NCBI Virus is really a great source for investigating SARS-COV-2 genomic data. Today
there are
47,376 SARS-COV-2 nucleotide sequences
in the database of which 31,884 are complete and 30,972 are found in the host
Homo Sapiens. The rest are partial sequences or other/unknown host species. Sequences are added on a daily basis.
This blogpost is intended to show how to find deletions, point mutations and differences in length of SARS-COV-2.
Sequence length variation
I was wondering how to explain the length differences in SARS-COV-2 genomes.
The NCBI genome viewer is a perfect tool for answering that question. I
selected 5 SARS-COV-2 sequences from different countries plus the reference
sequence, clicked Align and positioned the viewer at the very beginning of the virus
sequence (Fig 1). It appeared that the largest difference between these 5 viruses was
36 nucleotides at the start of the sequence. I did not expect that. But with
this tool it can be easily investigated.
Fig 1. the beginnings of 6 sequences compared |
Fig 2. the end of 6 sequences compared (lower resolution) (At lower resolution the bases are not shown) |
The length of the 30,972 sequences in the database varies from 29,403 to 30,018. That is a difference of 615 nucleotides. Quite large. This can be found simply by sorting on genome length. The Wuhan reference sequence is 29,903 bases long. Actually, there are 3,250 sequences in the database with the same length. So, it is a very common length.
Deletions
Further research shows that the smallest sequence (29,403 nt) misses 265
bases at the beginning of the sequence, has a gap (deletion) of 6 bases (position 515 to 520, Fig. 3)
and 343 bases at the end are lost. The 6 base deletion corresponds to exactly 2
codons. So, it has no effect on the downstream codons.
Fig 3. A deletion of 6 bases (position 515 to 520) (id=MT810566) Identical bases relative to the reference are shown as dots (optional). |
Mutations
Mutations are easy to spot when loading a lot of sequences (up to 200 in one view) and when 'Show differences' is selected from the Coloring drop-down menu. The screen could look like this:
Fig 4. overview mutations (red) in aligned sequences of SARS-COV-2 Settings: Coloring = Show differences |
Fig 5. Non-synonymous mutation T in ORF1ab of SARS-COV-2 3 upper rows: the unmutated codon AAG = K = Lysine 3 lower rows: mutated codon AAT = N = Asparagine enlarged image (original is much smaller) |
With a mouse click on the 'click to expand' symbol (Fig 4 left column) the sequence expands with information about codons and amino acids (green en red horizontal bars). I added blue areas to highlight the location of the mutation. In Fig 5 amino acids identical to the Reference sequence are indicated by dots (standard setting). Mutations are shown with red background.
Contribution of countries
[Update 6 Jan 2021]
In this list the contribution of a few countries are shown based on two criteria: SARS-COV-2 and nucleotide completeness. The 'host' field is not included because this field is often empty.
- USA nucleotide: 18,233 protein: 218,445
- Australia: nucleotide: 9,918 protein: 118,982
- Netherlands: nucleotide: 600 protein: 191
- China: nucleotide: 114 protein: 1,294
- Italy: nucleotide: 64 protein: 766
- UK nucleotide: 4 protein: 48
Note: Great Britain has only 4 complete SARS-COV-2 sequences uploaded in the database. This is strange. They have sequenced the full genome of thousands of SARS-COV-2 viruses. That means in this database I can not (yet) investigate the new British B.1.1.7 variant!
The Netherlands did not supply the 'host' in almost all contributions, so I cannot compare for example mink with human SARS-COV-2.
Thank you Marleen for pointing out problems in the numbers.
References
The New MSA Viewer
(2 min. introduction video)
NCBI Minute: A Beginner's Guide to Genes and Sequences at NCBI
(youtube, introduction of 33 min.)
Gert,
ReplyDeleteCan you give an explanation for why there are such enormous differences between the number of submitted sequences between the USA, China, GB and The Netherlands? Are the researchers in the USA more capable of investigating and have thus discovered a higher number of sequences so the differences are not true or are there effectively less sequences in the population in China, GB and The Netherlands?
In case the numbers reflect the situation correctly could these differences be due to more intense flights from Americans over the world and through their states than in the other countries. The difference is huge. Or are the Americans more active in uploading their data in NCBI?
Marleen, Good you ask. I selected for (1) complete genomes, (2) host=Homo sapiens. But the Dutch researchers left the field 'Host' blanc except for 3 human and 13 mink (nerts). That's the reason I found only 3 ! If you ignore Host, than we have 600 nucleotide and 191 protein sequences. So, that's the story! Sorry, for the confusion. I didn't expect this!!
ReplyDeleteItaly: 107 nucleotide and 907 protein sequences. Even better!
But, there may be a more serious problem: researchers protect their own data. Maybe the strategy is: first publish in the journals based on your own data, after that upload your data a public database? And there are alternatives: GISAID. And: if you are busy, maybe it is a lot of work to upload it?
Anyway, there is a lot to discover! And: it is fun!
update:
ReplyDeletehttps://www.sciencemag.org/news/2021/01/viral-mutations-may-cause-another-very-very-bad-covid-19-wave-scientists-warn
Thanks, Harry P!
ReplyDeleteI am going to include this in my blogpost about the new variant!!!