Evolution blog

05 January 2021

Corona update 5 Jan 2021 NCBI Virus

NCBI Virus is really a great source for investigating SARS-COV-2 genomic data. Today there are 47,376 SARS-COV-2 nucleotide sequences in the database of which 31,884 are complete and 30,972 are found in the host Homo Sapiens. The rest are partial sequences or other/unknown host species. Sequences are added on a daily basis.

This blogpost is intended to show how to find deletions, point mutations and differences in length of SARS-COV-2.

Sequence length variation

I was wondering how to explain the length differences in SARS-COV-2 genomes. The NCBI genome viewer is a perfect tool for answering that question. I selected 5 SARS-COV-2 sequences from different countries plus the reference sequence, clicked Align and positioned the viewer at the very beginning of the virus sequence (Fig 1). It appeared that the largest difference between these 5 viruses was 36 nucleotides at the start of the sequence. I did not expect that. But with this tool it can be easily investigated.

Fig 1. the beginnings of 6 sequences compared

Fig 2. the end of 6 sequences compared (lower resolution)
(At lower resolution the bases are not shown)

The length of the 30,972 sequences in the database varies from 29,403 to 30,018. That is a difference of 615 nucleotides. Quite large. This can be found simply by sorting on genome length. The Wuhan reference sequence is 29,903 bases long. Actually, there are 3,250 sequences in the database with the same length. So, it is a very common length.

Deletions

Further research shows that the smallest sequence (29,403 nt) misses 265 bases at the beginning of the sequence, has a gap (deletion) of 6 bases (position 515 to 520, Fig. 3) and 343 bases at the end are lost. The 6 base deletion corresponds to exactly 2 codons. So, it has no effect on the downstream codons.

Fig 3. A deletion of 6 bases (position 515 to 520) (id=MT810566)
Identical bases relative to the reference are shown as dots (optional).

Mutations

Mutations are easy to spot when loading a lot of sequences (up to 200 in one view) and when 'Show differences' is selected from the Coloring drop-down menu. The screen could look like this:

Fig 4. overview mutations (red) in aligned sequences of SARS-COV-2
Settings: Coloring = Show differences

Fig 5. Non-synonymous mutation T in ORF1ab of SARS-COV-2
3 upper rows: the unmutated codon AAG = K = Lysine
3 lower rows: mutated codon AAT = N = Asparagine
enlarged image (original is much smaller)

With a mouse click on the 'click to expand' symbol (Fig 4 left column) the sequence expands with information about codons and amino acids (green en red horizontal bars). I added blue areas to highlight the location of the mutation. In Fig 5 amino acids identical to the Reference sequence are indicated by dots (standard setting). Mutations are shown with red background.

Contribution of countries

[Update 6 Jan 2021]

In this list the contribution of a few countries are shown based on two criteria: SARS-COV-2 and nucleotide completeness. The 'host' field is not included because this field is often empty.

USA nucleotide: 18,233 protein: 218,445
Australia: nucleotide: 9,918 protein: 118,982
Netherlands: nucleotide: 600 protein: 191
China: nucleotide: 114 protein: 1,294
Italy: nucleotide: 64 protein: 766
UK nucleotide: 4 protein: 48

Note: Great Britain has only 4 complete SARS-COV-2 sequences uploaded in the database. This is strange. They have sequenced the full genome of thousands of SARS-COV-2 viruses. That means in this database I can not (yet) investigate the new British B.1.1.7 variant!

The Netherlands did not supply the 'host' in almost all contributions, so I cannot compare for example mink with human SARS-COV-2.

Thank you Marleen for pointing out problems in the numbers.

References

The New MSA Viewer (2 min. introduction video)

NCBI Minute: A Beginner's Guide to Genes and Sequences at NCBI (youtube, introduction of 33 min.)

02 January 2021

Corona update 2 Jan 2021 GISAID

Today I show a nice interactive graphic representation of the SARS-CoV-2 genome.

Official hCoV-19 Reference Sequence (GISAID)

At the start of the year 2020 nobody could have imagined that an unknown RNA virus with no more than 30,000 bases could get the world on its knees. 30,000 bases! A genome of 30,000 bases has the power to overwhelm a species with a total of 3 billion bases, that is 100,000 times more bases. That species, Homo sapiens, the most powerful and intelligent species on earth. How can such a small virus hit us so hard?

For a start lets have a look at the structure of the genome of the SARS-COV-2 virus. The picture above is the official hCoV-19 Reference Sequence shown as a schematic clickable image on a GISAID page. When you open this link a short animation starts. The genes have different colours. Click on the genes for more information. ORF means Open Reading Frame. ORF is the part of the RNA that can be translated into proteins. Not all RNA is translated into proteins. For example the first 265 bases are not translated. For each RNA-gene proteins are listed and references to the literature are given. The yellow gene S is the Spike protein. The Spike protein is considered the most important protein because it is the key to enter a human cell. The biggest gene is ORF 1 gene. There is a mysterious tenth gene ORF10 that does not produce a protein. Why is it there? If it is non-functional, mutations including deletions are expected to accumulate with higher frequency. Could it be that genome length variation is due to length variation in gene 10? One can deduce that the genome length of the virus must be 29674 bases because that is the last base of the ORF10 gene.

Further, in two large genes, ORF1 and ORF9, grey areas are shown. The grey areas show the smaller proteins that are produced after processing the primary polyprotein. For example ORF1 protein is split into 15 smaller Non Structural Proteins (NSP11 is missing from the list) and the ORF9 gene produces two smaller proteins.

In the interactive picture one can see the start and end base position of genes, the length of the genes and the primary and derived proteins, but there are no links to the base sequence itself and the mutations. However, with these limited data one can do simple but interesting calculations. Superficially it looks like the whole genome is translated into proteins. But try this: divide the length of a gene expressed in nucleotides (nt) by 3 (because 3 bases code for one amino acid) and a few surprising facts are uncovered. The number of amino acids and the number of nucleotides do not match exactly [1]. There are more bases in the SARS-COV-2 genome than strictly necessary to code for all the proteins. So, a number of bases are not translated into proteins. Above I mentioned the first 265 bases already.

A more detailed view of the mutations can be found when in the left panel of the main page the option 'Show entropy' is set ON and the other options are OFF.

GISAID: entropy of SARS-COV-2 shows the number
of mutations per position.

An overview of the SARS-COV-2 genome is shown with the genes and the number of mutations. The vertical axis shows the diversity and the horizontal axis shows the nucleotide (NT) or amino acid (AA) positions. Mutations are unevenly distributed, there are mutation hotspots. A slider with a left and a right pointer can be used to zoom in:

GISAID: slider set to S protein (detail)

In the above example the left and right slider are moved to select the S gene. The frequency of mutations in the S gene appear in bars. When hovering with the mouse over the bars further information appears. However, to position the mouse on a bar is not easy. I think, this part of the interface could be improved. Also, in the maximum zoom position all the mutated codons should appear. Explanations of the different options would have been helpful. For example how entropy is defined. I assume the software is still under construction. Conclusion: this genome explorer has nice features, but has its limitations.

Notes

All genes except ORF1ab contain 2 extra bases that are not and cannot be translated into an amino acid. ORF1ab has 1 extra base apart from the first 265 bases that are not translated.

30 December 2020

Corona Update 30 Dec 2020

Today a short update about a tool to visualize the evolution of SARS-CoV-2.

One of the tools is GISAID. In this tool one can easily produce phylogenetic trees of SARS-CoV-2 genomes. There are 3848 genomes sampled between Dec 2019 and Dec 2020. Many thousands of variants of SARS-CoV-2 are circulating. To simplify the matter, they are grouped in clades. A clade is a genetically well defined lineage that has reached a frequency of 20% globally and has spread globally (see Tutorials: Clade Naming & Definitions). Clade 19A and 19B are sampled in 2019 and 20A, 20B, 20C are sampled in 2020. Below is a radial phylogenetic tree with 5 clades:

Radial Phylogenetic Tree of SARS-COV-2
click on image for full size

I like the radial display with 5 clades because the clades are visually well separated. The direction of time is from the centre outwards to the circumference. The tree starts at 30th December 2019 with the reference genome (Wuhan, China) and ends at December 2020.

One can easily switch from one display type to another. The most common tree type is the Rectangular tree. The reader is encouraged to try out different display options and data selections (left panel). A map and a time lapse animation are also available.

The clades are not geographically restricted. To show this, select 'Color by Region':

SARS-COV-2 Rectangular tree with 6 regions indicated.
Time runs from left to right. click on image for full size.

Remarkable: geographic locations are distributed all over the tree. No clade is restricted to one geographic region. This is unnatural for a species. This means different SARS-CO-2 clades have been transported all over the globe. Species have been transported all over the world even before the popularity of air travel, and even before travel by ship, but the genetic fingerprints of SARS-COV-2 offer us unprecedented opportunities to trace the movements of people carrying the virus from continent to continent.

To be continued.

GISAID (website)

GISAID (wikipedia)

Official hCoV-19 Reference Sequence (GISAID)