02 January 2021

Corona update 2 Jan 2021 GISAID

Today I show a nice interactive graphic representation of the SARS-CoV-2 genome.

 

 

 

 

 

Official hCoV-19 Reference Sequence (GISAID)

At the start of the year 2020 nobody could have imagined that an unknown RNA virus with no more than 30,000 bases could get the world on its knees. 30,000 bases!  A genome of 30,000 bases has the power to overwhelm a species with a total of 3 billion bases, that is 100,000 times more bases. That species, Homo sapiens, the most powerful and intelligent species on earth. How can such a small virus hit us so hard?

For a start lets have a look at the structure of the genome of the SARS-COV-2 virus. The picture above is the official hCoV-19 Reference Sequence shown as a schematic clickable image on a GISAID page. When you open this link a short animation starts. The genes have different colours. Click on the genes for more information. ORF means Open Reading Frame. ORF is the part of the RNA that can be translated into proteins. Not all RNA is translated into proteins. For example the first 265 bases are not translated. For each RNA-gene proteins are listed and references to the literature are given. The yellow gene S is the Spike protein. The Spike protein is considered the most important protein because it is the key to enter a human cell. The biggest gene is ORF 1 gene. There is a mysterious tenth gene ORF10 that does not produce a protein. Why is it there? If it is non-functional, mutations including deletions are expected to accumulate with higher frequency. Could it be that genome length variation is due to length variation in gene 10? One can deduce that the genome length of the virus must be 29674 bases because that is the last base of the ORF10 gene.

Further, in two large genes, ORF1 and ORF9, grey areas are shown. The grey areas show the smaller proteins that are produced after processing the primary polyprotein. For example ORF1 protein is split into 15 smaller Non Structural Proteins (NSP11 is missing from the list) and the ORF9 gene produces two smaller proteins.

In the interactive picture one can see the start and end base position of genes, the length of the genes and the primary and derived proteins, but there are no links to the base sequence itself and the mutations. However, with these limited data one can do simple but interesting calculations. Superficially it looks like the whole genome is translated into proteins. But try this: divide the length of a gene expressed in nucleotides (nt) by 3 (because 3 bases code for one amino acid) and a few surprising facts are uncovered. The number of amino acids and the number of nucleotides do not match exactly [1]. There are more bases in the SARS-COV-2 genome than strictly necessary to code for all the proteins. So, a number of bases are not translated into proteins. Above I mentioned the first 265 bases already. 

A more detailed view of the mutations can be found when in the left panel of the main page the option 'Show entropy' is set ON and the other options are OFF.

GISAID: entropy of SARS-COV-2 shows the number
of mutations per position.


An overview of the SARS-COV-2 genome is shown with the genes and the number of mutations. The vertical axis shows the diversity and the horizontal axis shows the nucleotide (NT) or amino acid (AA) positions. Mutations are unevenly distributed, there are mutation hotspots. A slider with a left and a right pointer can be used to zoom in:

GISAID: slider set to S protein (detail)


In the above example the left and right slider are moved to select the S gene. The frequency of mutations in the S gene appear in bars. When hovering with the mouse over the bars further information appears. However, to position the mouse on a bar is not easy. I think, this part of the interface could be improved. Also, in the maximum zoom position all the mutated codons should appear. Explanations of the different options would have been helpful. For example how entropy is defined. I assume the software is still under construction. Conclusion: this genome explorer has nice features, but has its limitations.

 

Notes

  1. All genes except ORF1ab contain 2 extra bases that are not and cannot be translated into an amino acid. ORF1ab has 1 extra base apart from the first 265 bases that are not translated.

No comments:

Post a Comment

Comments to posts >30 days old are being moderated.
Safari causes problems, please use Firefox or Chrome for adding comments.