Evolution blog

02 January 2021

Corona update 2 Jan 2021 GISAID

Today I show a nice interactive graphic representation of the SARS-CoV-2 genome.

Official hCoV-19 Reference Sequence (GISAID)

At the start of the year 2020 nobody could have imagined that an unknown RNA virus with no more than 30,000 bases could get the world on its knees. 30,000 bases! A genome of 30,000 bases has the power to overwhelm a species with a total of 3 billion bases, that is 100,000 times more bases. That species, Homo sapiens, the most powerful and intelligent species on earth. How can such a small virus hit us so hard?

For a start lets have a look at the structure of the genome of the SARS-COV-2 virus. The picture above is the official hCoV-19 Reference Sequence shown as a schematic clickable image on a GISAID page. When you open this link a short animation starts. The genes have different colours. Click on the genes for more information. ORF means Open Reading Frame. ORF is the part of the RNA that can be translated into proteins. Not all RNA is translated into proteins. For example the first 265 bases are not translated. For each RNA-gene proteins are listed and references to the literature are given. The yellow gene S is the Spike protein. The Spike protein is considered the most important protein because it is the key to enter a human cell. The biggest gene is ORF 1 gene. There is a mysterious tenth gene ORF10 that does not produce a protein. Why is it there? If it is non-functional, mutations including deletions are expected to accumulate with higher frequency. Could it be that genome length variation is due to length variation in gene 10? One can deduce that the genome length of the virus must be 29674 bases because that is the last base of the ORF10 gene.

Further, in two large genes, ORF1 and ORF9, grey areas are shown. The grey areas show the smaller proteins that are produced after processing the primary polyprotein. For example ORF1 protein is split into 15 smaller Non Structural Proteins (NSP11 is missing from the list) and the ORF9 gene produces two smaller proteins.

In the interactive picture one can see the start and end base position of genes, the length of the genes and the primary and derived proteins, but there are no links to the base sequence itself and the mutations. However, with these limited data one can do simple but interesting calculations. Superficially it looks like the whole genome is translated into proteins. But try this: divide the length of a gene expressed in nucleotides (nt) by 3 (because 3 bases code for one amino acid) and a few surprising facts are uncovered. The number of amino acids and the number of nucleotides do not match exactly [1]. There are more bases in the SARS-COV-2 genome than strictly necessary to code for all the proteins. So, a number of bases are not translated into proteins. Above I mentioned the first 265 bases already.

A more detailed view of the mutations can be found when in the left panel of the main page the option 'Show entropy' is set ON and the other options are OFF.

GISAID: entropy of SARS-COV-2 shows the number
of mutations per position.

An overview of the SARS-COV-2 genome is shown with the genes and the number of mutations. The vertical axis shows the diversity and the horizontal axis shows the nucleotide (NT) or amino acid (AA) positions. Mutations are unevenly distributed, there are mutation hotspots. A slider with a left and a right pointer can be used to zoom in:

GISAID: slider set to S protein (detail)

In the above example the left and right slider are moved to select the S gene. The frequency of mutations in the S gene appear in bars. When hovering with the mouse over the bars further information appears. However, to position the mouse on a bar is not easy. I think, this part of the interface could be improved. Also, in the maximum zoom position all the mutated codons should appear. Explanations of the different options would have been helpful. For example how entropy is defined. I assume the software is still under construction. Conclusion: this genome explorer has nice features, but has its limitations.

Notes

All genes except ORF1ab contain 2 extra bases that are not and cannot be translated into an amino acid. ORF1ab has 1 extra base apart from the first 265 bases that are not translated.

30 December 2020

Corona Update 30 Dec 2020

Today a short update about a tool to visualize the evolution of SARS-CoV-2.

One of the tools is GISAID. In this tool one can easily produce phylogenetic trees of SARS-CoV-2 genomes. There are 3848 genomes sampled between Dec 2019 and Dec 2020. Many thousands of variants of SARS-CoV-2 are circulating. To simplify the matter, they are grouped in clades. A clade is a genetically well defined lineage that has reached a frequency of 20% globally and has spread globally (see Tutorials: Clade Naming & Definitions). Clade 19A and 19B are sampled in 2019 and 20A, 20B, 20C are sampled in 2020. Below is a radial phylogenetic tree with 5 clades:

Radial Phylogenetic Tree of SARS-COV-2
click on image for full size

I like the radial display with 5 clades because the clades are visually well separated. The direction of time is from the centre outwards to the circumference. The tree starts at 30th December 2019 with the reference genome (Wuhan, China) and ends at December 2020.

One can easily switch from one display type to another. The most common tree type is the Rectangular tree. The reader is encouraged to try out different display options and data selections (left panel). A map and a time lapse animation are also available.

The clades are not geographically restricted. To show this, select 'Color by Region':

SARS-COV-2 Rectangular tree with 6 regions indicated.
Time runs from left to right. click on image for full size.

Remarkable: geographic locations are distributed all over the tree. No clade is restricted to one geographic region. This is unnatural for a species. This means different SARS-CO-2 clades have been transported all over the globe. Species have been transported all over the world even before the popularity of air travel, and even before travel by ship, but the genetic fingerprints of SARS-COV-2 offer us unprecedented opportunities to trace the movements of people carrying the virus from continent to continent.

To be continued.

GISAID (website)

GISAID (wikipedia)

Official hCoV-19 Reference Sequence (GISAID)

24 December 2020

Corona updates 2020

This blog post gives short corona updates. Developments are going fast. Here I add very short updates (especially on the evolutionary aspects) instead of writing many long and detailed blog posts. Most recent on top of the page.

28 December

In the previous update I argued that the new British SARS-CoV-2 variant has outcompeted the existing variants because it has higher transmissibility. But does it also make people more sick than the standard virus? In other words: does it have a higher virulence? The general idea is that there is a trade-off between virulence and transmissibility. If a virus makes you so sick that you stay at home, than the chances of transmission to other people diminish strongly. If you die, the virus is unable to spread to other people. That virus will not cause a pandemic. On the other hand, if a virus does not make many copies of itself and transmits them to other hosts, the virus will disappear. Making many copies is a burden for the host. To be honest: the host is making those copies! That's why it is a burden. That's why a virus is the ultimate parasite.

I am not saying that the virus has a strategy. The success of a virus depends on its genetic make-up and how people behave. It is a matter of causes and consequences. However, we could describe the behaviour of a virus as an evolutionary strategy. A successful evolutionary strategy produces the most offspring.

The evolutionary strategy of SARS-CoV-2 appears to be a mix of virulence and transmissibility. Transmissibility means producing and shedding a lot of copies of itself before the persons begin feeling sick. This is called pre-symptomatic transmission. These persons unknowingly and unintentionally transmit virus particles. Research suggests that up to 45% of infected people are symptom-free transmitters. In younger people transmissibility seems to be maximized.

The second part of the mixed strategy is producing a very high number of virus particles in a subgroup of people, for example in older people. The downside is that these people will get very sick and die. Virulence is high. This is a short term strategy. It is a dead end. Maybe it is better to consider this a side effect of the successful high transmission strategy. In the end, the effect is that SARS-CoV-2 kills people and at the same time spreads around the globe. It resulted in a pandemic. At least, that is the situation in this phase of its evolution. Two facts will determine the long-term evolutionary success of SARS-CoV-2: new mutations and our behaviour.

How the coronavirus escapes an evolutionary trade-off that helps keep other pathogens in check.

25 december

Britse epidemiologen hebben een analyse gemaakt van de nieuwe B.1.1.7 virus variant. Resultaten: van de 17 mutaties hebben 3 een potentieel biologische effect: N501Y, P681H en del69-70. De andere 14 zijn of neutraal of nog onbekend. De eerste twee mutaties zijn aminozuur wijzigingen en de derde is een deletie van aminozuren 69 en 70 van het Spike eiwit. Die ontbreken dus. Uit in vitro experimenten is gebleken dat deze deletie de infectiviteit vergroot. Opvallend effect van de deletie is dat sommige standaard commericiële PCR testen een False Negative geven: ze detecteren de Britse variant niet. Dat is natuurlijk problematisch. Die PCR testen moeten zo spoedig mogelijk vervangen worden door testen die de nieuwe variant wel detecteren.
De epidemiologen concluderen dat de sterke toename van de nieuwe variant niet toegeschreven kan worden aan veranderende sociale interacties of mobiliteit. De variant is dus gestegen in frequentie door het selectief voordeel van de mutaties. Positieve selectie dus. Exacter: differentiële reproductie van genetische varianten in de virus populatie gebaseerd op een fenotypisch effect. Tevens is hiermee aangetoond dat een deletie niet altijd schadelijk hoeft te zijn, maar selectief voordeel kan hebben.

Estimated transmissibility and severity of novel SARS-CoV-2 Variant of Concern 202012/01 in England

19 december

Een Engels team van wetenschappers maakte de vondst bekend van een nieuwe Sars-CoV-2 variant B.1.1.7 die zeer waarschijnlijk in Engeland ontstaan is. Het bijzondere is dat hij verschilt van de bestaande varianten door de unieke combinatie van 17 mutaties die kennelijk in 1 keer ontstaan zijn. En dat is uniek in de korte geschiedenis van het virus. De specifieke combinatie van mutaties is niet eerder waar genomen. Tot nu toe zijn er geen evolutionaire voorlopers gevonden. Wat het meest de pers heeft gehaald is het feit dat de mutant een sterke toename vertoont onder de nieuwe gevallen. Op 9 december had de variant in London al een frequentie bereikt van 60%. De variant is sterk geassocieerd met nieuwe gevallen in Engeland. De politiek heeft hier snel op gereageerd.

Wetenschappers verschillen van mening over de vraag of die sterke toename door toeval (superspreader event) of door selectief voordeel verklaard moet worden. Feit is dat de variant andere varianten verdringt. Een argument voor de hypothese dat deze variant een selectief voordeel heeft is het feit dat het virus onderdeel dat voor aanhechting aan de menselijke cel zorgt (Spike) maar liefst 8 mutaties heeft. En dat blijkt bepaald niet nadelig voor de verspreiding van het virus te zijn. Dat de variant selectief voordeel heeft kan bevestigd worden door deze te kweken op menselijke cellen in het laboratorium met als controle de standaard corona stam.

Dit is de originele publicatie van de groep die de variant ontdekt en beschreven heeft:

Preliminary genomic characterisation of an emergent SARS-CoV-2 lineage in the UK defined by a novel set of spike mutations.