Evolution blog

03 February 2021

New feature in NCBI virus database: View Mutations in SARS-CoV-2

Corona Update 3 February 2021

There seems to be a competition between countries to report new SARS-COV-2 variants. The media try to make sense of it and try to answer questions about how dangerous these new variants are. For example, the Scientific American: The Most Worrying Mutations in Five Emerging Coronavirus Variants [1] and The Scientist [5].

This is a very useful article. I will return to it. But there are more variants and many more mutations. What is the total number of different mutations that have been found worldwide up to now? Answer: NCBI virus database [2]. The NCBI started an overview of all mutations in SARS-CoV-2. This is free information and no account is required. This is a user-friendly website.

View Mutations in SARS-CoV-2 SRA Data

Click on the link View Mutations:

Table with all mutations of SARS-CoV-2

After a few seconds a table with all mutations appears with columns. See appendix for the columns in the list.

Explanation

A non-synonymous substitution is for example: D 614 G is : amino acid D is replaced by G in position 614 in the Spike (surface glycoprotein). The 614 position is relative to the start of the first amino acid (AA) of the protein. For the Spike protein the position is between 1 and 1273. That is the length of the protein. The Spike is a relatively small protein.

The genomic position is a number between 1 and 29,903. That is the length of the standard reference SARS-CoV-2 genome.

A synonymous substitution for example: Q 613 Q. Q 'replaced' by Q. This is still a substitution because the substitution is at the nucleotide level: CAA > CAG. The nucleotide change is listed also in the table.

The Count gives an indication whether the mutation is rare. In Collected location the countries of origin of the virus sample are specified.

Furthermore, a handy feature is that each column can be sorted (up/down) by clicking on the header. Try it!

There are not yet statistics provided by the NCBI website. I counted (30 Jan) the number of mutations in Spike protein (surface glycoprotein):

264 non-synonymous mutations
345 synonymous mutations
609 mutations total

This is expected: there are more synonymous than non-synonymous mutations. This is quite a lot for a protein of 1273 Amino Acids: 20% Amino Acid changes and 47% of the Spike nucleotides have mutations. The million dollar question is what the effect is on the behaviour of the protein and the properties of the virus. A first step is:

From one-dimensional RNA to three-dimensional proteins

A spectacular and sophisticated feature is the interactive 3-D display of the protein which is shown when clicking on the link of the Protein Change. Try it!

Click on the link N501Y

Loading data ... please wait ... (ignore error message):

Interactive 3D model of Spike protein
mouse pointer at N501.

try full screen video! (16 sec)

By moving the mouse pointer over the protein, the names of individual Amino Acids with position are displayed. The software is keeping track of all 1273 Amino Acids in this very complicated 3D structure! Really great software! After a lot of trial and error I found the ASN501.

ASN = Asparagin; 1-letter code: N.

Tip: for the table of code names for amino acids see this page.

Asparagin on position 501 (N501) is the location of the famous mutation N501Y. N is replaced by Y. The amino acid it is marked by a yellow color:

zoomed in. Yellow structure is Asparagin in position 501

Not surprisingly, the yellow position 501 is located on the outside of the molecule. It must attach to the human ACE2 receptor. It could not work if it were located at the inside of the molecule.

Try it. Play with it. Move the cursor over the structure. Manipulate the point of view with your mouse by holding the mouse button down and move. Watch the different angles of view. Try other mutations. (click on other mutations in the main table). Zoom in. Mind you: this is the molecule that caused a pandemic!

Remember: the three-dimensional structure of a protein is the first step in discovering the effect of a mutation.

Problems: Not all links to 3D proteins seem correct. H1000Q results in a protein THR257. The links are made manual?

Later I discovered that one can select certain locations in the one-dimensional RNA (in the right panel of the page) and the selected amino acid will appear yellow highlighted in the 3D model. I have to explore that.

The famous N501Y mutation is found in the variant in UK, South Africa and Brazil. Here is the list of the Scientific American article [1]:

Spain: A222V (Spike) -
UK: - - N501Y (Spike)
South Africa: E484K K417N N501Y [virus escape mutant]
Brazil: E484K K417N/T N501Y

Universe too small ! too short living !

The number of possible proteins of length 1273 is staggering. Do the calculation: for every position there are 20 possibilities because there are 20 Amino Acids. "So there are 20×20 = 400 distinct proteins of 2 Amino Acids, 20x20x20 = 8000 proteins of length 3 AA, 160,000 proteins of length 4 AA, 3,200,000 with just 5 AA." [4] etc. Total: 20^1273 AA sequences for the Spike alone. And that is only one protein! Obviously, evolution could not have tried out all those possibilities. The age of the universe is too short to try them all out! So, we can expect endless new virus variants coming as long as we don't interfere with the pandemic and the virus is allowed its natural course.

Notes

The Most Worrying Mutations in Five Emerging Coronavirus Variants, Scientific American, January 29, 2021.
NCBI Virus database of protein and nucleotide sequences.
NCBI database of all SARS-CoV-2 mutations.
Nature. I made a serious error: 20x1273 = 25,460 is wrong calculation. Is now corrected in the text. Sorry!
The Scientist: Side-by-Side Comparisons of Important SARS-CoV-2 Variants Jan 26, 2021. 6 variants.

Appendix

Information in the NCBI table with all mutations:

Protein: all proteins encoded by SARS-CoV-2
Amino Acid substitution (as far as I can see: no insertions/deletions...)
Count: total number of cases in the database of the specific mutation
Genomic location: the position in bases or: nt
Codon change. For example: GCT > GCC (T is used instead of U !)
Non-synonymous (does change AA) or synonymous (does not change AA), AA = amino acid.
Collection location: country of origin of the sample

Sources

This page has a table with the abbreviations of the amino acids.
The video was created with SimpleScreenRecorder for Linux by Maarten Baert.

27 January 2021

Did the highly transmissible British SARS-CoV-2 variant B.1.1.7 originate in one individual?

Corona Update 27 January 2021

After discovering two immuno-compromised patients with high mutation rates and accelerated evolution, I remembered that the original publication describing the highly transmissible British SARS-CoV-2 B.1.1.7 variant, also discussed immunodeficient or immunosuppressed patients [1]. They discussed such patients for a good reason. They were puzzled with the unusual high number of mutations present in the B.1.1.7 and the fact that they did not see any precursors of the variant. Usually, there must have been a step by step accumulation of mutations. But the B.1.1.7 variant made a big jump in sequence space. They asked: What evolutionary processes or selective pressures might have given rise to lineage B.1.1.7 ? They noted that an accumulation of many mutations in immunocompromised patients has been reported in the literature. This could also be an explanation of the origin of the B.1.1.7 variant. This is what they conclude:

"These considerations lead us to hypothesise that the unusual genetic divergence of lineage B.1.1.7 may have resulted, at least in part, from virus evolution with a chronically-infected individual. Although such infections are rare, and onward transmission from them presumably even rarer, they are not improbable given the ongoing large number of new infections.

Although we speculate here that chronic infection played a role in the origins of the B.1.1.7 variant, this remains a hypothesis and we cannot yet infer the precise nature of this event."

If this is true, then the highly transmissible British variant originated in one sick individual! One person is the source of a highly transmissible variant that conquered the world and caused severe lock-downs all over the world. Some talk even about a second pandemic.

Knowing this, it seems urgent that these immuno-compromised patients with covid-19 once released from the hospital must be kept in quarantine for a few weeks in order to prevent the spread of a dangerous new variant.

Together with another case [2] there are possibly 4 cases of immunocompromised patients with high mutation rates. The 4th patient (supposed to be the origin of B.1.1.7) is inferred to exist, but has not been identified as far as I know.

Furthermore, if true, this shows that within-host evolution of the virus does not prevent being a better between-host transmitter.

Notes

25 January 2021

Accidental discovery of a second immuno-compromised patient with accelerated viral evolution

Corona Update 25 January 2021

I discovered a second immuno-compromised patient with accelerated evolution by accident. I wanted to know whether the spontaneous mutations occurring in the first immuno-compromised patient could also be found in the general human population.

I searched for a specific 4AA deletion in the Spike protein in the NCBI SARS-CoV-2 database. I selected all sequences of 1269 AA (4 amino acids shorter than the standard 1273 Spike protein). I added the standard Spike protein length of 1273 AA until I hit the maximum number of 500 sequences that are allowed in one search. The result: 27 sequences of length 1269 of which 16 showed the deletion 141-144 (see previous blog). Unexpectedly, two of them showed me the way to a second immuno-compromised patient (Fig. 1).

Fig 1. Two new sequences with the 141-144 deletion:
QNQ32127; QNQ32151

Fig 2. The publication that describes the virus sequence (source).

Usually the sequences in the database are a 'Direct Submission'. They are not published. But the source of these new sequences (Fig.1) revealed that they were part of a

'Case Study: Prolonged infectious SARS-CoV-2 shedding from an asymptomatic immunocompromised cancer patient' [1].

That's how I discovered my second immuno-compromised patient.

Fig. 3. Long-term SARS-CoV-2 shedding [1] with within-patient variation

It is an immuno-compromised individual persistently testing positive for SARS-CoV-2. Remarkably: it is an asymptomatic individual! The virus mutated and created genetic diversity. This cannot be explained by contamination or secondary infection because the viral genomes of this patient cluster as a mono-phyletic clade*).

This strongly suggests evolution

The authors state: "Throughout the course of infection, there was marked within-host genomic evolution of SARS-CoV-2. Deep sequencing revealed a continuously changing virus population structure with turnover in the relative frequency of the observed genotypes over the course of infection. (...) Potential factors contributing to the observed within-host evolution is prolonged infection and the compromised immune status of the host, possibly resulting in a different set of selective pressures compared with an immune-competent host. These differential selective pressures may have allowed a larger genetic diversity with continuous turnover of dominant viral species throughout the course of infection." [1],[2].

The convalescent plasma therapy was not successful. But it is expected to be a selective pressure on the virus.

Apart from demonstrating evolution, there is an important public health lesson: "an estimated 3 million people in the United States have some form of immuno-compromising condition, including individuals with HIV infection".

The mutations

Fig. 4. Two deletions. red arrow: 21 nt. yellow: 12 nt. (click to enlarge)
first row (black) shows nt, second row shows AA (colored)

Table 2 Consensus Sequence Variants in Clinical Samples from the Individual and SARS-CoV-2 Isolates Compared with Reference USA/WA1/2020 (MN985325.1) [1].

Two in-frame*) deletions were observed in the Spike glycoprotein coding region:

1) A 21 nt in-frame deletion (residues 21,975–21,995) was found in the N-terminal domain (NTD) of S1, leading to a 7-amino-acid deletion (amino acids [AA] 139–145)

2) A 12 nt deletion (residues 21,982–21,993) was detected in the day 70 isolate, leading to a 4-AA deletion (AA 141–144) in the NTD.

As can be seen from Fig. 4 the two deletion strains of the virus disappear on day 85 and 105.

*) Abbr

Abbr = abbreviations.

nt = nucleotides or bases. Three bases code for 1 AA.

AA = amino acids.

mono-phyletic clade = an evolutionary group of organisms with one common ancestor.

in-frame deletions = deletions in DNA/RNA that leave the codons (triplets) intact: for example a 3 or 6 base deletion which removes only intact codons. A 1 or 2 base deletion is out-of-frame and causes troubles.

viral shedding = the release of virus particles (in the air) (wikipedia)

The authors did not state explicitly, but these results could -just as the patient in my previous blog- be compared with anti-biotic resistance after an unsuccessful anti-biotic treatment.

Update 26 January.

Small text update: "I searched for a specific 4AA deletion in the Spike protein in the NCBI SARS-CoV-2 database."

Update 27 January:

added Table 2 with overview of all mutations.

Notes

Case Study: Prolonged infectious SARS-CoV-2 shedding from an asymptomatic immunocompromised cancer patient. 23 Dec 2020
There was no selection effect detected in in vitro experiments of mutated virus strains.