27 January 2021

Did the highly transmissible British SARS-CoV-2 variant B.1.1.7 originate in one individual?



 Corona Update 27 January 2021

After discovering two immuno-compromised patients with high mutation rates and accelerated evolution, I remembered that the original publication describing the highly transmissible British SARS-CoV-2 B.1.1.7 variant, also discussed immunodeficient or immunosuppressed patients [1]. They discussed such patients for a good reason. They were puzzled with the unusual high number of mutations present in the B.1.1.7 and the fact that they did not see any precursors of the variant. Usually, there must have been a step by step accumulation of mutations. But the B.1.1.7 variant made a big jump in sequence space. They asked: What evolutionary processes or selective pressures might have given rise to lineage B.1.1.7 ?  They noted that an accumulation of many mutations in immunocompromised patients has been reported in the literature. This could also be an explanation of the origin of the B.1.1.7 variant. This is what they conclude:

"These considerations lead us to hypothesise that the unusual genetic divergence of lineage B.1.1.7 may have resulted, at least in part, from virus evolution with a chronically-infected individual. Although such infections are rare, and onward transmission from them presumably even rarer, they are not improbable given the ongoing large number of new infections.

Although we speculate here that chronic infection played a role in the origins of the B.1.1.7 variant, this remains a hypothesis and we cannot yet infer the precise nature of this event."

If this is true, then the highly transmissible British variant originated in one sick individual! One person is the source of a highly transmissible variant that conquered the world and caused severe lock-downs all over the world. Some talk even about a second pandemic.

Knowing this, it seems urgent that these immuno-compromised patients with covid-19 once released from the hospital must be kept in quarantine for a few weeks in order to prevent the spread of a dangerous new variant.

Together with another case [2] there are possibly 4 cases of immunocompromised patients with high mutation rates. The 4th patient (supposed to be the origin of B.1.1.7) is inferred to exist, but has not been identified as far as I know.

Furthermore, if true, this shows that within-host evolution of the virus does not prevent being a better between-host transmitter. 



  1. Preliminary genomic characterisation of an emergent SARS-CoV-2 lineage in the UK defined by a novel set of spike mutations,  Dec 2020.
  2. Neutralising antibodies in Spike mediated SARS-CoV-2 adaptation, December 29, 2020

25 January 2021

Accidental discovery of a second immuno-compromised patient with accelerated viral evolution



 Corona Update 25 January 2021

I discovered a second immuno-compromised patient with accelerated evolution by accident. I wanted to know whether the spontaneous mutations occurring in the first immuno-compromised patient could also be found in the general human population. 

I searched for a specific 4AA deletion in the Spike protein in the NCBI SARS-CoV-2 database. I selected all sequences of 1269 AA (4 amino acids shorter than the standard 1273 Spike protein). I added the standard Spike protein length of 1273 AA until I hit the maximum number of 500 sequences that are allowed in one search. The result: 27 sequences of length 1269 of which 16 showed the deletion 141-144 (see previous blog). Unexpectedly, two of them showed me the way to a second immuno-compromised patient (Fig. 1).

Fig 1. Two new sequences with the 141-144 deletion:
QNQ32127; QNQ32151

Fig 2. The publication that describes the virus sequence (source).

Usually the sequences in the database are a 'Direct Submission'. They are not published. But the source of these new sequences (Fig.1) revealed that they were part of a

'Case Study: Prolonged infectious SARS-CoV-2 shedding from an asymptomatic immunocompromised cancer patient' [1]. 

That's how I discovered my second immuno-compromised patient.

Fig. 3. Long-term SARS-CoV-2 shedding [1] with within-patient variation

It is an immuno-compromised individual persistently testing positive for SARS-CoV-2. Remarkably: it is an asymptomatic individual! The virus mutated and created genetic diversity. This cannot be explained by contamination or secondary infection because the viral genomes of this patient cluster as a mono-phyletic clade*).


This strongly suggests evolution

The authors state: "Throughout the course of infection, there was marked within-host genomic evolution of SARS-CoV-2. Deep sequencing revealed a continuously changing virus population structure with turnover in the relative frequency of the observed genotypes over the course of infection. (...) Potential factors contributing to the observed within-host evolution is prolonged infection and the compromised immune status of the host, possibly resulting in a different set of selective pressures compared with an immune-competent host. These differential selective pressures may have allowed a larger genetic diversity with continuous turnover of dominant viral species throughout the course of infection." [1],[2].

The convalescent plasma therapy was not successful. But it is expected to be a selective pressure on the virus.

Apart from demonstrating evolution, there is an important public health lesson: "an estimated 3 million people in the United States have some form of immuno-compromising condition, including individuals with HIV infection".


The mutations

Fig. 4. Two deletions. red arrow: 21 nt. yellow: 12 nt. (click to enlarge)
first row (black) shows nt, second row shows AA (colored)

Table 2 Consensus Sequence Variants in Clinical Samples from the Individual and SARS-CoV-2 Isolates Compared with Reference USA/WA1/2020 (MN985325.1) [1].


Two in-frame*) deletions were observed in the Spike glycoprotein coding region:

1) A 21 nt in-frame deletion (residues 21,975–21,995) was found in the N-terminal domain (NTD) of S1, leading to a 7-amino-acid deletion (amino acids [AA] 139–145) 

2) A 12 nt deletion (residues 21,982–21,993) was detected in the day 70 isolate, leading to a 4-AA deletion (AA 141–144) in the NTD.

As can be seen from Fig. 4 the two deletion strains of the virus disappear on day 85 and 105.

*) Abbr

Abbr = abbreviations.

nt = nucleotides or bases. Three bases code for 1 AA.

AA = amino acids. 

mono-phyletic clade = an evolutionary group of organisms with one common ancestor.

in-frame deletions = deletions in DNA/RNA that leave the codons (triplets) intact: for example a 3 or 6 base deletion which removes only intact  codons. A 1 or 2 base deletion is out-of-frame and causes troubles.

viral shedding =  the release of virus particles (in the air) (wikipedia)

The authors did not state explicitly, but these results could -just as the patient in my previous blog- be compared with anti-biotic resistance after an unsuccessful anti-biotic treatment.

Update 26 January. 

Small text update: "I searched for a specific 4AA deletion in the Spike protein in the NCBI SARS-CoV-2 database."

Update 27 January: 

added Table 2 with overview of all mutations.



  1. Case Study: Prolonged infectious SARS-CoV-2 shedding from an asymptomatic immunocompromised cancer patient. 23 Dec 2020
  2. There was no selection effect detected in in vitro experiments of mutated virus strains.

20 January 2021

Accelerated evolution of SARS-COV-2 in a person with immunodeficiency (with hands-on exercise)




Corona Update 20 January 2021

People with weakened immune systems are at higher risk of getting severely sick from SARS-CoV-2, the virus that causes covid-19 [1]. They may also remain infectious for a longer period of time than others with COVID-19. Recently, a group of 32 researchers published the results of viral genome sequencing of an immuno-compromised patient with Covid-19 [2].

To my knowledge, this study is the first where a sufficient number of whole-genome viral sequences were obtained from the same patient to establish the evolution of SARS-CoV-2 in one individual during a longer period of time in which anti-viral therapy was given, such as two remdesivir treatments and a SARS-CoV-2 antibody cocktail against the SARS-CoV-2 spike protein (Regeneron) [3] and more.  

In the study 9 whole-genome viral sequencing, 19 RT-PCR tests and a number of quantitative viral load assays of SARS-CoV-2 were done during a period of 154 days (5 months). After many clinical complications such as hypoxemia and remdesivir treatments, the patient died on day 154 from shock and respiratory failure.

In this blog I will focus on the whole genome sequences. The authors did a phylogenetic analysis of the viral sequences which was consistent with persistent infection and accelerated evolution of the virus. 

The authors conclude: "Although most immunocompromised persons effectively clear SARS-CoV-2 infection, this case highlights the potential for persistent infection and  accelerated  viral  evolution  associated  with  an immunocompromised state". 

This situation may be comparable to the continued use of antibiotics in humans or animals to combat bacterial infections. When the selection pressure is not too strong, not every microbe is killed. A few lucky mutants escape and multiply. When the 'right' mutations occur, antibiotic resistance soon develops [8].


The mutations...

For those who insist to know the exact mutations, here they are:

Fig 1a. Partial Spike sequences. continued below.

Fig 1b. Partial Spike sequences (continued). Click to enlarge.
Amino Acids in colored letters. Dots: identical AA.
Numbers above are positions in Spike protein
For complete Spike sequence see below: How to ...


Generally, mutations originate in random places in the genome. But the authors found that amino acid changes were pre-dominantly in the Spike gene (S) and the receptor-binding domain (RBD), which make up 13% and 2% of the viral genome, respectively, but harboured 57% and 38% of the observed changes. That means mutations in the Spike protein were much higher than expected by chance alone. Selection must have taken place: negative (removing mutated variants) and positive (amplifying mutated variants) [4].

Several synonymous as well as non-synonymous mutations in the virus were detected, as well as continuous deletions of 3, 4 and 7 amino acids (AA). One variant has a 7 AA and a 3 AA deletion (see Fig 1a), so its sequence is 10 AA or 30 nt shorter. Apparently, not very harmful. Remarkably, all mutations in the Spike protein are non-synonymous (see figures). That means amino acids are replaced. In synonymous mutations the amino acid does not change, so are of little consequence.

The top row in the figure is the first virus genome in the patient that has been sequenced (day 18). It is identical to Reference sequence (RefSeq) of the Spike protein, except for one substitution N869M in position 869: M has been substituted by N. So, the patient started with a standard virus. But then things changed rapidly. A total of 60 non-synonymous mutations and 15 synonymous mutations were found. That is 4 times as much non-synonymous than synonymous mutations. For comparison: the New Brazilian variant has 24 mutations!

Here is a list of 8 mutations found on the last sampling day (day 152):

del 141-144 [9]



E484A [6]



N501Y [5],[7]


The first mutations appear on day 75 and all virus variants thereafter accumulate mutations. We see a familiar mutation: N501Y which occurs in the British B.1.17 [5] and South Africa variant [7].

The table Fig 1ab shows only one sequence per day. However, there must exist more sequences at the same time. 

Fig 1c. On Day 152 seven lost amino acids reappear.

For example, in Fig. 1c. we see a large 7 amino acids deletion (position 12-18) in the Spike protein on Day 146. A week later, on Day 152, the deleted amino acids reappear. Any mutated amino acid can mutate back to the wildtype, but deleted amino acids cannot reappear. So, they must be inherited from other variants without the deletion. But they are not shown in the table. The table is incomplete. So, different sub-populations must exist side-by-side and evolve independently. See the phylogenetic tree:

Phylogenetic tree of virus populations within one patient [1].

This small tree shows evolutionary diversification. As time progresses more lineages appear. At time T0 there are two lineages and at T3 there are 3 different lineages. Unfortunately, there are not enough data to establish which variants went extinct and how many variants survived until the last day. But, then, this is a patient, not an experiment.



The pharmacological cocktail in this patient with an weakened immune system seem to have caused accelerated evolution of SARS-CoV-2. The mutations are significantly more located in the Spike protein and there are more non-synonymous than synonymous mutations. In other words: micro-evolution with positive selection for reproductive success of the virus and adaptation to the internal environment of the patient. Maybe the discontinuous therapy has given the virus time to recover and thereby contributed to the accelerated evolution.

How to download a sequence from NCBI

To download the Spike protein of SARS-COV-2 (Surface glycoprotein):
  • click this link (SARS-CoV-2 Spike protein Reference is selected)
  • select the YP_009724390 (this is the reference sequence; 1273 AA)
  • press DOWNLOAD button 
  • step 1 'Protein' is pre-selected
  • Step 2: Next
  • Download Selected Records 
  • Step 3: 'Use default'. Next
  • Download 
  • the downloaded file is: sequences.fasta
  • this file is a plain text file and can be read in a simple text editor
  • the file is formatted in lines of 60 characters.
  • optional: remove first line and save as YP_009724390.txt

Overview of all mutations (red) in Spike protein.
Yellow highlight: sequences shown in the publication.
Used is the Reference sequence (1273 AA).
Formatted in lines of 100 characters. Click to enlarge.

I used the downloaded Reference sequence YP_009724390 of the Spike protein to show the mutations in context and to compare the sequences of the patient with the official reference sequence. Deletions are also shown in red. Yellow highlight are the sequences shown in the publication. This is an example how one can use downloaded sequences.
Alternative method [10]: 
click on YP_009724390 and scroll down and you find the amino acid (AA) sequence of the Reference Spike protein in rows of 60 in blocks of 10 per row:
the complete sequence of the Spike protein: 1273 AA


- hypoxemia: an abnormally low level of oxygen in the blood
- A 7 amino acid (AA) deletion correspondents to a 21 nucleotide (nt) deletion.
- Synonymous mutation is a mutation in DNA/RNA that doesn't change the amino acid. A non-synonymous mutation does change the amino acid.

- Remdesivir is a broad-spectrum anti-viral molecule (wikipedia)

- NCBI = The National Center for Biotechnology Information

- RT-PCR = Reverse-Transcriptase–Polymerase-Chain-Reaction.
  1. People with weakened immune systems are at higher risk of getting severely sick from SARS-CoV-2, the virus that causes covid-19
  2. 32 authors: 'Persistence and Evolution of SARS-CoV-2 in an Immunocompromised Host' in The New England Journal of Medicine, November 11, 2020
  3. Experimental treatment for COVID-19 (Regeneron Pharmaceuticals)
  4. Positive selection has been reported by others:  Positive selection within the genomes of SARS-CoV-2 and other Coronaviruses independent of impact on protein function
  5. see a previous blog: Finding the highly transmissible British SARS-CoV-2 B.1.1.7 variant in the USA
  6. "The most consequential mutations, at a location called E484, caused a steep drop in the potency of some individuals’ antibodies. Coronavirus variants identified in South Africa and Brazil carry a mutation at the same spot." Nature.
  7. Emergence and rapid spread of a new severe acute respiratory syndrome-related coronavirus 2 (SARS-CoV-2) lineage with multiple spike mutations in South Africa. Posted December 22, 2020.
  8. The same effect is predicted to occur when the second covid-19 vaccination is delayed: Could delaying a second vaccine dose lead to more dangerous coronavirus strains?  Jan 14, 2021 (thank you Harry)
  9. correction: del142-144 must be del141-144 (4AA) on the last day. On previous days it was del142-144 (3AA).
  10. Update 24 Jan 2021: I added an alternative method to get the complete amino acid sequence without downloading it to your computer.


Thanks Marleen for notifying me to nrc article which referred to publication [2].

16 January 2021

Is het virus ooit geïsoleerd, geïdentificeerd, en gepurificeerd?


Corona Update 16 januari 2021

"Is het virus ooit geïsoleerd, geïdentificeerd, en gepurificeerd?"

Gisteravond in een youtube lifestream van Vincent Everts en Charles Groenhuizen werden een apotheker, een farmacoloog, een viroloog en een huisarts geïnterviewd over de mogelijke toelating van Ivermectin als behandeling van covid-19. Kijkers konden vragen stellen. Ik viel van mijn stoel toen ik de volgende vraag voorbij zag komen:

"Is het virus ooit geïsoleerd, geïdentificeerd, en gepurificeerd?" (t=3790)

De SARS-CoV-2 pandemie begon eind december 2019 in Wuhan. Op 15 januari 2021, dus ruim een jaar later vraagt iemand:

"Is het virus ooit geïsoleerd, geïdentificeerd, en gepurificeerd?"

Ik neem aan dat de vragensteller te goeder trouw is, en het een eerlijke vraag is van iemand die het gewoon niet weet, maar wellicht beïnvloed is door complotdenkers op sociale media. De viroloog Ab Osterhaus mocht de vraag beantwoorden. Hij deed dat kort en duidelijk. Het volledige SARS-CoV-2 virus genoom is op 10 januari 2020 door Chinese onderzoekers gepubliceerd. Dat was zeer belangrijk voor de rest van de wereld. Op basis van die publicatie konden in het Westen PCR testen ontwikkeld worden om SARS-CoV-2 bij mensen aan te tonen en de farmaceutische industrie kon beginnen met het samenstellen van een vaccin tegen het virus. Vroeger beschikten wetenschappers niet over het volledige genoom van virussen. Het is een enorme vooruitgang dat dat tegenwoordig wel kan.

Osterhaus zei er niet bij dat het RNA van meer dan 50.000 individuele SARS-CoV-2 virussen bekend is. Dus het volledige genoom tot op de laatste letter is bekend en gepubliceerd. Dit is te zien in openbare databases als GISAID en NCBI. [1],[2]

Het virus is ook niet 100% nieuw zoals de naam al doet vermoeden: het is versie 2. Versie 1, SARS-CoV-1 veroorzaakte een kleinere epidemie in 2003. Ze behoren beide tot de familie van coronavirussen. Dit kun je allemaal in de wikipedia vinden. En dat is niet moeilijk te vinden.

Ik zal in een volgend blog laten zien hoe iedereen zelf de RNA volgorde van SARS-CoV-2 kan downloaden als tekstbestand. Dan hoef je de vraag Is het SARS virus ooit geïsoleerd, geïdentificeerd, en gepurificeerd? niet meer te stellen! Nooit meer!


  1. "In the past year, more than 360,000 SARS-CoV-2 genomes have been sequenced and stored on GISAID, a non-profit online database for sharing viral genomes ... covering more than 140 countries." Nature, 15 Jan 2021.
  2. "more than 90 million cases of COVID-19 have been recorded and only about 350,000 virus variants have been sequenced". Eurekalert 28 Jan 2021


LIVE 16:00 Ab Osterhaus en Adam Cohen over toelating Ivermectin voor Covid-19 behandeling

One year since first genomes of SARS-CoV-2 released to the world 10 January 2020 00:41UTC

Global analysis of more than 50,000 SARS-CoV-2 genomes reveals epistasis between eight viral genes

11 January 2021

ObsIdentify herkent Cetti's zanger op tegenlicht foto met 99% zekerheid

f/9,0  1.1250 sec ISO 1600. Sony 70-350 telezoom. 9 Jan 2021
Bewerkte foto. ObsIdentify voorspelt:
Cetti's Zanger - Cettia cetti met zekerheid 96.9%

f/9.0  1/3200 sec ISO 1600. Sony 70-350 telezoom. 9 Jan 2021
ObsIdentify voorspelt Cetti's Zanger met 99,8% zekerheid

Het beeldherkenningsprogramma ObsIdentify heeft een tegenlicht foto van een Cetti's zanger met 99% zekerheid herkent. 

Hij zat in het struikgewas aan de rand van een meertje. Ik wist absoluut niet wat het was. Zelfs thuis op mijn computerscherm wist ik het niet. Het had iets van een winterkoning, maar de staart was te lang. Of een Rosse Waaierstaart, maar die is zeer zeldzaam. Of een staartmees, maar die heeft een lange smalle staart en een kleiner lichaampje. Toen ik de foto's aanbood aan ObsIdentify (waarneming.nl) kreeg ik tot mijn grote verbazing scores van 99% en hoger voor de Cetti's zanger! Voor een tegenlicht foto! Waar totaal geen kleur in zit. Zelfs het beste fototoestel en lens kan geen kleur tevoorschijn toveren uit een tegenlicht foto. Ik heb de foto's bewerkt om er nog zoveel mogelijk detail uit te halen. De zon scheen die middag. Maar ik en/of de zon stonden verkeerd. Het vogeltje was binnen een minuut weer weg. Ik heb net 10 foto's kunnen maken. De foto's waren wel scherp. Maar het waren in feite zwart-wit foto's.

Het algoritme gaf het volgende oordeel: "Confident (99%), but a new location". Dat betekent dat er op die plek de laatste tijd (maanden) nooit een goedgekeurde Cetti's zanger geregistreerd was. Tot mijn verbazing waren er in oktober, november en december 2020 op die plek waarnemingen van het beestje, maar geen van allen waren goedgekeurd wegens gebrek aan bewijs (foto). Het algoritme keurt dan zo'n waarneming niet goed. Een validator (moderator) heeft mijn waarneming goedgekeurd: "accepted (with evidence) by validator". Ik was uitzonderlijk blij en heb op vegagebak getrakteerd! Tevens is het een zeer goede prestatie van ObsIdentify beeldherkenning! Andere foto's gaven: 99,8%, 99,9%, 93,7%, 99,5%, 100,0%. Een knappe prestatie mogen we voorzichtig concluderen. Het algoritme blijft erg gevoelig voor meer of minder uitsnijden van de foto.

Zo ziet hij eruit in het zonlicht (René Fousert)

Ik heb extreem geluk gehad dat ik hem heb kunnen fotograferen, want het bleek dat van de 956 waarnemingen van de Cetti's zanger in waarneming.nl in de eerste 10 dagen van januari, er slechts 8 foto's geüpload waren (inclusief de mijne). Dat is minder dan 1% van de waarnemingen. Dus: extreem geluk dat ik hem heb kunnen fotograferen. Dit is typisch een toevalswaarneming: ik stond toevallig op de juiste plek op het juiste moment, maar had de pech dat het een tegenlicht opname was. Het zeer grote aantal waarnemingen komt waarschijnlijk door over-rapportage: het is winter en het is toch een bijzondere soort.

De Cetti's zanger is voor mij een bijzonderheid, omdat ik heb hem pas vorig jaar voorjaar voor het eerst gehoord (niet gezien!) heb, terwijl ik eigenlijk mijn hele leven belangstelling heb voor (zang)vogels. Officieel is zijn status 'vrij algemeen'. Voor mij is het een zeldzaam vogeltje. Het was überhaupt de eerste keer dat ik hem heb 'gezien' (zonder te weten dat het een Cetti's was). Hij laat zich namelijk zelden zien. 's Winters zijn ze iets makkelijker te zien omdat bladeren ontbreken. Zijn geluid is onmiskenbaar. Maar als je gaat spitten in waarnemingen.nl zie je een oudgediende die hem 42 jaar geleden al in Nederland heeft waargenomen. Toen was het een echte zeldzaamheid. Een dwaalgast. De laatste 20 jaar is hij in Nederland in aantal toegenomen. Het beestje voelt zich kennelijk thuis in Nederland. Vooral in rietlanden en oevervegetatie. En dat is er kennelijk nog genoeg in Nederland. 

Je verwacht van zo'n vogel dat het een insecten-eter is en dus naar het Zuiden trekt zoals de traditionele Nederlandse zangvogels (fitis, tjiftjaf, tuinfluiter, nachtegaal). Volgens wikipedia is hij 'insectivorous'. Kennelijk kan hij zich goed in leven houden in de winter. Wat hij in de winter eet is niet bekend. En er wordt geen melding gemaakt van overwintering. Dus het is bijzonder dat je hem 's winters kunt zien of horen.


update 12 jan: enige aanvullingen.



Cetti's zanger (Nederlandse wikipedia), Engelse wikipedia.

Cetti's zanger (Vogelbescherming)

Cetti's zanger (waarneming.nl) 

Vorig blog over ObsIdentify:  Tesla beeldherkenning en ObsIdentify beeldherkenning: steeds beter, maar maken nog steeds klassieke fouten (10). Dit is deel 11.

Voor alle blogs over ObsIdentify klik op label ObsIdentify

07 January 2021

Finding the highly transmissible British SARS-CoV-2 B.1.1.7 variant in the USA

Can the new British B.1.1.7 variant be found in the NCBI database?


Corona Update 7 Jan 2021



There is much talk these days about the new British SARS-COV-2 B.1.1.7 variant. Even top scientific journals like Science raise the alarm: "Viral mutations may cause another ‘very, very bad’ COVID-19 wave, scientists warn". 

In the previous blogpost I explored the NCBI database. Can I find this new variant in the  NCBI database? How do I find it in a database of nearly 32,000 SARS-CoV-2 nucleotide sequences? I first tried the sequences from the UK, of course. But, amazingly, they uploaded only 5 complete genomes and 60 proteins. None of them were useful. How do I recognize the variant anyway? The new variant is characterised by the unique combination of 17 mutations (Fig 1).

Fig. 1. All 17 mutations of the British variant B.1.17 (source)
Note: all substitutions are non-synonymous!

The Spike protein enables entrance in to human cells. Very important protein. The standard length of the Spike protein is 1273 amino acids (AA). Since the Spike protein of the new variant has two deletions (together 3 amino acids or 9 bases), it has been shortened to 1270 amino acids. So, any Spike protein with 1273 AA can be eliminated from the search. So, for a start, I selected all SARS-CoV-2 Spike protein sequences with a length of 1270 AA. They did exist. I checked whether all 8 mutations were present. Fortunately, they were present in five sequences (Fig 2):

Fig 2. B.1.1.7 variants - Spike protein (composite image)
row 2,3,4,5,6 are B.1.1.7 variants. Others are controls.
The numbers above the columns are sequence positions,
Click to enlarge

That's a promising start. But there are nine mutations in other genes. I  entered the five Sequence IDs in the 'Accession' filter (with no further filters). That results in whole genomes. I hit the button 'Align'. I checked the presence of the remaining nine mutations one by one in the rest of the virus. And, lo and behold, they all were found at the exact locations predicted in the table (Fig.3). That means this really is the new British B.1.1.7 variant. Big surprise: they were captured not in the UK, but in the USA (CA, NY, FL). I guess it is extremely unlikely that they arose independently of the British variant. So, they must have been transported by air travel from the UK to the USA. Collection dates of the US samples are: 19, 20, 24 and 29 December 2020. The new variant was reported on 8 December in the UK (source). So, it spread within a few weeks to the US. Maybe earlier and very likely there are far more than those five in the US.

The B.1.1.7 variant doesn't stop mutating. I already found additional mutations in the US variant, for example: del A28271 shared by all 5, but not found in the reference virus Refseq NC_045512. I am busy checking more.

Quite a lot of people in the Netherlands doubt whether the PCR test detects SARS-CoV-2 at all and conclude there is no SARS-CoV-2 pandemic and lockdown should be stopped immediately. Well, at this very moment the NCBI database contains 47,714 SARS-CoV-2 genome sequences. If this is no proof of the existence of a SARS-CoV-2 pandemic, no evidence will be enough for those people.


I will expand this blog when new information becomes available.

Latest news:

The new variant is roughly 50% more transmissible than other variants, and according to others 56% (Nature).

Update 10 Jan 2021

According to this publication there is a D614G mutation in the Spike protein of the B.1.1.7 variant, but this one is not listed in the table Fig. 1 above. I checked my five variants: indeed D614G  is present! This is an interesting mutation, I will blog about later.

Note: all of the substitutions of the new variant in the list Fig. 1 are non-synonymous: they substitute one amino acid (AA) for another. This will change protein properties! I overlooked this important fact. Of course there are also synonymous mutations in the RNA; they do not change an Amino Acid.

Appendix: technical notes.

Figure 2 consists of 8 different screenshots of 8 different positions along the sequence of the Spike protein. The positions are too far apart to capture them in one screenshot. 

Below the screenshot of the 9 mutations in the rest of the B.1.1.7 sequence:

Fig. 3. Composite of nine mutations of B.1.1.7 (outside Spike)
Second row: Refseq NC_045512.2

This completes the 17 mutations of the B.1.1.7 variant. The second row in Fig. 3 is the Refseq NC_045512.2 from Wuhan, Dec 2019. Remarkable: big deletion in ORF1ab. Row 3 and 5 have undefined bases at position 28280-28283 (letter N).

The five B.1.1.7 sequences are loaded by this link in the NCBI database.  

MW422256 29817 bp

MW422255 29763 bp

MW430974 29861 bp

MW430966 29835 bp

MW440433 29792 bp

Reference Surface glycoprotein: YP_009724390 1273 aa (listed in Fig 2) 

NC_045512 Reference genome SARS-CoV-2 Wuhan, China. 29903 bp. Dec 2019



Preliminary genomic characterisation of an emergent SARS-CoV-2 lineage in the UK defined by a novel set of spike mutations Dec 2020. Gives table with all mutations of the B.1.1.7 variant used in Figure 1.

Viral mutations may cause another ‘very, very bad’ COVID-19 wave, scientists warn  Science 5 Jan 2021

Report 42 - Transmission of SARS-CoV-2 Lineage B.1.1.7 in England: insights from linking epidemiological and genetic data, posted 4 Jan 2021 on medrxiv.org

05 January 2021

Corona update 5 Jan 2021 NCBI Virus

NCBI Virus is really a great source for investigating SARS-COV-2 genomic data. Today there are 47,376 SARS-COV-2 nucleotide sequences in the database of which 31,884 are complete and 30,972 are found in the host Homo Sapiens. The rest are partial sequences or other/unknown host species. Sequences are added on a daily basis.

This blogpost is intended to show how to find deletions, point mutations and differences in length of SARS-COV-2.

Sequence length variation

I was wondering how to explain the length differences in SARS-COV-2 genomes. The NCBI genome viewer is a perfect tool for answering that question. I selected 5 SARS-COV-2 sequences from different countries plus the reference sequence, clicked  Align  and positioned the viewer at the very beginning of the virus sequence (Fig 1). It appeared that the largest difference between these 5 viruses was 36 nucleotides at the start of the sequence. I did not expect that. But with this tool it can be easily investigated.

Fig 1. the beginnings of 6 sequences compared

Fig 2. the end of 6 sequences compared (lower resolution)
(At lower resolution the bases are not shown)

The length of the 30,972 sequences in the database varies from 29,403 to 30,018. That is a difference of 615 nucleotides. Quite large. This can be found simply by sorting on genome length. The Wuhan reference sequence is 29,903 bases long. Actually, there are 3,250 sequences in the database with the same length. So, it is a very common length. 


Further research shows that the smallest sequence (29,403 nt) misses 265 bases at the beginning of the sequence, has a gap (deletion) of 6 bases (position 515 to 520, Fig. 3) and 343 bases at the end are lost. The 6 base deletion corresponds to exactly 2 codons. So, it has no effect on the downstream codons.


Fig 3. A deletion of 6 bases (position 515 to 520) (id=MT810566)
Identical bases relative to the reference are shown as dots (optional).


Mutations are easy to spot when loading a lot of sequences (up to 200 in one view) and when 'Show differences' is selected from the Coloring dropdown menu. The screen could look like this:

Fig 4. overview mutations (red) in aligned sequences of SARS-COV-2
Settings: Coloring = Show differences

Fig 5. Non-synonymous mutation  in ORF1ab of SARS-COV-2
3 upper rows: the unmutated codon AAG = K = Lysine
3 lower rows: mutated codon AAT = N = Asparagine
enlarged image (original is much smaller)

With a mouse click on the 'click to expand' symbol (Fig 4 left column) the sequence expands with information about codons and amino acids (green en red horizontal bars). I added blue areas to highlight the location of the mutation. In Fig 5 amino acids identical to the Reference sequence are indicated by dots (standard setting). Mutations are shown with red background.

Contribution of countries

[Update 6 Jan 2021]

In this list the contribution of a few countries are shown based on two criteria: SARS-COV-2 and nucleotide completeness. The 'host' field is not included because this field is often empty.

  • USA          nucleotide: 18,233  protein: 218,445
  • Australia:   nucleotide:  9,918  protein: 118,982
  • Netherlands: nucleotide:    600  protein:     191
  • China:       nucleotide:    114  protein:   1,294
  • Italy:       nucleotide:     64  protein:     766
  • UK           nucleotide:      4  protein:      48

Note: Great Britain has only 4 complete SARS-COV-2 sequences uploaded in the database. This is strange. They have sequenced the full genome of thousands of SARS-COV-2 viruses. That means in this database I can not (yet) investigate the new British B.1.1.7 variant!

The Netherlands did not supply the 'host' in almost all contributions, so I cannot compare for example mink with human SARS-COV-2.

Thank you Marleen for pointing out problems in the numbers.



The New MSA Viewer (2 min. introduction video)

NCBI Minute: A Beginner's Guide to Genes and Sequences at NCBI (youtube, introduction of 33 min.)

02 January 2021

Corona update 2 Jan 2021 GISAID

Today I show a nice interactive graphic representation of the SARS-CoV-2 genome.






Official hCoV-19 Reference Sequence (GISAID)

At the start of the year 2020 nobody could have imagined that an unknown RNA virus with no more than 30,000 bases could get the world on its knees. 30,000 bases!  A genome of 30,000 bases has the power to overwhelm a species with a total of 3 billion bases, that is 100,000 times more bases. That species, Homo sapiens, the most powerful and intelligent species on earth. How can such a small virus hit us so hard?

For a start lets have a look at the structure of the genome of the SARS-COV-2 virus. The picture above is the official hCoV-19 Reference Sequence shown as a schematic clickable image on a GISAID page. When you open this link a short animation starts. The genes have different colours. Click on the genes for more information. ORF means Open Reading Frame. ORF is the part of the RNA that can be translated into proteins. Not all RNA is translated into proteins. For example the first 265 bases are not translated. For each RNA-gene proteins are listed and references to the literature are given. The yellow gene S is the Spike protein. The Spike protein is considered the most important protein because it is the key to enter a human cell. The biggest gene is ORF 1 gene. There is a mysterious tenth gene ORF10 that does not produce a protein. Why is it there? If it is non-functional, mutations including deletions are expected to accumulate with higher frequency. Could it be that genome length variation is due to length variation in gene 10? One can deduce that the genome length of the virus must be 29674 bases because that is the last base of the ORF10 gene.

Further, in two large genes, ORF1 and ORF9, grey areas are shown. The grey areas show the smaller proteins that are produced after processing the primary polyprotein. For example ORF1 protein is split into 15 smaller Non Structural Proteins (NSP11 is missing from the list) and the ORF9 gene produces two smaller proteins.

In the interactive picture one can see the start and end base position of genes, the length of the genes and the primary and derived proteins, but there are no links to the base sequence itself and the mutations. However, with these limited data one can do simple but interesting calculations. Superficially it looks like the whole genome is translated into proteins. But try this: divide the length of a gene expressed in nucleotides (nt) by 3 (because 3 bases code for one amino acid) and a few surprising facts are uncovered. The number of amino acids and the number of nucleotides do not match exactly [1]. There are more bases in the SARS-COV-2 genome than strictly necessary to code for all the proteins. So, a number of bases are not translated into proteins. Above I mentioned the first 265 bases already. 

A more detailed view of the mutations can be found when in the left panel of the main page the option 'Show entropy' is set ON and the other options are OFF.

GISAID: entropy of SARS-COV-2 shows the number
of mutations per position.

An overview of the SARS-COV-2 genome is shown with the genes and the number of mutations. The vertical axis shows the diversity and the horizontal axis shows the nucleotide (NT) or amino acid (AA) positions. Mutations are unevenly distributed, there are mutation hotspots. A slider with a left and a right pointer can be used to zoom in:

GISAID: slider set to S protein (detail)

In the above example the left and right slider are moved to select the S gene. The frequency of mutations in the S gene appear in bars. When hovering with the mouse over the bars further information appears. However, to position the mouse on a bar is not easy. I think, this part of the interface could be improved. Also, in the maximum zoom position all the mutated codons should appear. Explanations of the different options would have been helpful. For example how entropy is defined. I assume the software is still under construction. Conclusion: this genome explorer has nice features, but has its limitations.



  1. All genes except ORF1ab contain 2 extra bases that are not and cannot be translated into an amino acid. ORF1ab has 1 extra base apart from the first 265 bases that are not translated.