Evolution blog

07 January 2021

Finding the highly transmissible British SARS-CoV-2 B.1.1.7 variant in the USA

Can the new British B.1.1.7 variant be found in the NCBI database?

Corona Update 7 Jan 2021

There is much talk these days about the new British SARS-COV-2 B.1.1.7 variant. Even top scientific journals like Science raise the alarm: "Viral mutations may cause another ‘very, very bad’ COVID-19 wave, scientists warn".

In the previous blogpost I explored the NCBI database. Can I find this new variant in the NCBI database? How do I find it in a database of nearly 32,000 SARS-CoV-2 nucleotide sequences? I first tried the sequences from the UK, of course. But, amazingly, they uploaded only 5 complete genomes and 60 proteins. None of them were useful. How do I recognize the variant anyway? The new variant is characterised by the unique combination of 17 mutations (Fig 1).

Fig. 1. All 17 mutations of the British variant B.1.17 (source)
Note: all substitutions are non-synonymous!

The Spike protein enables entrance in to human cells. Very important protein. The standard length of the Spike protein is 1273 amino acids (AA). Since the Spike protein of the new variant has two deletions (together 3 amino acids or 9 bases), it has been shortened to 1270 amino acids. So, any Spike protein with 1273 AA can be eliminated from the search. So, for a start, I selected all SARS-CoV-2 Spike protein sequences with a length of 1270 AA. They did exist. I checked whether all 8 mutations were present. Fortunately, they were present in five sequences (Fig 2):

Fig 2. B.1.1.7 variants - Spike protein (composite image)
row 2,3,4,5,6 are B.1.1.7 variants. Others are controls.
The numbers above the columns are sequence positions,
Click to enlarge

That's a promising start. But there are nine mutations in other genes. I entered the five Sequence IDs in the 'Accession' filter (with no further filters). That results in whole genomes. I hit the button 'Align'. I checked the presence of the remaining nine mutations one by one in the rest of the virus. And, lo and behold, they all were found at the exact locations predicted in the table (Fig.3). That means this really is the new British B.1.1.7 variant. Big surprise: they were captured not in the UK, but in the USA (CA, NY, FL). I guess it is extremely unlikely that they arose independently of the British variant. So, they must have been transported by air travel from the UK to the USA. Collection dates of the US samples are: 19, 20, 24 and 29 December 2020. The new variant was reported on 8 December in the UK (source). So, it spread within a few weeks to the US. Maybe earlier and very likely there are far more than those five in the US.

The B.1.1.7 variant doesn't stop mutating. I already found additional mutations in the US variant, for example: del A28271 shared by all 5, but not found in the reference virus Refseq NC_045512. I am busy checking more.

Quite a lot of people in the Netherlands doubt whether the PCR test detects SARS-CoV-2 at all and conclude there is no SARS-CoV-2 pandemic and lockdown should be stopped immediately. Well, at this very moment the NCBI database contains 47,714 SARS-CoV-2 genome sequences. If this is no proof of the existence of a SARS-CoV-2 pandemic, no evidence will be enough for those people.

I will expand this blog when new information becomes available.

Latest news:

The new variant is roughly 50% more transmissible than other variants, and according to others 56% (Nature).

Update 10 Jan 2021

According to this publication there is a D614G mutation in the Spike protein of the B.1.1.7 variant, but this one is not listed in the table Fig. 1 above. I checked my five variants: indeed D614G is present! This is an interesting mutation, I will blog about later.

Note: all of the substitutions of the new variant in the list Fig. 1 are non-synonymous: they substitute one amino acid (AA) for another. This will change protein properties! I overlooked this important fact. Of course there are also synonymous mutations in the RNA; they do not change an Amino Acid.

Appendix: technical notes.

Figure 2 consists of 8 different screenshots of 8 different positions along the sequence of the Spike protein. The positions are too far apart to capture them in one screenshot.

Below the screenshot of the 9 mutations in the rest of the B.1.1.7 sequence:

Fig. 3. Composite of nine mutations of B.1.1.7 (outside Spike)
Second row: Refseq NC_045512.2

This completes the 17 mutations of the B.1.1.7 variant. The second row in Fig. 3 is the Refseq NC_045512.2 from Wuhan, Dec 2019. Remarkable: big deletion in ORF1ab. Row 3 and 5 have undefined bases at position 28280-28283 (letter N).

The five B.1.1.7 sequences are loaded by this link in the NCBI database.

MW422256 29817 bp

MW422255 29763 bp

MW430974 29861 bp

MW430966 29835 bp

MW440433 29792 bp

Reference Surface glycoprotein: YP_009724390 1273 aa (listed in Fig 2)

NC_045512 Reference genome SARS-CoV-2 Wuhan, China. 29903 bp. Dec 2019

https://www.ncbi.nlm.nih.gov/protein/QQK53357.1
https://www.ncbi.nlm.nih.gov/protein/QQH18533.1
https://www.ncbi.nlm.nih.gov/protein/QQK53261.1
https://www.ncbi.nlm.nih.gov/protein/QQK89963.1
https://www.ncbi.nlm.nih.gov/protein/QQH18545.1

References

Preliminary genomic characterisation of an emergent SARS-CoV-2 lineage in the UK defined by a novel set of spike mutations Dec 2020. Gives table with all mutations of the B.1.1.7 variant used in Figure 1.

Viral mutations may cause another ‘very, very bad’ COVID-19 wave, scientists warn Science 5 Jan 2021

Report 42 - Transmission of SARS-CoV-2 Lineage B.1.1.7 in England: insights from linking epidemiological and genetic data, posted 4 Jan 2021 on medrxiv.org

05 January 2021

Corona update 5 Jan 2021 NCBI Virus

NCBI Virus is really a great source for investigating SARS-COV-2 genomic data. Today there are 47,376 SARS-COV-2 nucleotide sequences in the database of which 31,884 are complete and 30,972 are found in the host Homo Sapiens. The rest are partial sequences or other/unknown host species. Sequences are added on a daily basis.

This blogpost is intended to show how to find deletions, point mutations and differences in length of SARS-COV-2.

Sequence length variation

I was wondering how to explain the length differences in SARS-COV-2 genomes. The NCBI genome viewer is a perfect tool for answering that question. I selected 5 SARS-COV-2 sequences from different countries plus the reference sequence, clicked Align and positioned the viewer at the very beginning of the virus sequence (Fig 1). It appeared that the largest difference between these 5 viruses was 36 nucleotides at the start of the sequence. I did not expect that. But with this tool it can be easily investigated.

Fig 1. the beginnings of 6 sequences compared

Fig 2. the end of 6 sequences compared (lower resolution)
(At lower resolution the bases are not shown)

The length of the 30,972 sequences in the database varies from 29,403 to 30,018. That is a difference of 615 nucleotides. Quite large. This can be found simply by sorting on genome length. The Wuhan reference sequence is 29,903 bases long. Actually, there are 3,250 sequences in the database with the same length. So, it is a very common length.

Deletions

Further research shows that the smallest sequence (29,403 nt) misses 265 bases at the beginning of the sequence, has a gap (deletion) of 6 bases (position 515 to 520, Fig. 3) and 343 bases at the end are lost. The 6 base deletion corresponds to exactly 2 codons. So, it has no effect on the downstream codons.

Fig 3. A deletion of 6 bases (position 515 to 520) (id=MT810566)
Identical bases relative to the reference are shown as dots (optional).

Mutations

Mutations are easy to spot when loading a lot of sequences (up to 200 in one view) and when 'Show differences' is selected from the Coloring drop-down menu. The screen could look like this:

Fig 4. overview mutations (red) in aligned sequences of SARS-COV-2
Settings: Coloring = Show differences

Fig 5. Non-synonymous mutation T in ORF1ab of SARS-COV-2
3 upper rows: the unmutated codon AAG = K = Lysine
3 lower rows: mutated codon AAT = N = Asparagine
enlarged image (original is much smaller)

With a mouse click on the 'click to expand' symbol (Fig 4 left column) the sequence expands with information about codons and amino acids (green en red horizontal bars). I added blue areas to highlight the location of the mutation. In Fig 5 amino acids identical to the Reference sequence are indicated by dots (standard setting). Mutations are shown with red background.

Contribution of countries

[Update 6 Jan 2021]

In this list the contribution of a few countries are shown based on two criteria: SARS-COV-2 and nucleotide completeness. The 'host' field is not included because this field is often empty.

USA nucleotide: 18,233 protein: 218,445
Australia: nucleotide: 9,918 protein: 118,982
Netherlands: nucleotide: 600 protein: 191
China: nucleotide: 114 protein: 1,294
Italy: nucleotide: 64 protein: 766
UK nucleotide: 4 protein: 48

Note: Great Britain has only 4 complete SARS-COV-2 sequences uploaded in the database. This is strange. They have sequenced the full genome of thousands of SARS-COV-2 viruses. That means in this database I can not (yet) investigate the new British B.1.1.7 variant!

The Netherlands did not supply the 'host' in almost all contributions, so I cannot compare for example mink with human SARS-COV-2.

Thank you Marleen for pointing out problems in the numbers.

References

The New MSA Viewer (2 min. introduction video)

NCBI Minute: A Beginner's Guide to Genes and Sequences at NCBI (youtube, introduction of 33 min.)

02 January 2021

Corona update 2 Jan 2021 GISAID

Today I show a nice interactive graphic representation of the SARS-CoV-2 genome.

Official hCoV-19 Reference Sequence (GISAID)

At the start of the year 2020 nobody could have imagined that an unknown RNA virus with no more than 30,000 bases could get the world on its knees. 30,000 bases! A genome of 30,000 bases has the power to overwhelm a species with a total of 3 billion bases, that is 100,000 times more bases. That species, Homo sapiens, the most powerful and intelligent species on earth. How can such a small virus hit us so hard?

For a start lets have a look at the structure of the genome of the SARS-COV-2 virus. The picture above is the official hCoV-19 Reference Sequence shown as a schematic clickable image on a GISAID page. When you open this link a short animation starts. The genes have different colours. Click on the genes for more information. ORF means Open Reading Frame. ORF is the part of the RNA that can be translated into proteins. Not all RNA is translated into proteins. For example the first 265 bases are not translated. For each RNA-gene proteins are listed and references to the literature are given. The yellow gene S is the Spike protein. The Spike protein is considered the most important protein because it is the key to enter a human cell. The biggest gene is ORF 1 gene. There is a mysterious tenth gene ORF10 that does not produce a protein. Why is it there? If it is non-functional, mutations including deletions are expected to accumulate with higher frequency. Could it be that genome length variation is due to length variation in gene 10? One can deduce that the genome length of the virus must be 29674 bases because that is the last base of the ORF10 gene.

Further, in two large genes, ORF1 and ORF9, grey areas are shown. The grey areas show the smaller proteins that are produced after processing the primary polyprotein. For example ORF1 protein is split into 15 smaller Non Structural Proteins (NSP11 is missing from the list) and the ORF9 gene produces two smaller proteins.

In the interactive picture one can see the start and end base position of genes, the length of the genes and the primary and derived proteins, but there are no links to the base sequence itself and the mutations. However, with these limited data one can do simple but interesting calculations. Superficially it looks like the whole genome is translated into proteins. But try this: divide the length of a gene expressed in nucleotides (nt) by 3 (because 3 bases code for one amino acid) and a few surprising facts are uncovered. The number of amino acids and the number of nucleotides do not match exactly [1]. There are more bases in the SARS-COV-2 genome than strictly necessary to code for all the proteins. So, a number of bases are not translated into proteins. Above I mentioned the first 265 bases already.

A more detailed view of the mutations can be found when in the left panel of the main page the option 'Show entropy' is set ON and the other options are OFF.

GISAID: entropy of SARS-COV-2 shows the number
of mutations per position.

An overview of the SARS-COV-2 genome is shown with the genes and the number of mutations. The vertical axis shows the diversity and the horizontal axis shows the nucleotide (NT) or amino acid (AA) positions. Mutations are unevenly distributed, there are mutation hotspots. A slider with a left and a right pointer can be used to zoom in:

GISAID: slider set to S protein (detail)

In the above example the left and right slider are moved to select the S gene. The frequency of mutations in the S gene appear in bars. When hovering with the mouse over the bars further information appears. However, to position the mouse on a bar is not easy. I think, this part of the interface could be improved. Also, in the maximum zoom position all the mutated codons should appear. Explanations of the different options would have been helpful. For example how entropy is defined. I assume the software is still under construction. Conclusion: this genome explorer has nice features, but has its limitations.

Notes

All genes except ORF1ab contain 2 extra bases that are not and cannot be translated into an amino acid. ORF1ab has 1 extra base apart from the first 265 bases that are not translated.