Evolution blog

11 January 2021

ObsIdentify (NIA) herkent Cetti's zanger op tegenlicht foto met 99% zekerheid

f/9,0 1.1250 sec ISO 1600. Sony 70-350 telezoom. 9 Jan 2021
Bewerkte foto. ObsIdentify voorspelt:
Cetti's Zanger - Cettia cetti met zekerheid 96.9%

f/9.0 1/3200 sec ISO 1600. Sony 70-350 telezoom. 9 Jan 2021
ObsIdentify voorspelt Cetti's Zanger met 99,8% zekerheid

Het beeldherkenningsprogramma ObsIdentify (NIA) heeft een tegenlicht foto van een Cetti's zanger met 99% zekerheid herkent.

Hij zat in het struikgewas aan de rand van een meertje. Ik wist absoluut niet wat het was. Zelfs thuis op mijn computerscherm wist ik het niet. Het had iets van een winterkoning, maar de staart was te lang. Of een Rosse Waaierstaart, maar die is zeer zeldzaam. Of een staartmees, maar die heeft een lange smalle staart en een kleiner lichaampje. Toen ik de foto's aanbood aan ObsIdentify (waarneming.nl) kreeg ik tot mijn grote verbazing scores van 99% en hoger voor de Cetti's zanger! Voor een tegenlicht foto! Waar totaal geen kleur in zit. Zelfs het beste fototoestel en lens kan geen kleur tevoorschijn toveren uit een tegenlicht foto. Ik heb de foto's bewerkt om er nog zoveel mogelijk detail uit te halen. De zon scheen die middag. Maar ik en/of de zon stonden verkeerd. Het vogeltje was binnen een minuut weer weg. Ik heb net 10 foto's kunnen maken. De foto's waren wel scherp. Maar het waren in feite zwart-wit foto's.

Het algoritme gaf het volgende oordeel: "Confident (99%), but a new location". Dat betekent dat er op die plek de laatste tijd (maanden) nooit een goedgekeurde Cetti's zanger geregistreerd was. Tot mijn verbazing waren er in oktober, november en december 2020 op die plek waarnemingen van het beestje, maar geen van allen waren goedgekeurd wegens gebrek aan bewijs (foto). Het algoritme keurt dan zo'n waarneming niet goed. Een validator (moderator) heeft mijn waarneming goedgekeurd: "accepted (with evidence) by validator". Ik was uitzonderlijk blij en heb op vegagebak getrakteerd! Tevens is het een zeer goede prestatie van ObsIdentify beeldherkenning! Andere foto's gaven: 99,8%, 99,9%, 93,7%, 99,5%, 100,0%. Een knappe prestatie mogen we voorzichtig concluderen. Het algoritme blijft erg gevoelig voor meer of minder uitsnijden van de foto.

Zo ziet hij eruit in het zonlicht (René Fousert)

Ik heb extreem geluk gehad dat ik hem heb kunnen fotograferen, want het bleek dat van de 956 waarnemingen van de Cetti's zanger in waarneming.nl in de eerste 10 dagen van januari, er slechts 8 foto's geüpload waren (inclusief de mijne). Dat is minder dan 1% van de waarnemingen. Dus: extreem geluk dat ik hem heb kunnen fotograferen. Dit is typisch een toevalswaarneming: ik stond toevallig op de juiste plek op het juiste moment, maar had de pech dat het een tegenlicht opname was. Het zeer grote aantal waarnemingen komt waarschijnlijk door over-rapportage: het is winter en het is toch een bijzondere soort.

De Cetti's zanger is voor mij een bijzonderheid, omdat ik heb hem pas vorig jaar voorjaar voor het eerst gehoord (niet gezien!) heb, terwijl ik eigenlijk mijn hele leven belangstelling heb voor (zang)vogels. Officieel is zijn status 'vrij algemeen'. Voor mij is het een zeldzaam vogeltje. Het was überhaupt de eerste keer dat ik hem heb 'gezien' (zonder te weten dat het een Cetti's was). Hij laat zich namelijk zelden zien. 's Winters zijn ze iets makkelijker te zien omdat bladeren ontbreken. Zijn geluid is onmiskenbaar. Maar als je gaat spitten in waarnemingen.nl zie je een oudgediende die hem 42 jaar geleden al in Nederland heeft waargenomen. Toen was het een echte zeldzaamheid. Een dwaalgast. De laatste 20 jaar is hij in Nederland in aantal toegenomen. Het beestje voelt zich kennelijk thuis in Nederland. Vooral in rietlanden en oevervegetatie. En dat is er kennelijk nog genoeg in Nederland.

Je verwacht van zo'n vogel dat het een insecten-eter is en dus naar het Zuiden trekt zoals de traditionele Nederlandse zangvogels (fitis, tjiftjaf, tuinfluiter, nachtegaal). Volgens wikipedia is hij 'insectivorous'. Kennelijk kan hij zich goed in leven houden in de winter. Wat hij in de winter eet is niet bekend. En er wordt geen melding gemaakt van overwintering. Dus het is bijzonder dat je hem 's winters kunt zien of horen.

update 12 jan: enige aanvullingen.

Referenties

Cetti's zanger (Nederlandse wikipedia), Engelse wikipedia.

Cetti's zanger (Vogelbescherming)

Cetti's zanger (waarneming.nl)

Vorig blog over ObsIdentify: Tesla beeldherkenning en ObsIdentify beeldherkenning: steeds beter, maar maken nog steeds klassieke fouten (10). Dit is deel 11.

Voor alle blogs over ObsIdentify klik op label ObsIdentify.

07 January 2021

Finding the highly transmissible British SARS-CoV-2 B.1.1.7 variant in the USA

Can the new British B.1.1.7 variant be found in the NCBI database?

Corona Update 7 Jan 2021

There is much talk these days about the new British SARS-COV-2 B.1.1.7 variant. Even top scientific journals like Science raise the alarm: "Viral mutations may cause another ‘very, very bad’ COVID-19 wave, scientists warn".

In the previous blogpost I explored the NCBI database. Can I find this new variant in the NCBI database? How do I find it in a database of nearly 32,000 SARS-CoV-2 nucleotide sequences? I first tried the sequences from the UK, of course. But, amazingly, they uploaded only 5 complete genomes and 60 proteins. None of them were useful. How do I recognize the variant anyway? The new variant is characterised by the unique combination of 17 mutations (Fig 1).

Fig. 1. All 17 mutations of the British variant B.1.17 (source)
Note: all substitutions are non-synonymous!

The Spike protein enables entrance in to human cells. Very important protein. The standard length of the Spike protein is 1273 amino acids (AA). Since the Spike protein of the new variant has two deletions (together 3 amino acids or 9 bases), it has been shortened to 1270 amino acids. So, any Spike protein with 1273 AA can be eliminated from the search. So, for a start, I selected all SARS-CoV-2 Spike protein sequences with a length of 1270 AA. They did exist. I checked whether all 8 mutations were present. Fortunately, they were present in five sequences (Fig 2):

Fig 2. B.1.1.7 variants - Spike protein (composite image)
row 2,3,4,5,6 are B.1.1.7 variants. Others are controls.
The numbers above the columns are sequence positions,
Click to enlarge

That's a promising start. But there are nine mutations in other genes. I entered the five Sequence IDs in the 'Accession' filter (with no further filters). That results in whole genomes. I hit the button 'Align'. I checked the presence of the remaining nine mutations one by one in the rest of the virus. And, lo and behold, they all were found at the exact locations predicted in the table (Fig.3). That means this really is the new British B.1.1.7 variant. Big surprise: they were captured not in the UK, but in the USA (CA, NY, FL). I guess it is extremely unlikely that they arose independently of the British variant. So, they must have been transported by air travel from the UK to the USA. Collection dates of the US samples are: 19, 20, 24 and 29 December 2020. The new variant was reported on 8 December in the UK (source). So, it spread within a few weeks to the US. Maybe earlier and very likely there are far more than those five in the US.

The B.1.1.7 variant doesn't stop mutating. I already found additional mutations in the US variant, for example: del A28271 shared by all 5, but not found in the reference virus Refseq NC_045512. I am busy checking more.

Quite a lot of people in the Netherlands doubt whether the PCR test detects SARS-CoV-2 at all and conclude there is no SARS-CoV-2 pandemic and lockdown should be stopped immediately. Well, at this very moment the NCBI database contains 47,714 SARS-CoV-2 genome sequences. If this is no proof of the existence of a SARS-CoV-2 pandemic, no evidence will be enough for those people.

I will expand this blog when new information becomes available.

Latest news:

The new variant is roughly 50% more transmissible than other variants, and according to others 56% (Nature).

Update 10 Jan 2021

According to this publication there is a D614G mutation in the Spike protein of the B.1.1.7 variant, but this one is not listed in the table Fig. 1 above. I checked my five variants: indeed D614G is present! This is an interesting mutation, I will blog about later.

Note: all of the substitutions of the new variant in the list Fig. 1 are non-synonymous: they substitute one amino acid (AA) for another. This will change protein properties! I overlooked this important fact. Of course there are also synonymous mutations in the RNA; they do not change an Amino Acid.

Appendix: technical notes.

Figure 2 consists of 8 different screenshots of 8 different positions along the sequence of the Spike protein. The positions are too far apart to capture them in one screenshot.

Below the screenshot of the 9 mutations in the rest of the B.1.1.7 sequence:

Fig. 3. Composite of nine mutations of B.1.1.7 (outside Spike)
Second row: Refseq NC_045512.2

This completes the 17 mutations of the B.1.1.7 variant. The second row in Fig. 3 is the Refseq NC_045512.2 from Wuhan, Dec 2019. Remarkable: big deletion in ORF1ab. Row 3 and 5 have undefined bases at position 28280-28283 (letter N).

The five B.1.1.7 sequences are loaded by this link in the NCBI database.

MW422256 29817 bp

MW422255 29763 bp

MW430974 29861 bp

MW430966 29835 bp

MW440433 29792 bp

Reference Surface glycoprotein: YP_009724390 1273 aa (listed in Fig 2)

NC_045512 Reference genome SARS-CoV-2 Wuhan, China. 29903 bp. Dec 2019

https://www.ncbi.nlm.nih.gov/protein/QQK53357.1
https://www.ncbi.nlm.nih.gov/protein/QQH18533.1
https://www.ncbi.nlm.nih.gov/protein/QQK53261.1
https://www.ncbi.nlm.nih.gov/protein/QQK89963.1
https://www.ncbi.nlm.nih.gov/protein/QQH18545.1

References

Preliminary genomic characterisation of an emergent SARS-CoV-2 lineage in the UK defined by a novel set of spike mutations Dec 2020. Gives table with all mutations of the B.1.1.7 variant used in Figure 1.

Viral mutations may cause another ‘very, very bad’ COVID-19 wave, scientists warn Science 5 Jan 2021

Report 42 - Transmission of SARS-CoV-2 Lineage B.1.1.7 in England: insights from linking epidemiological and genetic data, posted 4 Jan 2021 on medrxiv.org

05 January 2021

Corona update 5 Jan 2021 NCBI Virus

NCBI Virus is really a great source for investigating SARS-COV-2 genomic data. Today there are 47,376 SARS-COV-2 nucleotide sequences in the database of which 31,884 are complete and 30,972 are found in the host Homo Sapiens. The rest are partial sequences or other/unknown host species. Sequences are added on a daily basis.

This blogpost is intended to show how to find deletions, point mutations and differences in length of SARS-COV-2.

Sequence length variation

I was wondering how to explain the length differences in SARS-COV-2 genomes. The NCBI genome viewer is a perfect tool for answering that question. I selected 5 SARS-COV-2 sequences from different countries plus the reference sequence, clicked Align and positioned the viewer at the very beginning of the virus sequence (Fig 1). It appeared that the largest difference between these 5 viruses was 36 nucleotides at the start of the sequence. I did not expect that. But with this tool it can be easily investigated.

Fig 1. the beginnings of 6 sequences compared

Fig 2. the end of 6 sequences compared (lower resolution)
(At lower resolution the bases are not shown)

The length of the 30,972 sequences in the database varies from 29,403 to 30,018. That is a difference of 615 nucleotides. Quite large. This can be found simply by sorting on genome length. The Wuhan reference sequence is 29,903 bases long. Actually, there are 3,250 sequences in the database with the same length. So, it is a very common length.

Deletions

Further research shows that the smallest sequence (29,403 nt) misses 265 bases at the beginning of the sequence, has a gap (deletion) of 6 bases (position 515 to 520, Fig. 3) and 343 bases at the end are lost. The 6 base deletion corresponds to exactly 2 codons. So, it has no effect on the downstream codons.

Fig 3. A deletion of 6 bases (position 515 to 520) (id=MT810566)
Identical bases relative to the reference are shown as dots (optional).

Mutations

Mutations are easy to spot when loading a lot of sequences (up to 200 in one view) and when 'Show differences' is selected from the Coloring drop-down menu. The screen could look like this:

Fig 4. overview mutations (red) in aligned sequences of SARS-COV-2
Settings: Coloring = Show differences

Fig 5. Non-synonymous mutation T in ORF1ab of SARS-COV-2
3 upper rows: the unmutated codon AAG = K = Lysine
3 lower rows: mutated codon AAT = N = Asparagine
enlarged image (original is much smaller)

With a mouse click on the 'click to expand' symbol (Fig 4 left column) the sequence expands with information about codons and amino acids (green en red horizontal bars). I added blue areas to highlight the location of the mutation. In Fig 5 amino acids identical to the Reference sequence are indicated by dots (standard setting). Mutations are shown with red background.

Contribution of countries

[Update 6 Jan 2021]

In this list the contribution of a few countries are shown based on two criteria: SARS-COV-2 and nucleotide completeness. The 'host' field is not included because this field is often empty.

USA nucleotide: 18,233 protein: 218,445
Australia: nucleotide: 9,918 protein: 118,982
Netherlands: nucleotide: 600 protein: 191
China: nucleotide: 114 protein: 1,294
Italy: nucleotide: 64 protein: 766
UK nucleotide: 4 protein: 48

Note: Great Britain has only 4 complete SARS-COV-2 sequences uploaded in the database. This is strange. They have sequenced the full genome of thousands of SARS-COV-2 viruses. That means in this database I can not (yet) investigate the new British B.1.1.7 variant!

The Netherlands did not supply the 'host' in almost all contributions, so I cannot compare for example mink with human SARS-COV-2.

Thank you Marleen for pointing out problems in the numbers.

References

The New MSA Viewer (2 min. introduction video)

NCBI Minute: A Beginner's Guide to Genes and Sequences at NCBI (youtube, introduction of 33 min.)