Bioinformatics helps pinpoint evolutionary mutations of SARS-CoV-2

Apr 21, 2020

A new bioinformatics approach to classifying genetic variants of SARS-CoV-2 helps to answer questions such as whether there are one or more circulating strains of the virus with different virulence levels. And if so, which of those strains should be used for therapeutic research?

Australian researchers explored this technique in an April 19 article in Transboundary and Emerging Diseases.

It is well established that newly emerged viruses have the potential to evolve rapidly in hosts, and they present quasispecies diversity -- a population of viruses with a large number of variant genomes. This viral diversity is due to low fidelity, high polymorphism, and viral polymerases lacking the ability to correct errors.

However, coronaviruses express an exoribonuclease that enables high-fidelity replication of their large genomes and therefore permits high mutation rates. A diversity of strains with unknown functional differences poses a complex problem in terms of rapid development and evaluation of diagnostics, vaccines, antivirals, and antibody therapies.

Advances in sequencing

Advances in genomic sequencing technology and the willingness of the international community to share information in the public domain have allowed researchers to track the evolution of SARS-CoV-2, the agent responsible for the COVID-19 pandemic. However, researchers from the Commonwealth Scientific and Industrial Research Organization (CSIRO) in Australia developed a bioinformatics approach to improve epidemiology and response efforts by synthesizing complex information more effectively and systematically.

The researchers calculated the frequency of 10 mers (decapeptides) of SARS-CoV-2 isolates followed by principal component analysis (PCA) to allow for visual comparison. First, the researchers showed that coronavirus strains are unique by distinguishing SARS-CoV-2 against 17 severe acute respiratory syndrome (SARS) and six Middle East respiratory syndrome (MERS) isolates. PCA analysis was then used to determine how SARS-CoV-2 sequences clustered together or formed unique strains.

"Globally there is now a huge amount of individual virus sequences," said Denis Bauer, PhD, CSIRO's bioinformatics team leader and honorary associate professor at Macquarie University, in a statement. "Assessing the evolutionary distance between these data points and visualizing it helps researchers find out about the different strains of the virus, including where they came from and how they continue to evolve."

Of the four Australian isolates of SARS-CoV-2 that the team analyzed, two were grouped closely to Wuhan-Hu-1 (the reference genome), which is consistent with phylogenetic results and reflected the fact that those sequences had only minimal mutational changes from their core sequences compared with Wuhan-Hu-1. The PCA analysis placed the other two isolates further away from Wuhan-Hu-1 than the phylogenetic tree, based on the several deletions in the sequences.

The researchers suggest that the k-mer approach may more accurately reflect the fluidity of changes -- the cloud of variants. Further, because the alignment-free method looks at changes across the whole genome rather than at specific locations, the information gleaned from the analysis can determine high-level similarities between genomes, such as distinct genomic islands with common functions or recombination events.

Determining isolates for preclinical models

NextStrain, an open-source bioinformatics toolkit, is a powerful tool for visualizing the available strains of SARS-CoV-2 in real time, but it relies only on phylogeny. The researchers demonstrated that it may not provide the most complete depiction of changes in the evolution of the virus and newly emerged strains. When identifying a strain for use in clinical models, researchers aim to find the most representative and appropriate option.

Phylogenic trees of SARS-CoV-2 show three main clusters but may not adequately cover the evolutionary space of the actively circulating virus, as revealed by PCA analysis. The researchers reran PCA analysis on consensus sequences (the most frequent residues found at each position in a sequence) of the major and emerging SARS-CoV-2 clusters from the phylogenetic trees.

They found that the alignment-free approach may be able to suggest alternative strains that cover all emerging clusters. This analysis identified strains that may be a good choice for preclinical testing (USA/WA1) due to their relatively central location and the ability to represent newly emerging clusters.

The researchers demonstrated that synthetic consensus sequences can be used to visualize the evolutionary space already claimed by the virus and can better inform isolates for use in diagnostics, vaccines, and other countermeasures.