Hostname: page-component-89b8bd64d-z2ts4 Total loading time: 0 Render date: 2026-05-09T13:33:06.178Z Has data issue: false hasContentIssue false

A novel statistical method predicts mutability of the genomic segments of the SARS-CoV-2 virus

Published online by Cambridge University Press:  13 December 2021

Amir Hossein Darooneh
Affiliation:
Department of Applied Mathematics, University of Waterloo, Waterloo, ON, Canada
Michelle Przedborski
Affiliation:
Department of Applied Mathematics, University of Waterloo, Waterloo, ON, Canada
Mohammad Kohandel*
Affiliation:
Department of Applied Mathematics, University of Waterloo, Waterloo, ON, Canada
*
*Author for correspondence: Mohammad Kohandel, E-mail: kohandel@uwaterloo.ca
Rights & Permissions [Opens in a new window]

Abstract

The SARS-CoV-2 virus has made the largest pandemic of the 21st century, with hundreds of millions of cases and tens of millions of fatalities. Scientists all around the world are racing to develop vaccines and new pharmaceuticals to overcome the pandemic and offer effective treatments for COVID-19 disease. Consequently, there is an essential need to better understand how the pathogenesis of SARS-CoV-2 is affected by viral mutations and to determine the conserved segments in the viral genome that can serve as stable targets for novel therapeutics. Here, we introduce a text-mining method to estimate the mutability of genomic segments directly from a reference (ancestral) whole genome sequence. The method relies on calculating the importance of genomic segments based on their spatial distribution and frequency over the whole genome. To validate our approach, we perform a large-scale analysis of the viral mutations in nearly 80,000 publicly available SARS-CoV-2 predecessor whole genome sequences and show that these results are highly correlated with the segments predicted by the statistical method used for keyword detection. Importantly, these correlations are found to hold at the codon and gene levels, as well as for gene coding regions. Using the text-mining method, we further identify codon sequences that are potential candidates for siRNA-based antiviral drugs. Significantly, one of the candidates identified in this work corresponds to the first seven codons of an epitope of the spike glycoprotein, which is the only SARS-CoV-2 immunogenic peptide without a match to a human protein.

Information

Type
Research Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2021. Published by Cambridge University Press
Figure 0

Fig. 1. The probability of appearance of each nucleotide (A, Adenine; C, Cytosine; G, Guanine; T, Thymine) in the reference SARS-CoV-2 genome sequence (blue solid bar), and the fraction of mutated nucleotides (orange bar), for the (a) NCBI data set and (b) GISAID data set.

Figure 1

Figure 2. Percentage of different mutations observed for each nucleotide (A, C, G, T) for (a) the NCBI data set and (b) the GISAID data set. Axis labels reference nucleotide substitutions, ‘Del’ refers to deletion events, and ‘Ins’ are nucleotide insertions.

Figure 2

Figure 3. The percentage of mutations that occur among different possible three-nucleotide sequences. For each sequence, the mutation occurs in the central nucleotide, as indicated at the centre of each plot. Results are shown for (a) the NCBI data set and (b) the GISAID data set.

Figure 3

Table 1. Top 10 nucleotide positions with the highest probability of substitution mutation in the SARS-CoV-2 genomic sequence, based on the GISAID data set

Figure 4

Figure 4. Number of codon mutations associated with each nucleotide position in the SARS-CoV-2 whole genome, according to the GISAID data set. The coloured rectangles in the bottom of the figure depict different gene regions. Three codon positions that include nucleotides with more than 50,000 mutations are identified by red arrows.

Figure 5

Table 2. The 10 most probable codon changes in the SARS-CoV-2 genome, according to the GISAID data set

Figure 6

Figure 5. Probability of nucleotide-substitution codon changes in the SARS-CoV-2 genome, based on the GISAID data set. The y-axis corresponds to the origin codons in the reference genome and the x-axis is the destination codon. The codons are arranged in alphabetical order along each axis.

Figure 7

Table 3. Top 10 most frequent deletion mutations in the SARS-CoV-2 genome, causing the merging or removal of codons, based on the GISAID data set

Figure 8

Figure 6. Total number of mutations (blue solid bar) and the number of silent mutations (orange bar) for distinct codon types, based on GISAID data set. The codons are arranged in alphabetical order along the horizontal axis.

Figure 9

Figure 7. The positions of two codons, CGG and TAA, in the SARS-CoV-2 reference genome. The vertical blue lines are the position of the codons and the coloured rectangles are gene regions in the sequence. The two codons have almost the same frequency of occurrence, 11 and 10, respectively; however, their position along the genome is markedly different. TAA is the stop codon and plays an important role in protein-making instructions.

Figure 10

Figure 8. The eccentricity (solid blue bar) and frequency of occurrence (orange bar) for all the codons in the SARS-CoV-2 reference genome. The codons are arranged according to their frequency of occurrence, from most to least frequent. The red dashed line shows the codon importance, which takes into account both the normalised eccentricity and normalised frequency according to Eq. (2).

Figure 11

Figure 9. Relative density of codons in the SARS-CoV-2 genome, arranged in order of increasing codon importance, for the low mutation repetition group. Calculations were performed for the GISAID data set.

Figure 12

Figure 10. Average relative density and average importance of SARS-CoV-2 viral genes. The former is obtained from mutation data in GISAID data set and the latter is calculated based on the SARS-CoV-2 reference genome.

Figure 13

Figure 11. Number of mutations versus rank for all codon positions in the genome, for each codon type, obtained from the GISAID data set. Different codons are distinguished by different colours. We observe that Zipf’s law holds for all codons. The black dashed line corresponds to a power law function with the form $ y\hskip0.2em \propto \hskip0.2em {x}^{-1} $.

Figure 14

Figure 12. Mutation index of each codon, with codons arranged from least to most important. The plot depicts a strong negative correlation between the mutation index and the codon importance.

Figure 15

Figure 13. Average mutation index and average importance of SARS-CoV-2 viral genes. The former is obtained from mutation data in GISAID data set and the latter is calculated based on the SARS-CoV-2 reference genome. To illustrate negative correlation we rank the genes from the lowest to the highest average mutation index (left panel) in contrast to the ranking order for the importance values.

Figure 16

Table 4. Top 10 important segments with six codons in the SARS-CoV-2 genome sequence

Figure 17

Table 5. Top 10 important segments with seven codons in the SARS-CoV-2 genome sequence

Supplementary material: File

Kohandel et al. supplementary material

Kohandel et al. supplementary material 1

Download Kohandel et al. supplementary material(File)
File 72.7 KB
Supplementary material: PDF

Kohandel et al. supplementary material

Kohandel et al. supplementary material 2

Download Kohandel et al. supplementary material(PDF)
PDF 555.9 KB

Review: A novel statistical method predicts mutability of the genomic segments of the SARS-CoV-2 virus — R0/PR1

Conflict of interest statement

none.

Comments

Comments to Author: Darooneh et al. report on application of text-mining techniques for genomic analysis of the SARS-CoV-2 virus. The authors have analyzed two genomic datasets and identify genomic contexts and regions with differential mutability. The paper is well written and the results are interesting.

Major comments:

- What percentage of sequences in the NCBI and GISAID databases are common between them? (They are not independent databases.) Is it possible that high corroboration between the datasets is because of a large overlap? (If overlap is large, the analysis should be performed on either the union or intersection of datasets.)

- Throughout the paper, statements need quantification and numbers should be provided in the text. For instance, on page 9, in "a nucleotide has a considerable probability..." the probability must be stated. Qualitative words (such as several, rare, small, etc.) should be quantified throughout.

- Related to the previous comment, the text should describe the results. For instance, on page 10, this sentence is pointing to where the results are, but does not describe them: "several of the nucleotide positions with high probability for mutation in Table I are apparent in Fig. 4 as individual or small clusters of peaks in the number of mutations." Moreover, it is not clear where in Figure 4 this information can be found.

- More than 97% of the viral genome is coding its genes; therefore, it’s expected that 83% of the mutations are coding. Throughout the paper, normalization for lengths of the genome or individual genes need to be considered and explicitly stated so the significance of the findings can be assessed vs. random.

- There are known relationships between Zipf ’s law and measures in population genetics such as the infinite-allele model. The findings need to be discussed within what is known about SARS-CoV-2’s evolution and fixed mutations in the genomes analyzed.

- More insight in behavior of "importance i(w)" would help with clarity and justification of logarithms and square roots. How does this approach compare with normalization and weighting approaches to reduce large differences?

- TTC/TTT code for phenylalanine; more than 98% of mutations in this codon have to be C>T on the last base so they don’t change the amino acid. Were they corrected for codon usage in the viral genome and were the 63,555 mutations in TTC unique or were they the same mutation found in many viral genomes suggesting fixation? How were the numbers in Figure 6 calculated exactly?

- How are results in Figure 8 inform on codon usage in SARS-CoV-2’s genome?

- Loss of GC and increase in AT content is known for many human viruses as they evolve. Many papers exist on the topic, including https://journals.plos.org/plospathogens/article?id=10.1371/journal.ppat.1000079

Minor comments:

- Are codons defined based on how they code the proteins or a word is three consecutive nucleotide? How is a word defined for non-coding regions?

- The presented results point to a mutagenesis process, possibly due to antigenic drift. How can these findings be translated to other novel viruses and pathogens as suggested in Discussion when different mutagenesis processes may be involved?

- Are highly mutated regions known to be subjected to antigenic pressures?

- Can the rankings presented for importance of the genes be discussed in what is now known of their biology?- In multiple figures, the x axes are codons that are arranged based on various criteria. These criteria and ranking should be stated in the figure itself.

- This sentence is not clear on page 5: "Like the keywords in text, we assume that the significant codons form clusters."

- The phrase "experience mutation" should be replaced with "are mutated" throughout the paper.

Recommendation: A novel statistical method predicts mutability of the genomic segments of the SARS-CoV-2 virus — R0/PR2

Comments

Comments to Author: Reviewer #1: Darooneh et al. report on application of text-mining techniques for genomic analysis of the SARS-CoV-2 virus. The authors have analyzed two genomic datasets and identify genomic contexts and regions with differential mutability. The paper is well written and the results are interesting.

Major comments:

- What percentage of sequences in the NCBI and GISAID databases are common between them? (They are not independent databases.) Is it possible that high corroboration between the datasets is because of a large overlap? (If overlap is large, the analysis should be performed on either the union or intersection of datasets.)

- Throughout the paper, statements need quantification and numbers should be provided in the text. For instance, on page 9, in "a nucleotide has a considerable probability..." the probability must be stated. Qualitative words (such as several, rare, small, etc.) should be quantified throughout.

- Related to the previous comment, the text should describe the results. For instance, on page 10, this sentence is pointing to where the results are, but does not describe them: "several of the nucleotide positions with high probability for mutation in Table I are apparent in Fig. 4 as individual or small clusters of peaks in the number of mutations." Moreover, it is not clear where in Figure 4 this information can be found.

- More than 97% of the viral genome is coding its genes; therefore, it’s expected that 83% of the mutations are coding. Throughout the paper, normalization for lengths of the genome or individual genes need to be considered and explicitly stated so the significance of the findings can be assessed vs. random.

- There are known relationships between Zipf ’s law and measures in population genetics such as the infinite-allele model. The findings need to be discussed within what is known about SARS-CoV-2’s evolution and fixed mutations in the genomes analyzed.

- More insight in behavior of "importance i(w)" would help with clarity and justification of logarithms and square roots. How does this approach compare with normalization and weighting approaches to reduce large differences?

- TTC/TTT code for phenylalanine; more than 98% of mutations in this codon have to be C>T on the last base so they don’t change the amino acid. Were they corrected for codon usage in the viral genome and were the 63,555 mutations in TTC unique or were they the same mutation found in many viral genomes suggesting fixation? How were the numbers in Figure 6 calculated exactly?

- How are results in Figure 8 inform on codon usage in SARS-CoV-2’s genome?

- Loss of GC and increase in AT content is known for many human viruses as they evolve. Many papers exist on the topic, including https://journals.plos.org/plospathogens/article?id=10.1371/journal.ppat.1000079

Minor comments:

- Are codons defined based on how they code the proteins or a word is three consecutive nucleotide? How is a word defined for non-coding regions?

- The presented results point to a mutagenesis process, possibly due to antigenic drift. How can these findings be translated to other novel viruses and pathogens as suggested in Discussion when different mutagenesis processes may be involved?

- Are highly mutated regions known to be subjected to antigenic pressures?

- Can the rankings presented for importance of the genes be discussed in what is now known of their biology?- In multiple figures, the x axes are codons that are arranged based on various criteria. These criteria and ranking should be stated in the figure itself.

- This sentence is not clear on page 5: "Like the keywords in text, we assume that the significant codons form clusters."

- The phrase "experience mutation" should be replaced with "are mutated" throughout the paper.

Review: A novel statistical method predicts mutability of the genomic segments of the SARS-CoV-2 virus — R1/PR3

Conflict of interest statement

Reviewer declares none.

Comments

Comments to Author: The authors developed an approach to calculate the importance of a segment in a sequence based on a text-mining approach. The importance of a segment is a combination of the frequency and eccentricity. Then the authors derived the relative density and mutational index as quantities of mutational behavior. As an application, the authors have performed substantial work to calculate descriptives and investigate the mutational properties of the SARS-CoV-2 virus, as well as use their measures to interpret the findings.

Major comments:

1. I have difficulties with the eccentricity, as defined in Equation (1). The authors describe it as ‘clustering of a codon’ (page 7, line 16), and more generally ‘clustering of a word’ (page 6, line 27).

a. The authors need to clearly define what ’clustering of a word/codon’ precisely means, do they mean the clustering of several codons? Or the clustering of several occurrences of a single codon? E.g. are two clustered codons on average located closer to each other than two other arbitrary other codons? Or are the occurrences of a clustered codon on average closer to each other than for another codon?

b. Equation (1) should capture clustering as well as distance from the first quarter of a region. In the equation, the authors sum the squared position of a codon minus the position of the first quarter. The authors should explain why the sum of squared positions capture clustering. It seems that the Equation puts more emphasis on whether a segment is far from the first quarter, regardless of it being clustered. Namely, a codon that is clustered around the first quarter has almost no impact on the eccentricity, unless at least one occurrence is far away from the first quarter.

c. How do the authors deal with differences in region length? In a longer region, a codon can be located further away from the first quarter, yielding a higher eccentricity. This seems unfair for shorter regions.

2. I also have difficulties with the importance, defined in Equation (2). I agree that the frequency and eccentricity should be normalized before they are combined in one measure. The authors chose to divide each e(w) and f(w) by the total eccentricity resp. frequency. This leads to a normalized quantity between 0 and 1. The authors also applied a log transformation and a square root to "reduce large differences" and account for a "larger scale of the frequencies". After dividing by the total frequency and eccentricity, these two issues seem not a problem anymore, as all values lie between 0 and 2. Moreover, taking the square root of numbers between 0 and 1 increases their scale. The authors need to justify taking the log and square root.

3. On page 4, last two lines, the authors describe the data used in this manuscript. How crucial is it to have many known sequences in order to apply their method? Can the mutability be reliably estimated when not so much is known about a virus? Also, with a very low or high mutability, do we need significantly more or less sequences to identify the important segments? Perhaps the authors can state something about this in the Discussion section.

Minor comments:

- Page 6, line 19. The term "finite probability" is odd, since a probability is by definition finite. Do the authors mean "equal probability" (or maybe "non-zero probability")?

- Page 7, line 17. It would help the reader if the authors would add a sentence explaining why clustering near region boundaries is more important.

- Page 19, line 4. Define normalized eccentricity. Is that the second part of Equation (2)?

- Page 19, line 7. The authors state that more frequent codons correspond with lower eccentricity. This cannot be derived from figure 8, as the eccentricity seems more or less equal across codons. The authors should weaken their statement.

Recommendation: A novel statistical method predicts mutability of the genomic segments of the SARS-CoV-2 virus — R1/PR4

Comments

Comments to Author: Reviewer #2: The authors developed an approach to calculate the importance of a segment in a sequence based on a text-mining approach. The importance of a segment is a combination of the frequency and eccentricity. Then the authors derived the relative density and mutational index as quantities of mutational behavior. As an application, the authors have performed substantial work to calculate descriptives and investigate the mutational properties of the SARS-CoV-2 virus, as well as use their measures to interpret the findings.

Major comments:

1. I have difficulties with the eccentricity, as defined in Equation (1). The authors describe it as ‘clustering of a codon’ (page 7, line 16), and more generally ‘clustering of a word’ (page 6, line 27).

a. The authors need to clearly define what ‘clustering of a word/codon’ precisely means, do they mean the clustering of several codons? Or the clustering of several occurrences of a single codon? E.g. are two clustered codons on average located closer to each other than two other arbitrary other codons? Or are the occurrences of a clustered codon on average closer to each other than for another codon?

b. Equation (1) should capture clustering as well as distance from the first quarter of a region. In the equation, the authors sum the squared position of a codon minus the position of the first quarter. The authors should explain why the sum of squared positions capture clustering. It seems that the Equation puts more emphasis on whether a segment is far from the first quarter, regardless of it being clustered. Namely, a codon that is clustered around the first quarter has almost no impact on the eccentricity, unless at least one occurrence is far away from the first quarter.

c. How do the authors deal with differences in region length? In a longer region, a codon can be located further away from the first quarter, yielding a higher eccentricity. This seems unfair for shorter regions.

2. I also have difficulties with the importance, defined in Equation (2). I agree that the frequency and eccentricity should be normalized before they are combined in one measure. The authors chose to divide each e(w) and f(w) by the total eccentricity resp. frequency. This leads to a normalized quantity between 0 and 1. The authors also applied a log transformation and a square root to "reduce large differences" and account for a "larger scale of the frequencies". After dividing by the total frequency and eccentricity, these two issues seem not a problem anymore, as all values lie between 0 and 2. Moreover, taking the square root of numbers between 0 and 1 increases their scale. The authors need to justify taking the log and square root.

3. On page 4, last two lines, the authors describe the data used in this manuscript. How crucial is it to have many known sequences in order to apply their method? Can the mutability be reliably estimated when not so much is known about a virus? Also, with a very low or high mutability, do we need significantly more or less sequences to identify the important segments? Perhaps the authors can state something about this in the Discussion section.

Minor comments:

- Page 6, line 19. The term "finite probability" is odd, since a probability is by definition finite. Do the authors mean "equal probability" (or maybe "non-zero probability")?

- Page 7, line 17. It would help the reader if the authors would add a sentence explaining why clustering near region boundaries is more important.

- Page 19, line 4. Define normalized eccentricity. Is that the second part of Equation (2)?

- Page 19, line 7. The authors state that more frequent codons correspond with lower eccentricity. This cannot be derived from figure 8, as the eccentricity seems more or less equal across codons. The authors should weaken their statement.

Recommendation: A novel statistical method predicts mutability of the genomic segments of the SARS-CoV-2 virus — R2/PR5

Comments

No accompanying comment.