Current Bioinformatics

Author(s): Harris Song, Nan Sun, Wenping Yu* and Stephen S.-T. Yau*

DOI: 10.2174/0115748936269106231025064143

DownloadDownload PDF Flyer Cite As
A Novel Natural Graph for Efficient Clustering of Virus Genome Sequences

Page: [687 - 703] Pages: 17

  • * (Excluding Mailing and Handling)

Abstract

Background: This study addresses the need for analyzing viral genome sequences and understanding their genetic relationships. The focus is on introducing a novel natural graph approach as a solution.

Objective: The objective of this study is to demonstrate the effectiveness and advantages of the proposed natural graph approach in clustering viral genome sequences into distinct clades, subtypes, or districts. Additionally, the aim is to explore its interpretability, potential applications, and implications for pandemic control and public health interventions.

Methods: The study utilizes the proposed natural graph algorithm to cluster viral genome sequences. The results are compared with existing methods and multidimensional scaling to evaluate the performance and effectiveness of the approach.

Results: The natural graph approach successfully clusters viral genome sequences, providing valuable insights into viral evolution and transmission dynamics. The ability to generate directed connections between nodes enhances the interpretability of the results, facilitating the investigation of transmission pathways and viral fitness.

Conclusion: The findings highlight the potential applications of the natural graph algorithm in pandemic control, transmission tracing, and vaccine design. Future research directions may involve scaling up the analysis to larger datasets and incorporating additional genetic features for improved resolution.

The natural graph approach presents a promising tool for viral genomics research with implications for public health interventions.

Keywords: Viral genome analysis, genetic relationships, natural graph approach, phylogenetic clustering.

Graphical Abstract

[1]
Nucleic Acid. Available from: https://www.genome.gov/genetics-glossary/Nucleic-Acids (accessed June, 2023)
[2]
What is DNA. Available from: https://whatisdna.net/ (accessed June, 2023)
[3]
Watson JD, Crick FHC. Molecular structure of nucleic acids; a structure for deoxyribose nucleic acid. Nature 1953; 171(4356): 737-8.
[http://dx.doi.org/10.1038/171737a0] [PMID: 13054692]
[4]
Sun N, Pei S, He L, Yin C, He RL, Yau SST. Geometric construction of viral genome space and its applications. Comput Struct Biotechnol J 2021; 19: 4226-34.
[http://dx.doi.org/10.1016/j.csbj.2021.07.028] [PMID: 34429843]
[5]
Yu C, Deng M, Cheng SY, Yau SC, He RL, Yau SST. Protein space: A natural method for realizing the nature of protein universe. J Theor Biol 2013; 318: 197-204.
[http://dx.doi.org/10.1016/j.jtbi.2012.11.005] [PMID: 23154188]
[6]
Deng M, Yu C, Liang Q, He RL, Yau SST. A novel method of characterizing genetic sequences: Genome space with biological distance and applications. PLoS One 2011; 6(3): e17293.
[http://dx.doi.org/10.1371/journal.pone.0017293] [PMID: 21399690]
[7]
Training E-E. What is genetic variation Available from: https://www.ebi.ac.uk/training/online/course/human-genetic-variation-i-introduction/what-genetic-variation (accessed June, 2023)
[8]
Genetic Variation. Available from: https://www.genome.gov/genetics-glossary/Genomic-Variation (accessed June, 2023)
[9]
Ciccarelli FD, Doerks T, Von Mering C, Creevey CJ, Snel B, Bork P. Toward automatic reconstruction of a highly resolved tree of life. Science 2006; 311(5765): 1283-7.
[http://dx.doi.org/10.1126/science.1123061]
[10]
Wolf YI, Rogozin IB, Grishin NV, Koonin EV. Genome trees and the tree of life. Trends Genet 2002; 18(9): 472-9.
[http://dx.doi.org/10.1016/S0168-9525(02)02744-0] [PMID: 12175808]
[11]
Tavassoly I, Goldfarb J, Iyengar R. Systems biology primer: The basic methods and approaches. Essays Biochem 2018; 62(4): 487-500.
[http://dx.doi.org/10.1042/EBC20180003] [PMID: 30287586]
[12]
Baitaluk M. System biology of gene regulation. Methods Mol Biol 2009; 569: 55-87.
[13]
Wen J, Zhang Y, Yau SST. k-mer Sparse matrix model for genetic sequence and its applications in sequence comparison. J Theor Biol 2014; 363: 145-50.
[http://dx.doi.org/10.1016/j.jtbi.2014.08.028] [PMID: 25158165]
[14]
Vinje H, Liland KH, Almøy T, Snipen L. Comparing K-mer based methods for improved classification of 16S sequences. BMC Bioinformatics 2015; 16(1): 205.
[http://dx.doi.org/10.1186/s12859-015-0647-4] [PMID: 26130333]
[15]
Bohnsack KS, Kaden M, Abel J, Villmann T. Alignment-free sequence comparison: A systematic survey from a machine learning perspective. IEEE/ACM Trans Comput Biol Bioinformatics 2022; 1.
[http://dx.doi.org/10.1109/TCBB.2022.3140873]
[16]
Gao L, Qi J. Whole genome molecular phylogeny of large dsDNA viruses using composition vector method. BMC Evol Biol 2007; 7(1): 41.
[http://dx.doi.org/10.1186/1471-2148-7-41] [PMID: 17359548]
[17]
Wang Y, Hill K, Singh S, Kari L. The spectrum of genomic signatures: From dinucleotides to chaos game representation. Gene 2005; 346: 173-85.
[http://dx.doi.org/10.1016/j.gene.2004.10.021] [PMID: 15716010]
[18]
Cheng J, Zeng X, Ren G, Liu Z. CGAP: A new comprehensive platform for the comparative analysis of chloroplast genomes. BMC Bioinformatics 2013; 14(1): 95.
[http://dx.doi.org/10.1186/1471-2105-14-95] [PMID: 23496817]
[19]
Ondov BD, Treangen TJ, Melsted P, et al. Mash: Fast genome and metagenome distance estimation using MinHash. Genome Biol 2016; 17(1): 132.
[http://dx.doi.org/10.1186/s13059-016-0997-x] [PMID: 27323842]
[20]
Ondov BD, Starrett GJ, Sappington A, et al. Mash Screen: High-throughput sequence containment estimation for genome discovery. Genome Biol 2019; 20(1): 232.
[http://dx.doi.org/10.1186/s13059-019-1841-x] [PMID: 31690338]
[21]
Wen J, Chan RHF, Yau SC, He RL, Yau SST. K-mer natural vector and its application to the phylogenetic analysis of genetic sequences. Gene 2014; 546(1): 25-34.
[http://dx.doi.org/10.1016/j.gene.2014.05.043] [PMID: 24858075]
[22]
Zhang Y, Wen J, Yau SST. Phylogenetic analysis of protein sequences based on a novel k-mer natural vector method. Genomics 2019; 111(6): 1298-305.
[http://dx.doi.org/10.1016/j.ygeno.2018.08.010] [PMID: 30195069]
[23]
Sun N, Yang J, Yau SST. Identification of HIV rapid mutations using differences in nucleotide distribution over time. Genes 2022; 13(2): 170.
[http://dx.doi.org/10.3390/genes13020170] [PMID: 35205215]
[24]
Zhao X, Tian K, He RL, Yau SST. Convex hull principle for classification and phylogeny of eukaryotic proteins. Genomics 2019; 111(6): 1777-84.
[http://dx.doi.org/10.1016/j.ygeno.2018.11.033] [PMID: 30529533]
[25]
Huang HH, Yu C, Zheng H, et al. Global comparison of multiple-segmented viruses in 12-dimensional genome space. Mol Phylogenet Evol 2014; 81: 29-36.
[http://dx.doi.org/10.1016/j.ympev.2014.08.003] [PMID: 25172357]
[26]
Yu C, Liang Q, Yin C, He RL, Yau SST. A novel construction of genome space with biological geometry. DNA Res 2010; 17(3): 155-68.
[http://dx.doi.org/10.1093/dnares/dsq008] [PMID: 20360268]
[27]
Li Y, Tian K, Yin C, He RL, Yau SST. Virus classification in 60-dimensional protein space. Mol Phylogenet Evol 2016; 99: 53-62.
[http://dx.doi.org/10.1016/j.ympev.2016.03.009] [PMID: 26988414]
[28]
Fang M, Xu J, Sun N, Yau SS-T. Generating minimal models of H1N1 NS1 gene sequences using alignment-based and alignment-free algorithms. Genes 2023; 14(1): 186.
[http://dx.doi.org/10.3390/genes14010186]
[29]
Yu C. Real time classification of viruses in 12 Dimensions. Plos one 2013; 8: e64328.
[http://dx.doi.org/10.1371/journal.pone.0064328]
[30]
Tian K, Yang X, Kong Q, Yin C, He RL, Yau SST. Two dimensional Yau-Hausdorff distance with applications on comparison of DNA and protein sequences. PLoS One 2015; 10(9): e0136577.
[http://dx.doi.org/10.1371/journal.pone.0136577] [PMID: 26384293]
[31]
Dong R, Zhu Z, Yin C, He RL, Yau SST. A new method to cluster genomes based on cumulative Fourier power spectrum. Gene 2018; 673: 239-50.
[http://dx.doi.org/10.1016/j.gene.2018.06.042] [PMID: 29935353]
[32]
Pei S, Dong W, Chen X, He RL, Yau SST. Fast and accurate genome comparison using genome images: The Extended Natural Vector Method. Mol Phylogenet Evol 2019; 141: 106633.
[http://dx.doi.org/10.1016/j.ympev.2019.106633] [PMID: 31563612]
[33]
Sun N, Zhao X, Yau SST. An efficient numerical representation of genome sequence: Natural vector with covariance component. PeerJ 2022; 10: e13544.
[http://dx.doi.org/10.7717/peerj.13544] [PMID: 35729905]
[34]
Dong R, Pei S, Guan M, et al. Full chromosomal relationships between populations and the origin of humans. Front Genet 2022; 12: 828805.
[http://dx.doi.org/10.3389/fgene.2021.828805] [PMID: 35186019]
[35]
Sokal M. A statistical method for evaluating systematic relationships.University of Kansas Science Bulletin. 1958; 38: pp. 1409-38.
[36]
Garcia-Vallvé S, Puigbo P. DendroUPGMA: A dendrogram construction utility. Universitat Rovira i Virgili 2009; pp. 1-14.
[37]
Murtagh F. Complexities of hierarchic clustering algorithms: State of the art. Comput Stat Quarterly 1984; 1(2): 101-13.
[38]
Olsen GJ. Phylogenetic analysis using ribosomal RNA. Methods in enzymology. Elsevier 1988; 164: pp. 793-812.
[39]
Erdmann VA, Wolters J. Collection of published 5S, 5.8 S and 4.5 S ribosomal RNA sequences. Nucleic Acids Res 1986; 14(1): 1.
[40]
Saitou N, Nei M. The neighbor-joining method: A new method for reconstructing phylogenetic trees. Mol Biol Evol 1987; 4(4): 406-25.
[PMID: 3447015]
[41]
Mihaescu R, Levy D, Pachter L. Why neighbor-joining works. Algorithmica 2009; 54(1): 1-24.
[http://dx.doi.org/10.1007/s00453-007-9116-4]
[42]
Kuhner MK, Felsenstein J. A simulation comparison of phylogeny algorithms under equal and unequal evolutionary rates. Mol Biol Evol 1994; 11(3): 459-68.
[PMID: 8015439]
[43]
Kidd KK, Sgaramella-Zonta LA. Phylogenetic analysis: Concepts and methods. Am J Hum Genet 1971; 23(3): 235-52.
[PMID: 5089842]
[44]
Catanzaro D. The minimum evolution problem: Overview and classification. Networks 2009; 53(2): 112-25.
[http://dx.doi.org/10.1002/net.20280]
[45]
Rzhetsky A, Nei M. Theoretical foundation of the minimum-evolution method of phylogenetic inference. Mol Biol Evol 1993; 10(5): 1073-95.
[PMID: 8412650]
[46]
Fitch WM, Margoliash E. Construction of phylogenetic trees. Science 1967; 155(3760): 279-84.
[http://dx.doi.org/10.1126/science.155.3760.279] [PMID: 5334057]
[47]
Saitou N, Imanishi T. Relative efficiencies of the Fitch-Margoliash, maximum-parsimony, maximum-likelihood, minimum-evolution, and neighbor-joining methods of phylogenetic tree construction in obtaining the correct tree. Mol Biol Evolu 1989; 6(5): 514.
[48]
Leitner T, Escanilla D, Franzén C, Uhlén M, Albert J. Accurate reconstruction of a known HIV-1 transmission history by phylogenetic tree analysis. Proc Natl Acad Sci 1996; 93(20): 10864-9.
[http://dx.doi.org/10.1073/pnas.93.20.10864] [PMID: 8855273]
[49]
Sullivan J, Joyce P. Model selection in phylogenetics. Annu Rev Ecol Evol Syst 2005; 36(1): 445-66.
[http://dx.doi.org/10.1146/annurev.ecolsys.36.102003.152633]
[50]
Pol D. Empirical problems of the hierarchical likelihood ratio test for model selection. Syst Biol 2004; 53(6): 949-62.
[http://dx.doi.org/10.1080/10635150490888868] [PMID: 15764562]
[51]
Abadi S, Azouri D, Pupko T, Mayrose I. Model selection may not be a mandatory step for phylogeny reconstruction. Nat Commun 2019; 10(1): 934.
[http://dx.doi.org/10.1038/s41467-019-08822-w] [PMID: 30804347]
[52]
Noureddine FY, Chakkour M, El Roz A, et al. The emergence of SARS-CoV-2 variant (s) and its impact on the prevalence of COVID-19 cases in the Nabatieh Region, Lebanon. Med Sci 2021; 9(2): 40.
[http://dx.doi.org/10.3390/medsci9020040] [PMID: 34199617]
[53]
Alm E, Broberg EK, Connor T, et al. Geographical and temporal distribution of SARS-CoV-2 clades in the WHO European Region, January to June 2020. Euro Surveill 2020; 25(32): 2001410.
[http://dx.doi.org/10.2807/1560-7917.ES.2020.25.32.2001410] [PMID: 32794443]
[54]
GISAID - hCov19 Variants. Available from: https://gisaid.org/hcov19-variants/ (accessed June, 2023)
[55]
GISAID. Clade tree. Available from: https://www.gisaid.org/fileadmin/c/gisaid/files/images/clade_tree.jpg (accessed June, 2023)
[56]
Zhukova A, Blassel L, Lemoine F, Morel M, Voznica J, Gascuel O. Origin, evolution and global spread of SARS-CoV-2. C R Biol 2021; 344(1): 57-75.
[http://dx.doi.org/10.5802/crbiol.29] [PMID: 33274614]
[57]
Lefort V, Desper R, Gascuel O. FastME 2.0: A comprehensive, accurate, and fast distance-based phylogeny inference program. Mol Biol Evol 2015; 32(10): 2798-800.
[http://dx.doi.org/10.1093/molbev/msv150] [PMID: 26130081]
[58]
Gascuel O. BIONJ: An improved version of the NJ algorithm based on a simple model of sequence data. Mol Biol Evol 1997; 14(7): 685-95.
[http://dx.doi.org/10.1093/oxfordjournals.molbev.a025808] [PMID: 9254330]
[59]
Gilbert PB, McKeague IW, Eisen G, et al. Comparison of HIV-1 and HIV-2 infectivity from a prospective cohort study in Senegal. Stat Med 2003; 22(4): 573-93.
[http://dx.doi.org/10.1002/sim.1342] [PMID: 12590415]
[60]
Douek DC, Roederer M, Koup RA. Emerging concepts in the immunopathogenesis of AIDS. Annu Rev Med 2009; 60(1): 471-84.
[http://dx.doi.org/10.1146/annurev.med.60.041807.123549] [PMID: 18947296]
[61]
Shankarappa R, Margolick JB, Gange SJ, et al. Consistent viral evolutionary changes associated with the progression of human immunodeficiency virus type 1 infection. J Virol 1999; 73(12): 10489-502.
[http://dx.doi.org/10.1128/JVI.73.12.10489-10502.1999] [PMID: 10559367]
[62]
Hemelaar J, Gouws E, Ghys PD, Osmanov S. Global and regional distribution of HIV-1 genetic subtypes and recombinants in 2004. AIDS 2006; 20(16): W13-23.
[http://dx.doi.org/10.1097/01.aids.0000247564.73009.bc] [PMID: 17053344]
[63]
Smith DM, Richman DD, Little SJ. HIV Superinfection. J Infect Dis 2005; 192(3): 438-44.
[http://dx.doi.org/10.1086/431682] [PMID: 15995957]
[64]
Sun N, Yau SS-T. In-depth investigation of the point mutation pattern of HIV-1. Front Cell Infect Microbiol 2022; 12: 1033481.
[http://dx.doi.org/10.3389/fcimb.2022.1033481]
[65]
Krammer F, Smith GJD, Fouchier RAM, et al. Influenza. Nat Rev Dis Primers 2018; 4(1): 3.
[http://dx.doi.org/10.1038/s41572-018-0002-y] [PMID: 29955068]
[66]
Sautto GA, Kirchenbaum GA, Ross TM. Towards a universal influenza vaccine: Different approaches for one goal. Virol J 2018; 15(1): 17.
[http://dx.doi.org/10.1186/s12985-017-0918-y] [PMID: 29370862]
[67]
Eisfeld AJ, Neumann G, Kawaoka Y. At the centre: Influenza A virus ribonucleoproteins. Nat Rev Microbiol 2015; 13(1): 28-41.
[http://dx.doi.org/10.1038/nrmicro3367] [PMID: 25417656]
[68]
Goka EA, Vallely PJ, Mutton KJ, Klapper PE. Mutations associated with severity of the pandemic influenza A(H1N1)pdm09 in humans: A systematic review and meta-analysis of epidemiological evidence. Arch Virol 2014; 159(12): 3167-83.
[http://dx.doi.org/10.1007/s00705-014-2179-z]
[69]
Zhang Y, Wen J, Xi K, Pan Q. Exploring the dynamic variations of viral genomes via a novel genetic network. Mol Phylogenet Evol 2022; 175: 107583.
[http://dx.doi.org/10.1016/j.ympev.2022.107583] [PMID: 35810971]
[70]
Chen C-h, Härdle W, Unwin A, Cox MA, Cox TF. Multidimensional scaling. Handbook of data visualization. Berlin, Heidelberg: Springer 2008.
[71]
Gordon A. The User’s Guide to Multidimensional Scaling, with Special Reference to the Mds (X). Library of Computer Programs. Wiley 1983.
[http://dx.doi.org/10.2307/2987947]
[72]
Green PE. Marketing Applications of MDS: Assessment and Outlook: After a decade of development, what have we learned from MDS in marketing? J Mark 1975; 39(1): 24-31.
[http://dx.doi.org/10.1177/002224297503900105]