Novel Protein Sequence Comparison Method Based on Transition
Probability Graph and Information Entropy

Zhaohui       Qi; Xinlong       Wen

Abstract

Aim and Objective: Aim and Objective: Sequence analysis is one of the foundations in bioinformatics. It is widely used to find out the feature metrics hidden in the sequence. Otherwise, the graphical representation of the biologic sequence is an important tool for sequencing analysis. This study is undertaken to find out a new graphical representation of biosequences.

Materials and Methods: The transition probability is used to describe amino acid combinations of protein sequences. The combinations are composed of amino acids directly adjacent to each other or separated by multiple amino acids. The transition probability graph is built up by the transition probabilities of amino acid combinations. Next, a map is defined as a representation from the transition probability graph to transition probability vector by the k-order transition probability graph. Transition entropy vectors are developed by the transition probability vector and information entropy. Finally, the proposed method is applied to two separate applications, 499 HA genes of H1N1, and 95 coronaviruses.

Results: By constructing a phylogenetic tree, it was found that the results of each application are consistent with other studies.

Conclusion: The graphical representation proposed in this article is a practical and correct method.

Keywords: Graphical bioinfomatics, similarity, sequence, descriptors, transition probability graph, information entropy.

Graphical Abstract

[1] 
Hamori, E.; Ruskin, J. H curves, a novel method of representation of nucleotide series especially suited for long DNA sequences. J. Biol. Chem.,  1983, 258(2), 1318-1327.
[PMID: 6822501] 
[2] 
Bielińska-Wa̧ż, D. Four-component spectral representation of DNA sequences. J. Math. Chem.,  2010, 47(1), 41.
[http://dx.doi.org/10.1007/s10910-009-9535-3] 
[3] 
Bielińska-Wąż. Wąż, D.; Nowak, W.; Wa, P.; Nandy, A.; Clark, T., Distribution moments of 2D-graphs as descriptors of DNA sequences. Chem. Phys. Lett.,  2007, 443(4-6), 408-413.
[http://dx.doi.org/10.1016/j.cplett.2007.06.088] 
[4] 
Bielińska-Waz, D.; Subramaniam, S. Classification studies based on a spectral representation of DNA. J. Theor. Biol.,  2010, 266(4), 667-674.
[http://dx.doi.org/10.1016/j.jtbi.2010.07.038] [PMID: 20691193] 
[5] 
Liao, B.; Wang, T.M. New 2D graphical representation of DNA sequences. J. Comput. Chem.,  2004, 25(11), 1364-1368.
[http://dx.doi.org/10.1002/jcc.20060] [PMID: 15185330] 
[6] 
Liao, B.; Zeng, C.; Li, F.; Tang, Y. Analysis of similarity/dissimilarity of DNA sequences based on dual nucleotides. MATCH Commun. Math. Comput. Chem,  2008, 59(3), 647-652.
[7] 
Liao, B.; Xiang, Q.; Cai, L.; Cao, Z. A new graphical coding of DNA sequence and its similarity calculation. Physica A,  2013, 392(19), 4663-4667.
[http://dx.doi.org/10.1016/j.physa.2013.05.015] 
[8] 
Randić, M.; Vračko, M.; Lerš, N.; Plavšić, D. Novel 2-D graphical representation of DNA sequences and their numerical characterization. Chem. Phys. Lett.,  2003, 368(1-2), 1-6.
[http://dx.doi.org/10.1016/S0009-2614(02)01784-0] 
[9] 
Randić, M. Another look at the chaos-game representation of DNA. Chem. Phys. Lett.,  2008, 456(1-3), 84-88.
[http://dx.doi.org/10.1016/j.cplett.2008.03.011] 
[10] 
Randić, M.; Zupan, J.; Pisanski, T. On representation of DNA by line distance matrix. J. Math. Chem.,  2008, 43(2), 674-692.
[http://dx.doi.org/10.1007/s10910-006-9219-1] 
[11] 
Jaklič, G.; Pisanski, T.; Randić, M. Characterization of complex biological systems by matrix invariants. J. Comput. Biol.,  2006, 13(9), 1558-1564.
[http://dx.doi.org/10.1089/cmb.2006.13.1558] [PMID: 17147478] 
[12] 
Yang, Y.; Zhang, Y.; Jia, M.; Li, C.; Meng, L. Non-degenerate graphical representation of DNA sequences and its applications to phylogenetic analysis. Comb. Chem. High Throughput Screen.,  2013, 16(8), 585-589.
[http://dx.doi.org/10.2174/1386207311316080001] [PMID: 23617263] 
[13] 
Qi, Z.H.; Li, L.; Qi, X.Q. Using Huffman coding method to visualize and analyze DNA sequences. J. Comput. Chem.,  2011, 32(15), 3233-3240.
[http://dx.doi.org/10.1002/jcc.21906] [PMID: 21953557] 
[14] 
Qi, X-Q.; Li, X-H.; Qi, Z-H. Graphic mapping of protein-coding DNA sequence in four-dimensional space and its application. J. Comput. Theor. Nanosci.,  2014, 11(5), 1244-1251.
[http://dx.doi.org/10.1166/jctn.2014.3489] 
[15] 
Qi, Z.H.; Qi, X.Q.; Liu, C.C. New method for global alignment of 2 DNA sequences by the tree data structure. J. Theor. Biol.,  2010, 263(2), 227-236.
[http://dx.doi.org/10.1016/j.jtbi.2009.12.012] [PMID: 20025888] 
[16] 
Sims, G.E.; Jun, S-R.; Wu, G.A.; Kim, S-H. Whole-genome phylogeny of mammals: evolutionary information in genic and nongenic regions. Proc. Natl. Acad. Sci. USA,  2009, 106(40), 17077-17082.
[http://dx.doi.org/10.1073/pnas.0909377106] [PMID: 19805074] 
[17] 
Wang, H.; Xu, Z.; Gao, L.; Hao, B. A fungal phylogeny based on 82 complete genomes using the composition vector method. BMC Evol. Biol.,  2009, 9(1), 195.
[http://dx.doi.org/10.1186/1471-2148-9-195] [PMID: 19664262] 
[18] 
Kolekar, P.; Kale, M.; Kulkarni-Kale, U. Alignment-free distance measure based on return time distribution for sequence analysis: applications to clustering, molecular phylogeny and subtyping. Mol. Phylogenet. Evol.,  2012, 65(2), 510-522.
[http://dx.doi.org/10.1016/j.ympev.2012.07.003] [PMID: 22820020] 
[19] 
Qi, Z-H.; Jin, M-Z.; Yang, H. A measure of protein sequence characteristics based on the frequency and the position entropy of existing K-words. Match (Mulh.),  2015, 73(3), 731-748.
[20] 
Gusfield, D. Algorithms on stings, trees, and sequences: Computer science and computational biology. Acm. Sigact. News,  1997, 28(4), 41-60.
[http://dx.doi.org/10.1145/270563.571472] 
[21] 
Xiao, M.; Zhu, Z.Z.; Liu, J.; Zhang, C.Y. A new method based on entropy theory for genomic sequence analysis. Acta Biotheor.,  2002, 50(3), 155-165.
[http://dx.doi.org/10.1023/A:1016587025917] [PMID: 12211329] 
[22] 
Gerhardt, G.J.; Takeda, A.A.; Andrighetti, T.; Sartor, I.T.; Echeverrigaray, S.L.; de Avila, E.; Silva, S.; Dos Santos, L.; Rybarczyk-Filho, J.L. Triplet entropy analysis of hemagglutinin and neuraminidase sequences measures influenza virus phylodynamics. Gene,  2013, 528(2), 277-281.
[http://dx.doi.org/10.1016/j.gene.2013.06.060] [PMID: 23850726] 
[23] 
Liao, B.; Shan, X.; Zhu, W.; Li, R. Phylogenetic tree construction based on 2D graphical representation. Chem. Phys. Lett.,  2006, 422(1-3), 282-288.
[http://dx.doi.org/10.1016/j.cplett.2006.02.081] 
[24] 
Randić, M.; Zupan, J. Highly compact 2D graphical representation of DNA sequences. SAR QSAR Environ. Res.,  2004, 15(3), 191-205.
[http://dx.doi.org/10.1080/10629360410001697753] [PMID: 15293546] 
[25] 
Randić, M.; Butina, D.; Zupan, J. Novel 2-D graphical representation of proteins. Chem. Phys. Lett.,  2006, 419(4-6), 528-532.
[http://dx.doi.org/10.1016/j.cplett.2005.11.091] [PMID: 18071576] 
[26] 
Bai, F.; Wang, T. On graphical and numerical representation of protein sequences. J. Biomol. Struct. Dyn.,  2006, 23(5), 537-546.
[http://dx.doi.org/10.1080/07391102.2006.10507078] [PMID: 16494503] 
[27] 
Randić, M. WITHDRAWN: 2-D graphical representation of proteins based on physico-chemical properties of amino acids. Chem. Phys. Lett.,  2007, 444(1-3), 176-180.
[http://dx.doi.org/10.1016/j.cplett.2007.06.114] 
[28] 
Feng, J.; Wang, T.M. Characterization of protein primary sequences based on partial ordering. J. Theor. Biol.,  2008, 254(4), 752-755.
[http://dx.doi.org/10.1016/j.jtbi.2008.07.007] [PMID: 18671982] 
[29] 
Yau, S.S-T.; Yu, C.; He, R. A protein map and its application. DNA Cell Biol.,  2008, 27(5), 241-250.
[http://dx.doi.org/10.1089/dna.2007.0676] [PMID: 18348704] 
[30] 
Li, C.; Yu, X.; Yang, L.; Zheng, X.; Wang, Z. 3-D maps and coupling numbers for protein sequences. Physica A,  2009, 388(9), 1967-1972.
[http://dx.doi.org/10.1016/j.physa.2009.01.017] 
[31] 
Randić, M.; Mehulić, K.; Vukicević, D.; Pisanski, T.; Vikić-Topić, D.; Plavsić, D. Graphical representation of proteins as four-color maps and their numerical characterization. J. Mol. Graph. Model.,  2009, 27(5), 637-641.
[http://dx.doi.org/10.1016/j.jmgm.2008.10.004] [PMID: 19081277] 
[32] 
He, P.A.; Zhang, Y.P.; Yao, Y.H.; Tang, Y.F.; Nan, X.Y. The graphical representation of protein sequences based on the physicochemical properties and its applications. J. Comput. Chem.,  2010, 31(11), 2136-2142.
[http://dx.doi.org/10.1002/jcc.21501] [PMID: 20225279] 
[33] 
Randić, M.; Zupan, J.; Balaban, A.T.; Vikić-Topić, D.; Plavsić, D. Graphical representation of proteins. Chem. Rev.,  2011, 111(2), 790-862.
[http://dx.doi.org/10.1021/cr800198j] [PMID: 20939561] 
[34] 
Ghosh, A.; Nandy, A. Graphical representation and mathematical characterization of protein sequences and applications to viral proteins. Advances in Protein Chemistry and Structural Biology; Elsevier, 2011, Vol. 83, pp. 1-42.
[35] 
Randić, M.; Novič, M.; Choudhury, A.R.; Plavšić, D. On graphical representation of trans-membrane proteins. SAR QSAR Environ. Res.,  2012, 23(3-4), 327-343.
[http://dx.doi.org/10.1080/1062936X.2012.658083] [PMID: 22432416] 
[36] 
Yu, H-J.; Huang, D-S. Novel 20-D descriptors of protein sequences and it’s applications in similarity analysis. Chem. Phys. Lett.,  2012, 531, 261-266.
[http://dx.doi.org/10.1016/j.cplett.2012.02.030] 
[37] 
Qi, Z.H.; Feng, J.; Qi, X.Q.; Li, L. Application of 2D graphic representation of protein sequence based on Huffman tree method. Comput. Biol. Med.,  2012, 42(5), 556-563.
[http://dx.doi.org/10.1016/j.compbiomed.2012.01.011] [PMID: 22325072] 
[38] 
Yao, Y-h.; Kong, F.; Dai, Q.; He, P-a. A sequence-segmented method applied to the similarity analysis of long protein sequence. Match (Mulh.),  2013, 70(1), 431-450.
[39] 
Huang, L.; Tan, H.; Liao, B. HR-Curve: a novel 2D graphical representation of protein sequence and its multi-application. J. Comput. Theor. Nanosci.,  2013, 10(1), 257-264.
[http://dx.doi.org/10.1166/jctn.2013.2688] 
[40] 
Hou, W.; Pan, Q.; He, M. A novel 2D representation of genome sequence and its application. J. Comput. Theor. Nanosci.,  2014, 11(8), 1745-1749.
[http://dx.doi.org/10.1166/jctn.2014.3561] 
[41] 
Czerniecka, A.; Bielińska-Wąż, D.; Wąż, P.; Clark, T. 20D-dynamic representation of protein sequences. Genomics,  2016, 107(1), 16-23.
[http://dx.doi.org/10.1016/j.ygeno.2015.12.003] [PMID: 26705741] 
[42] 
Hou, W.; Pan, Q.; He, M. A new graphical representation of protein sequences and its applications. Physica A,  2016, 444, 996-1002.
[http://dx.doi.org/10.1016/j.physa.2015.10.067] 
[43] 
Yao, Y.H.; Dai, Q.; Li, C.; He, P.A.; Nan, X.Y.; Zhang, Y.Z. Analysis of similarity/dissimilarity of protein sequences. Proteins,  2008, 73(4), 864-871.
[http://dx.doi.org/10.1002/prot.22110] [PMID: 18536018] 
[44] 
Yao, Y.H.; Dai, Q.; Li, L.; Nan, X.Y.; He, P.A.; Zhang, Y.Z. Similarity/dissimilarity studies of protein sequences based on a new 2D graphical representation. J. Comput. Chem.,  2010, 31(5), 1045-1052.
[PMID: 19777597] 
[45] 
MacKay, D.J.; Mac Kay, D.J. Information theory, inference and learning algorithms; Cambridge university press, 2003. 
[46] 
Shannon, C.E. A mathematical theory of communication. Mob. Comput. Commun. Rev.,  2001, 5(1), 3-55.
[http://dx.doi.org/10.1145/584091.584093] 
[47] 
Ceraolo, C.; Giorgi, F.M. Genomic variance of the 2019-nCoV coronavirus. J. Med. Virol.,  2020, 92(5), 522-528.
[http://dx.doi.org/10.1002/jmv.25700] [PMID: 32027036] 
[48] 
Saw, A.K.; Raj, G.; Das, M.; Talukdar, N.C.; Tripathy, B.C.; Nandi, S. Alignment-free method for DNA sequence clustering using Fuzzy integral similarity. Sci. Rep.,  2019, 9(1), 3753.
[http://dx.doi.org/10.1038/s41598-019-40452-6] [PMID: 30842590] 
[49] 
Li, X.; Zai, J.; Zhao, Q.; Nie, Q.; Li, Y.; Foley, B.T.; Chaillon, A. Evolutionary history, potential intermediate animal host, and cross-species analyses of SARS-CoV-2. J. Med. Virol.,  2020, 92(6), 602-611.
[http://dx.doi.org/10.1002/jmv.25731] [PMID: 32104911] 
[50] 
Kumar, S.; Stecher, G.; Li, M.; Knyaz, C.; Tamura, K. MEGA X: molecular evolutionary genetics analysis across computing platforms. Mol. Biol. Evol.,  2018, 35(6), 1547-1549.
[http://dx.doi.org/10.1093/molbev/msy096] [PMID: 29722887] 
[51] 
Qi, Z.H.; Jin, M.Z.; Li, S.L.; Feng, J. A protein mapping method based on physicochemical properties and dimension reduction. Comput. Biol. Med.,  2015, 57, 1-7.
[http://dx.doi.org/10.1016/j.compbiomed.2014.11.012] [PMID: 25486446] 
[52] 
Qi, Z-H.; Feng, J.; Liu, C-C. Evolution trends of the 2009
 pandemic influenza A (H1N1) viruses in different continents from
 March 2009 to April 2012. Biologia 2014, 69(4)
[http://dx.doi.org/10.2478/s11756-014-0341-4] 
[53] 
Chan, J.F-W.; Yuan, S.; Kok, K-H.; To, K.K-W.; Chu, H.; Yang, J.; Xing, F.; Liu, J.; Yip, C.C-Y.; Poon, R.W-S.; Tsoi, H.W.; Lo, S.K.; Chan, K.H.; Poon, V.K.; Chan, W.M.; Ip, J.D.; Cai, J.P.; Cheng, V.C.; Chen, H.; Hui, C.K.; Yuen, K.Y. A familial cluster of pneumonia associated with the 2019 novel coronavirus indicating person-to-person transmission: a study of a family cluster. Lancet,  2020, 395(10223), 514-523.
[http://dx.doi.org/10.1016/S0140-6736(20)30154-9] [PMID: 31986261] 
[54] 
Li, Q.; Guan, X.; Wu, P.; Wang, X.; Zhou, L.; Tong, Y.; Ren, R.; Leung, K.S.M.; Lau, E.H.Y.; Wong, J.Y.; Xing, X.; Xiang, N.; Wu, Y.; Li, C.; Chen, Q.; Li, D.; Liu, T.; Zhao, J.; Liu, M.; Tu, W.; Chen, C.; Jin, L.; Yang, R.; Wang, Q.; Zhou, S.; Wang, R.; Liu, H.; Luo, Y.; Liu, Y.; Shao, G.; Li, H.; Tao, Z.; Yang, Y.; Deng, Z.; Liu, B.; Ma, Z.; Zhang, Y.; Shi, G.; Lam, T.T.Y.; Wu, J.T.; Gao, G.F.; Cowling, B.J.; Yang, B.; Leung, G.M.; Feng, Z. Early transmission dynamics in Wuhan, China, of novel coronavirus–infected pneumonia. N. Engl. J. Med.,  2020, 382(13), 1199-1207.
[http://dx.doi.org/10.1056/NEJMoa2001316] [PMID: 31995857] 

Cite As

Combinatorial Chemistry & High Throughput Screening

Novel Protein Sequence Comparison Method Based on Transition Probability Graph and Information Entropy

Abstract

Graphical Abstract