Abstract
Aim and Objective: Aim and Objective: Sequence analysis is one of the foundations in bioinformatics. It is
widely used to find out the feature metrics hidden in the sequence. Otherwise, the graphical
representation of the biologic sequence is an important tool for sequencing analysis. This study is
undertaken to find out a new graphical representation of biosequences.
Materials and Methods: The transition probability is used to describe amino acid combinations of
protein sequences. The combinations are composed of amino acids directly adjacent to each other
or separated by multiple amino acids. The transition probability graph is built up by the transition
probabilities of amino acid combinations. Next, a map is defined as a representation from the
transition probability graph to transition probability vector by the k-order transition probability
graph. Transition entropy vectors are developed by the transition probability vector and
information entropy. Finally, the proposed method is applied to two separate applications, 499 HA
genes of H1N1, and 95 coronaviruses.
Results: By constructing a phylogenetic tree, it was found that the results of each application are
consistent with other studies.
Conclusion: The graphical representation proposed in this article is a practical and correct method.
Keywords:
Graphical bioinfomatics, similarity, sequence, descriptors, transition probability graph, information entropy.
Graphical Abstract
[6]
Liao, B.; Zeng, C.; Li, F.; Tang, Y. Analysis of similarity/dissimilarity of DNA sequences based on dual nucleotides. MATCH Commun. Math. Comput. Chem, 2008, 59(3), 647-652.
[19]
Qi, Z-H.; Jin, M-Z.; Yang, H. A measure of protein sequence characteristics based on the frequency and the position entropy of existing K-words. Match (Mulh.), 2015, 73(3), 731-748.
[34]
Ghosh, A.; Nandy, A. Graphical representation and mathematical characterization of protein sequences and applications to viral proteins. Advances in Protein Chemistry and Structural Biology; Elsevier, 2011, Vol. 83, pp. 1-42.
[38]
Yao, Y-h.; Kong, F.; Dai, Q.; He, P-a. A sequence-segmented method applied to the similarity analysis of long protein sequence. Match (Mulh.), 2013, 70(1), 431-450.
[45]
MacKay, D.J.; Mac Kay, D.J. Information theory, inference and learning algorithms; Cambridge university press, 2003.
[54]
Li, Q.; Guan, X.; Wu, P.; Wang, X.; Zhou, L.; Tong, Y.; Ren, R.; Leung, K.S.M.; Lau, E.H.Y.; Wong, J.Y.; Xing, X.; Xiang, N.; Wu, Y.; Li, C.; Chen, Q.; Li, D.; Liu, T.; Zhao, J.; Liu, M.; Tu, W.; Chen, C.; Jin, L.; Yang, R.; Wang, Q.; Zhou, S.; Wang, R.; Liu, H.; Luo, Y.; Liu, Y.; Shao, G.; Li, H.; Tao, Z.; Yang, Y.; Deng, Z.; Liu, B.; Ma, Z.; Zhang, Y.; Shi, G.; Lam, T.T.Y.; Wu, J.T.; Gao, G.F.; Cowling, B.J.; Yang, B.; Leung, G.M.; Feng, Z. Early transmission dynamics in Wuhan, China, of novel coronavirus–infected pneumonia.
N. Engl. J. Med., 2020,
382(13), 1199-1207.
[
http://dx.doi.org/10.1056/NEJMoa2001316] [PMID:
31995857]