SpliVert: A Protein Multiple Sequence Alignment Refinement Method Based on Splitting-Splicing Vertically

Qing       Zhan; Yilei       Fu; Qinghua       Jiang; Bo       Liu; Jiajie       Peng; Yadong       Wang

Abstract

Background: Multiple Sequence Alignment (MSA) is a fundamental task in bioinformatics and is required for many biological analysis tasks. The more accurate the alignments are, the more credible the downstream analyses. Most protein MSA algorithms realign an alignment to refine it by dividing it into two groups horizontally and then realign the two groups. However, this strategy does not consider that different regions of the sequences have different conservation; this property may lead to incorrect residue-residue or residue-gap pairs, which cannot be corrected by this strategy.

Objective: In this article, our motivation is to develop a novel refinement method based on splitting- splicing vertically.

Methods: Here, we present a novel refinement method based on splitting-splicing vertically, called SpliVert. For an alignment, we split it vertically into 3 parts, remove the gap characters in the middle, realign the middle part alone, and splice the realigned middle parts with the other two initial pieces to obtain a refined alignment. In the realign procedure of our method, the aligner will only focus on a certain part, ignoring the disturbance of the other parts, which could help fix the incorrect pairs.

Results: We tested our refinement strategy for 2 leading MSA tools on 3 standard benchmarks, according to the commonly used average SP (and TC) score. The results show that given appropriate proportions to split the initial alignment, the average scores are increased comparably or slightly after using our method. We also compared the alignments refined by our method with alignments directly refined by the original alignment tools. The results suggest that using our SpliVert method to refine alignments can also outperform direct use of the original alignment tools.

Conclusion: The results reveal that splitting vertically and realigning part of the alignment is a good strategy for the refinement of protein multiple sequence alignments.

Keywords: Protein, multiple sequence alignment, progressive alignment, realign, refinement, splitting-splicing vertically.

Graphical Abstract

[1] 
Chalmel, F.; Lardenois, A.; Thompson, J.D.; Muller, J.; Sahel, J-A.; Léveillard, T.; Poch, O. GOAnno: GO annotation based on multiple alignment. Bioinformatics,  2005, 21(9), 2095-2096.
[http://dx.doi.org/10.1093/bioinformatics/bti252] [PMID:  15647299] 
[2] 
Thompson, J.D.; Holbrook, S.R.; Katoh, K.; Koehl, P.; Moras, D.; Westhof, E.; Poch, O. MAO: A Multiple Alignment Ontology for nucleic acid and protein sequences. Nucleic Acids Res.,  2005, 33(13), 4164-4171.
[http://dx.doi.org/10.1093/nar/gki735] [PMID:  16043635] 
[3] 
Ashkenazy, H.; Sela, I.; Levy Karin, E.; Landan, G.; Pupko, T. Multiple sequence alignment averaging improves phylogeny reconstruction. Syst. Biol.,  2019, 68(1), 117-130.
[PMID:  29771363] 
[4] 
Zou, Q.; Wan, S.; Zeng, X.; Ma, Z.S. Reconstructing evolutionary trees in parallel for massive sequences. BMC Syst. Biol.,  2017, 11(Suppl. 6), 100.
[http://dx.doi.org/10.1186/s12918-017-0476-3] [PMID:  29297337] 
[5] 
Chatzou, M.; Magis, C.; Chang, J-M.; Kemena, C.; Bussotti, G.; Erb, I.; Notredame, C. Multiple sequence alignment modeling: Methods and applications. Brief. Bioinform.,  2016, 17(6), 1009-1023.
[http://dx.doi.org/10.1093/bib/bbv099] [PMID:  26615024] 
[6] 
Wei, L.; Zou, Q.; Wei, L.; Zou, Q. Recent progress in machine learning-based methods for protein fold recognition. Int. J. Mol. Sci.,  2016, 17(12), 2118-2118.
[http://dx.doi.org/10.3390/ijms17122118] [PMID:  27999256] 
[7] 
Cheng, L.; Zhuang, H.; Ju, H.; Yang, S.; Han, J.; Tan, R.; Hu, Y. Exposing the causal effect of body mass index on the risk of type 2 diabetes mellitus: A mendelian randomization study. Front. Genet.,  2019, 10, 94-94.
[http://dx.doi.org/10.3389/fgene.2019.00094] [PMID:  30891058] 
[8] 
Cheng, L.; Zhuang, H.; Yang, S.; Jiang, H.; Wang, S.; Zhang, J. Exposing the causal effect of C-reactive protein on the risk of type 2 diabetes mellitus: A Mendelian randomisation study. Front. Genet.,  2018, 9, 657-657.
[http://dx.doi.org/10.3389/fgene.2018.00657] [PMID:  30619477] 
[9] 
Cheng, L.; Hu, Y. Human disease system biology. Curr. Gene Ther.,  2018, 18(5), 255-256.
[http://dx.doi.org/10.2174/1566523218666181010101114] [PMID:  30306867] 
[10] 
Durbin, R.; Eddy, S.R.; Krogh, A.; Mitchison, G. Biological sequence analysis: Probabilistic models of proteins and nucleic acids; Cambridge University Press: Cambridge, 1998, pp. 356-356.
[http://dx.doi.org/10.1017/CBO9780511790492] 
[11] 
Feng, D-F.; Doolittle, R.F. Progressive sequence alignment as a prerequisite to correct phylogenetic trees. J. Mol. Evol.,  1987, 25(4), 351-360.
[http://dx.doi.org/10.1007/BF02603120] [PMID:  3118049] 
[12] 
Hu, Y.; Zhao, T.; Zang, T.; Zhang, Y.; Cheng, L. Identification of Alzheimer’s disease-related genes based on data integration method. Front. Genet.,  2019, 9, 703-703.
[http://dx.doi.org/10.3389/fgene.2018.00703] [PMID:  30740125] 
[13] 
Cheng, L.; Jiang, Y.; Ju, H.; Sun, J.; Peng, J.; Zhou, M.; Hu, Y. InfAcrOnt: Calculating cross-ontology term similarities using information flow by a random walk. BMC Genomics,  2018, 19(Suppl. 1), 919-919.
[http://dx.doi.org/10.1186/s12864-017-4338-6] [PMID:  29363423] 
[14] 
Cheng, L.; Yang, H.; Zhao, H.; Pei, X.; Shi, H.; Sun, J.; Zhang, Y.; Wang, Z.; Zhou, M. MetSigDis: A manually curated resource for the metabolic signatures of diseases. Brief. Bioinform.,  2019, 20(1), 203-209.
[http://dx.doi.org/10.1093/bib/bbx103] [PMID:  28968812] 
[15] 
Do, C.B.; Mahabhashyam, M.S.P.; Brudno, M.; Batzoglou, S. ProbCons: Probabilistic consistency-based multiple sequence alignment. Genome Res.,  2005, 15(2), 330-340.
[http://dx.doi.org/10.1101/gr.2821705] [PMID:  15687296] 
[16] 
Ye, Y.; Cheung, D.W-L.; Wang, Y.; Yiu, S-M.; Zhan, Q.; Lam, T-W.; Ting, H-F. GLProbs: Aligning multiple sequences adaptively. IEEE/ACM Trans. Comput. Biol. Bioinformatics,  2015, 12(1), 67-78.
[http://dx.doi.org/10.1109/TCBB.2014.2316820] [PMID:  26357079] 
[17] 
Zhan, Q.; Wang, N.; Jin, S.; Tan, R.; Jiang, Q.; Wang, Y. ProbPFP: A Multiple Sequence Alignment Algorithm Combining Partition Function and Hidden Markov Model with Particle Swarm Optimization. In: IEEE International Conference on Bioinformatics and Biomedicine; Madrid, Spain, 2018; pp. 1290-1295.
[http://dx.doi.org/10.1109/BIBM.2018.8621220] 
[18] 
Needleman, S.B.; Wunsch, C.D. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol.,  1970, 48(3), 443-453.
[http://dx.doi.org/10.1016/0022-2836(70)90057-4] [PMID:  5420325] 
[19] 
Edgar, R.C. MUSCLE: Multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res.,  2004, 32(5), 1792-1797.
[http://dx.doi.org/10.1093/nar/gkh340] [PMID:  15034147] 
[20] 
Notredame, C.; Higgins, D.G.; Heringa, J. T-Coffee: A novel method for fast and accurate multiple sequence alignment. J. Mol. Biol.,  2000, 302(1), 205-217.
[http://dx.doi.org/10.1006/jmbi.2000.4042] [PMID:  10964570] 
[21] 
Sievers, F.; Wilm, A.; Dineen, D.; Gibson, T.J.; Karplus, K.; Li, W.; Lopez, R.; McWilliam, H.; Remmert, M.; Söding, J.; Thompson, J.D.; Higgins, D.G. Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega. Mol. Syst. Biol.,  2011, 7(1), 539-539.
[http://dx.doi.org/10.1038/msb.2011.75] [PMID:  21988835] 
[22] 
Blackshields, G.; Sievers, F.; Shi, W.; Wilm, A.; Higgins, D.G. Sequence embedding for fast construction of guide trees for multiple sequence alignment. Algorithms Mol. Biol.,  2010, 5(1), 21-21.
[http://dx.doi.org/10.1186/1748-7188-5-21] [PMID:  20470396] 
[23] 
Katoh, K.; Misawa, K.; Kuma, K.; Miyata, T. MAFFT: A novel method for rapid multiple sequence alignment based on fast fourier transform. Nucleic Acids Res.,  2002, 30(14), 3059-3066.
[http://dx.doi.org/10.1093/nar/gkf436] [PMID:  12136088] 
[24] 
Rajasekaran, S.; Jin, X.; Spouge, J.L. The efficient computation of position-specific match scores with the fast fourier transform. J. Comput. Biol.,  2002, 9(1), 23-33.
[http://dx.doi.org/10.1089/10665270252833172] [PMID:  11911793] 
[25] 
Barton, G.J.; Sternberg, M.J.E. A strategy for the rapid multiple alignment of protein sequences. Confidence levels from tertiary structure comparisons. J. Mol. Biol.,  1987, 198(2), 327-337.
[http://dx.doi.org/10.1016/0022-2836(87)90316-0] [PMID:  3430611] 
[26] 
Subbiah, S.; Harrison, S.C. A method for multiple sequence alignment with gaps. J. Mol. Biol.,  1989, 209(4), 539-548.
[http://dx.doi.org/10.1016/0022-2836(89)90592-5] [PMID:  2685324] 
[27] 
Berger, M.P.; Munson, P.J. A novel randomized iterative strategy for aligning multiple protein sequences. Comput. Appl. Biosci.,  1991, 7(4), 479-484.
[http://dx.doi.org/10.1093/bioinformatics/7.4.479] [PMID:  1747779] 
[28] 
Gotoh, O. Optimal alignment between groups of sequences and its application to multiple sequence alignment. Comput. Appl. Biosci.,  1993, 9(3), 361-370.
[http://dx.doi.org/10.1093/bioinformatics/9.3.361] [PMID:  8324637] 
[29] 
Kimura, M. The neutral theory of molecular evolution; Cambridge University Press: Cambridge, 1983. 
[http://dx.doi.org/10.1017/CBO9780511623486] 
[30] 
Katoh, K.; Standley, D.M. MAFFT multiple sequence alignment software version 7: Improvements in performance and usability. Mol. Biol. Evol.,  2013, 30(4), 772-780.
[http://dx.doi.org/10.1093/molbev/mst010] [PMID:  23329690] 
[31] 
Katoh, K.; Toh, H. PartTree: An algorithm to build an approximate tree from a large number of unaligned sequences. Bioinformatics,  2007, 23(3), 372-374.
[http://dx.doi.org/10.1093/bioinformatics/btl592] [PMID:  17118958] 
[32] 
Roshan, U.; Livesay, D.R. Probalign: Multiple sequence alignment using partition function posterior probabilities. Bioinformatics,  2006, 22(22), 2715-2721.
[http://dx.doi.org/10.1093/bioinformatics/btl472] [PMID:  16954142] 
[33] 
Liu, Y.; Schmidt, B.; Maskell, D.L. MSAProbs: Multiple sequence alignment based on pair hidden Markov models and partition function posterior probabilities. Bioinformatics,  2010, 26(16), 1958-1964.
[http://dx.doi.org/10.1093/bioinformatics/btq338] [PMID:  20576627] 
[34] 
Gotoh, O. Significant improvement in accuracy of multiple protein sequence alignments by iterative refinement as assessed by reference to structural alignments. J. Mol. Biol.,  1996, 264(4), 823-838.
[http://dx.doi.org/10.1006/jmbi.1996.0679] [PMID:  8980688] 
[35] 
Sahraeian, S.M.E.; Yoon, B-J. PicXAA: Greedy probabilistic construction of maximum expected accuracy alignment of multiple sequences. Nucleic Acids Res.,  2010, 38(15), 4917-4928.
[http://dx.doi.org/10.1093/nar/gkq255] [PMID:  20413579] 
[36] 
Thompson, J.D.; Plewniak, F.; Poch, O. BAliBASE: A benchmark alignment database for the evaluation of multiple alignment programs. Bioinformatics,  1999, 15(1), 87-88.
[http://dx.doi.org/10.1093/bioinformatics/15.1.87] [PMID:  10068696] 
[37] 
Van Walle, I.; Lasters, I.; Wyns, L. Align-m--a new algorithm for multiple alignment of highly divergent sequences. Bioinformatics,  2004, 20(9), 1428-1435.
[http://dx.doi.org/10.1093/bioinformatics/bth116] [PMID:  14962914] 
[38] 
Raghava, G.P.S.; Searle, S.M.J.; Audley, P.C.; Barber, J.D.; Barton, G.J. OXBench: A benchmark for evaluation of protein multiple sequence alignment accuracy. BMC Bioinformatics,  2003, 4(1), 47-47.
[http://dx.doi.org/10.1186/1471-2105-4-47] [PMID:  14552658] 
[39] 
Edgar, R.C. Quality measures for protein alignment benchmarks. Nucleic Acids Res.,  2010, 38(7), 2145-2153.
[http://dx.doi.org/10.1093/nar/gkp1196] [PMID:  20047958] 
[40] 
Zhang, Z.; Huang, J.; Wang, Z.; Wang, L.; Gao, P. Impact of indels on the flanking regions in structural domains. Mol. Biol. Evol.,  2011, 28(1), 291-301.
[http://dx.doi.org/10.1093/molbev/msq196] [PMID:  20671041] 
[41] 
Zhan, Q.; Ye, Y.; Lam, T-W.; Yiu, S-M.; Wang, Y.; Ting, H-F. Improving multiple sequence alignment by using better guide trees. BMC Bioinformatics,  2015, 16(Suppl. 5), S4-S4.
[http://dx.doi.org/10.1186/1471-2105-16-S5-S4] [PMID:  25859903] 
[42] 
Zou, Q.; Hu, Q.; Guo, M.; Wang, G. HAlign: Fast multiple similar DNA/RNA sequence alignment based on the centre star strategy. Bioinformatics,  2015, 31(15), 2475-2481.
[http://dx.doi.org/10.1093/bioinformatics/btv177] [PMID:  25812743] 
[43] 
Wan, S.; Zou, Q. HAlign-II: Efficient ultra-large multiple sequence alignment and phylogenetic tree reconstruction with distributed and parallel computing. Algorithms Mol. Biol.,  2017, 12(1), 25-25.
[http://dx.doi.org/10.1186/s13015-017-0116-x] [PMID:  29026435] 
[44] 
Su, W.; Liao, X.; Lu, Y.; Zou, Q.; Peng, S. Multiple sequence alignment based on a suffix tree and center-star strategy: A linear method for multiple nucleotide sequence alignment on spark parallel framework. J. Comput. Biol.,  2017, 24(12), 1230-1242.
[http://dx.doi.org/10.1089/cmb.2017.0040] [PMID:  29116822] 
[45] 
Zou, Q.; Li, X.B.; Jiang, W.R.; Lin, Z.Y.; Li, G.L.; Chen, K. Survey of MapReduce frame operation in bioinformatics. Brief. Bioinform.,  2014, 15(4), 637-647.
[http://dx.doi.org/10.1093/bib/bbs088] [PMID:  23396756] 
[46] 
Feng, C-Q.; Zhang, Z-Y.; Zhu, X-J.; Lin, Y.; Chen, W.; Tang, H.; Lin, H. iTerm-PseKNC: A sequence-based tool for predicting bacterial transcriptional terminators. Bioinformatics,  2019, 35(9), 1469-1477.
[PMID:  30247625] 
[47] 
Dao, F-Y.; Lv, H.; Wang, F.; Feng, C-Q.; Ding, H.; Chen, W.; Lin, H. Identify origin of replication in Saccharomyces cerevisiae using two-step feature selection technique. Bioinformatics,  2019, 35(12), 2075-2083.
[http://dx.doi.org/10.1093/bioinformatics/bty943] [PMID:  30428009] 
[48] 
Cheng, L.; Hu, Y.; Sun, J.; Zhou, M.; Jiang, Q. DincRNA: A comprehensive web-based bioinformatics toolkit for exploring disease associations and ncRNA function. Bioinformatics,  2018, 34(11), 1953-1956.
[http://dx.doi.org/10.1093/bioinformatics/bty002] [PMID:  29365045] 
[49] 
Zhang, T.; Tan, P.; Wang, L.; Jin, N.; Li, Y.; Zhang, L.; Yang, H.; Hu, Z.; Zhang, L.; Hu, C.; Li, C.; Qian, K.; Zhang, C.; Huang, Y.; Li, K.; Lin, H.; Wang, D. RNALocate: A resource for RNA subcellular localizations. Nucleic Acids Res.,  2017, 45(D1), D135-D138.
[PMID:  27543076] 
[50] 
Liang, Z-Y.; Lai, H-Y.; Yang, H.; Zhang, C-J.; Yang, H.; Wei, H-H.; Chen, X-X.; Zhao, Y-W.; Su, Z-D.; Li, W-C.; Deng, E-Z.; Tang, H.; Chen, W.; Lin, H. Pro54DB: A database for experimentally verified sigma-54 promoters. Bioinformatics,  2017, 33(3), 467-469.
[PMID:  28171531] 
[51] 
Cheng, L.; Wang, P.; Tian, R.; Wang, S.; Guo, Q.; Luo, M.; Zhou, W.; Liu, G.; Jiang, H.; Jiang, Q. LncRNA2Target v2.0: A comprehensive database for target genes of lncRNAs in human and mouse. Nucleic Acids Res.,  2019, 47(D1), D140-D144.
[http://dx.doi.org/10.1093/nar/gky1051] [PMID:  30380072] 
[52] 
Cheng, L.; Sun, J.; Xu, W.; Dong, L.; Hu, Y.; Zhou, M. OAHG: An integrated resource for annotating human genes with multi-level ontologies. Sci. Rep.,  2016, 6(1), 34820-34820.
[http://dx.doi.org/10.1038/srep34820] [PMID:  27703231] 

Cite As

Protein & Peptide Letters

SpliVert: A Protein Multiple Sequence Alignment Refinement Method Based on Splitting-Splicing Vertically

Abstract

Graphical Abstract