Variable Length Character N-Gram Embedding of Protein Sequences for Secondary Structure Prediction

Ashish    Kumar    Sharma; Rajeev       Srivastava

Abstract

Background: The prediction of a protein's secondary structure from its amino acid sequence is an essential step towards predicting its 3-D structure. The prediction performance improves by incorporating homologous multiple sequence alignment information. Since homologous details not available for all proteins. Therefore, it is necessary to predict the protein secondary structure from single sequences.

Objective and Methods: Protein secondary structure predicted from their primary sequences using n-gram word embedding and deep recurrent neural network. Protein secondary structure depends on local and long-range neighbor residues in primary sequences. In the proposed work, the local contextual information of amino acid residues captures variable-length character n-gram words. An embedding vector represents these variable-length character n-gram words. Further, the bidirectional long short-term memory (Bi-LSTM) model is used to capture the long-range contexts by extracting the past and future residues information in primary sequences.

Results: The proposed model evaluates on three public datasets ss.txt, RS126, and CASP9. The model shows the Q3 accuracy of 92.57%, 86.48%, and 89.66% for ss.txt, RS126, and CASP9.

Conclusion: The proposed model performance compares with state-of-the-art methods available in the literature. After a comparative analysis, it observed that the proposed model performs better than state-of-the-art methods.

Keywords: Proteomics, protein secondary structure, amino acids sequence, character n-gram embedding, deep learning, bidirectional long short-term memory.

Graphical Abstract

[1] 
Ashburner, M.; Davis, A.P.; Richardson, J.E.; Lewis, S.; Botstein, D.; Matese, J.C.; Butler, H.; Ball, C.A.; Issel-Tarver, L.; Dolinski, K.; Sherlock, G.; Hill, D.P.; Harris, M.A.; Ringwald, M.; Dwight, S.S.; Kasarskis, A.; Cherry, J.M.; Blake, J.A.; Rubin, G.M.; Eppig, J.T. Gene ontology: tool for the unification of biology. Nat. Genet.,  2000, 25(1), 25-29.
[http://dx.doi.org/10.1038/75556] [PMID: 10802651] 
[2] 
Cole, C.; Barber, J.D.; Barton, G.J. The Jpred 3 secondary structure prediction server. Nucleic Acids Res.,  2008, 36(Web Server issue), W197-201.
[http://dx.doi.org/10.1093/nar/gkn238] [PMID: 18463136] 
[3] 
Yoo, P.; Zhou, B.; Zomaya, A. Machine learning techniques for protein secondary structure prediction: an overview and evaluation. Curr. Bioinform.,  2008, 3(2), 74-86.
[http://dx.doi.org/10.2174/157489308784340676] 
[4] 
Yang, Y.; Gao, J.; Wang, J.; Heffernan, R.; Hanson, J.; Paliwal, K.; Zhou, Y. Sixty-five years of the long march in protein secondary structure prediction: the final stretch? Brief. Bioinform.,  2018, 19(3), 482-494.
[http://dx.doi.org/10.1093/bib/bbw129] [PMID: 28040746] 
[5] 
Kendrew, J.C.; Bodo, G.; Dintzis, H.M.; Parrish, R.G.; Wyckoff, H.; Phillips, D.C. A three-dimensional model of the myoglobin molecule obtained by x-ray analysis. Nature,  1958, 181(4610), 662-666.
[http://dx.doi.org/10.1038/181662a0] [PMID: 13517261] 
[6] 
Hafsa, N.E.; Arndt, D.; Wishart, D.S. CSI 3.0: a web server for identifying secondary and super-secondary structure in proteins using NMR chemical shifts. Nucleic Acids Res.,  2015, 43(W1), W370-377.
[http://dx.doi.org/10.1093/nar/gkv494] [PMID: 25979265] 
[7] 
Dong, A.; Huang, P.; Caughey, W.S. Protein secondary structures in water from second-derivative amide I infrared spectra. Biochemistry,  1990, 29(13), 3303-3308.
[http://dx.doi.org/10.1021/bi00465a022] [PMID: 2159334] 
[8] 
Toomula, N.; Kumar, S.; Kumar, V.P. Computational methods for protein structure prediction and its application in drug design. J. Proteomics Bioinform. Cit.,  2011, 4, 289-293.
[http://dx.doi.org/10.4172/jpb.1000203] 
[9] 
Hua, S.; Sun, Z. A novel method of protein secondary structure prediction with high segment overlap measure: support vector machine approach. J. Mol. Biol.,  2001, 308(2), 397-407.
[http://dx.doi.org/10.1006/jmbi.2001.4580] [PMID: 11327775] 
[10] 
Zhou, J.; Wang, H.; Zhao, Z.; Xu, R.; Lu, Q. CNNH_PSS: protein 8-class secondary structure prediction by convolutional neural network with highway. BMC Bioinformatics,  2018, 19(Suppl. 4), 60.
[http://dx.doi.org/10.1186/s12859-018-2067-8] [PMID: 29745837] 
[11] 
Liu, B.; Liu, F.; Wang, X.; Chen, J.; Fang, L.; Chou, K-C. Pse-in-One: a web server for generating various modes of pseudo components of DNA, RNA, and protein sequences. Nucleic Acids Res.,  2015, 43(W1), W65-71.
[http://dx.doi.org/10.1093/nar/gkv458] [PMID: 25958395] 
[12] 
Liu, B.; Gao, X.; Zhang, H. BioSeq-Analysis2.0: an updated platform for analyzing DNA, RNA and protein sequences at sequence level and residue level based on machine learning approaches. Nucleic Acids Res.,  2019, 47(20), e127.
[http://dx.doi.org/10.1093/nar/gkz740] [PMID: 31504851] 
[13] 
Liu, B.; Wu, H.; Zhang, D.; Wang, X.; Chou, K-C. Pse-Analysis: a python package for DNA/RNA and protein/ peptide sequence analysis based on pseudo components and kernel methods. Oncotarget,  2017, 8(8), 13338-13343.
[http://dx.doi.org/10.18632/oncotarget.14524] [PMID: 28076851] 
[14] 
Chen, Z.; Zhao, P.; Li, F.; Leier, A.; Marquez-Lago, T.T.; Wang, Y.; Webb, G.I.; Smith, A.I.; Daly, R.J.; Chou, K-C.; Song, J. iFeature: a Python package and web server for features extraction and selection from protein and peptide sequences. Bioinformatics,  2018, 34(14), 2499-2502.
[http://dx.doi.org/10.1093/bioinformatics/bty140] [PMID: 29528364] 
[15] 
Hinton, G.; Deng, L.; Yu, D.; Dahl, G.; Mohamed, A.-R.; Jaitly, N.; Senior, A.; Vanhoucke, V.; Nguyen, P.; Sainath, T.; Kingsbury, B. Deep neural networks for acoustic modeling in speech recognition. IEEE Signal Processing Magazine,  2021, 29(6), 82-97.
[16] 
Voulodimos, A.; Doulamis, N.; Doulamis, A.; Protopapadakis, E. Deep Learning for computer vision: a brief review. Comput. Intell. Neurosci.,  2018, 2018, 7068349.
[http://dx.doi.org/10.1155/2018/7068349] [PMID: 29487619] 
[17] 
Nogueira, C.; Santos, D.; Gatti, M.  In: Deep convolutional neural networks for sentiment analysis of short texts. Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers, Dublin, Ireland, August 2014, Dublin City University and Association for Computational Linguistics: Dublin, Ireland, 2014; pp. 69-78.
[18] 
Busia, A.; Collins, J.; Jaitly, N. Protein secondary structure prediction using deep multi-scale convolutional neural networks and next-step conditioning.  arXiv:1611.01503.
[19] 
Lin, Z.; Lanchantin, J.; Qi, Y. MUST-CNN: a multilayer shift-and-stitch deep convolutional architecture for sequence-based protein structure prediction.  arXiv:1605.03004.
[20] 
Pollastri, G.; Przybylski, D.; Rost, B.; Baldi, P. Improving the prediction of protein secondary structure in three and eight classes using recurrent neural networks and profiles. Proteins,  2002, 47(2), 228-235.
[http://dx.doi.org/10.1002/prot.10082] [PMID: 11933069] 
[21] 
Sønderby, S.K.; Winther, O. Protein secondary structure prediction with long short term memory networks. arXiv, 2014.
[22] 
Guo, Y.; Wang, B.; Li, W.; Yang, B. Protein secondary structure prediction improved by recurrent neural networks integrated with two-dimensional convolutional neural networks. J. Bioinform. Comput. Biol.,  2018, 16(5), 1850021.
[http://dx.doi.org/10.1142/S021972001850021X] [PMID: 30419785] 
[23] 
Zhou, J.; Troyanskaya, O.G. Deep supervised and convolutional generative stochastic network for protein secondary structure prediction. Proceedings of the 31st International Conference on Machine Learning, PMLR,  2014, 32(1), 745-753.
[24] 
Li, Z.; Yu, Y. Protein secondary structure prediction using cascaded convolutional and recurrent neural networks. arXiv,  2016, 1604.07176.
[25] 
Wang, S.; Peng, J.; Ma, J.; Xu, J. Protein secondary structure prediction using deep convolutional neural fields. Sci. Rep.,  2016, 6, 18962.
[http://dx.doi.org/10.1038/srep18962] [PMID: 26752681] 
[26] 
RCSB PDB: Homepage.   Available from: https://www.rcsb.org/
[27] 
Moult, J.; Fidelis, K.; Kryshtafovych, A.; Tramontano, A. Critical assessment of methods of protein structure prediction (CASP)--round IX. Proteins,  2011, 79(Suppl. 10), 1-5.
[http://dx.doi.org/10.1002/prot.23200] [PMID: 21997831] 
[28] 
Rost, B.; Sander, C. Improved prediction of protein secondary structure by use of sequence profiles and neural networks. Proc. Natl. Acad. Sci. USA,  1993, 90(16), 7558-7562.
[http://dx.doi.org/10.1073/pnas.90.16.7558] [PMID: 8356056] 
[29] 
Kabsch, W.; Sander, C. Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers,  1983, 22(12), 2577-2637.
[http://dx.doi.org/10.1002/bip.360221211] [PMID: 6667333] 
[30] 
Schuster, M.; Paliwal, K.K. Bidirectional recurrent neural networks. IEEE Trans. Signal Process.,  1997, 45(11), 2673-2681.
[http://dx.doi.org/10.1109/78.650093] 
[31] 
Home - Keras Documentation.  Available from: https://keras.io/
[32] 
TensorFlow White Papers | TensorFlow.  Available from: https://www.tensorflow.org/about/bib
[33] 
Hinton, G.; Srivastava, N.; Swersky, K. Neural Networks for Machine Learning. Lecture 6a: Overview of mini-batch gradient descent.  Available from: http://www.cs.toronto.edu/~bonner/ courses/2016s/csc321/lectures/lec6.pdf
[34] 
Heffernan, R.; Paliwal, K.; Lyons, J.; Singh, J.; Yang, Y.; Zhou, Y. Single-sequence-based prediction of protein secondary structures and solvent accessibility by deep whole-sequence learning. J. Comput. Chem.,  2018, 39(26), 2210-2216.
[http://dx.doi.org/10.1002/jcc.25534] [PMID: 30368831] 
[35] 
Drozdetskiy, A.; Cole, C.; Procter, J.; Barton, G.J. JPred4: a protein secondary structure prediction server. Nucleic Acids Res.,  2015, 43(W1), W389-394.
[http://dx.doi.org/10.1093/nar/gkv332] [PMID: 25883141] 
[36] 
Wang, S.; Li, W.; Liu, S.; Xu, J. RaptorX-Property: a web server for protein structure property prediction. Nucleic Acids Res.,  2016, 44(W1), W430-435.
[http://dx.doi.org/10.1093/nar/gkw306] [PMID: 27112573] 
[37] 
Fang, C.; Shang, Y.; Xu, D. MUFOLD-SS: new deep inception-inside-inception networks for protein secondary structure prediction. Proteins,  2018, 86(5), 592-598.
[38] 
Hu, H.; Li, Z.; Elofsson, A.; Xie, S. A Bi-LSTM based ensemble algorithm for prediction of protein secondary structure. Appl. Sci. (Basel),  2019, 9, 3538.
[http://dx.doi.org/10.3390/app9173538] 
[39] 
Aydin, Z.; Altunbasak, Y.; Borodovsky, M. Protein secondary structure prediction for a single-sequence using hidden semi-Markov models. BMC Bioinformatics,  2006, 7, 178.
[http://dx.doi.org/10.1186/1471-2105-7-178] [PMID: 16571137] 
[40] 
Rost, B.; Sander, C.; Schneider, R. PHD-an automatic mail server for protein secondary structure prediction. Comput. Appl. Biosci.,  1994, 10(1), 53-60.
[http://dx.doi.org/10.1093/bioinformatics/10.1.53] [PMID: 8193956] 
[41] 
Magnan, C.N.; Baldi, P. SSpro/ACCpro 5: almost perfect prediction of protein secondary structure and relative solvent accessibility using profiles, machine learning and structural similarity. Bioinformatics,  2014, 30(18), 2592-2597.
[http://dx.doi.org/10.1093/bioinformatics/btu352] [PMID: 24860169] 

Cite As

Protein & Peptide Letters

Variable Length Character N-Gram Embedding of Protein Sequences for Secondary Structure Prediction

Abstract

Graphical Abstract