Improved Prediction of Protein-Protein Interaction Mapping on Homo Sapiens by Using Amino Acid Sequence Features in a Supervised Learning Framework

Page: [74 - 83] Pages: 10

  • * (Excluding Mailing and Handling)

Abstract

Background: Protein-Protein Interaction (PPI) has emerged as a key role in the control of many biological processes including protein function, disease incidence, and therapy design. However, the identification of PPI by wet lab experiment is a challenging task, since it is laborious, time consuming and expensive. Therefore, computational prediction of PPI is now given emphasis before going to the experimental validation, since it is simultaneously less laborious, time saver and cost minimizer.

Objective: The objective of this study is to develop an improved computational method for PPI prediction mapping on Homo sapiens by using the amino acid sequence features in a supervised learning framework.

Methods: The experimentally validated 91 positive-PPI pairs of human protein sequences were collected from IntAct Molecular Interaction Database. Then we constructed three balanced datasets with ratios 1:1, 1:2 and 1:3 of positive and negative PPI samples. Then we partitioned each dataset into training (80%) and independent test (20%) datasets. Again each training dataset was partitioned into four mutually exclusive groups of equal sizes for interchanging each group with independent test group to perform 5-fold cross validation (CV). Then we trained candidate seven classifiers (NN, SVM, LR, NB, KNN, AB and RF) with each ratio case to obtain the better PPI predictor by comparing their performance scores.

Results: The random forest (RF) based predictor that was trained with 1:2 ratio of positive-PPI and negative-PPI samples based on AAC encoding features provided the most accurate PPI prediction by producing the highest average performance scores of accuracy (93.50%), sensitivity (95.0%), MCC (85.2%), AUC (0.941) and pAUC (0.236) with the 5-fold cross-validation. It also achieved the highest average performance scores of accuracy (92.0%), sensitivity (94.0%), MCC (83.6%), AUC (0.922) and pAUC (0.207) with the independent test datasets in a comparison of the other candidate and existing predictors.

Conclusion: The final resultant prediction strongly recommend that the RF based predictor is a better prediction model of PPI mapping on Homo sapiens.

Keywords: Protein sequence, protein-protein interaction (PPI) prediction, sequence encoding, feature selection, supervised learning framework, performance comparison, random forest.

Graphical Abstract

[1]
Braun, P.; Gingras, A.C. History of protein-protein interactions: from egg-white to complex networks. Proteomics, 2012, 12(10), 1478-1498.
[http://dx.doi.org/10.1002/pmic.201100563 ] [PMID: 22711592]
[2]
Nooren, I.M.A.; Thornton, J.M. Diversity of protein-protein interactions. EMBO J., 2003, 22(14), 3486-3492.
[http://dx.doi.org/10.1093/emboj/cdg359 ] [PMID: 12853464]
[3]
Devos, D.; Russell, R.B. A more complete, complexed and structured interactome. Curr. Opin. Struct. Biol., 2007, 17(3), 370-377.
[http://dx.doi.org/10.1016/j.sbi.2007.05.011]
[4]
Kumar, A.; Snyder, M. Protein complexes take the bait. Nature, 2002, 415(6868), 123-124.
[http://dx.doi.org/10.1038/415123a ] [PMID: 11805813]
[5]
Saha, I.; Zubek, J.; Klingstrom, T.; Forsberg, S.; Wikander, J.; Kierczak, M.; Maulik, U.; Plewczynski, D. Ensemble learning prediction of protein-protein interactions using proteins functional annotations. Mol. Biosyst., 2014, 10(4), 820-830.
[http://dx.doi.org/10.1039/c3mb70486f ] [PMID: 24469380]
[6]
Suresh, V.; Liu, L.; Adjeroh, D.; Zhou, X. RPI-Pred: predicting ncRNA-protein interaction using sequence and structural information. Nucleic Acids Res., 2015, 43(3), 1370-1379.
[http://dx.doi.org/10.1093/nar/gkv020 ] [PMID: 25609700]
[7]
Geng, H.; Lu, T.; Lin, X.; Liu, Y.; Yan, F. Prediction of protein-protein Interaction sites based on naive bayes classifier. Hindawi Publishing corporation. Biochem. Res. Int., 2015, 2015, 978193.
[http://dx.doi.org/10.1155/2015/978193 ] [PMID: 26697220]
[8]
Huang, Q.; You, Z.; Zhang, X.; Zhou, Y. Prediction of protein-protein interactions with clustered amino acids and weighted sparse representation. Int. J. Mol. Sci., 2015, 16(5), 10855-10869.
[http://dx.doi.org/10.3390/ijms160510855 ] [PMID: 25984606]
[9]
Sriwastava, B.K.; Basu, S.; Maulik, U. Protein-protein interaction site prediction in Homo sapiens and E. coli using an interaction-affinity based membership function in fuzzy SVM. J. Biosci., 2015, 40(4), 809-818.
[http://dx.doi.org/10.1007/s12038-015-9564-y ] [PMID: 26564981]
[10]
Zhou, X.; Park, B.; Choi, D.; Han, K. A generalized approach to predicting protein-protein interactions between virus and host. BMC Genomics, 2018, 19(Suppl. 6), 568.
[http://dx.doi.org/10.1186/s12864-018-4924-2 ] [PMID: 30367586]
[11]
Wang, X.; Yu, B.; Ma, A.; Chen, C.; Liu, B.; Ma, Q. Protein-protein interaction sites prediction by ensemble random forests with synthetic minority oversampling technique. Bioinformatics, 2019, 35(14), 2395-2402.
[http://dx.doi.org/10.1093/bioinformatics/bty995 ] [PMID: 30520961]
[12]
Hasan, M.M.; Yang, S.; Zhou, Y.; Mollah, M.N. SuccinSite: a computational tool for the prediction of protein succinylation sites by exploiting the amino acid patterns and properties. Mol. Biosyst., 2016, 12(3), 786-795.
[http://dx.doi.org/10.1039/C5MB00853K ] [PMID: 26739209]
[13]
Venkatesh, B.; Anuradha, J. A review of feature selection and its methods. Cybern. Inf. Technol., 2019, 19(1), 1314-4081.
[http://dx.doi.org/10.2478/cait-2019-0001]
[14]
Shier, R. "Statistics," The Wilcoxon Signed Rank Sum Test, Mathematics Learning Support Centre. 2004, 1-3.
[15]
Guyon, I.; Elisseeff, A. An introduction to variable and feature selection. Mach. Learn. Res., 2003, 3, 1157-1182.
[16]
Yao, X. Evolving artificial neural networks. IEEE, 1999, 87, 1423-1447.
[17]
Cortes, C.; Vapnik, V. Support-vector networks. Mach. Learn., 1995, 20, 5.
[http://dx.doi.org/10.1007/BF00994018]
[18]
Mosharaf, M.P.; Hassan, M.M.; Ahmed, F.F.; Khatun, M.S.; Moni, M.A.; Mollah, M.N.H. Computational prediction of protein ubiquitination sites mapping on Arabidopsis thaliana. Comput. Biol. Chem., 2020, 85, 107238.
[19]
Hasan, M.M.; Zhou, Y.; Lu, X.; Li, J.; Song, J.; Zhang, Z. Computational identification of protein pupylation sites by using profile-based composition of K-spaced amino acid pairs. PLoS One, 2015, 10(6), e0129635.
[http://dx.doi.org/10.1371/journal.pone.0129635 ] [PMID: 26080082]
[20]
Tabaei, B.P.; Herman, W.H. A multivariate logistic regression equation to screen for diabetes: development and validation. Diabetes Care, 2002, 25(11), 1999-2003.
[http://dx.doi.org/10.2337/diacare.25.11.1999 ] [PMID: 12401746]
[21]
Cover, T.M. Geometrical and statistical properties of systems of linear inequalities with applications in pattern recognition. IEEE Trans. Electron. Comput., 1965, 14(3), 326-334.
[http://dx.doi.org/10.1109/PGEC.1965.264137]
[22]
Altman, N.S. An introduction to kernel and nearest-neighbor nonparametric regression. Am. Stat., 1992, 46(3), 175-185.
[23]
Kegl, B. The return of AdaBoost. MH: Multiclass Humming trees. arXiv preprint arXiv:, 2013, 1312.6086.
[24]
Hasan, M.M.; Kurata, H. GPSuc: Global Prediction of Generic and Species-specific Succinylation Sites by aggregating multiple sequence features. PLoS One, 2018, 13(10), e0200283.
[http://dx.doi.org/10.1371/journal.pone.0200283 ] [PMID: 30312302]
[25]
Hasan, M.M.; Khatun, M.S.; Mollah, M.N.H.; Yong, C.; Dianjing, G. NTyroSite: computational identification of protein nitrotyrosine sites using sequence evolutionary features. Molecules, 2018, 23(7), 166.
[http://dx.doi.org/10.3390/molecules23071667 ] [PMID: 29987232]
[26]
Charoenkwan, P.; Yana, J.; Schaduangrat, N.; Nantasenamat, C.; Hasan, M.M.; Shoombuatong, W. iBitter-SCM: Identification and characterization of bitter peptides using a scoring card method with propensity scores of dipeptides. Genomics, 2020, 112(4), 2813-2822.
[http://dx.doi.org/10.1016/j.ygeno.2020.03.019 ] [PMID: 32234434]
[27]
Hasan, M.M.; Rashid, M.M.; Khatun, M.S.; Kurata, H. Computational identification of microbial phosphorylation sites by the enhanced characteristics of sequence information. Sci. Rep., 2019, 9(1), 8258.
[http://dx.doi.org/10.1038/s41598-019-44548-x ] [PMID: 31164681]
[28]
Khatun, S.; Hasan, M.; Kurata, H. Efficient computational model for identification of antitubercular peptides by integrating amino acid patterns and properties. FEBS Lett., 2019, 593(21), 3029-3039.
[http://dx.doi.org/10.1002/1873-3468.13536 ] [PMID: 31297788]
[29]
Hasan, M.M.; Schaduangrat, N.; Lee, G.; Shoombuatong, W.; Manavalan, B. HLPpred-Fuse: improved and robust prediction of hemolytic peptide and its activity by fusing multiple feature representation. Bioinformatics, 2020. [Epub ahead of print].
[http://dx.doi.org/10.1093/bioinformatics/btaa160]
[30]
Charoenkwan, P.; Nantasenamat, C.; Hasan, M.M.; Shoombuatong, W. iTTCA-Hybrid: improved and robust identification of tumor T cell antigens by utilizing hybrid feature representation. Anal. Biochem., 2020, 599, 113747.
[http://dx.doi.org/10.1016/j.ab.2020.113747 ] [PMID: 32333902]
[31]
Hasan, M.M.; Khatun, M.S.; Mollah, M.N.H.; Yong, C.; Guo, D. A systematic identification of species-specific protein succinylation sites using joint element features information. Int. J. Nanomedicine, 2017, 12, 6303-6315.
[http://dx.doi.org/10.2147/IJN.S140875 ] [PMID: 28894368]
[32]
Hasan, M.M.; Manavalan, B.; Shoombuatong, W.; Khatun, M.S.; Kurata, H. i4mC-Mouse: improved identification of DNA N4-methylcytosine sites in the mouse genome using multiple encoding schemes. Comput. Struct. Biotechnol. J., 2020, 18, 906-912.
[http://dx.doi.org/10.1016/j.csbj.2020.04.001 ] [PMID: 32322372]
[33]
Hasan, M.M.; Manavalan, B.; Shoombuatong, W.; Khatun, M.S.; Kurata, H. i6mA-Fuse: improved and robust prediction of DNA 6 mA sites in the Rosaceae genome by fusing multiple feature representation. Plant Mol. Biol., 2020, 103(1-2), 225-234.
[http://dx.doi.org/10.1007/s11103-020-00988-y ] [PMID: 32140819]
[34]
Khatun, M.S.; Hasan, M.M.; Kurata, H. PreAIP: Computational prediction of anti-inflammatory peptides by integrating multiple complementary features. Front. Genet., 2019, 10(129), 129.
[http://dx.doi.org/10.3389/fgene.2019.00129 ] [PMID: 30891059]
[35]
Hasan, M.M.; Khatun, M.S.; Kurata, H. Large-scale assessment of bioinformatics tools For lysine succinylation sites. Cells, 2019, 8(2), E95.
[http://dx.doi.org/10.3390/cells8020095 ] [PMID: 30696115]
[36]
Hasan, M.M.; Khatun, M.S.; Kurata, H. A comprehensive review of in silico analysis for protein s-sulfenylation sites. Protein Pept. Lett., 2018, 25(9), 815-821.
[http://dx.doi.org/10.2174/0929866525666180905110619 ] [PMID: 30182830]
[37]
Khatun, M.S.; Hasan, M.M.; Mollah, M.N.H.; Kurata, H. SIPMA: A Systematic Identification of Protein-Protein Interactions in Zea mays Using Autocorrelation Features in a Machine-Learning Framework. 2018 IEEE 18th International Conference on Bioinformatics and Bioengineering (BIBE), Taichung, Taiwan 2018, 122-125.
[38]
Hasan, M.M.; Manavalan, B.; Khatun, M.S.; Kurata, H. Prediction of S-nitrosylation sites by integrating support vector machines and random forest. Molecular Omics, 2019, 15(6), 451-458.
[39]
Hasan, M.M.; Manavalan, B.; Khatun, M.S.; Kurata, H. i4mC-ROSE, a bioinformatics tool for the identification of DNA N4-methylcytosine sites in the Rosaceae genome. Int J Biol Macromol., 2019, S0141-8130(19), 38547-2.
[40]
Hasan, M.M.; Guo, D.; Kurta, H. Computational identification of protein S-sulfenylation sites by incorporating the multiple sequence features information Molecular Biosystem. Mol. Biosyst., 2017, 13(12), 2545-2550.
[41]
Rashid, M.M.; Shatabda, S.; Hasan, M.M.; Kuata, H. Recent development of machine learning methods in microbial phosphorylation sites. Curr. Genomics, 2020, 21, 1.
[http://dx.doi.org/10.2174/1389202921666200427210833]
[42]
Mosharaf, M.P.; Hasan, M.M.; Ahmed, F.F.; Shamima, K.M.; Moni, M.A.; Mollah, M.N.H. Computational prediction of protein ubiquitination sites mapping on Arabidopsis thaliana. Comput. Biol. Chem., 2020, 85, 107238.
[http://dx.doi.org/10.1016/j.compbiolchem.2020.107238]]
[43]
Breiman, L. Random Forests. Mach. Learn., 2001, 45, 5-32.
[http://dx.doi.org/10.1023/A:1010933404324]