Boosting Granular Support Vector Machines for the Accurate Prediction of Protein-Nucleotide Binding Sites

Yi-Heng       Zhu; Jun       Hu; Yong       Qi; Xiao-Ning       Song; Dong-Jun       Yu

Abstract

Aim and Objective: The accurate identification of protein-ligand binding sites helps elucidate protein function and facilitate the design of new drugs. Machine-learning-based methods have been widely used for the prediction of protein-ligand binding sites. Nevertheless, the severe class imbalance phenomenon, where the number of nonbinding (majority) residues is far greater than that of binding (minority) residues, has a negative impact on the performance of such machine-learning-based predictors.

Materials and Methods: In this study, we aim to relieve the negative impact of class imbalance by Boosting Multiple Granular Support Vector Machines (BGSVM). In BGSVM, each base SVM is trained on a granular training subset consisting of all minority samples and some reasonably selected majority samples. The efficacy of BGSVM for dealing with class imbalance was validated by benchmarking it with several typical imbalance learning algorithms. We further implemented a protein-nucleotide binding site predictor, called BGSVM-NUC, with the BGSVM algorithm.

Results: Rigorous cross-validation and independent validation tests for five types of proteinnucleotide interactions demonstrated that the proposed BGSVM-NUC achieves promising prediction performance and outperforms several popular sequence-based protein-nucleotide binding site predictors. The BGSVM-NUC web server is freely available at http://csbio.njust.edu.cn/bioinf/BGSVM-NUC/ for academic use.

Keywords: Imbalance learning, granular computing, support vector machine, classifier ensemble, protein-nucleotide binding sites.

[1] 
Gao, M.; Skolnick, J. The distribution of ligand-binding pockets around protein-protein interfaces suggests a general mechanism for pocket formation. Proc. Natl. Acad. Sci. USA,  2012, 109(10), 3784-3789.
[http://dx.doi.org/10.1073/pnas.1117768109] [PMID:  22355140] 
[2] 
Kokubo, H.; Tanaka, T.; Okamoto, Y. Ab initio prediction of protein-ligand binding structures by replica-exchange umbrella sampling simulations. J. Comput. Chem.,  2011, 32(13), 2810-2821.
[http://dx.doi.org/10.1002/jcc.21860] [PMID:  21710634] 
[3] 
Roy, A.; Zhang, Y. Recognizing protein-ligand binding sites by global structural alignment and local geometry refinement. Structure,  2012, 20(6), 987-997.
[http://dx.doi.org/10.1016/j.str.2012.03.009] [PMID:  22560732] 
[4] 
Yang, J.; Roy, A.; Zhang, Y. Protein-ligand binding site recognition using complementary binding-specific substructure comparison and sequence profile alignment. Bioinformatics,  2013, 29(20), 2588-2595.
[http://dx.doi.org/10.1093/bioinformatics/btt447] [PMID:  23975762] 
[5] 
Wang, C.; Liu, J.; Luo, F.; Deng, Z.; Hu, Q.N. Predicting target-ligand interactions using protein ligand-binding site and ligand substructures. BMC Syst. Biol.,  2015, 9(Suppl. 1), S2-S11.
[http://dx.doi.org/10.1186/1752-0509-9-S1-S2] [PMID:  25707321] 
[6] 
Chen, P.; Hu, S.; Zhang, J.; Gao, X.; Li, J.; Xia, J.; Wang, B. A sequence-based dynamic ensemble learning system for protein ligand-binding site prediction. IEEE/ACM Trans. Comput. Biol. Bioinformatics,  2016, 13(5), 901-912.
[http://dx.doi.org/10.1109/TCBB.2015.2505286] [PMID:  26661785] 
[7] 
Yu, D.J.; Hu, J.; Tang, Z.M.; Shen, H.B.; Yang, J.; Yang, J.Y. Improving protein-ATP binding residues prediction by boosting SVMs with random under-sampling. Neurocomputing,  2013, 104, 180-190.
[http://dx.doi.org/10.1016/j.neucom.2012.10.012] 
[8] 
Dundas, J.; Ouyang, Z.; Tseng, J.; Binkowski, A.; Turpaz, Y.; Liang, J. CASTp: computed atlas of surface topography of proteins with structural and topographical mapping of functionally annotated residues. Nucleic Acids Res.,  2006, 34(Web Server issue), W116-8.
[http://dx.doi.org/10.1093/nar/gkl282] [PMID: 16844972] 
[9] 
Brylinski, M.; Skolnick, J. A threading-based method (FINDSITE) for ligand-binding site prediction and functional annotation. Proc. Natl. Acad. Sci. USA,  2008, 105(1), 129-134.
[http://dx.doi.org/10.1073/pnas.0707684105] [PMID:  18165317] 
[10] 
Capra, J.A.; Laskowski, R.A.; Thornton, J.M.; Singh, M.; Funkhouser, T.A. Predicting protein ligand binding sites by combining evolutionary sequence conservation and 3D structure. PLOS Comput. Biol.,  2009, 5(12)e1000585
[http://dx.doi.org/10.1371/journal.pcbi.1000585] [PMID:  19997483] 
[11] 
Hernandez, M.; Ghersi, D.; Sanchez, R. SITEHOUND-web: A server for ligand binding site identification in protein structures. Nucleic Acids Res.,  2009, 37(Web Server issue), W413-6.
[http://dx.doi.org/10.1093/nar/gkp281] [PMID:  19398430] 
[12] 
Wass, M.N.; Kelley, L.A.; Sternberg, M.J. 3DLigandSite: predicting ligand-binding sites using similar structures. Nucleic Acids Res.,  2010, 38(Web Server issue), W469-73.
[http://dx.doi.org/10.1093/nar/gkq406] [PMID:  20513649] 
[13] 
Pupko, T.; Bell, R.E.; Mayrose, I.; Glaser, F.; Ben-Tal, N. Rate4Site: An algorithmic tool for the identification of functional regions in proteins by surface mapping of evolutionary determinants within their homologues. Bioinformatics,  2002, 18(Suppl. 1), S71-S77.
[http://dx.doi.org/10.1093/bioinformatics/18.suppl_1.S71] [PMID:  12169533] 
[14] 
Dempster, A.P.; Laird, N.M.; Rubin, D.B. Maximum Likelihood from Incomplete Data via the EM Algorithm. J. R. Stat. Soc.,  1977, 39, 1-38.
[15] 
Shu, N.; Zhou, T.; Hovmöller, S. Prediction of zinc-binding sites in proteins from sequence. Bioinformatics,  2008, 24(6), 775-782.
[http://dx.doi.org/10.1093/bioinformatics/btm618] [PMID:  18245129] 
[16] 
Suykens, J.A.; Vandewalle, J. Least squares support vector machine classifiers. Neural Process. Lett.,  1999, 9, 293-300.
[http://dx.doi.org/10.1023/A:1018628609742] 
[17] 
Chen, K.; Mizianty, M.J.; Kurgan, L. Prediction and analysis of nucleotide-binding residues using sequence and sequence-derived structural descriptors. Bioinformatics,  2012, 28(3), 331-341.
[http://dx.doi.org/10.1093/bioinformatics/btr657] [PMID:  22130595] 
[18] 
Panwar, B.; Gupta, S.; Raghava, G.P. Prediction of vitamin interacting residues in a vitamin binding protein using evolutionary information. BMC Bioinformatics,  2013, 14, 44-57.
[http://dx.doi.org/10.1186/1471-2105-14-44] [PMID:  23387468] 
[19] 
Yu, D.J.; Hu, J.; Huang, Y.; Shen, H.B.; Qi, Y.; Tang, Z.M.; Yang, J.Y. TargetATPsite: A template-free method for ATP-binding sites prediction with residue evolution image sparse representation and classifier ensemble. J. Comput. Chem.,  2013, 34(11), 974-985.
[http://dx.doi.org/10.1002/jcc.23219] [PMID:  23288787] 
[20] 
Chen, P.; Huang, J.Z.; Gao, X. LigandRFs: random forest ensemble to identify ligand-binding residues from sequence information alone. BMC Bioinformatics,  2014, 15(Suppl. 15), S4-S15.
[http://dx.doi.org/10.1186/1471-2105-15-S15-S4] [PMID:  25474163] 
[21] 
Liaw, A.; Wiener, M. Classification and regression by randomforest. R News,  2002, 2/3, 1-22.
[22] 
Chen, K.; Mizianty, M.J.; Kurgan, L. ATPsite: Sequence-based prediction of ATP-binding residues. Proteome Sci.,  2011, 9(Suppl. 1), S4.
[http://dx.doi.org/10.1186/1477-5956-9-S1-S4] [PMID:  22165846] 
[23] 
Yu, D.J.; Hu, J.; Yan, H.; Yang, X.B.; Yang, J.Y.; Shen, H.B. Enhancing protein-vitamin binding residues prediction by multiple heterogeneous subspace SVMs ensemble. BMC Bioinformatics,  2014, 15, 297-310.
[http://dx.doi.org/10.1186/1471-2105-15-297] [PMID:  25189131] 
[24] 
He, H.; Garcia, E.A. Learning from Imbalanced Data. IEEE Trans. Knowl. Data Eng.,  2009, 21, 1263-1284.
[http://dx.doi.org/10.1109/TKDE.2008.239] 
[25] 
Chawla, N.V.; Japkowicz, N.; Kotcz, A. Editorial: Special issue on learning from imbalanced data sets. SIGKDD Explor.,  2004, 6, 1-6.
[http://dx.doi.org/10.1145/1007730.1007733] 
[26] 
Ganganwar, V. An overview of classification algorithms for imbalanced datasets. Int. J. Emerg. Technol. Adv. Eng.,  2012, 2, 42-47.
[27] 
Guyon, I.J.; Weston, S.; Barnhill, V. Vapnik, gene selection for cancer classification using support vector machines. Mach. Learn.,  2002, 46, 389-422.
[http://dx.doi.org/10.1023/A:1012487302797] 
[28] 
Akbani, R.; Kwek, S.; Japkowicz, N. Applying support vector machines to imbalanced datasets. Proceedings of European Conference on Machine Learning,  2004, pp. 39-50.
[http://dx.doi.org/10.1007/978-3-540-30115-8_7] 
[29] 
Wang, B.X.; Japkowicz, N. Boosting support vector machines for imbalanced data sets. Knowl. Inf. Syst.,  2010, 25, 1-20.
[http://dx.doi.org/10.1007/s10115-009-0198-y] 
[30] 
Cover, T.; Hart, P. Nearest neighbor pattern classification. IEEE Trans. Inf. Theory,  1953, 13, 21-27.
[http://dx.doi.org/10.1109/TIT.1967.1053964] 
[31] 
Keller, J.M.; Gray, M.R.; Givens, J.A. Fuzzy K-Nearest neighbor algorithm. IEEE Trans. Syst. Man Cybern.,  2012, SMC-15, 580-585.
[http://dx.doi.org/10.1109/TSMC.1985.6313426] 
[32] 
Tan, S. Neighbor-weighted K-nearest neighbor for unbalanced text corpus. Expert Syst. Appl.,  2005, 28, 667-671.
[http://dx.doi.org/10.1016/j.eswa.2004.12.023] 
[33] 
Kang, P.; Cho, S. EUS SVMs: Ensemble of under-sampled svms for data imbalance problems. Proceedings of International Conference on Neural Information Processing,  2006, pp. 837-846.
[http://dx.doi.org/10.1007/11893028_93] 
[34] 
He, H.; Bai, Y.; Garcia, E.A.; Li, S. ADASYN: Adaptive Synthetic Sampling Approach for Imbalanced Learning. Proceedings of IEEE International Joint Conference on Neural Networks,  2008, pp. 1322-1328.
[35] 
Liu, Y.; Yu, X.; Huang, J.X.; An, A. Combining integrated sampling with svm ensembles for learning from imbalanced datasets. Inf. Process. Manage.,  2011, 47, 617-631.
[http://dx.doi.org/10.1016/j.ipm.2010.11.007] 
[36] 
Tong, S.; Koller, D. Support vector machine active learning with applications to text classification. J. Mach. Learn. Res.,  2001, 2, 45-66.
[37] 
Ertekin, S.; Huang, J.; Giles, C.L. Active Learning for Class Imbalance Problem. Proceedings of 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval,  2007, pp. 823-824.
[38] 
Wu, G.; Chang, E.Y. KBA: Kernel boundary alignment considering imbalanced data distribution. IEEE Trans. Knowl. Data Eng.,  2005, 17, 786-795.
[http://dx.doi.org/10.1109/TKDE.2005.95] 
[39] 
Hong, X.; Chen, S.; Harris, C.J. A kernel-based two-class classifier for imbalanced data sets. IEEE Trans. Neural Netw.,  2007, 18(1), 28-41.
[http://dx.doi.org/10.1109/TNN.2006.882812] [PMID:  17278459] 
[40] 
Jia, J.; Liu, Z.; Xiao, X.; Liu, B.; Chou, K.C. iCar-PseCp: Identify carbonylation sites in proteins by Monte Carlo sampling and incorporating sequence coupled effects into general PseAAC. Oncotarget,  2016, 7(23), 34558-34570.
[http://dx.doi.org/10.18632/oncotarget.9148] [PMID:  27153555] 
[41] 
Jia, J.; Liu, Z.; Xiao, X.; Liu, B.; Chou, K.C. iSuc-PseOpt: Identifying lysine succinylation sites in proteins by incorporating sequence-coupling effects into pseudo components and optimizing imbalanced training dataset. Anal. Biochem.,  2016, 497, 48-56.
[http://dx.doi.org/10.1016/j.ab.2015.12.009] [PMID:  26723495] 
[42] 
Yu, D.J.; Hu, J.; Yang, J.; Shen, H.B.; Tang, J.; Yang, J.Y. Designing template-free predictor for targeting protein-ligand binding sites with classifier ensemble and spatial clustering. IEEE/ACM Trans. Comput. Biol. Bioinformatics,  2013, 10(4), 994-1008.
[http://dx.doi.org/10.1109/TCBB.2013.104] [PMID:  24334392] 
[43] 
García, S.; Herrera, F. Evolutionary undersampling for classification with imbalanced datasets: Proposals and taxonomy. Evol. Comput.,  2009, 17(3), 275-306.
[http://dx.doi.org/10.1162/evco.2009.17.3.275] [PMID:  19708770] 
[44] 
Galar, M.; Fernández, A.; Barrenechea, E.; Herrera, F. EUSBoost: Enhancing ensembles for highly imbalanced data-sets by evolutionary undersampling. Pattern Recognit.,  2013, 46, 3460-3471.
[http://dx.doi.org/10.1016/j.patcog.2013.05.006] 
[45] 
Tang, Y.; Zhang, Y-Q. Granular SVM with repetitive undersampling for highly imbalanced protein homology prediction. Proceedings of IEEE International Conference on Granular Computing,  2006, pp. 457-460.
[46] 
Tang, Y.; Zhang, Y-Q.; Chawla, N.V.; Krasser, S. SVMs modeling for highly imbalanced classification. IEEE Trans. Syst. Man Cybern. B Cybern.,  2009, 39(1), 281-288.
[http://dx.doi.org/10.1109/TSMCB.2008.2002909] [PMID:  19068445] 
[47] 
Yao, J.; Vasilakos, A.V.; Pedrycz, W. Granular computing: Perspectives and challenges. IEEE Trans. Cybern.,  2013, 43(6), 1977-1989.
[http://dx.doi.org/10.1109/TSMCC.2012.2236648] [PMID:  23757594] 
[48] 
Zhu, Y.H.; Hu, J.; Song, X.N.; Yu, D.J. DNAPred: Accurate identification of DNA-binding sites from protein sequence by ensembled hyperplane-distance-based support vector machines. J. Chem. Inf. Mod.,  2019, 59(6), 3057-3071.
[http://dx.doi.org/10.1021/acs.jcim.8b00749] [PMID:  30668479] 
[49] 
Li, W.; Godzik, A. Cd-hit: A fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics,  2006, 22(13), 1658-1659.
[http://dx.doi.org/10.1093/bioinformatics/btl158] [PMID:  16731699] 
[50] 
Schäffer, A.A.; Aravind, L.; Madden, T.L.; Shavirin, S.; Spouge, J.L.; Wolf, Y.I.; Koonin, E.V.; Altschul, S.F. Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements. Nucleic Acids Res.,  2001, 29(14), 2994-3005.
[http://dx.doi.org/10.1093/nar/29.14.2994] [PMID:  11452024] 
[51] 
Bairoch, A.; Apweiler, R. The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000. Nucleic Acids Res.,  2000, 28(1), 45-48.
[http://dx.doi.org/10.1093/nar/28.1.45] [PMID:  10592178] 
[52] 
Zhang, Y.N.; Yu, D.J.; Li, S.S.; Fan, Y.X.; Huang, Y.; Shen, H.B. Predicting protein-ATP binding sites from primary sequence through fusing bi-profile sampling of multi-view features. BMC Bioinformatics,  2012, 13, 118-128.
[http://dx.doi.org/10.1186/1471-2105-13-118] [PMID:  22651691] 
[53] 
Jones, D.T. Protein secondary structure prediction based on position-specific scoring matrices. J. Mol. Biol.,  1999, 292(2), 195-202.
[http://dx.doi.org/10.1006/jmbi.1999.3091] [PMID:  10493868] 
[54] 
Freund, Y.; Schapire, R.E. Experiments with a new bosting algorithm. Proceedings of International Conference on Machine Learning,  1996, pp. 148-156.
[55] 
Chang, C.C.; Lin, C.J. LIBSVM: A library for support vector machines. ACM Trans. Intell. Syst. Technol.,  2011, 2, 1-27.
[http://dx.doi.org/10.1145/1961189.1961199] 
[56] 
Liu, G.H.; Shen, H.B.; Yu, D.J. Prediction of protein-protein interaction sites with machine-learning-based data-cleaning and post-filtering procedures. J. Membr. Biol.,  2016, 249(1-2), 141-153.
[http://dx.doi.org/10.1007/s00232-015-9856-z] [PMID:  26563228] 
[57] 
He, X.; Han, K.; Hu, J.; Yan, H.; Yang, J.Y.; Shen, H.B.; Yu, D.J. TargetFreeze: Identifying antifreeze proteins via a combination of weights using sequence evolutionary information and pseudo amino acid composition. J. Membr. Biol.,  2015, 248(6), 1005-1014.
[http://dx.doi.org/10.1007/s00232-015-9811-z] [PMID:  26058944] 
[58] 
Xiao, X.; Hui, M.; Liu, Z. iAFP-Ense: An ensemble classifier for identifying antifreeze protein by incorporating grey model and PSSM into PseAAC. J. Membr. Biol.,  2016, 249(6), 845-854.
[http://dx.doi.org/10.1007/s00232-016-9935-9] [PMID:  27812737] 
[59] 
Hu, J.; Zhou, X.; Zhu, Y.H.; Yu, D.J.; Zhang, G.; Target, D.B.P. TargetDBP: Accurate DNA-binding protein prediction via sequence-based multi-view feature learning. IEEE/ACM Trans. Comput. Biol. Bioinformatics,  2019, 1-1.
[http://dx.doi.org/ 10.1109/TCBB.2019.2893634] [PMID:  30668479 ] 
[60] 
Ahmad, K.; Waris, M.; Hayat, M. Prediction of protein submitochondrial locations by incorporating dipeptide composition into Chou’s general pseudo amino acid composition. J. Membr. Biol.,  2016, 249(3), 293-304.
[http://dx.doi.org/10.1007/s00232-015-9868-8] [PMID:  26746980] 
[61] 
Jia, J.; Liu, Z.; Xiao, X.; Liu, B.; Chou, K.C. iPPBS-Opt: A sequence-based ensemble classifier for identifying protein-protein binding sites by optimizing imbalanced training datasets. Molecules,  2016, 21(1)E95
[http://dx.doi.org/10.3390/molecules21010095] [PMID:  26797600] 
[62] 
Jia, J.; Zhang, L.; Liu, Z.; Xiao, X.; Chou, K.C. pSumo-CD: Predicting sumoylation sites in proteins with covariance discriminant algorithm by incorporating sequence-coupled effects into general PseAAC. Bioinformatics,  2016, 32(20), 3133-3141.
[http://dx.doi.org/10.1093/bioinformatics/btw387] [PMID:  27354696] 

Cite As

Combinatorial Chemistry & High Throughput Screening

Boosting Granular Support Vector Machines for the Accurate Prediction of Protein-Nucleotide Binding Sites

Abstract