MK-FSVM-SVDD: A Multiple Kernel-based Fuzzy SVM Model for Predicting DNA-binding Proteins via Support Vector Data Description

Page: [274 - 283] Pages: 10

  • * (Excluding Mailing and Handling)

Abstract

Background: Detecting DNA-binding proteins (DBPs) based on biological and chemical methods is time-consuming and expensive.

Objective: In recent years, the rise of computational biology methods based on Machine Learning (ML) has greatly improved the detection efficiency of DBPs.

Methods: In this study, the Multiple Kernel-based Fuzzy SVM Model with Support Vector Data Description (MK-FSVM-SVDD) is proposed to predict DBPs. Firstly, sex features are extracted from the protein sequence. Secondly, multiple kernels are constructed via these sequence features. Then, multiple kernels are integrated by Centered Kernel Alignment-based Multiple Kernel Learning (CKA-MKL). Next, fuzzy membership scores of training samples are calculated with Support Vector Data Description (SVDD). FSVM is trained and employed to detect new DBPs.

Results: Our model is evaluated on several benchmark datasets. Compared with other methods, MKFSVM- SVDD achieves best Matthew's Correlation Coefficient (MCC) on PDB186 (0.7250) and PDB2272 (0.5476).

Conclusion: We can conclude that MK-FSVM-SVDD is more suitable than common SVM, as the classifier for DNA-binding proteins identification.

Keywords: DNA-binding proteins, fuzzy support vector machine, multiple kernel learning, support vector data description, membership function.

Graphical Abstract

[1]
Wang JH, Wang H, Wang XD, et al. Predicting drug-target interactions via FM-DNN Learning. Curr Bioinform 2020; 15(1): 68-76.
[http://dx.doi.org/10.2174/1574893614666190227160538]
[2]
Fajila MNF. Gene subset selection for leukemia classification using microarray data. Curr Bioinform 2019; 14(4): 353-8.
[http://dx.doi.org/10.2174/1574893613666181031141717]
[3]
Wang Y, Shi FQ, Cao LY, et al. Morphological segmentation analysis and texture-based support vector machines classification on mice liver fibrosis microscopic images. Curr Bioinform 2019; 14(4): 282-94.
[http://dx.doi.org/10.2174/1574893614666190304125221]
[4]
Liu G, Jin S, Hu Y, Jiang Q. Disease status affects the association between rs4813620 and the expression of Alzheimer’s disease susceptibility gene TRIB3. Proc Natl Acad Sci USA 2018; 115(45): E10519-20.
[http://dx.doi.org/10.1073/pnas.1812975115] [PMID: 30355771]
[5]
Liu G, Hu Y, Han Z, Jin S, Jiang Q. Genetic variant rs17185536 regulates SIM1 gene expression in human brain hypothalamus. Proc Natl Acad Sci USA 2019; 116(9): 3347-8.
[http://dx.doi.org/10.1073/pnas.1821550116] [PMID: 30755538]
[6]
Bi XA, Liu Y, Xie Y, Hu X, Jiang Q. Morbigenous brain region and gene detection with a genetically evolved random neural network cluster approach in late mild cognitive impairment. Bioinformatics 2020; 36(8): 2561-8.
[http://dx.doi.org/10.1093/bioinformatics/btz967] [PMID: 31971559]
[7]
Jia C, Zuo Y, Zou Q. O-GlcNAcPRED-II: an integrated classification algorithm for identifying O-GlcNAcylation sites based on fuzzy undersampling and a K-means PCA oversampling technique. Bioinformatics 2018; 34(12): 2029-36.
[http://dx.doi.org/10.1093/bioinformatics/bty039] [PMID: 29420699]
[8]
Wei L, Luan S, Nagai LAE, Su R, Zou Q. Exploring sequence-based features for the improved prediction of DNA N4-methylcytosine sites in multiple species. Bioinformatics 2019; 35(8): 1326-33.
[http://dx.doi.org/10.1093/bioinformatics/bty824] [PMID: 30239627]
[9]
Zou Q, Xing P, Wei L, Liu B. Gene2vec: gene subsequence embedding for prediction of mammalian N6-methyladenosine sites from mRNA. RNA 2019; 25(2): 205-18.
[http://dx.doi.org/10.1261/rna.069112.118] [PMID: 30425123]
[10]
Wang G, Luo X, Wang J, et al. MeDReaders: a database for transcription factors that bind to methylated DNA. Nucleic Acids Res 2018; 46(D1): D146-51.
[http://dx.doi.org/10.1093/nar/gkx1096] [PMID: 29145608]
[11]
Shen Y, Ding Y, Tang J, Zou Q, Guo F. Critical evaluation of web-based prediction tools for human protein subcellular localization. Brief Bioinform 2020; 21(5): 1628-40.
[http://dx.doi.org/10.1093/bib/bbz106] [PMID: 31697319]
[12]
Wang H, Ding Y, Tang J, et al. Identification of membrane protein types via multivariate information fusion with Hilbert–Schmidt Independence Criterion. Neurocomputing 2020; 383(28): 257-69.
[13]
Wang Y, Ding Y, Tang J, Dai Y, Guo F. CrystalM: a multi-view fusion approach for protein crystallization prediction. IEEE/ACM Trans Comput Biol Bioinformatics 2021; 18(1): 325-35.
[http://dx.doi.org/10.1109/TCBB.2019.2912173] [PMID: 31027046]
[14]
Ding Y, Tang J, Guo F. Protein crystallization identification via fuzzy model on linear neighborhood representation. IEEE/ACM Trans Comput Biol Bioinformatics 2019.
[http://dx.doi.org/10.1109/TCBB.2019.2954826] [PMID: 31751248]
[15]
Wei L, Ding Y, Su R, et al. Prediction of human protein subcellular localization using deep learning. J Parallel Distrib Comput 2018; 117: 212-7.
[http://dx.doi.org/10.1016/j.jpdc.2017.08.009]
[16]
Liu B, Jiang S, Zou Q. HITS-PR-HHblits: protein remote homology detection by combining PageRank and Hyperlink-Induced Topic Search. Brief Bioinform 2018. 10.1093/bib/bby104.
[http://dx.doi.org/10.1093/bib/bby104] [PMID: 30403770]
[17]
Liu H, Ren G, Chen H, et al. Predicting lncRNA-miRNA interactions based on logistic matrix factorization with neighborhood regularized. Knowl Base Syst 2020; 191, 105261.
[http://dx.doi.org/10.1016/j.knosys.2019.105261]
[18]
Ding Y, Tang J, Guo F. Identification of drug-side effect association via semisupervised model and multiple kernel learning. IEEE J Biomed Health Inform 2019; 23(6): 2619-32.
[http://dx.doi.org/10.1109/JBHI.2018.2883834] [PMID: 30507518]
[19]
Ding Y, Tang J, Guo F. Identification of drug-side effect association via multiple information integration with centered kernel alignment. Neurocomputing 2019; 325: 211-24.
[http://dx.doi.org/10.1016/j.neucom.2018.10.028]
[20]
Qu K, Guo F, Liu X, Lin Y, Zou Q. Application of machine learning in microbiology. Front Microbiol 2019; 10: 827.
[http://dx.doi.org/10.3389/fmicb.2019.00827] [PMID: 31057526]
[21]
Ru X, Li L, Zou Q. Incorporating distance-based top-n-gram and random forest to identify electron transport proteins. J Proteome Res 2019; 18(7): 2931-9.
[http://dx.doi.org/10.1021/acs.jproteome.9b00250] [PMID: 31136183]
[22]
Jiang L, Xiao Y, Ding Y, Tang J, Guo F. FKL-Spa-LapRLS: an accurate method for identifying human microRNA-disease association. BMC Genomics 2018; 19: 911.
[http://dx.doi.org/10.1186/s12864-018-5273-x] [PMID: 30598109]
[23]
Zeng X, Liu L, Lü L, Zou Q. Prediction of potential disease-associated microRNAs using structural perturbation method. Bioinformatics 2018; 34(14): 2425-32.
[http://dx.doi.org/10.1093/bioinformatics/bty112] [PMID: 29490018]
[24]
Jiang Q, Wang G, Jin S, Li Y, Wang Y. Predicting human microRNA-disease associations based on support vector machine. Int J Data Min Bioinform 2013; 8(3): 282-93.
[http://dx.doi.org/10.1504/IJDMB.2013.056078] [PMID: 24417022]
[25]
Wang G, Wang Y, Teng M, Zhang D, Li L, Liu Y. Signal transducers and activators of transcription-1 (STAT1) regulates microRNA transcription in interferon γ-stimulated HeLa cells. PLoS One 2010; 5(7), e11794.
[http://dx.doi.org/10.1371/journal.pone.0011794] [PMID: 20668688]
[26]
Wang G, Wang Y, Feng W, et al. Transcription factor and microRNA regulation in androgen-dependent and -independent prostate cancer cells. BMC Genomics 2008; 9(Suppl. 2): S22.
[http://dx.doi.org/10.1186/1471-2164-9-S2-S22] [PMID: 18831788]
[27]
Zhao Y, Wang F, Juan L. MicroRNA promoter identification in arabidopsis using multiple histone markers. BioMed Res Int 2015; 2015, 861402.
[http://dx.doi.org/10.1155/2015/861402] [PMID: 26425556]
[28]
Ding Y, Tang J, Guo F. Identification of drug-target interactions via fuzzy bipartite local model. Neural Comput Appl 2019.
[http://dx.doi.org/10.1007/s00521-019-04569-z]
[29]
Zhao Q, Yang Y, Ren G, Ge E, Fan C. Integrating bipartite network projection and KATZ measure to identify novel circrna-disease associations. IEEE Trans Nanobioscience 2019; 18(4): 578-84.
[http://dx.doi.org/10.1109/TNB.2019.2922214] [PMID: 31199265]
[30]
Zhao X, Jiao Q, Li H, et al. ECFS-DEA: an ensemble classifier-based feature selection for differential expression analysis on expression profiles. BMC Bioinformatics 2020; 21(1): 43.
[http://dx.doi.org/10.1186/s12859-020-3388-y] [PMID: 32024464]
[31]
Ding Y, Tang J, Guo F. Identification of protein-protein interactions via a novel matrix-based sequence representation model with amino acid contact information. Int J Mol Sci 2016; 17(10): 1623.
[http://dx.doi.org/10.3390/ijms17101623] [PMID: 27669239]
[32]
Ding Y, Tang J, Guo F. Predicting protein-protein interactions via multivariate mutual information of protein sequences. BMC Bioinformatics 2016; 17(1): 398.
[http://dx.doi.org/10.1186/s12859-016-1253-9] [PMID: 27677692]
[33]
Liu B, Xu J, Lan X, et al. iDNA-Prot|dis: identifying DNA-binding proteins by incorporating amino acid distance-pairs and reduced alphabet profile into the general pseudo amino acid composition. PLoS One 2014; 9(9), e106691.
[http://dx.doi.org/10.1371/journal.pone.0106691] [PMID: 25184541]
[34]
Liu B, Xu J, Fan S, Xu R, Zhou J, Wang X. PseDNA-Pro: DNA-binding protein identification by combining Chou’s PseAAC and physicochemical distance transformation. Mol Inform 2015; 34(1): 8-17.
[http://dx.doi.org/10.1002/minf.201400025] [PMID: 27490858]
[35]
Liu B, Wang S, Wang X. DNA binding protein identification by combining pseudo amino acid composition and profile-based protein representation. Sci Rep 2015; 5: 15479.
[http://dx.doi.org/10.1038/srep15479] [PMID: 26482832]
[36]
Lin WZ, Fang JA, Xiao X, Chou KC. iDNA-Prot: identification of DNA binding proteins using random forest with grey model. PLoS One 2011; 6(9), e24756.
[http://dx.doi.org/10.1371/journal.pone.0024756] [PMID: 21935457]
[37]
Kumar KK, Pugalenthi G, Suganthan PN. DNA-Prot: identification of DNA binding proteins from protein sequence information using random forest. J Biomol Struct Dyn 2009; 26(6): 679-86.
[http://dx.doi.org/10.1080/07391102.2009.10507281] [PMID: 19385697]
[38]
Kumar M, Gromiha MM, Raghava GP. Identification of DNA-binding proteins using support vector machines and evolutionary profiles. BMC Bioinformatics 2007; 8(1): 463.
[http://dx.doi.org/10.1186/1471-2105-8-463] [PMID: 18042272]
[39]
Dong Q, Wang S, Kai W, et al. Identification of DNA-binding proteins by auto-cross covariance transformation. IEEE International Conference on Bioinformatics and Biomedicine (BIBM) USA 2005..
[40]
Wei L, Tang J, Zou Q. Local-DPP: An improved DNA-binding protein prediction method by exploring local evolutionary information. Inf Sci 2017; 384: 135-44.
[http://dx.doi.org/10.1016/j.ins.2016.06.026]
[41]
Yijie D, Feng C, Xiaoyi G, et al. Identification of DNA-binding proteins by multiple kernel support vector machine and sequence information. Curr Proteomics 2019; 16: 1-9.
[42]
Liu XJ, Gong XJ, Yu H, Xu JH. A model stacking framework for identifying dna binding proteins by orchestrating multi-view features and classifiers. Genes 2018; 9(8): 394.
[http://dx.doi.org/10.3390/genes9080394] [PMID: 30071697]
[43]
Rahman MS, Shatabda S, Saha S, Kaykobad M, Rahman MS. DPP-PseAAC: A DNA-binding protein prediction model using Chou’s general PseAAC. J Theor Biol 2018; 452: 22-34.
[http://dx.doi.org/10.1016/j.jtbi.2018.05.006] [PMID: 29753757]
[44]
Du X, Diao Y, Liu H, Li S. MsDBP: exploring DNA-binding proteins by integrating multiscale sequence information via Chou’s five-step rule. J Proteome Res 2019; 18(8): 3119-32.
[http://dx.doi.org/10.1021/acs.jproteome.9b00226] [PMID: 31267738]
[45]
Adilina S, Farid DM, Shatabda S. Effective DNA binding protein prediction by using key features via Chou’s general PseAAC. J Theor Biol 2019; 460: 64-78.
[http://dx.doi.org/10.1016/j.jtbi.2018.10.027] [PMID: 30316822]
[46]
Wei L, Tang J, Quan Z. Local-DPP: an improved DNA-binding protein prediction method by exploring local evolutionary information. Inf Sci 2016; 384: 135-44.
[http://dx.doi.org/10.1016/j.ins.2016.06.026]
[47]
Zou Y, Ding Y, Tang J, Guo F, Peng L. FKRR-MVSF: a fuzzy kernel ridge regression model for identifying DNA-binding proteins by multi-view sequence features via Chou’s five-step rule. Int J Mol Sci 2019; 20(17): 4175.
[http://dx.doi.org/10.3390/ijms20174175] [PMID: 31454964]
[48]
Tax DMJ, Duin RPW. Support vector domain description. Pattern Recognit Lett 1999; 20(11-13): 1191-9.
[http://dx.doi.org/10.1016/S0167-8655(99)00087-2]
[49]
You ZH, Zhu L, Zheng CH, Yu HJ, Deng SP, Ji Z. Prediction of protein-protein interactions from amino acid sequences using a novel multi-scale continuous and discontinuous feature set. BMC Bioinformatics 2014; 15(Suppl. 15): S9.
[http://dx.doi.org/10.1186/1471-2105-15-S15-S9] [PMID: 25474679]
[50]
Li X, Liao B, Shu Y, Zeng Q, Luo J. Protein functional class prediction using global encoding of amino acid sequence. J Theor Biol 2009; 261(2): 290-3.
[http://dx.doi.org/10.1016/j.jtbi.2009.07.017] [PMID: 19631664]
[51]
Chou K-C, Shen H-B. MemType-2L: a web server for predicting membrane proteins and their types by incorporating evolution information through Pse-PSSM. Biochem Biophys Res Commun 2007; 360(2): 339-45.
[http://dx.doi.org/10.1016/j.bbrc.2007.06.027] [PMID: 17586467]
[52]
Jeong JC, Lin X, Chen XW. On position-specific scoring matrix for protein function prediction. IEEE/ACM Trans Comput Biol Bioinformatics 2011; 8(2): 308-15.
[http://dx.doi.org/10.1109/TCBB.2010.93] [PMID: 20855926]
[53]
Cristianini N, Shawetaylor J, Elisseeff A, et al. On Kernel-Target Alignment. Advances in Neural Information Processing Systems Canada 2001; pp. 367-73.
[54]
Cortes C, Vapnik V. Support-vector networks. Mach Learn 1995; 20(3): 273-97.
[http://dx.doi.org/10.1007/BF00994018]
[55]
Lin CF, Wang SD. Fuzzy support vector machines. IEEE Trans Neural Netw 2002; 13(2): 464-71.
[http://dx.doi.org/10.1109/72.991432] [PMID: 18244447]
[56]
Lou W, Wang X, Chen F, Chen Y, Jiang B, Zhang H. Sequence based prediction of DNA-binding proteins based on hybrid feature selection using random forest and Gaussian naïve Bayes. PLoS One 2014; 9(1), e86703.
[http://dx.doi.org/10.1371/journal.pone.0086703] [PMID: 24475169]