A Binary Classifier for the Prediction of EC Numbers of Enzymes

Page: [383 - 391] Pages: 9

  • * (Excluding Mailing and Handling)

Abstract

Background: Identification of Enzyme Commission (EC) number of enzymes is quite important for understanding the metabolic processes that produce enough energy to sustain life. Previous studies mainly focused on predicting six main functional classes or sub-functional classes, i.e., the first two digits of the EC number.

Objective: In this study, a binary classifier was proposed to identify the full EC number (four digits) of enzymes.

Methods: Enzymes and their known EC numbers were paired as positive samples and negative samples were randomly produced that were as many as positive samples. The associations between any two samples were evaluated by integrating the linkages between enzymes and EC numbers. The classic machining learning algorithm, Support Vector Machine (SVM), was adopted as the prediction engine.

Results: The five-fold cross-validation test on five datasets indicated that the overall accuracy, Matthews correlation coefficient and F1-measure were about 0.786, 0.576 and 0.771, respectively, suggesting the utility of the proposed classifier. In addition, the effectiveness of the classifier was elaborated by comparing it with other classifiers that were based on other classic machine learning algorithms.

Conclusion: The proposed classifier was quite effective for prediction of EC number of enzymes and was specially designed for dealing with the problem addressed in this study by testing it on five datasets containing randomly produced samples.

Keywords: Enzyme, EC number, support vector machine, protein-protein interaction, Weka, binary classification, five-fold cross-validation.

Graphical Abstract

[1]
Tipton, K.; Boyce, S. History of the enzyme nomenclature system. Bioinformatics, 2000, 16(1), 34-40.
[2]
Kanehisa, M.; Goto, S. KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids Res., 2000, 28(1), 27-30.
[3]
Ogata, H.; Goto, S.; Sato, K.; Fujibuchi, W.; Bono, H.; Kanehisa, M. KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Res., 1999, 27(1), 29-34.
[4]
Jensen, L.J.; Skovgaard, M.; Brunak, S. Prediction of novel archaeal enzymes from sequence-derived features. Protein Sci., 2002, 11(12), 2894-2898.
[5]
Cai, C.Z.; Han, L.Y.; Ji, Z.L.; Chen, Y.Z. Enzyme family classification by support vector machines. Proteins, 2004, 55(1), 66-76.
[6]
Chou, K.C.; Elrod, D.W. Prediction of enzyme family classes. J. Proteome Res., 2003, 2(2), 183-190.
[7]
Lu, L.; Qian, Z.; Cai, Y.D.; Li, Y. ECS: an automatic enzyme classifier based on functional domain composition. Comput. Biol. Chem., 2007, 31(3), 226-232.
[8]
Cai, Y.D.; Chou, K.C. Using functional domain composition to predict enzyme family classes. J. Proteome Res., 2005, 4(1), 109-111.
[9]
Shen, H.B.; Chou, K.C. EzyPred: a top-down approach for predicting enzyme functional classes and subclasses. Biochem. Biophys. Res. Commun., 2007, 364(1), 53-59.
[10]
Qiu, J.D.; Huang, J.H.; Shi, S.P.; Liang, R.P. Using the concept of Chou’s pseudo amino acid composition to predict enzyme family classes: an approach with support vector machine based on discrete wavelet transform. Protein Pept. Lett., 2010, 17(6), 715-722.
[11]
Cai, Y.D.; Zhou, G.P.; Chou, K.C. Predicting enzyme family classes by hybridizing gene product composition and pseudo-amino acid composition. J. Theor. Biol., 2005, 234(1), 145-149.
[12]
Chou, K. Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes. Bioinformatics, 2005, 21(1), 10-19.
[13]
Zhou, X.; Chen, C.; Li, Z.; Zou, X. Using Chou’s amphiphilic pseudo-amino acid composition and support vector machine for prediction of enzyme subfamily classes. J. Theor. Biol., 2007, 248(3), 546-551.
[14]
Chou, K.C.; Cai, Y.D. Predicting enzyme family class in a hybridization space. Protein Sci., 2004, 13(11), 2857-2863.
[15]
Huang, W.L.; Chen, H.M.; Hwang, S.F.; Ho, S.Y. Accurate prediction of enzyme subfamily class using an adaptive fuzzy k-nearest neighbor method. Biosystems, 2007, 90(2), 405-413.
[16]
Yun, W.; Hua, T.; Wei, C.; Hao, L. Predicting human enzyme family classes by using pseudo amino acid composition. Curr. Proteomics, 2016, 13(2), 99-104.
[17]
Dobson, P.D.; Doig, A.J. Predicting enzyme class from protein structure without alignments. J. Mol. Biol., 2005, 345(1), 187-199.
[18]
Borro, L.C.; Oliveira, S.R.; Yamagishi, M.E.; Mancini, A.L.; Jardine, J.G.; Mazoni, I.; Santos, E.H.; Higa, R.H.; Kuser, P.R.; Neshich, G. Predicting enzyme class from protein structure using Bayesian classification. Genet. Mol. Res., 2006, 5(1), 193-202.
[19]
Bairoch, A. The ENZYME database in 2000. Nucleic Acids Res., 2000, 28(1), 304-305.
[20]
Snel, B.; Lehmann, G.; Bork, P.; Huynen, M.A. STRING: a web-server to retrieve and display the repeatedly occurring neighbourhood of a gene. Nucleic Acids Res., 2000, 28(18), 3442-3444.
[21]
Cai, Y.D.; Zhang, Q.; Zhang, Y.H.; Chen, L.; Huang, T. Identification of genes associated with breast cancer metastasis to bone on a protein-protein interaction network with a shortest path algorithm. J. Proteome Res., 2017, 16(2), 1027-1038.
[22]
Chen, L.; Yang, J.; Xing, Z.; Yuan, F.; Shu, Y.; Zhang, Y.; Kong, X.; Huang, T.; Li, H.; Cai, Y.D. An integrated method for the identification of novel genes related to oral cancer. PLoS One, 2017, 12(4)e0175185
[23]
Ng, K.L.; Ciou, J.S.; Huang, C.H. Prediction of protein functions based on function-function correlation relations. Comput. Biol. Med., 2010, 40(3), 300-305.
[24]
Chen, L.; Xing, Z.; Huang, T.; Shu, Y.; Huang, G.; Li, H.P. Application of the shortest path algorithm for the discovery of breast cancer related genes. Curr. Bioinform., 2016, 11(1), 51-58.
[25]
Zhang, J.; Yang, J.; Huang, T.; Shu, Y.; Chen, L. Identification of novel proliferative diabetic retinopathy related genes on protein-protein interaction network. Neurocomputing, 2016, 217, 63-72.
[26]
Chen, L.; Huang, T.; Zhang, Y.H.; Jiang, Y.; Zheng, M.; Cai, Y.D. Identification of novel candidate drivers connecting different dysfunctional levels for lung adenocarcinoma using protein-protein interactions and a shortest path approach. Sci. Rep., 2016, 6, 29849.
[27]
Hu, L.; Huang, T.; Liu, X.J.; Cai, Y.D. Predicting protein phenotypes based on protein-protein interaction network. PLoS One, 2011, 6(3)e17668
[28]
Hu, L.; Huang, T.; Shi, X.; Lu, W.C.; Cai, Y.D.; Chou, K.C. Predicting functions of proteins in mouse based on weighted protein-protein interaction network and protein hybrid properties. PLoS One, 2011, 6(1)e14556
[29]
Chen, L.; Zhang, Y.H.; Huang, T.; Cai, Y.D. Identifying novel protein phenotype annotations by hybridizing protein-protein interactions and protein sequence similarities. Mol. Genet. Genomics, 2016, 291(2), 913-934.
[30]
Chen, L.; Yang, J.; Huang, T.; Kong, X.Y.; Lu, L.; Cai, Y.D. Mining for novel tumor suppressor genes using a shortest path approach. J. Biomol. Struct. Dyn., 2016, 34(3), 664-675.
[31]
Cortes, C.; Vapnik, V. Support-vector networks. Mach. Learn., 1995, 20(3), 273-297.
[32]
Witten, I.H.; Frank, E. Data mining: practical machine learning tools and techniques, 2nd Ed., Morgan, K.; San Francisco, USA. 2005.pp. 560
[33]
Platt, J. Fast training of support vector machines using sequential minimal optimization. In Adv. Kernel Methods, MIT Press Cambridge, MA, USA. 1998, 185-208.
[34]
Keerthi, S.S.; Shevade, S.K.; Bhattacharyya, C.; Murthy, K.R.K. Improvements to Platt’s SMO algorithm for SVM classifier design. Neural Comput., 2001, 13(3), 637-649.
[35]
Chen, L.; Feng, K.Y.; Cai, Y.D.; Chou, K.C.; Li, H.P. Predicting the network of substrate-enzyme-product triads by combining compound similarity and functional domain composition. BMC Bioinformatics, 2010, 11, 293.
[36]
Baldi, P.; Brunak, S.; Chauvin, Y.; Andersen, C.; Nielsen, H. Assessing the accuracy of prediction algorithms for classification: an overview. Bioinformatics, 2000, 16(5), 412-424.
[37]
Matthews, B. Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochim. et Biophys. Acta Protein Struct., 1975, 405(2), 442-451.
[38]
Cohen, W.W. Fast effective rule induction. In: Twelfth International Conference Machine Learning, Ed. Morgan Kaufmann Publishers Inc. San Francisco, CA, USA. 1995, p. pp. 115-123.
[39]
Breiman, L. Random forests. Mach. Learn., 2001, 45(1), 5-32.