Prediction of Citrullination Sites on the Basis of mRMR Method and SNN

Min       Liu; Guangzhong       Liu

Abstract

Background: Citrullination, an important post-translational modification of proteins, alters the molecular weight and electrostatic charge of the protein side chains. Citrulline, in protein sequences, is catalyzed by a class of Peptidyl Arginine Deiminases (PADs). Dependent on Ca2+, PADs include five isozymes: PAD 1, 2, 3, 4/5, and 6. Citrullinated proteins have been identified in many biological and pathological processes. Among them, abnormal protein citrullination modification can lead to serious human diseases, including multiple sclerosis and rheumatoid arthritis.

Objective: It is important to identify the citrullination sites in protein sequences. The accurate identification of citrullination sites may contribute to the studies on the molecular functions and pathological mechanisms of related diseases.

Methods and Results: In this study, after an encoded training set (containing 116 positive and 348 negative samples) into the feature matrix, the mRMR method was used to analyze the 941- dimensional features which were sorted on the basis of their importance. Then, a predictive model based on a self-normalizing neural network (SNN) was proposed to predict the citrullination sites in protein sequences. Incremental Feature Selection (IFS) and 10-fold cross-validation were used as the model evaluation method. Three classical machine learning models, namely random forest, support vector machine, and k-nearest neighbor algorithm, were selected and compared with the SNN prediction model using the same evaluation methods. SNN may be the best tool for citrullination site prediction. The maximum value of the Matthews Correlation Coefficient (MCC) reached 0.672404 on the basis of the optimal classifier of SNN.

Conclusion: The results showed that the SNN-based prediction methods performed better when evaluated by some common metrics, such as MCC, accuracy, and F1-Measure. SNN prediction model also achieved a better balance in the classification and recognition of positive and negative samples from datasets compared with the other three models.

Keywords: PTM (post-translational modification), citrullination site, SNN (self-normalizing neural network), mRMR (minimum redundancy maximum relevance), IFS (incremental feature selection), protein sequence.

[1] 
Mann, M.; Jensen, O.N. Proteomic analysis of post-translational modifications. Nat. Biotechnol.,  2003, 21(3), 255-261.
[http://dx.doi.org/10.1038/nbt0303-255] [PMID:  12610572] 
[2] 
Andrade, F.; Darrah, E.; Gucek, M.; Cole, R.N.; Rosen, A.; Zhu, X. Autocitrullination of human peptidyl arginine deiminase type 4 regulates protein citrullination during cell activation. Arthritis Rheum.,  2010, 62(6), 1630-1640.
[http://dx.doi.org/10.1002/art.27439] [PMID:  20201080] 
[3] 
György, B.; Tóth, E.; Tarcsa, E.; Falus, A.; Buzás, E.I. Citrullination: a posttranslational modification in health and disease. Int. J. Biochem. Cell Biol.,  2006, 38(10), 1662-1677.
[http://dx.doi.org/10.1016/j.biocel.2006.03.008] [PMID:  16730216] 
[4] 
Curis, E.; Nicolis, I.; Moinard, C.; Osowska, S.; Zerrouk, N.; Bénazeth, S.; Cynober, L. Almost all about citrulline in mammals. Amino Acids,  2005, 29(3), 177-205.
[http://dx.doi.org/10.1007/s00726-005-0235-4] [PMID:  16082501] 
[5] 
Bannister, A.J.; Kouzarides, T. Reversing histone methylation. Nature,  2005, 436(7054), 1103-1106.
[http://dx.doi.org/10.1038/nature04048] [PMID:  16121170] 
[6] 
Mastronardi, F.G.; Moscarello, M.A. Molecules affecting myelin stability: a novel hypothesis regarding the pathogenesis of multiple sclerosis. J. Neurosci. Res.,  2005, 80(3), 301-308.
[http://dx.doi.org/10.1002/jnr.20420] [PMID:  15704220] 
[7] 
Vossenaar, E.R.; Zendman, A.J.; van Venrooij, W.J.; Pruijn, G.J. PAD, a growing family of citrullinating enzymes: genes, features and involvement in disease. BioEssays,  2003, 25(11), 1106-1118.
[http://dx.doi.org/10.1002/bies.10357] [PMID:  14579251] 
[8] 
Hu, L.L.; Wan, S.B.; Niu, S.; Shi, X.H.; Li, H.P.; Cai, Y.D.; Chou, K.C. Prediction and analysis of protein palmitoylation sites. Biochimie,  2011, 93(3), 489-496.
[http://dx.doi.org/10.1016/j.biochi.2010.10.022] [PMID:  21075167] 
[9] 
Zhou, Y.; Zhang, N.; Li, B.Q.; Huang, T.; Cai, Y.D.; Kong, X.Y. A method to distinguish between lysine acetylation and lysine ubiquitination with feature selection and analysis. J. Biomol. Struct. Dyn.,  2015, 33(11), 2479-2490.
[http://dx.doi.org/10.1080/07391102.2014.1001793] [PMID:  25616595] 
[10] 
Niu, S.; Huang, T.; Feng, K.; Cai, Y.; Li, Y. Prediction of tyrosine sulfation with mRMR feature selection and analysis. J. Proteome Res.,  2010, 9(12), 6490-6497.
[http://dx.doi.org/10.1021/pr1007152] [PMID:  20973568] 
[11] 
Ferron, F.; Longhi, S.; Canard, B.; Karlin, D. A practical overview of protein disorder prediction methods. Proteins,  2006, 65(1), 1-14.
[http://dx.doi.org/10.1002/prot.21075] [PMID:  16856179] 
[12] 
Noivirt-Brik, O.; Prilusky, J.; Sussman, J.L. Assessment of disorder predictions in CASP8. Proteins,  2009, 77(Suppl. 9), 210-216.
[http://dx.doi.org/10.1002/prot.22586] [PMID:  19774619] 
[13] 
Peng, K.; Radivojac, P.; Vucetic, S.; Dunker, A.K.; Obradovic, Z. Length-dependent prediction of protein intrinsic disorder. BMC Bioinformatics,  2006, 7, 208.
[http://dx.doi.org/10.1186/1471-2105-7-208] [PMID:  16618368] 
[14] 
Kawashima, S.; Kanehisa, M. AAindex: amino acid index database. Nucleic Acids Res.,  2000, 28(1), 374.
[http://dx.doi.org/10.1093/nar/28.1.374] [PMID:  10592278] 
[15] 
Atchley, W.R.; Zhao, J.; Fernandes, A.D.; Drüke, T. Solving the protein sequence metric problem. Proc. Natl. Acad. Sci. USA,  2005, 102(18), 6395-6400.
[http://dx.doi.org/10.1073/pnas.0408677102] [PMID:  15851683] 
[16] 
Altschul, S.F.; Madden, T.L.; Schäffer, A.A.; Zhang, J.; Zhang, Z.; Miller, W.; Lipman, D.J. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res.,  1997, 25(17), 3389-3402.
[http://dx.doi.org/10.1093/nar/25.17.3389] [PMID:  9254694] 
[17] 
Peng, H.; Long, F.; Ding, C. Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans. Pattern Anal. Mach. Intell.,  2005, 27(8), 1226-1238.
[http://dx.doi.org/10.1109/TPAMI.2005.159] [PMID:  16119262] 
[18] 
Klambauer, G.; Unterthiner, T.; Mayr, A.; Hochreiter, S. Self-Normalizing Neural Networks. In: Advances in Neural Information
Processing Systems 30 (NIPS 2017); Guyon, I., Luxburg, U.V.; Bengio,
S.; Wallach, H.; Fergus, R.; Vishwanathan, S.; Garnett, R.. , 2017. 
[19] 
Cortes, C.; Vapnik, V. Support Vector Networks. Mach. Learn.,  1995, 20(3), 273-297.
[20] 
Breiman, L.; Last, M.; Rice, J. Random Forests: Finding Quasars. In: Statistical Challenges in Astronomy; Springer: New York, NY, 2003; p. 243.
[21] 
Altman, N.S. An introduction to kernel and nearest-neighbor nonparametric regression. Am. Stat.,  1992, 46(3), 175-185.
[22] 
Bishop, C.M. Neural Networks for Pattern Recognition; Oxford University Press, 1995. 
[23] 
Rumelhart, D.E.; Hinton, G.E.; Williams, R.J. Learning representations by back-propagating errors. Nature,  •••, 323, 533-536.
[24] 
Chen, L.; Feng, K.Y.; Cai, Y.D.; Chou, K.C.; Li, H.P. Predicting the network of substrate-enzyme-product triads by combining compound similarity and functional domain composition. BMC Bioinformatics,  2010, 11, 293.
[http://dx.doi.org/10.1186/1471-2105-11-293] [PMID:  20513238] 
[25] 
Matthews, B.W. Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochim. Biophys. Acta,  1975, 405(2), 442-451.
[http://dx.doi.org/10.1016/0005-2795(75)90109-9] [PMID:  1180967] 
[26] 
Chen, L.; Zhang, Y-H.; Zheng, M.; Huang, T.; Cai, Y-D. Identification of compound-protein interactions through the analysis of gene ontology, KEGG enrichment for proteins and molecular fragments of compounds. Mol. Genet. Genomics,  2016, 291(6), 2065-2079.
[http://dx.doi.org/10.1007/s00438-016-1240-x] [PMID:  27530612] 
[27] 
Chen, L.; Zhang, Y.H.; Huang, G.; Pan, X.; Wang, S.; Huang, T.; Cai, Y.D. Discriminating cirRNAs from other lncRNAs using a hierarchical extreme learning machine (H-ELM) algorithm with feature selection. Mol. Genet. Genomics,  2018, 293(1), 137-149.
[http://dx.doi.org/10.1007/s00438-017-1372-7] [PMID:  28913654] 
[28] 
Kohavi, R. A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection. In: International Joint Conference on Artificial Intelligence; , 1995. 
[29] 
Chen, L.; Zeng, W.M.; Cai, Y.D.; Feng, K.Y.; Chou, K.C. Predicting Anatomical Therapeutic Chemical (ATC) classification of drugs by integrating chemical-chemical interactions and similarities. PLoS One,  2012, 7(4)e35254
[http://dx.doi.org/10.1371/journal.pone.0035254] [PMID:  22514724] 
[30] 
Chen, L.; Chu, C.; Zhang, Y-H.; Zheng, M.; Zhu, L.C.; Kong, X.Y.; Huang, T. Identification of drug-drug interactions using chemical interactions. Curr. Bioinform.,  2017, 11(999), 1-1.
[http://dx.doi.org/10.2174/1574893611666160618094219] 
[31] 
Stehman, S.V. Selecting and interpreting measures of thematic classification accuracy. Remote Sens. Environ.,  1997, 62(1), 77-89.
[http://dx.doi.org/10.1016/S0034-4257(97)00083-7] 

Cite As

Combinatorial Chemistry & High Throughput Screening

Prediction of Citrullination Sites on the Basis of mRMR Method and SNN

Abstract