A Novel Amino Acid Properties Selection Method for Protein Fold Classification

Page: [287 - 294] Pages: 8

  • * (Excluding Mailing and Handling)

Abstract

Background: Amino acid physicochemical properties encoded in protein primary structure play a crucial role in protein folding. However, it is not yet clear which of the properties are the most suitable for protein fold classification.

Objective: To avoid exhaustively searching the total properties space, an amino acid properties selection method was proposed in this study to rapidly obtain a suitable properties combination for protein fold classification.

Methods: The proposed amino acid properties selection method was based on sequential floating forward selection strategy. Beginning with an empty set, variable number of features were added iteratively until achieving the iteration termination condition.

Results: The experimental results indicate that the proposed method improved prediction accuracies by 0.26-5% on a widely used benchmark dataset with appropriately selected amino acid properties.

Conclusion: The proposed properties selection method can be extended to other biomolecule property related classification problems in bioinformatics.

Keywords: Protein fold classification, amino acid properties, sequential floating forward selection, auto-cross correlation, Naive Bayes, support vector machine.

Graphical Abstract

[1]
Anfinsen, C.B. The formation and stabilization of protein structure. Biochem. J., 1972, 128(4), 737-749.
[http://dx.doi.org/10.1042/bj1280737] [PMID: 4565129]
[2]
Chou, K.C. Progress in protein structural class prediction and its impact to bioinformatics and proteomics. Curr. Protein Pept. Sci., 2005, 6(5), 423-436.
[http://dx.doi.org/10.2174/138920305774329368] [PMID: 16248794]
[3]
Murzin, A.G.; Brenner, S.E.; Hubbard, T.; Chothia, C. SCOP: A structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol., 1995, 247(4), 536-540.
[http://dx.doi.org/10.1016/S0022-2836(05)80134-2] [PMID: 7723011]
[4]
Andreeva, A.; Howorth, D.; Chothia, C.; Kulesha, E.; Murzin, A.G. SCOP2 prototype: A new approach to protein structure mining. Nucleic Acids Res., 2014, 42(Database issue), D310-D314.
[http://dx.doi.org/10.1093/nar/gkt1242] [PMID: 24293656]
[5]
Liu, D.; Li, G.; Zuo, Y. Function determinants of TET proteins: The arrangements of sequence motifs with specific codes. Brief. Bioinform., 2019, 20(5), 1826-1835.
[http://dx.doi.org/10.1093/bib/bby053] [PMID: 29947743]
[6]
Yang, J.Y.; Chen, X. Improving taxonomy-based protein fold recognition by using global and local features. Proteins, 2011, 79(7), 2053-2064.
[http://dx.doi.org/10.1002/prot.23025] [PMID: 21538542]
[7]
Wei, L.; Zou, Q. Recent progress in machine learning-based methods for protein fold recognition. Int. J. Mol. Sci., 2016, 17(12) E2118
[http://dx.doi.org/10.3390/ijms17122118] [PMID: 27999256]
[8]
Dubchak, I.; Muchnik, I.; Holbrook, S.R.; Kim, S.H. Prediction of protein folding class using global description of amino acid sequence. Proc. Natl. Acad. Sci. USA, 1995, 92(19), 8700-8704.
[http://dx.doi.org/10.1073/pnas.92.19.8700] [PMID: 7568000]
[9]
Ding, C.H.; Dubchak, I. Multi-class protein fold recognition using support vector machines and neural networks. Bioinformatics, 2001, 17(4), 349-358.
[http://dx.doi.org/10.1093/bioinformatics/17.4.349] [PMID: 11301304]
[10]
Shen, H.B.; Chou, K.C. Ensemble classifier for protein fold pattern recognition. Bioinformatics, 2006, 22(14), 1717-1722.
[http://dx.doi.org/10.1093/bioinformatics/btl170] [PMID: 16672258]
[11]
Dong, Q.; Zhou, S.; Guan, J. A new taxonomy-based protein fold recognition approach based on autocross-covariance transformation. Bioinformatics, 2009, 25(20), 2655-2662.
[http://dx.doi.org/10.1093/bioinformatics/btp500] [PMID: 19706744]
[12]
Dehzangi, A.; Paliwal, K.; Lyons, J.; Sharma, A.; Sattar, A. A segmentation-based method to extract structural and evolutionary features for protein fold recognition. IEEE/ACM Trans. Comput. Biol. Bioinformatics, 2014, 11(3), 510-519.
[http://dx.doi.org/10.1109/TCBB.2013.2296317] [PMID: 26356019]
[13]
Saini, H.; Raicar, G.; Sharma, A.; Lal, S.; Dehzangi, A.; Lyons, J.; Paliwal, K.K.; Imoto, S.; Miyano, S. Probabilistic expression of spatially varied amino acid dimers into general form of Chou׳s pseudo amino acid composition for protein fold recognition. J. Theor. Biol., 2015, 380, 291-298.
[http://dx.doi.org/10.1016/j.jtbi.2015.05.030] [PMID: 26079221]
[14]
Lyons, J.; Paliwal, K.K.; Dehzangi, A.; Heffernan, R.; Tsunoda, T.; Sharma, A. Protein fold recognition using HMM-HMM alignment and dynamic programming. J. Theor. Biol., 2016, 393, 67-74.
[http://dx.doi.org/10.1016/j.jtbi.2015.12.018] [PMID: 26801876]
[15]
Ibrahim, W.; Abadeh, M.S. Extracting features from protein sequences to improve deep extreme learning machine for protein fold recognition. J. Theor. Biol., 2017, 421, 1-15.
[http://dx.doi.org/10.1016/j.jtbi.2017.03.023] [PMID: 28351701]
[16]
Xia, J.; Peng, Z.; Qi, D.; Mu, H.; Yang, J. An ensemble approach to protein fold classification by integration of template-based assignment and support vector machine classifier. Bioinformatics, 2017, 33(6), 863-870.
[PMID: 28039166]
[17]
Shamim, M.T.; Anwaruddin, M.; Nagarajaram, H.A. Support Vector Machine-based classification of protein folds using the structural properties of amino acid residues and amino acid residue pairs. Bioinformatics, 2007, 23(24), 3320-3327.
[http://dx.doi.org/10.1093/bioinformatics/btm527] [PMID: 17989092]
[18]
Chou, K.C. Prediction of protein cellular attributes using pseudo-amino acid composition. Proteins, 2001, 43(3), 246-255.
[http://dx.doi.org/10.1002/prot.1035] [PMID: 11288174]
[19]
Lin, C.; Zou, Y.; Qin, J.; Liu, X.; Jiang, Y.; Ke, C.; Zou, Q. Hierarchical classification of protein folds using a novel ensemble classifier. PLoS One, 2013, 8(2) e56499
[http://dx.doi.org/10.1371/journal.pone.0056499] [PMID: 23437146]
[20]
Dehzangi, A.; Sharma, A.; Lyons, J.; Paliwal, K.K.; Sattar, A. A mixture of physicochemical and evolutionary-based feature extraction approaches for protein fold recognition. Int. J. Data Min. Bioinform., 2015, 11(1), 115-138.
[http://dx.doi.org/10.1504/IJDMB.2015.066359] [PMID: 26255379]
[21]
Chen, K.; Kurgan, L. PFRES: Protein fold classification by using evolutionary information and predicted secondary structure. Bioinformatics, 2007, 23(21), 2843-2850.
[http://dx.doi.org/10.1093/bioinformatics/btm475] [PMID: 17942446]
[22]
Paliwal, K.K.; Sharma, A.; Lyons, J.; Dehzangi, A. Improving protein fold recognition using the amalgamation of evolutionary-based and structural based information. BMC Bioinformatics, 2014, 15(S16), S12.
[http://dx.doi.org/10.1186/1471-2105-15-S16-S12] [PMID: 25521502]
[23]
Shen, H.B.; Chou, K.C. Predicting protein fold pattern with functional domain and sequential evolution information. J. Theor. Biol., 2009, 256(3), 441-446.
[http://dx.doi.org/10.1016/j.jtbi.2008.10.007] [PMID: 18996396]
[24]
Jo, T.; Cheng, J. Improving protein fold recognition by random forest. BMC Bioinformatics, 2014, 15(S11), S14.
[http://dx.doi.org/10.1186/1471-2105-15-S11-S14] [PMID: 25350499]
[25]
Damoulas, T.; Girolami, M.A. Probabilistic multi-class multi-kernel learning: On protein fold recognition and remote homology detection. Bioinformatics, 2008, 24(10), 1264-1270.
[http://dx.doi.org/10.1093/bioinformatics/btn112] [PMID: 18378524]
[26]
Dill, K.A.; MacCallum, J.L. The protein-folding problem, 50 years on. Science, 2012, 338(6110), 1042-1046.
[http://dx.doi.org/10.1126/science.1219021] [PMID: 23180855]
[27]
Ghanty, P.; Pal, N.R. Prediction of protein folds: Extraction of new features, dimensionality reduction, and fusion of heterogeneous classifiers. IEEE Trans. Nanobioscience, 2009, 8(1), 100-110.
[http://dx.doi.org/10.1109/TNB.2009.2016488] [PMID: 19278932]
[28]
Wang, J.; Wang, W. A computational approach to simplifying the protein folding alphabet. Nat. Struct. Biol., 1999, 6(11), 1033-1038.
[http://dx.doi.org/10.1038/14918] [PMID: 10542095]
[29]
Peterson, E.L.; Kondev, J.; Theriot, J.A.; Phillips, R. Reduced amino acid alphabets exhibit an improved sensitivity and selectivity in fold assignment. Bioinformatics, 2009, 25(11), 1356-1362.
[http://dx.doi.org/10.1093/bioinformatics/btp164] [PMID: 19351620]
[30]
Zuo, Y.; Li, Y.; Chen, Y.; Li, G.; Yan, Z.; Yang, L. PseKRAAC: A flexible web server for generating pseudo K-tuple reduced amino acids composition. Bioinformatics, 2017, 33(1), 122-124.
[http://dx.doi.org/10.1093/bioinformatics/btw564] [PMID: 27565583]
[31]
Sharma, A.; Paliwal, K.K.; Dehzangi, A.; Lyons, J.; Imoto, S.; Miyano, S. A strategy to select suitable physicochemical attributes of amino acids for protein fold recognition. BMC Bioinformatics, 2013, 14, 233.
[http://dx.doi.org/10.1186/1471-2105-14-233] [PMID: 23879571]
[32]
Rao, H.B.; Zhu, F.; Yang, G.B.; Li, Z.R.; Chen, Y.Z. Update of PROFEAT: A web server for computing structural and physicochemical features of proteins and peptides from amino acid sequence. Nucleic Acids Res., 2011, 39(Web Server issue), W385-390.
[http://dx.doi.org/10.1093/nar/gkr284] [PMID: 21609959]
[33]
Kong, L.; Kong, L.F.; Wang, C.W.; Jing, R.; Zhang, L.C. Predicting protein structural class for low-similarity sequences via novel evolutionary modes of PseAAC and recursive feature elimination. Lett. Org. Chem., 2017, 14(9), 673-683.
[http://dx.doi.org/10.2174/1570178614666170511165837]
[34]
Liu, B.; Wang, S.; Dong, Q.; Li, S.; Liu, X. Identification of DNA-binding proteins by combining auto-cross covariance transformation and ensemble learning. IEEE Trans. Nanobioscience, 2016, 15(4), 328-334.
[http://dx.doi.org/10.1109/TNB.2016.2555951] [PMID: 28113908]
[35]
Liu, X.; Zhao, L.; Dong, Q. Protein remote homology detection based on auto-cross covariance transformation. Comput. Biol. Med., 2011, 41(8), 640-647.
[http://dx.doi.org/10.1016/j.compbiomed.2011.05.015] [PMID: 21664609]
[36]
Chen, W.; Ding, H.; Zhou, X.; Lin, H.; Chou, K.C. iRNA(m6A)-PseDNC: Identifying N6-methyladenosine sites using pseudo dinucleotide composition. Anal. Biochem., 2018, 561-562, 59-65.
[http://dx.doi.org/10.1016/j.ab.2018.09.002] [PMID: 30201554]
[37]
Liu, B.; Liu, Y.; Jin, X.; Wang, X.; Liu, B. iRSpot-DACC: A computational predictor for recombination hot/cold spots identification based on dinucleotide-based auto-cross covariance. Sci. Rep., 2016, 6, 33483.
[http://dx.doi.org/10.1038/srep33483] [PMID: 27641752]
[38]
Kawashima, S.; Pokarowski, P.; Pokarowska, M.; Kolinski, A.; Katayama, T.; Kanehisa, M. AAindex: Amino acid index database, progress report 2008. Nucleic Acids Res., 2008, 36(Database issue), D202-D205.
[PMID: 17998252]
[39]
Kamiński, B.; Jakubczyk, M.; Szufel, P. A framework for sensitivity analysis of decision trees. Cent. Eur. J. Oper. Res., 2018, 26(1), 135-159.
[http://dx.doi.org/10.1007/s10100-017-0479-6] [PMID: 29375266]
[40]
Walker, S.H.; Duncan, D.B. Estimation of the probability of an event as a function of several independent variables. Biometrika, 1967, 54(1), 167-179.
[http://dx.doi.org/10.1093/biomet/54.1-2.167] [PMID: 6049533]
[41]
Webb, G.I.; Boughton, J.; Wang, Z. Not So Naive Bayes: Aggregating one-dependence estimators. Mach. Learn., 2005, 58(1), 5-24.
[http://dx.doi.org/10.1007/s10994-005-4258-6]
[42]
Zuo, Y.C.; Peng, Y.; Liu, L.; Chen, W.; Yang, L.; Fan, G.L. Predicting peroxidase subcellular location by hybridizing different descriptors of Chou’ pseudo amino acid patterns. Anal. Biochem., 2014, 458, 14-19.
[http://dx.doi.org/10.1016/j.ab.2014.04.032] [PMID: 24802134]
[43]
Dao, F.Y.; Lv, H.; Wang, F.; Feng, C.Q.; Ding, H.; Chen, W.; Lin, H. Identify origin of replication in Saccharomyces cerevisiae using two-step feature selection technique. Bioinformatics, 2019, 35(12), 2075-2083.
[http://dx.doi.org/10.1093/bioinformatics/bty943] [PMID: 30428009]
[44]
Zuo, Y.; Lv, Y.; Wei, Z.; Yang, L.; Li, G.; Fan, G. iDPF-PseRAAAC: A web-server for identifying the defensin peptide family and subfamily using pseudo reduced amino acid alphabet composition. PLoS One, 2015, 10(12) e0145541
[http://dx.doi.org/10.1371/journal.pone.0145541] [PMID: 26713618]
[45]
Chang, C.C.; Lin, C.J. LIBSVM: A library for support vector machines. ACM Trans. Intell. Syst. Technol., 2011, 2, 389-396.
[http://dx.doi.org/10.1145/1961189.1961199]
[46]
Chou, K.C. Some remarks on protein attribute prediction and pseudo amino acid composition. J. Theor. Biol., 2011, 273(1), 236-247.
[http://dx.doi.org/10.1016/j.jtbi.2010.12.024] [PMID: 21168420]
[47]
Zuo, Y.C.; Su, W.X.; Zhang, S.H.; Wang, S.S.; Wu, C.Y.; Yang, L.; Li, G.P. Discrimination of membrane transporter protein types using K-nearest neighbor method derived from the similarity distance of total diversity measure. Mol. Biosyst., 2015, 11(3), 950-957.
[http://dx.doi.org/10.1039/C4MB00681J] [PMID: 25607774]
[48]
Lin, H.; Chen, W.; Ding, H. AcalPred: A sequence-based tool for discriminating between acidic and alkaline enzymes. PLoS One, 2013, 8(10) e75726
[http://dx.doi.org/10.1371/journal.pone.0075726] [PMID: 24130738]
[49]
Zuo, Y.C.; Li, Q.Z. Using reduced amino acid composition to predict defensin family and subfamily: Integrating similarity measure and structural alphabet. Peptides, 2009, 30(10), 1788-1793.
[http://dx.doi.org/10.1016/j.peptides.2009.06.032] [PMID: 19591890]
[50]
Chen, W.; Feng, P.M.; Lin, H.; Chou, K.C. iRSpot-PseDNC: Identify recombination spots with pseudo dinucleotide composition. Nucleic Acids Res., 2013, 41(6) e68
[http://dx.doi.org/10.1093/nar/gks1450] [PMID: 23303794]
[51]
Kong, L.; Zhang, L. An ensemble method for multi-type Gram-negative bacterial secreted protein prediction by integrating different PSSM-based features. SAR QSAR Environ. Res., 2019, 30(3), 181-194.
[http://dx.doi.org/10.1080/1062936X.2019.1573438] [PMID: 30739484]
[52]
Feng, P.M.; Lin, H.; Chen, W. Identification of antioxidants from sequence information using naïve Bayes. Comput. Math. Methods Med., 2013, 2013567529
[http://dx.doi.org/10.1155/2013/567529] [PMID: 24062796]
[53]
Feng, P.M.; Ding, H.; Chen, W.; Lin, H. Naïve Bayes classifier with feature selection to identify phage virion proteins. Comput. Math. Methods Med., 2013, 2013530696
[http://dx.doi.org/10.1155/2013/530696] [PMID: 23762187]
[54]
Zhang, L.; Kong, L. iRSpot-ADPM: Identify recombination spots by incorporating the associated dinucleotide product model into Chou’s pseudo components. J. Theor. Biol., 2018, 441, 1-8.
[http://dx.doi.org/10.1016/j.jtbi.2017.12.025] [PMID: 29305179]
[55]
Chen, W.; Yang, H.; Feng, P.; Ding, H.; Lin, H. iDNA4mC: Identifying DNA N4-methylcytosine sites based on nucleotide chemical properties. Bioinformatics, 2017, 33(22), 3518-3523.
[http://dx.doi.org/10.1093/bioinformatics/btx479] [PMID: 28961687]
[56]
Zhang, C.J.; Tang, H.; Li, W.C.; Lin, H.; Chen, W.; Chou, K.C. iOri-Human: Identify human origin of replication by incorporating dinucleotide physicochemical properties into pseudo nucleotide composition. Oncotarget, 2016, 7(43), 69783-69793.
[http://dx.doi.org/10.18632/oncotarget.11975] [PMID: 27626500]
[57]
Feng, C.Q.; Zhang, Z.Y.; Zhu, X.J.; Lin, Y.; Chen, W.; Tang, H.; Lin, H. iTerm-PseKNC: A sequence-based tool for predicting bacterial transcriptional terminators. Bioinformatics, 2019, 35(9), 1469-1477.
[http://dx.doi.org/10.1093/bioinformatics/bty827] [PMID: 30247625]
[58]
Chen, W.; Lv, H.; Nie, F.; Lin, H. i6mA-Pred: Identifying DNA N6-methyladenine sites in the rice genome. Bioinformatics, 2019. Epub ahead of print
[http://dx.doi.org/10.1093/bioinformatics/btz015] [PMID: 30624619]
[59]
Tang, H.; Zhao, Y.W.; Zou, P.; Zhang, C.M.; Chen, R.; Huang, P.; Lin, H. HBPred: A tool to identify growth hormone-binding proteins. Int. J. Biol. Sci., 2018, 14(8), 957-964.
[http://dx.doi.org/10.7150/ijbs.24174] [PMID: 29989085]
[60]
Chen, W.; Song, X.; Lin, H.; Lin, H. iRNA-m2G: Identifying N2-methylguanosine sites based on sequence-derived information. Mol. Ther. Nucleic Acids, 2019, 18, 253-258.
[http://dx.doi.org/10.1016/j.omtn.2019.08.023]
[61]
Chen, W.; Zhang, X.; Brooker, J.; Lin, H.; Zhang, L.; Chou, K.C. PseKNC-General: A cross-platform package for generating various modes of pseudo nucleotide compositions. Bioinformatics, 2015, 31(1), 119-120.
[http://dx.doi.org/10.1093/bioinformatics/btu602] [PMID: 25231908]
[62]
Zhu, X.J.; Feng, C.Q.; Lai, H.Y.; Chen, W.; Lin, H. Predicting protein structural classes for low-similarity sequences by evaluating different features. Knowl. Base. Syst., 2019, 163, 787-793.
[http://dx.doi.org/10.1016/j.knosys.2018.10.007]