Abstract
Background: Thermophilic proteins can maintain good activity under high temperature,
therefore, it is important to study thermophilic proteins for the thermal stability of proteins.
Objective: In order to solve the problem of low precision and low efficiency in predicting
thermophilic proteins, a prediction method based on feature fusion and machine learning was
proposed in this paper.
Methods: For the selected thermophilic data sets, firstly, the thermophilic protein sequence was
characterized based on feature fusion by the combination of g-gap dipeptide, entropy density and
autocorrelation coefficient. Then, Kernel Principal Component Analysis (KPCA) was used to reduce
the dimension of the expressed protein sequence features in order to reduce the training time and
improve efficiency. Finally, the classification model was designed by using the classification
algorithm.
Results: A variety of classification algorithms was used to train and test on the selected thermophilic
dataset. By comparison, the accuracy of the Support Vector Machine (SVM) under the jackknife
method was over 92%. The combination of other evaluation indicators also proved that the SVM
performance was the best.
Conclusion: Because of choosing an effectively feature representation method and a robust
classifier, the proposed method is suitable for predicting thermophilic proteins and is superior to
most reported methods.
Keywords:
Thermophilic proteins, feature fusion, g-gap, entropy density, autocorrelation coefficient, KPCA, machine learning.
Graphical Abstract
[6]
Zhang GY, Fang BS. Application of amino acid distribution along the sequence for discriminating mesophilic and thermophilic proteins. Process Biochem 2006; 41: 1792-8.
[8]
Wu LC, Lee JX, Huang HD, et al. An expert system to predict protein thermostability using decision tree. Expert Syst Appl 2009; 36: 9007-14.
[17]
Hu L, Chan KC. Extracting Coevolutionary Features from Protein Sequences for Predicting Protein-Protein Interactions. EEE/ACM Trans Comput Biol Bioinform 2017; (3): 155-66.
[18]
Wei LY, Ding YJ, Su R, et al. Prediction of human protein subcellular localization using deep learning. J Parallel Distrib Comput 2018; 117: 212-7.
[26]
Du X, Cheng J, Eds. Inferring protein-protein interactions from sequence using sequence order information. Proceedings of the International Conference on Computer Science & Education 2010. Hefei, China.
[29]
Zhu XJ, Feng CQ, Lai HY, et al. Predicting protein structural classes for low-similarity sequences by evaluating different features. Knowl Base Syst 2019; 163: 787-93.
[32]
Vyas H, Mathur R. Experimental analysis: Hybrid scheme for face recognition using KPCA & SVD. IEEE International Conference on Computational Intelligence & Communication Technology. Ghaziabad, India. 2015.
[34]
Lin H, Liang ZY, Tang H, et al. Identifying Sigma70 Promoters with Novel Pseudo Nucleotide Composition. IEEE/ACM Trans Comput Biol Bioinform 2019; 16: 1316-21.
[36]
Cortes C, Vapnik VJML. Support-vector networks. Med Leaning 1995; 20: 273-97.
[41]
Yang W, Zhu XJ, Huang J, et al. A brief survey of machine learning methods in protein sub-Golgi localization. Curr Bioinform 2019; 14: 234-40.
[42]
Dao FY, Chen XX, Lin H. Prediction of thermophilic proteins based on physicochemical properties. Chinese J Bioinform 2017; 15(1): 1-6.
[44]
Yu L, Sun X, Tian SW, et al. Drug and nondrug classification based on deep learning with various feature selection strategies. Curr Bioinform 2018; 13: 253-9.