Obtaining soluble proteins in sufficient concentrations helps increase the overall success rate in various experimental studies. Protein solubility is an individual trait ultimately determined by its primary protein sequence. Exploring the interconnection between the protein solubility and the compositions of protein sequence is instrumental for setting priorities on targets in large scale proteomics projects. In this paper, amino acid composition (20 dimensions) and the dipeptide composition (400 dimensions) were extracted to form the total candidate feature pool (420 dimensions), and each feature was selected into the feature vectors one by one, which were sorted by the absolute value of the correlation coefficient. Finally, we evaluated and recorded the 420 results of Support Vector Machine (SVM) as the prediction engine. According to the results of SVM, the first 208 features were chosen from the 420 dimensions, which were considered as the efficient ones. By analyzing the composition of the former 208 features, we found that the protein solubility was significantly influenced by the occurrence frequencies of the acidic amino acids, basic amino acids, non-polar hydrophobic amino acids and the two polar neutral amino acids(C, Q) in the protein sequences. Additionally, we detected that the dipeptides composed by the acidic amino acids (D, E) and basic amino acids (K, R and H), especially the dipeptide composed by the acidic amino acids (D, E), had strong interconnection with the protein solubility.
Keywords: Protein solubility, support vector machine, correlation coefficient, hydrophobic amino acids, dipeptide, vector, proteomics, protein sequence