Protein complexes involve in most if not all of essential biological processes in a living cell. Many attempts have been devoted to identify protein complexes using computational methods, most of which exploit protein-protein interaction networks to search intensively interacting proteins as a protein complex. Besides identifying protein complexes, knowing their biological functions may help unlock their molecular mechanisms and their roles in related biological processes. Therefore, it is also desirable to computationally predict the functions of protein complexes. However, no literature has been found to address such a problem. This paper attempts to address the problem by choosing yeast as the model organism, where total 50 protein complexes are collected and their functions are validated by solid experiments. Each of the complexes was encoded by a numeric vector based upon their graphic and functional properties. Feature selection techniques, including Minimum Redundancy Maximum Relevance and Incremental Feature Selection, were adopted to extract core features for the prediction. Three different prediction methods, Nearest Neighbor Algorithm, Bayesian network and Sequential Minimal Optimization, were utilized in this study and tested by jackknife crossvalidation test. Consequently, 22 core features coupled with Nearest Neighbor Algorithm gain the highest accuracy. These core features are regarded as the most important features for the determination of the biological functions of protein complexes. 19 out of 22 core features were from functional properties, indicating that the functions of each protein component probably constrain the overall functions of the protein complex.
Keywords: Bayesian network, gene ontology, incremental feature selection, jackknife test, minimum redundancy maximum relevance, nearest neighbor algorithm, prediction of functions of protein complex, protein complex, sequential minimal optimization.