An Overview of Protein Function Prediction Methods: A Deep Learning
Perspective

Emilio      Ispano; Federico      Bianca; Enrico      Lavezzo; Stefano      Toppo

Abstract

Predicting the function of proteins is a major challenge in the scientific community, particularly in the post-genomic era. Traditional methods of determining protein functions, such as experiments, are accurate but can be resource-intensive and time-consuming. The development of Next Generation Sequencing (NGS) techniques has led to the production of a large number of new protein sequences, which has increased the gap between available raw sequences and verified annotated sequences. To address this gap, automated protein function prediction (AFP) techniques have been developed as a faster and more cost-effective alternative, aiming to maintain the same accuracy level.

Several automatic computational methods for protein function prediction have recently been developed and proposed. This paper reviews the best-performing AFP methods presented in the last decade and analyzes their improvements over time to identify the most promising strategies for future methods.

Identifying the most effective method for predicting protein function is still a challenge. The Critical Assessment of Functional Annotation (CAFA) has established an international standard for evaluating and comparing the performance of various protein function prediction methods. In this study, we analyze the best-performing methods identified in recent editions of CAFA. These methods are divided into five categories based on their principles of operation: sequence-based, structure-based, combined-based, ML-based and embeddings-based.

After conducting a comprehensive analysis of the various protein function prediction methods, we observe that there has been a steady improvement in the accuracy of predictions over time, mainly due to the implementation of machine learning techniques. The present trend suggests that all the bestperforming methods will use machine learning to improve their accuracy in the future.

We highlight the positive impact that the use of machine learning (ML) has had on protein function prediction. Most recent methods developed in this area use ML, demonstrating its importance in analyzing biological information and making predictions. Despite these improvements in accuracy, there is still a significant gap compared with experimental evidence. The use of new approaches based on Deep Learning (DL) techniques will probably be necessary to close this gap, and while significant progress has been made in this area, there is still more work to be done to fully realize the potential of DL.

Keywords: Protein function prediction, AFP, GO, machine learning, deep learning, feature representation methods, measurements, classifiers, web servers.

Graphical Abstract

[1]
Shehu A, Barbará D, Molloy K. A survey of computational methods for protein function prediction. Big Data Analytics in Genomics.  Cham: Springer International Publishing 2016; pp. 225-98.
 [http://dx.doi.org/10.1007/978-3-319-41279-5_7]

[2]
Goodwin S, McPherson JD, McCombie WR. Coming of age: Ten years of next-generation sequencing technologies. Nat Rev Genet  2016; 17(6): 333-51.
 [http://dx.doi.org/10.1038/nrg.2016.49] [PMID: 27184599]

[3]
Ashburner M, Ball CA, Blake JA, et al. Gene Ontology: Tool for the unification of biology. Nat Genet  2000; 25(1): 25-9.
 [http://dx.doi.org/10.1038/75556] [PMID: 10802651]

[4]
The gene ontology resource. 2023. http://geneontology.org/

[5]
Joshi T, Xu D. Quantitative assessment of relationship between sequence similarity and function similarity. BMC Genomics  2007; 8(1): 222.
 [http://dx.doi.org/10.1186/1471-2164-8-222] [PMID: 17620139]

[6]
Fetrow JS, Siew N, Di Gennaro JA, Martinez-Yamout M, Dyson HJ, Skolnick J. Genomic-scale comparison of sequence- and structure-based methods of function prediction: Does structure provide additional insight? Protein Sci  2001; 10(5): 1005-14.
 [http://dx.doi.org/10.1110/ps.49201] [PMID: 11316881]

[7]
Hegyi H, Gerstein M. The relationship between protein structure and function: A comprehensive survey with application to the yeast genome. J Mol Biol  1999; 288(1): 147-64.
 [http://dx.doi.org/10.21236/ADA472211]

[8]
Radivojac P, Clark WT, Oron TR, et al. A large-scale evaluation of computational protein function prediction. Nat Methods  2013; 10(3): 221-7.
 [http://dx.doi.org/10.1038/nmeth.2340] [PMID: 23353650]

[9]
Jiang Y, Oron TR, Clark WT, et al. An expanded evaluation of protein function prediction methods shows an improvement in accuracy. Genome Biol  2016; 17(1): 184.
 [http://dx.doi.org/10.1186/s13059-016-1037-6] [PMID: 27604469]

[10]
Zhou N, Jiang Y, Bergquist TR, et al. The CAFA challenge reports improved protein function prediction and new functional annotations for hundreds of genes through experimental screens. Genome Biol  2019; 20(1): 244.
 [http://dx.doi.org/10.1186/s13059-019-1835-8] [PMID: 31744546]

[11]
Whisstock JC, Lesk AM. Prediction of protein function from protein sequence and structure. Q Rev Biophys  2003; 36(3): 307-40.
 [http://dx.doi.org/10.1017/S0033583503003901] [PMID: 15029827]

[12]
Jeffery CJ. Moonlighting proteins. Trends in Biochem Sci  1999; 24(1): 8-11.
 [http://dx.doi.org/10.1016/S0968-0004(98)01335-8]

[13]
Bateman A, Martin M-J, Orchard S, et al. UniProt: The universal protein knowledgebase in 2021. Nucleic Acids Res  2021; 49(D1): D480-9.
 [http://dx.doi.org/10.1093/nar/gkaa1100] [PMID: 33237286]

[14]
Lavezzo E, Falda M, Fontana P, Bianco L, Toppo S. Enhancing protein function prediction with taxonomic constraints – The Argot2.5 web server. Methods  2016; 93: 15-23.
 [http://dx.doi.org/10.1016/j.ymeth.2015.08.021] [PMID: 26318087]

[15]
Altschul SF, Gish W, Miller W, Myers EW, Lipman JD. Basic local alignment search tool. J Mol Biol  1990; 215(3): 403-10.

[16]
Lin Dekang. An information-theoretic definition of similarity. 1998. Available from: https://dl.acm.org/doi/10.5555/645527.657297=

[17]
Gong Q, Ning W, Tian W. GoFDR: A sequence alignment based method for predicting protein functions. Methods  2016; 93: 3-14.
 [http://dx.doi.org/10.1016/j.ymeth.2015.08.009] [PMID: 26277418]

[18]
Suzek BE, Wang Y, Huang H, McGarvey PB, Wu CH. UniRef clusters: A comprehensive and scalable alternative for improving sequence similarity searches. Bioinformatics  2015; 31(6): 926-32.
 [http://dx.doi.org/10.1093/bioinformatics/btu739] [PMID: 25398609]

[19]
Tian W, Arakaki AK, Skolnick J. EFICAz: A comprehensive approach for accurate genome-scale enzyme function inference. Nucleic Acids Res  2004; 32(21): 6226-39.
 [http://dx.doi.org/10.1093/nar/gkh956] [PMID: 15576349]

[20]
You R, Zhang Z, Xiong Y, Sun F, Mamitsuka H, Zhu S. GOLabeler: Improving sequence-based large-scale protein function prediction by learning to rank. Bioinformatics  2018; 34(14): 2465-73.
 [http://dx.doi.org/10.1093/bioinformatics/bty130]

[21]
Li H. A short introduction to learning to rank. IEICE Trans Inf Syst  2011; E94-D(10): 1854-62.
 [http://dx.doi.org/10.1587/transinf.E94.D.1854]

[22]
Hu Jing, Yan Xianghe. BS-KNN: An effective algorithm for predicting protein subchloroplast localization. Evolution Bioinform  2012; 8: 79-87.
 [http://dx.doi.org/10.4137/EBO.S8681]

[23]
Jones P, Binns D, Chang HY, et al. InterProScan 5: Genome-scale protein function classification. Bioinformatics  2014; 30(9): 1236-40.
 [http://dx.doi.org/10.1093/bioinformatics/btu031] [PMID: 24451626]

[24]
Ofer D, Linial M. ProFET: Feature engineering captures high-level protein functions. Bioinformatics  2015; 31(21): 3429-36.
 [http://dx.doi.org/10.1093/bioinformatics/btv345] [PMID: 26130574]

[25]
Yang J, Yan R, Roy A, Xu D, Poisson J, Zhang Y. The I-TASSER Suite: Protein structure and function prediction. Nat Methods  2015; 12(1): 7-8.
 [http://dx.doi.org/10.1038/nmeth.3213] [PMID: 25549265]

[26]
Wu S, Zhang Y. LOMETS: A local meta-threading-server for protein structure prediction. Nucleic Acids Res  2007; 35(10): 3375-82.
 [http://dx.doi.org/10.1093/nar/gkm251] [PMID: 17478507]

[27]
Yang J, Roy A, Zhang Y. BioLiP: A semi-manually curated database for biologically relevant ligand–protein interactions. Nucleic Acids Res  2012; 41(D1): D1096-103.
 [http://dx.doi.org/10.1093/nar/gks966] [PMID: 23087378]

[28]
Zhang C, Freddolino PL, Zhang Y. COFACTOR: Improved protein function prediction by combining structure, sequence and protein–protein interaction information. Nucleic Acids Res  2017; 45(W1): W291-9.
 [http://dx.doi.org/10.1093/nar/gkx366] [PMID: 28472402]

[29]
Cozzetto D, Minneci F, Currant H, Jones DT. FFPred 3: Feature-based function prediction for all Gene Ontology domains. Sci Rep  2016; 6(1): 31865.
 [http://dx.doi.org/10.1038/srep31865] [PMID: 27561554]

[30]
Boser B, Guyon I, Vapnik V. A training algorithm for optimal margin classifiers. Proceedings of the fifth annual workshop on Computational learning theory.  Pittsburgh, USA. 1992; pp. 144-52.
 [http://dx.doi.org/10.1145/130385.130401]

[31]
Lobley A E, Nugent T, Orengo C A, Jones D T. FFPred: An integrated feature-based function prediction server for vertebrate proteomes. Nucleic Acids Res  2008; 36: W297-302.
 [http://dx.doi.org/10.1093/nar/gkn193]

[32]
Altschul S, Madden TL, Schäffer AA, et al. Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Res  1997; 25(17): 3389-402.
 [http://dx.doi.org/10.1093/nar/25.17.3389] [PMID: 9254694]

[33]
Jumper J, Evans R, Pritzel A, et al. Highly accurate protein structure prediction with AlphaFold. Nature  2021; 596(7873): 583-9.
 [http://dx.doi.org/10.1038/s41586-021-03819-2] [PMID: 34265844]

[34]
Pereira J, Simpkin AJ, Hartmann MD, Rigden DJ, Keegan RM, Lupas AN. High‐accuracy protein structure prediction in CASP14. Proteins  2021; 89(12): 1687-99.
 [http://dx.doi.org/10.1002/prot.26171] [PMID: 34218458]

[35]
Lensink MF, Brysbaert G, Mauri T, et al. Prediction of protein assemblies, the next frontier: The CASP14‐CAPRI experiment. Proteins  2021; 89(12): 1800-23.
 [http://dx.doi.org/10.1002/prot.26222] [PMID: 34453465]

[36]
Zhang C, Zheng W, Freddolino PL, Zhang Y. Meta GO: Predicting gene ontology of non-homologous proteins through low-resolution protein structure prediction and protein–protein network mapping. J Mol Biol  2018; 430(15): 2256-65.
 [http://dx.doi.org/10.1016/j.jmb.2018.03.004] [PMID: 29534977]

[37]
Szklarczyk D, Franceschini A, Wyder S, et al. STRING v10: Protein–protein interaction networks, integrated over the tree of life. Nucleic Acids Res  2015; 43(D1): D447-52.
 [http://dx.doi.org/10.1093/nar/gku1003] [PMID: 25352553]

[38]
Cozzetto D, Buchan DWA, Bryson K, Jones DT. Protein function prediction by massive integration of evolutionary analyses and multiple data sources. BMC Bioinformatics  2013; 14(S3) (Suppl. 3): S1.
 [http://dx.doi.org/10.1186/1471-2105-14-S3-S1] [PMID: 23514099]

[39]
Zhou G, Wang J, Zhang X, Yu G. DeepGOA: Predicting gene ontology annotations of proteins via graph convolutional network. 2019 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).  San Diego. 2019; pp. 1836-41.
 [http://dx.doi.org/10.1109/BIBM47256.2019.8983075]

[40]
Albawi S, Mohammed TA, Al-Zawi S. Understanding of a convolutional neural network. 2017 International Conference on Engineering and Technology (ICET).  Antalya, Turkey. 2017; pp. 1-6.
 [http://dx.doi.org/10.1109/ICEngTechnol.2017.8308186]

[41]
Zhang S, Tong H, Xu J, Maciejewski R. Graph convolutional networks: A comprehensive review. Comput Soc Netw  2019; 6(1): 11.
 [http://dx.doi.org/10.1186/s40649-019-0069-y]

[42]
Kulmanov M, Hoehndorf R. DeepGOPlus: Improved protein function prediction from sequence. Bioinformatics  2020; 36(2): 422-9.
 [http://dx.doi.org/10.1101/615260]

[43]
Huerta-Cepas J, Szklarczyk D, Forslund K, et al. eggNOG 4.5: A hierarchical orthology framework with improved functional annotations for eukaryotic, prokaryotic and viral sequences. Nucleic Acids Res  2016; 44(D1): D286-93.
 [http://dx.doi.org/10.1093/nar/gkv1248] [PMID: 26582926]

[44]
You Ronghui, Yao Shuwei. Mamitsuka Hiroshi, Zhu Shanfeng. DeepGraphGO: Graph neural network for large-scale, multispecies protein function prediction. Bioinformatics  2021; 37 (Suppl_1): i262-71.
 [http://dx.doi.org/10.1093/bioinformatics/btab270]

[45]
Scarselli F, Gori M, Hagenbuchner M, Monfardini G, Monfardini G. The graph neural network model. IEEE Trans Neural Netw  2009; 20(1): 61-80.
 [http://dx.doi.org/10.1109/TNN.2008.2005605] [PMID: 19068426]

[46]
Mistry J, Chuguransky S, Williams L, et al. Pfam: The protein families database in 2021. Nucleic Acids Res  2021; 49(D1): D412-9.
 [http://dx.doi.org/10.1093/nar/gkaa913] [PMID: 33125078]

[47]
Wilson D, Pethica R, Zhou Y, et al. SUPERFAMILY—Sophisticated comparative genomics, data mining, visualization and phylogeny. Nucleic Acids Res  2009; 37: D380-6.
 [http://dx.doi.org/10.1093/nar/gkn762]

[48]
Sillitoe I, Bordin N, Dawson N, et al. CATH: Increased structural coverage of functional space. Nucleic Acids Res  2021; 49(D1): D266-73.
 [http://dx.doi.org/10.1093/nar/gkaa1079] [PMID: 33237325]

[49]
Lewis TE, Sillitoe I, Dawson N, et al. Gene3D: Extensive prediction of globular domains in proteins. Nucleic Acids Res  2018; 46(D1): D435-9.
 [http://dx.doi.org/10.1093/nar/gkx1069]

[50]
Lu S, Wang J, Chitsaz F, et al. CDD/SPARCLE: The conserved domain database in 2020. Nucleic Acids Res  2020; 48(D1): D265-8.
 [http://dx.doi.org/10.1093/nar/gkz991] [PMID: 31777944]

[51]
Yang KK, Wu Z, Bedbrook CN, Arnold FJ. Learned protein embeddings for machine learning. Bioinformatics  2018; 34(15): 2642-8.
 [http://dx.doi.org/10.1093/bioinformatics/bty178]

[52]
Bepler T, Berger B. Learning the protein language: Evolution, structure, and function. Cell Systems  2021; 12(6): 654-69.
 [http://dx.doi.org/10.1016/j.cels.2021.05.017]

[53]
Littmann M, Heinzinger M, Dallago C, Olenyi T, Rost B. Embeddings from deep learning transfer GO annotations beyond homology. Sci Rep  2021; 11(1): 1160.
 [http://dx.doi.org/10.1038/s41598-020-80786-0] [PMID: 33441905]

[54]
Heinzinger M, Elnaggar A, Wang Y, et al. Modeling aspects of the language of life through transfer-learning protein sequences. BMC Bioinformatics  2019; 20(1): 723.
 [http://dx.doi.org/10.1186/s12859-019-3220-8] [PMID: 31847804]

[55]
Devlin J, Chang M-W, Lee K, Toutanova K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human language technologies.  Minneapolis, USA. 2019; pp. 4171-86.

[56]
Cao Y, Shen Y. TALE: Transformer-based protein function Annotation with joint sequence-label embedding. Bioinformatics  2021; 37(18): 2852-33.
 [http://dx.doi.org/10.1093/bioinformatics/btab198]

[57]
Huntley RP, Sawford T, Mutowo-Meullenet P, et al. The GOA database: Gene Ontology Annotation updates for 2015. Nucleic Acids Res  2015; 43(D1): D1057-63.
 [http://dx.doi.org/10.1093/nar/gku1113] [PMID: 25378336]

[58]
Rolnick D, Veit A, Belongie S, Shavit N. Deep learning is robust to massive label noise. ArXiv:170510694 2018.

Cite As

Current Bioinformatics

An Overview of Protein Function Prediction Methods: A Deep Learning Perspective

Abstract

Graphical Abstract