An Overview of Protein Function Prediction Methods: A Deep Learning Perspective

Page: [621 - 630] Pages: 10

  • * (Excluding Mailing and Handling)

Abstract

Predicting the function of proteins is a major challenge in the scientific community, particularly in the post-genomic era. Traditional methods of determining protein functions, such as experiments, are accurate but can be resource-intensive and time-consuming. The development of Next Generation Sequencing (NGS) techniques has led to the production of a large number of new protein sequences, which has increased the gap between available raw sequences and verified annotated sequences. To address this gap, automated protein function prediction (AFP) techniques have been developed as a faster and more cost-effective alternative, aiming to maintain the same accuracy level.

Several automatic computational methods for protein function prediction have recently been developed and proposed. This paper reviews the best-performing AFP methods presented in the last decade and analyzes their improvements over time to identify the most promising strategies for future methods.

Identifying the most effective method for predicting protein function is still a challenge. The Critical Assessment of Functional Annotation (CAFA) has established an international standard for evaluating and comparing the performance of various protein function prediction methods. In this study, we analyze the best-performing methods identified in recent editions of CAFA. These methods are divided into five categories based on their principles of operation: sequence-based, structure-based, combined-based, ML-based and embeddings-based.

After conducting a comprehensive analysis of the various protein function prediction methods, we observe that there has been a steady improvement in the accuracy of predictions over time, mainly due to the implementation of machine learning techniques. The present trend suggests that all the bestperforming methods will use machine learning to improve their accuracy in the future.

We highlight the positive impact that the use of machine learning (ML) has had on protein function prediction. Most recent methods developed in this area use ML, demonstrating its importance in analyzing biological information and making predictions. Despite these improvements in accuracy, there is still a significant gap compared with experimental evidence. The use of new approaches based on Deep Learning (DL) techniques will probably be necessary to close this gap, and while significant progress has been made in this area, there is still more work to be done to fully realize the potential of DL.

Graphical Abstract

[1]
Shehu A, Barbará D, Molloy K. A survey of computational methods for protein function prediction. Big Data Analytics in Genomics. Cham: Springer International Publishing 2016; pp. 225-98.
[http://dx.doi.org/10.1007/978-3-319-41279-5_7]
[2]
Goodwin S, McPherson JD, McCombie WR. Coming of age: Ten years of next-generation sequencing technologies. Nat Rev Genet 2016; 17(6): 333-51.
[http://dx.doi.org/10.1038/nrg.2016.49] [PMID: 27184599]
[3]
Ashburner M, Ball CA, Blake JA, et al. Gene Ontology: Tool for the unification of biology. Nat Genet 2000; 25(1): 25-9.
[http://dx.doi.org/10.1038/75556] [PMID: 10802651]
[4]
The gene ontology resource. 2023. http://geneontology.org/
[5]
Joshi T, Xu D. Quantitative assessment of relationship between sequence similarity and function similarity. BMC Genomics 2007; 8(1): 222.
[http://dx.doi.org/10.1186/1471-2164-8-222] [PMID: 17620139]
[6]
Fetrow JS, Siew N, Di Gennaro JA, Martinez-Yamout M, Dyson HJ, Skolnick J. Genomic-scale comparison of sequence- and structure-based methods of function prediction: Does structure provide additional insight? Protein Sci 2001; 10(5): 1005-14.
[http://dx.doi.org/10.1110/ps.49201] [PMID: 11316881]
[7]
Hegyi H, Gerstein M. The relationship between protein structure and function: A comprehensive survey with application to the yeast genome. J Mol Biol 1999; 288(1): 147-64.
[http://dx.doi.org/10.21236/ADA472211]
[8]
Radivojac P, Clark WT, Oron TR, et al. A large-scale evaluation of computational protein function prediction. Nat Methods 2013; 10(3): 221-7.
[http://dx.doi.org/10.1038/nmeth.2340] [PMID: 23353650]
[9]
Jiang Y, Oron TR, Clark WT, et al. An expanded evaluation of protein function prediction methods shows an improvement in accuracy. Genome Biol 2016; 17(1): 184.
[http://dx.doi.org/10.1186/s13059-016-1037-6] [PMID: 27604469]
[10]
Zhou N, Jiang Y, Bergquist TR, et al. The CAFA challenge reports improved protein function prediction and new functional annotations for hundreds of genes through experimental screens. Genome Biol 2019; 20(1): 244.
[http://dx.doi.org/10.1186/s13059-019-1835-8] [PMID: 31744546]
[11]
Whisstock JC, Lesk AM. Prediction of protein function from protein sequence and structure. Q Rev Biophys 2003; 36(3): 307-40.
[http://dx.doi.org/10.1017/S0033583503003901] [PMID: 15029827]
[12]
Jeffery CJ. Moonlighting proteins. Trends in Biochem Sci 1999; 24(1): 8-11.
[http://dx.doi.org/10.1016/S0968-0004(98)01335-8]
[13]
Bateman A, Martin M-J, Orchard S, et al. UniProt: The universal protein knowledgebase in 2021. Nucleic Acids Res 2021; 49(D1): D480-9.
[http://dx.doi.org/10.1093/nar/gkaa1100] [PMID: 33237286]
[14]
Lavezzo E, Falda M, Fontana P, Bianco L, Toppo S. Enhancing protein function prediction with taxonomic constraints – The Argot2.5 web server. Methods 2016; 93: 15-23.
[http://dx.doi.org/10.1016/j.ymeth.2015.08.021] [PMID: 26318087]
[15]
Altschul SF, Gish W, Miller W, Myers EW, Lipman JD. Basic local alignment search tool. J Mol Biol 1990; 215(3): 403-10.
[16]
Lin Dekang. An information-theoretic definition of similarity. 1998. Available from: https://dl.acm.org/doi/10.5555/645527.657297=
[17]
Gong Q, Ning W, Tian W. GoFDR: A sequence alignment based method for predicting protein functions. Methods 2016; 93: 3-14.
[http://dx.doi.org/10.1016/j.ymeth.2015.08.009] [PMID: 26277418]
[18]
Suzek BE, Wang Y, Huang H, McGarvey PB, Wu CH. UniRef clusters: A comprehensive and scalable alternative for improving sequence similarity searches. Bioinformatics 2015; 31(6): 926-32.
[http://dx.doi.org/10.1093/bioinformatics/btu739] [PMID: 25398609]
[19]
Tian W, Arakaki AK, Skolnick J. EFICAz: A comprehensive approach for accurate genome-scale enzyme function inference. Nucleic Acids Res 2004; 32(21): 6226-39.
[http://dx.doi.org/10.1093/nar/gkh956] [PMID: 15576349]
[20]
You R, Zhang Z, Xiong Y, Sun F, Mamitsuka H, Zhu S. GOLabeler: Improving sequence-based large-scale protein function prediction by learning to rank. Bioinformatics 2018; 34(14): 2465-73.
[http://dx.doi.org/10.1093/bioinformatics/bty130]
[21]
Li H. A short introduction to learning to rank. IEICE Trans Inf Syst 2011; E94-D(10): 1854-62.
[http://dx.doi.org/10.1587/transinf.E94.D.1854]
[22]
Hu Jing, Yan Xianghe. BS-KNN: An effective algorithm for predicting protein subchloroplast localization. Evolution Bioinform 2012; 8: 79-87.
[http://dx.doi.org/10.4137/EBO.S8681]
[23]
Jones P, Binns D, Chang HY, et al. InterProScan 5: Genome-scale protein function classification. Bioinformatics 2014; 30(9): 1236-40.
[http://dx.doi.org/10.1093/bioinformatics/btu031] [PMID: 24451626]
[24]
Ofer D, Linial M. ProFET: Feature engineering captures high-level protein functions. Bioinformatics 2015; 31(21): 3429-36.
[http://dx.doi.org/10.1093/bioinformatics/btv345] [PMID: 26130574]
[25]
Yang J, Yan R, Roy A, Xu D, Poisson J, Zhang Y. The I-TASSER Suite: Protein structure and function prediction. Nat Methods 2015; 12(1): 7-8.
[http://dx.doi.org/10.1038/nmeth.3213] [PMID: 25549265]
[26]
Wu S, Zhang Y. LOMETS: A local meta-threading-server for protein structure prediction. Nucleic Acids Res 2007; 35(10): 3375-82.
[http://dx.doi.org/10.1093/nar/gkm251] [PMID: 17478507]
[27]
Yang J, Roy A, Zhang Y. BioLiP: A semi-manually curated database for biologically relevant ligand–protein interactions. Nucleic Acids Res 2012; 41(D1): D1096-103.
[http://dx.doi.org/10.1093/nar/gks966] [PMID: 23087378]
[28]
Zhang C, Freddolino PL, Zhang Y. COFACTOR: Improved protein function prediction by combining structure, sequence and protein–protein interaction information. Nucleic Acids Res 2017; 45(W1): W291-9.
[http://dx.doi.org/10.1093/nar/gkx366] [PMID: 28472402]
[29]
Cozzetto D, Minneci F, Currant H, Jones DT. FFPred 3: Feature-based function prediction for all Gene Ontology domains. Sci Rep 2016; 6(1): 31865.
[http://dx.doi.org/10.1038/srep31865] [PMID: 27561554]
[30]
Boser B, Guyon I, Vapnik V. A training algorithm for optimal margin classifiers. Proceedings of the fifth annual workshop on Computational learning theory. Pittsburgh, USA. 1992; pp. 144-52.
[http://dx.doi.org/10.1145/130385.130401]
[31]
Lobley A E, Nugent T, Orengo C A, Jones D T. FFPred: An integrated feature-based function prediction server for vertebrate proteomes. Nucleic Acids Res 2008; 36: W297-302.
[http://dx.doi.org/10.1093/nar/gkn193]
[32]
Altschul S, Madden TL, Schäffer AA, et al. Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Res 1997; 25(17): 3389-402.
[http://dx.doi.org/10.1093/nar/25.17.3389] [PMID: 9254694]
[33]
Jumper J, Evans R, Pritzel A, et al. Highly accurate protein structure prediction with AlphaFold. Nature 2021; 596(7873): 583-9.
[http://dx.doi.org/10.1038/s41586-021-03819-2] [PMID: 34265844]
[34]
Pereira J, Simpkin AJ, Hartmann MD, Rigden DJ, Keegan RM, Lupas AN. High‐accuracy protein structure prediction in CASP14. Proteins 2021; 89(12): 1687-99.
[http://dx.doi.org/10.1002/prot.26171] [PMID: 34218458]
[35]
Lensink MF, Brysbaert G, Mauri T, et al. Prediction of protein assemblies, the next frontier: The CASP14‐CAPRI experiment. Proteins 2021; 89(12): 1800-23.
[http://dx.doi.org/10.1002/prot.26222] [PMID: 34453465]
[36]
Zhang C, Zheng W, Freddolino PL, Zhang Y. Meta GO: Predicting gene ontology of non-homologous proteins through low-resolution protein structure prediction and protein–protein network mapping. J Mol Biol 2018; 430(15): 2256-65.
[http://dx.doi.org/10.1016/j.jmb.2018.03.004] [PMID: 29534977]
[37]
Szklarczyk D, Franceschini A, Wyder S, et al. STRING v10: Protein–protein interaction networks, integrated over the tree of life. Nucleic Acids Res 2015; 43(D1): D447-52.
[http://dx.doi.org/10.1093/nar/gku1003] [PMID: 25352553]
[38]
Cozzetto D, Buchan DWA, Bryson K, Jones DT. Protein function prediction by massive integration of evolutionary analyses and multiple data sources. BMC Bioinformatics 2013; 14(S3) (Suppl. 3): S1.
[http://dx.doi.org/10.1186/1471-2105-14-S3-S1] [PMID: 23514099]
[39]
Zhou G, Wang J, Zhang X, Yu G. DeepGOA: Predicting gene ontology annotations of proteins via graph convolutional network. 2019 IEEE International Conference on Bioinformatics and Biomedicine (BIBM). San Diego. 2019; pp. 1836-41.
[http://dx.doi.org/10.1109/BIBM47256.2019.8983075]
[40]
Albawi S, Mohammed TA, Al-Zawi S. Understanding of a convolutional neural network. 2017 International Conference on Engineering and Technology (ICET). Antalya, Turkey. 2017; pp. 1-6.
[http://dx.doi.org/10.1109/ICEngTechnol.2017.8308186]
[41]
Zhang S, Tong H, Xu J, Maciejewski R. Graph convolutional networks: A comprehensive review. Comput Soc Netw 2019; 6(1): 11.
[http://dx.doi.org/10.1186/s40649-019-0069-y]
[42]
Kulmanov M, Hoehndorf R. DeepGOPlus: Improved protein function prediction from sequence. Bioinformatics 2020; 36(2): 422-9.
[http://dx.doi.org/10.1101/615260]
[43]
Huerta-Cepas J, Szklarczyk D, Forslund K, et al. eggNOG 4.5: A hierarchical orthology framework with improved functional annotations for eukaryotic, prokaryotic and viral sequences. Nucleic Acids Res 2016; 44(D1): D286-93.
[http://dx.doi.org/10.1093/nar/gkv1248] [PMID: 26582926]
[44]
You Ronghui, Yao Shuwei. Mamitsuka Hiroshi, Zhu Shanfeng. DeepGraphGO: Graph neural network for large-scale, multispecies protein function prediction. Bioinformatics 2021; 37 (Suppl_1): i262-71.
[http://dx.doi.org/10.1093/bioinformatics/btab270]
[45]
Scarselli F, Gori M, Hagenbuchner M, Monfardini G, Monfardini G. The graph neural network model. IEEE Trans Neural Netw 2009; 20(1): 61-80.
[http://dx.doi.org/10.1109/TNN.2008.2005605] [PMID: 19068426]
[46]
Mistry J, Chuguransky S, Williams L, et al. Pfam: The protein families database in 2021. Nucleic Acids Res 2021; 49(D1): D412-9.
[http://dx.doi.org/10.1093/nar/gkaa913] [PMID: 33125078]
[47]
Wilson D, Pethica R, Zhou Y, et al. SUPERFAMILY—Sophisticated comparative genomics, data mining, visualization and phylogeny. Nucleic Acids Res 2009; 37: D380-6.
[http://dx.doi.org/10.1093/nar/gkn762]
[48]
Sillitoe I, Bordin N, Dawson N, et al. CATH: Increased structural coverage of functional space. Nucleic Acids Res 2021; 49(D1): D266-73.
[http://dx.doi.org/10.1093/nar/gkaa1079] [PMID: 33237325]
[49]
Lewis TE, Sillitoe I, Dawson N, et al. Gene3D: Extensive prediction of globular domains in proteins. Nucleic Acids Res 2018; 46(D1): D435-9.
[http://dx.doi.org/10.1093/nar/gkx1069]
[50]
Lu S, Wang J, Chitsaz F, et al. CDD/SPARCLE: The conserved domain database in 2020. Nucleic Acids Res 2020; 48(D1): D265-8.
[http://dx.doi.org/10.1093/nar/gkz991] [PMID: 31777944]
[51]
Yang KK, Wu Z, Bedbrook CN, Arnold FJ. Learned protein embeddings for machine learning. Bioinformatics 2018; 34(15): 2642-8.
[http://dx.doi.org/10.1093/bioinformatics/bty178]
[52]
Bepler T, Berger B. Learning the protein language: Evolution, structure, and function. Cell Systems 2021; 12(6): 654-69.
[http://dx.doi.org/10.1016/j.cels.2021.05.017]
[53]
Littmann M, Heinzinger M, Dallago C, Olenyi T, Rost B. Embeddings from deep learning transfer GO annotations beyond homology. Sci Rep 2021; 11(1): 1160.
[http://dx.doi.org/10.1038/s41598-020-80786-0] [PMID: 33441905]
[54]
Heinzinger M, Elnaggar A, Wang Y, et al. Modeling aspects of the language of life through transfer-learning protein sequences. BMC Bioinformatics 2019; 20(1): 723.
[http://dx.doi.org/10.1186/s12859-019-3220-8] [PMID: 31847804]
[55]
Devlin J, Chang M-W, Lee K, Toutanova K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human language technologies. Minneapolis, USA. 2019; pp. 4171-86.
[56]
Cao Y, Shen Y. TALE: Transformer-based protein function Annotation with joint sequence-label embedding. Bioinformatics 2021; 37(18): 2852-33.
[http://dx.doi.org/10.1093/bioinformatics/btab198]
[57]
Huntley RP, Sawford T, Mutowo-Meullenet P, et al. The GOA database: Gene Ontology Annotation updates for 2015. Nucleic Acids Res 2015; 43(D1): D1057-63.
[http://dx.doi.org/10.1093/nar/gku1113] [PMID: 25378336]
[58]
Rolnick D, Veit A, Belongie S, Shavit N. Deep learning is robust to massive label noise. ArXiv:170510694 2018.