AthEDL: Identifying Enhancers in Arabidopsis thaliana Using an Attention-based Deep Learning Method

Page: [531 - 540] Pages: 10

  • * (Excluding Mailing and Handling)

Abstract

Background: Enhancers are key cis-function elements of DNA structure that are crucial in gene regulation and the function of a promoter in eukaryotic cells. Availability of accurate identification of the enhancers would facilitate the understanding of DNA functions and their physiological roles. Previous studies have revealed the effectiveness of computational methods for identifying enhancers in other organisms. To date, a huge number of enhancers remain unknown, especially in the field of plant species.

Objective: In this study, the aim is to build an efficient attention-based neural network model for the identification of Arabidopsis thaliana enhancers.

Methods: A sequence-based model using convolutional and recurrent neural networks was proposed for the identification of enhancers. The input DNA sequences are represented as feature vectors by 4-mer. A neural network model consists of CNN and Bi-RNN as sequence feature extractors, and the attention mechanism is suggested to improve the prediction performance.

Results: We implemented an ablation study on validation set to select and evaluate the effectiveness of our proposed model. Moreover, our model showed remarkable performance on the test set achieving the Mcc of 0.955, the AUPRC of 0.638, and the AUROC of 0.837, which are significantly higher than state-of-the-art methods, respectively.

Conclusion: The proposed computational framework aims at solving similar problems in non-coding genomic regions, thereby providing valuable insights into the prediction about the enhancers of plants.

Keywords: Enhancer, Arabidopsis thaliana, DNA sequence, deep learning, attention mechanism, transcriptional regulation.

Graphical Abstract

[1]
Shlyueva D, Stampfel G, Stark A. Transcriptional enhancers: From properties to genome-wide predictions. Nat Rev Genet 2014; 15(4): 272-86.
[http://dx.doi.org/10.1038/nrg3682] [PMID: 24614317]
[2]
Calo E, Wysocka J. Modification of enhancer chromatin: what, how, and why? Mol Cell 2013; 49(5): 825-37.
[http://dx.doi.org/10.1016/j.molcel.2013.01.038] [PMID: 23473601]
[3]
Pennacchio LA, Bickmore W, Dean A, Nobrega MA, Bejerano G. Enhancers: Five essential questions. Nat Rev Genet 2013; 14(4): 288-95.
[http://dx.doi.org/10.1038/nrg3458] [PMID: 23503198]
[4]
Kleinjan DA, van Heyningen V. Long-range control of gene expression: Emerging mechanisms and disruption in disease. Am J Hum Genet 2005; 76(1): 8-32.
[http://dx.doi.org/10.1086/426833] [PMID: 15549674]
[5]
Firpi HA, Ucar D, Tan K. Discover regulatory DNA elements using chromatin signatures and artificial neural network. Bioinformatics 2010; 26(13): 1579-86.
[http://dx.doi.org/10.1093/bioinformatics/btq248] [PMID: 20453004]
[6]
Kulaeva OI, Nizovtseva EV, Polikanov YS, Ulianov SV, Studitsky VM. Distant activation of transcription: mechanisms of enhancer action. Mol Cell Biol 2012; 32(24): 4892-7.
[http://dx.doi.org/10.1128/MCB.01127-12] [PMID: 23045397]
[7]
van Duijvenboden K, de Boer BA, Capon N, Ruijter JM, Christoffels VM. EMERGE: A flexible modelling framework to predict genomic regulatory elements from genomic signatures. Nucleic Acids Res 2016; 44(5): 42.
[http://dx.doi.org/10.1093/nar/gkv1144]
[8]
Zhu B, Zhang W, Zhang T, Liu B, Jiang J. Genome-wide prediction and validation of intergenic enhancers in Arabidopsis using open chromatin signatures. Plant Cell 2015; 27(9): 2415-26.
[http://dx.doi.org/10.1105/tpc.15.00537] [PMID: 26373455]
[9]
Herrmann C, Van de Sande B, Potier D, Aerts S. i-cisTarget: An integrative genomics method for the prediction of regulatory features and cis-regulatory modules. Nucleic Acids Res 2012; 40(15): 114.
[10]
Jolma A, Yan J, Whitington T, et al. DNA-binding specificities of human transcription factors. Cell 2013; 152(1-2): 327-39.
[http://dx.doi.org/10.1016/j.cell.2012.12.009] [PMID: 23332764]
[11]
Furey TS. ChIP-seq and beyond: new and improved methodologies to detect and characterize protein-DNA interactions. Nat Rev Genet 2012; 13(12): 840-52.
[http://dx.doi.org/10.1038/nrg3306] [PMID: 23090257]
[12]
Heintzman ND, Ren B. Finding distal regulatory elements in the human genome. Curr Opin Genet Dev 2009; 19(6): 541-9.
[http://dx.doi.org/10.1016/j.gde.2009.09.006] [PMID: 19854636]
[13]
May D, Blow MJ, Kaplan T, et al. Large-scale discovery of enhancers from human heart tissue. Nat Genet 2011; 44(1): 89-93.
[http://dx.doi.org/10.1038/ng.1006] [PMID: 22138689]
[14]
Larrañaga P, Calvo B, Santana R, et al. Machine learning in bioinformatics. Brief Bioinform 2006; 7(1): 86-112.
[http://dx.doi.org/10.1093/bib/bbk007] [PMID: 16761367]
[15]
Lee D, Karchin R, Beer MA. Discriminative prediction of mammalian enhancers from DNA sequence. Genome Res 2011; 21(12): 2167-80.
[http://dx.doi.org/10.1101/gr.121905.111] [PMID: 21875935]
[16]
Ghandi M, Lee D, Mohammad-Noori M, Beer MA. Enhanced regulatory sequence prediction using gapped k-mer features. PLOS Comput Biol 2014; 10(7): e1003711.
[http://dx.doi.org/10.1371/journal.pcbi.1003711] [PMID: 25033408]
[17]
Liu B, Fang L, Long R, Lan X, Chou K-C. iEnhancer-2L: A two-layer predictor for identifying enhancers and their strength by pseudo k-tuple nucleotide composition. Bioinformatics 2016; 32(3): 362-9.
[http://dx.doi.org/10.1093/bioinformatics/btv604] [PMID: 26476782]
[18]
Liu B, Li K, Huang D-S, Chou K-C. iEnhancer-EL: Identifying enhancers and their strength with ensemble learning approach. Bioinformatics 2018; 34(22): 3835-42.
[http://dx.doi.org/10.1093/bioinformatics/bty458] [PMID: 29878118]
[19]
Sethi A, Gu M, Gumusgoz E, et al. Supervised enhancer prediction with epigenetic pattern recognition and targeted validation. Nat Methods 2020; 17(8): 807-14.
[http://dx.doi.org/10.1038/s41592-020-0907-8] [PMID: 32737473]
[20]
Lim DY, Khanal J, Tayara H, Chong KT. iEnhancer-RF: Identifying enhancers and their strength by enhanced feature representation using random forest. Chemom Intell Lab Syst 2021; 212: 104284.
[21]
Min S, Lee B, Yoon S. Deep learning in bioinformatics. Brief Bioinform 2017; 18(5): 851-69.
[PMID: 27473064]
[22]
Yang B, Liu F, Ren C, et al. BiRen: Predicting enhancers with a deep-learning-based model using the DNA sequence alone. Bioinformatics 2017; 33(13): 1930-6.
[http://dx.doi.org/10.1093/bioinformatics/btx105] [PMID: 28334114]
[23]
Erwin GD, Oksenberg N, Truty RM, et al. Integrating diverse datasets improves developmental enhancer prediction. PLOS Comput Biol 2014; 10(6): e1003677.
[http://dx.doi.org/10.1371/journal.pcbi.1003677] [PMID: 24967590]
[24]
Nguyen QH, Nguyen-Vo T-H, Le NQK, Do TTT, Rahardja S, Nguyen BP. iEnhancer-ECNN: Identifying enhancers and their strength using ensembles of convolutional neural networks. BMC Genomics 2019; 20(9)(Suppl. 9): 951.
[http://dx.doi.org/10.1186/s12864-019-6336-3] [PMID: 31874637]
[25]
Khanal J, Tayara H, Chong KT. Identifying enhancers and their strength by the integration of word embedding and convolution neural network. IEEE Access 2020; 8: 58369-76.
[http://dx.doi.org/10.1109/ACCESS.2020.2982666]
[26]
Le NQK, Yapp EKY, Ho QT, Nagasundaram N, Ou Y-Y, Yeh H-Y. iEnhancer-5Step: Identifying enhancers using hidden information of DNA sequences via Chou’s 5-step rule and word embedding. Anal Biochem 2019; 571: 53-61.
[http://dx.doi.org/10.1016/j.ab.2019.02.017] [PMID: 30822398]
[27]
Le NQK, Ho Q-T, Nguyen T-T-D, Ou Y-Y. A transformer architecture based on BERT and 2D convolutional neural network to identify DNA enhancers from sequence information. Briefings in Bioinformatics 2021; 22(5): bbab005.
[28]
Bahdanau D, Cho K, Bengio Y. Neural machine translation by jointly learning to align and translate. Comput Sci 2014; 2014: 1409.0473.
[29]
Kim Y, Denton C, Hoang L, Rush AM. Structured attention networks. ArXiv 2017; 2017: 1702.
[30]
Parikh AP, Täckström O, Das D, Uszkoreit J. A decomposable attention model for natural language inference. EMNLP 2016; 2016: 1606.01933.
[http://dx.doi.org/10.18653/v1/D16-1244]
[31]
Sullivan AM, Bubb KL, Sandstrom R, Stamatoyannopoulos JA, Queitsch C. DNase I hypersensitivity mapping, genomic footprinting, and transcription factor networks in plants. Curr Plant Biol 2015; 3: 40-7.
[http://dx.doi.org/10.1016/j.cpb.2015.10.001]
[32]
Sikic K, Carugo O. Protein sequence redundancy reduction: Comparison of various method. Bioinformation 2010; 5(6): 234-9.
[http://dx.doi.org/10.6026/97320630005234] [PMID: 21364823]
[33]
Fu L, Niu B, Zhu Z, Wu S, Li W. CD-HIT: Accelerated for clustering the next-generation sequencing data. Bioinformatics 2012; 28(23): 3150-2.
[http://dx.doi.org/10.1093/bioinformatics/bts565] [PMID: 23060610]
[34]
Sievers A, Bosiek K, Bisch M, et al. K-mer content, correlation, and position analysis of genome DNA sequences for the identification of function and evolutionary features. Genes (Basel) 2017; 8(4): 122.
[http://dx.doi.org/10.3390/genes8040122] [PMID: 28422050]
[35]
Wang Y, Fu L, Ren J, Yu Z, Chen T, Sun F. Identifying Group-Specific sequences for microbial communities using long k-mer sequence signatures. Front Microbiol 2018; 9: 872.
[http://dx.doi.org/10.3389/fmicb.2018.00872] [PMID: 29774017]
[36]
Tan KK, Le NQK, Yeh H-Y, Chua MCH. Ensemble of deep recurrent neural networks for identifying enhancers via dinucleotide physicochemical properties. Cells 2019; 8(7): 767.
[http://dx.doi.org/10.3390/cells8070767] [PMID: 31340596]
[37]
Grau J, Grosse I, Keilwagen J. PRROC: Computing and visualizing precision-recall and receiver operating characteristic curves in R. Bioinformatics 2015; 31(15): 2595-7.
[http://dx.doi.org/10.1093/bioinformatics/btv153] [PMID: 25810428]
[38]
Fawcett T. ROC graphs: Notes and practical considerations for researchers. Mach Learn 2004; 31(1): 1-38.
[39]
Wei L, Zhou C, Su R, Zou Q. PEPred-Suite: Improved and robust prediction of therapeutic peptides using adaptive feature representation learning. Bioinformatics 2019; 35(21): 4272-80.
[http://dx.doi.org/10.1093/bioinformatics/btz246] [PMID: 30994882]
[40]
Sahiner B, Chen W, Pezeshk A, Petrick N, Eds. Comparison of two classifiers when the data sets are imbalanced: the power of the area under the precision-recall curve as the figure of merit versus the area under the ROC curve Medical Imaging 2017: Image Perception, Observer Performance, and Technology Assessment. Washington: International Society for Optics and Photonics 2017.
[41]
Chen Z, Lam O, Jacobson A, Milford M. Convolutional neural network-based place recognition. ArXiv 2014; 2014: 1509.
[42]
Tayara H, Chong KT. Improving the quantification of DNA sequences using evolutionary information based on deep learning. Cells 2019; 8(12): 1635.
[http://dx.doi.org/10.3390/cells8121635] [PMID: 31847308]
[43]
Feurer M, Hutter F. Hyperparameter optimization. In: Automated machine learning. Cham: Springer 2019; pp. 3-33.
[http://dx.doi.org/10.1007/978-3-030-05318-5_1]
[44]
Ghulam A, Lei X, Zhang Y, Cheng S, Guo M. Identification of pathway-specific protein domain by incorporating hyperparameter optimization based on 2D convolutional neural network. IEEE Access 2020; 8: 180140-55.
[45]
Abadi M, Barham P, Chen J, Chen Z, Davis A, Dean J, Eds. Tensorflow: A system for large-scale machine learning. 12th {USENIX} symposium on operating systems design and implementation (OSDI 16). 2016.
[46]
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Eds. Attention is all you need. Advances in neural information processing systemsMassachusetts, USA: MIT Press 2017.
[47]
Min X, Zeng W, Chen S, Chen N, Chen T, Jiang R. Predicting enhancers with deep convolutional neural networks. BMC Bioinformatics 2017; 18(13)(Suppl. 13): 478.
[http://dx.doi.org/10.1186/s12859-017-1878-3] [PMID: 29219068]
[48]
Cai L, Ren X, Fu X, Peng L, Gao M, Zeng X. iEnhancer-XG: Interpretable sequence-based enhancers and their strength predictor. Bioinformatics 2021; 37(8): 1060-7.
[http://dx.doi.org/10.1093/bioinformatics/btaa914] [PMID: 33119044]
[49]
Zhang T-H, Flores M, Huang Y. ES-ARCNN: Predicting enhancer strength by using data augmentation and residual convolutional neural network. Anal Biochem 2021; 618: 114120.
[http://dx.doi.org/10.1016/j.ab.2021.114120] [PMID: 33535061]
[50]
Hong Z, Zeng X, Wei L, Liu X. Identifying enhancer-promoter interactions with neural network based on pre-trained DNA vectors and attention mechanism. Bioinformatics 2020; 36(4): 1037-43.
[PMID: 31588505]