Abstract
Background: Enhancers are key cis-function elements of DNA structure that are crucial in
gene regulation and the function of a promoter in eukaryotic cells. Availability of accurate identification
of the enhancers would facilitate the understanding of DNA functions and their physiological roles.
Previous studies have revealed the effectiveness of computational methods for identifying enhancers in
other organisms. To date, a huge number of enhancers remain unknown, especially in the field of plant
species.
Objective: In this study, the aim is to build an efficient attention-based neural network model for the
identification of Arabidopsis thaliana enhancers.
Methods: A sequence-based model using convolutional and recurrent neural networks was proposed for
the identification of enhancers. The input DNA sequences are represented as feature vectors by 4-mer.
A neural network model consists of CNN and Bi-RNN as sequence feature extractors, and the attention
mechanism is suggested to improve the prediction performance.
Results: We implemented an ablation study on validation set to select and evaluate the effectiveness of
our proposed model. Moreover, our model showed remarkable performance on the test set achieving the
Mcc of 0.955, the AUPRC of 0.638, and the AUROC of 0.837, which are significantly higher than
state-of-the-art methods, respectively.
Conclusion: The proposed computational framework aims at solving similar problems in non-coding
genomic regions, thereby providing valuable insights into the prediction about the enhancers of plants.
Keywords:
Enhancer, Arabidopsis thaliana, DNA sequence, deep learning, attention mechanism, transcriptional regulation.
Graphical Abstract
[9]
Herrmann C, Van de Sande B, Potier D, Aerts S. i-cisTarget: An integrative genomics method for the prediction of regulatory features and cis-regulatory modules. Nucleic Acids Res 2012; 40(15): 114.
[20]
Lim DY, Khanal J, Tayara H, Chong KT. iEnhancer-RF: Identifying enhancers and their strength by enhanced feature representation using random forest. Chemom Intell Lab Syst 2021; 212: 104284.
[27]
Le NQK, Ho Q-T, Nguyen T-T-D, Ou Y-Y. A transformer architecture based on BERT and 2D convolutional neural network to identify DNA enhancers from sequence information. Briefings in Bioinformatics 2021; 22(5): bbab005.
[28]
Bahdanau D, Cho K, Bengio Y. Neural machine translation by jointly learning to align and translate. Comput Sci 2014; 2014: 1409.0473.
[29]
Kim Y, Denton C, Hoang L, Rush AM. Structured attention networks. ArXiv 2017; 2017: 1702.
[38]
Fawcett T. ROC graphs: Notes and practical considerations for researchers. Mach Learn 2004; 31(1): 1-38.
[40]
Sahiner B, Chen W, Pezeshk A, Petrick N, Eds. Comparison of two classifiers when the data sets are imbalanced: the power of the area under the precision-recall curve as the figure of merit versus the area under the ROC curve Medical Imaging 2017: Image Perception, Observer Performance, and Technology Assessment. Washington: International Society for Optics and Photonics 2017.
[41]
Chen Z, Lam O, Jacobson A, Milford M. Convolutional neural network-based place recognition. ArXiv 2014; 2014: 1509.
[44]
Ghulam A, Lei X, Zhang Y, Cheng S, Guo M. Identification of pathway-specific protein domain by incorporating hyperparameter optimization based on 2D convolutional neural network. IEEE Access 2020; 8: 180140-55.
[45]
Abadi M, Barham P, Chen J, Chen Z, Davis A, Dean J, Eds. Tensorflow: A system for large-scale machine learning. 12th {USENIX} symposium on operating systems design and implementation (OSDI 16). 2016.
[46]
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Eds. Attention is all you need. Advances in neural information processing systemsMassachusetts, USA: MIT Press 2017.