Deep-BSC: Predicting Raw DNA Binding Pattern in Arabidopsis Thaliana

Syed    Adnan Shah    Bukhari; Abdul       Razzaq; Javeria       Jabeen; Shaheer       Khan; Zulqurnain       Khan

Abstract

Background: With the rapid development of the sequencing methods in recent years, binding sites have been systematically identified in such projects as Nested-MICA and MEME. Prediction of DNA motifs with higher accuracy and precision has been a very important task for bioinformaticians. Nevertheless, experimental approaches are still time-consuming for big data set, making computational identification of binding sites indispensable.

Objective: To facilitate the identification of the binding site, we proposed a deep learning architecture, named Deep-BSC (Deep-Learning Binary Search Classification), to predict binding sites in a raw DNA sequence with more precision and accuracy.

Methods: Our proposed architecture purely relies on the raw DNA sequence to predict the binding sites for protein by using a convolutional neural network (CNN). We trained our deep learning model on binding sites at the nucleotide level. DNA sequence of A. thaliana is used in this study because it is a model plant.

Results: The results demonstrate the effectiveness and efficiency of our method in the classification of binding sites against random sequences, using deep learning. We construct a CNN with different layers and filters to show the usefulness of max-pooling technique in the proposed method. To gain the interpretability of our approach, we further visualized binding sites in the saliency map and successfully identified similar motifs in the raw sequence. The proposed computational framework is time and resource efficient.

Conclusion: Deep-BSC enables the identification of binding sites in the DNA sequences via a highly accurate CNN. The proposed computational framework can also be applied to problems such as operator, repeats in the genome, DNA markers, and recognition sites for enzymes, thereby promoting the use of Deep-BSC method in life sciences.

Keywords: Transcription factors (TFs), DNA binding motifs, arabidopsis thaliana, convolutional neural network (CNN), computational biology, genomic.

Graphical Abstract

[1] 
Nguyen NG, Tran VA, Ngo DL, et al. DNA Sequence Classification by Convolutional Neural Network. J Biomed Sci Eng  2016; 09(05): 280-6.
[http://dx.doi.org/10.4236/jbise.2016.95021] 
[2] 
Czibula G, Bocicor MI, Czibula IG. Promoter sequences prediction using relational association rule mining. Evol Bioinform Online  2012; 8(8): 181-96.
[http://dx.doi.org/10.4137/EBO.S9376] [PMID:  22563233] 
[3] 
Chowdhury N, Bagchi A. An overview of DNA-protein interactions. Curr Chem Biol  2015; 9(2): 73-83.
[http://dx.doi.org/10.2174/2212796809666151022202255] 
[4] 
Szabóová A, Kuželka O, Zelezný F, Tolar J. Prediction of DNA-binding proteins from relational features. Proteome Sci  2012; 10(1): 66.
[http://dx.doi.org/10.1186/1477-5956-10-66] [PMID:  23146001] 
[5] 
Kumar M, Gromiha MM, Raghava GP. Identification of DNA-binding proteins using support vector machines and evolutionary profiles. BMC Bioinformatics  2007; 8(1): 463.
[http://dx.doi.org/10.1186/1471-2105-8-463] [PMID:  18042272] 
[6] 
Das MK, Dai HK. A survey of DNA motif finding algorithms. BMC Bioinformatics  2007; 8(7): S21.
[http://dx.doi.org/10.1186/1471-2105-8-S7-S21] [PMID:  18047721] 
[7] 
Song L, Li D, Zeng X, Wu Y, Guo L, Zou Q. nDNA-Prot: identification of DNA-binding proteins based on unbalanced classification. BMC Bioinformatics  2014; 15(1): 298.
[http://dx.doi.org/10.1186/1471-2105-15-298] [PMID:  25196432] 
[8] 
Hassanzadeh HR, Kolhe P, Isbell CL, Wang MD. MotifMark:  Finding regulatory motifs in DNA sequences. In 2017 39th Annual  International Conference of the IEEE Engineering in Medicine and
  Biology Society (EMBC). 2017; pp. 3890-3.. 
[9] 
Stormo GD. DNA binding sites: representation and discovery. Bioinformatics  2000; 16(1): 16-23.
[http://dx.doi.org/10.1093/bioinformatics/16.1.16] 
[10] 
Libbrecht MW, Noble WS. Machine learning applications in genetics and genomics. Nat Rev Genet  2015; 16(6): 321-32.
[http://dx.doi.org/10.1038/nrg3920] [PMID:  25948244] 
[11] 
Holloway DT, Kon MA, Delisi C. Machine learning methods for transcription data integration. IBM J Res Develop  2006; 50(6): 631-43.
[http://dx.doi.org/10.1147/rd.506.0631] 
[12] 
Krizhevsky A, Sutskever I, Hinton GE. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems  2012; 1097-105.
[13] 
Yue T, Wang H. Deep Learning for Genomics: A Concise  Overview. arXiv 2018; 1-40.. 
[14] 
Angermueller C, Pärnamaa T, Parts L, Stegle O. Deep learning for computational biology. Mol Syst Biol  2016; 12(7): 878.
[http://dx.doi.org/10.15252/msb.20156651] [PMID:  27474269] 
[15] 
Min S, Lee B, Yoon S. Deep learning in bioinformatics. Brief Bioinform  2017; 18(5): 851-69.
[PMID:  27473064] 
[16] 
Zou J, Huss M, Abid A, Mohammadi P, Torkamani A, Telenti A. A primer on deep learning in genomics. Nat Genet  2019; 51(1): 12-8.
[http://dx.doi.org/10.1038/s41588-018-0295-5] [PMID:  30478442] 
[17] 
Khodabandelou G, Routhier E, Mozziconacci J. Genome functional annotation using deep convolutional neural networks. bioRxiv 2018.330308
[18] 
Jaganathan K, Kyriazopoulou Panagiotopoulou S, McRae JF, et al. Predicting splicing from primary sequence with deep learning. Cell  2019; 176(3): 535-548.e24.
[http://dx.doi.org/10.1016/j.cell.2018.12.015] [PMID:  30661751] 
[19] 
Mikolov T, Chen K, Corrado G, Dean J.  Efficient estimation of word representations in vector space. arXiv. 2013.
[20] 
Zeng H, Edwards MD, Liu G, Gifford DK. Convolutional neural network architectures for predicting DNA-protein binding. Bioinformatics  2016; 32(12): i121-7.
[http://dx.doi.org/10.1093/bioinformatics/btw255] [PMID:  27307608] 
[21] 
van Helden J, André B, Collado-Vides J. Extracting regulatory sites from the upstream region of yeast genes by computational analysis of oligonucleotide frequencies. J Mol Biol  1998; 281(5): 827-42.
[http://dx.doi.org/10.1006/jmbi.1998.1947] [PMID:  9719638] 
[22] 
van Helden J, Rios AF, Collado-Vides J. Discovering regulatory elements in non-coding sequences by analysis of spaced dyads. Nucleic Acids Res  2000; 28(8): 1808-18.
[http://dx.doi.org/10.1093/nar/28.8.1808] [PMID:  10734201] 
[23] 
Hertz GZ, Hartzell GW III, Stormo GD. Identification of consensus patterns in unaligned DNA sequences known to be functionally related. Comput Appl Biosci  1990; 6(2): 81-92.
[http://dx.doi.org/10.1093/bioinformatics/6.2.81] [PMID:  2193692] 
[24] 
Down TA, Hubbard TJP. NestedMICA: sensitive inference of over-represented motifs in nucleic acid sequence. Nucleic Acids Res  2005; 33(5): 1445-53.
[http://dx.doi.org/10.1093/nar/gki282] [PMID:  15760844] 
[25] 
Liu D, Xiong X, DasGupta B, Zhang H. Motif discoveries in unaligned molecular sequences using self-organizing neural networks. IEEE Trans Neural Netw  2006; 17(4): 919-28.
[http://dx.doi.org/10.1109/TNN.2006.875987] [PMID:  16856655] 
[26] 
Szegedy C, Liu W, Jia Y, et al. Going deeper with convolutions. Proc IEEE Comput Soc Conf Comput Vis Pattern Recognit  2015; 12: 1-9.
[27] 
Lanchantin J, Singh R, Lin Z, Qi Y. Deep motif: visualizing genomic sequence classifications. arXiv 2016; 1-5.. 
[28] 
Berardini TZ, Reiser L, Li D, et al. The arabidopsis information resource: making and mining the “gold standard” annotated reference plant genome. Genesis  2015; 53(8): 474-85.
[http://dx.doi.org/10.1002/dvg.22877] [PMID:  26201819] 
[29] 
Ong Q, Nguyen P, Thao NP, Le L. Bioinformatics approach in plant genomic research Curr Genomics 2016; 17(4): 368-78.. 
[http://dx.doi.org/10.2174/1389202917666160331202956] [PMID: 27499685] 
[30] 
Martinez M. Computational tools for genomic studies in plants. Curr Genomics  2016; 17(6): 509-14.
[http://dx.doi.org/10.2174/1389202917666160520103447] [PMID:  28217007] 
[31] 
Zhang X, Zhao J, Lecun Y. Character-level convolutional networks for text. arXiv 2015; 649-7.. 
[32] 
Min X, Zeng W, Chen S, Chen N, Chen T, Jiang R. Predicting enhancers with deep convolutional neural networks. BMC Bioinformatics  2017; 118(13): 478.
[http://dx.doi.org/10.1186/s12859-017-1878-3] 
[33] 
Yu N, Yu Z, Pan Y. A deep learning method for lincRNA detection using auto-encoder algorithm. BMC Bioinformatics  2017; 18(15): 511.
[http://dx.doi.org/10.1186/s12859-017-1922-3] [PMID:  29244011] 
[34] 
Lanchantin J, Singh R, Wang B, Qi Y. Deep motif dashboard: Visualizing and understanding genomic sequences using deep neural networks. Pac Symp Biocomput  2017; 22(212679): 254-65.
[http://dx.doi.org/10.1142/9789813207813_0025] [PMID:  27896980] 
[35] 
Kassim NA, Abdullah A. Classification of DNA sequences using convolutional neural network approach. UTM Comput Proc Innov Comput Technol Appl  2017; 2: 1-6.
[36] 
Carneiro T, Da Nobrega RVM, Nepomuceno T, Bian GB, De Albuquerque VHC, Filho PPR. Performance analysis of google colaboratory as a tool for accelerating deep learning applications IEEE Access 2018; 6: 61677-85.. 
[http://dx.doi.org/10.1109/ACCESS.2018.2874767] 
[37] 
An J-Y, Zhou Y, Zhang L, Niu Q, Wang D-F. Improving self-interacting proteins prediction accuracy using protein evolutionary information and weighed-extreme learning machine. Curr Bioinform  2018; 14(2): 115-22.
[http://dx.doi.org/10.2174/1574893613666180209161152] 
[38] 
Simonyan K, Vedaldi A, Zisserman A. Deep inside convolutional networks: Visualising image classification models and saliency maps. 2nd Int Conf Learn Represent ICLR 2014 - Work Track Proc.  1-8.
[39] 
Li J, Shou J, Guo Y, et al. Efficient inversions and duplications of mammalian regulatory DNA elements and gene clusters by CRISPR/Cas9. J Mol Cell Biol  2015; 7(4): 284-98.
[http://dx.doi.org/10.1093/jmcb/mjv016] [PMID:  25757625] 
[40] 
Bailey TL, Boden M, Buske FA, et al. MEME SUITE: tools for motif discovery and searching . Nucleic Acids Res 2009; 37(Web Server issue): W202-8.. 
[http://dx.doi.org/10.1093/nar/gkp335] [PMID:  19458158] 
[41] 
Andken BB, Lim I, Benson G, et al. 3′-UTR SIRF: a database for identifying clusters of whort interspersed repeats in 3′ untranslated regions. BMC Bioinformatics  2007; 8(1): 274.
[http://dx.doi.org/10.1186/1471-2105-8-274] [PMID:  17663765] 
[42] 
Nain V, Sahi S, Ananda P. In Silico identification of regulatory elements in promoters. Comput Biol Appl Bioinform  2011; 2: 47-66.
[http://dx.doi.org/10.5772/22230] 
[43] 
Boutellier R, Heinzen M. Growth through innovation: managing the technology-driven enterprise. Springer International Publishing 2014.
[http://dx.doi.org/10.1007/978-3-319-04016-5] 

Cite As

Current Bioinformatics

Deep-BSC: Predicting Raw DNA Binding Pattern in Arabidopsis Thaliana

Abstract

Graphical Abstract