Background: With the rapid development of the sequencing methods in recent years, binding sites have been systematically identified in such projects as Nested-MICA and MEME. Prediction of DNA motifs with higher accuracy and precision has been a very important task for bioinformaticians. Nevertheless, experimental approaches are still time-consuming for big data set, making computational identification of binding sites indispensable.
Objective: To facilitate the identification of the binding site, we proposed a deep learning architecture, named Deep-BSC (Deep-Learning Binary Search Classification), to predict binding sites in a raw DNA sequence with more precision and accuracy.
Methods: Our proposed architecture purely relies on the raw DNA sequence to predict the binding sites for protein by using a convolutional neural network (CNN). We trained our deep learning model on binding sites at the nucleotide level. DNA sequence of A. thaliana is used in this study because it is a model plant.
Results: The results demonstrate the effectiveness and efficiency of our method in the classification of binding sites against random sequences, using deep learning. We construct a CNN with different layers and filters to show the usefulness of max-pooling technique in the proposed method. To gain the interpretability of our approach, we further visualized binding sites in the saliency map and successfully identified similar motifs in the raw sequence. The proposed computational framework is time and resource efficient.
Conclusion: Deep-BSC enables the identification of binding sites in the DNA sequences via a highly accurate CNN. The proposed computational framework can also be applied to problems such as operator, repeats in the genome, DNA markers, and recognition sites for enzymes, thereby promoting the use of Deep-BSC method in life sciences.
Keywords: Transcription factors (TFs), DNA binding motifs, arabidopsis thaliana, convolutional neural network (CNN), computational biology, genomic.