The Influence of Memory-Aware Computation on Distributed BLAST

Majid      Hajibaba; Mohsen         Sharifi; Saeid          Gorgin

Abstract

Background: One of the pivotal challenges in nowadays genomic research domain is the fast processing of voluminous data such as the ones engendered by high-throughput Next-Generation Sequencing technologies. On the other hand, BLAST (Basic Local Alignment Search Tool), a longestablished and renowned tool in Bioinformatics, has shown to be incredibly slow in this regard.

Objective: To improve the performance of BLAST in the processing of voluminous data, we have applied a novel memory-aware technique to BLAST for faster parallel processing of voluminous data.

Method: We have used a master-worker model for the processing of voluminous data alongside a memory-aware technique in which the master partitions the whole data in equal chunks, one chunk for each worker, and consequently each worker further splits and formats its allocated data chunk according to the size of its memory. Each worker searches every split data one-by-one through a list of queries.

Results: We have chosen a list of queries with different lengths to run insensitive searches in a huge database called UniProtKB/TrEMBL. Our experiments show 20 percent improvement in performance when workers used our proposed memory-aware technique compared to when they were not memory aware. Comparatively, experiments show even higher performance improvement, approximately 50 percent, when we applied our memory-aware technique to mpiBLAST.

Conclusion: We have shown that memory-awareness in formatting bulky database, when running BLAST, can improve performance significantly, while preventing unexpected crashes in low-memory environments. Even though distributed computing attempts to mitigate search time by partitioning and distributing database portions, our memory-aware technique alleviates negative effects of page-faults on performance.

Keywords: BLAST, next-generation sequencing, distributed, parallel, mater-worker, memory-aware.

Graphical Abstract

[1] 
Illumina. Introduction to NGS: Learn how the technology works
and what it can do for you. https://www.illumina.com/science/ technology/next-generation-sequencing.html (Accessed on July 9,
2018).
[2] 
Sanger F, Nicklen S, Coulson AR. DNA sequencing with chain-terminating inhibitors. Proc Natl Acad Sci USA  1977; 74(12): 5463-7.
[3] 
Fu L, Niu B, Zhu Z, Wu S, Li W. CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics  2012; 28(23): 3150-2.
[4] 
Zomaya AY. Parallel computing for bioinformatics and
computational biology: models, enabling technologies, and case
studies Wiley & Sons: New York 2005.
[5] 
Petsko GA, Ringe D. Protein structure and function Oxford
University: Oxford 2004.
[6] 
Altschul S. Basic Local Alignment Search Tool. J Mol Biol  1990; 215(3): 403-10.
[7] 
Mathog DR. Parallel BLAST on split databases. Bioinformatics  2003; 19(14): 1865-6.
[8] 
Bjornson R, Sherman A, Weston S, Willard N, Wing J. TurboBLAST: a parallel implementation of blast built on the turbohub. Proceedings of the 16th International Parallel and Distributed Processing Symposium. 2001 April 15-19; Lauderdale, Florida: IEEE 2002.
[9] 
Camacho C, Coulouris G, Avagyan V, et al. BLAST+: architecture and applications. BMC Bioinformatics  2009; 10: 421.
[10] 
SWISS-PROT in U.S. National Library of Medicine: Available from: . https://ftp.ncbi.nlm.nih.gov/blast/db/
[11] 
Braun R, Pedretti K, Casavant T, Scheetz T, Birkett C, Roberts C. Parallelization of local BLAST service on workstation clusters. Future Gener Comput Syst  2001; 17(6): 745-54.
[12] 
Hughey R. Parallel hardware for sequence comparison and alignment. Bioinformatics  1996; 12(6): 473-9.
[13] 
Ye W, Chen Y, Zhang Y, Xu Y. H-BLAST: a fast protein sequence alignment toolkit on heterogeneous computers with GPUs. Bioinformatics  2017; 33(8): 1130-8.
[14] 
Sandes EFDO, Miranda G, Melo ACD, Martorell X, Ayguade E. CUDAlign 3.0: Parallel Biological Sequence Comparison in Large GPU Clusters. 14th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing. 2014 May 26-29; Chicago, Illinois: IEEE 2014.
[15] 
Zhang J, Wang H, Feng WC. cuBLASTP: Fine-Grained
Parallelization of Protein Sequence Search on CPU + GPU.
IEEE/ACM Trans Comput Biol Bioinform 2017; 14(4): 830-43.
[16] 
Rognes T. ParAlign: a parallel sequence alignment algorithm for rapid and sensitive database searches. Nucleic Acids Res  2001; 29(7): 1647-52.
[17] 
Yang XL, Liu YL, Yuan CF, Huang YH. Parallelization of BLAST with MapReduce for Long Sequence Alignment. Fourth International Symposium on Parallel Architectures, Algorithms and Programming. 2011 Dec 9-11; Tianjin, China: IEEE 2011.
[18] 
Yadav M, Chaudhary S. HCLBLAST for Genome Sequence Matching. Int Jof  Comput Appl  2017; 163(11): 31-4.
[19] 
Azad A, Pavlopoulos GA, Ouzounis CA, Kyrpides NC, Buluç A. HipMCL: a high-performance parallel implementation of the Markov clustering algorithm for large-scale networks. Nucleic Acids Res  2018; 46(6): e33.
[20] 
Darling AE, Carey L, Feng W. The Design, Implementation, and Evaluation of mpiBLAST. 4th International Conference on Linux Clusters. 2003 June 23-26; San Jose, California. 
[21] 
Lin H, Ma X, Chandramohan P, Geist A, Samatova N. Efficient Data Access for Parallel BLAST. 19th IEEE International Parallel and Distributed Processing Symposium. 2005 April 4-8; Denver, CO: IEEE 2005.
[22] 
Zhang L, Tang B. Parka: A Parallel Implementation of BLAST with MapReduce. Adv in Intel Sys Inter Appl  2017; 686: 185-91.
[23] 
Schatz MC. CloudBurst: highly sensitive read mapping with MapReduce. Bioinformatics  2009; 25(11): 1363-9.
[24] 
Lu W, Jackson J, Barga R. AzureBlast: a case study of developing
science applications on the cloud. Proceedings of the 19th ACM
International Symposium on High Performance Distributed
Computing; 2010 June 21-25; Chicago, IL. New York: ACM 2010;
pp. 413-20 
[25] 
Senturk IF, Balakrishnan P, Abu-Doleh A, Kaya K, Malluhi Q, Çatalyürek ÜV. A resource provisioning framework for bioinformatics applications in multi-cloud environments. Future Gener Comput Syst  2018; 78(1): 379-91.
[26] 
Xiao S, Lin H, Feng WC. Accelerating Protein Sequence Search in
a Heterogeneous Computing System. 25th IEEE International
Parallel & Distributed Processing Symposium; 2011 May 16-20.
2011; Anchorage, Alaska. IEEE 2011; pp. 1212-22.
[27] 
Kim HS, Kim HJ, Han DS. Hyper-BLAST: A Parallelized BLAST on Cluster System. Lect Notes Comput Sci Comput Sci  2003; 2659: 213-22.
[28] 
Pinthong W, Muangruen P, Suriyaphol P, Mairiang D. A simple grid implementation with Berkeley Open Infrastructure for Network Computing using BLAST as a model. PeerJ  2016; 4: e2248.
[29] 
Choi J, Kim J, Han H. Efficient Memory Mapped File I/O for In-
Memory File Systems. In 9th Workshop on Hot Topics in Storage
and File Systems; 2017 July 10–11; Santa Clara, CA. USENIX
2017 
[30] 
Korf I, Bedell J, Yandell M. BLAST: an essential guide to the basic
local alignment search tool 1st edition O'Reilly Media: California 2003.
[31] 
UniProtKB/TrEMBL UniProt release 2016_07: Available from: http://www.uniprot.org/statistics/TrEMBL

Cite As

Current Bioinformatics

The Influence of Memory-Aware Computation on Distributed BLAST

Abstract

Graphical Abstract