Hardware Performance Evaluation of De novo Transcriptome Assembly Software in Amazon Elastic Compute Cloud

Page: [420 - 430] Pages: 11

  • * (Excluding Mailing and Handling)

Abstract

Background: Bioinformatics software for RNA-seq analysis has a high computational requirement in terms of the number of CPUs, RAM size, and processor characteristics. Specifically, de novo transcriptome assembly demands large computational infrastructure due to the massive data size, and complexity of the algorithms employed. Comparative studies on the quality of the transcriptome yielded by de novo assemblers have been previously published, lacking, however, a hardware efficiency-oriented approach to help select the assembly hardware platform in a cost-efficient way.

Objective: We tested the performance of two popular de novo transcriptome assemblers, Trinity and SOAPdenovo-Trans (SDNT), in terms of cost-efficiency and quality to assess limitations, and provided troubleshooting and guidelines to run transcriptome assemblies efficiently.

Methods: We built virtual machines with different hardware characteristics (CPU number, RAM size) in the Amazon Elastic Compute Cloud of the Amazon Web Services. Using simulated and real data sets, we measured the elapsed time, cost, CPU percentage and output size of small and large data set assemblies.

Results: For small data sets, SDNT outperformed Trinity by an order the magnitude, significantly reducing the time duration and costs of the assembly. For large data sets, Trinity performed better than SDNT. Both the assemblers provide good quality transcriptomes.

Conclusion: The selection of the optimal transcriptome assembler and provision of computational resources depend on the combined effect of size and complexity of RNA-seq experiments.

Keywords: Cloud computing, cost-efficiency, quality, RNA-seq, transcriptome, magnitude.

Graphical Abstract

[1]
Capobianco E. RNA-Seq data: a complexity journey. Comput Struct Biotechnol J 2014; 11(19): 123-30.
[http://dx.doi.org/10.1016/j.csbj.2014.09.004] [PMID: 25408846]
[2]
Marx V. Biology: the big challenges of big data. Nature 2013; 498(7453): 255-60.
[http://dx.doi.org/10.1038/498255a] [PMID: 23765498]
[3]
Yang A, Troup M, Ho JWK. Scalability and validation of big data bioinformatics software. Comput Struct Biotechnol J 2017; 15: 379-86.
[http://dx.doi.org/10.1016/j.csbj.2017.07.002] [PMID: 28794828]
[4]
Baker M. Next-generation sequencing: adjusting to data overload. Nat Methods 2010; 7(7): 495-9.
[http://dx.doi.org/10.1038/nmeth0710-495]
[5]
López de Heredia U, Vázquez-Poletti JL. RNA-seq analysis in forest tree species: bioinformatic problems and solutions. Tree Genet Genomes 2016; 12(2): 30.
[http://dx.doi.org/10.1007/s11295-016-0995-x]
[6]
Martin JA, Wang Z. Next-generation transcriptome assembly. Nat Rev Genet 2011; 12(10): 671-82.
[http://dx.doi.org/10.1038/nrg3068] [PMID: 21897427]
[7]
Miller JR, Koren S, Sutton G. Assembly algorithms for next-generation sequencing data. Genomics 2010; 95(6): 315-27.
[http://dx.doi.org/10.1016/j.ygeno.2010.03.001] [PMID: 20211242]
[8]
Geniza M, Jaiswal P. Tools for building de novo transcriptome assembly. Curr Plant Biol 2017; 11-12: 41-5.
[http://dx.doi.org/10.1016/j.cpb.2017.12.004]
[9]
Honaas LA, Wafula EK, Wickett NJ, et al. Selecting superior de novo transcriptome assemblies: Lessons learned by leveraging the best plant genome. PLoS One 2016; 11(1) e0146062
[http://dx.doi.org/10.1371/journal.pone.0146062] [PMID: 26731733]
[10]
Grabherr MG, Haas BJ, Yassour M, et al. Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nat Biotechnol 2011; 29(7): 644-52.
[http://dx.doi.org/10.1038/nbt.1883] [PMID: 21572440]
[11]
Haas BJ, Papanicolaou A, Yassour M, et al. De novo transcript sequence reconstruction from RNA-seq using the Trinity platform for reference generation and analysis. Nat Protoc 2013; 8(8): 1494-512.
[http://dx.doi.org/10.1038/nprot.2013.084] [PMID: 23845962]
[12]
Xie Y, Wu G, Tang J, et al. SOAPdenovo-Trans: de novo transcriptome assembly with short RNA-Seq reads. Bioinformatics 2014; 30(12): 1660-6.
[http://dx.doi.org/10.1093/bioinformatics/btu077] [PMID: 24532719]
[13]
Compeau PEC, Pevzner PA, Tesler G. How to apply de Bruijn graphs to genome assembly. Nat Biotechnol 2011; 29(11): 987-91.
[http://dx.doi.org/10.1038/nbt.2023] [PMID: 22068540]
[14]
Luo R, Liu B, Xie Y, et al. SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler. Gigascience 2012; 1(1): 18.
[http://dx.doi.org/10.1186/2047-217X-1-18] [PMID: 23587118]
[15]
Schulz MH, Zerbino DR, Vingron M, Birney E. Oases: robust de novo RNA-seq assembly across the dynamic range of expression levels. Bioinformatics 2012; 28(8): 1086-92.
[http://dx.doi.org/10.1093/bioinformatics/bts094] [PMID: 22368243]
[16]
Salzberg SL, Phillippy AM, Zimin A, et al. GAGE: A critical evaluation of genome assemblies and assembly algorithms. Genome Res 2012; 22(3): 557-67.
[http://dx.doi.org/10.1101/gr.131383.111] [PMID: 22147368]
[17]
Chang Z, Wang Z, Li G. The impacts of read length and transcriptome complexity for de novo assembly: a simulation study. PLoS One 2014; 9(4) e94825
[http://dx.doi.org/10.1371/journal.pone.0094825] [PMID: 24736633]
[18]
O’Neil ST, Emrich SJ. Assessing De Novo transcriptome assembly metrics for consistency and utility. BMC Genomics 2013; 14(1): 465.
[http://dx.doi.org/10.1186/1471-2164-14-465] [PMID: 23837739]
[19]
Behera S, Voshall A. Performance comparison and an ensemble approach of transcriptome assembly. IEEE International Conference on Bioinformatics and Biomedicine (BIBM) 2017.
[http://dx.doi.org/10.1109/BIBM.2017.8218005]
[20]
Jain P, Krishnan NM, Panda B. Augmenting transcriptome assembly by combining de novo and genome-guided tools. PeerJ 2013; 1 e133
[21]
Wang S, Gribskov M. Comprehensive evaluation of de novo transcriptome assembly programs and their effects on differential gene expression analysis. Bioinformatics 2017; 33(3): 327-33.
[PMID: 28172640]
[22]
Chikhi R, Medvedev P. Informed and automated k-mer size selection for genome assembly. Bioinformatics 2014; 30(1): 31-7.
[http://dx.doi.org/10.1093/bioinformatics/btt310] [PMID: 23732276]
[23]
Durai DA, Schulz MH. Informed kmer selection for de novo transcriptome assembly. Bioinformatics 2016; 32(11): 1670-7.
[http://dx.doi.org/10.1093/bioinformatics/btw217] [PMID: 27153653]
[24]
Andrews S. FastQC: a quality control tool for high throughput sequence data 2010.Available from:. http://www.bioinformatics. babraham.ac.uk/projects/fastqc Accessed on October 6, 2011.
[25]
Bolger AM, Lohse M, Usadel B. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics 2014; 30(15): 2114-20.
[http://dx.doi.org/10.1093/bioinformatics/btu170] [PMID: 24695404]
[26]
Mora-Márquez F, Vázquez-Poletti JL, López de Heredia U. NGScloud: RNA-seq analysis of non-model species using cloud computing. Bioinformatics 2018; 34(19): 3405-7.
[http://dx.doi.org/10.1093/bioinformatics/bty363] [PMID: 29726914]
[27]
Li W, Godzik A. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 2006; 22(13): 1658-9.
[http://dx.doi.org/10.1093/bioinformatics/btl158] [PMID: 16731699]
[28]
Yang Y, Smith SA. Optimizing de novo assembly of short-read RNA-seq data for phylogenomics. BMC Genomics 2013; 14: 328.
[http://dx.doi.org/10.1186/1471-2164-14-328] [PMID: 23672450]
[29]
Gurevich A, Saveliev V, Vyahhi N, Tesler G. QUAST: quality assessment tool for genome assemblies. Bioinformatics 2013; 29(8): 1072-5.
[http://dx.doi.org/10.1093/bioinformatics/btt086] [PMID: 23422339]
[30]
Bushmanova E, Antipov D, Lapidus A, Suvorov V, Prjibelski AD. rnaQUAST: a quality assessment tool for de novo transcriptome assemblies. Bioinformatics 2016; 32(14): 2210-2.
[http://dx.doi.org/10.1093/bioinformatics/btw218] [PMID: 27153654]
[31]
Waterhouse RM, Seppey M, Simão FA, et al. BUSCO applications from quality assessments to gene prediction and phylogenomics. Mol Biol Evol 2018; 35(3): 543-8.
[http://dx.doi.org/10.1093/molbev/msx319] [PMID: 29220515]
[32]
Durai DA, Schulz MH. In silico read normalization using set multi-cover optimization. Bioinformatics 2018; 34(19): 3273-80.
[http://dx.doi.org/10.1093/bioinformatics/bty307] [PMID: 29912280]
[33]
López de Heredia U. ENT-RS-CLOUD RNA-seq differential Expression aNalysis for Tree species in the Cloud Master's thesis, Escuela Nacional de Sanidad (ISCIII) . 2014.
[34]
Lu B, Zeng Z, Shi T. Comparative study of de novo assembly and genome-guided assembly strategies for transcriptome reconstruction based on RNA-Seq. Sci China Life Sci 2013; 56(2): 143-55.
[http://dx.doi.org/10.1007/s11427-013-4442-z] [PMID: 23393030]
[35]
Hsieh PH, Oyang YJ, Chen CY. Effect of de novo transcriptome assembly on transcript quantification. Sci Rep 2019; 9(1): 8304.
[http://dx.doi.org/10.1038/s41598-019-44499-3] [PMID: 31165774]