Estimating the k-mer Coverage Frequencies in Genomic Datasets: A Comparative Assessment of the State-of-the-art

Page: [2 - 15] Pages: 14

  • * (Excluding Mailing and Handling)

Abstract

Background: In bioinformatics, estimation of k-mer abundance histograms or just enumerating the number of unique k-mers and the number of singletons are desirable in many genome sequence analysis applications. The applications include predicting genome sizes, data pre-processing for de Bruijn graph assembly methods (tune runtime parameters for analysis tools), repeat detection, sequencing coverage estimation, measuring sequencing error rates, etc. Different methods for cardinality estimation in sequencing data have been developed in recent years.

Objective: In this article, we present a comparative assessment of the different k-mer frequency estimation programs (ntCard, KmerGenie, KmerStream and Khmer (abundance-dist-single.py and unique-kmers.py) to assess their relative merits and demerits.

Methods: Principally, the miscounts/error-rates of these tools are analyzed by rigorous experimental analysis for a varied range of k. We also present experimental results on runtime, scalability for larger datasets, memory, CPU utilization as well as parallelism of k-mer frequency estimation methods.

Results: The results indicate that ntCard is more accurate in estimating F0, f1 and full k-mer abundance histograms compared with other methods. ntCard is the fastest but it has more memory requirements compared to KmerGenie.

Conclusion: The results of this evaluation may serve as a roadmap to potential users and practitioners of streaming algorithms for estimating k-mer coverage frequencies, to assist them in identifying an appropriate method. Such results analysis also help researchers to discover remaining open research questions, effective combinations of existing techniques and possible avenues for future research.

Keywords: K-mer abundance histogram, High-throughput sequencing, Hashing, Streaming algorithms, Singleton k-mers, Distinct k-mers.

Graphical Abstract

[1]
Marçais, G.; Kingsford, C. A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics, 2011, 27(6), 764-770.
[2]
Miller, J.R.; Delcher, A.L.; Koren, S.; Venter, E.; Walenz, B.P.; Brownley, A.; Johnson, J.; Li, K.; Mobarry, C.; Sutton, G. Aggressive assembly of pyrosequencing reads with mates. Bioinformatics, 2003, 24(24), 2818-2824.
[3]
Jaffe, D.B.; Butler, J.; Gnerre, S.; Mauceli, E.; Lindblad-Toh, K.; Mesirov, J.P.; Zody, M.C.; Lander, E.S. Whole-genome sequence assembly for mammalian genomes: Arachne 2. Genome Res., 2003, 13(1), 91-96.
[4]
Miller, J.R.; Koren, S.; Sutton, G. Assembly algorithms for next-generation sequencing data. Genomics, 2010, 95(6), 315-327.
[5]
Pevzner, P.A.; Tang, H.; Waterman, M.S. An Eulerian path approach to DNA fragment assembly. Proc. Natl. Acad. Sci. USA, 2001, 98(17), 9748-9753.
[6]
Zerbino, D.; Birney, E. Velvet: Algorithms for de novo short read assembly using de Bruijn graphs. Genome Res., 2008, 18(5), 821-829.
[7]
Simpson, J.T.; Wong, K.; Jackman, S.D.; Schein, J.E.; Jones, S.J.; Birol, I. ABySS: A parallel assembler for short read sequence data. Genome Res., 2009, 19(6), 1117-1123.
[8]
Kelley, D.R.; Schatz, M.C.; Salzberg, S.L. Quake: Quality-aware detection and correction of sequencing errors. Genome Biol., 2010, 11(11), R116.
[9]
Shi, H.; Schmidt, B.; Liu, W.; Müller-Wittig, W. A parallel algorithm for error correction in high-throughput short-read data on CUDA-enabled graphics hardware. J. Comput. Biol., 2010, 17(4), 603-615.
[10]
Liu, Y.; Schröder, J.; Schmidt, B. Musket: A multistage k-mer spectrum-based error corrector for Illumina sequence data. Bioinformatics, 2012, 29(3), 308-315.
[11]
Medvedev, P.; Scott, E.; Kakaradov, B.; Pevzner, P. Error correction of high-throughput sequencing datasets with non-uniform coverage. Bioinformatics, 2011, 27(13), 137-141.
[12]
Salmela, L.; Schröder, J. Correcting errors in short reads by multiple alignments. Bioinformatics, 2011, 27(11), 1455-1461.
[13]
Li, R.; Ye, J.; Li, S.; Wang, J.; Han, Y.; Ye, C.; Wang, J.; Yang, H.; Yu, J.; Wong, G.K.S.; Wang, J. ReAS: Recovery of ancestral sequences for transposable elements from the unassembled reads of a whole genome shotgun. PLoSComputat. Biol., 2005, 1(4), e43.
[14]
Price, A.L.; Jones, N.C.; Pevzner, P.A. De novo identification of repeat families in large genomes. Bioinformatics, 2005, 21(Suppl. 1), 351-358.
[15]
Campagna, D.; Romualdi, C.; Vitulo, N.; Del Favero, M.; Lexa, M.; Cannata, N.; Valle, G. RAP: A new computer program for de novo identification of repeated sequences in whole genomes. Bioinformatics, 2004, 21(5), 582-588.
[16]
Lefebvre, A.; Lecroq, T.; Dauchel, H.; Alexandre, J. FORRepeats: Detects repeats on entire chromosomes and between genomes. Bioinformatics, 2003, 19(3), S319-S326.
[17]
Healy, J.; Thomas, E.E.; Schwartz, J.T.; Wigler, M. Annotating large genomes with exact word matches. Genome Res., 2003, 13(10), 2306-2315.
[18]
Kurtz, S.; Narechania, A.; Stein, J.C.; Ware, D. A new method to compute K-mer frequencies and its application to annotate large repetitive plant genomes. BMC Genomics, 2008, 9(1), 517.
[19]
Kokot, M.; Długosz, M.; Deorowicz, S. KMC 3: Counting and manipulating k-mer statistics. Bioinformatics, 2017, 33(17), 2759-2761.
[20]
Erbert, M.; Rechner, S.; Müller-Hannemann, M. Gerbil: A fast and memory-efficient k-mer counter with GPU-support. Algor. Mol. Biol., 2017, 12(1), 9.
[21]
Rizk, G.; Lavenier, D.; Chikhi, R. DSK: k-mer counting with very low memory usage. Bioinformatics, 2013, 29(5), 652-653.
[22]
Conway, T.C.; Bromage, A.J. Succinct data structures for assembling large genomes. Bioinformatics, 2011, 27(4), 479-486.
[23]
Stephens, Z.D.; Lee, S.Y.; Faghri, F.; Campbell, R.H.; Zhai, C.; Efron, M.J.; Iyer, R.; Schatz, M.C.; Sinha, S.; Robinson, G.E. Big data: Astronomical or genomical? PLoS Biol., 2015, 13(7), e1002195.
[24]
Brown, T.C.; Howe, A.; Zhang, Q.; Pyrkosz, A.B.; Brom, T.M. A reference-free algorithm for computational normalization of shotgun sequencing data. arXiv:1203.4802, 2012.
[25]
Pell, J.; Hintze, A.; Canino-Koning, R.; Howe, A.; Tiedje, J.M.; Brown, C.T. Scaling metagenome sequence assembly with probabilistic de Bruijn graphs. Proc. Natl. Acad. Sci.USA, 2012, 9(33), 13272-13277.
[26]
Junior, L.C.I.; Brown, C.T. Efficient cardinality estimation for k-mers in large DNA sequencing data sets. F1000. Res., 2016, 1-5.
[27]
Zhang, Q.; Pell, J.; Canino-Koning, R.; Howe, A.C.; Brown, C.T. These are not the k-mers you are looking for: efficient online k-mer counting using a probabilistic data structure. PLoS One, 2014, 9(7), e101271.
[28]
Mohamadi, H.; Khan, H.; Birol, I. ntCard: A streaming algorithm for cardinality estimation in genomics data. Bioinformatics, 2017, 33(9), 1324-1330.
[29]
Melsted, P.; Halldorsson, B.V. KmerStream: Streaming algorithms for k-mer abundance estimation. Bioinformatics, 2014, 30(24), 3541-3547.
[30]
Chikhi, R.; Medvedev, P. Sequence analysis informed and automated k-mer size selection for genome assembly. Bioinformatics, 2014, 30(1), 31-37.
[31]
Alon, N.; Matias, Y.; Szegedy, M. The space complexity of approximating the frequency moments. Proceedings of the 28th Annual ACM Symposium on Theory of Computing (STOC), Philadelphia, Pennsylvania, U.S1996, pp. 20-29.
[32]
Bar-Yossef, Z.; Jayram, T.S.; Kumar, R.; Sivakumar, D.; Trevisan, L. Counting distinct elements in a data stream. In: International Workshop on Randomization and Approximation Techniques in Computer Science; Springer: Berlin, Heidelberg, Germany, 2002; pp. 1-10.
[33]
Flajolet, P.; Martin, G.N. Probabilistic counting algorithms for data base applications. J. Comput. Syst. Sci., 1985, 31(2), 182-209.
[34]
Cormode, G.; Muthukrishnan, S. An improved data stream summary: The count-min sketch and its applications. J. Algorithms, 2005, 55(1), 58-75.
[35]
Simpson, J.T. Exploring genome characteristics and sequence quality without a reference. Bioinformatics, 2014, 30(9), 1228-1235.
[36]
Chu, J.; Sadeghi, S.; Raymond, A.; Jackman, S.D.; Nip, K.M.; Mar, R.; Mohamadi, H.; Butterfield, Y.S.; Robertson, A.G.; Birol, I. BioBloom tools: Fast, accurate and memory-efficient host species sequence screening using bloom filters. Bioinformatics, 2014, 30(23), 3402-3404.
[37]
Pérez, N.; Gutierrez, M.; Vera, N. Computational performance assessment of k-mer counting algorithms. J. Comput. Biol., 2016, 23(4), 248-255.
[38]
Mohamadi, H.; Chu, J.; Vandervalk, B.P.; Birol, I. ntHash: Recursive nucleotide hashing. Bioinformatics, 2016, 32(22), 3492-3494.
[39]
Bloom, B.H. Space/time trade-offs in hash coding with allowable errors. Commun. ACM, 1970, 13(7), 422-426.
[40]
Crusoe, M.R.; Alameldin, S.; Awad, E.; Boucher, A.; Caldwell, R.; Cartwright, A.; Charbonneau, B.; Constantinides, G.; Edvenson, S.; Fay, J.; Fenton, T.; Fenzl, J.; Fish, L.; Garcia-Gutierrez, P.; Garland, J.; Gluck, I.; González, S.; Guermond, J.; Guo, A.; Gupta, J.R.; Herr, A.; Howe, A.; Hyer, A.; Härpfer, L.; Irber, R.; Kidd, D.; Lin, J.; Lippi, T.; Mansour, P.; McA’Nulty, E.; McDonald, J.; Mizzi, K.D.; Murray, J.R.; Nahum, K.; Nanlohy, A.J.; Nederbragt, H.; Ortiz-Zuazaga, J.; Ory, J.; Pell, C.; Pepe-Ranney, Z.N.; Russ, E.; Schwarz, C.; Scott, J.; Seaman, S.; Sievert, J.; Simpson, C.T.; Skennerton, J.; Spencer, R.; Srinivasan, D.; Standage, J.A.; Stapleton, S.R.; Steinman, J.; Stein, B.; Taylor, W.; Trimble, H.L.; Wiencko, M.; Wright, B.; Wyss, Q.; Zhang, E. Zyme; C.T. Brown. The khmer software package: Enabling efficient nucleotide sequence analysis. F1000 Res., 2015, 4(115), 900.
[41]
Flajolet, P.; Fusy, É.; Gandouet, O.; Meunier, F. Hyperloglog: The analysis of a near-optimal cardinality estimation algorithm In: DiscretMath. Theor. Comput. Sci; , 2007, pp. 137-156.
[42]
Press, W.H.; Teukolsky, S.A.; Vetterling, W.T.; Flannery, B.P. Numerical recipes: The Art of Scientific Computing, 3rd ed; Cambridge University Press: New York, NY, USA, 2007.