In the metagenomics, long metagenome contigs can either improve metagenome gene prediction or metagenome sequence binning. Moreover, metagenome contigs can also make gene function annotation more accurate because they provide a lot of genome context information. Because of repetitive sequences of either intra-genomes or inter-genomes, metagenome contigs are probably wrongly assembled. Therefore, it is essential to develop a method to validate metagenome contigs. Here, we propose a computational method to validate metagenome contigs. After realigning raw sequencing reads onto one contig, we first compute a contig-ECDF (empirical cumulative probability distribution functions) and its corresponding reference using a computational simulation-based method. Because a reference of the contig-ECDF is changeless given some parameters, we use the distinction between them to check whether or not a contig is bona fide. The less the distinction is, the more likely a contig is bona fide. For wrongly assembled metagenome contigs, using simulated metagenome datasets, our method was shown to have a good capacity to identify them. After applying the method to a real metagenome dataset, which was sequenced from an in vitro-simulated microbial community with known constituted genomes, we showed that our method had a strong ability to identify bona fide contigs, and further demonstrated that small distinctions between contig-ECDFs and their references were significantly correlated with bona fide contigs. A computational method is developed to validate metagenome contigs. For each metagenome contig, our method gives it a score, and the smaller the score is, the more likely a contig is bona fide. After validation using both simulated and real datasets, our method was shown to have good performances.
Keywords: Bona fide contigs , computational method, datasets, metagenome contigs, Metagenomics, simulated metagenome.