Benchmarking Metagenomic Classifiers with Synthetic and Real-World Datasets

The HiSeq and MiSeq metagenomes were built using 20 sets of bacterial whole-genome shotgun reads. These reads were found either as part of the GAGE-B project [21 (link)] or in the NCBI Sequence Read Archive. Each metagenome contains sequences from ten genomes (Additional file 1: Table S1). For both the 10,000 and 10 million read samples of each of these metagenomes, 10% of their sequences were selected from each of the ten component genome data sets (i.e., each genome had equal sequence abundance). All sequences were trimmed to remove low quality bases and adapter sequences.
The composition of these two metagenomes poses certain challenges to our classifiers. For example, Pelosinus fermentans, found in our HiSeq metagenome, cannot be correctly identified at the genus level by Kraken (or any of the other previously described classifiers), because there are no Pelosinus genomes in the RefSeq complete genomes database; however, there are seven such genomes in Kraken-GB’s database, including six strains of P. fermentans. Similarly, in our MiSeq metagenome, Proteus vulgaris is often classified incorrectly at the genus level because the only Proteus genome in Kraken’s database is a single Proteus mirabilis genome. Five more Proteus genomes are present in Kraken-GB’s database, allowing Kraken-GB to classify reads better from that genus. In addition, the MiSeq metagenome contains five genomes from the Enterobacteriaceae family (Citrobacter, Enterobacter, Klebsiella, Proteus and Salmonella). The high sequence similarity between the genera in this family can make distinguishing between genera difficult for any classifier.
The simBA-5 metagenome was created by simulating reads from the set of complete bacterial and archaeal genomes in RefSeq. Replicons from those genomes were used if they were associated with a taxon that had an entry associated with the genus rank, resulting in a set of replicons from 607 genera. We then used the Mason read simulator [22 ] with its Illumina model to produce 10 million 100-bp reads from these genomes. First we created simulated genomes for each species, using a SNP rate of 0.1% and an indel rate of 0.1% (both default parameters), from which we generated the reads. For the simulated reads, we multiplied the default mismatch and indel rates by five, resulting in an average mismatch rate of 2% (ranging from 1% at the beginning of reads to 6% at the ends) and an indel rate of 1% (0.5% insertion probability and 0.5% deletion probability). For the simBA-5 metagenome, the 10,000 read set was generated from a random sample of the 10 million read set.

Free full text: Click here

Wood D.E, & Salzberg S.L. (2014). Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biology, 15(3), R46.

Publication 2014

Archaeal genomes Bacterial Bacterial genome Citrobacter Deletion Enterobacter Enterobacteriaceae Genome component Genomes Indel Klebsiella Metagenome Pelosinus fermentans Proteus Proteus mirabilis Proteus vulgaris Replicons Salmonella Strains

Corresponding Organization : University of Maryland, College Park

Other organizations : Johns Hopkins Medicine, Johns Hopkins University

Top 5 similar protocols

Protocol cited in 267 other protocols

Variable analysis

independent variables

Metagenome type (HiSeq, MiSeq, simBA-5)
Number of reads (10,000 and 10 million)

dependent variables

Accuracy of taxonomic classification at the genus level

control variables

Sequence quality (trimmed to remove low quality bases and adapter sequences)
Sequence abundance (10% of sequences selected from each of the ten component genome data sets)
Genome composition (10 genomes per metagenome)

controls

Positive control: simBA-5 metagenome created by simulating reads from the set of complete bacterial and archaeal genomes in RefSeq
Negative control: Not explicitly mentioned

Annotations

Based on most similar protocols

Etiam vel ipsum. Morbi facilisis vestibulum nisl. Praesent cursus laoreet felis. Integer adipiscing pretium orci. Nulla facilisi. Quisque posuere bibendum purus. Nulla quam mauris, cursus eget, convallis ac, molestie non, enim. Aliquam congue. Quisque sagittis nonummy sapien. Proin molestie sem vitae urna. Maecenas lorem.

As authors may omit details in methods from publication, our AI will look for missing critical information across the 5 most similar protocols.

About PubCompare

Our mission is to provide scientists with the largest repository of trustworthy protocols and intelligent analytical tools, thereby offering them extensive information to design robust protocols aimed at minimizing the risk of failures.

We believe that the most crucial aspect is to grant scientists access to a wide range of reliable sources and new useful tools that surpass human capabilities.

However, we trust in allowing scientists to determine how to construct their own protocols based on this information, as they are the experts in their field.

Ready to get started?

Revolutionizing how scientists
search and build protocols!