Benchmarking Taxonomic Classification of Microbiome Sequences

The simulated reads used here were derived from the reference databases using the “Cross-validated classification performance” notebooks in our project repository. The reference databases were either Greengenes or UNITE (99% OTUs) that were cleaned according to taxonomic label to remove sequences with ambiguous or null labels. Reference sequences were trimmed to simulate amplification using standard PCR primers and slice out the first 250 bases downstream (3′) of the forward primer. The bacterial primers used were 27F/1492R [27 (link)] to simulate full-length 16S rRNA gene sequences, 515F/806R [28 (link)] to simulate 16S rRNA gene V4 domain sequences, and 27F/534R [29 (link)] to simulate 16S rRNA gene V1–3 domain sequences; the fungal primers used were BITSf/B58S3r [30 (link)] to simulate ITS1 internal transcribed spacer DNA sequences. The exact sequences were used for cross validation and were not altered to simulate any sequencing error; thus, our benchmarks simulate denoised sequence data [4 (link)] and isolate classifier performance from impacts from sequencing errors. Each database was stratified by taxonomy and 10-fold randomized cross-validation data sets were generated using scikit-learn’s library functions. Where a taxonomic label had less than 10 instances, taxonomies were amalgamated to make sufficiently large strata. If, as a result, a taxonomy in any test set was not present in the corresponding training set, the expected taxonomy label was truncated to the nearest common taxonomic rank observed in the training set (e.g., Lactobacillus casei would become Lactobacillus). The notebook detailing simulated read generation (for both cross-validated and novel taxon reads) prior to taxonomy classification is available at https://github.com/caporaso-lab/tax-credit-data/blob/0.1.0/ipynb/novel-taxa/dataset-generation.ipynb.
Classification performance was also slightly modified from a standard machine-learning scenario as the classifiers in this study are able to refuse classification if they are not confident above a taxonomic level for a given sample. This also accommodates the taxonomy truncation that we performed for this test. The methodology was consistent with that used below for novel taxon evaluations, so we defer its description to the next section.

Free full text: Click here

Bokulich N.A., Kaehler B.D., Rideout J.R., Dillon M., Bolyen E., Knight R., Huttley G.A, & Gregory Caporaso J. (2018). Optimizing taxonomic classification of marker-gene amplicon sequences with QIIME 2’s q2-feature-classifier plugin. Microbiome, 6, 90.

Publication 2018

16s rrna Bacterial Confident Gene Lactobacillus Lactobacillus casei Library Primers Rrna gene Unite

Corresponding Organization : Northern Arizona University

Other organizations : Australian National University, University of California, San Diego

Top 5 similar protocols

Protocol cited in 545 other protocols

Variable analysis

independent variables

Primer used for simulated read generation (27F/1492R, 515F/806R, 27F/534R, BITSf/B58S3r)

dependent variables

Classification performance of taxonomic classifiers

control variables

Reference databases used (Greengenes or UNITE 99% OTUs)
Taxonomic label cleaning process to remove ambiguous or null labels
Simulated read generation by trimming reference sequences to mimic amplification using standard PCR primers
Absence of simulated sequencing errors in the generated reads
10-fold randomized cross-validation data sets generated using scikit-learn's library functions
Taxonomy truncation for test sets where a taxonomy was not present in the corresponding training set

Annotations

Based on most similar protocols

Etiam vel ipsum. Morbi facilisis vestibulum nisl. Praesent cursus laoreet felis. Integer adipiscing pretium orci. Nulla facilisi. Quisque posuere bibendum purus. Nulla quam mauris, cursus eget, convallis ac, molestie non, enim. Aliquam congue. Quisque sagittis nonummy sapien. Proin molestie sem vitae urna. Maecenas lorem.

As authors may omit details in methods from publication, our AI will look for missing critical information across the 5 most similar protocols.

About PubCompare

Our mission is to provide scientists with the largest repository of trustworthy protocols and intelligent analytical tools, thereby offering them extensive information to design robust protocols aimed at minimizing the risk of failures.

We believe that the most crucial aspect is to grant scientists access to a wide range of reliable sources and new useful tools that surpass human capabilities.

However, we trust in allowing scientists to determine how to construct their own protocols based on this information, as they are the experts in their field.

Ready to get started?

Revolutionizing how scientists
search and build protocols!