Benchmarking Taxonomy Classification Pipelines

Representative sequences for all analyses (mock community, cross-validated, and novel taxa) were classified taxonomically using the following taxonomy classifiers and setting sweeps:

q2-feature-classifier multinomial naive Bayes classifier. Varied k-mer length in {4, 6, 7, 8, 9, 10, 11, 12, 14, 16, 18, 32} and confidence threshold in {0, 0.5, 0.7, 0.9, 0.92, 0.94, 0.96, 0.98, 1}.

BLAST+ [9 (link)] local sequence alignment followed by consensus taxonomy classification implemented in q2-feature-classifier. Varied max accepts from 1 to 100; percent identity from 0.80 to 0.99; and minimum consensus from 0.51 to 0.99. See description below.

VSEARCH [10 (link)] global sequence alignment followed by consensus taxonomy classification implemented in q2-feature-classifier. Varied max accepts from 1 to 100; percent identity from 0.80 to 0.99; and minimum consensus from 0.51 to 0.99. See description below.

Ribosomal Database Project (RDP) naïve Bayesian classifier [11 (link)] (QIIME1 wrapper), with confidence thresholds between 0.0 and 1.0 in steps of 0.1.

Legacy BLAST [15 (link)] (QIIME1 wrapper) varying e-value thresholds from 1e-9 to 1000.

SortMeRNA [13 (link)] (QIIME1 wrapper) varying minimum consensus fraction from 0.51 to 0.99; similarity from 0.8 to 0.9; max accepts from 1 to 10; and coverage from 0.8 to 0.9.

UCLUST [12 (link)] (QIIME1 wrapper) varying minimum consensus fraction from 0.51 to 0.99; similarity from 0.8 to 0.9; and max accepts from 1 to 10.

With the exception of the UCLUST classifier, we have only benchmarked the performance of open-source, free, marker-gene-agnostic classifiers, i.e., those that can be trained/aligned on a reference database of any marker gene. Hence, we excluded classifiers that can only assign taxonomy to a particular marker gene (e.g., only bacterial 16S rRNA genes) and those that rely on specialized or unavailable reference databases and cannot be trained on other databases, effectively restricting their use for other marker genes and custom databases.
Classification of bacterial/archaeal 16S rRNA gene sequences was made using the Greengenes (13_8 release) [5 (link)] reference sequence database preclustered at 99% ID, with amplicons for the domain of interest extracted using primers 27F/1492R [27 (link)], 515F/806R [28 (link)], or 27F/534R [29 (link)] with q2-feature-classifier’s extract_reads method. Classification of fungal ITS sequences was made using the UNITE database (version 7.1 QIIME developer release) [31 (link)] preclustered at 99% ID. For the cross validation and novel taxon classification tests, we prefiltered to remove sequences with incomplete or ambiguous taxonomies (containing the substrings ‘unknown,’ ‘unidentified,’ or ‘_sp’ or terminating at any level with ‘__’).
The notebooks detailing taxonomy classification sweeps of mock communities are available at https://github.com/caporaso-lab/tax-credit-data/tree/0.1.0/ipynb/mock-community. Cross-validated read classification sweeps are available at https://github.com/caporaso-lab/tax-credit-data/blob/0.1.0/ipynb/cross-validated/taxonomy-assignment.ipynb. Novel taxon classification sweeps are available at https://github.com/caporaso-lab/tax-credit-data/blob/0.1.0/ipynb/novel-taxa/taxonomy-assignment.ipynb.

Free full text: Click here

Bokulich N.A., Kaehler B.D., Rideout J.R., Dillon M., Bolyen E., Knight R., Huttley G.A, & Gregory Caporaso J. (2018). Optimizing taxonomic classification of marker-gene amplicon sequences with QIIME 2’s q2-feature-classifier plugin. Microbiome, 6, 90.

Publication 2018

16s rrna Archaeal Archaeal gene Bacterial Bacterial gene Credit assignment Gene Genes marker Primers Ribosomal Sequence alignment Sequences analyses Tree Unite

Corresponding Organization : Northern Arizona University

Other organizations : Australian National University, University of California, San Diego

Top 5 similar protocols

Protocol cited in 448 other protocols

Variable analysis

independent variables

K-mer length (4, 6, 7, 8, 9, 10, 11, 12, 14, 16, 18, 32)
Confidence threshold (0, 0.5, 0.7, 0.9, 0.92, 0.94, 0.96, 0.98, 1)
Max accepts (1 to 100)
Percent identity (0.80 to 0.99)
Minimum consensus (0.51 to 0.99)
E-value thresholds (1e-9 to 1000)
Minimum consensus fraction (0.51 to 0.99)
Similarity (0.8 to 0.9)
Max accepts (1 to 10)
Coverage (0.8 to 0.9)

dependent variables

Taxonomy classification performance of representative sequences

control variables

Bacterial/archaeal 16S rRNA gene sequences classified using Greengenes (13_8 release) reference database preclustered at 99% ID
Fungal ITS sequences classified using UNITE database (version 7.1 QIIME developer release) preclustered at 99% ID
Sequences with incomplete or ambiguous taxonomies (containing 'unknown,' 'unidentified,' '_sp' or terminating at any level with '__') were prefiltered

Annotations

Based on most similar protocols

Etiam vel ipsum. Morbi facilisis vestibulum nisl. Praesent cursus laoreet felis. Integer adipiscing pretium orci. Nulla facilisi. Quisque posuere bibendum purus. Nulla quam mauris, cursus eget, convallis ac, molestie non, enim. Aliquam congue. Quisque sagittis nonummy sapien. Proin molestie sem vitae urna. Maecenas lorem.

As authors may omit details in methods from publication, our AI will look for missing critical information across the 5 most similar protocols.

About PubCompare

Our mission is to provide scientists with the largest repository of trustworthy protocols and intelligent analytical tools, thereby offering them extensive information to design robust protocols aimed at minimizing the risk of failures.

We believe that the most crucial aspect is to grant scientists access to a wide range of reliable sources and new useful tools that surpass human capabilities.

However, we trust in allowing scientists to determine how to construct their own protocols based on this information, as they are the experts in their field.

Ready to get started?

Revolutionizing how scientists
search and build protocols!