Representative sequences for all analyses (mock community, cross-validated, and novel taxa) were classified taxonomically using the following taxonomy classifiers and setting sweeps:

q2-feature-classifier multinomial naive Bayes classifier. Varied k-mer length in {4, 6, 7, 8, 9, 10, 11, 12, 14, 16, 18, 32} and confidence threshold in {0, 0.5, 0.7, 0.9, 0.92, 0.94, 0.96, 0.98, 1}.

BLAST+ [9 (link)] local sequence alignment followed by consensus taxonomy classification implemented in q2-feature-classifier. Varied max accepts from 1 to 100; percent identity from 0.80 to 0.99; and minimum consensus from 0.51 to 0.99. See description below.

VSEARCH [10 (link)] global sequence alignment followed by consensus taxonomy classification implemented in q2-feature-classifier. Varied max accepts from 1 to 100; percent identity from 0.80 to 0.99; and minimum consensus from 0.51 to 0.99. See description below.

Ribosomal Database Project (RDP) naïve Bayesian classifier [11 (link)] (QIIME1 wrapper), with confidence thresholds between 0.0 and 1.0 in steps of 0.1.

Legacy BLAST [15 (link)] (QIIME1 wrapper) varying e-value thresholds from 1e-9 to 1000.

SortMeRNA [13 (link)] (QIIME1 wrapper) varying minimum consensus fraction from 0.51 to 0.99; similarity from 0.8 to 0.9; max accepts from 1 to 10; and coverage from 0.8 to 0.9.

UCLUST [12 (link)] (QIIME1 wrapper) varying minimum consensus fraction from 0.51 to 0.99; similarity from 0.8 to 0.9; and max accepts from 1 to 10.

With the exception of the UCLUST classifier, we have only benchmarked the performance of open-source, free, marker-gene-agnostic classifiers, i.e., those that can be trained/aligned on a reference database of any marker gene. Hence, we excluded classifiers that can only assign taxonomy to a particular marker gene (e.g., only bacterial 16S rRNA genes) and those that rely on specialized or unavailable reference databases and cannot be trained on other databases, effectively restricting their use for other marker genes and custom databases.
Classification of bacterial/archaeal 16S rRNA gene sequences was made using the Greengenes (13_8 release) [5 (link)] reference sequence database preclustered at 99% ID, with amplicons for the domain of interest extracted using primers 27F/1492R [27 (link)], 515F/806R [28 (link)], or 27F/534R [29 (link)] with q2-feature-classifier’s extract_reads method. Classification of fungal ITS sequences was made using the UNITE database (version 7.1 QIIME developer release) [31 (link)] preclustered at 99% ID. For the cross validation and novel taxon classification tests, we prefiltered to remove sequences with incomplete or ambiguous taxonomies (containing the substrings ‘unknown,’ ‘unidentified,’ or ‘_sp’ or terminating at any level with ‘__’).
The notebooks detailing taxonomy classification sweeps of mock communities are available at https://github.com/caporaso-lab/tax-credit-data/tree/0.1.0/ipynb/mock-community. Cross-validated read classification sweeps are available at https://github.com/caporaso-lab/tax-credit-data/blob/0.1.0/ipynb/cross-validated/taxonomy-assignment.ipynb. Novel taxon classification sweeps are available at https://github.com/caporaso-lab/tax-credit-data/blob/0.1.0/ipynb/novel-taxa/taxonomy-assignment.ipynb.
Free full text: Click here