The simulated reads used here were derived from the reference databases using the “Cross-validated classification performance” notebooks in our project repository. The reference databases were either Greengenes or UNITE (99% OTUs) that were cleaned according to taxonomic label to remove sequences with ambiguous or null labels. Reference sequences were trimmed to simulate amplification using standard PCR primers and slice out the first 250 bases downstream (3′) of the forward primer. The bacterial primers used were 27F/1492R [27 (link)] to simulate full-length 16S rRNA gene sequences, 515F/806R [28 (link)] to simulate 16S rRNA gene V4 domain sequences, and 27F/534R [29 (link)] to simulate 16S rRNA gene V1–3 domain sequences; the fungal primers used were BITSf/B58S3r [30 (link)] to simulate ITS1 internal transcribed spacer DNA sequences. The exact sequences were used for cross validation and were not altered to simulate any sequencing error; thus, our benchmarks simulate denoised sequence data [4 (link)] and isolate classifier performance from impacts from sequencing errors. Each database was stratified by taxonomy and 10-fold randomized cross-validation data sets were generated using scikit-learn’s library functions. Where a taxonomic label had less than 10 instances, taxonomies were amalgamated to make sufficiently large strata. If, as a result, a taxonomy in any test set was not present in the corresponding training set, the expected taxonomy label was truncated to the nearest common taxonomic rank observed in the training set (e.g., Lactobacillus casei would become Lactobacillus). The notebook detailing simulated read generation (for both cross-validated and novel taxon reads) prior to taxonomy classification is available at https://github.com/caporaso-lab/tax-credit-data/blob/0.1.0/ipynb/novel-taxa/dataset-generation.ipynb.
Classification performance was also slightly modified from a standard machine-learning scenario as the classifiers in this study are able to refuse classification if they are not confident above a taxonomic level for a given sample. This also accommodates the taxonomy truncation that we performed for this test. The methodology was consistent with that used below for novel taxon evaluations, so we defer its description to the next section.
Free full text: Click here