I used the NCBI BLAST 16S rRNA database (BLAST16S) (Sayers et al., 2012 (link)), downloaded July 1, 2017, the RDP 16S rRNA training set v16 (RDP16S) and the Warcup fungal ITS training set v2 (WITS) (Deshpande et al., 2015 (link)). The sequences and taxonomy annotations for these databases were mostly obtained from authoritatively named isolate strains. While there could be some errors in the taxonomy annotations, I pragmatically considered them to be authoritative and used them as truth standards for the benchmark tests. BLAST16S and the RDP 16S rRNA training set have highly uneven numbers of sequences per genus. For example, ∼40% (950/2,273) of the genera in BLAST16S have only a single sequence while the most abundant genus, Streptomyces, has 1,162 sequences, more than all singletons combined. To investigate the effects of uneven representation and create a more balanced reference, I created a subset (BLAST16S/10) by imposing a maximum of 10 sequences per genus; sequences were discarded at random as needed to meet this constraint. I also considered two larger databases: the subset of Greengenes clustered at 97% identity (GG97) which is the default 16S rRNA reference database in QIIME v1, and UNITE (Kõljalg et al., 2013 (link)). GG97 and UNITE were not used as truth standards because most of their taxonomy annotations are computational or manual predictions. To investigate prediction performance with shorter sequences, I extracted the V4 and V3–V5 segments from BLAST16S and BLAST16S/10 using V4 primer sequences from Kozich et al. (2013) (link) and V3–V5 primer sequences from Methé et al. (2012) (link). Sequence error was not modeled because state-of-the-art methods are able to extract highly accurate sequences from noisy next-generation reads (Edgar, 2013 (link); Callahan et al., 2016 (link)).
Free full text: Click here