Benchmarking was performed on isolate sequences and on an amplicon test dataset. Twenty-four sequences of rumen and intestinal methanogens were selected for benchmarking with isolate sequences (see Table S2). The selected sequences were either exported from SILVA or are published as part of this study (see Table S2 for isolates). Analysis was performed on long length (>1,000 bp) sequences and on sequences of the V6–V8 variable regions of the 16S rRNA gene. Taxonomic assignment of sequences was carried out using the parallel_assign_taxonomy_blast.py script in QIIME, version 1.5. The three different reference databases used for taxonomic assignments of sequences were RIM-DB (File S1 and File S3), SILVA (release 111, Pruesse et al., 2007 (link)) and Greengenes (release GG_13_05, McDonald et al., 2012 (link)). QIIME-compatible SILVA and Greengenes databases were downloaded from http://qiime.wordpress.com. Specific options/files used for taxonomic assignments with SILVA were: –id_to_taxonomy Silva_111_taxa_map_full.txt and –blast_db Silva_111_full_unique.fasta; and with Greengenes: –id_to_taxonomy gg_13_5_taxonomy and -blast_db gg_13_5.fasta. Abundance tables were generated and only OTUs with a mean minimum relative abundance of 1% across all samples were retained.
A test set of amplicon sequence data was generated by combining the following sequence datasets (for accession numbers see Table S4). These datasets contain partial 16S rRNA gene sequences covering nucleotide positions 935–1,385 (Escherichia coli 16S rRNA nucleotide numbering (Brosius et al., 1978 (link))). Sequence data were processed using the QIIME package, version 1.5 (Caporaso et al., 2010 (link)). Reads were quality filtered and assigned to the corresponding sample by barcodes using the QIIME split_library.py script. Only reads with average quality scores >25 were included in the analysis. The resulting fna-files from all experiments were concatenated and denoised using combined flowgram-files, using the denoise_wrapper.py script with default settings (Reeder & Knight, 2010 (link)). The output was subjected to the inflate_denoiser_output.py script (default settings). Denoised sequence reads were chimera-checked with the QIIME script parallel_identify_chimeric_seqs.py, using the parameters –d 4 and –n 2, and using RIM-DB as the reference database. The chimeric sequences that were identified were removed from the dataset using the QIIME filter_fasta.py script. Subsequently, the denoised and chimera-checked dataset was processed with the QIIME pipeline. Sequences were clustered into operational taxonomic units (OTUs) used the default clustering method UCLUST (Edgar, 2010 (link)) with a sequence similarity cut-off of 99% (pick_otus.py option: -s 0.99). Abundance tables were generated and only OTUs with a mean minimum relative abundance of 1% across all samples were retained. Taxonomic assignment of representative sequences was carried out as described for the isolate sequences.
Free full text: Click here