For our primary analysis, we used a previously published dataset [19 (link)] derived from the microbiomes of fourteen individual nematodes. Two regions of the bacterial 16S ribosomal RNA gene (V3-V5 and V6-V8) were PCR amplified and sequenced using the Roche-454 GS FLX platform with the Titanium protocol (800 flows), resulting in just over 40,000 reads.
To calculate error rates, we retrieved the Titanium mock community dataset of Quince et al. [6 (link)], which was used to validate AmpliconNoise, as well as other denoising algorithms [23 (link)]. The 62,873 reads were derived from PCR amplification of the V4-V5 region of the 16S gene, using 91 plasmid clones as the source DNA (mock community). The set of original reads (“Stage 0”) was determined by filtering only for mid tag and primer sequences and allowing one and two mismatches to them, respectively. The initial error rate was calculated by finding the best match of each read to the 90 reference sequences (see Additional files 2 and 3) using ClustalW [24 (link)] with a reduced gap-opening penalty (-gapopen=1). In this and other error-rate calculations, we counted only insertions and deletions, which are the dominant form of errors in Roche-454 pyrosequencing [9 (link)]. We filtered the reads with FlowClus (version 1.1) using criteria similar to those recommended with the QIIME denoising pipeline and denoised with a constant value of 0.90. The dataset was also processed through the equivalent steps of AmpliconNoise V1.27 [6 (link)] and the denoising pipeline in QIIME 1.8.0 [8 (link)].
To evaluate scalability, we analyzed the large datasets from Krych et al. [25 (link)]. In this study of the human gut microbiome, the V3-V4 region of the 16S gene was amplified by PCR and sequenced on the Roche-454 GS FLX Titanium platform. The total number of reads for all three groups (baseline, synbiotic, and placebo) was 2.2 million.
Free full text: Click here