For all data sets, we removed all sequences that contained one or more ambiguous bases (Ns), that did not have an exact match to the expected bar-coded forward primers, or that had an average quality score less than 30 (the V6 region is short and generally low in homopolymer stretches and therefore has high average quality scores) (Sogin et al., 2006 (link); Huse et al., 2007 (link); Kunin et al., 2010 (link)). For V6 data sets, we also removed sequences that did not have a recognizable reverse primer sequence.
For non-environmental data sets, we compared all reads to a database of 16S rRNA sequences using GAST (Huse et al., 2008 (link)). Reads that had a best match to a non-target sequence that was at least 10% better than the match to the nearest template sequence were considered to be contamination and were removed. Reads that either did not have any match or did not have a match over at least 80% of their length were considered to represent non-target amplification, chimeras or reads with gross errors and were removed. These sequences were compared with the GenBank nt database using BLASTN (Altschul et al., 1990 (link)).
The likelihood of generating chimeras between short, hypervariable rRNA sequences of divergent taxa in the absence of the conserved regions of the gene is very small. The E. coli and S. epidermidis data sets, however, each include two very similar sequences in high density. Chimeras here are very similar to the correct sequences and map to the same species; therefore they are not identified by the minimum blast alignment requirement nor by standard chimera checking software, and would artificially increase the calculated error rate of PCR+pyrosequencing. Through visual examination we identified obvious chimeras and removed those specific sequences from the data. Some additional chimeras that contain sequencing errors and therefore do not exactly match predicted chimeras likely remain.
Free full text: Click here