For non-environmental data sets, we compared all reads to a database of 16S rRNA sequences using GAST (Huse et al., 2008 (link)). Reads that had a best match to a non-target sequence that was at least 10% better than the match to the nearest template sequence were considered to be contamination and were removed. Reads that either did not have any match or did not have a match over at least 80% of their length were considered to represent non-target amplification, chimeras or reads with gross errors and were removed. These sequences were compared with the GenBank nt database using BLASTN (Altschul et al., 1990 (link)).
The likelihood of generating chimeras between short, hypervariable rRNA sequences of divergent taxa in the absence of the conserved regions of the gene is very small. The E. coli and S. epidermidis data sets, however, each include two very similar sequences in high density. Chimeras here are very similar to the correct sequences and map to the same species; therefore they are not identified by the minimum