Output sequences were first compared to the known 16S rRNA gene reference sequences of the members of each mock community. If an output sequence matched a reference sequences, it was classified as Reference, and if it had one mismatch or gap to a reference sequence it was classified as One Off. Output sequences that were at least Hamming distance 2 from any reference sequence were then BLASTed against the nr/nt database. If the best hit was an exact match covering the full output sequence, it was classified Exact. If there was a single mismatch or indel, it was classified One Off. Output sequences that remained unclassified to this point were classified Other.
We included the BLAST against nr/nt step because even amplicon sequencing data from communities with a putatively known reference composition will contain contaminant sequences. Contaminants are real, albeit unwanted, biological variation, and should be identified when correcting amplicon errors. While the nr/nt database is imperfect, it is reasonable to expect that Exact matches are far more likely to be real variants than are Others. Output sequences classified as Other, and output sequences classified as One Off that differed by one substitution from a more abundant output sequence, were considered a proxy for false positives. Output sequences classified as Reference or Exact were considered true positives.