In Sogin et al. [9] , we proposed a tag mapping methodology, GAST (Global Alignment for Sequence Taxonomy) to assign a taxonomic classification to environmental V6 tags (http://vamps.mbl.edu/resources/software.php). The first step in GAST is to BLAST each tag against the RefV3 or RefV6 database (no minimum score, expectation value or other cutoffs were imposed). Because the top BLAST hit may not have the highest overall similarity to the tag sequence, particularly because edge-effects in such a short region can be pronounced, we aligned the tag sequence to the reference hypervariable region tags corresponding to the top 100 BLAST hits. We used MUSCLE [38] (link) (with parameters –diags and -maxiters 2 to reduce processing time) because it is well suited to high-throughput experiments. We calculated the global distance from the sample tag to each of the aligned reference sequence tags as the number of insertions, deletions and mismatches divided by the length of the tag, using quickdist [9] . We considered the reference sequence or sequences with the minimum global distance to be the top GAST match(es). The top BLAST hit was frequently the best global match; however, for 5% to 25% of tags the best global match was to a reference sequence with a lower BLAST score.
For each tag, we identified all of the reference long sequences in RefSSU that contained the exact hypervariable sequence of the top GAST match(es). We compared the taxonomic classification of all corresponding SSU rRNA sequences (with RDP bootstrap values> = 80) and generated a consensus taxonomy. If two-thirds or more of the full-length sequences shared the same assigned genus, the tag was assigned to that genus. If there was no such agreement, we proceeded up one level to family. If there was a two-thirds or better consensus at the family level, we assigned this taxonomy to the tag, and if not, we continued to proceed up the tree. Occasionally, a tag could not be assigned taxonomic classification at the domain level. This was because the RDP Classifier could not assign a domain with an adequate bootstrap value, rather than a tag mapping to full-length sequences from different domains. These may represent novel organisms whose taxonomy has not yet been determined. Sample tags that did not have a single BLAST match in the RefSSU database also were not given a taxonomic assignment. We chose to use a 66% (two-thirds) majority although other values or a distributional vs. strict percentage approach can be implemented. We reviewed nearly 17 million tags in our sequencing database (primarily of the V6 region) from a wide range of studies using the 66% majority as the threshold for assignment. A distribution curve of voting majority did not show any obvious break points (graph not shown), although 95% of the tags had a voting majority of 75% or better, and 90% had a voting majority > = 83%.
Free full text: Click here