We downloaded 503,971 aligned small subunit rRNA sequences from the SILVA database, version 92 [35] (link). Using the SILVA quality assessments, we eliminated low-quality sequences (sequence quality < = 50, alignment quality < = 50, pintail score < = 40). SSU rRNA genes whose sequences were identical were flagged as redundant. The resultant dataset included 417,433 unique sequences, of which 99% were between 350 and 2000 nt in length. Although the sequences vary in length and coverage of the full-length SSU rRNA gene, we refer to these sequences as “long” or “full-length” sequences for the purposes of this paper, and the dataset of these sequences as RefSSU. From all aligned RefSSU sequences, we extracted the V3 and V6 hypervariable regions, defined as homologous positions between positions 338 and 533 of the E. coli SSU rRNA sequence (U00096) for V3, and 967 to 1046 for V6. Sequences shorter than 50 nt or containing ambiguous bases were culled. We removed all gap characters to create a set of 293,265 V3 reference tags (RefV3 database) and 195,344 V6 reference tags (RefV6 database). The higher representation of sequences spanning the V3 region in molecular databases is likely a consequence of the experimental design used to generate PCR amplicon libraries favoring the beginning of the molecule. These databases include 123,206 unique V3 tag sequences and 59,830 unique V6 tag sequences. Most V3 sequences (99+%) range in length from 80 nt to 180 nt (max 447), while the most V6 sequences (99+%) range from 50 nt to 80 nt with a maximum of 349 nt (http://vamps.mbl.edu/resources/databases.php ).
We classified all bacterial and archaeal long sequences directly with the Ribosomal Database Project Classifier (RDP) [28] (link). We used only RDP classifications with a bootstrap value of > = 80%. If the bootstrap value was <80%, the taxonomic assignment was moved to a higher classification level until an 80% or better bootstrap value was achieved. For example, if the genus assignment had a bootstrap value of 70%, but the family had a value of 85%, that sequence would be assigned only as far as family and not to genus. RDP Classifier does not classify sequences below the genus level but the GAST process is not inherently limited to genus; its resolution is constrained by the taxonomy of the reference sequence database. The accuracy of GAST will improve in response to refinements of the reference database including increased number of taxonomically-resolved sequences, removal of cryptic chimeric and short sequences, improvement of taxonomic identities for long sequences, and elimination of low quality entries.
We classified all bacterial and archaeal long sequences directly with the Ribosomal Database Project Classifier (RDP) [28] (link). We used only RDP classifications with a bootstrap value of > = 80%. If the bootstrap value was <80%, the taxonomic assignment was moved to a higher classification level until an 80% or better bootstrap value was achieved. For example, if the genus assignment had a bootstrap value of 70%, but the family had a value of 85%, that sequence would be assigned only as far as family and not to genus. RDP Classifier does not classify sequences below the genus level but the GAST process is not inherently limited to genus; its resolution is constrained by the taxonomy of the reference sequence database. The accuracy of GAST will improve in response to refinements of the reference database including increased number of taxonomically-resolved sequences, removal of cryptic chimeric and short sequences, improvement of taxonomic identities for long sequences, and elimination of low quality entries.
Full text: Click here