The non-redundant SILVA SSURef release 106 was downloaded in ARB-format from the SILVA website at http://www.arb-silva.de . Using the ARB software package [23] (link), we removed all sequences with a pintail score below 75, alignment quality score below 75 or length below 1,200 bp, in order to retain only high quality sequences. Further, we revised the taxonomy of several bacterial and archaeal taxa. The most significant improvements update the taxonomy of the Archaea to include the proposed phylum Thaumarchaeota[42] (link), [43] (link), the Actinobacteria to comply with Bergey’s Taxonomic Outline [44] , the Acidobacteria to incorporate proposed subgroups [45] and the Cyanobacteria to comply with the CyanoDB [46] and in some cases specific studies (details given in Supplementary Table S1 ). Other added taxa include the Zetaproteobacteria[47] (link), Rubritaleaceae[27] and Armatimonadetes[48] (link). In addition, we identified a number of taxa whose taxonomic annotation disagreed strongly with the topology of the SSURef alignment-based tree and appeared poorly supported by phylogenetic studies. These were either re-assigned to existing parent taxa or novel ones labeled incertae sedis. Unique taxon names were always used and to this end we added the name of the only child taxon to several unlabeled or undetermined taxa, or removed them.
Annotations of the eukaryotic taxa using the NCBI Taxonomy were taken from the SSURef database and manually verified in order to remove all sequences where taxonomical affiliation was in clear conflict with the topology of the alignment-based tree. Selection of fungal reference sequences was done according to recent phylogenetic work [49] (link), [50] (link).
All manual changes are listed in SupplementaryTable S1 , which can also be downloaded as a text file from http://services.cbu.uib.no/supplementary/crest/and is using an unambiguous format that can be parsed by the nds2CREST script (see below). In total, 82 new taxa were added, 123 were renamed and 17 deleted. All sequences remaining after curation were exported in FASTA format. During this procedure, sequences were cropped so as only the part corresponding to the SSU rRNA gene was saved. This was achieved by applying the Escherichia coli positional filter in ARB, selecting alignment column 1,o00 and 43,183. A tab-separated text file listing the accession numbers and taxonomic placements of each sequence was exported (using “NDS export”).
We developed the python script nds2CREST distributed together with the CREST LCAClassifier in order to convert the exported sequence and taxonomic data from ARB into configuration files for MEGAN [20] (link) and the CREST LCAClassifier. This script also reads a text version of the Manual Changes File (MCF; SupplementaryTable S1 ). For each change specified in the MCF, it confirms that the change was properly carried out. In addition, the script removes all sequences without valid taxonomical annotation or specified to be removed in the MCF. After this procedure, it assigns taxonomic ranks for each taxon based primarily on the NCBI Taxonomy, where such information is available; secondarily on the name of the taxon using the suffices “-ales” and “-acaea” to indicate family or order level, respectively; and lastly based on the parent rank. The output of nds2CREST is (1) a tree-file in Newick format describing the topology of the taxonomy, (2) a tab-separated “mapping file” specifying the name and rank for each taxon, and (3) a reference sequence database in FASTA-format. In addition to SilvaMod, we also prepared such files from the Greengenes Taxonomy [21] (link) using the same procedure, however without manual curation or positional filtering.
Annotations of the eukaryotic taxa using the NCBI Taxonomy were taken from the SSURef database and manually verified in order to remove all sequences where taxonomical affiliation was in clear conflict with the topology of the alignment-based tree. Selection of fungal reference sequences was done according to recent phylogenetic work [49] (link), [50] (link).
All manual changes are listed in Supplementary
We developed the python script nds2CREST distributed together with the CREST LCAClassifier in order to convert the exported sequence and taxonomic data from ARB into configuration files for MEGAN [20] (link) and the CREST LCAClassifier. This script also reads a text version of the Manual Changes File (MCF; Supplementary
Full text: Click here