The ERRD data contained sequences from ‘environmental samples’. These were excluded since there was little information about them. The 5S were generally around 120 nt long, the 16/18S around 1500 nt and the 23/28S around 3000 nt long, all with no obvious outliers. The length of the eukaryotic rRNAs varied substantially, more than those of bacterial and archaeal rRNAs, but no sequences in the alignments seemed obviously wrong.
The sequences were divided into phylogenetic groups to help with further analysis. Due to sequencing bias, some phylogenetic groups dominated the data sets. Such a skew could potentially cause the predictors to be less sensitive on underrepresented phylogenetic groups. Among the bacteria, 82% of the sequences were from three phyla: Actinobacteria, Firmicutes and Proteobacteria. Around 70% of the archaeal sequences were from Euryarchaeota; among the eukaryotes, the Streptophyta comprised 15% of the data. Several of the sequences also proved to be very similar. Therefore, redundancy reduction inspired by Hobohms second algorithm (14) was performed. This algorithm starts with a sorted list of the number of neighbors each sequence has. An all-against-all comparison between the sequences is performed and neighborship is judged by the level of similarity found. Similarity was measured by Score = ∑i,jnijSij/(N-g) where i and j sum over the four nucleotides, nij counts the number of aligned nucleotide pairs (i, j), N is the length of the sequence and g is the number of gap-only positions; Sij refers to the scoring matrix EDNAFULL created by Todd Lowe. The maximum similarity level allowed was set to ensure that each phylum was represented. Similarity graphs were formed for each group, with the sequences as vertices and edges between similar sequences. The sequence with the highest connectivity and its edges were deleted from the graph, and this was repeated until no edges remained. At the end, all removed sequences were checked to see if they had any edges to vertices in the remaining set. If not, they were reinstated. This procedure was implemented as a C program.
Sequences in ERRD may contain ambiguous nucleotide symbols representing nucleotides that have not been uniquely determined. These occur more frequently in bacteria and eukaryotes than in archaea, and primarily at both ends of the alignment: in 16/18S, predominantly at the end; in 23/28S, predominantly at the beginning. In the latter case, this was mostly due the high prevalence of gaps at the end of the alignment. As we found that ambiguous nucleotides at the ends reduced the ability to predict start and stop positions accurately, we decided to remove all sequences with five or more ambiguous nucleotides in either end of the sequence. A summary of the number of sequences removed during curation of the alignments is shown in
The initial number of rRNA sequences and the number of sequences excluded for different reasons.
Kingdom | Type | Initial count | Environmental samples | Incomplete sequences | Redundancy reduction | Total in HMM |
Archaea | 5S | 58 | 0 | 0 | 10 | 48 |
16S | 589 | 239 | 471 | 287 | 76 | |
23S | 37 | 0 | 18 | 8 | 15 | |
Bacteria | 5S | 461 | 0 | 0 | 101 | 360 |
16S | 12 107 | 1429 | 10 723 | 2485 | 743 | |
23S | 398 | 0 | 155 | 130 | 127 | |
Eukaryotes | 5S | 316 | 0 | 0 | 33 | 283 |
18S | 6585 | 24 | 5222 | 836 | 979 | |
28S | 157 | 0 | 91 | 8 | 58 |
Environmental samples were excluded due to lack of phylogenetic information. Sequences with too many unknown nucleotides in either end of the sequence were excluded to improve HMM accuracy. Redundancy reduction was performed to reduce bias. Note that these groups may overlap. The last column indicates the number of sequences used to build each HMM.
With the aim of increasing the search speed, we determined the 75 most conserved consecutive columns in each alignment, as illustrated in
The graphs show conservation in the alignments as measured by information content: C=∑ifilog2(fi/qi) where i sums over the four nucleotides, fi is the frequency of nucleotide i in the column and qi =1/4 is used as the background frequency. Ambiguous nucleotide symbols were evenly divided between the corresponding fi, gaps between all four nucleotides. The grey line represents the value for each position in the alignment, the black line is a running average over 75 nt around the current position, whereas the white dot indicates the center of the most conserved 75 nt region of the alignment.
Annotations of rRNAs were obtained from GenBank. Unfortunately, rRNAs have not been annotated in a uniform manner and it was often unclear exactly what was annotated. In some cases, both the separate rRNAs and the full operon was annotated. In all such cases, the operons were longer than 5000 nt, and all annotations longer than that were thus excluded. In our experience, this affected only operons. In other cases, different pieces of the same gene had been annotated as separate entities. Thus, some predictions matched several annotation entries; these are listed in Supplementary Table S3. A prediction was considered to match an annotation if they were on the same strand and the length of their overlap was at least half the length of the shorter of the two; it was considered to be annotated if it matched at least one annotation. The deviation between annotated and predicted start and stop positions was also examined, but predictions with multiple matching annotations were excluded from this comparison.
Additional analyses were performed for experimentally verified 16S in Anaplasma marginale St. Maries (M60313), Chlamydia muridarum Nigg (D85718), Escherichia coli K12 MG1655 (J01695), Sulfolobus tokodaii St. 7 (AB022438), Thermus thermophilus HB8 (X07998) and Nitrobacter hamburgensis X14 (L11663). Computational speed was assessed on M. capricolum ATCC 27343 (CP000123) Solibacter usitatus Ellin6076 (CP000473) and Sargasso Sea data (AACY01000001-AACY01811372). All test searches reported were performed on an SGI Altix 3000 machine using one 1.3 GHz Itanium 2 processor.