To identify the candidate primer sequences for the PCR amplification of prokaryotic 16S rRNA genes, such genes from genome-sequenced strains were used as references because they have been accurately sequenced, are full-length genes, have well-defined taxonomic information. Bacterial and archaeal genomic sequences were obtained from the NCBI Genome Database (ftp://ftp.ncbi.nih.gov/genbank/genomes/Bacteria/, accessed on 11 November 2013) in November 2008. The 16S rRNA gene sequences in each strain were identified by RNAmmer.25 (link) Then, one 16S rRNA gene sequence per species was randomly chosen because slight sequence differences exist among the 16S rRNA genes from strains of the same species,26 (link) and among the gene copies within a genome.27 (link) A total of 531 16S rRNA gene sequences were chosen. Their taxonomic information was obtained from the NCBI Taxonomy Database (http://www.ncbi.nlm.nih.gov/Taxonomy/taxonomyhome.html/, accessed on 11 November 2013). A multiple sequence alignment of the 531 16S rRNA gene sequences was constructed using MAFFT version 6.713 with default parameters.28 (link)To find out the candidate sequences described above, highly conserved regions identified in the reference alignment were chosen as follows. Generally, the primer lengths for the PCR-amplification of 16S rRNA genes are more than 15 nt;9 (link) therefore, we used a sliding window of 15 nt with a step size of 1 nt across the reference alignment. For each window, we calculated the frequency of each 15-nt sequence with one mismatch allowed. The 15-nt sequences that included gaps were also considered when calculating the frequencies. The consensus sequence for each window was defined as the 15-nt sequence that was found most frequently within one mismatch among strains. The coverage rate for a consensus sequence in each phylum was defined as the percentage of matched sequences among genome-sequenced strains within one mismatch.
Free full text: Click here