For Synechococcus investigation, we used 16S rRNA gene oligotyping as described in5 (link). This method is based on a supervised algorithm that identifies microdiversity using 16S rRNA gene sequences. Oligotyping is unlike regular taxonomic classification based on available reference databases available sequences or cluster analysis based on the selection of the similarity threshold. This technique tackles the taxonomic resolution limitation by finding the most information-rich nucleotide positions (i.e., oligotypes). Sequences identified as Synechococcus were extracted from the Vamps database. We aligned Synechococcus reads using PyNAST41 (link). Of the 22,387 sequences identified as Synechococcus, 17,941 remained after quality filtration and Pynast alignment. The mean length of Synechococcus reads was 254 bp. Next, we removed the uninformative gaps in the resulting aligned sequences using the “o-trim-uninformative-columns-from-alignment” script. Subsequently, we calculated the entropy of each nucleotide position within the oligotype package. After the initial calculation of Shannon entropy using the “analyze-entropy” script, we ran 16S rRNA oligotyping for the Synechococcus genus until each oligotype had converged. Uninformative nucleotide positions were excluded. Seven nucleotide positions were used in total to define each oligotype, and to minimize the impact of sequencing errors on oligotyping results, we used a “minimum substantive abundance” criterion (M) of 5; thus, an oligotype was not included if the most common sequence for that type occurred less than five times. To reduce the noise, each oligotype was required to appear in at least one sample but was not required to comprise a certain percentage of reads or represent a minimum number of reads in all samples combined. We removed any oligotypes that did not meet these criteria from the analysis. The final number of quality-controlled oligotypes revealed by the analysis was 31 and represented 95% of the total Synechococcus reads. For each oligotype, the oligotyping pipeline chose the most abundant read as the representative sequence to be used for downstream analyses. Upon completion of oligotyping analysis, the resulting “observation matrices” are concatenated to generate a single “observation matrix” for our V4-V5 dataset. These observation matrices report counts, which are the number of reads assigned to each oligotype in each sample (Table 1). We then converted counts to percent abundances within each sample and used these normalized relative abundances for subsequent analyses. We searched the most biologically relevant representative sequence of our oligotypes using blastn version 2.2.26 to assign taxonomy for each oligotype. We kept default parameters, except ‘per. identity 100’ to have hits with 100% sequence identity reported.
Free full text: Click here