The sponge data set was filtered using the metadata so that it contained only samples with either the label healthy or the label stressed. This resulted in a comparison with 218 remaining samples. Similarly, the sleep apnea study was filtered for IHH and air control samples, with a treatment duration of 6 weeks resulting in 189 remaining samples. The infant gut colonization case study was filtered for samples over 500 reads/sample and for a single sample from the mother with the title 101.Mother. The 88-soil data set was filtered for samples over 500 reads/samples. The keyboard data set was filtered for samples over 500 reads/sample and 15 reads/sOTU. Additionally, only subject IDs corresponding to M3, M2, and M9 were retained, giving 67 samples. For comparing ordinations at different numbers of samples, the data sets were filtered for having 1,000 sequences/sample and balanced to have equal numbers of each subgroup (i.e., equal Air and IHH samples). Then samples were removed randomly but equally from each subgroup; this was repeated 10 times. The first iteration was used to plot the ordinations, and the mean score of the iterations was used to plot KNN classification accuracy and PERMANOVA F-statistic.
Both data sets were then preprocessed with the robust centered log ratio (rclr) transform, and RPCA was run with a rank of 2 because there were two metadata categories of interest in each comparison. Weighted UniFrac distances were calculated using generalized UniFrac with an alpha of one (52 (link)). Bray-Curtis distances were calculated through QIIME 2 (49 (link)). Both weighted UniFrac and Bray-Curtis distances were calculated on tables rarefied to 1,000 reads per sample. PCoA and PERMANOVA analyses for the Bray-Curtis, RPCA distance matrix, and weighted UniFrac were calculated through scikit-bio. The resulting PCoA and PCA axes were plotted through matplotlib (53 (link)) with PC1 and PC2 in the x and y axes, respectively.
The original unprocessed (raw count) tables were sorted by feature loadings from RPCA. Features with a count sum of less than 10 across all samples were filtered out. The resulting table was then clr transformed with a pseudocount of one and plotted as a heat map. Each sOTU was given the lowest classification for the sleep apnea and sponge data sets, respectively.
The features in the PC1 axis of the feature loadings from RPCA were selected to represent a manageable number of taxa to compare between subgroups. Those selected features (sOTUs) from the feature loadings were used for log ratios. Log ratios were calculated from the table used to calculate them. The samples that contained zeros in either the numerator or denominator were removed before calculating the ratios. The correlations between the log ratio and PC1 axis were performed by Pearson correlation via SciPy (54 ).