For dataset A1, A2, A3, and B1, the primary sample materials were collected from the COpenhagen Prospective Studies on Asthma in Childhood 2010 (COPSAC
2010) mother-child cohort, following 700 children and their families from pregnancy into childhood, as previously described in detail [33 (
link)]. In this study, we used fecal samples collected at ages 1 week (
n = 95), 1 month (
n = 361), and 1 year (
n = 622); vaginal swabs collected at week 36 of pregnancy (
n = 670); and hypopharyngeal aspirates (
n = 144) collected at acute wheezy episodes in children with persistent wheeze aged 1–3 years, using a soft suction catheter passed through the nose. DNA was extracted using MoBio PowerSoil kits on an EpMotion 5075, amplified using a two-step PCR reaction with forward and reverse 16S V4 primers, and sequenced using 250bp paired-end sequencing on an Illumina MiSeq. A full description of the laboratory workflow and the bioinformatics pipeline is available in the Additional file
13.
To examine effects in smaller datasets, we subset datasets A1, A2, and A3 into 16 (small) and 50 samples (medium) by random sampling with recorded random seeds, resulting in datasets A1s–A3s and A1m–A3m. Additionally, we created a simulated dataset A4 by independent resampling of all OTUs across samples, without replacement, of dataset A3.
Additionally, for dataset B2, we used public data from the Human Microbiome Project [34 (
link)], testing separation ability between the tongue dorsum (
n = 316) and hard palate (
n = 301) 16S V3-5 samples (
http://hmpdacc.org/HMQCP/). For dataset B3, we used data from Pop et al. [35 (
link)], downloaded from Bioconductor (
http://bioconductor.org/packages/release/data/experiment/html/msd16s.html), testing separation between age groups 0–6 months (
n = 112), 6–12 months (
n = 308), 12–18 months (
n = 173), 18–24 months (
n = 146), and 24–60 months (
n = 253).
To reduce sparsity of dataset B3, chimeras were rechecked using USEARCH v7.0.1090 [36 (
link)] against the gold database [37 (
link)], and 3624 chimeras (listed in Additional file
14: Table S2) were removed from the OTU table. Since a phylogenetic tree file was not published along with the OTU table and sample metadata from this paper, we built one using the supplied reference sequences as described in the “Bioinformatics” section of the Additional file
13. Due to issues with TMM normalization of this dataset (see the “
Results” section), we agglomerated similar OTUs to reduce the sparsity as a sensitivity analysis. This was achieved by computing pairwise phylogenetic distances using the tree and grouping together all OTUs who were closer to each other than the 0.001 quantile of the distance distribution, see Additional file
1: Table S1. The OTUs were merged with the merge_taxa function in the R package phyloseq [38 (
link)], using the OTU with the highest sum of counts as archetype.
Thorsen J., Brejnrod A., Mortensen M., Rasmussen M.A., Stokholm J., Al-Soud W.A., Sørensen S., Bisgaard H, & Waage J. (2016). Large-scale benchmarking reveals false discoveries and count transformation sensitivity in 16S rRNA gene amplicon data analysis methods used in microbiome studies. Microbiome, 4, 62.