Three classes of statistical tests were used to assess metabolic variability across the human microbiome. First, pathways and modules differentially abundant in at least one of the seven analyzed body sites were determined by the LEfSe system for metagenomic biomarker discovery [23] (link). These differences were summarized into overall patterns of variation using principal component analysis on a matrix of average module abundances per body site, Winsorized at 20% (a robust arithmetic mean [30] ), filtered at a minimum of 0.01% in at least one site, and normalized to z-scores. Since LEfSe is not appropriate for HUMAnN's binary pathway coverage scores, we determined site-enriched or underenriched pathways and modules as follows: a module was in aggregate present at a site if it occurred with coverage ≥0.9 in ≥90% of the site's samples; absent if it occurred with coverage ≤0.1 in ≥90% of samples; and differential if it was present in at least one site and absent in at least one other. Pathways were analyzed identically using a ≥0.5 coverage criterion, since no large pathways consistently had coverage ≥0.9.
The third test described here associated pathway and module abundance not with human microbiome body sites, but with one or more of the subject clinical metadata variables described by the HMP [9] . These included continuous descriptors of each sample (e.g. subject age, body mass index, vaginal introitus and posterior fornix pH for women, etc.) as well as categorical variables (e.g. gender or location, see Supplemental Table S1). Pathway and module abundances were associated with these metadata first by stratifying by body site. Within each body site, each pathway/metadata pair present above 0.01% in at least 10% of samples was independently tested using Spearman's ρ for continuous metadata and the Kruskal-Wallis nonparametric ANOVA for categorical, after removing any outliers outside of the upper or lower inner fences. The resulting p-values were corrected using the Benjamini-Hochberg method within each body site and thresholded at a minimum FDR q<0.1.
Free full text: Click here