Finally, clades and functions were tested for statistically significant associations with clinical metadata of interest by using a novel multivariate algorithm. Each clade (excluding ecological measures) was normalized with a variance-stabilizing arcsine square-root transformation and evaluated with a general linear model (in R using the glm package). Model selection for sparse data was performed per clade using boosting (gbm package [99 (link)]). A multivariate linear model associating all available metadata with each clade independently was boosted, and any metadata selected in at least 1% of these iterations was finally tested for significance in a standard generalized linear model. This composite model was thus of the form:
where p are the clinical metadata selected from boosting.
Within each metadatum/clade association independently, multiple comparisons over factor levels were adjusted using a Bonferonni correction; multiple hypothesis tests over all clades and metadata were adjusted to produce a final Benjamini and Hochberg false discovery rate [100 ]. Unless otherwise indicated, significant association was considered below a q-value threshold of 0.25; the KEGG pathway sulfur metabolism (ko00920) had an average q-value of 0.26 for association with Crohn's disease. Multiple factor analysis was performed to visualize the relationships within heterogeneous factor data as well as with a select group of taxa found to be significantly associated with metadata (using the FactoMineR R package [101 ]). Total abundances and significant associations between metadata, taxa, and functions are listed in Additional files