Inverse Simpson diversity, Chao1 richness (using the R fossil package), and Pielou evenness were calculated for clade abundance, KEGG pathway and module abundance, and Gene Ontology term abundance [94 -97 ]. Next, data were pre-processed for quality control before modeling. Clinical metadata were removed when more than 10% of data were missing, or when they did not vary in value over the available samples. Clades, pathways, and features of very low abundance (< 0.001 in ≥ 90% of samples) and feature outliers outside of the lower or upper outer fence (3× interquartile range) were removed. Missing data were imputed for significance testing with the mean abundance of the sample; missing factor metadata were imputed with a 'NA' factor level using the na.gam.replace function from the R package [98 ]. Unless stated otherwise, all subsequent analyses and calculations were performed using these processed data. After processing, 228 and 231 samples passed quality control for clade abundance and functional abundance analyses, respectively.
Finally, clades and functions were tested for statistically significant associations with clinical metadata of interest by using a novel multivariate algorithm. Each clade (excluding ecological measures) was normalized with a variance-stabilizing arcsine square-root transformation and evaluated with a general linear model (in R using the glm package). Model selection for sparse data was performed per clade using boosting (gbm package [99 (link)]). A multivariate linear model associating all available metadata with each clade independently was boosted, and any metadata selected in at least 1% of these iterations was finally tested for significance in a standard generalized linear model. This composite model was thus of the form:
arcsin(yi))=β0+pβpXi,p+εi,i=1,...,n
where p are the clinical metadata selected from boosting.
Within each metadatum/clade association independently, multiple comparisons over factor levels were adjusted using a Bonferonni correction; multiple hypothesis tests over all clades and metadata were adjusted to produce a final Benjamini and Hochberg false discovery rate [100 ]. Unless otherwise indicated, significant association was considered below a q-value threshold of 0.25; the KEGG pathway sulfur metabolism (ko00920) had an average q-value of 0.26 for association with Crohn's disease. Multiple factor analysis was performed to visualize the relationships within heterogeneous factor data as well as with a select group of taxa found to be significantly associated with metadata (using the FactoMineR R package [101 ]). Total abundances and significant associations between metadata, taxa, and functions are listed in Additional files 1 and 11.
Free full text: Click here