Linear regression was performed to determine genetic associations with metabolites using KORA F4
CCA and imputed data and the results were compared with each other. For this analysis, we selected metabolite-SNP pairs for which (i) a genome-wide significant association could be identified in the meta-analysis of KORA F4 and TwinsUK cohorts in a previous GWAS (Shin et al. 2014 (
link)) (summary statistics retrieved from
http://www.gwas.eu); (ii) the proportion of each metabolite’s missing values in KORA F4 was between 10 and 70%; (iii) the metabolite was measured in the EPIC-Norfolk cohort, which we used to further benchmark the preservation of effect sizes; and (iv) a functional connection between the genetic locus of the SNP and the metabolite (e.g., metabolite is a known substrate of the enzyme/transporter) was evident according to manual curation of the GWAS results (Table S8). For each imputed dataset, 18 metabolite-SNP pairs were tested for genetic association using age- and sex-corrected linear regression models under the assumption of an additive genetic model
. To avoid spurious associations, metabolic data points greater than four SDs from the mean were removed prior to computing linear models. For MI approaches, the regression coefficients were pooled using Rubin’s rules as provided by the
R package
mice, version 2.25. For each metabolite-SNP pair, the variance of the regression coefficients and p-values were estimated using bootstrapping.
To explore which imputation approaches increased statistical power, p-values obtained for the effect sizes based on imputed data were compared with p-values obtained from
CCA by calculating their ratio as
where
was the p-value obtained for imputed data and
was the p-value derived from
CCA. A ratio less than or equal to zero indicated either no power gain or a power loss, whereas a ratio greater than zero indicated a drop in p-value, which suggested that statistical power increased when imputation was performed.
In addition to statistical power gain, the imputation approaches should be able to preserve effect sizes compared to
CCA. Standardized effect sizes obtained from the imputed data
were compared with standardized effect sizes estimated for
CCA based on the KORA F4 data (n = 1750) and the EPIC-Norfolk data (n = 10,634), assuming estimates from the EPIC-Norfolk data to be close to true effects. We calculated the ratio
, with a low ratio indicating a similar effect size between the imputed data and
CCA. A highly negative or positive
indicates an underestimation or overestimation of the effect sizes in imputed data, respectively. A well performing imputation method is assumed to obtain high
and low absolute
.
Do K.T., Wahl S., Raffler J., Molnos S., Laimighofer M., Adamski J., Suhre K., Strauch K., Peters A., Gieger C., Langenberg C., Stewart I.D., Theis F.J., Grallert H., Kastenmüller G, & Krumsiek J. (2018). Characterization of missing values in untargeted MS-based metabolomics data and evaluation of missing data handling strategies. Metabolomics, 14(10), 128.