All phenotype derivation and genomic analysis was conducted on a homogenous population of individuals of European (EUR) ancestry (N = 455,146), as determined by: (1) projection on to 1KGP phase 3 PCA space, (2) outlier detection to identify the largest cluster of individuals using Aberrant R package46 (link), selecting the λ in which all clustered individuals fell within 1KGP EUR PC1 and PC2 limits (λ = 4.5), (3) removed individuals who did not self-report as “British,” “Irish,” “Any other white background,” “White,” “Do not know,” or “Prefer not to answer,” as self-identified non-EUR ancestry could confound dietary habits.
Prior to phenotype derivation, we removed individuals who were pregnant, had kidney disease as defined by ICD10 codes, or a cancer diagnosis within the last year (field 40005). The UKB FFQ consists of quantitative continuous variables (i.e., field 1289, tablespoons of cooked vegetables per day), ordinal non-quantitative variables depending on overall daily/weekly frequency (i.e., field 1329, overall oily fish intake), food types (i.e., milk, spread, bread, cereal, or coffee), or foods never eaten (field 6144, dairy, eggs, sugar, and wheat). Supplementary Data 1 provides a list of UKB fields relating to the corresponding FFQ question for each dietary habit, which can be looked up in the UKB Data Showcase (http://biobank.ndph.ox.ac.uk/showcase/). Ordinal variables were ranked and set to quantitative values, while food types or foods never eaten were converted into a series of binary variables. Variables relating to alcoholic drinks per month were derived from a conglomeration of drinks per month and drinks per week questions answered by different individuals depending on their response to overall alcohol frequency (field 1558). All 85 single FI dietary phenotypes were then adjusted for age in months and sex, followed by inverse rank normal transformation on continuous FI-QTs. For individuals with repeated FFQ responses, both the dietary variable and the age in months covariate were averaged over all repeated measures. PCs were then derived from all 85 FI-QTs after filling in missing data with the median using the prcomp base function in R. FI-QTs with percent contribution (squared coordinates) greater than expected under a uniform distribution [1/85 × 100 = 1.18%] were included in Fig. 1 and Supplementary Fig. 4, created using ComplexHeatmap package in R47 (link). Phenotype correlation between all 170 dietary habits was estimated using Pearson’s pair-wise correlation on complete observations in R. All correlations (phenotypic and genetic) with P > 0.05/85 were set to 0. The significance threshold here was selected based on a Bonferroni correction for 85 total FI-QTs to maintain stringency for multiple testing and consistency across phenotype and genetic correlation analyses, while allowing for the nested and non-independent nature of the FFQ questions and derived FI-QTs.
Free full text: Click here