The genome and plasma proteome data of European descendants included in the INTERVAL study (subcohort 1 and subcohort 2) was used to establish and validate protein genetic prediction models. Detailed information about the INTERVAL study dataset has been described elsewhere [16 (
link)]. In brief, participants were aged 18–80 and were generally in good health. The SOMAscan assay was used to measure the relative concentrations of 3620 plasma proteins or protein complexes. Quality control (QC) was performed at the sample and SOMAmer level. After excluding eight non-human protein targets, a total of 3283 SOMAmers remained for further study. DNA was used to assay ~ 830,000 variants on the Affymetrix Axiom UK Biobank genotyping array. Standard sample and variant QC was conducted, as described in the original publication [16 (
link)]. SNPs were further phased using SHAPEIT3 and imputed using a combined 1000 Genomes Phase 3-UK10K reference panel via the Sanger Imputation Server, resulting in over 87 million imputed variants. Such SNPs were filtered using criteria of (1) imputation quality of at least 0.7, (2) minor allele frequency (MAF) of at least 5%, (3) Hardy–Weinberg equilibrium (HWE)
p ≥ 5 × 10
−6, (4) missing rates < 5%, and (5) presenting in the 1000 Genome Project data for European populations. In total, there were 4,662,360 variants passing these criteria.
In subcohort 1 (
N = 2481), protein levels were log transformed and adjusted for age, sex, duration between blood draw and processing, and the first three principal components of ancestry. For the rank-inverse normalized residuals of each protein of interest, we followed the TWAS/FUSION framework [17 (
link)] to develop genetic prediction models, using nearby SNPs (within 100 kb) of potentially associated SNPs as potential predictors. A false discovery rate (FDR) < 0.05 and
P-value ≤ 5 × 10
−8 were used to determine potentially associated SNPs in
cis- and
trans- regions, respectively. We defined
cis-region as a region within 1 Mb of the transcriptional start site (TSS) of the gene encoding the target protein of interest. Subsequently, we extracted all SNPs located within 100 kb of the aforementioned potentially associated SNPs to serve as potential predictors for establishing protein prediction models, excluding any ambiguous SNPs. In order to include potential predictors from both
cis and
trans regions, we converted all the chromosome numbers to Z and combined them as a single pseudo chromosome. Four methods, namely, best linear unbiased predictor, elastic net, LASSO, and top1, were used for establishing the models. For developed protein prediction models with prediction performance (
R2) of at least 0.01 [15 (
link), 18 (
link)–23 (
link)], which is a common threshold used in relevant studies, we further conducted external validation using subcohort 2 (
N = 820) data. In brief, we generated predicted expression levels by applying the established protein prediction models to the genetic data, and then compared the predicted v.s. measured levels of each protein of interest. We selected proteins with a model prediction
R2 of ≥ 0.01 in subcohort 1 and a correlation coefficient of ≥ 0.1 in subcohort 2 for the downstream association analysis.
To assess the associations between genetically predicted circulating protein levels and AD risk, we applied the validated protein prediction models to the summary statistics from a large GWAS meta-analysis of AD risk [24 (
link)]. Instead of using the conventional approach of including clinically diagnosed AD alone, this GWAS combined clinically confirmed and parental diagnoses based by-proxy phenotypes, which has been demonstrated to confer great value in substantially increasing statistical power [25 (
link)]. In brief, this study included a total of 85,934 cases (39,106 clinically diagnosed AD and 46,828 proxy AD) and 401,577 controls of European ancestry, which were obtained from various sources including The European Alzheimer & Dementia Biobank dataset (EADB), GR@ACE/DEGESCO study, The Rotterdam Study (RS1 and RS2), European Alzheimer’s Disease Initiative (EADI) Consortium, Genetic and Environmental Risk in AD (GERAD) Consortium/Defining Genetic, Polygenic, and Environmental Risk for Alzheimer’s Disease (PERADES) Consortium, The Norwegian DemGene Network, The Neocodex–Murcia study (NxC), The Copenhagen City Heart Study (CCHS), Bonn studies, and UK Biobank. Detailed information on study participants as well as genotyping and imputation methods for the samples from each of the included study can be found in the supplementary files of the original GWAS paper [24 (
link)]. Risk estimates for the single marker association analyses were adjusted for sex, batch (if applicable), age (if applicable), and top principal components (PCs).
The TWAS/FUSION framework was used to determine the protein-AD associations, by leveraging correlation information between SNPs included in the prediction models from the phase 3, 1000 Genomes Project data of European ancestry [17 (
link)]. We calculated the PWAS test statistic
Z-score =
w'Z/(
w'Σ
s,sw)
1/2, where the
Z is a vector of standardized effect sizes of SNPs for a given protein (Wald
z-scores),
w is a vector of prediction weights for the abundance feature of the protein being tested, and the
Σs,s is the LD matrix of the SNPs estimated from the 1000 Genomes Project as the LD reference panel. The Bonferroni correction
P-value < 0.05 was used to determine significant associations between genetically predicted protein concentrations and AD risk.
Ingenuity Pathway Analysis (IPA, Ingenuity System Inc, USA)) and Protein–Protein Interaction analysis via STRING database (version 12.0) with 0.400 confidence level [26 (
link)] was implemented to cluster and classify enriched pathways for the identified proteins using default interaction resources, including Textmining, Experiments, Databases, Co-expression, Neighborhood, Gene Fusion, and Co-occurrence. We also investigated potentially repositionable drugs targeting the genes encoding associated proteins, by using the GREP (Genome for REPositioning drugs) tool [27 (
link)]. We further conducted molecular docking analysis considering ATP1A1 protein as the drug target protein and almitrine and ciclopirox as the drug agents [28 ].
Zhu J., Liu S., Walker K.A., Zhong H., Ghoneim D.H., Zhang Z., Surendran P., Fahle S., Butterworth A., Alam M.A., Deng H.W., Wu C, & Wu L. (2024). Associations between genetically predicted plasma protein levels and Alzheimer’s disease risk: a study using genetic prediction models. Alzheimer's Research & Therapy, 16, 8.