We trained our cross-tissue gene expression imputation model using genotype and normalized gene expression data from 44 tissues in the GTEx project (version V6p, dbGaP accession code: phs000424.v6.p1)
3 (link). Sample sizes for different tissues ranged from 70 (uterus) to 361 (skeletal muscle). SNPs with ambiguous alleles or minor allele frequency (MAF) < 0.01 were removed. Normalized gene expressions were further adjusted to remove potential confounding effects from sex, sequencing platform, top three principal components of genotype data, and top probabilistic estimation of expression residuals (PEER) factors
77 . As previously recommended
17 , we included 15 PEER factors for tissues with
N < 150, 30 factors for tissues with 150 ≤
N < 250, and 35 factors for tissues with
N ≥ 250. All covariates were downloaded from the GTEx portal website (
URLs). We applied a 5-fold cross-validation for model tuning and evaluation. Specifically, we randomly divided individuals into five groups of equal size. Each time, we used three groups as the training set, one as the intermediate set for selecting tuning parameters, and the last one as the testing set for performance evaluation. Squared correlation between predicted and observed expression (i.e.
R2) was used to quantify imputation accuracy. For each model, we selected gene-tissue pairs with FDR < 0.05 for downstream testing. External validation of imputation accuracy was performed using whole-blood expression data from 421 samples in the 1000 Genomes Project (GEUVADIS consortium)
32 (link) and the CommonMind consortium
33 (link), which collected expression in across multiple regions from > 1,000 postmortem brain samples (mainly corresponding to Brain_Frontal_Cortex_BA9 in GTEx) from donors with schizophrenia, bipolar disorder, and individuals with no neuropsychiatric disorders. For CommonMind data, we focused our analysis on 147 controls with no neuropsychiatric disorders. Average improvements in
R2 in both external validation datasets are shown in
Supplementary Figure 4. Although not statistically significant due to the limited sample size, the accuracy of the cross-tissue method was consistently higher than that of the single-tissue approach in different quantiles. Furthermore, comparing the tissue-tissue similarity based on the observed and imputed gene expressions indicated that cross-tissue imputation removed stochastic noises in the expression data without losing tissue-specific correlational patterns (
Supplementary Note;
Supplementary Figure 5–
6).
Hu Y., Li M., Lu Q., Weng H., Wang J., Zekavat S.M., Yu Z., Li B., Gu J., Muchnik S., Shi Y., Kunkle B.W., Mukherjee S., Natarajan P., Naj A., Kuzma A., Zhao Y., Crane P.K., Lu H, & Zhao H. (2019). A statistical framework for cross-tissue transcriptome-wide association analysis. Nature genetics, 51(3), 568-576.