We first construct the n × m genotype matrix X, by centering and scaling the allele counts for each SNP according to Xi,j = (Si,j−2fj) × [2fj (1−fj)]α/2, where fj = ΣiSi,j/2n. If wj and rj denote the LD weight9 (link) and information score for SNP j, then the LDAK Model for estimating SNP heritability hSNP2=σg2/(σg2+σe2) is: Yi=k=1pθkZi,k+j=1mβjXi,j+ei,withβj(0,rjwjσg2/W),ei(0,σe2)andW=j=1mrjwj[2fj(1fj)]1+α.
θk denotes the fixed-effect coefficient for the kth covariate, βj and ei are random-effects indicating the effect size of SNP j and the noise component for Individual i, while σg2 and σe2 are interpreted as genetic and environmental variances, respectively. Note that the introduction of rj is an addition to the model we proposed in 2012.9 (link)
Model (2) is equivalent to assuming:44 , 45 (link)
Y(Zθ,Kσg2+Iσe2),withK=XΩXTW, where I is an n × n identity matrix and Ω denotes a diagonal matrix with diagonal entries (r1w1, …, rmwm). The kinship matrix K, also referred to as a genetic relationship matrix (GRM)1 (link) or genomic similarity matrix (GSM),46 (link) consists of average allelic correlations across the SNPs (adjusted for LD and genotype certainty). Model (3) is typically solved using REstricted Maximum Likelihood (REML), which returns estimates of θ1, …, θp, σg2 and σe2. 12
The heritability of SNP j can be estimated by hj2=βj2Var(Xj)/Var(Y), which under Model (2), and assuming Hardy-Weinberg Equilibrium,47 (link), 48 has expectation 𝔼[hj2]=𝔼[βj2]×Var(Xj)Var(Y)=rjwjσg2/W×[2fi(1fj)]1+αVar(Y). If P1 and P2 index two sets of SNPs of size |P1| and |P2|, then under the LDAK Model, they are expected to contribute heritability in the ratio W1 : W2, where Wl = Σj∈Plrjwj [2fj (1−fj)]1+α. The GCTA Model corresponds to setting wj = rj = 1, in which case Wl = ΣjPl [2fj (1−fj)]1+α. Most applications of GCTA have further assumed α = −1, so that Wl = |Pl|, which corresponds to the assumption that SNP sets are expected to contribute heritability proportional to the number of SNPs they contain.
Model (2) assumes that all effect-sizes can be described by a single prior distribution. This assumption is relaxed by SNP partitioning. Suppose that the SNPs are divided into tranches P1, …, PL of sizes |P1|, …, |PL|; typically these will partition the genome, so that each SNP appears in exactly one tranche and Σl |Pl| = m, but this is not required. This correspond to generalizing Model (2), so that SNPs in Tranche l have effect-size prior distribution βj(0,rjwjσl2/Wl). Letting Σ=σ12++σL2, then hSNP2=Σ/(Σ+σe2), while σl2/Σ represents the contribution to hSNP2 of SNPs in Tranche l. This model can equivalently be expressed as Y(Zθ,K1σ12++KLσL2+Iσe2), where Kl represents allele correlations across the SNPs in Tranche l.
For analyses under the LDAK Model, we used LDAK v.5; for analyses under the GCTA Model, we used GCTA v.1.26. For about a third of GCTA-LDMS analyses, the GCTA REML solver failed with the error “information matrix is not invertible,” in which case we rerun using LDAK (while the GCTA and LDAK solvers are both based on Average Information REML,28 (link), 49 (link) subtle differences mean that when using a large number of tranches, one might complete while the other fails). For the few occasions when both solvers failed, we instead used “GCTA-LD” (i.e., SNPs divided only by LD, rather than by LD and MAF), which we found gave very similar results to GCTA-LDMS for traits where both completed (Supplementary Fig. 7). For diseases, we converted estimates of hSNP2 to the liability scale based on the observed case-control ratio and assumed prevalence.26 (link), 27 (link) In general, we copied the prevalences used by previous studies; however for tuberculosis, where no previous estimate of hSNP2 is available, we derived an estimate of prevalence from World Health Organization data50 (see Supplementary Note).