We first construct the
n ×
m genotype matrix
X, by centering and scaling the allele counts for each SNP according to
Xi,j = (
Si,j−2
fj) × [2
fj (1−
fj)]
α/2, where
fj = Σ
iSi,j/2
n. If
wj and
rj denote the LD weight9 (
link) and information score for SNP
j, then the LDAK Model for estimating SNP heritability
is:
θk denotes the fixed-effect coefficient for the
kth covariate,
βj and
ei are random-effects indicating the effect size of SNP
j and the noise component for Individual
i, while
and
are interpreted as genetic and environmental variances, respectively. Note that the introduction of
rj is an addition to the model we proposed in 2012.9 (
link)
Model (2) is equivalent to assuming:44 , 45 (
link)
where
I is an
n ×
n identity matrix and
Ω denotes a diagonal matrix with diagonal entries (
r1w1, …,
rmwm). The kinship matrix
K, also referred to as a genetic relationship matrix (GRM)1 (
link) or genomic similarity matrix (GSM),46 (
link) consists of average allelic correlations across the SNPs (adjusted for LD and genotype certainty).
Model (3) is typically solved using REstricted Maximum Likelihood (REML), which returns estimates of
θ1, …,
θp,
and
12
The heritability of SNP
j can be estimated by
which under
Model (2), and assuming Hardy-Weinberg Equilibrium,47 (
link), 48 has expectation
If
P1 and
P2 index two sets of SNPs of size |
P1| and |
P2|, then under the LDAK Model, they are expected to contribute heritability in the ratio
W1 :
W2, where
Wl = Σ
j∈Plrjwj [2
fj (1−
fj)]
1+α. The GCTA Model corresponds to setting
wj =
rj = 1, in which case
Wl = Σ
j∈Pl [2
fj (1−
fj)]
1+α. Most applications of GCTA have further assumed
α = −1, so that
Wl = |
Pl|, which corresponds to the assumption that SNP sets are expected to contribute heritability proportional to the number of SNPs they contain.
Model (2) assumes that all effect-sizes can be described by a single prior distribution. This assumption is relaxed by SNP partitioning. Suppose that the SNPs are divided into tranches
P1, …,
PL of sizes |
P1|, …, |
PL|; typically these will partition the genome, so that each SNP appears in exactly one tranche and Σ
l |
Pl| =
m, but this is not required. This correspond to generalizing
Model (2), so that SNPs in Tranche l have effect-size prior distribution
Letting
then
while
represents the contribution to
of SNPs in Tranche
l. This model can equivalently be expressed as
where
Kl represents allele correlations across the SNPs in Tranche
l.
For analyses under the LDAK Model, we used LDAK v.5; for analyses under the GCTA Model, we used GCTA v.1.26. For about a third of GCTA-LDMS analyses, the GCTA REML solver failed with the error “information matrix is not invertible,” in which case we rerun using LDAK (while the GCTA and LDAK solvers are both based on Average Information REML,28 (
link), 49 (
link) subtle differences mean that when using a large number of tranches, one might complete while the other fails). For the few occasions when both solvers failed, we instead used “GCTA-LD” (i.e., SNPs divided only by LD, rather than by LD and MAF), which we found gave very similar results to GCTA-LDMS for traits where both completed (
Supplementary Fig. 7). For diseases, we converted estimates of
to the liability scale based on the observed case-control ratio and assumed prevalence.26 (
link), 27 (
link) In general, we copied the prevalences used by previous studies; however for tuberculosis, where no previous estimate of
is available, we derived an estimate of prevalence from World Health Organization data50 (see
Supplementary Note).