Moorjani et al. (2011) (link) first observed that pairwise LD measurements across a panel of SNPs can be combined to enable accurate inference of the age of admixture, n. The crux of their approach was to harness the fact that the ALD between two sites x and y scales as e−nd multiplied by the product of allele frequency differences δ(x)δ(y) in the mixing populations. While the allele frequency differences δ(⋅) are usually not directly computable, they can often be approximated. Thus, Moorjani et al. (2011) (link) formulated a method, ROLLOFF, that dates admixture by fitting an exponential decay e−nd to correlation coefficients between LD measurements and surrogates for δ(x)δ(y). Note that Moorjani et al. (2011) (link) define z(x, y) as a sample correlation coefficient, analogous to the classical LD measure r, as opposed to the sample covariance (Equation 1) we use here; we find the latter more mathematically convenient.
We build upon these previous results by deriving exact formulas for weighted sums of ALD under a variety of weighting schemes that serve as useful surrogates for δ(x)δ(y) in practice. These calculations will allow us to interpret the magnitude of weighted ALD to obtain additional information about admixture parameters. Additionally, the theoretical development will generally elucidate the behavior of weighted ALD and its applicability in various phylogenetic scenarios.
Following Moorjani et al. (2011) (link), we partition all pairs of SNPs (x, y) into bins of roughly constant genetic distance, where ε is a discretization parameter inducing a discretization on d. Given a choice of weights w(⋅), one per SNP, we define the weighted LD at distance d as
Assume first that our weights are the true allele frequency differences in the mixing populations, i.e., w(x) = δ(x) for all x. Applying Equation 3, where F2(A, B) is the expected squared allele frequency difference for a randomly drifting neutral allele, as defined in Reich et al. (2009) (link) and Patterson et al. (2012) (link). Thus, a(d) has the form of an exponential decay as a function of d, with time constant n giving the date of admixture.
In practice, we must compute an empirical estimator of a(d) from a finite number of sampled genotypes. Say we have a set of m diploid admixed samples from population C indexed by i = 1, …, m, and denote their genotypes at sites x and y by xi, yi ε {0, 1, 2}. Also assume we have some finite number of reference individuals from A and B with empirical mean allele frequencies and . Then our estimator is where is the usual unbiased sample covariance, so the expectation over the choice of samples satisfies (assuming no background LD, so the ALD in population C is independent of the drift processes producing the weights).
The weighted sum is a natural quantity to use for detecting ALD decay and is common to our weighted LD statistic and previous formulations of ROLLOFF. Indeed, for SNP pairs (x, y) at a fixed distance d, we can think of Equation 3 as providing a simple linear regression model between LD measurements z(x, y) and allele frequency divergence products δ(x)δ(y). In practice, the linear relation is made noisy by random sampling, as noted above, but the regression coefficient 2αβe−nd can be inferred by combining measurements from many SNP pairs (x, y). In fact, the weighted sum in the numerator of Equation 5 is precisely the numerator of the least-squares estimator of the regression coefficient, which is the formulation of ROLLOFF given in Moorjani et al. (2012, Note S1). Note that measurements of z(x, y) cannot be combined directly without a weighting scheme, as the sign of the LD can be either positive or negative; additionally, the weights tend to preserve signal from ALD while depleting contributions from other forms of LD.
Up to scaling, our ALDER formulation is roughly equivalent to the regression coefficient formulation of ROLLOFF (Moorjani et al. 2012 , Note S1). In contrast, the original ROLLOFF statistic (Patterson et al. 2012 (link)) computed a correlation coefficient between z(x, y) and w(x)w(y) over . However, the normalization term in the denominator of the correlation coefficient can exhibit an unwanted d-dependence that biases the inferred admixture date if the admixed population has undergone a strong bottleneck (Moorjani et al. 2012 , Note S1) or in the case of recent admixture and large sample sizes. Beyond correcting the date bias, the curve that ALDER computes has the advantage of a simple form for its amplitude in terms of meaningful quantities, providing us additional leverage on admixture parameters. Additionally, we will show that can be computed efficiently via a new fast Fourier transform-based algorithm.
We build upon these previous results by deriving exact formulas for weighted sums of ALD under a variety of weighting schemes that serve as useful surrogates for δ(x)δ(y) in practice. These calculations will allow us to interpret the magnitude of weighted ALD to obtain additional information about admixture parameters. Additionally, the theoretical development will generally elucidate the behavior of weighted ALD and its applicability in various phylogenetic scenarios.
Following Moorjani et al. (2011) (link), we partition all pairs of SNPs (x, y) into bins of roughly constant genetic distance, where ε is a discretization parameter inducing a discretization on d. Given a choice of weights w(⋅), one per SNP, we define the weighted LD at distance d as
Assume first that our weights are the true allele frequency differences in the mixing populations, i.e., w(x) = δ(x) for all x. Applying Equation 3, where F2(A, B) is the expected squared allele frequency difference for a randomly drifting neutral allele, as defined in Reich et al. (2009) (link) and Patterson et al. (2012) (link). Thus, a(d) has the form of an exponential decay as a function of d, with time constant n giving the date of admixture.
In practice, we must compute an empirical estimator of a(d) from a finite number of sampled genotypes. Say we have a set of m diploid admixed samples from population C indexed by i = 1, …, m, and denote their genotypes at sites x and y by xi, yi ε {0, 1, 2}. Also assume we have some finite number of reference individuals from A and B with empirical mean allele frequencies and . Then our estimator is where is the usual unbiased sample covariance, so the expectation over the choice of samples satisfies (assuming no background LD, so the ALD in population C is independent of the drift processes producing the weights).
The weighted sum is a natural quantity to use for detecting ALD decay and is common to our weighted LD statistic and previous formulations of ROLLOFF. Indeed, for SNP pairs (x, y) at a fixed distance d, we can think of Equation 3 as providing a simple linear regression model between LD measurements z(x, y) and allele frequency divergence products δ(x)δ(y). In practice, the linear relation is made noisy by random sampling, as noted above, but the regression coefficient 2αβe−nd can be inferred by combining measurements from many SNP pairs (x, y). In fact, the weighted sum in the numerator of Equation 5 is precisely the numerator of the least-squares estimator of the regression coefficient, which is the formulation of ROLLOFF given in Moorjani et al. (2012, Note S1). Note that measurements of z(x, y) cannot be combined directly without a weighting scheme, as the sign of the LD can be either positive or negative; additionally, the weights tend to preserve signal from ALD while depleting contributions from other forms of LD.
Up to scaling, our ALDER formulation is roughly equivalent to the regression coefficient formulation of ROLLOFF (Moorjani et al. 2012 , Note S1). In contrast, the original ROLLOFF statistic (Patterson et al. 2012 (link)) computed a correlation coefficient between z(x, y) and w(x)w(y) over . However, the normalization term in the denominator of the correlation coefficient can exhibit an unwanted d-dependence that biases the inferred admixture date if the admixed population has undergone a strong bottleneck (Moorjani et al. 2012 , Note S1) or in the case of recent admixture and large sample sizes. Beyond correcting the date bias, the curve that ALDER computes has the advantage of a simple form for its amplitude in terms of meaningful quantities, providing us additional leverage on admixture parameters. Additionally, we will show that can be computed efficiently via a new fast Fourier transform-based algorithm.