To efficiently simulate the AFS, we adopt a diffusion approach. Such approaches have a long and distinguished history in population genetics, dating back to R. A. Fischer [28] –[30] . The diffusion approach is a continuous approximation to the population genetics of a discrete number of individuals evolving in discrete generations. An important underlying assumption is that per-generation changes in allele frequency are small. Consequently, the diffusion approximation applies when the effective population size is large and migration rates and selection coefficients are of order .
If we have samples from populations, the numbers of sampled sequences from each population are . (For diploids, is typically twice the number of individuals sampled from population 1.) Entry of the AFS records the number of diallelic polymorphic sites at which the derived allele was found in samples from population 1, from population 2, and so forth. (If ancestral alleles cannot be determined, then the “folded” AFS can be considered, in which entries correspond to the frequency of the minor allele.)
We model the evolution of , the density of derived mutations at relative frequencies in populations at time . (All run from 0 to 1.) Given an infinitely-many-sites mutational model [31] (link) and Wright-Fisher reproduction in each generation, the dynamics of for an arbitrary finite number of populations are governed by a linear diffusion equation: The first term models genetic drift, and the second term models selection and migration. Figure 1A illustrates the effects of different evolutionary forces on components of . Time is in units of , where is the time in generations and is a reference effective population size. The relative effective size of population is . The scaled migration rate is , where is the proportion of chromosomes per generation in population that are new migrants from population . (Thus migration is assumed to be conservative [32] (link)). Finally, the scaled selection coefficient is , where is the relative selective advantage or disadvantage of variants in population . Boundary conditions are no-flux except at two corners of the domain, where all population frequencies are 0 or 1; these are absorbing points corresponding to allele loss or fixation. Because the diffusion equation is linear, we can solve simultaneously for the evolution of all polymorphism by continually injecting density at low frequency in each population (at a rate proportional to the total mutation flux ), corresponding to novel mutations.
Changes in population size and migration alter the parameters in Equation 1, while population splits and mergers alter the dimensionality of . For example, if new population 3 is admixed with a proportion from population 1 and from population 2 then where denotes the Dirac delta function. To remove population 2, is integrated over : .
Given , the expected value of each entry of the AFS, , is found via a P-dimensional integral over all possible population allele frequencies of the probability of sampling derived alleles times the density of sites with those population allele frequencies. For SNP data obtained by resequencing, these probabilities are binomial, so In some cases of ascertained data [33] (link), the resulting bias can be corrected by modifying the above equation [11] (link),[34] (link).
Free full text: Click here