We have tested our program ability to recover demographic parameters from DNA sequence data in four relatively plausible but distinct scenarios of population differentiation involving one to ten populations with migration (see Figure 1). In all cases, we simulated with fastsimcoal2 400,000 unlinked regions of 50 bp, thus totaling 20 Mb of DNA sequences, assuming a mutation rate of 2.5×10−8 bp−1 per generation and an infinite-site model. Pseudo-observed SFS were also directly computed with fastsimcoal2. Parameters were estimated independently from ten data sets generated under each model. For each data set generated under models with one to three populations, we performed 50 parameter estimations via ECM maximization, and each time retained the parameter set with maximum likelihood. For the model with 10 populations we only performed 20 estimations per data sets, and used 50,000 simulations instead of 100,000 for the other models to estimate the expected SFS due to long computation times. We describe the four tested models in Figure 1, and the used parameter values are showed as red dots in Figures 2, 3, S4 and S5. Absolute numbers (generations, population sizes) were obtained by assuming that the mutation rate of 2.5×10−8 bp−1 per generation was known.
As a benchmark, we used to infer the demographic parameters in scenarios shown in Figure 1A–1C involving up to three populations. For each generated data set, we performed 50 parameter estimations using the Broyden-Fletcher-Goldfarb-Shanno (BFGS) optimization method implemented in , and we retained the parameters associated with the maximum likelihood. We followed 's manual specification to set reasonable upper and lower bounds of the search ranges of the parameter. In all cases, the expected SFS was estimated by extrapolating the SFS inferred from 3 grid sizes set to 40, 50 and 60, which are in all cases larger than our maximum samples sizes (30 in the IM model case). The composite likelihood was computed using 's multinomial model, which is in fact a product of Poisson likelihoods, where the expected model entries are scaled to sum up to 1. This likelihood also ignores information about the expected and observed numbers of monomorphic and polymorphic sites used in our likelihood formulation (as well as in [20] (link)). Therefore, the ratio should be equal to showing that barring the terms, the two CLs differ by a single constant value. The difference between likelihoods computed with fastsimcoal and is illustrated in Figure S3 for the case of the bottleneck scenarios shown in Figures 1A. It shows that when monomorphic sites are not taken into account, fastsimcoal and indeed produce essentially identical likelihood profiles around true parameters. However, when monomorphic sites are used in the likelihood, the shape of the likelihood profiles differs, making it more or less peaky depending on the parameter. There is thus no clear advantage in using one or the other likelihood form for this scenario, but our use of monomorphic sites allows us to directly get absolute values of the parameters. We report in Figures 2, S4 and S5 only the results obtained for data sets for which 's best log likelihood was less than 10% lower than the largest log-likelihood obtained with the other data sets, and we considered not to have converged for the discarded data sets.
Free full text: Click here