In order to understand the modeling choices underlying our new imputation algorithm, it is crucial to consider the statistical issues that arise in imputation datasets. For simplicity, we will discuss these issues in the context of Scenario A, although we will also extend them to Scenario B in the
Results section. Fundamentally, imputation is very similar to phasing, so it is no surprise that most imputation algorithms are based on population genetic models that were originally used in phasing methods. The most important distinction between phasing and imputation datasets is that the latter include large proportions of systematically missing genotypes.
Large amounts of missing data greatly increase the space of possible outcomes, and most phasing algorithms are not able to explore this space efficiently enough to be useful for inference in large studies. A standard way to overcome this problem with HMMs [6] (
link),[11] (
link) is to make the approximation that, conditional on the reference panel, each study individual's multilocus genotype is independent of the genotypes for the rest of the study sample. This transforms the inference problem into a separate imputation step for each study individual, with each step involving only a small proportion of missing data since the reference panel is assumed to be missing few, if any, genotypes.
In motivating our new imputation methodology, we pointed out that modeling the study individuals independently, rather than jointly, sacrifices phasing accuracy at typed SNPs; this led us to propose a hybrid approach that models the study haplotypes jointly at typed SNPs but independently at untyped SNPs. We made the latter choice partly to improve efficiency – it is fast to impute untyped alleles independently for different haplotypes, which allows us to use all of the information in large reference panels – but also because of the intuition that there is little to be gained from jointly modeling the study sample at untyped SNPs.
By contrast, the recently published BEAGLE [13] (
link) imputation approach fits a full joint model to all individuals at all SNPs. To overcome the difficulties caused by the large space of possible genotype configurations, BEAGLE initializes its model using a few ad-hoc burn-in iterations in which genotype imputation is driven primarily by the reference panel. The intuition is that this burn-in period will help the model reach a plausible part of parameter space, which can be used as a starting point for fitting a full joint model.
This alternative modeling strategy raises the question of whether, and to what extent, it is advantageous to model the study sample jointly at untyped SNPs. One argument [20] (
link) holds that there is no point in jointly modeling such SNPs because all of the linkage disequilibrium information needed to impute them is contained in the reference panel. A counterargument is that, as with any statistical missing data problem, the “correct” inference approach is to create a joint model of all observed and missing data. We have found that a full joint model may indeed improve accuracy on small, contrived imputation datasets (data not shown), and this leads us to believe that joint modeling could theoretically increase accuracy in more realistic datasets.
However, a more salient question is whether there is any
useful information to be gained from jointly modeling untyped SNPs, and whether this information can be obtained with a reasonable amount of computational effort. Most imputation methods, including our new algorithm, implicitly assume that such information is not worth pursuing, whereas BEAGLE assumes that it is. We explore this question further in the sections that follow.
Howie B.N., Donnelly P, & Marchini J. (2009). A Flexible and Accurate Genotype Imputation Method for the Next Generation of Genome-Wide Association Studies. PLoS Genetics, 5(6), e1000529.