Scalable Whole-Genome Imputation with Hadoop

The Michigan Imputation Server implements the whole-genotype imputation workflow using the MapReduce programming model for efficient parallelization of computationally intensive tasks. We use the open source framework Hadoop to implement all workflow steps. Maintenance of the server, including node configuration (for example, amount of parallel tasks, memory for each chunk, and monitoring of all nodes), is achieved using the Cloudera Manager. During cluster initialization, reference panels, genetic maps, and software packages are distributed across all cluster nodes using the Hadoop file system HDFS. The imputation workflow itself consists of two steps: first, we divide the data into non-overlapping chunks (here, chromosome segments of 20 Mb). Second, we run an analysis (here, quality control or phasing and imputation) in parallel across chunks. To avoid edge effects, 5 Mb for phasing and 500 kb for imputation are added to each chunk. Finally, all results are combined to generate an aggregate final output.
Genotype imputation can be implemented with MapReduce, as the computationally expensive whole-genome calculations can be split into independent chromosome segments. Our imputation server accepts phased and unphased GWAS genotypes in VCF file format. File format checks and initial statistics (numbers of individuals and SNVs, detected chromosomes, unphased/phased data set, and number of chunks) are generated during the preprocessing step. Then, the submitted genotypes are compared to the reference panel to ensure that alleles, allele frequencies, strand orientation, and variant coding are correct. In this first MapReduce analysis, the map function calculates the VCF statistics for each file chunk, and the reducer summarizes the results and forwards only chunks that pass quality control to the subsequent imputation step (Supplementary Fig. 2). The MapReduce imputation step constitutes a map-only job. This means that no reducer is applied and each mapper imputes genotypes using minimac3 on the previously generated chunk. If the user has uploaded unphased genotypes, the data are prephased with one of the available phasing engines: Eagle 2, HAPI-UR^{34 (link)}, or SHAPEIT^{17 (link)}. A post-processing step generates a zipped and indexed VCF file (using bgzip and tabix^{35 (link)}) for each imputed chromosome. To minimize the input/output load, the reference panel is distributed across available nodes in the cluster using the distributed cache feature of Hadoop. To ensure data security, imputation results are encrypted on the fly using a one-time password. All result files and reports can be viewed or downloaded via the web interface.
The imputation server workflow has been integrated into Cloudgene^{24 (link)} to provide a graphical user interface. Cloudgene is a high-level workflow system for Apache Hadoop designed as a web application using Bootstrap, CanJs, and JQuery. On the server side, all necessary resources are implemented in Java using the RESTful web framework Restlet. The Cloudgene API provides methods for the execution and monitoring of MapReduce jobs and can be seen as an additional layer between Hadoop and the client. The imputation server is integrated into Cloudgene using the provided workflow definition language and its plugin interface. On the basis of the workflow information, Cloudgene automatically renders a web form for all required parameters to submit individual jobs to the Cloudgene server. The server communicates and interacts with the Hadoop cluster and receives feedback from currently executing jobs. Client and server communicate by asynchronous HTTP requests (AJAX) with JSON as an interchange format. All transmissions between server and client are encrypted using SSL (Secure Socket Layer).

Partial Protocol Preview
This section provides a glimpse into the protocol.
The remaining content is hidden due to licensing restrictions, but the full text is available at the following link: Access Free Full Text.

Das S., Forer L., Schönherr S., Sidore C., Locke A.E., Kwong A., Vrieze S.I., Chew E.Y., Levy S., McGue M., Schlessinger D., Stambolian D., Loh P.R., Iacono W.G., Swaroop A., Scott L.J., Cucca F., Kronenberg F., Boehnke M., Abecasis G.R, & Fuchsberger C. (2016). Next-generation genotype imputation service and methods. Nature genetics, 48(10), 1284-1287.

Publication 2016

Alleles Chromosome Eagle Genetic maps Genome Genotypes Gwas Memory Restful Seen Socket Transmissions

Corresponding Organization : Eurac Research

Other organizations : University of Michigan–Ann Arbor, Innsbruck Medical University, Universität Innsbruck, University of Colorado Boulder, National Eye Institute, National Institutes of Health, HudsonAlpha Institute for Biotechnology, University of Minnesota, National Institute on Aging, University of Pennsylvania, Harvard University, University of Sassari

Top 5 similar protocols

Protocol cited in 119 other protocols

Variable analysis

independent variables

Parallelization of computationally intensive tasks using the MapReduce programming model
Dividing the data into non-overlapping chunks (chromosome segments of 20 Mb)
Adding 5 Mb for phasing and 500 kb for imputation to each chunk to avoid edge effects
Comparing submitted genotypes to the reference panel to ensure alleles, allele frequencies, strand orientation, and variant coding are correct
Prephasing unphased genotypes using Eagle 2, HAPI-UR, or SHAPEIT
Encrypting imputation results on the fly using a one-time password

dependent variables

Genotype imputation
Quality control or phasing and imputation analysis
Generation of an aggregate final output

control variables

Amount of parallel tasks
Memory for each chunk
Monitoring of all nodes using Cloudera Manager
Distribution of reference panels, genetic maps, and software packages across all cluster nodes using the Hadoop file system HDFS

Annotations

Based on most similar protocols

Etiam vel ipsum. Morbi facilisis vestibulum nisl. Praesent cursus laoreet felis. Integer adipiscing pretium orci. Nulla facilisi. Quisque posuere bibendum purus. Nulla quam mauris, cursus eget, convallis ac, molestie non, enim. Aliquam congue. Quisque sagittis nonummy sapien. Proin molestie sem vitae urna. Maecenas lorem.

As authors may omit details in methods from publication, our AI will look for missing critical information across the 5 most similar protocols.

About PubCompare

Our mission is to provide scientists with the largest repository of trustworthy protocols and intelligent analytical tools, thereby offering them extensive information to design robust protocols aimed at minimizing the risk of failures.

We believe that the most crucial aspect is to grant scientists access to a wide range of reliable sources and new useful tools that surpass human capabilities.

However, we trust in allowing scientists to determine how to construct their own protocols based on this information, as they are the experts in their field.

Ready to get started?

Revolutionizing how scientists
search and build protocols!