DNA was extracted directly from blood samples taken from patients at admission, after leukocyte depletion to minimize contamination from human DNA. Leukocyte depletion was achieved by CF11 filtration for most samples10 (link) or alternatively by Lymphoprep density gradient centrifugation (Axis-Shield) followed by Plasmodipur filtration (Euro-Diagnostica)36 (link) or by Plasmodipur filtration alone. Genomic DNA was extracted using the QIAamp DNA Blood Midi or Maxi kit (Qiagen), and the quantities of human and Plasmodium DNA were determined by fluorescence analysis using a Qubit instrument (Invitrogen) and multispecies quantitative PCR(qPCR) using the Roche LightCycler 480 II system, as described previously11 (link). Samples with >50 ng of DNA and <80% human DNA contamination were selected for sequencing on the Illumina HiSeq platform following the manufacturer's standard protocols37 (link). Paired-end sequencing reads of 200–300bp in length were obtained, generating approximately 1Gb of read data per sample.
Polymorphism discovery, quality control and sample genotyping followed a process described elsewhere11 (link). Short sequence reads from 3,281 P. falciparum samples included in the MalariaGEN Plasmodium falciparum Community Project were aligned against the P. falciparum 3D7 reference sequence V3 using the bwa program38 (link) as previously described11 (link), to identify an initial global set of 3,373,632 potential SNPs. This list was then used to guide stringent realignment using the SNP-o-matic algorithm39 (link), to reduce misalignment errors. Stringent alignments were then examined by a series of quality filters, with the aim of removing alignment artifacts and their sources. In particular, the following were removed: (i) noncoding SNPs; (ii) SNPs where polymorphisms had extremely low support (<10 reads in 1 sample); (iii) SNPs with more than 2 alleles, with the exception of loci known to be important for drug resistance, which were manually verified to not have artifacts; (iv) SNPs where coverage across samples was lower than the 25th percentile or higher than the 95th percentile of coverage in coding SNPs (these thresholds were determined from an analysis of artifact incidence); (v) SNPs located in regions of relatively low uniqueness11 (link); (vi) SNPs where heterozygosity levels were found to be inconsistent with the heterozygosity distribution at the SNP's allele frequency; and (vii) SNPs where the genotype could not be established in at least 70% of samples. These analyses produced a final list of 681,587 high-quality SNPs in the 14 chromosomes of the nuclear genome, whose genotypes were used for analysis in this study.
All samples were genotyped at each high-quality SNP by a single allele, on the basis of the number of reads observed for the two alleles at that position in the sample. At positions with fewer than five reads, the genotype was set to undetermined (no call was made). At all other positions, the sample was determined to be heterozygous if both alleles were each observed in more than two reads; otherwise, the sample was called as homozygous for the allele observed in the majority of reads. For the purposes of estimating allele frequencies and genetic distances, a within-sample allele frequency (fw) was also assigned to each valid call. For heterozygous calls, fw was estimated as the ratio of the non-reference read count to the reference read count; homozygous calls were assigned fw = 0 when called with the reference allele and fw = 1 when called with the non-reference allele.
For specific analyses that required no genotype missingness in our data set, we produced a set of genotypes where missing calls (with coverage <5 reads) were assigned a genotype by simple imputation. First, we considered missing calls where the two flanking positions (on each side) had valid genotypes, imputing with the allele that most frequently appeared at the same position between the same flanking alleles in the full sample set. Finally, remaining samples with missing genotypes were assigned with the most common allele at that position in their population.
Polymorphism discovery, quality control and sample genotyping followed a process described elsewhere11 (link). Short sequence reads from 3,281 P. falciparum samples included in the MalariaGEN Plasmodium falciparum Community Project were aligned against the P. falciparum 3D7 reference sequence V3 using the bwa program38 (link) as previously described11 (link), to identify an initial global set of 3,373,632 potential SNPs. This list was then used to guide stringent realignment using the SNP-o-matic algorithm39 (link), to reduce misalignment errors. Stringent alignments were then examined by a series of quality filters, with the aim of removing alignment artifacts and their sources. In particular, the following were removed: (i) noncoding SNPs; (ii) SNPs where polymorphisms had extremely low support (<10 reads in 1 sample); (iii) SNPs with more than 2 alleles, with the exception of loci known to be important for drug resistance, which were manually verified to not have artifacts; (iv) SNPs where coverage across samples was lower than the 25th percentile or higher than the 95th percentile of coverage in coding SNPs (these thresholds were determined from an analysis of artifact incidence); (v) SNPs located in regions of relatively low uniqueness11 (link); (vi) SNPs where heterozygosity levels were found to be inconsistent with the heterozygosity distribution at the SNP's allele frequency; and (vii) SNPs where the genotype could not be established in at least 70% of samples. These analyses produced a final list of 681,587 high-quality SNPs in the 14 chromosomes of the nuclear genome, whose genotypes were used for analysis in this study.
All samples were genotyped at each high-quality SNP by a single allele, on the basis of the number of reads observed for the two alleles at that position in the sample. At positions with fewer than five reads, the genotype was set to undetermined (no call was made). At all other positions, the sample was determined to be heterozygous if both alleles were each observed in more than two reads; otherwise, the sample was called as homozygous for the allele observed in the majority of reads. For the purposes of estimating allele frequencies and genetic distances, a within-sample allele frequency (fw) was also assigned to each valid call. For heterozygous calls, fw was estimated as the ratio of the non-reference read count to the reference read count; homozygous calls were assigned fw = 0 when called with the reference allele and fw = 1 when called with the non-reference allele.
For specific analyses that required no genotype missingness in our data set, we produced a set of genotypes where missing calls (with coverage <5 reads) were assigned a genotype by simple imputation. First, we considered missing calls where the two flanking positions (on each side) had valid genotypes, imputing with the allele that most frequently appeared at the same position between the same flanking alleles in the full sample set. Finally, remaining samples with missing genotypes were assigned with the most common allele at that position in their population.