We collected GBS data from a collection of 1995 accessions from the genus Malus from the US Department of Agriculture apple germplasm repository in Geneva, NY. The samples were processed with two different restriction enzymes (ApeKI, PstI/EcoT22I) in separate GBS libraries and were sequenced using Illumina Hi-Sequation 2000 technology. Genotypes were called using a custom GBS pipeline described in Gardner et al. (2014) (link). Briefly, 100-bp reads generated from both enzymes were aligned to the Malus domestica reference genome version 1.0 (Velasco et al. 2010 (link)) using the default parameters in BWA (Li and Durbin 2009 (link)). Genotypes were called using GATK (McKenna et al. 2010 (link)) with a minimum of eight reads supporting each genotype. The final genotype matrix was filtered to contain only samples from the domesticated apple, Malus domestica, and ≤20% missing data per SNP and per sample. SNPs with a minor allele frequency (MAF) of <0.01 were then discarded. Finally, the data were pruned to exclude clonal relationships: if two or more samples had IBD >0.9, they were considered clones and the sample with the least amount of missing data from the group was retained. This resulted in a dataset of 711 samples and 8404 SNPs.
To test the accuracy of our imputation method we created a “masked” dataset by setting 10,000 random genotypes to missing. This created “truth known” genotypes to which our imputed genotype calls were compared. We limited our testing to 10,000 masked genotypes, which represents 0.17% of the genotype matrix, in order to maintain a dataset with a reasonable amount of missing data while providing enough masked genotypes to be able to estimate imputation accuracy.
Biased allele frequency in imputed data has been shown to affect downstream analyses (Han et al. 2014 (link)). To determine how well each imputation method estimates allele frequencies, we filtered the genotype matrix to contain no missing data. This resulted in a matrix containing 1001 SNPs from 459 samples (Figure S2 ). We masked and then imputed 20% (91,952 genotypes) of the genotypes at random and compared the allele frequency estimates from the imputed data to the allele frequency estimates from the complete genotype matrix. As most imputation methods make use of other SNPs to aid imputation, we imputed using all 8404 SNPs in the dataset so as to provide more information to these methods. We then restrict our analysis to the 1001 complete SNPs.
We also tested the performance of our method on genome-wide SNP data from maize and grape. The maize data were downloaded from the International Maize and Wheat Improvement Center (Hearne et al. 2014 ). We reduced the data to biallelic SNPs with <20% missing data and a MAF >1% and then discarded samples with >20% missing data. This resulted in 43,696 SNPs from 4300 samples.
To generate the grape dataset we collected GBS data from a collection of diverse samples from the genus Vitis including commercial Vitis vinifera varieties, hybrids and wild accessions from the USDA grape germplasm collection. The samples were processed with two different restriction enzymes (HindIII/BfaI, HindIII/MseI) and were sequenced using Illumina Hi-Sequation 2000 technology. We then used the 12X grape reference genome (Jaillon et al. 2007 (link); Adam-Blondon et al. 2011 ) and the Tassel / BWA version 4 pipeline to generate a genotype matrix (Li and Durbin 2009 (link); Glaubitz et al. 2014 (link)). Default parameters were used at each stage except for the SNP output stage where we filtered for biallelic SNPs. We then removed any genotypes with fewer than eight supporting reads using vcftools (Danecek et al. 2011 (link)). Using PLINK (Purcell et al. 2007 (link)), we removed SNPs with >20% missing data before removing samples with >20% missing data. We then removed SNPs with excess heterozygosity (failed a Hardy−Weinberg equilibrium test with a p-value < 0.001) and finally SNPs with a MAF < 0.01. This created a dataset of 8506 SNPs and 77 samples.
To test the accuracy of our imputation method we created a “masked” dataset by setting 10,000 random genotypes to missing. This created “truth known” genotypes to which our imputed genotype calls were compared. We limited our testing to 10,000 masked genotypes, which represents 0.17% of the genotype matrix, in order to maintain a dataset with a reasonable amount of missing data while providing enough masked genotypes to be able to estimate imputation accuracy.
Biased allele frequency in imputed data has been shown to affect downstream analyses (Han et al. 2014 (link)). To determine how well each imputation method estimates allele frequencies, we filtered the genotype matrix to contain no missing data. This resulted in a matrix containing 1001 SNPs from 459 samples (
We also tested the performance of our method on genome-wide SNP data from maize and grape. The maize data were downloaded from the International Maize and Wheat Improvement Center (Hearne et al. 2014 ). We reduced the data to biallelic SNPs with <20% missing data and a MAF >1% and then discarded samples with >20% missing data. This resulted in 43,696 SNPs from 4300 samples.
To generate the grape dataset we collected GBS data from a collection of diverse samples from the genus Vitis including commercial Vitis vinifera varieties, hybrids and wild accessions from the USDA grape germplasm collection. The samples were processed with two different restriction enzymes (HindIII/BfaI, HindIII/MseI) and were sequenced using Illumina Hi-Sequation 2000 technology. We then used the 12X grape reference genome (Jaillon et al. 2007 (link); Adam-Blondon et al. 2011 ) and the Tassel / BWA version 4 pipeline to generate a genotype matrix (Li and Durbin 2009 (link); Glaubitz et al. 2014 (link)). Default parameters were used at each stage except for the SNP output stage where we filtered for biallelic SNPs. We then removed any genotypes with fewer than eight supporting reads using vcftools (Danecek et al. 2011 (link)). Using PLINK (Purcell et al. 2007 (link)), we removed SNPs with >20% missing data before removing samples with >20% missing data. We then removed SNPs with excess heterozygosity (failed a Hardy−Weinberg equilibrium test with a p-value < 0.001) and finally SNPs with a MAF < 0.01. This created a dataset of 8506 SNPs and 77 samples.
Full text: Click here