We downloaded short-read data for 1,057 accessions from the 1001 Genomes Project [19 (link)]. Raw paired-end reads were processed with cutadapt (v1.9) [51 (link)] to remove 3′ adapters, and to trim 5′-ends with quality 15 and 3′-ends with quality 10 or N-endings. All reads were aligned to the A. thaliana TAIR10 reference genome [52 (link)] with BWA-MEM (v0.7.8) [53 ], and both Samtools (v0.1.18) and Sambamba (v0.6.3) were used for various file format conversions, sorting and indexing [54 (link), 55 (link)], while duplicated reads where by marked by Markduplicates from Picard (v1.101; http://broadinstitute.github.io/picard/). Further steps were carried out with GATK (v3.4) functions [26 (link), 56 ]. Local realignment around indels were done with “RealignerTargetCreator” and “IndelRealigner,” and base recalibration with “BaseRecalibrator” by providing known indels and SNPS from The 1001 Genomes Consortium [19 (link)]. Genetic variants were called with “HaplotypeCaller” in individual samples followed by joint genotyping of a single cohort with “GenotypeGVCFs.” An initial SNP filtering was done following the variant quality score recalibration (VQSR) protocol. Briefly, a subset of ~181,000 high-quality SNPs from the RegMap panel [57 (link)] was used as the training set for VariantRecalibrator with a priori probability of 15 and four maximum Gaussian distributions. Finally, only bi-allelic SNPs within a sensitivity tranche level of 99.5 were kept, for a total of 7,311,237 SNPs.
Free full text: Click here