Simulated whole-genome sequencing data were generated to evaluate the sensitivity of CREST in identifying validated germline structural variations (i.e. deletions, duplications and insertions) compiled as a gold standard data set by the 1000 Genomes Project 13 (link). NA12878 was selected because it was sequenced at high coverage using three sequencing platforms (Illumia/Solexa, Roche/454 and Life Technologies/SOLiD) and analyzed by 19 SV detection methods, 12 of which were evaluated for their sensitivity in detecting deletion polymorphisms. The golden standard data set for NA12878 consists of 642 deletions, 271 duplications and 30 insertions. We were unable to include the 30 insertions for simulation because the inserted sequences were not accessible. Of the 913 deletion/duplication events, 309 at 138 loci are overlapping events with multiple non-reference deletion/duplication alleles. We consider these multi-allele polymorphisms with ≥ 2 non-reference alleles in the population. Two haploid genomes were generated to represent the two non-reference deletion/duplication alleles in these regions. For the 26 loci that have ≥3 overlapping non-reference alleles, two were randomly selected resulting in a loss of 27 events (23 deletions and 4 duplications). We simulated 100-bp paired-end reads with a mean size of 400 bp with a standard deviation (s.d.) of 20bp. using the software MAQ (version 0.7.1)20 (link) and obtained 20-fold coverage to human assembly NCBI build 36 for each haploid genome. Merging the data from the two haploid genomes gives a total of 1,232,167,792 reads for the diploid genome data with a mean coverage of 40. All reads were mapped to the human assembly NCBI build 36 using the program BWA10 (link) with the default parameters.
Two sets of whole-genome simulation data were generated based on the following two quality models. One is a normal quality simulation that derives the sequencing error and quality based on a training data set of 250k empirical reads randomly selected from our T-ALL WGS data while the other is a high quality data set that use only reads with qualities in the range of 32–40 for training. We created the high-quality simulation data because the mapping rate of the normal quality WGS is 10% lower than that of the empirical WGS data for the 10 T-ALL genomes which ranges from 92–95%. On the other hand, high-quality simulation data has a mapping rate of 91% which is close to the empirical mapping rate.