In all examined samples, normal DNA from the same individuals had been sequenced to establish the somatic origin of variants. Extensive filtering was performed to remove any residual germline mutations and technology specific sequencing artifacts prior to analyzing the data. Germline mutations were filtered out from the lists of reported mutations using the complete list of germline mutations from dbSNP60 , 1000 genomes project61 , NHLBI GO Exome Sequencing Project62 , and 69 Complete Genomics panel (http://www.completegenomics.com/public-data/69-Genomes/). Technology specific sequencing artifacts were filtered out by using panels of BAM files of (unmatched) normal tissues containing more than 120 normal genomes and 500 normal exomes. Any somatic mutation present in at least three well-mapping reads in at least two normal BAM files was discarded. The remaining somatic mutations were used for generating a mutational catalog for every sample. Prevalence of somatic mutations was estimated based on a haploid human genome after all filtering. Prevalence of somatic mutations in exomes was calculated based on the identified mutations in protein coding genes and assuming that an average exome has 30 megabases in protein coding genes with sufficient coverage. Prevalence of somatic mutations in whole genomes was calculated based on all identified mutations and assuming that an average whole genome has 2.8 gigabases with sufficient coverage.
The immediate 5′ and 3′ sequence context was extracted using the ENSEMBL Core APIs for human genome build GRCh37. Curated somatic mutations that originally mapped to an older version of the human genome were re-mapped using UCSC’s freely available lift genome annotations tool (any somatic mutations with ambiguous or missing mappings were discarded). Dinucleotide substitutions were identified when two substitutions were present in consecutive bases on the same chromosome (sequence context was ignored). The immediate 5′ and 3′ sequence content of all indels was examined and the ones present at mono/polynucleotide repeats or microhomologies were included in the analyzed mutational catalogs as their respective types. Strand bias catalogs were derived for each sample using only substitutions identified in the transcribed regions of well-annotated protein coding genes. Genomic regions of bidirectional transcription were excluded from the strand bias analysis.