The processing of exome-sequencing data from1 (link)
and TCGA2 (link) involved variant calling on matched-normal
pairs using Mutect3 . A mutation was considered if the
depth of coverage was ≥10 and at least 3 reads supported the variant. Mutations that
aligned to a more than one genomic location were discarded. The WGS gastric cancers4 (link) were processed using VarScan25 (link), with minimum depth of coverage for a mutation being 10x and at least 3
reads supporting the variant. Non-CRCs in the TCGA had mutations called using Mutect according
to the pipeline described in ref6 (link). Microsatellite
instability in the TCGA colon cancer samples was called using MSIsensor7 (link). Annotation was performed with ANNOVAR8 (link).
To fit the neutral model to allele frequency data we considered only variants with
allele frequency in the range [fmax,fmin] corresponding
to [t0,t] in equation [2]. The low boundary fmin reflects the
limit for the reliable detectability of low-frequency mutations in NGS data, which is in the
order of 10%3 . The high boundary
fmax is necessary to filter out public mutations that were present
in the first transformed cell. In the case of diploid tumors, clonal mutations are expected at
fmax=0.5 (mutations with 50% allelic frequency are heterozygous
public or clonal), in the case of triploid tumors, this threshold drops to 0.33 and in the case
of tetraploid neoplasms, it drops to 0.25. For all samples we used a boundary of [0.12-0.24] to
account only for reliably called subclonal mutations and tumor purity in the samples. All the
samples considered in this study were reported to have tumor purity ≥70% and a minimum
of 12 reliably called private mutations within the fit boundary. Once these conditions were met
in a sample, equation [7] was used to perform
the fit as illustrated in Figure 1B and 2B. In particular, for x=1/f, equation [7] becomes a linear model with slope
μ/β and intercept –μ/(β
fmax)
. We exploited the intercept constraint to perform a more
restrictive fit using the model
y=m(x-1/fmax)+0.
Copy-number changes (allelic deletion or duplication) can alter the frequency of a
variant in a manner that is not described by equation
[7]
. We assessed the impact of copy-number alterations (CNAs) on our estimates of the
mutation rate within the TCGA colorectal cancer samples by using the paired publically
available segmented SNP-array data to exclude somatic mutations that fell within regions of
CNA. CNAs were identified having an absolute log-R-ratio>0.5, and the model fitting was
performed only on diploid regions of the genome. In the gastric cancer cohort, regions with
copy number changes were identified using Sequenza9 (link) and
removed from the analysis. Mutation rates were adjusted to the size of the resulting diploid
genome. Supplementary Figures 2 and
5 demonstrate the robustness of our
analysis to copy number changes. R2 values were independent from
the mean coverage of mutations (p=0.32), the total number of mutations in the
sample (p=0.40), the mutation rate (p=0.11), or the number of
mutations within the model range (p=0.65).