We wanted to create an accurate model of de novo mutation for each gene. In order to do so, we extended a previous sequence context-based model of de novo mutation to derive gene-specific probabilities of mutation for each of the following mutation types: synonymous, missense, nonsense, essential splice site, and frameshift3 (link). In brief, the local sequence context was used to determine the probability of each base in the coding region mutating to each other possible base and then determine the coding impact of each possible mutation. These probabilities of mutation were summed across genes to create a per-gene probability of mutation for the aforementioned mutation types (see Supplementary Note for more details). Here, we applied the method to exons and immediately flanking essential splice sites, but note that the framework is applicable to non-genic sequences. While fitting the expected rates of mutation to observed data, we added a term for local primate divergence across 1 Mb (to capture additional unmeasured sources of regional mutational variability) and another for the average depth of sequence of each nucleotide (to capture inefficiency of variant discovery at lower sequencing depths); both terms significantly improved the fit of the model to observed data (details in Supplementary Note). We also investigated a regional replication timing term22 (link), but found no evidence for it significantly improving the model (Supplementary Note).
To evaluate the predictive value of the model of de novo coding mutations, we extracted synonymous variants that were seen 10 times or fewer in the 6,503 individuals in the NHLBI’s Exome Sequencing Project (ESP) and compared the number of these rare variants in each gene to 1) the length of the gene and 2) the probability of a synonymous mutation for that gene determined by our model. While gene length alone showed a high correlation (0.880), our full model showed a significantly greater correlation (0.940, p < 10−16). Of note, the stochastic variability of counts from NHLBI ESP is such that if the model were perfect, the correlation to any instance of these data would be 0.975, indicating that little additional gene-to-gene variability remains to be explained. The relative rates of different types of coding mutations was quite similar to previous work based on primate substitutions23 (link). With this calibrated model of relative mutability, we determined the absolute expected mutation rate per gene by applying a genome-wide mutation rate of 1.2×10−8 per base pair per generation (Supplementary Note)24 (link),25 (link).