To determine if the measured percent editing was significant, we implemented a null hypothesis significance testing approach using a null distribution modeled from the background noise. The null distribution is generated by trimming the first 20 bases of the sequence and removing the 20 bases of the protospacer. Additionally, bases that fall within the 10th percentile of total area are removed, as small peaks are associated with poor initial primer binding and poor end extension.
24 To account for the variability in sequencing, the user can manually select the region to model the null distribution in case the default trimming does not effectively remove low-quality sequencing. Next, the value of every “N” trace fluorescence under every non-“N” basecall (e.g., T fluorescence under A, C, or G peaks) is compiled to generate a sample of the noise distribution. The sample of the noise distribution for each base is fitted to a zero-adjusted gamma distribution (zΓ;
Supplementary Fig. S1) using the package
gamlss.25 We chose the zΓ distribution for three reasons: (1) it has a domain from 0 to +∞, (2) it is a continuous distribution allowing for non-integer values, and (3) it allows for a high proportion of zeros in the data, which accounted for 10% of the values in our data (
Supplementary Fig. S1).
25 Filliben's correlation coefficient (
RF2 (link)) is calculated to assess the goodness of fit of the model given the data, where
RF2 (link) = 1 is a perfect fit. From this model, we can assign critical values using a default level of significance (α = 0.01), which the user can manually change on EditR's interface.
EditR was written in the R statistical programming environment v3.4.0. EditR requires a sample AB1 Sanger sequencing file (i.e., cells treated with base editor and gRNA) and a 15–24 nt character string of the edited region of interest (i.e., gRNA protospacer). Initial parameters for the program have set defaults that can be adjusted by the user under the advanced settings if desired. The EditR web app was written with R Shiny v1.0.1 and helped by incorporating design from TIDE and Poly Peak Parser.
18 (link),19 (link) The former identifies simple indel mixtures from Sanger sequencing data, while the latter calculates the frequency and composition of complex indel mixtures.
The sample file is uploaded and read into EditR. The fluorescence area of all four bases at each base call is assigned, as measured by the software provided by the capillary electrophoretic instrument manufacturer and determined by the
makeBaseCalls function of
sangerseqR. The percent area of each base is calculated by dividing the total area of the focal base by the area of all the bases summed together. The guide sequence is then aligned to the primary sequence generated from the base calls using the ends-free overlap alignment algorithm in
pairwiseAlignment() with
type = “overlap” argument from the
Biostrings package.
26 Ends-free alignment was chosen, as it aligned to a local match while also being robust to changes in the first base of the guide, as well multiple base changes in the middle of the guide.