Every single replicate, treated and untreated control, is processed independently from the alignment up to the cluster definition, as described in (39). Then, an overlap analysis is performed to unify the clusters from several replicates. Clusters overlapping or separated by less than 1,500 bp are merged and considered as a single translocation event [see (Turchiano et al., 2021 (
link)) for details]. Based on the number of replicates, the user can define the minimum number of replicates where the site was found, and the minimum number of samples in which the site was significantly different from untreated control (i.e., the number of reads was significantly higher in treated vs. untreated based on Fisher’s exact test).
Barcode hopping: We introduced an additional filter to eliminate artifacts generated by barcode hopping events. Barcode hopping are identified by their low reads:hits ratio in comparison to real translocation events by the formula: log10 (reads:hits) distribution (
Coverage: For the remaining sites, the read coverage is calculated in order to identify highly covered regions. Sites are divided into 100 bins of equal size. For each site, the coordinates of bin with the highest coverage across all replicates is used for downstream analysis instead of the whole site coordinates. This new feature restricts the alignment against the target sequence to a smaller, and highly covered region. This makes the alignment more specific and less prone to identification of false-positive OMTs/HMTs.
Alignment: A new TALEN-specific substitution matrix was implemented (
Supplementary Tables S12) inspired by (18), and analysis restricted to four TALEN combinations: LF.LR, LF.RR, RF.RR, and RF.LR (L/RX, left/right; XF/R, forward/reverse). In order to determine the best combination, i.e., the one that is most likely cleaving an off-target site, different spacer lengths from 8 to 28 bp, are tested for each combination. Artificial sequences, representing binding sites of two TALEN arms separated by a spacer “N
k” of 8–28 nucleotides (k belong to 8:28) are tested. N can match any bases without cost, therefore the length of the spacer does not influence the alignment score by itself. An example sequence is shown in
Supplementary Figure S2B. Alignment score is calculated using the pairwise Alignment function from Biostrings R package with a “local-global” alignment type. The different TALEN combinations and spacer lengths are first selected based on two criteria: a) The first (5′) aligned base is a T, b) the last (3′) aligned base is an A. Then we ordered them based on the alignment score and define the highest score as the most probable TALEN combination and spacer length for a given target site. The same approach was performed on randomly selected regions over the entire genome to determine the overall distribution of the alignment score on random sequences.
p values of a given combination and spacer length are assessed based on the empirical cumulative distribution function. Sites with
p values below 0.05 are considered as OMT. HMTs and NBSs were classified in the same way as described in (39).