Insertion events are known to be caused by various mechanisms and have various consequences [26 (link)]. To characterize and investigate the origins of the detected insertions, we decomposed them into TRs, TEs, tandem duplications (TDs), satellite sequences, dispersed duplications, processed pseudogenes, alternative sequences, “deletions” in GRCh38, and nuclear mitochondrial DNA sequences (NUMTs).
We first applied Tandem Repeats Finder (TRF) [27 ] to all inserted sequences and defined TRs as having (1) element lengths < 50 bp and (2) covering more than 50% of an inserted sequence. After filtering TRs, we identified TEs using RepeatMasker [28 ] if (1) an inserted sequence covered a TE > 50%, (2) the inserted sequence was covered by the TE > 50% (reciprocal overlap), and (3) the total substitutions and indels were < 50% (matching condition).
Previous studies have reported that TDs are understudied but widespread [26 (link), 29 (link)]. After detecting TRs and TEs, we manually reviewed the remaining insertions and found that they contained TDs derived from non-repetitive regions in the reference. We considered these insertions as TDs. To identify this class of insertions, we aligned all insertions except TRs to GRCh38 using BLAT [30 (link)]. We then collected insertions mapped to original breakpoints within 5 bp with > 90% in BLAT identity and defined them as TDs. In this process, missing TRs with long repeat elements were found. Therefore, they were added to the TR callset if (1) an inserted sequence aligned within 500 bp from the insertion breakpoint and (2) the ratio of the total number of matching bases to the insertion length was > 0.5.
To understand the remaining insertions, we manually checked their features by aligning them to the reference using BLAT [30 (link)]. We identified insertions that were aligned from end to end to different chromosomal regions with high identity (> 90%). We defined these insertions as dispersed duplications. Next, we detected insertions aligned to a series of exons and untranslated regions (UTRs) of coding genes with high identity (> 90%) and classified them as processed pseudogenes. We also found other insertions aligned to the alternative sequences (e.g., “alt” or “fix” sequences) on BLAT with high identity (> 90%). We classified them as alternative sequences. Some of the insertions left at this point were thought to have arisen by deletion events in GRCh38 because they were securely aligned to the chimpanzee reference genome (panTro6), although they were classified as insertions when compared with GRCh38 [3 (link)]. We aligned the remaining insertions to the panTro6 assembly and categorized the insertions that lifted over panTro6 with high accuracy (> 90%) within 100 bp of the inserted position on GRCh38 as "deletions” in GRCh38. After this, the remaining insertions were manually reviewed, and features of the genomic regions (segmental duplications or self-chain) were examined.
Free full text: Click here