Manual curation of TEs in rice was started after the release of the map-based rice genome [22 (link)]. Repetitive sequences in the rice genome were compiled by RECON [44 (link)] with a copy number cutoff of 10. Details for manual curation of LTR sequences were previously described in the LTR_retriever paper [40 (link)]. In brief, for the curation of LTR retrotransposons, we first collected known LTR elements and used them to mask LTR candidates. Unmasked candidates were manually checked for terminal motifs, TSD sequences, and conserved coding sequences. Terminal repeats were aligned with extended sequences, from which candidates were discarded if alignments extended beyond their boundaries. For the curation of non-LTR retrotransposons, new candidates were required to have a poly-A tail and TSD. We also collected 13 curated SINE elements from [53 (link)] to complement our library.
For curation of DNA TEs with TIRs, flanking sequences (100 bp or longer, if necessary) were extracted and aligned using DIALIGN2 [72 (link)] to determine element boundaries. A boundary was defined as the position to which sequence homology is conserved over more than half of the aligned sequences. Then, sequences with defined boundaries were manually examined for the presence of TSD. To classify the TEs into families, features in the terminal and TSD sequences were used. Each transposon family is associated with distinct features in their terminal sequences and TSDs, which can be used to identify and classify elements into their respective families [14 (link)]. For Helitrons, each representative sequence requires at least two copies with intact terminal sequences, distinct flanking sequences, and inserts into “AT” target sites.
To make our non-redundant curated library, each new TE candidate was first masked by the current library. The unmasked candidates were further checked for structural integrity and conserved domains. For candidates that were partially masked and presented as true elements, the “80-80-80” rule (≥ 80% of the query aligned with ≥ 80% of identity and the alignment is ≥ 80 bp long) was applied to determine whether this element would be retained. For elements containing detectable known nested insertions, the nested portions were removed and the remaining regions were joined as a sequence. Finally, protein-coding sequences were removed using the ProtExcluder package [73 (link)]. The curated library version 6.9.5 was used in this study and is available as part of the EDTA toolkit.
Free full text: Click here