The PiRATE pipeline was used as in the original publication (Berthelier et al., 2018 (
link)), including the following steps: 1) Contigs representing repetitive sequences were identified from the assembled contigs using similarity-based, structure-based, and repetitiveness-based approaches. The similarity-based detection programs included RepeatMasker v-4.1.0 (
http://repeatmasker.org/RepeatMasker/, using Repbase20.05_REPET.embl.tar.gz as the library instead) and TE-HMMER (Eddy, 2011 (
link)). The structural-based detection programs included LTRharvest (Ellinghaus et al., 2008 (
link)), MGEScan non-LTR (Rho and Tang, 2009 (
link)), HelSearch (Yang et al., 2009 (
link)), MITE-Hunter (Han and Wessler, 2010 (
link)), and SINE-finder (Wenke et al., 2011 (
link)). The repetitiveness-based detection programs included TEdenovo (Flutre et al., 2011 (
link)) and RepeatScout (Price et al., 2005 (
link)). 2) Repeat consensus sequences (
e.g., representing multiple subfamilies within a TE family) were also identified from the cleaned, filtered, and unassembled reads with dnaPipeTE (Goubert et al., 2015 (
link)) and RepeatModeler (
http://www.repeatmasker.org/RepeatModeler/). 3) Contigs identified by each individual program in steps 1 and 2, above, were filtered to remove those <100 bp in length and clustered with CD-HIT-est (Li and Godzik, 2006 (
link)) to reduce redundancy (100% sequence identity cutoff). This yielded a total of 155,999 contigs. 4) All 155,999 contigs were then clustered together with CD-HIT-est (100% sequence identity cutoff), retaining the longest contig and recording the program that classified it. 46,090 contigs were filtered out at this step. 5) The remaining 109,909 repeat contigs were annotated as TEs to the levels of order and superfamily in Wicker’s hierarchical classification system (Wicker et al., 2007 (
link)), modified to include several recently discovered TE superfamilies using PASTEC (Hoede et al., 2014 (
link)), and checked manually to filter chimeric contigs and those annotated with conflicting evidence (
Supplementary File S2). 6) All classified repeats (“known TEs” hereafter), along with the unclassified repeats (“unknown repeats” hereafter) and putative multi-copy host genes, were combined to produce a
Ranodon-derived repeat library. 7) For each superfamily, we collapsed the contigs to 95% and 80% sequence identity using CD-HIT-est to provide an overall view of within-superfamily diversity; 80% is the sequence identity threshold used to define TE families (Wicker et al., 2007 (
link)).