The reference
O. sativa genome [47–51 (
link)], was selected for testing the software due to its high-quality assembly, small genome size (389 Mb) and quality of its genes and TEs annotations. The
O. sativa genome was identical to the one used by [33 (
link)] to compare the results with benchmarking tools in this study. This work used the standard library v6.9.5 created by [33 (
link)] based on the
O. sativa L. ssp.
japonica cv. ‘Nipponbare’ v. MSU7 genome and RepeatMasker v4.0.8 [52 ] with the following parameters ‘-pa 36 -q -no_is -norna -nolow -div 40 -cutoff 225’.
Additionally, six different plant genomes (Table
1) were used to test the execution times of Inpactor2 by assessing different genome sizes and TE compositions. The genomes were downloaded from NCBI and analyzed with Inpactor2 using the following parameters (-m 15000 -n 1000, -i no, -d no, -C 1, -c yes -a no), as suggested in [53 (
link)]. Finally, EDTA was run with the same genomes to compare its execution times with Inpactor2. EDTA was executed using EDTA_raw.py script, –type ltr, and the other parameters by default.
Libraries of LTR-RTs of the species shown in Table
1 were then created using Inpactor2 (with and without filtering with the -c flag) and EDTA. In addition, two species that were not contained in the training data were used, such as
Coffea humblotiana [54 (
link)] and
Gardenia jasminoides [55 (
link)]. These libraries were then annotated using repeatMasker and compared with the proportion of genomes corresponding to LTR-RTs according to the papers where the genomes were reported. A workstation with AMD Ryzen Threadripper 3970X 32-Core Processor, 128 Gb in RAM memory and a GPU Nvidia RTX 2080 super was used to perform all the experiments.
To evaluate the performance of Inpactor2 compared with other software, a similar methodology to the one proposed in [33 (
link)] was followed. First, Inpactor v.1.0 [34 ], TEsorter v.1.3 [45 ], Transposon Ultimate v.1.0 [28 ], LTR_retriever v.2.9 [56 (
link)] and LTRharvest [57 ] were selected for benchmarking given their methodologies for classifying LTR-RTs to the superfamily level. A workflow was established for each software, initially using LTR_FINDER v.1.0.7 as the LTR-RTs detector. Then, the
O. sativa genome was annotated with RepeatMasker and performance metrics were extracted for each workflow. The metrics evaluated were: accuracy, precision, specificity, sensitivity, FDR and F1-score. Figure
1 shows the schematic representation of the benchmarking metrics. In this study, TP, FN, TN and FP are the number of nucleotides belonging to each category (Figure
2).
The script called ‘lib-test.pl’, included in the EDTA toolkit [33 (
link)], was used to extract the six metrics. Since this study only focused on the LTR-RT category, so the script was executed using the -cat ltr parameter to perform the comparative evaluation.