Custom repeat libraries were individually created for
I. trifida and
I. triloba by combining the putative repeat libraries predicted from MITE-Hunter
46 (link) (v2011) and RepeatModeler (
http://www.repeatmasker.org/; v1.0.8). Protein-coding genes were removed from the repeat library using ProtExcluder
47 (link). The Repbase (v20150807) repeats for green plants (Viridiplantae) were then added to each library to create a final custom repeat library for each species. The pseudomolecules for
I. trifida and
I. triloba were repeat masked with the respective repeat library using RepeatMasker (
http://www.repeatmasker.org/; v4.0.6).
For gene prediction, AUGUSTUS
48 (link) (v3.1) was trained on the soft-masked assemblies using the leaf RNA-Seq alignments. Gene models were predicted with AUGUSTUS using the hard-masked assemblies and refined with PASA2 (v2.0.2) (ref.
49 (link)) using the genome-guided transcript assemblies from each tissue as transcript evidence (Supplementary MethodÂ
1). Two rounds of gene prediction comparison were performed and gene models PASA identified as merged, but unable to split, were manually inspected and split as necessary. A third round of gene prediction comparison was performed to refine the structure of the manually curated models. The final output from PASA2 is the working set of gene models. Expression abundances for each gene model were determined based on the RNA-Seq read alignments using Cufflinks2 (ref.
50 (link)) (v.2.2.1). A high-confidence gene model set was constructed from the working gene model set by removing partial gene models and gene models with an internal stop codon, a hit to a transposable element, or an FPKM of 0 across the RNA-Seq libraries used for the annotation.
Functional annotations for the high-confidence gene models were assigned by comparing their protein sequences against the Arabidopsis proteome (TAIR 10), Pfam (v29), and the Swiss-Prot databases. Proteins that only matched a hypothetical Arabidopsis gene model and had no matches in the other databases were annotated as conserved hypothetical, while proteins with no matches in any of the databases were annotated as hypothetical.