For assessment of GeneMark-EP as well as ProtHint accuracy, we selected annotated genomes from diverse clades: fungi, worms, plants, insects and vertebrae (Table 1). The genome length varied from <100 Mb (Neurospora crassa) to >1.3 Gb (Danio rerio). With exception of Solanum lycopersicum, a species representing large genome plants important for economy, all selected species are model organisms whose genomes presumably have high-quality annotation. To assess accuracy of gene prediction made for model species, we compared genes predicted and annotated on a whole genome scale. In case of S. lycopersicum, we used a limited set of genes, validated by available RNA-Seq data. In all genomic datasets, contigs not assigned to any chromosome were excluded from the analysis as well as genomes of organelles.
We used OrthoDB v10 protein database (23 (link)) as an all-inclusive source of protein sequences. Still, for generating protein hints for particular species we used subsets of OrthoDB: plant proteins for gene prediction in Arabidopsis thaliana, arthropod proteins for gene prediction in Drosophila melanogaster, etc. (Table 2).
As an additional test set, we used annotation of major protein isoforms available in the APPRIS database (24 (link)); this assessment was done for C. elegans, D. melanogaster and D. rerio (Supplementary Table S1). Arguably, accuracy of prediction of major isoforms is of significant interest, since in a gene locus the major isoform was observed to be expressed in higher volume than other (minor) isoforms (24 (link)).