To avoid an overly optimistic performance estimate for the new genes, the chromosomes of all genomes [for fly and plant downloaded from the UCSC Genome Browser database (17 (link))] were split into two parts in such a way that ∼50% of the genes were located on the first half, and the remaining genes were located on the second half. The second part of all chromosomes was used as a genomic input sequence for training AUGUSTUS, whereas the first part served for accuracy assessment opf gene predictions.
For D.melanogaster, protein coding genes from FlyBase (18 (link)), for A. thaliana, protein coding genes from TAIR 10 (19 (link)) and for C. elegans, protein coding genes from Wormbase were used as a reference annotation for measuring accuracy.
The exact source of all data sets and the files used for the actual experiments are described in detail in
Commonly used measures of accuracy (measured in percent) in gene prediction are
where TP stand for true positives, i.e. the number of predicted features that agree with the gold-standard reference, FN stands for false negatives, i.e. the number of features that were overseen by the predictor and FP stands for false positives, i.e. the number of features that were predicted but not in agreement with the reference annotation.
Sensitivity and specificity were measured for the features gene (i.e. only a gene structure that was predicted correctly including the exact positions of all CDS exons was counted as TP), exon (i.e. only exons that were predicted correctly were counted as TP) and nucleotide (i.e. every correctly predicted nucleotide was counted as TP).