Benchmarking Gene Prediction Accuracy

Prediction accuracy with parameters trained by WebAUGUSTUS and by human experts was measured using three different data sets. For optA, the genome of the insect Drosophila melanogaster (assembly BDGP R5/dm3) and 818 005 ESTs from the same species that were obtained from the National Center for Biotechnology Information (NCBI) were used. OptB was evaluated using the genome of the plant Arabidopsis thaliana (assembly TAIR 10) and 35 375 protein sequences of the same species that were obtained from NCBI. OptC was evaluated using the genome of the worm Caenorhabditis elegans and 18 555 training gene structures retrieved from Wormbase (16 (link)).
To avoid an overly optimistic performance estimate for the new genes, the chromosomes of all genomes [for fly and plant downloaded from the UCSC Genome Browser database (17 (link))] were split into two parts in such a way that ∼50% of the genes were located on the first half, and the remaining genes were located on the second half. The second part of all chromosomes was used as a genomic input sequence for training AUGUSTUS, whereas the first part served for accuracy assessment opf gene predictions.
For D.melanogaster, protein coding genes from FlyBase (18 (link)), for A. thaliana, protein coding genes from TAIR 10 (19 (link)) and for C. elegans, protein coding genes from Wormbase were used as a reference annotation for measuring accuracy.
The exact source of all data sets and the files used for the actual experiments are described in detail in Supplementary Materials, section Supplementary Methods: Data Sets.
Commonly used measures of accuracy (measured in percent) in gene prediction are

where TP stand for true positives, i.e. the number of predicted features that agree with the gold-standard reference, FN stands for false negatives, i.e. the number of features that were overseen by the predictor and FP stands for false positives, i.e. the number of features that were predicted but not in agreement with the reference annotation.
Sensitivity and specificity were measured for the features gene (i.e. only a gene structure that was predicted correctly including the exact positions of all CDS exons was counted as TP), exon (i.e. only exons that were predicted correctly were counted as TP) and nucleotide (i.e. every correctly predicted nucleotide was counted as TP).

Free full text: Click here

Hoff K.J, & Stanke M. (2013). WebAUGUSTUS—a web service for training AUGUSTUS and predicting genes in eukaryotes. Nucleic Acids Research, 41(Web Server issue), W123-W128.

Publication 2013

Arabidopsis thaliana plant Caenorhabditis elegans Chromosomes Drosophila melanogaster Ests Exons Gene Gene structures Genomic Gold Human Insect genome Nucleotide Optimistic Plant Protein genes Protein sequences Worm

Corresponding Organization :

Other organizations : Universität Greifswald

Top 5 similar protocols

Protocol cited in 69 other protocols

Variable analysis

independent variables

Parameters trained by WebAUGUSTUS
Parameters trained by human experts

dependent variables

Prediction accuracy

control variables

Use of three different data sets: optA (Drosophila melanogaster), optB (Arabidopsis thaliana), and optC (Caenorhabditis elegans)
Splitting of chromosomes into two parts, with the second part used for training and the first part used for accuracy assessment
Use of reference annotations from FlyBase, TAIR 10, and Wormbase for the respective organisms

Annotations

Based on most similar protocols

Etiam vel ipsum. Morbi facilisis vestibulum nisl. Praesent cursus laoreet felis. Integer adipiscing pretium orci. Nulla facilisi. Quisque posuere bibendum purus. Nulla quam mauris, cursus eget, convallis ac, molestie non, enim. Aliquam congue. Quisque sagittis nonummy sapien. Proin molestie sem vitae urna. Maecenas lorem.

As authors may omit details in methods from publication, our AI will look for missing critical information across the 5 most similar protocols.

About PubCompare

Our mission is to provide scientists with the largest repository of trustworthy protocols and intelligent analytical tools, thereby offering them extensive information to design robust protocols aimed at minimizing the risk of failures.

We believe that the most crucial aspect is to grant scientists access to a wide range of reliable sources and new useful tools that surpass human capabilities.

However, we trust in allowing scientists to determine how to construct their own protocols based on this information, as they are the experts in their field.

Ready to get started?

Revolutionizing how scientists
search and build protocols!