AUGUSTUS is based on a generalized hidden Markov model (GHMM), which defines probability distributions for the various sections of genomic sequences. Introns, exons, intergenic regions, etc. correspond to states in the model and each state is thought to create DNA sequences with certain pre-defined emission probabilities. Similar to other HMM-based gene finders, AUGUSTUS finds an optimal parse of a given genomic sequence, i.e. a segmentation of the sequences into states that is most likely according to the underlying statistical model. We probabilistically model the sequence around the splice sites, the sequence of the branch point region, the bases before the translation start, the coding regions and non-coding regions, the first coding bases of a gene, the length distribution of single exons, initial exons, internal exons, terminal exons, intergenic regions, the distribution of the number of exons per gene and the length distribution of introns.
The performance of AUGUSTUS has been extensively evaluated on sequence data from human and Drosophila (7 ,8 (
link)) (
). These studies showed that, especially for long input sequences, the accuracy of our program is superior to that of existing ab initio gene finding approaches. To make our tool available to the research community, we have set up a WWW server at GOBICS (Göttingen Bioinformatics Compute Server) (9 (link)).
AUGUSTUS may be forced to predict an exon, an intron, a splice site, a translation start or a translation end point at a certain position in the sequence. An arbitrary number of such constraints is allowed and supported types of constraints are given in Table 1.
With the term gene structure, we refer to a segmentation of the input sequence into any meaningful sequence of exons, introns and intergenic regions. This includes the possibility of having no genes at all or of having multiple genes. AUGUSTUS tries to predict a gene structure that
is (biologically) consistent in the following sense:
No exon contains an in-frame stop codon.
The splice sites obey the gt–ag consensus. All complete genes start with atg and end with a stop codon.
Each gene ends before the next gene starts.
The lengths of single exons and introns exceed a species-dependent minimal length.
That obeys all given constraints.
Among all gene structures that are consistent and that obey all constraints, AUGUSTUS finds the most likely gene structure. A constraint may contradict the biological consistency. For example, an exonpart constraint may be impossible to realize because there is no containing open reading frame with allowed exon boundaries. If no consistent gene structure is possible, which obeys all constraints, then some constraints are ignored. Also, if two or more constraints contradict each other, then AUGUSTUS obeys only that constraint that fits better to the model. Figure 1 illustrates the concept. Further examples are on the page .