Rapid Annotation Transfer Tool (RATT)

RATT is programmed in ‘bash’ and ‘PERL’ and its design is illustrated in Figure 1 and Supplementary Figure S1. First, two sequences are compared using ‘nucmer’ from the MUMmer package (17 (link)) to define sequence regions that share synteny. Those regions are filtered using configurable parameters depending on the type of annotation mapping that is being attempted. Preset parameters are provided for transfers between assembly versions, strains or species (see Supplementary Table S1). To be included, the minimum nucleotide sequence identity between synteny blocks must be 40%. Synteny information is stored as a base range in the query and its associated base range in the reference. However, this information alone is inadequate to map the annotation because insertions or deletions (indels) change the relative distance between mapped synteny blocks. The coordinates are therefore sequentially adjusted across a synteny block by calling indels using ‘show-snp’ from the MUMmer package. Accurately calling indels within repetitive regions presents a particular challenge. Therefore, RATT recalibrates the adjusted coordinates using single nucleotide polymorphisms (SNPs, also called using ‘show-snp’) as unambiguous anchor points within synteny blocks. In transfers between very closely related sequences (e.g. successive assembly versions), SNPs may occur with insufficient frequency to perform this coordinate adjustment. In such cases, RATT modifies the query by inserting a faux SNP every 300 bp to aid in the recalibrating step. The final sequence and transferred annotations remain unaffected.
Figure 1.

Workflow of RATT.

Once the coordinates within synteny blocks have been defined, RATT proceeds to the annotation-mapping step, whereby each feature within a reference EMBL file is associated with new coordinates on the query (Supplementary Figure S1B). A feature is not mapped (and is put in the non-transferred bin file), if it bridges a synteny break and if its coordinate boundaries match different chromosomes, different DNA strands, or if the new mapped distance of its coordinates has increased by more than 20 kb. If a short sequence from the beginning, middle or the end of a feature can be placed within a synteny region, mapping is attempted (see Supplementary Figure S1B). In addition, if the exons of a single gene model map to different gene regions, the model is split and identified in the output file. The bin is an EMBL-format file that can be loaded onto the reference sequence for analysis (see Figure 2, brown colour track). Further outputs include statistics about transferred features or the amount of synteny conserved between the reference and query, as well as Artemis-readable files showing SNPs, indels and regions that lack synteny between the compared sequences, see the example on the sourceforge site.
Figure 2.

Transfer of annotation from the M. tuberculosis strain H37Rv onto the strain F11 sequence, over a deletion. The genomes of H37Rv (upper) and F11 (lower) are shown using the Artemis Comparison Tool (ACT). The source H37Rv annotation (light blue) is directly mapped onto F11 by RATT (green) except for those features corresponding to a region that is unique to the source strain that cannot be transferred and are written to a separate output file (brown).

Although two sequences may be related, differences can occur, such as a change in the start or stop codons of a protein-coding sequence. Therefore, we implemented a correction algorithm in RATT (see Supplementary Figure S1C). Figure 3 shows examples of the correction step. First, the start codon is checked. If it is not present, the upstream sequence is searched for a new start codon (Figure 3A). If a stop codon is found, the first start codon downstream is used. In the absence of any start codon, an error is recorded in the results file. If the sequence between exons has no stop codon and a length divisible by three bases but the splice acceptor or donor sequences are wrong, then the intron is eliminated. Likewise, frameshifts previously introduced into the reference to maintain conceptual translations (for instance, in apparent pseudogenes) will also be removed from coding sequences in the query. RATT will also detect, and attempt to fix, incorrect splice sites. As splice sites are difficult to annotate correctly, RATT only tries to correct a gene model that has one wrong splice site. If one incorrect splice site is detected, the closest alternative splice donor or acceptor is found that, when used, generates no frame shifts. Next, RATT searches for genes or exons with internal stop codons, further than 150 bp from the 3′-end. If the introduction of a frameshift would generate a model without internal stop codons, the model is corrected (Figure 3C). Stop codons are corrected last: if a model has less than five internal stops in its last exon, the model is shortened to the first stop codon (Figure 3B). If the model has no stop codon it is extended downstream until a stop codon is found.
Figure 3.

RATT corrections of transferred annotations. Annotation from H37Rv were transferred onto the F11 sequence (pale blue), corrected (green) and then compared with the existing strain F11 annotation in EMBL (yellow). (A and B) The correction of start and stop codons, respectively. In a more complex mapping situation (C), where all three reading frames are shown for clarity, RATT maps a large single coding sequence (CDS) from H37Rv to a locus within F11 that includes several in-frame stop codons. By inserting a frameshift (i.e. to indicate a pseudogene) the conceptual translation is preserved. This contrasts with two overlapping genes predicted as part of the F11 genome project.

Different criteria can be specified depending on the translation that an organism uses (e.g. such bacterial TTG and GTG start codons) or whether unsual splice sites are used. RATT is programmed in PERL and was tested in UNIX/LINUX environments. The output can be loaded into Artemis/Act. The list and explanation of all the output files can be found at the sourceforge site.

Partial Protocol Preview
This section provides a glimpse into the protocol.
The remaining content is hidden due to licensing restrictions, but the full text is available at the following link: Access Free Full Text.

Otto T.D., Dillon G.P., Degrave W.S, & Berriman M. (2011). RATT: Rapid Annotation Transfer Tool. Nucleic Acids Research, 39(9), e57.

Publication 2011

Bacterial Chromosomes Coding sequence Contrasts Deletion Deletions Donor Exon Frame Frameshift Genes Genome Indels Insertions Intron Light M tuberculosis h37rv Maps Nucleotide sequence Overlapping genes Pseudogene Repetitive regions Snps Start codons Stop codons Strain Synteny

Corresponding Organization :

Other organizations : Wellcome Sanger Institute

Top 5 similar protocols

Protocol cited in 79 other protocols

Variable analysis

independent variables

The type of annotation mapping that is being attempted (e.g. transfers between assembly versions, strains or species)

dependent variables

The coordinates within synteny blocks after adjusting for insertions and deletions (indels)
The final sequence and transferred annotations

control variables

The configurable parameters used for filtering the synteny regions
The minimum nucleotide sequence identity between synteny blocks (set at 40%)
The maximum distance increase of the new mapped coordinates compared to the original coordinates (set at 20 kb)

positive controls

None specified

negative controls

None specified

Annotations

Based on most similar protocols

Etiam vel ipsum. Morbi facilisis vestibulum nisl. Praesent cursus laoreet felis. Integer adipiscing pretium orci. Nulla facilisi. Quisque posuere bibendum purus. Nulla quam mauris, cursus eget, convallis ac, molestie non, enim. Aliquam congue. Quisque sagittis nonummy sapien. Proin molestie sem vitae urna. Maecenas lorem.

As authors may omit details in methods from publication, our AI will look for missing critical information across the 5 most similar protocols.

About PubCompare

Our mission is to provide scientists with the largest repository of trustworthy protocols and intelligent analytical tools, thereby offering them extensive information to design robust protocols aimed at minimizing the risk of failures.

We believe that the most crucial aspect is to grant scientists access to a wide range of reliable sources and new useful tools that surpass human capabilities.

However, we trust in allowing scientists to determine how to construct their own protocols based on this information, as they are the experts in their field.

Ready to get started?

Revolutionizing how scientists
search and build protocols!