The input of IntegronFinder is a sequence of DNA in FASTA format. The sequence is annotated with Prodigal v2.6.2 (46 (
link)) using the default mode for replicons larger than 200 kb and the metagenomic mode for smaller replicons ('-p meta’ in Prodigal) (Figure
3). In the present work, we omitted the annotation part and used the NCBI RefSeq annotations because they are curated. The annotation step is particularly useful to study newly acquired sequences or poorly annotated ones.
The program searches for the two protein profiles of the integron-integrase using hmmsearch with default parameters from HMMER suite version 3.1b1 and for the
attC sites with the default mode of cmsearch from INFERNAL 1.1 (Figure
3). Two
attC sites are put in the same cluster if they are less than 4 kb apart on the same strand. The clusters are built by transitivity: an
attC site less than 4 kb from any
attC site of a cluster is integrated in that cluster. Clusters are merged when localized less than 4 kb apart. The threshold of 4kb was determined empirically as a compromise between sensitivity (large values decrease the probability of missing cassettes) and specificity (small values are less likely to put together two independent integrons). More precisely, the threshold is twice the size of the largest known cassettes (∼2 kb (6 (
link))). This guarantees that even in the worst case (largest known cassettes) two
attC sites will be clustered if an intervening site was not detected. Importantly, the user can set this threshold (‘- - distance_thresh’ in IntegronFinder).
The results of the searches for the elements of the integron are put together to class the loci in three categories (Figure
1 - B, C, D). (i) The elements with
intI and at least one
attC site were named complete integrons. The word complete is meant to characterize the presence of both elements; we cannot ascertain the functionality or expression of the integron. (ii) The
In0 elements have
intI but no recognizable
attC sites. We do not strictly follow the original definition of In0, which also includes the presence of an
attI (47 (
link)), because this sequence is not known for most integrons (and thus cannot be searched for). (iii) The
cluster of
attC site
lacking
integron-integrase (CALIN) has at least two
attC sites and lacks nearby
intI.
To obtain a better compromise between accuracy and running time, IntegronFinder can re-run INFERNAL to search for
attC sites with more precision using the Inside algorithm (‘- - max’ option in INFERNAL), but only around previously identified elements (‘- - local_max’ option in IntegronFinder). More precisely, if a locus contains an integron-integrase and
attC sites (complete integron), the search is constrained to the strand encoding
attC sites between the end of the integron-integrase and 4 kb after its most distant
attC. If other
attC sites are found after this one, the search is extended by 4 kb in that direction until no more new sites are found. If the element contains only
attC sites (CALIN), the search is performed on the same strand on both directions. If the integron is In0, the search for
attC sites is done on both strands in the 4 kb flanking the integron-integrase on each side. The program then searches for promoters and
attI sites near the integron-integrase. Finally, it can annotate the integron genes’ cassettes (defined in the program as the CDS found between
intI and 200 bp after the last
attC site, or 200 bp before the first and 200 bp after the last
attC site if there is no integron-integrase) using a database of protein profiles (option ‘- - func_annot’). For example, in the present study we used the ResFams database to search for antibiotic resistance genes. One can use any hmmer-compatible profile databases with the program.
The program outputs tabular and GenBank files listing all the identified genetic elements associated with an integron. The program also produces a figure in pdf format representing each complete integron. For an interactive view of all the hits, one can use the GenBank file as input in specific programs such as Geneious (48 (
link)).
The user can change the profiles of the integrases and the covariance model of the
attC site. Thus, if novel models of
attC sites were to be built in the future, e.g., for novel types of
attC sites, they could easily be plugged in IntegronFinder.
Cury J., Jové T., Touchon M., Néron B, & Rocha E.P. (2016). Identification and analysis of integrons and cassette arrays in bacterial genomes. Nucleic Acids Research, 44(10), 4539-4550.