The input of IntegronFinder is a sequence of DNA in FASTA format. The sequence is annotated with Prodigal v2.6.2 (46 (link)) using the default mode for replicons larger than 200 kb and the metagenomic mode for smaller replicons ('-p meta’ in Prodigal) (Figure 3). In the present work, we omitted the annotation part and used the NCBI RefSeq annotations because they are curated. The annotation step is particularly useful to study newly acquired sequences or poorly annotated ones.
The program searches for the two protein profiles of the integron-integrase using hmmsearch with default parameters from HMMER suite version 3.1b1 and for the attC sites with the default mode of cmsearch from INFERNAL 1.1 (Figure 3). Two attC sites are put in the same cluster if they are less than 4 kb apart on the same strand. The clusters are built by transitivity: an attC site less than 4 kb from any attC site of a cluster is integrated in that cluster. Clusters are merged when localized less than 4 kb apart. The threshold of 4kb was determined empirically as a compromise between sensitivity (large values decrease the probability of missing cassettes) and specificity (small values are less likely to put together two independent integrons). More precisely, the threshold is twice the size of the largest known cassettes (∼2 kb (6 (link))). This guarantees that even in the worst case (largest known cassettes) two attC sites will be clustered if an intervening site was not detected. Importantly, the user can set this threshold (‘- - distance_thresh’ in IntegronFinder).
The results of the searches for the elements of the integron are put together to class the loci in three categories (Figure 1 - B, C, D). (i) The elements with intI and at least one attC site were named complete integrons. The word complete is meant to characterize the presence of both elements; we cannot ascertain the functionality or expression of the integron. (ii) The In0 elements have intI but no recognizable attC sites. We do not strictly follow the original definition of In0, which also includes the presence of an attI (47 (link)), because this sequence is not known for most integrons (and thus cannot be searched for). (iii) The cluster of attC site lacking integron-integrase (CALIN) has at least two attC sites and lacks nearby intI.
To obtain a better compromise between accuracy and running time, IntegronFinder can re-run INFERNAL to search for attC sites with more precision using the Inside algorithm (‘- - max’ option in INFERNAL), but only around previously identified elements (‘- - local_max’ option in IntegronFinder). More precisely, if a locus contains an integron-integrase and attC sites (complete integron), the search is constrained to the strand encoding attC sites between the end of the integron-integrase and 4 kb after its most distant attC. If other attC sites are found after this one, the search is extended by 4 kb in that direction until no more new sites are found. If the element contains only attC sites (CALIN), the search is performed on the same strand on both directions. If the integron is In0, the search for attC sites is done on both strands in the 4 kb flanking the integron-integrase on each side. The program then searches for promoters and attI sites near the integron-integrase. Finally, it can annotate the integron genes’ cassettes (defined in the program as the CDS found between intI and 200 bp after the last attC site, or 200 bp before the first and 200 bp after the last attC site if there is no integron-integrase) using a database of protein profiles (option ‘- - func_annot’). For example, in the present study we used the ResFams database to search for antibiotic resistance genes. One can use any hmmer-compatible profile databases with the program.
The program outputs tabular and GenBank files listing all the identified genetic elements associated with an integron. The program also produces a figure in pdf format representing each complete integron. For an interactive view of all the hits, one can use the GenBank file as input in specific programs such as Geneious (48 (link)).
The user can change the profiles of the integrases and the covariance model of the attC site. Thus, if novel models of attC sites were to be built in the future, e.g., for novel types of attC sites, they could easily be plugged in IntegronFinder.