The operation of the program can be divided into four main steps summarized in
CRISPR Finder flow chart. (Step 1) Browsing the maximal repeats to get possible CRISPR localizations using the Vmatch program. (Step 2) Consensus DR selection according to candidate occurrences and a score computation: the score privileges internal mismatches between direct repeats of a cluster rather than boundary mismatches. (Step 3) DR and spacers size check. (Step 4) Tandem repeats elimination using ClustalW for aligning spacers.
The second step is aimed at retrieving the DR consensus of each cluster. The difficulty resides especially in the identification of boundaries, which is very important to extract the correct spacers and compare DRs. In fact, the consensus DR is selected as the maximal repeat which occurs the most in the whole underlying genome sequence with respect to the forward and the reverse complement directions (since two CRISPRs having the same DR consensus may be in opposite directions). Thus, ambiguity in the choice of a DR will be eliminated in the case of presence of similar DRs in other CRISPRs of the related genomic sequence. However, if occurrence numbers are equal, more than a single DR consensus candidate are kept and later compared. Given a candidate consensus DR, the pattern search program fuzznuc of the EMBOSS package (28 (link)) is applied to get DRs’ positions in the related cluster. As the first or the last DR in a CRISPR may be diverged/truncated, a mismatch of one-third of the DR length is allowed between the flanking DRs and the candidate consensus DR, whereas smaller nucleotide differences are allowed between the other DRs to take into account possible single mutations. In case of multiple DR candidates, a score is computed and the best one (minimum) is picked. This score favours candidates which are encountered more frequently, rather than consensus DR showing less internal mismatches.
Once the DR consensus is determined, the corresponding spacers (Step 3) are extracted according to the DR boundaries determined previously. The spacer length is not allowed to be shorter than 0.6 or longer than 2.5 times the DR length. These sizes are in the range of CRISPRs described in the literature.
The last step consists in discarding false CRISPRs. Therefore, tandem repeats are eliminated by comparing the consensus DR with the spacer if there is only one spacer, or by comparing spacers between each other. The comparison is done with the CLUSTALW program (29 (link)) and the percentage of identity between spacers is not allowed to exceed 60%. Finally, candidates having at least three motifs and at least two exactly identical DRs are considered as confirmed CRISPRs. The remaining candidates are considered as questionable. These should be critically investigated by, for example, checking for intraspecies size variation of the locus.