We first used MetaCRT [33 (
link)], which we modified from CRT [34 (
link)] (to allow detection of partial repeats at the ends of CRISPR arrays), to predict the CRISPR arrays in complete bacterial and archaeal genomes. The genomes were downloaded in October 2016 from the NCBI ftp website (
ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq). We focused on complete reference genomes in this study, as CRISPR–Cas systems may be found in separate contigs when draft genomes are used. However, for a few species we analyzed in detail, we augmented the list of genomes with draft genomes: including 13 draft genomes for
Streptococcus thermophilus and 4055 draft genomes for
Staphylococcus aureus. In some cases, a long CRISPR may be split into multiple ones because of repeats containing excessive mutations or long spacers. To avoid such cases, CRISPRs that are close to each other (<=200 bps) and share very similar repeat sequences were considered to be in the same locus. We then collected the consensus repeat for each putative CRISPR array. We clustered these consensus repeats at 90% sequence identity using CD-HIT-EST [35 (
link)]. In this way, a “cluster” contains more than two CRISPR arrays, and a “singleton” refers to the repeats exclusively found within their corresponding CRISPR array.
We then used hmmscan [36 (
link)] to search putative proteins found in the genomes against a collection of Cas families to predict putative Cas proteins (using the gathering cutoff). In total, the collection contains 403 Cas families, among which eight were identified from the human microbiomes (using a combination of context-based and similarity-search approaches) [37 ], and 395 were from a recent study [14 (
link)]. Since Koonin and colleagues did not build models for the Cas families they curated [14 (
link)], we used hmmbuild to construct hmm models for all of their families. Considering that gene prediction is far from perfect for many genomes, for the genomes/contigs that contain CRISPRs but lack
cas genes, we further used the FragGeneScan [38 (
link)], a gene predictor we have developed for predicting complete as well as fragmented genes in genomic sequences, to re-predict the genes, and then performed
cas gene prediction to rule out the possibility of missing
cas genes because the genes were not predicted in the first place.
A
cas locus defined in this study should contain at least three
cas genes, at least one of which belongs to the universal
cas genes for CRISPR adaptation (
cas1 and
cas2) or the main components of interference module including
cas7,
cas5,
cas8,
cas10,
csf1,
cas9,
cpf1 [14 (
link)].
Zhang Q, & Ye Y. (2017). Not all predicted CRISPR–Cas systems are equal: isolated cas genes and classes of CRISPR like elements. BMC Bioinformatics, 18, 92.