The protein profiles used were either retrieved from existing databases (PFAM
42 (link), COG
43 (link)) or built from scratch when no adequate profiles existed (see below for details on the building of HMM profiles and Supplementary Data
2).
New protein profiles for the proteins involved in anti-phage systems were built using a homogeneous procedure. We collected a set of sequences from the protein family that were representative of the diversity of the bacterial taxonomy. Homologous proteins were aligned using MAFFT v7.475
44 (link) (default options, mode auto) and then used to produce protein profiles with Hmmbuild (default options) from the HMMer suite v3.3
45 . To ensure a better detection we curated each profile manually by assigning a GA score (used with the hmmsearch option–cut_ga) (Supplementary Figs.
1–
2). GA score defines the threshold above which a hit is considered significant. This threshold was determined manually by inspecting the distribution of the scores. All accession numbers for proteins used to build custom HMM profiles are available in Supplementary Data
2.
Protein scrapping was done using different methods depending on the available information about the system in the literature (Details in Supplementary Data
1). For systems from
1 (link), dGTPase
8 , dCTPdeaminase
8 , BREX
3 (link), part of Cyclic-oligonucleotide-based anti-phage signaling systems (CBASS)
17 (link), all the reverse transcriptases of retrons, BstA
10 , viperins
7 (link) and DISARM
2 , we used a subset (between 20 and 100 proteins) of the proteins available in the supplementary data of each publication. We then tested if the HMM allows for detection of all known occurrences of such proteins. If a lot of proteins were undetected, we added proteins reported in the supplementary materials but not detected through our HMM to the list of sequences for the alignment and subsequent HMM generation.
For AbiEii, AbiH,Abi2, Stk2, Pif, Lit, PrrC, RexAB, part of CBASS, part brxA, and brxB from BREX, PARIS (AAA15 and AAA21), we used PFAM available at (
http://pfam.xfam.org/) or the sequence available on COG (
https://www.ncbi.nlm.nih.gov/research/cog-project/). For part of BREX, DndABCDEFGH
46 (link), we searched for proteins with this name available on NCBI and curated manually such list. For systems when only one sequence was provided such as Gao’s systems
4 (link), Rousset’s systems
12 , Dnd type SspBCDE, part of retrons, the protein sequence was BLASTed. Between 20 and 50 sequences with high coverage were selected. For retrons other than reverse transcriptase, we used the IMG genome neighborhood feature to get adjacent protein of the reverse transcriptase and repeated the BLAST process. For CAS systems, HMM protein profiles were downloaded from
22 (link),47 (link). All hmm profiles used are available at
https://github.com/mdmparis/defense-finder-models.