We predicted the sub-cellular location of the proteins encoded in the genomes included in the analysis (see Genomic data, taxonomic classification, and phylogenetic reconstruction) using PSORTB v 3.129 (link). The PSORTB model was selected based on the species’ monoderm/diderm classification (taken from the literature)61 . Only proteins classified as “extracellular” by PSORTB and lacking transmembrane domains where considered in our study. Proteins not matching these criteria were discarded. When more than one genome was available per species, we computed the average number of proteins per genome for that location (Supplementary data 5). Extracellular proteins were functionally classified by searching for sequence similarity, using HMMsearch from HMMer v.3.1.2b65 (link), in the eggNOG v. 4.5 database66 (link). We only considered hits with an e-value ≤10−5 and more than 50% similarity. Since different HMMs may be associated to the same functional category in different taxa, we kept the functional annotation of the best hit when more than half of the hits were associated to that same category (otherwise it was marked unknown).
Three functional categories were explored more carefully. First, we characterized the repertoire of extracellular bacteriocins. To do so, we searched for similarities to the extracellular proteins in the two bacteriocin databases Bagel and Bactibase67 (link),68 (link) using HMMer. We kept the hits with an e-value < 0.05 and more than 50% coverage of the query sequence (Supplementary Table 2). Second, we identified the extracellular proteins with a degradative activity. We selected enzymatic activities often associated to the extracellular environment: amidase, amylase, cellulase, chitinase, dipeptidase, glycosyl hydrolase, invertase, inulinase, keratinase, and pectinase69 (link). For each degradative enzyme, we collected all previously validated bacterial protein candidates by searching for specific keywords in Uniprot170 (link). We clustered them using usearch with the “cluster_smallmem” algorithm at 70% identity. We aligned the sequences of each cluster using mafft v.7 with the local pairwise alignment option and a maximum 1000 iterations (“linsi” option)71 (link). The resulting multiple alignments were used to build protein HMM profiles using hmmbuild from HMMer. HMM profiles were queried against the extracellular proteins previously predicted. Hits with more than 40% identity and less than 20% difference in length for the smallest of either the protein or profile where kept, and the best hit was used to classify them (Supplementary Table 2).
Free full text: Click here