Ideally, our positive data set should consist of a large number of proteins secreted via non-classical pathways. Unfortunately, it was not possible to obtain a sufficiently large data set as only a small number of proteins undergoing non-classical secretion are known. Since we are looking for features shared among extracellular proteins, the mechanism by which a protein is secreted should not be important. We therefore used for training the large number of proteins known to be secreted via the classical Sec-dependent secretion mediated mechanism. All sequence data was extracted from Swiss-Prot release 44.0. Two individual training sets were created for Firmicutes and Proteobacteria, respectively.
A set of 690 extracellular proteins from Firmicutes (Gram-positive) and a set of 2185 extracellular proteins from Proteobacteria (Gram-negative) were extracted from the Swiss-Prot database based on annotations in the feature table (FT) and comments line (CC) [52 (link)]. Partial sequences were excluded from the data set. As we wanted to train a predictor that works in the absence of signal peptides, the signal peptide part of each sequence was removed according to the Swiss-Prot annotation. These lists of secreted proteins formed our positive data sets. Negative training sets were constructed by extracting 1084 proteins for Firmicutes and 2098 proteins for Proteobacteria from Swiss-Prot, which were annotated as localised to the cytoplasm. After redundancy reduction of the data sets based on a structural similarity criteria [53 (link)], 152 and 350 extracellular sequences were left in the positive data sets for Firmicutes and Proteobacteria, respectively. In the negative data sets, 140 and 334 sequences remained for Firmicutes and Proteobacteria, respectively. For Gram-positive bacteria (Firmicutes and Actinobacteria) a set of non-classically secreted proteins was retrieved from Swiss-Prot based on literature searches (see Table 1).
All data sets used are available as supplementary information from our website [37 ].
For identification of putative non-classically secreted proteins in E. coli and B. subtilis, we used the following accession numbers to extract the annotated and translated proteomes: [Genbank:NC_000913] for E. coli and [Genbank:NC_000964] for B. subtilis.
Free full text: Click here