We collected from the SwissProt UniProt database (release 2013_03) 10,780 transporter, carrier, and channel proteins that were well characterized at the protein level and had clear substrate annotations [15] (link), [16] (link). We removed sequences that were fragmented. We also removed sequences annotated with more than two substrate specificities and biological function annotations that were based solely on sequence similarity. We manually curated the biological function annotations from the remaining sequences and compiled a total of 1,110 membrane transport protein sequences in which only one transporting substrate has been reported in the literature. We removed 210 sequences that showed greater than 70% similarity using CD-HIT software [17] (link) (see Figure S1 for details about the data compilation and curation processes). The 900 remaining transporter sequences were then divided into seven major classes of transporters based on their substrate specificity: 85 amino acid/oligopeptide transporters, 72 anion transporters, 296 cation transporters, 70 electron transporters, 85 protein/mRNA transporters, 72 sugar transporters, and 220 other transporters. We also compiled 660 non-transporters as an extra class of control proteins in our model development process by randomly sampling all the proteins in UniProt release 2013_03 excluding the 10,780 transporters.
We further divided the 1,560 compiled proteins into two datasets: 1) the main dataset, which consisted of 70 amino acid transporters, 60 anion transporters, 260 cation transporters, 60 electron transporters, 70 protein/mRNA transporters, 60 sugar transporters, 200 other transporters, and 600 non-transport proteins for a total of 1,380 proteins; and 2) an independent dataset, which consisted of 15 amino acid transporters, 12 anion transporters, 36 cation transporters, 10 electron transporters, 15 protein/mRNA transporters, 12 sugar transporters, 20 other transporters, and 60 non-transport proteins for a total of 180 proteins (seeTable S1 for a detailed dataset partition; all the sequences are available on our TrSSP web server at http://bioinfo.noble.org/TrSSP/ ). We applied a five-fold cross-validation schema on the 1,380 proteins in the main dataset to develop our SVM models. The performance of these SVM models was further tested and validated on the independent dataset of 180 proteins. To evaluate the prediction accuracy of the models for each class of proteins, proteins within the same class were considered a positive predictor and proteins from the remaining classes were considered a negative predictor.
We further divided the 1,560 compiled proteins into two datasets: 1) the main dataset, which consisted of 70 amino acid transporters, 60 anion transporters, 260 cation transporters, 60 electron transporters, 70 protein/mRNA transporters, 60 sugar transporters, 200 other transporters, and 600 non-transport proteins for a total of 1,380 proteins; and 2) an independent dataset, which consisted of 15 amino acid transporters, 12 anion transporters, 36 cation transporters, 10 electron transporters, 15 protein/mRNA transporters, 12 sugar transporters, 20 other transporters, and 60 non-transport proteins for a total of 180 proteins (see
Full text: Click here