The training data set in GPS-Lipid was manually collected by searching the scientific literatures (published before Nov. 2014) in the PubMed with keywords such as “Palmitoylation”, “Myristoylation”, “Farnesylation” and “Geranylgeranylation”. Here, we totally collected 737 S-palmitoylation sites in 361 proteins, 106 S-farnesylation sites in 97 proteins, 95 S- geranylgeranylation sites in 70 proteins and 283 N-myristoylation sites in 281 proteins. To provide full access to the above collected data set, an online database was then developed and the intact annotations from UniProt and NCBI were integrated. As previously described, to avoid any overestimation of prediction accuracy, the redundant sites should be removed, and the CD-HIT39 (link) with a threshold of 40% sequence identity was used to single out homologous proteins. If two proteins are modified by lipid groups at the same position and present more than 40% sequence identity, only one protein was preserved. In particular, 65 palmitoylation sites was randomly selected from the non-redundant dataset to construct an additional test set. Due to data limitation, the additional test set for other lipid modifications were not constructed. For the preparation of training data sets, we took known lipid modification sites as the positive dataset, while all other non-modified residues, i.e. cysteine and glycine, in the same substrates were taken as the negative dataset. As a result, 579 S-palmitoylation sites, 226 N-myristoylation sites, 82 S-farnesylation sites and 71 S-geranylgeranylation sites were retained from 277, 226, 78 and 52 protein substrates as the final positive training data set (Supplementary table S3 – S6). While the corresponding negative dataset contains 3002 non-palmitoylated sites, 6754 non-myristoylated sites, 613 non-farnesylated sites and 192 non-geranylgeranylated sites.
To include as much as possible lipid modification sites, another 1259 high-throughput experimentally verified palmitoylated proteins was collected from PubMed. By using GPS-Lipid with a high threshold, the exact palmitoylation sites for those high throughput verified proteins were predicted and integrated into the lipid modification database. Notably, we also constructed a sequence library for further identifying the co-regulation mechanisms of lipid modifications by integrating the collected data set and high-throughput data set.
Free full text: Click here