We have extracted 843 experimentally validated CPPs from the CPPsite database, which has been developed by our group [24 ]. The peptides containing non-natural amino acids (e.g. selenocysteine) or having D-amino acids (D-conformation) were removed. Finally, we have got 708 unique CPPs having natural amino acids. Three different datasets (CPPsite-1, CPPsite-2 and CPPsite-3) have been created from these peptides. Since very few peptides have been experimentally validated as non-CPPs (negative examples), equal number of peptides (15–30 amino acids) were generated randomly from SwissProt proteins, and considered them as non-CPPs. This strategy for creating negative dataset has already been used in previous studies [22 (link),25 (link)].
First dataset (CPPsite-1) contains 708 CPPs (positive examples) and 708 non-CPPs (negative examples). In CPPsite-1, CPPs having wide range of uptake efficiency (low and high) have been included, thus we have derived another dataset CPPsite-2 from CPPsite-1. CPPsite-2 contains 187 CPPs having high uptake efficiency and equal number of non-CPPs. We have created third dataset (CPPsite-3), which contains 187 CPPs having high uptake efficacy as positive examples and equal number of CPPs with low uptake efficiency were taken as negative examples. The model based on CPPsite-3 dataset can discriminate between high and low efficient CPPs.
All datasets (CPPsite-1, CPPsite-2 and CPPsite-3) consist of several CPPs with all possible Ala-scan mutants, or different truncations. Ideally redundancy in the datasets should be removed because it affects the performance of prediction method. In past, our group has removed the redundancy in various prediction methods [25 (link),26 (link)]. But in this study, we have not removed the redundancy in CPP datasets because a single residue can affect the uptake efficiency of CPPs, and this may also lead to the loss of information about CPPs. In order to check the performance of our model on redundant dataset, we have used some benchmark datasets, which are redundant.
Free full text: Click here