First dataset (CPPsite-1) contains 708 CPPs (positive examples) and 708 non-CPPs (negative examples). In CPPsite-1, CPPs having wide range of uptake efficiency (low and high) have been included, thus we have derived another dataset CPPsite-2 from CPPsite-1. CPPsite-2 contains 187 CPPs having high uptake efficiency and equal number of non-CPPs. We have created third dataset (CPPsite-3), which contains 187 CPPs having high uptake efficacy as positive examples and equal number of CPPs with low uptake efficiency were taken as negative examples. The model based on CPPsite-3 dataset can discriminate between high and low efficient CPPs.
All datasets (CPPsite-1, CPPsite-2 and CPPsite-3) consist of several CPPs with all possible Ala-scan mutants, or different truncations. Ideally redundancy in the datasets should be removed because it affects the performance of prediction method. In past, our group has removed the redundancy in various prediction methods [25 (link),26 (link)]. But in this study, we have not removed the redundancy in CPP datasets because a single residue can affect the uptake efficiency of CPPs, and this may also lead to the loss of information about CPPs. In order to check the performance of our model on redundant dataset, we have used some benchmark datasets, which are redundant.