We searched the scientific literature from PubMed with the keyword of “calpain” to obtain the experimentally verified calpain substrates with cleavage sites (before June 30th, 2010). The data collected by Tompa et al. and duVerle et al. were also integrated [16] (link), [22] (link), while the protein sequences were retrieved from the UniProt database.
We defined a calpain cleavage peptide CCP(m, n) as a cleavage bond flanked by m residues upstream and n residues downstream. As previously described [23] (link), [24] (link), we regarded all experimentally verified cleavage sites as positive data (+), while all other non-cleavage sites in the same substrates were taken as negative data (−). If a cleavage site locates at the N- or C-terminus of the protein and the length of the peptide is smaller than m+n, we added one or multiple “*” characters as pseudo amino acids to complement the CCP(m, n). The positive data (+) set for training might contain several homologous sites from homologous proteins. If the training data were highly redundant with too many homologous sites, the prediction accuracy would be overestimated. To avoid such overestimation, we clustered the protein sequences with a threshold of 40% identity by CD-HIT [25] (link). If two proteins were similar with ≥40% identity, we re-aligned the proteins with BL2SEQ, a program in the BLAST package [26] (link), and checked the results manually. If two calpain cleavage sites from two homologous proteins were at the same position after sequence alignment, only one item was preserved, the other was discarded. Finally, the non-redundant benchmark data set for training contained 368 positive sites from 130 unique substrates (Supplementary Table S1).
Free full text: Click here