To assess the applicability of mCSM signatures in predicting the impact of mutations in protein stability, several data sets derived from the ProTherm (Kumar et al., 2006 (link)) database were considered. ProTherm is a collection of experimental thermodynamic parameters for wild-type and mutant proteins, including the change in Gibbs free energy ( ). Only single-point mutations were considered. The data sets were used in comparative experiments with other methods, in regression and classification tasks, which consist of predicting the numerical value and the direction of change in , respectively.
S2648: The first data set, S2648, was used in comparative regression tasks where the aim is to predict the change in Gibbs free energy ( ) between wild-type and mutant protein. The data set comprises 2648 single-point mutations in 131 different globular proteins. For experiments with these data, we used 5-fold cross-validation, the same validation procedure use by the authors of the PoPMuSiC (Dehouck et al., 2009 (link)) algorithm.
S350: The second data set, S350, comprised 350 mutations in 67 different proteins. It is a randomly selected subset of the S2648 data set, also used in comparative regression experiments. In this case, the remaining 2298 mutations from the S2648 data set were used to train the predictive model, whereas the S350 data set was used as a test set. This data set is widely used in the literature to compare the performance of different methods.
S1925: The data set S1925 was used in both regression and classification experiments. It comprises 1925 mutations in 55 proteins, which are uniformly distributed across the four major SCOP classes (Murzin et al., 1995 (link)). Twenty-fold cross-validation protocol was used, the same protocol used in by the AUTOMUTE method Masso and Vaisman, 2008 (link)).
p53: Finally, as a study case, we assembled a data set of 42 mutations within the DNA binding domain of the tumour suppressor protein p53, whose thermodynamic effects have previously been experimentally characterized (Ang et al., 2006 (link); Bullock et al. 2000 (link); Joerger et al., 2006 (link); Nikolova et al. 1998 (link), 2000 (link)). The full data set description is available as Supplementary Material.
Free full text: Click here