S2648. This is our training and test dataset, the S2648, collected from the ProTherm database [28 (link)], including 2648 unique single point missense entries in 131 different proteins and the corresponding ΔΔGs.
S350. This is our validation dataset. To compare with other methods, we used the same validation dataset used by other developers [16 (link),17 (link),18 (link),19 (link)], which contains 350 mutations (taken from 67 different proteins) randomly selected from S2648.
S276. This blind data set was collected from Cao’s et al. work [29 (link)], which includes 276 unique single point missense entries in 37 different proteins. None of them is in the training or validation set.
p53. This is the second blind dataset. We used a dataset of 42 single point missense mutations within the DNA binding domain of the tumor suppressor protein p53, which thermodynamic effects have been experimentally determined [46 (link),47 (link),48 (link)]. As in the previous case, none of them appeared in our training set.
PTEN and TPMT. For the third blind data set, we collected two independent datasets for the phosphatase and tensin homologue (PTEN) and thiopurine S-methyl transferase (TPMT) proteins from the Critical Assessment of Genome Interpretation (CAGI) challenge [30 (link)]. It can be downloaded from