We used several different datasets to develop, validate, and independently test the SAAFEC-SEQ method. These datasets contain experimental thermodynamic information for wild type and mutant proteins, including the change in Gibbs free energy (ΔΔG). The following datasets contain only a single chain protein and single point missense mutations.
S2648. This is our training and test dataset, the S2648, collected from the ProTherm database [28 (link)], including 2648 unique single point missense entries in 131 different proteins and the corresponding ΔΔGs.
S350. This is our validation dataset. To compare with other methods, we used the same validation dataset used by other developers [16 (link),17 (link),18 (link),19 (link)], which contains 350 mutations (taken from 67 different proteins) randomly selected from S2648.
S276. This blind data set was collected from Cao’s et al. work [29 (link)], which includes 276 unique single point missense entries in 37 different proteins. None of them is in the training or validation set.
p53. This is the second blind dataset. We used a dataset of 42 single point missense mutations within the DNA binding domain of the tumor suppressor protein p53, which thermodynamic effects have been experimentally determined [46 (link),47 (link),48 (link)]. As in the previous case, none of them appeared in our training set.
PTEN and TPMT. For the third blind data set, we collected two independent datasets for the phosphatase and tensin homologue (PTEN) and thiopurine S-methyl transferase (TPMT) proteins from the Critical Assessment of Genome Interpretation (CAGI) challenge [30 (link)]. It can be downloaded from https://genomeinterpretation.org/content/predict-effect-missense-mutations-pten-and-tpmt-protein-stability. We removed mutations with an unknown amino acid “X” (both in wild type and mutant), and then kept a total of 7363 missense mutations for the PTEN (3736) and TPMT (3627) proteins.
Free full text: Click here