In addition to the above data, SKEMPI 2.0 also provides data on the location of the mutated residues, the homology between interactions in the dataset, and processed PDB files, which can be easily parsed.
Residue location: Each mutated residue is classified according to the scheme proposed by Levy (2010) (link); residues at the interface are classified as support (mostly buried when unbound and entirely buried upon binding), core (mostly solvent exposed when unbound but buried upon binding) and rim (partly buried upon binding), while residues away from the binding site are classified as interior or surface. Solvent exposed surface area was calculated using CCP4 (Winn et al., 2011 (link)).
Processed PDB files: The PDB files for the interactions, as downloaded from the Protein Data Bank (Berman et al., 2000 (link)), often contain multiple copies of the interacting proteins in the unit cell or other chains irrelevant to the interaction. In one instance, the binding of dimeric myostatin to follistatin-like 3, the myostatin dimer must be created by tessellating the unit cell. Further, some PDB files contain features that are not readily parsed by some software, such as residue insertion codes or negative residue numbers. To help users we provide “cleaned” PDB files which contain only the chains of interest, renumbered from one, as well as waters and other molecules with a non-hydrogen atom within 5 Å of a non-hydrogen atom of any of the chains of interest. Consequently, each mutation is reported with both PDB numbering and renumbered.
Defining homologous interactions: Each entry also specifies which other entries are mutations to homologous interactions. Two interactions are deemed homologous if they have a shared binding partner or homologous binding partner and at least 70% of the corresponding interface residues are common to both interactions. We determine the homology between proteins using the GAP4 program (Huang and Brutlag, 2007 (link)), and define homologous proteins as those with a similarity score greater than 50 and at least 30% sequence identity. Interface residues are defined as those with a non-hydrogen atom within 10 Å of a non-hydrogen atom on the binding partner. Interactions falling within manually assigned clusters of homologous interactions are designated as pMHC/TCR, antibody/antigen or protease/inhibitor. While the names of these clusters have been chosen to reflect the predominant function of their constituent interactions, they reflect the homologies within the dataset and are not functional assignments. Thus, for instance, some nanobodies are classified as antibodies as they bind to the same site as cetuximab, 14.3.d is classified as TCR, even though it is only the β chain, and its binding partner, enterotoxin C3, is classified as a pMHC.