DisCo is derived from QMEANDist, a quasi-single model method that participated in the CASP9 experiment as a global quality predictor (Biasini, 2013 ; Kryshtafovych
et al., 2011 (
link)). We revisited the approach of assessing the agreement of pairwise residue–residue distances with ensembles of constraints extracted from experimentally determined protein structures that are homologous to the assessed model. Instead of generating global quality estimates, DisCo aims to predict local per-residue quality estimates. After extracting the target sequence of the model to be assessed, homologues are identified using HHblits (Remmert
et al., 2011 (
link), the used command line arguments are available in the
Supplementary Materials). For each homologue
k, all Cα positions are mapped onto the target sequence using the HHblits alignment. Gaussian distance constraints for residue pairs (
i, j) are generated for all Cα–Cα distances
μijk below 15 Å:
The goal is to construct a pairwise scoring function
sij(
dij), that assesses the consistency of a particular pairwise Cα–Cα distance
dij in the model with all corresponding constraints
gijk(
dij). In order to avoid biases towards overrepresented sequence families among all found homologues, they are clustered based on their pairwise sequence similarity as specified in the
Supplementary Materials. Since the templates often do not cover the entire target sequence, some Cα–Cα pairs might not be represented in every template and consequently the number of templates
nijc containing a Cα–Cα pair varies within a cluster for different (
i, j). Only if a Cα–Cα pair is present in a cluster c, we construct a cluster scoring function
hijc(
dij):
To get the desired pairwise scoring function
sij(
dij) we combine
hijc(
dij) from each cluster
c in a weighted manner as exemplified in
Figure 2. Clusters expected to be closely related to the target sequence contribute more than others:
with weights
wc defined as
exp[
γSSc] and normalized, so that the weights of all clusters in which the Cα–Cα pair is present, sum up to one.
SSc is the average normalized sequence similarity towards the target sequence of cluster
c and
γ is a constant that controls how fast the influence of a cluster vanishes as a function of
SSc. The default value for
γ is 70 and the effect of varying
γ is discussed in
Supplementary Figure S3. The DisCo score of a single residue of the model at position
i then is computed by averaging the outcome of all
n pairwise scoring functions
sij(
dij) towards other residues
j ≠ i with their Cα positions within 15 Å:
As the accuracy of DisCo depends on the underlying templates, features describing its reliability are required to optimally weigh DisCo with the single model scores in a subsequent machine-learning step. For each residue
i there are:
Studer G., Rempfer C., Waterhouse A.M., Gumienny R., Haas J, & Schwede T. (2019). QMEANDisCo—distance constraints applied on model quality estimation. Bioinformatics, 36(6), 1765-1771.