The PDZ domain dataset (
SI Appendix, Table S1) was taken from ref. 20 (
link) and consisted of 17 human PDZ domains with experimentally determined structures. Binder peptides for the 17 PDZ domains were downloaded from supplemental data for ref. 21 (
link) (
https://baderlab.org/Data/PDZ). Experimental amino acid frequency matrices (PWMs) were constructed from the PDZ binder peptides, with clone frequency weighting. For the AlphaFold simulations, 20,000 random peptide sequences of length equal to the peptide in the experimental structure were generated using NNK codon frequencies to match the amino acid bias in the phage-display libraries. The experimental structure listed in the template column of
SI Appendix, Table S1 was used as the sole template, with the random peptide sequences aligned to the template peptide and single-sequence MSA information.
The SH3 domain dataset (
SI Appendix, Table S2) consisted of 19 SH3 domains with experimentally determined structures extracted from the Database of Peptide Recognition Modules (
http://prm-db.org/) (22 (
link)). Experimental PWMs were downloaded from the PRM-DB. SH3 domains can bind peptides in two orientations, denoted Class I and Class II, which have opposite chain orientations: Class I peptides often match a +XXPXXP sequence motif, where “+” denotes a positively charged amino acid and X is any amino acid; Class II peptides often match a PXXPX+ motif. SH3-peptide PWMs from PRM-DB were annotated as Class I or Class II by choosing the class whose sequence motif had the highest PWM frequency (averaged over the three motif positions). Five of the SH3 domains had multiple PWMs in the PRM-DB, one of which was assigned as Class I and one as Class II; these domains were modeled twice, once in each orientation. For AlphaFold modeling, the native PDB structure listed in
SI Appendix, Table S2 (“SH3 template” column) was used as the template for the SH3 domain. Four peptide-SH3 structures with peptides in the desired orientation (i.e., Class I or Class II) were chosen as “Peptide templates” based on SH3 domain sequence identity (
SI Appendix, Table S2, Peptide templates column). The peptides in these structures were transformed into the reference frame of the SH3 domain template by structural superimposition to create hybrid template models for AlphaFold. Multiple structural alignment was used to identify the core motif positions (+XXPXXP or PXXPX+) in each template peptide. The peptide sequence modeled in the AlphaFold runs consisted of the core motif together with one residue on either side (nine residues for Class I and eight residues for Class II).
For comparison with experimental PWMs, predicted PWMs were constructed from the top-ranked peptide sequences. Peptides were ranked by protein–peptide inter-PAE: The sum of the residue–residue PAE scores for all (protein, peptide) and (peptide, protein) residue pairs, where PAE is AlphaFold’s “predicted aligned error” accuracy measure. The experimental PWMs were derived from phage-display experiments with random peptide libraries of size 10
9 and greater, whereas the predicted PWMs were based on 20,000 modeled peptides. To account for this differential and better match the entropy of the amino acid frequency distributions, we squared the predicted amino acid frequencies and renormalized them to sum to 1. This had the effect of increasing the information content of the predicted PWMs without changing the order of amino acid preference. The exponent of 2 can be partly rationalized by the approximate twofold differential in log search-space size between predictions and experiments. Following ref. 20 (
link), predicted and experimental PWMs for PDZ domains were compared over the last five C-terminal peptide positions. Predicted and experimental PWMs for SH3 domains were compared over the core 7 (for class I) or 6 (for class II) positions of the SH3 motif together with the immediately adjacent positions, if those positions were present in the experimental PWM. Two measures of PWM column divergence were used to assess predictions: average absolute difference (AAD) and the Frobenius metric. The AAD for a single PWM position equals the sum of the absolute frequency deviations for all amino acids, divided by 20; AAD ranges from 0.0 (perfect agreement) to 0.1 (maximal divergence). The Frobenius metric for a single PWM position equals the square root of the sum of the squared frequency deviations; it ranges from 0.0 (perfect agreement) to the square root of 2 (maximal divergence). The AAD and Frobenius values in
Fig. 4 were averaged over all PWM columns.
Motmaen A., Dauparas J., Baek M., Abedi M.H., Baker D, & Bradley P. (2023). Peptide-binding specificity prediction using fine-tuned protein structure prediction networks. Proceedings of the National Academy of Sciences of the United States of America, 120(9), e2216697120.