Consider a macromolecular system of n (nonhydrogen) atoms with Cartesian coordinates. To rapidly evaluate the energy of a particular configuration of the system (including hydrogens), we will decompose the system into a collection of distinct chemical groups, {Ai}, consisting of atoms for which the protonation state is unknown and a set P, the part of the system for which there is assumed to be no uncertainty regarding its protonation state.
The decomposition proceeds as follows: implicitly break all bonds between 4-coordinated alkane sp3 carbon atoms and collect the resulting connected (bonded) groups of atoms. For proteins, this will leave the backbone intact, isolate the alkane carbons, and produce a collection of m-methylamide (Asn, Gln), thiomethanol (Cys), methylimidazoles (His), methylguanidinium (Arg), methyl carboxylic acids (Asp, Glu), methanol (Ser, Thr), indole (Trp), methylphenol (Tyr) and methylbenzene (Phe), methylamine (Lys), and thioether (Met) groups. A special case disconnection of the standard termini will produce a methyl amine (N terminus) and a methyl carboxylic acid (C terminus). Solvent and disconnected ions are considered to be separate groups. Collect the backbone and isolated alkane atoms into a set, P, the “known” portion of the system. The remaining atoms in the chemical groups are collected (by connectivity) into m sets, {Ai}, the sets of the atoms for which there is uncertainty with respect to their protonation geometry, tautomer, or ionization state. This decomposition procedure assumes that alkane carbons and the protein peptide backbone have a known protonation state. In principle, any partitioning method can be used by Protonate3D provided that (relatively) apolar bonds are used to divide the system. The reason for this has to do with the thermodynamic approximations and the calculation of partial charges (which will be described later).
The hydrogen atoms of the heavy atoms of P (the “known” atoms) are added at standard bond lengths and angles according to the hybridization state of the atoms; for example, the backbone nitrogen in nonproline peptide bonds is given one hydrogen in the peptide plane; the Cα of nonglycine residues is given one hydrogen placed in an ideal tetrahedral geometry; sp3 carbons with two heavy neighbors (e.g., Cβ of Glu) are given two hydrogens placed at ideal tetrahedral geometry; terminal methyls are given three hydrogens in tetrahedral geometry in staggered conformation with respect to their (necessarily) alkane carbon neighbors. Henceforth, P will denote the hydrogen augmented set of atoms in the “known” part of the macromolecule.
For each chemical group Ai, we generate a finite collection Si = {Ai1,Ai2,…} of states consisting of the heavy atoms, flipped states, and all rotamer, tautomer, and ionization/protonation combinations of hydrogen atoms (seeFig. 1 ). In general, the states of chemical groups are generated according to a parameter file containing definitions of each chemical group and all of their topological tautomer and ionization states. The parameter file also contains, for each state, a tautomer strain energy (to provide for tautomer preferences). Rotamer (conformational) strain energy of each state is also considered and generated from force field parameter files such as OPLS-AA18 by applying the dihedral energy terms to the fragment geometry (as though still connected to P) and the intrafragment van der Waals energy terms (interfragment energies are handled by the matrix formulation of Eq. (1) , later).
For proteins, the sp3 carbon atoms with two heavy neighbors are given hydrogens in a similar manner to the carbons of P; sp2 carbon atoms with one heavy neighbor (e.g., aromatic carbons) are given one hydrogen at standard bond lengths and angles in the π system plane. Primary amides are given two hydrogens at standard planar geometry; planar nitrogen atoms with two heavy neighbors and one hydrogen has that hydrogen placed in-plane at standard bond lengths and angles. The polar hydrogens and terminal methyls are given hydrogens appropriate to their ionization state and hybridization at standard bond lengths and angles. The dihedral combinations are determined according to the chemical type of the heavy atom: hydrogens in hydroxyls and thiols are sampled at 60° dihedral increments starting at a staggered rotamer; phenol hydrogens and other conjugated hydroxyls are sampled at 30° dihedral increments starting at an in-plane rotamer; methyls and primary amines are sampled at 60° dihedral increments starting at an extended conformation; hydrogens on other terminal atoms are given similar geometries. The anionic state of phenols, alcohols, thiols, and indoles are generated in addition to the neutral forms. The flip states of terminal amides, sulfonamides, and phosphonamides are generated. The anionic state and both neutral tautomers of carboxylic acids are generated (with the hydrogen cis to the carbonyl oxygen). Primary amines are generated in neutral and cationic forms and dihedral angles sampled at 60° increments starting at a staggered rotamer. Imidazoles are generated in anionic, cationic and two neutral tautomers (HID and HIE) as well as in flipped states (for a total of eight states). The states neutral of guanidines consist of all planar tautomers and rotamers. Water states consist of ∼500 rigid body orientations and isolated metals are given appropriate ionization states for groups I and II and a collection of ionization states from {+1,+2,+3} for transition metals under the assumption of zero ionization potential.
Thus, each Aij consists of an all-atom chemical group with an appropriate ionization state, the heavy atoms, all of its hydrogen atoms in reasonable geometry and has an associated internal energy, sij, consisting of the sum of its conformational and tautomeric energy.Figure 1 depicts a hypothetical fixed part P (with known protonation state and geometry) of a macromolecular system and three chemical groups each with a collection Si of alternative protonation states; A1 has four alternative states, A2 has two states, and A3 has three states.
To represent the state ensemble of the system, arrange all of the individual chemical group states in all of the {Si} into single state list, S, divided into contiguous blocks corresponding to the {Si}, each of length mi = |Si|.
The first block of m1 elements in the list are the states of chemical group 1, the next block of m2 elements in the list are the states of group 2, and so on. (The reason for this arrangement will become clear shortly.) A configuration of the entire system consists of a selection of exactly one particular state from each block associated with a chemical group. Thus, there are a total of m1 × m2 × m3 × … configurations of the system. In typical proteins, the number of configurations exceeds 10100. A binary vectorx of length equal to the length of the list S conveniently encodes a configuration, with a value 1 denoting the selection of an individual state. For example, in Figure 1 , the vector x = (0,1,0,0,1,0,0,0,1) denotes the configuration state 2 from group 1, state 1 from group 2, and state 3 from group 3; to see this, introduce dividers into x corresponding to the blocks: x = (0,1,0,0 | 1,0 | 0,0,1), so that the position of the 1 value within each block (counting from the left) indicates the number of the state within the group. Admissible, or permitted, configuration vectors, x , have the property that there is exactly one 1 value in each block corresponding to a chemical group; this means that an admissible configuration vector encodes a definite single state for each chemical group. This constraint giving rise to the admissible configuration vectors is called the unary constraint, inspired by unary (base 1) notation of numbers in which “1” = 1, “10” = 2, “100” = 3, “1000” = 4, “10,000” = 5, and so on.
Suppose that we are given a pairwise interaction energy function f(i,j), for atoms i and j (e.g., Coulomb's law or a Lennard-Jones van der Waals potential), without loss of generality, we will assume that f(i,i) is well defined (e.g., for Coulomb's law, f(i,i) = 0). If X and Y are two disjoint sets of atoms (e.g., two chemical states), then the interaction energy between X and Y is
Form a matrixU with entries equal to the interaction energy of the various chemical group states in the list S. We will take the interaction energy between two states of the same chemical group to be zero. For notational convenience, let I(k) denote the chemical group to which state k belongs. Thus, the matrix U will have Uij = f(Ai,Bj) if I(i) ≠ I(j) and 0 otherwise. Form a vector u with entries ui = f(P,Ai) + si, the interaction energy between a chemical group state and the known part of the protein, P, and the internal energy of the state, si (to be described later). Let u0 = f(P,P)/2, the (constant) internal interaction energy of the known part of the protein P. With this matrix notation, we can write the total energy of a particular configuration encoded by admissible binary vector, x , compactly (and efficiently) with
Thus, the total energy of a configuration of the system specified byx can be evaluated by a multidimensional quadratic form. If all of the values of u and U are calculated in advance, then a matrix–vector multiplication and two inner products are all that is required to evaluate the total energy for any arbitrary configuration of the system. Finding the optimal configuration of the system now is a matter of finding the smallest value of the quadratic form E over all binary vectors x satisfying the unary constraint; this optimization problem is called the “Unary Quadratic Optimization” problem.
Postponing the details of the energy model, the algorithmic structure of Protonate3D is (a more detailed set of steps is given at the end of this section):
The addition of many (more than 20) water molecules (each with ∼500 orientations) becomes impractical. As a result, most of the water molecules are typically left out of the preceding steps and oriented afterward. This is done by orienting the waters one by one proceeding from the water in the strongest electrostatic field (of the protein and previously oriented waters) to the weakest. The selection of water molecules to include in the main calculation is left to the user—typically, water molecules near the sites of interest are treated in the main calculation.
The Unary quadratic optimization algorithm used by Protonate3D proceeds as follows. First, a dead-end elimination14 (link) procedure is applied to eliminate states that cannot possibly be part of the optimal solution. This has the effect of reducing the dimensions of theU matrix and u vector of the quadratic energy function in a provably correct way. Suppose, elements r and s of the list S belong to the same chemical group X; if (where the sum extends over all chemical groups Y different from X) we can eliminate state r. The dead-end elimination criterion, when satisfied, eliminates r because no matter what state assignment is made, some state X, different from r, will result in a lower energy. This criterion is applied repeatedly until no more elimination is possible. Typically, the majority of the configurations are eliminated a priori, but it is still not practical to conduct a brute force search over the remaining configurations.
In an effort to speed up the state space search to follow, a “Mean Field Theory” calculation is performed to produce a Boltzmann distribution over all of the remaining individual chemical group states. This results in an estimate of the probability of each state in the Boltzmann-weighted ensemble of configurations. Briefly, the state probabilitiesp k are determined by solving the nonlinear equation. where p is the probability vector; U and u are as in Eq. (1) ; e k is a vector of all zeros and a single 1 at position k; and β = 1/kT. The nonlinear equation can be solved efficiently by successive feedback iteration. These probabilities, p , are the population probabilities of the individual states under the assumption that each state feels the Boltzmann weighted average interactions of the other states. The vector p is used as a heuristic state priority in the subsequent search over states; the idea is to investigate high mean field probability states first under the assumption that they will lead to low energy configurations of the entire system (an approximate best-first search). The mean field probabilities, p , only affect the run-time of the state-space search and not its correctness; moreover, the energy of a system is evaluated using Eq. (1) , which does not depend on p . The value of β must be chosen carefully to guarantee the uniqueness of p ; in general, the solutions to Eq. (3) depend on the starting p vector. However, for certain values of β, the solution will be independent of the starting point (see the Appendix ) and consequently p can be initialized with a uniform distribution on the states of each chemical group.
Finally, a recursive tree search is conducted over all admissible binary vectors,x , to locate the lowest energy state as calculated by Eq. (1) (which provides for rapid evaluation of energies). The performance of the search depends critically on the ability to prune the search space without loss of correctness. At any given point in the search, some of the elements of x , corresponding to some set of groups, G, will be assigned and others are yet to be assigned (with zero values). A lower bound, L(x ), on the minimum energy of the system assuming the assigned part of x is
If this lower bound value exceeds the energy of the best energy determined thus far, then no further search of configurations containing the assigned part ofx is required, thereby pruning the search tree and bypassing the examination of descendant configurations. During the recursive search, trial elements of the unassigned portion of x are made in decreasing order of the mean field probability. This greatly improves the pruning performance of the lower bound because the likelihood of visiting the best configurations first is increased. Moreover, premature termination of the search will produce the best solution with high probability.
The pseudocode for the recursive tree search procedure is given inFigure 2 .
We now turn to the energy model for the macromolecular system. We will use an energy model that contains van der Waals repulsion, Coulomb electrostatic, and Generalized Born implicit solvation energies. Use of the Poisson-Boltzmann Equation (PBE) was not attempted because it was felt that the run-time would be prohibitively long for large systems, requiring at least one PBE solution per state. The van der Waals and Coulomb functional forms terms are pairwise and fit neatly into the quadratic form ofEq. (1) ; however, the Generalized Born model is not a two-body potential and certain approximations will be used to reformulate it into an effective two-body potential. In addition, because the number of particles may change upon ionizing a chemical group, we must introduce free energy terms related to group titration (because potential energies cannot be compared for systems with different numbers of particles).
Each atom of the system, whether in the known part, P, or in one of the group states {Aij} has associated van der Waals radius, van der Waals well depth parameters, as well as a partial charge. The van der Waals parameters and partial charges are permitted to depend on the particular tautomer, rotamer, or ionization state of each chemical group. In the interests of efficiency, we impose the requirement that the van der Waals parameters and partial charge assignments of one chemical group do not depend on the particular state selection of another chemical group. In particular, we require that the partial charge model be a nonpolarizable charge model (see the titration theory, later). The decomposition of the system along apolar bonds is done to reduce the potential adverse impact of these independence requirements.
Protonate3D uses a slightly modified version of MMFF9419 partial charges because (a) the MMFF94 charge model is based on fixed (topological) bond charge increments; (b) the chemical contexts for atom types in MMFF94 do not cross sp3 carbon atoms; (c) the bond charge increment between sp3 carbon bonds is zero (a purely apolar bond); and (d) MMFF94 supports general organic compounds. The slight modification to the MMFF94 charge model is that the normal zero bond charge increment between alkane hydrogens and carbons was replaced with a bond charge increment of 0.08 electrons, in better agreement with protein force field partial charges such as AMBER.20 Protonate3D uses Engh–Huber21 van der Waals parameters; however, hydrogens on oxygen and nitrogen are taken to have zero van der Waals radius, consistent with OPLS-AA. Coulomb's law is used for electrostatic interactions and special form of van der Waals interaction is used: only the repulsive part of the van der Waals interaction energy is modeled (although, the standard Lennard-Jones functions with the attractive term are not precluded). The special functional form is 800εij (1 − r/Rij),3 (link) where r < Rij is the interatomic separation, Rij is the sum of the van der Waals radii, and εij is the geometric mean of the van der Waals well depth parameters for the two interacting atoms. Because of the 800 factor derived from a series expansion, this functional form lies in between the 12-6 and 9-6 Lennard-Jones functions at distances below the optimal interaction distance and approximates the 12-6 form well near the energy minimum. Because the OPLS-AA van der Waals parameters for polar hydrogen atoms are zero, the van der Waals terms are used by Protonate3D to handle side-chain “flip” states; the special form was used largely to mimic the sphere overlap test of Reduce.7 (link) The elements ofU matrix and u vector are populated by a straightforward application of the pairwise formulae given previously.
Protonate3D uses a modified version of the Generalized Born/Volume Integral (GB/VI) formalism22 (link) for implicit solvent electrostatics (although other Generalized Born models are not precluded):
In this equation, ε is the dielectric constant of the interior of a solute, εsol is the dielectric constant of the solvent, {γi} are (topological) atom-type-dependent constants that account for nonpolar energies including cavitation and dispersion using an inverse sixth-power integral instead of surface area, {Ri} are (topological) atom-type-dependent solvation radii, κ is the Debye ionic screening parameter that depends on salt concentration, {qi} are the atomic partial charges, {Bi} are the Born self-energies (inversely proportional to the Born radii), which are estimated with a pairwise sphere approximation23 to the solute cavity, and rij denotes the distance between atoms i and j. Were it not for the {Bi}, the GB/VI equations would be a pairwise potential; however, because the Bi of a particular atom i depends on the state assignment of atoms in other chemical groups with possibly unknown state, we must calculate a set of {Bi} that (a) remain fixed despite the protonation state of other groups and (b) reasonably preserve the GB/VI energy values.
Consider an atom k in the system (whether in P or in some state Aij). The contribution to Bk from all of the other atoms in the system will fall as the sixth power in the integrand ofEq. (7) . Thus, atoms far away from k will contribute little, no matter if they are in some other group with unknown state. The various states in the system differ only in the position or absence of hydrogen atoms, which contribute relatively little to the volume integral (because of their small solvation radius); thus, the bulk of the states' contribution (from the heavy atoms) will be accurate no matter which state is selected. In any event, the approximation to the volume integral in the GB/VI is a pairwise summation of the form for a specific function22 (link) V. To minimize the impact of the hydrogen positions of the unknown states, Protonate3D uses a separate mean field approximation to the volume integrals. A separate U matrix and u vector is created containing only the van der Waals repulsion terms, the states' internal strain energies, and the pH-dependent isolated group titration energies (see later). For each separate group state, the mean field equation of Eq. (3) is then solved to produce a set of state probabilities p . Each atom in each group state as well as the known part P is given the probability of its chemical group state, or 1 if the atom is in P. The Born factors are then calculated with resulting in a mean field approximation to the Born factors that takes steric, rotamer/tautomer, and isolated group pKa free energies into account. This approximation works very well in practice; indeed, one can argue it is in some sense superior to the original in that it takes some protonation state flexibility into account. It should be noted that some GB implicit solvent models do not include hydrogens in the volume integration24 (link); consequently, we believe that our calculation of the mean field Born factors is eminently reasonable. In this way, we approximate the three-body GB/VI model with a close pairwise model more suited for the quadratic form of Eq. (1) .
It remains to deal with the pH-dependent free energy of ionization of the chemical groups that must be included in the calculation. Consider the free energy, a, of the reaction PAH → PA− + H+, where AH is an acidic group bound (possibly covalently) to a macromolecule P. Our approach is to introduce a thermodynamic cycle linking the reaction to the isolated group reaction AH → A− + H+, whose free energy will be assumed to be known. In the covalent case, we consider the thermodynamic cycle in which a = b + c + d. If the pKa of the reaction HAH → HA− + H+ is known (say from experiment), then for a given pH, we have that c = −kT (log 10) (pH−pKa), where k is Boltzmann's constant and T is the temperature of the system. Because the (vertical) reaction equation H2 + PAH → PH + HAH is balanced and, by construction, E = ECOUL + ESOL is the free energy of charging and solvating the system, we may simply write
The case of a noncovalently bound group AH near a macromolecule P is simpler in that the H2 molecule is not required to balance the equation and, in this case,
We shall deal with the noncovalent case first, because it is simpler and provides insight into the covalent case. The noncovalent d is similar to b resulting in and using the fact that E(A + B) = E(A) + E(B) we have that
The superscript iso is used to signify that the E is calculated for the isolated AH and A− systems (i.e., calculated with Born factors derived from the isolated system, ignoring P). These iso superscripted quantities involve only a small number of atoms—the atoms of AH and A−—and direct evaluations of E are used to calculate the required energy. The iso superscripted quantities are included directly in theu vector of Eq. (1) for the corresponding group state so that b + d is simply a difference of configuration energies.
With a similar line of reasoning as in the noncovalent case, we find that as a result of cancellations of E(PH) and E(H2), for the covalent case and, as before, the iso superscripted quantities can be calculated directly (because few atoms are involved) and included in theu vector. In practice, the distinction between covalent and noncovalent groups makes only a small difference—on the order of 0.5 kcal/mol (∼2% error) for ionic species. A small correction to the experimental isolated pKa values for covalently bound species can account for most of this difference. In any event, the static nature of the entire calculation and the approximations inherent in a Generalized Born model will in all likelihood overshadow any lack of distinction between the cases.
The free energy c = −kT (log 10) (pH − pKa) remains to be included inEq. (1) . Consider a polyprotic species AHn with pKa values pKi, corresponding to AHi → AHi−1. The free energy of the reaction AHi → AHi−1 + H+ is then ΔGi = −kT log 10 (pH − pKi). If we assign we will have that ΔGi = Gi − Gi−1; thus, we can incorporate the Gi values into the relevant u vector entries for each acidic chemical group state with i titratable protons. The reasoning for the b and d quantities generalizes to polyprotic species and multiple-site titration straightforwardly, because of the overall pairwise nature of the energy terms that make up the effective configuration energy.
We now summarize the Protonate3D procedure:
This brings to a close the exposition of the Protonate3D methodology. Protonate3D was implemented in the Scientific Vector Language of the Molecular Operating Environment25 version 2006.08. Computational experiments were conducted on a 2 GHz Pentium IV processor running Microsoft Windows.
The decomposition proceeds as follows: implicitly break all bonds between 4-coordinated alkane sp3 carbon atoms and collect the resulting connected (bonded) groups of atoms. For proteins, this will leave the backbone intact, isolate the alkane carbons, and produce a collection of m-methylamide (Asn, Gln), thiomethanol (Cys), methylimidazoles (His), methylguanidinium (Arg), methyl carboxylic acids (Asp, Glu), methanol (Ser, Thr), indole (Trp), methylphenol (Tyr) and methylbenzene (Phe), methylamine (Lys), and thioether (Met) groups. A special case disconnection of the standard termini will produce a methyl amine (N terminus) and a methyl carboxylic acid (C terminus). Solvent and disconnected ions are considered to be separate groups. Collect the backbone and isolated alkane atoms into a set, P, the “known” portion of the system. The remaining atoms in the chemical groups are collected (by connectivity) into m sets, {Ai}, the sets of the atoms for which there is uncertainty with respect to their protonation geometry, tautomer, or ionization state. This decomposition procedure assumes that alkane carbons and the protein peptide backbone have a known protonation state. In principle, any partitioning method can be used by Protonate3D provided that (relatively) apolar bonds are used to divide the system. The reason for this has to do with the thermodynamic approximations and the calculation of partial charges (which will be described later).
The hydrogen atoms of the heavy atoms of P (the “known” atoms) are added at standard bond lengths and angles according to the hybridization state of the atoms; for example, the backbone nitrogen in nonproline peptide bonds is given one hydrogen in the peptide plane; the Cα of nonglycine residues is given one hydrogen placed in an ideal tetrahedral geometry; sp3 carbons with two heavy neighbors (e.g., Cβ of Glu) are given two hydrogens placed at ideal tetrahedral geometry; terminal methyls are given three hydrogens in tetrahedral geometry in staggered conformation with respect to their (necessarily) alkane carbon neighbors. Henceforth, P will denote the hydrogen augmented set of atoms in the “known” part of the macromolecule.
For each chemical group Ai, we generate a finite collection Si = {Ai1,Ai2,…} of states consisting of the heavy atoms, flipped states, and all rotamer, tautomer, and ionization/protonation combinations of hydrogen atoms (see
For proteins, the sp3 carbon atoms with two heavy neighbors are given hydrogens in a similar manner to the carbons of P; sp2 carbon atoms with one heavy neighbor (e.g., aromatic carbons) are given one hydrogen at standard bond lengths and angles in the π system plane. Primary amides are given two hydrogens at standard planar geometry; planar nitrogen atoms with two heavy neighbors and one hydrogen has that hydrogen placed in-plane at standard bond lengths and angles. The polar hydrogens and terminal methyls are given hydrogens appropriate to their ionization state and hybridization at standard bond lengths and angles. The dihedral combinations are determined according to the chemical type of the heavy atom: hydrogens in hydroxyls and thiols are sampled at 60° dihedral increments starting at a staggered rotamer; phenol hydrogens and other conjugated hydroxyls are sampled at 30° dihedral increments starting at an in-plane rotamer; methyls and primary amines are sampled at 60° dihedral increments starting at an extended conformation; hydrogens on other terminal atoms are given similar geometries. The anionic state of phenols, alcohols, thiols, and indoles are generated in addition to the neutral forms. The flip states of terminal amides, sulfonamides, and phosphonamides are generated. The anionic state and both neutral tautomers of carboxylic acids are generated (with the hydrogen cis to the carbonyl oxygen). Primary amines are generated in neutral and cationic forms and dihedral angles sampled at 60° increments starting at a staggered rotamer. Imidazoles are generated in anionic, cationic and two neutral tautomers (HID and HIE) as well as in flipped states (for a total of eight states). The states neutral of guanidines consist of all planar tautomers and rotamers. Water states consist of ∼500 rigid body orientations and isolated metals are given appropriate ionization states for groups I and II and a collection of ionization states from {+1,+2,+3} for transition metals under the assumption of zero ionization potential.
Thus, each Aij consists of an all-atom chemical group with an appropriate ionization state, the heavy atoms, all of its hydrogen atoms in reasonable geometry and has an associated internal energy, sij, consisting of the sum of its conformational and tautomeric energy.
To represent the state ensemble of the system, arrange all of the individual chemical group states in all of the {Si} into single state list, S, divided into contiguous blocks corresponding to the {Si}, each of length mi = |Si|.
The first block of m1 elements in the list are the states of chemical group 1, the next block of m2 elements in the list are the states of group 2, and so on. (The reason for this arrangement will become clear shortly.) A configuration of the entire system consists of a selection of exactly one particular state from each block associated with a chemical group. Thus, there are a total of m1 × m2 × m3 × … configurations of the system. In typical proteins, the number of configurations exceeds 10100. A binary vector
Suppose that we are given a pairwise interaction energy function f(i,j), for atoms i and j (e.g., Coulomb's law or a Lennard-Jones van der Waals potential), without loss of generality, we will assume that f(i,i) is well defined (e.g., for Coulomb's law, f(i,i) = 0). If X and Y are two disjoint sets of atoms (e.g., two chemical states), then the interaction energy between X and Y is
Form a matrix
Thus, the total energy of a configuration of the system specified by
Postponing the details of the energy model, the algorithmic structure of Protonate3D is (a more detailed set of steps is given at the end of this section):
The addition of many (more than 20) water molecules (each with ∼500 orientations) becomes impractical. As a result, most of the water molecules are typically left out of the preceding steps and oriented afterward. This is done by orienting the waters one by one proceeding from the water in the strongest electrostatic field (of the protein and previously oriented waters) to the weakest. The selection of water molecules to include in the main calculation is left to the user—typically, water molecules near the sites of interest are treated in the main calculation.
The Unary quadratic optimization algorithm used by Protonate3D proceeds as follows. First, a dead-end elimination14 (link) procedure is applied to eliminate states that cannot possibly be part of the optimal solution. This has the effect of reducing the dimensions of the
In an effort to speed up the state space search to follow, a “Mean Field Theory” calculation is performed to produce a Boltzmann distribution over all of the remaining individual chemical group states. This results in an estimate of the probability of each state in the Boltzmann-weighted ensemble of configurations. Briefly, the state probabilities
Finally, a recursive tree search is conducted over all admissible binary vectors,
If this lower bound value exceeds the energy of the best energy determined thus far, then no further search of configurations containing the assigned part of
The pseudocode for the recursive tree search procedure is given in
We now turn to the energy model for the macromolecular system. We will use an energy model that contains van der Waals repulsion, Coulomb electrostatic, and Generalized Born implicit solvation energies. Use of the Poisson-Boltzmann Equation (PBE) was not attempted because it was felt that the run-time would be prohibitively long for large systems, requiring at least one PBE solution per state. The van der Waals and Coulomb functional forms terms are pairwise and fit neatly into the quadratic form of
Each atom of the system, whether in the known part, P, or in one of the group states {Aij} has associated van der Waals radius, van der Waals well depth parameters, as well as a partial charge. The van der Waals parameters and partial charges are permitted to depend on the particular tautomer, rotamer, or ionization state of each chemical group. In the interests of efficiency, we impose the requirement that the van der Waals parameters and partial charge assignments of one chemical group do not depend on the particular state selection of another chemical group. In particular, we require that the partial charge model be a nonpolarizable charge model (see the titration theory, later). The decomposition of the system along apolar bonds is done to reduce the potential adverse impact of these independence requirements.
Protonate3D uses a slightly modified version of MMFF9419 partial charges because (a) the MMFF94 charge model is based on fixed (topological) bond charge increments; (b) the chemical contexts for atom types in MMFF94 do not cross sp3 carbon atoms; (c) the bond charge increment between sp3 carbon bonds is zero (a purely apolar bond); and (d) MMFF94 supports general organic compounds. The slight modification to the MMFF94 charge model is that the normal zero bond charge increment between alkane hydrogens and carbons was replaced with a bond charge increment of 0.08 electrons, in better agreement with protein force field partial charges such as AMBER.20 Protonate3D uses Engh–Huber21 van der Waals parameters; however, hydrogens on oxygen and nitrogen are taken to have zero van der Waals radius, consistent with OPLS-AA. Coulomb's law is used for electrostatic interactions and special form of van der Waals interaction is used: only the repulsive part of the van der Waals interaction energy is modeled (although, the standard Lennard-Jones functions with the attractive term are not precluded). The special functional form is 800εij (1 − r/Rij),3 (link) where r < Rij is the interatomic separation, Rij is the sum of the van der Waals radii, and εij is the geometric mean of the van der Waals well depth parameters for the two interacting atoms. Because of the 800 factor derived from a series expansion, this functional form lies in between the 12-6 and 9-6 Lennard-Jones functions at distances below the optimal interaction distance and approximates the 12-6 form well near the energy minimum. Because the OPLS-AA van der Waals parameters for polar hydrogen atoms are zero, the van der Waals terms are used by Protonate3D to handle side-chain “flip” states; the special form was used largely to mimic the sphere overlap test of Reduce.7 (link) The elements of
Protonate3D uses a modified version of the Generalized Born/Volume Integral (GB/VI) formalism22 (link) for implicit solvent electrostatics (although other Generalized Born models are not precluded):
In this equation, ε is the dielectric constant of the interior of a solute, εsol is the dielectric constant of the solvent, {γi} are (topological) atom-type-dependent constants that account for nonpolar energies including cavitation and dispersion using an inverse sixth-power integral instead of surface area, {Ri} are (topological) atom-type-dependent solvation radii, κ is the Debye ionic screening parameter that depends on salt concentration, {qi} are the atomic partial charges, {Bi} are the Born self-energies (inversely proportional to the Born radii), which are estimated with a pairwise sphere approximation23 to the solute cavity, and rij denotes the distance between atoms i and j. Were it not for the {Bi}, the GB/VI equations would be a pairwise potential; however, because the Bi of a particular atom i depends on the state assignment of atoms in other chemical groups with possibly unknown state, we must calculate a set of {Bi} that (a) remain fixed despite the protonation state of other groups and (b) reasonably preserve the GB/VI energy values.
Consider an atom k in the system (whether in P or in some state Aij). The contribution to Bk from all of the other atoms in the system will fall as the sixth power in the integrand of
It remains to deal with the pH-dependent free energy of ionization of the chemical groups that must be included in the calculation. Consider the free energy, a, of the reaction PAH → PA− + H+, where AH is an acidic group bound (possibly covalently) to a macromolecule P. Our approach is to introduce a thermodynamic cycle linking the reaction to the isolated group reaction AH → A− + H+, whose free energy will be assumed to be known. In the covalent case, we consider the thermodynamic cycle in which a = b + c + d. If the pKa of the reaction HAH → HA− + H+ is known (say from experiment), then for a given pH, we have that c = −kT (log 10) (pH−pKa), where k is Boltzmann's constant and T is the temperature of the system. Because the (vertical) reaction equation H2 + PAH → PH + HAH is balanced and, by construction, E = ECOUL + ESOL is the free energy of charging and solvating the system, we may simply write
The case of a noncovalently bound group AH near a macromolecule P is simpler in that the H2 molecule is not required to balance the equation and, in this case,
We shall deal with the noncovalent case first, because it is simpler and provides insight into the covalent case. The noncovalent d is similar to b resulting in and using the fact that E(A + B) = E(A) + E(B) we have that
The superscript iso is used to signify that the E is calculated for the isolated AH and A− systems (i.e., calculated with Born factors derived from the isolated system, ignoring P). These iso superscripted quantities involve only a small number of atoms—the atoms of AH and A−—and direct evaluations of E are used to calculate the required energy. The iso superscripted quantities are included directly in the
With a similar line of reasoning as in the noncovalent case, we find that as a result of cancellations of E(PH) and E(H2), for the covalent case and, as before, the iso superscripted quantities can be calculated directly (because few atoms are involved) and included in the
The free energy c = −kT (log 10) (pH − pKa) remains to be included in
We now summarize the Protonate3D procedure:
This brings to a close the exposition of the Protonate3D methodology. Protonate3D was implemented in the Scientific Vector Language of the Molecular Operating Environment25 version 2006.08. Computational experiments were conducted on a 2 GHz Pentium IV processor running Microsoft Windows.
Full text: Click here