Efficient HMM-HMM Alignment with Local Sequence Scoring

The Viterbi algorithm, when applied to profile HMMs, is formally equivalent to global sequence alignment with position-specific gap penalties [27 (link)]. We had previously introduced a modification of the Viterbi algorithm that is formally equivalent to Smith-Waterman local sequence alignment [15 (link)]. In HH-suite we use it to compute the best-scoring local alignment between two profile HMMs.
HH-suite models MSA columns with <50% gaps (default value) by match states and all other columns as insertion states. By traversing through the states of a profile HMM, the HMM can “emit” sequences. A match state (M) emits amino acids according to the 20 probabilities of amino acids estimated from their fraction in the MSA column, plus some pseudocounts. Insert states (I) emit amino acids according to a standard amino acid background distribution, while delete states (D) do not emit any amino acids.
The alignment score between two HMMs in HH-suite is the sum over all co-emitted sequences of the log odds scores for the probability for the two aligned HMMs to co-emit this sequence divided by the probability of the sequence under the background model. Since M and I states emit amino acids and D states do not, M and I in one HMM can only be aligned with M or I states in the other HMM. Conversely, a D state can only be aligned with a D state or with a Gap G (Fig. 1). The co-emission score can be written as the sum of the similarity scores of the aligned profile columns, in other words the match-match (MM) pair states, minus the position-specific penalties for indels: delete-open, delete-extend, insert-open and insert-extend.
Fig. 1

HMM-HMM alignment of query and target. The alignment is represented as red path through both HMMs. The corresponding pair state sequence is MM, MM, MI, MM, MM, DG, MM

We denote the alignment pair states as MM, MI, IM, II, DD, DG, and GD. Figure 1 shows an example of two aligned profile HMMs. In the third column HMM q emits a residue from its M state and HMM p emits a residue from the I state. The pair state for this alignment column is MI. In column six of the alignment HMM q does not emit anything since it passes through the D state. HMM p does not emit anything either since it has a gap in the alignment. The corresponding pair state is DG. To speed up the alignment, we exclude pair states II and DD, and we only allow transitions between a pair state and itself and between pair state MM and pair states MI, IM, DG, or GD.
To calculate the local alignment score, we need five dynamic programming matrices S_XY, one for each pair state XY ∈{MM, MI, IM, DG, GD }. They contain the score of the best partial alignment which ends in column i of q and column j of p in pair state XY. These five matrices are calculated recursively.

\begin{align} S_{MM} (i, j) = S_{aa} (q_{i}^{p}, t_{j}^{p}) + S_{ss} (q_{i}^{ss}, t_{j}^{ss}) + \\ max \{\begin{matrix} \begin{matrix} 0 (for local alignment) \\ S_{MM} (i - 1, j - 1) + log (q_{i - 1} (M,M) t_{j - 1} (M,M)) \\ S_{MI} (i - 1, j - 1) + log (q_{i - 1} (M,M) t_{j - 1} (I,M)) \\ S_{II} (i - 1, j - 1) + log (q_{i - 1} (I,M) t_{j - 1} (M,M)) \\ S_{DG} (i - 1, j - 1) + log (q_{i - 1} (D,M) t_{j - 1} (M,M)) \\ S_{GD} (i - 1, j - 1) + log (q_{i - 1} (M,M) t_{j - 1} (D,M)) \end{matrix} \end{matrix} \end{align}

\begin{matrix} S_{MI} (i, j) = max \{\begin{matrix} S_{MM} (i - 1, j) + log (q_{i - 1} (M,M) t_{j} (D,D)) \\ S_{MI} (i - 1, j) + log (q_{i - 1} (M,M) t_{j} (I,I)) \end{matrix} \end{matrix}

\begin{align} S_{DG} (i, j) = max \{\begin{matrix} S_{MM} (i - 1, j) + log (q_{i - 1} (D,M)) \\ S_{DG} (i - 1, j) + log (q_{i - 1} (D,D)) \end{matrix} \end{align}

\begin{align} S_{aa} (q_{i}^{p}, t_{j}^{p}) = log \sum_{a = 1}^{20} \frac{q_{i}^{p} (a) t_{j}^{p} (a)}{f_{a}} \end{align}

Vector

q_{i}^{p}

contains the 20 amino acid probabilities of q at position i,

t_{j}^{p}

are the amino acid probabilities t at j, and f_a denotes the background frequency of amino acid a. The score S_aa measures the similarity of amino acid distributions in the two columns i and j. S_ss can optionally be added to S_aa. It measures the similarity of the secondary structure states of query and target HMM at i and j [15 (link)].

Free full text: Click here

Steinegger M., Meier M., Mirdita M., Vöhringer H., Haunsberger S.J, & Söding J. (2019). HH-suite3 for fast remote homology detection and deep protein annotation. BMC Bioinformatics, 20, 473.

Publication 2019

Acids amino acid Amino acid Emit Hmms Indels Sequence alignment Vector

Corresponding Organization :

Other organizations : Max Planck Institute for Biophysical Chemistry, Royal College of Surgeons in Ireland

Top 5 similar protocols

Protocol cited in 14 other protocols

Variable analysis

independent variables

Not explicitly mentioned

dependent variables

Not explicitly mentioned

control variables

Not explicitly mentioned

positive controls

Not specified

negative controls

Not specified

Annotations

This protocol is too long. Unable to provide accurate annotations

About PubCompare

Our mission is to provide scientists with the largest repository of trustworthy protocols and intelligent analytical tools, thereby offering them extensive information to design robust protocols aimed at minimizing the risk of failures.

We believe that the most crucial aspect is to grant scientists access to a wide range of reliable sources and new useful tools that surpass human capabilities.

However, we trust in allowing scientists to determine how to construct their own protocols based on this information, as they are the experts in their field.

Ready to get started?

Revolutionizing how scientists
search and build protocols!