Let
i = (
i1,
i2, ...,
iw) denote a sequence of amino acids, which has been extracted from a protein sequence. Let
j denote the position in this window,
j = 1...
w. On basis of
i, the hidden Markov model predicts if the center position of the window is annotated as part of an epitope. In the N- and C-termini, parts of the extracted windows are exceeding the terminals. For these residues, the character 'X' is used, which does not count when the hidden Markov model is used for the predictions. The prediction score for a window is given by
which is the log odds of the residue at the center position of the window is being part of an epitope (Epitope model) as opposed to if it is occurring by chance (Random model).
To construct the Random model, background frequencies of the Swiss-Prot database [23 (
link)],
qi, is used. For the Epitope model,
pi,j is the effective amino acid probability of having amino acid
i at position
j according to the model.
To calculate the values of
pi,j, all windows, for which their center position is annotated as part of an epitope, are extracted from atraining data set. Again, if an extracted window exceeds the N or C terminal, the character 'X' is used, which does not count when calculating the parameters.
These extracted peptide windows form a matrix of aligned peptides of the width
w. From this alignment,
pi,j is calculated as the pseudo count corrected probability of occurrence of amino acid
i in column
j, estimated as in [24 (
link)]. To make the pseudo count correction, pseudo count frequencies,
gi,j, are calculated. They are given by
where
pk,j is the observed frequency of amino acid
k in column
j of the alignment [25 (
link)]. The variable
bi,k is the Blosum 62 substitution matrix frequency, e.g. the frequency of which
i is aligned to
k [26 (
link)].
To give an example of using (2), let the window size,
w = 1. The model is then only covering residues, which are annotated as being part of linear B-cell epitopes. If the observed peptides consists of the following single amino acid sequences L and V, with the frequencies
pL,1 = 0.5 and
pV,1 = 0.5, then the pseudo-count frequency for e.g. I is given by
The effective amino acid frequencies are calculated as a weighted average of the observed frequency and the pseudo count frequency,
Here,
α is the effective number of sequences in the alignment - 1, and
β is the pseudo count correction [25 (
link)], which is also called the weight on low counts. To finish the calculation example, let
β be very large as it is in this work. Then
pI,1 ≈
gI,1 = 0.14.
Note that we shall use the term hidden Markov model throughout this work to refer to the weight matrix generated using (1). The parameters of the ungapped Markov model are calculated using a so-called Gibbs sampler, written by Nielsen et al. [24 (
link)].
The result of applying (1) is a prediction score for every residue of the query sequence. To reduce fluctuations, a smoothing window is applied to every position. It is made asymmetric in the N- and C- termini in order to conserve prediction examples.
Larsen J.E., Lund O, & Nielsen M. (2006). Improved method for predicting linear B-cell epitopes. Immunome Research, 2, 2.