glam2 examines a set of sequences provided by the user, and returns an alignment of segments of these sequences. A typical alignment is shown in Figure 1. Each sequence contributes at most one segment to the alignment. Our approach assumes that a motif is defined by residue preferences at certain positions, which we call key positions. These are analogous to the “turned-on” columns of the second-generation Gibbs sampler, or to the match states of a profile HMM. In a particular motif instance, some key positions may be deleted, and residues may be inserted between key positions (Figure 1).
glam2 defines a scoring scheme for alignments such as that in Figure 1. It rewards alignment of identical or similar residues in the same key position, and penalizes deletions and insertions. However, deletions and insertions are penalized less strongly if they repeatedly occur in the same locations. This is reasonable because some locations in a motif may be more prone to deletions or insertions than others. Having defined a scoring scheme for alignments, it is straightforward to calculate the marginal score of one aligned segment: the score of the alignment including this segment minus the score of the alignment excluding this segment. These marginal scores reflect how well each segment matches the other segments.
Having defined a scoring scheme, glam2 attempts to find a motif alignment with maximum score. Even in the gapless case, the number of possible alignments is too huge to enumerate, and there is no practical algorithm to guarantee finding the optimal alignment. This problem is only exacerbated in the gapped case. Thus glam2 uses a heuristic optimisation method – simulated annealing – highly analogous to the optimisation methods of the gapless Gibbs samplers [10] (link),[27] (link).
Simulated annealing takes an initial, presumably non-optimal, alignment and repeatedly makes changes to it. These changes have an element of randomness: they generally increase the score, but sometimes decrease it, which avoids getting stuck in local optima. The process is analogous to crystallization in a cooling material. Two types of change are performed by glam2, which we call site sampling and column sampling, because they are analogous to similarly-named procedures in the original Gibbs sampler [10] (link),[24] (link). Site sampling adjusts the alignment of one sequence to the motif, using the clever stochastic traceback procedure from hmmer to efficiently sample one from all possible such alignments [22] (link). In column sampling, one key position is moved, added, or deleted. These changes are carefully designed to satisfy the reversibility and detailed balance conditions of simulated annealing (Text S1). Such changes are applied until the score fails to improve for n (e.g. 10000) changes in succession. To check that a reproducible, high-scoring motif has been found, the whole procedure is repeated r (e.g. 10) times from different random starting alignments.
glam2's behaviour can be controlled with numerous adjustable parameters. The allowed alignments can be constrained by specifying a minimum number of key positions (a), a maximum number of key positions (b), and a minimum number of segments in the alignment (z). This z parameter is a useful generalization of the OOPS (one occurrence per sequence) and ZOOPS (zero or one occurrence per sequence) modes of previous motif discovery algorithms [28] . The annealing follows a simple geometric cooling schedule with initial temperature t and cooling rate c per n changes. glam2 can find the optimal number of key positions more quickly if the initial number (w) is set to a near-optimal value. All parameters have sensible default values.
glam2scan takes a motif found by glam2, and scans it against a database of sequences. It performs short-in-long alignments of the motif against the sequences, using position-specific residue scores, deletion scores, and insertion scores, which are derived from the glam2 alignment. The highest-scoring such alignments are reported.
Free full text: Click here