User's Guide

| Dept. of Genetics | WashU | Medical School | Sequencing Center | CGM | IBC|
| Eddy lab | Internal (lab only) | HMMER | PFAM | tRNAscan-SE | Software | Publications |

next up previous contents
Next: Sequence file formats Up: Introduction Previous: Primary changes from HMMER


Plan 7

The Plan 7 architecture

The Plan 7 architecture is substantially different from the original HMMER model architecture.

The abbreviations for the states are as follows:

Figure 2.1: The Plan7 architecture. Squares indicate match states (modeling consensus positions in the alignment). Diamonds indicate insert states (modeling insertions relative to consensus) and special random sequence emitting states. Circles indicate delete states (modeling deletions relative to consensus) and special begin/end states. Arrows indicate state transitions. See text for more details.

The section of the model composed of M, D, and I states, and the B and E states, is essentially a Krogh/Haussler profile HMM. I refer to this as the ``main model''. A group of three states M/D/I at the same consensus position in the alignment is called a ``node''. The main model controls the data dependent features of the model. The probability parameters in the main model are generally learned from data in a multiple sequence/structure alignment.

Unlike the original Krogh/Haussler and HMMER model architecture, Plan 7 has no D $\rightarrow$ I or I $\rightarrow$ D transitions. This reduction from 9 to 7 transitions per node in the main model is the origin of the codename Plan 7. (The original HMMER architecture is called Plan 9 in parts of the code.)

The other states (S,N,C,T,J) are called ``special states''. They (combined with special entry probabilities from B and exit probabilities to E) control the algorithm dependent features of the model: how likely the model is to generate various sorts of local or multihit alignments. The algorithm dependent parameters are typically not learned from data, but rather set externally by choosing a desired alignment style.

Local alignments in Plan 7

The Plan 7 architecture models a complete sequence, regardless of how much of that sequence matches the main model. All alignments to a Plan 7 model are ``global'' alignments, but some of the sequence may be assigned to Plan 7 states (N,C,J) that generate ``random'' sequence that is not aligned to the main model. Thus, the algorithm dependent parts of the model control the apparent locality of the alignments.

Local alignments with respect to the sequence (i.e., allowing a match to the main model anywhere internal to a longer sequence) are controlled by the N and C states. If the N $\rightarrow$ N transition is set to 0, alignments are constrained to start in the main model at the very first residue. Similarly, if the C $\rightarrow$ C transition is set to 0, alignments are constrained to match the main model at the very last residue in the sequence.

Local alignments with respect to the model (i.e., allowing fragments of the model to match the sequence) are controlled by B $\rightarrow$M ``entry'' transitions and M $\rightarrow$ E ``exit'' transitions, shown as dotted lines in the Plan 7 figure. Setting all entries but the B $\rightarrow$ M1 transition to 0 forces a partially ``global'' alignment in which all alignments to the main model must start at the first match or delete state. Setting all exits to 0 but the final M $\rightarrow$ E transition (which is always 1.0) forces a partially global alignment in which all alignments to the main model must end at the final match or delete state.

Multiple hit alignments are controlled by the E $\rightarrow$ J transition and the J state. If the E $\rightarrow$ J transition is set to 0, a sequence may only contain one domain (one alignment to the main model). If it is nonzero, more than one domain per sequence can be aligned to the main model. The J $\rightarrow$ J transition controls the expected length of the intervening sequence between domains; the lower this probability, the more clustered the domains are expected to be.

The original HMMER1 search programs are encoded in Plan 7 models as follows:

One advantage of Plan 7 is great flexibility in choosing an alignment style. Complicated alignment styles are easily encoded in the model parameters without changing the alignment algorithm. For example, say you wanted to model human L1 retrotransposon elements. Because of the way L1 elements are inserted by reverse transcriptase (RVT), L1 elements tend to have a defined 3' end (RVT starts replication at the same place in each new L1) but a ragged 5' end (RVT prematurely falls off a new L1 in an unpredictable fashion). A specialized L1 model could define non-zero internal entry probabilities and zero internal exit probabilities to model this biological situation.

One disadvantage of Plan 7 is that if you decide you want to do both local and global alignments, you need two different models (or you need to do one search, then change the model). This wouldn't be a terrible burden except for the fact that the algorithm-dependent parameters strongly affect the values of the $\mu$ and $\lambda$parameters that E-value statistics depend on. If the algorithm dependent parameters are changed, these parameters are lost and the model should be recalibrated with hmmcalibrate - and hmmcalibrate is relatively slow.

The Plan 7 null model

When HMM alignments are scored, they are scored by a log-odds score relative to a ``null model'' of random sequence composition [Barrett et al., 1997]. In Plan 7, this model is now specified as a full probabilistic model too:


The G state has a symbol emission probability distribution for Ksymbols in the alphabet. By default, this distribution is set either to the average amino acid composition of SWISSPROT 34, or to 0.25 for each nucleotide. The G $\rightarrow$ G transition controls the expected length of observed random sequences; in practice, this transition probability is so close to 1 that it has very little effect. The F state is just a dummy end state like the T state in the Plan 7 architecture.

Wing retraction in Plan 7 dynamic programming

In the figure of the Plan 7 architecture, you may have noticed that the first and last delete states are greyed out. Internally in HMMER, these delete states exist in the ``probability form'' of the model (when the model is being worked with in every way except alignments) but they are carefully removed in the ``search form'' of the model (when the model is converted to log-odds scores and used for alignments). This process is called ``wing retraction'' in the code, by analogy to a swept-wing fighter changing from a wings-out takeoff and landing configuration to a wings-back configuration for high speed flight.

The problem is that the Plan 7 model allows cycles through the J state. If a continuous nonemitting ``mute cycle'' were possible (J, B, D states, E, and back to J), dynamic programming recursions would fail. This is why special mute states like delete states must be handled carefully in HMM dynamic programming algorithms; see [Durbin et al., 1998] for further discussion. The easiest way to prevent a mute cycle is to make sure that the model must pass through at least one match state per path through the main model.

Wing retraction involves folding the probabilities of the terminal delete paths into the Plan 7 entry and exit probabilities. For example, in wing retraction the ``algorithm dependent'' B $\rightarrow$ M3 entry probability is incremented by the probability of the ``data dependent'' path B $\rightarrow$ D1 $\rightarrow$ D2 $\rightarrow$ M3.

Having the wing retraction step, rather than always folding these probabilities together, is a design decision, preserving a distinction between the ``algorithm dependent'' and ``data dependent'' parts of the model.

next up previous contents
Next: Sequence file formats Up: Introduction Previous: Primary changes from HMMER

Direct comments and questions to <eddy@genetics.wustl.edu>