Profile hidden Markov models (profile HMMs) are statistical models of the primary structure consensus of a sequence family. Anders Krogh, David Haussler, and co-workers at UC Santa Cruz introduced profile HMMs [Krogh et al., 1994], adopting HMM techniques which have been used for years in speech recognition. HMMs had been used in biology before the Krogh/Haussler work, but the Krogh paper had a particularly dramatic impact, because HMM technology was so well-suited to the popular ``profile'' methods for searching databases using multiple sequence alignments instead of single query sequences. Since then, several computational biology groups (including ours) have rapidly adopted HMMs as the underlying formalism for sequence profile analysis.
``Profiles'' were introduced by Gribskov and colleagues [Gribskov et al., 1987,Gribskov et al., 1990] at about the same time that other groups introduced similar approaches, such as ``flexible patterns'' [Barton, 1990], and ``templates''[Bashford et al., 1987,Taylor, 1986]. The term ``profile'' has stuck.2.1 All of these are more or less statistical descriptions of the consensus of a multiple sequence alignment. They use position-specific scores for amino acids (or nucleotides) and position specific scores for opening and extending an insertion or deletion. Traditional pairwise alignment (for example, BLAST [Altschul et al., 1990], FASTA [Pearson and Lipman, 1988], or the Smith/Waterman algorithm [Smith and Waterman, 1981]) uses position-independent scoring parameters. This property of profiles captures important information about the degree of conservation at various positions in the multiple alignment, and the varying degree to which gaps and insertions are permitted.
The advantage of using HMMs is that HMMs have a formal probabilistic basis. We can use Bayesian probability theory to guide how all the probability (scoring) parameters should be set. Though this might sound like a purely academic issue, this probabilistic basis lets us do things that the more heuristic methods cannot do easily. For example, an HMM can be trained from unaligned sequences, if a trusted alignment isn't yet known. Another consequence is that HMMs have a consistent theory behind gap and insertion scores.2.2 In most details, HMMs are a slight improvement over a carefully constructed profile - but far less skill and manual intervention is necessary to train a good HMM and use it. This allows us to make libraries of hundreds of profile HMMs and apply them on a very large scale to whole-genome or EST sequence analysis. One such database of protein domain models is Pfam [Sonnhammer et al., 1997]; the construction and use of Pfam is tightly tied to the HMMER software package.
HMMs do have important limitations. One is that HMMs do not capture any higher-order correlations. An HMM assumes that the identity of a particular position is independent of the identity of all other positions.2.3 HMMs make poor models of RNAs, for instance, because an HMM cannot describe base pairs. Also, compare protein ``threading'' methods, which include scoring terms for nearby amino acids in a three-dimensional protein structure.
A general definition of HMMs and an excellent tutorial introduction to their use has been written by Rabiner [Rabiner, 1989]. Throughout, I will often use ``HMM'' to refer to the specific case of profile HMMs as described by Krogh et al. [Krogh et al., 1994]. This shorthand usage is for convenience only. For a review of profile HMMs, see [Eddy, 1996], and for a complete book on the subject of probabilistic modeling in computational biology, see [Durbin et al., 1998] [More information on-line] .