User's Guide

| Dept. of Genetics | WashU | Medical School | Sequencing Center | CGM | IBC|
| Eddy lab | Internal (lab only) | HMMER | PFAM | tRNAscan-SE | Software | Publications |

next up previous contents
Next: Searching a query sequence Up: Tutorial Previous: Files used in the


Searching a sequence database with a single profile HMM

One common use of HMMER is to search a sequence database for homologues of a protein family of interest. You need a multiple sequence alignment of the sequence family you're interested in. (Profile HMMs can be trained from unaligned sequences; however, this functionality is temporarily withdrawn from HMMER. I recommend CLUSTALW as an excellent, freely available multiple sequence alignment program.)

HMM construction with hmmbuild

Let's assume you have a multiple sequence alignment of a protein domain or protein sequence family. To use HMMER to search for additional remote homologues of the family, you want to first build a profile HMM from the alignment. The following command builds a profile HMM from the alignment of 50 globin sequences in globins50.msf:

> hmmbuild globin.hmm globins50.msf

hmmbuild - build a hidden Markov model from an alignment
HMMER 2.0 (June 1998)
Copyright (C) 1992-1998 Washington University School of Medicine
HMMER is freely distributed under the GNU General Public License (GPL).
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Training alignment:                globins50.msf
Number of sequences:               50
Search algorithm configuration:    Multiple domain (hmmls)
Model construction strategy:       MAP (gapmax hint: 0.50)
Prior used:                        (default)
Prior strategy:                    Dirichlet
Sequence weighting method:         G/S/C tree weights
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Determining effective sequence number    ... done. [13]
Weighting sequences heuristically        ... done.
Constructing model architecture          ... done.
Converting counts to probabilities       ... done.
Setting model name, etc.                 ... done. [globins50]

Constructed a profile HMM (length 162)
Average score:      283.03 bits
Minimum score:      137.32 bits
Maximum score:      343.50 bits
Std. deviation:      53.21 bits

Finalizing model configuration           ... done.
Saving model to file                     ... done. [globin.hmm]

The process takes a second or two. hmmbuild create a new HMM file called globin.hmm. This is a human and computer readable ASCII text file, but for now you don't care. You also don't care for now what all the stuff in the output means; I'll describe it in detail later. The profile HMM can be treated as a compiled model of your alignment.

HMM calibration with hmmcalibrate

This step is optional, but doing it will increase the sensitivity of your database search.

When you search a sequence database, it is useful to get ``E-values'' (expectation values) in addition to raw scores. When you see a database hit that scores x, an E-value tells you the number of hits you would've expected to score x or more just by chance in a sequence database of this size.

HMMER will always estimate an E-value for your hits. However, unless you ``calibrate'' your model before a database search, HMMER uses an analytic upper bound calculation that is extremely conservative. An empirical HMM calibration costs time (about 10% the time of a SWISSPROT search) but it only has to be done once per model, and can greatly increase the sensitivity of a database search. To empirically calibrate the E-value calculations for the globin model, type:

> hmmcalibrate globin.hmm

hmmcalibrate -- calibrate HMM search statistics
HMMER 2.0 (June 1998)
Copyright (C) 1992-1998 Washington University School of Medicine
HMMER is freely distributed under the GNU General Public License (GPL).
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
HMM file:                 globin.hmm
Length distribution mean: 325
Length distribution s.d.: 200
Number of samples:        5000
random seed:              895790561
histogram(s) saved to:    [not saved]
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

HMM    : globins50
mu     :  -107.928612
lambda :     0.186579
max    :   -63.859001

This takes several minutes. Go have a cup of coffee. When it is complete, the relevant parameters are added to the HMM file.

Calibrated HMMER E-values tend to be relatively accurate. E-values of 0.1 or less are, in general, very significant hits. Uncalibrated HMMER E-values are also reliable, erring on the cautious side; uncalibrated models may miss remote homologues.

Sequence database search with hmmsearch

As an example of searching for new homologues using a profile HMM, we'll use the globin model to search for globin domains in the example Artemia globin sequence in Artemia.fa:

> hmmsearch globin.hmm Artemia.fa

The output comes in several sections, and unlike building and calibrating the HMM (where we treated the HMM as a black box), now you do care about what it's saying.

The first section is the header that tells you waht program you ran, on what, and with what options:

hmmsearch - search a sequence database with a profile HMM
HMMER 2.0 (June 1998)
Copyright (C) 1992-1998 Washington University School of Medicine
HMMER is freely distributed under the GNU General Public License (GPL).
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
HMM file:                 globin.hmm [globins50]
Sequence database:        Artemia.fa
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

Query HMM:  globins50  
  [HMM has been calibrated; E-values are empirical estimates]

The second section is the sequence top hits list. It is a list of ranked top hits (sorted by E-value, most significant hit first), formatted in a BLAST-like style:

Scores for complete sequences (score includes all domains):
Sequence Description                                    Score    E-value  N 
-------- -----------                                    -----    ------- ---
S13421   S13421 GLOBIN - BRINE SHRIMP                   335.5     1e-101   8

The first field is the name of the target sequence, then followed by the description line for the sequence. The last three fields are the raw score (in units of ``bits''), the estimated E-value, and the total number of domains detected in the sequence. By default, every sequence with an E-value over 10.0 is listed in this output.

The second section is the domain top hits list. By default, for every sequence with an E-value less than 10, every domain with a non-zero raw score is listed. (Read that carefully. In a later chapter we'll discuss some caveats about how hmmsearch identifies domains, and how to control its output in different ways.) Each domain detected in the search is output in a list ranked by E-value:

Parsed for domains:
Sequence Domain  seq-f seq-t    hmm-f hmm-t      score  E-value
-------- ------- ----- -----    ----- -----      -----  -------
S13421     6/8     928  1075 ..     1   162 []    65.8  1.5e-20
S13421     2/8     149   288 ..     1   162 []    59.4  1.3e-18
S13421     3/8     303   450 ..     1   162 []    58.4  2.6e-18
S13421     8/8    1238  1390 ..     1   162 []    42.5  1.6e-13
S13421     5/8     771   918 ..     1   162 []    33.8  3.3e-12
S13421     7/8    1085  1234 ..     1   162 []    32.0  4.6e-12
S13421     4/8     454   607 ..     1   162 []    26.1  1.4e-11
S13421     1/8       1   139 [.     1   162 []    21.9    3e-11

The first field is the name of the target sequence. The second field is the number of this domain: e.g. ``6/8'' means the sixth domain of eight total domains detected.

The fields marked ``seq-f'' and ``seq-t'' mean ``sequence from'' and ``sequence to'': the start and end points of the alignment on the target sequence. After these two fields is a shorthand annotation for whether the alignment is ``global'' with respect to the sequence or not. A dot (.) means the alignment does not go all the way to the end; a bracket ([ or ]) means it does. Thus, .. means that the alignment is local within the sequence; [. means that the alignment starts at the beginning of the sequence, but doesn't go all the way to its end; .] means the alignment starts somewhere internally and goes all the way to the end; and [] means the alignment includes the entire sequence.

Analogously, the fields marked ``hmm-f'' and ``hmm-t'' indicate the start and end points with respect to the consensus coordinates of the model, and the following field is a shorthand for whether the alignment is global with respect to the model. Here, for instance, all the globin domains in the Artemia sequence are complete matches to the entire globin model - because, by default, hmmbuild built the HMM to only look for those kinds of alignments. We'll discuss later how to modify the profile HMM for other search styles.

The final two fields are the raw score in bits and the estimated E-value, for the isolated domain. Because of the method HMMER uses to correct raw scores for biased sequence composition, the raw scores for the domains do not necessarily sum up to the raw score of the sequence.

The next section is the alignment output. By default, every domain that appeared in the domain top hits list now appears as a BLAST-like alignment. For example:

Alignments of top-scoring domains:
S13421: domain 6 of 8, from 928 to 1075: score 65.8, E = 1.5e-20
                      +           LSa e a Vk++W   V+ ++ ++G  ++  lF +
      S13421   928    G-----------LSAREVAVVKQTW-NLVKPDLMGVGMRIFKSLFEA 962  

                   +P  q+ Fp+F+d+ +ld +++ p v +H   V t l++ ++ LD   nl

                   +    +L+e H   lrV+   Fk +g+vlv  L   +g  f+  +  +w 

                   K++++++  +++      
      S13421  1061 KAYDVIVEYIEEGLQ    1075

The top line is the HMM consensus. The amino acid shown for the consensus is the highest probability amino acid at that position according to the HMM (not necessarily the highest scoring amino acid, though). Capital letters mean ``highly conserved'' residues: those with a probability of > 0.5 for protein models, or > 0.9 for DNA models.

The center line shows letters for ``exact'' matches to the highest probability residue in the HMM, or a ``+'' when the match has a positive score and is therefore considered to be ``conservative'' according to the HMM's view of this particular position in the model - not the usual definition of conservative changes in general.

The third line shows the sequence itself, of course.

The next section of the output is the score histogram. It shows a histogram with raw score increasing along the Y axis, and the number of sequence hits represented as a bar along the X axis. In our example here, since there's only a single sequence, the histogram is very boring:

Histogram of all scores:
score    obs    exp  (one = represents 1 sequences)
-----    ---    ---
  335      1      0|=

Notice though that it's a histogram of the whole sequence hits, not the domain hits.

You can ignore the rest of the hmmsearch output:

% Statistical details of theoretical EVD fit:
              mu =  -107.9286
          lambda =     0.1866
chi-sq statistic =     0.0000
  P(chi-square)  =          0

Whole sequence top hits:
tophits_s report:
     Total hits:           1
     Satisfying E cutoff:  1
     Total memory:         15K

Domain top hits:
tophits_s report:
     Total hits:           8
     Satisfying E cutoff:  8
     Total memory:         20K

This is just some trailing internal info about the search that's useful to me sometimes, but probably not to you.

Searching major databases like NR or SWISSPROT

HMMER reads all major database formats and does not need any special database indexing. You can search any large sequence database you have installed locally just by giving the full path to the database file, e.g. something like:

> hmmsearch globin.hmm /nfs/databases/swiss35/sprot35.dat

If you have BLAST installed locally, it's likely that you have a directory (or directories) in which the BLAST databases are kept. These directories are specified in an environment variable called BLASTDB. HMMER will read the same environment variable. For example, if you have a BLAST database called /nfs/databases/blast-db/swiss35, the following commands will work:

> setenv BLASTDB /nfs/databases/blast-db/
> hmmsearch globin.hmm swiss35

Obviously, you'd tend to have the setenv command as part of the local configuration of your machine, rather than typing it at the command line.

Local alignment searches with hmmsearch

This is extremely important. HMMER does not do local (Smith/Waterman) and global (Needleman/Wunsch) style alignments in the same way that most computational biology analysis programs do it. To HMMER, whether local or global alignments are allowed is part of the model, rather than being accomplished by running a different algorithm. (This will be discussed in greater detail later; it is part of the ``Plan7'' architecture of the new HMMER2 models.)

Therefore, you need to choose what kind of alignments you want to allow when you build the model with hmmbuild. By default, hmmbuild builds models which allow alignments that are global with respect to the HMM, local with respect to the sequence, and allows multiple domains to hit per sequence. Such models will only find complete domains.

hmmbuild provides some standard options for common alignment styles. The following table shows the four alignment styles supported by hmmbuild, and also shows the equivalent old HMMER 1.x search program style (to orient experienced HMMER users).

Command w.r.t. sequence w.r.t. HMM multidomain HMMER 1 equivalent
hmmbuild local global yes hmmls
hmmbuild -f local local yes hmmfs
hmmbuild -g local global no hmms
hmmbuild -s local local no hmmsw

In brief, if you want maximum sensitivity at the expense of only finding complete domains, use the hmmbuild default. If you need to find fragments (local alignments) too, and are willing to give up some sensitivity to see them, use hmmbuild -f. If you want the best of both worlds, build two models and search with both of them.

next up previous contents
Next: Searching a query sequence Up: Tutorial Previous: Files used in the

Direct comments and questions to <eddy@genetics.wustl.edu>