![]() |
HMMER
User's Guide |
One common use of HMMER is to search a sequence database for homologues of a protein family of interest. You need a multiple sequence alignment of the sequence family you're interested in. (Profile HMMs can be trained from unaligned sequences; however, this functionality is temporarily withdrawn from HMMER. I recommend CLUSTALW as an excellent, freely available multiple sequence alignment program.)
Let's assume you have a multiple sequence alignment of a protein domain or protein sequence family. To use HMMER to search for additional remote homologues of the family, you want to first build a profile HMM from the alignment. The following command builds a profile HMM from the alignment of 50 globin sequences in globins50.msf:
> hmmbuild globin.hmm globins50.msf
hmmbuild - build a hidden Markov model from an alignment HMMER 2.0 (June 1998) Copyright (C) 1992-1998 Washington University School of Medicine HMMER is freely distributed under the GNU General Public License (GPL). - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Training alignment: globins50.msf Number of sequences: 50 Search algorithm configuration: Multiple domain (hmmls) Model construction strategy: MAP (gapmax hint: 0.50) Prior used: (default) Prior strategy: Dirichlet Sequence weighting method: G/S/C tree weights - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Determining effective sequence number ... done. [13] Weighting sequences heuristically ... done. Constructing model architecture ... done. Converting counts to probabilities ... done. Setting model name, etc. ... done. [globins50] Constructed a profile HMM (length 162) Average score: 283.03 bits Minimum score: 137.32 bits Maximum score: 343.50 bits Std. deviation: 53.21 bits Finalizing model configuration ... done. Saving model to file ... done. [globin.hmm]
The process takes a second or two. hmmbuild create a new HMM file called globin.hmm. This is a human and computer readable ASCII text file, but for now you don't care. You also don't care for now what all the stuff in the output means; I'll describe it in detail later. The profile HMM can be treated as a compiled model of your alignment.
This step is optional, but doing it will increase the sensitivity of your database search.
When you search a sequence database, it is useful to get ``E-values'' (expectation values) in addition to raw scores. When you see a database hit that scores x, an E-value tells you the number of hits you would've expected to score x or more just by chance in a sequence database of this size.
HMMER will always estimate an E-value for your hits. However, unless you ``calibrate'' your model before a database search, HMMER uses an analytic upper bound calculation that is extremely conservative. An empirical HMM calibration costs time (about 10% the time of a SWISSPROT search) but it only has to be done once per model, and can greatly increase the sensitivity of a database search. To empirically calibrate the E-value calculations for the globin model, type:
> hmmcalibrate globin.hmm
hmmcalibrate -- calibrate HMM search statistics HMMER 2.0 (June 1998) Copyright (C) 1992-1998 Washington University School of Medicine HMMER is freely distributed under the GNU General Public License (GPL). - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - HMM file: globin.hmm Length distribution mean: 325 Length distribution s.d.: 200 Number of samples: 5000 random seed: 895790561 histogram(s) saved to: [not saved] - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - HMM : globins50 mu : -107.928612 lambda : 0.186579 max : -63.859001 //
This takes several minutes. Go have a cup of coffee. When it is complete, the relevant parameters are added to the HMM file.
Calibrated HMMER E-values tend to be relatively accurate. E-values of 0.1 or less are, in general, very significant hits. Uncalibrated HMMER E-values are also reliable, erring on the cautious side; uncalibrated models may miss remote homologues.
As an example of searching for new homologues using a profile HMM, we'll use the globin model to search for globin domains in the example Artemia globin sequence in Artemia.fa:
> hmmsearch globin.hmm Artemia.fa
The output comes in several sections, and unlike building and calibrating the HMM (where we treated the HMM as a black box), now you do care about what it's saying.
The first section is the header that tells you waht program you ran, on what, and with what options:
hmmsearch - search a sequence database with a profile HMM HMMER 2.0 (June 1998) Copyright (C) 1992-1998 Washington University School of Medicine HMMER is freely distributed under the GNU General Public License (GPL). - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - HMM file: globin.hmm [globins50] Sequence database: Artemia.fa - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Query HMM: globins50 [HMM has been calibrated; E-values are empirical estimates]
The second section is the sequence top hits list. It is a list of ranked top hits (sorted by E-value, most significant hit first), formatted in a BLAST-like style:
Scores for complete sequences (score includes all domains): Sequence Description Score E-value N -------- ----------- ----- ------- --- S13421 S13421 GLOBIN - BRINE SHRIMP 335.5 1e-101 8
The first field is the name of the target sequence, then followed by the description line for the sequence. The last three fields are the raw score (in units of ``bits''), the estimated E-value, and the total number of domains detected in the sequence. By default, every sequence with an E-value over 10.0 is listed in this output.
The second section is the domain top hits list. By default, for every sequence with an E-value less than 10, every domain with a non-zero raw score is listed. (Read that carefully. In a later chapter we'll discuss some caveats about how hmmsearch identifies domains, and how to control its output in different ways.) Each domain detected in the search is output in a list ranked by E-value:
Parsed for domains: Sequence Domain seq-f seq-t hmm-f hmm-t score E-value -------- ------- ----- ----- ----- ----- ----- ------- S13421 6/8 928 1075 .. 1 162 [] 65.8 1.5e-20 S13421 2/8 149 288 .. 1 162 [] 59.4 1.3e-18 S13421 3/8 303 450 .. 1 162 [] 58.4 2.6e-18 S13421 8/8 1238 1390 .. 1 162 [] 42.5 1.6e-13 S13421 5/8 771 918 .. 1 162 [] 33.8 3.3e-12 S13421 7/8 1085 1234 .. 1 162 [] 32.0 4.6e-12 S13421 4/8 454 607 .. 1 162 [] 26.1 1.4e-11 S13421 1/8 1 139 [. 1 162 [] 21.9 3e-11
The first field is the name of the target sequence. The second field is the number of this domain: e.g. ``6/8'' means the sixth domain of eight total domains detected.
The fields marked ``seq-f'' and ``seq-t'' mean ``sequence from'' and ``sequence to'': the start and end points of the alignment on the target sequence. After these two fields is a shorthand annotation for whether the alignment is ``global'' with respect to the sequence or not. A dot (.) means the alignment does not go all the way to the end; a bracket ([ or ]) means it does. Thus, .. means that the alignment is local within the sequence; [. means that the alignment starts at the beginning of the sequence, but doesn't go all the way to its end; .] means the alignment starts somewhere internally and goes all the way to the end; and [] means the alignment includes the entire sequence.
Analogously, the fields marked ``hmm-f'' and ``hmm-t'' indicate the start and end points with respect to the consensus coordinates of the model, and the following field is a shorthand for whether the alignment is global with respect to the model. Here, for instance, all the globin domains in the Artemia sequence are complete matches to the entire globin model - because, by default, hmmbuild built the HMM to only look for those kinds of alignments. We'll discuss later how to modify the profile HMM for other search styles.
The final two fields are the raw score in bits and the estimated E-value, for the isolated domain. Because of the method HMMER uses to correct raw scores for biased sequence composition, the raw scores for the domains do not necessarily sum up to the raw score of the sequence.
The next section is the alignment output. By default, every domain that appeared in the domain top hits list now appears as a BLAST-like alignment. For example:
Alignments of top-scoring domains: S13421: domain 6 of 8, from 928 to 1075: score 65.8, E = 1.5e-20 *->vilealvnssShLSaeekalVkslWYgKVegnaeeiGaeaLgRlFvv + LSa e a Vk++W V+ ++ ++G ++ lF + S13421 928 G-----------LSAREVAVVKQTW-NLVKPDLMGVGMRIFKSLFEA 962 YPwTqryFphFgdLssldavkgspkvKaHGkKVltalgdavkhLDdtgnl +P q+ Fp+F+d+ +ld +++ p v +H V t l++ ++ LD nl S13421 963 FPAYQAVFPKFSDV-PLDKLEDTPAVGKHSISVTTKLDELIQTLDEPANL 1011 kgalakLSelHadklrVDPeNFklLghvlvvvLaehfgkdftPevqAAwd + +L+e H lrV+ Fk +g+vlv L +g f+ + +w S13421 1012 ALLARQLGEDH-IVLRVNKPMFKSFGKVLVRLLENDLGQRFSSFASRSWH 1060 KflagvanaLahKYr<-* K++++++ +++ S13421 1061 KAYDVIVEYIEEGLQ 1075
The top line is the HMM consensus. The amino acid shown for the consensus is the highest probability amino acid at that position according to the HMM (not necessarily the highest scoring amino acid, though). Capital letters mean ``highly conserved'' residues: those with a probability of > 0.5 for protein models, or > 0.9 for DNA models.
The center line shows letters for ``exact'' matches to the highest probability residue in the HMM, or a ``+'' when the match has a positive score and is therefore considered to be ``conservative'' according to the HMM's view of this particular position in the model - not the usual definition of conservative changes in general.
The third line shows the sequence itself, of course.
The next section of the output is the score histogram. It shows a histogram with raw score increasing along the Y axis, and the number of sequence hits represented as a bar along the X axis. In our example here, since there's only a single sequence, the histogram is very boring:
Histogram of all scores: score obs exp (one = represents 1 sequences) ----- --- --- 335 1 0|=
Notice though that it's a histogram of the whole sequence hits, not the domain hits.
You can ignore the rest of the hmmsearch output:
% Statistical details of theoretical EVD fit: mu = -107.9286 lambda = 0.1866 chi-sq statistic = 0.0000 P(chi-square) = 0 Whole sequence top hits: tophits_s report: Total hits: 1 Satisfying E cutoff: 1 Total memory: 15K Domain top hits: tophits_s report: Total hits: 8 Satisfying E cutoff: 8 Total memory: 20K
This is just some trailing internal info about the search that's useful to me sometimes, but probably not to you.
HMMER reads all major database formats and does not need any special database indexing. You can search any large sequence database you have installed locally just by giving the full path to the database file, e.g. something like:
> hmmsearch globin.hmm /nfs/databases/swiss35/sprot35.dat
If you have BLAST installed locally, it's likely that you have a directory (or directories) in which the BLAST databases are kept. These directories are specified in an environment variable called BLASTDB. HMMER will read the same environment variable. For example, if you have a BLAST database called /nfs/databases/blast-db/swiss35, the following commands will work:
> setenv BLASTDB /nfs/databases/blast-db/
> hmmsearch globin.hmm swiss35
Obviously, you'd tend to have the setenv command as part of the local configuration of your machine, rather than typing it at the command line.
This is extremely important. HMMER does not do local (Smith/Waterman) and global (Needleman/Wunsch) style alignments in the same way that most computational biology analysis programs do it. To HMMER, whether local or global alignments are allowed is part of the model, rather than being accomplished by running a different algorithm. (This will be discussed in greater detail later; it is part of the ``Plan7'' architecture of the new HMMER2 models.)
Therefore, you need to choose what kind of alignments you want to allow when you build the model with hmmbuild. By default, hmmbuild builds models which allow alignments that are global with respect to the HMM, local with respect to the sequence, and allows multiple domains to hit per sequence. Such models will only find complete domains.
hmmbuild provides some standard options for common alignment styles. The following table shows the four alignment styles supported by hmmbuild, and also shows the equivalent old HMMER 1.x search program style (to orient experienced HMMER users).
Command | w.r.t. sequence | w.r.t. HMM | multidomain | HMMER 1 equivalent |
hmmbuild | local | global | yes | hmmls |
hmmbuild -f | local | local | yes | hmmfs |
hmmbuild -g | local | global | no | hmms |
hmmbuild -s | local | local | no | hmmsw |
In brief, if you want maximum sensitivity at the expense of only finding complete domains, use the hmmbuild default. If you need to find fragments (local alignments) too, and are willing to give up some sensitivity to see them, use hmmbuild -f. If you want the best of both worlds, build two models and search with both of them.