|
HMMER
User's Guide
|
|
Dept. of Genetics |
WashU |
Medical School |
Sequencing Center |
CGM |
IBC|
|
Eddy lab |
Internal (lab only) |
HMMER |
PFAM |
tRNAscan-SE |
Software |
Publications
|
Next: HMMER null model files
Up: File formats
Previous: File formats
Subsections
The file Demos/rrm.hmm gives an example of a HMMER ASCII save
file. An abridged version is shown here, where (...) mark deletions
made for clarity and space:
HMMER2.0
NAME rrm
ACC PF00076
DESC RNA recognition motif. (aka RRM, RBD, or RNP domain)
LENG 72
ALPH Amino
RF no
CS no
MAP yes
COM ../src/hmmbuild -F rrm.hmm rrm.slx
COM ../src/hmmcalibrate rrm.hmm
NSEQ 70
DATE Wed Jul 8 08:13:25 1998
CKSUM 2768
GA 13.3 0.0
TC 13.40 0.60
NC 13.20 13.20
XT -8455 -4 -1000 -1000 -8455 -4 -8455 -4
NULT -4 -8455
NULE 595 -1558 85 338 -294 453 -1158 (...)
EVD -49.999123 0.271164
HMM A C D E F G H I (...)
m->m m->i m->d i->m i->i d->m d->d b->m m->e
-21 * -6129
1 -1234 -371 -8214 -7849 -5304 -8003 -7706 2384 (...) 1
- -149 -500 233 43 -381 399 106 -626 (...)
- -11 -11284 -12326 -894 -1115 -701 -1378 -21 *
2 -3634 -3460 -5973 -5340 3521 -2129 -4036 -831 (...) 2
- -149 -500 233 43 -381 399 106 -626 (...)
- -11 -11284 -12326 -894 -1115 -701 -1378 * *
(...)
71 -1165 -4790 -240 -275 -5105 -4306 1035 -2009 (...) 90
- -149 -500 233 43 -381 398 106 -626 (...)
- -43 -6001 -12336 -150 -3342 -701 -1378 * *
72 -1929 1218 -1535 -1647 -3990 -4677 -3410 1725 (...) 92
- * * * * * * * * (...)
- * * * * * * * * 0
//
HMMER2 profile HMM save files have a very different format compared to
the previous HMMER1 ASCII formats. The HMMER2 format provides all the
necessary parameters to compare a protein sequence to a HMM, including
the search mode of the HMM (hmmls, hmmfs, hmmsw, and hmms in the old
HMMER1 package), the null (background) model, and the statistics to
evaluate the match on the basis of a previously fitted extreme value
distribution.
The format consists of one or more HMMs. Each HMM starts with the
identifier ``HMMER2.0'' on a line by itself and ends with // on a line
by itself. The identifier allows backward compatibility as the HMMer
software evolves. The closing // allows multiple HMMs to be
concatenated into a single file to provide a database of HMMs.
The format for an HMM is divided into two regions. The first region
contains text information and miscalleneous parameters in a (roughly)
tag-value scheme, akin to EMBL formats. This section is ended by a
line beginning with the keyword HMM. The second region is of a
more fixed format and contains the main model parameters. It is ended
by the // that ends the entire definition for a single profile-HMM.
Both regions contain probabilities that are used parameterize the HMM.
These are stored as integers which are related to the probability via
a log-odds calculation. The log-odds score calculation is defined in
mathsupport.c and is:
so conversely, to get a probability from the scores in an HMM save
file:
INTSCALE is defined in config.h as 1000.
Notice that you must know a null model probability to convert scores
back to HMM probabilities.
The special case of prob = 0 is translated to ``*'', so a score of *
is read as a probability of 0. Null model probabilities are not
allowed to be 0.
This log-odds format has been chosen because it has a better dynamic
range than storing probabilities as ASCII text, and because the
numbers are more meaningful to a human reader to a certain extent:
positive values means a better than expected probability, and negative
values a worse than expected probability. However, because of the
conversion from probabilities, it should be noted that you should
not edit the numbers in a HMMER save file directly. The HMM is a
probabilistic model and expects state transition and symbol emission
probability distributions to sum to one. If you want to edit the HMM,
you must understand the underlying Plan7 probabilistic model, and
ensure the correct summations yourself.
A more detailed description of the format follows.
In the header section, each line after the initial identifier has a
unique tag of five characters or less. For shorter tags, the remainder
of the five characters is padded with spaces. Therefore the first six
characters of these lines are reserved for the tag and a space. The
remainder of the line starts at the seventh character. The parser does
require this.
- [HMMER2.0]
File format version; a unique identifier for this save file
format. Used for backwards compatibility. Not necessarily the
version number of the HMMER software that generated it; rather, the
version number of the last HMMER that changed the format so much
that a whole new function had to be introduced to do the parsing.
(i.e., HMMER 2.8 might
still be writing save files that are headed HMMER2.0).
Mandatory.
- [NAME <s>] Model name; <s> is a single word name for the HMM.
No spaces or tabs may occur in the name.
hmmbuild will use the
#=ID
line from a SELEX alignment
file to set the name. If this is not present, or the alignment
is not in SELEX format,
hmmbuild
sets the HMM name using the name of the alignment file, after removing any
file type suffix. For example, an HMM built from the alignment file
rrm.slx would be named rrm by default.
Mandatory.
- [ACC <s>] Accession number; <s> is a one-word
accession number for an HMM. Used in Pfam maintenance. Accessions are
stable identifiers for Pfam models, whereas names may change from
release to release. Added in v2.1.1. Optional.
- [DESC <s>] Description line; <s> is a one-line description
of the HMM. hmmbuild will use the
#=DE
line from a
SELEX alignment file to set the description line. If this is not
present, or the alignment is not in SELEX format, the description
line is left blank; one can be added manually (or by Perl script)
if you wish. Optional.
- [LENG <d>] Model length; <d>, a positive nonzero integer,
is the number of match states in the model.
Mandatory.
- [ALPH <s>] Symbol alphabet; <s> must be either
Amino or Nucleic. This determines the symbol alphabet and the
size of the symbol emission probability distributions. If
Amino, the alphabet size is set to 20 and the symbol alphabet
to ``ACDEFGHIKLMNPQRSTVWY'' (alphabetic order). If Nucleic, the
alphabet size is set to 4 and the symbol alphabet to ``ACGT''. Case
insensitive. Mandatory.
- [RF <s>] Reference annotation flag; <s> must
be either no or yes (case insensitive). If set to
yes, a character of reference annotation is read for each match
state/consensus column in the main section of the file (see below);
else this data field will be ignored. Reference annotation lines are
currently somewhat inconsistently used. The only major use in HMMER is
to specify which columns of an alignment get turned into match states
when using the
hmmbuild -hand manual model construction option. Reference
annotation can only be picked up from SELEX format alignments. See
description of SELEX format for more details on reference annotation
lines. Optional; assumed to be no if not present.
- [CS <s>] Consensus structure annotation flag;
<s> must be either no or yes (case insensitive). If set to yes, a character
of consensus structure annotation is read for each match
state/consensus column in the main section of the file (see below);
else this data field will be ignored. Consensus structure annotation
lines are currently somewhat inconsistently used. Consensus structure
annotation can only be picked up from SELEX format alignments. See
description of SELEX format for more details on consensus structure
annotation lines. Optional; assumed to be no if not present.
- [MAP <s>] Map annotation flag;
<s> must be either no or yes (case insensitive).
If set to yes, each line of data for the match state/consensus
column in the main section of the file is followed by an extra number.
This number gives the index of the alignment column that the match
state was made from. This information provides a ``map'' of the match
states (1..M) onto the columns of the alignment (1..alen). It is
used for quickly aligning the model back to the original alignment,
e.g. when using hmmalign -mapali. Added in v2.0.1.
Optional; assumed to be no if not present.
- [COM <s>] Command line log; <s> is a one-line
command. There may be more than one COM line per save
file. These lines record the command line for every HMMER command that
modifies the save file. This helps us automatically log Pfam
construction strategies, for example. Optional.
- [CKSUM <d>] Training alignment checksum; <d> is a nonzero
positive integer. This number is calculated from the training
alignment and stored when hmmbuild is used. It is used in
conjunction with the alignment map information to to verify that some
alignment is indeed the alignment that the map is for. Added in
v2.0.1. Optional.
- [GA <f> <f>] Pfam gathering thresholds GA1 and GA2.
This is a feature in progress. See Pfam documentation of GA lines.
Added in v2.1.1. Optional.
- [TC <f> <f>] Pfam trusted cutoffs TC1 and TC2.
This is a feature in progress. See Pfam documentation of TC lines.
Added in v2.1.1. Optional.
- [NC <f> <f>] Pfam noise cutoffs NC1 and NC2.
This is a feature in progress. See Pfam documentation of NC lines.
Added in v2.1.1. Optional.
- [NSEQ <d>] Sequence number; <d> is a nonzero
positive integer, the number of sequences that the HMM was trained on.
This field is only used for logging purposes.
Optional.
- [DATE <s>] Creation date; <s> is a date string.
This field is only used for logging purposes.
Optional.
- [XT <d>*8] Eight ``special'' transitions for
controlling parts of the algorithm-specific parts of the Plan7 model.
The null probability used to convert these back to model probabilities
is 1.0. The order of the eight fields is N
B, N
N, E
C, E
J, C
T, C
C, J
B, J
J. (Another
way to view the order is as four transition probability distributions
for N,E,C,J; each distribution has two probabilities, the first one
for ``moving'' and the second one for ``looping''.) For an explanation
of these special transitions (and definition of the state names), read
the Plan7 architecture documentation.
Mandatory.
- [NULT <d> <d>] The transition probability distribution
for the null model (single G state). The null probability used to
convert these back to model probabilities is 1.0. The order is G
G, G
F.
Mandatory.
- [NULE <d>*K] The symbol emission probability
distribution for the null model (G state); consists of K (e.g. 4 or
20) integers. The null probability used to convert these back to model
probabilities is 1/K. (Yes, it's a little weird to have a ``null
probability'' for the null model symbol emission probabilities; this
is strictly an aesthetic decision, so one can look at the null model
and easily tell which amino acids are more common than chance
expectation in the background distribution.)
Mandatory.
- [EVD <f> <f>] The extreme value distribution
parameters
and
,
respectively; both floating point
values.
is positive and nonzero. These values are set when
the model is calibrated with hmmcalibrate. They are used to
determine E-values of bit scores. If this line is not present,
E-values are calculated using a conservative analytic upper bound.
Optional.
- [HMM ] HMM flag line; flags the end of the header
section. Otherwised not parsed. Strictly for human readability, the
symbol alphabet is also shown on this line, aligned to the NULE
fields and the fields of the match and insert symbol emission
distributions in the main model. The immediately next line is also an
unparsed human annotation line: column headers for the state
transition probability fields in the main model section that follows.
Both lines are mandatory.
All the remaining fields are mandatory, except for the
alignment map.
The first line in the main model section is atypical; it contains
three fields, for transitions from the B state into the first node of
the model. The only purpose of this line is to set the B
D transition probability. The first field is the score
for
.
The second field is always ``*'' (there is no B
I transition). The third field is the score for
.
The null probability used for converting these
scores back to probabilities is 1.0. In principle, only the third
number is needed to obtain
.
In practice, HMMER
reads both the first and the third number, converts them to
probabilities, and renormalizes the distribution to obtain
.
4.1
The remainder of the model has three lines per node, for M nodes
(where M is the number of match states, as given by the LENG
line). These three lines are:
- [Match emission line] The first field is the node number (1..M).
The HMMER parser verifies this number as a consistency check (it
expects the nodes to come in order). The next K numbers for match
emission scores, one per symbol. The null probability used to convert
them to probabilities is the relevant null model emission probability
calculated from the NULE line.
If MAP was yes, then there is one more number on this
line, representing the alignment column index for this match state.
See MAP above for more information about the alignment map, and
also see the man pages for hmmalign -mapali. Added in
v2.0.1. This field is optional, for backwards compatibility with 2.0.
- [Insert emission line] The first field is a character of
reference annotation (RF), or ``-'' if there is no reference
annotation. The remaining fields are K numbers for insert emission
scores, one per symbol, in alphabetic order. The null probability used
to convert them to probabilities is the relevant null model emission
probability calculated from the NULE line.
- [State transition line] The first field is a character
of consensus structure annotation (CS), or ``-'' if there is no
consensus structure annotation. The remaining 9 fields are state
transition scores. The null probability used to convert them back from
log odds scores to probabilities is 1.0. The order of these scores is
given by the annotation line at the top of the main section: it is M
M, M
I, M
D; I
M, I
D; D
M, D
D; B
M; M
E.
The insert emission and state transition lines for the final node Mare special. Node M has no insert state, so all the insert
emissions are given as ``*''. (In fact, this line is skipped by the
parser, except for its RF annotation.) There is also no next node, so
only the B
M and M
E transitions are
valid; the first seven transitions are always ``*''. (Incidentally,
the M
E transition score for the last node is always 0,
because this probability has to be 1.0.)
Finally, the last line of the format is the ``//'' record separator.
After the parser reads the file and converts the scores back to
probabilities, it renormalizes the probability distributions to sum to
1.0 to eliminate minor rounding/conversion/numerical imprecision
errors. If you're trying to emulate HMMER save files, it might be
useful to know what HMMER considers to be a probability
distribution. See
Plan7Renormalize() in plan7.c for the relevant
function.
- [null emissions] The K symbol emissions
given on the NULE line.
- [null transitions] The two null model transitions
given on the NULT line.
- [N,E,C,J specials] Each of the four special states N,E,C,J have two
state transition probabilities (move and loop). All four distributions
are specified on the XT line.
- [B transitions] M B
M entry probabilities are given by the 9th field in the state
transition line of each of the M nodes. The B
D
transition (from the atypical first line of the main model section) is
also part of this state transition distribution.
- [match transitions] One distribution of 4 numbers per node;
,
,
,
and
(fields 2,
3, 4, and 10 in the state transition line of each node). Note the
asymmetry between B
M and M
E; entries are
a probability distribution of their own, while exits are not.
- [insert transitions] One distribution of 2 numbers per node;
,
(fields 5 and 6 of the state transition line of each
node).
- [delete transitions] One distribution of 2 numbers per
node;
,
(fields 7 and 8 of the
state transition line of each node).
- [match emissions] One distribution of K numbers
per node; the K match symbol emissions given on the first line of
each node in the main section.
- [insert emissions] One distribution of K numbers
per node; the K insert symbol emissions given on the second line of
each node in the main section.
Though I make an effort to keep this documentation up to date, it may
lag behind the code. For definitive answers, please check the parsing
code in hmmio.c. The relevant function to see what's being
written is WriteAscHMM(). The relevant function to see how it's
being parsed is read_asc20hmm().
Next: HMMER null model files
Up: File formats
Previous: File formats
Direct comments and questions to <eddy@genetics.wustl.edu>