User's Guide

| Dept. of Genetics | WashU | Medical School | Sequencing Center | CGM | IBC|
| Eddy lab | Internal (lab only) | HMMER | PFAM | tRNAscan-SE | Software | Publications |

next up previous contents
Next: Installation Up: File formats Previous: HMMER prior files


Sequence files

Supported file formats

HMMER can automatically recognize and parse a number of common file formats. The best supported of these formats are listed below. If you know your sequence database is in one of these formats, you can use the file. If you are formatting sequences yourself, see the section of FASTA format below for unaligned sequences, and the section on SELEX format for alignments; these are the simplest formats to hand type

Unaligned sequence formats
FASTA Pearson FASTA format; BLAST databases
SWISS-PROT SWISS-PROT protein sequence database
PIR PIR protein sequence database
EMBL EMBL DNA sequence database
GenBank GenBank DNA database flat files
GCGdata Wisconsin GCG sequence database format
GCG Wisconsin GCG single sequence format

Multiple sequence alignment formats
SELEX The preferred format for HMMER/Pfam (see below)
MSF GCG alignment format
A2M ``Aligned FASTA''

For programs that do sequential, database-style access (i.e. where you'd usually use an unaligned flat file), the alignment formats are read as if they were multiple unaligned sequences.

There is no provision for enforcing that single unaligned sequence formats (i.e. GCG unaligned sequence format) are single sequence only. HMMER will happily try to read more than one sequence if your file contains more than one. However, this may not give the results you expected.

Staden ``experiment file'' format is parsed using the EMBL file parser, but this functionality is relatively unsupported. There is one wrinkle in this. Staden experiment files use '-' characters to indicate 'N' - i.e., that a base is present in a sequence read, but its identity is unknown. Therefore, the software replaces any '-' in an EMBL sequence with an 'N'. Sometimes people use the unaligned formats to distribute aligned sequences by including gap characters. If EMBL files are used i this way for aligned strings, they must use a different character than '-' to indicate gaps.

FASTA unaligned sequence format

An example of a simple FASTA file:

> seq1 This is the description of my first sequence.
> seq2 This is a description of my second sequence.

FASTA is probably the simplest of formats for unaligned sequences. FASTA files are easily created in a text editor. Each sequence is preceded by a line starting with >. The first word on this line is the name of the sequence. The rest of the line is a description of the sequence (free format). The remaining lines contain the sequence itself. You can put as many letters on a sequence line as you want.

Blank lines in a FASTA file are ignored, and so are spaces or other gap symbols (dashes, underscores, periods) in a sequence. Any other non-amino or non-nucleic acid symbols in the sequence should produce an appropriately strident string of warnings on yur terminal screen when you try to use the file.

HMMER currently has a limit of 4095 characters on lines in files. You may see this limitation in NCBI NR databases, where some description lines will be truncated by HMMER.

SELEX alignment format

An example of a simple SELEX alignment file:

# Example selex file

seq2     ..GGGAAAGG.GA
seq3     UUU..AAAUUU.A

seq1  ..ACG
seq2  AAGGG
seq3  AA...UUU

SELEX is an interleaved multiple alignment format that arose as an intuitive format that was easy to write and manipulate manually with a text editor like emacs. It is usually easy to convert other alignment formats into SELEX format, even with a couple of lines of Perl, but it can be harder to go the other way, since SELEX is more free-format than other alignment formats. For instance, GCG's MSF format and the output of the CLUSTALV multiple alignment program are similar interleaved formats that can be converted to SELEX just by stripping a small number of non-sequence lines out. Because SELEX evolved to accomodate different user input styles, it is very tolerant of various inconsistencies such as different gap symbols, varying line lengths, etc.

Each line contains a name, followed by the aligned sequence. A space, dash, underscore, or period denotes a gap. If the alignment is too long to fit on one line, the alignment is split into multiple blocks, separated by blank lines. The number of sequences, their order, and their names must be the same in every block (even if a sequence has no residues in a given block!) Other blank lines are ignored. You can add comments to the file on lines starting with a #.

SELEX stands for ``Systematic Evolution of Ligands by Exponential Enrichment'' - it refers to the Tuerk and Gold technology for evolving families of small RNAs for particular functions [Tuerk and Gold, 1990]. SELEX files were what we used to keep track of alignments of these small RNA families. It's an interesting piece of historical baggage - most of my lab works on RNA, not protein.

As the format evolved, more features have been added. To maintain compatibility with past alignment files, the new features are added using a reserved comment style. You don't have to worry about adding these extra information lines. They are generally the province of automated SELEX-generating software, such as my koala sequence alignment editor or the COVE and HMMER sequence analysis packages. This extra information includes consensus and individual RNA or protein secondary structure, sequence weights, a reference coordinate system for the columns, and database source information including name, accession number, and coordinates (for subsequences extracted from a longer source sequence) See below for details.

Detailed specification of a SELEX file

Any line beginning with a #= as the first two characters is a machine comment. #= comments are reserved for parsed data about the alignment. Often these features are maintained by software such as the koala editor or the HMMER software, not by hand. The format of #= lines is usually quite specific, since they must be parsed by the file-reading software.

All other lines beginning with a % or # as the first character are user comments. User comments are ignored by all software. Anything may appear on these lines. Any number of comments may be included in a SELEX file, and at any point.

Lines of data consist of a name followed by a sequence. The total length of the line must be smaller than 4096 characters.

Names must be a single word. Any non-whitespace characters are accepted. No spaces are tolerated in names: names MUST be a single word. Names must be less than 32 characters long.

In the sequence, any of the characters -_. or a space are recognized as gaps. Any other characters are interpreted as sequence. Sequence is case-sensitive. There is a common assumption by my software that upper-case symbols are used for consensus (match) positions and lower-case symbols are used for inserts. This language of ``match'' versus ``insert'' comes from the hidden Markov model formalism [Krogh et al., 1994]. To almost all of my software, this isn't important, and it immediately converts the sequence to all upper-case after it's read.

Multiple different sequences are grouped in a block of data lines. Blocks are separated by blank lines. No blank lines are tolerated between the sequence lines in a block. Each block in a multi-block file of a long alignment must have its sequences in the same order in each block. The names are checked to verify that this is the case; if not, only a warning is generated. (In manually constructed files, some users may wish to use shorthand names in subsequent blocks after an initial block with full names - but this isn't recommended.)

Optional SELEX file annotation


Several pieces of information can be added to the header of the file using the following machine comments:

Sequence header

Additional per-sequence information can be placed after the main header lines and before any blocks appear. These lines, one per sequence and in exactly the same order as the sequences appear in the alignment, are formatted like

#=SQ <name> <wgt> <source> <accession> <start..stop::orig length> <desc>

This information includes a sequence weight (for compensating for biased representation of subfamilies of sequences in the alignment); source information, if the sequence came from a database, consisting of identifier, accession number, and source coordinates; and a description of the sequence.

If a #=SQ line is present, all the fields must be present on each line and one #=SQ line must be present for each sequence. If no information is available for a field, use '-' for all the fields except the source coordinates, which would be given as '0'.

Secondary structure annotation

Lines beginning with #=SS or #=CS are individual or consensus secondary structure data, respectively. #=SS individual secondary structure lines must immediately follow the sequence they are associated with. There can only be one #=SS per sequence. All, some, or none of the sequences may have #=SS annotation.

#=CS consensus secondary structure predictions precede all the sequences in each block. There can only be one #=CS per file.

I use one-letter codes to indicate secondary structures. Secondary structure strings are aligned to sequence blocks just like additional sequences.

For RNA secondary structure, the symbols > and < are used for base pairs (pairs point at each other). + indicate definitely single-stranded positions, and any gap symbol indicates unassigned bases or single-stranded positions. This description roughly follows [Konings and Hogeweg, 1989]. For protein secondary structure, I use E to indicate residues in $\beta$-sheet, H for those in $\alpha$-helix, L for those in loops, and any gap symbol for unassigned or unstructured residues.

RNA pseudoknots are represented by alphabetic characters, with upper case letters representing the 5' side of the helix and lower case letters representing the 3' side. Note that this restricts the annotation to a maximum of 26 pseudoknots per sequence.

Reference coordinate system

Alignments are usually numbered by some reference coordinate system, often a canonical molecule. For instance, tRNA positions are numbered by reference to the positions of yeast tRNA-Phe.

A line beginning with #=RF preceding the sequences in a block gives a reference coordinate system. Any non-gap symbol in the #=RF line indicates that sequence positions in its columns are numbered. For instance, the #=RF lines for a tRNA alignment would have 76 non-gap symbols for the canonical numbered columns; they might be the aligned tRNA-Phe sequence itself, or they might be just X's.

The #=RF line also provides a good way to mark conserved key residues in a protein alignment. HMMER picks up both the #=RF and #=CS lines when it builds an HMM, if they're there, and will display them in the context of any alignments to target sequences.

An example SELEX file with annotation

Here's an example SELEX file with examples of all the optional fields. The sequences are evolved RNA ligands for the phage R17 coat protein [Schneider et al., 1992].

#=ID r17
#=AC RNA00001
#=DE RNA ligands evolved to bind phage R17 coat protein 
#=GA 0.0 0.0
#=TC 0.0 0.0
#=NC 0.0 0.0
#=AU COVE 2.2.3 Wed Dec  2 08:39:58 1998

#=SQ lig28 1.0 - - 1..29:29 R17 coat protein ligand 28
#=SQ lig1  1.0 - - 1..17:17 R17 coat protein ligand 1
#=SQ lig2  1.0 - - 1..29:29 R17 coat protein ligand 2

#=RF         ***A** loop *****
#=CS  .......>>>+>> ++++ <<<<<....... 
#=SS   >>>>>>>>>+>> ++++ <<<<<<<<<<<<
#=SS        >>>>+>> ++++ <<<<<<
#=SS  >>>>>>>>>>+>> ++++ <<<<<<<<<<<<

next up previous contents
Next: Installation Up: File formats Previous: HMMER prior files

Direct comments and questions to <eddy@genetics.wustl.edu>