|Unaligned sequence formats|
|FASTA||Pearson FASTA format; BLAST databases|
|SWISS-PROT||SWISS-PROT protein sequence database|
|PIR||PIR protein sequence database|
|EMBL||EMBL DNA sequence database|
|GenBank||GenBank DNA database flat files|
|GCGdata||Wisconsin GCG sequence database format|
|GCG||Wisconsin GCG single sequence format|
|Multiple sequence alignment formats|
|SELEX||The preferred format for HMMER/Pfam (see below)|
|MSF||GCG alignment format|
|CLUSTAL||CLUSTALW and CLUSTALV format|
For programs that do sequential, database-style access (i.e. where you'd usually use an unaligned flat file), the alignment formats are read as if they were multiple unaligned sequences.
There is no provision for enforcing that single unaligned sequence formats (i.e. GCG unaligned sequence format) are single sequence only. HMMER will happily try to read more than one sequence if your file contains more than one. However, this may not give the results you expected.
Staden ``experiment file'' format is parsed using the EMBL file parser, but this functionality is relatively unsupported. There is one wrinkle in this. Staden experiment files use '-' characters to indicate 'N' - i.e., that a base is present in a sequence read, but its identity is unknown. Therefore, the software replaces any '-' in an EMBL sequence with an 'N'. Sometimes people use the unaligned formats to distribute aligned sequences by including gap characters. If EMBL files are used i this way for aligned strings, they must use a different character than '-' to indicate gaps.
An example of a simple FASTA file:
> seq1 This is the description of my first sequence. AGTACGTAGTAGCTGCTGCTACGTGCGCTAGCTAGTACGTCA CGACGTAGATGCTAGCTGACTCGATGC > seq2 This is a description of my second sequence. CGATCGATCGTACGTCGACTGATCGTAGCTACGTCGTACGTAG CATCGTCAGTTACTGCATGCTCG
FASTA is probably the simplest of formats for unaligned sequences.
FASTA files are easily created in a text editor. Each sequence is
preceded by a line starting with
>. The first word on this line
is the name of the sequence. The rest of the line is a description of
the sequence (free format). The remaining lines contain the sequence
itself. You can put as many letters on a sequence line as you want.
Blank lines in a FASTA file are ignored, and so are spaces or other gap symbols (dashes, underscores, periods) in a sequence. Any other non-amino or non-nucleic acid symbols in the sequence should produce an appropriately strident string of warnings on yur terminal screen when you try to use the file.
HMMER currently has a limit of 4095 characters on lines in files. You may see this limitation in NCBI NR databases, where some description lines will be truncated by HMMER.
An example of a simple SELEX alignment file:
# Example selex file seq1 ACGACGACGACG. seq2 ..GGGAAAGG.GA seq3 UUU..AAAUUU.A seq1 ..ACG seq2 AAGGG seq3 AA...UUU
SELEX is an interleaved multiple alignment format that arose as an intuitive format that was easy to write and manipulate manually with a text editor like emacs. It is usually easy to convert other alignment formats into SELEX format, even with a couple of lines of Perl, but it can be harder to go the other way, since SELEX is more free-format than other alignment formats. For instance, GCG's MSF format and the output of the CLUSTALV multiple alignment program are similar interleaved formats that can be converted to SELEX just by stripping a small number of non-sequence lines out. Because SELEX evolved to accomodate different user input styles, it is very tolerant of various inconsistencies such as different gap symbols, varying line lengths, etc.
Each line contains a name, followed by the aligned sequence. A space,
dash, underscore, or period denotes a gap. If the alignment is too
long to fit on one line, the alignment is split into multiple blocks,
separated by blank lines. The number of sequences, their order, and
their names must be the same in every block (even if a sequence has no
residues in a given block!) Other blank lines are ignored. You can add
comments to the file on lines starting with a
SELEX stands for ``Systematic Evolution of Ligands by Exponential Enrichment'' - it refers to the Tuerk and Gold technology for evolving families of small RNAs for particular functions [Tuerk and Gold, 1990]. SELEX files were what we used to keep track of alignments of these small RNA families. It's an interesting piece of historical baggage - most of my lab works on RNA, not protein.
As the format evolved, more features have been added. To maintain compatibility with past alignment files, the new features are added using a reserved comment style. You don't have to worry about adding these extra information lines. They are generally the province of automated SELEX-generating software, such as my koala sequence alignment editor or the COVE and HMMER sequence analysis packages. This extra information includes consensus and individual RNA or protein secondary structure, sequence weights, a reference coordinate system for the columns, and database source information including name, accession number, and coordinates (for subsequences extracted from a longer source sequence) See below for details.
#=as the first two characters is a machine comment.
#=comments are reserved for parsed data about the alignment. Often these features are maintained by software such as the koala editor or the HMMER software, not by hand. The format of
#=lines is usually quite specific, since they must be parsed by the file-reading software.
#as the first character are user comments. User comments are ignored by all software. Anything may appear on these lines. Any number of comments may be included in a SELEX file, and at any point.
-_.or a space are recognized as gaps. Any other characters are interpreted as sequence. Sequence is case-sensitive. There is a common assumption by my software that upper-case symbols are used for consensus (match) positions and lower-case symbols are used for inserts. This language of ``match'' versus ``insert'' comes from the hidden Markov model formalism [Krogh et al., 1994]. To almost all of my software, this isn't important, and it immediately converts the sequence to all upper-case after it's read.
Several pieces of information can be added to the header of the file using the following machine comments:
Additional per-sequence information can be placed after the main header lines and before any blocks appear. These lines, one per sequence and in exactly the same order as the sequences appear in the alignment, are formatted like
#=SQ <name> <wgt> <source> <accession> <start..stop::orig length> <desc>
This information includes a sequence weight (for compensating for biased representation of subfamilies of sequences in the alignment); source information, if the sequence came from a database, consisting of identifier, accession number, and source coordinates; and a description of the sequence.
#=SQ line is present, all the fields must be present on
each line and one
#=SQ line must be present for each sequence.
If no information is available for a field, use '-' for all the fields
except the source coordinates, which would be given as '0'.
Lines beginning with
#=CS are individual or
consensus secondary structure data, respectively.
individual secondary structure lines must immediately follow the
sequence they are associated with. There can only be one
per sequence. All, some, or none of the sequences may have
#=CS consensus secondary structure predictions precede all the
sequences in each block. There can only be one
#=CS per file.
I use one-letter codes to indicate secondary structures. Secondary structure strings are aligned to sequence blocks just like additional sequences.
For RNA secondary structure, the symbols
used for base pairs (pairs point at each other).
definitely single-stranded positions, and any gap symbol indicates
unassigned bases or single-stranded positions. This description
roughly follows [Konings and Hogeweg, 1989]. For protein secondary structure, I
use E to indicate residues in -sheet, H for those
in -helix, L for those in loops, and any gap symbol for
unassigned or unstructured residues.
RNA pseudoknots are represented by alphabetic characters, with upper case letters representing the 5' side of the helix and lower case letters representing the 3' side. Note that this restricts the annotation to a maximum of 26 pseudoknots per sequence.
Alignments are usually numbered by some reference coordinate system, often a canonical molecule. For instance, tRNA positions are numbered by reference to the positions of yeast tRNA-Phe.
A line beginning with
#=RF preceding the sequences in a block
gives a reference coordinate system. Any non-gap symbol in the
#=RF line indicates that sequence positions in its columns are
numbered. For instance, the
#=RF lines for a tRNA alignment
would have 76 non-gap symbols for the canonical numbered columns; they
might be the aligned tRNA-Phe sequence itself, or they might be just
#=RF line also provides a good way to mark conserved key
residues in a protein alignment. HMMER picks up both the
#=CS lines when it builds an HMM, if they're there, and
will display them in the context of any alignments to target
Here's an example SELEX file with examples of all the optional fields. The sequences are evolved RNA ligands for the phage R17 coat protein [Schneider et al., 1992].
#=ID r17 #=AC RNA00001 #=DE RNA ligands evolved to bind phage R17 coat protein #=GA 0.0 0.0 #=TC 0.0 0.0 #=NC 0.0 0.0 #=AU COVE 2.2.3 Wed Dec 2 08:39:58 1998 #=SQ lig28 1.0 - - 1..29:29 R17 coat protein ligand 28 #=SQ lig1 1.0 - - 1..17:17 R17 coat protein ligand 1 #=SQ lig2 1.0 - - 1..29:29 R17 coat protein ligand 2 #=RF ***A** loop ***** #=CS .......>>>+>> ++++ <<<<<....... lig28 GGAGUAAGAUAGC AUCA GCAUCUUGUUCC #=SS >>>>>>>>>+>> ++++ <<<<<<<<<<<< lig1 GUUCACC AUCA GGGGAC #=SS >>>>+>> ++++ <<<<<< lig2 AUGGAUGCGCACC AUCA GGGCGUAUCUAU #=SS >>>>>>>>>>+>> ++++ <<<<<<<<<<<<