The WALIGN menu provides functions for sequence alignment, profile alignment, and
multiple sequence alignment. This menu was written for management of the TM7
GPCR file server (send the message HELP to "TM7@EMBL-Heidelberg.DE"
for information) but has further capabilities.
The basic process involves storing (a maximum of 9) profiles and an unlimited number
of sequences (maximum sequence length 1000 residues) in a file called BIGFILE,
and performing all operations on this file.
This menu provides tools to search rapidly for identical or highly homologous
sequences, to perform (iterative) profile alignments, to display and
manipulate sequences, and in general, to perform functions unavailable in other
packages or to surpass the sequence limitations of these packages.
WHAT IF always maintains the original sequence in memory. This means
that following alignment, one can always return to
the original sequence. This allows for easy experimentation with gap
weights, etc.
In order to conduct sequence analyses and manipulations, several commands are
needed. The
sequence- and profile-related commands are therefore divided among a group of
related menus. However, all commands from these menus can be executed directly
from the WALIGN menu, without the percent sign (%) prefix.
The command BIGFIL can be used to create a so-called BIGFILE. In the
BIGFILE all profiles and sequences are stored. It is only possible to
operate on
sequences and profiles once they are stored in the BIGFILE. If a BIGFILE exists
with the name WALIGN.BIG, then this file will automatically be opened;
otherwise the first option in the WALIGN menu MUST be BIGFIL. If a BIGFILE is
already open, this command allows you to close the file and save it and
to create a new BIGFILE, or to open another already existing one.
The command WALINI will erase all information in the BIGFILE, and
it will additionally erase all other arrays, variables and parameters
related to any WALIGN related option.
You normally only use this option when you are in deeeep shit....
When the command DOUBLS is issued, WHAT IF will prompt for two sequence
ranges (of sequences present in the BIGFILE). It will list all pairs within
these ranges
that are identical in sequence. Be aware that (profile) alignment, for example,
may delete residues from sequences, and that any discarded residues
are not used by the DOUBLS option. (See ORGBCK).
DOUBLS performs a pairwise comparison of
the sequences returned by the LSTSQS option. This option is much faster
than DMATCH (see below) because it searches for 100 percent identical sequences
and thus does not require any alignment. The DOUBLS command can also be used
to search for close homologs, i.e. 95-100 percent identical sequences, but
the comparison method chosen is only exact for 100 percent identity between
sequence pairs.
The command DOUBLS searches for pairs of sequences with high
homology. DOUBLS (see above) assumes that the sequences are already aligned.
If the sequences are not yet aligned, DMATCH may be used.
The command DMATCH will cause WHAT IF to prompt you for two sequence ranges
and a sequence identity percentage cutoff. All pairs of sequences that
show a pairwise sequence identity (following alignment) above the specified
cutoff
will be listed. To avoid days of CPU time, a crude filter based on nearest-
neighbour sequence relations is applied. This filter makes the process
considerably
faster; however, as a consequence, the option is only reliable at identity
levels above 90%.
At lower levels the reported identity levels are still accurate, but the
algorithm may fail to
detect some homologous pairs.
Be aware that profile alignment, for example,
will delete residues from sequences, and that discarded residues
are not used by the DMATCH option. (See ORGBCK).
Following alignment against a profile, residues coinciding with
insertions are deleted from their sequences. If original sequences are
later desired,
the ORGBCK command may be issued. WHAT IF will prompt for
a sequence range, and all sequences within this range will be restored
to their original
state (upon being read from a file)..
The command BIGSTS will tell you how many sequences and how many
profiles presently are loaded in the BIGFILE.
The command MFILES will create a file called FILES.LIST that has for
every BIGFILE sequence entry the filename, and the DE, AC and OS
record in it.
When dealing with multiple sequence alignments tye biggest problem
is probably residue numbering. What is the number of that conserved
cysteine that is at position 15 in one sequence and at position 17
in another? Should it be 15, 17, or should we take the average, 16,
smile, smile.
In the WALIGN menu we solved this problem with the so-called arbitary
sequence number. These numbers are in the profile. And you have
to hand edit them in the profiles.
If you want to get the arbirary sequence numbers known to WHAT IF, you
can have WHAT IF read them by issueing the GETARB command. It
seems logical to obtain the sequences from the profile that you used
for the alignment. The GETARB option therefore seems only needed
for cases where you want to modify the arbitrary sequence
number on the fly or something like that.
A simple alignment of two sequences involves the matching and scoring of
pairs of
residues. The classical method for perfoming these tasks is the
Dayhof exchange matrix. The file DAYHOF.MAT in the */dbdata directory
holds the default exchange matrix used by WHAT IF. If WHAT IF is requested
to use a DAYHOF matrix for the alignment, the file
DAYHOF.MAT may be copied from the .../dbdata directory to the directory where
WHAT IF is executed; here the file may be modified. WHAT IF searches for this
file first in
the local directory, and thereafter in the database directory. If the DAYHOF.MAT is
modified, the user must be careful to preserve the original format.
The command SHODAY can be used to display the present scoring matrix (also
called exchange matrix) at the terminal.
The command SETMAT can be used to reset the scoring matrix (also
called exchange matrix). When this command is issued, WHAT IF presents a minimenu
for selecting a unity matrix (scoring only identities), the default Dayhof matrix,
or another exchange matrix for with a specified name. The
default Dayhof matrix is the file .../dbdata/DAYHOF.MAT, unless a
file with the same name is present in the local directory (i.e. the current working
directory), in which case the local file is the default.
There are three types of sequence options: 1) options that operate on several
sequences, e.g. profile alignment; 2) options that operate on two sequences, e.g.
sequence alignment; and 3) options that work on one sequence, e.g. listing a
sequence or counting the residues in it. Some options are difficult to classify; for
example, listing two sequences without comparing them
is placed under single sequence options.
The command GETSEQ is used to read sequences in. WHAT IF recognizes three file
formats: PIR, Swissprot, and GCG. WHAT IF will prompt
for the format of the file. When reading multiple files, it is recommended
that the names of the files be placed in a single text file (one filename per
line); at the prompt this text file may be specified by @ (which is Shift-2
on most keyboards)
followed by the name of the text file. All files read by this method
should be of the same file type. In order to read multiple
Swissprot and PIR files, the GETSEQ command should be issued twice. WHAT IF
will attempt to recognize a format if an incorrect format is specified, but
this recognition
may not be reliable.
When the command DIRSEQ is issued, WHAT IF will prompt for a sequence
range. For
all residues in the specified range, the header information will be
listed in the
text window.
When the command LSTSEQ is issued, WHAT IF will prompt for a range of
sequences.
The corresponding file names, titles, and sequence information will be listed.
When the command LSTSQS is issued, WHAT IF will prompt for a sequence range
and a residue range. The specified residues of the specified sequences will
be displayed on the screen. Make sure that the window can accommodate the
number of requested residues (not more than 100 at a time), because ugly
wrap-arounds will result.
When the command DELSEQ is issued, WHAT IF will prompt for a sequence range.
The specified sequences will be removed from the BIGFILE. They will NOT be deleted
from disk.
When the command MAKSEQ is issued, WHAT IF will prompt for the number
of a sequence,
a file type, a file name and a title. The requested sequence will be
written to a
file with the specified name. File types can be PIR, GCG or Swissprot.
This command writes a range of sequences with all associated information
to a formatted
(human readable) file. The advantage of this file is that it is
smaller than the BIGFILE, and can be hand-edited, hand-sorted, etc.
This commands reads sequences from a file written with the MAKINT command (see
MAKINT). All sequences in the file will be read; it is not possible to read
a subset. To access
only a subset, the file must be edited accordingly. ***.
The command CCNSEQ will prompt you for two sequences. It will
make one new sequence in which the second sequence is concatenated
after the first one.
The profiles in this menu are mainly meant for the alignment of seven
helix membrane bundles of GPCR's. However, as usual, the options can be
misused for other purposes. Most profile operations use simple counting
statistics to build the profile, rather than using a Dayhof type matrix.
Or in other words, it is a normal profile, but a unitary Dayhof matrix is
used in the generation.
The format of a profile is as follows:
****** -PROFILE V1.0 ******
ID :profile
HEADER :some header information
COMPOUND :some compound information
SOURCE :some info about where the profile came from
AUTHOR :username, for example
PDB :only if applicable
DSSP :only if applicable
CHAINS :'.' irrelevant
PREFERENCE:AM irrelevant
EVAL :SCALED irrelevant
SMIN : -0.05 irrelevant
SMAX : 1.0 irrelevant
NRES : 394 length of the profile
SeqNo PDBNo AA STRUCTURE BP1 BP2 ACC NOCC OPEN ELONG WEIGHT V L
18 18 L < In this area the > 9 3.00 0.10 0.000 0.200
19 19 A < WHAT IF profile and > 9 3.00 0.10 0.000 0.175
20 20 L < MAXHOM profile are > 9 3.00 0.10 0.000 0.340
21 21 W < different, but that > 9 3.00 0.10 0.045 0.045
22 22 A < should not affect > 9 3.00 0.10 0.000 0.000
23 23 N < either of these two > 9 3.00 0.10 0.026 0.000
24 24 A < programs. > 3 3.00 0.10 0.000 0.048
342 342 V 0 4 1 0.00 0.00 1.00 0.040 -0.010
//
This whole profile is fixed format, so care is recommended in producing it.
In the .../dbdata directory an example profile can be found called
PROF.PRF. The irregular order in which the residues are listed in the
profile is necessary for compatibility with other profile programs such as
MAXHOM/HSSP. "File standards" are called standards because it is standard
behaviour to change them regularly, so it recommended that the user invoke the
NEWPRF command in the PROF2D menu to ascertain the present standard....
The ALIPRF commands aligns sequences against a profile. WHAT IF prompts
for the profile number and a range of sequences. The sequences
are then aligned against the profile. Insertions in the profile are not
permitted;
a corresponding deletion in the non-profile sequence is made.
Profile alignment requires approximately one second per 300 amino acids. If
several
sequences must be aligned, the MAXHOM/HSSP program is recommended.
For each aligned sequence the fit between the sequence and the profile is
provided.
The sequences are altered in the BIGFILE, but the original sequences can
always be
retrieved with the ORGBCK command.
The command UPDPRF is intended to update a profile.
The command NEWPRF also creates a profile from aligned sequences,
but, in contrast to UPDPRF, does not restrict the new profile to the
length of the
existing profile. WHAT IF prompts for a range of sequences, and a
profile is made
based on these sequences. In this new profile all gap open
penalties are 3.0,
and the gap elongation penalties are 0.1. Profile values range from
-0.01, for absent, to aproximately 1.5 for an absolutely conserved
residue.
If you make a new profile based on just one sequence, this profile might
not be detailed enough to start the ALIPRF profile alignment
properly. In such cases you can try to run CNVPRF. This option
applies the present scorings matrix (see the SETMAT option) to the
profile. This implies that the first round that you do with
ALIPRF will actually be the same as a whole series of 2ALIGN
options (everytime between the sequence you used for NEWPRF and
the sequence you are aligning against the profile).
This defies the basic principles of the WHAT IF iterative profile
align method a little bit, but if it is needed to get good result,
who cares?
The command SOUPRF will create a profile based on the molecule(s) in
the SOUP. You will NOT be prompted for the residue range in the
SOUP, so you have to make sure that the SOUP just holds those
residues for which you want a profile. This option tries to set
gap open and gap elongation penalties in agreement with the
observed structural characteristics.
The command al2prf can be used to align two profiles. WHAT IF prompts
for two profile numbers in the bigfile. Although the result of the
alignment will be expressed as the result of a consensus sequence
alignment, the real alignment optimizes the inner (or dot) products
of the profile vectors. Thus, instead of comparing similarities of
individual amino acids, similarities of vectors of 20 profile
values will be compared.
The command PCTPRF can be used to determine how well a sequence fits
to a profile. WHAT IF prompts for a range of sequences and a
profile number. Two values will be produced for every sequence.
The first number indicates how often the residue in the sequence
is identical to the consensus sequence of the profile. The second
number is the average profile value corresponding to the residue in
the sequence; in other words, the convolution of the sequence
with the profile. Because the profile values normally fall
between -0.1 and 1.54, the latter figure can be less than
zero or greater than 100 percent.
See also SRTPCT.
GETPRF can be used to read a profile from a file. The format of
a profile file is given above. WHAT IF prompts for the file name. The
profile will be automatically stored in the next free slot in BIGFILE.
A maximum of nine profiles may be held in BIGFILE.
The option DIRPRF provides a short list describing all profiles
presently available in the BIGFILE.
When the LSTPRF command is issued, WHAT IF prompts for the number
of a profile
in the BIGFILE. It will determine for each position in the profile the
residue with the highest profile value, and call that the consensus
residue at that position. The consensus sequence consisting of
these residues will
be displayed. The original sequence, i.e. the one
present in the profile file, will also be shown.
The command SHOPRF first performs the same function as LSTPRF
(see above),
and furthermore lists the complete profile. A wide text window
is recommended.
When the command DELPRF is issued, WHAT IF prompts for a profile number.
The profile specified will be deleted from BIGFILE. The profile file on
disk will ofcourse NOT be deleted.
When the command DELPRS is issued, WHAT IF prompts for a range
of profile numbers.
The profiles specified will be deleted from BIGFILE. The profile files on
disk will ofcourse NOT be deleted.
If it is discovered that most of the aligned sequences have an
insertion with respect to the profile at a given position,
the user may wish to insert one or more residues in the profile at this
position. When the command INSPRF is issued, WHAT IF will prompt
for the position in
the profile and will insert one residue in the profile at this
position. The values for all 20 amino acids at this profile
position are set identical. The commands MAKPRF and GETPRF,
as well as the editor may be used to change these values.
When the command MAKPRF is issued, WHAT IF prompts for a profile
number and
for an output profile file name. The profile will be written in that
file in the format as described above.
The command UPDPRF can be used to create a profile from a multiple sequence
alignment. This option is explicitly meant for iterative profile alignment
of GPCR sequences, but may also be useful for other purposes.
WHAT IF prompts for an old profile. Preferably, this should be the
profile that used to align sequences using the ALIPRF
command. WHAT IF will then prompt for a range of sequences. The
frequency of residue types at each position in the
sequences will determine the profile values for that position.
Inspection of the resulting profile is recommended. It may
not resemble what you had in mind....
See also the NEWPRF command.
This option does the same as UPDPRF but additionally UPAPRF will
update the gap open and gap elongation penalties in the
updated profile.
When the command SEQPRF is issued, WHAT IF will prompt for one sequence,
and for a profile
file name. The requested sequence will be written as a profile to
requested file. The resulting profile is not a good profile for alignment
purposes but can be administratively useful for placing a profile
file on disk. Furthermore, this profile can be used to start an
iterative profile alignment procedure.
The command MAKMSF can be used to create an MSF file. The MSF format
is the GCG standard format for multiple sequence alignments.
WHAT IF will prompt for a sequence range. The output file will be
called PROF.MSF.
Many of the profile alignment related options and some of the correlation
related options use weights for sequences and/or weights for positions
in teh sequences. The values of these weights are a function of the
present alignmenty. For some options it is really important to
use the most uptodate weights and in those cases the weights
are automatically updated when that option is used. However,
updating the weights is rather time consuming, and is far from
always performed automatically. The option PRFSWT allows you to force
WHAT IF to update all weight factors.
Sometimes residues can only mutate in pairs. For example, a salt
bridge on a
dimer interface typically consists of Asp-Arg or Arg-Asp pairs. When
a sequence
lacks the aspartic acid, it is probable that the arginine has also mutated.
Considerable information is available about such correlated mutations,
and the reader
is referred to the literature for further information. WHAT IF has its
own correlated
mutation module. The theory and methodology of this module is described
in volume 5
of the 7TM journal.
Sometimes there is a strong correlation between the type of
certain residues and the classification of the molecule. This is
seen most trivially in serine or cysteine proteases. However, this is also
true at a more subtle level. For example, in the GPCRs, all amine receptors
have an aspartic acid at one particular position. However, subclasses and
subclasses within these subclasses are often also characterized by certain
residue positions.
WHAT IF provides several tools to perform correlation analysis of
residues among sequences, or of residues with the class of the molecules.
The main idea behind correlated mutation analysis (CMA) or correlation
analysis in general is that we detect residues that are conserved
in sequences that perform function X, but are not conserved in the
sequences that do not perform this function. For this purpose the
CMC file can be used to teach WHAT IF wht the function of the individual
sequences is.
Sometimes one works with functions that are not a result of an
evolutionary process, but accidental. A good example for that is the
binding of exogenous ligands (or drugs). In such a case one
wants to serach for residues that
The correlated mutation module requires for many of its options a
"correlated
mutation code file" (CMC-file). The following options exist
to work with CMC-files:
The command GETCMC causes WHAT IF to prompt for the name of the
correlation file. This file should hold the accession numbers of the
sequences to be sorted, and the class identifiers. (See below).
The sequences in the BIGFILE will be sorted according to the
order in the correlation file. If the correlation file holds
accession numbers for non-existing sequences, an error message is
issued, and the option is terminated. If sequences are present in the
BIGFILE but not in the correlation file, these sequences may be placed at
the END of the BIGFILE, or removed from the BIGFILE.
The command GETTIT does the same as the command GETCMC (see above),
but GETTIT will additionally replace the file titles in the BIGFILE
with the file titles found in the CMC file.
The command MAKCMC will create a simple file called FILE.CMC.
The file is correctly formatted for input to GETCMC and
many of the COR*** commands. The correlation code is always X, and the
comment consists of the first ten characters of the file name and the
title of the sequence.
The command SRTCMC will sort a CMC file. This is often nice, because
if the CMC file is sorted such that the sequences with the same
CMC codes are next to ecah other in the CMC file, they will also
sit next to each other in all output.
Sometimes you want to skip certain residues in a correlation analysis.
For example, completely conserved residues, or the first and last 50
residues often only hold little information, but provide lots of
output. For these cases the skip file can come in handy. Since it
is unpleasant work to create a skip file, there are some options
to aid you with this:
The format of the correlation file is as follows:
One line per sequence. Each line holds the following information:
First 10 characters: Accession number.
Character 11 : class identifiers.
Character 12-15 : reserved for future use.
Characters 16-80 : comments.
The correlation file for the alpha adrenergic receptors, for example,
could look like (without the top 2 lines!):
10 20
1234567890123456789012345
A40132 A P1;A40132 - Alpha-2-adrenergic receptor
P08913; A ALPHA-2A ADRENERGIC RECEPTOR (SUBTYPE C1
SWP22909 A ALPHA_ADRENERGIC A-2 GCR_0200
A40392 B P1;A40392 - *Alpha-2-adren
P18825; B ALPHA-2C ADRENERGIC RECEPTOR (SUBTYPE C4
S13023 B P1;S13023 - *alpha-2-Adrenergic receptor
D00819 B ALPHA_ADRENERGIC A-2 GCR_0538
M58316 B ALPHA_ADRENERGIC A-2 GCR_0114
P19328; C ALPHA-2c ADRENERGIC RECEPTOR.
P30545; C ALPHA-2c ADRENERGIC RECEPTOR.
The command MAKTFF will create a skip file called SKIP.FIL. This file
can be used to skip all residues in many of the correlation options.
MAKTFF will write in this file residues that are completely conserved.
You will be prompted for the sequence range, the residue range, and
the conservation percentage above which a residue is called conserved.
The command MAKPTF will create a skip file called SKIP.FIL. This file
can be used to skip all residues in many of the correlation options.
MAKPTF will write in this file residues that are completely conserved
in the files labeled with a + sign in the CMC file that you would
use for any of the CORAN like options.
You will be prompted for the sequence range, the residue range, and
the conservation percentage above which a residue is called conserved.
Several options exist to search for correlated behaviour among
residues. These options can be divided in three groups: CORMUT, CORAN1-like,
and the +/- correlations.
CORMUT looks for residues that mutate in tandem. The CORAN1-like options
look for residues or residue pairs that mutate together with a code
entered via the CMC-file. The other COR*** options correlate residues with a
CMC code can only be plus or minus.
Be aware that the correlation values that are listed are NOT correlation
coefficients in the mathematical sense.
CORMUT will cause WHAT IF to prompt you
for a range of sequences and for a range
of residues in these sequences. It will then search for all moderately
conserved pairs of positions that show correlated mutational behaviour.
In other words, pairs of residues are searched where mutations are not too
frequent, but if a mutation ocurrs from one sequence to the other at the
one residue position, a mutation going from the one to the
other of the two sequences is also very
likely at the other position.
After the calculations, the maximal mutation correlation coefficient is
displayed,
and WHAT IF prompts for a cutoff correlation coefficient.
All pairs that show correlated mutational behaviour with mutation coefficient
above this cutoff will be listed, together with the actual residues, and
a frequency of all observed exchanges.
The option CORMUT (see above) requires a certain degree of variability
for the residue positions. CORMUN, in contrast, does not take variability into
account, and will thus call a pair of completely conserved residues highly
correlated.
The CORMUM option does a correlation analysis just like the basic CORMUT
option, but rather than scoring binary (+1 for conserved or mutated in
tandem, and 0 otherwise) CORMUM scores all pairs by the difference between
the exchange matrix scores for the two positions. See the SETMAT
option about exchange matrices.
Does about the same as CORMUT, but puts a hefty penalty on missing
residues.
Sometimes there is a strong correlation between the type of
certain residues and the classification of the molecule. This is
seen most trivially in serine or cysteine proteases. However, it is also
true at a more subtle level. For example, in the GPCRs, all amine receptors
have an aspartic acid at a particular position. However, subclasses and
subclasses within these subclasses are often also characterized by certain
residue positions.
WHAT IF provides a method for detecting these residues. To do so, a
form of correlated mutational behaviour as described above is incorporated
that correlates residues with class identifiers. A class identifier is
a character or number that is characteristic for the class, or subclass
of the sequence. Every sequence can have one class identifier. (see above
for a description of the file needed to instruct WHAT IF about the class
identifiers).
When the command CORAN1 is issued, WHAT IF prompts for the name of the
correlation file. WHAT IF will ask for the number of the profile that
was used for the alignment. It will
also prompt for a residue range. If you want, you can provide a so called
skip file. This is a file that holds the numbers of the residues that should
not be used in the analysis, give 0 (zero) if you do not have or do not want to
use such a file. The results will be similar to those
described for the CORMUT command (see above), but instead of
showing two correlated residues, it will present the correlation
between the class identifiers and the residues. This is a true
correlation, and not, as for the CORMUT option, a noise multiplied
correlation.
The sequences in the BIGFILE will be sorted according to the
order in the correlation file. If the correlation file holds
accession numbers for non-existing sequences, an error message is
issued, and the option is terminated. If sequences are present in the
BIGFILE but not in the correlation file, the user can choose between getting
those sequences placed at the
end of the BIGFILE or removing them from the BIGFILE.
This option functions similar to CORAN1, but is less strict in the negative
correlations.
CORPM1 functions similar to the CORAN options mentioned above. However,
the CMC
file is only allowed to hold the CMC codes + (plus) and - (minus). This
presents some restrictions, but accelerates the computations so much that
correlations over more residues at the same time become calculable.
The principle is the following: For every residue position the most
prevalent residue
in all sequences marked with a + is determined. The method now
considers all pairs of
residues and score cases where both residues agree at the same if their
CMC codes are a + whereas at least one of them should be different
from the majority of the + labeled sequences in the - labeled sequences.
If this sounds complicated to you, you are right. Just try it, it does not
take too much time.
CORPM2 is very similar to CORPM1. The only difference is that CORPM2 is
more critical about the - labeled residues. They have to differ from
the + labaled ones.
This option is still being worked on. If you want to try it, be aware that
WHAT IF could crash...
Often one wants to focus on a subset of the available sequences. Rather
than introducing active and inactive sequences (which means doing complicated
things inside the program), I have decided for a rather crude approach. The user
can simply remove any undesired sequences from the BIGFILE.
Since removing a sequence from the BIGFILE is irreversible,
a good backup of the BIGFILE is recommended (normally called WALIGN.BIG)
before any options in this menu are used.
The command KPNAME will cause WHAT IF to prompt you for a series of keywords
or text strings. All sequence that have one of these strings rendered EXACTLY,
either in the file name or the title, will be tagged to be kept.
Although matching is exact, the matching of text fragments is not case-
sensitive. After the search, the number of sequences found with this string
in it will be listed, and the user is asked if to confirm deletion of
all the other files.
The command DELNAM will cause WHAT IF to prompt for a series of keywords
or text strings. All sequence that have one of these strings rendered exactly,
either in their file name or in their title, will be tagged to be removed
from the BIGFILE.
Although matching is exact, the matching of text fragments is not case-
sensitive. After the search, the number of sequences found with this string
in it will be listed, and the user is asked to confirm deletion of
all these files.
The command KILDBL will cause WHAT IF to prompt for two sequence ranges.
The same range may be specified twice. Any sequence in the second
range that is completely identical to a sequence in the
first range will be removed from the BIGFILE. If same range is specified twice
and two identical sequences are detected, the sequence listed
later in the BIGFILE will be removed.
The command DELALI will cause WHAT IF to prompt you for a profile number,
and two cutoff percentages. These are the percentage identity between
the sequence and the consensus sequence of the profile, and the convolution
between the sequence and the profile. (these two numbers are shown in the
first table in the HSSP output file generated by the MAKHSP command).
All sequences that have either one of these percentages below the given
cutoff are deleted from the BIGFILE.
Often sequences are obtained for which the biological function is unknown or
only partly understood, and consequently it is difficult to name these
sequences. The option UNKTYP allows for the comparison of
sequences with a series of profiles. WHAT IF prompts for the name of
a file that holds the names of all profile files, one profile file name
per line. WHAT IF also be prompts for the range of sequences. All
sequences will be compared with all profiles, and the convolution of the
sequence with the profile, after optimal alignment, will be listed.
The command SRTPCT will cause WHAT IF to prompt you for a profile. For
all sequences the similarity with this profile is calculated and the
sequences are sorted by decending similarity.
The command SRTPID will cause WHAT IF to prompt you for a profile. For
all sequences the identity with this profile is calculated and the
sequences are sorted by decending identity.
The command SRTACC will sort the sequences in the BIGFILE as function
of their accession code (sorting in increasing order).
The command SRTFLN will sort the sequences in the BIGFILE as function
of their file name (sorting in increasing order).
The command SRTGMB will sort the sequences in the BIGFILE as function
of their grey meatball character (sorting in increasing order).
The grey meatball value gets higher when a sequence look more like all
other sequences.
The command WALGRA calls the menu for graphic representation of sequences.
When the command GRASQS is issued, WHAT IF will prompt for a range of
sequences
and a range of residues. The specified residues in the sequences
will be sent to the graphics window as a MOL-item(s). The residues are coloured
by residue type (See COLSQS). Limited interactive graphics are available with
the local command GO.
The colours for the residues are determined by the values given in the
file SEQCOL.FIL. The default for this file looks like:
A 240
C 180
D 120
E 120
F 260
G 220
H 120
I 240
K 40
L 240
M 240
N 80
P 220
Q 80
R 40
S 220
T 220
V 240
W 260
Y 260
X 150
- 350
If you have a file called SEQCOL.FIL in your local directory, this file will
be used rather than the default file. The command COLSQS will bring the local
copy of this file into the editor. If you do not have a local copy
of this file yet, the default file will first be created in the
present directory, and thereafter
the file will be brought into the editor. After leaving the editor the
file will be automatically read by WHAT IF, and the residues at the screen
get the coulours you requested. If the GRASQS option is run again, these
new colours will also be used for the new sequences.
The command SHOW in the WALIGN related menus will pass control to the graphics
window as is usually done by GO. The difference with GO is that the
main menu at the right side of the screen
now has many different options. These are:
WAIT : Cancel option.
T > : Translate a few residues to the right.
T < : Translate a few residues to the left.
T >> : Translate many residues to the right.
T << : Translate many residues to the left.
T ^ : Translate a few sequences upwards.
T V : Translate a few sequences downwards.
T ^^ : Translate many sequences upwards.
T VV : Translate many sequences downwards.
COLR : Allow for interactive modification of the colouring scheme.
M1 : Store the present view in view memory 1.
M2 : Store the present view in view memory 2.
M3 : Store the present view in view memory 3.
M4 : Store the present view in view memory 4.
VMS : Spawn a subproces (create a shell).
CHAT : Pass control back to the text window.
RSET : Reset the viewing parameters.
>> : Scale the display up.
<< : Scale the display down.
MOV+ : Move one step forward in the movie.
MOV- : Step one step back in the movie.
HELP : Activate/deactivate the interactive HELP option.
The command 2DPLOT requires a file called ARBNUM.POS. This file has the
following format:
170
111 -8.0 8.0 0.0
112 -7.0 7.5 0.0
113 -6.0 7.0 0.0
164 lines removed for clarity...
730 16.0 -3.0 0.0
731 17.0 -3.5 0.0
732 18.0 -4.0 0.0
The first number indicates the number of lines to follow. Thereafter
for each residue the arbitrary sequence number (this is the second number
given to it in the profile file) and its position in space. At this position
the residue will be shown in a small box.
This option is not yet entirely bugfree.
The command PLOT2D is an alternative spelling for 2DPLOT.
The option PRETYG will cause WHAT IF to prompt you for a set of
sequences and the ranges in these sequences. It will put the requested
sets of residues on the screen, coloured and with conserved
residues boxed. See the parameter menu for parameters that can modify
the output of this option.
The option PRETYP will cause WHAT IF to prompt you for a set of
sequences and the ranges in these sequences. It will put the requested
sets of residues in a postscript file, with the conserved
residues boxed. See the parameter menu for parameters that can modify
the output of this option.
You will also be prompted for the sequence to be used to determine how
to put numbers on top of the alignment. Please give the rank order
of the sequence in the output, not the number in the BIGFILE. So, if
you plot the sequences 3 till 7, and you want the residues in the output
to be numbered acoording to sequence 3, you have to answer the question
about which sequence to be used for the numbering not with 3 but with 1
because the BIGFILE sequence 3 is sequence 1 in the output. If you enter
a ridiculously large sequence number, e.g. 1000000, the numbering will
not follow any sequence, but simply count everything, residues and
insertions.
After that you will be asked for the residue ranges above which an * is
to be plotted, and the ranges to be shaded with dark
grey or light grey respectively. Also here, you have to give the number
in the output, not the residue number. So, if you want residue 61 to be
shaded grey or labeled with an * on the top of the page,
and you are plotting the residues from 60 to 80, you should
ask residue 2 to be shaded, because residue 61 is in this example the second
residue in the output.
The WALSER menu is still highly experimental, and therefor scarcely
dodumented. The WALSER menu is the main ingredient for the production
process of the WWW based information system GPCRDB.
It is envisaged that in due time this menu will also be useful
for the creation and maintenance of other class specific databases.
The WALSER menu stores all kinds of information (file names, locations,
profile names, mutation pointers, etc.) in a big file called SERVER.BIG.
The first time around you need to initialize this file with the
GPCINI command.
The WALSER menu stores all kinds of information (file names, locations,
profile names, mutation pointers, etc.) in a big file called SERVER.BIG.
The first time around you need to initialize this file with the
GPCINI command, but every other time you should open this file with the
GPCOPE command.
Please use GPCOPE before any of the commands that follow in this
chapter.
This option lists roughly all statistics about the file
SERVER.BIG.
This option lists some of the vital statistics about the file
SERVER.BIG.
Part of the GPCRDB (and future similar) projcets is the classification
of sequences if families, sub-families, sub-sub-families, etc. The
file _7TM.CLASSES hold this hierarchie of classes. The command GPCGCL
reads this file and stores the data in the big server file SERVER.BIG.
Every family, sub-family, etc., has its own, optimised, profile.
The file _7TMPROF.LIST holds the names of thse profiles. The
command GPCGPR reads this file and stores the data in the big
server file SERVER.BIG.
After running GPCGPR, you can actually read all profiles with
the command GPCRPR. The profiles will end up in the BIGFILE. The
information about which profile belongs with which family is stored by the
GPCGPR option in the big server file.
The command GPCSPL will align all sequences in the BIGFILE against
all profiles in the BIGFILE. If a hit is found, the sequence is
copied to the corresponding directory.
Does the same as GPCSPL, with as only difference that the sequences are
not moved into the corresponding directory, but that directory is
just listed at the screen.
The GPCMAL option will run over all directories, and in every directory
it will udate all files (.HSSP, .MSF, .PRF, etc).
The option GPCO7T will cause WHAT IF to read the file 7TM.FILES.
As this normally should go automatically, there should never be a reason
to run this option.
The sequence files that we want to use should be present in the local
directory. There are ofcourse many ways to get them there, and a
WHAT IF independent script seems to be the preferred way. However,
at EMBL you can obtain those files that are located in the SwissProt
directories by listing them in the file 7tmrlist.txt and running
the command GPCG7T.
The 2ALIGN command can be used to align two sequences. WHAT IF will
prompt for two sequence
numbers, a gap open penalty and a gap elongation penalty. The default
penalties that are suggested are meant to be used with the default
Dayhof type matrix obtained with the SETMAT command. Otherwise you
are on your own, and believe me, there is much you can do wrong here....
The command MAKHSP requires that you input a profile number, a range
of sequences and a HSSP file name. You will also be prompted if you
want to calculate the variability (if you say yes, this will take
a lot of CPU time, so you normally only say yes in the final step).
If you have only 1 profile in the big file, you will not be prompted
for the profile number. If there already exists a file with the same
name
as the HSSP file you want to generate, you will be asked if you
want to overwrite the old one.
The command SHOIDM will ask you for a sequence range.
All pairwise sequence identities in the overlapping areas are
calculated and listed as percentages in the first table.
The second table lists the differences rather than the similarities.
The last table shows the similarities after subtraction of the
smallest number found in the table.
The command HISTID lists all pairwise similarity percentages
between two ranges of sequences that you will be prompted for.
Additionally you get a histogram of the observed percentages, and
some statistics like the average similarity, and the standard
deviation, etc.
This option is not yet ready
The command MFETCH prompts you for a sequence range. It creates
a file called FETCH.LIS that can be edited to be used by GCG to
fetch the sequences from the database(s).
Gives a matrix with coloured squares. Colours relate to the pairwise
identity percentage.
For debugging only.
Normally gap penalties are an integral part of the profile.
If you want to use the gap open and gap elongation penalties
that are set in the parameter file (parameters 373 and 374) you
can use this WLHGAP option.
For debugging only.
This is a GPCR specific option.
This is a GPCR specific option.
This is a GPCR specific option.
This is a GPCR specific option.
Often a DNA sequence slips in with a large family of protein
sequences. WLHDNA will remove those that accidentally entered the
BIGFIL.
This is a GPCR specific option.
This is a GPCR specific option.
This option will try to optimise the order of the sequences in the
BIGFILE such that in the WLHPID output the highest homologies
are closest to the diagonal.
This option just does what the titel suggests.
This is a GPCR specific option.
The option is supposed to do just what the titel suggest, but
not being a molecular biologist, I doubt if I have implemented a smart
algorithm.
Writes a range of sequences as long single strings of character
in a file. One line per sequence. That file is often useful for
manual operations. See WLHRIN about reading the sequences back in.
See also WLHWIA.
This is a GPCR specific option.
Reads back the sequences written by WLHWIN. Normally you would
have edited the file between WLHWIN and WLHRIN. Also, dont forget
to delete the files from the BIGFILE if you don't want them double.
This option runs over a series of sequences, tries every fragment,
and calculates the weight of that fragment. If the fragment
falls within the limits given by the user, the fragment is listed.
This option is similar to WLHWIN. However, WLHWIN writes the sequences
as compact as possible. WLHWIA also writes sequences in a file at one
sequence per line, but it inserts - signs for deletions.
This is a handy option if you want to do some hand optimisation
of alignments.
This option indeed does something.
This option aligns all sequences in the BIGFILE one after the other
against each profile. The results are reported.
Translates DNA sequences into protein sequences.
Takes a range of sequences. Assumes that you pre-aligned them, and
makes a phylogenetic tree using the neighbour joining algorithm.
The option EMBED3 will do an eigenvalue analysis on the pairwise
sequence identity matrix. The result is a series of crosses
in 3D space that represent the sequences. The distance between any
two points is the best possible measure for the distance between the
sequences in sequence space. The crosses are pickable.
One day this option will be a cutoff for the sequence identity
percentage in profile alignment for those options that scan large volumes
of sequences.
One day this option will be a cutoff for the sequence-profile convolution
value in profile alignment for those options that scan large volumes
of sequences.
The option 2ALIGN uses a gap open penalty and a gap elongation
penalty. Upon running 2ALIGN you are also asked to give these
penalties, so setting this parameter in the PARAMS menu is
somewhat redundant.
The option 2ALIGN uses a gap open penalty and a gap elongation
penalty. Upon running 2ALIGN you are also asked to give these
penalties, so setting this parameter in the PARAMS menu is
somewhat redundant.
The option ALIPRF uses a position specific gap open penalty
and gap elongation penalty. These penalties are sitting in the
profile. With the option WLHGAP you can set all gap open penalties
in one shot at the value of the PRFOPE parameter. WLHGAP also sets
all gap elongation penalties according to the PRFELO parameter.
The option ALIPRF uses a position specific gap open penalty
and gap elongation penalty. These penalties are sitting in the
profile. With the option WLHGAP you can set all gap elongation
penalties
in one shot at the value of the PRFELO parameter. WLHGAP also sets
all gap open penalties according to the PRFOPE parameter.
The parameter CHRSIZ allows you to change the character size
in the PRETYP and PRETYG options.
The parameter LIMBOX determines how many identical residues are
needed at a=one position in an alignment before the PRETY* options
decide to draw a box around them.
The RESLIN parameter determines how many residues there will be
per line in the PRETYP option.