Sequence manipulation (WALIGN)

Introduction.

The WALIGN menu provides functions for sequence alignment, profile alignment, and multiple sequence alignment. This menu was written for management of the TM7 GPCR file server (send the message HELP to "TM7@EMBL-Heidelberg.DE" for information) but has further capabilities.

The basic process involves storing (a maximum of 9) profiles and an unlimited number of sequences (maximum sequence length 1000 residues) in a file called BIGFILE, and performing all operations on this file.

This menu provides tools to search rapidly for identical or highly homologous sequences, to perform (iterative) profile alignments, to display and manipulate sequences, and in general, to perform functions unavailable in other packages or to surpass the sequence limitations of these packages.

WHAT IF always maintains the original sequence in memory. This means that following alignment, one can always return to the original sequence. This allows for easy experimentation with gap weights, etc.

In order to conduct sequence analyses and manipulations, several commands are needed. The sequence- and profile-related commands are therefore divided among a group of related menus. However, all commands from these menus can be executed directly from the WALIGN menu, without the percent sign (%) prefix.

Big file administration

Creating a big file (BIGFIL)

The command BIGFIL can be used to create a so-called BIGFILE. In the BIGFILE all profiles and sequences are stored. It is only possible to operate on sequences and profiles once they are stored in the BIGFILE. If a BIGFILE exists with the name WALIGN.BIG, then this file will automatically be opened; otherwise the first option in the WALIGN menu MUST be BIGFIL. If a BIGFILE is already open, this command allows you to close the file and save it and to create a new BIGFILE, or to open another already existing one.

Initializing a BIGFILE (WALINI)

The command WALINI will erase all information in the BIGFILE, and it will additionally erase all other arrays, variables and parameters related to any WALIGN related option.

You normally only use this option when you are in deeeep shit....

Checking the big file for double ocurrences (DOUBLS)

When the command DOUBLS is issued, WHAT IF will prompt for two sequence ranges (of sequences present in the BIGFILE). It will list all pairs within these ranges that are identical in sequence. Be aware that (profile) alignment, for example, may delete residues from sequences, and that any discarded residues are not used by the DOUBLS option. (See ORGBCK).

DOUBLS performs a pairwise comparison of the sequences returned by the LSTSQS option. This option is much faster than DMATCH (see below) because it searches for 100 percent identical sequences and thus does not require any alignment. The DOUBLS command can also be used to search for close homologs, i.e. 95-100 percent identical sequences, but the comparison method chosen is only exact for 100 percent identity between sequence pairs.

Searching for highly homologous pairs (DMATCH)

The command DOUBLS searches for pairs of sequences with high homology. DOUBLS (see above) assumes that the sequences are already aligned. If the sequences are not yet aligned, DMATCH may be used. The command DMATCH will cause WHAT IF to prompt you for two sequence ranges and a sequence identity percentage cutoff. All pairs of sequences that show a pairwise sequence identity (following alignment) above the specified cutoff will be listed. To avoid days of CPU time, a crude filter based on nearest- neighbour sequence relations is applied. This filter makes the process considerably faster; however, as a consequence, the option is only reliable at identity levels above 90%. At lower levels the reported identity levels are still accurate, but the algorithm may fail to detect some homologous pairs.

Be aware that profile alignment, for example, will delete residues from sequences, and that discarded residues are not used by the DMATCH option. (See ORGBCK).

Recovering original sequences (ORGBCK)

Following alignment against a profile, residues coinciding with insertions are deleted from their sequences. If original sequences are later desired, the ORGBCK command may be issued. WHAT IF will prompt for a sequence range, and all sequences within this range will be restored to their original state (upon being read from a file)..

Show contents of BIGFILE (BIGSTS)

The command BIGSTS will tell you how many sequences and how many profiles presently are loaded in the BIGFILE.

Show contents of BIGFILE (MFILES)

The command MFILES will create a file called FILES.LIST that has for every BIGFILE sequence entry the filename, and the DE, AC and OS record in it.

Arbitrary sequence numbers

When dealing with multiple sequence alignments tye biggest problem is probably residue numbering. What is the number of that conserved cysteine that is at position 15 in one sequence and at position 17 in another? Should it be 15, 17, or should we take the average, 16, smile, smile.

In the WALIGN menu we solved this problem with the so-called arbitary sequence number. These numbers are in the profile. And you have to hand edit them in the profiles.

Obtaining the arbitrary sequence numbers (GETARB)

If you want to get the arbirary sequence numbers known to WHAT IF, you can have WHAT IF read them by issueing the GETARB command. It seems logical to obtain the sequences from the profile that you used for the alignment. The GETARB option therefore seems only needed for cases where you want to modify the arbitrary sequence number on the fly or something like that.

The scoring matrix

A simple alignment of two sequences involves the matching and scoring of pairs of residues. The classical method for perfoming these tasks is the Dayhof exchange matrix. The file DAYHOF.MAT in the */dbdata directory holds the default exchange matrix used by WHAT IF. If WHAT IF is requested to use a DAYHOF matrix for the alignment, the file DAYHOF.MAT may be copied from the .../dbdata directory to the directory where WHAT IF is executed; here the file may be modified. WHAT IF searches for this file first in the local directory, and thereafter in the database directory. If the DAYHOF.MAT is modified, the user must be careful to preserve the original format.

Displaying the exchange matrix (SHODAY)

The command SHODAY can be used to display the present scoring matrix (also called exchange matrix) at the terminal.

(Re-)setting the exchange matrix (SETMAT)

The command SETMAT can be used to reset the scoring matrix (also called exchange matrix). When this command is issued, WHAT IF presents a minimenu for selecting a unity matrix (scoring only identities), the default Dayhof matrix, or another exchange matrix for with a specified name. The default Dayhof matrix is the file .../dbdata/DAYHOF.MAT, unless a file with the same name is present in the local directory (i.e. the current working directory), in which case the local file is the default.

Input, output and administration of sequences SEQADM

There are three types of sequence options: 1) options that operate on several sequences, e.g. profile alignment; 2) options that operate on two sequences, e.g. sequence alignment; and 3) options that work on one sequence, e.g. listing a sequence or counting the residues in it. Some options are difficult to classify; for example, listing two sequences without comparing them is placed under single sequence options.

Reading sequences (GETSEQ)

The command GETSEQ is used to read sequences in. WHAT IF recognizes three file formats: PIR, Swissprot, and GCG. WHAT IF will prompt for the format of the file. When reading multiple files, it is recommended that the names of the files be placed in a single text file (one filename per line); at the prompt this text file may be specified by @ (which is Shift-2 on most keyboards) followed by the name of the text file. All files read by this method should be of the same file type. In order to read multiple Swissprot and PIR files, the GETSEQ command should be issued twice. WHAT IF will attempt to recognize a format if an incorrect format is specified, but this recognition may not be reliable.

Summary of available sequences (DIRSEQ)

When the command DIRSEQ is issued, WHAT IF will prompt for a sequence range. For all residues in the specified range, the header information will be listed in the text window.

Listing a sequence (LSTSEQ)

When the command LSTSEQ is issued, WHAT IF will prompt for a range of sequences. The corresponding file names, titles, and sequence information will be listed.

Displaying a multi sequence alignment (LSTSQS)

When the command LSTSQS is issued, WHAT IF will prompt for a sequence range and a residue range. The specified residues of the specified sequences will be displayed on the screen. Make sure that the window can accommodate the number of requested residues (not more than 100 at a time), because ugly wrap-arounds will result.

Deleting a sequence (DELSEQ)

When the command DELSEQ is issued, WHAT IF will prompt for a sequence range. The specified sequences will be removed from the BIGFILE. They will NOT be deleted from disk.

Writing a sequence to file (MAKSEQ)

When the command MAKSEQ is issued, WHAT IF will prompt for the number of a sequence, a file type, a file name and a title. The requested sequence will be written to a file with the specified name. File types can be PIR, GCG or Swissprot.

Write internal format multiple sequence file (MAKINT)

This command writes a range of sequences with all associated information to a formatted (human readable) file. The advantage of this file is that it is smaller than the BIGFILE, and can be hand-edited, hand-sorted, etc.

Read sequences from a multiple sequence file (GETINT)

This commands reads sequences from a file written with the MAKINT command (see MAKINT). All sequences in the file will be read; it is not possible to read a subset. To access only a subset, the file must be edited accordingly. ***.

Concatenating sequences (CCNSEQ)

The command CCNSEQ will prompt you for two sequences. It will make one new sequence in which the second sequence is concatenated after the first one.

Profile administration commands PROF2D

The profiles in this menu are mainly meant for the alignment of seven helix membrane bundles of GPCR's. However, as usual, the options can be misused for other purposes. Most profile operations use simple counting statistics to build the profile, rather than using a Dayhof type matrix. Or in other words, it is a normal profile, but a unitary Dayhof matrix is used in the generation.

The format of a profile is as follows:


****** -PROFILE V1.0 ******
ID        :profile
HEADER    :some header information
COMPOUND  :some compound information
SOURCE    :some info about where the profile came from
AUTHOR    :username, for example
PDB       :only if applicable
DSSP      :only if applicable
CHAINS    :'.'       irrelevant
PREFERENCE:AM        irrelevant
EVAL      :SCALED    irrelevant
SMIN      :  -0.05   irrelevant
SMAX      :   1.0    irrelevant
NRES      : 394      length of the profile
 SeqNo  PDBNo AA STRUCTURE BP1 BP2  ACC NOCC  OPEN ELONG  WEIGHT   V      L      
    18   18   L   < In this area the    >  9  3.00  0.10          0.000   0.200 
    19   19   A   < WHAT IF profile and >  9  3.00  0.10          0.000   0.175
    20   20   L   < MAXHOM profile are  >  9  3.00  0.10          0.000   0.340
    21   21   W   < different, but that >  9  3.00  0.10          0.045   0.045
    22   22   A   < should not affect   >  9  3.00  0.10          0.000   0.000
    23   23   N   < either of these two >  9  3.00  0.10          0.026   0.000
    24   24   A   < programs.           >  3  3.00  0.10          0.000   0.048
   342  342   V       0   4                1  0.00  0.00    1.00  0.040  -0.010
//
This whole profile is fixed format, so care is recommended in producing it. In the .../dbdata directory an example profile can be found called PROF.PRF. The irregular order in which the residues are listed in the profile is necessary for compatibility with other profile programs such as MAXHOM/HSSP. "File standards" are called standards because it is standard behaviour to change them regularly, so it recommended that the user invoke the NEWPRF command in the PROF2D menu to ascertain the present standard....

Profile alignment (ALIPRF)

The ALIPRF commands aligns sequences against a profile. WHAT IF prompts for the profile number and a range of sequences. The sequences are then aligned against the profile. Insertions in the profile are not permitted; a corresponding deletion in the non-profile sequence is made. Profile alignment requires approximately one second per 300 amino acids. If several sequences must be aligned, the MAXHOM/HSSP program is recommended. For each aligned sequence the fit between the sequence and the profile is provided. The sequences are altered in the BIGFILE, but the original sequences can always be retrieved with the ORGBCK command.

Creating a new profile from aligned sequences (NEWPRF)

The command UPDPRF is intended to update a profile. The command NEWPRF also creates a profile from aligned sequences, but, in contrast to UPDPRF, does not restrict the new profile to the length of the existing profile. WHAT IF prompts for a range of sequences, and a profile is made based on these sequences. In this new profile all gap open penalties are 3.0, and the gap elongation penalties are 0.1. Profile values range from -0.01, for absent, to aproximately 1.5 for an absolutely conserved residue.

Modification of profile with matrix (CNVPRF)

If you make a new profile based on just one sequence, this profile might not be detailed enough to start the ALIPRF profile alignment properly. In such cases you can try to run CNVPRF. This option applies the present scorings matrix (see the SETMAT option) to the profile. This implies that the first round that you do with ALIPRF will actually be the same as a whole series of 2ALIGN options (everytime between the sequence you used for NEWPRF and the sequence you are aligning against the profile).

This defies the basic principles of the WHAT IF iterative profile align method a little bit, but if it is needed to get good result, who cares?

Creating a profile from the soup (SOUPRF)

The command SOUPRF will create a profile based on the molecule(s) in the SOUP. You will NOT be prompted for the residue range in the SOUP, so you have to make sure that the SOUP just holds those residues for which you want a profile. This option tries to set gap open and gap elongation penalties in agreement with the observed structural characteristics.

Aligning two profiles (AL2PRF)

The command al2prf can be used to align two profiles. WHAT IF prompts for two profile numbers in the bigfile. Although the result of the alignment will be expressed as the result of a consensus sequence alignment, the real alignment optimizes the inner (or dot) products of the profile vectors. Thus, instead of comparing similarities of individual amino acids, similarities of vectors of 20 profile values will be compared.

Determine fit of sequence to profile (PCTPRF)

The command PCTPRF can be used to determine how well a sequence fits to a profile. WHAT IF prompts for a range of sequences and a profile number. Two values will be produced for every sequence. The first number indicates how often the residue in the sequence is identical to the consensus sequence of the profile. The second number is the average profile value corresponding to the residue in the sequence; in other words, the convolution of the sequence with the profile. Because the profile values normally fall between -0.1 and 1.54, the latter figure can be less than zero or greater than 100 percent.

See also SRTPCT.

Reading a profile (GETPRF)

GETPRF can be used to read a profile from a file. The format of a profile file is given above. WHAT IF prompts for the file name. The profile will be automatically stored in the next free slot in BIGFILE. A maximum of nine profiles may be held in BIGFILE.

Listing profiles (DIRPRF)

The option DIRPRF provides a short list describing all profiles presently available in the BIGFILE.

Show consensus sequence of profile (LSTPRF)

When the LSTPRF command is issued, WHAT IF prompts for the number of a profile in the BIGFILE. It will determine for each position in the profile the residue with the highest profile value, and call that the consensus residue at that position. The consensus sequence consisting of these residues will be displayed. The original sequence, i.e. the one present in the profile file, will also be shown.

List the whole profile (SHOPRF)

The command SHOPRF first performs the same function as LSTPRF (see above), and furthermore lists the complete profile. A wide text window is recommended.

Deleting a profile from BIGFILE (DELPRF)

When the command DELPRF is issued, WHAT IF prompts for a profile number. The profile specified will be deleted from BIGFILE. The profile file on disk will ofcourse NOT be deleted.

Deleting multiple profiles from BIGFILE (DELPRS)

When the command DELPRS is issued, WHAT IF prompts for a range of profile numbers. The profiles specified will be deleted from BIGFILE. The profile files on disk will ofcourse NOT be deleted.

Adding insertions to a profile (INSPRF)

If it is discovered that most of the aligned sequences have an insertion with respect to the profile at a given position, the user may wish to insert one or more residues in the profile at this position. When the command INSPRF is issued, WHAT IF will prompt for the position in the profile and will insert one residue in the profile at this position. The values for all 20 amino acids at this profile position are set identical. The commands MAKPRF and GETPRF, as well as the editor may be used to change these values.

Writing a profile to file (MAKPRF)

When the command MAKPRF is issued, WHAT IF prompts for a profile number and for an output profile file name. The profile will be written in that file in the format as described above.

Updating a profile from multiple sequence alignment (UPDPRF)

The command UPDPRF can be used to create a profile from a multiple sequence alignment. This option is explicitly meant for iterative profile alignment of GPCR sequences, but may also be useful for other purposes. WHAT IF prompts for an old profile. Preferably, this should be the profile that used to align sequences using the ALIPRF command. WHAT IF will then prompt for a range of sequences. The frequency of residue types at each position in the sequences will determine the profile values for that position. Inspection of the resulting profile is recommended. It may not resemble what you had in mind....

See also the NEWPRF command.

Updating a profile from multiple sequence alignment (UPAPRF)

This option does the same as UPDPRF but additionally UPAPRF will update the gap open and gap elongation penalties in the updated profile.

Make a profile of one sequence (SEQPRF)

When the command SEQPRF is issued, WHAT IF will prompt for one sequence, and for a profile file name. The requested sequence will be written as a profile to requested file. The resulting profile is not a good profile for alignment purposes but can be administratively useful for placing a profile file on disk. Furthermore, this profile can be used to start an iterative profile alignment procedure.

Make a GCG-style MSF file (MAKMSF)

The command MAKMSF can be used to create an MSF file. The MSF format is the GCG standard format for multiple sequence alignments. WHAT IF will prompt for a sequence range. The output file will be called PROF.MSF.

Sequence or position weights (PRFSWT)

Many of the profile alignment related options and some of the correlation related options use weights for sequences and/or weights for positions in teh sequences. The values of these weights are a function of the present alignmenty. For some options it is really important to use the most uptodate weights and in those cases the weights are automatically updated when that option is used. However, updating the weights is rather time consuming, and is far from always performed automatically. The option PRFSWT allows you to force WHAT IF to update all weight factors.

Correlation analysis WALCOR

Sometimes residues can only mutate in pairs. For example, a salt bridge on a dimer interface typically consists of Asp-Arg or Arg-Asp pairs. When a sequence lacks the aspartic acid, it is probable that the arginine has also mutated. Considerable information is available about such correlated mutations, and the reader is referred to the literature for further information. WHAT IF has its own correlated mutation module. The theory and methodology of this module is described in volume 5 of the 7TM journal.

Sometimes there is a strong correlation between the type of certain residues and the classification of the molecule. This is seen most trivially in serine or cysteine proteases. However, this is also true at a more subtle level. For example, in the GPCRs, all amine receptors have an aspartic acid at one particular position. However, subclasses and subclasses within these subclasses are often also characterized by certain residue positions.

WHAT IF provides several tools to perform correlation analysis of residues among sequences, or of residues with the class of the molecules.

Correlation theory

The main idea behind correlated mutation analysis (CMA) or correlation analysis in general is that we detect residues that are conserved in sequences that perform function X, but are not conserved in the sequences that do not perform this function. For this purpose the CMC file can be used to teach WHAT IF wht the function of the individual sequences is.

Sometimes one works with functions that are not a result of an evolutionary process, but accidental. A good example for that is the binding of exogenous ligands (or drugs). In such a case one wants to serach for residues that

Correlation code administration

The correlated mutation module requires for many of its options a "correlated mutation code file" (CMC-file). The following options exist to work with CMC-files:

Sorting according to class identifiers (GETCMC)

The command GETCMC causes WHAT IF to prompt for the name of the correlation file. This file should hold the accession numbers of the sequences to be sorted, and the class identifiers. (See below).

The sequences in the BIGFILE will be sorted according to the order in the correlation file. If the correlation file holds accession numbers for non-existing sequences, an error message is issued, and the option is terminated. If sequences are present in the BIGFILE but not in the correlation file, these sequences may be placed at the END of the BIGFILE, or removed from the BIGFILE.

Getting new titles (GETTIT)

The command GETTIT does the same as the command GETCMC (see above), but GETTIT will additionally replace the file titles in the BIGFILE with the file titles found in the CMC file.

Creating a CMC-file (MAKCMC)

The command MAKCMC will create a simple file called FILE.CMC. The file is correctly formatted for input to GETCMC and many of the COR*** commands. The correlation code is always X, and the comment consists of the first ten characters of the file name and the title of the sequence.

Sorting a CMC file (SRTCMC)

The command SRTCMC will sort a CMC file. This is often nice, because if the CMC file is sorted such that the sequences with the same CMC codes are next to ecah other in the CMC file, they will also sit next to each other in all output.

skipping residues in correlation analysis

Sometimes you want to skip certain residues in a correlation analysis. For example, completely conserved residues, or the first and last 50 residues often only hold little information, but provide lots of output. For these cases the skip file can come in handy. Since it is unpleasant work to create a skip file, there are some options to aid you with this:

Correlation file (CMCfile) format

The format of the correlation file is as follows:

One line per sequence. Each line holds the following information:

First 10 characters: Accession number.

Character 11 : class identifiers.

Character 12-15 : reserved for future use.

Characters 16-80 : comments.


The correlation file for the alpha adrenergic receptors, for example, could look like (without the top 2 lines!):

        10        20
1234567890123456789012345
A40132    A    P1;A40132 - Alpha-2-adrenergic receptor 
P08913;   A    ALPHA-2A ADRENERGIC RECEPTOR (SUBTYPE C1
SWP22909  A    ALPHA_ADRENERGIC A-2 GCR_0200          
A40392    B    P1;A40392 - *Alpha-2-adren
P18825;   B    ALPHA-2C ADRENERGIC RECEPTOR (SUBTYPE C4
S13023    B    P1;S13023 - *alpha-2-Adrenergic receptor
D00819    B    ALPHA_ADRENERGIC A-2 GCR_0538          
M58316    B    ALPHA_ADRENERGIC A-2 GCR_0114          
P19328;   C    ALPHA-2c ADRENERGIC RECEPTOR.           
P30545;   C    ALPHA-2c ADRENERGIC RECEPTOR.           

create a skip file (MAKTFF)

The command MAKTFF will create a skip file called SKIP.FIL. This file can be used to skip all residues in many of the correlation options. MAKTFF will write in this file residues that are completely conserved. You will be prompted for the sequence range, the residue range, and the conservation percentage above which a residue is called conserved.

create a skip file (MAKPTF)

The command MAKPTF will create a skip file called SKIP.FIL. This file can be used to skip all residues in many of the correlation options. MAKPTF will write in this file residues that are completely conserved in the files labeled with a + sign in the CMC file that you would use for any of the CORAN like options. You will be prompted for the sequence range, the residue range, and the conservation percentage above which a residue is called conserved.

Detection of residues with correlated mutational behaviour

Several options exist to search for correlated behaviour among residues. These options can be divided in three groups: CORMUT, CORAN1-like, and the +/- correlations. CORMUT looks for residues that mutate in tandem. The CORAN1-like options look for residues or residue pairs that mutate together with a code entered via the CMC-file. The other COR*** options correlate residues with a CMC code can only be plus or minus.

Be aware that the correlation values that are listed are NOT correlation coefficients in the mathematical sense.

Detection of correlated mutations (CORMUT)

CORMUT will cause WHAT IF to prompt you for a range of sequences and for a range of residues in these sequences. It will then search for all moderately conserved pairs of positions that show correlated mutational behaviour. In other words, pairs of residues are searched where mutations are not too frequent, but if a mutation ocurrs from one sequence to the other at the one residue position, a mutation going from the one to the other of the two sequences is also very likely at the other position.

After the calculations, the maximal mutation correlation coefficient is displayed, and WHAT IF prompts for a cutoff correlation coefficient. All pairs that show correlated mutational behaviour with mutation coefficient above this cutoff will be listed, together with the actual residues, and a frequency of all observed exchanges.

Detection of correlated mutations (CORMUN)

The option CORMUT (see above) requires a certain degree of variability for the residue positions. CORMUN, in contrast, does not take variability into account, and will thus call a pair of completely conserved residues highly correlated.

Detection of correlated mutations (CORMUM)

The CORMUM option does a correlation analysis just like the basic CORMUT option, but rather than scoring binary (+1 for conserved or mutated in tandem, and 0 otherwise) CORMUM scores all pairs by the difference between the exchange matrix scores for the two positions. See the SETMAT option about exchange matrices.

Detection of correlated mutations (CORMUF)

Does about the same as CORMUT, but puts a hefty penalty on missing residues.

Correlating sequences against external factors

Directed detection of correlations (CORAN1)

Sometimes there is a strong correlation between the type of certain residues and the classification of the molecule. This is seen most trivially in serine or cysteine proteases. However, it is also true at a more subtle level. For example, in the GPCRs, all amine receptors have an aspartic acid at a particular position. However, subclasses and subclasses within these subclasses are often also characterized by certain residue positions.

WHAT IF provides a method for detecting these residues. To do so, a form of correlated mutational behaviour as described above is incorporated that correlates residues with class identifiers. A class identifier is a character or number that is characteristic for the class, or subclass of the sequence. Every sequence can have one class identifier. (see above for a description of the file needed to instruct WHAT IF about the class identifiers).

When the command CORAN1 is issued, WHAT IF prompts for the name of the correlation file. WHAT IF will ask for the number of the profile that was used for the alignment. It will also prompt for a residue range. If you want, you can provide a so called skip file. This is a file that holds the numbers of the residues that should not be used in the analysis, give 0 (zero) if you do not have or do not want to use such a file. The results will be similar to those described for the CORMUT command (see above), but instead of showing two correlated residues, it will present the correlation between the class identifiers and the residues. This is a true correlation, and not, as for the CORMUT option, a noise multiplied correlation.

The sequences in the BIGFILE will be sorted according to the order in the correlation file. If the correlation file holds accession numbers for non-existing sequences, an error message is issued, and the option is terminated. If sequences are present in the BIGFILE but not in the correlation file, the user can choose between getting those sequences placed at the end of the BIGFILE or removing them from the BIGFILE.

Detecting correlations of residues with the class (CORAN2)

This option functions similar to CORAN1, but is less strict in the negative correlations.

Detecting correlations of residues with +/- classes (CORPM1)

CORPM1 functions similar to the CORAN options mentioned above. However, the CMC file is only allowed to hold the CMC codes + (plus) and - (minus). This presents some restrictions, but accelerates the computations so much that correlations over more residues at the same time become calculable.

The principle is the following: For every residue position the most prevalent residue in all sequences marked with a + is determined. The method now considers all pairs of residues and score cases where both residues agree at the same if their CMC codes are a + whereas at least one of them should be different from the majority of the + labeled sequences in the - labeled sequences.

If this sounds complicated to you, you are right. Just try it, it does not take too much time.

Detecting correlations of residues with +/- classes (CORPM2)

CORPM2 is very similar to CORPM1. The only difference is that CORPM2 is more critical about the - labeled residues. They have to differ from the + labaled ones.

Detecting correlations based on residuetypes (CORGR1)

This option is still being worked on. If you want to try it, be aware that WHAT IF could crash...

Sorting and selecting sequences WALSRT

Often one wants to focus on a subset of the available sequences. Rather than introducing active and inactive sequences (which means doing complicated things inside the program), I have decided for a rather crude approach. The user can simply remove any undesired sequences from the BIGFILE. Since removing a sequence from the BIGFILE is irreversible, a good backup of the BIGFILE is recommended (normally called WALIGN.BIG) before any options in this menu are used.

Keeping only files with a certain keyword in it (KPNAME)

The command KPNAME will cause WHAT IF to prompt you for a series of keywords or text strings. All sequence that have one of these strings rendered EXACTLY, either in the file name or the title, will be tagged to be kept. Although matching is exact, the matching of text fragments is not case- sensitive. After the search, the number of sequences found with this string in it will be listed, and the user is asked if to confirm deletion of all the other files.

Deleting files with a certain keyword in it (DELNAM)

The command DELNAM will cause WHAT IF to prompt for a series of keywords or text strings. All sequence that have one of these strings rendered exactly, either in their file name or in their title, will be tagged to be removed from the BIGFILE. Although matching is exact, the matching of text fragments is not case- sensitive. After the search, the number of sequences found with this string in it will be listed, and the user is asked to confirm deletion of all these files.

Deleting double occurring sequences (KILDBL)

The command KILDBL will cause WHAT IF to prompt for two sequence ranges. The same range may be specified twice. Any sequence in the second range that is completely identical to a sequence in the first range will be removed from the BIGFILE. If same range is specified twice and two identical sequences are detected, the sequence listed later in the BIGFILE will be removed.

Delete sequences that differ too much (DELALI)

The command DELALI will cause WHAT IF to prompt you for a profile number, and two cutoff percentages. These are the percentage identity between the sequence and the consensus sequence of the profile, and the convolution between the sequence and the profile. (these two numbers are shown in the first table in the HSSP output file generated by the MAKHSP command).

All sequences that have either one of these percentages below the given cutoff are deleted from the BIGFILE.

Determining the class of sequences (UNKTYP)

Often sequences are obtained for which the biological function is unknown or only partly understood, and consequently it is difficult to name these sequences. The option UNKTYP allows for the comparison of sequences with a series of profiles. WHAT IF prompts for the name of a file that holds the names of all profile files, one profile file name per line. WHAT IF also be prompts for the range of sequences. All sequences will be compared with all profiles, and the convolution of the sequence with the profile, after optimal alignment, will be listed.

Sorting sequences (SRTPCT)

The command SRTPCT will cause WHAT IF to prompt you for a profile. For all sequences the similarity with this profile is calculated and the sequences are sorted by decending similarity.

Sorting sequences (SRTPID)

The command SRTPID will cause WHAT IF to prompt you for a profile. For all sequences the identity with this profile is calculated and the sequences are sorted by decending identity.

Sort by accession code (SRTACC)

The command SRTACC will sort the sequences in the BIGFILE as function of their accession code (sorting in increasing order).

Sort by accession code (SRTFLN)

The command SRTFLN will sort the sequences in the BIGFILE as function of their file name (sorting in increasing order).

Sort by accession code (SRTGMB)

The command SRTGMB will sort the sequences in the BIGFILE as function of their grey meatball character (sorting in increasing order). The grey meatball value gets higher when a sequence look more like all other sequences.

Graphical commands WALGRA

The command WALGRA calls the menu for graphic representation of sequences.

Display sequences (coloured) at the graphics (GRASQS)

When the command GRASQS is issued, WHAT IF will prompt for a range of sequences and a range of residues. The specified residues in the sequences will be sent to the graphics window as a MOL-item(s). The residues are coloured by residue type (See COLSQS). Limited interactive graphics are available with the local command GO.

Changing the residue colours (COLSQS)

The colours for the residues are determined by the values given in the file SEQCOL.FIL. The default for this file looks like:

A 240
C 180
D 120
E 120
F 260
G 220
H 120
I 240
K  40
L 240
M 240
N  80
P 220
Q  80
R  40
S 220
T 220
V 240
W 260
Y 260
X 150
- 350

If you have a file called SEQCOL.FIL in your local directory, this file will be used rather than the default file. The command COLSQS will bring the local copy of this file into the editor. If you do not have a local copy of this file yet, the default file will first be created in the present directory, and thereafter the file will be brought into the editor. After leaving the editor the file will be automatically read by WHAT IF, and the residues at the screen get the coulours you requested. If the GRASQS option is run again, these new colours will also be used for the new sequences.

Interactive sequence graphics (SHOW)

The command SHOW in the WALIGN related menus will pass control to the graphics window as is usually done by GO. The difference with GO is that the main menu at the right side of the screen now has many different options. These are:

WAIT : Cancel option.
T >  : Translate a few residues to the right.
T <  : Translate a few residues to the left.
T >> : Translate many residues to the right.
T << : Translate many residues to the left.
T ^  : Translate a few sequences upwards.
T V  : Translate a few sequences downwards.
T ^^ : Translate many sequences upwards.
T VV : Translate many sequences downwards.
COLR : Allow for interactive modification of the colouring scheme.
M1   : Store the present view in view memory 1.
M2   : Store the present view in view memory 2.
M3   : Store the present view in view memory 3.
M4   : Store the present view in view memory 4.
VMS  : Spawn a subproces (create a shell).
CHAT : Pass control back to the text window.
RSET : Reset the viewing parameters.
 >>  : Scale the display up.
 <<  : Scale the display down.
MOV+ : Move one step forward in the movie.
MOV- : Step one step back in the movie.
HELP : Activate/deactivate the interactive HELP option.

Plotting the sequence in the membrane (2DPLOT)

The command 2DPLOT requires a file called ARBNUM.POS. This file has the following format:

  170
  111 -8.0  8.0  0.0
  112 -7.0  7.5  0.0
  113 -6.0  7.0  0.0
164 lines removed for clarity...
  730 16.0 -3.0  0.0
  731 17.0 -3.5  0.0
  732 18.0 -4.0  0.0

The first number indicates the number of lines to follow. Thereafter for each residue the arbitrary sequence number (this is the second number given to it in the profile file) and its position in space. At this position the residue will be shown in a small box.

This option is not yet entirely bugfree.

Plotting the sequence in the membrane (PLOT2D)

The command PLOT2D is an alternative spelling for 2DPLOT.

Boxed sequence alignment output (PRETYG)

The option PRETYG will cause WHAT IF to prompt you for a set of sequences and the ranges in these sequences. It will put the requested sets of residues on the screen, coloured and with conserved residues boxed. See the parameter menu for parameters that can modify the output of this option.

Boxed sequence alignment output (PRETYP)

The option PRETYP will cause WHAT IF to prompt you for a set of sequences and the ranges in these sequences. It will put the requested sets of residues in a postscript file, with the conserved residues boxed. See the parameter menu for parameters that can modify the output of this option.

You will also be prompted for the sequence to be used to determine how to put numbers on top of the alignment. Please give the rank order of the sequence in the output, not the number in the BIGFILE. So, if you plot the sequences 3 till 7, and you want the residues in the output to be numbered acoording to sequence 3, you have to answer the question about which sequence to be used for the numbering not with 3 but with 1 because the BIGFILE sequence 3 is sequence 1 in the output. If you enter a ridiculously large sequence number, e.g. 1000000, the numbering will not follow any sequence, but simply count everything, residues and insertions.

After that you will be asked for the residue ranges above which an * is to be plotted, and the ranges to be shaded with dark grey or light grey respectively. Also here, you have to give the number in the output, not the residue number. So, if you want residue 61 to be shaded grey or labeled with an * on the top of the page, and you are plotting the residues from 60 to 80, you should ask residue 2 to be shaded, because residue 61 is in this example the second residue in the output.

Making a WWW-based database WALSER

The WALSER menu is still highly experimental, and therefor scarcely dodumented. The WALSER menu is the main ingredient for the production process of the WWW based information system GPCRDB.

It is envisaged that in due time this menu will also be useful for the creation and maintenance of other class specific databases.

Initialize the big server file (GPCINI)

The WALSER menu stores all kinds of information (file names, locations, profile names, mutation pointers, etc.) in a big file called SERVER.BIG. The first time around you need to initialize this file with the GPCINI command.

Open the existing big server file (GPCOPE)

The WALSER menu stores all kinds of information (file names, locations, profile names, mutation pointers, etc.) in a big file called SERVER.BIG. The first time around you need to initialize this file with the GPCINI command, but every other time you should open this file with the GPCOPE command.

Please use GPCOPE before any of the commands that follow in this chapter.

Show all statistics about the big server file (GPCSTS)

This option lists roughly all statistics about the file SERVER.BIG.

Shows summary statistics about the big server file (GPCSRC)

This option lists some of the vital statistics about the file SERVER.BIG.

Read the classes from _7TM.CLASSES (GPCGCL)

Part of the GPCRDB (and future similar) projcets is the classification of sequences if families, sub-families, sub-sub-families, etc. The file _7TM.CLASSES hold this hierarchie of classes. The command GPCGCL reads this file and stores the data in the big server file SERVER.BIG.

Get the profile information from _7TMPROF.LIST (GPCGPR)

Every family, sub-family, etc., has its own, optimised, profile. The file _7TMPROF.LIST holds the names of thse profiles. The command GPCGPR reads this file and stores the data in the big server file SERVER.BIG.

Read the profiles obtained with GPCGPR (GPCRPR)

After running GPCGPR, you can actually read all profiles with the command GPCRPR. The profiles will end up in the BIGFILE. The information about which profile belongs with which family is stored by the GPCGPR option in the big server file.

Use the profiles to classify the sequences (GPCSPL)

The command GPCSPL will align all sequences in the BIGFILE against all profiles in the BIGFILE. If a hit is found, the sequence is copied to the corresponding directory.

Use the profiles to classify the sequences (GPCSEL)

Does the same as GPCSPL, with as only difference that the sequences are not moved into the corresponding directory, but that directory is just listed at the screen.

Loop over all classes and update them (GPCMAL)

The GPCMAL option will run over all directories, and in every directory it will udate all files (.HSSP, .MSF, .PRF, etc).

Read the 7TM directories file (GPCO7T)

The option GPCO7T will cause WHAT IF to read the file 7TM.FILES. As this normally should go automatically, there should never be a reason to run this option.

Gets the files from 7tmrlist.txt local (GPCG7T)

The sequence files that we want to use should be present in the local directory. There are ofcourse many ways to get them there, and a WHAT IF independent script seems to be the preferred way. However, at EMBL you can obtain those files that are located in the SwissProt directories by listing them in the file 7tmrlist.txt and running the command GPCG7T.

Other commands

Aligning two sequences (2ALIGN)

The 2ALIGN command can be used to align two sequences. WHAT IF will prompt for two sequence numbers, a gap open penalty and a gap elongation penalty. The default penalties that are suggested are meant to be used with the default Dayhof type matrix obtained with the SETMAT command. Otherwise you are on your own, and believe me, there is much you can do wrong here....

Writing aligned sequences in a HSSP file (MAKHSP)

The command MAKHSP requires that you input a profile number, a range of sequences and a HSSP file name. You will also be prompted if you want to calculate the variability (if you say yes, this will take a lot of CPU time, so you normally only say yes in the final step). If you have only 1 profile in the big file, you will not be prompted for the profile number. If there already exists a file with the same name as the HSSP file you want to generate, you will be asked if you want to overwrite the old one.

List pairwise identities (SHOIDM)

The command SHOIDM will ask you for a sequence range. All pairwise sequence identities in the overlapping areas are calculated and listed as percentages in the first table. The second table lists the differences rather than the similarities. The last table shows the similarities after subtraction of the smallest number found in the table.

List pairwise identities (HISTID)

The command HISTID lists all pairwise similarity percentages between two ranges of sequences that you will be prompted for. Additionally you get a histogram of the observed percentages, and some statistics like the average similarity, and the standard deviation, etc.

Searching original sequence files (KWCHEK)

This option is not yet ready

Fetching files from databases with GCG (MFETCH)

The command MFETCH prompts you for a sequence range. It creates a file called FETCH.LIS that can be edited to be used by GCG to fetch the sequences from the database(s).

Hidden commands

Graphical display of pairwise similarities (WLHPID)

Gives a matrix with coloured squares. Colours relate to the pairwise identity percentage.

Read sequences from LFK style file (WLHLFK)

For debugging only.

Over-rule profile gap penalties with (WLHGAP)

Normally gap penalties are an integral part of the profile. If you want to use the gap open and gap elongation penalties that are set in the parameter file (parameters 373 and 374) you can use this WLHGAP option.

Reads sequences from LFK-daily-update-files (WLHLFD)

For debugging only.

Checks if sequences have the GPCR fingerprint (WLHFIN)

This is a GPCR specific option.

Delete sequences that mis GPCR fingerprint (WLHDFN)

This is a GPCR specific option.

Do 7TM HSSP process for one molecule class (WLH1ST)

This is a GPCR specific option.

Read sequences from combined swissprot files (WLHSWS)

This is a GPCR specific option.

Delete DNA sequences from the BIGFILE (WLHDNA)

Often a DNA sequence slips in with a large family of protein sequences. WLHDNA will remove those that accidentally entered the BIGFIL.

Read sequences from 'list2seq' file (WLHL2S)

This is a GPCR specific option.

Delete fragments from the BIGFILE (WLHDFR)

This is a GPCR specific option.

Cluster sequences after alignment (WLHCLU)

This option will try to optimise the order of the sequences in the BIGFILE such that in the WLHPID output the highest homologies are closest to the diagonal.

List WALIGN related common parameters and values (WLHSPA)

This option just does what the titel suggests.

Creates the file PDB.LIS (WLHCPL)

This is a GPCR specific option.

Searches for decamers that can be good PCR probes (WLHPCR)

The option is supposed to do just what the titel suggest, but not being a molecular biologist, I doubt if I have implemented a smart algorithm.

Write unaligned sequences as single records in a file (WLHWIN)

Writes a range of sequences as long single strings of character in a file. One line per sequence. That file is often useful for manual operations. See WLHRIN about reading the sequences back in. See also WLHWIA.

As HIDE08, but including gap penaltie updates (WLH2ST)

This is a GPCR specific option.

Reads sequences back that were written with WLHWIN (WLHRIN)

Reads back the sequences written by WLHWIN. Normally you would have edited the file between WLHWIN and WLHRIN. Also, dont forget to delete the files from the BIGFILE if you don't want them double.

Searches for a fragment of requested weight (WLHGWT)

This option runs over a series of sequences, tries every fragment, and calculates the weight of that fragment. If the fragment falls within the limits given by the user, the fragment is listed.

Writes aligned sequences as single records in a file (WLHWIA)

This option is similar to WLHWIN. However, WLHWIN writes the sequences as compact as possible. WLHWIA also writes sequences in a file at one sequence per line, but it inserts - signs for deletions. This is a handy option if you want to do some hand optimisation of alignments.

An option that does something (WLHXXX)

This option indeed does something.

Runs all sequences against all profiles (WLHSPR)

This option aligns all sequences in the BIGFILE one after the other against each profile. The results are reported.

Converts DNA sequence into protein sequence (WLHNAP)

Translates DNA sequences into protein sequences.

Draws phylogenetic tree (PHYTRE)

Takes a range of sequences. Assumes that you pre-aligned them, and makes a phylogenetic tree using the neighbour joining algorithm.

Embed sequence space in 3 dimensions (EMBED3)

The option EMBED3 will do an eigenvalue analysis on the pairwise sequence identity matrix. The result is a series of crosses in 3D space that represent the sequences. The distance between any two points is the best possible measure for the distance between the sequences in sequence space. The crosses are pickable.

Parameters for the WALIGN related menus

(SEQSCO)

One day this option will be a cutoff for the sequence identity percentage in profile alignment for those options that scan large volumes of sequences.

(PRFSCO)

One day this option will be a cutoff for the sequence-profile convolution value in profile alignment for those options that scan large volumes of sequences.

Pairwise alignment gap open penalty (GAPOPE)

The option 2ALIGN uses a gap open penalty and a gap elongation penalty. Upon running 2ALIGN you are also asked to give these penalties, so setting this parameter in the PARAMS menu is somewhat redundant.

Pairwise alignment gap elongation penalty (GAPELO)

The option 2ALIGN uses a gap open penalty and a gap elongation penalty. Upon running 2ALIGN you are also asked to give these penalties, so setting this parameter in the PARAMS menu is somewhat redundant.

Profile alignment gap open penalty (PRFOPE)

The option ALIPRF uses a position specific gap open penalty and gap elongation penalty. These penalties are sitting in the profile. With the option WLHGAP you can set all gap open penalties in one shot at the value of the PRFOPE parameter. WLHGAP also sets all gap elongation penalties according to the PRFELO parameter.

Profile alignment gap elongation penalty (PRFELO)

The option ALIPRF uses a position specific gap open penalty and gap elongation penalty. These penalties are sitting in the profile. With the option WLHGAP you can set all gap elongation penalties in one shot at the value of the PRFELO parameter. WLHGAP also sets all gap open penalties according to the PRFOPE parameter.

Character size in PRETY* options (CHRSIZ)

The parameter CHRSIZ allows you to change the character size in the PRETYP and PRETYG options.

Identities needed for boxing (LIMBOX)

The parameter LIMBOX determines how many identical residues are needed at a=one position in an alignment before the PRETY* options decide to draw a box around them.

Residues per line in PRETYP options (RESLIN)

The RESLIN parameter determines how many residues there will be per line in the PRETYP option.