The structure fragment database (DGLOOP)

Introduction.

This chapter should start with honouring Alwyn Jones for having the idea to create a fragment database. Even though I did not understand how his method was implemented, and I therefore had to redesign the whole procedure around another (faster) algorithm, the idea was his.

The idea is that all proteins are made up out of a limited number of possible short fragments, together forming all possible backbone conformations. Therefore, if one has a large enough fragment database, it must be possible to build a new protein just by using these fragments. The problem is however, how to find for example all groups of 9 amino acids in the whole database that have a smaller than 1.0 Angstrom RMS deviation on C-alpha positions when fitted to a group of 9 amino acids in the molecule we are working on. To do this by brute force methods would take around 50 hours of CPU time on a micro VAX, or about 3 minutes on a 1997 workstation. Using inversly sorted C-alpha distance tables with integer distance pointer arrays can speed this process up by many orders of magnitude. The possibility to find fragments in the database that superimpose well on top of a part of the molecule you are working on has been incorporated in the program WHAT IF in many places.

Almost all these commands start with the two characters DG. This follows Alwyn who used the same nomenclature.

Because most DG*** options at some time explicitly use the middle amino acid of the stretch, your group length should always be odd. (Can be set with the LENSET command). The DG*** commands are all activated from the DGLOOP menu. Type DGLOOP to enter this menu.

Implications of the algorithm

WHAT IF accepts every hit that meets the user defined (or default) criteria about RMS and maximal errors. However, most options have an upper limit in the number of hits. This explained why, for example, you can work with rubredoxin, but not find the perfect hit in the database, eventhough rubredoxin is in the database. That is the simple result of finding enough hits before the hit in the database that came from rubredoxin was actually inspected. If you want to be sure that you will get all hits, set the number of hits high, and the search criteria tight. Also, hits that give an RMS better than 0.00001 Angstrom are skipped because that normally indicates that the database contains the exact protein you are working with. Using those hits would skew your view.

Searching in the database

Finding stretches (DGFIND)

DGFIND will cause WHAT IF to prompt you for a residue number. This can not be a residue that is too close to the N- or C-terminus of any chain (Why, will be explained below). WHAT IF will take the fragment (of at least 5 residues, see LENSET) with this residue in the middle and search the database for equally long fragments with a highly similar back bone conformation. Highly similar is defined by the parameters, but typically it means that the RMS on alpha carbons is better than 0.7A. There are no additional constraints on this fragment.

Inserting residue(s) using the database (DGINS)

The DGINS option does rather a lot of things, one after the other. You will first be prompted for a residue after which to insert between 1 and N amino acids (N depends on parameter 638 in the PARAMS.FIG file, see also PRP006). Then you will be asked for the number of amino acids to be inserted. The program will now send the best hits over to the graphics window and you can loop through them with the movie buttons (MOV+ and MOV-). After clicking CHAT you are asked to choose which one you want to use for the insertion. Of the inserted residues only the backbone will be inserted (So it is a poly glycine insertion). No corrections for non-covalent contacts (bumps) are made!

Inserting residue(s) using the database (DGINSS)

The DGINSS option does rather a lot of things, one after the other. You will first be prompted for a residue after which to insert. Then you will be asked for the number of amino acids to be inserted. Here you can answer 1 or 2. You will thereafter be prompted for the amino acid type at position one, and, if you want two residues to be inserted, also for the amino acid type at position two. The program will now search the database for well fitting insertions, and it will send the best hits over to the graphics window. You can loop through them with the movie buttons (MOV+ and MOV-). After clicking CHAT you are asked to choose which one you want to use for the insertion. So, dont forget to mark the number of the hit you want while flipping through the movie; the number is given in the top right corner of the screen as "movie step x".

In contrast to DGINS, which only inserts the backbone, DGINSS will actually insert the entire residue. Virtually no corrections for non-covalent contacts (bumps) are made!

Finding alternative conformations (DGFIX)

DGFIX will cause WHAT IF to prompt you for a residue number. This can not be a residue that is too close to the N- or C-terminus of any chain. WHAT IF will take the fragment (of at least 5 residues, see LENSET) with this residue in the middle and search the database for equally long fragments with a highly similar back bone conformation. Highly similar is defined by the parameters, but typically it means that the RMS on alpha carbons is better than 0.7A. The middle residue in the database fragment must be be of the same type as the residue on which you perform the search.

Mutating using the database (DGMUT)

DGMUT will cause WHAT IF to prompt you for a residue number. This can not be a residue that is too close to the N- or C-terminus of any chain. WHAT IF will take the fragment (of at least 5 residues, see LENSET) with this residue in the middle and search the database for equally long fragments with a highly similar back bone conformation. Highly similar is defined by the parameters, but typically it means that the RMS on alpha carbons is better than 0.7A. You will be prompted for the residue type of the middle residue in the database fragments.

Look alike contacts DGCONT

If you want to lift all contacts from the database that look like a certain contact in your protein, then the option to do this is DGCONT. This option is present in the DGLOOP menu and in the SCAN3D menu. Extensive documentation can be found in the SCAN3D menu.

Replacing a residue with a hit (DGREP)

The options DGFIND, DGFIX and DGMUT all prepare groups of hits. If you want to mutate the amino acid used to make these hits with the middle amino acid of one of these hits, you should use the DGREP option. This option does the same as the DGGRA option (see DGGRA), but after showing the hits at the PS300 screen you are prompted for the number of the hit to be used. These numbers are indicated at the right top of the screen while you click through the movie with MOV+ and MOV-. If there is no hit to your liking, you can (as usual) escape by typing zero.

Display fragments

Movie of fragments (DGGRA)

The command DGGRA can be used to send hits to the graphics window for visual inspection. After typing DGGRA you will be promted for a group number. You can only look at groups that were made using any of the DG*** options (also after a logical operation with another group has been performed). The hits are sent to the MOVIE. The middle residue, the one of our interest, is drawn somewhat more intens than the other residues. The right hand side of the top bar indicates the number of the hit presently at the screen. You can switch the movie off with the MOVIE button at the bottom of the screen. Also a next set of DG*** hits will overwrite the previous one when send over with a subsequent DGGRA command.

Showing all fragments at once (DGGRAL)

The command DGGRAL can be used to send hits to the graphics window for visual inspection. After typing DGGRAL you will be promted for a group number. You can only look at groups that were made using any of the DG*** options (also after a logical operation with another group has been performed). The hits are stored in a MOL-item. They are coloured by quality of fit. Blue for the best one, red for the worst.

Listing hits (DGSHOW)

The command DGSHOW does almost the same as the command SHOHIT (see SHOHIT in the SCAN3D menu). It lists the hits one by one, including sequence, secondary structure determination for the fragment, and the RMS deviation for the alpha-carbons after superpositioning. Be aware that the RMS deviation is no longer correct if you have done logical combinations on this group.

Working with alpha carbons only

Builing a structure from alpha carbons (CATOAL)

The command CATOAL will run over the entire molecule and replace every amino acid for which only the alpha carbon coordinates are present by a complete residue. This option loops over the DGMUT option, and every time accepts the best hit found, without user intervention.

If you are running this option on experimental alpha carbon positions, you should probably run the RELAX option (see below) a couple of times before starting with CATOAL.

Keeping only alpha carbons (ALTOCA)

The command ALTOCA causes WHAT IF to set all coordinates to zero except those for the alpha carbons. This is of course a rather useless command, but it is nice to test the quality of the CATOAL option.

Rotamer searches

Single rotamer searches (DGR1-1)

This option does the same as DGROTA (see below). This option is only added for option nomenclature consistency.

Single rotamer searches (DGROTA)

The command DGROTA does almost the same as DGMUT. However, it will automatically add a DGGRAL option at the end. In this DGGRAL option only the side chains of the middle residue of the search string will be shown. Also, in DGROTA the weight on the central residues alpha carbon is infinite in the superposition.

This is a very good option to get an impression about possible sidechain conformations (=rotamers) at a certain position.

The command DGROTA does the same as DGR1-1. It is left in here for compatibility purposes.

Multiple residue rotamers at one position (DGRN-1)

The command DGRN-1 will prompt you for one residue. It will than determine the rotamers (as described for the DRG1-1 option) for all 20 residue types at this position (nothing is shown for glycine because it has no side chain). The hits will be stored in the first 20 frames of the movie option.

One residue type rotamer for a range of residues (DGR1-N)

The command DGR1-N will prompt you for a residue range and a residue type. The range should not span more than 100 residues. For every residue in the range the rotamers for the requested residue type will be determined as described for the DGR1-1 option, and put in the movie.

At present the output is also a surprise to me.

All rotamers for a residue range (DGRN-N)

The command DGRN-N is determines rotamer distributions for all residue types for a complete range of residues. As this can no longer be displayed, you get the Chi-1 statistics. The statistics consist of a table with for every position for every residue type the distribution of preferred Chi-1 angles in steps of 10 degrees. Also, three graphs will be shown with the frequency of occurrence around +60, +/-180, and -60 degrees (from bottom to top) at each position averaged over the 17 residue types (gly, ala, pro are excluded). A second plot shows the distribution of the average residue over the 360 degrees of chi-1, averaged over the 17 residue types.

Since these two plots are drawn in the colour of the residues (actually their alpha carbons), you are suggested to thing about colouring them cleverly before you run this extremely time consuming option!

At present the output is also a surprise to me.

Self rotamers (DGRSLF)

The command DGRSLF will cause WHAT IF to prompt you for a residue range. It will than execute the DGR1-1 option on each residue in this range, and store the results in the movie. The rotamers will be for the residue type that is present at that situation. This option allows you to inspect how many of your residues are in the most preferred conformation.

The range should not span more than 100 residues.

geometric best rotamers for a residue range (DGRS-N)

The command DGRS-N will cause WHAT IF to prompt you for a residue range. For all residues in this range the geometrically best rotamer (that is the rotamer that is closest to the middle of the cloud and has the best backbone fit) will be determined. These best rotamers will be plotted.

Fragment group administration

Resetting the group length (LENSET)

The command LENSET can be used to change the length of the groups to search for. The commands DGFIX, DGFIND and DGMUT need the group length to be odd. DGCONT works independent of the group length.

This LENSET command is completely equivalent to the SETLEN command in the SCAN3D menu.

Initializing the groups (GRPINI)

The command GRPINI does the same as the command INIGRP in the SCAN3D menu: it initializes all groups. This is an irreversible command. The only way to get the groups back is by regenerating them.

Showing the search groups (GRPSHO)

The command GRPSHO does the same as the command SHOGRP in the SCAN3D menu: it shows you all groups. The presently available groups are shown including their group number, the number of hits in the group, and a short description of how the group was created.

Parameter related option

Tightening the parameters (TIGHT)

The command TIGHT will cause WHAT IF to tighten all DGLOOP related parameters by a factor of 1.67. This means that the quality of the hits will on the average get better on the cost of the number of hits.

Relaxing the parameters (RELAX)

The command RELAX will cause WHAT IF to relax all DGLOOP related parameters by a factor of 1.67. This means that the quality of the hits will on the average get worse, but you will get more hits.

Resetting the default parameters (RESPAR)

The command RESPAR will cause WHAT IF to reset all DGLOOP related parameters to their default values.

Showing dgloop parameters (SHOPAR)

The command SHOPAR will cause WHAT IF to show you all DGLOOP related parameters.

Parameters (PARAMS)

The command PARAMS brings you directly in the menu to change the DG*** related parameters. The following parameters can be set. May I suggest that you only change parameters for which you really know what they do...

Maximal Ca-Ca distance misfit allowed (DGCERR)

Speed performance related parameter. This is the maximally allowed Calpha-Calpha distance error in any hit. Should in principle be set at twice the desired final RMS fit error. If you know you will get very many hits, e.g. if you are only modelling helices, you can decrease this parameter perhaps even to one and a half times the desired final RMS fit error. But remember, this is only a CPU speed optimiser, it does not give you better hits.

Allowed RMS fit error (DGTERR)

This is the maximal Calpha RMS misfit between database hit and the real structure. This is one of the critical quality parameters. This parameter should be chosen as a function of the quality of your molecule in the soup, and the average quality of the database files. I have not really thouroughly tested how this parameter influences the performance of the program, but the default feels OK to me.

Maximal Ca misplacement after real suppos (MXCAER)

This is a CPU performance parameter the use of which still has to be proven. May I suggest you look in the source code before you change this parameter....

Maximal number of hits to be searched for (DGSHWT)

This parameter determines how many hits WHAT IF will maximally extract from the database. The upper limit for this parameter can be found in the include file called DGLOOP.INC. If you do many DGRN-1 related options it might improve the turn around if you set this parameter lower than the default (which is 80), for example, 20 will for many visual inspection options also be fine.

Additional backbone fit flag (ADDFIT)

The option DGREP will superpose the backbone database fragments only on the corresponding alpha carbons in the soup. In this superposition the weight on the central Calpha is infinite. The obtained superposition is then applied to the sidechain of the residue to be DGREP-ed. The ADDFIT parameter determines if more backbone atoms than just the alpha carbon should be used of the central residue upon superposing. You can choose 0 (only use C-alpha), 3 (use N Ca C), 4 (use N, Ca, C, O) or 5 (use N, Ca, C, O, and Cb). The default is 5.

Number of anchoring residues around insertions (INANCH)

The options DGINS and DGINSS do a search in the fragment database in which the residues that have to be inserted carry no weight. In order to do something useful, one should of course have a few residues before and after the insertion that match between the database fragment and the residues in the soup flanking the gap that has to be filled. The INANCH parameter determines how long these flanking stretches will be. Make INANCH too small (e.g. 1 is really stupid...) and you will get very many very poor hits. Make inanch too large, and you are likely to get no hits at all. The deafault is INANCH is 3, i.e., three residues at either side of the insertion.

Should accessibility constraints be used (USEACC)

The parameter USEACC allows you to switch ON/OFF the use of accessibility constraints on the central residue in the database fragment.

Lower limit on accessibility of database hit (LOWACC)

If the USEACC parameter is switched ON, LOWACC tells WHAT IF the minimal accessible molecular surface area that the central residue in the database fragment should have in order to be acceptable.

Lower limit on accessibility of database hit (HGHACC)

If the USEACC parameter is switched ON, HGHACC tells WHAT IF the maximal accessible molecular surface area that the central residue in the database fragment should have in order to be acceptable.

Maximal error in DGCONT option (CNTERR)

The option DGCONT searches for pairs of residues in the database that superpose well on pairs of proteins in the soup. An all atom superposition is performed, and if the RMS atomic displacement error is smaller than CNTERR, the hit will be accepted.

Automatic mutant prediction

There are a few options in WHAT IF that use DG** options implicitly for the purpose of predicting which mutations can be made savely. See MUTQUA, TRYMUT and SUGMUT.

Statistics on groups of hits

The SCNSTS menu that is normally used to evaluate SCAN3D relational database hits can also be used to determine residue statistics for DG*** groups. However, DG*** groups can hold only a limited number of hits, which makes the use of SCNSTS options for teh analysis of DG*** groups not always equally useful.