(Re-)creating the database (SEQ3D)

(Re-)Creating the database

                 W A R N I N G :

       DO NOT TRY TO (RE-)CREATE THE DATABASE UNTIL:
 
       1) YOU HAVE GOOD BACKUPS.

       2) YOU HAVE STUDIED THE SOURCE CODE IN THE MAKDB LIBRARY.

       3) YOU ARE SURE WHAT YOU ARE GOING TO DO.

       4) YOU CAN COUNT ON TENS OF HOURS OF CPU TIME BEING 
          AVAILABLE OVER THE NEXT DAY OR SO.

       5) YOU ARE SURE YOU HAVE ENOUGH DISK SPACE AVAILABLE.

       6) YOU ARE SURE YOU REALLY WANT TO DO THIS.

       7) YOU HAVE TAKEN AN INSURANCE AGAINST MURDER BY YOUR COLLEAGUES

Introduction.

Making a new database is in principle not needed. Actually, when you think you need a new database, you probably need an updated version of the whole program. However, if you want to make a class specific database, e.g. a database that only contains antibodies and beta-only structures, read first the previous paragraph, read thereafter this entire chapter, and only then, go ahead.

preparation

You first need a file PDB.LIS. This file has the names of all PDB files in it that you want to incorporate in the database. These files should be present in the directory indicated in the CCONFI.FIG file, or in the local directory. The first 4 characters of each line should be the 4 letter PDB identifier, the fifth character should be the chain-name of the chain you want to use. Use a blank as fifth character if there is only one chain.

Now you are ready to run the options PRP001, PRP002, etc., but first we will explain what files will all be generated.

ascdata

ALLACC.ACC Contains all summed accessibilities per residue. ALLCYS.CYS Information about Cys-Cys bridges ALLDSP.DSP Contains all original determinations by DSSP. ALLHST.HST Contains all WHAT IF-ised determinations by DSSP. ALLHYD.HYD Holds all hydrophobic moments for the protein database. ALLOME.GAS Contains all omega angles for the protein database. ALLPHI.PHI Contains all phi angles for the protein database. ALLPSI.PSI Contains all psi angles for the protein database. BBCDAT.DAT CHI00*.CHI Contains the values for chi-* for the protein database. HBONDS.HBO Will be rewritten soon. IMPROPER.DAT MUTDB.IND Holds counters for the database proteins SYMMAT.MAT TOTALS.SEQ Contains all sequences. Will be read upon startup.

bindata

ALCONT.ACT Contains all atom-atom contacts for the protein database. ALCONT.WHT Hash tables for ALCONT.ACT ALCOOR.XYZ Contains all coordinates, B-factors, etc. for protein. ALHASH.CON Is the hash table for alcont.act. ALLNEA.DIS Holds information about nearest neighbours in the database. ALLNUM.NAM Contains original names for amino acids in protein database. CACA.NEW Tables for the DGLOOP options CONSER.HSP Sequence conservation values for the database (from HSSP) PROFIL.SEQ Sequence profiles for the database (from HSSP)

predata

This directory holds files that you should not touch (actually, if you still have these files, delete them and you save a lot of space). These files are formatted versions for the unformatted files in bindata. As computers are so nicely standardised that they all use different formats for unformatted (and thus fast) files, we let WHAT IF re-generate those unformatted files from the predata files upon installation.

summary of SEQ3D options

Options that you need:
PRP001  Corrects all PDB files (Easily takes 24 hours and 100M space)
PRP002  Creates MUTDB.IND and TOTALS.SEQ
PRP003  Creates ALCOOR.XYZ (all PDB coordinates)
PRP004  Corrects atom correctness flags in ALCOOR.XYZ
PRP005  Create ALLHST.HST (secondary structure database by DSSP)
PRP006  Create DGloop pointer files
PRP007  Puts accessibilities in ALCOOR.XYZ
PRP008  Creates the NEACON file
PRP009  Creates phi-psi value tables for database
PRP010  Creates accessibility tables for database
PRP011  Creates atomic contact tables for database
PRP012  Creates hydrophobic moment tables for database
PRP013  Creates cys-cys bridge info tables for database
PRP014  Does nothing
PRP015  Puts the eighth torsion angles as term 10 in alcoor
PRP016  Makes the secondary structure element tables
PRP017  Makes the nearest neighbours index tables
PRP018  Creates ALLDSP.DSP for use of original DSSP values
PRP019  Creates HSSP derived sequence profiles
PRP020  Creates conservation table
PRP021  Creates backbone hydrogen bond table
AUX001  Creates RAMA.LIN (backdrop for Ramachandran plots)
AUX002  Calibrates Ramachandran Z-score (parameters in PARAMS.FIG)
AUX003  Calibrates chi-1/chi-2 Z-score (parameters in PARAM.FIG)
AUX004  Creates NQA boxes (needs PRP001 and 200M space in current directory)
AUX005  Creates NQA residue calibration
AUX006  Creates NQA structure calibration
AUX007  Creates NQA virtual atom boxes
AUX008  Creates inside/outside calibration database (INOUTF.DAT)
AUX009  Creates Backbone conformation Z-score calibration
AUX010  Creates EVACHI.CHI (Calibration table for CHICHK)
AUX011  Creates IMPROPER.DAT (Calibration table for HNDCHK)

Options that you don't need:
GETORG  Overwrites the HST values in common with originals
MAKFMT  Creates a formatted coordinate file (ALCOOR.FMT)
MAKXYZ  Regenerates ALCOOR.XYZ database coordinates from formatted file
PRPCHK  Checkes if all PDB files required for database generation exist
PRP101  Copy all PDB files to default directory
PRP102  regenerate PDB.LIST from the database
PRP103  checks if all HSSP files exist
PRP107  creates formatted HSSP derived files
PRP108  re-creates real HSSP derived files
PRP110  Creates the 1-3 bond table
PRP115  Creates the 1-2-3 angle table for the soup

The PRP options

Most PRP options are obligatory for most users that want to use their own database. Some options are non-obligatory (and those are marked as such). PRP008 and PRP011 create the least-needed big output files, and we suggest you do not run these options until WHAT IF tells you that you have to run them.

Many of these options give some output for every database protein. Of course you should carefully read all this output!

Description of (PRP001)

This option takes all the PDB files mentioned in the PDB.LIS file, performs a number of corrections, and dumps the corrected PDB file in the local directory. This takes about 20 hours on a very fast workstation (This sentence was written in May 1997). In theory, PRP001 is optional, but in practice you don't want to skip it....

Description of (PRP002)

The first command needed to really generate the first files of the database is PRP002. This command reads the file PDB.LIS. It will read all PDB-files (chains) and create the files TOTALS.SEQ, which holds all sequences, and which will be read upon starting WHAT IF with the database present, and it will create the file MUTDB.IND, which is a formatted file with some statistical (=dumb counters) information about the files in the database. This file is very important, as it holds the basic hash values for the internal pointer administration for database usage.

Description of (PRP003)

The second command needed upon generation of the database is PRP003. This command reads the files MUTDB.IND (generated by PRP001) and PDB.LIS. It will thereafter read all PDB-files (chains) and create the files ALCOOR.XYZ which holds all protein information and ALDRUG.XYZ which holds all information about the drugs and waters in this entry.

These files are direct access files for which the indexing hash tables are generated upon starting WHAT IF from the information stored in MUTDB.IND.

Description of (PRP004)

The command PRP004 can be run at any moment after PRP002 and PRP003. It will cause WHAT IF to run over all residues stored in the file ALCOOR.XYZ (which was generated by PRP003) and checks the intra residue bond lengths. Any pair of covalently bound atoms with a too long bond distance will be tagged as bad atoms for further WHAT IF operation. Several other checks for bad atoms are performed too. It is highly recommended that you run this option.

Description of (PRP005)

The command PRP005 can be ran at any moment after PRP002 and PRP003. It will cause WHAT IF to run DSSP (or the WHAT IF DSSP-emulator if you do not have a DSSP lisence) on all files in the database. The secondary structure determinations will be extracted from the DSSP output files and stored in the file ALLHST.HST for fast access later.

Description of (PRP006)

The command PRP006 can be ran at any moment after PRP002 and PRP003. It will cause WHAT IF to run over all residues stored in the file ALCOOR.XYZ (which was generated by PRP002) and create the fragment database file from that. This file is called CACA.NEW. CACA stands for C-Alpha - C-Alpha distance. This file hold the distance records and the pointers (hash-tables) to these distance records. This file is mostly needed for model building, structure validation and DG*** based mutation prediction.

Description of (PRP007)

The command PRP007 can be ran at any moment after PRP002 and PRP003. This option will add the accessibility to every atom in the protein database file ALCOOR.XYZ. This option is rather time consuming. It is used when coordinates are extracted from the database (e.g. for DGLOOP options) to have their accessibility values directly available. Database search options that involve accessibilities use the special tables made by PRP010.

Description of (PRP008)

The command PRP008 can be ran at any moment after PRP002 and PRP003. It creates the datafile needed by the NEACON option. This file is big and is only used by one option: NEACON. It is not distributed in the standard version of WHAT IF. It is not recommended that you run this option before WHAT IF told you that you need it.

Description of (PRP009)

The command PRP009 can be ran at any moment after PRP002 and PRP003. Creates the files with phi, psi, omega, and chi1-5 values (8 files in total) which are needed for the backbone torsion angle and chi-angle operations of the relational database.

Description of (PRP010)

The command PRP010 can be ran at any moment after PRP002 and PRP003. Creates the file ALLACC.ACC which is needed for the residue accessibility options of the relational protein database. (See also PRP007).

Description of (PRP011)

The command PRP011 can be ran at any moment after PRP002 and PRP003. Creates the very VERY BIG file (more than 100MB) ALCONT.ACT and its hash table ALHASH.CON. These files are used by the contact options of the relational database. If you need to delete any database files because of space problems, these are probably the best candidates. It is not recommemded that you run this option unless WHAT IF tells you that you should.

Description of (PRP012)

The command PRP012 can be ran at any moment after PRP002 and PRP003. Creates the file ALLHYD.MOM which is used by the relational protein database for the hydrophobic moment option. This table is used for hydrophobic moment related database search options.

Description of (PRP013)

The command PRP013 can be ran at any moment after PRP002 and PRP003. Creates the file ALLCYS.CYS which is used by the relational protein database to look up CYS-CYS bridges (or free cysteines).

Description of (PRP014)

This option is reserved for future use..

Description of (PRP015)

The command PRP015 can be ran at any moment after PRP002 and PRP003. Adds all torsion angles to the file ALCOOR.XYZ. There are not yet many applications for this, but since this takes no extra space, why not do it? Some of the options that extract coordinates from the database use these torsion angles, and one can expect a small speed improvement in homology modelling when this option has been executed. No option will give different answers depending on this option being used or not.

Description of (PRP016)

This option stores the beginning and end of all secondary structure elements in the database in the file ELMNTS.DIS. As this file is not yet used by a sigle option, you can just as well skip PRP016 for now.

Description of (PRP017)

This option is reserved for future use..

Description of (PRP018)

Upon reading DSSP files WHAT IF converts secondary struture codes (see the DSSP chapter). If you want to work with the original DSSP determinations you need to have run this option and you can use the option ORGDSP in the DSSP menu to overwrite the internal secondary structure data in the database (ORGDSP only overwrites it for the present run, next time you or somebody else runs WHAT IF you get the converted secondary structure determination back).

Description of (PRP019)

This option makes the HSSP profiles relationally accessible to WHAT IF. It creates the file PROFIL.SEQ. As only one, not often used option uses this file, I suggest that you don't run PRP019 till WHAT IF tells you that you need to do so.

Description of (PRP020)

This option makes the HSSP residue conservation relationally accessible to WHAT IF. It creates the file CONSER.HSP. As only one, not often used option uses this file, I suggest that you don't run PRP020 till WHAT IF tells you that you need to do so.

Description of (PRP021)

This option makes the backbone hydrogen bond tables for SCAN3D.

The auxiliary options to create auxiliary files....

Some files/pieces of information are not essentially coupled to the database, but describe features that give a "feel" of protein structure. These options should definitely only be run if the database being generated is an unspecialized general database. Running these should never be needed.

Description of (AUX001)

This option creates RAMA.LIN. There should never be a need to run this option. The distributed version should always work well.

Description of (AUX002)

This option calculates the average and standard deviation of the Ramachandran check for all proteins in the database. This is used to calibrate the Ramachandran Z-score. The results are stored in PARAMS.FIG

Description of (AUX003)

This option calculates the average and standard deviation of the Chi-1/Chi-2 check for all proteins in the database. This is used to calibrate the chi1/chi2 Z-score. The results are stored in PARAMS.FIG

Description of (AUX004)

This option is step one in the generation of the NEW quality control boxes: It calculates all the boxes. Before this option is run, the NQA database must be completely deleted manually. Not for the faint of heart.

Description of (AUX005)

This option is step two in the generation of the NEW quality control boxes: It calibrates all the residue scores. This option needs to be run immediately after AUX004.

Description of (AUX006)

This option is step three in the generation of the NEW quality control boxes: It calibrates the structure Z-scores. This option needs to be run immediately after AUX005.

Description of (AUX007)

This option is the last in the generation of the NEW quality control boxes: It adds the virtual atom types that make reverse quality control possible. This option needs to be run immediately after AUX006.

Description of (AUX008)

This option creates the database used for the Input/Output check of residue polarity for the INOCHK.

Description of (AUX009)

This option calibrates the backbone conformation Z-score for the BBCCHK.

Description of (AUX010)

This option calibrates the Chi angle check for EVACHI and CHICHK. It creates the file EVACHI.CHI

Description of (AUX011)

This option calibrates the improper dihedrals used for the HNDCHK.

Hidden options

Check if all files exist (PRPCHK)

The option PRPCHK will check if all PDB files, etc., that are needed for the generation of the database are actually available to WHAT IF.

Create formatted coordinate file (MAKFMT)

MAKFMT Creates a formatted coordinate file (ALCOOR.FMT) Only needed upon installation, don't use yourself.

Regenerate coordinate file (MAKXYZ)

MAKXYZ Regenerates ALCOOR.XYZ database coordinates from formatted file Only needed upon installation, don't use yourself.

Copy all PDB files to default directory (PRP101)

Should not be needed. Use only if your PDB archive is a shared CD or something like that.

regenerate PDB.LIST from the database (PRP102)

Only for developers. Don't use.

checks if all HSSP files exist (PRP103)

Just quickly check if all files in PDB.LIS are available. Useful option to run BEFORE any of the HSSP file dependent options. If HSSP files are missing, you can use the PHD server to get them generated.

creates formatted HSSP derived files (PRP107)

Only needed upon installation, don't use yourself.

re-creates real HSSP derived files (PRP108)

Only needed upon installation, don't use yourself.

Creates the 1-3 bond table (PRP110)

Only for developers. Don't use.

Creates the 1-2-3 angle table for the soup (PRP115)

Only for developers. Don't use.