W A R N I N G :
DO NOT TRY TO (RE-)CREATE THE DATABASE UNTIL:
1) YOU HAVE GOOD BACKUPS.
2) YOU HAVE STUDIED THE SOURCE CODE IN THE MAKDB LIBRARY.
3) YOU ARE SURE WHAT YOU ARE GOING TO DO.
4) YOU CAN COUNT ON TENS OF HOURS OF CPU TIME BEING
AVAILABLE OVER THE NEXT DAY OR SO.
5) YOU ARE SURE YOU HAVE ENOUGH DISK SPACE AVAILABLE.
6) YOU ARE SURE YOU REALLY WANT TO DO THIS.
7) YOU HAVE TAKEN AN INSURANCE AGAINST MURDER BY YOUR COLLEAGUES
Making a new database is in principle not needed. Actually, when
you think you need a new database, you probably need an updated
version of the whole program. However, if you want to make a class
specific database, e.g. a database that only contains antibodies and
beta-only structures, read first the previous paragraph, read thereafter
this entire chapter, and only then, go ahead.
You first need a file PDB.LIS. This file has the names of all PDB
files in it that you want to incorporate in the database. These files
should be present in the directory indicated in the CCONFI.FIG file,
or in the local directory. The first 4 characters of each line should
be the 4 letter PDB identifier, the fifth character should be the
chain-name of the chain you want to use. Use a blank as fifth
character if there is only one chain.
Now you are ready to run the options PRP001, PRP002, etc., but first
we will explain what files will all be generated.
ALLACC.ACC Contains all summed accessibilities per residue.
ALLCYS.CYS Information about Cys-Cys bridges
ALLDSP.DSP Contains all original determinations by DSSP.
ALLHST.HST Contains all WHAT IF-ised determinations by DSSP.
ALLHYD.HYD Holds all hydrophobic moments for the protein database.
ALLOME.GAS Contains all omega angles for the protein database.
ALLPHI.PHI Contains all phi angles for the protein database.
ALLPSI.PSI Contains all psi angles for the protein database.
BBCDAT.DAT
CHI00*.CHI Contains the values for chi-* for the protein database.
HBONDS.HBO Will be rewritten soon.
IMPROPER.DAT
MUTDB.IND Holds counters for the database proteins
SYMMAT.MAT
TOTALS.SEQ Contains all sequences. Will be read upon startup.
ALCONT.ACT Contains all atom-atom contacts for the protein database.
ALCONT.WHT Hash tables for ALCONT.ACT
ALCOOR.XYZ Contains all coordinates, B-factors, etc. for protein.
ALHASH.CON Is the hash table for alcont.act.
ALLNEA.DIS Holds information about nearest neighbours in the database.
ALLNUM.NAM Contains original names for amino acids in protein database.
CACA.NEW Tables for the DGLOOP options
CONSER.HSP Sequence conservation values for the database (from HSSP)
PROFIL.SEQ Sequence profiles for the database (from HSSP)
This directory holds files that you should not touch (actually, if you
still have these files, delete them and you save a lot of
space). These files are formatted versions for the unformatted files
in bindata. As computers are so nicely standardised that they all use
different formats for unformatted (and thus fast) files, we let WHAT
IF re-generate those unformatted files from the predata files upon
installation.
Options that you need:
PRP001 Corrects all PDB files (Easily takes 24 hours and 100M space)
PRP002 Creates MUTDB.IND and TOTALS.SEQ
PRP003 Creates ALCOOR.XYZ (all PDB coordinates)
PRP004 Corrects atom correctness flags in ALCOOR.XYZ
PRP005 Create ALLHST.HST (secondary structure database by DSSP)
PRP006 Create DGloop pointer files
PRP007 Puts accessibilities in ALCOOR.XYZ
PRP008 Creates the NEACON file
PRP009 Creates phi-psi value tables for database
PRP010 Creates accessibility tables for database
PRP011 Creates atomic contact tables for database
PRP012 Creates hydrophobic moment tables for database
PRP013 Creates cys-cys bridge info tables for database
PRP014 Does nothing
PRP015 Puts the eighth torsion angles as term 10 in alcoor
PRP016 Makes the secondary structure element tables
PRP017 Makes the nearest neighbours index tables
PRP018 Creates ALLDSP.DSP for use of original DSSP values
PRP019 Creates HSSP derived sequence profiles
PRP020 Creates conservation table
PRP021 Creates backbone hydrogen bond table
AUX001 Creates RAMA.LIN (backdrop for Ramachandran plots)
AUX002 Calibrates Ramachandran Z-score (parameters in PARAMS.FIG)
AUX003 Calibrates chi-1/chi-2 Z-score (parameters in PARAM.FIG)
AUX004 Creates NQA boxes (needs PRP001 and 200M space in current directory)
AUX005 Creates NQA residue calibration
AUX006 Creates NQA structure calibration
AUX007 Creates NQA virtual atom boxes
AUX008 Creates inside/outside calibration database (INOUTF.DAT)
AUX009 Creates Backbone conformation Z-score calibration
AUX010 Creates EVACHI.CHI (Calibration table for CHICHK)
AUX011 Creates IMPROPER.DAT (Calibration table for HNDCHK)
Options that you don't need:
GETORG Overwrites the HST values in common with originals
MAKFMT Creates a formatted coordinate file (ALCOOR.FMT)
MAKXYZ Regenerates ALCOOR.XYZ database coordinates from formatted file
PRPCHK Checkes if all PDB files required for database generation exist
PRP101 Copy all PDB files to default directory
PRP102 regenerate PDB.LIST from the database
PRP103 checks if all HSSP files exist
PRP107 creates formatted HSSP derived files
PRP108 re-creates real HSSP derived files
PRP110 Creates the 1-3 bond table
PRP115 Creates the 1-2-3 angle table for the soup
Most PRP options are obligatory for most users that want to use their
own database. Some options are non-obligatory (and those are marked as such).
PRP008 and PRP011 create the least-needed big output files, and we suggest
you do not run these options until WHAT IF tells you that you have to
run them.
Many of these options give some output for every database protein. Of
course you should carefully read all this output!
This option takes all the PDB files mentioned in the PDB.LIS file,
performs a number of corrections, and dumps the corrected PDB file in
the local directory. This takes about 20 hours on a very fast
workstation (This sentence was written in May 1997). In theory, PRP001
is optional, but in practice you don't want to skip it....
The first command needed to really generate the first files of the
database is PRP002. This command reads the file PDB.LIS. It will read
all PDB-files (chains) and create the files TOTALS.SEQ, which holds
all sequences, and which will be read upon starting WHAT IF with the
database present, and it will create the file MUTDB.IND,
which is a formatted file with some statistical (=dumb counters)
information about the files in the database. This file is very
important, as it holds the basic hash values for the internal pointer
administration for database usage.
The second command needed upon generation of the database is PRP003. This
command reads the files MUTDB.IND (generated by PRP001) and PDB.LIS.
It will thereafter read all PDB-files (chains)
and create the files ALCOOR.XYZ which holds all protein information
and ALDRUG.XYZ which holds all information about the drugs and waters in
this entry.
These files are direct access files for which the indexing hash tables are
generated upon starting WHAT IF from the information stored in MUTDB.IND.
The command PRP004 can be run at any moment after PRP002 and PRP003.
It will cause WHAT IF to run over all residues stored in the file
ALCOOR.XYZ (which was generated by PRP003) and checks the intra
residue bond lengths. Any pair of covalently bound atoms with a too
long bond distance will be tagged as bad atoms for further WHAT IF
operation. Several other checks for bad atoms are performed too. It is
highly recommended that you run this option.
The command PRP005 can be ran at any moment after PRP002 and PRP003.
It will cause WHAT IF to run DSSP (or the WHAT IF DSSP-emulator if you
do not have a DSSP lisence) on all files in the database. The
secondary structure determinations will be extracted from the DSSP
output files and stored in the file ALLHST.HST for fast access
later.
The command PRP006 can be ran at any moment after PRP002 and PRP003.
It will cause WHAT IF to run over all residues stored in the file
ALCOOR.XYZ (which was generated by PRP002) and create the fragment
database file from that. This file is called CACA.NEW. CACA stands for
C-Alpha - C-Alpha distance. This file hold the distance records and
the pointers (hash-tables) to these distance records. This file is
mostly needed for model building, structure validation and DG*** based
mutation prediction.
The command PRP007 can be ran at any moment after PRP002 and PRP003.
This option will add the accessibility to every atom in the protein database
file ALCOOR.XYZ. This option is rather time consuming. It is used when
coordinates are extracted from the database (e.g. for DGLOOP options) to
have their accessibility values directly available. Database search options
that involve accessibilities use the special tables made by PRP010.
The command PRP008 can be ran at any moment after PRP002 and PRP003.
It creates the datafile needed by the NEACON option. This file is big
and is only used by one option: NEACON. It is not distributed in the
standard version of WHAT IF. It is not recommended that you run this
option before WHAT IF told you that you need it.
The command PRP009 can be ran at any moment after PRP002 and PRP003.
Creates the files with phi, psi, omega, and chi1-5 values (8 files in
total) which are needed for the backbone torsion angle and chi-angle
operations of the relational database.
The command PRP010 can be ran at any moment after PRP002 and PRP003.
Creates the file ALLACC.ACC which is needed for the residue accessibility
options of the relational protein database. (See also PRP007).
The command PRP011 can be ran at any moment after PRP002 and PRP003.
Creates the very VERY BIG file (more than 100MB) ALCONT.ACT and its
hash table ALHASH.CON. These files are used by the contact options of
the relational database. If you need to delete any database files
because of space problems, these are probably the best candidates. It
is not recommemded that you run this option unless WHAT IF tells you
that you should.
The command PRP012 can be ran at any moment after PRP002 and PRP003.
Creates the file ALLHYD.MOM which is used by the relational protein
database for the hydrophobic moment option. This table is used for
hydrophobic moment related database search options.
The command PRP013 can be ran at any moment after PRP002 and PRP003.
Creates the file ALLCYS.CYS which is used by the relational protein
database to look up CYS-CYS bridges (or free cysteines).
This option is reserved for future use..
The command PRP015 can be ran at any moment after PRP002 and PRP003.
Adds all torsion angles to the file ALCOOR.XYZ. There are not yet many
applications for this, but since this takes no extra space, why not do
it? Some of the options that extract coordinates from the database
use these torsion angles, and one can expect a small speed improvement
in homology modelling when this option has been executed. No option
will give different answers depending on this option being used or
not.
This option stores the beginning and end of all secondary structure
elements in the database in the file ELMNTS.DIS. As this file is not yet
used by a sigle option, you can just as well skip PRP016 for now.
This option is reserved for future use..
Upon reading DSSP files WHAT IF converts secondary struture codes
(see the DSSP chapter). If you want to work with the original DSSP
determinations you need to have run this option and you can use
the option ORGDSP in the DSSP menu to overwrite the internal
secondary structure data in the database (ORGDSP only overwrites
it for the present run, next time you or somebody else runs WHAT IF
you get the converted secondary structure determination back).
This option makes the HSSP profiles relationally accessible to WHAT IF.
It creates the file PROFIL.SEQ. As only one, not often used option uses
this file, I suggest that you don't run PRP019 till WHAT IF tells you
that you need to do so.
This option makes the HSSP residue conservation relationally
accessible to WHAT IF. It creates the file CONSER.HSP. As only one,
not often used option uses this file, I suggest that you don't run
PRP020 till WHAT IF tells you that you need to do so.
This option makes the backbone hydrogen bond tables for SCAN3D.
Some files/pieces of information are not essentially coupled to the
database, but describe features that give a "feel" of protein
structure. These options should definitely only be run if the
database being generated is an unspecialized general database.
Running these should never be needed.
This option creates RAMA.LIN. There should never be a need to run this
option. The distributed version should always work well.
This option calculates the average and standard deviation of the
Ramachandran check for all proteins in the database. This is used
to calibrate the Ramachandran Z-score. The results are stored in
PARAMS.FIG
This option calculates the average and standard deviation of the
Chi-1/Chi-2 check for all proteins in the database. This is used to
calibrate the chi1/chi2 Z-score. The results are stored in
PARAMS.FIG
This option is step one in the generation of the NEW quality control
boxes: It calculates all the boxes. Before this option is run, the
NQA database must be completely deleted manually. Not for the faint of
heart.
This option is step two in the generation of the NEW quality control
boxes: It calibrates all the residue scores. This option needs to be run
immediately after AUX004.
This option is step three in the generation of the NEW quality control
boxes: It calibrates the structure Z-scores. This option needs to be run
immediately after AUX005.
This option is the last in the generation of the NEW quality control
boxes: It adds the virtual atom types that make reverse quality
control possible. This option needs to be run immediately after
AUX006.
This option creates the database used for the Input/Output check of
residue polarity for the INOCHK.
This option calibrates the backbone conformation Z-score for the BBCCHK.
This option calibrates the Chi angle check for EVACHI and CHICHK. It creates
the file EVACHI.CHI
This option calibrates the improper dihedrals used for the HNDCHK.
The option PRPCHK will check if all PDB files, etc., that are needed
for the generation of the database are actually available to WHAT IF.
MAKFMT Creates a formatted coordinate file (ALCOOR.FMT)
Only needed upon installation, don't use yourself.
MAKXYZ Regenerates ALCOOR.XYZ database coordinates from formatted file
Only needed upon installation, don't use yourself.
Should not be needed. Use only if your PDB archive is a shared CD or
something like that.
Only for developers. Don't use.
Just quickly check if all files in PDB.LIS are available. Useful option to
run BEFORE any of the HSSP file dependent options. If HSSP files are
missing, you can use the PHD server to get them generated.
Only needed upon installation, don't use yourself.
Only needed upon installation, don't use yourself.
Only for developers. Don't use.
Only for developers. Don't use.