Automated Refinement Procedure

Version 4b

User Guide

March 1997.

G e n e r a l

The Automated Refinement Procedure (ARP) is a program package for refining protein structures. The scientific aspects of the procedure were published by Lamzin, V.S. & Wilson,K.S. in Acta Crystallographica (1993) D49, 129-147 and more recent manuscript has been accepted for publication in Methods in Enzymology (1997): Macromolecular Crystallography (Carter, C. & Sweet, B. eds.)

The refinement procedure implicitly requires iterative use of the least-squares minimisation, density map calculation and the arp program itself. The least-squares minimisation can be done with either the CCP4 (The SERC (UK) Collaborative Computing Project No. 4: a Suite of Programs for Protein Crystallography, distributed from Daresbury Laboratory, Warrington WA4 4AD, UK) programs (sfall/protin/prolsq or protin/refmac) with an optional scaling (e.g. using rstats). Use of other programs for least-squares minimisation, shelxl-96 (Sheldrick, G.M. & Schneider, T. (1996) SHELXL: High-resolution refinement. In Methods Enzymol. : Macromolecular Crystallography. (Carter CM, Sweet RM, Eds.) in the press) or tnt, requires additional conversion to the CCP4 format. Density map calculation should be carried out with the CCP4 programs fft and extend (mapmask).

Arp updates the model by identifying and removing poorly defined atoms and adding new atoms. Rejection of atoms is carried out on the basis of the density interpolated at the atomic centre, the deviation of the density shape from sphericity and some distance criteria. Addition of atoms is performed on the basis of difference density coupled with distance constraints. Some other facilities like sphere-based real space refinement are also provided. wARP (Perrakis, A., Sixma, T.K., Wilson, K.S. & Lamzin, V.S. submitted) is an extension to ARP, involving the weighted averaging of multiple refinements. wARP is used as a map modification technique for improvement of MIR, MAD or Molecular Replacement maps.

Limitations

As ARP runs in conjunction with programs of the CCP4 suite all limitations for the latter remain. Arp itself is limited to:

1. Density map in the CCP4 format.

2. Maximum map section size is 400,000 points.

3. Maximum number of map sections is 1,000.

4. Maximum number of atoms in extended asymmetric unit is 250,000.

5. Only acentric space groups (typical for proteins) and P1 bar are supported.

6. Arp operates with coordinate files in the stadard PDB format as follows:


CRYST1  104.000   54.600   79.000  90.00 100.70  90.00 
SCALE1      0.009615  0.000000  0.001817       0.000000 
SCALE2      0.000000  0.018315  0.000000       0.000000 
SCALE3      0.000000  0.000000  0.012882       0.000000 
ATOM      1  N   ALA A   1      25.100   9.398  92.698  1.00 23.96 
ATOM      2  CA  ALA A   1      25.093  10.741  92.132  1.00 24.99 
.............. 
ATOM   5939  OW0 WAT X   1      23.326   9.951  59.663  1.00   7.68 
ATO    5940  OW0 WAT X   2      19.153   7.518  59.052  1.00   8.48

Cell parameters are read as the first 6 numbers after the CRYST card. The orthogonalisation matrix is read as the first 4 numbers after SCALE1, SCALE2 and SCALE3 cards. The fourth (translational) component must be zero. The coordconv program from CCP4 can be used to output coordinates in this format.

Water molecules are recognised as having the residue name WAT or HOH and an atom name starting with O. Metals are recognised by their atom name.

7. The CCP4 conventions should be set up before running arp.

Originators

ARP Victor S. Lamzin (EMBL Hamburg), Keith S. Wilson (EMBL Hamburg, York University)

wARP Anastasis Perrakis (NKI Amsterdam)

References

The user is requested to forward all bugs or suggested changes to:

Victor S. Lamzin, EMBL c/o DESY, Notkestrasse 85, D-22603 Hamburg, Germany, Tel +49-40-89902-121, Fax +49-40-89902-149), E-mail victor@embl-hamburg.de

or to the Email server accessible via the ARP page http://den.nki.nl/~perrakis/arp.html

Any application of ARP should cite Lamzin, V.S. & Wilson, K.S. (1993). Automated refinement of protein models. Acta Crystallogr. D49, 129-147.

Use of wARP should cite Perrakis, A., Sixma, T.K., Wilson, K.S. & Lamzin, V.S. (1997) wARP: Improvement and extension of initial crystallographic phases by weighted averaging of multiple refined dummy models. Acta Crystallogr. D, submitted.

Acknowledgements

The authors are greatful to E. Dodson (York), J. Sevcik (Bratislava), S. Butterworth (Hamburg), P. Evans (Cambridge) and Z. Dauter (York) for their valuable suggestions.

ARP modes

ARP can be used in various modes as shown in Table 1. If the initial model needs to be substantially improved, then unrestrained least-squares minimisation with updating of all atoms in the model is carried out (Mode 1, unrestrained ARP). If the initial model is more or less correct and essentially only solvent needs to be improved, restrained (standard) refinement with automatic adjustment of solvent structure using arp is useful (Mode 2, restrained ARP). There is also a possibility of improving MIR phases (Mode 3), which consists of building a "protein like" model in the MIR map followed by unrestrained ARP of this model. This is described in the wARP supplement.


Table 1. ARP modes.

------------------------------------------------------------------------------------------

Mode            Input    Initial R-fac (%)    LSQ       Resolution (A)   To improve

------------------------------------------------------------------------------------------

Unrestrained ARP   "Bad" model   > 30    unrestrained   2.0 or better  loops, side chains

Restrained ARP    "Good" model   < 30      restrained   2.5 or better       solvent 

wARP               Density map    -      unrestrained   2.5 or better         map 
------------------------------------------------------------------------------------------ 

The output of arp is a set of ARP atoms (ARP model). Atoms which were removed are left in the coordinate file but assigned zero occupancy. New atoms added to the model as well as solvent molecules are transformed to lie around the protein atoms (if present in the file) using symmetry operators. A density map with coefficients (3F o -2F c , alpha c ) must be calculated from the ARP model and analysed using graphics. For restrained ARP (Mode 2) both (3F o -2F c , alpha c ) and (F o -F c , alpha c ) maps should be inspected. The initial or ARP model is rebuilt to fit these map. Experience shows that if the X-ray resolution is high enough and the initial model is essentially correct, even in the case of unrestrained ARP, the ARP atoms are located at approximately the true protein atom positions and can be used as guides for rebuilding, either automatically or using graphics.

Quality of initial model

As both rejection and addition are carried out on the basis of electron density maps calculated with model phases, the starting model for the refinement should be "reasonable". Experience shows that if the model includes all protein atoms but not water molecules and CA atoms are within 0.5 Aalpha from their correct position, it can be successfully refined. If the initial model is about 75 % complete and CA atoms are with 1.0 to 1.5 A of their true positions, ARP still provides a clear improvement of the density. Updating water molecules only (restrained ARP) requires good quality for the protein part of the model.

Quality of X-ray data

The data normally should be of high resolution (see Table 1 above). Unrestrained ARP at lower resolution can potentially lead to a poor quality density map. However if the model is nearly correct and only limited parts need to be improved (e.g. solvent or ligand molecules), unrestrained ARP can be applied at lower resolutions such as 2.5 A. The X-ray data should be complete, especially in the low resolution range (5 A and lower). If the low resolution strong data are systematically incomplete (e.g. missing overloaded reflections), the density map, even in case of good models, is usually discontinuous and not consistent with the model. Because ARP involves updating on the basis of density maps, such discontinuity can lead to incorrect interpretation of the density and as a result to slow convergence or even non-interpretable maps on output. The wARP procedure for density modification needs data to at least 2.2 - 2.5 A resolution depending on the solvent content. In general, the number of observed reflections MUST be at least 6 times higher than the number of protein and solvent atoms in the model.

ARP command file

The ARP command file consists of several blocks which are iterated. The last block is concerned with removal of poorly located atoms and addition of new atoms and is performed by the arp program itself. Examples of command files are available as separate files.

Keyworded input to ARP

Arp input is keyworded. Only the first four characters of each keyword or subkeyword (except END) are important. An input card starts with a keyword which may be followed by subkeywords. Input cards can appear in any order except END which must be the last one. Subkeywords within a card may also appear in any order.

arp       XYZIN input_coordinates MAPIN1 3Fo-2Fc_map_file_name \ 
          MAPIN2 Fo-Fc_map_file_name XYZOUT output_coordinates << eof 
CELL      number number number number number number 
SYMMETRY  number/string 
REMOVE ATOMS number ANALYSE WATERS/ALLATOMS CUTSIGMA number MERGE number 
FIND      ATOMS number CHAIN string CUTSIGMA number/AUTO 
FDISTANCE NEWOLD number number NEWNEW number 
REFINE    WATERS/ALLATOMS END 
eof


Keywords CELL, SYMM and END are obligatory and must be given.

Keyword REFI and subkeyword MERG are optional.

Keywords REMO, FIND and FDIS are "half optional". Either REMO or (FIND and FDIS) or all three keywords must be given.

Subkeyword ANAL plays an important role. It influences both removal and addition, see below. MAPIN1 should be the (3F o -2F c , alpha c ) map and must be provided if REMO card is given.

MAPIN2 should be the (F o -F c , alpha c ) map and must be provided if FIND and FDIS cards are given.

CELL

Cell parameters a, b, c, alpha, beta, gamma in A and degrees.

SYMMETRY

True crystal symmetry. Can be given either as a space group name or number (e.g. P212121 or 19).

REMOVE

Presence of this card intialises rejection mode. Rejection influences the success of refinement to a much greater extent then addition of new atoms and should certainly be used. Number after ATOMS is the number of atoms to reject. A value of about 25 8 to 100 % of the number of atoms to be added is recommended. This defines the maximum number of atoms which may be rejected. The actual number will be defined by the program. The string after ANALYSE can be either ALLATOMS (both protein and water atoms from the model will be considered for rejection) or WATERS (only water molecules will be rejected). Water molecules are assumed to have residue name WAT or HOH. Metals will be treated as non-water atoms. The mode ANALYSE ALLATOMS is used for unrestrained ARP, where both protein and solvent need to be improved. The mode ANALYSE WATERS isused in restrained ARP if the protein part of the model is essentially correct. All removed atoms will be retained in the coordinate set but assigned zero occupancy.

The number after CUTSIGMA is a density cutoff. Atoms will be considered for rejection only if they are located in density below CUTSIGMA *r.m.s. density (MAPIN1). A value of 0.5 to 1 is recommended . The number after the MERGE keyword is the shortest distance between two atoms if they are not to be merged. If the mode is ANALYSE ALLATOMS, any pair closer than this will be inspected and one atom will be rejected and the second assigned to the weighted average xyz and 1/B parameters. 0.6 isrecommended for merging distance. If the mode is ANALYSE WATERS, the same will be applied to water-water pairs and if any water appears to be too close to any protein atom, it will be simply rejected without merging. 2.2 os recommended for merging. ARP automatically removes an atom (in ANALYSE ALLATOMS mode) or a water atom (in ANALYSE WATERS) if it has no neighbours within 3.5 A.

FIND

The number after ATOMS is the number of atoms to add. At the end of refinement (it may take 20 to 50 cycles) the model should contain all atoms. The "target" number of atoms in the final model can be estimated as the number of protein atoms * 1.2, where the 20 % extra corresponds both to ordered water molecules and weak ones important for a pseudo solvent continuum. The number of atoms allowed to be added in each cycle depends on the resolution. A simple empirical guide is that the maximum number to add is N*0.08/d 3 , where N is the current number of atoms and d is the highest resolution (A). Thus at a resolution of 1.8 A and a coordinate file of 2,000 atoms the maximum number to be added is 27. As for removal, the actual number of atoms added will be defined by the program. New atoms will be automatically assigned a temperature factor on the basis of the density hight.

The string after CHAIN is the chain identifier for new atoms. All new atoms will have this chain identifier and be numbered sequentially. It is advisable that none of the atoms in the initial model has this chain identifier. The field after CUTSIGMA can be either the number or AUTO. The number is a density cutoff. Atoms will be looked for in density above CUTSIGMA*r.m.s. density (MAPIN2). A value of 3 to 4 sigma should be used. The statistically significant density threshold can be defined automatically by the program if AUTO is used. This is stronly recommended especially for ANALYSE WATERS mode as it prevents too many extra atoms being added.

FDISTANCE

Distance constraints to find new atoms. The two numbers after NEWOLD areminimum and maximum distances between new atoms and existing atoms in the model. The minimum distance is usually set to 0.7*resolution forANALYSE ALLATOMS mode or to 2.2 A, the minimum water-proteindistance, for ANALYSE WATERS. There is no reason to make the maximum distance between new atom and existing atoms in the model longer than about 3.3 A, the maximum water-protein distance. The addition of new atoms isbased on analysis of the difference Fourier map, which does not predict density well if it is too far from the current model. If there are new features far from the model, the procedure should reach these after a few cycles. The number after NEWNEW is the minimum distance between new atoms. Avalue of 2.2 A, the minimum water-water distance, is usually used to prevent several atoms being put in one peak, where there should be one atom only.

REFINE

This provides real space refinement on the assumption that the atomic density should be spherical. This is more powerful at resolution higher than 2 A or if atoms are clearly separated (if solvent atoms only are refined). The mode can be either ALLATOMS (all atoms will be refined - not recommended unless resolution is about 1.0 A ) or WATERS ( strongly recommended for ANALYSE WATERS mode ).

END

Must be the last data card terminating input to ARP.

On-line help

The arp input pre-processor gives warnings or error messages if something is wrong. This should be carefully checked. It is also advisable to check arp input prior to submitting a long refinement job.

Examples of on-line commands:


ARP 
END
 An error message Data Card SYMM is missing.

ARP 
SYMM
 An  error  message:  Keyword  SYMM  must  be  followed  by  1  field(s). Expected format: SYMMETRY Space_group_name.

ARP 
SYMM 4
 Output message Asymmetric unit limits 1 / 1   1 / 2   1 / 1. These limits should be used for input maps. END Error message: Data Card CELL is missing.

ARP 
SYMM 4 
CELL
 An  error  message:  Keyword  CELL  must  be  followed  by  6  field(s). Expected format: CELL a b c alpha beta gamma.

ARP 
SYMM 4 
CELL 30 45 47 90 90 90 1
 An error message: Cannot accept field shown by arrows: CELL 30 45 47 90 90 90 ==>1<==.

ARP 
SYMM 4 
CELL 30 45 47 90 90 90 
END
 An error message: Either REMO or FIND Data Card must be given.

ARP 
SYMM 4 
CELL 30 45 47 90 90 90 
REMO 
END
 An  error  message:  Keyword  REMO  must  be  followed  by  1  field(s). Expected  format:  REMOVE  ATOMS  number  ANALYSE  string  CUTSIGMA  number [MERGE number].

ARP 
SYMM 4 
CELL 30 45 47 90 90 90 
REMO ATOMS 10 ANALYSE WATERS CUTSIGMA 2.0 
END
 Now  the  command  file  is  formally  correct  and  program  gives  warnings only:
 Comments: Remove 10 old atoms if below 2.0 sigma in MAPIN1.
 -- This is not a standard use of ARP
 --  a  value  between  0.5 and 1.0 sigma is recommended
 -- assuming that MAPIN1 is 3Fo-2Fc map.
 Comments: Analyse waters only for removal
 -- This is not a standard use of ARP
 -- use of MERGE data card is desirable
 -- use both removal and addition of atoms is advisable 
 -- real space refinement of waters is advisable

Fix these and try the following:

ARP 
SYMM 4 
CELL 30 45 47 90 90 90 
REMO ATOMS 10 ANALYSE WATERS CUTSIGMA 1.0 MERGE 1.0 
FIND ATOMS 10 CUTSIGMA AUTO CHAIN X 
FDIS NEWOLD 2.2 3.3 NEWNEW 2.2 
REFI WATERS 
END
 Only one warning is now given:
 Comments: Waters located closer than 1.00  A  to  protein  atoms  will  be rejected.
 -- This is not a standard use of ARP
 -- a value of 2.2 A is recommended

Fix it the arp program eventually accepts input. However it will make more checks.

Further input processing and limitations

Arp checks identity in the input cell parameters and those from the coordinate and map file headers. Arp does not check whether the cell parameters are meaningful at all, i.e. it will accept CELL 67.1 82.2 79.9 102.2 98.9 100.3 together with SYMM P2 1 2 1 2 1 .

Arp checks whether the orthogonalisation matrix derived from CELL is consistent with the matrix written at the top of the coordinate file.

Arp will refuse to accept a negative value of the number of atoms to update but does not check whether these numbers are not too high, i.e. are consistent with the formula given above.

Arp does not check whether the input MAPIN1 is indeed (3F o -2F c , alpha c ) map and MAPIN2 is (F o -F c , alpha c ) map. It also does not check the input coordinate file in terms of proper connectivity, residue and atom names, etc.

Monitoring

Several parameters can be used as convergence criteria. The first criterion is map quality. A map with coefficients (3F o -2F c , alpha c ) is calculated from the last ARP model. The crystallographic R factor is a reasonable quantity to monitor.

Arp gives several parameters on output. These are:

The number of atoms merged,

The number of atoms removed,

The sphericity functions indicating whether atoms are well shaped - a value of about 0.05 to 0.10 (the lower the better) is reasonable,

The result of improvement of the sphericity function if sphere-based real space refinement is used,

The statistically significant threshold in difference density (if FIND CUTSIGMA AUTO is provided) for addition of new atoms,

The number of atoms added.

It is important to look through the arp part of the output file.

The AUTO option provides an attempt to be objective in adding atoms. The actual number of atoms to remove depends both on REMOVE CUTSIGMA value and ATOMS number). If the user during reshuffling the structure asked for not enough removal, the result would be that not enough new atoms are found. If the requested number for removal is too high (but assumed to satisfy the formula given above) - more new atoms will be found. A situation where each cycle arp removes less than about 2-3 atoms (for typical structure of 1,000 to 3,000 atoms) and finds the same number of new ones and the R factor does not change indicates that convergence has been achieved.

There is no reason to run millions of cycles. Usually refinement essentially converges after 10 to 20 cycles. However if the density is still getting better the number of cycles can be increased to 50 or even 100.

Supplementary use of arp

After restrained refinement is complete and before using the graphics it is worth knowing which parts of the model should be corrected. Arp can be used for such purpose.

$ ARP XYZIN input.BRK MAPIN1 3Fo-2Fc.MAP XYZOUT temp 
CELL number number number number number number 
SYMMETRY number/string 
REMOVE ATOMS 50 ANALYSE ALLATOMS CUTSIGMA 1.0 
END 

The output of this job will contain a list of the 50 worst (from an arp's point of view) atoms which do not agree with the electron density. These atoms should be inspected first. The input MAPIN1 should be the (3F o -2F c , alpha c ) map.

$ ARP XYZIN input.BRK MAPIN2 Fo-Fc.MAP XYZOUT output.BRK 
CELL number number number number number number 
SYMMETRY number/string 
FIND ATOMS 10 CHAIN F CUTSIGMA 3.0 
FDIS NEWOLD 0.01 5.0 NEWNEW 2.2 
END 

This job is a sort of a "peak" search. The input MAPIN2 should be (F o -F c , alpha c ) map. The new atoms will be placed in highest peaks of the difference map and will be appended at the end of the coordinate file. The user can simply load this file into the graphics program, centre on these new atoms and try to find out what it is wrong with the model.