X-PLOR provides the possibility of cross-validation in reciprocal space, as described by Brünger (1992, 1993).

The most common measure of the quality of a crystal structure is the
**R** value (Eq. 12.2).
**R** is closely related to the crystallographic
residual (cf. Eq. 12.1)

which is a linear function of the negative logarithm of the
likelihood of the atomic model, assuming that
all observations are independent and normally distributed.
**R** can be made arbitrarily
small by increasing the number of model
parameters and subsequent refinement against
; i.e., the diffraction data can be overfit
without changing the information content of the atomic model.

Crystallographic diffraction data are redundant to some degree; e.g.,
a small portion of the
data can be omitted without seriously affecting the
result.
Following the statistical concept of
cross-validation,
the observed reflections are partitioned into a test set **T** and a
working set **A** (Brünger 1992); that is,
**T** and **A** are
disjoint, and their conjunction is the full set of
observed reflections.
The value

is referred to
as the free **R** value computed for the **T** set of reflections.
**T** is omitted in the modeling process;
e.g., in the case of
crystallographic refinement,
the residual to be minimized is given by

One would expect to be less
prone to overfitting than **R**.
This concept can be applied to the
other statistical quantities available in X-PLOR,
such as the standard linear correlation
coefficient (Eq. 12.1). It can even be applied to
crystal structures that have already been refined with all diffraction
data included: refinement by simulated annealing
with **T**
omitted will remove some of the memory toward **T**.

Both and the
rms difference between the model refined against the complete data
set and the model refined against **A**
increase more or less monotonically as a function of
the percentage of omitted data. This is to be
expected of terms that monitor the validity of a model.
**R** decreases,
which is a paradoxical and misleading behavior for an indicator of the
model's accuracy. As a compromise between
avoiding fluctuations of and maintaining small rms\
differences between refined models, obtain **T** from
a random selection of 10% of the observed reflections.

The free **R** value (or correlation coefficient)
is printed along with the conventional
**R** value (correlation coefficient) during all
refinement procedures in X-PLOR, including **PC**-refinement
for molecular replacement. In addition, the data
analysis can be carried out for both the test set **T** and
the working set **A** when one is using the ``PRINt R", ``PRINt PHASe",
and ``PRINt COMPleteness" statements. The **R** values or
correlation coefficients are stored in the symbols
$R, $TEST R, $CORR, and $TEST CORR
whenever a computation of has been carried out, e.g,
when a ``PRINt TARGet" statement has been issued or an
energy calculation has been carried out.

The following two example files show how to use the
free **R** value concept in X-PLOR. Basically, none of the
example files described in the previous section
have to be changed. The only requirement is
to create a special reflection file that tells X-PLOR
which reflections belong to the test set and the
working set. This is indicated by the TEST array.
The example file below randomly selects 10% of the
data and sets the TEST array to 1 for them. Subsequently,
a new reflection file ``amy.cv" is written that should be
used for all subsequent X-PLOR runs. X-PLOR
automatically partitions the data into the
working set and the test set whenever the TEST array
contains nonzero elements. The reflections with
TEST=1 are used for the free **R** value (correlation)
computation.

The example file below is a combination of the slow-cooling simulated annealing refinement cycle described in Section 13.1.3 and the restrained B-factor refinement described in Section 13.4. Note that no change was required in the input files except for using the ``amy.cv" reflection file.

As a consequence of the SA-refinement with the test set
omitted, the free **R** value deviates from the
conventional **R** value. However, the free **R** value
decreases during the course of the refinement, even
though the test set of reflections has been omitted from
the refinement process. This indicates that the
information content and phase accuracy of the model
increase during the refinement process. If at
any stage in the refinement process---e.g., after
refining additional water molecules---the free **R** value
increased, it would indicate that the phase accuracy
of the model was worsened by the additional refinement.
The free **R** value can thus be used to prevent the user
from overfitting the diffraction data.

Figure 15.1 was produced by obtaining the
free and conventional **R** values using the UNIX
grep facility from the X-PLOR output file
(searching for ``TEST=1" and ``TEST=0").
The resulting lines were fed into a spreadsheet
program.

**Figure 15.1:** Course of refinement.

Sat Mar 11 09:37:37 PST 1995