I am not a neural network expert, so do not expect very fancy features or
novel developements. The WHAT IF neural network module is written as a toy
that can be used universaly for small data sets. For theory about neural
networks you are referred to your local library.
Use the general command NEURAL to enter the neural network module.
The network will probably be used by people
intererested in QSAR techniques (structure activity relations
for drugs). In principle the network should be able to replace the classical
QSAR modules, but in practice I think it will only be useful to use QSAR and
the network in parallel.
Neural networks are normally used to do pattern recognition. They are often
useful to detect hidden correlations between data. In the special case of QSAR
problems a neural network should in principle be capable to find correlations
between the parameters for the variable active groups.
Most neural networks accept bitpatterns as input. The WHAT IF neural network
however, expects real numbers as input. It is of course easy to see that this
greatly enhances the flexibility of this network. In practice, the neural
network will find a two till three fold smaller deviation between the
observed and calculated binding constants than classical QSAR methods for
90 till 95 percent of all data points. However, for the other 5 till 10
percent of the data points the correlation is five till ten times worse.
Much experiments still have to be performed, but it looks to me that this
is a good way of detecting outlyers in the data set.
Most neural networks suffer badly from the multiple minimum problem. The
WHAT IF neural network uses a optimization scheme that uses random
neuron alterations. This ensures that given sufficient CPU time, also
the global minimum can be found.
If one only wants to analyse a data set, one can just read in a set of
data points, each consisting of a series of variables X and an associated
parameter Y that is a function of the X's: Y=F(Xi,Xj,Xk,...). In QSAR
applications X's are the volume, charge, etc of groups in the molecule, and
Y is the binding constant. The
option TRAIN will than try to optimize the junctions (also called neurons) in
the neural network to fit this dataset. WHAT IF will at regular intervals
give some information about the present status of the fitting procedure.
The commands SAVNEU and RESNEU can be used to respectively save and restore
the network architecture, and the values for the neurons. If one wants to
predict the binding constant for unknown compounds, one should use the
GETSET command again, but now with 0.0 for the last parameter (the binding
constant) for each compound. The command SHOSET can be used to both evaluate
the progress of the training, and to use and the present neuron values
to predict the binding constants.
The training phase can be very CPU intensive, the testing phase however, is
blastingly fast.
Please read some literature about neural networks. Especially about size of
the dataset and the corresponding network architecture. If there are not
enough neurons in the net, the network will not generalize and errors will
be larger than needed. If there are to
many, the network will get over-trained, and the predictions will become
random. I suggest you start with 2 times more neurons as there are
variables in your dataset. I also suggest that you do not use too many
hidden layers, one or sometimes two, will almost always work fine.
The command NETWRK can be used to define the network architecture (number
of input nodes, hidden layers, etc.).
The WHAT IF neural network is a bit geared towards QSAR related problems.
One can use one input layer with a maximal width of 50, or in QSAR terms,
every compound can have at most 50 variable parameters. There are at most
ten hidden layers each with a maximal width of 50. This does not seem very
much, but be aware that a QSAR set of 50 compounds with five parameters
per compound has only 250 variables, and a neural network with 5000 neurons
can certainly learn such a dataset by heart. The WHAT IF neural network has
only one output unit. This unit will hold the binding constant.
Of course you can use the network for other purposes, the limitation is than
that you need N (N less than 50) reals as input, and one real as output.
I have personally also used it for secondary structure prediction, but this
took very, very much CPU time (and Chou and Fasman like results).
The following is a training session. Just do EXACTLY as you are being told.
If the input is typed in capitals in this writeup, you type it in capitals
in WHAT IF; if it is small print here, you type it in small characters....
Leave WHAT IF.
Resart WHAT IF.
Type:
neural
exampl
getset
TRAIN.NEU
netwrk
2
5
2.5
5.0
train
200
shoset
grafic
end
scater
grafic
go
You now see the results graphically. You can rotate/translate it etc.
Click on CHAT, because it is now time to USE the net to predict values
with the neural net.
Below you see a dataset that has the answers given. The file without the answers
is called TEST.NEU. So, type:
end
getset
TEST.NEU
shoset
With `neural` you went to the neural network menu. The `exampl` command copied
a training dataset, called TRAIN.NEU, that can be aproximated
with a non-linear function. With
the `getset` command you read this dataset in. There are 30 data points.
With 'netwrk, 2, 5, 2.5, 5.0` you created a network architecture consisting
of 2 hidden layers of 5 nodes each. WHAT IF will try to keep the values of
the junctions between -2.5 and 2.5, but junctions outside -5.0, 5.0 are
forbidden.
With `train` and `200` you told WHAT IF to do 200 rounds of network optimization.
This will take a couple of minutes on an INDIGO workstation. You will see
the error converge around a value of 0.20. That is a little bit
bigger than the error that I put into this dataset (0.14). (Try more and wider
hidden layers overnight, and you will see that the error can get smaller.
This is called over-training. The network learns the data by heart, rather than
that it extracts the hidden correlations). The `shoset` command gives two sets of
output the first half shows the input values, the observed results, the
calculated results, and the error in the calculated results. The second
half also displays the tolerance of the net (see below).
The little excursion with `grafic`
and `end` is needed to initialize the graphics window. The command scater
(scatter, which is better english is acceptable too), will make a scatter
plot in which the data points are green and the calculated values red. The
size of the cross is a measure for the error. The second shoset command
does the same as the first, but now the errors are of course irrelevant. You
should just look at the calculated answers. The true answers are given below.
If you were to take the trouble of calculating the RMS between the expected
and calculated values in the test set, you would probably find an RMS around
0.7. That nicely indicates one of the problems of neural nets. They are black
boxes, very deep-black black boxes.....
1.823 1.311 3.633
0.424 0.140 0.549
0.906 1.296 2.603
0.129 0.690 0.605
1.472 0.419 1.728
1.013 0.226 1.155
1.202 0.733 1.836
0.409 1.550 2.984
0.681 1.092 2.003
1.511 1.764 4.697
1.397 1.096 2.740
1.462 1.560 3.916
1.772 0.221 1.949
0.146 0.777 0.907
0.871 1.240 2.530
0.959 0.482 1.267
0.274 0.907 1.185
0.453 1.726 3.545
1.355 0.504 1.620
0.782 0.658 1.283
1.076 1.002 2.194
0.515 0.201 0.712
1.666 0.574 2.175
0.140 0.430 0.330
1.565 0.476 1.839
0.778 1.875 4.439
1.266 0.920 2.299
1.222 1.545 3.663
0.473 0.609 0.874
1.982 0.616 2.367
The training data file and the files with data points for which the output
value (binding constant in QSAR) should be predicted have the same format.
The only difference
is that in a training dataset the last column, which is the measured output
(binding constant in case of QSAR), is relevant, whereas in a testing set
this number is irrelevant (but has to be there).
The input can be free format, there should however be at least one blank
character between numbers. For the time being all numbers for one data point
should fit on one 80 character long line.
The command GETSET can be used either for reading the training dataset, or
for reading the dataset that holds the variables for which the output should
be predicted.
The command ININEU will cause WHAT IF to initialize all neurons, and to reset
the stepsize (maximal amount by which a neuron can change in one round) and
other parameters that were automatically updated during the teaching phase.
Use SHOPAR to see what parameters you now have.
The command MAKNEU will cause WHAT IF to prompt you for a file name.
It will write all junction values (those are the parameters that are
supposed to be optimised) in that file. See also GETNEU.
The command GETNEU will cause WHAT IF to prompt you for a file name.
It will read all junction values (those are the parameters that are
supposed to be optimised) from that file. See also MAKNEU.
The command SAVNEU will prompt you for a network save-file number. I strongly
suggest that you use the suggested default untill you know what you are
doing. A file called NEURAL***.WHAT IF will be created; *** is the save-file
number. The network architecture and the values of all junctions (neurons)
will be stored in this file. Use RESNEU to restore the network from this
file. The input dataset will NOT be stored in this save file.
If you have saved the network architecture and neuron values in a file with
the SAVNEU command, you can restore the net from this file with the RESNEU
command. You will be prompted for the save-file number. This number should
of course be the same number as used for the SAVNEU command. The input dataset
is not saved in the save file, and can thus also not be restored from it.
WARNING. Strange things will happen if the network architecture and the
data set do not belong together!
The network needs to be trained before it can predict anything. You should
give it a sufficiently large and reliable data set so that it can try to
find the hidden rules that govern the relation between the variables used
as input (values for variables like charge, volume, etc. in QSAR), and the
value (binding constant in QSAR) that comes out.
If you have too many neurons in the network you will run into the over-training
problem. That means that the network will not determine general rules, but
rather will learn your data by heart. If you want to circumvent this, you
should not take too many training steps, or use fewer neurons, but it
is impossible to determine
the optimum. A good, but time consuming, way of checking that you have the
correct architecture and training length is the jack-knife method. That means,
take out all data points one after the other, train the network with all but
this one data
point, and for each training run determine at the end the error in the
prediction of the output parameter (binding constant in QSAR) for the one
value that was removed from the data set.
If there are datapoints that you trust more than others, you can make this
clear to WHAT IF by putting that data point multiple times in the input
data.
To start a
training procedure, you use the command TRAIN. You will be prompted for the
number of rounds. I suggest that you start with one round to get an
impression about the amount of CPU time needed. As WHAT IF trains the net
incrementally, no training round will ever get lost.
Use the SHOSET command to see the progress of the training.
The option train (see above) will all the time use the entire dataset.
This could lead to a situation that the network learns to
recognise all datapoints, while loosing the ability to extrapolate.
This artefact can be reduced (at the cost of a considerable amount
of CPU time) by leaving out datapoints, and determining the
quality of the network on their predictive power on those points
that wer left out.
The WHAT IF neural network is just a toy, but one of the things
it has actually proven to be good at is outlier detection in QSAR
datasets. The option QQSAR behaves the same as TRAIN, but will actually
be doing outlier detection in the input dataset.
The command SHONEU will display the values for all junctions (connections
between neurons) in the network. You will see N lines of M numbers. N is the
number of nodes in the input layer, M the number of nodes per hidden layer.
Thereafter L-1 times M lines with M numbers will be shown. L is the number
of hidden layers, and M the number of nodes per hidden layer.
Finally M numbers, the junctions from the last hidden layer to the output
unit, will be shown. See also the figure at the end of this chapter.
The command SHOSET will cause WHAT IF to display all datapoints from the
input dataset, together with the measured output value (binding constant for
QSAR applications) and the calculated/predicted output value, and the
difference between these last two values, which is called the error. At
the end, the RMS error will be shown. In case you
are training the net, this is a good way of checking the training performance.
In case you are using the net for prediction purposes you should of course
neglect the errors, and only look at the predicted output values (which in
case of QSAR applications will be the predicted binding constants).
The command SHOPAR will (in the NEURAL menu) display the neural network
architecture and training dynamics parameters. The number of input nodes
(also called input units, or the width of the input layer), the number of
hidden layers (ranging from one till ten), and the number of nodes in the
hidden layers (also called the width of the hidden layers) will be
displayed. As there is only one output unit, and since all layers are always
completely connected, and since the WHAT IF network only works in so-called
feed forward mode, these three parameters completely define the network
architecture.
A hard and a soft limit for the neurons will also be listed. During the training
phase WHAT tries to keep the values for the neurons between plus and minus the
soft limit. However, as soon as the absolute value of a neuron exceeds the
hard boundary a reset will be done for this neuron; even if this makes the
whole performance worse. If this happens, you should increase
the hard and soft boundary. Be aware that the product of the width of
the hidden layers and the hard limit should have at least the same order of
magnitude as the expected values at the output unit.
If the network training is going well and you are getting
close to convergence, cooling down, that means decreasing
the size of the steps that WHAT IF uses to change the
neurons can speed up the convergence. A small stepsize
will lead to faster training, but more chance of getting
stuck in a local minimum. Type COOLNW to reduce the step size
by a factor 1.5. See also HEATNW.
The process of slowly decreasing the step size in monte carlo like procedures
like the one chosen to optimize the WHAT IF network, is often called
simulated anealing.
If you think that the network is stuck in a local minimum
you can try the HEATNW option. It will increase the maximal
allowed change in junction value per training step. Heating
up will increase the change of finding the global minimum,
but will slow down the training process. The option HEATNW
increases the step size by a factor 1.5. See also COOLNW.
The option SHAKEN will shake the whole neural net a little bit.
This will of course make the results worse, but it could help
getting out of a local minimum in network space.
The parameters as described under the SHOPAR command can be (re-)set with the
PARAMS command. In this case you will simply be prompted for the five
parameters in a row.
This parameter determines the initial 'temperature' of the network.
Optimally choose a value that is a bit bigger than the expected
average change in the neurons during the training. If you
dont know this value, take the average difference between all
input values.
The 'temperature' of the neurons is automatically scaled
during the run when the observed temperature and the desired
temperature differ more than this factor. I have not yet determined
any useful rule of thumb for thsi parameter. Play with it if
all else fails.
It sometimes (not too often) helps to add a linear offset to the
relation between
the input and output layer. Play with this option if
all else fails.
If you expect a small number of very bad values in your input data
it might help to use robust instead of quadratic error
estimates. This flag can do that for you.
If the output of the neural net tells you that you have some
datapoints that really dont fit with the rest, you can try to
use a third power error function instead of the normal
quadratic one. This forces the training to pay much more attention
to those few bad values. The SMTERR parameter does just that for you.
This option switches on the debug flag. Could be handy if you
start modifying the code yourself. Otherwise, better dont use....
Together with the DSSP option HSTMTS, MAKTST is needed for the not
yet functioning neural network that one day will improve the secondary
structure determinations.
The option SCATER will, after training, make a scatter plot of the
training set values versus the values that the net produced.