TRAINING

Feature normalization

The main training algorithm of the toolkit is back-propagation through time. However, before we go into the BackProp command, let's look at the normalization of the input data. The values in the external data files are not always well suited for directly copying to network activities. This is because the activities in the network are normally in the range [-1; 1] and if some activities differ significantly from this behaviour, training performance will be degraded (this is also discussed in the Streams section). Streams have a linear transformation between their external and internal data. The command NormStream can be used to set up this transform to normalize input streams so that the internal data is roughly in the range [-1; 1] and thus improving the training performance.

This is the syntax of NormStream:

USAGE: NormStream [options] Net [Input]
       Option                                               Default
       -S             Treat 'Input' as an inputfile  script (off)
       -s stream      Process the specified stream          (all streams)
       -d mult        Center around mean
                      mapping: [mean-s*mult, mean+s*mult] -> [-1,1]
                                                            (min,max -> -1,1)
       -0             Reset linear coeff's (no Input)
"Input" is a "training data" kind. By examining the external data of "Input", the program computes the normalization parameters. Often the training data is stored in many different files (in speech examples, typically one file per utterance). Then the -S option should be used. When the -S option is used, "Input" is not treated as a datafile, but as a script file holding a list of data files. In either case, it is the base name of the datafiles that matters -- the directory and the extension is taken from the information in each stream of the network (see the Streams section).

This method of specifying multiple datafiles using the -S option is shared by all training and evaluation commands in the toolkit. It is practical to make a few different script-files for: training files, evaluation files, testing files etc.

The default normalization method is to normalize each component in such a way that the minmum value is mapped to -1 and the maximum value is mapped to +1. However, this can sometimes be a bad idea, since it only considers the most extreme and often rare cases. The -d option provides an alternative to this min/max strategy. The -d option maps the mean of the external data to 0.0 and a user specified factor times one standard deviation is mapped to +/-1.0.

Back-propagation training

When the input streams are normalized, we are ready to start training the network with the BackProp command. Here is the syntax:
USAGE: BackProp [options] Net Input
       Option                                                           Default
       -S            Treat 'Input' as an inputfile script               (off)
       -s obj1 obj2  Select connections from obj1 to obj2               (all)
       -g gain       Linear gain factor                                 (1e-3)
       -m float      Momentum parameter                                 (0.9)
       -w factor     Weight decay                                       (off)
       -f num        Update frequency - update every 'num' frames       (Max+1)
       -F min max    Random update after between min and max frames     (off)
       -E            Epoch updating                                     (off)
       -p name       Name of runtime error progress report file         (none)
       -T level      Level of detail in progress report file            (0)
       -P name m n   Update progress every m:th and net every n:th iteration
       -i iter       Number of iterations                               (100)
       -V set stream Specify a Validation set and stream                (off)
       -B decay num  Multiply gain with 'decay' after epochs where
                     the validation set's global error is not
                     improved, but maximum 'num' times.                 (off)
       -e float      Error Criterium                                    (off)
       -d            Store external data in RAM                         (off)
       -M address    Send mail to 'address' when finished               (off)
Again, "Input" is either a data file or, with the -S option, a script file with a list of datafiles.

The back-propagation training often takes a lot of time. Typically BackProp is run in the background, sometimes for several days. In such cases (and in other cases too) I recommend that you use the -p option (or the -P option). The -p option will case output of the progress of a training session to be printed after each epoch (epoch=one iteration through all training data). Then, if something is wrong and the network isn't learning, you can stop the session and restart with new parameters.

The -f and -F options are used when working with dynamical networks. Theoretically, the back-propagation through time algorithm reqiures that for each data file, you first do the forward pass to the end of the file and then do the backward pass from the end to the beginning. But this would allow weight updating only once for each file. However, if the temporal dependencies that we expect the network to learn, have a smaller time-span than whole files, we can allow ourselves to restart the back-propagation after a smaller number of samples. For speech recognition for example, it seems sufficient to use -F 20 30, when a sample (frame) is 10 ms.

The optimal gain and momentum parameters are highly task dependent. You should not trust the default values for the -g and -m options. It can be worthwhile to experiment a little with these before starting a big training session.

The -B option defines a scheme for reducing the gain during the backprop run. A reduction will occur if the error on the validation set does not improve from one epoch to the next. The first argument after -B specifies the factor to multiply the gain with in each reduction and the second argument specifies the maximun allowed number of reductions. For example, "-B 0.5 5" means that the gain is halved for each reduction and that maximum five reductions will take place. If the error still does not improve after five reductions, the training is terminated.

The -s option provides a way to train only a selected set of connection weights. This can also be done by maipulating the connection's plasticity (see the Connections section and the command SetPlast).

For classification problems the -V option can be useful. With classification problem I mean that the stream should have one component for each class and the "choice" of the network is the component with the highest value (see also the Evaluation section). The validation set should be different from the training set. This way you can monitor in the progress-file (-p option) how the classification performance changes during the training. Then you can stop the training when the classification is starting to degrade due to over-learning the training data.