A SPEECH EXAMPLE

Introduction

This is an application of phoneme probability estimation. A network for phoneme recognition on the acoustic frame level is built and trained for the TIMIT database. The TIMIT CDROM is one of the most well known phonetically labelled speech databases and can be ordered from the Linguistic Data Consortium (LDC)

Structuring the database

The training tools, feature extraction and target generation tools put some constraints on how you can organize your database. 1) All files corresponding to the same utterance must have the same basename.  2) All files of the same type, e.g., phonetic label files, or feature vector files, must have the same file extension and live in the same directory. If your database is not organized this way you can create some new directories and make symbolic links to the physical locations of the files. The script make_timit_links does just that for the TIMIT CDROM. You need to change the variable TIMITDIR to the actual path of the CDROM on your system and the variable LINKDIR to the directory where you want to put the database.
 
#! /bin/csh -f

 # set this variable to the directory of the TIMIT database 
set TIMITDIR = /mnt/cdrom/timit 

 # set these variables to where you want to store links to 
 # wave-form files and phoneme label files 
set LINKDIR = .

set WAVEDIR = $LINKDIR/waveform 
set PHONEDIR = $LINKDIR/phnlab 

mkdir -p $WAVEDIR
mkdir -p $PHONEDIR

rm -f $LINKDIR/trainutts
rm -f $LINKDIR/testutts

set CATEGORY = ${TIMITDIR}/train 
foreach DIALECT ( ${CATEGORY}/dr? ) 
  pushd ${DIALECT} 
  foreach SPEAKER ( * ) 
    pushd ${SPEAKER} 
    foreach FILE ( *.wav ) 
      #ln -s ${DIALECT}/${SPEAKER}/${FILE} $WAVEDIR/${SPEAKER}_${FILE}
      echo ${SPEAKER}_$FILE | sed s/\.wav// >> $LINKDIR/trainutts
    end
    foreach FILE ( *.phn ) 
      #ln -s ${DIALECT}/${SPEAKER}/${FILE} $PHONEDIR/${SPEAKER}_${FILE}
    end 
    popd 
  end 
  popd 
end 

set CATEGORY = ${TIMITDIR}/test
foreach DIALECT ( ${CATEGORY}/dr? ) 
  pushd ${DIALECT} 
  foreach SPEAKER ( * ) 
    pushd ${SPEAKER} 
    foreach FILE ( *.wav ) 
      #ln -s ${DIALECT}/${SPEAKER}/${FILE} $WAVEDIR/${SPEAKER}_${FILE}
      echo ${SPEAKER}_$FILE | sed s/\.wav// >> $LINKDIR/testutts
    end
    foreach FILE ( *.phn ) 
      #ln -s ${DIALECT}/${SPEAKER}/${FILE} $PHONEDIR/${SPEAKER}_${FILE}
    end 
    popd 
  end 
  popd 
end 

Figure 1. Script for creating symbolic links to the wave-form and phonetic label files of the TIMIT database.

Feature extraction

The tool Barkfib implements a standard feature extraction, based on a Bark or Mel-scaled filter-bank computed from the FTT spectrum. The -c option selects cepstrum features instead of filter-bank. In the following example it is assumed that the directory ~/waveform contains links to waveform files of the TIMIT database, all_utterances is a script-file with the base-names of the utterances to process, and ~/mfcc is the directory to store the feature files in.

Barkfib -T1 -S -m -Fnist -n24 -c12 -e -d mfcc -xmfcc -p waveform -qwav testutts
Barkfib -T1 -S -m -Fnist -n24 -c12 -e -d mfcc -xmfcc -p waveform -qwav trainutts
 

Generation of phoneme targets

The phoneme target for each 10 ms frame of the feature files can be extracted from the phonetic transcription files of the database. The tool Lab2Targ does this by checking the length of the parameter files and using the time-marks of the transcription files to compute the targets for each phoneme output unit at each frame. In this example, it is assumed that the directory ~/mfcc contains the parameter files created in the previous section, ~/phmlab contains links to the phonetic transcription files of the TIMIT database, all_utterances is a script-file with the base-names of the utterances to process, and ~/phmtarg is the directory to store the target files in. The names of the 61 phonemes should be supplied in the file phoneme_set, one phoneme per line.

Lab2Targ -S -p phnlab -qphn -d phntarg -x targ -P mfcc -Qmfcc -Ibinary -Ltimit -Fcodebook phoneme_set 13 testutts
Lab2Targ -S -p phnlab -qphn -d phntarg -x targ -P mfcc -Qmfcc -Ibinary -Ltimit -Fcodebook phoneme_set 13 trainutts

There will be a lot of warnings of the type "Warning: Label endtime out of bound", but don't worry about them.

Specifying the network structure

The script make_simpletimit (Figure 2) builds a network with the cepstrum features as input and the 61 phonemes as output units. It showcases the most fundamental commands of the toolkit. Detailed information about the commands can be found in the toolkit documentation.
 
#! /bin/csh -f

 CreateNet simpletimit.rtdnn simpletimit.rtdnn 

AddStream -x mfcc -d mfcc -Fbinary 13 r CEP simpletimit.rtdnn 
AddGroup cep simpletimit.rtdnn 
AddUnit -i -u 13 cep simpletimit.rtdnn 
LinkGroup CEP cep simpletimit.rtdnn 

AddStream -x targ -d phntarg -Fcodebook -S phoneme_set 61 t PHONEME simpletimit.rtdnn 
AddGroup phoneme simpletimit.rtdnn 
AddUnit -o -S phoneme_set phoneme simpletimit.rtdnn 
LinkGroup PHONEME phoneme simpletimit.rtdnn 
SetType -O cross phoneme simpletimit.rtdnn 

AddGroup features simpletimit.rtdnn 
AddUnit -u 100 features simpletimit.rtdnn 
Connect -D -3 3 cep features simpletimit.rtdnn 
Connect -D -1 1 features phoneme simpletimit.rtdnn 

Figure 2. Script for building a simple phone recognition network.
 

Training

The command for normalizing the input units is:

NormStream -S -s CEP -d1.0 simpletimit.rtdnn trainutts

Here we used the -d option for normalizing based on the distribution (mean and std dev) of the data.

Next we use the most important training tool of the toolkit, BackProp.  Here's how you start the actual network training:

BackProp -S -m0.7 -g1e-6 -i30 -F 20 30 -p logfile.log simpletimit.rtdnn trainutts

There will be warings about "large gradients", but you can ignore them.
This training can take a very long time. I recommend that you try first a small number of training utterances to make sure everything works. For example, run:

head -100 trainutts > sometrainutts
BackProp -S -m0.7 -g1e-6 -i30 -F 20 30 -p logfile.log simpletimit.rtdnn sometrainutts

I also recommend that you run the training in the background and check the logfile.log regularly to see if there is any progress. You may need to modify the gain (-g) and/or the momentum (-m) parameter to get fast convergence.  You may also want to experiment with the different types of validation and gain decay schemes available for BackProp.

Frame-level evaluation

Althogh the toolkit does not have tools for evaluation on the segment level, it is possible to evaluate the frame classification performance, i.e., how frequently the correct phoneme has the highest output activation. The tool to use is CResult. This example is an evaluation of the classification of the 61 phonemes. The CResult tool has many options, e.g., confusion matrix, top-N evaluation, etc. Details can be found in the toolkit documentation.

CResult -S simpletimit.rtdnn testutts

What next?

When you have tried the simpletimit example above, start experimenting with more hidden units, more than one hidden layer, different time-delay windows (the -D option to Connect), and recurrency (connecting layers to themselves with time-delay).

Then try playing with sparse connectivity (the -r option of Connect), and metrically based sparse connectivity (the Metricnct command).

You can use a more sophisticated version of the backpropagation algorithm by using a validation set and dynamically adjust the gain parameter (-V, -v, -C, and -D options of the BackProp command).

You can try different types of output units (use SetType), and you can even create a 'softmax' oputput layer (see the Units section).

After a network is trained, try the Prune command and then re-train it to get a much more compact network with almost the same accuracy.

Interested in feature representations? Use the Barkfib command to experiment with different feature vectors. The conventional wisdom is that around 12 cepstrum coefiicients plus some energy measure works well for gaussian mixture pdf's, but is that also true for ANN based pdf's?