A SPEECH EXAMPLE
Introduction
This is an application of phoneme probability
estimation. A network for phoneme recognition on the acoustic frame level
is built and trained for the TIMIT database. The TIMIT CDROM is one of
the most well known phonetically labelled speech databases and can be
ordered from the Linguistic Data
Consortium (LDC)
Structuring the database
The training tools, feature extraction and
target generation tools put some constraints on how you can organize your
database. 1) All files corresponding to the same utterance must
have the same basename. 2) All files of the same type, e.g.,
phonetic label files, or feature vector files, must have the same file
extension and live in the same directory. If your database is not organized
this way you can create some new directories and make symbolic links to
the physical locations of the files. The script make_timit_links
does
just that for the TIMIT CDROM. You need to change the variable TIMITDIR
to the actual path of the CDROM on your system and the variable LINKDIR
to the directory where you want to put the database.
#! /bin/csh -f
# set this variable to the directory of the TIMIT database
set TIMITDIR = /mnt/cdrom/timit
# set these variables to where you want to store links to
# wave-form files and phoneme label files
set LINKDIR = .
set WAVEDIR = $LINKDIR/waveform
set PHONEDIR = $LINKDIR/phnlab
mkdir -p $WAVEDIR
mkdir -p $PHONEDIR
rm -f $LINKDIR/trainutts
rm -f $LINKDIR/testutts
set CATEGORY = ${TIMITDIR}/train
foreach DIALECT ( ${CATEGORY}/dr? )
pushd ${DIALECT}
foreach SPEAKER ( * )
pushd ${SPEAKER}
foreach FILE ( *.wav )
#ln -s ${DIALECT}/${SPEAKER}/${FILE}
$WAVEDIR/${SPEAKER}_${FILE}
echo ${SPEAKER}_$FILE | sed s/\.wav//
>> $LINKDIR/trainutts
end
foreach FILE ( *.phn )
#ln -s ${DIALECT}/${SPEAKER}/${FILE}
$PHONEDIR/${SPEAKER}_${FILE}
end
popd
end
popd
end
set CATEGORY = ${TIMITDIR}/test
foreach DIALECT ( ${CATEGORY}/dr? )
pushd ${DIALECT}
foreach SPEAKER ( * )
pushd ${SPEAKER}
foreach FILE ( *.wav )
#ln -s ${DIALECT}/${SPEAKER}/${FILE}
$WAVEDIR/${SPEAKER}_${FILE}
echo ${SPEAKER}_$FILE | sed s/\.wav//
>> $LINKDIR/testutts
end
foreach FILE ( *.phn )
#ln -s ${DIALECT}/${SPEAKER}/${FILE}
$PHONEDIR/${SPEAKER}_${FILE}
end
popd
end
popd
end |
Figure 1. Script for creating symbolic links to the wave-form and
phonetic label files of the TIMIT database.
Feature extraction
The tool Barkfib implements a standard feature
extraction, based on a Bark or Mel-scaled filter-bank computed from the
FTT spectrum. The -c option selects cepstrum features instead of filter-bank.
In the following example it is assumed that the directory ~/waveform contains
links to waveform files of the TIMIT database, all_utterances is a script-file
with the base-names of the utterances to process, and ~/mfcc is the directory
to store the feature files in.
Barkfib -T1 -S -m -Fnist -n24 -c12 -e -d mfcc -xmfcc
-p waveform -qwav testutts
Barkfib -T1 -S -m -Fnist -n24 -c12 -e -d mfcc -xmfcc
-p waveform -qwav trainutts
Generation of phoneme targets
The phoneme target for each 10 ms frame of
the feature files can be extracted from the phonetic transcription files
of the database. The tool Lab2Targ does this by checking the length of
the parameter files and using the time-marks of the transcription files
to compute the targets for each phoneme output unit at each frame. In this
example, it is assumed that the directory ~/mfcc contains the parameter
files created in the previous section, ~/phmlab contains links to the phonetic
transcription files of the TIMIT database, all_utterances is a script-file
with the base-names of the utterances to process, and ~/phmtarg is the
directory to store the target files in. The names of the 61 phonemes should
be supplied in the file phoneme_set, one phoneme per line.
Lab2Targ -S -p phnlab -qphn -d phntarg -x targ -P
mfcc -Qmfcc -Ibinary -Ltimit -Fcodebook phoneme_set 13 testutts
Lab2Targ -S -p phnlab -qphn -d phntarg -x targ -P
mfcc -Qmfcc -Ibinary -Ltimit -Fcodebook phoneme_set 13 trainutts
There will be a lot of warnings of the
type "Warning: Label endtime out of bound", but don't worry about them.
Specifying the network structure
The script make_simpletimit
(Figure 2) builds a network with the cepstrum features as input and the
61 phonemes as output units. It showcases the most fundamental commands
of the toolkit. Detailed information about the commands can be found in
the toolkit documentation.
#! /bin/csh -f
CreateNet simpletimit.rtdnn simpletimit.rtdnn
AddStream -x mfcc -d mfcc -Fbinary 13 r CEP simpletimit.rtdnn
AddGroup cep simpletimit.rtdnn
AddUnit -i -u 13 cep simpletimit.rtdnn
LinkGroup CEP cep simpletimit.rtdnn
AddStream -x targ -d phntarg -Fcodebook -S phoneme_set
61 t PHONEME simpletimit.rtdnn
AddGroup phoneme simpletimit.rtdnn
AddUnit -o -S phoneme_set phoneme simpletimit.rtdnn
LinkGroup PHONEME phoneme simpletimit.rtdnn
SetType -O cross phoneme simpletimit.rtdnn
AddGroup features simpletimit.rtdnn
AddUnit -u 100 features simpletimit.rtdnn
Connect -D -3 3 cep features simpletimit.rtdnn
Connect -D -1 1 features phoneme simpletimit.rtdnn |
Figure 2. Script for building a simple phone recognition network.
Training
The command for normalizing the input units
is:
NormStream -S -s CEP -d1.0 simpletimit.rtdnn trainutts
Here we used the -d option for normalizing
based on the distribution (mean and std dev) of the data.
Next we use the most important training
tool of the toolkit, BackProp. Here's how you start the actual network
training:
BackProp -S -m0.7 -g1e-6 -i30 -F 20 30 -p logfile.log
simpletimit.rtdnn trainutts
There will be warings about "large gradients",
but you can ignore them.
This training can take a very long time.
I recommend that you try first a small number of training utterances to
make sure everything works. For example, run:
head -100 trainutts > sometrainutts
BackProp -S -m0.7 -g1e-6 -i30 -F 20 30 -p logfile.log
simpletimit.rtdnn sometrainutts
I also recommend that you run the training
in the background and check the logfile.log regularly to see if there is
any progress. You may need to modify the gain (-g) and/or the momentum
(-m) parameter to get fast convergence. You may also want to experiment
with the different types of validation and gain decay schemes available
for BackProp.
Frame-level evaluation
Althogh the toolkit does not have tools for
evaluation on the segment level, it is possible to evaluate the frame classification
performance, i.e., how frequently the correct phoneme has the highest output
activation. The tool to use is CResult. This example is an evaluation of
the classification of the 61 phonemes. The CResult tool has many options,
e.g., confusion matrix, top-N evaluation, etc. Details can be found in
the toolkit documentation.
CResult -S simpletimit.rtdnn testutts
What next?
When you have tried the simpletimit example
above, start experimenting with more hidden units, more than one hidden
layer, different time-delay windows (the -D option to Connect), and recurrency
(connecting layers to themselves with time-delay).
Then try playing with sparse connectivity
(the -r option of Connect), and metrically based sparse connectivity (the
Metricnct command).
You can use a more sophisticated version
of the backpropagation algorithm by using a validation set and dynamically
adjust the gain parameter (-V, -v, -C, and -D options of the BackProp command).
You can try different types of output units
(use SetType), and you can even create a 'softmax' oputput layer (see the
Units section).
After a network is trained, try the Prune
command and then re-train it to get a much more compact network with almost
the same accuracy.
Interested in feature representations?
Use the Barkfib command to experiment with different feature vectors. The
conventional wisdom is that around 12 cepstrum coefiicients plus some energy
measure works well for gaussian mixture pdf's, but is that also true for
ANN based pdf's?