Hostname: page-component-89b8bd64d-ktprf Total loading time: 0 Render date: 2026-05-08T08:06:54.903Z Has data issue: false hasContentIssue false

Joint optimization on decoding graphs using minimum classification error criterion

Published online by Cambridge University Press:  29 April 2014

Abdelaziz A. Abdelhamid*
Affiliation:
Computer Science, Ain Shams University, Egypt
Waleed H. Abdulla
Affiliation:
Electrical and Computer Engineering, Auckland University, New Zealand
*
Corresponding author: Abdelaziz A. Abdelhamid Email: abdelaziz.ieee@live.com

Abstract

Motivated by the inherent correlation between the speech features and their lexical words, we propose in this paper a new framework for learning the parameters of the corresponding acoustic and language models jointly. The proposed framework is based on discriminative training of the models' parameters using minimum classification error criterion. To verify the effectiveness of the proposed framework, a set of four large decoding graphs is constructed using weighted finite-state transducers as a composition of two sets of context-dependent acoustic models and two sets of n-gram-based language models. The experimental results conducted on this set of decoding graphs validated the effectiveness of the proposed framework when compared with four baseline systems based on maximum likelihood estimation and separate discriminative training of acoustic and language models in benchmark testing of two speech corpora, namely TIMIT and RM1.

Information

Type
Original Paper
Creative Commons
Creative Common License - CCCreative Common License - BY
The online version of this article is published within an Open Access environment subject to the conditions of the Creative Commons Attribution licence http://creativecommons.org/licenses/by/3.0/
Copyright
Copyright © The Authors, 2014
Figure 0

Fig. 1. Flowchart of the proposed joint discriminative training framework for learning the parameters of acoustic and language models on integrated decoding graph. k denotes graph number (i.e., 1, 2, 3, or 4). N and U refer to the number of training iterations and the size of the development set, respectively.

Figure 1

Table 1. Number of utterances of the development and evaluation sets of the TIMIT and RM1 speech corpora.

Figure 2

Table 2. Baseline acoustic models.

Figure 3

Table 3. Baseline language models.

Figure 4

Table 4. Baseline large decoding graphs.

Figure 5

Table 5. Evolution of graph size when constructing the large WFST-based decoding graphs.

Figure 6

Fig. 2. Discriminative training approaches of acoustic and language models' parameters.

Figure 7

Fig. 3. Parameter update value (y-axis) for the score differences (x-axis) when ε = 10.

Figure 8

Fig. 4. Score difference between reference and competing hypotheses.

Figure 9

Table 6. Parameter setting for discriminative training and testing.

Figure 10

Fig. 5. Evolution of language model perplexity. The baseline perplexities are 375.05 and 411.57 for the ti-gram (LM1) and bi-gram (LM2) language models, respectively.

Figure 11

Fig. 6. Error reduction rate (%) using Graph1 on the TIMIT evaluation set with respect to the baseline models.

Figure 12

Fig. 7. Error reduction rate (%) using Graph1 on the RM1 evaluation set with respect to the baseline models.

Figure 13

Fig. 8. Average training time and accuracy of four discriminative training approaches.

Figure 14

Fig. 9. Histogram of model separation calculated by MLE, MCE LM, MCE AM, (MCE AM, MCE LM), and (MCE Joint AM,LM) models on the TIMIT evaluation set using Graph1.

Figure 15

Fig. 10. Histogram of model separation calculated by MLE, MCE LM, MCE AM, (MCE AM, MCE LM), and (MCE Joint AM,LM) models on the RM1 evaluation set using Graph1.