Hostname: page-component-89b8bd64d-rbxfs Total loading time: 0 Render date: 2026-05-12T13:40:19.710Z Has data issue: false hasContentIssue false

Adaptive feature truncation to address acoustic mismatch in automatic recognition of children's speech

Published online by Cambridge University Press:  09 August 2016

Shweta Ghai*
Affiliation:
Department of Electronics and Electrical Engineering, Indian Institute of Technology Guwahati, Guwahati-781039, Assam, India. Phone: +1 404 785 9340
Rohit Sinha
Affiliation:
Department of Electronics and Electrical Engineering, Indian Institute of Technology Guwahati, Guwahati-781039, Assam, India. Phone: +1 404 785 9340
*
Corresponding author: S. Ghai shweta.ghai@emory.edu

Abstract

An algorithm for adaptive Mel frequency cepstral coefficients (MFCC) feature truncation is proposed to improve automatic speech recognition (ASR) performance under acoustically mismatched conditions. Using the relationship found between MFCC base feature truncation and degree of acoustic mismatch of speech signals with respect to recognition models, the proposed algorithm performs utterance-specific MFCC feature truncation for test signals to address their acoustic mismatch in context of ASR. The proposed technique, without any prior knowledge about the speaker of the test utterance, gives 38% (on a connected-digit recognition task) and 36% (on a continuous speech recognition task) relative improvement over baseline in ASR performance for children's speech on models trained on adult speech, which is also found to be additive to improvements obtained with vocal tract length normalization and/or constrained maximum likelihood linear regression. The generality and effectiveness of the algorithm is also validated for automatic recognition of children's and adults' speech under matched and mismatched conditions.

Information

Type
Original Paper
Creative Commons
Creative Common License - CCCreative Common License - BYCreative Common License - NCCreative Common License - SA
This is an Open Access article, distributed under the terms of the Creative Commons Attribution-NonCommercial-ShareAlike licence (http://creativecommons.org/licenses/by-nc-sa/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the same Creative Commons licence is included and the original work is properly cited. The written permission of Cambridge University Press must be obtained for commercial re-use.
Copyright
Copyright © The Authors, 2016
Figure 0

Table 1. Details of the speech corpora used for automatic speech recognition experiments.

Figure 1

Table 2. Division of different age groups of the child test set “CHts1”' used in the connected-digit recognition task.

Figure 2

Table 3. Division of different age groups of the child test set “PFts” used in the continuous speech recognition task.

Figure 3

Table 4. Baseline ASR performances (in WER) for adult and child test sets on models trained on adult and child speech for both connected-digit and continuous speech recognition tasks.

Figure 4

Table 5. Frequency of substitution, deletion, and insertion errors as resulting by evaluating the outputs of different ASR systems for “CHts1” on models trained on “ADtr”.

Figure 5

Fig. 1. Scatter plots showing distributions of C11 and C12 coefficients of digit “FIVE” utterances from (a) original “CHts1” (in blue) along with Gaussian distributions (in gray scale) of those coefficients in digit “NINE” models trained with “ADtr”, and (b) explicitly pitch normalized “CHts1” (in magenta) along with Gaussian distributions (in gray scale) of those coefficients in digit “FIVE” models trained with “ADtr”. The spread of the C11C12 distribution for digit “FIVE” utterances from “CHts1” test set is significantly reduced and is better mapped for digit “FIVE” models after explicit pitch normalization of the test set.

Figure 6

Fig. 2. Scatter plots showing distributions of C1 and C2 coefficients of digit “FIVE” utterances from (a) original “CHts1” (in blue) along with Gaussian distributions (in gray scale) of those coefficients in digit “NINE” models trained with “ADtr”, and (b) explicitly pitch normalized “CHts1” (in magenta) along with Gaussian distributions (in gray scale) of those coefficients in digit “FIVE” models trained with “ADtr”. No significant change is observed in the distribution of lower-order C1 and C2 coefficients for digit “FIVE” utterances from “CHts1” test set after pitch normalization.

Figure 7

Fig. 3. Scatter plots showing distributions of C11 and C12 coefficients of digit “OH” utterances from (a) original “CHts1” (in blue) along with Gaussian distributions (in gray scale) of those coefficients in digit “TWO” models trained with “ADtr”, and (b) explicitly pitch normalized “CHts1” (in magenta) along with Gaussian distributions (in gray scale) of those coefficients in digit “OH” models trained with “ADtr”. The spread of the C11C12 distribution for digit “OH” utterances from “CHts1” test set is significantly reduced and is better mapped for digit “OH” models after explicit pitch normalization of the test set.

Figure 8

Fig. 4. Scatter plots showing distributions of C1 and C2 coefficients of digit “OH” utterances from (a) original “CHts1” (in blue) along with Gaussian distributions (in gray scale) of those coefficients in digit “TWO” models trained with “ADtr”, and (b) explicitly pitch normalized “CHts1” (in magenta) along with Gaussian distributions (in gray scale) of those coefficients in digit “OH” models trained with “ADtr”. No significant change is observed in the distribution of lower-order C1 and C2 coefficients for digit “OH” utterances from “CHts1” test set after pitch normalization.

Figure 9

Fig. 5. Plots of (a) smoothed spectra corresponding to the base MFCC features of different dimensions along with corresponding. (b) Linear DFT spectrum for a frame of vowel /iy/ having the average pitch value of 300 Hz.

Figure 10

Table 6. Performance of “CHts1” on models trained with “ADtr” for various truncations of the base MFCC features along with its pitch group-wise breakup.

Figure 11

Table 7. Age group-wise breakup of the best recognition performance obtained for “CHts1” on models trained with “ADtr” using 4-D base MFCC features.

Figure 12

Table 8. Performance of “PFts” on models trained with “CAMtr” for various truncations of the base MFCC features along with its pitch group-wise breakup.

Figure 13

Table 9. Age group-wise breakup of the best recognition performance obtained for “PFts” on models trained with “CAMtr” using 6-D base MFCC features.

Figure 14

Table 10. Performance of “PFts” on models trained on “CAMtr” for various truncations of the base MFCC features along with their VTLN warp factor-wise breakup.

Figure 15

Table 11. Performance of “CAMts” on models trained on “CAMtr” for various MFCC base feature truncations along with their VTLN warp factor-wise breakup.

Figure 16

Fig. 6. Graph showing the relation proposed between appropriate length of MFCC base feature and VTLN warp factor for both adult and child test signals.

Figure 17

Fig. 7. Flow diagram of the proposed algorithm to determine the appropriate MFCC base feature length for test signal on models trained on adult speech (solid lines) or models trained on child speech (in dashed lines).

Figure 18

Table 12. Performances of test sets on recognition models trained with different training data sets using MFCC features derived using the proposed algorithm referred to as “Proposed” for both connected-digit and continuous speech recognition task.

Figure 19

Table 13. Performances of different test sets using the default MFCC features referred to as “Default” and MFCC features derived using the proposed algorithm referred to as “Proposed” both with and without VTLN and/or CMLLR on recognition models trained with different training data sets for continuous speech recognition task. Relative gain in ASR performance obtained with CMLLR over the respective baseline is also given.