Hostname: page-component-77f85d65b8-8v9h9 Total loading time: 0 Render date: 2026-03-29T04:50:35.879Z Has data issue: false hasContentIssue false

Deep-neural network approaches for speech recognition with heterogeneous groups of speakers including children

Published online by Cambridge University Press:  12 April 2016

ROMAIN SERIZEL
Affiliation:
LTCI, CNRS, Télécom ParisTech, Université Paris - Saclay 46, rue Barrault, Paris, 75013, France e-mail: romain.serizel@telecom-paristech.fr HLT research unit, Fondazione Bruno Kessler (FBK) Via Sommarive 18, Trento, 38121, Italy e-mail: giuliani@fbk.eu
DIEGO GIULIANI
Affiliation:
HLT research unit, Fondazione Bruno Kessler (FBK) Via Sommarive 18, Trento, 38121, Italy e-mail: giuliani@fbk.eu
Rights & Permissions [Opens in a new window]

Abstract

This paper introduces deep neural network (DNN)–hidden Markov model (HMM)-based methods to tackle speech recognition in heterogeneous groups of speakers including children. We target three speaker groups consisting of children, adult males and adult females. Two different kind of approaches are introduced here: approaches based on DNN adaptation and approaches relying on vocal-tract length normalisation (VTLN). First, the recent approach that consists in adapting a general DNN to domain/language specific data is extended to target age/gender groups in the context of DNN–HMM. Then, VTLN is investigated by training a DNN–HMM system by using either mel frequency cepstral coefficients normalised with standard VTLN or mel frequency cepstral coefficients derived acoustic features combined with the posterior probabilities of the VTLN warping factors. In this later, novel, approach the posterior probabilities of the warping factors are obtained with a separate DNN and the decoding can be operated in a single pass when the VTLN approach requires two decoding passes. Finally, the different approaches presented here are combined to take advantage of their complementarity. The combination of several approaches is shown to improve the baseline phone error rate performance by thirty per cent to thirty-five per cent relative and the baseline word error rate performance by about ten per cent relative.

Information

Type
Articles
Copyright
Copyright © Cambridge University Press 2016 
Figure 0

Fig. 1. Training of the DNN-warp.

Figure 1

Fig. 2. Training of the warping factor aware DNN–HMM.

Figure 2

Fig. 3. Joint optimisation of the DNN-warp and the DNN–HMM.

Figure 3

Table 1. Data repartition in the speech corpora. (f) and (m) denote speech from female and male speakers, respectively

Figure 4

Table 2. Distribution of speakers in the ChildIt corpus per grade. Children in grade 2 are approximatively 7 years old while children in grade 8 are approximatively 13 years old

Figure 5

Table 3. Phone error rate achieved with the DNN–HMM trained age/gender groups specific data

Figure 6

Table 4. Phone error rate achieved with the DNN–HMM trained on a mixture of adult and children's speech and adapted to specific age/gender groups

Figure 7

Table 5. Phone error rate achieved with VTLN approaches to DNN–HMM

Figure 8

Table 6. Phone error rate achieved with combination of approaches

Figure 9

Table 7. Word error rate achieved with the DNN–HMM trained on a mixture of adult and children's speech and adapted to specific age/gender groups

Figure 10

Table 8. Word error rate achieved with several VTLN approaches to DNN–HMM