Hostname: page-component-77f85d65b8-7lfxl Total loading time: 0 Render date: 2026-03-30T09:40:14.225Z Has data issue: false hasContentIssue false

Speech emotion recognition based on listener-dependent emotion perception models

Published online by Cambridge University Press:  20 April 2021

Atsushi Ando*
Affiliation:
NTT Corporation, Yokosuka, Kanagawa 239-0847, Japan Nagoya University, Nagoya, Aichi 464-8601, Japan
Takeshi Mori
Affiliation:
NTT Corporation, Yokosuka, Kanagawa 239-0847, Japan
Satoshi Kobashikawa
Affiliation:
NTT Corporation, Yokosuka, Kanagawa 239-0847, Japan
Tomoki Toda
Affiliation:
Nagoya University, Nagoya, Aichi 464-8601, Japan
*
Corresponding author: A. Ando Email: atsushi.ando.hd@hco.ntt.co.jp

Abstract

This paper presents a novel speech emotion recognition scheme that leverages the individuality of emotion perception. Most conventional methods simply poll multiple listeners and directly model the majority decision as the perceived emotion. However, emotion perception varies with the listener, which forces the conventional methods with their single models to create complex mixtures of emotion perception criteria. In order to mitigate this problem, we propose a majority-voted emotion recognition framework that constructs listener-dependent (LD) emotion recognition models. The LD model can estimate not only listener-wise perceived emotion, but also majority decision by averaging the outputs of the multiple LD models. Three LD models, fine-tuning, auxiliary input, and sub-layer weighting, are introduced, all of which are inspired by successful domain-adaptation frameworks in various speech processing tasks. Experiments on two emotional speech datasets demonstrate that the proposed approach outperforms the conventional emotion recognition frameworks in not only majority-voted but also listener-wise perceived emotion recognition.

Information

Type
Original Paper
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution, and reproduction in any medium, provided the original work is properly cited.
Copyright
Copyright © The Author(s), 2021 published by Cambridge University Press in association with Asia Pacific Signal and Information Processing Association
Figure 0

Fig. 1. An example of the conventional emotion recogntion model based on direct modeling of majority-voted emotion.

Figure 1

Fig. 2. Overview of the proposed majority-voted emotion recognition based on listener-dependent (LD) models. (a) Fine-tuning. (b) Auxiliary input. (c) Sub-layer weighting.

Figure 2

Fig. 3. Structure of the Sub-layer Weighting-based Adaptation Layer (SWAL).

Figure 3

Fig. 4. Adaptation for the auxiliary input-based LD model.

Figure 4

Table 1. Number of utterances in IEMOCAP

Figure 5

Table 2. Number of utterances in MSP-Podcast

Figure 6

Fig. 5. Cohen's kappa coefficients of listener annotations. (a) IEMOCAP. (b) MSP-Podcast.

Figure 7

Table 3. Network architectures of emotion recognition model

Figure 8

Table 4. Number of model parameters

Figure 9

Table 5. Estimation accuracies of the majority-voted emotions. Bold means the highest accuracy.

Figure 10

Fig. 6. Confusion matrices for IEMOCAP. (a) Majority-voted model. (b) LD model

Figure 11

Fig. 7. Confusion matrices for MSP-Podcast. (a) Majority-voted model. (b) LD model.

Figure 12

Fig. 8. WAs of listener-wise emotion recognitions with LD models. (a) IEMOCAP. (b) MSP-Podcast.

Figure 13

Table 6. Macro-average of estimation accuracies of the listener-dependent perceived emotions. Bold means the highest accuracy.

Figure 14

Table 7. Macro-average of WAs and UAs in listener-open dataset

Figure 15

Fig. 9. WA for each listeners in open dataset.