Hostname: page-component-89b8bd64d-4ws75 Total loading time: 0 Render date: 2026-05-11T06:01:03.590Z Has data issue: false hasContentIssue false

Dimensional speech emotion recognition from speech features and word embeddings by using multitask learning

Published online by Cambridge University Press:  27 May 2020

Bagus Tris Atmaja*
Affiliation:
Japan Advanced Institute of Science and Technology, 1-1 Asahidai, Nomi, Ishikawa 923-1292, Japan Sepuluh Nopember Institute of Technology, Kampus ITS Sukolilo, Surabaya60111, Indonesia
Masato Akagi
Affiliation:
Japan Advanced Institute of Science and Technology, 1-1 Asahidai, Nomi, Ishikawa 923-1292, Japan
*
Corresponding author: Bagus Tris Atmaja Email: bagus@jaist.ac.jp

Abstract

The majority of research in speech emotion recognition (SER) is conducted to recognize emotion categories. Recognizing dimensional emotion attributes is also important, however, and it has several advantages over categorical emotion. For this research, we investigate dimensional SER using both speech features and word embeddings. The concatenation network joins acoustic networks and text networks from bimodal features. We demonstrate that those bimodal features, both are extracted from speech, improve the performance of dimensional SER over unimodal SER either using acoustic features or word embeddings. A significant improvement on the valence dimension is contributed by the addition of word embeddings to SER system, while arousal and dominance dimensions are also improved. We proposed a multitask learning (MTL) approach for the prediction of all emotional attributes. This MTL maximizes the concordance correlation between predicted emotion degrees and true emotion labels simultaneously. The findings suggest that the use of MTL with two parameters is better than other evaluated methods in representing the interrelation of emotional attributes. In unimodal results, speech features attain higher performance on arousal and dominance, while word embeddings are better for predicting valence. The overall evaluation uses the concordance correlation coefficient score of the three emotional attributes. We also discuss some differences between categorical and dimensional emotion results from psychological and engineering perspectives.

Information

Type
Original Paper
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution, and reproduction in any medium, provided the original work is properly cited.
Copyright
Copyright © The Author(s), 2020. Published by Cambridge University Press in association with Asia Pacific Signal and Information Processing Association
Figure 0

Fig. 1. Block diagram of proposed dimensional emotion recognition system from speech features and word embeddings; the multitask learning will train the weights of the network based on acoustic and text input features to predict degrees of valence, arousal, and dominance simultaneously.

Figure 1

Table 1. Acoustic feature sets: GeMAPS [28] and pyAudioAnalysis [27]. The numbers in parentheses indicate the total numbers of features (LLDs).

Figure 2

Table 2. Word embedding feature sets.

Figure 3

Fig. 2. Overview of the LSTM and CNN models used in the acoustic network. A number in parentheses represents the number of units, with the second number for the convolutional layers representing the kernel size. The numbers after # represent the numbers of trainable parameters. For the text network, the flatten layer was replaced by a dense layer.

Figure 4

Fig. 3. Architecture of a combined system using LSTMs for both the acoustic and text networks. HSF: high-level statistical functions; WE: word embedding; V: valence; A: arousal; D: dominance.

Figure 5

Table 3. CCC score results on the acoustic networks.

Figure 6

Table 4. CCC score results on the text networks.

Figure 7

Table 5. Results of bimodal feature fusion (without parameters) by concatenating the acoustic and text networks; each modality used either an LSTM, CNN, or dense network; batch $\text {size} = 8$.

Figure 8

Fig. 4. Surface plot of different $\alpha$ and $\beta$ factors for MTL with two parameters. The best mean CCC score of 0.51 was obtained using $\alpha = 0.7$ and $\beta = 0.2$. Both factors were searched simultaneously/dependently.

Figure 9

Fig. 5. CCC scores for MTL with three parameters obtained to find the optimal weighting factors. Linear search was performed independently on each parameter. The best weighting factors for the three parameters were $\alpha = 0.9$, $\beta = 0.9$, and $\gamma = 0.2$.

Figure 10

Table 6. Results of MTL with and without parameters for bimodal feature fusion ($\text {LSTM} + \text {LSTM)}$; $\text {batch size} = 256$.

Figure 11

Fig. 6. Analysis of dropout rates applied to the acoustic and text networks before concatenating them. The dropout rates were applied independently on either network while keeping a fixed rate for the other network.