Hostname: page-component-6766d58669-tq7bh Total loading time: 0 Render date: 2026-05-16T11:16:39.278Z Has data issue: false hasContentIssue false

3D skeletal movement-enhanced emotion recognition networks

Published online by Cambridge University Press:  05 August 2021

Jiaqi Shi*
Affiliation:
Graduate School of Engineering Science, Osaka University, Osaka, Japan Guardian Robot Project, RIKEN, Kyoto, Japan
Chaoran Liu
Affiliation:
Advanced Telecommunications Research Institute International, Kyoto, Japan
Carlos Toshinori Ishi
Affiliation:
Guardian Robot Project, RIKEN, Kyoto, Japan Advanced Telecommunications Research Institute International, Kyoto, Japan
Hiroshi Ishiguro
Affiliation:
Graduate School of Engineering Science, Osaka University, Osaka, Japan Advanced Telecommunications Research Institute International, Kyoto, Japan
*
Corresponding author: J. Shi Email: shi.jiaqi@irl.sys.es.osaka-u.ac.jp

Abstract

Automatic emotion recognition has become an important trend in the fields of human–computer natural interaction and artificial intelligence. Although gesture is one of the most important components of nonverbal communication, which has a considerable impact on emotion recognition, it is rarely considered in the study of emotion recognition. An important reason is the lack of large open-source emotional databases containing skeletal movement data. In this paper, we extract three-dimensional skeleton information from videos and apply the method to IEMOCAP database to add a new modality. We propose an attention-based convolutional neural network which takes the extracted data as input to predict the speakers’ emotional state. We also propose a graph attention-based fusion method that combines our model with the models using other modalities, to provide complementary information in the emotion classification task and effectively fuse multimodal cues. The combined model utilizes audio signals, text information, and skeletal data. The performance of the model significantly outperforms the bimodal model and other fusion strategies, proving the effectiveness of the method.

Information

Type
Original Paper
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution-NonCommercial-NoDerivatives licence (http://creativecommons.org/licenses/by/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is unaltered and is properly cited. The written permission of Cambridge University Press must be obtained for commercial re-use or in order to create a derivative work.
Copyright
Copyright © The Author(s), 2021 Published by Cambridge University Press
Figure 0

Fig. 1. Architecture of the proposed SMACN.

Figure 1

Fig. 2. SMERN framework where audio, text, and gesture are used for emotion classification simultaneously.

Figure 2

Fig. 3. Illustration of the multimodal model with graph attention. GA represents the graph attention module.

Figure 3

Table 1. Comparison for unimodal and multimodal

Figure 4

Table 2. Performances of different models for skeleton movement-based emotion recognition

Figure 5

Fig. 4. Confusion matrices of each model in our experiment: (a) gesture, (b) audio, (c) text, and (d) multimodal.

Figure 6

Table 3. Comparison between noisy data and clean data

Figure 7

Table 4. Comparison of division methods of the dataset

Figure 8

Fig. 5. Performance of the model changes with the proportion of training data using and not using pretrained model.