Skip to main content Accessibility help
×
Home

Survey on audiovisual emotion recognition: databases, features, and data fusion strategies

  • Chung-Hsien Wu (a1), Jen-Chun Lin (a1) (a2) and Wen-Li Wei (a1)

Abstract

Emotion recognition is the ability to identify what people would think someone is feeling from moment to moment and understand the connection between his/her feelings and expressions. In today's world, human–computer interaction (HCI) interface undoubtedly plays an important role in our daily life. Toward harmonious HCI interface, automated analysis and recognition of human emotion has attracted increasing attention from the researchers in multidisciplinary research fields. In this paper, a survey on the theoretical and practical work offering new and broad views of the latest research in emotion recognition from bimodal information including facial and vocal expressions is provided. First, the currently available audiovisual emotion databases are described. Facial and vocal features and audiovisual bimodal data fusion methods for emotion recognition are then surveyed and discussed. Specifically, this survey also covers the recent emotion challenges in several conferences. Conclusions outline and address some of the existing emotion recognition issues.

  • View HTML
    • Send article to Kindle

      To send this article to your Kindle, first ensure no-reply@cambridge.org is added to your Approved Personal Document E-mail List under your Personal Document Settings on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part of your Kindle email address below. Find out more about sending to your Kindle. Find out more about sending to your Kindle.

      Note you can select to send to either the @free.kindle.com or @kindle.com variations. ‘@free.kindle.com’ emails are free but can only be sent to your device when it is connected to wi-fi. ‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.

      Find out more about the Kindle Personal Document Service.

      Survey on audiovisual emotion recognition: databases, features, and data fusion strategies
      Available formats
      ×

      Send article to Dropbox

      To send this article to your Dropbox account, please select one or more formats and confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your <service> account. Find out more about sending content to Dropbox.

      Survey on audiovisual emotion recognition: databases, features, and data fusion strategies
      Available formats
      ×

      Send article to Google Drive

      To send this article to your Google Drive account, please select one or more formats and confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your <service> account. Find out more about sending content to Google Drive.

      Survey on audiovisual emotion recognition: databases, features, and data fusion strategies
      Available formats
      ×

Copyright

This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/3.0/), which permits unrestricted re-use, distribution, and reproduction in any medium, provided the original work is properly cited.

Corresponding author

Corresponding author: Chung-Hsien Wu, chunghsienwu@gmail.com

References

Hide All
[1] Picard, R.W.: Affective Computing. MIT Press, 1997.
[2] Cowie, R. et al. : Emotion recognition in human-computer interaction. IEEE Signal Process. Mag., 18 (2001), 3380.
[3] Fragopanagos, N.; Taylor, J.G.: Emotion recognition in human-computer interaction. Neural Netw., 18 (2005), 389405.
[4] Mehrabian, A.: Communication without words. Psychol. Today, 2 (1968), 5356.
[5] Ambady, N.; Weisbuch, M.: Nonverbal behavior, in Fiske, S.T., Gilbert, D.T. & Lindzey, G. (Eds), John Wiley & Sons, Inc., Handbook of Social Psychology, 2010, 464497.
[6] Rule, N.; Ambady, N.: First impressions of the face: predicting success. Social Person. Psychol. Compass, 4 (2010), 506516.
[7] Devillers, L.; Vidrascu, L.: Real-life emotions detection with lexical and paralinguistic cues on human–human call center dialogs, in Interspeech, 2006, 801–804.
[8] Ayadi, M.E.; Kamel, M.S.; Karray, F.: Survey on speech emotion recognition: features, classification schemes, and databases. Pattern Recognit., 44 (2011), 572587.
[9] Schuller, B.; Batliner, A.; Steidl, S.; Seppi, D.: Recognising realistic emotions and affect in speech: state of the art and lessons learnt from the first challenge. Speech Commun., 53 (2011), 10621087.
[10] Sumathi, C.P.; Santhanam, T.; Mahadevi, M.: Automatic facial expression analysis a survey. Int. J. Comput. Sci. and Eng. Surv. (IJCSES), 3 (2013), 4759.
[11] Pantic, M.; Bartlett, M.: Machine Analysis of Facial Expressions. Face Recognition. I-Tech Education and Publishing, Vienna, Austria, 2007, 377416.
[12] Zeng, Z.; Pantic, M.; Roisman, G.I.; Huang, T.S.: A survey of affect recognition methods: audio, visual, and spontaneous expressions. IEEE Trans. Pattern Anal. Mach. Intell., 31 (2009), 3958.
[13] Sebe, N.; Cohen, I.; Gevers, T.; Huang, T.S.: Emotion recognition based on joint visual and audio cues, in Proc. 18th Int. Conf. Pattern Recognition, 2006, 1136–1139.
[14] Busso, C. et al. : Analysis of emotion recognition using facial expression, speech and multimodal information, in Proc. Sixth ACM Int'l Conf. Multimodal Interfaces, 2004, 205–211.
[15] Schuller, B.; Valstar, M.; Eyben, F.; McKeown, G.; Cowie, R.; Pantic, M.: AVEC 2011 the first international audio/visual emotion challenge, in Proc. First Int. Audio/Visual Emotion Challenge and Workshop (ACII), 2011, 415–424.
[16] Schuller, B.; Valstar, M.; Eyben, F.; Cowie, R.; Pantic, M.: AVEC 2012 – the continuous audio/visual emotion challenge, in Proc. of Int. Audio/Visual Emotion Challenge and Workshop (AVEC), ACM ICMI, 2012.
[17] Valstar, M. et al. : AVEC 2013 – The Continuous Audio/Visual Emotion and Depression Recognition Challenge, ACM Multimedia, 2013.
[18] Valstar, M. et al. : AVEC 2014 – 3D dimensional affect and depression recognition challenge, in Proc. AVEC 2014, held in Conjunction with the 22nd ACM Int. Conf. Multimedia (MM 2014), 2014.
[19] Schuller, B.; Steidl, S.; Batliner, A.: The INTERSPEECH 2009 emotion challenge, in Proc. Interspeech, 2009, 312–315.
[20] Schuller, B. et al. : The INTERSPEECH 2010 paralinguistic challenge, in Proc. INTERSPEECH, 2010, 2794–2797.
[21] Schuller, B. et al. : The INTERSPEECH 2013 computational paralinguistics challenge: social signals, conflict, emotion, autism, in Proc. Interspeech, 2013, 148–152.
[22] Valstar, M.; Jiang, B.; Mehu, M.; Pantic, M.; Scherer, K.: The first facial expression recognition and analysis challenge, in Proc. IEEE Int. Conf. Automatic Face and Gesture Recognition, 2011, 921–926.
[23] Dhall, A.; Goecke, R.; Joshi, J.; Wagner, M.; Gedeon, T.: Emotion recognition in the wild challenge 2013, in ACM ICMI, 2013.
[25] D'Mello, S.; Kory, J.: Consistent but modest: a meta-analysis on unimodal and multimodal affect detection accuracies from 30 studies, in Proc. ACM Int. Conf. Multimodal Interaction (ICMI), 2012, 31–38.
[26] Stuhlsatz, A.; Meyer, C.; Eyben, F.; Zielke, T.; Meier, G.; Schuller, B.: Deep neural networks for acoustic emotion recognition: raising the benchmarks, in ICASSP, 2011, 5688–5691.
[27] Gunes, H.; Schuller, B.: Categorical and dimensional affect analysis in continuous input: current trends and future directions. Image Vis. Comput., 31 (2013), 120136.
[28] Gunes, H.; Pantic, M.: Automatic, dimensional and continuous emotion recognition. Int. J. Synth. Emotions, 1 (2010), 6899.
[29] Bänziger, T.; Pirker, H.; Scherer, K.: Gemep – Geneva multimodal emotion portrayals: a corpus for the study of multimodal emotional expressions, in Proc. of LREC Workshop on Corpora for Research on Emotion and Affect, 2006, 15–19.
[30] Martin, O.; Kotsia, I.; Macq, B.; Pitas, I.: The eNTERFACE'05 audio-visual emotion database, in Int. Conf. Data Engineering Workshops, 2006.
[31] Paleari, M.; Benmokhtar, R.; Huet, B.: Evidence theory-based multimodal emotion recognition, in Proc. 15th Int. Multimedia Modeling Conf. Advances in Multimedia Modeling, 2009, 435–446.
[32] Jiang, D.; Cui, Y.; Zhang, X.; Fan, P.; Gonzalez, I.; Sahli, H.: Audio visual emotion recognition based on triple-stream dynamic Bayesian network models, in Proc. Affective Computing and Intelligent Interaction, 2011, 609–618.
[33] Busso, C. et al. : IEMOCAP: interactive emotional dyadic motion capture database. J. Lang. Resources Eval., 42 (2008), 335359.
[34] Metallinou, A.; Lee, S.; Narayanan, S.: Audio-visual emotion recognition using Gaussian mixture models for face and voice in Proc. Int. Symp. Multimedia, 2008, 250–257.
[35] Metallinou, A.; Wollmer, M.; Katsamanis, A.; Eyben, F.; Schuller, B.; Narayanan, S.: Context-sensitive learning for enhanced audiovisual emotion classification. IEEE Trans. Affective Comput., 3 (2012), 184198.
[36] Metallinou, A.; Lee, S.; Narayanan, S.: Decision level combination of multiple modalities for recognition and analysis of emotional expression, in Proc. Int. Conf. Acoustics, Speech, and Signal Processing, 2010, 2462–2465.
[37] Mariooryad, S.; Busso, C.: Exploring cross-modality affective reactions for audiovisual emotion recognition. IEEE Trans. Affective Comput., 4 (2013), 183196.
[38] Wang, Y.; Guan, L.: Recognizing human emotional state from audiovisual signals. IEEE Transactions on Multimedia, 10 (2008), 936946.
[39] Wang, Y.; Zhang, R.; Guan, L.; Venetsanopoulos, A.N.: Kernel fusion of audio and visual information for emotion recognition, in Proc. 8th Int. Conf. Image Analysis and Recognition (ICIAR), 2011, 140–150.
[40] Grimm, M.; Kroschel, K.; Narayanan, S.: The Vera am Mittag German audio-visual emotional speech database, in Proc. IEEE Int. Conf. Multimedia and Expo, 2008, 865–868.
[41] Sayedelahl, A.; Araujo, P.; Kamel, M.S.: Audio-visual feature-decision level fusion for spontaneous emotion estimation in speech conversations, in Int. Conf. Multimedia and Expo Workshops, 2013, 1–6.
[42] Haq, S.; Jackson, P.J.B.: Speaker-dependent audio-visual emotion recognition, in Proc. Int. Conf. Auditory-Visual Speech Processing, 2009, 53–58.
[43] Haq, S.; Jackson, P.J.B.: Multimodal emotion recognition, in Wang, W. (ed), Machine Audition: Principles, Algorithms and Systems, IGI Global Press, chapter 17 (2010), 398423.
[44] Schuller, B.; Müller, R.; Hörnler, B.; Höthker, A.; Konosu, H.; Rigoll, G.: Audiovisual recognition of spontaneous interest within conversations, in Proc. 9th Int. Conf. Multimodal Interfaces (ICMI), Special Session on Multimodal Analysis of Human Spontaneous Behaviour, ACM SIGCHI, 2007, 30–37.
[45] Eyben, F.; Petridis, S.; Schuller, B.; Pantic, M.: Audiovisual vocal outburst classification in noisy acoustic conditions, in ICASSP, 2012, 5097–5100.
[46] McKeown, G.; Valstar, M.; Pantic, M.; Cowie, R.: The SEMAINE corpus of emotionally coloured character interactions, in Proc. IEEE Int. Conf. Multimedia & Expo, 2010, 1–6.
[47] Lin, J.C.; Wu, C.H.; Wei, W.L.: Error weighted semi-coupled hidden Markov model for audio-visual emotion recognition. IEEE Trans. Multimedia, 14 (2012), 142156.
[48] Ramirez, G.A.; Baltrušaitis, T.; Morency, L.P.: Modeling latent discriminative dynamic of multi-dimensional affective signals, in Proc. Affective Computing and Intelligent Interaction, 2011, 396–406.
[49] Wöllmer, M.; Kaiser, M.; Eyben, F.; Schuller, B.; Rigoll, G.: LSTM-modeling of continuous emotions in an audiovisual affect recognition framework, in Image and Vision Computing (IMAVIS). Spec. Issue Affect Anal. Continuous Input, 31 (2013), 153163.
[50] Wu, C.H.; Lin, J.C.; Wei, W.L.: Two-level hierarchical alignment for semi-coupled HMM-based audiovisual emotion recognition with temporal course. IEEE Trans. Multimedia, 15 (2013), 18801895.
[51] Lin, J.C.; Wu, C.H.; Wei, W.L.: Semi-coupled hidden Markov model with state-based alignment strategy for audio-visual emotion recognition, in Proc. Affective Computing and Intelligent Interaction (ACII), 2011, 185–194.
[52] Dhall, A.; Goecke, R.; Lucey, S.; Gedeon, T.: Collecting large, richly annotated facial-expression databases from movies. IEEE Multimedia, 19 (2012), 3441.
[53] Rosas, V.P.; Mihalcea, R.; Morency, L.-P.: Multimodal sentiment analysis of Spanish online videos. IEEE Intell. Syst., 28 (2013), 3845.
[54] Morency, L.-P.; Mihalcea, R.; Doshi, P.: Towards multimodal sentiment analysis: harvesting opinions from the web, in Proc. 13th Int. Conf. Multimodal Interfaces (ICMI), 2011, 169–176.
[55] Petridis, S.; Martinez, B.; Pantic, M.: The MAHNOB laughter database. Image Vis. Comput., 31 (2013), 186202.
[56] Rudovic, O.; Petridis, S.; Pantic, M.: Bimodal log-linear regression for fusion of audio and visual features, in Proc. 21st ACM Int. Conf. Multimedia, 2013, 789–792.
[57] Ringeval, F.; Sonderegger, A.; Sauer, J.; Lalanne, D.: Introducing the RECOLA multimodal corpus of remote collaborative and affective interactions. 2nd Int. Workshop on Emotion Representation, Analysis and Synthesis in Continuous Time and Space (EmoSPACE), in Proc. IEEE Face & Gestures, 2013, 1–8.
[58] Bänziger, T.; Scherer, K.: Introducing the Geneva Multimodal Emotion Portrayal (GEMEP) Corpus. Oxford University Press, Oxford, 2010, 271294.
[59] Bänziger, T.; Mortillaro, M.; Scherer, K.: Introducing the Geneva multimodal expression corpus for experimental research on emotion perception. Emotion, 12 (2012), 11611179.
[61] Douglas-Cowie, E.; Cowie, R.; Cox, C.; Amier, N.; Heylen, D.: The sensitive artificial listener: an induction technique for generating emotionally coloured conversation, in LREC Workshop on Corpora for Research on Emotion and Affect, 2008, 1–4.
[62] Weizenbaum, J.: ELIZA – a computer program for the study of natural language communication between man and machine. Commun. ACM, 9 (1966), 3645.
[63] Mckeown, G.; Valstar, M.F.; Cowie, R.; Pantic, M.; Schroe, M.: The SEMAINE database: annotated multimodal records of emotionally coloured conversations between a person and a limited agent. IEEE Trans. Affective Comput., 3 (2012), 517.
[65] Ekman, P.; Friesen, W.V.: Picture of Facial Affect. Consulting Psychologist Press, Palo Alto, 1976.
[66] Ekman, P.: Facial expression and emotion, Am. Psychol., 48 (1993), 384392.
[67] Russell, J.A.: A circumplex model of affect. J. Personal. Soc. Psychol., 39 (1980), 11611178.
[68] Fontaine, R.J.; Scherer, K.R.; Roesch, E.B.; Ellsworth, P.: The world of emotions is not two-dimensional. Psychol. Sci., 18 (2007), 10501057.
[69] Thayer, R.E.: The Biopsychology of Mood and Arousal. Oxford University Press, New York, 1989.
[70] Yang, Y.-H.; Lin, Y.-C.; Su, Y.-F.; Chen, H.-H.: A regression approach to music emotion recognition. IEEE Trans. Audio, Speech Lang. Process., 16 (2008), 448457.
[71] Zeng, Z.; Zhang, Z.; Pianfetti, B.; Tu, J.; Huang, T.S.; Audio-visual affect recognition in activation-evaluation space, in Proc. IEEE Int. Conf. Multimedia and Expo, 2005, 828–831.
[72] Cowie, R.; Douglas-Cowie, E.; Savvidou, S.; McMahon, E.; Sawey, M.; Schröder, M.: Feeltrace: an instrument for recording perceived emotion in real time, in Proc. ISCA Workshop on Speech and Emotion, 2000, 19–24.
[73] Schuller, B. et al. : Being bored? Recognising natural interest by extensive audiovisual integration for real-life application. Image Vis. Comput. J., 27 (2009), 17601774.
[75] Song, M.; You, M.; Li, N.; Chen, C.: A robust multimodal approach for emotion recognition. Neurocomputing, 71 (2008), 19131920.
[76] Zeng, Z.; Tu, J.; Pianfetti, B.M.; Huang, T.S.: Audio-visual affective expression recognition through multistream fused HMM. IEEE Trans. Multimedia, 10 (2008), 570577.
[77] Lu, K.; Jia, Y.: Audio-visual emotion recognition with boosted coupled HMM, in Int'l Conf. Pattern Recognition (ICPR), 2012, 1148–1151.
[78] Wu, C.H.; Yeh, J.F.; Chuang, Z.J.: Emotion Perception and Recognition from Speech, Affective Information Processing, Springer, New York, 2009, 93110.
[79] Wu, C.H.; Liang, W.B.: Emotion recognition of affective speech based on multiple classifiers using acoustic-prosodic information and semantic labels. IEEE Trans. Affective Comput., 2 (2011), 112.
[80] Morrison, D.; Wang, R.; De Silva, L.C.: Ensemble methods for spoken emotion recognition in call-centres. Speech Commun., 49 (2007), 98112.
[81] Murray, I.R.; Arnott, J.L.: Toward the simulation of emotion in synthetic speech: a review of the literature on human vocal emotion. J. Acoust. Soc. Am., 93 (1993), 10971108.
[82] Scherer, K.R.: Vocal communication of emotion: a review of research paradigms. Speech Commun., 40 (2003), 227256.
[83] Luengo, I.; Navas, E.; Hernáez, I.; Sánchez, J.: Automatic emotion recognition using prosodic parameters, in Proc. INTERSPEECH, 2005, 493–496.
[84] Kooladugi, S.G.; Kumar, N.; Rao, K.S.: Speech emotion recognition using segmental level prosodic analysis, in Int. Conf. Devices and Communications, 2011, 1–5.
[85] Kwon, O.W.; Chan, K.; Hao, J.; Lee, T.W.: Emotion recognition by speech signals, in Proc. 8th European Conf. Speech Comm. and Technology, 2003.
[86] Freedman, D.A.: Statistical Models: Theory and Practice. Cambridge University Press, Cambridge, 2005.
[87] Boersma, P.; Weenink, D.: Praat: doing phonetics by computer, 2007. [Online]. Available: http://www.praat.org/.
[88] Eyben, F.; Wöllmer, M.; Schuller, B.: OpenSMILE – the Munich versatile and fast open-source audio feature extractor, In Proc. 9th ACM Int. Conf. Multimedia, MM, 2010, 1459–1462.
[89] Eyben, F.; Wöllmer, M.; Schuller, B.: OpenEAR introducing the Munich opensource emotion and affect recognition toolkit, in Proc. Affective Computing and Intelligent Interaction (ACII), 2009, 576–581.
[90] Huang, Y.; Zhang, G.; Li, X.; Da, F.: Improved emotion recognition with novel global utterance-level features. Int. J. Appl. Math. Inf. Sci., 5 (2011), 147153.
[91] Schuller, B.; Steidl, S.; Batliner, A.; Schiel, F.; Krajewski, J.: The INTERSPEECH 2011speaker state challenge, in Proc. Interspeech, 2011, 3201–3204.
[92] Patras, I.; Pantic, M.: Particle filtering with factorized likelihoods for tracking facial features, in Proc. FG, 2004, 97–104.
[93] Cootes, T.F.; Edwards, G.J.; Taylor, C.J.: Active appearance models. IEEE Trans. Pattern Anal. Mach. Intell., 23 (2001), 681685.
[94] Shan, C.; Gong, S.; Mcowan, P.W.: Facial expression recognition based on local binary patterns: a comprehensive study. Image Vis. Comput., 27 (2009), 803816.
[95] Ahonen, T.; Hadid, A.; Pietikäinen, M.: Face description with local binary patterns: application to face recognition. IEEE Trans. Pattern Anal. Mach. Intell., 28 (2006), 20372041.
[97] Chen, D.; Jiang, D.; Ravyse, I.; Sahli, H.: Audio-visual emotion recognition based on a DBN model with constrained asyn- chrony, in Fifth Int. Conf. Image and Graphics, 2009, 912–916.
[98] Nicolaou, M.; Gunes, H.; Pantic, M.: Audio-visual classification and fusion of spontaneous affective data in likelihood space, in Int. Conf. Pattern Recognition (ICPR), 2010, 3695–3699.
[99] Grant, K.W.; Greenberg, S.: Speech intelligibility derived from asynchronous processing of auditory-visual information, in Proc. Workshop on Audio-Visual Speech Processing (AVSP), 2001, 132–137.
[100] Xie, L.; Liu, Z.Q.: A coupled HMM approach to video-realistic speech animation. Pattern Recognit., 40 (2007), 23252340.
[101] Valstar, M.F.; Pantic, M.: Fully automatic recognition of the temporal phases of facial actions. IEEE Trans. Syst. Man Cybern. B, 42 (2012), 2843.
[102] Koelstra, S.; Pantic, M.; Patras, I.: A dynamic texture-based approach to recognition of facial actions and their temporal models. IEEE Trans. Pattern Anal. Mach. Intell., 32 (2010), 19401954.
[103] Jiang, B.; Valstar, M.; Martinez, B.; Pantic, M.: A dynamic appearance descriptor approach to facial actions temporal modeling. IEEE Trans. Syst. Man Cybern. B, 44 (2014), 161174.
[104] Lin, J.C.; Wu, C.H.; Wei, W.L.: Emotion recognition of conversational affective speech using temporal course modeling, in Proc. Interspeech, 2013, 1336–1340.
[105] Wöllmer, M.; Metallinou, A.; Eyben, F.; Schuller, B.; Narayanan, S.S.: Context-sensitive multimodal emotion recognition from speech and facial expression using bidirectional LSTM modeling, in INTERSPEECH, 2010, 2362–2365.
[106] Busso, C.; Metallinou, A.; Narayanan, S.: Iterative feature normalization for emotional speech detection, in Proc. Int. Conf. Acoust., Speech, and Signal Processing, 2011, 5692–5695.
[107] Rudovic, O.; Pantic, M.; Patras, I.: Coupled Gaussian processes for pose-invariant facial expression recognition. IEEE Trans. Pattern Anal. Mach. Intell., 35 (2013), 13571369.
[108] Wu, C.H.; Wei, W.L.; Lin, J.C.; Lee, W.Y.: Speaking effect removal on emotion recognition from facial expressions based on eigenface conversion. IEEE Trans. Multimedia, 15 (2013), 17321744.
[109] Lin, J.C.; Wu, C.H.; Wei, W.L.: Facial action unit prediction under partial occlusion based on error weighted cross-correlation model, in Int. Conf. Acoustics, Speech, and Signal Processing, 2013, 3482–3486.

Keywords

Related content

Powered by UNSILO
Type Description Title
PDF
Supplementary materials

Wu Supplementary Material
Supplementary Material

 PDF (168 KB)
168 KB

Survey on audiovisual emotion recognition: databases, features, and data fusion strategies

  • Chung-Hsien Wu (a1), Jen-Chun Lin (a1) (a2) and Wen-Li Wei (a1)

Metrics

Full text views

Total number of HTML views: 0
Total number of PDF views: 0 *
Loading metrics...

Abstract views

Total abstract views: 0 *
Loading metrics...

* Views captured on Cambridge Core between <date>. This data will be updated every 24 hours.

Usage data cannot currently be displayed.