Computational Analysis of Vocal Expression of Affect: Trends and Challenges

doi:10.1017/9781316676202.006

6 - Computational Analysis of Vocal Expression of Affect: Trends and Challenges

from Part I - Conceptual Models of Social Signals

Published online by Cambridge University Press: 13 July 2017

Klaus Scherer ,

Björn Schüller and

Aaron Elkins

Edited by

Judee K. Burgoon ,

Nadia Magnenat-Thalmann ,

Maja Pantic and

Alessandro Vinciarelli

Show author details

Klaus Scherer: Affiliation:
University of Geneva
Björn Schüller: Affiliation:
Imperial College London and Technical University Munich
Aaron Elkins: Affiliation:
San Diego State University
Judee K. Burgoon: Affiliation:
University of Arizona
Nadia Magnenat-Thalmann: Affiliation:
Université de Genève
Maja Pantic: Affiliation:
Imperial College London
Alessandro Vinciarelli: Affiliation:
University of Glasgow

Book contents

Get access

Summary

In this chapter we want to first provide a short introduction into the “classic” audio features used in this field and methods leading to the automatic recognition of human emotion as reflected in the voice. From there, we want to focus on the main trends leading up to the main challenges for future research. It has to be stated that a line is difficult to draw here – what are contemporary trends and where does “future” start. Further, several of the named trends and challenges are not limited to the analysis of speech, but hold for many if not all modalities. We focus on examples and references in the speech analysis domain.

“Classic Features”: Perceptual and Acoustic Measures

Systematic treatises on the importance of emotional expression in speech communication and its powerful impact on the listener can be found throughout history. Early Greek and Roman manuals on rhetoric (e.g., by Aristotle, Cicero, Quintilian) suggested concrete strategies for making speech emotionally expressive. Evolutionary theorists, such as Spencer, Bell, and Darwin, highlighted the social functions of emotional expression in speech and music. The empirical investigation of the effect of emotion on the voice started with psychiatrists trying to diagnose emotional disturbances and early radio researchers concerned with the communication of speaker attributes and states, using the newly developed methods of electroacoustic analysis via vocal cues in speech. Systematic research programs started in the 1960s when psychiatrists renewed their interest in diagnosing affective states, nonverbal communication researchers explored the capacity of different bodily channels to carry signals of emotion, emotion psychologists charted the expression of emotion in different modalities, linguists and particularly phoneticians discovered the importance of pragmatic information, all making use of ever more sophisticated technology to study the effects of emotion on the voice (see Scherer, 2003, for further details).

While much of the relevant research has exclusively focused on the recognition of vocally expressed emotions by naive listeners, research on the production of emotional speech has used the extraction of acoustic parameters from the speech signal as a method to understand the patterning of the vocal expression of different emotions.

Information

Type: Chapter
Information: Social Signal Processing , pp. 56 - 68

DOI: https://doi.org/10.1017/9781316676202.006 [Opens in a new window]

Publisher: Cambridge University Press

Print publication year: 2017

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Book purchase

Temporarily unavailable

References

Banea, C.,Mihalcea, R., & Wiebe, J. (2011).Multilingual sentiment and subjectivity. In I., Zitouni & D., Bikel (Eds), Multilingual Natural Language Processing. Prentice Hall.

Batliner, A. & Schuller, B. (2014).More than fifty years of speech processing – the rise of computational paralinguistics and ethical demands. In Proceedings ETHICOMP 2014. Paris, France: CERNA, for Commission de réflexion sur l'Ethique de la Recherche en sciences et technologies du Numérique d'Allistene.

Bonneh, Y. S., Levanon, Y., Dean-Pardo, O., Lossos, L., & Adini, Y. (2011). Abnormal speech spectrum and increased pitch variability in young autistic children. Frontiers in Human Neuroscience, 4.Google Scholar

Callejas, Z. & Lòpez-Cózar, R. (2008). In fluence of contextual information in emotion annotation for spoken dialogue systems. Speech Communication, 50(5), 416–433.Google Scholar

Chen, S. X. & Bond, M. H. (2010). Two languages, two personalities? Examining language effects on the expression of personality in a bilingual context. Personality and Social Psychology Bulletin, 36(11), 1514–1528.Google Scholar

Cirillo, J. (2004). Communication by unvoiced speech: The role of whispering. Annals of the Brazilian Academy of Sciences, 76(2), 1–11.Google Scholar

Cirillo, J. & Todt, D. (2002). Decoding whispered vocalizations: relationships between social and emotional variables. Proceedings IX International Conference on Neural Information Processing (ICONIP) (pp.1559–1563).

Coutinho, E., Deng, J., & Schuller, B. (2014). Transfer learning emotion manifestation across music and speech. In Proceedings 2014 International Joint Conference on Neural Networks (IJCNN) as part of the IEEE World Congress on Computational Intelligence (IEEE WCCI). Beijing: IEEE.

Cowie, R. (2011). Editorial: “Ethics and good practice” – computers and forbidden places:Where machines may and may not go. In P., Petta, C., Pelachaud, & R., Cowie (Eds), Emotion-Oriented Systems: The Humaine Handbook (pp. 707–712). Berlin: Springer.

Davidov, D., Tsur, O., & Rappoport, A. (2010). Semi-supervised recognition of sarcastic sentences in Twitter and Amazon. In Proceedings 14th Conference on Computational Natural Language Learning (pp. 107–116).

Deng, J. & Schuller, B. (2012). Confidence measures in speech emotion recognition based on semi-supervised learning. In Proceedings Interspeech 2012. Portland, OR.

Deng, J., Han, W., & Schuller, B. (2012). Confidence measures for speech emotion recognition: A start. In T., Fingscheidt & W., Kellermann (Eds), Proceedings 10th ITG Conference on Speech Communication (pp. 1–4). Braunschweig, Germany: IEEE.

Deng, J., Zhang, Z., Marchi, E., & Schuller, B. (2013). Sparse autoencoder-based feature transfer learning for speech emotion recognition. In Proceedings 5th biannual Humaine Association Conference on Affective Computing and Intelligent Interaction (ACII 2013) (pp. 511–516). Geneva: IEEE.

Deng, J., Xia, R., Zhang, Z., Liu, Y., & Schuller, B. (2014). Introducing shared-hidden-layer autoencoders for transfer learning and their application in acoustic emotion recognition. In Proceedings 39th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2014. Florence, Italy: IEEE.

Dhall, A., Goecke, R., Joshi, J.,Wagner, M., & Gedeon, T. (Eds) (2013). Proceedings of the 2013 Emotion Recognition in the Wild Challenge and Workshop. Sydney: ACM.

Döring, S., Goldie, P., & McGuinness, S. (2011). Principalism: A method for the ethics of emotion-oriented machines. In P., Petta, C., Pelachaud, & R., Cowie (Eds), Emotion-Oriented Systems: The Humaine Handbook (pp. 713–724). Berlin Springer.

Forbes-Riley, K. & Litman, D. (2004). Predicting emotion in spoken dialogue from multiple knowledge sources. In Proceedings HLT/NAACL (pp.201–208).

Goldie, P., Döring, S., & Cowie, R. (2011). The ethical distinctiveness of emotion-oriented technology: Four long-term issues. In P., Petta, C., Pelachaud, & R., Cowie (Eds), Emotion-Oriented Systems: The Humaine Handbook (pp. 725–734). Berlin: Springer.

Grossman, R. B., Bemis, R. H., Skwerer, D. P., & Tager-Flusberg, H. (2010). Lexical and affective prosody in children with high-functioning autism. Journal of Speech, Language, and Hearing Research, 53, 778–793.Google Scholar

Gunes, H., Schuller, B., Pantic, M., & Cowie, R. (2011). Emotion representation, analysis and synthesis in continuous space: A survey. In Proceedings International Workshop on Emotion Synthesis, Representation, and Analysis in Continuous space, EmoSPACE 2011, held in conjunction with the 9th IEEE International Conference on Automatic Face & Gesture Recognition and Workshops (FG 2011) (pp. 827–834). Santa Barbara, CA: IEEE.

Han, W., Zhang, Z., Deng, J., et al. (2012). Towards distributed recognition of emotion in speech. In Proceedings 5th International Symposium on Communications, Control, and Signal Processing, ISCCSP 2012 (pp. 1–4). Rome, Italy: IEEE.

Han, W., Li, H., Ruan, H., et al. (2013). Active learning for dimensional speech emotion recognition. In Proceedings Interspeech 2013 (pp. 2856–2859). Lyon, France: ISCA.

Havasi, C., Speer, R., & Alonso, J. (2007). ConceptNet 3: A flexible, multilingual semantic network for common sense Knowledge. In Recent Advances in Natural Language Processing, September.

Hayes, A. F. & Krippendorff, K. (2007). Answering the call for a standard reliability measure for coding data. Communication Methods and Measures, 1(1), 77–89.Google Scholar

Juslin, P. N. & Laukka, P. (2003). Communication of emotions in vocal expression and music performance: Different channels, same code? Psychological Bulletin, 129, 770–814.Google Scholar

Kajackas, A., Anskaitis, A., & Gursnys, D. (2008). Peculiarities of testing the impact of packet loss on voice quality. Electronics and Electrical Engineering, 82(2), 35–40.Google Scholar

Kövecses, Z. (2000). The concept of anger: Universal or culture specific? Psychopathology, 33, 159–170.Google Scholar

Lindquist, K., Feldman Barrett, L., Bliss-Moreau, E., & Russell, J. (2006). Language and the perception of emotion. Emotion, 6(1), 125–138.Google Scholar

Liscombe, J., Riccardi, G., & Hakkani-Tür, D. (2005). Using context to improve emotion detection in spoken dialog systems. In Proceedings of INTERSPEECH (pp. 1845–1848).

Mahdhaoui, A. & Chetouani, M. (2009). A new approach for motherese detection using a semisupervised algorithm. Machine Learning for Signal Processing XIX – Proceedings of the 2009 IEEE Signal Processing Society Workshop, MLSP (pp. 1–6).

Marchi, E., Schuller, B., Batliner, A., et al. (2012a). Emotion in the speech of children with autism spectrum conditions: Prosody and everything else. In Proceedings 3rd Workshop on Child, Computer and Interaction (WOCCI 2012), Satellite Event of Interspeech 2012. Portland, OR: ISCA.

Marchi, E., Batliner, A., Schuller, B., et al. (2012b). Speech, emotion, age, language, task, and typicality: Trying to disentangle performance and feature relevance. In Proceedings 1st International Workshop on Wide Spectrum Social Signal Processing (WS3P 2012), held in conjunction with the ASE/IEEE International Conference on Social Computing (SocialCom 2012). Amsterdam, The Netherlands: IEEE.

Obin, N. (2012). Cries and whispers – classification of vocal effort in expressive speech. In Proceedings Interspeech. Portland, OR: ISCA.

Patel, S. & Scherer, K. R. (2013). Vocal behaviour. In J. A., Hall & M. L., Knapp (Eds), Handbook of Nonverbal Communication. Berlin: Mouton-DeGruyter.

Ramírez-Esparza, N., Gosling, S. D., Benet-Martínez, V., Potter, J. P., & Pennebaker, J.W. (2006). Do bilinguals have two personalities? A special case of cultural frame switching. Journal of Research in Personality, 40, 99–120.Google Scholar

Riviello, M. T., Chetouani, M., Cohen, D., & Esposito, A. (2010). On the perception of emotional “voices”: a cross-cultural comparison among American, French and Italian subjects. In Analysis of Verbal and Nonverbal Communication and Enactment: The Processing Issues (vol. 6800, pp. 368–377). Springer LNCS.

Sauter, D., Eisner, F., Ekman, P., & Scott, S. K. (2010). Cross-cultural recognition of basic emotions through nonverbal emotional vocalizations. Proceedings of the National Academy of Sciences of the United States of America, 107(6), 2408–2412.Google Scholar

Sauter, D. A. (2006). An investigation into vocal expressions of emotions: the roles of valence, culture, and acoustic factors. PhD thesis, University College London.

Scherer, K. R. (1986). Vocal affect expression: A review and a model for future research. Psychological Bulletin, 99, 143–165.Google Scholar

Scherer, K. R. (2003). Vocal communication of emotion: A review of research paradigms. Speech Communication, 40, 227–256.Google Scholar

Scherer, K. R. & Brosch, T. (2009). Culture-specific appraisal biases contribute to emotion dispositions. European Journal of Personality, 23, 265–288.Google Scholar

Schröder, M., Devillers, L., Karpouzis, K., et al. (2007). What should a generic emotion markup language be able to represent? In A., Paiva, R.W., Picard & R., Prada (Eds), Affective Computing and Intelligent Interaction: Second International Conference, ACII 2007, Lisbon, Portugal, September 12-14, 2007, Proceedings. Lecture Notes on Computer Science (LNCS) (vol. 4738, pp. 440–451). Berlin: Springer.

Schuller, B. (2012). The computational paralinguistics challenge. IEEE Signal Processing Magazine, 29(4), 97–101.Google Scholar

Schuller, B. & Batliner, A. (2013). Computational Paralinguistics: Emotion, Affect and Personality in Speech and Language Processing. Hoboken, NJ: Wiley.

Schuller, B. & Devillers, L. (2010). Incremental acoustic valence recognition: An inter-corpus perspective on features, matching, and performance in a gating paradigm. In Proceedings Interspeech (pp. 2794–2797). Makuhari, Japan: ISCA.

Schuller, B., Dunwell, I., Weninger, F., & Paletta, L. (2013a). Serious gaming for behavior change – the state of play. IEEE Pervasive Computing Magazine, Special Issue on Understanding and Changing Behavior, 12(3), 48–55.Google Scholar

Schuller, B., Steidl, S., Batliner, A., et al. (2013b). The INTERSPEECH 2013 computational paralinguistics challenge: Social signals, conflict, emotion, autism. In Proceedings Interspeech 2013 (pp. 148–152). Lyon, France: ISCA.

Silverman, K., Beckman, M., Pitrelli, J., et al. (1992). ToBI: A standard for labeling English prosody. In Proceedings ICSLP(vol. 2, pp. 867–870).Google Scholar

Sneddon, I., Goldie, P., & Petta, P. (2011). Ethics in emotion-oriented systems: The challenges for an ethics committee. In P., Petta, C., Pelachaud, & R., Cowie (Eds), Emotion-Oriented Systems: The Humaine Handbook. Berlin. Springer.

Sundberg, J., Patel, S., Björkner, E., & Scherer, K. R. (2011). Interdependencies among voice source parameters in emotional speech. IEEE Transactions on Affective Computing, 99, 2423– 2426.Google Scholar

Tawari, A. & Trivedi, M. M. (2010a). Speech emotion analysis: Exploring the role of context. IEEE Transactions on Multimedia, 12(6), 502–509.Google Scholar

Tawari, A. & Trivedi, M. M. (2010b). Speech emotion analysis in noisy real world environment. In Proceedings 20th International Conference on Pattern Recognition (ICPR) (pp. 4605–4608). Istanbul, Turkey: IAPR.

Weninger, F., Eyben, F., Schuller, B., Mortillaro, M., & Scherer, K. R. (2013). On the acoustics of emotion in audio: What speech, music and sound have in common. Frontiers in Psychology, Emotion Science, Special Issue on Expression of emotion in music and vocal communication, 4(292), 1–12.Google Scholar

Wöllmer, M., Eyben, F., Reiter, S., et al. (2008). Abandoning emotion classes – towards continuous emotion recognition with modelling of long-range dependencies. Proceedings Interspeech 2008 (pp. 597–600). Brisbane, Australia: ISCA.

Wöllmer, M., Weninger, F., Knaup, T., et al. (2013). YouTube movie reviews: Sentiment analysis in an audiovisuall context. IEEE Intelligent Systems Magazine, Special Issue on Statistical Approaches to Concept-Level Sentiment Analysis, 28(3), 46–53.Google Scholar

Wu, D. & Parsons, T. (2011). Active class selection for arousal classification. Proceedings Affective Computing and Intelligent Interaction (ACII) (pp. 132–141).

Zhang, Z. & Schuller, B. (2012). Active learning by sparse instance tracking and classifier confidence in acoustic emotion recognition. In Proceedings Interspeech 2012. Portland, OR: ISCA.

Zhang, Z., Weninger, F., Wöllmer, M., & Schuller, B. (2011). Unsupervised learning in crosscorpus acoustic emotion recognition. Proceedings 12th Biannual IEEE Automatic Speech Recognition and Understanding Workshop (ASRU 2011) (pp. 523–528). Big Island, HY: IEEE.

Zhang, Z., Deng, J., Marchi, E., & Schuller, B. (2013a). Active learning by label uncertainty for acoustic emotion recognition. Proceedings Interspeech 2013 (pp. 2841–2845). Lyon, France: ISCA.

Zhang, Z., Deng, J., & Schuller, B. (2013b). Co-training succeeds in computational paralinguistics. In Proceedings 38th IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2013) (pp. 8505–8509). Vancouver, IEEE.

Accessibility standard: Unknown

Accessibility compliance for the PDF of this book is currently unknown and may be updated in the future.