Skip to main content

Engagement recognition by a latent character model based on multimodal listener behaviors in spoken dialogue

  • Koji Inoue (a1), Divesh Lala (a1), Katsuya Takanashi (a1) and Tatsuya Kawahara (a1)

Engagement represents how much a user is interested in and willing to continue the current dialogue. Engagement recognition will provide an important clue for dialogue systems to generate adaptive behaviors for the user. This paper addresses engagement recognition based on multimodal listener behaviors of backchannels, laughing, head nodding, and eye gaze. In the annotation of engagement, the ground-truth data often differs from one annotator to another due to the subjectivity of the perception of engagement. To deal with this, we assume that each annotator has a latent character that affects his/her perception of engagement. We propose a hierarchical Bayesian model that estimates both engagement and the character of each annotator as latent variables. Furthermore, we integrate the engagement recognition model with automatic detection of the listener behaviors to realize online engagement recognition. Experimental results show that the proposed model improves recognition accuracy compared with other methods which do not consider the character such as majority voting. We also achieve online engagement recognition without degrading accuracy.

  • View HTML
    • Send article to Kindle

      To send this article to your Kindle, first ensure is added to your Approved Personal Document E-mail List under your Personal Document Settings on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part of your Kindle email address below. Find out more about sending to your Kindle. Find out more about sending to your Kindle.

      Note you can select to send to either the or variations. ‘’ emails are free but can only be sent to your device when it is connected to wi-fi. ‘’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.

      Find out more about the Kindle Personal Document Service.

      Engagement recognition by a latent character model based on multimodal listener behaviors in spoken dialogue
      Available formats
      Send article to Dropbox

      To send this article to your Dropbox account, please select one or more formats and confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your <service> account. Find out more about sending content to Dropbox.

      Engagement recognition by a latent character model based on multimodal listener behaviors in spoken dialogue
      Available formats
      Send article to Google Drive

      To send this article to your Google Drive account, please select one or more formats and confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your <service> account. Find out more about sending content to Google Drive.

      Engagement recognition by a latent character model based on multimodal listener behaviors in spoken dialogue
      Available formats
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (, which permits unrestricted re-use, distribution, and reproduction in any medium, provided the original work is properly cited.
Corresponding author
Corresponding author: Koji Inoue Email:
Hide All
[1]Higashinaka, R. et al. : Towards an open-domain conversational system fully based on natural language processing, in COLING, 2014, 928939.
[2]Hori, C.; Hori, T.: End-to-end conversation modeling track in DSTC6, in Dialog System Technology Challenges 6, 2017.
[3]Skantze, G.; Johansson, M.: Modelling situated human-robot interaction using IrisTK, in SIGDIAL, 2015, 165167.
[4]DeVault, D. et al. : A virtual human interviewer for healthcare decision support, in AAMAS, 2014, 10611068.
[5]Young, S.; Gašić, M.; Thomson, B.; Williams, J.D.: POMDP-based statistical spoken dialog systems: A review. Proc. IEEE., 101 (5) (2013), 11601179.
[6]Perez, J.; Boureau, Y.-L.; Bordes, A.: Dialog system & technology challenge 6 overview of track 1 - END-to-end goal-oriented dialog learning. In Dialog System Technology Challenges 6, 2017.
[7]Schuller, B.; Köhler, N.; Müller, R.; Rigoll, G.: Recognition of interest in human conversational speech, in INTERSPEECH, 2006, 793796.
[8]Wang, W.Y.; Biadsy, F.; Rosenberg, A.; Hirschberg, J.: Automatic detection of speaker state: Lexical, prosodic, and phonetic approaches to level-of-interest and intoxication classification. Comput. Speech. Lang., 27 (1) (2013), 168189.
[9]Kawahara, T.; Hayashi, S.; Takanashi, K.: Estimation of interest and comprehension level of audience through multi-modal behaviors in poster conversations, in INTERSPEECH, 2013, 18821885.
[10]Han, K.; Yu, D.; Tashev, I.: Speech emotion recognition using deep neural network and extreme learning machine, in INTERSPEECH, 2014, 223227.
[11]Valstar, M. et al. : Depression, mood, and emotion recognition workshop and challenge, in AVEC 2016: International Workshop on Audio/Visual Emotion Challenge, 2016, 310.
[12]Kahou, S.E. et al. : Multimodal deep learning approaches for emotion recognition in video. J. Multimodal User Interfaces, 10 (2) (2016), 99111.
[13]Mizukami, M.; Yoshino, K.; Neubig, G.; Traum, D.; Nakamura, S.: Analyzing the effect of entrainment on dialogue acts, in SIGDIAL, 2016, 310318.
[14]Lubold, N.; Pon-Barry, H.: Acoustic-prosodic entrainment and rapport in collaborative learning dialogues, in ACM workshop on Multimodal Learning Analytics Workshop and Grand Challenge, 2014, 512.
[15]Matsuyama, Y.; Bhardwaj, A.; Zhao, R.; Akoju, S.; Cassell, J.: Socially-aware animated intelligent personal assistant agent, in SGIDIAL, 2016, 224227.
[16]Müller, P.; Huang, M.X.; Bulling, A.: Detecting low rapport during natural interactions in small groups from non-verbal behaviour, in IUI, 2018.
[17]Sidner, C.L.; Lee, C.; Kidd, C.D.; Lesh, N.; Rich, C.: Explorations in engagement for humans and robots. Artificial Intelligence, 166 (1–2) (2005), 140164.
[18]Xu, Q.; Li, L.; Wang, G.: Designing engagement-aware agents for multiparty conversations, in CHI, 2013, 22332242.
[19]Yu, Z.; Nicolich-Henkin, L.; Black, A.W.; Rudnicky, A.I.: A Wizard-of-Oz study on a non-task-oriented dialog systems that reacts to user engagement, in SIGDIAL. 2016, 5563.
[20]Sun, M.; Zhao, Z.; Ma, X.: Sensing and handling engagement dynamics in human-robot interaction involving peripheral computing devices, in CHI, 2017, 556567.
[21]Rudovic, O.; Nicolaou, M.A.; Pavlovic, V.: Machine learning methods for social signal processing, In Social Signal Processing, Cambridge University Press, 2017, 234254.
[22]Nakano, Y.I.; Ishii, R.: Estimating user's engagement from eye-gaze behaviors in human-agent conversations, in IUI, 2010, 139148.
[23]Oertel, C.; Mora, K.A.F; Gustafson, J.; Odobez, J.-M.: Deciphering the silent participant: On the use of audio-visual cues for the classification of listener categories in group discussions, in ICMI, 2015.
[24]Goffman, E.: Behavior in Public Places: Notes on the Social Organization of Gatherings. Simon and Schuster, USA, 1966.
[25]Glas, N.; Pelachaud, C.: Definitions of engagement in human-agent interaction, in Int. Workshop on Engagement in Human Computer Interaction, 2015, 944949.
[26]Bohus, D.; Horvitz, E.: Learning to predict engagement with a spoken dialog system in open-world settings, in SIGDIAL, 2009, 244252.
[27]Peters, C.: Direction of attention perception for conversation initiation in virtual environments, in Int. Workshop on Intelligent Virtual Agents, 2005, 215228.
[28]Yu, Z.; Bohus, D.; Horvitz, E.: Incremental coordination: Attention-centric speech production in a physically situated conversational agent, in SIGDIAL, 2015, 402406.
[29]Bohus, D.; Andrist, S.; Horvitz, E.: A study in scene shaping: Adjusting F-formations in the wild. In AAAI Fall Symp. on Natural Communication for Human-Robot Collaboration, 2017.
[30]Yu, C.; Aoki, P.M.; Woodruff, A.: Detecting user engagement in everyday conversations, in ICSLP, 2004, 13291332.
[31]Poggi, I.: Mind, Hands, Face, Body: A Goal and Belief View of Multimodal Communication. Weidler, Germany, 2007.
[32]Bednarik, R.; Eivazi, S.; Hradis, M.: Gaze and conversational engagement in multiparty video conversation: An annotation scheme and classification of high and low levels of engagement, in ICMI Workshop on Eye Gaze in Intelligent Human Machine Interaction, 2012, 10.
[33]Michalowski, M.P.; Sabanovic, S.; Simmons, R.: A spatial model of engagement for a social robot, in Int. Workshop on Advanced Motion Control. 2006, 762767.
[34]Castellano, G.; Pereira, A.; Leite, I.; Paiva, A.; McOwan, P.W.: Detecting user engagement with a robot companion using task and social interaction-based features, in ICMI. 2009, 119126.
[35]Rich, C.; Ponsler, B.; Holroyd, A.; Sidner, C.L.: Recognizing engagement in human-robot interaction, in HRI, 2010, 375382.
[36]Yu, Z.; Ramanarayanan, V.; Lange, P.; Suendermann-Oeft, D.: An open-source dialog system with real-time engagement tracking for job interview training applications, in IWSDS, 2017.
[37]Chiba, Y.; Nose, T.; Ito, A.: Analysis of efficient multimodal features for estimating user's willingness to talk: Comparison of human-machine and human-human dialog, in APSIPA ASC, 2017.
[38]Türker, B.B.; Buçinca, Z.; Erzin, E.; Yemez, Y.; Sezgin, M.: Analysis of engagement and user experience with a laughter responsive social robot, in INTERSPEECH. 2017, 844848.
[39]Sanghvi, J.; Castellano, G.; Leite, I.; Pereira, A.; McOwan, P.W.; Paiva, A.: Automatic analysis of affective postures and body motion to detect engagement with a game companion, in HRI, 2011, 305311.
[40]Chiba, Y.; Ito, A.: Estimation of user's willingness to talk about the topic: Analysis of interviews between humans, in IWSDS, 2016.
[41]Huang, Y.; Gilmartin, E.; Campbell, N.: Conversational engagement recognition using auditory and visual cues, in INTERSPEECH, 2016.
[42]Frank, M.; Tofighi, G.; Gu, H.; Fruchter, R.: Engagement detection in meetings. arXiv preprint, 2016. arXiv: 1608.08711.
[43]Sidner, C.L.; Lee, C.: Engagement rules for human-robot collaborative interactions, in ICSMC, 2003, 39573962.
[44]Inoue, K.; Lala, D.; Nakamura, S.; Takanashi, K.; Kawahara, T.: Annotation and analysis of listener's engagement based on multi-modal behaviors, in ICMI Workshop on Multimodal Analyses enabling Artificial Agents in Human-Machine Interaction, 2016.
[45]Glas, N.; Prepin, K.; Pelachaud, C.: Engagement driven topic selection for an information-giving agent, in SemiDial, 2015, 4857.
[46]Glas, D.F.; Minaot, T.; Ishi, C.T.; Kawahara, T.; Ishiguro, H.: E.R.I.C.A: The ERATO intelligent conversational android, in ROMAN, 2016, 2229.
[47]Inoue, K.; Milhorat, P.; Lala, D.; Zhao, T.; Kawahara, T.: Talking with ERICA, an autonomous android, in SIGDIAL, 2016, 212215.
[48]Ishi, C.T.; Ishiguro, H.; Hagita, N.: Evaluation of formant-based lip motion generation in tele-operated humanoid robots, in IROS, 2012, 23772382.
[49]Sakai, K.; Ishi, C.T.; Minato, T.; Ishiguro, H.: Online speech-driven head motion generating system and evaluation on a tele-operated robot, in ROMAN, 2015, 529534.
[50]Ramanarayanan, V.; Leong, C.W.; Suendermann-Oeft, D.: Rushing to judgement: How do laypeople rate caller engagement in thin-slice videos of human-machine dialog? in INTERSPEECH, 2017, 25262530.
[51]Ramanarayanan, V.; Leong, C.W.; Suendermann-Oeft, D.; Evanini, K.: Crowdsourcing ratings of caller engagement in thin-slice videos of human-machine dialog: Benefits and pitfalls, in ICMI, 2017, 281287.
[52]Den, Y.; Yoshida, N.; Takanashi, K.; Koiso, H.: Annotation of japanese response tokens and preliminary analysis on their distribution in three-party conversations, in Oriental COCOSDA, 2011, 168173.
[53]Blei, D.M.; Ng, A.Y.; Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res., 3 (2003), 9931022.
[54]Dawid, A.P.; Skene, A.M.: Maximum likelihood estimation of observer error-rates using the EM algorithm. Appl. Stat., 28 (1) (1979), 2028.
[55]Griffiths, T.L.; Steyvers, M.: Finding scientific topics. Proc. Natl. Acad. Sci., 101 (suppl 1) (2004), 52285235.
[56]Ozkan, D.; Sagae, K.; Morency, L.P.: Latent mixture of discriminative experts for multimodal prediction modeling, in COLING, 2010, 860868.
[57]Ozkan, D.; Morency, L.P.: Modeling wisdom of crowds using latent mixture of discriminative experts, in ACL, 2011, 335340.
[58]Kumano, S.; Otsuka, K.; Matsuda, M.; Ishii, R.; Yamato, J.: Using a probabilistic topic model to link observers' perception tendency to personality, in Humaine Association Conference on Affective Computing and Intelligent Interaction, 2013, 588593.
[59]Schuller, B. et al. : The INTERSPEECH 2013 computational paralinguistics challenge social signals, conflict, emotion, autism, in INTERSPEECH, 2013, 148152.
[60]Inaguma, H.; Inoue, K.; Mimura, M.; Kawahara, T.: Social signal detection in spontaneous dialogue using bidirectional LSTM-CTC, in INTERSPEECH, 2017, 16911695.
[61]Fujie, S.; Ejiri, Y.; Nakajima, K.; Matsusaka, Y.; Kobayashi, T.: A conversation robot using head gesture recognition as para-linguistic information, in ROMAN, 2004, 159164.
[62]Morency, L.P.; Quattoni, A.; Darrell, T.: Latent-dynamic discriminative models for continuous gesture recognition in CVPR, 2007.
[63]Lala, D.; Inoue, K.; Milhorat, P.; Kawahara, T.: Detection of social signals for recognizing engagement in human-robot interaction. In AAAI Fall Symposium on Natural Communication for Human-Robot Collaboration, 2017.
[64]Barrick, M.R.; Mount, M.K.: The Big Five personality dimensions and job performance: A meta-analysis. Pers. Psychol., 44 (1) (1991), 126.
[65]Wada, S.: Construction of the Big Five scales of personality trait terms and concurrent validity with NPI. Japanese J. Psychol., 67 (1) (1996), 6167.
Recommend this journal

Email your librarian or administrator to recommend adding this journal to your organisation's collection.

APSIPA Transactions on Signal and Information Processing
  • ISSN: 2048-7703
  • EISSN: 2048-7703
  • URL: /core/journals/apsipa-transactions-on-signal-and-information-processing
Please enter your name
Please enter a valid email address
Who would you like to send this to? *



Full text views

Total number of HTML views: 0
Total number of PDF views: 0 *
Loading metrics...

Abstract views

Total abstract views: 0 *
Loading metrics...

* Views captured on Cambridge Core between <date>. This data will be updated every 24 hours.

Usage data cannot currently be displayed