Hostname: page-component-77f85d65b8-6c7dr Total loading time: 0 Render date: 2026-03-26T23:08:29.798Z Has data issue: false hasContentIssue false

Actionable conversational quality indicators for improving task-oriented dialog systems

Published online by Cambridge University Press:  09 January 2024

Michael Higgins
Affiliation:
LivePerson Inc. New York, NY, USA
Dominic Widdows*
Affiliation:
LivePerson Inc. New York, NY, USA IonQ Inc. College Park, MD, USA
Beth Ann Hockey
Affiliation:
LivePerson Inc. New York, NY, USA
Akshay Hazare
Affiliation:
LivePerson Inc. New York, NY, USA
Kristen Howell
Affiliation:
LivePerson Inc. New York, NY, USA
Gwen Christian
Affiliation:
LivePerson Inc. New York, NY, USA
Sujit Mathi
Affiliation:
LivePerson Inc. New York, NY, USA
Chris Brew
Affiliation:
LivePerson Inc. New York, NY, USA
Andrew Maurer
Affiliation:
LivePerson Inc. New York, NY, USA
George Bonev
Affiliation:
LivePerson Inc. New York, NY, USA
Matthew Dunn
Affiliation:
LivePerson Inc. New York, NY, USA
Joseph Bradley
Affiliation:
LivePerson Inc. New York, NY, USA
*
Corresponding author: Dominic Widdows; Email: widdows@ionq.com
Rights & Permissions [Opens in a new window]

Abstract

Automatic dialog systems have become a mainstream part of online customer service. Many such systems are built, maintained, and improved by customer service specialists, rather than dialog systems engineers and computer programmers. As conversations between people and machines become commonplace, it is critical to understand what is working, what is not, and what actions can be taken to reduce the frequency of inappropriate system responses. These analyses and recommendations need to be presented in terms that directly reflect the user experience rather than the internal dialog processing.

This paper introduces and explains the use of Actionable Conversational Quality Indicators (ACQIs), which are used both to recognize parts of dialogs that can be improved and to recommend how to improve them. This combines benefits of previous approaches, some of which have focused on producing dialog quality scoring while others have sought to categorize the types of errors the dialog system is making. We demonstrate the effectiveness of using ACQIs on LivePerson internal dialog systems used in commercial customer service applications and on the publicly available LEGOv2 conversational dataset. We report on the annotation and analysis of conversational datasets showing which ACQIs are important to fix in various situations.

The annotated datasets are then used to build a predictive model which uses a turn-based vector embedding of the message texts and achieves a 79% weighted average f1-measure at the task of finding the correct ACQI for a given conversation. We predict that if such a model worked perfectly, the range of potential improvement actions a bot-builder must consider at each turn could be reduced by an average of 81%.

Information

Type
Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2024. Published by Cambridge University Press
Figure 0

Figure 1. Typical dialog system architecture illustrating components for both spoken and text-based input and output.

Figure 1

Figure 2. A dialog system-building interface where the bot-builder is about to add a multiple-choice question.

Figure 2

Table 1. Summary description and statistics on parts of LEGOv2 and LivePerson datasets annotated and used in this work

Figure 3

Table 2. ACQIs with their associated actions in example text and spoken dialog systems (NLU = Natural Language Understanding. ASR = Automatic Speech Recognition)

Figure 4

Table 3. Average minimum and final IQ scores from annotation

Figure 5

Table 4. Annotator agreement for annotating IQ and ACQI, showing linear weighted Cohen kappa (LWCK), unweighted average recall (UAR), Spearman rank correlation ($\rho$) for IQ and Cohen kappa (CK) for ACQI. Following Schmitt and Ultes (2015), we take the average agreement across each pair of annotators

Figure 6

Figure 3. IQ score distribution between IQ annotations in Ultes et al. (2015) and the work reported in this paper.

Figure 7

Figure 4. Distribution of negative/neutral/positive score changes grouped by LEGOv2 and LivePerson dialog systems.

Figure 8

Figure 5. ACQIs from Table 2 along with the proportions of each that were aligned with positive, negative, and neutral changes in IQ. Note that for the above graphic, we excluded any turn whose preceding IQ score was a 1 or 5.

Figure 9

Table 5. Distribution of ACQIs given a decrease in IQ score

Figure 10

Figure 6. Relationship between number of confirmations and score change.

Figure 11

Table 6. IQ Model Performance: linear weighted Cohen kappa (LWCK), unweighted average recall (UAR), Spearman rank correlation ($\rho$) for IQ. Model selection and hyper-parameter selection were accomplished by nested cross-validation (5 folds)

Figure 12

Table 7. ACQI model performance

Figure 13

Figure 7. Performance of ACQI models based on number of training conversations.

Figure 14

Table 8. Average number of recommended actions per dialog system when there is no measurement strategy (None), IQ is available (assuming no actions required when score does not decrement), ACQI alone is available, and IQ + ACQI. 95% confidence intervals were calculated taking 1000 bootstrapped samples (at turn level) per dialog system

Figure 15

Table A1. Quality score annotation guidelines

Figure 16

Figure A1. Annotation tool showing tooltips Bot State.

Figure 17

Table A2. ACQI options and descriptions

Figure 18

Table B1. Generalization of ACQI models for production dialog systems across data from different industries within the LivePerson framework

Figure 19

Table B2. Generalization of IQ models for production dialog systems across data from different industries within the LivePerson framework: Linear weighted Cohen kappa (LWCK), unweighted average recall (UAR), Spearman rank correlation ($\rho$)

Figure 20

Table B3. Generalization of ACQI models between LivePerson and LEGOv2

Figure 21

Table B4. Generalization of IQ models between LivePerson and LEGOv2: Linear weighted Cohen kappa (LWCK), unweighted average recall (UAR), Spearman rank correlation ($\rho$)