Hostname: page-component-77f85d65b8-6bnxx Total loading time: 0 Render date: 2026-03-28T18:04:17.640Z Has data issue: false hasContentIssue false

Dialogue agents 101: a beginner’s guide to critical ingredients for designing effective conversational systems

Published online by Cambridge University Press:  09 September 2024

Shivani Kumar*
Affiliation:
Indraprastha Institute of Information Technology, Delhi, India
Sumit Bhatia
Affiliation:
Media and Data Science Research Lab, Adobe, India
Milan Aggarwal
Affiliation:
Media and Data Science Research Lab, Adobe, India
Tanmoy Chakraborty
Affiliation:
Indian Institute of Technology, Delhi, India
*
Corresponding author: Shivani Kumar; Email: shivaniku@iiitd.ac.in
Rights & Permissions [Opens in a new window]

Abstract

Sharing ideas through communication with peers is the primary mode of human interaction. Consequently, extensive research has been conducted in the area of conversational AI, leading to an increase in the availability and diversity of conversational tasks, datasets, and methods. However, with numerous tasks being explored simultaneously, the current landscape of conversational AI has become fragmented. Consequently, initiating a well-thought-out model for a dialogue agent can pose significant challenges for a practitioner. Toward highlighting the critical ingredients needed for a practitioner to design a dialogue agent from scratch, the current study provides a comprehensive overview of the primary characteristics of a dialogue agent, the supporting tasks, their corresponding open-domain datasets, and the methods used to benchmark these datasets. We observe that different methods have been used to tackle distinct dialogue tasks. However, building separate models for each task is costly and does not leverage the correlation among the several tasks of a dialogue agent. As a result, recent trends suggest a shift toward building unified foundation models. To this end, we propose Unit, a Unified dialogue dataset constructed from conversations of varying datasets for different dialogue tasks capturing the nuances for each of them. We then train a Unified dialogue foundation model, GPT-2$^{\textrm{U}}$ and present a concise comparative performance of GPT-2$^{\textrm{U}}$ against existing large language models. We also examine the evaluation strategies used to measure the performance of dialogue agents and highlight the scope for future research in the area of conversational AI with a thorough discussion of popular models such as ChatGPT.

Information

Type
Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2024. Published by Cambridge University Press
Figure 0

Table 1. Characteristic of each task based on the taxonomic characteristic of a dialogue agent. Size indicates an approximate value expressed in thousands (k). Abbreviations—DR: Dialogue Rewrite, DS: Dialogue Summary, D2S: Dialogue to Structure, QA: Question Answering, KGR: Knowledge Grounded Response, CC: Chit-chat, TOD: Task-Oriented Dialogues, ID: Intent Detection, SF: Slot Filling, DST: Dialogue State Tracking, AD: Affect Detection, CC: Chit-chat, GO: Goal Oriented, Spc: Specific, ST: Single Turn, MT: Multi Turn, U: Unimodal, M: Multimodal, Unstr: Unstructured, Str: Structured, Eng: Engaging, Inf: Informative, Instr: Instructional, Emp: Empathetic

Figure 1

Figure 1. A taxonomic overview of a dialogue agent. The major components for designing a complete pipeline of a dialogue agent are—input(s), natural language understanding (NLU), generated output(s), and model evaluation. Each component can be further divided based on the characteristics required in the final dialogue agent.

Figure 2

Figure 2. Dialogues highlighting different attributes of a dialogue agent input and output.

Figure 3

Table 2. Statistics of the Unit dataset: Unified Dialogue Dataset. Abbreviations: Dlgs: Dialogues, Utts: Utterances

Figure 4

Figure 3. All $39$ datasets from distinct tasks are standardized and combined into a single conversational dataset called Unit. Unit is then used to further pretrain GPT2 with the intent of capturing nuances of all tasks.

Figure 5

Figure 4. Log–log distribution of the number of speakers and number of utterances per dialogue in Unit. Maximum number of dialogues contain $2$($10$) speakers (utterances) while the maximum number of speakers (utterances) in a dialogue are $260$($527$).

Figure 6

Figure 5. Distribution of sizes of different datasets in Unit. Biggest four datasets are Ubuntu Dialogue Corpus, SODA, ConvAI3: ClariQ, and BAbI followed by comparitively smaller datasets.

Figure 7

Table 3. Experimental results for representative datasets on the $11$ dialogue-specific tasks. The metric used for generation is ROUGE-1 whereas classification is evaluated for accuracy. For abbreviations, please refer to Table 1

Figure 8

Table 4. Results of human evaluation for the representative tasks

Figure 9

Figure 6. Distribution of datasets covering the specific dialogue attributes. Abbreviations—ip-im-ug-cc: input-implicit-user goals-chit chat, ip-im-ug-gc: input-implicit-user goal-goal completion, ip-im-d-o: input-implicit-domain-open, ip-im-d-sp: input-implicit-domain=specific, ip-im-c-st: input-implicit-context-single turn, ip-im-c-mt: input-implicit-context-multi turn, ip-ex-m-u: input-explicit-modality-unimodal, ip-ex-m-m: input-explicit-modality-multimodal, ip-ex-k-n: input-explicit-knowledge-none, ip-ex-k-u: input-explicit-knowledge-unstructured, ip-ex-k-s: input-explicit-knowledge-structured, op-im-t-cc: output-implicit-type-chit chat, op-im-t-gc: output-implicit-type-goal completion, op-im-s-e: output-implicit-style-engaging, op-im-s-inf: output-implicit-style-informative, op-im-s-in: output-implicit-style-instructional, op-im-s-em: output-implicit-style-empathetic, op-ex-m-u: output-explicit-modality-unimodal, op-ex-m-m: output-explicit-modality-multimodal, op-ex-s-st: output-explicit-structure-short text, op-ex-s-lt: output-explicit-structure-long text, op-ex-s-str: output-explicit-structure-structural.