Hostname: page-component-89b8bd64d-x2lbr Total loading time: 0 Render date: 2026-05-14T07:05:37.107Z Has data issue: false hasContentIssue false

A randomized prospective study of a hybrid rule- and data-driven virtual patient

Published online by Cambridge University Press:  23 September 2022

Adam Stiff*
Affiliation:
Department of Computer Science and Engineering, The Ohio State University, Columbus, OH, USA
Michael White
Affiliation:
Department of Linguistics, The Ohio State University, Columbus, OH, USA
Eric Fosler-Lussier
Affiliation:
Department of Computer Science and Engineering, The Ohio State University, Columbus, OH, USA
Lifeng Jin
Affiliation:
Department of Linguistics, The Ohio State University, Columbus, OH, USA
Evan Jaffe
Affiliation:
Department of Linguistics, The Ohio State University, Columbus, OH, USA
Douglas Danforth
Affiliation:
The Ohio State University Wexner Medical Center, Columbus, OH, USA
*
*Corresponding author. E-mail: stiff.4@osu.edu
Rights & Permissions [Opens in a new window]

Abstract

Randomized prospective studies represent the gold standard for experimental design. In this paper, we present a randomized prospective study to validate the benefits of combining rule-based and data-driven natural language understanding methods in a virtual patient dialogue system. The system uses a rule-based pattern matching approach together with a machine learning (ML) approach in the form of a text-based convolutional neural network, combining the two methods with a simple logistic regression model to choose between their predictions for each dialogue turn. In an earlier, retrospective study, the hybrid system yielded a nearly 50% error reduction on our initial data, in part due to the differential performance between the two methods as a function of label frequency. Given these gains, and considering that our hybrid approach is unique among virtual patient systems, we compare the hybrid system to the rule-based system by itself in a randomized prospective study. We evaluate 110 unique medical student subjects interacting with the system over 5,296 conversation turns, to verify whether similar gains are observed in a deployed system. This prospective study broadly confirms the findings from the earlier one but also highlights important deficits in our training data. The hybrid approach still improves over either rule-based or ML approaches individually, even handling unseen classes with some success. However, we observe that live subjects ask more out-of-scope questions than expected. To better handle such questions, we investigate several modifications to the system combination component. These show significant overall accuracy improvements and modest F1 improvements on out-of-scope queries in an offline evaluation. We provide further analysis to characterize the difficulty of the out-of-scope problem that we have identified, as well as to suggest future improvements over the baseline we establish here.

Information

Type
Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution, and reproduction in any medium, provided the original work is properly cited.
Copyright
© The Author(s), 2022. Published by Cambridge University Press
Figure 0

Figure 1. The user interface of the Virtual Patient. Note the dialogue input box at the top of the screen, where a user has typed “Hi Jim, how are you?” After the user presses the Enter key, the patient will respond with “Pretty good, except for this back pain.”

Figure 1

Figure 2. Frequency of labels in the core dataset by rank, with quintiles by color.

Figure 2

Figure 3. Overview of the stacked CNN architecture.

Figure 3

Table 1. Mean accuracy across 10 folds, with standard deviations.

Figure 4

Figure 4. Accuracy of the tested models by label quintiles.

Figure 5

Table 2. Features used by the chooser.

Figure 6

Table 3. Hybrid system results. ChatScript and stacked CNN baselines are repeated from above. “Conf” is using only the confidence feature from Table 2; “base” is using Log Prob, Entropy, and Confidence. CS = ChatScript, CNN = stacked CNN.

Figure 7

Table 4. Summary statistics of collected data.

Figure 8

Table 5. Raw accuracies of control and test conditions, and subcomponents of the hybrid system (CS = ChatScript, CNN = stacked CNN).

Figure 9

Table 6. Top changes in class frequency by magnitude of difference between core and enhanced datasets. Change is the count of examples of the class in the enhanced set minus the count in the core set.

Figure 10

Figure 5. Accuracy of the hybrid system and constituent components by label quintiles. “Core dataset quintiles” have the same label membership as in the previous experiment; “raw enhanced dataset quintiles” are the quintiles according to the frequencies observed in Experiment 2. “Fair enhanced dataset quintiles” remove No Answer labels and other unseen labels from consideration.

Figure 11

Table 7. Adjusted accuracies of control and test conditions, and subcomponents of the combined system.

Figure 12

Figure 6. A simple illustration of scope and domain boundaries, with specific examples.

Figure 13

Table 8. New features used by the three-way chooser.

Figure 14

Table 9. Development accuracy with additional features.

Figure 15

Table 10. Effects of retraining on test performance.

Figure 16

Table 11. Test results for three-way choosers with retraining.

Figure 17

Figure 7. t-SNE plot of Multilabel chooser training data. Units are included to enable discussion of features of the visualization, but note that they are meaningless in terms of the original data, given that this is a stochastic, lossy projection of the original space.

Figure 18

Figure 8. Quintile accuracies for multilabel components, with baselines.

Figure 19

Figure A1. Overview of client-server architecture.

Figure 20

Table A1. The implemented REST-like API.

Figure 21

Table C1. Selected turns from a single conversation.