Hostname: page-component-6766d58669-7fx5l Total loading time: 0 Render date: 2026-05-17T23:15:48.737Z Has data issue: false hasContentIssue false

A survey of methods for revealing and overcoming weaknesses of data-driven Natural Language Understanding

Published online by Cambridge University Press:  22 April 2022

Viktor Schlegel*
Affiliation:
Department of Computer Science, University of Manchester, Manchester M13 9PL, UK
Goran Nenadic
Affiliation:
Department of Computer Science, University of Manchester, Manchester M13 9PL, UK
Riza Batista-Navarro
Affiliation:
Department of Computer Science, University of Manchester, Manchester M13 9PL, UK
*
*Corresponding author. E-mail: viktor.schlegel@manchester.ac.uk
Rights & Permissions [Opens in a new window]

Abstract

Recent years have seen a growing number of publications that analyse Natural Language Understanding (NLU) datasets for superficial cues, whether they undermine the complexity of the tasks underlying those datasets and how they impact those models that are optimised and evaluated on this data. This structured survey provides an overview of the evolving research area by categorising reported weaknesses in models and datasets and the methods proposed to reveal and alleviate those weaknesses for the English language. We summarise and discuss the findings and conclude with a set of recommendations for possible future research directions. We hope that it will be a useful resource for researchers who propose new datasets to assess the suitability and quality of their data to evaluate various phenomena of interest, as well as those who propose novel NLU approaches, to further understand the implications of their improvements with respect to their model’s acquired capabilities.

Information

Type
Survey Paper
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution, and reproduction in any medium, provided the original work is properly cited.
Copyright
© The Author(s), 2022. Published by Cambridge University Press
Figure 0

Figure 1. Bar chart with RTE, MRC and other datasets that were investigated by at least three surveyed papers. Datasets investigated once or twice are summarised with ‘Multiple’. Full statistics can be observed in the Appendix.

Figure 1

Figure 2. Example for a dataset artefact where the requirement to synthesise information from 2 out of 10 accompanying passages can be circumvented by exploiting simple word co-occurrence between question and answer sentence.

Figure 2

Figure 3. Taxonomy of investigated methods. Labels (a), (b) and (c) correspond to the coarse grouping discussed in Section 5.

Figure 3

Figure 4. Number of methods per category split by task. As multiple papers report more than one method, the maximum (160) does not add up to the number of surveyed papers (121).

Figure 4

Table 1. Summary of data-investigating methods with the corresponding research questions as described in Section 5.1

Figure 5

Table 2. Proposed adversarial and challenge evaluation sets with their target phenomenon, grouped by task and, where appropriate, with original resource name. The last column ‘OOD’indicates, whether the authors acknowledge and discount for the distribution shift between training and challenge set data (Y), they do not (N), whether performance under the distribution shift is part of the research question (P), whether an informal argument (I) is provided or whether it is not applicable (-)

Figure 6

Table 3. Categorisation of methods that have been proposed to overcome weaknesses in models and data. To indicate that a method was applied to improve performance on a challenge set, we specify the challenge set name as presented in Table 2

Figure 7

Figure 5. Dataset by publication year with no or any spurious correlations detection methods applied; applied in a later publication; created adversarially, or both.

Figure 8

Table A1. Google Scholar Queries for the extended dataset corpus

Figure 9

Table B1. Table of datasets where no quantitative methods that describe dataset weaknesses have been applied yet