Hostname: page-component-89b8bd64d-72crv Total loading time: 0 Render date: 2026-05-13T18:24:52.658Z Has data issue: false hasContentIssue false

A resampling-based method to evaluate NLI models

Published online by Cambridge University Press:  09 June 2023

Felipe de Souza Salvatore*
Affiliation:
Instituto de Matemática e Estatística, Universidade de São Paulo, São Paulo, Brazil
Marcelo Finger
Affiliation:
Instituto de Matemática e Estatística, Universidade de São Paulo, São Paulo, Brazil
Roberto Hirata Jr.
Affiliation:
Instituto de Matemática e Estatística, Universidade de São Paulo, São Paulo, Brazil
Alexandre G. Patriota
Affiliation:
Instituto de Matemática e Estatística, Universidade de São Paulo, São Paulo, Brazil
*
Corresponding author: Felipe de Souza Salvatore; Email: felipessalvador@googlemail.com
Rights & Permissions [Opens in a new window]

Abstract

The recent progress of deep learning techniques has produced models capable of achieving high scores on traditional Natural Language Inference (NLI) datasets. To understand the generalization limits of these powerful models, an increasing number of adversarial evaluation schemes have appeared. These works use a similar evaluation method: they construct a new NLI test set based on sentences with known logic and semantic properties (the adversarial set), train a model on a benchmark NLI dataset, and evaluate it in the new set. Poor performance on the adversarial set is identified as a model limitation. The problem with this evaluation procedure is that it may only indicate a sampling problem. A machine learning model can perform poorly on a new test set because the text patterns presented in the adversarial set are not well represented in the training sample. To address this problem, we present a new evaluation method, the Invariance under Equivalence test (IE test). The IE test trains a model with sufficient adversarial examples and checks the model’s performance on two equivalent datasets. As a case study, we apply the IE test to the state-of-the-art NLI models using synonym substitution as the form of adversarial examples. The experiment shows that, despite their high predictive power, these models usually produce different inference outputs for equivalent inputs, and, more importantly, this deficiency cannot be solved by adding adversarial observations in the training data.

Information

Type
Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2023. Published by Cambridge University Press
Figure 0

Figure 1. The bootstrap version of the paired t-test applied multiple times. For $m = 1, \ldots, M$, $g_m$ is a classifier trained on the transformed sample $( \mathcal{D}_{{T}}^{m}, \mathcal{D}_{{V}}^{m})$. The p-value $p_m$ is obtained by comparing the observable test statistic associated with $g_m$, $\hat{t}_m$, with the bootstrap distribution of $t$ under the null hypothesis.

Figure 1

Figure 2. Toy example of sentence transformation (not related to a real dataset). In this case, there are two synonyms associated with the only noun appearing in the source sentence (dog). Since both synonyms have the same frequency in the corpus (zero), the selected synonym is the one with the lower edit distance (domestic dog).

Figure 2

Table 1. Sound percentages for the transformation function based on the WordNet database. The values were estimated using a random sample of 400 sentence pairs from the training set.

Figure 3

Figure 3. Baseline results. In the $x$-axis, we have different choices of transformation probabilities used in training. The $y$-axis displays the minimum value for the p-values acquired in five paired t-tests. We reject the null hypothesis if the minimum p-value is smaller than $1\%$.

Figure 4

Figure 4. SNLI results. In the $x$-axis, we have different choices of transformation probabilities in training. The $y$-axis displays the accuracy. Each point represents the average accuracy in five runs. The vertical lines display the associated standard deviation. The black and gray lines represent the values for the original and transformed test sets, respectively.

Figure 5

Figure 5. MNLI results. In the $x$-axis, we have different choices of transformation probabilities in training. The $y$-axis displays the accuracy. Each point represents the average accuracy in five runs. The vertical lines display the associated standard deviation. The black and gray lines represent the values for the original and transformed test sets, respectively.

Figure 6

Figure 6. Test statistics from the IE test for all models. In the $x$-axis, we have different choices of transformation probabilities used in training. The $y$-axis displays the values for the test statistic. Each point represents the average test statistics in five paired t-tests. The vertical lines display the associated standard deviation. And the baseline is a BoW model.

Figure 7

Table 2. Ranked models according to the SNR metric. In this case, the noise is the synonym substitution transformation.

Figure 8

Figure 7. Models’ accuracy on the original test set. In the $x$-axis, we have different choices of transformation probabilities used in training. The $y$-axis displays the accuracy. Each point represents the average accuracy in five runs. The vertical lines display the associated standard deviation. And the baseline is a BoW model.

Figure 9

Table 3. Sound transformations for SNLI.

Figure 10

Table 4. Unsound transformations for SNLI.

Figure 11

Table 5. Sound transformations for MNLI.

Figure 12

Table 6. Unsound transformations for MNLI.

Figure 13

Table 7. Best hyperparameter assignments for the Gradient Boosting classifier.

Figure 14

Table 8. Best hyperparameter assignments for ALBERT.

Figure 15

Table 9. Best hyperparameter assignments for BERT.

Figure 16

Table 10. Best hyperparameter assignments for XLNet.

Figure 17

Table 11. Best hyperparameter assignments for RoBERTa$_{BASE}$.

Figure 18

Table 12. Best hyperparameter assignments for RoBERTa$_{LARGE}$.

Figure 19

Figure 8. Plot with the twenty most frequent transformations of each dataset. The $x$-axis displays the word substitution frequency. The $y$-axis displays the content of each substitution.

Figure 20

Table 13. Average text input length for the different partitions of the SNLI training data. The NLI labels define the partitions.

Figure 21

Table 14. Average text input length for the different partitions of the MNLI training data. The NLI labels define the partitions.

Figure 22

Table 15. Frequency transformation of the terms selected from Gururangan et al.(2018). We only show words where the synonym substitution function has affected the frequency. By $X\% \rightarrow Y\%$ we refer to the frequency transformation of the word in the partition of the dataset (the NLI label defines the partition). $X\%$ refers to the original frequency, while $Y\%$ represents the word frequency on the transformed dataset.

Figure 23

Table 16. Sound percentages for the transformation function based on the WordNet database. The values were estimated using a random sample of 400 sentence pairs from the test set.

Figure 24

Table 17. Examples of hard cases for SNLI. Here a “hard case” is an observation that all deep learning models predict correctly in the original form, but they all make wrong predictions after the synonym substitution transformation.

Figure 25

Table 18. Examples of hard cases for MNLI. Here a “hard case” is an observation that all deep learning models predict correctly in the original form, but they all make wrong predictions after the synonym substitution transformation.