Hostname: page-component-89b8bd64d-j4x9h Total loading time: 0 Render date: 2026-05-07T10:54:00.116Z Has data issue: false hasContentIssue false

NLP verification: towards a general methodology for certifying robustness

Published online by Cambridge University Press:  02 April 2025

Marco Casadio*
Affiliation:
Heriot-Watt University, Edinburgh, UK
Tanvi Dinkar
Affiliation:
Heriot-Watt University, Edinburgh, UK
Ekaterina Komendantskaya
Affiliation:
Heriot-Watt University, Edinburgh, UK University of Southampton, Southampton, UK
Luca Arnaboldi
Affiliation:
University of Birmingham, Birmingham, UK
Matthew L. Daggitt
Affiliation:
University of Western Australia, Perth, Australia
Omri Isac
Affiliation:
The Hebrew University of Jerusalem, Jerusalem, Israel
Guy Katz
Affiliation:
The Hebrew University of Jerusalem, Jerusalem, Israel
Verena Rieser
Affiliation:
Heriot-Watt University, Edinburgh, UK Google DeepMind, London, UK
Oliver Lemon
Affiliation:
Heriot-Watt University, Edinburgh, UK
*
Corresponding author: Marco Casadio; Email: mc248@hw.ac.uk
Rights & Permissions [Opens in a new window]

Abstract

Machine learning has exhibited substantial success in the field of natural language processing (NLP). For example, large language models have empirically proven to be capable of producing text of high complexity and cohesion. However, at the same time, they are prone to inaccuracies and hallucinations. As these systems are increasingly integrated into real-world applications, ensuring their safety and reliability becomes a primary concern. There are safety critical contexts where such models must be robust to variability or attack and give guarantees over their output. Computer vision had pioneered the use of formal verification of neural networks for such scenarios and developed common verification standards and pipelines, leveraging precise formal reasoning about geometric properties of data manifolds. In contrast, NLP verification methods have only recently appeared in the literature. While presenting sophisticated algorithms in their own right, these papers have not yet crystallised into a common methodology. They are often light on the pragmatical issues of NLP verification, and the area remains fragmented. In this paper, we attempt to distil and evaluate general components of an NLP verification pipeline that emerges from the progress in the field to date. Our contributions are twofold. First, we propose a general methodology to analyse the effect of the embedding gap – a problem that refers to the discrepancy between verification of geometric subspaces, and the semantic meaning of sentences which the geometric subspaces are supposed to represent. We propose a number of practical NLP methods that can help to quantify the effects of the embedding gap. Second, we give a general method for training and verification of neural networks that leverages a more precise geometric estimation of semantic similarity of sentences in the embedding space and helps to overcome the effects of the embedding gap in practice.

Information

Type
Papers
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2025. Published by Cambridge University Press
Figure 0

Figure 1. An example of verifiable but not generalisable$\epsilon \textrm {-balls}$(a), convex-hull around selected embedded points (b), hyper-rectangle around same points (c) and rotation of such hyper-rectangle (d) in 2-dimensions. The red dots represent sentences in the embedding space from the training set belonging to one class, while the turquoise dots are embedded sentences from the test set belonging to the same class.

Figure 1

Table 1. Summary of the main features of the datasets used in NLP verification

Figure 2

Table 2. Summary of the main features of the existing NLP verification approaches. In bold are state-of-the-art methods

Figure 3

Table 3. Summary of the main features of the existing randomised smoothing approaches

Figure 4

Figure 2. Visualisation of the NLP verification pipeline followed in our approach.

Figure 5

Table 4. Character-level perturbations: their types and examples of how each type acts on a given sentence from the R-U-A-robot dataset [129]. Perturbations are selected from random words that have 3 or more characters, first and last characters of a word are never perturbed

Figure 6

Table 5. Word-level perturbations: their types and examples of how each type acts on a given sentence from the R-U-A-robot dataset [129].

Figure 7

Table 6. Construction complexity for different geometric shapes, where $m$ is the number of dimensions and $p$ is the number of points or generators

Figure 8

Figure 3. An example of hyper-rectangle drawn around all points of the same class (a), shrunk hyper-rectangle$\mathbb {H}_{sh}$that is obtained by excluding all points from the opposite class (b) and clustered hyper-rectangles (c) in 2-dimensions. The red dots represent sentences in the embedding space of one class, while the blue dots are embedded sentences that do not belong to that class.

Figure 9

Table 7. Mean and standard deviation of the accuracy of the baseline DNN on the RUAR and the medical datasets. All experiments are replicated five times

Figure 10

Table 8. Sets of geometric subspaces used in the experiments, their cardinality and average volumes of hyper-rectangles. All shapes are eigenspace rotated for better precision

Figure 11

Table 9. Verifiability of the baseline DNN on the RUAR and the medical datasets, for a selection of geometric subspaces; using the ERAN verifier

Figure 12

Table 10. Generalisability of the selected geometric subspaces $\mathbb {H}_{\epsilon =0.005}$, $\mathbb {H}_{\epsilon =0.05}$ and $\mathbb {H}_{sh}$, measured on the sets of semantic perturbations $\mathcal {A}^{{b}}_{{t}_{RAUR}}(\mathcal {Y}^{pos})$ and $\mathcal {A}^{{b}}_{{t}_{medical}}(\mathcal {Y}^{pos})$

Figure 13

Table 11. Sets of semantic subspaces used in the experiments, their cardinality and average volumes of hyper-rectangles. All shapes are eigenspace rotated for better precision

Figure 14

Table 12. Verifiability of the baseline DNN on the RUAR and the medical datasets, for a selection of semantic subspaces; using the ERAN verifier

Figure 15

Table 13. Verifiability of the baseline DNN on the RUAR and the medical datasets, for a selection of semantic subspaces; using the Marabou verifier

Figure 16

Table 14. Generalisability of the selected geometric subspaces$\mathbb {H}_{\epsilon =0.005}$and$\mathbb {H}_{\epsilon =0.05}$and the semantic subspaces$\mathbb {H}_{word}$and$\mathbb {H}_{pj}$, measured on the sets of semantic perturbations$\mathcal {A}^{{b}}_{{t}_{RUAR}}(\mathcal {Y}^{pos})$and$\mathcal {A}^{{b}}_{{t}_{medical}}(\mathcal {Y}^{pos})$. Note that the generalisability of$\mathbb {H}_{sh}$(Table 10), despite it having the volume 19 order of magnitudes bigger, is only$3\%$greater than$\mathbb {H}_{word}$

Figure 17

Table 15. Total volume and percentage of the training embedding space covered by our best hyper-rectangles. The total volume of the training embedding space for RUAR is $6.14e-5$, and for medical is $1.43e-5$

Figure 18

Table 16. Accuracy of the robustly trained DNNs on the RUAR and the medical datasets. $\mathcal {Y}$ stands for either RUAR or medical depending on the column

Figure 19

Table 17. Accuracy of the DNNs trained adversarially on the RUAR and the medical datasets

Figure 20

Table 18. Verifiability of the robustly trained DNNs on the RUAR and the medical datasets, for a selection of semantic subspaces; using the ERAN verifier

Figure 21

Table 19. Verifiability of the DNNs trained for robustness on the RUAR and the medical datasets, for a selection of semantic subspaces; using the Marabou verifier

Figure 22

Table 20. Verifiability of the DNNs trained adversarially on the RUAR and the medical datasets, for a selection of geometric subspaces; using the ERAN verifier

Figure 23

Table 21. Comparison of different subspace shapes for verification and verifiers for $N_{pj-adv}$ and $\mathbb {H}_{pj}$ on the medical dataset

Figure 24

Figure 4. Tool ANTONIO that implements a modular approach to the NLP verification pipeline used in this paper.

Figure 25

Figure 5. Zero-shot prompts with 2 basic examples from the R-U-A-robot dataset. Answers from vicuna-13b are given in italics. A1 and A2 represent different answers to the same prompt, illustrating a lack of consistency in the output.

Figure 26

Table 22. Performance of the models on the test/perturbation set. The average standard deviation is $0.0049$

Figure 27

Table 23. Section 5NLP verification pipeline setup, implemented using ANTONIO. Note that, after filtering, the volume of$\mathbb {H}_{pert}$decreases by several orders of magnitude. Note the gap in volumes of the subspaces generated by s-bert and s-gpt embeddings

Figure 28

Table 24. Verifiability, generalisability and embedding error of the baseline and the robustly (adversarially) trained DNNs on the RUAR and the medical datasets, for$\mathbb {H}_{pert^\diamondsuit }$($N_{base}$and$N_{pert^\diamondsuit }$) and$\mathbb {H}_{pert}$($N_{pert}$); for marabou verifier

Figure 29

Table 25. Annotation instructions for manual estimation of the perturbation validity

Figure 30

Table 26. Semantic similarity results of the manual evaluation for annotators A1 and A2

Figure 31

Table 27. Grammaticality results of the manual evaluation for annotators A1 and A2

Figure 32

Table 28. Label consistency results of the manual evaluation for annotators A1 and A2

Figure 33

Table 29. Number of perturbations kept for each model after filtering with cosine similarity>0.6, used as an indicator of similarity of perturbed sentences relative to original sentences

Figure 34

Table 30. Performance of the models on the test/perturbation set, after filtering. The average standard deviation is $0.0049$

Figure 35

Figure 6. Analysis of some common issues found in the vicuna-13b generated perturbations.

Figure 36

Table 31. ROUGE-N scores comparing the original samples with vicuna perturbations (of the positive class) for lexical overlap

Figure 37

Table 32. ROUGE-N scores comparing the original samples with vicuna perturbations (of the positive class) for syntax overlap

Figure 38

Figure 7. In this figure, we show how a prepended, semantically informed verified filter added to an NLP system (here, an LLM), can check that safety-critical queries are handled responsibly, e.g. by redirecting the query to a tightly controlled rule-based system instead of a stochastic LLM.