Hostname: page-component-77f85d65b8-9nbrm Total loading time: 0 Render date: 2026-03-30T10:22:07.080Z Has data issue: false hasContentIssue false

Verifying the robustness of automatic credibility assessment

Published online by Cambridge University Press:  14 November 2024

Piotr Przybyła*
Affiliation:
Universitat Pompeu Fabra, Barcelona, Spain Institute of Computer Science, Polish Academy of Sciences, Warsaw, Poland
Alexander Shvets
Affiliation:
Universitat Pompeu Fabra, Barcelona, Spain
Horacio Saggion
Affiliation:
Universitat Pompeu Fabra, Barcelona, Spain
*
Corresponding author: Piotr Przybyła; Email: piotr.przybyla@upf.edu
Rights & Permissions [Opens in a new window]

Abstract

Text classification methods have been widely investigated as a way to detect content of low credibility: fake news, social media bots, propaganda, etc. Quite accurate models (likely based on deep neural networks) help in moderating public electronic platforms and often cause content creators to face rejection of their submissions or removal of already published texts. Having the incentive to evade further detection, content creators try to come up with a slightly modified version of the text (known as an attack with an adversarial example) that exploit the weaknesses of classifiers and result in a different output. Here we systematically test the robustness of common text classifiers against available attacking techniques and discover that, indeed, meaning-preserving changes in input text can mislead the models. The approaches we test focus on finding vulnerable spans in text and replacing individual characters or words, taking into account the similarity between the original and replacement content. We also introduce BODEGA: a benchmark for testing both victim models and attack methods on four misinformation detection tasks in an evaluation framework designed to simulate real use cases of content moderation. The attacked tasks include (1) fact checking and detection of (2) hyperpartisan news, (3) propaganda, and (4) rumours. Our experimental results show that modern large language models are often more vulnerable to attacks than previous, smaller solutions, e.g. attacks on GEMMA being up to 27% more successful than those on BERT. Finally, we manually analyse a subset adversarial examples and check what kinds of modifications are used in successful attacks.

Information

Type
Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2024. Published by Cambridge University Press
Figure 0

Figure 1. An overview of the evaluation of an adversarial attack using BODEGA. For each task, three datasets are available: development ($X_{\text{dev}}$), training ($X_{\text{train}}$), and attack ($X_{\text{attack}}$). During an evaluation of an attack involving an Attacker and Victim models from the library of available models, the Attacker takes the text of the $i$th instance from the attack dataset ($x_i$), e.g. a news piece, and modifies it into an adversarial example ($x_i^*$). The Victim model is used to assess the credibility of both the original ($f(x_i)$) and modified text ($f(x_i^*)$). The BODEGA score assesses the quality of an AE, checking the similarity between the original and modified sample ($\text{sim}(x_i,x_i^*)$), as well as the change in the victim’s output ($\text{diff}(f(x_i),f(x_i^*))$).

Figure 1

Table 1. Four datasets used in BODEGA, with the task ID (see descriptions in text), number of instances in training, attack and development subsets, and an overall percentage of positive (non-credible) class

Figure 2

Table 2. Examples of credible and non-credible content in each of the tasks: hyperpartisan news (HN), propaganda recognition (PR), fact checking (FC) and rumour detection (RD). See main text for references to data sources and labelling criteria

Figure 3

Table 3. Performance of the victim classifiers, expressed as F-score over the attack subset

Figure 4

Table 4. The results of adversarial attacks, averaged over all victim classifiers, in four misinformation detection tasks (untargeted). Evaluation measures include BODEGA score, confusion score, semantic score, character score and number of queries to the attacked model per example. The best score in each task is in boldface

Figure 5

Figure 2. Classification performance (F1 score) and vulnerability to targeted attacks (BODEGA score) of models according to their size (parameter count, logarithmic scale), for different tasks.

Figure 6

Figure 3. Results of the targeted attacks (y axis, BODEGA score) plotted against the number of queries necessary (x axis, logarithmic) for various attack methods (symbols) and tasks (colours).

Figure 7

Table 5. A comparison of the results—highest BODEGA score and corresponding number of queries—in the untargeted (U) and targeted (T) scenario for various tasks and victims. The better values (higher BODEGA scores and lower number of queries) are highlighted

Figure 8

Table 6. Number of AEs using different modifications among the best 20 instances (according to BODEGA score) in each task, using BiLSTM as victim and BERT-ATTACK as attacker

Figure 9

Table 7. Some examples of adversarial modifications that were successful (i.e. resulted in changed classifier decision), performed by BERT-ATTACK against BiLSTM, including identifier (mentions in text), task and type of modification. Changes are highlighted in boldface

Supplementary material: File

Przybyła et al. supplementary material

Przybyła et al. supplementary material
Download Przybyła et al. supplementary material(File)
File 140.4 KB