Hostname: page-component-77f85d65b8-grvzd Total loading time: 0 Render date: 2026-03-27T05:17:53.347Z Has data issue: false hasContentIssue false

Detection avoidance techniques for large language models

Published online by Cambridge University Press:  10 March 2025

Sinclair Schneider*
Affiliation:
Bundeswehr University Munich, Munich, Bavaria, Germany
Florian Steuber
Affiliation:
Bundeswehr University Munich, Munich, Bavaria, Germany
João A.G. Schneider
Affiliation:
Bundeswehr University Munich, Munich, Bavaria, Germany
Gabi Dreo Rodosek
Affiliation:
Bundeswehr University Munich, Munich, Bavaria, Germany
*
Corresponding author: Sinclair Schneider; Email: Sinclair.Schneider@unibw.de

Abstract

The increasing popularity of large language models has not only led to widespread use but has also brought various risks, including the potential for systematically spreading fake news. Consequently, the development of classification systems such as DetectGPT has become vital. These detectors are vulnerable to evasion techniques, as demonstrated in an experimental series: Systematic changes of the generative models’ temperature proofed shallow learning—detectors to be the least reliable (Experiment 1). Fine-tuning the generative model via reinforcement learning circumvented BERT-based—detectors (Experiment 2). Finally, rephrasing led to a >90% evasion of zero-shot—detectors like DetectGPT, although texts stayed highly similar to the original (Experiment 3). A comparison with existing work highlights the better performance of the presented methods. Possible implications for society and further research are discussed.

Information

Type
Research Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2025. Published by Cambridge University Press
Figure 0

Figure 1. Data pipeline used for modeling.Note: A comprehensive filtering policy was used (e.g., only tweets from verified users below the average amount of daily tweets; English language; no retweets or quoted tweets, etc.). Dataset yield is the basis for the classifier and generator, respectively. Several models were tested (e.g., pre-trained GPT versions).

Figure 1

Table 1. Example tweets generated with different models

Figure 2

Figure 2. Human-based against machine-based word probability distributions.Note: Logarithmic scaling is applied to quantiles on both axes. Machine-based quantiles result from a GPT-2 (1.5B) model using random sampling with a sampling size of 10k. (a) Comparison of machine against human-based word probability distributions (red dashed line marks theoretical perfect mapping). (b) Density distributions reflect the effect of temperature (red dashed line marks empirical human density distribution).

Figure 3

Figure 3. Detection rates by temperature for sampling sizes, methods, and generator models.Note: Comparison of ACC for varying temperatures $ \tau $ and (a) different sampling sizes (1k–100k tweets) using random sampling and (b) different sampling types, i.e., typical–, top $ k=100 $, nucleus $ p=0.95 $ and pure random sampling, all with a 10k sampling size. greedy search is not depicted here, since it always leads to $ ACC>99\% $. Both sampling sizes and strategies result from the same GPT-2-1.5B model, whereas panels (c) depict results for OPT– and (d) for GPT model architectures with various parameter sizes. Sampling sizes and model parameters are reflected by color shading (i.e., the darker, the bigger). For all panels, small ACC values indicate a better performance of the generating model. For colorization, see the online version of this article.

Figure 4

Figure 4. Reinforcement learning reward calculation procedure.

Figure 5

Figure 5. Ground-truth distributions.Note: Constraints based on ground truth visualized for (a) the maximal proportion of special characters, (b) the number of repetitions per tweet, (c) the proportion of emojis, and (d) the number of emojis per tweet as well as (e) the minimal proportion of tokens in a standard dictionary. For all plots, the dashed line depicts the cutoff values. For all proportions (1st col.), the x-axis is log-scaled for visualization purposes, for all frequencies (2nd col.), the square root of the actual numbers is depicted on the y-axis.

Figure 6

Table 2. Example of the reinforcement learning process

Figure 7

Figure 6. $ {F}_1 $-score comparison. Note: Results are depicted for (a) the used Twitter dataset and also for (b) the CNN and Daily Mail dataset as a representation of fake news. Dashed lines depict the mean $ {F}_1 $-scores across models and datasets before and after RL. Line distance illustrates the huge RL effect of $ d=-16.48 $, $ 95\%\mathrm{CI}\left[-24.14,-8.49\right] $, demonstrating transferability to different text domains.

Figure 8

Figure 7. Pipeline used for training set generation. Note: Question from Human ChatGPT Comparison Corpus (Guo et al., 2023) answered by Qwen1.5-4B-Chat (Bai et al., 2023) and permutated by t5-3b (Raffel et al., 2020). The Log loss (plausibility) is checked by the generative Qwen-model itself, while the similarity is checked using the sentence transformer all-MiniLM-L6-v2 (Wang et al., 2020). The acceptability was checked by a DeBERTa (He et al., 2021) model trained on the CoLA dataset (Warstadt et al., 2019).

Figure 9

Table 3. Example of trainset generation

Figure 10

Table 4. Example of original answer vs. paraphrases to the question: Who was Michelangelo?

Figure 11

Figure 8. Results of the proposed model vs. reference models. Note: Comparison of the results achieved with (a) the proposed model in comparison to (b) the Discourse Paraphraser DIPPER by Krishna et al. (2024) and (c) the paraphrasing model by Sadasivan et al. (2024). Sadasivan et al. report values for grammar or text quality (comparable to linguistic acceptability) and content preservation (matched to similarity), both manually labeled on a Likert scale of 1–5 (scaled here for better comparability but marked by dashed line representation). Those values are only provided for permutations 1–5 with detection rates, i.e., it is unclear at which permutation their “best” result occurred.

Submit a response

Comments

No Comments have been published for this article.