Hostname: page-component-77f85d65b8-5ngxj Total loading time: 0 Render date: 2026-03-29T18:06:53.441Z Has data issue: false hasContentIssue false

CoAT: Corpus of artificial texts

Published online by Cambridge University Press:  06 September 2024

Tatiana Shamardina
Affiliation:
ABBYY, Milpitas, CA, USA
Marat Saidov
Affiliation:
HSE University, Moscow, Russia
Alena Fenogenova
Affiliation:
HSE University, Moscow, Russia
Aleksandr Tumanov
Affiliation:
HSE University, Moscow, Russia
Alina Zemlyakova
Affiliation:
HSE University, Moscow, Russia
Anna Lebedeva
Affiliation:
HSE University, Moscow, Russia
Ekaterina Gryaznova
Affiliation:
HSE University, Moscow, Russia
Tatiana Shavrina
Affiliation:
Institute of Linguistics, RAS, Moscow, Russia
Vladislav Mikhailov*
Affiliation:
University of Oslo, Oslo, Norway
Ekaterina Artemova
Affiliation:
Toloka AI
*
Corresponding author: Vladislav Mikhailov; Email: vladism@ifi.uio.no
Rights & Permissions [Opens in a new window]

Abstract

With recent advances in natural language generation, risks associated with the rapid proliferation and misuse of generative language models for malicious purposes steadily increase. Artificial text detection (ATD) has emerged to develop resources and computational methods to mitigate these risks, such as generating fake news and scientific article reviews. This paper introduces corpus of artificial texts (CoAT), a large-scale corpus of human-written and generated texts for the Russian language. CoAT spans six domains and comprises outputs from 13 text generation models (TGMs), which differ in the number of parameters, architectural choices, pre-training objectives, and downstream applications. We detail the data collection methodology, conduct a linguistic analysis of the corpus, and present a detailed analysis of the ATD experiments with widely used artificial text detectors. The results demonstrate that the detectors perform well on the seen TGMs, but fail to generalise to unseen TGMs and domains. We also find it challenging to identify the author of the given text, and human annotators significantly underperform the detectors. We release CoAT, the codebase, two ATD leaderboards, and other materials used in the paper.

Information

Type
Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2024. Published by Cambridge University Press
Figure 0

Table 1. Artificial text detection datasets and benchmarks for non-English languages. Notations: H = human-written texts. M = machine-generated texts

Figure 1

Table 2. General statistics of CoAT by natural language generation task and model

Figure 2

Table 3. Number of texts in each domain and task-specific dataset in CoAT

Figure 3

Table 4. Lexical richness metrics by natural language generation task in CoAT. Notations: %=the average fraction of high-frequency tokens. H=human-written texts. M=machine-generated texts

Figure 4

Table 5. Manually picked examples of text generation errors from the train set. The examples are automatically translated into English for illustration purposes

Figure 5

Table 6. The average values of the stylometric features in the corpus subset. Machine refers to the machine-generated texts

Figure 6

Figure 1. 2-dimensional distribution of the corpus subset using PCA.

Figure 7

Figure 2. The distribution of the FRE (left) and Cyrillic (right) features between the human-written and machine-generated texts in the corpus subset. The difference between the mean values is statistically significant according to the Mann-Whitney U test, with the p-values equal to 0.002 and 0.0, respectively.

Figure 8

Figure 3. The web interface used for human evaluation on the artificial text detection task.

Figure 9

Table 7. Accuracy scores of the detectors on the artificial text detection task

Figure 10

Figure 4. Macro-F1 scores of the detectors on the artificial text detection task grouped by five quintiles of the text length.

Figure 11

Table 8. Macro-$F_1$ and accuracy scores by a natural language generation task

Figure 12

Table 9. Averaged macro-$F_1$ by a natural language generation task

Figure 13

Table 10. $F_1$ scores of the authorship attribution task by the target TGM

Figure 14

Figure 5. Results of testing the detectors’ robustness towards the size of unseen GPT-based text generation models.

Figure 15

Figure 6. Results of testing the detectors’ robustness towards unseen text domains. Notations: Minek=strategic documents; Prozhito=digitalised diaries; RNC = Russian National Corpus.

Figure 16

Table B1. Human performance by models