Hostname: page-component-77f85d65b8-pkds5 Total loading time: 0 Render date: 2026-03-28T21:56:28.418Z Has data issue: false hasContentIssue false

OffensEval 2023: Offensive language identification in the age of Large Language Models

Published online by Cambridge University Press:  06 December 2023

Marcos Zampieri*
Affiliation:
George Mason University, Fairfax, VA, USA
Sara Rosenthal
Affiliation:
IBM Research, Yorktown Heights, NY, USA
Preslav Nakov
Affiliation:
Mohamed bin Zayed University of Artificial Intelligence, Abu Dhabi, UAE
Alphaeus Dmonte
Affiliation:
George Mason University, Fairfax, VA, USA
Tharindu Ranasinghe
Affiliation:
Aston University, Birmingham, UK
*
Corresponding author: Marcos Zampieri; Email: mzampier@gmu.edu
Rights & Permissions [Opens in a new window]

Abstract

The OffensEval shared tasks organized as part of SemEval-2019–2020 were very popular, attracting over 1300 participating teams. The two editions of the shared task helped advance the state of the art in offensive language identification by providing the community with benchmark datasets in Arabic, Danish, English, Greek, and Turkish. The datasets were annotated using the OLID hierarchical taxonomy, which since then has become the de facto standard in general offensive language identification research and was widely used beyond OffensEval. We present a survey of OffensEval and related competitions, and we discuss the main lessons learned. We further evaluate the performance of Large Language Models (LLMs), which have recently revolutionalized the field of Natural Language Processing. We use zero-shot prompting with six popular LLMs and zero-shot learning with two task-specific fine-tuned BERT models, and we compare the results against those of the top-performing teams at the OffensEval competitions. Our results show that while some LMMs such as Flan-T5 achieve competitive performance, in general LLMs lag behind the best OffensEval systems.

Information

Type
Survey Paper
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited
Copyright
© The Author(s), 2023. Published by Cambridge University Press
Figure 0

Table 1. Several tweets from the original OLID dataset, with their labels for each level of the annotation model (Zampieri et al.2019a)

Figure 1

Table 2. Distribution of label combinations in OLID (Zampieri et al.2019b)

Figure 2

Table 3. F1-Macro for the top-10 teams for all three sub-tasks. The best baseline model (CNN) is also presented

Figure 3

Figure 1. Pie chart adapted from (Zampieri et al.2020) showing the models used in sub-task A. “N/A” indicates that the system did not have a description. Under machine learning, we included all approaches based on traditional classifiers such as SVMs and Naive Bayes. Under deep learning, we included approaches based on neural architectures available at that time except BERT.

Figure 4

Table 4. Data statistics for OffensEval 2020 sub-task A from Zampieri et al.(2020)

Figure 5

Table 5. Annotated examples for all sub-tasks and languages adapted from Zampieri et al.(2020)

Figure 6

Table 6. Results for the top-10 teams in English sub-task A ordered by macro-averaged F1

Figure 7

Table 7. Macro-F1 scores for the OffensEval 2019 test set. Baseline results displayed in italics

Figure 8

Table 8. Macro-F1 scores for the OffensEval 2020 English test set. Baseline results are displayed in italics

Figure 9

Table 9. Macro-F1 scores for the OffensEval 2020 Arabic, Greek, and Turkish test sets. Baseline results are displayed in italics