Hostname: page-component-89b8bd64d-j4x9h Total loading time: 0 Render date: 2026-05-12T07:25:56.400Z Has data issue: false hasContentIssue false

Misspellings in natural language processing: A survey of recent literature

Published online by Cambridge University Press:  25 March 2026

Gianluca Sperduti*
Affiliation:
Istituto di Scienza e Tecnologie dell’Informazione, Consiglio Nazionale delle Ricerche, Italy
Alejandro Moreo
Affiliation:
Istituto di Scienza e Tecnologie dell’Informazione, Consiglio Nazionale delle Ricerche, Italy
*
Corresponding author: Gianluca Sperduti; Email: gianluca.sperduti@isti.cnr.it
Rights & Permissions [Opens in a new window]

Abstract

This survey provides an overview of the challenges of misspellings in natural language processing (NLP). Misspellings are ubiquitous in digital communication, and even if humans can generally interpret misspelt text, NLP models frequently struggle to handle it: this causes a decline in performance in common tasks like text classification and machine translation. In this paper, we reconstruct a history of misspellings as a scientific problem. We then discuss the latest advancements to address the challenge of misspellings in NLP. Main strategies to mitigate the effect of misspellings include data augmentation, double step, character-order agnostic, and tuple-based methods, among others. This survey also examines dedicated data challenges and competitions to spur progress in the field. Critical safety and ethical concerns are also examined, for example, the voluntary use of misspellings to inject malicious messages and hate speech on social networks. The survey also explores psycholinguistic perspectives on how humans process misspellings, potentially informing innovative computational techniques for text normalisation and representation. Additionally, the survey explores the challenges that misspellings pose in multilingual contexts. Finally, the misspelling-related challenges and opportunities associated with modern large language models are also analysed, including benchmarks, datasets and performances of the most prominent language models against misspellings. This survey provides a comprehensive review of recent research on misspellings and aims to serve as a valuable resource for researchers seeking to get up to speed on this problem within the rapidly evolving landscape of NLP.

Information

Type
Survey Paper
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2026. Published by Cambridge University Press
Figure 0

Figure 1. Publication trends in NLP papers on misspellings (2004–2025).

Figure 1

Table 1. Reference guide for the methods discussed in Section 5, along with tasks and misspellings addressed, type of models, datasets, and metrics used in the evaluation

Figure 2

Figure 2. Conceptualisation of an order-agnostic representation for garbled words. Dotted lines denote garbled variants of the original word, on the top. Solid lines denote an order-agnostic representation of a surface form word. If all characters (but the first and last) are represented as a set, then the representation of the original word and the garbled variants coincide.

Figure 3

Figure 3. Distribution of methods (left) across tasks (right). Flowchart created using SankeyMATIC https://sankeymatic.com (accessed 25/08/2025).

Figure 4

Table 2. Types of misspellings applied in each paper. We included in this table only works that used synthetic misspellings and that explicitly described the kind of misspellings used