In his seminal 1989 article, Matti Rissanen identified three potential pitfalls for the then-nascent field of corpus linguistics: the ‘philologist’s dilemma’ (losing touch with the texts), the ‘God’s truth fallacy’ (uncritically trusting the representativeness of a corpus) and the ‘mystery of vanishing reliability’ (overstating the significance of infrequent phenomena). More than three decades later, Rissanen’s cautionary spirit is more relevant than ever.
Corpus linguistics has become an indispensable methodological framework across subfields of linguistics – from historical and sociolinguistic inquiry to language learning, computational applications and digital humanities. Yet, as corpora have grown in size, complexity and multimodality, the field has also faced renewed challenges: questions of representativeness, data quality, annotation reliability and replicability have re-emerged. Challenges in Corpus Linguistics: Rethinking Corpus Compilation and Analysis, edited by Mark Kaunisto and Marco Schilk, offers a timely and thought-provoking response to these issues. The volume stems from the pre-conference workshop ‘Corpus pitfalls: Dealing with messy data (and other traps for the unwary)’ held at the 42nd ICAME Conference (TU Dortmund, 2021), and it gathers eight chapters by leading scholars who collectively address what Kaunisto terms the ‘pitfalls and perplexities’ of corpus-based research.
In line with the tradition in the Studies in Corpus Linguistics (SCL) series, the volume situates methodological reflection at the core of empirical practice. It does not seek to introduce new corpora or analytical tools; instead, it interrogates the epistemological assumptions that underlie corpus compilation and analysis. Each contribution exemplifies the critical self-reflexivity that defines mature scientific fields, which leads to a coherent and accessible volume that appeals to corpus researchers navigating the ‘messiness’ of real-world data.
The volume opens with an introductory chapter by Mark Kaunisto, ‘From fallacies and pitfalls to solutions and future directions: Navigating the evolving terrain of corpus linguistics’ (pp. 1–8). Kaunisto frames the collection by arguing that Rissanen’s original problems have not been solved but have mutated. The ‘philologist’s dilemma’, for instance, is intensified in the age of big data; where once the challenge was to read all the texts in a million-word corpus, it is now a practical impossibility to manually inspect the vast datasets that underpin modern linguistic research. He argues that while the corpus community has moved far beyond the constraints of small datasets, the expansion to billion-word corpora has paradoxically intensified the danger of analytical distance – scholars now know less, not more, about the provenance and composition of their data. Kaunisto terms this the ‘fallacy of sophisticated technology’ – the mistaken belief that advanced tools guarantee analytical reliability. He also reminds corpus users that sleek interfaces and powerful statistical engines or tools like Sketch Engine or #LancsBox can lull researchers into a false sense of security, which can obscure the messy realities of the underlying data. Importantly, Kaunisto positions the ensuing chapters as a collective attempt to expose such hidden assumptions and to cultivate an ethos of methodological humility. This introduction sets the stage and argues that the need for critical engagement has not diminished but has, in fact, become more crucial for the field’s progress.
In chapter 2, ‘Engaging with bad (meta)data in historical corpus linguistics’ (pp. 9–34), Turo Vartiainen and Tanja Säily provide a series of case studies that serve as cautionary examples. They move beyond general warnings to pinpoint specific ways in which historical data can be misleading. Their example of the adjectival use of key and fun shows how taggers fail to capture historical usage, leading to skewed frequency counts that could support erroneous claims about grammatical innovation. Another example is the investigation of sentence-initial as well in the Corpus of Historical American English (COHA), where an initial query suggested an emerging feature of American English, but a meticulous examination of the metadata revealed that the majority of early hits came from miscategorized Canadian texts. This chapter provides solid arguments for those interested in historical linguistics, echoing the work of scholars like McEnery & Wilson (Reference McEnery and Wilson2001) on the importance of understanding corpus compilation, but with a specific focus on the unique pitfalls of historical datasets, thereby advocating a level of ‘data hygiene’ that is often overlooked in the rush to analyze large-scale archives. The chapter also exemplifies what the editors promised in the Preface: a candid engagement with the ‘not-so-smooth’ aspects of corpus work. It will resonate with anyone who has wrestled with imperfect legacy datasets.
Mark Kaunisto’s ‘Named entities as potentially problematic items in corpora’ (chapter 3, pp. 35–54) tackles a ubiquitous yet often underestimated source of noise. He convincingly argues that words within named entities may not reflect active linguistic choices and can therefore skew frequency and collocational analyses. The examples provided are alarming. The analysis of lifespan in the British National Corpus (BNC), where 96 percent of its instances come from a single computer manual, is a striking case of skewed dispersion that could mislead a researcher. More subtly, his study of the Japanese borrowing samurai in the Global Web-based English corpus (GloWbE) reveals significant regional differences in the proportion of its use in named entities (e.g. the Samurai Sudoku puzzle), highlighting a potential confound for cross-varietal studies. The chapter’s capstone is its analysis of the phrase absolutely fabulous, whose strong collocational link in British English is shown to be overwhelmingly driven by references to the television show. This chapter is a powerful reminder that a corpus is not a direct window into a language system but a collection of texts. It is a call for a more critical approach to tokenization and for the development of more sophisticated, context-aware named-entity recognition in corpus annotation, a challenge that remains significant for computational linguistics.
Chapter 4, ‘Challenges in the compilation, annotation, and analysis of learner corpus data’ by Marcus Callies (pp. 55–67), shifts the focus to the specialized domain of Learner Corpus Research (LCR). Callies moves beyond typical challenges to address profound theoretical issues. He identifies three core problem areas. First, he discusses multilingual practices like code-switching, which, if not properly annotated, can be mistaken for production errors. Second, he addresses lexical bias introduced by elicitation tasks, a long-standing problem in LCR (see Granger et al. Reference Granger, Gilquin and Meunier2015). Finally, and most thought-provokingly, Callies critiques the ‘discourse of deficit’ that often underlies error annotation, where innovative interlanguage forms are reflexively marked as errors against a native-speaker norm. He argues for a more nuanced approach that recognizes the systematicity of the learner language, echoing calls from within Second Language Acquisition (e.g. Le Bruyn & Paquot Reference Le Bruyn and Paquot2021) to move beyond a purely target-language-oriented perspective. This chapter is a crucial contribution, demonstrating that annotation schemes are not neutral but carry theoretical weight, and that uncritical annotation can lead to a distorted view of the language learning process.
In chapter 5, Turo Hiltunen navigates the treacherous terrain of what he calls ‘opportunistic corpora’ – massive digitized archives not originally compiled for linguistic research – in ‘Early newspapers as data for corpus linguistics (and Digital Humanities): Issues in using the British Library Newspapers database as a corpus’ (pp. 68–88). Hiltunen expertly outlines the mismatch between the expectations of corpus linguists (who value balance and structured metadata) and the realities of many Digital Humanities resources. Using the British Library Newspapers database as a case study, he identifies several pitfalls: inadequate search tools, a lack of transparency about text distribution, and pervasive Optical Character Recognition (OCR) errors that can create ghost words or mis-render common ones. This creates a tension between the vast, rich data of such archives and the curated, cleaner data of traditional corpora. Rather than dismissing these resources, Hiltunen offers a pragmatic path forward, advocating for third-party tools like Octavo and for filtering data based on OCR confidence scores. This chapter provides an essential roadmap for researchers, serving as a practical guide for navigating the methodological trade-offs involved.
Stefan Hartmann’s ‘Open Corpus Linguistics – Or how to overcome common problems in dealing with corpus data by adopting open research practices’ (chapter 6, pp. 89–105) provides a forceful argument connecting Rissanen’s problems to the contemporary ‘replication crisis’. Hartmann contends that many studies, particularly those on widely used English corpora like the BNC or the Corpus of Contemporary American English (COCA), are not fully replicable because the data sit behind paywalls. He advocates for an Open Corpus Linguistics that embraces not only openly available data but also open methods, including the sharing of analysis scripts, a practice known as computational reproducibility. The chapter thoughtfully addresses the significant practical challenges, most notably copyright restrictions, and discusses potential workarounds such as sharing sentence shuffles or using password-protected repositories. This is perhaps the most polemical chapter, and its call for greater transparency extends the arguments of scholars like Gries (Reference Gries2016) on quantitative rigor to the entire research lifecycle, reframing replicability not just as a methodological ideal but as an ethical imperative for a global and equitable scientific community.
In chapter 7, ‘Text length and short texts: An overview of the problem’ (pp. 106–25), Aatu Liimatta tackles a fundamental methodological issue that has become pressing with the rise of social media. He clearly distinguishes between the ‘problem of text length’ (the general confounding effect of length on raw frequencies) and the ‘problem of short texts’ (the mathematical inflation of normalized frequencies in very short texts, which can render comparisons meaningless). This issue is critical for any study of online discourse, where features in a single tweet can appear artificially prominent when normalized. The chapter provides an excellent overview of existing solutions, from simply excluding short texts (which risks losing data) to more sophisticated approaches like ‘lengthwise analysis’. Liimatta’s lucid explanation of this complex statistical problem is a major contribution, providing researchers working with computer-mediated communication with a valuable toolkit of potential solutions and a clear rationale for choosing among them.
Daniel Ocic Ihrmark’s ‘Corpus genre categories: Issues at the intersection of linguistics and literature’ (chapter 8, pp. 126–41) explores a fascinating pitfall in corpus stylistics. Ihrmark points out that genre is conceptualized very differently by linguists, who often follow the text-external categories of Biber (Reference Biber1988), and literary scholars, for whom genre is an interpretive framework. This discrepancy can lead to flawed conclusions, such as when a literary scholar compares an author to the overly broad ‘fiction’ section of a reference corpus. Such a comparison would be akin to analyzing the style of a sonnet by comparing it to a dataset containing epic poems and grocery lists. Ihrmark argues that for corpus compilation, broader, objective categories are most practical, while the application of more granular, literary genre labels should be left to individual studies. This contribution enriches the volume by extending its scope beyond the empirical to the hermeneutic, echoing Hunston’s (Reference Hunston2022) call to re-engage corpus linguistics with textual interpretation. Ihrmark’s plea for methodological pluralism – recognizing that genres are ‘moving targets’ rather than fixed taxonomic entities – will resonate with corpus stylisticians and literary linguists alike.
The volume concludes with chapter 9 (pp. 142–70), ‘Modeling fine-grained sociolinguistic variation: The promises and pitfalls of Twitter corpora and neural word embeddings’, by Filip Miletić, Anne Przewozny-Desriaux and Ludovic Tanguy. This chapter brings the book to the cutting edge, deploying a custom Twitter corpus and BERT-based embeddings to investigate semantic shifts in Quebec English. The authors offer a compelling demonstration of both the power and the peril of these new methods. The model successfully clustered similar uses of target words, greatly facilitating manual analysis. However, the authors detail the numerous ‘false positives’ produced, where clusters were caused not by semantic shifts but by local cultural references or French codeswitching. Their solution – a coarse-grained manual annotation of the computationally generated clusters – is a powerful illustration of the synergistic relationship between automated methods and expert linguistic analysis. It powerfully reinforces the book’s central theme: that even the most advanced techniques are best seen as powerful assistants for, rather than replacements of, the trained linguist.
Challenges in Corpus Linguistics is a timely, practical and intellectually stimulating volume. The editors have assembled a coherent set of contributions that speak to a clear, central theme. Anchored by Kaunisto’s opening chapter, the volume maintains a clear focus on what might be called the epistemology of corpus practice: how corpus data are constructed, validated and interpreted. The volume succeeds in balancing breadth with unity: topics range from historical corpora and learner data to Twitter embeddings, yet all converge on methodological transparency and reflexivity. Each chapter, while methodologically distinct, returns to the central motif of knowing one’s data – a refrain that threads from Rissanen’s (Reference Rissanen1989) seminal work.
What distinguishes this book is its problematizing stance: rather than presenting corpus linguistics as a success story of computational progress, it foregrounds its vulnerabilities – messy metadata, genre fluidity, replicability limits and ethical ambiguity. This aligns with broader epistemic shifts in linguistics and social research, where reflexivity and transparency are valued as much as innovation. The volume’s greatest achievement lies in reframing ‘pitfalls’ not as failures but as productive tensions that drive methodological innovation. By foregrounding messiness, uncertainty and ethical awareness, Kaunisto and Schilk remind us that corpus linguistics, far from being a purely technical enterprise, remains a profoundly interpretive science of language in use.
The volume’s scope has a few boundaries worth noting. As the collection stems from a pre-conference workshop at ICAME, an organization primarily devoted to English corpus linguistics, the perspectives are naturally focused on challenges within European and English-centric corpus traditions. This origin explains the volume’s great coherence, though readers who hope to find case studies from non-European or extensively multilingual corpora – which present their own unique sets of pitfalls – will find them outside this book’s specific remit. On a separate note, while the book is an essential read for any serious practitioner, some chapters assume considerable prior familiarity with corpus tools and statistical terminology. For novice readers, short methodological glossaries, like those found in Fundamental Principles of Corpus Linguistics (McEnery & Brezina Reference McEnery and Brezina2022), could have enhanced accessibility.
In conclusion, Challenges in Corpus Linguistics: Rethinking Corpus Compilation and Analysis is a concise yet rich contribution that encourages corpus linguists to pause and reflect on their own practices. Its central message – that technical sophistication does not replace epistemological vigilance – is both timely and necessary. The collection’s interdisciplinary reach, spanning historical linguistics, learner corpus research, literary studies and computational modeling, ensures its relevance to a wide readership.