Hostname: page-component-6766d58669-h8lrw Total loading time: 0 Render date: 2026-05-22T10:03:50.071Z Has data issue: false hasContentIssue false

On The London–Lund Corpus 2: design, challenges and innovations

Published online by Cambridge University Press:  08 September 2021

NELE PÕLDVERE
Affiliation:
Centre for Languages and Literature Lund University Box 201 221 00 Lund Sweden nele.poldvere@englund.lu.se victoria.johansson@ling.lu.se carita.paradis@englund.lu.se Lund University and University of Oslo
VICTORIA JOHANSSON
Affiliation:
Centre for Languages and Literature Lund University Box 201 221 00 Lund Sweden nele.poldvere@englund.lu.se victoria.johansson@ling.lu.se carita.paradis@englund.lu.se
CARITA PARADIS
Affiliation:
Centre for Languages and Literature Lund University Box 201 221 00 Lund Sweden nele.poldvere@englund.lu.se victoria.johansson@ling.lu.se carita.paradis@englund.lu.se
Rights & Permissions [Opens in a new window]

Abstract

This article describes and critically examines the challenging task of compiling The London–Lund Corpus 2 (LLC–2) from start to end, accounting for the methodological decisions made in each stage and highlighting the innovations. LLC–2 is a half-a-million-word corpus of contemporary spoken British English with recordings from 2014 to 2019. Its size and design are the same as those of the world's first machine-readable spoken corpus, The London–Lund Corpus of Spoken English with data from the 1950s to 1980s. In this way, LLC–2 allows not only for synchronic investigations of contemporary speech but also for principled diachronic research of spoken language across time. Each stage of the compilation of LLC–2 posed its own challenges, ranging from the design of the corpus, the recruitment of the speakers, transcription, markup and annotation procedures, to the release of the corpus to the international research community. The decisions and solutions represent state-of-the-art practices of spoken corpus compilation with important innovations that enhance the value of LLC–2 for spoken corpus research, such as the availability of both the transcriptions and the corresponding time-aligned audio files in a standard compliant format.

Information

Type
Research Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution, and reproduction in any medium, provided the original work is properly cited.
Copyright
Copyright © The Author(s) 2021
Figure 0

Figure 1. The basic design of LLC–1 (adapted from Greenbaum & Svartvik 1990: 13)

Figure 1

Table 1. The complete design of LLC–2 (CMC = Computer-Mediated Communication)

Figure 2

Figure 2. The distribution of speakers across four age groups in LLC–2

Figure 3

Table 2. The comparison of the number of texts in the London–Lund Corpora