Using web-based data for the study of global English

doi:10.1017/CBO9780511792519.011

8 - Using web-based data for the study of global English

Published online by Cambridge University Press: 05 June 2014

Marianne Hundt

Edited by

Manfred Krug and

Julia Schlüter

Show author details

Marianne Hundt: Affiliation:
English Department, University of Zurich, Switzerland
Manfred Krug: Affiliation:
Otto-Friedrich-Universität Bamberg, Germany
Julia Schlüter: Affiliation:
Otto-Friedrich-Universität Bamberg, Germany

Book contents

Get access

Summary

Introduction

According to Biber et al. (1998: 4), a corpus is ‘a large and principled collection of natural texts’ (my emphasis). This definition of a corpus obviously does not apply to the huge collection of texts that the World Wide Web constitutes, and in the more narrow corpus linguistic terms, the web can therefore not be considered a corpus. However, the data available on the web have been used increasingly in corpus linguistic investigations. The focus of this chapter will be on why this is the case, how this can be done, as well as the gains and limitations of using web-based data for linguistic research.

There are several reasons why linguists have turned to the World Wide Web as a source of data. For the study of some phenomena, even large corpora comprising 100 million words or more are still not large enough. This holds for most kinds of lexicographic research, but investigating some of the more ephemeral points in English grammar may also necessitate larger sources of data. In addition, the internet has given rise to new text types such as e-mail, chat-room discussions, text messaging, blogs, or interactive internet magazines – text types that are interesting objects of study in themselves (e.g. Herring and Paolillo 2006; Tagliamonte 2008). Another reason for the allure of the World Wide Web is that it takes a long time and considerable financial resources to compile standard reference corpora. Moreover, these representative corpora are quickly out of date when it comes to recent or ongoing change; Baker (2009) describes how the internet can be used to supplement existing standard corpora in this respect. Furthermore, apart from the International Corpus of English (ICE), corpus linguistics has largely focused on so-called inner-circle varieties of English, i.e. varieties of English as a first language; moreover, within the inner circle, the focus has been mostly on British (BrE) and American English (AmE). For even slightly more exotic varieties of English – like Bangladeshi or Pakistani English – we do not even have ICE components and are very unlikely to see them in the (near) future. The discussion in this chapter also applies in large parts to the recently made available Corpus of Global Web-Based English (GloWbE) (see corpus2.byu.edu/glowbe), a web-derived corpus of world Englishes.

Information

Type: Chapter
Information: Research Methods in Language Variation and Change , pp. 158 - 178

DOI: https://doi.org/10.1017/CBO9780511792519.011 [Opens in a new window]

Publisher: Cambridge University Press

Print publication year: 2013

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Book purchase

Temporarily unavailable

References

Hoffmann, Sebastian 2007b. ‘Processing Internet-derived text: creating a corpus of Usenet messages’, Literary and Linguistic Computing 22(2): 151–165.CrossRef Google Scholar

Hundt, Marianne, Nesselhauf, Nadja and Biewer, Carolin (eds.) 2007. Corpus linguistics and the Web. Amsterdam: Rodopi.CrossRef Google Scholar

Volk, Martin 2001. ‘Exploiting the WWW as a corpus to resolve PP attachment ambiguities’, in Rayson, Paul, Wilson, Andrew, McEnery, Tony, Hardie, Andrew and Khoja, Shereen (eds.), Proceedings of the Corpus Linguistics 2001 conference. Lancaster, 30 March – 2 April 2001. Department of Linguistics. No pagination.Google Scholar

Accessibility standard: Unknown

Why this information is here

This section outlines the accessibility features of this content - including support for screen readers, full keyboard navigation and high-contrast display options. This may not be relevant for you.

Accessibility Information

Accessibility compliance for the PDF of this book is currently unknown and may be updated in the future.