Published online by Cambridge University Press: 05 June 2014
Introduction
According to Biber et al. (1998: 4), a corpus is ‘a large and principled collection of natural texts’ (my emphasis). This definition of a corpus obviously does not apply to the huge collection of texts that the World Wide Web constitutes, and in the more narrow corpus linguistic terms, the web can therefore not be considered a corpus. However, the data available on the web have been used increasingly in corpus linguistic investigations. The focus of this chapter will be on why this is the case, how this can be done, as well as the gains and limitations of using web-based data for linguistic research.
There are several reasons why linguists have turned to the World Wide Web as a source of data. For the study of some phenomena, even large corpora comprising 100 million words or more are still not large enough. This holds for most kinds of lexicographic research, but investigating some of the more ephemeral points in English grammar may also necessitate larger sources of data. In addition, the internet has given rise to new text types such as e-mail, chat-room discussions, text messaging, blogs, or interactive internet magazines – text types that are interesting objects of study in themselves (e.g. Herring and Paolillo 2006; Tagliamonte 2008). Another reason for the allure of the World Wide Web is that it takes a long time and considerable financial resources to compile standard reference corpora. Moreover, these representative corpora are quickly out of date when it comes to recent or ongoing change; Baker (2009) describes how the internet can be used to supplement existing standard corpora in this respect. Furthermore, apart from the International Corpus of English (ICE), corpus linguistics has largely focused on so-called inner-circle varieties of English, i.e. varieties of English as a first language; moreover, within the inner circle, the focus has been mostly on British (BrE) and American English (AmE). For even slightly more exotic varieties of English – like Bangladeshi or Pakistani English – we do not even have ICE components and are very unlikely to see them in the (near) future. The discussion in this chapter also applies in large parts to the recently made available Corpus of Global Web-Based English (GloWbE) (see corpus2.byu.edu/glowbe), a web-derived corpus of world Englishes.
To save this book to your Kindle, first ensure no-reply@cambridge.org is added to your Approved Personal Document E-mail List under your Personal Document Settings on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part of your Kindle email address below. Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations. ‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi. ‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
Find out more about the Kindle Personal Document Service.
To save content items to your account, please confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account. Find out more about saving content to Dropbox.
To save content items to your account, please confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account. Find out more about saving content to Google Drive.