Book contents
2 - Planning the construction of a corpus
Published online by Cambridge University Press: 03 December 2009
Summary
Before the texts to be included in a corpus are collected, annotated, and analyzed, it is important to plan the construction of the corpus carefully: what size it will be, what types of texts will be included in it, and what population will be sampled to supply the texts that will comprise the corpus. Ultimately, decisions concerning the composition of a corpus will be determined by the planned uses of the corpus. If, for instance, the corpus is be used primarily for grammatical analysis (e.g. the analysis of relative clauses or the structure of noun phrases), it can consist simply of text excerpts rather than complete texts. On the other hand, if the corpus is intended to permit the study of discourse features, then it will have to contain complete texts.
Deciding how lengthy text samples within a corpus should be is but one of the many methodological considerations that must be addressed before one begins collecting data for inclusion in a corpus. To explore the process of planning a corpus, this chapter will consider the methodological assumptions that guided the compilation of the British National Corpus. Examining the British National Corpus reveals how current corpus planners have overcome the methodological deficiencies of earlier corpora, and raises more general methodological considerations that anyone planning to create a corpus must address.
The British National Corpus
At approximately 100 million words in length, the British National Corpus (BNC) (see table 2.1) is one of the largest corpora ever created.
- Type
- Chapter
- Information
- English Corpus LinguisticsAn Introduction, pp. 30 - 54Publisher: Cambridge University PressPrint publication year: 2002