We use cookies to distinguish you from other users and to provide you with a better experience on our websites. Close this message to accept cookies or find out how to manage your cookie settings.
To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
This chapter describes the process of creating and annotating a corpus. This process involves, for instance, collecting data (speech and writing), transcribing recorded speech, and adding annotation, markup indicating in a conversation, for instance, when one person’s speech overlaps another speaker. While written texts are relatively easy to collect – most writing is readily available in digital formats – speech, especially spontaneous conversations, has to be transcribed, though voice recognition software has made progress in automating the transcription of certain kinds of speech, such as monologues. Other stages of building a corpus are also discussed, ranging from the administrative (keeping records of texts collected) to transcribing recordings of speech. The chapter concludes with a description of various kinds of textual markup and linguistic annotation that can be added to texts. Topics discussed include how to create a “header” for a particular text. Headers contain various kinds of information. For written texts, the header would include, for instance, the title of the text; the author(s); if published, where it was published. Other textual markup is internal to the text, and in a spoken text would include such information as speaker IDs, and the beginnings and ends of overlapping speech.
This chapter focuses on the empirical basis of corpus linguistics, describing how linguistic corpora have played an important role in providing corpus linguists with linguistic evidence to support particular analyses of language. It opens with a discussion of how to define a corpus, and then traces the history of corpus linguistics, noting that as early as the fifteenth century, concordances were created based on the Bible. Later developments included the creation of the Quirk Corpus (print samples of spoken and written English) in 1955 at the Survey of English Usage in University College London, followed (in the 1960s) by the Brown Corpus (edited written American English). There are now online corpora, such as the Corpus of Contemporary American English. Tools for creating and analyzing corpora have also improved considerably: tagging corpora with part-of-speech information can be done with high levels of accuracy. The chapter closes with a description of the many different areas (e.g. lexicology and sociolinguistics) that have benefited from the use of linguistic corpora as well as a sample linguistic analysis demonstrating that corpus-based methodology and the theory of construction grammar can provide evidence that appositives in English are a type of construction.
This chapter describes both the process of creating a corpus as well as the methodological considerations that guide this process. It opens with a detailed discussion of the planning that went into the building of four different types of corpora: the British National Corpus (BNC), the Corpus of Contemporary American English (COCA), the Corpus of Early English Correspondence (CEEC), and the International Corpus of Learner English (ICLE). The structure of each of these corpora is also discussed: their length, the genres that they contain (e.g prose fiction, press reportage, blogs, spontaneous conversations, scripted speech), and other pertinent information. Subsequent sections discuss other topics relevant to building a corpus, such as defining exactly what a corpus is (can the web be considered a corpus?); determining the appropriate size of a corpus and the length of particular texts that the corpus will contain (complete texts versus shorter samples from each text, e.g. 2,000 words); selecting the particular genres be included a corpus (e.g. press reportage, technical writing, spontaneous conversations, scripted speech); and insuring that the writers or speakers whose speech or writing is included are balanced for such issues as gender, ethnicity, and age.
This chapter describes the process of analyzing a corpus. It discusses the general notions of quantitative and qualitative research methodologies within the context of various types of corpus analyses, including Biber’s work on multi-dimensional analysis as well as a more qualitative analysis of Donald Trump’s style of speaking termed Trump Speak. These analyses are used to illustrate the various stages of doing research in general, and corpus-based research specifically. For instance, the analysis of Trump Speak is linked to a discussion of “how to frame a research question"; “how to select relevant corpora that can be analyzed to pursue the research question"; “how to ‘extract’ grammatical and lexical information from a corpus,” using such tools as concordancing programs; and, finally, how to “integrate relevant linguistic theories into the analysis of corpus results,” so that the research does more than simply list, for instance, frequency counts. There is also a discussion of “mixed methods”: how to introduce quantitative and qualitative research methods into a corpus analysis, followed by an example corpus analysis employing descriptive statistics such as the chi square statistic, and additionally, a description of multi-dimensional analysis in the work of Douglas Biber.
Corpus linguistics is a research method which draws on authentic language examples, collected and organized into 'corpora', or searchable 'bodies' of data. The method was established in the 1960s, and has rapidly developed since then. Now in its second edition, this book provides a step-by-step guide on how to create and analyze linguistic corpora. It has been extensively updated to reflect the most recent developments in this ever-evolving field, and now covers the empirical foundation of corpus-based research, new methodological considerations that guide the creation of a corpus, new kinds of research that can be conducted on corpora, and the most up-to-date information on how qualitative and quantitative analyses of corpora are conducted. Theoretical approaches are introduced in an accessible, easy-to-read way, and the book is illustrated with a wide range of different linguistic corpora, making it essential reading for researchers and students in a number of subfields of linguistics.