Book contents
3 - Collecting and computerizing data
Published online by Cambridge University Press: 03 December 2009
Summary
Once the basic outlines of a corpus are determined, it is time to begin the actual creation of the corpus. This is a three-part process, involving the collection, computerization, and annotation of data. This chapter will focus on the first two parts of this process – how to collect and computerize data. The next chapter will focus in detail on the last part of the process: the annotation of a corpus once it has been encoded into computer-readable form.
Collecting data involves recording speech, gathering written texts, obtaining permission from speakers and writers to use their texts, and keeping careful records about the texts collected and the individuals from whom they were obtained. How these collected data are computerized depends upon whether the data are spoken or written. Recordings of speech need to be manually transcribed using either a special cassette tape recorder that can automatically replay segments of a recording, or software that can do the equivalent with a sample of speech that has been converted into digital form. Written texts that are not available in electronic form can be computerized with an optical scanner and accompanying OCR (optical character recognition) software, or (less desirably) they can be retyped manually.
Even though the process of collecting, computerizing, and annotating texts will be discussed as separate stages in this and the next chapter, in many senses the stages are closely connected: after a conversation is recorded, for instance, it may prove more efficient to transcribe it immediately, since whoever made the recording will be available to answer questions about it and to aid in its transcription.
- Type
- Chapter
- Information
- English Corpus LinguisticsAn Introduction, pp. 55 - 80Publisher: Cambridge University PressPrint publication year: 2002