MARSEC: A Machine-Readable Spoken English Corpus

Peter Roach; Gerry Knowles; Tamas Varadi; Simon Arnfield

doi:10.1017/S0025100300004849

MARSEC: A Machine-Readable Spoken English Corpus

Published online by Cambridge University Press: 06 February 2009

Tamas Varadi and

Peter Roach: Affiliation:
Department of Linguistic Science, University of Reading, Reading RG6 2AAU.K.
Gerry Knowles: Affiliation:
Department of Modern English Language and Linguistics, University of Lancaster, Bailrigg, Lancaster LA1 4YT, U.K..
Tamas Varadi: Affiliation:
Department of Modern English Language and Linguistics, University of Lancaster, Bailrigg, Lancaster LA1 4YT, U.K..
Simon Arnfield: Affiliation:
Department of Linguistic Science, University of Reading, Reading RG6 2AAU.K.

Article contents

Extract
References

Get access

Rights & Permissions

Extract

The purpose of this paper is to describe a new version of the Spoken English Corpus which will be of interest to phoneticians and other speech scientists. The Spoken English Corpus is a well-known collection of spoken-language texts that was collected and transcribed in the 1980's in a joint project involving IBM UK and the University of Lancaster (Alderson and Knowles forthcoming, Knowles and Taylor 1988). One valuable aspect of it is that the recorded material on which it was based is fairly freely available and the recording quality is generally good. At the time when the recordings were made, the idea of storing all the recorded material in digital form suitable for computer processing was of limited practicality. Although storage on digital tape was certainly feasible, this did not provide rapid computer access. The arrival of optical disk technology, with the possibility of storing very large amounts of digital data on a compact disk at relatively low cost, has brought about a revolution in ideas on database construction and use. It seemed to us that the recordings of the Spoken English Corpus (hereafter SEC) should now be converted into a form which would enable the user to gain access to the acoustic signal without the laborious business of winding through large amounts of tape. Once this was done, we should be able not only to listen to the recordings in a very convenient way, but also to carry out many automatic analyses of the material by computer.

Type: Article
Information: Journal of the International Phonetic Association , Volume 23 , Issue 2 , December 1993 , pp. 47 - 54

DOI: https://doi.org/10.1017/S0025100300004849 [Opens in a new window]
Copyright: Copyright © Journal of the International Phonetic Association 1993

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Alderson, P. and Knowles, G. (forthcoming). Working with the Spoken English Corpus. London: LongmanGoogle Scholar

Arnfield, S. (1993). A syntax-based grammar of stress sequences. Proceedings of the IEE Conference on Grammatical Inference, 7.1–7.7. Colchester: University of Essex.Google Scholar

Fourcin, A.J. and Dolmazon, J-M. (1991). Speech knowledge, standards and assessment. Proceedings of the XII International Congress of Phonetic Sciences. Aixen-Provence. Vol. 5, 430–433.Google Scholar

Garside, R. (1987). The CLAWS word-tagging system. In Garside, R., Leech, G. and Sampson, G. (editors), The Computational Analysis of English, 30–41. London: Longman.Google Scholar

Ghali, N., Arnfield, S. and Roach, P. (1992). Statistical relationships between acoustic and auditory records of intonation. Proceedings of the Institute of Acoustics. 14.6, 207–216.Google Scholar

Knowles, G. (1991). Prosodic labelling: the problem of tone group boundaries. In Johansson, S. and Stenstrom, A-B. (editors), English Computer Corpora: selected papers and research guide, 149–163. Berlin: Mouton de Gruyter.CrossRef Google Scholar

Knowles, G. (1992). Pitch contours and tones in the Lancaster/IBM Spoken English Corpus. In Leitner, G. (editor). New Dimensions in Corpus Linguistics, 289–299.. Berlin: de Gruyter.Google Scholar

Knowles, G. (1993 a). The machine readable Spoken English Corpus. In Aarts, J, de Haan, P. and Oostdijk, N. (editors), English Language Corpora: design, analysis and exploration., 107–119. Amsterdam: Rodopi.Google Scholar

Knowles, G. (1993 b). From text to waveform: converting the Lancaster/IBM Spoken English Corpus into a speech database. In Souter, C. and Atwell, E. (editors), Corpus-based Computational Linguistics, 47–58. Amsterdam: Rodopi.CrossRef Google Scholar

Knowles, G. and Lawrence, L. (1987). Automatic intonation assignment. In In Garside, R., Leech, G. and Sampson, G. (editors), The Computational Analysis of English., 139–148. London: Longman.Google Scholar

Knowles, G. and Taylor, L. (1988). Manual of Information to Accompany the Spoken English Corpus. Lancaster: Unit for Computer Research on the English Language, University of Lancaster.Google Scholar

Moore, J. and Roach, P. (1993). The role of context in the automatic recognition of stressed syllables. Proceedings of Eurospeech, 2, 767–770. Berlin.Google Scholar

Roach, P. (forthcoming). Conversion between prosodic transcription systems: ‘Standard British’ and ToBI. To appear in Speech Communication.Google Scholar

Roach, P. and Arnfield, S. (forthcoming). Linking prosodic transcription to the time dimension. In Leech, G., Myers, G. and Thomas, J. (editors). Computerized Spoken Discourse. London: Longman.Google Scholar

Silverman, K, Beckman, M., Pitrelli, J., Ostendorf, M., Wightman, C., Price, P., Pierrehumbert, J. and Hirschberg, J. (1992). ToBI: a standard for labeling English prosody. Proceedings of the 1992 International Conference on Speech and Language Processing 2, 867–870. Banff, Alberta.Google Scholar

Svartvik, J. and Quirk, R. (editors) (1980). A Corpus of English Conversation. Lund: Lund University Press.Google Scholar

Wichmann, A. (1991). Beginnings, Middles and Ends. Unpublished PhD thesis, University of Lancaster.Google Scholar

Article contents

MARSEC: A Machine-Readable Spoken English Corpus

Extract

Access options

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests