Doing Linguistics with a Corpus: Methodological Considerations for the Everyday User

Jesse Egbert; Tove Larsson; Douglas Biber

doi:10.1017/9781108888790

Series: Elements in Corpus Linguistics

Doing Linguistics with a Corpus

Methodological Considerations for the Everyday User

Published online by Cambridge University Press: 13 October 2020

Jesse Egbert ,

Tove Larsson and

Douglas Biber

Show author details

Jesse Egbert: Affiliation:
Northern Arizona University
Tove Larsson: Affiliation:
Northern Arizona University
Douglas Biber: Affiliation:
Northern Arizona University

Summary

Paradoxically, doing corpus linguistics is both easier and harder than it has ever been before. On the one hand, it is easier because we have access to more existing corpora, more corpus analysis software tools, and more statistical methods than ever before. On the other hand, reliance on these existing corpora and corpus linguistic methods can potentially create layers of distance between the researcher and the language in a corpus, making it a challenge to do linguistics with a corpus. The goal of this Element is to explore ways for us to improve how we approach linguistic research questions with quantitative corpus data. We introduce and illustrate the major steps in the research process, including how to: select and evaluate corpora, establish linguistically-motivated research questions, observational units and variables, select linguistically interpretable variables, understand and evaluate existing corpus software tools, adopt minimally sufficient statistical methods, and qualitatively interpret quantitative findings.

Element contents

Summary
References

Get access

Keywords

corpus linguistics research design quantitative methods qualitative methods statistical methods

Information

Type: Element
Information: Series: Elements in Corpus Linguistics

DOI: https://doi.org/10.1017/9781108888790 [Opens in a new window]

Online ISBN: 9781108888790

Publisher: Cambridge University Press

Print publication: 12 November 2020

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Element purchase

Temporarily unavailable

References

Ackoff, R. L. (2010). Systems Thinking for Curious Managers. Chicago: Triarchy Press.Google Scholar

Anthony, L. (2013). A critical look at software tools in corpus linguistics. Linguistics Research, 30(2), 141–61.Google Scholar

Anthony, L. & Baker, P. (2015). ProtAnt: A tool for analysing the prototypicality of texts. International Journal of Corpus Linguistics, 20(3), 273–92.Google Scholar

Baayen, H. R., Janda, L. A., Nesset, T., Endresen, A., & Makarova, A. (2013). Making choices in Russian: Pros and cons of statistical methods for rival forms. Russian Linguistics, 37(3), 253–91.Google Scholar

Baker, P. (2004). Querying keywords: Questions in difference, frequency, and sense in keyword analysis. Journal of English Linguistics, 32(4), 346–59.Google Scholar

Baker, P. (2010). Corpus methods in linguistics. In Litosseliti, L., ed. Research Methods in Linguistics. New York: Continuum, pp. 95–113.Google Scholar

Biber, D. (1984). A model of textual relations within the written and spoken modes. Unpublished PhD dissertation. Los Angeles: University of Southern California.Google Scholar

Biber, D. (1988). Variation across Speech and Writing. Cambridge: Cambridge University Press.Google Scholar

Biber, D. (1993). Representativeness in corpus design. Literary and Linguistic Computing, 8(4), 243–57.Google Scholar

Biber, D. (2006). University Language: A Corpus-Based Study of Spoken and Written Registers. Amsterdam: John Benjamins.Google Scholar

Biber, D. (2009). A corpus-driven approach to formulaic language in English: Multi-word patterns in speech and writing. International journal of corpus linguistics, 14(3), 275–311.Google Scholar

Biber, D. & Conrad, S. (2019). Register, Genre, and Style (2nd ed.). Cambridge: Cambridge University Press.CrossRef Google Scholar

Biber, D. & Egbert, J. (2018). Register Variation Online. Cambridge: Cambridge University Press.Google Scholar

Biber, D., Johansson, S., Leech, G., Conrad, S., & Finegan, E. (1999). The Longman Grammar of Spoken and Written English. London: Longman.Google Scholar

Biber, D. & Jones, J. K. (2009). Quantitative methods in corpus linguistics. In Lüdeling, A. & Kytö, M., eds. Corpus Linguistics: An International Handbook. Berlin: Walter de Gruyter, pp. 1286–1304.CrossRef Google Scholar

Biber, D., Reppen, R., Schnur, E., & Ghanem, R. (2016). On the (non)utility of Juilland’s D to measure lexical dispersion in large corpora. International Journal of Corpus Linguistics, 21(4), 439–64.Google Scholar

Biber, D., Staples, S., Gray, B., & Egbert, J. (2020). Investigating grammatical complexity in L2 English writing research: Linguistic description versus predictive measurement. Journal of English for Academic Purposes.Google Scholar

Blair, E. & Blair, J. (2015). Applied Survey Sampling. London: Sage.CrossRef Google Scholar

Brezina, V. (2018). Statistics in Corpus Linguistics: A Practical Guide. Cambridge: Cambridge University Press.Google Scholar

Bryant, J. (1998). The Great Depression and New Deal. http://teachersinstitute.yale.edu/curriculum/units/1998/4/98.04.04.x.html.Google Scholar

Caldas-Coulthard, C. R. & Moon, R. (2010). “Curvy, hunky, kinky”: Using corpora as tools for critical analysis. Discourse & Society, 21(2), 99–133.Google Scholar

Carroll, J. B., Davies, P., & Richman, B. (1971). The American Heritage word frequency book. Houghton Mifflin.Google Scholar

Chen, D. & Manning, C. (2014). A fast and accurate dependency parser using neural networks. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, pp. 740–50.Google Scholar

Clear, J. (1992). Corpus sampling. In Leitner, G., ed., New Directions in Language Corpora: Methodology, Results, Software Developments. Berlin: De Gruyter, pp. 21–32.Google Scholar

Cohen, J. (1977). Statistical Power Analysis for the Behavioral Sciences. New York: Routledge.Google Scholar

Davies, M. (2010–) The Corpus of Historical American English (COHA): 400 million words, 1810–2009. Available online at www.english-corpora.org/coha/.Google Scholar

Egbert, J. (2014). Reader Perceptions of Linguistic Variation in Published Academic Writing. Unpublished PhD dissertation. Flagstaff: Northern Arizona University.Google Scholar

Egbert, J. (2015). Sub-register and discipline variation in published academic writing: Investigating statistical interaction in corpus data. International Journal of Corpus Linguistics, 20(1), 1–29.Google Scholar

Egbert, J. (2019). Corpus design and representativeness. In Berber Sardinha, T. & Veirano Pinto, M., eds., Multi-dimensional Analysis: Research Methods and Current Issues. London: Bloomsbury, pp. 27–42.Google Scholar

Egbert, J. & Baker, P. eds. (2019). Using Corpus Methods to Triangulate lLnguistic Analysis. New York: Routledge.CrossRef Google Scholar

Egbert, J. & Biber, D. (2019). Incorporating text dispersion into keyword analyses. Corpora, 14(1), 77–104.Google Scholar

Egbert, J., Burch, B., & Biber, D. (2020). Lexical dispersion and corpus design. International Journal of Corpus Linguistics, 25(1), 89–115.Google Scholar

Egbert, J., & Davies, M. (2019). If olive oil Is made of olives, then what’s baby oil made of?: The shifting semantics of noun+ noun sequences in American English. In Egbert, J. and Baker, P. (Eds.), Using corpus methods to triangulate linguistic analysis (pp. 163–184). New York City: Routledge.Google Scholar

Ellis, N. (2019). Usage-based theories of Construction Grammar: Triangulating corpus linguistics and psycholinguistics. In Egbert, J. & Baker, P., eds. (2019). Using Corpus Methods to Triangulate Linguistic Analysis. New York: Routledge.Google Scholar

Evert, S. (2004). The statistics of word cooccurrences: Word pairs and collocations. Unpublished PhD thesis. Stuttgart: University of Stuttgart.Google Scholar

Evert, S. (2009). Corpora and collocations. In Lüdeling, A. & Kytö, M., eds. Corpus Linguistics: An International Handbook, Vol. 2. Berlin/New York: Mouton de Gruyter, pp. 1212–48.Google Scholar

Ford, H. J. (1909). The influence of state politics in expanding federal power. Proceedings of the American Political Science Association, 5, 53–63.CrossRef Google Scholar

Gabrielatos, C. (2018). Keyness analysis: Nature, metrics and techniques. In Taylor, C. & Marchi, A., eds. Corpus Approaches to Discourse: A Critical Review. London/New York: Routledge, pp. 225–58.Google Scholar

Geiger, R. L. (1997). Research, graduate education, and the ecology of American universities: An interpretive history. In Goodchild, L. F. & Weschler, H. S., eds. The History of Higher Education (2nd ed.). Needham Heights: Simon & Schuster, pp. 273–89.Google Scholar

Graesser, A. C., McNamara, D. S., & Louwerse, M. M. (2003). What do readers need to learn in order to process coherence relations in narrative and expository text? In Sweet, A. P. & Snow, C. E., eds. Rethinking Reading Comprehension. New York: Guilford Publications, pp. 82–98.Google Scholar

Gries, S. T. (forthcoming). On classification trees and random forests in corpus linguistics: Some words of caution and suggestions for improvement. Corpus Linguistics and Linguistic Theory.Google Scholar

Hanks, P. (2012). The corpus revolution in lexicography. International Journal of Lexicography, 25(4), 398–436.Google Scholar

Hasselgård, H. (2010). Adjunct Adverbials in English. Cambridge: Cambridge University Press.Google Scholar

Hinrichs, L. & Szmrecsanyi, B. (2007). Recent changes in the function and frequency of Standard English genitive constructions: A multivariate analysis of tagged corpora. English Language & Linguistics, 11(3), 437–74.Google Scholar

Hinrichs, L., Szmrecsanyi, B., & Bohmann, A. (2015). Which-hunting and the Standard English relative clause. Language, 91(4), 806–836.Google Scholar

Housen, A., De Clercq, B., Kuiken, F., & Vedder, I. (2019). Multiple approaches to complexity in second language research. Second Language Research, 35(1), 3–21. Published online (2018). https://doi.org/10.1177/0267658318809765.CrossRef Google Scholar

Hunston, S. (2002). Corpora in applied linguistics. Cambridge: Cambridge University Press.Google Scholar

Hunston, S. (2007). Semantic prosody revisited. International Journal of Corpus Linguistics, 12(2), 249–68.CrossRef Google Scholar

Hunt, K. W. (1970). Do sentences in the second language grow like those in the first? TESOL Quarterly, 4(3), 195–202.CrossRef Google Scholar

Larsson, T., Callies, M., Hasselgård, H., Laso, N. J., Van Vuuren, S., Verdaguer, I., & Paquot, M. (2020). Adverb placement in EFL academic writing: Going beyond syntactic transfer. International Journal of Corpus Linguistics, 25(2), 155–184.Google Scholar

Larsson, T. & Kaatari, H. (2020). Syntactic complexity across registers: Investigating (in)formality in second-language writing. Journal of English for Academic Purposes, 45, 100850.Google Scholar

Larsson, T., Paquot, M., & Plonsky, L. (forthcoming). Inter-rater reliability in learner corpus research: Insights from a collaborative study on adverb placement. International Journal of Learner Corpus Research.Google Scholar

Leech, G. (2007). New resources, or just better old ones? The Holy Grail of representativeness. In Hundt, M., Nesselhauf, N., & Biewer, C., eds. Corpus Linguistics and the Web. Amsterdam: Brill Rodopi, pp. 133–50.Google Scholar

Leech, G., Hundt, M., Mair, C., & Smith, N. (2009). Change in Contemporary English: A Grammatical Study. Cambridge: Cambridge University Press.Google Scholar

Levshina, N. (2015). How to Do Linguistics with R: Data Exploration and Statistical Analysis. Amsterdam: John Benjamins.Google Scholar

Levshina, N. (forthcoming). Conditional Inference Trees and Random Forests. In S. Th. Gries & M. Paquot, eds. A Practical Handbook of Corpus Linguistics. New York: Springer.Google Scholar

Lu, X. (2010). Automatic analysis of syntactic complexity in second language writing. International Journal of Corpus Linguistics, 15(4), 474–96.Google Scholar

Lu, X. (2017). Automated measurement of syntactic complexity in corpus-based L2 writing research and implications for writing. Language Testing, 34(4), 493–511.CrossRef Google Scholar

McEnery, T., Xiao, R., & Tono, Y. (2006). Corpus-Based Language Studies: An Advanced Resource Book. New York: Taylor & Francis.Google Scholar

Nivre, J., Hall, J., Nilsson, J., Chanev, A., Eryigit, G., Kubler, S., Marinov, S., & Marsi, E. (2007). MALTparser: A language-independent system for data-driven dependency parsing. Natural Language Engineering, 13(2), 95–135.Google Scholar

Orbach, B., Callahan, K. S., & Lindemenn, L. M. (2010). Arming states’ rights: Federalism, private lawmakers, and the battering ram strategy. Arizona Law Review, 52, 1161–1206.Google Scholar

Picoral, A., Reppen, R., & Staples, S. (under review). Evaluation of annotation resources for learner data: A comparison of software tools. Special Issue of International Journal of Learner Corpus Research, Natural Language Processing for Learner Corpus Research.Google Scholar

Quirk, R., Greenbaum, S., Leech, G., & Svartvik, J. (1985). A Comprehensive Grammar of the English Language. London: Longman.Google Scholar

R Core Team (2020). R: A language and environment for statistical computing. Vienna: R Foundation for Statistical Computing, www.R-project.org/.Google Scholar

Rychlý, P. (2008). A lexicographer-friendly association score. In Sojka, P. & Horák, A., eds. Proceedings of Recent Advances in Slavonic Natural Language Processing, RASLAN. Brno: Masaryk University, pp. 6–9.Google Scholar

Savický, P. & Hlaváčová, J. (2002). Measures of word commonness. Journal of Quantitative Linguistics, 9, 215–31.Google Scholar

Schimmel, C. (2008). School counseling: A brief historical overview. West Virginia Department of Education. http://wvde.state.wv.us/counselors/history.html.Google Scholar

Scott, M. 1997. PC analysis of key words – and key words. System, 25(2), 233–45.Google Scholar

Sinclair, J. (1991). Corpus, Concordance, Collocation. Oxford: Oxford University Press.Google Scholar

Stallings, W. (1989). Data and Computer Communications (4th ed.). New York: Macmillan.Google Scholar

Strobl, C., Boulesteix, A.-L., Kneib, T., Augustin, T., & Zeileis, A. (2008). Conditional variable importance for random forests.BMC Bioinformatics, 9(307), www.biomedcentral.com/1471–2105/9/307.Google Scholar

Stubbs, M. (1995) Corpus evidence for norms of lexical collocation. In Cook, G. & Seidlhofer, B., eds. Principles and Practice in the Study of Language and Learning. Oxford: Oxford University Press, pp. 245–256.Google Scholar

Szmrecsanyi, B. & Hinrichs, L. (2008). Probabilistic determinants of genitive variation in spoken and written English: A multivariate comparison across time, space, and genres. In Nevalainen, T., Taavitsainen, I., Pahta, P., & Korhonen, M., eds. The Dynamics of Linguistic Variation: Corpus Evidence on English Past and Present. Amsterdam: Benjamins, pp. 291–309.Google Scholar

Xiao, R. & McEnery, T. (2006). Collocation, semantic prosody, and near synonymy: A cross-linguistic perspective. Applied Linguistics, 27(1), 103–29.Google Scholar

Accessibility standard: Unknown

Why this information is here

This section outlines the accessibility features of this content - including support for screen readers, full keyboard navigation and high-contrast display options. This may not be relevant for you.

Accessibility Information

Accessibility compliance for the HTML of this element is currently unknown and may be updated in the future.