Skip to main content Accessibility help
×
  • Cited by 3
Publisher:
Cambridge University Press
Online publication date:
May 2024
Print publication year:
2024
Online ISBN:
9781108904094
Subjects:
Research Methods in Linguistics, Applied Linguistics, Language and Linguistics

Book description

This Element offers intermediate or experienced programmers algorithms for Corpus Linguistic (CL) programming in the Python language using dataframes that provide a fast, efficient, intuitive set of methods for working with large, complex datasets such as corpora. This Element demonstrates principles of dataframe programming applied to CL analyses, as well as complete algorithms for creating concordances; producing lists of collocates, keywords, and lexical bundles; and performing key feature analysis. An additional algorithm for creating dataframe corpora is presented including methods for tokenizing, part-of-speech tagging, and lemmatizing using spaCy. This Element provides a set of core skills that can be applied to a range of CL research questions, as well as to original analyses not possible with existing corpus software.

References

Anthony, L. (2020). Programming for corpus linguistics. In Paquot, M. and Gries, S. T., eds. Practical Handbook of Corpus Linguistics. Springer, pp. 181207.
Biber, D., Conrad, S., & Cortes, V. (2004). If you look at … : Lexical bundles in university teaching and textbooks. Applied Linguistics, 25(3), 371405.
Biber, D., & Egbert, J. (2018). Register Variation Online. Cambridge University Press.
Brezina, V. (2018). Statistics in Corpus Linguistics: A Practical Guide. Cambridge University Press.
Dunning, T. E. (1993). Accurate methods for the statistics of surprise and coincidence. Computational Linguistics, 19(1), 6174.
Egbert, J., & Biber, D. (2019). Incorporating text dispersion into keyword analyses. Corpora, 14(1), 77104.
Egbert, J., & Biber, D. (2023). Key feature analysis: A simple, yet powerful method for comparing text varieties. Corpora, 18(1), 121133.
Gabrielatos, C. (2018). Keyness analysis: Nature, metrics and techniques. In Taylor, C. & Marchi, A., eds. Corpus Approaches to Discourse: A Critical Review. Routledge, pp. 225258.
Hetland, M. L. (2014). Python Algorithms: Mastering Basic Algorithms in the Python Language. Apress.
Honnibal, M., Montani, I., Van Landeghem, S., & Boyd, A. (2020). spaCy: Industrial-strength natural language processing in Python. https://spacy.io/
Ide, N., & Suderman, K. (2004, May). The American National Corpus First Release. In Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04), Lisbon, Portugal. European Language Resources Association (ELRA). https://aclanthology.org/L04-1313/
Lee, K. D., & Hubbard, S. H. (2015). Data Structures and Algorithms with Python. Springer.
Nivre, J., Agić, Ž., Ahrenberg, L. et al. (2017). Universal Dependencies 2.1. https://universaldependencies.org/u/pos/
Rayson, P. (n.d.). Log-likelihood and effect size calculator. http://ucrel.lancs.ac.uk/llwizard.html
Rychlý, P. (2008). A lexicographer-friendly association score. Proceedings from Recent Advances in Slavonic Natural Language Processing (pp. 69). Karlova Studánka, Czech Republic: Masaryk University. nlp.fi.muni.cz/raslan/2008/raslan08.pdf

Metrics

Altmetric attention score

Full text views

Total number of HTML views: 0
Total number of PDF views: 0 *
Loading metrics...

Book summary page views

Total views: 0 *
Loading metrics...

* Views captured on Cambridge Core between #date#. This data will be updated every 24 hours.

Usage data cannot currently be displayed.