Programming for Corpus Linguistics with Python and Dataframes

Daniel Keller

doi:10.1017/9781108904094

This Element offers intermediate or experienced programmers algorithms for Corpus Linguistic (CL) programming in the Python language using dataframes that provide a fast, efficient, intuitive set of methods for working with large, complex datasets such as corpora. This Element demonstrates principles of dataframe programming applied to CL analyses, as well as complete algorithms for creating concordances; producing lists of collocates, keywords, and lexical bundles; and performing key feature analysis. An additional algorithm for creating dataframe corpora is presented including methods for tokenizing, part-of-speech tagging, and lemmatizing using spaCy. This Element provides a set of core skills that can be applied to a range of CL research questions, as well as to original analyses not possible with existing corpus software.

References

Anthony, L. (2020). Programming for corpus linguistics. In Paquot, M. and Gries, S. T., eds. Practical Handbook of Corpus Linguistics. Springer, pp. 181–207.

Biber, D., Conrad, S., & Cortes, V. (2004). If you look at … : Lexical bundles in university teaching and textbooks. Applied Linguistics, 25(3), 371–405.

Biber, D., & Egbert, J. (2018). Register Variation Online. Cambridge University Press.

Brezina, V. (2018). Statistics in Corpus Linguistics: A Practical Guide. Cambridge University Press.

Dunning, T. E. (1993). Accurate methods for the statistics of surprise and coincidence. Computational Linguistics, 19(1), 61–74.

Egbert, J., & Biber, D. (2019). Incorporating text dispersion into keyword analyses. Corpora, 14(1), 77–104.

Egbert, J., & Biber, D. (2023). Key feature analysis: A simple, yet powerful method for comparing text varieties. Corpora, 18(1), 121–133.

Gabrielatos, C. (2018). Keyness analysis: Nature, metrics and techniques. In Taylor, C. & Marchi, A., eds. Corpus Approaches to Discourse: A Critical Review. Routledge, pp. 225–258.

Hetland, M. L. (2014). Python Algorithms: Mastering Basic Algorithms in the Python Language. Apress.

Honnibal, M., Montani, I., Van Landeghem, S., & Boyd, A. (2020). spaCy: Industrial-strength natural language processing in Python. https://spacy.io/

Ide, N., & Suderman, K. (2004, May). The American National Corpus First Release. In Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04), Lisbon, Portugal. European Language Resources Association (ELRA). https://aclanthology.org/L04-1313/

Lee, K. D., & Hubbard, S. H. (2015). Data Structures and Algorithms with Python. Springer.

Nivre, J., Agić, Ž., Ahrenberg, L. et al. (2017). Universal Dependencies 2.1. https://universaldependencies.org/u/pos/

Rayson, P. (n.d.). Log-likelihood and effect size calculator. http://ucrel.lancs.ac.uk/llwizard.html

Rychlý, P. (2008). A lexicographer-friendly association score. Proceedings from Recent Advances in Slavonic Natural Language Processing (pp. 6–9). Karlova Studánka, Czech Republic: Masaryk University. nlp.fi.muni.cz/raslan/2008/raslan08.pdf

Metrics

Altmetric attention score

Total number of HTML views: 0

Total number of PDF views: 0 *

Loading metrics...

Total views: 0 *

Loading metrics...

* Views captured on Cambridge Core between #date#. This data will be updated every 24 hours.

Usage data cannot currently be displayed.

Why this information is here

This section outlines the accessibility features of this content - including support for screen readers, full keyboard navigation and high-contrast display options. This may not be relevant for you.

Accessibility Information

Accessibility compliance for the PDF of this book is currently unknown and may be updated in the future.

Programming for Corpus Linguistics with Python and Dataframes

This Element has been cited by the following publications. This list is generated based on data provided by Crossref.

Book description

References

Metrics

Altmetric attention score

Full text views

Book summary page views

Accessibility standard: Unknown

Why this information is here

Accessibility Information