Natural Language Processing for Corpus Linguistics

Jonathan Dunn

doi:10.1017/9781009070447

Series: Elements in Corpus Linguistics

Natural Language Processing for Corpus Linguistics

Published online by Cambridge University Press: 04 March 2022

Jonathan Dunn

Show author details

Jonathan Dunn: Affiliation:
University of Canterbury, Christchurch, New Zealand

Summary

Corpus analysis can be expanded and scaled up by incorporating computational methods from natural language processing. This Element shows how text classification and text similarity models can extend our ability to undertake corpus linguistics across very large corpora. These computational methods are becoming increasingly important as corpora grow too large for more traditional types of linguistic analysis. We draw on five case studies to show how and why to use computational methods, ranging from usage-based grammar to authorship analysis to using social media for corpus-based sociolinguistics. Each section is accompanied by an interactive code notebook that shows how to implement the analysis in Python. A stand-alone Python package is also available to help readers use these methods with their own data. Because large-scale analysis introduces new ethical problems, this Element pairs each new methodology with a discussion of potential ethical implications.

Element contents

Summary
References

Get access

Keywords

computational linguistics natural language processing corpus linguistics text classification text similarity usage-based grammar corpus-based sociolinguistics computational stylistics computational syntax

Type: Element
Information: Series: Elements in Corpus Linguistics

DOI: https://doi.org/10.1017/9781009070447 [Opens in a new window]

Online ISBN: 9781009070447

Publisher: Cambridge University Press

Print publication: 31 March 2022

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Element purchase

Temporarily unavailable

References

Biber, D. (2012). Register as a Predictor of Linguistic Variation. Corpus Linguistics and Linguistic Theory, 8(1), 9–37.CrossRef Google Scholar

Church, K., & Hanks, P. (1990). Word Association Norms, Mutual Information, and Lexicography. Computational Linguistics, 16(1), 22–29.Google Scholar

Diermeier, D., Godbout, J., Yu, B., & Kaufmann, S. (2011). Language and Ideology in Congress. British Journal of Political Science, 42(1), 31–55.Google Scholar

Dunn, J. (2013a). Evaluating the Premises and Results of Four Metaphor Identification Systems. In Gelbukh, A. (ed.), Proceedings of the Conference on Intelligent Text Processing and Computational Linguistics, vol. 1 (pp. 471–486). Heidelberg: Springer.Google Scholar

Dunn, J. (2013). How Linguistic Structure Influences and Helps to Predict Metaphoric Meaning. Cognitive Linguistics, 24(1), 33–66.CrossRef Google Scholar

Dunn, J. (2014). Measuring Metaphoricity. In Toutanova, K. & Wu, H. (eds.), Proceedings of the Annual Meeting of the Association for Computational Linguistics (pp. 745–751). Stroudsburg, PA: Association for Computational Linguistics.Google Scholar

Dunn, J. (2015). Modeling Abstractness and Metaphoricity. Metaphor & Symbol, 30, 259–289.CrossRef Google Scholar

Dunn, J. (2017). Computational Learning of Construction Grammars. Language & Cognition, 9(2), 254–292.CrossRef Google Scholar

Dunn, J. (2018a). Finding Variants for Construction-Based Dialectometry: A Corpus-Based Approach to Regional CxGs. Cognitive Linguistics, 29(2), 275–311.CrossRef Google Scholar

Dunn, J. (2018b). Modeling the Complexity and Descriptive Adequacy of Construction Grammars. In Jarosz, G., O’Connor, B., & Pater, J. (eds.), Proceedings of the Society for Computation in Linguistics (pp. 81–90). Stroudsburg, PA: Association for Computational Linguistics.Google Scholar

Dunn, J. (2018c). Multi-Unit Directional Measures of Association Moving Beyond Pairs of Words. International Journal of Corpus Linguistics, 23(2), 183–215.CrossRef Google Scholar

Dunn, J. (2019a). Frequency vs. Association for Constraint Selection in Usage-Based Construction Grammar. In Chersoni, E., Jacobs, C., Lenci, A., Linzen, T., Prévot, L., & Santus, E. (eds.), Proceedings of the Workshop on Cognitive Modeling and Computational Linguistics (pp. 117–128). Stroudsburg, PA: Association: for Computational Linguistics.CrossRef Google Scholar

Dunn, J. (2019b). Global Syntactic Variation in Seven Languages: Towards a Computational Dialectology. Frontiers in Artificial Intelligence, Collection on Computational Sociolinguistics, 2. DOI: https://doi.org/10.3389/frai.2019.00015.Google Scholar

Dunn, J. (2019c). Modeling Global Syntactic Variation in English Using Dialect Classification. In Zampieri, M., Nakov, P., Malmasi, S., Ljubešić, N., Tiedemann, J., & Ali, A. (eds.), Proceedings of NAACL 2019 Sixth Workshop on NLP for Similar Languages, Varieties and Dialects (pp. 42–53). Stroudsburg, PA: Association for Computational Linguistics.Google Scholar

Dunn, J. (2020). Mapping Languages: The Corpus of Global Language Use. Language Resources and Evaluation, 54, 999–1018. DOI: https://doi.org/10.1007/s10579-020-09489-2.CrossRef Google Scholar

Dunn, J. (2021). Representations of Language Varieties Are Reliable Given Corpus Similarity Measures. In Zampieri, M., Nakov, P., Ljubešić, N., Tiedemann, J., Scherrer, Y., & Jahuiainen, T. (Eds.), Proceedings of the Eighth Workshop on NLP for Similar Languages, Varieties, and Dialects (pp. 28–38). Stroudsburg, PA: Association for Computational Linguistics.Google Scholar

Dunn, J., & Adams, B. (2019). Mapping Languages and Demographics with Georeferenced Corpora. In Adams, B., de Roiste, M., Gahegan, M., Hulbe, C., O’Sullivan, D., Sila-Nowicka, K., Whigham, P., & Wilson, M. (eds.), Proceedings of Geocomputation 2019 (16 pp.). Auckland: N.p.Google Scholar

Dunn, J., & Adams, B. (2020, May). Geographically-Balanced Gigaword Corpora for 50 Language Varieties. In Calzolari, N., Béchet, F., Blache, P., Choukri, K., Cieri, C., Declerck, T., Goggi, S., Isahara, H., Maegaard, B., Mariani, J., Mazo, H., Moreno, A., Odijk, J., & Piperidis, S. (eds.), Proceedings of the 12th Language Resources and Evaluation Conference (pp. 2528–2536). Marseilles, European Language Resources Association.Google Scholar

Dunn, J., Argamon, S., Rasooli, A., & Kumar, G. (2016). Profile-Based Authorship Analysis. Literary and Linguistic Computing, 31(4), 689–710.CrossRef Google Scholar

Dunn, J., Coupe, T., & Adams, B. (2020, Nov.). Measuring Linguistic Diversity During COVID-19. In Bamman, D., Hovy, D., Jurgens, D., O’Connor, B., & Volkova, S. (eds.), Proceedings of the Fourth Workshop on Natural Language Processing and Computational Social Science (pp. 1–10). Online: Association for Computational Linguistics.Google Scholar

Dunn, J., & Nini, A. (2021). Production vs Perception: The Role of Individuality in Usage-Based Grammar Induction. In Chersoni, E., Hollenstein, N., Jacobs, C., Oseki, Y., Prévot, L., & Santus, E. (Eds.), Proceedings of the Workshop on Cognitive Modeling and Computational Linguistics (pp. 149–159). Stroudsburg, PA: Association for Computational Linguistics.CrossRef Google Scholar

Dunn, J., & Tayyar Madabushi, H. (2021). Learned Construction Grammars Converge Across Registers Given Increased Exposure. In Bisazza, A. & Abend, O. (Eds.), Proceedings of the Conference on Computational Natural Language Learning (pp. 471–486). Stroudsburg, PA: Association for Computational Linguistics.Google Scholar

Ellis, N. (2007). Language Acquisition as Rational Contingency Learning. Applied Linguistics, 27(1), 1–24.CrossRef Google Scholar

Francis, W., & Kucera, H. (1967). Computational Analysis of Present-Day American English. Providence, RI: Brown University Press.Google Scholar

Gentzkow, M., Shapiro, J., & Taddy, M. (2018). Congressional Record for the 43rd–114th Congresses: Parsed Speeches and Phrase Counts (Tech. Rep.). Palo Alto, CA: Stanford Libraries. https://data.stanford.edu/congress_text Google Scholar

Gerlach, M., & Font-Clos, F. (2020). A Standardized Project Gutenberg Corpus for Statistical Analysis of Natural Language and Quantitative Linguistics. Entropy, 22(1), 126. DOI: https://doi.org/10.3390/e22010126 CrossRef Google Scholar

Goldberg, Y. (2017). Neural Network Methods in Natural Language Processing. Williston, VT: Morgan & Claypool Publishers.CrossRef Google Scholar

Gries, S. T. (2013). 50-Something Years of Work on Collocations: What Is or Should Be Next. International Journal of Corpus Linguistics, 18(1), 137–165.CrossRef Google Scholar

Hellrich, J., Kampe, B., & Hahn, U. (2019). The Influence of Down-Sampling Strategies on SVD Word Embedding Stability. In Rogers, A., Drozd, A., Rumshisky, A., & Goldberg, Y. (Eds.), Proceedings of the 3rd Workshop on Evaluating Vector Space Representations for NLP (pp. 18–26). Stroudburg, PA: Association for Computational Linguistics.CrossRef Google Scholar

Kilgarriff, A. (2001). Comparing Corpora. International Journal of Corpus Linguistics, 6(1), 97–133.CrossRef Google Scholar

Koppel, M., Schler, J., & Bonchek-Dokow, E. (2007). Measuring Differentiability: Unmasking Pseudonymous Authors. Journal of Machine Learning Research, 8, 1261–1276.Google Scholar

Landauer, T., Foltz, P., & Laham, D. (1998). Introduction to Latent Semantic Analysis. Discourse Processes, 25(2–3), 259–284.CrossRef Google Scholar

Levy, O., Goldberg, Y., & Dagan, I. (2015, May). Improving Distributional Similarity with Lessons Learned from Word Embeddings. Transactions of the Association for Computational Linguistics, 3, 211–225.CrossRef Google Scholar

Li, J. (2012). Hotel Reviews Dataset (Tech. Rep.). Carnegie Mellon University. www.cs.cmu.edu/~jiweil/html/hotel-review.html Google Scholar

McKenzie, G., & Adams, B. (2018). A Data-Driven Approach to Exploring Similarities of Tourist Attractions through Online Reviews. Journal of Location Based Services, 12(2), 94–118.CrossRef Google Scholar

Mikolov, T., Sutskever, I., Chen, K., Corrado, G., & Dean, J. (2013). Distributed Representations of Words and Phrases and Their Compositionality. In Burges, C. J. C., Bottou, L., Welling, M., Ghahramani, Z., & Weinberger, K. Q. (Eds.), Proceedings of the 26th International Conference on Neural Information Processing Systems–Volume 2 (pp. 3111–3119). Red Hook, NY: Curran Associates Inc.Google Scholar

Mueller, A., Nicolai, G., Petrou-Zeniou, P., Talmina, N., & Linzen, T. (2020). Cross-Linguistic Syntactic Evaluation of Word Prediction Models. In Jurafsky, D., Chai, J., Schluter, N., & Tetreault, J. (Eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (pp. 5523–5539). Stroudsburg, PA: Association for Computational Linguistics.CrossRef Google Scholar

Parsons, A. (2019). NY Times Article Lead Paragraphs 1851–2017 (Tech. Rep.). Kaggle. https://www.kaggle.com/parsonsandrew1/nytimes-article-lead-paragraphs-18512017 Google Scholar

Pennebaker, J. (2011). The Secret Life of Pronouns: What Our Words Say About Us. New York: Bloomsbury Publishing.Google Scholar

Pennington, J., Socher, R., & Manning, C. (2014). GloVe: Global Vectors for Word Representation. In Moschitti, A., Pang, B., & Daelemans, W. (eds.), Empirical Methods in Natural Language Processing (EMNLP) (pp. 1532–1543). Stroudsburg, PA: Association for Computational Linguistics.Google Scholar

Petrov, S., Das, D., & McDonald, R. (2012). A Universal Part-of-Speech Tagset. In Calzolari, N., Choukri, K., Declerck, T., Uğur Doğan, M., Maegaard, B., Mariani, J., Moreno, A., Odijk, J., & Piperidis, S. (eds.), Proceedings of the Eighth Conference on Language Resources and Evaluation (pp. 2089–2096). Paris: European Language Resources Association.Google Scholar

Taylor, J. (2004). Linguistic Categorization (3rded.). Oxford: Oxford University Press.Google Scholar

Wang, H., Lu, Y., & Zhai, C. (2011). Latent Aspect Rating Analysis Without Aspect Keyword Supervision. In Proceedings of the 17th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (pp. 618–626). New York: Association for Computing Machinery.Google Scholar

Zeman, D. et al. (2021). Universal Dependencies 2.8.1 (Tech. Rep.). LINDAT/CLARIAH-CZ Digital Library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University. http://hdl.handle.net/11234/1-3687 Google Scholar

Zhao, J., Zhou, Y., Li, Z., Wang, W., & Chang, K.-W. (2018, October–November). Learning Gender-Neutral Word Embeddings. In Riloff, E., Chiang, D., Hockenmaier, J., & Tsujii, J. (eds.), Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (pp. 4847–4853). Brussels: Association for Computational Linguistics.CrossRef Google Scholar

Zuboff, S. (2019). The Age of Surveillance Capitalism: The Fight for a Human Future at the New Frontier of Power. New York: PublicAffairs.Google Scholar

Element contents

Natural Language Processing for Corpus Linguistics

Summary

Keywords

Access options

Element purchase

Temporarily unavailable

References

Save element to Kindle

Save element to Dropbox

Save element to Google Drive