Skip to main content Accessibility help
×
  • Cited by 23
      • Jonathan Dunn, University of Canterbury, Christchurch, New Zealand
      Show more authors
    • You may already have access via personal or institutional login
    • Select format
    • Publisher:
      Cambridge University Press
      Publication date:
      04 March 2022
      31 March 2022
      ISBN:
      9781009070447
      9781009074438
      Dimensions:
      Weight & Pages:
      Dimensions:
      (229 x 152 mm)
      Weight & Pages:
      0.15kg, 96 Pages
    • Subjects:
      Research Methods in Linguistics, Applied Linguistics, Language and Linguistics
    You may already have access via personal or institutional login
  • Selected: Digital
    Add to cart View cart Buy from Cambridge.org
    Subjects:
    Research Methods in Linguistics, Applied Linguistics, Language and Linguistics

    Book description

    Corpus analysis can be expanded and scaled up by incorporating computational methods from natural language processing. This Element shows how text classification and text similarity models can extend our ability to undertake corpus linguistics across very large corpora. These computational methods are becoming increasingly important as corpora grow too large for more traditional types of linguistic analysis. We draw on five case studies to show how and why to use computational methods, ranging from usage-based grammar to authorship analysis to using social media for corpus-based sociolinguistics. Each section is accompanied by an interactive code notebook that shows how to implement the analysis in Python. A stand-alone Python package is also available to help readers use these methods with their own data. Because large-scale analysis introduces new ethical problems, this Element pairs each new methodology with a discussion of potential ethical implications.

    References

    Biber, D. (2012). Register as a Predictor of Linguistic Variation. Corpus Linguistics and Linguistic Theory, 8(1), 937.
    Church, K., & Hanks, P. (1990). Word Association Norms, Mutual Information, and Lexicography. Computational Linguistics, 16(1), 2229.
    Diermeier, D., Godbout, J., Yu, B., & Kaufmann, S. (2011). Language and Ideology in Congress. British Journal of Political Science, 42(1), 3155.
    Dunn, J. (2013a). Evaluating the Premises and Results of Four Metaphor Identification Systems. In Gelbukh, A. (ed.), Proceedings of the Conference on Intelligent Text Processing and Computational Linguistics, vol. 1 (pp. 471486). Heidelberg: Springer.
    Dunn, J. (2013). How Linguistic Structure Influences and Helps to Predict Metaphoric Meaning. Cognitive Linguistics, 24(1), 3366.
    Dunn, J. (2014). Measuring Metaphoricity. In Toutanova, K. & Wu, H. (eds.), Proceedings of the Annual Meeting of the Association for Computational Linguistics (pp. 745751). Stroudsburg, PA: Association for Computational Linguistics.
    Dunn, J. (2015). Modeling Abstractness and Metaphoricity. Metaphor & Symbol, 30, 259289.
    Dunn, J. (2017). Computational Learning of Construction Grammars. Language & Cognition, 9(2), 254292.
    Dunn, J. (2018a). Finding Variants for Construction-Based Dialectometry: A Corpus-Based Approach to Regional CxGs. Cognitive Linguistics, 29(2), 275311.
    Dunn, J. (2018b). Modeling the Complexity and Descriptive Adequacy of Construction Grammars. In Jarosz, G., O’Connor, B., & Pater, J. (eds.), Proceedings of the Society for Computation in Linguistics (pp. 8190). Stroudsburg, PA: Association for Computational Linguistics.
    Dunn, J. (2018c). Multi-Unit Directional Measures of Association Moving Beyond Pairs of Words. International Journal of Corpus Linguistics, 23(2), 183215.
    Dunn, J. (2019a). Frequency vs. Association for Constraint Selection in Usage-Based Construction Grammar. In Chersoni, E., Jacobs, C., Lenci, A., Linzen, T., Prévot, L., & Santus, E. (eds.), Proceedings of the Workshop on Cognitive Modeling and Computational Linguistics (pp. 117128). Stroudsburg, PA: Association: for Computational Linguistics.
    Dunn, J. (2019b). Global Syntactic Variation in Seven Languages: Towards a Computational Dialectology. Frontiers in Artificial Intelligence, Collection on Computational Sociolinguistics, 2. DOI: https://doi.org/10.3389/frai.2019.00015.
    Dunn, J. (2019c). Modeling Global Syntactic Variation in English Using Dialect Classification. In Zampieri, M., Nakov, P., Malmasi, S., Ljubešić, N., Tiedemann, J., & Ali, A. (eds.), Proceedings of NAACL 2019 Sixth Workshop on NLP for Similar Languages, Varieties and Dialects (pp. 4253). Stroudsburg, PA: Association for Computational Linguistics.
    Dunn, J. (2020). Mapping Languages: The Corpus of Global Language Use. Language Resources and Evaluation, 54, 9991018. DOI: https://doi.org/10.1007/s10579-020-09489-2.
    Dunn, J. (2021). Representations of Language Varieties Are Reliable Given Corpus Similarity Measures. In Zampieri, M., Nakov, P., Ljubešić, N., Tiedemann, J., Scherrer, Y., & Jahuiainen, T. (Eds.), Proceedings of the Eighth Workshop on NLP for Similar Languages, Varieties, and Dialects (pp. 2838). Stroudsburg, PA: Association for Computational Linguistics.
    Dunn, J., & Adams, B. (2019). Mapping Languages and Demographics with Georeferenced Corpora. In Adams, B., de Roiste, M., Gahegan, M., Hulbe, C., O’Sullivan, D., Sila-Nowicka, K., Whigham, P., & Wilson, M. (eds.), Proceedings of Geocomputation 2019 (16 pp.). Auckland: N.p.
    Dunn, J., & Adams, B. (2020, May). Geographically-Balanced Gigaword Corpora for 50 Language Varieties. In Calzolari, N., Béchet, F., Blache, P., Choukri, K., Cieri, C., Declerck, T., Goggi, S., Isahara, H., Maegaard, B., Mariani, J., Mazo, H., Moreno, A., Odijk, J., & Piperidis, S. (eds.), Proceedings of the 12th Language Resources and Evaluation Conference (pp. 25282536). Marseilles, European Language Resources Association.
    Dunn, J., Argamon, S., Rasooli, A., & Kumar, G. (2016). Profile-Based Authorship Analysis. Literary and Linguistic Computing, 31(4), 689710.
    Dunn, J., Coupe, T., & Adams, B. (2020, Nov.). Measuring Linguistic Diversity During COVID-19. In Bamman, D., Hovy, D., Jurgens, D., O’Connor, B., & Volkova, S. (eds.), Proceedings of the Fourth Workshop on Natural Language Processing and Computational Social Science (pp. 110). Online: Association for Computational Linguistics.
    Dunn, J., & Nini, A. (2021). Production vs Perception: The Role of Individuality in Usage-Based Grammar Induction. In Chersoni, E., Hollenstein, N., Jacobs, C., Oseki, Y., Prévot, L., & Santus, E. (Eds.), Proceedings of the Workshop on Cognitive Modeling and Computational Linguistics (pp. 149159). Stroudsburg, PA: Association for Computational Linguistics.
    Dunn, J., & Tayyar Madabushi, H. (2021). Learned Construction Grammars Converge Across Registers Given Increased Exposure. In Bisazza, A. & Abend, O. (Eds.), Proceedings of the Conference on Computational Natural Language Learning (pp. 471486). Stroudsburg, PA: Association for Computational Linguistics.
    Ellis, N. (2007). Language Acquisition as Rational Contingency Learning. Applied Linguistics, 27(1), 124.
    Francis, W., & Kucera, H. (1967). Computational Analysis of Present-Day American English. Providence, RI: Brown University Press.
    Gentzkow, M., Shapiro, J., & Taddy, M. (2018). Congressional Record for the 43rd–114th Congresses: Parsed Speeches and Phrase Counts (Tech. Rep.). Palo Alto, CA: Stanford Libraries. https://data.stanford.edu/congress_text
    Gerlach, M., & Font-Clos, F. (2020). A Standardized Project Gutenberg Corpus for Statistical Analysis of Natural Language and Quantitative Linguistics. Entropy, 22(1), 126. DOI: https://doi.org/10.3390/e22010126
    Goldberg, Y. (2017). Neural Network Methods in Natural Language Processing. Williston, VT: Morgan & Claypool Publishers.
    Gries, S. T. (2013). 50-Something Years of Work on Collocations: What Is or Should Be Next. International Journal of Corpus Linguistics, 18(1), 137165.
    Hellrich, J., Kampe, B., & Hahn, U. (2019). The Influence of Down-Sampling Strategies on SVD Word Embedding Stability. In Rogers, A., Drozd, A., Rumshisky, A., & Goldberg, Y. (Eds.), Proceedings of the 3rd Workshop on Evaluating Vector Space Representations for NLP (pp. 1826). Stroudburg, PA: Association for Computational Linguistics.
    Kilgarriff, A. (2001). Comparing Corpora. International Journal of Corpus Linguistics, 6(1), 97133.
    Koppel, M., Schler, J., & Bonchek-Dokow, E. (2007). Measuring Differentiability: Unmasking Pseudonymous Authors. Journal of Machine Learning Research, 8, 12611276.
    Landauer, T., Foltz, P., & Laham, D. (1998). Introduction to Latent Semantic Analysis. Discourse Processes, 25(2–3), 259284.
    Levy, O., Goldberg, Y., & Dagan, I. (2015, May). Improving Distributional Similarity with Lessons Learned from Word Embeddings. Transactions of the Association for Computational Linguistics, 3, 211225.
    Li, J. (2012). Hotel Reviews Dataset (Tech. Rep.). Carnegie Mellon University. www.cs.cmu.edu/~jiweil/html/hotel-review.html
    McKenzie, G., & Adams, B. (2018). A Data-Driven Approach to Exploring Similarities of Tourist Attractions through Online Reviews. Journal of Location Based Services, 12(2), 94118.
    Mikolov, T., Sutskever, I., Chen, K., Corrado, G., & Dean, J. (2013). Distributed Representations of Words and Phrases and Their Compositionality. In Burges, C. J. C., Bottou, L., Welling, M., Ghahramani, Z., & Weinberger, K. Q. (Eds.), Proceedings of the 26th International Conference on Neural Information Processing Systems–Volume 2 (pp. 31113119). Red Hook, NY: Curran Associates Inc.
    Mueller, A., Nicolai, G., Petrou-Zeniou, P., Talmina, N., & Linzen, T. (2020). Cross-Linguistic Syntactic Evaluation of Word Prediction Models. In Jurafsky, D., Chai, J., Schluter, N., & Tetreault, J. (Eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (pp. 55235539). Stroudsburg, PA: Association for Computational Linguistics.
    Parsons, A. (2019). NY Times Article Lead Paragraphs 1851–2017 (Tech. Rep.). Kaggle. https://www.kaggle.com/parsonsandrew1/nytimes-article-lead-paragraphs-18512017
    Pennebaker, J. (2011). The Secret Life of Pronouns: What Our Words Say About Us. New York: Bloomsbury Publishing.
    Pennington, J., Socher, R., & Manning, C. (2014). GloVe: Global Vectors for Word Representation. In Moschitti, A., Pang, B., & Daelemans, W. (eds.), Empirical Methods in Natural Language Processing (EMNLP) (pp. 15321543). Stroudsburg, PA: Association for Computational Linguistics.
    Petrov, S., Das, D., & McDonald, R. (2012). A Universal Part-of-Speech Tagset. In Calzolari, N., Choukri, K., Declerck, T., Uğur Doğan, M., Maegaard, B., Mariani, J., Moreno, A., Odijk, J., & Piperidis, S. (eds.), Proceedings of the Eighth Conference on Language Resources and Evaluation (pp. 20892096). Paris: European Language Resources Association.
    Taylor, J. (2004). Linguistic Categorization (3rded.). Oxford: Oxford University Press.
    Wang, H., Lu, Y., & Zhai, C. (2011). Latent Aspect Rating Analysis Without Aspect Keyword Supervision. In Proceedings of the 17th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (pp. 618626). New York: Association for Computing Machinery.
    Zeman, D. et al. (2021). Universal Dependencies 2.8.1 (Tech. Rep.). LINDAT/CLARIAH-CZ Digital Library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University. http://hdl.handle.net/11234/1-3687
    Zhao, J., Zhou, Y., Li, Z., Wang, W., & Chang, K.-W. (2018, October–November). Learning Gender-Neutral Word Embeddings. In Riloff, E., Chiang, D., Hockenmaier, J., & Tsujii, J. (eds.), Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (pp. 48474853). Brussels: Association for Computational Linguistics.
    Zuboff, S. (2019). The Age of Surveillance Capitalism: The Fight for a Human Future at the New Frontier of Power. New York: PublicAffairs.

    Metrics

    Altmetric attention score

    Full text views

    Total number of HTML views: 0
    Total number of PDF views: 0 *
    Loading metrics...

    Book summary page views

    Total views: 0 *
    Loading metrics...

    * Views captured on Cambridge Core between #date#. This data will be updated every 24 hours.

    Usage data cannot currently be displayed.

    Accessibility standard: Unknown

    Why this information is here

    This section outlines the accessibility features of this content - including support for screen readers, full keyboard navigation and high-contrast display options. This may not be relevant for you.

    Accessibility Information

    Accessibility compliance for the PDF of this book is currently unknown and may be updated in the future.