Skip to main content Accessibility help
×
Hostname: page-component-7bb8b95d7b-pwrkn Total loading time: 0 Render date: 2024-09-20T02:21:49.307Z Has data issue: false hasContentIssue false

Text Analysis in Python for Social Scientists

Discovery and Exploration

Published online by Cambridge University Press:  14 December 2020

Dirk Hovy
Affiliation:
Bocconi University

Summary

Text is everywhere, and it is a fantastic resource for social scientists. However, because it is so abundant, and because language is so variable, it is often difficult to extract the information we want. There is a whole subfield of AI concerned with text analysis (natural language processing). Many of the basic analysis methods developed are now readily available as Python implementations. This Element will teach you when to use which method, the mathematical background of how it works, and the Python code to implement it.
Get access
Type
Element
Information
Online ISBN: 9781108873352
Publisher: Cambridge University Press
Print publication: 21 January 2021

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Antoniak, M., & Mimno, D. (2018). Evaluating the stability of embeddingbased word similarities. Transactions of the Association for Computational Linguistics, 6, 107119.Google Scholar
Bengio, Y., Ducharme, R., Vincent, P., & Jauvin, C. (2003). A neural probabilistic language model. Journal of Machine Learning Research, 3, 11371155.Google Scholar
Bhatia, S. (2017). Associative judgment and vector space semantics. Psychological Review 124(1), 1.Google Scholar
Bianchi, F., Terragni, S., & Hovy, D. (2020). Pre-training is a hot topic: Contextualized document embeddings improve topic coherence. arXiv preprint arXiv:2004.03974.Google Scholar
Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet allocation. Journal of Machine Learning Research, 3, 9931022.Google Scholar
Blodgett, S. L., Green, L., & O'Connor, B. (2016). Demographic dialectal variation in social media: A case study of African-American English. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics. Stroudsburg, PA. (pp. 1119-1130).Google Scholar
Boyd-Graber, J., Mimno, D., & Newman, D. (2014). Careandfeedingoftopic models: Problems, diagnostics, and improvements. In Airoldi, E. M., Blei, D., Erosheva, E. A., & Fienberg, S. E. (Eds.), Handbook of mixed membership models and their applications. Boca Raton, FL: CRC Press, pp. 225254.Google Scholar
Chen, S. F., & Goodman, J. (1996). An empirical study of smoothing techniques for language modeling. Paper presented at the 34th annual meeting of the Association for Computational Linguistics. Retrieved from http://aclweb.org/anthology/P96-1041CrossRefGoogle Scholar
Chollet, F. (2017). Deep learning with Python. Manning, Shelter Island, NY.Google Scholar
Crystal, D. (2003). The Cambridge encyclopedia of the English language (3rd ed.). Cambridge, England: Cambridge University Press.Google Scholar
Das, R., Zaheer, M., & Dyer, C. (2015). Gaussian LDA for topic mod-els with word embeddings. In Proceedings of the 53rd annual meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing: Vol. 1. Long papers. Association for Computational Linguistics. Stroudsburg, PA. (pp. 795-804).Google Scholar
Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., & Harshman, R. (1990). Indexing by latent semantic analysis. Journal of the American Society for Information Science 41(6), 391407.Google Scholar
Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society: Series B (Methodological), 39(1), 122.Google Scholar
Denny, M.J., & Spirling, A. (2018). Text preprocessing for unsupervised learning: Why it matters, when it misleads, and what to do about it. Political Analysis 26(2), 168189.Google Scholar
Dieng, A. B., Ruiz, F. J., & Blei, D. M. (2019). Topic modeling in embedding spaces. arXiv preprint arXiv:1907.04907.Google Scholar
Eisenstein, J. (2019). Introduction to natural language processing. Cambridge, MA: MIT Press.Google Scholar
Evans, J. A., & Aceves, P. (2016). Machine translation: Mining text for social theory. Annual Review of Sociology, 42, 2150.Google Scholar
Firth, J. R. (1957). A synopsis of linguistic theory, 1930-1955. Studies in Linguistic Analysis. Basil Blackwell, Oxford. pp 1-32. Volume 1Google Scholar
Fromkin, V., Rodman, R., & Hyams, N. (2018). An introduction to language. Cengage Learning. Wadsworth. Boston, MA.Google Scholar
Garg, N., Schiebinger, L., Jurafsky, D., & Zou, J. (2018). Word embeddings quantify 100 years of gender and ethnic stereotypes. Proceedings of the National Academy of Sciences 115(16), E3635E3644.Google Scholar
Gentzkow, M., Kelly, B. T., & Taddy, M. (2017). Text as data (technical report). Washington, DC: National Bureau of Economic Research.Google Scholar
Goldberg, Y. (2016). A primer on neural network models for natural language processing. Journal of Artificial Intelligence Research, 57, 345420.Google Scholar
Goldberg, Y. (2017). Neural network methods for natural language processing. Edited by Graeme Hirst. Morgan & Claypool. San Rafael, CA, Synthesis Lectures on Human Language Technologies 10(1), 1309.Google Scholar
Goldberg, Y., & Levy, O. (2014). word2vec Explained: Deriving Mikolov et al.’s negative-sampling word-embedding method. arXiv preprint arXiv: 1402.3722.Google Scholar
Grave, E., Bojanowski, P., Gupta, P., Joulin, A., & Mikolov, T. (2018). Learning word vectors for 157 languages. Paper presented at the International Conference on Language Resources and Evaluation (LREC 2018).Google Scholar
Grimmer, J., & Stewart, B. M. (2013). Text as data: The promise and pitfalls of automatic content analysis methods for political texts. Political Analysis 21(3), 267297.Google Scholar
Hamilton, W.L., Leskovec, J., & Jurafsky, D. (2016). Diachronic word embeddings reveal statistical laws of semantic change. In Proceedings of the 54th Meeting of the Association for Computational Linguistics (pp. 1489-1501).Google Scholar
Hartmann, J., Huppertz, J., Schamp, C., & Heitmann, M. (2018). Comparing automated text classification methods. Association for Computational Linguistics. Stroudsburg, PA. International Journal of Research in Marketing 36(1), pp. 2038.Google Scholar
Hovy, D. (2010). An evening with: : : EM (technical report). University of Southern California. Online tech report.Google Scholar
Hovy, D., & Purschke, C. (2018). Capturing regional variation with distributed place representations and geographic retrofitting. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics. Stroudsburg, PA. (pp. 4383-4394).Google Scholar
Humphreys, A., & Wang, R. J.-H. (2017). Automated text analysis for consumer research. Journal of Consumer Research 44(6), 12741306.Google Scholar
Jagarlamudi, J., Daume, H., III, & Udupa, R. (2012). Incorporating lexical priors into topic models. In Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics. Association for Computational Linguistics. Stroudsburg, PA (pp. 204-213).Google Scholar
Jelinek, F., & Mercer, R. (1980). Interpolated estimation of Markov source parameters from sparse data. In Proceedings Workshop Pattern Recognition in Practice (pp. 381-397).Google Scholar
Jurafsky, D. (2014). The language of food: A linguist reads the menu. North Holland Publishing Company, Amsterdam. New York: W. W. Norton.Google Scholar
Jurafsky, D.,& Martin, J. H. (2014). Speech and language processing (3rd ed.). London: Pearson.Google Scholar
Katz, S. (1987). Estimation of probabilities from sparse data for the language model component of a speech recognizer. IEEE Transactions on Acoustics, Speech, and Signal Processing 35(3), 400401.Google Scholar
Kulkarni, V., Al-Rfou, R., Perozzi, B., & Skiena, S. (2015). Statistically significant detection of linguistic change. In Proceedings of the 24th International Conference on the World Wide Web, Association for Computing Machinery. New York, NY. (pp. 625-635).Google Scholar
Labov, W. (1972). Sociolinguistic patterns. Philadelphia, PA: University of Pennsylvania Press.Google Scholar
Landauer, T. K., & Dumais, S. T. (1997). A solution to Plato’s problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge. Psychological Review 104(2), 211240.Google Scholar
Lang, S. (2012). Introduction to linear algebra. New York: Springer Science & Business Media.Google Scholar
Lau, J. H., &Baldwin, T. (2016). An empirical evaluation of doc2vec with practical insights into document embedding generation. In (p. 78-86). Proceedings of the 1st Workshop on Representation Learning for NLP. Association for Computational Linguistics. Stroudsburg, PA.Google Scholar
Le, Q., & Mikolov, T. (2014). Distributed representations of sentences and documents. In Proceedings of the 31st International Conference on Machine Learning (ICML-14). Association for Computing Machinery. New York, NY. (pp. 1188-1196).Google Scholar
Loper, E., &Bird, S. (2002). NLTK: The Natural Language Toolkit. Paper presented at the ACL-02 Workshop on Effective Tools and Methodologies for Teaching Natural Language Processing and Computational Linguistics.Google Scholar
Maaten, L. v. d., & Hinton, G. (2008). Visualizing data using t-SNE. Journal of Machine Learning Research, 9, 25792605.Google Scholar
Manning, C. D., & Schutze, H. (1999). Foundations of statistical natural language processing. Cambridge, MA: MIT Press.Google Scholar
Marsland, S. (2015). Machine learning: An algorithmic perspective (2nd ed.). New York: Chapman and Hall/CRC.Google Scholar
McDonald, R., Nivre, J., Quirmbach-Brundage, Y., Goldberg, Y., Das, D., Ganchev, K., et al. (2013). Universal dependency annotation for multilingual parsing. In Proceedings of the 51st annual meeting of the Association for Computational Linguistics: Vol. 2. Short Papers. Association for Com-putational Linguistics. Stroudsburg, PA. (pp. 92-97).Google Scholar
Mikolov, T, Karafiat, M., Burget, L., Cernocky, J., & Khudanpur, S. (2010). Recurrent neural network based language model. Paper presented at the 11th annual conference of the International Speech Communication Association.Google Scholar
Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems. Neural Information Processing Systems Foundation. San Diego, CA. (pp. 3111-3119).Google Scholar
Mimno, D., Wallach, H., Talley, E., Leenders, M., & McCallum, A. (2011). Optimizing semantic coherence in topic models. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing. Association forComputationalLinguistics. Stroudsburg, PA. (pp. 262-272).Google Scholar
Murphy, K. P. (2012). Machine learning: A probabilistic perspective. Cambridge, MA: MIT Press.Google Scholar
Niculae, V., Kumar, S., Boyd-Graber, J., & Danescu-Niculescu-Mizil, C. (2015). Linguistic harbingers of betrayal: A case study on an online strategy game. In Proceedings of the 53rd annual meeting of the Associationfor Computational Linguistics and the 7th International Joint Conference on Natural Language Processing: Vol. 1. Long Papers. Association for Computational Linguistics. Stroudsburg, PA. (pp. 1650-1659).Google Scholar
Nivre, J., Agic, Z., Aranzabe, M. J., Asahara, M., Atutxa, A., Ballesteros, M., et al. (2015). Universal Dependencies Consortium. No address: https://universaldependencies.org/ Universal dependencies 1.2.Google Scholar
Nivre, J., de Marneffe, M.-C., Ginter, F., Goldberg, Y., Hajic, J., Manning, C. D., etal. (2016, May). Universal dependencies v1 : A multilingual treebank collection. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16) (pp. 1659-1666). Portoroz, Slovenia: European Language Resources Association (ELRA). Retrieved from www.aclweb.org/anthology/L16-1262Google Scholar
Pennebaker, J. W. (2011). The secret life ofpronouns: What our words say about us. New York: Bloomsbury Press.Google Scholar
Pennebaker, J. W., Francis, M. E., & Booth, R. J. (2001). Linguisticinquiryand word count: LIWC2001. Mahwah, NJ: Lawrence Erlbaum, 2001.Google Scholar
Pennington, J., Socher, R., & Manning, C. D. (2014). Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) . Association for Computational Linguistics. Stroudsburg, PA. (pp. 1532-1543).Google Scholar
Petrov, S., Das, D., & McDonald, R. (2011). A universal part-of-speech tagset. In Proceedings ofLREC. European Language Resources Association. Paris.Google Scholar
Porter, M. F. (1980). An algorithm for suffix stripping. Program 14(3), 130137.Google Scholar
Prabhakaran, V., Rambow, O., & Diab, M. (2012). Predicting overt display of power in written dialogs. In Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics. Stroudsburg, PA. (pp. 518-522).Google Scholar
Resnik, P., & Hardisty, E. (2010). Gibbssamplingfor the uninitiated (technical report). College Park, MD: University of Maryland Institute for Advanced Computer Studies.Google Scholar
Roberts, Molly Roberts, Brandon, Stewart, Dustin Tingley, Edoardo Airoldi M. E., Stewart, B. M., Tingley, D., Airoldi, E. M., et al. (2013). The structural topic model and applied social science. In Advances in neural information processing systems workshop on topic models: Computation, application, and evaluation. Neural Information Processing Systems Foundation. San Diego, CA. (pp. 1-20).Google Scholar
Röder, M., Both, A., & Hinneburg, A. (2015). Exploring the space of topic coherence measures. In Proceedings of the 8th ACM International Conference on Web Search and Data Mining. Association for Computing Machinery. New York, NY. (pp. 399-408).Google Scholar
Rong, X. (2014). word2vec parameter learning explained. arXiv preprint arXiv:1411.2738.Google Scholar
Schwartz, H. A., Eichstaedt, J., Blanco, E., Dziurzynski, L., Kern, M., Ramones, S., et al. (2013). Choosing the right words: Characterizing and reducing error of the word count approach. In Second Joint Conference on Lexical and Computational Semantics (* SEM): Vol. 1. Proceedings of the main conference and the shared task: Semantic textual similarity. Association for Computational Linguistics. Stroudsburg, PA. (pp. 296-305).Google Scholar
Sparck Jones, K. (1972). A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation 28(1), 1121.Google Scholar
Srivastava, A., & Sutton, C. (2017). Autoencoding variational inference for topic models. arXiv preprint arXiv:1703.01488.Google Scholar
Stevens, K., Kegelmeyer, P., Andrzejewski, D., & Buttler, D. (2012, July). Exploring topic coherence over many models and many topics. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (pp. 952-961). Jeju Island, Korea: Association for Computational Linguistics. Retrieved from www.aclweb.org/anthology/D12-1087Google Scholar
Trudgill, P. (2000). Sociolinguistics: An introduction to language and society. London: Penguin.Google Scholar
Zipf, G. K. (1935). The psycho-biology of language: An introduction to dynamic philology. Houghton Mifflin. Boston, MA.Google Scholar

Save element to Kindle

To save this element to your Kindle, first ensure coreplatform@cambridge.org is added to your Approved Personal Document E-mail List under your Personal Document Settings on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part of your Kindle email address below. Find out more about saving to your Kindle.

Note you can select to save to either the @free.kindle.com or @kindle.com variations. ‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi. ‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.

Find out more about the Kindle Personal Document Service.

Text Analysis in Python for Social Scientists
  • Dirk Hovy, Bocconi University
  • Online ISBN: 9781108873352
Available formats
×

Save element to Dropbox

To save content items to your account, please confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account. Find out more about saving content to Dropbox.

Text Analysis in Python for Social Scientists
  • Dirk Hovy, Bocconi University
  • Online ISBN: 9781108873352
Available formats
×

Save element to Google Drive

To save content items to your account, please confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account. Find out more about saving content to Google Drive.

Text Analysis in Python for Social Scientists
  • Dirk Hovy, Bocconi University
  • Online ISBN: 9781108873352
Available formats
×