To use or not to use: Feature selection for sentiment analysis of highly imbalanced data†

SANDRA KÜBLER; CAN LIU; ZEESHAN ALI SAYYED

doi:10.1017/S1351324917000298

To use or not to use: Feature selection for sentiment analysis of highly imbalanced data†

Published online by Cambridge University Press: 07 August 2017

SANDRA KÜBLER ,

CAN LIU and

ZEESHAN ALI SAYYED

Show author details

SANDRA KÜBLER: Affiliation:
Department of Linguistics, Indiana University, Bloomington, IN 47405, USA e-mail: skuebler@indiana.edu
CAN LIU: Affiliation:
Department of Computer Science, Indiana University, Bloomington, IN 47405, USA e-mails: liucan@indiana.edu, zasayyed@indiana.edu
ZEESHAN ALI SAYYED: Affiliation:
Department of Computer Science, Indiana University, Bloomington, IN 47405, USA e-mails: liucan@indiana.edu, zasayyed@indiana.edu

Article contents

Abstract
Footnotes
References

Get access

Rights & Permissions

Abstract

We investigate feature selection methods for machine learning approaches in sentiment analysis. More specifically, we use data from the cooking platform Epicurious and attempt to predict ratings for recipes based on user reviews. In machine learning approaches to such tasks, it is a common approach to use word or part-of-speech n-grams. This results in a large set of features, out of which only a small subset may be good indicators for the sentiment. One of the questions we investigate concerns the extension of feature selection methods from a binary classification setting to a multi-class problem. We show that an inherently multi-class approach, multi-class information gain, outperforms ensembles of binary methods. We also investigate how to mitigate the effects of extreme skewing in our data set by making our features more robust and by using review and recipe sampling. We show that over-sampling is the best method for boosting performance on the minority classes, but it also results in a severe drop in overall accuracy of at least 6 per cent points.

Information

Type: Articles
Information: Natural Language Engineering , Volume 24 , Issue 1 , January 2018 , pp. 3 - 37

DOI: https://doi.org/10.1017/S1351324917000298 [Opens in a new window]
Copyright: Copyright © Cambridge University Press 2017

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Article purchase

Temporarily unavailable

Footnotes

†

This work is based on research supported by the U.S. Office of Naval Research (ONR) via grant #N00014-10-1-0140.

References

Agarwal, B., and Mittal, N., 2012. Categorical probability proportion difference (CPPD): A feature selection method for sentiment classification. In Proceedings of the 2nd Workshop on Sentiment Analysis where AI meets Psychology (SAAIP), Mumbai, India, pp. 17–26.Google Scholar

Baccianella, S., Esuli, A., and Sebastiani, F., 2010. SentiWordNet 3.0: An enhanced lexical resource for sentiment analysis and opinion mining. In Proceedings of International Conference on Language Resources and Evaluation (LREC), Valletta, Malta, vol. 10, pp. 2200–4.Google Scholar

Bird, S., Klein, E., and Loper, E. 2009. Natural Language Processing with Python – Analyzing Text with the Natural Language Toolkit. O’Reilly Media, Sebastopol, CA.Google Scholar

Bollen, J., Mao, H., and Zeng, X.-J. 2011. Twitter mood predicts the stock market. Journal of Computational Science 2: 1–8.CrossRef Google Scholar

Brank, J., Grobelnik, M., Milic-Frayling, N., and Mladenic, D. 2002. Feature selection using linear support vector machines. Technical Report MSR-TR-2002-63, Microsoft Research.Google Scholar

Brants, T., 2000. TnT – A statistical part-of-speech tagger. In Proceedings of the 1st Conference of the North American Chapter of the Association for Computational Linguistics and the 6th Conference on Applied Natural Language Processing (ANLP/NAACL), Seattle, WA, pp. 224–31.Google Scholar

Brown, P., Della Pietra, V., deSouza, P., Lai, J., and Mercer, R., 1992. Class-based n-gram models of natural language. Computational Linguistics 18 (4): 467–79.Google Scholar

Chen, J., Huang, H., Tian, S., and Qu, Y., 2009. Feature selection for text classification with Naïve Bayes. Expert Systems with Applications 36 (3): 5432–5.Google Scholar

Crammer, K., and Singer, Y., 2002. On the algorithmic implementation of multiclass kernel-based vector machines. Journal of Machine Learning Research 2: 265–92.Google Scholar

Duric, A., and Song, F., 2012. Feature selection for sentiment analysis based on content and syntax models. Decision Support Systems 53 (4): 704–11.CrossRef Google Scholar

Forman, G., 2003. An extensive empirical study of feature selection metrics for text classification. Journal of Machine Learning Research 3 : 1289–305.Google Scholar

Forman, G. 2004. A pitfall and solution in multi-class feature selection for text classification. In Proceedings of the 21st International Conference on Machine Learning, Banff, Canada.Google Scholar

Glorot, X., Bordes, A., and Bengio, Y., 2011. Domain adaptation for large-scale sentiment classification: A deep learning approach. In Proceedings of the 28th International Conference on Machine Learning (ICML), Bellevue, WA, pp. 513–20.Google Scholar

Guyon, I., and Elisseeff, A., 2003. An introduction to variable and feature selection. Journal of Machine Learning Research 3 : 1157–82.Google Scholar

Joachims, T. 1999. Making large-scale SVM learning practical. In Schölkopf, B., Burges, C., and Smola, A. (eds.), Advances in Kernel Methods – Support Vector Learning. MIT Press, Massachusetts Institute of Technology.Google Scholar

Koo, T., Carreras, X., and Collins, M., 2008. Simple semi-supervised dependency parsing. In Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (ACL:HLT), Columbus, OH, pp. 595–603.Google Scholar

Kummer, O., and Savoy, J., 2012. Feature selection in sentiment analysis. In Proceeding of the Conférence en Recherche d’Infomations et Applications (CORIA), Bordeaux, France, pp. 273–84.Google Scholar

Li, S., Xia, R., Zong, C., and Huang, C.-R., 2009. A framework of feature selection methods for text categorization. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, Suntec, Singapore, pp. 692–700.Google Scholar

Liang, P. 2005. Semi-Supervised Learning for Natural Language. Master’s Thesis, MIT.Google Scholar

Liu, B. 2012. Sentiment Analysis and Opinion Mining. Synthesis Lectures on Human Language Technologies. Morgan & Claypool Publishers.CrossRef Google Scholar

Liu, C., Guo, C., Dakota, D., Rajagopalan, S., Li, W., Kübler, S., and Yu, N., 2014a. “My curiosity was satisfied, but not in a good way”: Predicting user ratings for online recipes. In Proceedings of the 2nd Workshop on Natural Language Processing for Social Media (SocialNLP), Dublin, Ireland, pp. 12–21.CrossRef Google Scholar

Liu, C., Kübler, S., and Yu, N., 2014b. Feature selection for highly skewed sentiment analysis tasks. In Proceedings of the 2nd Workshop on Natural Language Processing for Social Media (SocialNLP), Dublin, Ireland, pp. 2–11.Google Scholar

Maas, A., Daly, R., Pham, P., Huang, D., Ng, A., and Potts, C., 2011. Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Portland, OR, pp. 142–50.Google Scholar

Maier, W., Kübler, S., Dakota, D., and Whyatt, D. 2014. Parsing German: How much morphology do we need? In Proceedings of the 1st Joint Workshop on Statistical Parsing of Morphologically Rich Languages and Syntactic Analysis of Non-Canonical Languages (SPMRL-SANCL), Dublin, Ireland, pp. 1–14.Google Scholar

Mitchell, T. 1997. Machine Learning. McGraw-Hill.Google Scholar

Mullen, T., and Collier, N., 2004. Sentiment analysis using support vector machines with diverse information sources. In Proceedings of Conference on Empirical Methods in Natural Language Processing (EMNLP), vol. 4, Barcelona, Spain, pp. 412–8.Google Scholar

Nakagawa, T., Inui, K., and Kurohashi, S. 2010. Dependency tree-based sentiment classification using CRFs with hidden variables. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 786–94.Google Scholar

Ng, A. 2004. Feature selection, L1 vs. L2 regularization, and rotational invariance. In Proceedings of the 21st International Conference on Machine Learning, Banff, Canada.Google Scholar

O’Keefe, T., and Koprinska, I., 2009. Feature selection and weighting methods in sentiment analysis. In Proceedings of the 14th Australasian Document Computing Symposium (ADCS), Sydney, Australia, pp. 67–74.Google Scholar

Pang, B., and Lee, L. 2004. A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts. In Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics, Barcelona, Spain.Google Scholar

Pang, B., and Lee, L., 2008. Opinion mining and sentiment analysis. Foundations and Trends in Information Retrieval 2 (1–2): 1–135.CrossRef Google Scholar

Pang, B., Lee, L., and Vaithyanathan, S., 2002. Thumbs up? Sentiment classification using machine learning techniques. In Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing (EMNLP), Philadelphia, PA, pp. 79–86.Google Scholar

Porter, M., 1980. An algorithm for suffix stripping. Program 14 (3): 130–7.CrossRef Google Scholar

Sadamitsu, K., Sekine, S., and Yamamoto, M., 2008. Sentiment analysis based on probabilistic models using inter-sentence information. In Proceedings of International Conference on Language Resources and Evaluation (LREC), Marrakesh, Morocco, pp. 2892–6.Google Scholar

Santorini, B. 1990. Part-of-Speech Tagging Guidelines for the Penn Treebank Project (3rd revision, 2nd printing). Dept. Comput. Inf. Sci., Univ. Pennsylvania.Google Scholar

Severyn, A., and Moschitti, A., 2015. On the automatic learning of sentiment lexicons. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Denver, CO, pp. 1397–402.Google Scholar

Socher, R., Pennington, J., Huang, E. H., Ng, A. Y., and Manning, C. D., 2011. Semi-supervised recursive autoencoders for predicting sentiment distributions. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Edinburgh, Scotland, pp. 151–61.Google Scholar

Sun, A., Grishman, R., and Sekine, S., 2011. Semi-supervised relation extraction with large-scale word clustering. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Portland, OR, pp. 521–9.Google Scholar

Tkachenko, M., and Simanovsky, A., 2012. Named entity recognition: Exploring features. In Proceedings of KONVENS 2012, 11th Conference on Natural Language Processing, Vienna, Austria, pp. 118–27.Google Scholar

Wilson, T., Wiebe, J., and Hoffmann, P., 2005. Recognizing contextual polarity in phrase-level sentiment analysis. In Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing, Vancouver, Canada, pp. 347–54.Google Scholar

Yang, Y., and Pedersen, J., 1997. A comparative study on feature selection in text categorization. In Proceedings of the Fourteenth International Conference on Machine Learning (ICML), Nashville, TN, pp. 412–20.Google Scholar

Ye, Q., Zhang, Z., and Law, R., 2009. Sentiment classification of online reviews to travel destinations by supervised machine learning approaches. Expert Systems with Applications 36 (3): 6527–35.Google Scholar

Yu, N., Zhekova, D., Liu, C., and Kübler, S. 2013. Do good recipes need butter? Predicting user ratings of online recipes. In Proceedings of the IJCAI Workshop on Cooking with Computers, Beijing, China.Google Scholar

Zheng, Z., Wu, X., and Srihari, R., 2004. Feature selection for text categorization on imbalanced data. ACM SIGKDD Explorations Newsletter 6 (1): 80–9.CrossRef Google Scholar

Article contents

To use or not to use: Feature selection for sentiment analysis of highly imbalanced data†

Abstract

Information

Access options

Article purchase

Temporarily unavailable

Footnotes

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests