Skip to main content
×
×
Home

To use or not to use: Feature selection for sentiment analysis of highly imbalanced data

  • SANDRA KÜBLER (a1), CAN LIU (a2) and ZEESHAN ALI SAYYED (a2)
Abstract
Abstract

We investigate feature selection methods for machine learning approaches in sentiment analysis. More specifically, we use data from the cooking platform Epicurious and attempt to predict ratings for recipes based on user reviews. In machine learning approaches to such tasks, it is a common approach to use word or part-of-speech n-grams. This results in a large set of features, out of which only a small subset may be good indicators for the sentiment. One of the questions we investigate concerns the extension of feature selection methods from a binary classification setting to a multi-class problem. We show that an inherently multi-class approach, multi-class information gain, outperforms ensembles of binary methods. We also investigate how to mitigate the effects of extreme skewing in our data set by making our features more robust and by using review and recipe sampling. We show that over-sampling is the best method for boosting performance on the minority classes, but it also results in a severe drop in overall accuracy of at least 6 per cent points.

Copyright
Footnotes
Hide All

This work is based on research supported by the U.S. Office of Naval Research (ONR) via grant #N00014-10-1-0140.

Footnotes
References
Hide All
Agarwal B., and Mittal N., 2012. Categorical probability proportion difference (CPPD): A feature selection method for sentiment classification. In Proceedings of the 2nd Workshop on Sentiment Analysis where AI meets Psychology (SAAIP), Mumbai, India, pp. 1726.
Baccianella S., Esuli A., and Sebastiani F., 2010. SentiWordNet 3.0: An enhanced lexical resource for sentiment analysis and opinion mining. In Proceedings of International Conference on Language Resources and Evaluation (LREC), Valletta, Malta, vol. 10, pp. 2200–4.
Bird S., Klein E., and Loper E. 2009. Natural Language Processing with Python – Analyzing Text with the Natural Language Toolkit. O’Reilly Media, Sebastopol, CA.
Bollen J., Mao H., and Zeng X.-J. 2011. Twitter mood predicts the stock market. Journal of Computational Science 2: 18.
Brank J., Grobelnik M., Milic-Frayling N., and Mladenic D. 2002. Feature selection using linear support vector machines. Technical Report MSR-TR-2002-63, Microsoft Research.
Brants T., 2000. TnT – A statistical part-of-speech tagger. In Proceedings of the 1st Conference of the North American Chapter of the Association for Computational Linguistics and the 6th Conference on Applied Natural Language Processing (ANLP/NAACL), Seattle, WA, pp. 224–31.
Brown P., Della Pietra V., deSouza P., Lai J., and Mercer R., 1992. Class-based n-gram models of natural language. Computational Linguistics 18 (4): 467–79.
Chen J., Huang H., Tian S., and Qu Y., 2009. Feature selection for text classification with Naïve Bayes. Expert Systems with Applications 36 (3): 5432–5.
Crammer K., and Singer Y., 2002. On the algorithmic implementation of multiclass kernel-based vector machines. Journal of Machine Learning Research 2: 265–92.
Duric A., and Song F., 2012. Feature selection for sentiment analysis based on content and syntax models. Decision Support Systems 53 (4): 704–11.
Forman G., 2003. An extensive empirical study of feature selection metrics for text classification. Journal of Machine Learning Research 3 : 1289–305.
Forman G. 2004. A pitfall and solution in multi-class feature selection for text classification. In Proceedings of the 21st International Conference on Machine Learning, Banff, Canada.
Glorot X., Bordes A., and Bengio Y., 2011. Domain adaptation for large-scale sentiment classification: A deep learning approach. In Proceedings of the 28th International Conference on Machine Learning (ICML), Bellevue, WA, pp. 513–20.
Guyon I., and Elisseeff A., 2003. An introduction to variable and feature selection. Journal of Machine Learning Research 3 : 1157–82.
Joachims T. 1999. Making large-scale SVM learning practical. In Schölkopf B., Burges C., and Smola A. (eds.), Advances in Kernel Methods – Support Vector Learning. MIT Press, Massachusetts Institute of Technology.
Koo T., Carreras X., and Collins M., 2008. Simple semi-supervised dependency parsing. In Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (ACL:HLT), Columbus, OH, pp. 595603.
Kummer O., and Savoy J., 2012. Feature selection in sentiment analysis. In Proceeding of the Conférence en Recherche d’Infomations et Applications (CORIA), Bordeaux, France, pp. 273–84.
Li S., Xia R., Zong C., and Huang C.-R., 2009. A framework of feature selection methods for text categorization. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, Suntec, Singapore, pp. 692700.
Liang P. 2005. Semi-Supervised Learning for Natural Language. Master’s Thesis, MIT.
Liu B. 2012. Sentiment Analysis and Opinion Mining. Synthesis Lectures on Human Language Technologies. Morgan & Claypool Publishers.
Liu C., Guo C., Dakota D., Rajagopalan S., Li W., Kübler S., and Yu N., 2014a. “My curiosity was satisfied, but not in a good way”: Predicting user ratings for online recipes. In Proceedings of the 2nd Workshop on Natural Language Processing for Social Media (SocialNLP), Dublin, Ireland, pp. 1221.
Liu C., Kübler S., and Yu N., 2014b. Feature selection for highly skewed sentiment analysis tasks. In Proceedings of the 2nd Workshop on Natural Language Processing for Social Media (SocialNLP), Dublin, Ireland, pp. 211.
Maas A., Daly R., Pham P., Huang D., Ng A., and Potts C., 2011. Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Portland, OR, pp. 142–50.
Maier W., Kübler S., Dakota D., and Whyatt D. 2014. Parsing German: How much morphology do we need? In Proceedings of the 1st Joint Workshop on Statistical Parsing of Morphologically Rich Languages and Syntactic Analysis of Non-Canonical Languages (SPMRL-SANCL), Dublin, Ireland, pp. 114.
Mitchell T. 1997. Machine Learning. McGraw-Hill.
Mullen T., and Collier N., 2004. Sentiment analysis using support vector machines with diverse information sources. In Proceedings of Conference on Empirical Methods in Natural Language Processing (EMNLP), vol. 4, Barcelona, Spain, pp. 412–8.
Nakagawa T., Inui K., and Kurohashi S. 2010. Dependency tree-based sentiment classification using CRFs with hidden variables. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 786–94.
Ng A. 2004. Feature selection, L1 vs. L2 regularization, and rotational invariance. In Proceedings of the 21st International Conference on Machine Learning, Banff, Canada.
O’Keefe T., and Koprinska I., 2009. Feature selection and weighting methods in sentiment analysis. In Proceedings of the 14th Australasian Document Computing Symposium (ADCS), Sydney, Australia, pp. 6774.
Pang B., and Lee L. 2004. A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts. In Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics, Barcelona, Spain.
Pang B., and Lee L., 2008. Opinion mining and sentiment analysis. Foundations and Trends in Information Retrieval 2 (1–2): 1135.
Pang B., Lee L., and Vaithyanathan S., 2002. Thumbs up? Sentiment classification using machine learning techniques. In Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing (EMNLP), Philadelphia, PA, pp. 7986.
Porter M., 1980. An algorithm for suffix stripping. Program 14 (3): 130–7.
Sadamitsu K., Sekine S., and Yamamoto M., 2008. Sentiment analysis based on probabilistic models using inter-sentence information. In Proceedings of International Conference on Language Resources and Evaluation (LREC), Marrakesh, Morocco, pp. 2892–6.
Santorini B. 1990. Part-of-Speech Tagging Guidelines for the Penn Treebank Project (3rd revision, 2nd printing). Dept. Comput. Inf. Sci., Univ. Pennsylvania.
Severyn A., and Moschitti A., 2015. On the automatic learning of sentiment lexicons. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Denver, CO, pp. 1397–402.
Socher R., Pennington J., Huang E. H., Ng A. Y., and Manning C. D., 2011. Semi-supervised recursive autoencoders for predicting sentiment distributions. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Edinburgh, Scotland, pp. 151–61.
Sun A., Grishman R., and Sekine S., 2011. Semi-supervised relation extraction with large-scale word clustering. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Portland, OR, pp. 521–9.
Tkachenko M., and Simanovsky A., 2012. Named entity recognition: Exploring features. In Proceedings of KONVENS 2012, 11th Conference on Natural Language Processing, Vienna, Austria, pp. 118–27.
Wilson T., Wiebe J., and Hoffmann P., 2005. Recognizing contextual polarity in phrase-level sentiment analysis. In Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing, Vancouver, Canada, pp. 347–54.
Yang Y., and Pedersen J., 1997. A comparative study on feature selection in text categorization. In Proceedings of the Fourteenth International Conference on Machine Learning (ICML), Nashville, TN, pp. 412–20.
Ye Q., Zhang Z., and Law R., 2009. Sentiment classification of online reviews to travel destinations by supervised machine learning approaches. Expert Systems with Applications 36 (3): 6527–35.
Yu N., Zhekova D., Liu C., and Kübler S. 2013. Do good recipes need butter? Predicting user ratings of online recipes. In Proceedings of the IJCAI Workshop on Cooking with Computers, Beijing, China.
Zheng Z., Wu X., and Srihari R., 2004. Feature selection for text categorization on imbalanced data. ACM SIGKDD Explorations Newsletter 6 (1): 80–9.
Recommend this journal

Email your librarian or administrator to recommend adding this journal to your organisation's collection.

Natural Language Engineering
  • ISSN: 1351-3249
  • EISSN: 1469-8110
  • URL: /core/journals/natural-language-engineering
Please enter your name
Please enter a valid email address
Who would you like to send this to? *
×

Metrics

Full text views

Total number of HTML views: 7
Total number of PDF views: 68 *
Loading metrics...

Abstract views

Total abstract views: 326 *
Loading metrics...

* Views captured on Cambridge Core between 7th August 2017 - 22nd January 2018. This data will be updated every 24 hours.