Skip to main content
×
Home

Predicting word choice in affective text

  • M. GARDINER (a1) and M. DRAS (a1)
Abstract
Abstract

Choosing the best word or phrase for a given context from among the candidate near-synonyms, such as slim and skinny, is a difficult language generation problem. In this paper, we describe approaches to solving an instance of this problem, the lexical gap problem, with a particular focus on affect and subjectivity; to do this we draw upon techniques from the sentiment and subjectivity analysis fields. We present a supervised approach to this problem, initially with a unigram model that solidly outperforms the baseline, with a 6.8% increase in accuracy. The results to some extent confirm those from related problems, where feature presence outperforms feature frequency, and immediate context features generally outperform wider context features. However, this latter is somewhat surprisingly not always the case, and not necessarily where intuition might first suggest; and an analysis of where document-level models are in some cases better suggested that, in our corpus, broader features related to the ‘tone’ of the document could be useful, including document sentiment, document author, and a distance metric for weighting the wider lexical context of the gap itself. From these, our best model has a 10.1% increase in accuracy, corresponding to a 38% reduction in errors. Moreover, our models do not just improve accuracy on affective word choice, but on non-affective word choice also.

Copyright
Footnotes
Hide All

The authors would like to thank the anonymous reviewers of the article, and to acknowledge the support of ARC Discovery grant DP0558852.

Footnotes
References
Hide All
Banerjee S., and Pedersen T. 2003. The design, implementation and use of the Ngram statistics package. In Proceedings of the 4th International Conference on Intelligent Text Processing and Computational Linguistics, Mexico City.
Bieler H., Dipper S., and Stede M., 2007. Identifying formal and functional zones in film reviews. In Proceedings of the 8th SIGdial Workshop on Discourse and Dialogue, Antwerp, Belgium, pp. 7585.
Brants T., and Franz A. 2006. Web 1T 5-gram Version 1. http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T13.
Carbonell J., Klein S., Miller D., Steinbaum M., Grassiany T., and Frei J., 2006. Context-based machine translation. In Proceedings of the 7th Conference of the Association for Machine Translation of the Americas (AMTA), Cambridge, MA, US, pp. 1928.
Cerini S., Compagnoni V., Demontis A., Formentelli M., and Gandi C. 2007. Micro-WNOp: a gold standard for the evaluation of automatically compiled lexical resources for opinion mining. In Language Resources and Linguistic Theory: Typology, Second Language Acquisition, English Linguistics, Franco Angeli, Milan, Italy.
Church K., Gale W., Hanks P., and Hindle D. 1989. Parsing, word associations and typical predicate-argument relations. In Proceedings of the International Workshop on Parsing Technologies, Pittsburgh, PA, US.
Church K., Gale W., Hanks P., and Hindle D. 1991. Using statistics in lexical analysis. In Zernick U. (ed.), Lexical Acquisition: Using On-line Resources to Build a Lexicon, pp. 115164. Lawrence Erlbaum Associates, Hillsdale, NJ, US.
Church K., and Hanks P. 1991. Word association norms, mutual information and lexicography. Computational Linguistics 16 (1): 2229.
Clarke C. L. A., and Terra E. L., 2003. Passage retrieval versus document retrieval for factoid question answering. In Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Toronto, Canada, pp. 427428.
DiMarco C., Hirst G., and Stede M., 1993. The semantic and stylistic differentiation of synonyms and near-synonyms. In Proceedings of AAAI Spring Symposium on Building Lexicons for Machine Translation, Stanford, CA, USA, pp. 114121.
Edmonds P. 1997. Choosing the word most typical in context using a lexical co-occurrence network. In Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics, Madrid, Spain, pp. 507509. Association for Computational Linguistics.
Edmonds P. 1999. Semantic Representations of Near-Synonyms for Automatic Lexical Choice. PhD thesis, University of Toronto, Toronto, Canada.
Edmonds P., and Hirst G., 2002. Near-synonymy and lexical choice. Computational Linguistics 28 (2): 105144.
Esuli A., and Sebastiani F., 2006. SentiWordNet: a publicly available lexical resource for opinion mining. In Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC 2006), Genova, Italy, pp. 417422.
Fellbaum C. (ed.) 1998. WordNet: An Electronic Lexical Database. The MIT Press, Cambridge, MA, US.
Foody G. M. 2008. Sample size determination for image classification accuracy assessment and comparison. In Proceedings of the 8th International Symposium on Spatial Accuracy Assessment in Natural Resources and Environmental Sciences, Shanghai, pp. 154162.
Gale W. A., and Sampson G., 1995. Good-turing frequency estimation without tears. Journal of Quantitative Linguistics 2: 217232.
Gallo C. G., Jaeger T. F., and Smyth R., 2008. Incremental syntactic planning across clauses. In Proceedings of the 30th Annual Meeting of the Cognitive Science Society (CogSci08), Washington, DC, US, pp. 845850.
Gamon M. 2004. Sentiment classification on customer feedback data: noisy data, large feature vectors, and the role of linguistic analysis. In COLING ’04: Proceedings of the 20th International Conference on Computational Linguistics, Morristown, NJ, USA, pp. 841847. Association for Computational Linguistics.
Gardiner M., and Dras M., 2007a. Corpus statistics approaches to discriminating among near-synonyms. In Proceedings of the 10th Conference of the Pacific Association for Computational Linguistics (PACLING 2007), Melbourne, Australia, pp. 3139.
Gardiner M., and Dras M., 2007b. Exploring approaches to discriminating among near-synonyms. In Proceedings of the Australasian Language Technology Workshop 2007, Melbourne, Australia, pp. 3139.
Genzel D., and Charniak E., 2002. Entropy rate constancy in text. In Proceedings of the 40th Annual Meetings of the Association for Computational Linguistics (ACL’02), Philadelphia, US, pp. 199206.
Genzel D., and Charniak E., 2003. Variation of entropy and parse trees of sentences as a function of the sentence number. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Sapporo, Japan, pp. 6572.
Good I. J., 1953. The population frequencies of species and the estimation of population parameters. Biometrika 40 (3–4): 237264.
Hassan S., Csomai A., Banea C., Sinha R., and Mihalcea R. 2007. UNT: SubFinder: combining knowledge sources for automatic lexical substitution. In Proceedings of the 4th International Workshop on Semantic Evaluations (SemEval-2007), Prague, Czech Republic, pp. 410413. Association for Computational Linguistics.
Hatzivassiloglou V., and Wiebe J. M., 2000. Effects of adjective orientation and gradability on sentence subjectivity. In Proceedings of the 18th International Conference on Computational Linguistics (COLING-2000), Saarbrcken, Germany, pp. 299305.
Hawker T., Gardiner M., and Bennetts A., 2007. Practical queries of a massive n-gram database. In Proceedings of the Australasian Language Technology Workshop 2007, Melbourne, Australia, pp. 4048.
Hayakawa S. I. (ed.) 1968. Use The Right Word: Modern Guide to Synonyms and Related Words, 1st ed.The Reader’s Digest Association Pty. Ltd., New York, NY, US.
Hayakawa S. I. (ed.) 1994. Choose the Right Word (2nd edition. Harper Collins Publishers. revised by Eugene Ehrlich, New York, NY, US.
Ide N., and Vronis J., 1998. Introduction to the special issue on word sense disambiguation: the state of the Art. Computational Linguistics 24 (1): 140.
Inkpen D. 2007a. Near-synonym choice in an intelligent thesaurus. In Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Proceedings of the Main Conference, Rochester, New York, pp. 356363. Association for Computational Linguistics.
Inkpen D., 2007b. A statistical model for near-synonym choice. ACM Transactions of Speech and Language Processing 4 (1): 117.
Inkpen D., and Hirst G., 2006. Building and using a lexical knowledge-base of near-synonym differences. Computational Linguistics 32 (2): 223262.
Inkpen D. Z., Feiguina O., and Hirst G. 2006. Generating more-positive or more-negative text. In Shanahan J. G., Qu Y., and Wiebe J. (eds.), Computing Attitude and Affect in Text (Selected papers from the Proceedings of the Workshop on Attitude and Affect in Text, AAAI 2004 Spring Symposium), pp. 187196. Springer, Dordrecht, The Netherlands, Dordrecht, The Netherlands.
Islam A., and Inkpen D., 2010. Near-synonym choice using a 5-gram language model. Research in Computing Science: Special issue on Natural Language Processing and its Applications 46: 4152.
Islam M. A. 2011. An Unsupervised Approach to Detecting and Correcting Errors in Text. PhD thesis, University of Ottawa, Ottawa, Canada.
Joachims T. 1999. Making large-scale SVM learning practical. In Schlkopf B., Burges C. J., and Smola A. J. (eds.), Advances in Kernel Methods - Support Vector Learning, pp. 169184. Cambridge, USA: The MIT Press.
Jurafsky D., and Martin J. H. 2009. Speech and Language Processing: An Introduction to Natural Language Processing, Speech Recognition, and Computational Linguistics, 2nd ed.Prentice-Hall, Upper Saddle River, NJ, USA.
Katz S. M., 1987. Estimation of probabilities from sparse data for the language model component of a speech recognizer. IEEE Transactions on Acoustics, Speech and Signal Processing 35: 400401.
Keller F. 2004. The entropy rate principle as a predictor of processing effort: an evaluation against eye-tracking data. In Lin D., and Wu D. (eds.), Proceedings of EMNLP 2004, Barcelona, Spain, pp. 317324. Association for Computational Linguistics.
Koppel M., Akiva N., and Dagan I., 2006a. Feature instability as a criterion for selecting potential style markers. Journal of the American Society for Information Science and Technology 57 (11): 15191525.
Koppel M., Akiva N., and Dagan I., 2006b. Feature Instability as a Criterion for Selecting Potential Style Markers. Journal of the American Society for Information Science and Technology 57 (11): 15191525.
Kullback S., and Leibler R. A., 1951. On Information and Sufficiency. The Annals of Mathematical Statistics 22: 7986.
Landauer T., and Dumais S., 1997. A solution to Plato’s problem: the latent semantic analysis theory of acquisition, induction, and representation of knowledge. Psychological Review 104 (2): 211240.
Langkilde I., and Knight K., 1998. The practical value of N-grams in generation. In Proceedings of the 9th International Natural Language Generation Workshop, Niagra-on-the-Lake, Canada, pp. 248255.
Levy R., and Jaeger T. F. 2007. Speakers optimize information density through syntactic reduction. In Schlkopf B., Platt J., and Hoffman T. (eds.), Advances in Neural Information Processing Systems 19, Cambridge, MA: MIT Press.
Liu Y., and Zheng Y. F. 2005. One-against-all multi-class SVM classification using reliability measures. In Proceedings of the 2005 IEEE International Joint Conference on Neural Networks, (IJCNN ’05). vol. 2, pp. 849854.
McCarthy D., and Navigli R., 2007. SemEval-2007 Task 10: english lexical substitution task. In Proceedings of the 4th International Workshop on Semantic Evaluations (SemEval-2007), Prague, Czech Republic. Association for Computational Linguistics, pp. 4853.
Özgür L., and Güngör T. 2010. Text classification with the support of pruned dependency patterns. Pattern Recognition Letters 31 (12): 15981607.
Paltoglou G., and Thelwall M. 2010. A study of information retrieval weighting schemes for sentiment analysis. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, Uppsala, Sweden, pp. 13861395. Association for Computational Linguistics.
Pang B., and Lee L., 2004. A sentimental education: sentiment analysis using subjectivity summarization based on minimum cuts. In Proceedings of the 42nd Meeting of the Association for Computational Linguistics (ACL’04), Main Volume, Barcelona, Spain, pp. 271278.
Pang B., and Lee L. 2005. Seeing stars: exploiting class relationships for sentiment categorization with respect to rating scales. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL’05), Ann Arbor, Michigan, pp. 115124. Association for Computational Linguistics.
Pang B., and Lee L., 2008. Opinion mining and sentiment nnalysis. Foundations and Trends in Information Retrieval 2 (1–2): 1135.
Pang B., Lee L., and Vaithyanathan S. 2002. Thumbs up? Sentiment classification using machine learning techniques. In Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing, pp. 7986. Association for Computational Linguistics, Philadelphia, PA, US.
Qian T., and Jaeger T. F. 2010. Close = Relevant? The role of context in efficient language production. In Proceedings of the 2010 Workshop on Cognitive Modeling and Computational Linguistics, Uppsala, Sweden, pp. 4553. Association for Computational Linguistics.
Rapp R., 2008. The automatic generation of thesauri of related words for English, French, German, and Russian. International Journal of Speech Technology 11 (3–4): 147156.
Refaeilzadeh P., Tang L., and Liu H. 2009. Cross validation. In Tamer M., and Liu L. (eds.), Encyclopedia of Database Systems. Springer, New York, NY, US.
Reiter E., and Dale R. 2000. Building Natural Language Generation Systems. Cambridge University Press, Cambridge, UK.
Rifkin R., and Klautau A., 2004. In defense of one-vs-all classification. Journal of Machine Learning Research 5 (2): 101141.
Rosen-Zvi M., Griffiths T., Steyvers M., and Smyth P., 2004. The author-topic model for authors and documents. In Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence, Banff, Canada, pp. 487494.
Salzberg S. L., 1997. On comparing classifiers: pitfalls to avoid and a recommended approach. Data Mining and Knowledge Discovery 1: 317327.
Sinclair J. 1987. The nature of the evidence. In Sinclair J. M. (ed.), Looking Up: An Account of the COBUILD Project in Lexical Computing and the Development of the Collins COBUILD English Language Dictionary, pp. 150159. London, UK: HarperCollins Publishers Ltd.
Sinclair J., 1991. Corpus, Concordance, Collocation. Oxford, UK: Oxford University Press.
Sinha R. S., and Mihalcea R., 2014. Explorations in lexical sample and all-words lexical substitution. Natural Language Engineering 20 (1): 99129.
Snyder B., and Barzilay R. 2007. Multiple aspect ranking using the good grief algorithm. In Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Proceedings of the Main Conference, Rochester, New York, pp. 300307. Association for Computational Linguistics.
Sprent P., and Smeeton N. C. 2007. Applied Nonparametric Statistical Methods, 4th ed. Texts in Statistical Science. Chapman and Hall/CRC, Boca Raton, FL, US.
Stewart D., 2010. Semantic Prosody: A Critical Evaluation. New York, US: Routledge.
Stone P. J., Dunphy D. C., Smith M. S., and Ogilvie D. M. 1966. General Inquirer: A Computer Approach to Content Analysis. The MIT Press, Cambridge, MA, US.
Stubbs M., 2001. Words and Phrases: Corpus Studies of Lexical Semantics. Oxford, UK: Blackwell Publishing.
Turney P. 2002. Thumbs up or thumbs down? Semantic orientation applied to unsupervised classification of reviews. In Proceedings of 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, Pennsylvania, USA, pp. 417424. Association for Computational Linguistics.
Wang T., and Hirst G. 2010. Near-synonym lexical choice in latent semantic space. In Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010), Beijing, China, pp. 11821190. Coling 2010 Organizing Committee.
Wiebe J., Wilson T., Bruce R., Bell M., and Martin M., 2004. Learning subjective language. Computational Linguistics 30 (3): 277308.
Xu P., Chelba C., and Jelinek F., 2002. A study on richer syntactic dependencies for structured language modeling. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics (ACL’02), Philadelphia, PA, US, pp. 191198.
Yang Y., and Pedersen J. O. 1997. A comparative study on feature selection in text categorization. In Proceedings of the Fourteenth International Conference on Machine Learning, San Francisco, USA, pp. 412420. Morgan Kaufmann Publishers Inc.
Yu H., and Hatzivassiloglou V., 2003. Towards answering opinion questions: separating facts from opinions and identifying the polarity of opinion sentences. In Proceedings of the 2003 Conference on Empirical Methods in Natural Language Processing (EMNLP 2003), Sapporo, Japan, pp. 129136.
Yu L.-C., Shih H.-M., Lai Y.-L., Yeh J.-F., and Wu C.-H. 2010. Discriminative training for near-synonym substitution. In Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010), Beijing, China, pp. 12541262. Coling 2010 Organizing Committee.
Yuret D. 2007. KU: word sense disambiguation by substitution. In Proceedings of the 4th International Workshop on Semantic Evaluations (SemEval-2007), Prague, Czech Republic, pp. 207214. Association for Computational Linguistics.
Recommend this journal

Email your librarian or administrator to recommend adding this journal to your organisation's collection.

Natural Language Engineering
  • ISSN: 1351-3249
  • EISSN: 1469-8110
  • URL: /core/journals/natural-language-engineering
Please enter your name
Please enter a valid email address
Who would you like to send this to? *
×

Metrics

Altmetric attention score

Full text views

Total number of HTML views: 3
Total number of PDF views: 51 *
Loading metrics...

Abstract views

Total abstract views: 470 *
Loading metrics...

* Views captured on Cambridge Core between September 2016 - 21st November 2017. This data will be updated every 24 hours.