Skip to main content

Classifying news versus opinions in newspapers: Linguistic features for domain independence

  • K. R. KRÜGER (a1), A. LUKOWIAK (a1), J. SONNTAG (a1), S. WARZECHA (a1) and M. STEDE (a1)...

Newspaper text can be broadly divided in the classes ‘opinion’ (editorials, commentary, letters to the editor) and ‘neutral’ (reports). We describe a classification system for performing this separation, which uses a set of linguistically motivated features. Working with various English newspaper corpora, we demonstrate that it significantly outperforms bag-of-lemma and PoS-tag models. We conclude that the linguistic features constitute the best method for achieving robustness against change of newspaper or domain.

Hide All
Biber, D., and Conrad, S., 2009. Register, Genre, and Style. Cambridge, UK: Cambridge University Press.
Bird, S., Loper, E., and Klein, E. 2009. Natural Language Processing with Python. Sebastopol, CA: OReilly Media Inc.
Charniak, E., Blaheta, D., Ge, N., Hall, K., Hale, J., and Johnson, M., 2000. BLLIP 1987-89 WSJ Corpus Release 1 LDC2000T43. DVD. Philadelphia: Linguistic Data Consortium.
de Marneffe, M.-C., MacCartney, B., and Manning, C. D., 2006. Generating typed dependency parses from phrase structure parses. In Proceedings of the 5th Conference on International Language Resources and Evaluation (LREC 2006), Genoa, Italy, pp. 449454.
Esuli, A., and Sebastiani, F., 2006. SENTIWORDNET: a publicly available lexical resource for opinion mining. In Proceedings of the 5th Conference on Language Resources and Evaluation (LREC 2006), Genoa, Italy, pp. 417422.
Feldman, S., Marin, M., Ostendorf, M., and Gupta, M.R., 2009. Part-of-speech histograms for genre classification of text. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, Taipei, Taiwan, pp. 47814784.
Finn, A., and Kushmerick, N. 2003. Learning to classify documents according to genre. In Proceedings of the Workshop on Computational Approaches to Style Analysis and Synthesis at the International Joint Conference on Artificial Intelligence (IJCAI 2003), Acapulco, Mexico.
Freund, L., Clarke, C. L. A., and Toms, E. G., 2006. Towards genre classification for IR in the workplace. In Proceedings of the 1st International Conference on Information Interaction in Context (IIiX), Copenhagen, Denmark, pp. 3036.
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., and Witten, I. H., 2009. The WEKA data mining software: an update. ACM SIGKDD Explorations Newsletter 11 (1): 1018.
Hosmer, D. W., Lemeshow, S., and Sturdivant, R. X., 2013. Applied Logistic Regression. Hoboken, NJ: Wiley.
Karlgren, J., and Cutting, D., 1994. Recognizing text genres with simple metrics using discriminant analysis. In Proceedings of the 15th Conference on Computational Linguistics (COLING 1994), vol. 2, Kyoto, Japan, pp. 10711075.
Kessler, B., Nunberg, G., and Schütze, H., 1997. Automatic detection of text genre. In Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and 8th Conference of the European Chapter of the Association for Computational Linguistics, Madrid, Spain, pp. 3238.
Lippmann, R., 1987. An introduction to computing with neural nets. ASSP Magazine, IEEE 4 (2): 422.
Manning, C. D., Surdeanu, M., Bauer, J., Finkel, J., Bethard, S. J., and McClosky, D., 2014. The Stanford CoreNLP natural language processing toolkit. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, Baltimore, MD, pp. 5560.
Moore, A., and Lee, M. S., 1998. Cached sufficient statistics for efficient machine learning with large datasets. Journal of Artificial Intelligence Research 8 : 6791.
Pearl, J., 1988. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. San Francisco, CA: Morgan Kaufmann.
Petrenz, P., and Webber, B., 2011. Stable classification of text genres. Computational Linguistics 37 (2): 385–93.
Plank, B. 2011. Corresponding genre sets based on the meta-data found in ACL/DCI corpus. Accessed 2016-07-01.
Platt, J. 1998. Sequential minimal optimization: a fast algorithm for training support vector machines. Technical Report msr-tr-98-14, Microsoft Research.
Prasad, R., Dinesh, N., Lee, A., Miltsakaki, E., Robaldo, L., Joshi, A., and Webber, B., 2008. The Penn Discourse TreeBank 2.0. In Proceedings of the 6th Conference on International Language Resources and Evaluation (LREC 2008), Marrakech, Morocco, pp. 29612968.
Sandhaus, E., 2008. The New York Times Annotated Corpus LDC2008T19. DVD. Philadelphia: Linguistic Data Consortium.
Santini, M. 2007. Automatic Identification of Genre in Web Pages. PhD thesis, University of Brighton, UK.
Sharoff, S., Wu, Z., and Markert, K., 2010. The Web Library of Babel: evaluating genre collections. In Proceedings of the 7th Conference on International Language Resources and Evaluation (LREC 2010), Valletta, Malta, pp. 3063–70.
Toprak, C., and Gurevych, I., 2009. Document level subjectivity classification experiments in DEFT’09 challenge. In Proceedings of the DÉfi Fouille de Textes (DEFT 2009) Text Mining Challenge, Paris, France, pp. 8997.
Webber, B. L., 2009. Genre distinctions for discourse in the Penn TreeBank. In Proceedings of the 47th Annual Meeting of the Association for Computational Linguistics (ACL 2009) and the 4th International Joint Conference on Natural Language Processing of the AFNLP, Suntec, Singapore, pp. 674682.
Wiebe, J., Wilson, T., Bruce, R., Bell, M., and Martin, M., 2004. Learning subjective language. Computational Linguistics 30 (3): 277308.
Wilson, T., Wiebe, J., and Hoffmann, P., 2005. Recognizing contextual polarity in phrase-level sentiment analysis. In Proceedings of the Conference on Human Language Technology and Empirical Methods in Natural Language Processing (HLT-EMNLP 2005), Vancouver, B.C., pp. 347354.
Yu, H., and Hatzivassiloglou, V., 2003. Towards answering opinion questions: separating facts from opinions and identifying the polarity of opinion sentences. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP 2003), Stroudsburg, PA, pp. 129136.
Recommend this journal

Email your librarian or administrator to recommend adding this journal to your organisation's collection.

Natural Language Engineering
  • ISSN: 1351-3249
  • EISSN: 1469-8110
  • URL: /core/journals/natural-language-engineering
Please enter your name
Please enter a valid email address
Who would you like to send this to? *


Full text views

Total number of HTML views: 0
Total number of PDF views: 0 *
Loading metrics...

Abstract views

Total abstract views: 0 *
Loading metrics...

* Views captured on Cambridge Core between <date>. This data will be updated every 24 hours.

Usage data cannot currently be displayed