Skip to main content
    • Aa
    • Aa

Classifying news versus opinions in newspapers: Linguistic features for domain independence

  • K. R. KRÜGER (a1), A. LUKOWIAK (a1), J. SONNTAG (a1), S. WARZECHA (a1) and M. STEDE (a1)...

Newspaper text can be broadly divided in the classes ‘opinion’ (editorials, commentary, letters to the editor) and ‘neutral’ (reports). We describe a classification system for performing this separation, which uses a set of linguistically motivated features. Working with various English newspaper corpora, we demonstrate that it significantly outperforms bag-of-lemma and PoS-tag models. We conclude that the linguistic features constitute the best method for achieving robustness against change of newspaper or domain.

Hide All
BiberD., and ConradS., 2009. Register, Genre, and Style. Cambridge, UK: Cambridge University Press.
BirdS., LoperE., and KleinE. 2009. Natural Language Processing with Python. Sebastopol, CA: OReilly Media Inc.
CharniakE., BlahetaD., GeN., HallK., HaleJ., and JohnsonM., 2000. BLLIP 1987-89 WSJ Corpus Release 1 LDC2000T43. DVD. Philadelphia: Linguistic Data Consortium.
de MarneffeM.-C., MacCartneyB., and ManningC. D., 2006. Generating typed dependency parses from phrase structure parses. In Proceedings of the 5th Conference on International Language Resources and Evaluation (LREC 2006), Genoa, Italy, pp. 449454.
EsuliA., and SebastianiF., 2006. SENTIWORDNET: a publicly available lexical resource for opinion mining. In Proceedings of the 5th Conference on Language Resources and Evaluation (LREC 2006), Genoa, Italy, pp. 417422.
FeldmanS., MarinM., OstendorfM., and GuptaM.R., 2009. Part-of-speech histograms for genre classification of text. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, Taipei, Taiwan, pp. 47814784.
FinnA., and KushmerickN. 2003. Learning to classify documents according to genre. In Proceedings of the Workshop on Computational Approaches to Style Analysis and Synthesis at the International Joint Conference on Artificial Intelligence (IJCAI 2003), Acapulco, Mexico.
FreundL., ClarkeC. L. A., and TomsE. G., 2006. Towards genre classification for IR in the workplace. In Proceedings of the 1st International Conference on Information Interaction in Context (IIiX), Copenhagen, Denmark, pp. 3036.
HallM., FrankE., HolmesG., PfahringerB., ReutemannP., and WittenI. H., 2009. The WEKA data mining software: an update. ACM SIGKDD Explorations Newsletter 11 (1): 1018.
HosmerD. W., LemeshowS., and SturdivantR. X., 2013. Applied Logistic Regression. Hoboken, NJ: Wiley.
KarlgrenJ., and CuttingD., 1994. Recognizing text genres with simple metrics using discriminant analysis. In Proceedings of the 15th Conference on Computational Linguistics (COLING 1994), vol. 2, Kyoto, Japan, pp. 10711075.
KesslerB., NunbergG., and SchützeH., 1997. Automatic detection of text genre. In Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and 8th Conference of the European Chapter of the Association for Computational Linguistics, Madrid, Spain, pp. 3238.
LippmannR., 1987. An introduction to computing with neural nets. ASSP Magazine, IEEE 4 (2): 422.
ManningC. D., SurdeanuM., BauerJ., FinkelJ., BethardS. J., and McCloskyD., 2014. The Stanford CoreNLP natural language processing toolkit. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, Baltimore, MD, pp. 5560.
MooreA., and LeeM. S., 1998. Cached sufficient statistics for efficient machine learning with large datasets. Journal of Artificial Intelligence Research 8 : 6791.
PearlJ., 1988. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. San Francisco, CA: Morgan Kaufmann.
PetrenzP., and WebberB., 2011. Stable classification of text genres. Computational Linguistics 37 (2): 385–93.
PlankB. 2011. Corresponding genre sets based on the meta-data found in ACL/DCI corpus. Accessed 2016-07-01.
PlattJ. 1998. Sequential minimal optimization: a fast algorithm for training support vector machines. Technical Report msr-tr-98-14, Microsoft Research.
PrasadR., DineshN., LeeA., MiltsakakiE., RobaldoL., JoshiA., and WebberB., 2008. The Penn Discourse TreeBank 2.0. In Proceedings of the 6th Conference on International Language Resources and Evaluation (LREC 2008), Marrakech, Morocco, pp. 29612968.
SandhausE., 2008. The New York Times Annotated Corpus LDC2008T19. DVD. Philadelphia: Linguistic Data Consortium.
SantiniM. 2007. Automatic Identification of Genre in Web Pages. PhD thesis, University of Brighton, UK.
SharoffS., WuZ., and MarkertK., 2010. The Web Library of Babel: evaluating genre collections. In Proceedings of the 7th Conference on International Language Resources and Evaluation (LREC 2010), Valletta, Malta, pp. 3063–70.
ToprakC., and GurevychI., 2009. Document level subjectivity classification experiments in DEFT’09 challenge. In Proceedings of the DÉfi Fouille de Textes (DEFT 2009) Text Mining Challenge, Paris, France, pp. 8997.
WebberB. L., 2009. Genre distinctions for discourse in the Penn TreeBank. In Proceedings of the 47th Annual Meeting of the Association for Computational Linguistics (ACL 2009) and the 4th International Joint Conference on Natural Language Processing of the AFNLP, Suntec, Singapore, pp. 674682.
WiebeJ., WilsonT., BruceR., BellM., and MartinM., 2004. Learning subjective language. Computational Linguistics 30 (3): 277308.
WilsonT., WiebeJ., and HoffmannP., 2005. Recognizing contextual polarity in phrase-level sentiment analysis. In Proceedings of the Conference on Human Language Technology and Empirical Methods in Natural Language Processing (HLT-EMNLP 2005), Vancouver, B.C., pp. 347354.
YuH., and HatzivassiloglouV., 2003. Towards answering opinion questions: separating facts from opinions and identifying the polarity of opinion sentences. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP 2003), Stroudsburg, PA, pp. 129136.
Recommend this journal

Email your librarian or administrator to recommend adding this journal to your organisation's collection.

Natural Language Engineering
  • ISSN: 1351-3249
  • EISSN: 1469-8110
  • URL: /core/journals/natural-language-engineering
Please enter your name
Please enter a valid email address
Who would you like to send this to? *


Full text views

Total number of HTML views: 8
Total number of PDF views: 76 *
Loading metrics...

Abstract views

Total abstract views: 436 *
Loading metrics...

* Views captured on Cambridge Core between 21st February 2017 - 23rd October 2017. This data will be updated every 24 hours.