Classifying news versus opinions in newspapers: Linguistic features for domain independence

  • K. R. KRÜGER (a1), A. LUKOWIAK (a1), J. SONNTAG (a1), S. WARZECHA (a1) and M. STEDE (a1)...

Newspaper text can be broadly divided in the classes ‘opinion’ (editorials, commentary, letters to the editor) and ‘neutral’ (reports). We describe a classification system for performing this separation, which uses a set of linguistically motivated features. Working with various English newspaper corpora, we demonstrate that it significantly outperforms bag-of-lemma and PoS-tag models. We conclude that the linguistic features constitute the best method for achieving robustness against change of newspaper or domain.

