Skip to main content
    • Aa
    • Aa

Discourse structure and language technology

  • B. WEBBER (a1), M. EGG (a2) and V. KORDONI (a3)

An increasing number of researchers and practitioners in Natural Language Engineering face the prospect of having to work with entire texts, rather than individual sentences. While it is clear that text must have useful structure, its nature may be less clear, making it more difficult to exploit in applications. This survey of work on discourse structure thus provides a primer on the bases of which discourse is structured along with some of their formal properties. It then lays out the current state-of-the-art with respect to algorithms for recognizing these different structures, and how these algorithms are currently being used in Language Technology applications. After identifying resources that should prove useful in improving algorithm performance across a range of languages, we conclude by speculating on future discourse structure-enabled technology.

Linked references
Hide All

This list contains references from the content that can be linked to their source. For a full set of references and notes please see the PDF or HTML where available.

S. Agarwal , and H. Yu 2009. Automatically classifying sentences in full-text biomedical articles into introduction, methods, results and discussion. Bioinformatics 25 (23): 3174–80.

N. Asher 1993. Reference to Abstract Objects in Discourse. Boston MA: Kluwer.

J. Baldridge , N. Asher , and J. Hunter 2007. Annotation for and robust parsing of discourse structure on unrestricted texts. Zeitschrift für Sprachwissenschaft 26: 213–39.

R. Barzilay , and M. Lapata 2008. Modeling local coherence: an entity-based approach. Computational Linguistics 34 (1): 134.

Y. Bestgen 2006. Improving text segmentation using latent semantic analysis: a reanalysis of Choi, Wiemer-Hastings, and Moore (2001). Computational Linguistics 32 (1): 512.

J. Burstein , D. Marcu , and K. Knight 2003. Finding the WRITE stuff: automatic identification of discourse structure in student essays. IEEE Intelligent Systems: Special Issue on Advances in Natural Language Processing 18: 32–9.

C. Callison-Birch 2008. Syntactic constraints on paraphrases extracted from parallel corpora. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP '08), Honolulu, HI, USA.

L. Carlson , D. Marcu , and M. E. Okurowski 2003. Building a discourse-tagged corpus in the framework of Rhetorical Structure Theory. In J. van Kuppevelt and R. Smith (eds.), Current Directions in Discourse and Dialogue, pp. 85112. New York: Kluwer.

G. Chung 2009 (February). Sentence retrieval for abstracts of randomized controlled trials. BMC Medical Informatics and Decision Making 9 (10).

J. Clarke , and M. Lapata 2010. Discourse constraints for document compression. Computational Linguistics 36 (3): 411–41.

J. Eisenstein , and R. Barzilay 2008. Bayesian unsupervised topic segmentation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, (EMNLP '08), Honolulu, HI, pp. 334–43.

D. Hardt , and J. Elming 2010. Incremental re-training for post-editing SMT. In Proceedings of AMTA, Denver, CO, USA.

M. Hearst 1994. Multi-paragraph segmentation of expository text. In Proceedings, 32nd Annual Meeting of the Association for Computational Linguistics, Plainsboro, NJ, USA, pp. 916.

J.-D. Kim , T. Ohta , Y. Tateisi , and J. Tsujii 2003. GENIA corpus – semantically annotated corpus for bio-textmining. Bioinformatics 19 (Suppl 1): i180–2.

W. Kintsch , and T. van Dijk 1978. Towards a model of text comprehension and production. Psychological Review 85: 363–94.

A. Knott 2001. Semantic and pragmatic relations and their intended effects. In T. Sanders , J. Schilperoord , and W. Spooren (eds.), Text Representation: Linguistic and Psycholinguistic Aspects, pp. 127–51. Amsterdam: Benjamins.

A. Knott , J. Oberlander , M. O'Donnell , and C. Mellish 2001. Beyond elaboration: the interaction of relations and focus in coherent text. In T. Sanders , J. Schilperoord , and W. Spooren (eds.), Text Representation: Linguistic and Psycholinguistic Aspects, pp. 181–96. Amsterdam: Benjamins.

J. Lin , D. Karakos , D. Demner-Fushman , and S. Khudanpur 2006. Generative content models for structural analysis of medical abstracts. In Proceedings of the HLT-NAACL Workshop on BioNLP, Brooklyn, New York, pp. 6572.

M. Maamouri , and A. Bies 2004. Developing an Arabic treebank: methods, guidelines, procedures, and tools. In Proceedings of the Workshop on Computational Approaches to Arabic Script-Based Languages, pp. 29. Stroudsburg, PA: ACL.

I. Mani 2001. Automatic Summarization. Amsterdam, Netherlands: Benjamins.

D. Marcu 2000. The rhetorical parsing of unrestricted texts: a surface-based approach. Computational Linguistics 26: 395448.

R. Mitkov 1999. Introduction: special issue on anaphora resolution in machine translation and multilingual NLP. Machine Translation 14: 159–61.

Y. Mizuta , A. Korhonen , T. Mullen , and N. Collier 2006. Zone analysis in biology articles as a basis for information extraction. International Journal of Medical Informatics 75: 468–87.

M.-F. Moens , C. Uyttendaele , and J. Dumortier 1999. Information extraction from legal texts: the potential of discourse analysis. International Journal of Human-Computer Studies 51: 1155–71.

K. Ono , K. Sumita , and S. Miike 1994. Abstract generation based on rhetorical structure extraction. In Proceedings, International Conference on Computational Linguistics (COLING), Kyoto, Japan, pp. 344–48.

B. Pang , L. Lee , and S. Vaithyanathan 2002. Thumbs up? Sentiment classification using machine learning techniques. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 7986. Stroudsburg PA: Association for Computational Linguistics.

R. Pasch , U. Brausse , E. Breindl , and U. Wassner 2003. Handbuch der Deutschen Konnektoren. Berlin, Germany: Walter de Gruyter.

M. Poesio , R. Stevenson , B. D. Eugenio , and J. Hitzeman 2004. Centering: a parametric theory and its instantiations. Computational Linguistics 30: 309–63.

L. Polanyi , C. Culy , M. van den Berg , G. L. Thione , and D. Ahn 2004b. Sentential structure and discourse parsing. In Proceedings of the ACL 2004 Workshop on Discourse Annotation, Barcelona, Spain.

R. Prasad , S. McRoy , N. Frid , A. Joshi , and H. Yu 2011. The Biomedical Discourse Relation Bank. BMC Bioinformatics 12 (188): 18.

M. Purver 2011. Topic segmentation. In: G. Tur and R. de Mori (eds.), Spoken Language Understanding: Systems for Extracting Semantic Information from Speech. Hoboken NJ: Wiley. Chapter 11, doi:1002/9781119992691.ch11.

J. Pustejovsky , A. Meyers , M. Palmer , and M. Poesio 2005. Merging PropBank, NomBank, TimeBank, Penn Discourse Treebank and Coreference. In CorpusAnno '05: Proceedings of the Workshop on Frontiers in Corpus Annotations II, pp. 512. Stroudsburg, PA: Association for Computational Linguistics.

P. Ruch , C. Boyer , C. Chichester , I. Tbahriti , A. Geissbühler , P. Fabry , 2007. Using argumentation to extract key sentences from biomedical abstracts. International Journal of Medical Informatics 76 (2–3): 195200.

D. Rumelhart 1975. Notes on a schema for stories. In D. Bobrow and A. Collins (eds.), Representation and Understanding: Studies in Cognitive Science, pp. 211–36. New York: Academic Press.

P. Sibun 1992. Generating text without trees. Computational Intelligence, 8 (1): 102–22.

M. Stede 2008a. Disambiguating rhetorical structure. Research on Language and Computation 6: 311–32.

M. Stede 2008b. RST revisited: disentangling nuclearity. In C. Fabricius-Hansen and W. Ramm (eds.), Subordination versus Coordination in Sentence and Text, pp. 3358. Amsterdam, Netherlands: John Benjamins.

R. Subba , and B. D. Eugenio 2009. An effective discourse parser that uses rich linguistic information. In Proceedings of NAACL '09, pp. 566–74. Stroudsburg, PA: Association for Computational Linguistics.

M. Taboada , J. Brooke , and M. Stede 2009. Genre-based paragraph classification for sentiment analysis. In Proceedings of SIGDIAL 2009, London, UK, pp. 6270.

M. Taboada , and W. Mann 2006. Applications of rhetorical structure theory. Discourse Studies 8: 567–88.

J. Tamames , and V. de Lorenzo 2010. EnvMine: a text-mining system for the automatic extraction of contextual information. BMC Bioinformatics 11: 294.

S. Teufel , and M. Moens 2002. Summarizing scientific articles – experiments with relevance and rhetorical status. Computational Linguistics 28: 409–45.

M. Toolan 2006. Narrative: linguistic and structural theories. In K. Brown (ed.), Encyclopedia of Language and Linguistics, 2nd ed., pp. 459–73. Amsterdam, Netherlands: Elsevier.

V. R. Uzêda , T. A. S. Pardo , and M. D. G. V. Nunes 2010. A comprehensive comparative evaluation of RST-based summarization methods. ACM Transactions on Speech and Language Processing 6: 120.

B. Webber 1991. Structure and ostension in the interpretation of discourse deixis. Language and Cognitive Processes 6 (2): 107–35.

F. Wolf , and E. Gibson 2005. Representing discourse coherence: a corpus-based study. Computational Linguistics 31: 249–87.

Recommend this journal

Email your librarian or administrator to recommend adding this journal to your organisation's collection.

Natural Language Engineering
  • ISSN: 1351-3249
  • EISSN: 1469-8110
  • URL: /core/journals/natural-language-engineering
Please enter your name
Please enter a valid email address
Who would you like to send this to? *