Skip to main content
×
Home
    • Aa
    • Aa

Dependency-based n-gram models for general purpose sentence realisation

  • YUQING GUO (a1), HAIFENG WANG (a2) and JOSEF VAN GENABITH (a3)
Abstract
Abstract

This paper presents a general-purpose, wide-coverage, probabilistic sentence generator based on dependency n-gram models. This is particularly interesting as many semantic or abstract syntactic input specifications for sentence realisation can be represented as labelled bi-lexical dependencies or typed predicate-argument structures. Our generation method captures the mapping between semantic representations and surface forms by linearising a set of dependencies directly, rather than via the application of grammar rules as in more traditional chart-style or unification-based generators. In contrast to conventional n-gram language models over surface word forms, we exploit structural information and various linguistic features inherent in the dependency representations to constrain the generation space and improve the generation quality. A series of experiments shows that dependency-based n-gram models generalise well to different languages (English and Chinese) and representations (LFG and CoNLL). Compared with state-of-the-art generation systems, our general-purpose sentence realiser is highly competitive with the added advantages of being simple, fast, robust and accurate.

Copyright
References
Hide All
Bangalore S., and Rambow O. 2000. Exploiting a probabilistic hierarchical model for generation. In Proceedings of the 18th International Conference on Computational Linguistics, pp. 4248. Saarbrücken, Germany.
Bateman J. A. 1997. Enabling technology for multilingual natural language generation: the KPML development environment. Journal of Natural Language Engineering 3 (1): 1555. (Cambridge University Press)
Belz A. 2007. Automatic generation of weather forecast texts using comprehensive probabilistic generation-space models. Journal of Natural Language Engineering 1 (1): 126. (Cambridge University Press)
Bilmes J. A. and Kirchhoff K. 2003. Factored language models and generalized parallel backoff. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, pp. 46. Edmonton, Canada.
Cahill A., Burke M., O'Donovan R., van Genabith J., and Way A. 2004. Long-distance dependency resolution in automatically acquired wide-coverage PCFG-based LFG approximations. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics, pp. 320327. Barcelona, Spain.
Cahill A., Forst M., and Rohrer C. 2007. Stochastic realisation ranking for a free word order language. In Proceedings of the 11th European Workshop on Natural Language Generation, pp. 1724. Schloss Dagstuhl, Germany.
Cahill A., and van Genabith J. 2006. Robust PCFG-based generation using automatically acquired LFG approximations. In Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics, pp. 10331040. Sydney, Australia.
Callaway C. B. 2003. Evaluating coverage for large symbolic NLG grammars. In Proceedings of the Eighteenth International Joint Conference on Artificial Intelligence, pp. 811817. Acapulco, Mexico.
Carroll J., Copestake A., Flickinger D., and Poznanski V. 1999. An efficient chart generator for (semi-)lexicalist grammars. In Proceedings of the 7th European Workshop on Natural Language Generation, pp. 8695. Toulouse, France.
Corston-Oliver S., Gamon M., Ringger E., and Moore R. 2002. An overview of Amalgam: a machine-learned generation module. In Proceedings of the 2nd International Natural Language Generation Conference, pp. 3340. Harriman, NY.
Crouch D., Dalrymple M., Kaplan R., King T., Maxwell J., and Newman P. 2007. XLE Documentation. California: Palo Alto Research Center.
Dalrymple M., Kaplan R., Maxwell J., and Zaenen A. 1995. Formal Issues in Lexical-Functional Grammar, CSLI Lecture Notes No. 47. Standford, CA: CSLI Publications.
DeVault D., Traum D., and Artstein R. 2008. Practical grammar-based NLG from examples. In Proceedings of the 5th International Natural Language Generation Conference, pp. 7785. Salt Fork, OH.
Elhadad M. 1993. FUF: the universal unifier user manual version 5.2. Technical Report, Department of Computer Science, Ben Gurion University of the Negev, Israel. http://www.cs.bgu.ac.il/surge/index.html.
Elhadad M., and Robin J. 1996. An overview of SURGE: a reusable comprehensive syntactic realization component. Technical Report 96-03, Mathematics and Computer Science Department, Ben Gurion University of the Negev, Israel.
Filippova K., and Strube M. 2007. Generating constituent order in German clauses. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pp. 320327. Prague, Czech Republic.
Filippova K., and Strube M. 2009. Tree linearization in English: improving language model-based approaches. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 225228. Boulder, CO.
Gamon M., Ringger E., Zhang Z., Moore R., and Corston-Oliver S. 2002. Extraposition: a case study in German sentence realization. In Proceedings of the 19th International Conference on Computational Linguistics, pp. 17. Taipei, Taiwan.
Goodman J. T. 2001. A bit of progress in language modeling. Computer Speech and Language 15 (4): 403434.
Guo Y., van Genabith J., and Wang H. 2007. Treebank-based Acquisition of LFG Resources for Chinese. In Proceedings of LFG07 Conference, pp. 214232. Stanford, CA.
Guo Y., van Genabith J., and Wang H. 2008. Dependency-based N-gram models for general purpose sentence realisation. In Proceedings of the 22nd International Conference on Computational Linguistics, pp. 297304. Manchester, UK.
Guo Y., Wang H., and van Genabith J. 2010. A linguistically inspired statistical model for Chinese punctuation generation. ACM Transactions on Asian Language Information Processing 9 (2): 127.
Halliday M. A. K. 1978. Language as Social Semiotic: The Social Interpretation of Language and Meaning. Maryland: University Park Press.
Hogan D., Cafferkey C., Cahill A., and van Genabith J. 2007. Exploiting multi-word units in history-based probabilistic generation. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Conference on Computational Natural Language Learning, pp. 267276. Prague, Czech Republic.
Johansson R., and Nugues P. 2007. Extended constituent-to-dependency conversion for English. In Proceedings of the 16th Nordic Conference of Computational Linguistics, pp. 105112. Tartu, Estonia.
Kaplan R., and Bresnan J. 1982. Lexical functional grammar: a formal system for grammatical representation. The Mental Representation of Grammatical Relations, pp. 173282. Cambridge, MA: MIT Press.
Kaplan R., and Wedekind J. 2000. LFG generation produces context-free languages. In Proceedings of the 18th International Conference on Computational Linguistics, pp. 425431. Saarbrücken, Germany.
Kay M. 1979. Functional grammar. In Proceedings of the 5th Annual Meeting of the Berkeley Linguistics Society. Berkeley, CA.
Klakow D. 1998. Log-linear interpolation of language models. In Proceedings of the 5th International Conference on Spoken Language Processing, vol. 5, pp. 16951699. Sydney, Australia.
Langkilde I. 2000. Forest-based statistical sentence generation. In Proceedings of 1st Meeting of the North American Chapter of the Association for Computational Linguistics, pp. 170177. Seattle, WA.
Langkilde I. 2002. An empirical verification of coverage and correctness for a general-purpose sentence generator. In Proceedings of the 2nd International Conference on Natural Language Generation, pp. 1724. Harriman, NY.
Langkilde I., and Knight K. 1998. The practical value of N-grams in derivation. In Proceedings of the 9th International Workshop on Natural Language Generation, pp. 248255. New Brunswick, NJ.
Lavoie B., and Rambow O. 1997. A fast and portable realizer for text generation systems. In Proceedings of the 5th Conference on Applied Natural Language Processing, pp. 265268. Washington, DC.
Marciniak T., and Strube M. 2004. Classification-based generation using TAG. In Proceedings of the 3rd International Conference on Natural Language Generation, pp. 100109. Brockenhurst, UK.
Marcus M. P., Santorini B., and Marcinkiewicz M. A. 1993. Building a large annotated corpus of English: the Penn Treebank. Computational Linguistics 19 (2): 313330.
McDonald R., Crammer K., and Pereira F. 2005. Online large-margin training of dependency parsers. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics, pp. 9198. Ann Arbor, MI.
Mel'čuk I. A. 1988. Dependency Syntax: Theory and Practice. New York, NY: SUNY Press.
Nakanishi H., Nakanishi Y., and Tsujii J. 2005. Probabilistic models for disambiguation of an HPSG-based chart generator. In Proceedings of the 9th International Workshop on Parsing Technology, pp. 93102. Vancouver, British Columbia, Canada.
Nicolov N., and Mellish C. 2000. PROTECTOR: efficient generation with lexicalized grammars. In Recent Advances in Natural Language Processing II, pp. 221243. Amsterdam, The Netherlands: John Benjamins.
Nivre J. 2006. Inductive Dependency Parsing. New York, NY: Springer.
Och F. J. 2003 Minimum error rate training in statistical machine translation. In Proceedings of the 41st Annual Meeting on Association for Computational Linguistics, pp. 160167. Sapporo, Japan.
Papineni K., Roukos S., Ward T., and Zhu W.-J. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pp. 311318. Philadelphia, PA.
Rajkumar R., White M., and Espinosa D. 2009. Exploiting named entity classes in CCG surface realization. In Proceedings of Human Language Technologies: The Conference of the North American Chapter of the Association for Computational Linguistics, pp. 161164. Boulder, CO.
Ratnaparkhi A. 2000. Trainable methods for natural language generation. In Proceedings of the 6th Applied Natural Language Processing Conference and 1st Meeting of the North American Chapter of the Association for Computational Linguistics, pp. 194201. Seattle, WA.
Reiter E., and Dale R. 1997. Building applied natural language generation systems. Journal of Natural Language Engineering 3: 5787. (Cambridge University Press)
Ringger E., Gamon M., Smets M. E., Corston-Oliver S., and Moore R. C. 2003. Linguistically informed statistical models of constituent structure for ordering in sentence realization. Technical Report MSR-TR-2003-54, Microsoft Research, Redmond, WA.
Stolcke A. 2002. SRILM-An extensible language modeling toolkit. In Proceedings of the 7th International Conference of Spoken Language Processing, pp. 901904. Denver, CO.
Surdeanu M., Johansson R., Meyers A., Màrquez L., and Nivre J. 2008. The CoNLL-2008 shared task on joint parsing of syntactic and semantic dependencies. In Proceedings of the 12th Conference on Computational Natural Language Learning, pp. 159177. Manchester, UK.
Uchimoto K., Murata M., Ma Q., Sekine S., and Isahara H. 2000. Word order acquisition from corpora. In Proceedings of the 18th International Conference on Computational Linguistics, pp. 871877. Saarbrücken, Germany.
Velldal E., and Oepen S. 2005. Maximum entropy models for realization ranking. In Proceedings of the 10th Machine Translation Summit, pp. 109116. Phuket, Thailand.
White M. 2004. Reining in CCG chart realization. In Proceedings of the 3rd International Natural Language Generation Conference, pp. 182191. Brockenhurst, UK.
White M., and Rajkumar R. 2009. Perceptron reranking for CCG realization. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, pp. 410419. Singapore.
White M., Rajkumar R., and Martin S. 2007. Towards broad coverage surface realization with CCG. In Proceedings of the Workshop on Using Corpora for NLG: Language Generation and Machine Translation, pp. 2230. Copenhagen, Denmark.
Xue N., Xia F., Chiou Fu dong, and Palmer M, 2005. The penn Chinese TreeBank: phrase structure annotation of a large corpus. Natural Language Engineering 11 (2): 207238. (Cambridge University Press)
Zhong H., and Stent A. J. 2005. Building surface realizers automatically from corpora. In Proceedings of the Corpus Linguistics 2005 Workshop on Using Corpora for Natural Language Generation, pp. 4954. Birmingham, UK.
Recommend this journal

Email your librarian or administrator to recommend adding this journal to your organisation's collection.

Natural Language Engineering
  • ISSN: 1351-3249
  • EISSN: 1469-8110
  • URL: /core/journals/natural-language-engineering
Please enter your name
Please enter a valid email address
Who would you like to send this to? *
×

Metrics

Full text views

Total number of HTML views: 4
Total number of PDF views: 14 *
Loading metrics...

Abstract views

Total abstract views: 125 *
Loading metrics...

* Views captured on Cambridge Core between September 2016 - 23rd October 2017. This data will be updated every 24 hours.