Skip to main content Accessibility help

From UBGs to CFGs A practical corpus-driven approach


We present a simple and intuitive unsound corpus-driven approximation method for turning unification-based grammars, such as HPSG, CLE, or PATR-II into context-free grammars (CFGs). Our research is motivated by the idea that we can exploit (large-scale), hand-written unification grammars not only for the purpose of describing natural language and obtaining a syntactic structure (and perhaps a semantic form), but also to address several other very practical topics. Firstly, to speed up deep parsing by having a cheap recognition pre-flter (the approximated CFG). Secondly, to obtain an indirect stochastic parsing model for the unification grammar through a trained PCFG, obtained from the approximated CFG. This gives us an efficient disambiguation model for the unification-based grammar. Thirdly, to generate domain-specific subgrammars for application areas such as information extraction or question answering. And finally, to compile context-free language models which assist the acoustic model of a speech recognizer. The approximation method is unsound in that it does not generate a CFG whose language is a true superset of the language accepted by the original unification-based grammar. It is a corpus-driven method in that it relies on a corpus of parsed sentences and generates broader CFGs when given more input samples. Our open approach can be fine-tuned in different directions, allowing us to monotonically come close to the original parse trees by shifting more information into the context-free symbols. The approach has been fully implemented in JAVA.

Hide All
Aho, A. V., Sethi, R. and Ullman, J. D. (1986) Compilers: Principles, Techniques, and Tools. Reading, MA: Addison-Wesley.
Alshawi, H. (ed.) (1992) The Core Language Engine. ACL-MIT Press Series in Natural. Language Processing. MIT Press.
Becker, M., Drozdzynski, W, Krieger, H.-U., Piskorski, J., Schafer, U. and Xu, F. (2002) SProUT-Shallow Processing with Unifbation and Typed Feature Structures. Proceedings of the International Conference on Natural Language Processing, ICON-2002.
Bos, J. (2002) Compilation of Unifbation Grammars with Compositional Semantics to Speech Recognition Packages. Proceedings of the 19th International Conference on Computational Linguistics, CO LING 2002, pp. 106–112.
Briscoe, T. and Carroll, J. (1993) Generalized Probabilistic LR Parsing of Natural Language (Corpora) with Unifbation-Based Grammars. Computational Linguistics 19 (1): 2559.
Callmeier, U. (2000) PET Platform for Experimentation with Efficient HPSG Processing. Natural Language Engineering 6 (1): 99107.
Cancedda, N. and Samuelsson, C. (2000) Experiments with Corpus-based LFG Specialization. Proceedings of the 6th Conference on Applied Natural Language Processing, pp. 204–209.
Carpenter, B. (1992) The Logic of Typed Feature Structures. Tracts in Theoretical Computer Science. Cambridge: Cambridge University Press.
Carroll, J., Briscoe, T. and Grover, C. (1991) A Development Environment for Large Natural Language Grammars. Technical Report 233, Computer Laboratory, Cambridge University, UK.
Carroll, J. A. (1993) Practical Unification-based Parsing of Natural Language. PhD thesis, University of Cambridge, Computer Laboratory, Cambridge.
Charniak, E. 1993. Statistical Language Learning. Cambridge, MA: MIT Press.
Copestake, A., Lascarides, A. and Flickinger, D. (2001) An Algebra for Semantic Construction in Constraint-Based Grammars. In Proceedings of the 39th Annual Meeting of the Association for Computational Linguistics, ACL-2001, pp. 132–139.
Diagne, A. K., Kasper, W. and Krieger, H.-U. (1995) Distributed Parsing With HPSG Grammars In Proceedings of the 4th International Workshop on Parsing Technologies, IWPT'95, pp. 79–86. (Also available as DFKI Research Report RR-95–19.)
Dowding, J., Hockey, B. A., Gawron, J. M. and Culy, C. (2001) Practical Issues in Compiling Typed Unification Grammars for Speech Recognition. Proceedings of the 39th Annual Meeting of the Association for Computational Linguistics, ACL-2001, pp. 164–171.
Flanagan, D. (2002) Java in a Nutshell. Beijing: O'Reilly.
Gazdar, G., Klein, E., Pullum, G. and Sag, I. (1985) Generalized Phrase Structure Grammar. Cambridge, MA: Harvard University Press.
Goldstein, S. D. (1988) Using an Active Chart Parser to Convert Any Context Free Grammar to Backus-Naur Form. Master's thesis, Massachusetts Institute of Technology.
Hopcroft, J. E. and Ullman, J. D. (1979) Introduction to Automata Theory, Languages, and Computation. Reading, MA: Addis on-Wesley.
Hunt, A. and McGlashan, S. (2004) Speech Recognition GrammarSpecification Version 1.0. Technical report, W3C Recommendation 16 March 2004
Kaplan, R. and Bresnan, J. (1982) Lexical-Functional Grammar: A Formal System for Grammatical Representation. In: Bresnan, J., editor, The Mental Representation of Grammatical Relations, pp. 173281. Cambridge, Mass: MIT Press.
Kasper, W. and Krieger, H.-U. (1996) Modularizing Codescriptive Grammars for Efficient Parsing. Proceedings of the 16th International Conference on Computational Linguistics, COLING-96, pp. 628–633.
Kasper, W, Krieger, H.-U., Spilker, J. and Weber, H. (1996) From Word Hypotheses to Logical Form: An Efficient Interleaved Approach. In: D. Gibbon, editor, Natural Language Processing and Speech Technology. Results of the 3rd KONVENS Conference, pp. 7788. Berlin:Mouton de Gruyter.
Kiefer, B. and Krieger, H.-U. (2000) A Context-Free Approximation of Head-Driven Phrase Structure Grammar. Proceedings of the 6th International Workshop on Parsing Technologies, IWPT2000, pp. 135–146.
Kiefer, B. and Krieger, H.-U. (2002) A Context-Free Approximation of Head-Driven Phrase Structure Grammar. In: Oepen, S., Flickinger, D., Tsuji, J. and Uszkoreit, H., editors, Collaborative Language Engineering. A Case Study in Efficient Grammar-based Processing, pp. 49–76. CSLI Publications.
Kiefer, B. and Krieger, H.-U. (2004) A Context-Free Superset Approximation of Unification-Based Grammars. In: Bunt, H., Carroll, J. and Satta, G., editors, New Developments in Parsing Technology, pp. 229250. Kluwer Academic.
Kiefer, B., Krieger, H.-U., Carroll, J. and Malouf, R. (1999) A Bag of Useful Techniques for Efficient and Robust Parsing. Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics, ACL-99, pp. 473–80.
Kiefer, B., Krieger, H.-U. and Nederhof, M.-J. (2000) Efficient and Robust Parsing of Word Hypotheses Graphs. In: Wahlster, W., editor, Verbmobil: Foundations of Speech-to-Speech Translation, pp. 280295. Berlin: Springer.
Kiefer, B., Krieger, H.-U. and Prescher, D. (2002) A Novel Disambiguation Method For Unifbation-Based Grammars Using Probabilistic Context-Free Approximations. Proceedings of the 19th International Conference on Computational Linguistics, COLING2002.
Krieger, H.-U. (2004) A Corpus-Driven Context-Free Approximation of Head-Driven Phrase Structure Grammar. In: Paliouras, G. and Sakakibara, Y., editors, Proceedings of the 7th International Colloquium on Grammatical Inference, ICGI-2004, pp. 199210. No. 3264, Lecture Notes in Artificial Intelligence. Springer.
Krieger, H.-U., Drozdzynski, W., Piskorski, J., Schafer, U. and Xu, F. (2004) A Bag of Useful Techniques for Unifbation-Based Finite-State Transducers. Proceedings of KONVENS 2004, pp. 105–112.
Krieger, H.-U. and Schafer, U. (1994) 9∼2>ψ -A Type Description Language for Constraint-Based Grammars. Proceedings of the 15th International Conference on Computational Linguistics, COLING-94, pp. 893–899. (An enlarged version of this paper is available as DFKI Research Report RR-94-37).
Lari, K. and Young, S. J. (1990) The estimation of stochastic context-free grammars using the inside-outside algorithm. Computer Speech and Language 4: 3556.
Malouf, R., Carroll, J. and Copestake, A. (2000) Efficient feature structure operations without compilation. Natural Language Engineering 6 (1): 29–6.
Moore, R. C. (1999) Using Natural-Language Knowledge Sources in Speech Recognition. In: Ponting, K., editor, Computational Models of Speech Pattern Processing, Springer.
Nakazawa, T (1995) Construction of LR Parsing Tables for Grammars Using Feature-Based Syntactic Categories. In: Cole, J., Green, G., and Morgan, J., editors, Linguistics and Computation, pp. 199–219. CSLI Lecture Notes.
Nederhof, M.-J. (2000) Practical Experiments with Regular Approximation of Context-Free Languages. Computational Linguistics 26 (1): 1744.
Neumann, G. (2003) Data-driven Approaches to Head-Driven Phrase Structure Grammar. In: Bod, R., Scha, R. and Simaan, K., editors, Data-Oriented Parsing, pp. University of Chicago Press.
Neumann, G. and Flickinger, D. (1999) Learning Stochastic Lexicalized Tree Grammars from HPSG. Technical report, German Research Center for Artifbal Intelligence (DFKI), Saarbriicken.
Nuance (2004) Nuance Home
Oepen, S. and Callmeier, U. (2000) Measure For Measure: Parser Cross-Fertilization. Proceedings of the 6th International Workshop on Parsing Technologies, IWPT 2000, pp. 183–194.
Oepen, S. and Flickinger, D. P. (1998) Towards Systematic Grammar Profiling. Test Suite Technology Ten Years After. Journal of Computer Speech and Language 12 (4): 41W36.
Pereira, F. C. and Schabes, Y. (1992) Inside-Outside Reestimation from Partially Bracketed Corpora. Proceedings of the 30th Annual Meeting of the Association for Computational Linguistics, ACL-92, pp. 128–135.
Pereira, F. C. and Wright, R. N. (1991) Finite-State Approximation of Phrase Structure Grammars. Proceedings of the 29th Annual Meeting of the Association for Computational Linguistics, ACL-91, pp. 246–255. (An enlarged version is available in E. Roche and Y. Schabes, editors, Finite-State Devices for Natural Language Processing. Cambridge, MA: MIT Press.
Pollard, C. and Sag, I. A. (1987) Information-Based Syntax and Semantics. Vol. I: Fundamentals. CSLI Lecture Notes, Number 13. Stanford: Center for the Study of Language and Information.
Pollard, C. and Sag, I. A. (1994) Head-Driven Phrase Structure Grammar. Studies in Contemporary Linguistics. Chicago: University of Chicago Press.
Rayner, M., Dowding, J. and Hockey, B. A. (2001a) A Baseline Method for Compiling Typed Unification Grammars into Context Free Language Models. Proceedings of EUROSPEECH.
Rayner, M., Gorrell, G., Hockey, B. A., Dowding, J. and Boye, J. (2001b) Do CFG-Based Language Models Need Agreement Constraints. Proceedings of the 2nd Conference of the North American Chapter of the ACL, NAACL2001.
Rayner, M., Hockey, B. A., James, F., Bratt, E. O., Goldwater, S. and Gawron, J. M. (2000) Compiling Language Models from a Linguistically Motivated Unifbation Grammar. Proceedings of the 18th International Conference on Computational Linguistics, COLING 2000, pp. 670–676.
Shieber, S., Uszkoreit, H., Pereira, F., Robinson, J. and Tyson, M. (1983) The Formalism and Implementation of PATR-II. In: Grosz, B. J. and Stickel, M. E., editors, Research on Interactive Acquisition and Use of Knowledge, pp. 3979. Menlo Park, CA: AI Center, SRI International, November.
Shieber, S. M. (1985) Using Restriction to Extend Parsing Algorithms for Complex-Feature-Based Formalisms. Proceedings of the 23rd Annual Meeting of the Association for Computational Linguistics, ACL-85, pp. 145–152.
Uszkoreit, H. (1986) Categorial Unifbation Grammars. Proceedings of the llth International Conference on Computational Linguistics, pp. 187–194.
Van Tichelen, L. (2003) Semantic Interpretation for Speech Recognition. Technical report, W3C Working Draft 1 April 2003
Zeevat, H., Klein, E. and Calder, J. (1987) Unifbation Categorial Grammar. In: Haddock, N., Klein, E., and Merrill, G., editors, Edinburgh Working Papers in Cognitive Science, 1: Categorial Grammar, Unification Grammar, and Parsing, pp. 195–222. Centre for Cognitive Science, Edinburgh University, UK.
Recommend this journal

Email your librarian or administrator to recommend adding this journal to your organisation's collection.

Natural Language Engineering
  • ISSN: 1351-3249
  • EISSN: 1469-8110
  • URL: /core/journals/natural-language-engineering
Please enter your name
Please enter a valid email address
Who would you like to send this to? *


Full text views

Total number of HTML views: 0
Total number of PDF views: 0 *
Loading metrics...

Abstract views

Total abstract views: 0 *
Loading metrics...

* Views captured on Cambridge Core between <date>. This data will be updated every 24 hours.

Usage data cannot currently be displayed