To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
The lexicon has come to occupy an increasingly central place in a variety of current linguistic theories, and it is equally important to work in natural language processing. The lexicon – the repository of information about words – has often proved to be a bottleneck in the design of large-scale natural language systems, given the tremendous number of words in the English language, coupled with the constant coinage of new words and shifts in the meanings of existing words. For this reason, there has been growing interest recently in building large-scale lexical knowledge bases automatically, or even semi-automatically, taking various on-line resources such as machine readable dictionaries (MRDs) and text corpora as a starting point, for instance, see the papers in Boguraev and Briscoe (1989) and Zernik (1989a). This chapter looks at the task of creating a lexicon from a different perspective, reviewing some of the advances in the understanding of the organization of the lexicon that have emerged from recent work in linguistics and sketching how the results of this work may be used in the design and creation of large-scale lexical knowledge bases that can serve a variety of needs, including those of natural language front ends, machine translation, speech recognition and synthesis, and lexicographers' and translators' workstations.
Although in principle on-line resources such as MRDs and text corpora would seem to provide a wealth of valuable linguistic information that could serve as a foundation for developing a lexical knowledge base, in practice it is often difficult to take full advantage of the information these existing resources contain.
This chapter presents an operational definition of computational lexicography, which is emerging as a discipline in its own right. In the context of one of its primary goals – facilitation of (semi-)automatic construction of lexical knowledge bases (aka computational lexicons) by extracting lexical data from on-line dictionaries – the concerns of dictionary analysis are related to those of lexical semantics. The chapter argues for a particular paradigm of lexicon construction, which relies crucially on having flexible access to fine-grained structural analyses of multiple dictionary sources. To this end, several related issues in computational lexicography are discussed in some detail.
In particular, the notion of structured dictionary representation is exemplified by looking at the wide range of functions encoded, both explicitly and implicitly, in the notations for dictionary entries. This allows the formulation of a framework for exploiting the lexical content of dictionary structure, in part encoded configurationally, for the purpose of streamlining the process of lexical acquisition.
A methodology for populating a lexical knowledge base with knowledge derived from existing lexical resources should not be in isolation from a theory of lexical semantics. Rather than promote any particular theory, however, we argue that without a theoretical framework the traditional methods of computational lexicography can hardly go further than highlighting the inadequacies of current dictionaries. We further argue that by reference to a theory that assumes a formal and rich model of the lexicon, dictionaries can be made to reveal – through guided analysis of highly structured isomorphs – a number of lexical semantic relations of relevance to natural language processing, which are only encoded implicitly and are distributed across the entire source.
One of the major resources in the task of building a large-scale lexicon for a natural-language system is the machine-readable dictionary. Serious flaws (for the user-computer) have already been documented in dictionaries being used as machine-readable dictionaries in natural language processing, including a lack of systematicity in the lexicographers' treatment of linguistic facts; recurrent omission of explicit statements of essential facts; and variations in lexicographical decisions which, together with ambiguities within entries, militate against successful mapping of one dictionary onto another and hence against optimal extraction of linguistic facts.
Large-scale electronic corpora now allow us to evaluate a dictionary entry realistically by comparing it with evidence of how the word is used in the real world. For various lexical items, an attempt is made to compare the view of word meaning that a corpus offers with the way in which this is presented in the definitions of five dictionaries at present available in machine-readable form and being used in natural language processing (NLP) research; corpus evidence is shown to support apparently incompatible semantic descriptions. Suggestions are offered for the construction of a lexical database entry to facilitate the mapping of such apparently incompatible dictionary entries and the consequent maximization of useful facts extracted from these.
How ‘reliable’ are dictionary definitions?
Writing a dictionary is a salutary and humbling experience. It makes you very aware of the extent of your ignorance in almost every field of human experience.
The structural units of phrasal intonation are frequently orthogonal to the syntactic constituent boundaries that are recognized by traditional grammar and embodied in most current theories of syntax. As a result, much recent work on the relation of intonation to discourse context and information structure has either eschewed syntax entirely (cf. Bolinger, 1972; Cutler and Isard, 1980; Gussenhoven, 1983; Brown and Yule, 1983), or has supplemented traditional syntax with entirely nonsyntactic string-related principles (cf. Cooper and Paccia-Cooper, 1980). Recently, Selkirk (1984) and others have postulated an autonomous level of “intonational structure” for spoken language, distinct from syntactic structure. Structures at this level are plausibly claimed to be related to discourse-related notions, such as “focus”. However, the involvement of two apparently uncoupled levels of structure in Natural Language grammar appears to complicate the path from speech to interpretation unreasonably, and thereby to threaten the feasibility of computational speech recognition and speech synthesis.
In Steedman (1991a), I argue that the notion of intonational structure formalized by Pierrehumbert, Selkirk, and others, can be subsumed under a rather different notion of syntactic surface structure, which emerges from the “Combinatory Categorial” theory of grammar (Steedman, 1987, 1990). This theory engenders surface structure constituents corresponding directly to phonological phrase structure. Moreover, the grammar assigns to these constituents interpretations that directly correspond to what is here called “information structure” – that is, the aspects of discourse-meaning that have variously been termed “topic” and “comment”, “theme” and “rheme”, “given” and “new” information, and/or “presupposition” and “focus”.
Throughout most of the 1960s, 1970s, and 1980s, computational linguistics enjoyed an excellent reputation. A sense of the promise of the work to society was prevalent, and high expectations were justified by solid, steady progress in research.
Nevertheless, by the close of the 1980s, many people openly expressed doubt about progress in the field. Are the problems of human language too hard to be solved even in the next eighty years? What measure(s) (other than the number of papers published) show significant progress in the last ten years? Where is the technology successfully deployed in the military, with revolutionary impact? In what directions should the field move, to ensure progress and avoid stagnation?
The symposium
A symposium on Future Directions in Natural Language Processing was held at Bolt Beranek and Newman, Inc. (BBN), in Cambridge, Massachusetts, from November 29, 1989, to December 1, 1989.
The symposium, sponsored by BBN's Science Development Program, brought together top researchers in a variety of areas to discuss the most significant problems and challenges that will face the field of computational linguistics in the next two to ten years. Speakers were encouraged to present both recently completed and speculative work, and to focus on topics that will have the most impact on the field in the coming decade. They were asked to reconsider unsolved problems of long standing as well as to present new opportunities. The purpose was to contribute to long-range planning by funding agencies, research groups, academic institutions, graduate students, and others interested in computational linguistics.
The thirty-six symposium attendees, who are listed following this preface, were all invited participants.
The purpose of this chapter is to explore the implications of some facts about prosody and intonation for efforts to create more general and higher quality speech technology. It will emphasize parallels between speech synthesis and speech recognition, because I believe that the challenges presented in these two areas exhibit strong similarities and that the best progress will be made by working on both together.
In the area of synthesis, there are now text-to-speech systems that are useful in many practical applications, especially ones in which the users are experienced and motivated. In order to have more general and higher quality synthesis technology it will be desirable (1) to improve the phonetic quality of synthetic speech to the point where it is as easily comprehended as natural speech and where it is fully acceptable to naive or unmotivated listeners, (2) to use expressive variation appropriately to convey the structure and relative importance of information in complex materials, and (3) to model the speech of people of different ages, sexes, and dialects in order to support applications requiring use of multiple voices.
Engineers working on recognition have a long-standing goal of building systems that can handle large-vocabulary continuous speech. To be useful, such systems must be either speaker-independent or speaker-dependent; if speaker-dependent, engineers must be trained using a sample of speech that can feasibly be collected and analyzed. Present systems exhibit a strong trade-off between degree of speaker independence on the one hand and the size of the vocabulary and branching factor in the grammar on the other.
One of the most delightful features of a small symposium is that it allows for protracted discussions in which many people participate. Ample time for discussion was built into the symposium schedule throughout, but we allocated a special two-hour slot to challenge ourselves to identify the most significant problems capable of being solved in a five- to ten-year period. That they be solvable in that time frame challenges us beyond what we can see, but not beyond what we can reasonably extrapolate. That their solution be significant takes the discussion beyond questions of purely academic interest.
Furthermore, at the suggestion of one of the government representatives, we asked what applications should drive research (much as the application of natural language interfaces to database drove research in the 1970s and 1980s).
All attendees, including representatives of various governmental agencies, participated in this discussion.
To keep our thoughts large, we construed natural language processing (NLP) as broadly as possible, freely including such areas as lexicography and spoken language processing.
To direct the discussion without focusing it too tightly, we set forth the following questions:
What are the most critical areas for the next seven (plus or minus two) years of natural language processing? (“Critical” is taken to mean that which will produce the greatest impact in the technology.)
What resources are needed (such as people, training, and corpora) to accomplish the goals involved in that work?
What organization is needed (e.g., coordinated efforts, international participation) to accomplish those goals?
What application areas and markets should open up in response to progress toward those goals?
If current natural language understanding systems reason about the world at all, they generally maintain a strict division between the parsing processes and the representation that supports general reasoning about the world. The parsing processes, which include syntactic analysis, some semantic interpretation, and possibly some discourse processing, I will call structural processing, because these processes are primarily concerned with analyzing and determining the linguistic structure of individual sentences. The part of the system that involves representing and reasoning about the world or domain of discourse I will call the knowledge representation. The goal of this chapter is to examine why these two forms of processing are separated, to determine the current advantages and limitations of this approach, and to look to the future to attempt to identify the inherent limitations of the approach. I will point out some fundamental problems with the models as they are defined today and suggest some important directions of research in natural language and knowledge representation. In particular, I will argue that one of the crucial issues facing future natural language systems is the development of knowledge representation formalisms that can effectively handle ambiguity.
It has been well recognized since the early days of the field that representing and reasoning about the world are crucial to the natural language understanding task. Before we examine the main issue of the chapter in detail, let us consider some of the issues that have long been identified as demonstrating this idea. Knowledge about the world can be seen to be necessary in almost every aspect of the understanding task.
Although natural language processing (NLP) has come very far in the last twenty years, the technology has not yet achieved a revolutionary impact on society. Is this because of some fundamental limitation that can never be overcome? Is it because there has not been enough time to refine and apply theoretical work that has already been done?
We believe it is neither. We believe that several critical issues have never been adequately addressed in either theoretical or applied work, and that, because of a number of recent advances that we will discuss, the time is due for great leaps forward in the generality and utility of NLP systems. This paper focuses on roadblocks that seem surmountable within the next ten years.
Rather than presenting new results, this paper identifies the problems that we believe must block widespread use of computational linguistics, and that can be solved within five to ten years. These are the problems that most need additional research and most deserve the talents and attention of Ph.D. students. We focus on the following areas, which will have maximum impact when combined in software systems:
Knowledge acquisition from natural language (NL) texts of various kinds, from interactions with human beings, and from other sources. Language processing requires lexical, grammatical, semantic, and pragmatic knowledge. Current knowledge acquisition techniques are too slow and too difficult to use on a wide scale or on large problems. Knowledge bases should be many times the size of current ones.
Interaction with multiple underlying systems to give NL systems the utility and flexibility demanded by people using them. Single application systems are limited in both usefulness and the language that is necessary to communicate with them.