To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
We present a lexical platform that has been developed for the Spanish language. It achieves portability between different computer systems and efficiency, in terms of speed and lexical coverage. A model for the full treatment of Spanish inflectional morphology for verbs, nouns and adjectives is presented. This model permits word formation based solely on morpheme concatenation, driven by a feature-based unification grammar. The run-time lexicon is a collection of allomorphs for both stems and endings. Although not tested, it should be suitable also for other Romance and highly inflected languages. A formalism is also described for encoding a lemma-based lexical source, well suited for expressing linguistic generalizations: inheritance classes, lemma encoding, morpho-graphemic allomorphy rules and limited type-checking. From this source base, we can automatically generate an allomorph indexed dictionary adequate for efficient retrieval and processing. A set of software tools has been implemented around this formalism: lexical base augmenting aids, lexical compilers to build run-time dictionaries and access libraries for them, feature manipulation libraries, unification and pseudo-unification modules, morphological processors, a parsing system, etc. Software interfaces among the different modules and tools are cleanly defined to ease software integration and tool combination in a flexible way. Directions for accessing our e-mail and web demonstration prototypes are also provided. Some figures are given, showing the lexical coverage of our platform compared to some popular spelling checkers.
In this paper, we introduce a method to represent phrase structure grammars for building a large annotated corpus of Korean syntactic trees. Korean is different from English in word order and word compositions. As a result of our study, it turned out that the differences are significant enough to induce meaningful changes in the tree annotation scheme for Korean with respect to the schemes for English. A tree annotation scheme defines the grammar formalism to be assumed, categories to be used, and rules to determine correct parses for unsettled issues in parse construction. Korean is partially free in word order and the essential components such as subjects and objects of a sentence can be omitted with greater freedom than in English. We propose a restricted representation of phrase structure grammar to handle the characteristics of Korean more efficiently. The proposed representation is shown by means of an extensive experiment to gain improvements in parsing time as well as grammar size. We also describe the system named Teb that is a software environment set up with a goal to build a tree annotated corpus of Korean containing more than one million units.
This paper presents a new type of nonlinear discourse structure found to be very common in free English texts. This structure reflects nonlinear presentation of the information and knowledge conveyed by the texts. It is argued that such nonlinearity is representationally and informationally advantageous because it allows one to create smaller, more compact texts. The paper presents a heuristics-based, relatively domain-independent algorithm for computing this new text structure. The paper discusses good quantitative and qualitative performance of the algorithm, and presents the results of the extensive tests on a large volume of free English texts.
Natural language interfaces require dialogue models that allow for robust, habitable and efficient interaction. This paper presents such a model for dialogue management for natural language interfaces. The model is based on empirical studies of human computer interaction in various simple service applications. It is shown that for applications belonging to this class the dialogue can be handled using fairly simple means. The interaction can be modeled in a dialogue grammar with information on the functional role of an utterance as conveyed in the linguistic structure. Focusing is handled using dialogue objects recorded in a dialogue tree representing the constituents of the dialogue. The dialogue objects in the dialogue tree can be accessed by the various modules for interpretation, generation and background system access. Focused entities are modeled in entities pertaining to objects or sets of objects, and related domain concept information; properties of the domain objects. A simple copying principle, where a new dialogue object's focal parameters are instantiated with information from the preceding dialogue object, accounts for most context dependent utterances. The action to be carried out by the interface is determined on the basis of how the objects and related properties are specified. This in turn depends on information presented in the user utterance, context information from the dialogue tree and information in the domain model. The use of dialogue objects facilitates customization to the sublanguage utilized in a specific application. The framework has successfully been applied to various background systems and interaction modalities. In the paper results from the customization of the dialogue manager to three typed interaction applications are presented together with results from applying the model to two applications utilizing spoken interaction.
This special issue presents the state-of-the-art in implemented, general-purpose Natural Language Processing (NLP) systems that use nontrivial Knowledge Representation and Reasoning (KRR). These systems use full-scale implementations of traditional KRR techniques as well as some newer knowledge-related processing mechanisms that have been developed specifically to meet the needs of natural language processing. The papers cover a wide range of natural language inputs, knowledge and formalisms, application domains and processing tasks, illustrating the key role that knowledge representation plays in all types of NLP systems.
We describe the natural language processing and knowledge representation components of B2, a collaborative system that allows medical students to practice their decision-making skills by considering a number of medical cases that differ from each other in a controlled manner. The underlying decision-support model of B2 uses a Bayesian network that captures the results of prior clinical studies of abdominal pain. B2 generates story-problems based on this model and supports natural language queries about the conclusions of the model and the reasoning behind them. B2 benefits from having a single knowledge representation and reasoning component that acts as a blackboard for intertask communication and cooperation. All knowledge is represented using a propositional semantic network formalism, thereby providing a uniform representation to all components. The natural language component is composed of a generalized augmented transition network parser/grammar and a discourse analyzer for managing the natural language interactions. The knowlege representation component supports the natural language component by providing a uniform representation of the content and structure of the interaction, at the parser, discourse, and domain levels. This uniform representation allows distinct tasks, such as dialog management, domain-specific reasoning, and meta-reasoning about the Bayesian network, to all use the same information source, without requiring mediation. This is important because there are queries, such as Why?, whose interpretation and response requires information from each of these tasks. By contrast, traditional approaches treat each subtask as a “black-box” with respect to other task components, and have a separate knowledge representation language for each. As a result, they have had much more difficulty providing useful responses.
This paper describes the approach to knowledge representation taken in the LaSIE Information Extraction (IE) system. Unlike many IE systems that skim texts and use large collections of shallow, domain-specific patterns and heuristics to fill in templates, LaSIE attempts a fuller text analysis, first translating individual sentences to a quasi-logical form, and then constructing a weak discourse model of the entire text from which template fills are finally derived. Underpinning the system is a general ‘world model’, represented as a semantic net, which is extended during the processing of a text by adding the classes and instances described in that text. In the paper we describe the system's knowledge representation formalisms, their use in the IE task, and how the knowledge represented in them is acquired, including experiments to extend the system's coverage using the WordNet general purpose semantic network. Preliminary evaluations of our approach, through the Sixth DARPA Message Understanding Conference, indicate comparable performance to shallower approaches. However, we believe its generality and extensibility offer a route towards the higher precision that is required of IE systems if they are to become genuinely usable technologies.
A large collection of texts may be reached through the Internet and this provides a powerful platform from which common-sense knowledge may be gathered. This paper presents a system that contains a core knowledge base structured around WordNet, a lexical database, capable of extracting contextual information from a given input text. Such context information is then used to retrieve other texts from the Internet that relate to that context. When processed by the system, these new texts bring more information that represents an enhanced domain context for the initial text. This is an incremental method for text processing that acquires domain knowledge from other texts. The paper describes the system architecture, its core knowledge base and inference engine, and the acquisition of new knowledge from corpora.
In this paper, we describe NKRL (Narrative Knowledge Representation Language), a language designed for representing, in a standardized way, the semantic content (the ‘meaning’) of complex narrative texts. After having introduced informally the four ‘components’ (specialized sub-languages) of NKRL, we will describe (some of) the data structures proper to each of them, trying to show that the NKRL coding retains the main informational elements of the original narrative expressions. We will then focus on an important subset of NKRL, the so-called AECS sub-language, showing in particular that the operators of this sub-language can be used to represent some sorts of ‘plural’ expressions.
In this article, we give an overview of Natural Language Generation (NLG) from an applied system-building perspective. The article includes a discussion of when NLG techniques should be used; suggestions for carrying out requirements analyses; and a description of the basic NLG tasks of content determination, discourse planning, sentence aggregation, lexicalization, referring expression generation, and linguistic realisation. Throughout, the emphasis is on established techniques that can be used to build simple but practical working systems now. We also provide pointers to techniques in the literature that are appropriate for more complicated scenarios.
Natural language generation is now moving away from research prototypes into more practical applications. Generation functionality is also being asked to play a more significant role in established applications such as machine translation. In both cases, multilingual generation techniques have much to offer. However, the take-up of multilingual generation is being restricted by a critical lack both of large-scale linguistic resources suited to the generation task and of appropriate development environments. This paper describes KPML, a multilingual development environment that offers one possible solution to these problems. KPML aims to provide generation projects with standardized, broad-coverage, reusable resources and a basic engine for using such resources for generation. A variety of focused debugging aids ensure efficient maintenance, while supporting multilingual work such as contrastive language development and automatic merging of independently developed resources. KPML is based on a new, generic approach to multilinguality in resource description that extends significantly beyond previous approaches. The system has already been used in a number of large generation projects and is freely available to the generation community.
This paper describes an accurate and robust text alignment system for structurally different languages. Among structurally different languages such as Japanese and English, there is a limitation on the amount of word correspondences that can be statistically acquired. The main reason for this is the systems of functional (closed) words are quite different in the two languages. The proposed method makes use of two kinds of word correspondences in aligning bilingual texts. One is a bilingual dictionary of general use. The other is the word correspondences that are statistically acquired in the alignment process. Our method gradually determines sentence pairs (anchors) that correspond to each other by relaxing parameters. The method, by combining two kinds of word correspondences, achieves adequate word correspondences for complete alignment. As a result, texts of various length and of various genres in structurally different languages can be aligned with high precision. Experimental results show our system outperforms conventional methods for various kinds of Japanese–English texts.
The paper presents background and motivation for a processing model that segments discourse into units that are simple, non-nested clauses, prior to the recognition of clause internal phrasal constituents, and experimental results in support of this model. One set of results is derived from a statistical reanalysis of the Swedish empirical data in Strangert, Ejerhed and Huber 1993 concerning the linguistic structure of major prosodic units. The other set of results is derived from experiments in segmenting part of speech annotated Swedish text corpora into clauses, using a new clause segmentation algorithm. The clause segmented corpus data is taken from the Stockholm Umeå Corpus (SUC), 1 M words of Swedish texts from different genres, part of speech annotated by hand, and from the Umeå corpus DAGENS INDUSTRI 1993 (DI93), 5 M words of Swedish financial newspaper text, processed by fully automatic means consisting of tokenizing, lexical analysis, and probabilistic POS tagging. The results of these two experiments show that the proposed clause segmentation algorithm is 96% correct when applied to manually tagged text, and 91% correct when applied to probabilistically tagged text.
We present a model of text analysis for text-to-speech (TTS) synthesis based on (weighted) finite state transducers, which serves as the text analysis module of the multilingual Bell Labs TTS system. The transducers are constructed using a lexical toolkit that allows declarative descriptions of lexicons, morphological rules, numeral-expansion rules, and phonological rules, inter alia. To date, the model has been applied to eight languages: Spanish, Italian, Romanian, French, German, Russian, Mandarin and Japanese.
The full paper describes an environment for the generation of non-deterministic taggers, currently used for the development of a Spanish lexicon. In relation to previous approaches, our system includes the use of verification tools in order to assure the robustness of the generated taggers. A wide variety of user defined criteria can be applied for checking the exact properties of the system.
In computational linguistics, efficient recognition of phrases is an important prerequisite for many ambitious goals, such as automated extraction of terminology, part of speech disambiguation, and automated translation. If one wants to recognize a certain well-defined set of phrases, the question of which type of computational device to use for this task arises. For sets of phrases that are not too complex, as well as for many subtasks of the recognition process, finite state methods are appropriate and favourable because of their efficiency Gross and Perrin 1989; Silberztein 1993; Tapanainen 1995. However, if very large sets of possibly complex phrases are considered where correct resolution of grammatical structure requires morphological analysis (e.g. verb argument structure, extraposition of relative clauses, etc.), then the design and implementation of an appropriate finite state automaton might turn out to be infeasible in practice due to the immense number of morphological variants to be captured.