We use cookies to distinguish you from other users and to provide you with a better experience on our websites. Close this message to accept cookies or find out how to manage your cookie settings.
To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
A piece of software as complex as a complete natural language generation system is unlikely to be constructed as a monolithic program. In this chapter, we introduce a particular architecture for nlg systems, by which we mean a specification of how the different types of processing are distributed across a number of component modules. As part of this architectural specification, we discuss how these modules interact with each other and we describe the data structures that are passed between the modules.
Introduction
Like other complex software systems, nlg systems are generally easier to build and debug if they are decomposed into distinct, well-defined, and easily-integrated modules. This is especially true if the software is being developed by a team rather than by one individual. Modularisation can also make it easier to reuse components amongst different applications and can make it easier to modify an application. Suppose, for example, we adopt a modularisation where one component is responsible for selecting the information content of a text and another is responsible for expressing this content in some natural language. Provided a well-defined interface between these components is specified, different teams or individuals can work on the two components independently. It may be possible to reuse the components (and in particular the second, less application-dependent component) independently of one another.
This paper focuses on the design methodology of the MultiLingual Dictionary-System
(MLDS), which is a human-oriented tool for assisting in the task of translating lexical
units, oriented to translators and conceived from studies carried out with translators. We
describe the model adopted for the representation of multilingual dictionary-knowledge. Such
a model allows an enriched exploitation of the lexical-semantic relations extracted from
dictionaries. In addition, MLDS is supplied with knowledge about the use of the dictionaries
in the process of lexical translation, which was elicitated by means of empirical methods
and specified in a formal language. The dictionary-knowledge along with the task-oriented
knowledge are used to offer the translator active, anticipative and intelligent assistance.
This paper describes an approach for constructing a mixture of language models based on
simple statistical notions of semantics using probabilistic models developed for information
retrieval. The approach encapsulates corpus-derived semantic information and is able to model
varying styles of text. Using such information, the corpus texts are clustered in an unsupervised
manner and a mixture of topic-specific language models is automatically created. The principal
contribution of this work is to characterise the document space resulting from information
retrieval techniques and to demonstrate the approach for mixture language modelling. A
comparison is made between manual and automatic clustering in order to elucidate how the
global content information is expressed in the space. We also compare (in terms of association
with manual clustering and language modelling accuracy) alternative term-weighting schemes
and the effect of singular value decomposition dimension reduction (latent semantic analysis).
Test set perplexity results using the British National Corpus indicate that the approach can
improve the potential of statistical language modelling. Using an adaptive procedure, the
conventional model may be tuned to track text data with a slight increase in computational
cost.
Treebanks, such as the Penn Treebank, provide a basis for the automatic creation of broad
coverage grammars. In the simplest case, rules can simply be ‘read off’ the parse-annotations of
the corpus, producing either a simple or probabilistic context-free grammar. Such grammars,
however, can be very large, presenting problems for the subsequent computational costs of
parsing under the grammar. In this paper, we explore ways by which a treebank grammar
can be reduced in size or ‘compacted’, which involve the use of two kinds of technique: (i)
thresholding of rules by their number of occurrences; and (ii) a method of rule-parsing, which
has both probabilistic and non-probabilistic variants. Our results show that by a combined
use of these two techniques, a probabilistic context-free grammar can be reduced in size by
62% without any loss in parsing performance, and by 71% to give a gain in recall, but some
loss in precision.
We show how the DOP model can be used for fast and robust context-sensitive processing of
spoken input in a practical spoken dialogue system called OVIS. OVIS (Openbaar Vervoer
Informatie Systeem) – ‘Public Transport Information System’, is a Dutch spoken language
information system which operates over ordinary telephone lines. The prototype system is
the immediate goal of the NWO Priority Programme ‘Language and Speech Technology’. In
this paper, we extend the original Data-Oriented Parsing (DOP) model to context-sensitive
interpretation of spoken input. The system we describe uses the OVIS corpus (which consists
of 10,000 trees enriched with compositional semantics) to compute from an input word-graph the best utterance together with its meaning. Dialogue context is taken into account
by dividing up the OVIS corpus into context-dependent subcorpora. Each system question
triggers a subcorpus by which the user answer is analysed and interpreted. Our experiments
indicate that the context-sensitive DOP model obtains better accuracy than the original
model, allowing for fast and robust processing of spoken input.
Most parsing algorithms require phrases that are to be combined to be either contiguous or marked as being ‘extraposed’. The assumption that phrases which are to be combined will be adjacent to one another supports rapid indexing mechanisms: the fact that in most languages items can turn up in unexpected locations cancels out much of the ensuing efficiency. The current paper shows how ‘out of position’ items can be incorporated directly. This leads to efficient parsing even when items turn up having been right-shifted, a state of affairs which makes Johnson and Kay's (1994) notion of ‘sponsorship’ of empty nodes inapplicable.
In this paper we present results concerning the large scale automatic extraction of pragmatic content from email by a system based on a phrase matching approach to speech act detection combined with empirical detection of speech act patterns in corpora. The results show that most speech acts that occur in this corpus can be recognized by the approach. This investigation is supported by analysis of a corpus consisting of 1000 emails.
A probabilistic parameter reestimation algorithm plays a key role in the automatic acquisition of stochastic grammars. In the case of context-free phrase structure grammars, the inside-outside algorithm is widely used. However, it is not directly applicable to Probabilistic Dependency Grammar (PDG), because PDG is not based on constituents but on a head-dependent relation between pairs of words. This paper presents a reestimation algorithm which is a variation of the inside-outside algorithm adapted to probabilistic dependency grammar. The algorithm can be used either to reestimate the probabilistic parameters of an existing dependency grammar, or to extract a PDG from scratch. Using the algorithm, we have learned a PDG from a part-of-speech-tagged corpus of Korean, which showed about 62·82% dependency accuracy (the percentage of correct dependencies) for unseen test sentences.
An automated tool to assist in the understanding of legacy code components can be useful both in the areas of software reuse and software maintenance. Most previous work in this area has concentrated on functionally-oriented code. Whereas object-oriented code has been shown to be inherently more reusable than functionally-oriented code, in many cases the eventual reuse of the object-oriented code was not considered during development. A knowledge-based, natural language processing approach to the automated understanding of object-oriented code as an aid to the reuse of object-oriented code is described. A system, called the PATRicia system (Program Analysis Tool for Reuse) that implements the approach is examined. The natural language processing/information extraction system that comprises a large part of the PATRicia system is discussed and the knowledge-base of the PATRicia system, in the form of conceptual graphs, is described. Reports provided by natural language-generation in the PATRicia system are described.
In this paper, we describe a context-based method to semantically tag unknown proper
nouns (U-PNs) in corpora. Like many others, our system relies on a gazetteer and a set of
context-dependent heuristics to classify proper nouns. However, proper nouns are an open-end class: when parsing new fragments of a corpus, even in the same language domain, we
can expect that several proper nouns cannot be semantically tagged. The algorithm that we
propose assigns to an unknown PN an entity type based on the analysis of syntactically
and semantically similar contexts already seen in the application corpus. The performance of
the algorithm is evaluated not only in terms of precision, following the tradition of MUC
conferences, but also in terms of information gain, an information theoretic measure that
takes into account the complexity of the classification task.
The paper describes SENSE, a word sense disambiguation system which makes use of
multidimensional analogy-based proportions to infer the most likely sense of a word given
its context. Architecture and functioning of the system are illustrated in detail. Results of
different experimental settings are given, showing that the system, in spite its conservative
bias, successfully copes with the problem of training data sparseness.
Two tasks involving lexical semantic sense tagging are described. Different task requirements
made it necessary to select different corpora to be tagged and to develop different tagging
interfaces to achieve the desired result. A vocabulary-building task required sequential tagging
of connnected text, whereas a word-sense identification task required targeted tagging of many
instances of common polysemous words. Advantages and drawbacks of both are compared.
Many applications need a lexicon that represents semantic information but acquiring lexical
information is time consuming. We present a corpus-based bootstrapping algorithm that
assists users in creating domain-specific semantic lexicons quickly. Our algorithm uses a
representative text corpus for the domain and a small set of ‘seed words’ that belong to
a semantic class of interest. The algorithm hypothesizes new words that are also likely to
belong to the semantic class because they occur in the same contexts as the seed words. The
best hypotheses are added to the seed word list dynamically, and the process iterates in a
bootstrapping fashion. When the bootstrapping process halts, a ranked list of hypothesized
category words is presented to a user for review. We used this algorithm to generate a semantic
lexicon for eleven semantic classes associated with the MUC-4 terrorism domain.
Resnik and Yarowsky (1997) made a set of observations about the state-of-the-art in automatic
word sense disambiguation and, motivated by those observations, offered several specific
proposals regarding improved evaluation criteria, common training and testing resources,
and the definition of sense inventories. Subsequent discussion of those proposals resulted
in SENSEVAL, the first evaluation exercise for word sense disambiguation (Kilgarriff and
Palmer 2000). This article is a revised and extended version of our 1997 workshop paper,
reviewing its observations and proposals and discussing them in light of the SENSEVAL exercise.
It also includes a new in-depth empirical study of translingually-based sense inventories
and distance measures, using statistics collected from native-speaker annotations of 222
polysemous contexts across 12 languages. These data show that monolingual sense distinctions
at most levels of granularity can be effectively captured by translations into some set of second
languages, especially as language family distance increases. In addition, the probability that
a given sense pair will tend to lexicalize differently across languages is shown to correlate
with semantic salience and sense granularity; sense hierarchies automatically generated from
such distance matrices yield results remarkably similar to those created by professional
monolingual lexicographers.
This paper presents a system for automatic verb sense disambiguation using a small corpus
and a Machine-Readable Dictionary (MRD) in Korean. The system learns a set of typical uses
listed in the MRD usage examples for each of the senses of a polysemous verb in the MRD
definitions using verb-object co-occurrences acquired from the corpus. This paper concentrates
on the problem of data sparseness in two ways. First, by extending word similarity measures
from direct co-occurrences to co-occurrences of co-occurring words, we compute the word
similarities using non co-occurring words but co-occurring clusters. Secondly, we acquire IS-A
relations of nouns from the MRD definitions. It is possible to roughly cluster the nouns by
the identification of the IS-A relationship. Using these methods, two words may be considered
similar even if they do not share any word elements. Experiments show that this method
can learn from a very small training corpus, achieving over an 86% correct disambiguation
performance without any restriction on a word's senses.