We use cookies to distinguish you from other users and to provide you with a better experience on our websites. Close this message to accept cookies or find out how to manage your cookie settings.
To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
Topic analysis is important for many applications dealing with texts, such as text summarization or information extraction. However, it can be done with great precision only if it relies
on structured knowledge, which is difficult to produce on a large scale. In this paper, we
propose using bootstrapping to solve this problem: a first topic analysis based on a weakly
structured source of knowledge, a collocation network, is used for learning explicit topic
representations that then support a more precise and reliable topic analysis.
This paper explores the effectiveness of index terms more complex than the single words used
in conventional information retrieval systems. Retrieval is done in two phases: in the first, a
conventional retrieval method (the Okapi system) is used; in the second, complex index terms
such as syntactic relations and single words with part-of-speech information are introduced
to rerank the results of the first phase. We evaluated the effectiveness of the different types
of index terms through experiments using the TREC-7 test collection and 50 queries. The
retrieval effectiveness was improved for 32 out of 50 queries. Based on this investigation,
we then introduce a method to select effective index terms by using a decision tree. Further
experiments with the same test collection showed that retrieval effectiveness was improved in
25 of the 50 queries.
Robustness has been traditionally stressed as a general desirable property of any computational model and system. The human NL interpretation device exhibits this property as
the ability to deal with odd sentences. However, the difficulties in a theoretical explanation
of robustness within the linguistic modelling suggested the adoption of an empirical notion.
In this paper, we propose an empirical definition of robustness based on the notion of
performance. Furthermore, a framework for controlling the parser robustness in the design
phase is presented. The control is achieved via the adoption of two principles: the modularisation, typical of the software engineering practice, and the availability of domain adaptable
components. The methodology has been adopted for the production of CHAOS, a pool of
syntactic modules, which has been used in real applications. This pool of modules enables
a large validation of the notion of empirical robustness, on the one side, and of the design
methodology, on the other side, over different corpora and two different languages (English
and Italian).
We address the problem of clustering words (or constructing a thesaurus) based on co-occurrence
data, and conducting syntactic disambiguation by using the acquired word classes.
We view the clustering problem as that of estimating a class-based probability distribution
specifying the joint probabilities of word pairs. We propose an efficient algorithm based on the
Minimum Description Length (MDL) principle for estimating such a probability model. Our
clustering method is a natural extension of that proposed in Brown, Della Pietra, deSouza,
Lai and Mercer (1992). We next propose a syntactic disambiguation method which combines
the use of automatically constructed word classes and that of a hand-made thesaurus. The
overall disambiguation accuracy achieved by our method is 88.2%, which compares favorably
against the accuracies obtained by the state-of-the-art disambiguation methods.
The TIPSTER Text Summarization Evaluation (SUMMAC) has developed several new
extrinsic and intrinsic methods for evaluating summaries. It has established definitively that
automatic text summarization is very effective in relevance assessment tasks on news articles.
Summaries as short as 17% of full text length sped up decision-making by almost a factor
of 2 with no statistically significant degradation in accuracy. Analysis of feedback forms
filled in after each decision indicated that the intelligibility of present-day machine-generated
summaries is high. Systems that performed most accurately in the production of indicative
and informative topic-related summaries used term frequency and co-occurrence statistics,
and vocabulary overlap comparisons between text passages. However, in the absence of a
topic, these statistical methods do not appear to provide any additional leverage: in the case
of generic summaries, the systems were indistinguishable in accuracy. The paper discusses
some of the tradeoffs and challenges faced by the evaluation, and also lists some of the
lessons learned, impacts, and possible future directions. The evaluation methods used in the
SUMMAC evaluation are of interest to both summarization evaluation as well as evaluation
of other ‘output-related’ NLP technologies, where there may be many potentially acceptable
outputs, with no automatic way to compare them.
Most of the morphological properties of derivational Arabic words are encapsulated in their
corresponding morphological patterns. The morphological pattern is a template that shows
how the word should be decomposed into its constituent morphemes (prefix + stem + suffix),
and at the same time, marks the positions of the radicals comprising the root of
the word. The number of morphological patterns in Arabic is finite and is well below 1000.
Due to these properties, most of the current analysis algorithms concentrate on discovering
the morphological pattern of the input word as a major step in recognizing the type and
category of the word. Unfortunately, this process is non-determinitic in the sense that the
underlying search process may sometimes associate more than one morphological pattern
with the given word, all of them satisfying the major lexical constraints. One solution to this
problem is to use a collection of connectionist pattern associaters that uniquely associate
each word with its corresponding morphological pattern. This paper describes an LVQ-based
learning pattern association system that uniquely maps a given Arabic word to its
corresponding morphological pattern, and therefore deduces its morphological properties.
The system consists of a collection of hetroassociative models that are trained using the
LVQ algorithm plus a collection of autoassociative models that have been trained using
backpropagation. Experimental results have shown that the system is fairly accurate and very
easy to train. The LVQ algorithm has been chosen because it is very easy to train and the
implied training time is very small compared to that of backpropagation.
We describe a system for contextually appropriate anaphor and pronoun generation for
Turkish. It uses binding theory and centering theory to model local and nonlocal references.
We describe the rules for Turkish, and their computational treatment. A cascaded method for
anaphor and pronoun generation is proposed for handling pro-drop and discourse constraints
on pronominalization. The system has been tested as a stand-alone nominal expression
generator, and also as a reference planning component of a transfer-based MT system.
This is the first issue of Volume 8, and we thought we would take this opportunity to
bring readers of Natural Language Engineering up-to-date with various developments
with the journal.
One of the main challenges in question-answering is the potential mismatch between the
expressions in questions and the expressions in texts. While humans appear to use inference
rules such as ‘X writes Y’ implies ‘X is the author of Y’ in answering questions, such rules
are generally unavailable to question-answering systems due to the inherent difficulty in constructing
them. In this paper, we present an unsupervised algorithm for discovering inference
rules from text. Our algorithm is based on an extended version of Harris’ Distributional
Hypothesis, which states that words that occurred in the same contexts tend to be similar.
Instead of using this hypothesis on words, we apply it to paths in the dependency trees of a
parsed corpus. Essentially, if two paths tend to link the same set of words, we hypothesize
that their meanings are similar. We use examples to show that our system discovers many
inference rules easily missed by humans.
The Text REtrieval Conference (TREC) question answering track is an effort to bring the
benefits of large-scale evaluation to bear on a question answering (QA) task. The track has
run twice so far, first in TREC-8 and again in TREC-9. In each case, the goal was to retrieve
small snippets of text that contain the actual answer to a question rather than the document
lists traditionally returned by text retrieval systems. The best performing systems were able to
answer about 70% of the questions in TREC-8 and about 65% of the questions in TREC-9.
While the 65% score is a slightly worse result than the TREC-8 scores in absolute terms, it
represents a very significant improvement in question answering systems. The TREC-9 task
was considerably harder than the TREC-8 task because TREC-9 used actual users’ questions
while TREC-8 used questions constructed for the track. Future tracks will continue to
challenge the QA community with more difficult, and more realistic, question answering tasks.
We investigate the problem of complex answers in question answering. Complex answers
consist of several simple answers. We describe the online question answering system SHAPAQA,
and using data from this system we show that the problem of complex answers is quite
common. We define nine types of complex questions, and suggest two approaches, based on
answer frequencies, that allow question answering systems to tackle the problem.
As users struggle to navigate the wealth of on-line information now available, the
need for automated question answering systems becomes more urgent. We need
systems that allow a user to ask a question in everyday language and receive an
answer quickly and succinctly, with sufficient context to validate the answer. Current
search engines can return ranked lists of documents, but they do not deliver answers
to the user.
Question answering systems address this problem. Recent successes have been
reported in a series of question-answering evaluations that started in 1999 as part
of the Text Retrieval Conference (TREC). The best systems are now able to answer
more than two thirds of factual questions in this evaluation.
In this paper, we take a detailed look at the performance of components of an idealized
question answering system on two different tasks: the TREC Question Answering task
and a set of reading comprehension exams. We carry out three types of analysis: inherent
properties of the data, feature analysis, and performance bounds. Based on these analyses
we explain some of the performance results of the current generation of Q/A systems and
make predictions on future work. In particular, we present four findings: (1) Q/A system
performance is correlated with answer repetition; (2) relative overlap scores are more effective
than absolute overlap scores; (3) equivalence classes on scoring functions can be used to
quantify performance bounds; and (4) perfect answer typing still leaves a great deal of
ambiguity for a Q/A system because sentences often contain several items of the same type.
The syntactic structure of a nominal compound must be analyzed first for its semantic
interpretation. In addition, the syntactic analysis of nominal compounds is very useful for
NLP application such as information extraction, since a nominal compound often has a similar
linguistic structure with a simple sentence, as well as representing concrete and compound
meaning of an object with several nouns combined. In this paper, we present a novel model
for structural analysis of nominal compounds using linguistic and statistical knowledge which
is coupled based on lexical information. That is, the syntactic relations defined between nouns
(complement-predicate and modifier-head relation) are obtained from large corpora and again
used to analyze the structures of nominal compounds and identify the underlying relations
between nouns. Experiments show that the model gives good results, and can be effectively
used for application systems which do not require deep semantic information.
This paper presents a exible bottom-up process to incrementally generate several versions of
the same text, building up the core text from its kernel version into other versions varying of
the levels of details. We devise a method for identifying the question/answer relations holding
between the propositions of a text, we give rules for characterizing the kernel version of a text,
and we provide a procedure, based on causal and temporal expansions of sentences, which
distinguishes semantically these levels of details according to their importance. This is based
on the assumption that we have a stock of information from the interpreter's knowledge base
available. The sentence expansion operation is formally defined according to three principles:
(1) the kernel principle ensures to obtain the gist information; (2) the expansion principle defines
an incremental augmentation of a text; and (3) the subsume principle defines an importance-based
order among the possible details of the information. The system developed allows users
to generate in a follow-up way their own text version which meets their expectations and
their demands expressed as questions about the text under consideration.
We describe two newly developed computational tools for morphological processing: a
program for analysis of English inflectional morphology, and a morphological generator,
automatically derived from the analyser. The tools are fast, being based on finite-state
techniques, have wide coverage, incorporating data from various corpora and machine readable
dictionaries, and are robust, in that they are able to deal effectively with unknown words.
The tools are freely available. We evaluate the accuracy and speed of both tools and discuss
a number of practical applications in which they have been put to use.
Generating text in a hypermedia environment places different demands on a text generation
system than occurs in non-interactive environments. This paper describes some of these
demands, then shows how the architecture of one text generation system, ILEX, has been
shaped by them. The architecture is described in terms of the levels of linguistic representation
used, and the processes which map between them. Particular attention is paid to the processes
of content selection and text structuring.