To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
Information Extraction (IE) systems assist analysts to assimilate information from electronic documents. This paper focuses on IE tasks designed to support information discovery applications. Since information discovery implies examining large volumes of heterogeneous documents for situations that cannot be anticipated a priori, they require IE systems to have breadth as well as depth. This implies the need for a domain-independent IE system that can easily be customized for specific domains: end users must be given tools to customize the system on their own. It also implies the need for defining new intermediate level IE tasks that are richer than the subject-verb-object (SVO) triples produced by shallow systems, yet not as complex as the domain-specific scenarios defined by the Message Understanding Conference (MUC). This paper describes InfoXtract, a robust, scalable, intermediate-level IE engine that can be ported to various domains. It describes new IE tasks such as synthesis of entity profiles, and extraction of concept-based general events which represent realistic near-term goals focused on deriving useful, actionable information. Entity profiles consolidate information about a person/organization/location etc. within a document and across documents into a single template; this takes into account aliases and anaphoric references as well as key relationships and events pertaining to that entity. Concept-based events attempt to normalize information such as time expressions (e.g., yesterday) as well as ambiguous location references (e.g., Buffalo). These new tasks facilitate the correlation of output from an IE engine with structured data to enable text mining. InfoXtract's hybrid architecture comprised of grammatical processing and machine learning is described in detail. Benchmarking results for the core engine and applications utilizing the engine are presented.
Attempting to automatically learn to identify verb complements from natural language corpora without the help of sophisticated linguistic resources like grammars, parsers or treebanks leads to a significant amount of noise in the data. In machine learning terms, where learning from examples is performed using class-labelled feature-value vectors, noise leads to an imbalanced set of vectors: assuming that the class label takes two values (in this work complement/non-complement), one class (complements) is heavily underrepresented in the data in comparison to the other. To overcome the drop in accuracy when predicting instances of the rare class due to this disproportion, we balance the learning data by applying one-sided sampling to the training corpus and thus by reducing the number of non-complement instances. This approach has been used in the past in several domains (image processing, medicine, etc) but not in natural language processing. For identifying the examples that are safe to remove, we use the value difference metric, which proves to be more suitable for nominal attributes like the ones this work deals with, unlike the Euclidean distance, which has been used traditionally in one-sided sampling. We experiment with different learning algorithms which have been widely used and their performance is well known to the machine learning community: Bayesian learners, instance-based learners and decision trees. Additionally we present and test a variation of Bayesian belief networks, the COr-BBN (Class-oriented Bayesian belief network). The performance improves up to 22% after balancing the dataset, reaching 73.7% f-measure for the complement class, having made use only a phrase chunker and basic morphological information for preprocessing.
This paper describes CLIME, a web-based legal advisory system with a multilingual natural language interface. CLIME is a ‘proof-of-concept’ system which answers queries relating to ship-building and ship-operating regulations. Its core knowledge source is a set of such regulations encoded as a conceptual domain model and a set of formalised legal inference rules. The system supports retrieval of regulations via the conceptual model, and assessment of the legality of a situation or activity on a ship according to the legal inference rules. The focus of this paper is on the natural language aspects of the system, which help the user to construct semantically complex queries using WYSIWYM technology, allow the system to produce extended and cohesive responses and explanations, and support the whole interaction through a hybrid synchronous/asynchronous dialogue structure. Multilinguality (English and French) is viewed simply as interface localisation: the core representations are language-neutral, and the system can present extended or local interactions in either language at any time. The development of CLIME featured a high degree of client involvement, and the specification, implementation and evaluation of natural language components in this context are also discussed.
This paper describes DialogueView, a tool for annotating dialogues with utterance boundaries, speech repairs, speech act tags, and hierarchical discourse blocks. The tool provides three views of a dialogue: WordView, which shows the transcribed words time-aligned with the audio signal; UtteranceView, which shows the dialogue line-by-line as if it were a script for a movie; and BlockView, which shows an outline of the dialogue. The different views provide different abstractions of what is occurring in the dialogue. Abstraction helps users focus on what is important for different annotation tasks. For example, for annotating speech repairs, utterance boundaries, and overlapping and abandoned utterances, the tool provides the exact timing information. For coding speech act tags and hierarchical discourse structure, a broader context is created by hiding such low-level details, which can still be accessed if needed. We find that the different abstractions allow users to annotate dialogues more quickly without sacrificing accuracy. The tool can be configured to meet the requirements of a variety of annotation schemes.
I am honoured to address you as the new Executive Editor of the journal, a role I took on recently from Professor John Tait. As someone who, along with the other editors and members of the Editorial Board, has the responsibility for the overall quality of the journal, my main goal is to continue actively pursuing the journal objectives and to raise the standards even higher. These objectives are concerned with promoting applied natural language processing (NLP) research in the form of first-class original research and with bridging the gap between traditional computational linguistics research and the implementation of practical applications with potential real-world use.
A year ago in this column I wondered aloud whether 2007 was to be the year in which question-answering (QA) really took off in the commercial space. I was provoked to ask that question by the increasing number of Web-based QA systems that were portraying themselves as the Next Thing in search for the masses. There was, in particular, a lot of buzz around the $12.5 million funding deal announced by Powerset. The San Francisco-based company had gained exclusive access to parsing technology from PARC, but hadn't at that point displayed any of its wares to the general public. However, the company was inviting people to sign up for access to Powerset Labs, where, we were told, we would get the opportunity to be the first to play with the technology and to provide feedback to make it better. Since then, there have been occasional screen images of the application seen in blog posts and other news items, and a small number of claimed sightings by bloggers who were granted privileged access. But Powerset Labs was finally launched at TechCrunch 40 in mid-September. The company's Web site says that they've begun to let people have access to the technology, and that they'll be ‘letting in the next wave of users as soon as possible’. My surf board is ready, but I'm not holding my breath. I signed up in June 2007, and haven't heard a thing since. Comments posted on the Powerset site suggest that there might be quite a few in the line before me.