Hostname: page-component-77f85d65b8-2tv5m Total loading time: 0 Render date: 2026-03-29T18:01:56.811Z Has data issue: false hasContentIssue false

InfoXtract: A customizable intermediate level information extraction engine

Published online by Cambridge University Press:  01 January 2008

ROHINI K. SRIHARI
Affiliation:
Janya Inc., 1408 Sweet Home Road, Amherst, NY 14228, USA, State University of New York at Buffalo e-mail: rohini@janyainc.com
WEI LI
Affiliation:
Janya Inc., 1408 Sweet Home Road, Amherst, NY 14228, USA e-mail: wei@janyainc.comcornell@janyainc.com
THOMAS CORNELL
Affiliation:
Janya Inc., 1408 Sweet Home Road, Amherst, NY 14228, USA e-mail: wei@janyainc.comcornell@janyainc.com
CHENG NIU
Affiliation:
Microsoft Research China, 5/F, Beijing Sigma Center, No. 49, Zhichun Road, Haidian District, Beijing100080, P.R.C. e-mail: cniu@microsoft.com

Abstract

Information Extraction (IE) systems assist analysts to assimilate information from electronic documents. This paper focuses on IE tasks designed to support information discovery applications. Since information discovery implies examining large volumes of heterogeneous documents for situations that cannot be anticipated a priori, they require IE systems to have breadth as well as depth. This implies the need for a domain-independent IE system that can easily be customized for specific domains: end users must be given tools to customize the system on their own. It also implies the need for defining new intermediate level IE tasks that are richer than the subject-verb-object (SVO) triples produced by shallow systems, yet not as complex as the domain-specific scenarios defined by the Message Understanding Conference (MUC). This paper describes InfoXtract, a robust, scalable, intermediate-level IE engine that can be ported to various domains. It describes new IE tasks such as synthesis of entity profiles, and extraction of concept-based general events which represent realistic near-term goals focused on deriving useful, actionable information. Entity profiles consolidate information about a person/organization/location etc. within a document and across documents into a single template; this takes into account aliases and anaphoric references as well as key relationships and events pertaining to that entity. Concept-based events attempt to normalize information such as time expressions (e.g., yesterday) as well as ambiguous location references (e.g., Buffalo). These new tasks facilitate the correlation of output from an IE engine with structured data to enable text mining. InfoXtract's hybrid architecture comprised of grammatical processing and machine learning is described in detail. Benchmarking results for the core engine and applications utilizing the engine are presented.

Information

Type
Papers
Copyright
Copyright © Cambridge University Press 2006

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Article purchase

Temporarily unavailable