To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
Clustering is an unsupervised process through which objects are classified into groups called clusters. In categorization problems, as described in Chapter IV, we are provided with a collection of preclassified training examples, and the task of the system is to learn the descriptions of classes in order to be able to classify a new unlabeled object. In the case of clustering, the problem is to group the given unlabeled collection into meaningful clusters without any prior information. Any labels associated with objects are obtained solely from the data.
Clustering is useful in a wide range of data analysis fields, including data mining, document retrieval, image segmentation, and pattern classification. In many such problems, little prior information is available about the data, and the decision-maker must make as few assumptions about the data as possible. It is for those cases the clustering methodology is especially appropriate.
Clustering techniques are described in this chapter in the context of textual data analysis. Section V.1 discusses the various applications of clustering in text analysis domains. Sections V.2 and V.3 address the general clustering problem and present several clustering algorithms. Finally Section V.4 demonstrates how the clustering algorithms can be adapted to text analysis.
CLUSTERING TASKS IN TEXT ANALYSIS
One application of clustering is the analysis and navigation of big text collections such as Web pages. The basic assumption, called the cluster hypothesis, states that relevant documents tend to be more similar to each other than to nonrelevant ones.
Core mining operations in text mining systems center on the algorithms that underlie the creation of queries for discovering patterns in document collections. This chapter describes most of the more common – and a few useful but less common – forms of these algorithms. Pattern-discovery algorithms are discussed primarily from a high-level definitional perspective. In addition, we examine the incorporation of background knowledge into text mining query operations. Finally, we briefly treat the topic of text mining query languages.
CORE TEXT MINING OPERATIONS
Core text mining operations consist of various mechanisms for discovering patterns of concept occurrence within a given document collection or subset of a document collection. The three most common types of patterns encountered in text mining are distributions (and proportions), frequent and near frequent sets, and associations.
Typically, when they offer the capability of discovering more than one type of pattern, text mining systems afford users the ability to toggle between displays of the different types of patterns for a given concept or set of concepts. This allows the richest possible exploratory access to the underlying document collection data through a browser.
Distributions
This section defines and discusses some of text mining's most commonly used distributions. We illustrate this in the context of a hypothetical text mining system that has a document collection W composed of documents containing news wire stories about world affairs that have all been preprocessed with concept labels.
Effective text mining operations are predicated on sophisticated data preprocessing methodologies. In fact, text mining is arguably so dependent on the various preprocessing techniques that infer or extract structured representations from raw unstructured data sources, or do both, that one might even say text mining is to a degree defined by these elaborate preparatory techniques. Certainly, very different preprocessing techniques are required to prepare raw unstructured data for text mining than those traditionally encountered in knowledge discovery operations aimed at preparing structured data sources for classic data mining operations.
A large variety of text mining preprocessing techniques exist. All in some way attempt to structure documents – and, by extension, document collections. Quite commonly, different preprocessing techniques are used in tandem to create structured document representations from raw textual data. As a result, some typical combinations of techniques have evolved in preparing unstructured data for text mining.
Two clear ways of categorizing the totality of preparatory document structuring techniques are according to their task and according to the algorithms and formal frameworks that they use.
Task-oriented preprocessing approaches envision the process of creating a structured document representation in terms of tasks and subtasks and usually involve some sort of preparatory goal or problem that needs to be solved such as extracting titles and authors from a PDF document. Other preprocessing approaches rely on techniques that derive from formal methods for analyzing complex phenomena that can be also applied to natural language texts.
Many text mining systems introduced in the late 1990s were developed by computer scientists as part of academic “pure research” projects aimed at exploring the capabilities and performance of the various technical components making up these systems. Most current text mining systems, however – whether developed by academic researchers, commercial software developers, or in-house corporate programmers – are built to focus on specialized applications that answer questions peculiar to a given problem space or industry need. Obviously, such specialized text mining systems are especially well suited to solving problems in academic or commercial activities in which large volumes of textual data must be analyzed in making decisions.
Three areas of analytical inquiry have proven particularly fertile ground for text mining applications. In various areas of corporate finance, bankers, analysts, and consultants have begun leveraging text mining capabilities to sift through vast amounts of textual data with the aims of creating usable forms of business intelligence, noting trends, identifying correlations, and researching references to specific transactions, corporate entities, or persons. In patent research, specialists across industry verticals at some of the world's largest companies and professional services firms apply text mining approaches to investigating patent development strategies and finding ways to exploit existing corporate patent assets better. In life sciences, researchers are exploring enormous collections of biomedical research reports to identify complex patterns of interactivities between proteins.
This chapter discusses prototypical text mining solutions adapted for use in each of these three problem spaces.
Human-centered knowledge discovery places great emphasis on the presentation layer of systems used for data mining. All text mining systems built around a human-centric knowledge discovery paradigm must offer a user robust browsing capabilities as well as abilities to display dense and difficult-to-format patterns of textual data in ways that foster interactive exploration.
A robust text mining system should offer a user control over the shaping of queries by making search parameterization available through both high-level, easy-to-use GUI-based controls and direct, low-level, and relatively unrestricted query language access. Moreover, text mining systems need to offer a user administrative tools to create, modify, and maintain concept hierarchies, concept clusters, and entity profile information.
Text mining systems also rely, to an extraordinary degree, on advanced visualization tools. More on the full gamut of visualization approaches – from the relatively mundane to the highly exotic – relevant for text mining can be found in Chapter X.
BROWSING
Browsing is a term open to broad interpretation. With respect to text mining systems, however, it usually refers to the general front-end framework through which an enduser searches, queries, displays, and interacts with embedded or middle-tier knowledge-discovery algorithms.
The software that implements this framework is called a browser. Beyond their ability to allow a user to (a) manipulate the various knowledge discovery algorithms they may operate and (b) explore the resulting patterns, most browsers also generally support functionality to link to some portion of the full text of documents underlying the patterns that these knowledge discovery algorithms may return.
This paper develops a controller for position tracking of a new ball wheel drive mechanism for a robust omnidirectional wheeled mobile platform, an adaptive controller with physical parameter estimation. This platform, integrated with a manipulator, is designed for use in highway maintenance and construction, which is a generally unstructured and congested environment. The proposed ball wheel mechanism can move in all directions on the plane, instantaneously and isotropically. Reconfiguration of the redundant drive system of the ball wheel mechanism is accomplished by altering the contact pressure between the drive wheels and the sphere based on platform heading direction. The system redundancy is resolved through a Weighted Optimal Torque Distribution method allowing for smooth reconfiguration. Finally, experimental and simulation results demonstrate the effectiveness of the approach.
Based on the outcome of the preprocessing stage, we can establish links between entities either by using co-occurrence information (within some lexical unit such as a document, paragraph, or sentence) or by using the semantic relationships between the entities as extracted by the information extraction module (such as family relations, employment relationship, mutual service in the army, etc.). This chapter describes the link analysis techniques that can be applied to results of the preprocessing stage (information extraction, term extraction, and text categorization).
A social network is a set of entities (e.g., people, companies, organizations, universities, countries) and a set of relationships between them (e.g., family relationships, various types of communication, business transactions, social interactions, hierarchy relationships, and shared memberships of people in organizations). Visualizing a social network as a graph enables the viewer to see patterns that were not evident before.
We begin with preliminaries from graph theory used throughout the chapter. We next describe the running example of the 9/11 hijacker's network followed by a brief description of graph layout algorithms. After the concepts of paths and cycles in graphs are presented, the chapter proceeds with a discussion of the notion of centrality and the various ways of computing it. Various algorithms for partitioning and clustering nodes inside the network are then presented followed by a brief description of finding specific patterns in networks. The chapter concludes with a presentation of three low-cost software packages for performing link analysis.
Human-centric text mining emphasizes the centrality of user interactivity to the knowledge discovery process. As a consequence, text mining systems need to provide users with a range of tools for interacting with data. For a wide array of tasks, these tools often rely on very simple graphical approaches such as pick lists, drop-down boxes, and radio boxes that have become typical in many generic software applications to support query construction and the basic browsing of potentially interesting patterns.
In large document collections, however, problems of pattern and feature overabundance have led the designers of text mining systems to move toward the creation of more sophisticated visualization approaches to facilitate user interactivity. Indeed, in document collections of even relatively modest size, tens of thousands of identified concepts and thousands of interesting associations can make browsing with simple visual mechanisms such as pick lists all but unworkable. More sophisticated visualization approaches incorporate graphical tools that rely on advances in many different areas of computer and behavioral science research to promote easier and more intensive and iterative exploration of patterns in textual data.
Many of the more mundane activities that allow a user of a text mining system to engage in rudimentary data exploration are supported by a graphic user interface that serves as the type of basic viewer or browser discussed in Chapter VII. A typical basic browsing interface can be seen in Figure X.1.
The related fields of NLP, IE, text categorization, and probabilistic modeling have developed increasingly rapidly in the last few years. New approaches are tried constantly and new systems are reported numbering thousands a year. The fields largely remain experimental science – a new approach or improvement is conceived and a system is built, tested, and reported. However, comparatively little work is done in analyzing the results and in comparing systems and approaches with each other. Usually, it is the task of the authors of a particular system to compare it with other known approaches, and this presents difficulties – both psychological and methodological.
One reason for the dearth of analytical work, excluding the general lack of sound theoretical foundations, is that the comparison experiments require software, which is usually either impossible or very costly to obtain. Moreover, the software requires integration, adjustment, and possibly training for any new use, which is also extremely costly in terms of time and human labor.
Therefore, our description of the different possible solutions to the problems described in the first section is incomplete by necessity. There are just too many reported systems, and there is often no good reason to choose one approach against the other. Consequently, we have tried to describe in depth only a small number of systems. We have chosen as broad a selection as possible, encompassing many different approaches. And, of course, the results produced by the systems are state of the art or sufficiently close to it.
This appendix provides an example of a dedicated information extraction language called DIAL (declarative information analysis language). The purpose of the appendix is to show the general structure of the language and offer some code examples that will demonstrate how it can be used to extract concepts and relationships; hence, we will not cover all aspects and details of the language.
The DIAL language is a dedicated information extraction language enabling the user to define concepts whose instances are found in a text body by the DIAL engine. A DIAL concept is a logical entity, which can represent a noun (such as a person, place, or institution), an event (such as a business merger between two companies or the election of a president), or any other entity for which a text pattern can be defined. Instances of concepts are found when the DIAL engine succeeds in matching a concept pattern to part of the text it is processing. Concepts may have attributes, which are properties belonging to the concept whose values are found in the text of the concept instance. For instance, a “Date” concept might have numeric day, month, and year attributes and a string attribute for the day of the week.
A DIAL concept declaration defines the concept's name, attributes, and optionally some additional code common to all instances of the concept.
Text mining can be broadly defined as a knowledge-intensive process in which a user interacts with a document collection over time by using a suite of analysis tools. In a manner analogous to data mining, text mining seeks to extract useful information from data sources through the identification and exploration of interesting patterns. In the case of text mining, however, the data sources are document collections, and interesting patterns are found not among formalized database records but in the unstructured textual data in the documents in these collections.
Certainly, text mining derives much of its inspiration and direction from seminal research on data mining. Therefore, it is not surprising to find that text mining and data mining systems evince many high-level architectural similarities. For instance, both types of systems rely on preprocessing routines, pattern-discovery algorithms, and presentation-layer elements such as visualization tools to enhance the browsing of answer sets. Further, text mining adopts many of the specific types of patterns in its core knowledge discovery operations that were first introduced and vetted in data mining research.
Because data mining assumes that data have already been stored in a structured format, much of its preprocessing focus falls on two critical tasks: Scrubbing and normalizing data and creating extensive numbers of table joins. In contrast, for text mining systems, preprocessing operations center on the identification and extraction of representative features for natural language documents.
Several common themes frequently recur in many tasks related to processing and analyzing complex phenomena, including natural language texts. Among these themes are classification schemes, clustering, probabilistic models, and rule-based systems.
This section describes some of these techniques generally, and the next section applies them to the tasks described in Chapter VI.
Research has demonstrated that it is extremely fruitful to model the behavior of complex systems as some form of a random process. Probabilistic models often show better accuracy and robustness against the noise than categorical models. The ultimate reason for this is not quite clear and is an excellent subject for a philosophical debate.
Nevertheless, several probabilistic models have turned out to be especially useful for the different tasks in extracting meaning from natural language texts. Most prominent among these probabilistic approaches are hidden Markov models (HMMs), stochastic context-free grammars (SCFG), and maximal entropy (ME).
HIDDEN MARKOV MODELS
An HMM is a finite-state automaton with stochastic state transitions and symbol emissions (Rabiner 1990). The automaton models a probabilistic generative process. In this process, a sequence of symbols is produced by starting in an initial state, emitting a symbol selected by the state, making a transition to a new state, emitting a symbol selected by the state, and repeating this transition–emission cycle until a designated final state is reached.
The information age has made it easy to store large amounts of data. The proliferation of documents available on the Web, on corporate intranets, on news wires, and elsewhere is overwhelming. However, although the amount of data available to us is constantly increasing, our ability to absorb and process this information remains constant. Search engines only exacerbate the problem by making more and more documents available in a matter of a few key strokes.
Text mining is a new and exciting research area that tries to solve the information overload problem by using techniques from data mining, machine learning, natural language processing (NLP), information retrieval (IR), and knowledge management. Text mining involves the preprocessing of document collections (text categorization, information extraction, term extraction), the storage of the intermediate representations, the techniques to analyze these intermediate representations (such as distribution analysis, clustering, trend analysis, and association rules), and visualization of the results.
This book presents a general theory of text mining along with the main techniques behind it. We offer a generalized architecture for text mining and outline the algorithms and data structures typically used by text mining systems.
The book is aimed at the advanced undergraduate students, graduate students, academic researchers, and professional practitioners interested in complete coverage of the text mining field. We have included all the topics critical to people who plan to develop text mining systems or to use them.
A mature IE technology would allow rapid creation of extraction systems for new tasks whose performance would approach a human level. Nevertheless, even systems without near perfect recall and precision can be of real value. In such cases, the results of the IE system would need to be fed into an auditing environment to allow auditors to fix the system's precision (an easy task) and recall (much harder) errors. These types of systems would also be of value in cases in which the information is too vast for the users to be able to read all of it; hence, even a partially correct IE system would be preferable to the alternative of not obtaining any potentially relevant information. In general, IE systems are useful if the following conditions are met:
The information to be extracted is specified explicitly and no further inference is needed.
A small number of templates are sufficient to summarize the relevant parts of the document.
The needed information is expressed relatively locally in the text (check Bagga and Biermann 2000).
As a first step in tagging documents for text mining systems, each document is processed to find (i.e., extract) entities and relationships that are likely to be meaningful and content-bearing. The term relationships here denotes facts or events involving certain entities.
By way of example, a possible event might be a company's entering into a joint venture to develop a new drug.
Probably the most common theme in analyzing complex data is the classification, or categorization, of elements. Described abstractly, the task is to classify a given data instance into a prespecified set of categories. Applied to the domain of document management, the task is known as text categorization (TC) – given a set of categories (subjects, topics) and a collection of text documents, the process of finding the correct topic (or topics) for each document.
The study of automated text categorization dates back to the early 1960s (Maron 1961). Then, its main projected use was for indexing scientific literature by means of controlled vocabulary. It was only in the 1990s that the field fully developed with the availability of ever increasing numbers of text documents in digital form and the necessity to organize them for easier use. Nowadays automated TC is applied in a variety of contexts – from the classical automatic or semiautomatic (interactive) indexing of texts to personalized commercials delivery, spam filtering, Web page categorization under hierarchical catalogues, automatic generation of metadata, detection of text genre, and many others.
As with many other artificial intelligence (AI) tasks, there are two main approaches to text categorization. The first is the knowledge engineering approach in which the expert's knowledge about the categories is directly encoded into the system either declaratively or in the form of procedural classification rules.
Motivated by recent work on the derivation of labelled transitions and bisimulation congruences from unlabelled reaction rules, we show how to address this problem in the DPO (double-pushout) approach to graph rewriting. Unlike the case with previous approaches, we consider graphs as objects, rather than arrows, of the category under consideration. This allows us to present a very simple way of deriving labelled transitions (called rewriting steps with borrowed context), which integrates smoothly with the DPO approach, has a very constructive nature and requires only a minimum of category theory. The core part of this paper is the proof that the bisimilarity based on graph rewriting with borrowed contexts is a congruence relation. We will also introduce some proof techniques and compare our approach with the derivation of labelled transitions via relative pushouts.