Search results for Computer Science

16 - Flat clustering
Christopher D. Manning, Stanford University, California, Prabhakar Raghavan, Hinrich Schütze, Universität Stuttgart
Book:

Introduction to Information Retrieval

Published online:

05 June 2012

Print publication:

07 July 2008, pp 321-345
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Summary

Clustering algorithms group a set of documents into subsets or clusters. The algorithms' goal is to create clusters that are coherent internally, but clearly different from each other. In other words, documents within a cluster should be as similar as possible; and documents in one cluster should be as dissimilar as possible from documents in other clusters.
Clustering is the most common form of unsupervised learning. No supervision means that there is no human expert who has assigned documents to classes. In clustering, it is the distribution and makeup of the data that will determine cluster membership. A simple example is Figure 16.1. It is visually clear that there are three distinct clusters of points. This chapter and Chapter 17 introduce algorithms that find such clusters in an unsupervised fashion.
The difference between clustering and classification may not seem great at first. After all, in both cases we have a partition of a set of documents into groups. But as we will see the two problems are fundamentally different. Classification is a form of supervised learning (Chapter 13, page 237): Our goal is to replicate a categorical distinction that a human supervisor imposes on the data. In unsupervised learning, of which clustering is the most important example, we have no such teacher to guide us.
The key input to a clustering algorithm is the distance measure. In Figure 16.1, the distance measure is distance in the two-dimensional (2D) plane.

15 - Support vector machines and machine learning on documents
Christopher D. Manning, Stanford University, California, Prabhakar Raghavan, Hinrich Schütze, Universität Stuttgart
Book:

Introduction to Information Retrieval

Published online:

05 June 2012

Print publication:

07 July 2008, pp 293-320
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Summary

Improving classifier effectiveness has been an area of intensive machine-learning research over the last two decades, and this work has led to a new generation of state-of-the-art classifiers, such as support vector machines, boosted decision trees, regularized logistic regression, neural networks, and random forests. Many of these methods, including support vector machines (SVMs), the main topic of this chapter, have been applied with success to information retrieval problems, particularly text classification. An SVM is a kind of large-margin classifier: It is a vector-space–based machine-learning method where the goal is to find a decision boundary between two classes that is maximally far from any point in the training data (possibly discounting some points as outliers or noise).
We will initially motivate and develop SVMs for the case of two-class data sets that are separable by a linear classifier (Section 15.1), and then extend the model in Section 15.2 to nonseparable data, multiclass problems, and nonlinear models, and also present some additional discussion of SVM performance. The chapter then moves to consider the practical deployment of text classifiers in Section 15.3: What sorts of classifiers are appropriate when, and how can you exploit domain-specific text features in classification? Finally, we will consider how the machine-learning technology that we have been building for text classification can be applied back to the problem of learning how to rank documents in ad hoc retrieval (Section 15.4).

10 - XML retrieval
Christopher D. Manning, Stanford University, California, Prabhakar Raghavan, Hinrich Schütze, Universität Stuttgart
Book:

Introduction to Information Retrieval

Published online:

05 June 2012

Print publication:

07 July 2008, pp 178-200
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Summary

Information retrieval (IR) systems are often contrasted with relational databases. Traditionally, IR systems have retrieved information from unstructured text – by which we mean “raw” text without markup. Databases are designed for querying relational data, sets of records that have values for predefined attributes such as employee number, title, and salary. There are fundamental differences between IR and database systems in terms of retrieval model, data structures, and query language as shown in Table 10.1.
Some highly structured text search problems are most efficiently handled by a relational database; for example, if the employee table contains an attribute for short textual job descriptions and you want to find all employees who are involved with invoicing. In this case, the SQL query:
select lastname from employees where job_desc like ‘invoic%’;
may be sufficient to satisfy your information need with high precision and recall.
STRUCTURED RETRIEVAL
However, many structured data sources containing text are best modeled as structured documents rather than relational data. We call the search over such structured documents structured retrieval. Queries in structured retrieval can be either structured or unstructured, but we assume in this chapter that the collection consists only of structured documents. Applications of structured retrieval include digital libraries, patent databases, blogs, text in which entities like persons and locations have been tagged (in a process called named entity tagging), and output from office suites like OpenOffice that save documents as marked up text.

Preface
Christopher D. Manning, Stanford University, California, Prabhakar Raghavan, Hinrich Schütze, Universität Stuttgart
Book:

Introduction to Information Retrieval

Published online:

05 June 2012

Print publication:

07 July 2008, pp xv-xxii
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Summary

As recently as the 1990s, studies showed that most people preferred getting information from other people rather than from information retrieval (IR) systems. Of course, in that time period, most people also used human travel agents to book their travel. However, during the last decade, relentless optimization of information retrieval effectiveness has driven web search engines to new quality levels at which most people are satisfied most of the time, and web search has become a standard and often preferred source of information finding. For example, the 2004 Pew Internet Survey (Fallows 2004) found that “92% of Internet users say the Internet is a good place to go for getting everyday information.” To the surprise of many, the field of information retrieval has moved from being a primarily academic discipline to being the basis underlying most people's preferred means of information access. This book presents the scientific underpinnings of this field, at a level accessible to graduate students as well as advanced undergraduates.
Information retrieval did not begin with the Web. In response to various challenges of providing information access, the field of IR evolved to give principled approaches to searching various forms of content. The field began with scientific publications and library records but soon spread to other forms of content, particularly those of information professionals, such as journalists, lawyers, and doctors.

17 - Hierarchical clustering
Christopher D. Manning, Stanford University, California, Prabhakar Raghavan, Hinrich Schütze, Universität Stuttgart
Book:

Introduction to Information Retrieval

Published online:

05 June 2012

Print publication:

07 July 2008, pp 346-368
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Summary

Flat clustering is efficient and conceptually simple, but as we saw in Chapter 16 it has a number of drawbacks. The algorithms introduced in Chapter 16 return a flat unstructured set of clusters, require a prespecified number of clusters as input and are nondeterministic. Hierarchical clustering (or hierarchic clustering) outputs a hierarchy, a structure that is more informative than the unstructured set of clusters returned by flat clustering. Hierarchical clustering does not require us to prespecify the number of clusters and most hierarchical algorithms that have been used in information retrieval (IR) are deterministic. These advantages of hierarchical clustering come at the cost of lower efficiency. The most common hierarchical clustering algorithms have a complexity that is at least quadratic in the number of documents compared to the linear complexity of K-means and EM (cf. Section 16.4, page 335).
This chapter first introduces agglomerative hierarchical clustering (Section 17.1) and presents four different agglomerative algorithms, in Sections 17.2 through 17.4, which differ in the similarity measures they employ: single-link, complete-link, group-average, and centroid similarity. We then discuss the optimality conditions of hierarchical clustering in Section 17.5. Section 17.6 introduces top-down (or divisive) hierarchical clustering. Section 17.7 looks at labeling clusters automatically, a problem that must be solved whenever humans interact with the output of clustering. We discuss implementation issues in Section 17.8. Section 17.9 provides pointers to further reading, including references to soft hierarchical clustering, which we do not cover in this book.

Frontmatter
Christopher D. Manning, Stanford University, California, Prabhakar Raghavan, Hinrich Schütze, Universität Stuttgart
Book:

Introduction to Information Retrieval

Published online:

05 June 2012

Print publication:

07 July 2008, pp i-iv
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation

5 - Index compression
Christopher D. Manning, Stanford University, California, Prabhakar Raghavan, Hinrich Schütze, Universität Stuttgart
Book:

Introduction to Information Retrieval

Published online:

05 June 2012

Print publication:

07 July 2008, pp 78-99
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Summary

Chapter 1 introduced the dictionary and the inverted index as the central data structures in information retrieval (IR). In this chapter, we employ a number of compression techniques for dictionary and inverted index that are essential for efficient IR systems.
One benefit of compression is immediately clear. We need less disk space. As we will see, compression ratios of 1:4 are easy to achieve, potentially cutting the cost of storing the index by 75%.
There are two more subtle benefits of compression. The first is increased use of caching. Search systems use some parts of the dictionary and the index much more than others. For example, if we cache the postings list of a frequently used query term t, then the computations necessary for responding to the one-term query t can be entirely done in memory. With compression, we can fit a lot more information into main memory. Instead of having to expend a disk seek when processing a query with t, we instead access its postings list in memory and decompress it. As we will see below, there are simple and efficient decompression methods, so that the penalty of having to decompress the postings list is small. As a result, we are able to decrease the response time of the IR system substantially. Because memory is a more expensive resource than disk space, increased speed owing to caching – rather than decreased space requirements – is often the prime motivator for compression.

4 - Index construction
Christopher D. Manning, Stanford University, California, Prabhakar Raghavan, Hinrich Schütze, Universität Stuttgart
Book:

Introduction to Information Retrieval

Published online:

05 June 2012

Print publication:

07 July 2008, pp 61-77
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Summary

INDEXING INDEXER
In this chapter, we look at how to construct an inverted index. We call this process index construction or indexing; the process or machine that performs it the indexer. The design of indexing algorithms is governed by hardware constraints. We therefore begin this chapter with a review of the basics of computer hardware that are relevant for indexing. We then introduce blocked sort-based indexing (Section 4.2), an efficient single-machine algorithm designed for static collections that can be viewed as a more scalable version of the basic sort-based indexing algorithm we introduced in Chapter 1. Section 4.3 describes single-pass in-memory indexing, an algorithm that has even better scaling properties because it does not hold the vocabulary in memory. For very large collections like the web, indexing has to be distributed over computer clusters with hundreds or thousands of machines. We discuss this in Section 4.4. Collections with frequent changes require dynamic indexing introduced in Section 4.5 so that changes in the collection are immediately reflected in the index. Finally, we cover some complicating issues that can arise in indexing – such as security and indexes for ranked retrieval – in Section 4.6.
Index construction interacts with several topics covered in other chapters. The indexer needs raw text, but documents are encoded in many ways (see Chapter 2). Indexers compress and decompress intermediate files and the final index (see Chapter 5).

9 - Relevance feedback and query expansion
Christopher D. Manning, Stanford University, California, Prabhakar Raghavan, Hinrich Schütze, Universität Stuttgart
Book:

Introduction to Information Retrieval

Published online:

05 June 2012

Print publication:

07 July 2008, pp 162-177
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Summary

SYNONYMY
In most collections, the same concept may be referred to using different words. This issue, known as synonymy, has an impact on the recall of most information retrieval (IR) systems. For example, you would want a search for aircraft to match plane (but only for references to an airplane, not a woodworking plane), and for a search on thermodynamics to match references to heat in appropriate discussions. Users often attempt to address this problem themselves by manually refining a query, as was discussed in Section 1.4; in this chapter, we discuss ways in which a system can help with query refinement, either fully automatically or with the user in the loop.
The methods for tackling this problem split into two major classes: global methods and local methods. Global methods are techniques for expanding or reformulating query terms independent of the query and results returned from it, so that changes in the query wording will cause the new query to match other semantically similar terms. Global methods include:
Query expansion/reformulation with a thesaurus or WordNet (Section 9.2.2)
Query expansion via automatic thesaurus generation (Section 9.2.3)
Techniques like spelling correction (discussed in Chapter 3)

Local methods adjust a query relative to the documents that initially appear to match the query. The basic methods here are:

Relevance feedback (Section 9.1)
Pseudorelevance feedback, also known as blind relevance feedback (Section 9.1.6)
(Global) Indirect relevance feedback (Section 9.1.7)

19 - Web search basics
Christopher D. Manning, Stanford University, California, Prabhakar Raghavan, Hinrich Schütze, Universität Stuttgart
Book:

Introduction to Information Retrieval

Published online:

05 June 2012

Print publication:

07 July 2008, pp 385-404
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation

14 - Vector space classification
Christopher D. Manning, Stanford University, California, Prabhakar Raghavan, Hinrich Schütze, Universität Stuttgart
Book:

Introduction to Information Retrieval

Published online:

05 June 2012

Print publication:

07 July 2008, pp 266-292
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation

7 - Computing scores in a complete search system
Christopher D. Manning, Stanford University, California, Prabhakar Raghavan, Hinrich Schütze, Universität Stuttgart
Book:

Introduction to Information Retrieval

Published online:

05 June 2012

Print publication:

07 July 2008, pp 124-138
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation

21 - Link analysis
Christopher D. Manning, Stanford University, California, Prabhakar Raghavan, Hinrich Schütze, Universität Stuttgart
Book:

Introduction to Information Retrieval

Published online:

05 June 2012

Print publication:

07 July 2008, pp 421-440
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Summary

The analysis of hyperlinks and the graph structure of the Web has been instrumental in the development of web search. In this chapter, we focus on the use of hyperlinks for ranking web search results. Such link analysis is one of many factors considered by web search engines in computing a composite score for a web page on any given query. We begin by reviewing some basics of the Web as a graph in Section 21.1, then proceed to the technical development of the elements of link analysis for ranking.
Link analysis for web search has intellectual antecedents in the field of citation analysis, aspects of which overlap with an area known as bibliometrics. These disciplines seek to quantify the influence of scholarly articles by analyzing the pattern of citations among them. Much as citations represent the conferral of authority from a scholarly article to others, link analysis on the Web treats hyperlinks from a web page to another as a conferral of authority. Clearly, not every citation or hyperlink implies such authority conferral; for this reason, simply measuring the quality of a web page by the number of in-links (citations from other pages) is not robust enough. For instance, one may contrive to set up multiple web pages pointing to a target web page, with the intent of artificially boosting the latter's tally of in-links. This phenomenon is referred to as link spam.

8 - Evaluation in information retrieval
Christopher D. Manning, Stanford University, California, Prabhakar Raghavan, Hinrich Schütze, Universität Stuttgart
Book:

Introduction to Information Retrieval

Published online:

05 June 2012

Print publication:

07 July 2008, pp 139-161
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Summary

We have seen in the preceding chapters many alternatives in designing an information retrieval (IR) system. How do we know which of these techniques are effective in which applications? Should we use stop lists? Should we stem? Should we use inverse document frequency weighting? IR has developed as a highly empirical discipline, requiring careful and thorough evaluation to demonstrate the superior performance of novel techniques on representative document collections.
In this chapter, we begin with a discussion of measuring the effectiveness of IR systems (Section 8.1) and the test collections that are most often used for this purpose (Section 8.2). We then present the straightforward notion of relevant and nonrelevant documents and the formal evaluation methodology that has been developed for evaluating unranked retrieval results (Section 8.3). This includes explaining the kinds of evaluation measures that are standardly used for document retrieval and related tasks like text classification and why they are appropriate. We then extend these notions and develop further measures for evaluating ranked retrieval results (Section 8.4) and discuss developing reliable and informative test collections (Section 8.5).
We then step back to introduce the notion of user utility, and how it is approximated by the use of document relevance (Section 8.6). The key utility measure is user happiness. Speed of response and the size of the index are factors in user happiness.

Bibliography
Christopher D. Manning, Stanford University, California, Prabhakar Raghavan, Hinrich Schütze, Universität Stuttgart
Book:

Introduction to Information Retrieval

Published online:

05 June 2012

Print publication:

07 July 2008, pp 441-468
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation

Index
Christopher D. Manning, Stanford University, California, Prabhakar Raghavan, Hinrich Schütze, Universität Stuttgart
Book:

Introduction to Information Retrieval

Published online:

05 June 2012

Print publication:

07 July 2008, pp 469-482
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation

12 - Language models for information retrieval
Christopher D. Manning, Stanford University, California, Prabhakar Raghavan, Hinrich Schütze, Universität Stuttgart
Book:

Introduction to Information Retrieval

Published online:

05 June 2012

Print publication:

07 July 2008, pp 218-233
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Summary

A common suggestion to users for coming up with good queries is to think of words that would likely appear in a relevant document, and to use those words as the query. The language modeling approach to information retrieval (IR) directly models that idea: A document is a good match to a query if the document model is likely to generate the query, which will in turn happen if the document contains the query words often. This approach thus provides a different realization of some of the basic ideas for document ranking which we saw in Section 6.2 (page 107). Instead of overtly modeling the probability P(R = 1|q, d) of relevance of a document d to a query q, as in the traditional probabilistic approach to IR (Chapter 11), the basic language modeling approach instead builds a probabilistic language model Md from each document d, and ranks documents based on the probability of the model generating the query: P(q|Md).
In this chapter, we first introduce the concept of language models (Section 12.1) and then describe the basic and most commonly used language modeling approach to IR, the query likelihood model (Section 12.2). After some comparisons between the language modeling approach and other approaches to IR (Section 12.3), we finish by briefly describing various extensions to the language modeling approach (Section 12.4).
Language models
Finite automata and language models
What do we mean by a document model generating a query?

Computer Science

Refine search

Refine search

Actions for selected content:

48569 results in Computer Science

3 - Dictionaries and tolerant retrieval

Summary

20 - Web crawling and indexes

Summary

6 - Scoring, term weighting, and the vector space model

Summary

16 - Flat clustering

Summary

15 - Support vector machines and machine learning on documents

Summary

10 - XML retrieval

Summary

Preface

Summary

17 - Hierarchical clustering

Summary

Frontmatter

5 - Index compression

Summary

4 - Index construction

Summary

9 - Relevance feedback and query expansion

Summary

19 - Web search basics

14 - Vector space classification

7 - Computing scores in a complete search system

21 - Link analysis

Summary

8 - Evaluation in information retrieval

Summary

Bibliography

Index

12 - Language models for information retrieval

Summary

Computer Science

Refine search

Refine search

Actions for selected content:

Save Search

48569 results in Computer Science

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary