To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
In most collections, the same concept may be referred to using different words. This issue, known as synonymy, has an impact on the recall of most information retrieval (IR) systems. For example, you would want a search for aircraft to match plane (but only for references to an airplane, not a woodworking plane), and for a search on thermodynamics to match references to heat in appropriate discussions. Users often attempt to address this problem themselves by manually refining a query, as was discussed in Section 1.4; in this chapter, we discuss ways in which a system can help with query refinement, either fully automatically or with the user in the loop.
The methods for tackling this problem split into two major classes: global methods and local methods. Global methods are techniques for expanding or reformulating query terms independent of the query and results returned from it, so that changes in the query wording will cause the new query to match other semantically similar terms. Global methods include:
Query expansion/reformulation with a thesaurus or WordNet (Section 9.2.2)
Query expansion via automatic thesaurus generation (Section 9.2.3)
Techniques like spelling correction (discussed in Chapter 3)
Local methods adjust a query relative to the documents that initially appear to match the query. The basic methods here are:
Relevance feedback (Section 9.1)
Pseudorelevance feedback, also known as blind relevance feedback (Section 9.1.6)
The analysis of hyperlinks and the graph structure of the Web has been instrumental in the development of web search. In this chapter, we focus on the use of hyperlinks for ranking web search results. Such link analysis is one of many factors considered by web search engines in computing a composite score for a web page on any given query. We begin by reviewing some basics of the Web as a graph in Section 21.1, then proceed to the technical development of the elements of link analysis for ranking.
Link analysis for web search has intellectual antecedents in the field of citation analysis, aspects of which overlap with an area known as bibliometrics. These disciplines seek to quantify the influence of scholarly articles by analyzing the pattern of citations among them. Much as citations represent the conferral of authority from a scholarly article to others, link analysis on the Web treats hyperlinks from a web page to another as a conferral of authority. Clearly, not every citation or hyperlink implies such authority conferral; for this reason, simply measuring the quality of a web page by the number of in-links (citations from other pages) is not robust enough. For instance, one may contrive to set up multiple web pages pointing to a target web page, with the intent of artificially boosting the latter's tally of in-links. This phenomenon is referred to as link spam.
We have seen in the preceding chapters many alternatives in designing an information retrieval (IR) system. How do we know which of these techniques are effective in which applications? Should we use stop lists? Should we stem? Should we use inverse document frequency weighting? IR has developed as a highly empirical discipline, requiring careful and thorough evaluation to demonstrate the superior performance of novel techniques on representative document collections.
In this chapter, we begin with a discussion of measuring the effectiveness of IR systems (Section 8.1) and the test collections that are most often used for this purpose (Section 8.2). We then present the straightforward notion of relevant and nonrelevant documents and the formal evaluation methodology that has been developed for evaluating unranked retrieval results (Section 8.3). This includes explaining the kinds of evaluation measures that are standardly used for document retrieval and related tasks like text classification and why they are appropriate. We then extend these notions and develop further measures for evaluating ranked retrieval results (Section 8.4) and discuss developing reliable and informative test collections (Section 8.5).
We then step back to introduce the notion of user utility, and how it is approximated by the use of document relevance (Section 8.6). The key utility measure is user happiness. Speed of response and the size of the index are factors in user happiness.
A common suggestion to users for coming up with good queries is to think of words that would likely appear in a relevant document, and to use those words as the query. The language modeling approach to information retrieval (IR) directly models that idea: A document is a good match to a query if the document model is likely to generate the query, which will in turn happen if the document contains the query words often. This approach thus provides a different realization of some of the basic ideas for document ranking which we saw in Section 6.2 (page 107). Instead of overtly modeling the probability P(R = 1|q, d) of relevance of a document d to a query q, as in the traditional probabilistic approach to IR (Chapter 11), the basic language modeling approach instead builds a probabilistic language model Md from each document d, and ranks documents based on the probability of the model generating the query: P(q|Md).
In this chapter, we first introduce the concept of language models (Section 12.1) and then describe the basic and most commonly used language modeling approach to IR, the query likelihood model (Section 12.2). After some comparisons between the language modeling approach and other approaches to IR (Section 12.3), we finish by briefly describing various extensions to the language modeling approach (Section 12.4).
Language models
Finite automata and language models
What do we mean by a document model generating a query?
During the discussion of relevance feedback in Section 9.1.2, we observed that if we have some known relevant and nonrelevant documents, then we can straightforwardly start to estimate the probability of a term t appearing in a relevant document P(t|R = 1), and that this could be the basis of a classifier that decides whether documents are relevant or not. In this chapter, we more systematically introduce this probabilistic approach to information retrieval (IR), which provides a different formal basis for a retrieval model and results in different techniques for setting term weights.
Users start with information needs, which they translate into query representations. Similarly, there are documents, which are converted into document representations (the latter differing at least by how text is tokenized, but perhaps containing fundamentally less information, as when a nonpositional index is used). Based on these two representations, a system tries to determine how well documents satisfy information needs. In the Boolean or vector space models of IR, matching is done in a formally defined but semantically imprecise calculus of index terms. Given only a query, an IR system has an uncertain understanding of the information need. Given the query and document representations, a system has an uncertain guess of whether a document has content relevant to the information need. Probability theory provides a principled foundation for such reasoning under uncertainty.
Plato's writings are typically in the form of dialogues in which Socrates (born 469 BC) discusses philosophical questions with other characters of his day. Most of these are based on known historical figures, but the dialogues are not factual accounts; they are fictional, and often richly dramatic, products of Plato's philosophical imagination. The Symposium is a particularly dramatic work. It is set at the house of Agathon, a tragic poet celebrating his recent victory in 416 BC at one of the great dramatic festivals. Those present are amongst the intellectual elite of the day. They include an exponent of heroic poetry (Phaedrus), an expert in the laws of various Greek states (Pausanias), a representative of medical expertise (Eryximachus), a comic poet (Aristophanes) and a philosopher (Socrates). The guests participate in a symposium, a drinking party for aristocratic circles, on this occasion designed to honour Agathon's victory. Each guest delivers a speech in praise of eros, ‘passionate love’, or ‘desire’. The final speech is delivered by Alcibiades, a notorious associate of Socrates, who talks openly about his love for Socrates, in particular.
apollodorus: I believe I am quite well prepared to relate the events you are asking me about, for just the other day I happened to be going into Athens from my home in Phalerum when an acquaintance of mine caught sight of me from behind and called after me, jokily,
‘Phalerian! You there, Apollodorus! Wait for me, will you?’
So I stopped and waited.
‘I have just been looking for you, Apollodorus. I wanted to get from you the story about that party of Agathon's with Socrates, Alcibiades and the rest, the time when they were all together at dinner, and to hear what they said in their speeches on the subject of love. Someone else was telling me, who had heard about it from Phoenix, son of Philippus, and he said that you knew about it too. Actually he could not give any clear account of it, so you must tell me. You are in the best position to report the words of your friend. But tell me this first’, he went on. ‘Were you at that party yourself or not?’
‘It certainly looks as if your informant was rather confused’, I replied, ‘if you think the party you are asking about occurred recently enough for me to be there’.