To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
The relations of reducibility and convertibility were defined in Chapters 1 and 2 via contractions of redexes. The present chapter gives alternative definitions, via formal theories with axioms and rules of inference.
These theories will be used later in describing the correspondence between λ and CL precisely, and will help to make the distinction between syntax and semantics clearer in the chapters on models to come. They will also give a more direct meaning to such phrases as ‘add the equation M = N as a new axiom to the definition of = β … ’ (Corollary 3.11.1).
In books on logic, formal theories come in two kinds (at least): Hilbert-style and Gentzen-style. The theories in this chapter will be the former.
Notation 6.1 (Hilbert-style formal theories) A (Hilbert-style) formal theory T consists of three sets: formulas, axioms and rules (of inference). Each rule has one or more premises and one conclusion, and we shall write its premises above a horizontal line and its conclusion under this line; for examples, see the rules in Definition 6.2 below.
If г is a set of formulas, a deduction of a formula B from г is a tree of formulas, with those at the tops of branches being axioms or members of г, the others being deduced from those immediately above them by a rule, and the bottom one being B.
The discussion of models in the last chapter was almost too easy, so simple was the theory CLw. In contrast, the theory λβ has bound variables and rule (ξ), and these make its concept of model much more complex. This chapter will look at that concept from three different viewpoints. The definition of λ-model will be given in 15.3, and two other approaches will be described in Section 15B to help the reader understand the ideas lying behind this definition.
Notation 15.1 In this chapter we shall use the same notation as in 14.1, except that ‘term’ will now mean ‘λ-term’.
The identity-function on a set S will be called IS here.
The composition, φ ° ψ, of given functions φ and ψ, is defined as usual by the equation
and its domain is {a : ψ(a) is defined and in the domain of φ}.
If S and S′ are sets, and functions φ : S → S′ and ψ → and ψ : S′ → S satisfy
(a) ψ ° φ = IS,
then ψ is called a left inverse of φ, and S is called a retract of S′ by φ and ψ, and the pair 〈φ, ψ〉 is called a retraction; see Figure 15:1.
This Appendix was contributed by Carol Hindley to [HS86]. We believe its plain common-sense advice is still very valid despite changing fashions in care, and therefore reprint it here.
Combinators make ideal pets.
Housing They should be kept in a suitable axiom-scheme, preferably shaded by Böhm trees. They like plenty of scope for their contractions, and a proved extensionality is ideal for this.
Diet To keep them in strong normal form a diet of mixed free variables should be given twice a day. Bound variables are best avoided as they can lead to contradictions. The exotic R combinator needs a few Church numerals added to its diet to keep it healthy and active.
House-training If they are kept well supplied with parentheses, changed daily (from the left), there should be no problems.
Exercise They can be safely let out to contract and reduce if kept on a long corollary attached to a fixed point theorem, but do watch that they don't get themselves into a logical paradox while playing around it.
Discipline Combinators are generally well behaved but a few rules of inference should be enforced to keep their formal theories equivalent.
Health For those feeling less than weakly equal a check up at a nearby lemma is usually all that is required. In more serious cases a theorem (Church–Rosser is a good general one) should be called in.
Everything done so far has emphasized the close correspondence between λ and CL, in both motivation and results, but only now do we have the tools to describe this correspondence precisely. This is the aim of the present chapter.
The correspondence between the ‘extensional’ equalities will be described first, in Section 9B.
The non-extensional equalities are less straightforward. We have =β in λ-calculus and =w in combinatory logic, and despite their many parallel properties, these differ crucially in that rule (ξ) is admissible in the theory λβ but not in CLw. To get a close correspondence, we must define a new relation in CL to be like β-equality, and a new relation in λ to be like weak equality. The former will be done in Section 9D below. (An account of the latter can be found in [ç H98].)
Notation 9.1 This chapter is about both λ- and CL-terms, so ‘term’ will never be used without ‘λ-’ or ‘CL-’.
For λ-terms we shall ignore changes of bound variables, and ‘M ≡αN’ will be written as ‘M ≡ N’. (So, in effect, the word ‘λ-term’ will mean ‘α-convertibility class of λ-terms’, i.e. the class of all λ-terms α- convertible to a given one.)
Define
∧ = the class of all (α-convertibility classes of) λ-terms,
Recall the major steps in inverted index construction:
Collect the documents to be indexed.
Tokenize the text.
Do linguistic preprocessing of tokens.
Index the documents that each term occurs in.
In this chapter, we first briefly mention how the basic unit of a document can be defined and how the character sequence that it comprises is determined (Section 2.1). We then examine in detail some of the substantive linguistic issues of tokenization and linguistic preprocessing, which determine the vocabulary of terms that a system uses (Section 2.2). Tokenization is the process of chopping character streams into tokens; linguistic preprocessing then deals with building equivalence classes of tokens, which are the set of terms that are indexed. Indexing itself is covered in Chapters 1 and 4. Then we return to the implementation of postings lists. In Section 2.3, we examine an extended postings list data structure that supports faster querying, and Section 2.4 covers building postings data structures suitable for handling phrase and proximity queries, of the sort that commonly appear in both extended Boolean models and on the web.
Document delineation and character sequence decoding
Obtaining the character sequence in a document
Digital documents that are the input to an indexing process are typically bytes in a file or on a web server. The first step of processing is to convert this byte sequence into a linear sequence of characters.
The meaning of the term information retrieval (IR) can be very broad. Just getting a credit card out of your wallet so that you can type in the card number is a form of information retrieval. However, as an academic field of study, information retrieval might be defined thus:
Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers).
As defined in this way, information retrieval used to be an activity that only a few people engaged in: reference librarians, paralegals, and similar professional searchers. Now the world has changed, and hundreds of millions of people engage in information retrieval every day when they use a web search engine or search their email. Information retrieval is fast becoming the dominant form of information access, overtaking traditional database-style searching (the sort that is going on when a clerk says to you: “I'm sorry, I can only look up your order if you can give me your order ID”).
Information retrieval can also cover other kinds of data and information problems beyond that specified in the core definition above. The term “unstructured data” refers to data that does not have clear, semantically overt, easy-for-a-computer structure. It is the opposite of structured data, the canonical example of which is a relational database, of the sort companies usually use to maintain product inventories and personnel records.
On page 113, we introduced the notion of a term-document matrix: an M × N matrix C, each of whose rows represents a term and each of whose columns represents a document in the collection. Even for a collection of modest size, the term-document matrix C is likely to have several tens of thousands of rows and columns. In Section 18.1.1, we first develop a class of operations from linear algebra, known as matrix decomposition. In Section 18.2, we use a special form of matrix decomposition to construct a low-rank approximation to the term-document matrix. In Section 18.3 we examine the application of such low-rank approximations to indexing and retrieving documents, a technique referred to as latent semantic indexing. Although latent semantic indexing has not been established as a significant force in scoring and ranking for information retrieval (IR), it remains an intriguing approach to clustering in a number of domains including for collections of text documents (Section 16.6, page 343). Understanding its full potential remains an area of active research.
Readers who do not require a refresher on linear algebra may skip Section 18.1, although Example 18.1 is especially recommended as it highlights a property of eigenvalues that we exploit later in the chapter.
Linear algebra review
We briefly review some necessary background in linear algebra. Let C be an M × N matrix with real-valued entries; for a term–document matrix, all entries are in fact non-negative.
Thus far, this book has mainly discussed the process of ad hoc retrieval, where users have transient information needs that they try to address by posing one or more queries to a search engine. However, many users have ongoing information needs. For example, you might need to track developments in multicore computer chips. One way of doing this is to issue the query multicore and computer and chip against an index of recent newswire articles each morning. In this and the following two chapters we examine the question: How can this repetitive task be automated? To this end, many systems support standing queries. A standing query is like any other query except that it is periodically executed on a collection to which new documents are incrementally added over time.
If your standing query is just multicore and computer and chip, you will tend to miss many relevant new articles which use other terms such as multicore processors. To achieve good recall, standing queries thus have to be refined over time and can gradually become quite complex. In this example, using a Boolean search engine with stemming, you might end up with a query like (multicore or multi-core) and (chip or processor or microprocessor).
To capture the generality and scope of the problem space to which standing queries belong, we now introduce the general notion of a classification problem. Given a set of classes, we seek to determine which class(es) a given object belongs to.