Skip to main content Accessibility help
×
Hostname: page-component-848d4c4894-p2v8j Total loading time: 0.001 Render date: 2024-05-18T14:50:26.111Z Has data issue: false hasContentIssue false

13 - Web Search

from Part 3 - Building Web Scale Applications

Published online by Cambridge University Press:  05 June 2012

Serge Abiteboul
Affiliation:
INRIA Saclay – Île-de- France
Ioana Manolescu
Affiliation:
INRIA Saclay – Île-de- France
Philippe Rigaux
Affiliation:
Conservatoire Nationale des Arts et Metiers, Paris
Marie-Christine Rousset
Affiliation:
Université de Grenoble, France
Pierre Senellart
Affiliation:
Télécom ParisTech, France
Get access

Summary

With a constantly increasing size of dozens of billions of freely accessible documents, one of the major issues raised by the World Wide Web is that of searching in an effective and efficient way through these documents to find those that best suit a user's need. The purpose of this chapter is to describe the techniques that are at the core of today's search engines (such as Google, Bing, or Exalead), that is, mostly keyword search in very large collections of text documents. We also briefly touch upon other techniques and research issues that may be of importance in next-generation search engines.

This chapter is organized as follows. In Section 13.1, we briefly recall the Web and the languages and protocols it relies upon. Most of these topics have already been covered earlier in the book, and their introduction here is mostly intended to make the present chapter self-contained. We then present in Section 13.2 the techniques that can be used to retrieve pages from the Web, that is, to crawl it, and to extract text tokens from them. First-generation search engines, exemplified by Altavista, mostly relied on the classical information retrieval (IR) techniques, applied to text documents, that are described in Section 13.3. The advent of the Web, and more generally the steady growth of documents collections managed by institutions of all kinds, has led to extensions of these techniques. We address scalability issues in Section 13.3.3, with focus on centralized indexing. Distributed approaches are investigated in Chapter 14.

Type
Chapter
Information
Web Data Management , pp. 247 - 286
Publisher: Cambridge University Press
Print publication year: 2011

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Save book to Kindle

To save this book to your Kindle, first ensure coreplatform@cambridge.org is added to your Approved Personal Document E-mail List under your Personal Document Settings on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part of your Kindle email address below. Find out more about saving to your Kindle.

Note you can select to save to either the @free.kindle.com or @kindle.com variations. ‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi. ‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.

Find out more about the Kindle Personal Document Service.

Available formats
×

Save book to Dropbox

To save content items to your account, please confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account. Find out more about saving content to Dropbox.

Available formats
×

Save book to Google Drive

To save content items to your account, please confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account. Find out more about saving content to Google Drive.

Available formats
×