Search

Web Data Management

Serge Abiteboul, Ioana Manolescu, Philippe Rigaux, Marie-Christine Rousset, Pierre Senellart
Published online:

05 June 2012

Print publication:

28 November 2011
- Book
- - Get access
    
    Buy a print copy
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
The Internet and World Wide Web have revolutionized access to information. Users now store information across multiple platforms from personal computers to smartphones and websites. As a consequence, data management concepts, methods and techniques are increasingly focused on distribution concerns. Now that information largely resides in the network, so do the tools that process this information. This book explains the foundations of XML with a focus on data distribution. It covers the many facets of distributed data management on the Web, such as description logics, that are already emerging in today's data integration applications and herald tomorrow's semantic Web. It also introduces the machinery used to manipulate the unprecedented amount of data collected on the Web. Several 'Putting into Practice' chapters describe detailed practical applications of the technologies and techniques. The book will serve as an introduction to the new, global, information systems for Web professionals and master's level courses.

20 - Putting into Practice: CouchDB, a JSON Semistructured Database
from Part 3 - Building Web Scale Applications
Serge Abiteboul, Ioana Manolescu, Philippe Rigaux, Marie-Christine Rousset, Pierre Senellart
Book:

Web Data Management

Published online:

05 June 2012

Print publication:

28 November 2011, pp 400-420
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Summary

This chapter proposes exercises and projects based on CouchDB, a recent database system which relies on many of the concepts presented so far in this book. In brief:
CouchDB adopts a semistructured data model, based on the JSON (JavaScript Object Notation) format; JSON offers a lightweight alternative to XML;
A database in CouchDB is schema-less: the structure of the JSON documents may vary at will depending on their specific features;
In order to cope with the absence of constraint that constitutes the counterpart of this flexibility, CouchDB proposes an original approach, based on structured materialized views that can be produced from document collections;
Views are defined with the MapReduce paradigm, allowing both a parallel computation and incremental maintenance of their content;
Finally, the system aspects of CouchDB illustrate most of the distributed data management techniques covered in the last part of the present book: distribution based on consistent hashing, support for data replication and reconciliation, horizontal scalability, parallel computing, and so forth.
CouchDB is representative of the emergence of so-called key-value store systems that give up many features of the relational model, including schema, structured querying, and consistency guarantees, in favor of flexible data representation, simplicity and scalability. It illustrates the “No[tOnly]SQL” trend with an original and consistent approach to large-scale management of “documents” viewed as autonomous, rich pieces of information that can be managed independently, in contrast with relational databases, which take the form of a rich graph of interrelated flat tuples.

2 - XPath and XQuery
from Part 1 - Modeling Web Data
Serge Abiteboul, Ioana Manolescu, Philippe Rigaux, Marie-Christine Rousset, Pierre Senellart
Book:

Web Data Management

Published online:

05 June 2012

Print publication:

28 November 2011, pp 32-71
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Summary

INTRODUCTION
This chapter introduces XPath and XQuery, two related languages that respectively serve to navigate and query XML documents. XPath is actually a subset of XQuery. Both languages, specified by theW3C, are tightly associated and share in particular the same conceptual modeling of XML documents. Note that the XPath fragment of XQuery has a well-identified purpose (expressing “paths” in an XML tree) and as such can be used independently in other XML processing contexts, such as inside the XSLT transformation language. XQuery uses XPath as a core language for path expressions and navigation.
XQuery is a declarative language and intends to play for XML data the role of SQL in the relational realm. At a syntactical level, it is somewhat inspired from SQL. More importantly, it is expected to benefit from a mixture of physical storage, indexing, and optimization techniques in order to retrieve its result by accessing only a small fraction of its input. XQuery constitutes therefore an appropriate choice when large XML documents or large collections of documents must be manipulated.
In this chapter, we use as running example a movies XML database. Each XML document represents one movie and is similar in structure to the sample document shown in Figure 2.1.
We begin the chapter with a bird's-eye view of XQuery principles, introducing the XML data model that supports the interpretation of path expressions and queries, and showing the main features of the two languages. We then consider in more detail XPath and XQuery in a rather informal way.

Bibliography
Serge Abiteboul, Ioana Manolescu, Philippe Rigaux, Marie-Christine Rousset, Pierre Senellart
Book:

Web Data Management

Published online:

05 June 2012

Print publication:

28 November 2011, pp 421-430
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation

Part 3 - Building Web Scale Applications
Serge Abiteboul, Ioana Manolescu, Philippe Rigaux, Marie-Christine Rousset, Pierre Senellart
Book:

Web Data Management

Published online:

05 June 2012

Print publication:

28 November 2011, pp 245-246
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation

7 - Ontologies, RDF, and OWL
from Part 2 - Web Data Semantics and Integration
Serge Abiteboul, Ioana Manolescu, Philippe Rigaux, Marie-Christine Rousset, Pierre Senellart
Book:

Web Data Management

Published online:

05 June 2012

Print publication:

28 November 2011, pp 143-170
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Summary

INTRODUCTION
The vision of the Semantic Web is that of a world-wide distributed architecture where data and services easily interoperate. This vision is not yet a reality in the Web of today, in which given a particular need, it is difficult to find a resource that is appropriate to it. Also, given a relevant resource, it is not easy to understand what it provides and how to use it. To solve such limitations, facilitate interoperability, and thereby enable the Semantic Web vision, the key idea is to also publish semantics descriptions of Web resources. These descriptions rely on semantic annotations, typically on logical assertions that relate resources to some terms in predefined ontologies. This is the topic of the chapter.
An ontology is a formal description providing human users a shared understanding of a given domain. The ontologies we consider here can also be interpreted and processed by machines thanks to a logical semantics that enables reasoning. Ontologies provide the basis for sharing knowledge, and, as such, they are very useful for a number of reasons:
Organizing data. It is very easy to get lost in large collections of documents. An ontology is a natural means of “organizing” (structuring) it and thereby facilitates browsing through it to find interesting information. It provides an organization that is flexible, and that naturally structures the information in multidimensional ways. For instance, an ontology may allow browsing through the courses offered by a university by topic or department, by quarter or time, by level, and so forth.

9 - Data Integration
from Part 2 - Web Data Semantics and Integration
Serge Abiteboul, Ioana Manolescu, Philippe Rigaux, Marie-Christine Rousset, Pierre Senellart
Book:

Web Data Management

Published online:

05 June 2012

Print publication:

28 November 2011, pp 196-230
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Summary

INTRODUCTION
The goal of data integration is to provide a uniform access to a set of autonomous and possibly heterogeneous data sources in a particular application domain. This is typically what we need when, for instance, querying the deep web that is composed of a plethora of databases accessible through Web forms. We would like to be able with a single query to find relevant data no matter which database provides it.
A first issue for data integration (that will be ignored here) is social: The owners of some data set may be unwilling to fully share it and be reluctant to participate in a data integration system. Also, from a technical viewpoint, the difficulty comes from the lack of interoperability between the data sources, that may use a variety of formats, specific query-processing capabilities, different protocols. However, the real bottleneck for data integration is logical. It comes from the so-called semantic heterogeneity between the data sources. They typically organize data using different schemas even in the same application domain. For instance, each university or educational institution may choose to model students and teaching programs in its own way. A French university may use the social security number to identify students and the attributes nom, prenom, whereas the Erasmus database about European students may use a European student number and the attributes firstname, lastname, and home university.
In this chapter, we study data integration in the mediator approach. In this approach, data remain exclusively in data sources and are obtained when the system is queried.

19 - Putting into Practice: Large-Scale Data Management with Hadoop
from Part 3 - Building Web Scale Applications
Serge Abiteboul, Ioana Manolescu, Philippe Rigaux, Marie-Christine Rousset, Pierre Senellart
Book:

Web Data Management

Published online:

05 June 2012

Print publication:

28 November 2011, pp 387-399
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation

Introduction
Serge Abiteboul, Ioana Manolescu, Philippe Rigaux, Marie-Christine Rousset, Pierre Senellart
Book:

Web Data Management

Published online:

05 June 2012

Print publication:

28 November 2011, pp ix-xiv
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Summary

The Internet and the Web have revolutionized access to information. Individuals are depending more and more on the Web to find or publish information, download music and movies, and interact with friends in social networking Web sites. Following a parallel trend, companies go more and more toward Web solutions in their daily activity by using Web services (e.g., agenda) as well as by moving some applications into the cloud (e.g., with Amazon Web services). The growth of this immense information source is witnessed by the number of newly connected people, by the interactions among them facilitated by the social networking platforms, and above all by the huge amount of data covering all aspects of human activity. With the Web, information has moved from data isolated in very protected islands (typically relational databases) to information freely available to any machine or any individual connected to the Internet.
Perhaps the best illustration comes from a typical modern Web user. She has information stored on PCs, a personal laptop, and a professional computer, but also possibly on some server at work, on her smartphone, in an e-book, and so on. Also, she maintains information in personal Web sites or social network Web sites. She may store pictures in Picasa, movies in You Tube, bookmarks in Firefox Sync, and the like. So, even an individual is now facing the management of a complex distributed collection of data.

8 - Querying Data Through Ontologies
from Part 2 - Web Data Semantics and Integration
Serge Abiteboul, Ioana Manolescu, Philippe Rigaux, Marie-Christine Rousset, Pierre Senellart
Book:

Web Data Management

Published online:

05 June 2012

Print publication:

28 November 2011, pp 171-195
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation

10 - Putting into Practice: Wrappers and Data Extraction with XSLT
from Part 2 - Web Data Semantics and Integration
Serge Abiteboul, Ioana Manolescu, Philippe Rigaux, Marie-Christine Rousset, Pierre Senellart
Book:

Web Data Management

Published online:

05 June 2012

Print publication:

28 November 2011, pp 231-235
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Summary

Besides languages to extract information such as XPath or XQuery, languages for transforming XML documents have been proposed. One of them, XSLT, is very popular. The goal of this PiP is to expose the reader to this aspect of XML and to languages based on tree-pattern rewriting. A presentation of XSLT is beyond the scope of this book. The reader can read the present PiP to get a feeling on standard tasks that are commonly performed with XSLT programs. Of course, realizing the project that is described requires a reasonable understanding of the language. Such an understanding can be obtained, for instance, from the companion Web site of the book, i.e., at http://webdam.inria.fr/Jorge/. More references on XSLT may be found there.
XSLT is an XML transformation language. Its principles are quite different from that of XQuery, although they may roughly serve the same purpose: accessing and manipulating XML content and producing an XML-formatted output. In practice, XQuery is used to extract pieces of information from XML documents, whereas XSLT is often used to restructure documents, typically for publishing them in different forms, different dialects. We show in the present PiP chapter how XSLT can serve to write simple “wrappers” for XML pages. This is taking us back to data integration. To integrate a number of data sources, the first step is typically to wrap them all into a uniform schema. Since most data source now export XML, the wrapping technique considered here can be used in a wide variety of contexts.

6 - Putting into Practice: Tree Pattern Evaluation Using SAX
from Part 1 - Modeling Web Data
Serge Abiteboul, Ioana Manolescu, Philippe Rigaux, Marie-Christine Rousset, Pierre Senellart
Book:

Web Data Management

Published online:

05 June 2012

Print publication:

28 November 2011, pp 131-140
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Summary

In this chapter, we learn how to build an evaluation engine for tree-pattern queries, using the SAX (Simple API for XML) programming model. We thereby follow a dual goal: (i) improve our understanding of XML query languages and (ii) become familiar with SAX, a stream parser for XML, with an event-driven API. Recall that the main features of SAX were presented in Section 1.4.2.
TREE-PATTERN DIALECTS
We will consider tree-pattern languages of increasing complexity. We introduce them in this section.
C-TP This is the dialect of conjunctive tree-patterns. A C-TP is a tree, in which each node is labeled either with an XML element name, or with an XML attribute name. C-TP nodes corresponding to attributes are distinguished by prefixing them with @ (e.g., @color). Each node has zero or more children, connected by edges that are labeled either / (with the semantics of child) or // (with the semantics of descendant). Finally, the nodes that one wants to be returned are marked.
As an example, Figure 6.1 shows a simple XML document d where each node is annotated with its preorder number. (Recall the definition of this numbering from Section 4.2.) Figure 6.2 shows a C-TP pattern denoted t1 and the three tuples resulting from “matchings” of t1 into d.

13 - Web Search
from Part 3 - Building Web Scale Applications
Serge Abiteboul, Ioana Manolescu, Philippe Rigaux, Marie-Christine Rousset, Pierre Senellart
Book:

Web Data Management

Published online:

05 June 2012

Print publication:

28 November 2011, pp 247-286
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Summary

With a constantly increasing size of dozens of billions of freely accessible documents, one of the major issues raised by the World Wide Web is that of searching in an effective and efficient way through these documents to find those that best suit a user's need. The purpose of this chapter is to describe the techniques that are at the core of today's search engines (such as Google, Bing, or Exalead), that is, mostly keyword search in very large collections of text documents. We also briefly touch upon other techniques and research issues that may be of importance in next-generation search engines.
This chapter is organized as follows. In Section 13.1, we briefly recall the Web and the languages and protocols it relies upon. Most of these topics have already been covered earlier in the book, and their introduction here is mostly intended to make the present chapter self-contained. We then present in Section 13.2 the techniques that can be used to retrieve pages from the Web, that is, to crawl it, and to extract text tokens from them. First-generation search engines, exemplified by Altavista, mostly relied on the classical information retrieval (IR) techniques, applied to text documents, that are described in Section 13.3. The advent of the Web, and more generally the steady growth of documents collections managed by institutions of all kinds, has led to extensions of these techniques. We address scalability issues in Section 13.3.3, with focus on centralized indexing. Distributed approaches are investigated in Chapter 14.

Part 2 - Web Data Semantics and Integration
Serge Abiteboul, Ioana Manolescu, Philippe Rigaux, Marie-Christine Rousset, Pierre Senellart
Book:

Web Data Management

Published online:

05 June 2012

Print publication:

28 November 2011, pp 141-142
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation

16 - Distributed Computing with MapReduce and Pig
from Part 3 - Building Web Scale Applications
Serge Abiteboul, Ioana Manolescu, Philippe Rigaux, Marie-Christine Rousset, Pierre Senellart
Book:

Web Data Management

Published online:

05 June 2012

Print publication:

28 November 2011, pp 339-363
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Summary

So far, the discussion on distributed systems has been limited to data storage, and to a few data management primitives (e.g., write(), read(), search(), etc.). For real applications, one also needs to develop and execute more complex programs that process the available datasets and effectively exploit the available resources.
The naive approach that consists in getting all the required data at the Client in order to apply locally some processing, often looses in a distributed setting. First, some processing may not be available locally. Moreover, centralizing all the information then processing it, would simply miss all the advantages brought by a powerful cluster of hundreds or even thousands machines. We have to use distribution. One can consider two main scenarios for data processing in distributed systems.
Distributed processing and workflow: In the first one, an application disposes of large data sets and needs to apply to them some processes that are available on remote sites. When this is the case, the problem is to send the data to the appropriate locations, and then sequence the remote executions. This workflow scenario is typically implemented using Web services and some high-level coordination language.
Distributed data and MapReduce: In a second scenario, the data sets are already distributed in a number of servers, and, conversely to the previous scenario, we “push” programs to these servers. Indeed, due to network bandwidth issues, it is often more cost-effective to send a small piece of program from the Client to Servers, than to transfer large data volumes to a single Client.

5 - Putting into Practice: Managing an XML Database with eXist
from Part 1 - Modeling Web Data
Serge Abiteboul, Ioana Manolescu, Philippe Rigaux, Marie-Christine Rousset, Pierre Senellart
Book:

Web Data Management

Published online:

05 June 2012

Print publication:

28 November 2011, pp 116-130
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Summary

This chapter proposes some exercises and projects to manipulate and query XML documents in a practical context. The software used in these exercises is eXist, an open-source native XML database that provides an easy-to-use and powerful environment for learning and applying XML languages. We begin with a brief description on how to install eXist and execute some simple operations. eXist provides a graphical interface that is pretty easy to use, so we limit our explanations below to the vital information that can be useful to save some time to the absolute beginner.
PREREQUISITES
In the following, we assume that you plan to install eXist in your Windows or Linux environment. You need a Java Development Kit (JDK) for running the eXist java application (version 1.5 at least). If you do not have a JDK already installed, get it from the Sun site (try searching “download JDK 1.5” with Google to obtain an appropriate URL) and follow the instructions to set up your Java environment.
Be sure that you can execute Java applications. This requires the definition of a JAVA_HOME environment variable, pointing to the JDK directory. The PATH variable must also contain an entry to the directory that contain the Java executable, $JAVA_HOME/bin.
Under Windows: Load the configuration panel window; run the System application; choose Advanced and then Environment variables. Create a new variable JAVA_HOME with the appropriate location, and add the $JAVA_HOME/bin path to the PATH variable.
[…]

1 - Data Model
from Part 1 - Modeling Web Data
Serge Abiteboul, Ioana Manolescu, Philippe Rigaux, Marie-Christine Rousset, Pierre Senellart
Book:

Web Data Management

Published online:

05 June 2012

Print publication:

28 November 2011, pp 3-31
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Summary

The Web is a media of primary interest for companies who change their organization to place it at the core of their operation. It is an easy but boring task to list areas where the Web can be usefully leveraged to improve the functionalities of existing systems. One can cite in particular B2B and B2C (business to business or business to customer) applications, G2B and G2C (government to business or government to customer) applications or digital libraries. Such applications typically require some form of typing to represent data because they consist of programs that deal with HTML text with difficulties. Exchange and exploitation of business information call as well for a more powerful Web data management approach.
This motivated the introduction of a semistructured data model, namely XML, that is well suited both for humans and machines. XML describes content and promotes machine-to-machine communication and data exchange. The design of XML relies on two major goals. First it is designed as a generic data format, apt to be specialized for a wide range of data usages. In the XML world for instance, XHTML is seen as a specialized XML dialect for data presentation by Web browsers. Second XML “documents” are meant to be easily and safely transmitted on the Internet, by including in particular a self-description of their encoding and content.
XML is the language of choice for a generic, scalable, and expressive management of Web data.

Contents
Serge Abiteboul, Ioana Manolescu, Philippe Rigaux, Marie-Christine Rousset, Pierre Senellart
Book:

Web Data Management

Published online:

05 June 2012

Print publication:

28 November 2011, pp v-viii
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation

Part 1 - Modeling Web Data
Serge Abiteboul, Ioana Manolescu, Philippe Rigaux, Marie-Christine Rousset, Pierre Senellart
Book:

Web Data Management

Published online:

05 June 2012

Print publication:

28 November 2011, pp 1-2
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation

14 - An Introduction to Distributed Systems
from Part 3 - Building Web Scale Applications
Serge Abiteboul, Ioana Manolescu, Philippe Rigaux, Marie-Christine Rousset, Pierre Senellart
Book:

Web Data Management

Published online:

05 June 2012

Print publication:

28 November 2011, pp 287-309
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Summary

This chapter is an introduction to very large data management in distributed systems. Here, “very large” means a context where gigabytes (1,000 MB = 109 bytes) constitute the unit size for measuring data volumes. Terabytes (1012 bytes) are commonly encountered, and many Web companies and scientific or financial institutions must deal with petabytes (1015 bytes). In a near future, we can expect exabytes (1018 bytes) data sets, with the world-wide digital universe roughly estimated (in 2010) as about 1 zetabytes (1021 bytes).
Distribution is the key for handling very large data sets. Distribution is necessary (but not sufficient) to bring scalability (i.e., the means of maintaining stable performance for steadily growing data collections by adding new resources to the system). However, distribution brings a number of technical problems that make the design and implementation of distributed storage, indexing, and computing a delicate issue. A prominent concern is the risk of failure. In an environment that consists of hundreds or thousands of computers (a common setting for large Web companies), it becomes very common to face the failure of components (hardware, network, local systems, disks), and the system must be ready to cope with it at any moment.
Our presentation covers principles and techniques that recently emerged to handle Web-scale data sets. We examine the extension of traditional storage and indexing methods to large-scale distributed settings. We describe techniques to efficiently process point queries that aim at retrieving a particular object.

Search Results

Refine search

Refine search

Actions for selected content:

26 results

Web Data Management

20 - Putting into Practice: CouchDB, a JSON Semistructured Database

Summary

2 - XPath and XQuery

Summary

Bibliography

Part 3 - Building Web Scale Applications

7 - Ontologies, RDF, and OWL

Summary

9 - Data Integration

Summary

19 - Putting into Practice: Large-Scale Data Management with Hadoop

Introduction

Summary

8 - Querying Data Through Ontologies

10 - Putting into Practice: Wrappers and Data Extraction with XSLT

Summary

6 - Putting into Practice: Tree Pattern Evaluation Using SAX

Summary

13 - Web Search

Summary

Part 2 - Web Data Semantics and Integration

16 - Distributed Computing with MapReduce and Pig

Summary

5 - Putting into Practice: Managing an XML Database with eXist

Summary

1 - Data Model

Summary

Contents

Part 1 - Modeling Web Data

14 - An Introduction to Distributed Systems

Summary

Search Results

Refine search

Refine search

Actions for selected content:

Save Search

26 results

Web Data Management

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary