To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
Besides languages to extract information such as XPath or XQuery, languages for transforming XML documents have been proposed. One of them, XSLT, is very popular. The goal of this PiP is to expose the reader to this aspect of XML and to languages based on tree-pattern rewriting. A presentation of XSLT is beyond the scope of this book. The reader can read the present PiP to get a feeling on standard tasks that are commonly performed with XSLT programs. Of course, realizing the project that is described requires a reasonable understanding of the language. Such an understanding can be obtained, for instance, from the companion Web site of the book, i.e., at http://webdam.inria.fr/Jorge/. More references on XSLT may be found there.
XSLT is an XML transformation language. Its principles are quite different from that of XQuery, although they may roughly serve the same purpose: accessing and manipulating XML content and producing an XML-formatted output. In practice, XQuery is used to extract pieces of information from XML documents, whereas XSLT is often used to restructure documents, typically for publishing them in different forms, different dialects. We show in the present PiP chapter how XSLT can serve to write simple “wrappers” for XML pages. This is taking us back to data integration. To integrate a number of data sources, the first step is typically to wrap them all into a uniform schema. Since most data source now export XML, the wrapping technique considered here can be used in a wide variety of contexts.
In this chapter, we learn how to build an evaluation engine for tree-pattern queries, using the SAX (Simple API for XML) programming model. We thereby follow a dual goal: (i) improve our understanding of XML query languages and (ii) become familiar with SAX, a stream parser for XML, with an event-driven API. Recall that the main features of SAX were presented in Section 1.4.2.
TREE-PATTERN DIALECTS
We will consider tree-pattern languages of increasing complexity. We introduce them in this section.
C-TP This is the dialect of conjunctive tree-patterns. A C-TP is a tree, in which each node is labeled either with an XML element name, or with an XML attribute name. C-TP nodes corresponding to attributes are distinguished by prefixing them with @ (e.g., @color). Each node has zero or more children, connected by edges that are labeled either / (with the semantics of child) or // (with the semantics of descendant). Finally, the nodes that one wants to be returned are marked.
As an example, Figure 6.1 shows a simple XML document d where each node is annotated with its preorder number. (Recall the definition of this numbering from Section 4.2.) Figure 6.2 shows a C-TP pattern denoted t1 and the three tuples resulting from “matchings” of t1 into d.
With a constantly increasing size of dozens of billions of freely accessible documents, one of the major issues raised by the World Wide Web is that of searching in an effective and efficient way through these documents to find those that best suit a user's need. The purpose of this chapter is to describe the techniques that are at the core of today's search engines (such as Google, Bing, or Exalead), that is, mostly keyword search in very large collections of text documents. We also briefly touch upon other techniques and research issues that may be of importance in next-generation search engines.
This chapter is organized as follows. In Section 13.1, we briefly recall the Web and the languages and protocols it relies upon. Most of these topics have already been covered earlier in the book, and their introduction here is mostly intended to make the present chapter self-contained. We then present in Section 13.2 the techniques that can be used to retrieve pages from the Web, that is, to crawl it, and to extract text tokens from them. First-generation search engines, exemplified by Altavista, mostly relied on the classical information retrieval (IR) techniques, applied to text documents, that are described in Section 13.3. The advent of the Web, and more generally the steady growth of documents collections managed by institutions of all kinds, has led to extensions of these techniques. We address scalability issues in Section 13.3.3, with focus on centralized indexing. Distributed approaches are investigated in Chapter 14.
So far, the discussion on distributed systems has been limited to data storage, and to a few data management primitives (e.g., write(), read(), search(), etc.). For real applications, one also needs to develop and execute more complex programs that process the available datasets and effectively exploit the available resources.
The naive approach that consists in getting all the required data at the Client in order to apply locally some processing, often looses in a distributed setting. First, some processing may not be available locally. Moreover, centralizing all the information then processing it, would simply miss all the advantages brought by a powerful cluster of hundreds or even thousands machines. We have to use distribution. One can consider two main scenarios for data processing in distributed systems.
Distributed processing and workflow: In the first one, an application disposes of large data sets and needs to apply to them some processes that are available on remote sites. When this is the case, the problem is to send the data to the appropriate locations, and then sequence the remote executions. This workflow scenario is typically implemented using Web services and some high-level coordination language.
Distributed data and MapReduce: In a second scenario, the data sets are already distributed in a number of servers, and, conversely to the previous scenario, we “push” programs to these servers. Indeed, due to network bandwidth issues, it is often more cost-effective to send a small piece of program from the Client to Servers, than to transfer large data volumes to a single Client.
This chapter proposes some exercises and projects to manipulate and query XML documents in a practical context. The software used in these exercises is eXist, an open-source native XML database that provides an easy-to-use and powerful environment for learning and applying XML languages. We begin with a brief description on how to install eXist and execute some simple operations. eXist provides a graphical interface that is pretty easy to use, so we limit our explanations below to the vital information that can be useful to save some time to the absolute beginner.
PREREQUISITES
In the following, we assume that you plan to install eXist in your Windows or Linux environment. You need a Java Development Kit (JDK) for running the eXist java application (version 1.5 at least). If you do not have a JDK already installed, get it from the Sun site (try searching “download JDK 1.5” with Google to obtain an appropriate URL) and follow the instructions to set up your Java environment.
Be sure that you can execute Java applications. This requires the definition of a JAVA_HOME environment variable, pointing to the JDK directory. The PATH variable must also contain an entry to the directory that contain the Java executable, $JAVA_HOME/bin.
Under Windows: Load the configuration panel window; run the System application; choose Advanced and then Environment variables. Create a new variable JAVA_HOME with the appropriate location, and add the $JAVA_HOME/bin path to the PATH variable.
The Web is a media of primary interest for companies who change their organization to place it at the core of their operation. It is an easy but boring task to list areas where the Web can be usefully leveraged to improve the functionalities of existing systems. One can cite in particular B2B and B2C (business to business or business to customer) applications, G2B and G2C (government to business or government to customer) applications or digital libraries. Such applications typically require some form of typing to represent data because they consist of programs that deal with HTML text with difficulties. Exchange and exploitation of business information call as well for a more powerful Web data management approach.
This motivated the introduction of a semistructured data model, namely XML, that is well suited both for humans and machines. XML describes content and promotes machine-to-machine communication and data exchange. The design of XML relies on two major goals. First it is designed as a generic data format, apt to be specialized for a wide range of data usages. In the XML world for instance, XHTML is seen as a specialized XML dialect for data presentation by Web browsers. Second XML “documents” are meant to be easily and safely transmitted on the Internet, by including in particular a self-description of their encoding and content.
XML is the language of choice for a generic, scalable, and expressive management of Web data.
This chapter proposes exercises to manipulate and query real-world RDFS ontologies, and especially Yago. Yago was developed at the Max Planck Institute in Saarbrücken in Germany. At the time of this writing, it is the largest ontology of human quality that is freely available. It contains millions of entities such as scientists, and millions of facts about these entities such as where a particular scientist was born. Yago also includes knowledge about the classes and relationships composing it (e.g., a hierarchy of classes and relationships).
EXPLORING AND INSTALLING YAGO
Go to the Yago Web site, http://mpii.de/yago, click on the Demo tab and start the textual browser. This browser allows navigating through the Yago ontology.
Type “Elvis Presley” in the box. Then click on the Elvis Presley link. You will see all properties of Elvis Presley, including his biographic data and his discography.
You can follow other links to explore the ontology. Navigate to the wife of Elvis Presley, Priscilla.
The ontology is held together by a taxonomy of classes. Its top class is called “entity”. Verify this by repeatedly following type and sub Class Of links.
Go back to Elvis Presley. Navigate to one of his songs. You will see the date the song was recorded. Can you find all songs together with their record dates? Why would this be a tedious endeavor?
Then, to install Yago on your machine, make sure that you have Java installed and around 5 GB free disk space.
This chapter is an introduction to very large data management in distributed systems. Here, “very large” means a context where gigabytes (1,000 MB = 109 bytes) constitute the unit size for measuring data volumes. Terabytes (1012 bytes) are commonly encountered, and many Web companies and scientific or financial institutions must deal with petabytes (1015 bytes). In a near future, we can expect exabytes (1018 bytes) data sets, with the world-wide digital universe roughly estimated (in 2010) as about 1 zetabytes (1021 bytes).
Distribution is the key for handling very large data sets. Distribution is necessary (but not sufficient) to bring scalability (i.e., the means of maintaining stable performance for steadily growing data collections by adding new resources to the system). However, distribution brings a number of technical problems that make the design and implementation of distributed storage, indexing, and computing a delicate issue. A prominent concern is the risk of failure. In an environment that consists of hundreds or thousands of computers (a common setting for large Web companies), it becomes very common to face the failure of components (hardware, network, local systems, disks), and the system must be ready to cope with it at any moment.
Our presentation covers principles and techniques that recently emerged to handle Web-scale data sets. We examine the extension of traditional storage and indexing methods to large-scale distributed settings. We describe techniques to efficiently process point queries that aim at retrieving a particular object.
In this chapter, we discuss the typing of semistructured data. Typing is the process of describing, with a set of declarative rules or constraints called a schema, a class of XML documents, and verifying that a given document is valid for that class (we also say that this document is valid against the type defined by the schema). This is, for instance, used to define a specific XML vocabulary (XHTML, MathML, RDF, etc.), with its specificities in structure and content, that is used for a given application.
We first present motivations and discuss the kind of typing that is needed. XML data typing is typically based on finite-state automata. Therefore, we recall basic notion of automata, first on words, then on ranked trees, finally on unranked trees (i.e., essentially on XML). We also present the main two practical languages for describing XML types, DTDs, and XML Schema, both of which endorsed by the W3C.We then briefly describe alternative schema languages with their key features. In a last section, we discuss the typing of graph data.
One can also consider the issue of “type checking a program,” that is, verifying that if the input is of a proper input type, the program produces an output that is of a proper output type. In Section 3.5, we provide references to works on program type checking in the context of XML.
MOTIVATING TYPING
Perhaps the main difference with typing in relational systems is that typing is not compulsory for XML.
Mashups are Web applications that integrate and combine data from multiple Web sources to present them in a new way to a user. This chapter shows two different ways to construct mashup applications in practice: Yahoo! Pipes, a graphical user interface for building mashups, and XProc, a W3C language for describing workflows of transformations over XML documents. Pros and cons of either approach will be made clear as one follows the indicated steps. The goal will be to present information about news events, each event being accompanied by its localization displayed on a map. For that purpose, we integrate three sources of information:
A Web feed about current events in the world, in RSS format (e.g., CNN's top stories at http://rss.cnn.com/rss/edition.rss). Any such RSS feed is fine, though English is preferable to ensure precision of the geolocalization.
A geolocalization service. We use information from the GeoNames geographical database, and specifically their RSS to Geo RSS converter, whose API is described at http://www.geonames.org/rss-to-georss-converter.html.
A mapping service. We use Yahoo! Maps.
YAHOO! PIPES: A GRAPHICAL MASHUP EDITOR
Yahoo! Pipes allows creating simple mashup applications (simply called pipe) using a graphical interface based on the construction of a pipeline of boxes connected to each other, each box performing a given operation (fetching information, annotating it, reorganizing it, etc.) until the final output of the pipeline. It can be used by nonprogrammers, though defining complex mashups still requires skill and experience with the platform.
This chapter proposes an introduction to recommendation techniques and suggests some exercises and projects. We do not present a recommendation system in particular but rather focus on the general methodology. As an illustrative example, we will use the MovieLens data set to construct movie recommendations.
The chapter successively introduces recommendation, user-based collaborative filtering and item-based collaborative filtering. It discusses different methods parameterizations and evaluates their result with respect to the quality of the data set. We show how to generate recommendations using SQL queries on the Movie-Lens data set. Finally, we suggest some projects for students who want to investigate further the realm of recommendation systems.
INTRODUCTION TO RECOMMENDATION SYSTEMS
Given a set of ratings of items by a set of users, a recommendation system produces a list of items for a particular user, possibly in a given context. Such systems are widely used in Web applications. For example, content sites like Yahoo! Movies (movies), Zagat (restaurants), Library Thing (books), Pandora (music), Stumble Upon (Web site) suggest a list of items of interest by predicting the ratings of their users. E-commerce sites such as Amazon (books) or Netflix (movies) use recommendations to suggest new products to their users and construct bundle sales. Usually, they exploit the recent browsing history as a limited context. Finally, advertisement companies need to find a list of advertisements targeted for their users. Some of them, like Google AdSense, rely more on the context (e.g., keywords) than on an estimation of the user's taste based on her/his recent browsing history.
In large-scale file systems presented in Chapter 14, search operations are based on a sequential scan that accesses the whole data set. When it comes to finding a specific object, typically a tiny part of the data volume, direct access is much more efficient than a linear scan. The object is directly obtained using its physical address that may simply be the offset of the object's location with respect to the beginning of the file, or possibly a more sophisticated addressing mechanism.
An index on a collection C is a structure that maps the key of each object in C to its (physical) address. At an abstract level, it can be viewed as a set of pairs (k,a), called entries, where k is a key and a the address of an object. For the purpose of this chapter, an object is seen as raw (unstructured) data, its structure being of concern to the Client application only. You may want to think, for instance, of a relational tuple, an XML document, a picture or a video file. It may be the case that the key uniquely determines the object, as for keys in the relational model.
An index we consider here supports at least the following operations that we thereafter call the dictionary operations:
Insertion insert(k,a),
Deletion delete(k),
Key search search(k): a.
If the keys can be linearly ordered, an index may also support range queries of the form range(k1,k2) that retrieves all the keys (and their addresses) in that range.
In previous chapters, we presented algorithms for evaluating XPath queries on XML documents in Ptime with respect to the combined size of the XML data and of the query. In this context, the entire document is assumed to fit within the main memory. However, very large XML documents may not fit in the memory available to the query processor at runtime. Since access to disk-resident data is orders of magnitude slower than access to the main memory, this dramatically changes the problem. When this is the case, performance-wise, the goal is not so much in reducing the algorithmic complexity of query evaluation but in designing methods reducing the number of disk accesses that are needed to evaluate a given query. The topic of this chapter is the efficient processing of queries of disk-resident XML documents.
We will use extensively depth-first tree traversals in the chapter. We briefly recall two classical definitions:
preorder: To traverse a nonempty binary tree in preorder, perform the following operations recursively at each node, starting with the root node: (1) Visit the root, (2) traverse the left subtree, (3) traverse the right subtree.
postorder: To traverse a nonempty binary tree in postorder, perform the following operations recursively at each node, starting with the root node: (1) Traverse the left subtree, (2) traverse the right subtree, (3) visit the root.
Figure 4.1 illustrates the issues raised by the evaluation of path queries on disk-resident XML documents.
This is a first course in propositional modal logic, suitable for mathematicians, computer scientists and philosophers. Emphasis is placed on semantic aspects, in the form of labelled transition structures, rather than on proof theory. The book covers all the basic material - propositional languages, semantics and correspondence results, proof systems and completeness results - as well as some topics not usually covered in a modal logic course. It is written from a mathematical standpoint. To help the reader, the material is covered in short chapters, each concentrating on one topic. These are arranged into five parts, each with a common theme. An important feature of the book is the many exercises and an extensive set of solutions is provided.