To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
The vision of the Semantic Web is that of a world-wide distributed architecture where data and services easily interoperate. This vision is not yet a reality in the Web of today, in which given a particular need, it is difficult to find a resource that is appropriate to it. Also, given a relevant resource, it is not easy to understand what it provides and how to use it. To solve such limitations, facilitate interoperability, and thereby enable the Semantic Web vision, the key idea is to also publish semantics descriptions of Web resources. These descriptions rely on semantic annotations, typically on logical assertions that relate resources to some terms in predefined ontologies. This is the topic of the chapter.
An ontology is a formal description providing human users a shared understanding of a given domain. The ontologies we consider here can also be interpreted and processed by machines thanks to a logical semantics that enables reasoning. Ontologies provide the basis for sharing knowledge, and, as such, they are very useful for a number of reasons:
Organizing data. It is very easy to get lost in large collections of documents. An ontology is a natural means of “organizing” (structuring) it and thereby facilitates browsing through it to find interesting information. It provides an organization that is flexible, and that naturally structures the information in multidimensional ways. For instance, an ontology may allow browsing through the courses offered by a university by topic or department, by quarter or time, by level, and so forth.
The goal of data integration is to provide a uniform access to a set of autonomous and possibly heterogeneous data sources in a particular application domain. This is typically what we need when, for instance, querying the deep web that is composed of a plethora of databases accessible through Web forms. We would like to be able with a single query to find relevant data no matter which database provides it.
A first issue for data integration (that will be ignored here) is social: The owners of some data set may be unwilling to fully share it and be reluctant to participate in a data integration system. Also, from a technical viewpoint, the difficulty comes from the lack of interoperability between the data sources, that may use a variety of formats, specific query-processing capabilities, different protocols. However, the real bottleneck for data integration is logical. It comes from the so-called semantic heterogeneity between the data sources. They typically organize data using different schemas even in the same application domain. For instance, each university or educational institution may choose to model students and teaching programs in its own way. A French university may use the social security number to identify students and the attributes nom, prenom, whereas the Erasmus database about European students may use a European student number and the attributes firstname, lastname, and home university.
In this chapter, we study data integration in the mediator approach. In this approach, data remain exclusively in data sources and are obtained when the system is queried.
The Internet and the Web have revolutionized access to information. Individuals are depending more and more on the Web to find or publish information, download music and movies, and interact with friends in social networking Web sites. Following a parallel trend, companies go more and more toward Web solutions in their daily activity by using Web services (e.g., agenda) as well as by moving some applications into the cloud (e.g., with Amazon Web services). The growth of this immense information source is witnessed by the number of newly connected people, by the interactions among them facilitated by the social networking platforms, and above all by the huge amount of data covering all aspects of human activity. With the Web, information has moved from data isolated in very protected islands (typically relational databases) to information freely available to any machine or any individual connected to the Internet.
Perhaps the best illustration comes from a typical modern Web user. She has information stored on PCs, a personal laptop, and a professional computer, but also possibly on some server at work, on her smartphone, in an e-book, and so on. Also, she maintains information in personal Web sites or social network Web sites. She may store pictures in Picasa, movies in You Tube, bookmarks in Firefox Sync, and the like. So, even an individual is now facing the management of a complex distributed collection of data.
Besides languages to extract information such as XPath or XQuery, languages for transforming XML documents have been proposed. One of them, XSLT, is very popular. The goal of this PiP is to expose the reader to this aspect of XML and to languages based on tree-pattern rewriting. A presentation of XSLT is beyond the scope of this book. The reader can read the present PiP to get a feeling on standard tasks that are commonly performed with XSLT programs. Of course, realizing the project that is described requires a reasonable understanding of the language. Such an understanding can be obtained, for instance, from the companion Web site of the book, i.e., at http://webdam.inria.fr/Jorge/. More references on XSLT may be found there.
XSLT is an XML transformation language. Its principles are quite different from that of XQuery, although they may roughly serve the same purpose: accessing and manipulating XML content and producing an XML-formatted output. In practice, XQuery is used to extract pieces of information from XML documents, whereas XSLT is often used to restructure documents, typically for publishing them in different forms, different dialects. We show in the present PiP chapter how XSLT can serve to write simple “wrappers” for XML pages. This is taking us back to data integration. To integrate a number of data sources, the first step is typically to wrap them all into a uniform schema. Since most data source now export XML, the wrapping technique considered here can be used in a wide variety of contexts.
In this chapter, we learn how to build an evaluation engine for tree-pattern queries, using the SAX (Simple API for XML) programming model. We thereby follow a dual goal: (i) improve our understanding of XML query languages and (ii) become familiar with SAX, a stream parser for XML, with an event-driven API. Recall that the main features of SAX were presented in Section 1.4.2.
TREE-PATTERN DIALECTS
We will consider tree-pattern languages of increasing complexity. We introduce them in this section.
C-TP This is the dialect of conjunctive tree-patterns. A C-TP is a tree, in which each node is labeled either with an XML element name, or with an XML attribute name. C-TP nodes corresponding to attributes are distinguished by prefixing them with @ (e.g., @color). Each node has zero or more children, connected by edges that are labeled either / (with the semantics of child) or // (with the semantics of descendant). Finally, the nodes that one wants to be returned are marked.
As an example, Figure 6.1 shows a simple XML document d where each node is annotated with its preorder number. (Recall the definition of this numbering from Section 4.2.) Figure 6.2 shows a C-TP pattern denoted t1 and the three tuples resulting from “matchings” of t1 into d.
With a constantly increasing size of dozens of billions of freely accessible documents, one of the major issues raised by the World Wide Web is that of searching in an effective and efficient way through these documents to find those that best suit a user's need. The purpose of this chapter is to describe the techniques that are at the core of today's search engines (such as Google, Bing, or Exalead), that is, mostly keyword search in very large collections of text documents. We also briefly touch upon other techniques and research issues that may be of importance in next-generation search engines.
This chapter is organized as follows. In Section 13.1, we briefly recall the Web and the languages and protocols it relies upon. Most of these topics have already been covered earlier in the book, and their introduction here is mostly intended to make the present chapter self-contained. We then present in Section 13.2 the techniques that can be used to retrieve pages from the Web, that is, to crawl it, and to extract text tokens from them. First-generation search engines, exemplified by Altavista, mostly relied on the classical information retrieval (IR) techniques, applied to text documents, that are described in Section 13.3. The advent of the Web, and more generally the steady growth of documents collections managed by institutions of all kinds, has led to extensions of these techniques. We address scalability issues in Section 13.3.3, with focus on centralized indexing. Distributed approaches are investigated in Chapter 14.
So far, the discussion on distributed systems has been limited to data storage, and to a few data management primitives (e.g., write(), read(), search(), etc.). For real applications, one also needs to develop and execute more complex programs that process the available datasets and effectively exploit the available resources.
The naive approach that consists in getting all the required data at the Client in order to apply locally some processing, often looses in a distributed setting. First, some processing may not be available locally. Moreover, centralizing all the information then processing it, would simply miss all the advantages brought by a powerful cluster of hundreds or even thousands machines. We have to use distribution. One can consider two main scenarios for data processing in distributed systems.
Distributed processing and workflow: In the first one, an application disposes of large data sets and needs to apply to them some processes that are available on remote sites. When this is the case, the problem is to send the data to the appropriate locations, and then sequence the remote executions. This workflow scenario is typically implemented using Web services and some high-level coordination language.
Distributed data and MapReduce: In a second scenario, the data sets are already distributed in a number of servers, and, conversely to the previous scenario, we “push” programs to these servers. Indeed, due to network bandwidth issues, it is often more cost-effective to send a small piece of program from the Client to Servers, than to transfer large data volumes to a single Client.
This chapter proposes some exercises and projects to manipulate and query XML documents in a practical context. The software used in these exercises is eXist, an open-source native XML database that provides an easy-to-use and powerful environment for learning and applying XML languages. We begin with a brief description on how to install eXist and execute some simple operations. eXist provides a graphical interface that is pretty easy to use, so we limit our explanations below to the vital information that can be useful to save some time to the absolute beginner.
PREREQUISITES
In the following, we assume that you plan to install eXist in your Windows or Linux environment. You need a Java Development Kit (JDK) for running the eXist java application (version 1.5 at least). If you do not have a JDK already installed, get it from the Sun site (try searching “download JDK 1.5” with Google to obtain an appropriate URL) and follow the instructions to set up your Java environment.
Be sure that you can execute Java applications. This requires the definition of a JAVA_HOME environment variable, pointing to the JDK directory. The PATH variable must also contain an entry to the directory that contain the Java executable, $JAVA_HOME/bin.
Under Windows: Load the configuration panel window; run the System application; choose Advanced and then Environment variables. Create a new variable JAVA_HOME with the appropriate location, and add the $JAVA_HOME/bin path to the PATH variable.
The Web is a media of primary interest for companies who change their organization to place it at the core of their operation. It is an easy but boring task to list areas where the Web can be usefully leveraged to improve the functionalities of existing systems. One can cite in particular B2B and B2C (business to business or business to customer) applications, G2B and G2C (government to business or government to customer) applications or digital libraries. Such applications typically require some form of typing to represent data because they consist of programs that deal with HTML text with difficulties. Exchange and exploitation of business information call as well for a more powerful Web data management approach.
This motivated the introduction of a semistructured data model, namely XML, that is well suited both for humans and machines. XML describes content and promotes machine-to-machine communication and data exchange. The design of XML relies on two major goals. First it is designed as a generic data format, apt to be specialized for a wide range of data usages. In the XML world for instance, XHTML is seen as a specialized XML dialect for data presentation by Web browsers. Second XML “documents” are meant to be easily and safely transmitted on the Internet, by including in particular a self-description of their encoding and content.
XML is the language of choice for a generic, scalable, and expressive management of Web data.
This chapter proposes exercises to manipulate and query real-world RDFS ontologies, and especially Yago. Yago was developed at the Max Planck Institute in Saarbrücken in Germany. At the time of this writing, it is the largest ontology of human quality that is freely available. It contains millions of entities such as scientists, and millions of facts about these entities such as where a particular scientist was born. Yago also includes knowledge about the classes and relationships composing it (e.g., a hierarchy of classes and relationships).
EXPLORING AND INSTALLING YAGO
Go to the Yago Web site, http://mpii.de/yago, click on the Demo tab and start the textual browser. This browser allows navigating through the Yago ontology.
Type “Elvis Presley” in the box. Then click on the Elvis Presley link. You will see all properties of Elvis Presley, including his biographic data and his discography.
You can follow other links to explore the ontology. Navigate to the wife of Elvis Presley, Priscilla.
The ontology is held together by a taxonomy of classes. Its top class is called “entity”. Verify this by repeatedly following type and sub Class Of links.
Go back to Elvis Presley. Navigate to one of his songs. You will see the date the song was recorded. Can you find all songs together with their record dates? Why would this be a tedious endeavor?
Then, to install Yago on your machine, make sure that you have Java installed and around 5 GB free disk space.
This chapter is an introduction to very large data management in distributed systems. Here, “very large” means a context where gigabytes (1,000 MB = 109 bytes) constitute the unit size for measuring data volumes. Terabytes (1012 bytes) are commonly encountered, and many Web companies and scientific or financial institutions must deal with petabytes (1015 bytes). In a near future, we can expect exabytes (1018 bytes) data sets, with the world-wide digital universe roughly estimated (in 2010) as about 1 zetabytes (1021 bytes).
Distribution is the key for handling very large data sets. Distribution is necessary (but not sufficient) to bring scalability (i.e., the means of maintaining stable performance for steadily growing data collections by adding new resources to the system). However, distribution brings a number of technical problems that make the design and implementation of distributed storage, indexing, and computing a delicate issue. A prominent concern is the risk of failure. In an environment that consists of hundreds or thousands of computers (a common setting for large Web companies), it becomes very common to face the failure of components (hardware, network, local systems, disks), and the system must be ready to cope with it at any moment.
Our presentation covers principles and techniques that recently emerged to handle Web-scale data sets. We examine the extension of traditional storage and indexing methods to large-scale distributed settings. We describe techniques to efficiently process point queries that aim at retrieving a particular object.
In this chapter, we discuss the typing of semistructured data. Typing is the process of describing, with a set of declarative rules or constraints called a schema, a class of XML documents, and verifying that a given document is valid for that class (we also say that this document is valid against the type defined by the schema). This is, for instance, used to define a specific XML vocabulary (XHTML, MathML, RDF, etc.), with its specificities in structure and content, that is used for a given application.
We first present motivations and discuss the kind of typing that is needed. XML data typing is typically based on finite-state automata. Therefore, we recall basic notion of automata, first on words, then on ranked trees, finally on unranked trees (i.e., essentially on XML). We also present the main two practical languages for describing XML types, DTDs, and XML Schema, both of which endorsed by the W3C.We then briefly describe alternative schema languages with their key features. In a last section, we discuss the typing of graph data.
One can also consider the issue of “type checking a program,” that is, verifying that if the input is of a proper input type, the program produces an output that is of a proper output type. In Section 3.5, we provide references to works on program type checking in the context of XML.
MOTIVATING TYPING
Perhaps the main difference with typing in relational systems is that typing is not compulsory for XML.
Mashups are Web applications that integrate and combine data from multiple Web sources to present them in a new way to a user. This chapter shows two different ways to construct mashup applications in practice: Yahoo! Pipes, a graphical user interface for building mashups, and XProc, a W3C language for describing workflows of transformations over XML documents. Pros and cons of either approach will be made clear as one follows the indicated steps. The goal will be to present information about news events, each event being accompanied by its localization displayed on a map. For that purpose, we integrate three sources of information:
A Web feed about current events in the world, in RSS format (e.g., CNN's top stories at http://rss.cnn.com/rss/edition.rss). Any such RSS feed is fine, though English is preferable to ensure precision of the geolocalization.
A geolocalization service. We use information from the GeoNames geographical database, and specifically their RSS to Geo RSS converter, whose API is described at http://www.geonames.org/rss-to-georss-converter.html.
A mapping service. We use Yahoo! Maps.
YAHOO! PIPES: A GRAPHICAL MASHUP EDITOR
Yahoo! Pipes allows creating simple mashup applications (simply called pipe) using a graphical interface based on the construction of a pipeline of boxes connected to each other, each box performing a given operation (fetching information, annotating it, reorganizing it, etc.) until the final output of the pipeline. It can be used by nonprogrammers, though defining complex mashups still requires skill and experience with the platform.
This chapter proposes an introduction to recommendation techniques and suggests some exercises and projects. We do not present a recommendation system in particular but rather focus on the general methodology. As an illustrative example, we will use the MovieLens data set to construct movie recommendations.
The chapter successively introduces recommendation, user-based collaborative filtering and item-based collaborative filtering. It discusses different methods parameterizations and evaluates their result with respect to the quality of the data set. We show how to generate recommendations using SQL queries on the Movie-Lens data set. Finally, we suggest some projects for students who want to investigate further the realm of recommendation systems.
INTRODUCTION TO RECOMMENDATION SYSTEMS
Given a set of ratings of items by a set of users, a recommendation system produces a list of items for a particular user, possibly in a given context. Such systems are widely used in Web applications. For example, content sites like Yahoo! Movies (movies), Zagat (restaurants), Library Thing (books), Pandora (music), Stumble Upon (Web site) suggest a list of items of interest by predicting the ratings of their users. E-commerce sites such as Amazon (books) or Netflix (movies) use recommendations to suggest new products to their users and construct bundle sales. Usually, they exploit the recent browsing history as a limited context. Finally, advertisement companies need to find a list of advertisements targeted for their users. Some of them, like Google AdSense, rely more on the context (e.g., keywords) than on an estimation of the user's taste based on her/his recent browsing history.