To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
In Chapter 3, we provided a theoretical overview of the explore-exploit problem and its importance in scoring items for recommender systems, in particular, the connection to the classical multiarmed bandit (MAB) problem. We discussed the Bayesian and the minimax approaches to the MAB problem, including some popular heuristics that are used in practice. However, several additional nuances that arise in recommender systems violate assumptions made in the MAB problem.These include dynamic item pool, nonstationary CTR, and delay in feedback. This chapter develops new solutions that work well in practice.
For many recommender systems, it is appropriate to score items based on some positive action rate, such as click probabilities (CTR). Such an approach maximizes the total number of actions on recommended items. A simple approach that is often used in practice is to recommend the top-k items with the highest CTR. We shall refer to this as the most-popular recommendation approach, where popularity is measured through item-specific CTR. Although conceptually simple, the most-popular approach is technically nontrivial because the CTR of items have to be estimated. It also serves as a good baseline for applications where nonpersonalized recommendation is an acceptable solution. Hence we begin in this chapter by developing explore-exploit solutions for the most-popular recommendation problem.
In Section 6.1, we introduce an example application and show the characteristics of most-popular recommendation in this real-life application. We then mathematically define the explore-exploit problem for most-popular recommendation in Section 6.2 and develop a Bayesian solution from first principles in Section 6.3. A number of popular non-Bayesian solutions are reviewed in Section 6.4. Through a series of extensive experiments in Section 6.5, we show that, when the system can be properly modeled using a Bayesian framework, the Bayesian solution performs significantly better than other solutions. Finally, in Section 6.6, we discuss how to address the data sparsity challenge when the set of candidate items is large.
Example Application: Yahoo! Today Module
In Section 5.1.1, we introduced feature modules that are commonly shown on the home pages of web portals. The Today Module on the Yahoo! home page (see Figure 5.1 for a snapshot) is a typical example. The goal for this module is to recommend items (mostly news stories of different types) to maximize user engagement on the home page, usually measured by the total number of clicks.
After developing statistical methods for recommender systems, it is important to evaluate their performance to assess performance metrics in different application settings. Broadly speaking, there are two kinds of evaluation, depending on whether a recommendation algorithm, or more precisely, the model used in the algorithm, has been deployed to serve users:
1. Predeployment offline evaluation: A new model must show strong signs of performance improvement over existing baselines before being deployed to serve real users. To ascertain the potential of a newmodel before testing it on real user visits, we compute various performance measures on retrospective (historical) data. We refer to this as offline evaluation. To perform such offline evaluation, we need to log data that record past user-item interactions in the system. Model comparison is performed by computing various offline metrics based on such data.
2. Postdeployment online evaluation: Once the model performance looks promising based on offline metrics, we test it on a small fraction of real user visits. We refer to this as online evaluation. To perform online evaluation, it is typical to run randomized experiments online. A randomized experiment, also referred to as an A/B test or a bucket test in web applications, compares a new method to an existing baseline. It is conducted by assigning two random user or visit populations to the treatment bucket and the control bucket, respectively. The treatment bucket is typically smaller than the control because it serves users according to the new recommendationmodel that is being tested, whereas the control bucket serves users using the status quo. After running such a bucket test for a certain time period, we gauge model performance by comparing metrics that are computed using data collected from the corresponding buckets.
In this chapter, we describe several ways to measure the performance of recommendation models and discuss their strengths and weaknesses. We start in Section 4.1 with traditional offline evaluation metrics that measure out-of-sample predictive accuracy on retrospective ratings data. Our use of the term rating is generic and refers to both explicit ratings like star ratings on movies and implicit ratings (also called responses) like clicks on recommended items (we use ratings and responses interchangeably). In Section 4.2, we discuss online evaluation methods, describing both performance metrics and how to set up online bucket tests in a proper way.
The approaches that we discussed in previous chapters typically recommend items to optimize for a single objective, usually clicks on recommended items. However, a click is only the starting point of a user's journey, and subsequent downstream utilities, such as time spent on the website after the click and revenue generated through displaying advertisements on web pages, are also important. Here clicks, time spent, and revenue are three different objectives for which a website may want to optimize. When different objectives have strong positive correlation, maximizing one would automatically maximize others. In reality, this is not a typical scenario. For example, the advertisements on real estate articles tend to generate higher revenue than those on entertainment articles, but users usually click more and spend more time on entertainment articles. In situations like this, it is not feasible to reach optimality for all objectives. The goal, instead, is to find a good trade-off among competing objectives. For example, we maximize revenue such that clicks and time spent are still within 95 percent of the achievable number of clicks and 90 percent of the achievable time spent relative to some normative serving scheme (e.g., CTR maximization). How to set the constraints (the 95 percent and 90 percent thresholds) depends on the business goals and strategies of the website and is application specific. This chapter provides a methodology based on multiobjective programming to optimize a recommender system for a given set of predetermined constraints. Such a formulation can easily incorporate various application-driven requirements.
We present two approaches to multiobjective optimization. Section 11.2 describes a segmented approach (originally published in Agarwal et al., 2011a), where users are classified into user segments in a way similar to segmented postpopular recommendation (discussed in Sections 3.3.2 and 6.5.3) and the optimization is at the segment level. Although useful, this segmented approach assumes that users are partitioned into a few coarse, nonoverlapping segments. In many applications, a user is characterized by a high-dimensional feature vector (thousands of dimensions with large number of all possible combinations) – the segmented approach fails to provide recommendations at such granular resolutions because decisions are made at coarse user-segment resolutions. Section 11.3 describes a personalized approach (originally published in Agarwal et al., 2012) that improves the segmented approach by optimizing the objectives at the individual-user level.
Recommender systems have to select items for users to optimize for one or more objectives. We introduced several possible objectives in Chapter 1, reviewed classical methods in Chapter 2, described the explore-exploit trade-off and the key ideas to reduce dimensionality of the problem in Chapter 3, and discussed how to evaluate recommendation models in Chapter 4. In this and the five subsequent chapters of the book, we discuss various statistical methods used in some commonly encountered scenarios. In particular, we focus on problem settings where the main objective is to maximize some positive user response to the recommended items. In many application scenarios, clicks on items are the primary response. To maximize clicks, we have to recommend items with high click-through rates (CTRs). Thus, CTR estimation is our main focus. Although we use click and CTR as our primary objectives, other types of positive response (e.g., share, like) can be handled in a similar way. We defer the discussion of Multi objective optimization to Chapter 11.
The choice of statistical methods for a recommendation problem depends on the application. In this chapter, we provide a high-level overview of techniques to be introduced in the next four chapters. We start with an introduction to a variety of different problem settings in Section 5.1 and then describe an example system architecture in Section 5.2 to illustrate how web recommender systems work in practice, along with the role of statistical methods in such systems.
Problem Settings
A typical recommender system is usually implemented as a module on a web page. In this section, we introduce some common recommendation modules, provide details of the application settings, and conclude with a discussion of commonly used statistical methods for these settings.
Common Recommendation Modules
We classify websites into the following four categories: general portals, personal portals, domain-specific sites, and social network sites. Table 5.1 provides a summary.
General portals are websites that provide a wide range of different content. The home pages of content networks like Yahoo!, MSN, and AOL are examples of general portals.
Personal portals are websites that allow users to customize their home pages with desired content. For example, users of My Yahoo! customize their home pages by selecting content feeds from different sources or publishers and arranging the feeds on the pages according to their preferences.
In Chapter 2, we reviewed classical methods to score an item for a user in a given context. In this chapter, we describe scoring methods based on modern techniques, particularly those that are based on explore-exploit methods.
Scoring involves estimating the “value” of an item according to some criteria. Because explicit user intent in most modern-day recommender problems is at best partially observed and often weak, scoring items based on the predicted value of some rating or response is a popular mechanism. For ease of exposition, our response is binary, and the “positive” label corresponds to a certain positive user interaction with the item, such as click, like, or share, whereas “negative” indicates lack of any such positive interaction. For instance, in a news recommender problem, user click on a recommended item is the primary response variable. The score of an item is given by the response rate, which is the expected value of user response. For instance, response rate for a binary variable by this definition will translate to the probability of a positive response. For ease of exposition, throughout this chapter, we use click to refer to a positive response and CTR to refer to the corresponding response rate.
Often the goal in recommender problems is to maximize the total number of positive response, such as clicks on recommended items. For instance, in a news recommender problem, maximizing the total clicks on the recommended news articles is an important objective. With known item response rates, this is easy to accomplish by always recommending the item with the highest response rate. However, because response rates are not known, a key task is to accurately estimate them for items in the item pool. In Sections 2.3 and 2.4, we considered various supervised learning methods to estimate response rates through feature-based models and collaborative filtering. In this chapter, we show that scoring items in a recommender problem is not purely a supervised learning problem but more importantly an explore-exploit problem. We need to achieve a good balance between exploring or experimenting with items that are new or have a small sample size by displaying them in certain numbers of user visits, versus exploiting items that are known to have high response rates with high statistical certainty.
The central objective of the present author's research is to develop a system supporting the design of a technological process (a computer-aided process planning system) that functions similarly to a human expert in the field in question. The use of neural networks makes the creation of such a system possible. The proposed method uses a system of three blocks of neural networks, and involves the creation of neural networks to be used for the selection of machines, tools, and machining parameters. These networks are built for each process operation separately; that is, a set of neural networks is created for each selection. For the construction of models, different types of neural networks (multilayer networks with error backpropagation, radial basis function, and Kohonen) with different structures were employed, and the networks that made the best selections were identified. A method was also developed for the elimination of defects occurring during the production process. When a defect comes to light, this method suggests changes to the technological process, thus improving the quality of that process. Guidelines for the elimination of defects are produced in the form of decision rules. Such a computer-aided process planning system will be especially useful for process engineers who do not yet have sufficient experience in the design of technological processes, or who have only recently joined a particular manufacturing enterprise and are not fully familiar with its machines and other means of production (tools and instrumentation). It should be emphasized that such a system performs an advisory role, and it is always the process engineer who makes the final decision. The neural network models were tested on real data from an enterprise. A computer-aided process planning system based on rules and neural network models enables the intelligent design of technological processes.
In Chapter 7, we studied personalization through feature-based regression models. We reviewed models based on item-item similarity, user-user similarity, and matrix factorization in Chapter 2. In practice, matrix factorization has better prediction accuracy in warm-start scenarios, that is, when predicting responses for (user, item) pairs where both the user and the item have large numbers of observations in the training data. However, prediction accuracy deteriorates for users (or items) with meagre or no response in the training data. Such cases are often referred to as cold-start scenarios. In Section 8.1, we describe the regression-based latent factor model (RLFM) that extends matrix factorization to leverage available user and item features simultaneously. Such a strategy helps to improve the performance of both cold-start and warm-start scenarios in a single framework. We then develop the model-fitting algorithms in Section 8.2 and illustrate the performance of RLFM on a number of data sets in Section 8.3. Finally, in Section 8.4, we discuss model-fitting strategies to train RLFM on the very large data that are typical in modern web recommendation systems, and we evaluate their performance in Section 8.5.
Regression-Based Latent Factor Model (RLFM)
RLFM extends matrix factorization to leverage features and users’ past responses to items within a single modeling framework (Agarwal and Chen, 2009; Zhang et al., 2011). It provides a principled framework to combine the pros of collaborative filtering and content-based methods. When past response data are not sufficient to determine the latent factors of a user or an item, RLFM estimates them through a regression model based on features; otherwise, the latent factors are estimated as in matrix factorization. The key aspect of RLFM is the ability to seamlessly transition, in the continuum, from cold start to warm start.
Data and Notation. As usual, we use i to denote a user and j to denote an item. Let yij denote the response that user i gives item j. This response can be of various types, as described in Section 2.3.2, including numeric response (e.g., numeric ratings) modeled using the Gaussian response model and binary response (e.g., click or not) modeled using the logistic response model.
In addition to response data, we also have features available for each user and item. Let xi, xj, and xij denote feature vectors for user i, item j, and pair (i, j), respectively.
Recent research indicates that Web 2.0 applications contribute to supporting a social constructivist approach for language learning. However, students encounter different types of barrier associated with technologies and learning tasks, which can cause disengagement during different phases of learning. Thus, based on flow theory and the strategic motivation framework, this study aims to investigate students’ motivation and their engagement patterns while participating in Web 2.0 digital storytelling activities. The participants are 24 elementary school students of a suburban school in northern Taiwan. Over 19 weeks of observations on students aged 9–10 years in a third-grade classroom, data were collected through three sources: surveys, students’ digital stories, and English tests. The analysis of the data showed that motivation was a dynamic process, initially low but increasing in later phases. A dynamic pattern was also identified in the students’ flow perceptions, which included two cycles of disengagement and reengagement. Students encountered different challenges that led to disengagement phases, which highlighted the need for specific types of learning support in elementary school contexts. In addition, the participants’ vocabulary and oral fluency were found to have been enhanced by the end of the study. The implications for educational practice are discussed and the direction for future studies addressed.
Recommender systems are automated computer programs that match items to users in different contexts. Such systems are ubiquitous and have become an integral part of our daily lives. Examples include recommending products to users on a site like Amazon, recommending content to users visiting a website like Yahoo!, recommending movies to users on a site like Netflix, recommending jobs to users on a site like LinkedIn, and so on. The matching algorithms are constructed using large amounts of high-frequency data obtained from past user interactions with items. The algorithms are statistical in nature and involve challenges in areas like sequential decision processes, modeling interactions with very high-dimensional categorical data, and developing scalable statistical methods. New methodologies in this area require close collaboration among computer scientists, machine learners, statisticians, optimization experts, system experts, and, of course, domain experts. It is one of the most exciting applications of big data.
Why We Wrote This Book
Although much has been written about recommender systems in various fields, such as computer science, machine learning, and statistics, focusing on specific aspects of the problem, a comprehensive treatment of all statistical issues and how they are interrelated is lacking. We came to this realization while deploying such systems at Yahoo! and LinkedIn. For instance, much of the focus in statistics and machine learning is on building models that minimize out-of-sample predictive error. However, this does not address all aspects of practical importance. Statistically, a recommender system is a high-dimensional sequential process, and it is equally important to study issues like design of experiments as it is to develop sophisticated statistical models. In fact, the two are closely related – efficient design needs models to tame the curse of dimensionality. Also, most existing work in the literature tends to build models for univariate response, such as movie ratings, purchases, and click rates.With the advent of social media outlets like Facebook, LinkedIn, andTwitter, multiple responses are available. For instance, one may want to model click rates, share rates, and tweet rates simultaneously for a news recommender application. Such multivariate response models are challenging to build. Finally, given the machinery to obtain such multivariate predictions, how does one construct utility functions to make recommendations? Is it more important to optimize share rates relative to click rates?
It is important to recommend items that are tailored to the personal interests and information needs of each individual user. Recommending items based on global popularity is often not sufficient. Some personalization may be achieved through an easy extension to most-popular recommendation by creating user segments based on attributes such as age, gender and geolocation (e.g., male users between twenty and forty years old living on the East Coast may form a segment) and serving the most-popular items within each segment. However, such a strategy has limitations. When the number of user segments is large, obtaining a reliable estimate of the popularity of items for each segment is difficult because of data sparsity. Also, user visits tend to follow a power law type of distribution – a small fraction of active users tends to visit the site very often, while the rest are sporadic. It is desirable to build custom models for active users who have visited the site several times in the past. For instance, if Mary visited the Yahoo! front page one hundred times in the last week and clearly indicated her preference for baseball news over everything else, a baseball article in our content pool should get precedence for Mary's next visit. For users who visit sporadically, personalization is usually accomplished by pooling data across users who are similar. Defining similarity is the crux of the problem, and it is often done by looking at user features such as demographics, behavioral attributes, and social network information, among other sources of signals. Combining these signals in an accurate fashion is challenging. In addition, many users are in the gray area of being superactive versus sporadic visitors. We need a methodology that can automatically determine the appropriate amount of weighting we should provide to a user's own past interactions and those of similar users.
For recommender systems where users, items, and their behavior do not change over time, offline models that are trained based on a one-time dump of historical data may be sufficient. See Chapter 2 for such models. However, in many application settings, new items are frequently added, and user behaviors and interests also change over time. It is important to update recommendation models frequently to capture such nonstationarity.
In this chapter, we focus on feature-based regression, which personalizes recommendations of time-sensitive items by capturing similarities through a regression approach.
When developing a product, designers must decide what consumer variation will be addressed and how they will address it, because each consumer has a unique set of human factors, preferences, personal knowledge, and solution constraints. Numerous design methodologies exist to support the design of a product or set of products that address this consumer variation. However, currently there is little work supporting the selection of a design methodology, resulting in an ad hoc or a priori decision before conceptual design begins. This paper presents an affordance-based design method for use prior to conceptual design to help designers understand the consumer variation that is present. This facilitates the creation of a product or set of products that meets the demands of both the consumer(s) and the organization that is developing the product. Once consumer variation is understood, conceptual design can be performed with a more complete understanding of the overall problem.
Recommender systems (or recommendation systems) are computer programs that recommend the “best” items to users in different contexts. The notion of a best match is typically obtained by optimizing for objectives like total clicks, total revenue, and total sales. Such systems are ubiquitous on the web and form an integral part of our daily lives. Examples include product recommendations to users on an e-commerce site to maximize sales; content recommendations to users visiting a news site to maximize total clicks; movie recommendations to maximize user engagement and increase subscriptions; or job recommendations on a professional network site to maximize job applications. Input to these algorithms typically consists of information about users, items, contexts, and feedback that is obtained when users interact with items.
Figure 1.1 shows an example of a typical web application that is powered by a recommender system. A user uses a web browser to visit a web page. The browser then submits an HTTP request to the web server that hosts the page. To serve recommendations on the page (e.g., popular news stories on a news portal page), the web server makes a call to a recommendation service that retrieves a set of items and renders them on the web page. Such a service typically performs a large number of different types of computations to select the best items. These computations are often a hybrid of both offline and real-time computations, but they must adhere to strict efficiency requirements to ensure quick page load time (typically hundreds of milliseconds). Once the page loads, the user may interact with items through actions like clicks, likes, or shares. Data obtained through such interactions provide a feedback loop to update the parameters of the underlying recommendation algorithm and to improve the performance of the algorithm for future user visits. The frequency of such parameter updates depends on the application. For instance, if items are time sensitive or ephemeral, as in the case of news recommendations, parameter updates must be done frequently (e.g., every few minutes). For other applications where items have a relatively longer lifetime (e.g., movie recommendations), parameter updates can happen less frequently (e.g., daily) without significant degradation in overall performance.
Algorithms that facilitate selection of best items are crucial to the success of recommender systems. This book provides a comprehensive description of statistical and machine learning methods on which we believe such algorithms should be based.