To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
THE PREVIOUS CHAPTER introduced binary classification and associated tasks such as ranking and class probability estimation. In this chapter we will go beyond these basic tasks in a number of ways. Section 3.1 discusses how to handle more than two classes. In Section 3.2 we consider the case of a real-valued target variable. Section 3.3 is devoted to various forms of learning that are either unsupervised or aimed at learning descriptive models.
Handling more than two classes
Certain concepts are fundamentally binary. For instance, the notion of a coverage curve does not easily generalise to more than two classes. We will now consider general issues related to having more than two classes in classification, scoring and class probability estimation. The discussion will address two issues: how to evaluate multi-class performance, and how to build multi-class models out of binary models. The latter is necessary for some models, such as linear classifiers, that are primarily designed to separate two classes. Other models, including decision trees, handle any number of classes quite naturally.
Multi-class classification
Classification tasks with more than two classes are very common. For instance, once a patient has been diagnosed as suffering from a rheumatic disease, the doctor will want to classify him or her further into one of several variants. If we have k classes, performance of a classifier can be assessed using a k-by-k contingency table. Assessing performance is easy if we are interested in the classifier's accuracy, which is still the sum of the descending diagonal of the contingency table, divided by the number of test instances.
IN THIS CHAPTER and the next we take a bird's-eye view of the wide range of different tasks that can be solved with machine learning techniques. ‘Task’ here refers to whatever it is that machine learning is intended to improve performance of (recall the definition of machine learning on p.3), for example, e-mail spam recognition. Since this is a classification task, we need to learn an appropriate classifier from training data. Many different types of classifiers exist: linear classifiers, Bayesian classifiers, distancebased classifiers, to name a few. We will refer to these different types as models; they are the subject of Chapters 4–9. Classification is just one of a range of possible tasks for which we can learn a model: other tasks that will pass the review in this chapter are class probability estimation and ranking. In the next chapter we will discuss regression, clustering and descriptive modelling. For each of these tasks we will discuss what it is, what variants exist, how performance at the task could be assessed, and how it relates to other tasks. We will start with some general notation that is used in this chapter and throughout the book (see Background 2.1 for the relevant mathematical concepts).
The objects of interest in machine learning are usually referred to as instances. The set of all possible instances is called the instance space, denoted in this book.
MACHING LEARNING IS a practical subject as much as a computational one. While we may be able to prove that a particular learning algorithm converges to the theoretically optimal model under certain assumptions, we need actual data to investigate, e.g., the extent to which those assumptions are actually satisfied in the domain under consideration, or whether convergence happens quickly enough to be of practical use. We thus evaluate or run particular models or learning algorithms on one or more data sets, obtain a number of measurements and use these to answer particular questions we might be interested in. This broadly characterises what is known as machine learning experiments.
In the natural sciences, an experiment can be seen as a question to nature about a scientific theory. For example, Arthur Eddington's famous 1919 experiment to verify Einstein's theory of general relativity asked the question: Are rays of light bent by gravitational fields produced by large celestial objects such as the Sun? To answer this question, the perceived position of stars was recorded under several conditions including a total solar eclipse. Eddington was able to show that these measurements indeed differed to an extent unexplained by Newtonian physics but consistent with general relativity.
While you don't have to travel to the island of Príncipe to perform machine learning experiments, they bear some similarity to experiments in physics in that machine learning experiments pose questions about models that we try to answer by means of measurements on data.
RULE MODELS ARE the second major type of logical machine learning models. Generally speaking, they offer more flexibility than tree models: for instance, while decision tree branches are mutually exclusive, the potential overlap of rules may give additional information. This flexibility comes at a price, however: while it is very tempting to view a rule as a single, independent piece of information, this is often not adequate because of the way the rules are learned. Particularly in supervised learning, a rule model is more than just a set of rules: the specification of how the rules are to be combined to form predictions is a crucial part of the model.
There are essentially two approaches to supervised rule learning. One is inspired by decision tree learning: find a combination of literals – the body of the rule, which is what we previously called a concept – that covers a sufficiently homogeneous set of examples, and find a label to put in the head of the rule. The second approach goes in the opposite direction: first select a class you want to learn, and then find rule bodies that cover (large subsets of) the examples of that class. The first approach naturally leads to a model consisting of an ordered sequence of rules – a rule list – as will be discussed in Section 6.1. The second approach treats collections of rules as unordered rule sets and is the topic of Section 6.2.
This book started life in the Summer of 2008, when my employer, the University of Bristol, awarded me a one-year research fellowship. I decided to embark on writing a general introduction to machine learning, for two reasons. One was that there was scope for such a book, to complement the many more specialist texts that are available; the other was that through writing I would learn new things – after all, the best way to learn is to teach.
The challenge facing anyone attempting to write an introductory machine learn- ing text is to do justice to the incredible richness of the machine learning field without losing sight of its unifying principles. Put too much emphasis on the diversity of the discipline and you risk ending up with a ‘cookbook’ without much coherence; stress your favourite paradigm too much and you may leave out too much of the other in- teresting stuff. Partly through a process of trial and error, I arrived at the approach embodied in the book, which is is to emphasise both unity and diversity: unity by separate treatment of tasks and features, both of which are common across any machine learning approach but are often taken for granted; and diversity through coverage of a wide range of logical, geometric and probabilistic models.
Clearly, one cannot hope to cover all of machine learning to any reasonable depth within the confines of 400 pages.
HAVING DISCUSSED A VARIETY of tasks in the preceding two chapters, we are now in an excellent position to start discussing machine learning models and algorithms for learning them. This chapter and the next two are devoted to logical models, the hallmark of which is that they use logical expressions to divide the instance space into segments and hence construct grouping models. The goal is to find a segmentation such that the data in each segment is more homogeneous, with respect to the task to be solved. For instance, in classification we aim to find a segmentation such that the instances in each segment are predominantly of one class, while in regression a good segmentation is such that the target variable is a simple function of a small number of predictor variables. There are essentially two kinds of logical models: tree models and rule models. Rule models consist of a collection of implications or if-then rules, where the if-part defines a segment, and the then-part defines the behaviour of the model in this segment. Tree models are a restricted kind of rule model where the if-parts of the rules are organised in a tree structure.
In this chapter we consider methods for learning logical expressions or concepts from examples, which lies at the basis of both tree models and rule models. In concept learning we only learn a description for the positive class, and label everything that doesn't satisfy that description as negative.
MACHINE LEARNING IS ALL ABOUT using the right features to build the right models that achieve the right tasks – this is the slogan, visualised in Figure 3 on p.11, with which we ended the Prologue. In essence, features define a ‘language’ in which we describe the relevant objects in our domain, be they e-mails or complex organic molecules. We should not normally have to go back to the domain objects themselves once we have a suitable feature representation, which is why features play such an important role in machine learning. We will take a closer look at them in Section 1.3. A task is an abstract representation of a problem we want to solve regarding those domain objects: the most common form of these is classifying them into two or more classes, but we shall encounter other tasks throughout the book. Many of these tasks can be represented as a mapping from data points to outputs. This mapping or model is itself produced as the output of a machine learning algorithm applied to training data; there is a wide variety of models to choose from, as we shall see in Section 1.2.
We start this chapter by discussing tasks, the problems that can be solved with machine learning. No matter what variety of machine learning models you may encounter, you will find that they are designed to solve one of only a small number of tasks and use only a few different types of features.
Sure, the Internet has many security loopholes, from cyber-attack vulnerability to privacy-intrusion threats. But it does not have a few highly-connected routers in the center of the Internet that an attacker can destroy to disconnect the Internet, which would have fit the description of an “Achilles' heel”. So why would there be rumors that the Internet has an Achilles' heel?
The story started in the late 1990s with an inference result: the Internet topology exhibits a power-law distribution of node degrees. Here, the “topology” of the Internet may mean any of the following:
the graph of webpages connected by hyperlinks (like the one we mentioned in Chapter 3),
the graph of Autonomous Systems (ASs) connected by the physical and business relationships of peering (we will talk more about that in Chapter 13), and
the graph of routers connected by physical links (the focus of this chapter).
For the AS graph and the router graph, the actual distribution of the node degrees (think of the histogram of the degrees of all the nodes) is not clear due to measurement noise. For example, the AS graph data behind the power-law distribution had more than 50% of links missing. Internet exchange points further lead to many peering links among ASs.
You pick up your iPhone while waiting in line at a coffee shop. You Google a not-so-famous actor and get linked to a Wikipedia entry listing his recent movies and popular YouTube clips. You check out user reviews on IMDb and pick one, download that movie on BitTorrent or stream it in Netflix. But for some reason the WiFi logo on your phone is gone and you're on 3G. Video quality starts to degrade a little, but you don't know whether it's the video server getting crowded in the cloud or the Internet is congested somewhere. In any case, it costs you $10 per gigabyte, and you decide to stop watching the movie, and instead multitask between sending tweets and calling your friend on Skype, while songs stream from iCloud to your phone. You're happy with the call quality, but get a little irritated when you see that you have no new followers on Twitter.
You've got a typical networked life, an online networked life.
And you might wonder how all these technologies “kind of” work, and why sometimes they don't. Just flip through the table of contents of this book. It's a mixture: some of these questions have well-defined formulations and clear answers while for others there is still a significant gap between the theoretical models and actual practice; a few don't even have widely accepted problem statements. This book is about formulating and answering these 20 questions.
By the end of this chapter, you will count yourself lucky to get as much as a few percent of the advertised speed. Where did the rest go?
A Short Answer
First of all, the terms 3G and 4G can be confusing. There is one track following the standardization body 3GPP called UMTS or WCDMA, and another track in 3GPP2 called CDMA2000. Each also has several versions inbetween 2G and 3G, often called 2.5G, such as EDGE, EVDO, etc. For 4G, the main track is called Long Term Evolution (LTE), with variants such as LTE light and LTE advanced. Another competing track is called WiMAX. Some refer to evolved versions of 3G, such as HSPA+, as 4G too. All these have created quite a bit of confusion in a consumer's mind as to what really is a 3G technology and what really is a 4G technology.
You might have read that the 3G downlink speed for stationary users should be 7.2 Mbps. But when you try to download an email attachment of 3 MB, it often takes as long as one and half minutes. You get around 267 kbps, 3.7% of what you might expect. Who took away the 96%?
Many countries are moving towards LTE. They use a range of techniques to increase the spectral efficiency, defined as the number of bits per second that each Hz of bandwidth can support. These include methods like OFDM and MIMO mentioned at the end of the last chapter and splitting a large cell into smaller ones.
Almost all of our utility bills are based on the amount we consume: water, electricity, gas, etc. But even though wireless cellular capacity is expensive to provide and difficult to crank up, consumers in some countries like the USA have been enjoying flat-rate buffets for mobile Internet access for many years. Can a restaurant keep offering buffets with the same price if its customers keep doubling their appetites every year? Or will it have to stop at some point?
In April 2010, AT&T announced its usage-based pricing for 3G data users. This was followed in March 2011 by Verizon Wireless for its iPhone and iPad users, and in June 2011 for all of its 3G data users. In July 2011, AT&T started charging fixed broadband users on U-Verse services on the basis of usage too. In March 2012, AT&T announced that those existing customers on unlimited cellular data plans will see their connection speeds throttled significantly once the usage exceeds 5 GB, effectively ending the unlimited data plan. The LTE data plans from both AT&T and Verizon Wireless for the “new iPad” launched soon after no longer offered any type of unlimited data options. In June 2012, Verizon Wireless updated their cellular pricing plans. A customer could have unlimited voice and text in exchange for turning an unlimited data plan to usage-based. AT&T followed with a similar move one month later. What a reversal going from limited voice and unlimited data to unlimited voice and limited data. Similar measures have been pursued, or are being considered, in many other countries around the world for 3G, 4G, and even wired broadband networks.