To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
THE PREVIOUS CHAPTER introduced binary classification and associated tasks such as ranking and class probability estimation. In this chapter we will go beyond these basic tasks in a number of ways. Section 3.1 discusses how to handle more than two classes. In Section 3.2 we consider the case of a real-valued target variable. Section 3.3 is devoted to various forms of learning that are either unsupervised or aimed at learning descriptive models.
Handling more than two classes
Certain concepts are fundamentally binary. For instance, the notion of a coverage curve does not easily generalise to more than two classes. We will now consider general issues related to having more than two classes in classification, scoring and class probability estimation. The discussion will address two issues: how to evaluate multi-class performance, and how to build multi-class models out of binary models. The latter is necessary for some models, such as linear classifiers, that are primarily designed to separate two classes. Other models, including decision trees, handle any number of classes quite naturally.
Multi-class classification
Classification tasks with more than two classes are very common. For instance, once a patient has been diagnosed as suffering from a rheumatic disease, the doctor will want to classify him or her further into one of several variants. If we have k classes, performance of a classifier can be assessed using a k-by-k contingency table. Assessing performance is easy if we are interested in the classifier's accuracy, which is still the sum of the descending diagonal of the contingency table, divided by the number of test instances.
IN THIS CHAPTER and the next we take a bird's-eye view of the wide range of different tasks that can be solved with machine learning techniques. ‘Task’ here refers to whatever it is that machine learning is intended to improve performance of (recall the definition of machine learning on p.3), for example, e-mail spam recognition. Since this is a classification task, we need to learn an appropriate classifier from training data. Many different types of classifiers exist: linear classifiers, Bayesian classifiers, distancebased classifiers, to name a few. We will refer to these different types as models; they are the subject of Chapters 4–9. Classification is just one of a range of possible tasks for which we can learn a model: other tasks that will pass the review in this chapter are class probability estimation and ranking. In the next chapter we will discuss regression, clustering and descriptive modelling. For each of these tasks we will discuss what it is, what variants exist, how performance at the task could be assessed, and how it relates to other tasks. We will start with some general notation that is used in this chapter and throughout the book (see Background 2.1 for the relevant mathematical concepts).
The objects of interest in machine learning are usually referred to as instances. The set of all possible instances is called the instance space, denoted in this book.
MACHING LEARNING IS a practical subject as much as a computational one. While we may be able to prove that a particular learning algorithm converges to the theoretically optimal model under certain assumptions, we need actual data to investigate, e.g., the extent to which those assumptions are actually satisfied in the domain under consideration, or whether convergence happens quickly enough to be of practical use. We thus evaluate or run particular models or learning algorithms on one or more data sets, obtain a number of measurements and use these to answer particular questions we might be interested in. This broadly characterises what is known as machine learning experiments.
In the natural sciences, an experiment can be seen as a question to nature about a scientific theory. For example, Arthur Eddington's famous 1919 experiment to verify Einstein's theory of general relativity asked the question: Are rays of light bent by gravitational fields produced by large celestial objects such as the Sun? To answer this question, the perceived position of stars was recorded under several conditions including a total solar eclipse. Eddington was able to show that these measurements indeed differed to an extent unexplained by Newtonian physics but consistent with general relativity.
While you don't have to travel to the island of Príncipe to perform machine learning experiments, they bear some similarity to experiments in physics in that machine learning experiments pose questions about models that we try to answer by means of measurements on data.
RULE MODELS ARE the second major type of logical machine learning models. Generally speaking, they offer more flexibility than tree models: for instance, while decision tree branches are mutually exclusive, the potential overlap of rules may give additional information. This flexibility comes at a price, however: while it is very tempting to view a rule as a single, independent piece of information, this is often not adequate because of the way the rules are learned. Particularly in supervised learning, a rule model is more than just a set of rules: the specification of how the rules are to be combined to form predictions is a crucial part of the model.
There are essentially two approaches to supervised rule learning. One is inspired by decision tree learning: find a combination of literals – the body of the rule, which is what we previously called a concept – that covers a sufficiently homogeneous set of examples, and find a label to put in the head of the rule. The second approach goes in the opposite direction: first select a class you want to learn, and then find rule bodies that cover (large subsets of) the examples of that class. The first approach naturally leads to a model consisting of an ordered sequence of rules – a rule list – as will be discussed in Section 6.1. The second approach treats collections of rules as unordered rule sets and is the topic of Section 6.2.
This book started life in the Summer of 2008, when my employer, the University of Bristol, awarded me a one-year research fellowship. I decided to embark on writing a general introduction to machine learning, for two reasons. One was that there was scope for such a book, to complement the many more specialist texts that are available; the other was that through writing I would learn new things – after all, the best way to learn is to teach.
The challenge facing anyone attempting to write an introductory machine learn- ing text is to do justice to the incredible richness of the machine learning field without losing sight of its unifying principles. Put too much emphasis on the diversity of the discipline and you risk ending up with a ‘cookbook’ without much coherence; stress your favourite paradigm too much and you may leave out too much of the other in- teresting stuff. Partly through a process of trial and error, I arrived at the approach embodied in the book, which is is to emphasise both unity and diversity: unity by separate treatment of tasks and features, both of which are common across any machine learning approach but are often taken for granted; and diversity through coverage of a wide range of logical, geometric and probabilistic models.
Clearly, one cannot hope to cover all of machine learning to any reasonable depth within the confines of 400 pages.
HAVING DISCUSSED A VARIETY of tasks in the preceding two chapters, we are now in an excellent position to start discussing machine learning models and algorithms for learning them. This chapter and the next two are devoted to logical models, the hallmark of which is that they use logical expressions to divide the instance space into segments and hence construct grouping models. The goal is to find a segmentation such that the data in each segment is more homogeneous, with respect to the task to be solved. For instance, in classification we aim to find a segmentation such that the instances in each segment are predominantly of one class, while in regression a good segmentation is such that the target variable is a simple function of a small number of predictor variables. There are essentially two kinds of logical models: tree models and rule models. Rule models consist of a collection of implications or if-then rules, where the if-part defines a segment, and the then-part defines the behaviour of the model in this segment. Tree models are a restricted kind of rule model where the if-parts of the rules are organised in a tree structure.
In this chapter we consider methods for learning logical expressions or concepts from examples, which lies at the basis of both tree models and rule models. In concept learning we only learn a description for the positive class, and label everything that doesn't satisfy that description as negative.
MACHINE LEARNING IS ALL ABOUT using the right features to build the right models that achieve the right tasks – this is the slogan, visualised in Figure 3 on p.11, with which we ended the Prologue. In essence, features define a ‘language’ in which we describe the relevant objects in our domain, be they e-mails or complex organic molecules. We should not normally have to go back to the domain objects themselves once we have a suitable feature representation, which is why features play such an important role in machine learning. We will take a closer look at them in Section 1.3. A task is an abstract representation of a problem we want to solve regarding those domain objects: the most common form of these is classifying them into two or more classes, but we shall encounter other tasks throughout the book. Many of these tasks can be represented as a mapping from data points to outputs. This mapping or model is itself produced as the output of a machine learning algorithm applied to training data; there is a wide variety of models to choose from, as we shall see in Section 1.2.
We start this chapter by discussing tasks, the problems that can be solved with machine learning. No matter what variety of machine learning models you may encounter, you will find that they are designed to solve one of only a small number of tasks and use only a few different types of features.
We generalize the theory of canonical formulas for K4, the logic of transitive frames, to wK4, the logic of weakly transitive frames. Our main result establishes that each logic over wK4 is axiomatizable by canonical formulas, thus generalizing Zakharyaschev’s theorem for logics over K4. The key new ingredients include the concepts of transitive and strongly cofinal subframes of weakly transitive spaces. This yields, along with the standard notions of subframe and cofinal subframe logics, the new notions of transitive subframe and strongly cofinal subframe logics over wK4. We obtain axiomatizations of all four kinds of subframe logics over wK4. We conclude by giving a number of examples of different kinds of subframe logics over wK4.
We emphasize the role of the choice of vocabulary in formalization of a mathematical area and remark that this is a particular preoccupation of logicians. We use this framework to discuss Kennedy’s notion of ‘formalism freeness’ in the context of various schools in model theory. Then we clarify some of the mathematical issues in recent discussions of purity in the proof of the Desargues proposition. We note that the conclusion of ‘spatial content’ from the Desargues proposition involves arguments which are algebraic and even metamathematical. Hilbert showed that the Desargues proposition implies the coordinatizing ring is associative, which in turn implies the existence of a three-dimensional geometry in which the given plane can be embedded. With W. Howard we give a new proof, removing Hilbert’s ‘detour’ through algebra, of the ‘geometric’ embedding theorem.
Finally, our investigation of purity leads to the conclusion that even the introduction of explicit definitions in a proof can violate purity. We argue that although both involve explicit definition, our proof of the embedding theorem is pure while Hilbert’s is not. Thus the determination of whether an argument is pure turns on the content of the particular proof. Moreover, formalizing the situation does not provide a tool for characterizing purity.
Computing and Language Variation explores dialects and social differences in language computationally, examining topics such as how (and how much) linguistic differences impede intelligibility, how national borders accelerate and direct change, how opinio
We prove that the threshold for the appearance of a k-regular subgraph in Gn,p is at most the threshold for the appearance of a non-empty (k+1)-core. This improves a result of Pralat, Verstraete and Wormald [5] and proves a conjecture of Bollobás, Kim and Verstraete [3].
Sure, the Internet has many security loopholes, from cyber-attack vulnerability to privacy-intrusion threats. But it does not have a few highly-connected routers in the center of the Internet that an attacker can destroy to disconnect the Internet, which would have fit the description of an “Achilles' heel”. So why would there be rumors that the Internet has an Achilles' heel?
The story started in the late 1990s with an inference result: the Internet topology exhibits a power-law distribution of node degrees. Here, the “topology” of the Internet may mean any of the following:
the graph of webpages connected by hyperlinks (like the one we mentioned in Chapter 3),
the graph of Autonomous Systems (ASs) connected by the physical and business relationships of peering (we will talk more about that in Chapter 13), and
the graph of routers connected by physical links (the focus of this chapter).
For the AS graph and the router graph, the actual distribution of the node degrees (think of the histogram of the degrees of all the nodes) is not clear due to measurement noise. For example, the AS graph data behind the power-law distribution had more than 50% of links missing. Internet exchange points further lead to many peering links among ASs.