To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
Quasi-interpretations are a technique for guaranteeing complexity bounds on first-order functional programs: in particular, with termination orderings, they give a sufficient condition for a program to be executable in polynomial time (Marion and Moyen 2000), which we call the P-criterion here. We study properties of the programs satisfying the P-criterion in order to improve the understanding of its intensional expressive power. Given a program, its blind abstraction is the non-deterministic program obtained by replacing all constructors with the same arity by a single one. A program is blindly polytime if its blind abstraction terminates in polynomial time. We show that all programs satisfying a variant of the P-criterion are in fact blindly polytime. Then we give two extensions of the P-criterion: one relaxing the termination ordering condition and the other (the bounded-value property) giving a necessary and sufficient condition for a program to be polynomial time executable, with memoisation.
Probabilistic models explicitly take into account uncertainty and deal with our imperfect knowledge of the world. Suchmodels are of fundamental significance in Machine Learning since our understanding of the world will always be limited by our observations and understanding. We will focus initially on using probabilistic models as a kind of expert system.
In Part I, we assume that the model is fully specified. That is, given a model of the environment, how can we use it to answer questions of interest? We will relate the complexity of inferring quantities of interest to the structure of the graph describing the model. In addition, we will describe operations in terms of manipulations on the corresponding graphs. As we will see, provided the graphs are simple tree-like structures, most quantities of interest can be computed efficiently.
Part I deals with manipulating mainly discrete variable distributions and forms the background to all the later material in the book.
In 1965 Erdős conjectured a formula for the maximum number of edges in a k-uniform n-vertex hypergraph without a matching of size s. We prove this conjecture for k = 3 and all s ≥ 1 and n ≥ 4s.
A family of sets is said to be intersecting if A ∩ B ≠ ∅ for all A, B ∈ . It is a well-known and simple fact that an intersecting family of subsets of [n] = {1, 2, . . ., n} can contain at most 2n−1 sets. Katona, Katona and Katona ask the following question. Suppose instead ⊂ [n] satisfies || = 2n−1 + i for some fixed i > 0. Create a new family p by choosing each member of independently with some fixed probability p. How do we choose to maximize the probability that p is intersecting? They conjecture that there is a nested sequence of optimal families for i = 1, 2,. . ., 2n−1. In this paper, we show that the families [n](≥r) = {A ⊂ [n]: |A| ≥ r} are optimal for the appropriate values of i, thereby proving the conjecture for this sequence of values. Moreover, we show that for intermediate values of i there exist optimal families lying between those we have found. It turns out that the optimal families we find simultaneously maximize the number of intersecting subfamilies of each possible order.
Standard compression techniques appear inadequate to solve the problem as they do not preserve intersection properties of subfamilies. Instead, our main tool is a novel compression method, together with a way of ‘compressing subfamilies’, which may be of independent interest.
In Part II we address how to learn a model from data. In particular we will discuss learning a model as a form of inference on an extended distribution, now taking into account the parameters of the model.
Learning a model or model parameters from data forces us to deal with uncertainty since with only limited data we can never be certain which is the ‘correct’ model. We also address how the structure of a model, not just its parameters, can in principle be learned.
In Part II we show how learning can be achieved under simplifying assumptions, such as maximum likelihood that set parameters by those that would most likely reproduce the observed data. We also discuss the problems that arise when, as is often the case, there is missing data.
Together with Part I, Part II prepares the basic material required to embark on understanding models in machine learning, having the tools required to learn models from data and subsequently query them to answer questions of interest.
Sampling methods are popular and well known for approximate inference. In this chapter we give an introduction to the less well known class of deterministic approximation techniques. These have been spectacularly successful in branches of the information sciences and many have their origins in the study of large-scale physical systems.
Introduction
Deterministic approximate inference methods are an alternative to the sampling techniques discussed in Chapter 27. Drawing exact independent samples is typically computationally intractable and assessing the quality of the sample estimates is difficult. In this chapter we discuss some alternatives. The first, Laplace's method, is a simple perturbation technique. The second class of methods are those that produce rigorous bounds on quantities of interest. Such methods are interesting since they provide certain knowledge – it may be sufficient, for example, to show that a marginal probability is greater than 0.1 in order to make an informed decision. A further class of methods are the consistency methods, such as loopy belief propagation. Such methods have revolutionised certain fields, including error correction [197]. It is important to bear in mind that no single approximation technique, deterministic or stochastic, is going to beat all others on all problems, given the same computational resources. In this sense, insight as to the properties of the various approximations is useful in matching an approximation method to the problem at hand.
We can now make a first connection between probability and graph theory. A belief network introduces structure into a probabilistic model by using graphs to represent independence assumptions among the variables. Probability operations such as marginalising and conditioning then correspond to simple operations on the graph, and details about the model can be ‘read’ from the graph. There is also a benefit in terms of computational efficiency. Belief networks cannot capture all possible relations among variables. However, they are natural for representing ‘causal’ relations, and they are a part of the family of graphical models we study further in Chapter 4.
The benefits of structure
It's tempting to think of feeding a mass of undigested data and probability distributions into a computer and getting back good predictions and useful insights in extremely complex environments. However, unfortunately, such a naive approach is likely to fail. The possible ways variables can interact is extremely large, so that without some sensible assumptions we are unlikely to make a useful model. Independently specifying all the entries of a table p(x1, …, xN) over binary variables xi takes O(2N) space, which is impractical for more than a handful of variables. This is clearly infeasible in many machine learning and related application areas where we need to deal with distributions on potentially hundreds if not millions of variables. Structure is also important for computational tractability of inferring quantities of interest.
We investigate existence and uniqueness of p-means ep and the median e1 of a probability measure μ on a Finsler manifold, in relation with the convexity of the support of μ. We prove that ep is the limit point of a continuous time gradient flow. Under some additional condition which is always satisfied for p≥2, a discretization of this path converges to ep. This provides an algorithm for determining the Finsler center points.
For each of us who appear to have had a successful experiment there are many to whom their own experiments seem barren and negative.
Melvin Calvin, 1961 Nobel Lecture
An experiment is not considered “barren and negative” when it disproves your conjecture: an experiment fails by being inconclusive.
Successful experiments are partly the product of good experimental designs, as described in Chapter 2; there is also an element of luck (or savvy) in choosing a well-behaved problem to study. Furthermore, computational research on algorithms provides unusual opportunities for “tuning” experiments to yield more successful analyses and stronger conclusions. This chapter surveys techniques for building better experiments along these lines.
We start with a discussion of what makes a data set good or bad in this context. The remainder of this section surveys strategies for tweaking experimental designs to yield more successful outcomes.
If tweaks are not sufficient, stronger measures can be taken; Section 6.1 surveys variance reduction techniques, which modify test programs to generate better data, and Section 6.2 describes simulation shortcuts, which produce more data per unit of computation time.
The key idea is to exploit the fact, pointed out in Section 5.1, that the application program that implements an algorithm for practical use is distinct from the test program that describes algorithm performance. The test program need not resemble the application program at all; it is only required to reproduce faithfully the algorithm properties of interest.
Richard Hamming, Numerical Methods for Scientists and Engineers
Some questions:
You are a working programmer given a week to reimplement a data structure that supports client transactions, so that it runs efficiently when scaled up to a much larger client base. Where do you start?
You are an algorithm engineer, building a code repository to hold fast implementations of dynamic multigraphs. You read papers describing asymptotic bounds for several approaches. Which ones do you implement?
You are an operations research consultant, hired to solve a highly constrained facility location problem. You could build the solver from scratch or buy optimization software and tune it for the application. How do you decide?
You are a Ph.D. student who just discovered a new approximation algorithm for graph coloring that will make your career. But you're stuck on the average-case analysis. Is the theorem true? If so, how can you prove it?
You are the adviser to that Ph.D. student, and you are skeptical that the new algorithm can compete with state-of-the-art graph coloring algorithms. How do you find out?
One good way to answer all these questions is: run experiments to gain insight.
This book is about experimental algorithmics, which is the study of algorithms and their performance by experimental means. We interpret the word algorithm very broadly, to include algorithms and data structures, as well as their implementations in source code and machine code.
In almost every computation a great variety of arrangements for the succession of the processes is possible, and various considerations must influence the selection amongst them for the purposes of a Calculating Engine. One essential object is to choose that arrangement which shall tend to reduce to a minimum the time necessary for completing the calculation.
Ada Byron, Memoir on the Analytic Engine, 1843
This chapter considers an essential question raised by Lady Byron in her famous memoir: How to make it run faster?
This question can be addressed at all levels of the algorithm design hierarchy sketched in Figure 1.1 of Chapter 1, including systems, algorithms, code, and hardware. Here we focus on tuning techniques that lie between the algorithm design and hardware levels. We start with the assumption that the system analysis and abstract algorithm design work has already taken place, and that a basic implementation of an algorithm with good asymptotic performance is in hand. The tuning techniques in this chapter are meant to improve upon the abstract design work, not replace it.
Tuning exploits the gaps between practical experience and the simplifying assumptions necessary to theory, by focusing on constant factors instead of asymptotics, secondary instead of dominant costs, and performance on “typical” inputs rather than theoretical classes. Many of the ideas presented here are known in the folklore under the general rubric of “code tuning.”
This guidebook is written for anyone – student, researcher, or practitioner – who wants to carry out computational experiments on algorithms (and programs) that yield correct, general, informative, and useful results. (We take the wide view and use the term “algorithm” to mean “algorithm or program” from here on.)
Whether the goal is to predict algorithm performance or to build faster and better algorithms, the experiment-driven methodology outlined in these chapters provides insights into performance that cannot be obtained by purely abstract means or by simple runtime measurements. The past few decades have seen considerable developments in this approach to algorithm design and analysis, both in terms of number of participants and in methodological sophistication.
In this book I have tried to present a snapshot of the state-of-the-art in this field (which is known as experimental algorithmics and empirical algorithmics), at a level suitable for the newcomer to computational experiments. The book is aimed at a reader with some undergraduate computer science experience: you should know how to program, and ideally you have had at least one course in data structures and algorithm analysis. Otherwise, no previous experience is assumed regarding the other topics addressed here, which range widely from architectures and operating systems, to probability theory, to techniques of statistics and data analysis
A note to academics: The book takes a nuts-and-bolts approach that would be suitable as a main or supplementary text in a seminar-style course on advanced algorithms, experimental algorithmics, algorithm engineering, or experimental methods in computer science.
Strategy without tactics is the slowest route to victory. Tactics without strategy is the noise before defeat.
Sun Tzu, The Art of War
W. I. B. Beveridge, in his classic guidebook for young scientists [7], likens scientific research “to warfare against the unknown”:
The procedure most likely to lead to an advance is to concentrate one's forces on a very restricted sector chosen because the enemy is believed to be weakest there.Weak spots in the defence may be found by preliminary scouting or by tentative attacks.
This chapter is about developing small- and large-scale plans of attack in algorithmic experiments.
To make the discussion concrete, we consider algorithms for the graph coloring (GC) problem. The input is a graph G containing n vertices and m edges. A coloring of G is an assignment of colors to vertices such that no two adjacent vertices have the same color. Figure 2.1 shows an example graph with eight vertices and 10 edges, colored with four colors. The problem is to find a coloring that uses a minimum number of colors – is 4 the minimum in this case?
When restricted to planar graphs, this is the famous map coloring problem, which is to color the regions of a map so that adjacent regions have different colors. Only four colors are needed for any map, but in the general graph problem, as many as n colors may be required.
Really, the slipshod way we deal with data is a disgrace to civilization.
M. J. Moroney, Facts from Figures
Information scientists tell us that data, alone, have no value or meaning [1]. When organized and interpreted, data become information, which is useful for answering factual questions: Which is bigger, X or Y? How many Z's are there? A body of information can be further transformed into knowledge, which reflects understanding of how and why, at a level sufficient to direct choices and make predictions: which algorithm should I use for this application? How long will it take to run?
Data analysis is a process of inspecting, summarizing, and interpreting a set of data to transform it into something useful: information is the immediate result, and knowledge the ultimate goal.
This chapter surveys some basic techniques of data analysis and illustrates their application to algorithmic questions. Section 7.1 presents techniques for analyzing univariate (one-dimensional) data samples. Section 7.2 surveys techniques for analyzing bivariate data samples, which are expressed as pairs of (X, Y) points. No statistical background is required of the reader.
One chapter is not enough to cover all the data analysis techniques that are useful to algorithmic experiments – something closer to a few bookshelves would be needed. Here we focus on describing a small collection of techniques that address the questions most commonly asked about algorithms, and on knowing which technique to apply in a given scenario.