To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
All of the algorithms we have come across so far make several (severe) assumptions on the domain. Together with the knowledge we feed into our learning systems, the representation itself and the implementation of algorithms may result in heavy biases. But what if we just look at the objects we are given and their relational properties? Why should we try to discriminate indistinguishable objects instead of interpreting indiscernability as “being-of-the-same-breed” – whatever our current knowledge of different existing breeds is?
At the beginning of the last chapter we discovered that features induce equivalence relations and that equivalence relations create blocks of indiscernible objects, that is, “small groups of similar, equal, or equivalent things”. Any two objects in an equivalence class cannot be distinguished from each other, but two objects from different classes can be well discriminated. For our information systems that usually provide a large number of features, we also have many equivalence relations. Furthermore, any intersection of any subset of such equivalence relations also forms a new equivalence relation. And because equivalence relations are relations, and because relations are sets, it appears to be an interesting idea to consider the intersection of equivalence relations as a much finer and more detailed partitioning of our base set.
Knowledge discovery, machine learning, data mining, pattern recognition, and rule invention are all about algorithms that are designed to extract knowledge from data and to describe patterns by rules.
One of the cornerstones of (traditional) artificial intelligence is the assumption that
Intelligent behaviour requires rational, knowledge-based decisive and active processes.
These processes include the acquisition of new knowledge, which we call machine learning or knowledge discovery. However, when talking about knowledge-based systems we first need to explain what we mean by knowledge. If we try to define learning by intelligence, we need to explain intelligence, and if we want to explain intelligence, we need to explain knowledge. Bertrand Russell (1992, 1995) has given a very precise and in our case very helpful (and actually entirely sufficient) definition of knowledge:
Knowledge is the ability to discriminate things from each other.
As a consequence, learning means acquiring the ability to recognise and differentiate between different things. Thus, the process of knowledge acquisition is a process that is initiated and (autonomously) run by a system whose purpose is to learn by itself. L. G. Valiant (1984) said that
Learning means acquiring a program without a programmer.
No software without a program, no program without an algorithm. No algorithm without a theory, and no theory without a clear syntax and semantics. In this chapter we define the fundamental concepts that we need to speak about knowledge discovery in a clear language without too much confusion.
If we try to put all the important information about machine learning in just a small box, it would look like this:
Machine learning
Machine learning is concerned with the problem of inducing a concept from a sample of instances of our domain. Given a classification, the task is to define a mapping that approximates an unknown target function that assigns to each object a target class label.
The outcome is a hypothesis h of a certain quality, and the process of inducing such a hypothesis crucially depends on the representation of our domain.
This rather rough picture is described in detail in the following sections.
First we need to specify what we will be talking about and the terms we will be using.
Representation
Machine learning and knowledge discovery are concerned with:
with respect to their properties and/or the classes they belong to.
First we need to ask ourselves how to represent our knowledge of the world. The part of the world that we live in and that we shall reason about is called the domain. To be able to talk about objects of the domain we need to have representations thereof.
Talking about the discovery of knowledge requires us to understand “knowledge” first. In the last chapter we defined knowledge to be what it takes to discriminate different things from each other.
In this chapter we will develop a more formal framework of knowledge structures that enables us to describe the process of discovering new knowledge.
Information is something that may change knowledge, and knowledge is the ability to relate things by putting a structure on them. Knowledge is not made from the things we put into order, and information does not change the things themselves. It is, rather, that knowledge is a set of relations describing things and information helps us to describe a relation's utility for classifying things. But, then, why do we assume information to be describable by a set of entities each of which can take a certain number of different states – if they do not know whether there are entities they have never seen before nor how many states the entities can possibly take? Why do we explain causality by means of probabilistic dependence?
There are many definitions of what knowledge could be, and there are many approaches to knowledge representation formalisms. They all agree that there is knowledge of different qualities: factual knowledge, weak knowledge, procedural knowledge, hard knowledge, motor knowledge, world knowledge, and behavioural knowledge are just a few. A sloppy and weak definition of knowledge might well be what we want to acquire by learning.
Describing objects by features is a very common thing to do. For example, many decision support systems use a tree-like representation of cases, where every branch in the tree corresponds to a feature and its observed value. But which features can be used to model a certain concept? What is the shortest and most meaningful rule with which we can describe a distinct set of objects using our knowledge?
In the previous chapter we saw how similarity measures can be used to group objects into (hopefully) meaningful clusters. Given an information system ℑ, we now want to describe a feature's utility with respect to a given object's classification. Relationally speaking, we need to recursively apply those features fi ϵ F that generate a partition on U that is similar to U/Rt to learn a compressing classifier this way. It appears to be a good idea to start with a feature that appears to be the most “similar” to t. A feature being quite similar to the target function can be assumed to carry relevant information with respect to t. And this leads us to the information-theoretic notion of entropy.
Information gain driven classifier learning
While clustering tries to find hierarchies of groups of objects, so-called decision trees represent a hierarchy of feature-induced partitions. Unlike (unsupervised) similarity measures in clustering, one uses a target-specific information measure called entropy.
People often try to explain Shannon and Weaver's (1949) information-theoretic measure of entropy by the laws of entropy in thermodynamics.
If we assume knowledge to be what it takes to make rational decisions, then knowledge is not logic. It is not even logic when we assume it is representable in an information system.
Many decisions (and every answer to a question is a decision) are far from being discrete, deterministic, or deductively comprehensible.
Yet, the simplest question we can ask is, “Is x equal to y?” And it takes knowledge in the form of the ability to discern different things from each other to decide whether one should answer “Yes” or “No”.
During the last decades, machine learning evolved from theories of reasoning in artificial intelligence to an essential component of software systems. Statistical methods outperform logic-based approaches in most application domains – and with increasing computational power it has become possible to generate and test classifiers. As a more sophisticated approach, ensemble learning implements divide-and-conquer strategies on the learning problem. With the further increase of data collections (e.g., data warehouses), the problem we are facing is not concerned with how we can induce a classifier that supports our model assumptions on the data but rather to understand what kind of information there actually is. In machine learning, this approach is known as knowledge discovery.
Knowledge representation
Because we are used to describing knowledge in the language of terminologic logic, we quite often identify knowledge representation with logic models. In Chapter 3 we saw that knowledge can be represented in many, many different ways.
If we are given 5 pebbles, 3 marbles, 4 dice, and 2 keys, then we have 14 little objects of 4 different kinds. We also have 8 objects made of stone, 4 made of wood, and 2 made of metal. And we have 7 toy objects, 2 office tools, and 5 things we have collected during our last walk at the beach.
In the previous chapters, we saw that relations can be used to represent knowledge about sets of things. We also discovered that learning means to find a suitable set of relations with which we can describe or define concepts (see Definition 2.37). Now we describe a first approach to efficiently discover relational concept descriptions. Our starting point is an information system with a feature-based representation of the objects in our domain.
Concepts as sets of objects
Our working hypothesis is that knowledge is the ability to discriminate things and learning is knowledge acquisition. Therefore,
Learning means to acquire the ability to discriminate different objects from each other.
There are, in general, two different methods to group similar objects together and distinguish them from other groups of entities:
Building sets or classes of objects that we assume to share certain properties by grouping them into the same cluster
Inducing a concept that serves as a description of a representation class in terms of properties of objects.
This book is about knowledge discovery. There are many excellent books on machine learning and data mining. And there are many excellent books covering many particular aspects of these areas. Even though all the knowledge we are concerned with in computer science is relational, relational or logic machine learning or knowledge discovery is not that common. Accordingly, there are fewer textbooks on this issue.
This book strongly emphasises knowledge: what it is, how it can be represented, and, finally, how new knowledge can be discovered from what is already known plus a few new observations. Our interpretation of knowledge is based on the notion of “discernability”; all the methods discussed in this book are presented within the same paradigm: we take “learning” to mean acquiring the ability to discriminate between different things. Because things are different if they are not equal, we use a “natural” equivalence to group similar things together and distinguish them from differing things. Equivalence means to have something in common. According to the portion of commonness between things there are certain degrees of equality: things can be exactly the same, they can be the same in most cases or aspects, they can be roughly the same, not really the same, and they can be entirely different. Sometimes, they are even incomparable.
We initiate the theory and applications of biautomata. A biautomaton can read a word alternately from the left and from the right. We assign to each regular language L its canonical biautomaton. This structure plays, among all biautomata recognizing the language L, the same role as the minimal deterministic automaton has among all deterministic automata recognizing the language L. We expect that from the graph structure of this automaton one could decide the membership of a given language for certain significant classes of languages. We present the first two results of this kind: namely, a language L is piecewise testable if and only if the canonical biautomaton of L is acyclic. From this result Simon’s famous characterization of piecewise testable languages easily follows. The second class of languages characterizable by the graph structure of their biautomata are prefix-suffix testable languages.
The goal of computer vision is to extract useful information from images. This has proved a surprisingly challenging task; it has occupied thousands of intelligent and creative minds over the last four decades, and despite this we are still far from being able to build a general-purpose “seeing machine.”
Part of the problem is the complexity of visual data. Consider the image in Figure 1.1. There are hundreds of objects in the scene. Almost none of these are presented in a “typical” pose. Almost all of them are partially occluded. For a computer vision algorithm, it is not even easy to establish where one object ends and another begins. For example, there is almost no change in the image intensity at the boundary between the sky and the white building in the background. However, there is a pronounced change in intensity on the back window of the SUV in the foreground, although there is no object boundary or change in material here.
We might have grown despondent about our chances of developing useful computer vision algorithms if it were not for one thing: we have concrete proof that vision is possible because our own visual systems make light work of complex images such as Figure 1.1. If I ask you to count the trees in this image or to draw a sketch of the street layout, you can do this easily.
In the last chapter we showed that classification with generative models is based on building simple probability models. In particular, we build class-conditional density functions Pr(x|w = k) over the observed data x for each value of the world state w.
In Chapter 3 we introduced several probability distributions that could be used for this purpose, but these were quite limited in scope. For example, it is not realistic to assume that all of the complexities of visual data are well described by the normal distribution. In this chapter, we show how to construct complex probability density functions from elementary ones using the idea of a hidden variable.
As a representative problem we consider face detection; we observe a 60 × 60 RGB image patch, and we would like to decide whether it contains a face or not. To this end, we concatenate the RGB values to form the 10800 × 1 vector x. Our goal is to take the vector x and return a label w ϵ {0,1} indicating whether it contains background (w =0) or a face (w = 1). In a real face detection system, we would repeat this procedure for every possible subwindow of an image (Figure 7.1).
We will start with a basic generative approach in which we describe the likelihood of the data in the presence/absence of a face with a normal distribution. We will then extend this model to address its weaknesses.
I have been programming in Fortran for more than 25 years, first in FORTRAN IV and somewhat later in FORTRAN 77. In the last decade of the 20th century, I attended, together with a number of colleagues, a course on Fortran 90, given by the late Jan van Oosterwijk at the Technical University of Delft. It was also around this time that I came to know the comp.lang.fortran newsgroup, and I have learned a lot by participating in that friendly community.
In a way, I am a typical Fortran programmer. My background is physics and I learned the task of programming partly during my study, but mostly on the job. In other ways, I am not because I took a fancy to the more esoteric possibilities of programming in general and sought means to apply them in Fortran. I also began writing articles for the ACM Fortran Forum. These articles are the groundwork for this book.
This book will not teach you how to program in Fortran. There are plenty of books dedicated to that ([22], [65]). Instead, the purpose of this book is to show how modern Fortran can be used for modern programming problems, such as how techniques made popular in the world of object-oriented languages like C++ and Java fit neatly into Fortran as it exists today. It even shows some techniques for solving certain programming problems that are not easily achieved in these languages.
This chapter provides a brief overview of modern preprocessing methods for computer vision. In Section 13.1 we introduce methods in which we replace each pixel in the image with a new value. Section 13.2 considers the problem of finding and characterizing edges, corners and interest points in images. In Section 13.3 we discuss visual descriptors; these are low-dimensional vectors that attempt to characterize the interesting aspects of an image region in a compact way. Finally, in Section 13.4 we discuss methods for dimensionality reduction.
Per-pixel transformations
We start our discussion of preprocessing with per-pixel operations: these methods return a single value corresponding to each pixel of the input image. We denote the original 2D array of pixel data as P, where pij is the element at the ith of I rows and the jth of J columns. The element pij is a scalar representing the grayscale intensity. Per-pixel operations return a new 2D array X of the same size as P containing elements xij.
Whitening
The goal of whitening (Figure 13.1) is to provide invariance to fluctuations in the mean intensity level and contrast of the image. Such variation may arise because of a change in ambient lighting intensity, the object reflectance, or the camera gain. To compensate for these factors, the image is transformed so that the resulting pixel values have zero mean and unit variance.
In Chapter 11, we discussed models that were structured as chains or trees. In this chapter, we consider models that associate a label with each pixel of an image. Since the unknown quantities are defined on the pixel lattice, models defined on a grid structure are appropriate. In particular, we will consider graphical models in which each label has a direct probabilistic connection to each of its four neighbors. Critically, this means that there are loops in the underlying graphical model and so the dynamic programming and belief propagation approaches of the previous chapter are no longer applicable.
These grid models are predicated on the idea that the pixel provides only very ambiguous information about the associated label. However, certain spatial configurations of labels are known to be more common than others, and we aim to exploit this knowledge to resolve the ambiguity. In this chapter, we describe the relative preference for different configurations of labels with a pairwise Markov random field or MRF. As we shall see, maximum a posteriori inference for pairwise MRFs is tractable in some circumstances using a family of approaches known collectively as graph cuts.
To motivate the grid models, we introduce a representative application. In image denoising we observe a corrupted image in which the intensities at a certain proportion of pixels have been randomly changed to another value according to a uniform distribution (Figure 12.1).
Interaction with the user via graphical presentation or graphical interfaces is an extremely important aspect of computing. It is also an area that depends heavily on the available hardware and the operating system. While much effort has been put into the various operating environments to hide the hardware aspect, programming a graphical user-interface (GUI) on MS Windows is completely different from doing so on Linux or Mac OS X.
Hardly any programming language defines how you can display results graphically or how to build a graphical user-interface. The common approach is to use a library that hides – if possible or desirable – the specifics of the operating system, so that a portable program results. Some libraries, however, have been designed to take advantage of exactly one operating system, so that the program fits seamlessly in that environment, at the cost of not being portable anymore. At the same time, you should not underestimate the effort and skills required for designing and building a useful and usable GUI [3].
For a Fortran programmer, the situation is a bit complicated: most GUI libraries are written with C, C++, and similar languages in mind. Furthermore, a GUI for a long-running computation should have different properties than one for filling in a database form. This chapter examines a variety of solutions.
Plotting the Results
The first type of graphical interaction to examine is the presentation of the results of a computation. The responses from the user are simple: leaf through the collection of plots – in many cases sequentially – and maybe change a display parameter here and there.