To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
In this section we are going to try to quantify the notion of information. Before we do this, we should be aware that ‘information’ has a special meaning in probability theory, which is not the same as its use in ordinary language. For example, consider the following two statements:
(i) I will eat some food tomorrow.
(ii) The prime minister and leader of the opposition will dance naked in the street tomorrow.
If I ask which of these two statements conveys the most information, you will (I hope!) say that it is (ii). Your argument might be that (i) is practically a statement of the obvious (unless I am prone to fasting), whereas (ii) is extremely unlikely. To summarise:
(i) has very high probability and so conveys little information,
(ii) has very low probability and so conveys much information. Clearly, then, quantity of information is closely related to the element of surprise.
Consider now the following ‘statement’:
(iii) XQWQYK VZXPU VVBGXWQ.
Our immediate reaction to (iii) is that it is meaningless and hence conveys no information. However, from the point of view of English language structure we should be aware that (iii) has low probability (e.g. Q is a rarely occurring letter and is generally followed by U, (iii) contains no vowels) and so has a high surprise element.
The above discussion should indicate that the word ‘information’, as it occurs in everyday life, consists of two aspects, ‘surprise’ and ‘meaning’.
When I wrote the first edition of this book in the early 1990s it was designed as an undergraduate text which gave a unified introduction to the mathematics of ‘chance’ and ‘information’. I am delighted that many courses (mainly in Australasia and the USA) have adopted the book as a core text and have been pleased to receive so much positive feedback from both students and instructors since the book first appeared. For this second edition I have resisted the temptation to expand the existing text and most of the changes to the first nine chapters are corrections of errors and typos. The main new ingredient is the addition of a further chapter (Chapter 10) which brings a third important concept, that of ‘time’ into play via an introduction to Markov chains and their entropy. The mathematical device for combining time and chance together is called a ‘stochastic process’ which is playing an increasingly important role in mathematical modelling in such diverse (and important) areas as mathematical finance and climate science. Markov chains form a highly accessible subclass of stochastic (random) processes and nowadays these often appear in first year courses (at least in British universities). From a pedagogic perspective, the early study of Markov chains also gives students an additional insight into the importance of matrices within an applied context and this theme is stressed heavily in the approach presented here, which is based on courses taught at both Nottingham Trent and Sheffield Universities.
Our experience of the world leads us to conclude that many events are unpredictable and sometimes quite unexpected. These may range from the outcome of seemingly simple games such as tossing a coin and trying to guess whether it will be heads or tails to the sudden collapse of governments or the dramatic fall in prices of shares on the stock market. When we try to interpret such events, it is likely that we will take one of two approaches – we will either shrug our shoulders and say it was due to ‘chance’ or we will argue that we might have have been better able to predict, for example, the government's collapse if only we'd had more ‘information’ about the machinations of certain ministers. One of the main aims of this book is to demonstrate that these two concepts of ‘chance’ and ‘information’ are more closely related than you might think. Indeed, when faced with uncertainty our natural tendency is to search for information that will help us to reduce the uncertainty in our own minds; for example, think of the gambler about to bet on the outcome of a race and combing the sporting papers beforehand for hints about the form of the jockeys and the horses.
Before we proceed further, we should clarify our understanding of the concept of chance. It may be argued that the tossing of fair, unbiased coins is an ‘intrinsically random’ procedure in that everyone in the world is equally ignorant of whether the result will be heads or tails.
This chapter will be devoted to problems involving counting. Of course, everybody knows how to count, but sometimes this can be quite a tricky business. Consider, for example, the following questions:
(i) In how many different ways can seven identical objects be arranged in a row?
(ii) In how many different ways can a group of three ball bearings be selected from a bag containing eight?
Problems of this type are called combinatorial. If you try to solve them directly by counting all the possible alternatives, you will find this to be a laborious and time-consuming procedure. Fortunately, a number of clever tricks are available which save you from having to do this. The branch of mathematics which develops these is called combinatorics and the purpose of the present chapter is to give a brief introduction to this topic.
A fundamental concept both in this chapter and the subsequent ones on probability theory proper will be that of an ‘experience’ which can result in several possible ‘outcomes’. Examples of such experiences are:
(a) throwing a die where the possible outcomes are the six faces which can appear,
(b) queueing at a bus-stop where the outcomes consist of the nine different buses, serving different routes, which stop there.
If A and B are two separate experiences, we write A ∘ B to denote the combined experience of A followed by B.
This is designed to be an introductory text for a modern course on the fundamentals of probability and information. It has been written to address the needs of undergraduate mathematics students in the ‘new’ universities and much of it is based on courses developed for the Mathematical Methods for Information Technology degree at the Nottingham Trent University. Bearing in mind that such students do not often have a firm background in traditional mathematics, I have attempted to keep the development of material gently paced and user friendly – at least in the first few chapters. I hope that such an approach will also be of value to mathematics students in ‘old’ universities, as well as students on courses other than honours mathematics who need to understand probabilistic ideas.
I have tried to address in this volume a number of problems which I perceive in the traditional teaching of these subjects. Many students first meet probability theory as part of an introductory course in statistics. As such, they often encounter the subject as a ragbag of different techniques without the same systematic development that they might gain in a course in, say, group theory. Later on, they might have the opportunity to remedy this by taking a final-year course in rigorous measure theoretic probability, but this, if it exists at all, is likely to be an option only. Consequently, many students can graduate with degrees in mathematical sciences, but without a coherent understanding of the mathematics of probability.
Since Chapter 5, we have been concerned only with discrete random variables and their applications, that is random variables taking values in sets where the number of elements is either finite or ∞. In this chapter, we will extend the concept of random variables to the ‘continuous’ case wherein values are taken in ℝ or some interval of ℝ.
Historically, much of the motivation for the development of ideas about such random variables came from the theory of errors in making measurements. For example, suppose that you want to measure your height. One approach would be to take a long ruler or tape measure and make the measurement directly. Suppose that we get a reading of 5.7 feet. If we are honest, we might argue that this result is unlikely to be very precise – tape measures are notoriously inaccurate and it is very difficult to stand completely still when you are being measured.
To allow for the uncertainty as to our true height we introduce a random variable X to represent our height, and indicate our hesitancy in trusting the tape measure by assigning a number close to 1 to the probability P(X ∈ (5.6, 5.8)), that is we say that our height is between 5.6 feet and 5.8 feet with very high probability. Of course, by using better measuring instruments, we may be able to assign high probabilities for X lying in smaller and smaller intervals, for example (5.645, 5.665); however, since the precise location of any real number requires us to know an infinite decimal expansion, it seems that we cannot assign probabilities of the form P(X = 5.67).
So far in this book we have tended to deal with one (or at most two) random variables at a time. In many concrete situations, we want to study the interaction of ‘chance’ with ‘time’, e.g. the behaviour of shares in a company on the stock market, the spread of an epidemic or the movement of a pollen grain in water (Brownian motion). To model this, we need a family of random variables (all defined on the same probability space), (X(t), t ≥ 0), where X(t) represents, for example, the value of the share at time t.
(X(t), t ≥ 0) is called a (continuous time) stochastic process or random process. The word ‘stochastic’ comes from the Greek for ‘pertaining to chance’. Quite often, we will just use the word ‘process’ for short.
For many studies, both theoretical and practical, we discretise time and replace the continuous interval [0, ∞) with the discrete set ℤ+ = ℕ ∪ {0} or sometimes ℕ. We then have a (discrete time) stochastic process (Xn, n ∈ ℤ+). We will focus entirely on the discrete time case in this chapter.
Note. Be aware that X(t) and Xt (and similarly X(n) and Xn) are both used interchangeably in the literature on this subject.
There is no general theory of stochastic processes worth developing at this level. It is usual to focus on certain classes of process which have interesting properties for either theoretical development, practical application, or both of these. We will study Markov chains in this chapter.
In this chapter we will be trying to model the transmission of information across channels. We will begin with a very simple model, as is shown in Fig. 7.1, and then build further features into it as the chapter progresses.
The model consists of three components. A source of information, a channel across which the information is transmitted and a receiver to pick up the information at the other end. For example, the source might be a radio or TV transmitter, the receiver would then be a radio or TV and the channel the atmosphere through which the broadcast waves travel. Alternatively, the source might be a computer memory, the receiver a computer terminal and the channel the network of wires and processors which connects them. In all cases that we consider, the channel is subject to ‘noise’, that is uncontrollable random effects which have the undesirable effect of distorting the message leading to potential loss of information by the receiver.
The source is modelled by a random variable S whose values {a1, a2, …, an} are called the source alphabet. The law of S is {p1, p2, …, pn}. The fact that S is random allows us to include within our model the uncertainty of the sender concerning which message they are going to send. In this context, a message is a succession of symbols from S sent out one after the other.