To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
I believe, for instance, that it would be very difficult to persuade an intelligent physicist that current statistical practice was sensible, but that there would be much less difficulty with an approach via likelihood and Bayes' theorem.
G. E. P. Box (1962)
As we have noted several times, the idea that probabilities are physically real things, based ultimately on observed frequencies of random variables, underlies most recent expositions of probability theory, which would seem to make it a branch of experimental science. At the end of Chapter 8 we saw some of the difficulties that this view leads us to; in some real physical experiments the distinction between random and nonrandom quantities is so obscure and artificial that you have to resort to black magic in order to force this distinction into the problem at all. But that discussion did not reach into the serious physics of the situation. In this chapter, we take time off for an interlude of physical considerations that show the fundamental difficulty with the notion of ‘random’ experiments.
An interesting correlation
There have always been dissenters from the ‘frequentist’ view who have maintained, with Laplace, that probability theory is properly regarded as the ‘calculus of inductive reasoning’, and is not fundamentally related to random experiments at all. A major purpose of the present work is to demonstrate that probability theory can deal, consistently and usefully, with far more than frequencies in random experiments.
The development of our theory beyond this point, as a practical statistical theory, involves … all the complexities of the use, either of Bayes' law on the one hand, or of those terminological tricks in the theory of likelihood on the other, which seem to avoid the necessity for the use of Bayes' law, but which in reality transfer the responsibility for its use to the working statistician, or the person who ultimately employs his results.
Norbert Wiener (1948)
To the best of our knowledge, Norbert Wiener never actually applied Bayes' theorem in a published work; yet he perceived the logical necessity of its use as soon as one builds beyond the sampling distributions involved in his own statistical work. In the present chapter we examine some of the consequences of failing to use Bayesian methods in some very simple problems, where the paradoxes of Chapter 15 never arise.
In Chapter 16 we noted that the orthodox objections to Bayesian methods were always philosophical or ideological in nature, never examining the actual results that they give, and we expressed astonishment that mathematically competent persons would use such arguments. In order to give a fair comparison, we need to adopt the opposite tactic here, and concentrate on the demonstrable facts that orthodoxians never mention. Since Bayesian methods have been so egregiously misrepresented in the orthodox literature throughout our lifetimes, we must lean over backwards to avoid misrepresenting orthodox methods now; whenever an orthodox method does yield a satisfactory result in some problem, we shall acknowledge that fact, and we shall not deplore its use merely on ideological grounds.
At this point, we return to the job of designing the robot. We have part of its brain designed, and we have seen how it would reason in a few simple problems of hypothesis testing and estimation. In every problem it has solved thus far, the results have either amounted to the same thing as, or were usually demonstrably superior to, those offered in the ‘orthodox’ statistical literature. But it is still not a very versatile reasoning machine, because it has only one means by which it can translate raw information into numerical values of probabilities, the principle of indifference (2.95). Consistency requires it to recognize the relevance of prior information, and so in almost every problem it is faced at the onset with the problem of assigning initial probabilities, whether they are called technically prior probabilities or sampling probabilities. It can use indifference for this if it can break the situation up into mutually exclusive, exhaustive possibilities in such a way that no one of them is preferred to any other by the evidence. But often there will be prior information that does not change the set of possibilities but does give a reason for preferring one possibility to another. What do we do in this case?
Orthodoxy evades this problem by simply ignoring prior information for fixed parameters, and maintaining the fiction that sampling probabilities are known frequencies.
My own impression … is that the mathematical results have outrun their interpretation and that some simple explanation of the force and meaning of the celebrated integral … will one day be found … which will at once render useless all the works hitherto written.
Augustus de Morgan (1838)
Here, de Morgan was expressing his bewilderment at the ‘curiously ubiquitous’ success of methods of inference based on the Gaussian, or normal, ‘error law’ (sampling distribution), even in cases where the law is not at all plausible as a statement of the actual frequencies of the errors. But the explanation was not forthcoming as quickly as he expected.
In the middle 1950s the writer heard an after-dinner speech by Professor Willy Feller, in which he roundly denounced the practice of using Gaussian probability distributions for errors, on the grounds that the frequency distributions of real errors are almost never Gaussian. Yet in spite of Feller's disapproval, we continued to use them, and their ubiquitous success in parameter estimation continued. So, 145 years after de Morgan's remark, the situation was still unchanged, and the same surprise was expressed by George Barnard (1983): ‘Why have we for so long managed with normality assumptions?’
Today we believe that we can, at last, explain (1) the inevitably ubiquitous use, and (2) the ubiquitous success, of the Gaussian error law.
The actual science of logic is conversant at present only with things either certain, impossible, or entirely doubtful, none of which (fortunately) we have to reason on. Therefore the true logic for this world is the calculus of Probabilities, which takes account of the magnitude of the probability which is, or ought to be, in a reasonable man's mind.
James Clerk Maxwell (1850)
Suppose some dark night a policeman walks down a street, apparently deserted. Suddenly he hears a burglar alarm, looks across the street, and sees a jewelry store with a broken window. Then a gentleman wearing a mask comes crawling out through the broken window, carrying a bag which turns out to be full of expensive jewelry. The policeman doesn't hesitate at all in deciding that this gentleman is dishonest. But by what reasoning process does he arrive at this conclusion? Let us first take a leisurely look at the general nature of such problems.
Deductive and plausible reasoning
A moment's thought makes it clear that our policeman's conclusion was not a logical deduction from the evidence; for there may have been a perfectly innocent explanation for everything. It might be, for example, that this gentleman was the owner of the jewelry store and he was coming home from a masquerade party, and didn't have the key with him. However, just as he walked by his store, a passing truck threw a stone through the window, and he was only protecting his own property.
We collect here a brief account of the various mathematical conventions used throughout this work, and discuss some basic mathematical issues that arise in probability theory. Careless notation has led to so many erroneous results in the recent literature that we need to find rules of notation and terminology that make it as difficult as possible to commit such errors.
A mathematical notation, like a language, is not an end in itself but only a communication device. Its purpose is best served if the notation, like the language, is allowed to evolve with use. This evolution usually takes the form of abbreviations for whatever expressions recur often, and reducing the number of symbols when their meaning can be read from the context.
But a living, changing language still needs a kind of safe harbor in the form of a fixed set of rules of grammar and orthography, hidden away in a dictionary for use when ambiguities threaten. Likewise, probability theory needs a fixed set of normative rules on which we can fall back in case of doubt. We state here our formal rules of notation and logical hierarchy; all chapters from Chapter 3 on start with these standard forms, and evolve from them. A notation which is so convenient that it is almost a necessity in one chapter might be only confusing in the next; so each separate topic must be allowed its own independent evolution from the standard beginning.
Probably everybody who has been involved in quantitative measurements has found himself in the following situation. You are trying to measure some quantity θ (which might be, for example, the right ascension of Sirius, the mass of a π-meson, the velocity of seismic waves at a depth of 100 km, the melting point of a new organic compound, the elasticity of consumer demand for apples, etc.). But the apparatus or the data taking procedure is always imperfect and so, having made n independent measurements of θ, you have n different results (x1, …, xn). How are you to report what you now know about θ? More specifically, what ‘best’ estimate should you announce, and what accuracy are you entitled to claim?
If these n data values were closely clustered together making a reasonably smooth, single-peaked histogram, you would accept the solutions given in the previous chapters, and might feel that the problem of drawing conclusions from good data is not very difficult, even without any probability theory. But your data are not nicely clustered: one value, xj, lies far away from the nice cluster made by the other (n − 1) values. How are you to deal with this outlier? What effect does it have on the conclusions that you entitled to draw about θ?
We have seen in Chapters 4 and 5 how the appearance of astonishing, unexpected data may cause the resurrection of dead hypotheses; it appears that something like that may be at work here.
With all this confounded trafficking in hypotheses about invisible connections with all manner of inconceivable properties, which have checked progress for so many years, I believe it to be most important to open people's eyes to the number of superfluous hypotheses they are making, and would rather exaggerate the opposite view, if need be, than proceed along these false lines.
H. von Helmholtz (1868)
This chapter and Chapter 13 are concerned with the history of the subject rather than its present status. There is a complex and fascinating history before 1900, recounted by Stigler (1986c), but we are concerned now with more recent developments. In the period from about 1900 to 1970, one school of thought dominated the field so completely that it has come to be called ‘orthodox statistics’. It is necessary for us to understand it, because it is what most working statisticians active today were taught, and its ideas are still being taught, and advocated vigorously, in many textbooks and universities.
In Chapter 17 we want to examine the ‘orthodox’ statistical practice thus developed and compare its technical performance with that of the ‘probability as logic’ approach expounded here. But first, to understand this weird course of events, we need to know something about the problems faced then, the sociology that evolved to deal with them, the roles and personalities of the principal figures, and the general attitude toward scientific inference that orthodoxy represents.
A distinction without a difference has been introduced by certain writers who distinguish ‘Point estimation’, meaning some process of arriving at an estimate without regard to its precision, from ‘Interval estimation’ in which the precision of the estimate is to some extent taken into account.
R. A. Fisher (1956)
Probability theory as logic agrees with Fisher in spirit; that is, it gives us automatically both point and interval estimates from a single calculation. The distinction commonly made between hypothesis testing and parameter estimation is considerably greater than that which concerned Fisher; yet it too is, from our point of view, not a real difference. When we have only a small number of discrete hypotheses {H1, …, Hn} to consider, we usually want to pick out a specific one of them as the most likely in that set, in the light of the prior information and data. The cases n = 2 and n = 3 were examined in some detail in Chapter 4, and larger n is in principle a straightforward and rather obvious generalization.
When the hypotheses become very numerous, however, a different approach seems called for. A set of discrete hypotheses can always be classified by assigning one or more numerical indices which identify them, as in Ht (1 ≤ t ≤ n), and if the hypotheses are very numerous one can hardly avoid doing this.
The essence of the present theory is that no probability, direct, prior, or posterior, is simply a frequency.
H. Jeffreys (1939)
We have developed probability theory as a generalized logic of plausible inference which should apply, in principle, to any situation where we do not have enough information to permit deductive reasoning. We have seen it applied successfully in simple prototype examples of nearly all the current problems of inference, including sampling theory, hypothesis testing, and parameter estimation.
Most of probability theory, however, as treated in the past 100 years, has confined attention to a special case of this, in which one tries to predict the results of, or draw inferences from, some experiment that can be repeated indefinitely under what appear to be identical conditions; but which nevertheless persists in giving different results on different trials. Indeed, virtually all application-oriented expositions define probability as meaning ‘limiting frequency in independent repetitions of a random experiment’ rather than as an element of logic. The mathematically oriented often define it more abstractly, merely as an additive measure, without any specific connection to the real world. However, when they turn to applications, they too tend to think of probability in terms of frequency. It is important that we understand the exact relationship between these conventional treatments and the theory being developed here.
Some of these relationships have been seen already; in the preceding five chapters we have shown that probability theory as logic can be applied consistently in many problems of inference that do not fit into the frequentist preconceptions, and so would be considered beyond the scope of probability theory.
We now examine in detail two of the simplest applications of the general decision theory just formulated, and compare the first with the older Neyman–Pearson procedure. The problem of detection of signals in noise is really the same as Laplace's old problem of detecting the presence of unknown systematic influences in celestial mechanics, and Shewhart's (1931) more recent problem of detecting a systematic drift in machine characteristics, in industrial quality control. Statisticians would call the procedure a ‘significance test’. It is unfortunate that the basic identity of all these problems was not more widely recognized, because it forced workers in several different fields to rediscover the same things, with varying degrees of success, over and over again.
As is clear by now, all we really have to do to solve this problem is to take the principles of inference developed in Chapters 2 and 4, and supplement them with the loss function criterion for converting final probabilities into decisions (and, if needed, the maximum entropy principle for assigning priors). However, the literature of this field has been created largely from the standpoint of the original decision theory before this was realized. The existing literature therefore uses a different sort of vocabulary and set of concepts than we have been using up to now. Since it exists, we have no choice but to learn these terms and viewpoints if we want to read the literature of the field.
Entities are not to be multiplied without necessity.
William of Ockham, c 1330
We have seen in some detail how to conduct inferences – test hypotheses, estimate parameters, predict future observations – within the context of a preassigned model, representing some working hypothesis about the phenomenon being observed. But a scientist must also be concerned with a bigger problem: how to decide between different models when both seem able to account for the facts. Indeed, the progress of science requires comparison of different conceivable models; a false premise built into a model that is never questioned cannot be removed by any amount of new data.
Stated very broadly, the problem is hardly new; some 650 years ago the Franciscan Monk William of Ockham perceived the logical error in the mind projection fallacy. This led him to teach that some religious issues might be settled by reason, but others only by faith. He removed the latter from his discourse, and concentrated on the areas where reason might be applied – just as Bayesians seek to do today when we discard orthodox mind projecting mythology (such as assertions of limiting frequencies in experiments that have never been performed), and concentrate on the things that are meaningful in the real world. His propositions ‘amenable only to faith’ correspond roughly to what we should call non-Aristotelian propositions. His famous epigram quoted above, generally called ‘Ockham's razor’, represents a good start on the principles of reasoning that he needed, and that we still need today.
Ignorance is preferable to error and he is less remote from the truth who believes nothing than he who believes what is wrong.
Thomas Jefferson (1781)
The problem of translating prior information uniquely into a prior probability assignment represents the as yet unfinished half of probability theory, though the principle of maximum entropy in the preceding chapter provides one important tool. It is unfinished because it has been rejected for many decades by those who were unable to conceive of a probability distribution as representing information; but, just because of that long neglect, many current scientific, engineering, economic, and environmental problems are today calling out for new solutions to this problem, without which important new applications cannot proceed.
What are we trying to do?
It is curious that, even when different workers are in substantially complete agreement on what calculations should be done, they may have radically different views as to what we are actually doing and why we are doing it. For example, there is a large Bayesian community, whose members call themselves ‘subjective Bayesians’, who have settled into a position intermediate between ‘orthodox’ statistics and the theory expounded here. Their members have had, for the most part, standard orthodox training; but then they saw the absurdities in it and defected from the orthodox philosophy, while retaining the habits of orthodox terminology and notation.