To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
We have seen, in Chapter 7, how the great mathematician Leonhard Euler was unable to solve the problem of estimating eight orbital parameters from 75 discrepant observations of the past positions of Jupiter and Saturn. Thinking in terms of deductive logic, he could not even conceive of the principles by which such a problem could be solved. But, 38 years later, Laplace, thinking in terms of probability theory as logic, was in possession of exactly the right principles to resolve the great inequality of Jupiter and Saturn. In this chapter we develop the solution as it would be done today by considering a simpler problem, estimating two parameters from three observations. But our general solution, in matrix notation, will include Laplace's automatically.
Reduction of equations of condition
Suppose we wish to determine the charge e and mass m of the electron. The Millikan oil-drop experiment measures e directly. The deflection of an electron beam in a known electromagnetic field measures the ratio e/m. The deflection of an electron toward a metal plate due to attraction of image charges measures e2/m.
From the results of any two of these experiments we can calculate values of e and m. But all the measurements are subject to error, and the values of e and m obtained from different experiments will not agree. Yet each of the measurements does contain some information relevant to our question that is not contained in the others.
In several previous discussions we inserted parenthetic remarks to the effect that ‘there is still an essential point missing here, which will be supplied when we take up decision theory’. However, in postponing the topic until now, we have not deprived the reader of a needed technical tool, because the solution of the decision problem was, from our viewpoint, so immediate and intuitive that we did not need to invoke any underlying formal theory.
Inference vs. decision
The situation of appraising inference vs. decision arose as soon as we started applying probability theory to our first problem. When we illustrated the use of Bayes' theorem by sequential testing in Chapter 4, we noted that there is nothing in probability theory per se which could tell us where to put the critical levels at which the robot changes its decision: whether to accept the batch, reject it, or make another test. The location of these critical levels obviously depends in some way on value judgments as well as on probabilities; what are the consequences of making wrong decisions, and what are the costs of making further tests?
I believe, for instance, that it would be very difficult to persuade an intelligent physicist that current statistical practice was sensible, but that there would be much less difficulty with an approach via likelihood and Bayes' theorem.
G. E. P. Box (1962)
As we have noted several times, the idea that probabilities are physically real things, based ultimately on observed frequencies of random variables, underlies most recent expositions of probability theory, which would seem to make it a branch of experimental science. At the end of Chapter 8 we saw some of the difficulties that this view leads us to; in some real physical experiments the distinction between random and nonrandom quantities is so obscure and artificial that you have to resort to black magic in order to force this distinction into the problem at all. But that discussion did not reach into the serious physics of the situation. In this chapter, we take time off for an interlude of physical considerations that show the fundamental difficulty with the notion of ‘random’ experiments.
An interesting correlation
There have always been dissenters from the ‘frequentist’ view who have maintained, with Laplace, that probability theory is properly regarded as the ‘calculus of inductive reasoning’, and is not fundamentally related to random experiments at all. A major purpose of the present work is to demonstrate that probability theory can deal, consistently and usefully, with far more than frequencies in random experiments.
The development of our theory beyond this point, as a practical statistical theory, involves … all the complexities of the use, either of Bayes' law on the one hand, or of those terminological tricks in the theory of likelihood on the other, which seem to avoid the necessity for the use of Bayes' law, but which in reality transfer the responsibility for its use to the working statistician, or the person who ultimately employs his results.
Norbert Wiener (1948)
To the best of our knowledge, Norbert Wiener never actually applied Bayes' theorem in a published work; yet he perceived the logical necessity of its use as soon as one builds beyond the sampling distributions involved in his own statistical work. In the present chapter we examine some of the consequences of failing to use Bayesian methods in some very simple problems, where the paradoxes of Chapter 15 never arise.
In Chapter 16 we noted that the orthodox objections to Bayesian methods were always philosophical or ideological in nature, never examining the actual results that they give, and we expressed astonishment that mathematically competent persons would use such arguments. In order to give a fair comparison, we need to adopt the opposite tactic here, and concentrate on the demonstrable facts that orthodoxians never mention. Since Bayesian methods have been so egregiously misrepresented in the orthodox literature throughout our lifetimes, we must lean over backwards to avoid misrepresenting orthodox methods now; whenever an orthodox method does yield a satisfactory result in some problem, we shall acknowledge that fact, and we shall not deplore its use merely on ideological grounds.
At this point, we return to the job of designing the robot. We have part of its brain designed, and we have seen how it would reason in a few simple problems of hypothesis testing and estimation. In every problem it has solved thus far, the results have either amounted to the same thing as, or were usually demonstrably superior to, those offered in the ‘orthodox’ statistical literature. But it is still not a very versatile reasoning machine, because it has only one means by which it can translate raw information into numerical values of probabilities, the principle of indifference (2.95). Consistency requires it to recognize the relevance of prior information, and so in almost every problem it is faced at the onset with the problem of assigning initial probabilities, whether they are called technically prior probabilities or sampling probabilities. It can use indifference for this if it can break the situation up into mutually exclusive, exhaustive possibilities in such a way that no one of them is preferred to any other by the evidence. But often there will be prior information that does not change the set of possibilities but does give a reason for preferring one possibility to another. What do we do in this case?
Orthodoxy evades this problem by simply ignoring prior information for fixed parameters, and maintaining the fiction that sampling probabilities are known frequencies.
My own impression … is that the mathematical results have outrun their interpretation and that some simple explanation of the force and meaning of the celebrated integral … will one day be found … which will at once render useless all the works hitherto written.
Augustus de Morgan (1838)
Here, de Morgan was expressing his bewilderment at the ‘curiously ubiquitous’ success of methods of inference based on the Gaussian, or normal, ‘error law’ (sampling distribution), even in cases where the law is not at all plausible as a statement of the actual frequencies of the errors. But the explanation was not forthcoming as quickly as he expected.
In the middle 1950s the writer heard an after-dinner speech by Professor Willy Feller, in which he roundly denounced the practice of using Gaussian probability distributions for errors, on the grounds that the frequency distributions of real errors are almost never Gaussian. Yet in spite of Feller's disapproval, we continued to use them, and their ubiquitous success in parameter estimation continued. So, 145 years after de Morgan's remark, the situation was still unchanged, and the same surprise was expressed by George Barnard (1983): ‘Why have we for so long managed with normality assumptions?’
Today we believe that we can, at last, explain (1) the inevitably ubiquitous use, and (2) the ubiquitous success, of the Gaussian error law.
The actual science of logic is conversant at present only with things either certain, impossible, or entirely doubtful, none of which (fortunately) we have to reason on. Therefore the true logic for this world is the calculus of Probabilities, which takes account of the magnitude of the probability which is, or ought to be, in a reasonable man's mind.
James Clerk Maxwell (1850)
Suppose some dark night a policeman walks down a street, apparently deserted. Suddenly he hears a burglar alarm, looks across the street, and sees a jewelry store with a broken window. Then a gentleman wearing a mask comes crawling out through the broken window, carrying a bag which turns out to be full of expensive jewelry. The policeman doesn't hesitate at all in deciding that this gentleman is dishonest. But by what reasoning process does he arrive at this conclusion? Let us first take a leisurely look at the general nature of such problems.
Deductive and plausible reasoning
A moment's thought makes it clear that our policeman's conclusion was not a logical deduction from the evidence; for there may have been a perfectly innocent explanation for everything. It might be, for example, that this gentleman was the owner of the jewelry store and he was coming home from a masquerade party, and didn't have the key with him. However, just as he walked by his store, a passing truck threw a stone through the window, and he was only protecting his own property.
We collect here a brief account of the various mathematical conventions used throughout this work, and discuss some basic mathematical issues that arise in probability theory. Careless notation has led to so many erroneous results in the recent literature that we need to find rules of notation and terminology that make it as difficult as possible to commit such errors.
A mathematical notation, like a language, is not an end in itself but only a communication device. Its purpose is best served if the notation, like the language, is allowed to evolve with use. This evolution usually takes the form of abbreviations for whatever expressions recur often, and reducing the number of symbols when their meaning can be read from the context.
But a living, changing language still needs a kind of safe harbor in the form of a fixed set of rules of grammar and orthography, hidden away in a dictionary for use when ambiguities threaten. Likewise, probability theory needs a fixed set of normative rules on which we can fall back in case of doubt. We state here our formal rules of notation and logical hierarchy; all chapters from Chapter 3 on start with these standard forms, and evolve from them. A notation which is so convenient that it is almost a necessity in one chapter might be only confusing in the next; so each separate topic must be allowed its own independent evolution from the standard beginning.
Probably everybody who has been involved in quantitative measurements has found himself in the following situation. You are trying to measure some quantity θ (which might be, for example, the right ascension of Sirius, the mass of a π-meson, the velocity of seismic waves at a depth of 100 km, the melting point of a new organic compound, the elasticity of consumer demand for apples, etc.). But the apparatus or the data taking procedure is always imperfect and so, having made n independent measurements of θ, you have n different results (x1, …, xn). How are you to report what you now know about θ? More specifically, what ‘best’ estimate should you announce, and what accuracy are you entitled to claim?
If these n data values were closely clustered together making a reasonably smooth, single-peaked histogram, you would accept the solutions given in the previous chapters, and might feel that the problem of drawing conclusions from good data is not very difficult, even without any probability theory. But your data are not nicely clustered: one value, xj, lies far away from the nice cluster made by the other (n − 1) values. How are you to deal with this outlier? What effect does it have on the conclusions that you entitled to draw about θ?
We have seen in Chapters 4 and 5 how the appearance of astonishing, unexpected data may cause the resurrection of dead hypotheses; it appears that something like that may be at work here.
With all this confounded trafficking in hypotheses about invisible connections with all manner of inconceivable properties, which have checked progress for so many years, I believe it to be most important to open people's eyes to the number of superfluous hypotheses they are making, and would rather exaggerate the opposite view, if need be, than proceed along these false lines.
H. von Helmholtz (1868)
This chapter and Chapter 13 are concerned with the history of the subject rather than its present status. There is a complex and fascinating history before 1900, recounted by Stigler (1986c), but we are concerned now with more recent developments. In the period from about 1900 to 1970, one school of thought dominated the field so completely that it has come to be called ‘orthodox statistics’. It is necessary for us to understand it, because it is what most working statisticians active today were taught, and its ideas are still being taught, and advocated vigorously, in many textbooks and universities.
In Chapter 17 we want to examine the ‘orthodox’ statistical practice thus developed and compare its technical performance with that of the ‘probability as logic’ approach expounded here. But first, to understand this weird course of events, we need to know something about the problems faced then, the sociology that evolved to deal with them, the roles and personalities of the principal figures, and the general attitude toward scientific inference that orthodoxy represents.
A distinction without a difference has been introduced by certain writers who distinguish ‘Point estimation’, meaning some process of arriving at an estimate without regard to its precision, from ‘Interval estimation’ in which the precision of the estimate is to some extent taken into account.
R. A. Fisher (1956)
Probability theory as logic agrees with Fisher in spirit; that is, it gives us automatically both point and interval estimates from a single calculation. The distinction commonly made between hypothesis testing and parameter estimation is considerably greater than that which concerned Fisher; yet it too is, from our point of view, not a real difference. When we have only a small number of discrete hypotheses {H1, …, Hn} to consider, we usually want to pick out a specific one of them as the most likely in that set, in the light of the prior information and data. The cases n = 2 and n = 3 were examined in some detail in Chapter 4, and larger n is in principle a straightforward and rather obvious generalization.
When the hypotheses become very numerous, however, a different approach seems called for. A set of discrete hypotheses can always be classified by assigning one or more numerical indices which identify them, as in Ht (1 ≤ t ≤ n), and if the hypotheses are very numerous one can hardly avoid doing this.
The essence of the present theory is that no probability, direct, prior, or posterior, is simply a frequency.
H. Jeffreys (1939)
We have developed probability theory as a generalized logic of plausible inference which should apply, in principle, to any situation where we do not have enough information to permit deductive reasoning. We have seen it applied successfully in simple prototype examples of nearly all the current problems of inference, including sampling theory, hypothesis testing, and parameter estimation.
Most of probability theory, however, as treated in the past 100 years, has confined attention to a special case of this, in which one tries to predict the results of, or draw inferences from, some experiment that can be repeated indefinitely under what appear to be identical conditions; but which nevertheless persists in giving different results on different trials. Indeed, virtually all application-oriented expositions define probability as meaning ‘limiting frequency in independent repetitions of a random experiment’ rather than as an element of logic. The mathematically oriented often define it more abstractly, merely as an additive measure, without any specific connection to the real world. However, when they turn to applications, they too tend to think of probability in terms of frequency. It is important that we understand the exact relationship between these conventional treatments and the theory being developed here.
Some of these relationships have been seen already; in the preceding five chapters we have shown that probability theory as logic can be applied consistently in many problems of inference that do not fit into the frequentist preconceptions, and so would be considered beyond the scope of probability theory.