To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
In this book, we study a system which perceives the real world. Such a system has to estimate an information source by observation. If the information source is a probability distribution, then the estimation process is called statistical learning, and the system is said to be a statistical model or a learning machine.
A lot of statistical models have hierarchical layers, hidden variables, a collection of modules, or grammatical structures. Such models are nonidentifiable and contain singularities in their parameter spaces. In fact, the map from a parameter to a statistical model is not one-to-one, and the Fisher information matrix is not positive definite. Such statistical models are called singular. It has been difficult to examine the learning process of singular models, because there has been no mathematical theory for such models.
In this book, we establish a mathematical foundation which enables us to understand the learning process of singular models. This chapter gives an overview of the book before a rigorous mathematical foundation is developed.
Basic concepts in statistical learning
To describe what statistical learning is, we need some basic concepts in probability theory. For the reader who is unfamiliar with probability theory, Section 1.6 summarizes the key results.
Random samples
Let N be a natural number and ℝN be the N-dimensional real Euclidean space. We study a case when information data are represented by vectors in ℝN.
In this book, we introduce a fundamental relation between algebraic geometry and statistical learning theory.
A lot of statistical models and learning machines used in information science, for example, mixtures of probability distributions, neural networks, hidden Markov models, Bayesian networks, stochastic context-free grammars, and topological data analysis, are not regular but singular, because they are nonidentifiable and their Fisher information matrices are singular. In such models, knowledge to be discovered from examples corresponds to a singularity, hence it has been difficult to develop a mathematical method that enables us to understand statistical estimation and learning processes.
Recently, we established singular learning theory, in which four general formulas are proved for singular statistical models. Firstly, the log likelihood ratio function of any singular model can be represented by the common standard form even if it contains singularities. Secondly, the asymptotic behavior of the evidence or stochastic complexity is clarified, giving the result that the learning coefficient is equal to the maximum pole of the zeta function of a statistical model. Thirdly, there exist equations of states that express the universal relation of the Bayes quartet. We can predict Bayes and Gibbs generalization errors using Bayes and Gibbs training errors without any knowledge of the true distribution. And lastly, the symmetry of the generalization and training errors holds in the maximum likelihood and a posteriori estimators. If one-point estimation is applied to statistical learning, the generalization error is equal to the maximum value of a Gaussian process on a real analytic set.
In this chapter, we study statistical model evaluation and statistical hypothesis tests in singular learning machines. Firstly, we show that there is no universally optimal learning in general and that model evaluation and hypothesis tests are necessary in statistics. Secondly, we analyze two information criteria: stochastic complexity and generalization error in singular learning machines. Thirdly, we show a method to produce a statistical hypothesis test if the null hypothesis is a singularity of the alternative hypothesis. Then the methods by which the Bayes a posteriori distribution is generated are introduced. We discuss the Markov chain Monte Carlo and variational approximation. In the last part of this chapter, we compare regular and singular learning theories. Regular learning theory is based on the quadratic approximation of the log likelihood ratio function and the central limit theorem on the parameter space, whereas singular learning theory is based on the resolution of singularities and the central limit theorem on the functional space. Mathematically speaking, this book generalizes regular learning theory to singular statistical models.
Universally optimal learning
There are a lot of statistical estimation methods. One might expect that there is a universally optimal method, which always gives a smaller generalization error than any other method. However, in general, such a method does not exist.
Assumption. Assume that Φ(w) is the probability density function on ℝd, and that a parameter ω is chosen with respect to Φ(ω).
It is an article of faith in much applied work that disturbance terms are IID—Independent and Identically Distributed—across observations. Sometimes, this assumption is replaced by other assumptions that are more complicated but equally artificial. For example, when observations are ordered in time, the disturbance terms ∈t are sometimes assumed to follow an “autoregression,” e.g., ∈t = λ∈t−1 + δt, where now λ is a parameter to be estimated, and it is the δt that are IID. However, there is an alternative that should always be kept in mind. Disturbances are DDD—Dependent and Differently Distributed—across subjects. In the autoregression, for example, the δt could easily be DDD, and introducing yet another model would only postpone the moment of truth.
A second article of faith for many applied workers is that functions are linear with coefficients that are constant across subjects. The alternative is that functions are non-linear, with coefficients (or parameters more generally) that vary across subjects. The dueling acronyms would be LCC (Linear with Constant Coefficients) and NLNC (Non-Linear with Non-constant Coefficients). Some models have “random coefficients,” which only delays the inevitable: coefficients are assumed to be drawn at random from distributions that are constant across subjects. Why would that be so?
These articles of faith have had considerable influence on the applied literature. Therefore, when reading a statistical study, try to find out what kind of statistical analysis got the authors from the data to the conclusions.
Some books are correct. Some are clear. Some are useful. Some are entertaining. Few are even two of these. This book is all four. Statistical Models: Theory and Practice is lucid, candid and insightful, a joy to read. We are fortunate that David Freedman finished this new edition before his death in late 2008. We are deeply saddened by his passing, and we greatly admire the energy and cheer he brought to this volume—and many other projects—during his final months.
This book focuses on half a dozen of the most common tools in applied statistics, presenting them crisply, without jargon or hyperbole. It dissects real applications: a quarter of the book reprints articles from the social and life sciences that hinge on statistical models. It articulates the assumptions necessary for the tools to behave well and identifies the work that the assumptions do. This clarity makes it easier for students and practitioners to see where the methods will be reliable; where they are likely to fail, and how badly; where a different method might work; and where no inference is possible—no matter what tool somebody tries to sell them.
Many texts at this level are little more than bestiaries of methods, presenting dozens of tools with scant explication or insight, a cookbook, numbers-are-numbers approach. “If the left hand side is continuous, use a linear model; fit by least-squares. If the left hand side is discrete, use a logit or probit model; fit by maximum likelihood.”
This book is about regression models and variants like path models, simultaneous-equation models, logits and probits. Regression models can be used for different purposes:
(i) to summarize data,
(ii) to predict the future,
(iii) to predict the results of interventions.
The third—causal inference—is the most interesting and the most slippery. It will be our focus. For background, this section covers some basic principles of study design.
Causal inferences are made from observational studies, natural experiments, and randomized controlled experiments. When using observational (non-experimental) data to make causal inferences, the key problem is confounding. Sometimes this problem is handled by subdividing the study population (stratification, also called cross-tabulation), and sometimes by modeling. These strategies have various strengths and weaknesses, which need to be explored.
In medicine and social science, causal inferences are most solid when based on randomized controlled experiments, where investigators assign subjects at random—by the toss of a coin—to a treatment group or to a control group. Up to random error, the coin balances the two groups with respect to all relevant factors other than treatment. Differences between the treatment group and the control group are therefore due to treatment. That is why causation is relatively easy to infer from experimental data. However, experiments tend to be expensive, and may be impossible for ethical or practical reasons. Then statisticians turn to observational studies.
I test several hypotheses concerning the origins of political repression in the states of the United States. The hypotheses are drawn from the elitist theory of democracy, which asserts that repression of unpopular political minorities stems from the intolerance of the mass public, the generally more tolerant elites not supporting such repression. Focusing on the repressive legislation adopted by the states during the McCarthy era, I examine the relationships between elite and mass opinion and repressive public policy. Generally it seems that elites, not masses, were responsible for the repression of the era. These findings suggest that the elitist theory of democracy is in need of substantial theoretical reconsideration, as well as further empirical investigation.
Over three decades of research on citizen willingness to “put up with” political differences has led to the conclusion that the U.S. public is remarkably intolerant. Though the particular political minority that is salient enough to attract the wrath of the public may oscillate over time between the Left and the Right (e.g., Sullivan, Piereson, and Marcus 1982), generally, to be much outside the centrist mainstream of U.S. politics is to incur a considerable risk of being the object of mass political intolerance.