To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
This book is about data in many – and sometimes very many – variables and about analysing such data. The book attempts to integrate classical multivariate methods with contemporary methods suitable for high-dimensional data and to present them in a coherent and transparent framework. Writing about ideas that emerged more than a hundred years ago and that have become increasingly relevant again in the last few decades is exciting and challenging. With hindsight, we can reflect on the achievements of those who paved the way, whose methods we apply to ever bigger and more complex data and who will continue to influence our ideas and guide our research. Renewed interest in the classical methods and their extension has led to analyses that give new insight into data and apply to bigger and more complex problems.
There are two players in this book: Theory and Data. Theory advertises its wares to lure Data into revealing its secrets, but Data has its own ideas. Theory wants to provide elegant solutions which answer many but not all of Data's demands, but these lead Data to pose new challenges to Theory. Statistics thrives on interactions between theory and data, and we develop better theory when we ‘listen’ to data. Statisticians often work with experts in other fields and analyse data from many different areas. We, the statisticians, need and benefit from the expertise of our colleagues in the analysis of their data and interpretation of the results of our analysis.
I am not bound to please thee with my answer (William Shakespeare, The Merchant of Venice, 1596–1598).
Introduction
It is not always possible to measure the quantities of interest directly. In psychology, intelligence is a prime example; scores in mathematics, language and literature, or comprehensive tests are used to describe a person's intelligence. From these measurements, a psychologist may want to derive a person's intelligence. Behavioural scientist Charles Spearman is credited with being the originator and pioneer of the classical theory of mental tests, the theory of intelligence and what is now called Factor Analysis. In 1904, Spearman proposed a two-factor theory of intelligence which he extended over a number of decades (see Williams et al., 2003). Since its early days, Factor Analysis has enjoyed great popularity and has become a valuable tool in the analysis of complex data in areas as diverse as behavioural sciences, health sciences and marketing. The appeal of Factor Analysis lies in the ease of use and the recognition that there is an association between the hidden quantities and the measured quantities.
The aim of Factor Analysis is
• to exhibit the relationship between the measured and the underlying variables, and
• to estimate the underlying variables, called the hidden or latent variables.
Although many of the key developments have arisen in the behavioural sciences, Factor Analysis has an important place in statistics. Its model-based nature has invited, and resulted in, many theoretical and statistical advances.
Den Samen legen wir in ihre Hände! Ob Glück, ob Unglück aufgeht, lehrt das Ende (Friedrich von Schiller, Wallensteins Tod, 1799). We put the seed in your hands! Whether it develops into fortune or mistfortune only the end can teach us.
Introduction
In the beginning – in 1901 – there was Principal Component Analysis. On our journey through this book we have encountered many different methods for analysing multidimensional data, and many times on this journey, Principal Component Analysis reared its – some might say, ugly – head. About a hundred years since its birth, a renaissance of Principal Component Analysis (PCA) has led to new theoretical and practical advances for high-dimensional data and to SPCA, where S variously refers to simple, supervised and sparse. It seems appropriate, at the end of our journey, to return to where we started and take a fresh look at developments which have revitalised Principal Component Analysis. These include the availability of high-dimensional and functional data and the necessity for dimension reduction and feature selection and new and sparse ways of representing data.
Exciting developments in the analysis of high-dimensional data have been interacting with similar ones in Statistical Learning. It is not clear where analysis of data stops and learning from data starts. An essential part of both is the selection of ‘important’ and ‘relevant’ features or variables.
We live in a new age for statistical inference, where modern scientific technology such as microarrays and fMRI machines routinely produce thousands and sometimes millions of parallel data sets, each with its own estimation or testing problem. Doing thousands of problems at once is more than repeated application of classical methods. Taking an empirical Bayes approach, Bradley Efron, inventor of the bootstrap, shows how information accrues across problems in a way that combines Bayesian and frequentist ideas. Estimation, testing and prediction blend in this framework, producing opportunities for new methodologies of increased power. New difficulties also arise, easily leading to flawed inferences. This book takes a careful look at both the promise and pitfalls of large-scale statistical inference, with particular attention to false discovery rates, the most successful of the new statistical techniques. Emphasis is on the inferential ideas underlying technical developments, illustrated using a large number of real examples.
There are a number of important questions associated with statistical experiments: when does one given experiment yield more information than another; how can we measure the difference in information; how fast does information accumulate by repeating the experiment? The means of answering such questions has emerged from the work of Wald, Blackwell, LeCam and others and is based on the ideas of risk and deficiency. The present work which is devoted to the various methods of comparing statistical experiments, is essentially self-contained, requiring only some background in measure theory and functional analysis. Chapters introducing statistical experiments and the necessary convex analysis begin the book and are followed by others on game theory, decision theory and vector lattices. The notion of deficiency, which measures the difference in information between two experiments, is then introduced. The relation between it and other concepts, such as sufficiency, randomisation, distance, ordering, equivalence, completeness and convergence are explored. This is a comprehensive treatment of the subject and will be an essential reference for mathematical statisticians.
Written by one of the preeminent researchers in the field, this book provides a comprehensive exposition of modern analysis of causation. It shows how causality has grown from a nebulous concept into a mathematical theory with significant applications in the fields of statistics, artificial intelligence, economics, philosophy, cognitive science, and the health and social sciences. Judea Pearl presents and unifies the probabilistic, manipulative, counterfactual, and structural approaches to causation and devises simple mathematical tools for studying the relationships between causal connections and statistical associations. Cited in more than 2,100 scientific publications, it continues to liberate scientists from the traditional molds of statistical thinking. In this revised edition, Judea Pearl elucidates thorny issues, answers readers' questions, and offers a panoramic view of recent advances in this field of research. Causality will be of interest to students and professionals in a wide variety of fields. Dr Judea Pearl has received the 2011 Rumelhart Prize for his leading research in Artificial Intelligence (AI) and systems from The Cognitive Science Society.
Students in both the natural and social sciences often seek regression models to explain the frequency of events, such as visits to a doctor, auto accidents or job hiring. This analysis provides a comprehensive account of models and methods to interpret such data. The authors have conducted research in the field for nearly fifteen years and in this work combine theory and practice to make sophisticated methods of analysis accessible to practitioners working with widely different types of data and software. The treatment will be useful to researchers in areas such as applied statistics, econometrics, operations research, actuarial studies, demography, biostatistics, quantitatively-oriented sociology and political science. The book may be used as a reference work on count models or by students seeking an authoritative overview. The analysis is complemented by template programs available on the Internet through the authors' homepages.
This book is about the statistical principles behind the design of effective experiments and focuses on the practical needs of applied statisticians and experimenters engaged in design, implementation and analysis. Emphasising the logical principles of statistical design, rather than mathematical calculation, the authors demonstrate how all available information can be used to extract the clearest answers to many questions. The principles are illustrated with a wide range of examples drawn from real experiments in medicine, industry, agriculture and many experimental disciplines. Numerous exercises are given to help the reader practise techniques and to appreciate the difference that good design can make to an experimental research project. Based on Roger Mead's excellent Design of Experiments, this new edition is thoroughly revised and updated to include modern methods relevant to applications in industry, engineering and modern biology. It also contains seven new chapters on contemporary topics, including restricted randomisation and fractional replication.
Our aim in this book is to explain and illustrate the fundamental statistical concepts required for designing efficient experiments to answer real questions. This book has evolved from a previous book written by the first author. That book was based on 25 years of experience of designing experiments for research scientists and of teaching the concepts of statistical design both to statisticians and to experimenters. The present book is based on approximately a combined 100 years of experience of designing experiments for research scientists, and of teaching the concepts of statistical design both to statisticians and to experimenters.
The development of statistical philosophy about the design of experiments has always been dominated by mathematical theory. In contrast the influence of the availability of vastly improved computing facilities on teaching, textbooks and, most crucially, practical experimentation has been relatively small. The existence of statistical programs capable of analysing the results from any designed experiment does not imply any changes in the main statistical concepts of design. However, developments from these concepts have often been restricted by the earlier need to develop mathematical theory for design in such a way that the results from the designs could be analysed without recourse to computers. The fundamental concepts continually require reexamination and reinterpretation outside the limits implied by classical mathematical theory so that the full range of design possibilities may be considered. The result of the revolution in computing facilities is that the design of experiments should become a much wider and more exciting subject. We hope that this book will display that breadth and excitement.
(a) In an animal feeding experiment, six dietary treatments are to be compared. The diets are all possible combinations of three different levels of molasses at two energy levels. Twenty-four sheep are available and each sheep can be fed a different diet in each of two time periods. It is expected that there will be large differences in nutritional performances between sheep and some systematic differences between the results for the first and second periods. The structure of the experimental units therefore has two blocking classifications, giving a 24 × 2 row-and-column structure. The treatments have a 3 × 2 structure and main effect comparisons, particularly between the three levels of molasses, are the principal area of interest.
(b) An industrial experiment is to be planned to investigate the effects of varying seven factors in a chemical process. Eight treatment combinations can be tested using the same batch of basic material. Eight different batches of material will be available, and it is expected that theremay be substantial differences in output for sample units from the different batches. If it is decided to use two levels of each factor how shall the 64 treatment combinations to be included in the experiment be chosen, and how shall they be allocated to the eight blocks, or batches?
(c) An experiment on absorption of sugar by rabbits is to be designed to compare eight experimental treatments.
(a) An experiment to examine the pattern of variation over time of a particular chemical constituent of blood involved sampling the blood of nine chickens on 25 weekly occasions. The principal interest is in the variation of the chemical over the 25 times, the nine chickens being included to provide replication. The chemical analysis is complex and long and a set of at most ten blood samples can be analysed concurrently. It is known that there may be substantial differences in the results of the chemical analysis between different sets of samples. How should the 225 samples (25 times for nine chickens) be allocated to sets of ten (or fewer) so that comparisons between the 25 times are made as precise as possible?
(b) In an experiment to compare diets for cows the experimenter has five diet treatments that he wishes to compare. The diets have to be fed to fistulated (surgically prepared) cows so that the performance of the cows can be monitored throughout the application of each diet. Nine such cows are available for sufficient time that four periods of observation can be used for each cow, providing a set of 9 × 4 = 36 observations. Concerned that there will be differences between cows, so that cows have to be treated as a blocking structure, the experimenter had already decided before consulting a statistician that he could only test four of the diet treatments, each cow receiving each diet once.
(a) In an experiment to investigate the effect of training on human-computer-human interactions, six subjects were randomly allocated to each of four training programmes. Subjects were then paired into 12 blocks using two replicates of an unreduced balanced incomplete block design. Each pair carried out a conversation through a computer ‘chat’ program. In addition to several response variables measured on each subject individually, each pair was given a score by an independent observer, for the success of their interaction.
We have only a single response representing each block. Can we use this information and, if so, how? If we can, do the block totals contain useful information about the effects of treatments on other responses? Does this affect how we should design the experiment? In particular, for this response, should we have used some blocks with both subjects getting the same treatment?
(b) Eight feeds are to be compared for their effects on the growth of young chickens. The experiment will be carried out using 32 cages, arranged in four brooders, with each brooder having four tiers of two cages. Should the experiment be designed to ensure that each treatment appears once in each brooder and once in each tier, or should we consider the brooder×tier combinations as blocks of size 2 and choose a good design for this setup? Can we do both simultaneously?
Identifying multiple levels in data
In Section 7.3 we considered the analysis for general block–treatment designs. However, in that analysis only the information about treatments from comparisons within blocks was considered.
In an experiment to compare different treatments, each treatment must be applied to several different units. This is because the responses from different units vary, even if the units are treated identically. If each treatment is only applied to a single unit, the results will be ambiguous – we will not be able to distinguish whether a difference in the responses from two units is caused by the different treatments, or is simply due to the inherent differences between the units.
The simplest experimental design to compare t treatments is that in which treatment 1 is applied to n1 units, treatment 2 to n2 units,… and treatment t to nt units. In many experiments the numbers of units per treatment, n1, n2, …, nt, will be equal, but this is not necessary, or even always desirable. Some treatments may be of greater importance than others, in which case more information will be needed about them, and this will be achieved by increasing the replication for these treatments. This design, in which the only recognisable difference between units is the treatments which are applied to those units, is called a completely randomised design.
However, the ambiguity, when each treatment is only applied to a single unit, is not always removed by applying each treatment to multiple units. One treatment might be ‘lucky’ in the selection of units to which it is to be applied, and another might be ‘unlucky’.
Thus far in this book we have considered the design of individual experiments and have been concerned to ensure that each experiment should provide answers to the questions which motivated the experiment as efficiently as possible. In general this has required that the variation in the experimental units be controlled so that the answers provided from each experiment should be as precise as possible. This will frequently require that there should be relatively little variation between the experimental units, i.e. the population from which the units are drawn will be narrowly defined.
However, if the population from which the units are taken is narrowly defined, then it follows logically that the results from the experiment would apply only to that narrowly defined population. This would usually be quite unacceptable to an experimenter who would hope to convince the wider world that the results from the experiment would apply for a much wider population. For example determining which variety of rice gives the best results in a highly controlled experiment on the paddy fields of a research institute is only going to be useful if that variety of rice is going to be the best for a large region within which the institute is located, and if farmers in that region believe in the results. Similarly if a new drug is shown to be an improvement on current practice, through a rigorously controlled clinical trial, the pharmaceutical company which has produced the drug will wish to promote the use of the drug across the whole population of the country, or even of several countries.