To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
Get your facts first, and then you can distort them as much as you please (Mark Twain, 1835–1910).
Introduction
The first part of this book dealt with the three classical problems: finding structure within data, determining relationships between different subsets of variables and dividing data into classes. In Part II, we focus on the first of the problems, finding structure – in particular, groups or factors – in data. The three methods we explore, Cluster Analysis, Factor Analysis and Multidimensional Scaling, are classical in their origin and were developed initially in the behavioural sciences. They have since become indispensable tools in diverse areas including psychology, psychiatry, biology, medicine and marketing, as well as having become mainstream statistical techniques. We will see that Principal Component Analysis plays an important role in these methods as a preliminary step in the analysis or as a special case within a broader framework.
Cluster Analysis is similar to Discriminant Analysis in that one attempts to partition the data into groups. In biology, one might want to determine specific cell subpopulations. In archeology, researchers have attempted to establish taxonomies of stone tools or funeral objects by applying cluster analytic techniques. Unlike Discriminant Analysis, however, we do not know the class membership of any of the observations. The emphasis in Factor Analysis and Multidimensional Scaling is on the interpretability of the data in terms of a small number of meaningful descriptors or dimensions.
Alles Gescheite ist schon gedacht worden, man muß nur versuchen, es noch einmal zu denken (Johann Wolfgang von Goethe, Wilhelm Meisters Wanderjahre, 1749–1832.) Every clever thought has been thought before, we can only try to recreate these thoughts.
Introduction
In Chapter 2 we represented a random vector as a linear combination of uncorrelated vectors. From one random vector we progress to two vectors, but now we look for correlation between the variables of the first and second vectors, and in particular, we want to find out which variables are correlated and how strong this relationship is.
In medical diagnostics, for example, we may meet multivariate measurements obtained from tissue and plasma samples of patients, and the tissue and plasma variables typically differ. A natural question is: What is the relationship between the tissue measurements and the plasma measurements? A strong relationship between a combination of tissue variables and a combination of plasma variables typically indicates that either set of measurements could be used for a particular diagnosis. A very weak relationship between the plasma and tissue variables tells us that the sets of variables are not equally appropriate for a particular diagnosis.
On the share market, one might want to compare changes in the price of industrial shares and mining shares over a period of time. The time points are the observations, and for each time point, we have two sets of variables: those arising from industrial shares and those arising from mining shares.
‘That's not a regular rule: you invented it just now.’ ‘It's the oldest rule in the book,’ said the King. ‘Then it ought to be Number One,’ said Alice (Lewis Carroll, Alice's Adventures in Wonderland, 1865).
Introduction
To discriminate means to single out, to recognise and understand differences and to distinguish. Of special interest is discrimination in two-class problems: A tumour is benign or malignant, and the correct diagnosis needs to be obtained. In the finance and credit-risk area, one wants to assess whether a company is likely to go bankrupt in the next few years or whether a client will default on mortgage repayments. To be able to make decisions in these situations, one needs to understand what distinguishes a ‘good’ client from one who is likely to default or go bankrupt.
Discriminant Analysis starts with data for which the classes are known and finds characteristics of the observations that accurately predict each observation's class. One then combines this information into a rule which leads to a partitioning of the observations into disjoint classes. When using Discriminant Analysis for tumour diagnosis, for example, the first step is to determine the variables which best characterise the difference between the benign and malignant groups – based on data for tumours whose status (benign or malignant) is known – and to construct a decision rule based on these variables.
This book is about data in many – and sometimes very many – variables and about analysing such data. The book attempts to integrate classical multivariate methods with contemporary methods suitable for high-dimensional data and to present them in a coherent and transparent framework. Writing about ideas that emerged more than a hundred years ago and that have become increasingly relevant again in the last few decades is exciting and challenging. With hindsight, we can reflect on the achievements of those who paved the way, whose methods we apply to ever bigger and more complex data and who will continue to influence our ideas and guide our research. Renewed interest in the classical methods and their extension has led to analyses that give new insight into data and apply to bigger and more complex problems.
There are two players in this book: Theory and Data. Theory advertises its wares to lure Data into revealing its secrets, but Data has its own ideas. Theory wants to provide elegant solutions which answer many but not all of Data's demands, but these lead Data to pose new challenges to Theory. Statistics thrives on interactions between theory and data, and we develop better theory when we ‘listen’ to data. Statisticians often work with experts in other fields and analyse data from many different areas. We, the statisticians, need and benefit from the expertise of our colleagues in the analysis of their data and interpretation of the results of our analysis.
I am not bound to please thee with my answer (William Shakespeare, The Merchant of Venice, 1596–1598).
Introduction
It is not always possible to measure the quantities of interest directly. In psychology, intelligence is a prime example; scores in mathematics, language and literature, or comprehensive tests are used to describe a person's intelligence. From these measurements, a psychologist may want to derive a person's intelligence. Behavioural scientist Charles Spearman is credited with being the originator and pioneer of the classical theory of mental tests, the theory of intelligence and what is now called Factor Analysis. In 1904, Spearman proposed a two-factor theory of intelligence which he extended over a number of decades (see Williams et al., 2003). Since its early days, Factor Analysis has enjoyed great popularity and has become a valuable tool in the analysis of complex data in areas as diverse as behavioural sciences, health sciences and marketing. The appeal of Factor Analysis lies in the ease of use and the recognition that there is an association between the hidden quantities and the measured quantities.
The aim of Factor Analysis is
• to exhibit the relationship between the measured and the underlying variables, and
• to estimate the underlying variables, called the hidden or latent variables.
Although many of the key developments have arisen in the behavioural sciences, Factor Analysis has an important place in statistics. Its model-based nature has invited, and resulted in, many theoretical and statistical advances.
Den Samen legen wir in ihre Hände! Ob Glück, ob Unglück aufgeht, lehrt das Ende (Friedrich von Schiller, Wallensteins Tod, 1799). We put the seed in your hands! Whether it develops into fortune or mistfortune only the end can teach us.
Introduction
In the beginning – in 1901 – there was Principal Component Analysis. On our journey through this book we have encountered many different methods for analysing multidimensional data, and many times on this journey, Principal Component Analysis reared its – some might say, ugly – head. About a hundred years since its birth, a renaissance of Principal Component Analysis (PCA) has led to new theoretical and practical advances for high-dimensional data and to SPCA, where S variously refers to simple, supervised and sparse. It seems appropriate, at the end of our journey, to return to where we started and take a fresh look at developments which have revitalised Principal Component Analysis. These include the availability of high-dimensional and functional data and the necessity for dimension reduction and feature selection and new and sparse ways of representing data.
Exciting developments in the analysis of high-dimensional data have been interacting with similar ones in Statistical Learning. It is not clear where analysis of data stops and learning from data starts. An essential part of both is the selection of ‘important’ and ‘relevant’ features or variables.
Introduction. We have throughout Chapter II assumed a constant probability underlying, the frequency ratios obtained from observation. It is fairly obvious that frequency ratios are often found from material in which the underlying probability is not constant. Then the statistician should make use of all available knowledge of the material for appropriate classification into subsets for analysis and comparison. It thus becomes important to consider a set of observations which may be broken into subsets for examination and comparison as to whether the underlying probability seems to be constant from subset to subset. In the separation of a large number of relative frequencies into n subsets according to some appropriate principle of classification, it is useful to make the classification so that the theory of Lexis may be applied. In the theory of Lexis we consider three types of series or distributions characterized by the following properties:
1. The underlying probability p may remain a constant throughout the whole field of observation. Such a series is called a Bernoulli series, and has been considered in Chapter II.
2. Suppose next that the probability of an event varies from trial to trial within a set of s trials, but that the several probabilities for one set of s trials are identical to those of every other of n sets of s trials. Then the series is called a Poisson series.
The binomial description of frequency. In Chapter I attention was directed to the very simple process of finding the relative frequency of occurrence of an event or character among s cases in question. Let us now conceive of repeating the process of finding relative frequencies on many random samples each consisting of s items drawn from the same population. To characterize the degree of stability or the degree of dispersion of such a series of relative frequencies is a fundamental statistical problem.
To illustrate, suppose we repeat the throwing of a set of 1,000 coins many times. An observed frequency distribution could then be exhibited with respect to the number of heads obtained in each set of 1,000, or with respect to the relative frequency of heads in sets of 1,000. Such a procedure would be a laborious experimental treatment of the problem of the distribution of relative frequencies from repeated trials. What we seek is a mathematical method of obtaining the theoretical frequency distribution with respect to the number of heads or with respect to the relative frequency of heads in the sets.
To consider a more general problem, suppose we draw many sets of s balls from an urn one at a time with replacements, and let p be the probability of success in drawing a white ball in one trial.
Introduction. In Chapter I we have discussed very briefly three different methods of describing frequency distributions of one variable—the purely graphic method, the method of averages and measures of dispersion, and the method of theoretical frequency functions or curves. The weakness and inadequacy of the purely graphic method lies in the fact that it fails to give a numerical description of the distribution. While the method of averages and measures of dispersion gives a numerical description in the form of a summary characterization which is likely to be useful for many statistical purposes, particularly for purposes of comparison, the method is inadequate for some purposes because (1) it does not give a characterization of the distribution in the neighborhood of each point x or in each small interval x to x+dx of the variable, (2) it does not give a functional relation between the values of the variable x and the corresponding frequencies.
To give a description of the distribution at each small interval x to x+dx and to give a functional relation between the variable x and the frequency or probability we require a third method, which may be described as the “analytical method of describing frequency distributions.” This method uses theoretical frequency functions.
Introduction. In Chapter II we have dealt to some extent with the effects of random sampling fluctuations on relative frequencies. But it is fairly obvious that the interest of the statistician in the effects of sampling fluctuations extends far beyond the fluctuations in relative frequencies. To illustrate, suppose we calculate any statistical measure such as an arithmetic mean, median, standard deviation, correlation coefficient, or parameter of a frequency function from the actual frequencies given by a sample of data. If we need then either to form a judgment as to the stability of such results from sample to sample or to use the results in drawing inferences about the sampled population, the common-sense process of induction involved is much aided by a knowledge of the general order of magnitude of the sampling discrepancies which may reasonably be expected because of the limited size of the sample from which we have calculated our statistical measures.
We may very easily illustrate the nature of the more common problems of sampling by considering the determination of certain characteristics of a race of men. For example, suppose we wish to describe any character such as height, weight, or other measurable attributes among the white males age 30 in the race. We should almost surely attempt merely to construct our science on the basis of results obtained from the sample.
This book, on mathematical statistics is the third of the series of Cams Mathematical Monographs. The purpose of the monographs, admirably expressed by Professor Bliss in the first book of the series, is “to make the essential features of various mathematical theories more accessible and attractive to as many persons as possible who have an interest in mathematics but who may not be specialists in the particular theory presented.”
The problem of making statistical theory available has been changed considerably during the past two or three years by the appearance of a large number of text-books on statistical methods. In the course of preparation of the manuscript of the present volume, the writer felt at one time that perhaps the recent books had covered the ground in such a way as to accomplish the main purposes of the monograph which was in process of preparation. But further consideration gave support to the view that although the recent books on statistical method will serve useful purposes in the teaching and standardization of statistical practice, they have not, in general, gone far toward exposing the nature of the underlying theory, and some of them may even give misleading impressions as to the place and importance of probability theory in statistical analysis.