To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
In the modern world we are often faced with enormous data sets, both in terms of the number of observations n and in terms of the number of variables p. This is of course good news—we have always said the more data we have, the better predictive models we can build. Well, we are there now—we have tons of data, and must figure out how to use it.
Although we can scale up our software to fit the collection of linear and generalized linear models to these behemoths, they are often too modest and can fall way short in terms of predictive power. A need arose for some general purpose tools that could scale well to these bigger problems, and exploit the large amount of data by fitting a much richer class of functions, almost automatically. Random forests and boosting are two relatively recent innovations that fit the bill, and have become very popular as “out-thebox” learning algorithms that enjoy good predictive performance. Random forests are somewhat more automatic than boosting, but can also suffer a small performance hit as a consequence.
These two methods have something in common: they both represent the fitted model by a sum of regression trees. We discuss trees in some detail in Chapter 8. A single regression tree is typically a rather weak prediction model; it is rather amazing that an ensemble of trees leads to the state of the art in black-box predictors!
We can broadly describe both these methods very simply.
Random forest Grow many deep regression trees to randomized versions of the training data, and average them. Here “randomized” is a wideranging term, and includes bootstrap sampling and/or subsampling of the observations, as well as subsampling of the variables.
Boosting Repeatedly grow shallow trees to the residuals, and hence build up an additive model consisting of a sum of trees.
The basic mechanism in random forests is variance reduction by averaging. Each deep tree has a high variance, and the averaging brings the variance down. In boosting the basic mechanism is bias reduction, although different flavors include some variance reduction as well. Both methods inherit all the good attributes of trees, most notable of which is variable selection.
A study of human hearing and the biomechanical processes involved in hearing reveals several non-linear steps, or stages, in the perception of sound. Each of these stages contributes to the eventual unequal distribution of subjective features against purely physical ones in human hearing.
Put simply, what we think we hear is quite significantly different from the physical sounds that may be present in reality (which in turn differs from what might be recorded onto a computer, given the imperfections of microphones and recording technology). By taking into account the various non-linearities in the hearing process, and some of the basic physical characteristics of the ear, nervous system, and brain, it becomes possible to begin to account for these discrepancies between perception and physical measurements.
Over the years, science and technology has incrementally improved our ability to understand and model the hearing process using purely physical data. One simple example is that of A-law compression (or the similar µ-law used in some regions of the world), where approximately logarithmic amplitude quantisation replaces the linear quantisation of PCM (pulse coded modulation): humans tend to perceive amplitude logarithmically rather than linearly, thus A-law quantisation using 8 bits to represent each sample sounds better than linear PCM quantisation using 8 bits (in truth, it can sound better than speech quantised linearly with 12 bits). It thus achieves a higher degree of subjective speech quality than PCM – for a given bitrate [4].
Physical processes
A cut-away diagram of the human ear (outer, middle and inner) is shown in Figure 4.1. The outer ear includes the pinna, which filters sound and focuses it into the external auditory canal. Sound then acts upon the eardrum, where it is transmitted and amplified through the middle ear by the three bones, the malleus, incus and stapes, to the oval window, opening on to the cochlea in the inner ear.
The cochlea, as a coiled tube, contains an approximately 35mm long semi-rigid pair of membranes (basilar and Reissner's) enclosed in a fluid called endolymph [35]. The basilar membrane carries the organs of Corti, each of which contains a number of hair cells arranged in two rows (approximately 3500 inner and 20 000 outer hair cells).
Chapters 1 to 4 covered the foundations of speech signal processing including the characteristics of audio signals, methods of handling and processing them, the human speech production mechanism and the human auditory system. Chapter 5 then looked in more detail at psychoacoustics – the difference between what a human perceives and what is actually physically present. This chapter will now build upon these foundations as we embark on an exploration of the handling of speech in more depth, in particular in the coding of speech for communications purposes.
The chapter will consider typical speech processing in terms of speech coding and compression (rather than in terms of speech classification and recognition, which we will describe separately in later chapters). We will first consider the important topic of quantisation, which assumes speech to be a general audio waveform (i.e. the technique does not incorporate any specialist knowledge of the characteristics of speech).
Knowledge of speech features and characteristics allows parameterisation of the speech signal, in particular the important source filter model. Perhaps the pinnacle of achievement in these approaches is the CELP (codebook excited linear prediction) speech compression technique, which will be discussed in the final section.
Quantisation
As mentioned at the beginning of Chapter 1, audio samples need to be quantised in some way during the conversion from analogue quantities to their representations on computer. In effect, the quantisation process acts to reduce the amount of information stored: the fewer bits used to quantise the signal, the less audio information is preserved.
Most real-world systems are bandwidth (rate) or size constrained, such as an MP3 player only being able to store 4 or 8 Gbyte of audio, or a Bluetooth connected speaker only being able to replay sound at 44.1 kHz in 16 bits because this results in the maximum bandwidth audio signal that Bluetooth wireless can convey.
Manufacturers ofMP3 devices may quote how many songs their devices can store, or how many hours of audio they can contain – these are both considered more customerfriendly than specifying memory capacity in Gbytes – however, it is the memory capacity in Gbytes that tends to influence the cost of the device. It is therefore also evident that a method of reducing the size of audio recordings is important, since it allows more songs to be stored on a device with smaller memory capacity.
The emphasis of this book up to now has been on understanding speech, audio and hearing, and using this knowledge to discern rules for handling and processing this type of content. There are many good reasons to take such an approach, not least being that better understanding can lead to better rules and thus better processing. If an engineer is building a speech-based system, it is highly likely that the effectiveness of that system relates to the knowledge of the engineer. Conversely, a lack of understanding on the part of that engineer might lead to eventual problems with the speech system. However, this type of argument holds true only up to a point: it is no longer true if the subtle details of the content (data) become too complex for a human to understand, or when the amount of data that needs to be examined is more extensive than a human can comprehend. To put it another way, given more and more data, of greater and greater complexity, eventually the characteristics of the data exceed the capabilities of human understanding.
It is often said that we live in a data-rich world. This has been driven in part by the enormous decrease in data storage costs over the past few decades (from something like e100,000 per gigabyte in 1980, e10,000 in 1990, e10 in 2000 to e0.1 in 2010), and in part by the rapid proliferation of sensors, sensing devices and networks. Today, every smartphone, almost every computer, most new cars, televisions, medical devices, alarm systems and countless other devices include multiple sensors of different types backed up by the communications technology necessary to disseminate the sensed information.
Sensing data over a wide area can reveal much about the world in general, such as climate change, pollution, human social behaviour and so on. Over a smaller scale it can reveal much about the individual – witness targeted website advertisements, sales notifications that are driven from analysis of shopping patterns, credit ratings driven by past financial behaviour or job opportunities lost through inadvertent online presence. Data relating to the world as a whole, as well as to individuals, is increasingly available, and increasingly being ‘mined’ for hidden value.
In Chapter 2 we looked at the general handling, processing and visualisation of audio: vectors or sequences of samples captured at some particular sample rate, and which together represent sound.
In this chapter, we will build upon that foundation, and use it to begin to look at (or analyse) speech. There is nothing special about speech from an audio perspective – it is simply a continuous sequence of time varying amplitudes and tones just like any other sound – it's only when a human hears it and the brain becomes involved that the sound is interpreted as being speech.
There is a famous experiment which demonstrates a sentence of something called sinewave speech. This presents a particular sound recording made from sinewaves. Initially, the brain of a listener does not consider this to be speech, and so the signal is unintelligible. However, after the corresponding sentence is heard spoken aloud in a normal way, the listener's brain suddenly ‘realises’ that the signal is in fact speech, and from then on it becomes intelligible. After that the listener does not seem to ‘unlearn’ this ability to understand sinewave speech: subsequent sentences which may be completely unintelligible to others will have become intelligible to this listener [8]. To listen to some sinewave speech, please go to the book website at http://mcloughlin.eu/sws.
There is a point to sinewave speech. It demonstrates that, while speech is just a structured set of modulated frequencies, the combination of these in a certain way has a special meaning to the brain. Music and some naturally occurring sounds also have some inherently speech-like characteristics, but we do not often mistake music for speech. It is likely that there is some kind of decision process in the human hearing system that sends speech-like sounds to one part of the brain for processing (the part that handles speech), and sends other sounds to different parts of the brain. However, there is a lot hidden inside the human brain that we do not understand, and how it handles speech is just one of those grey areas.
Fortunately speech itself is much easier to analyse and understand computationally: the speech signal is easy to capture with a microphone and record on computer. Over the years, speech characteristics have been very well researched, with many specialised analysis, handling and processing methods having been developed for this particular type of audio.
Statistical inference is an unusually wide-ranging discipline, located as it is at the triple-point of mathematics, empirical science, and philosophy. The discipline can be said to date from 1763, with the publication of Bayes’ rule (representing the philosophical side of the subject; the rule's early advocates considered it an argument for the existence of God). The most recent quarter of this 250-year history—from the 1950s to the present—is the “computer age” of our book's title, the time when computation, the traditional bottleneck of statistical applications, became faster and easier by a factor of a million.
The book is an examination of how statistics has evolved over the past sixty years—an aerial view of a vast subject, but seen from the height of a small plane, not a jetliner or satellite. The individual chapters take up a series of influential topics—generalized linear models, survival analysis, the jackknife and bootstrap, false-discovery rates, empirical Bayes, MCMC, neural nets, and a dozen more—describing for each the key methodological developments and their inferential justification.
Needless to say, the role of electronic computation is central to our story. This doesn't mean that every advance was computer-related. A land bridge had opened to a new continent but not all were eager to cross. Topics such as empirical Bayes and James–Stein estimation could have emerged just as well under the constraints of mechanical computation. Others, like the bootstrap and proportional hazards, were pureborn children of the computer age. Almost all topics in twenty-first-century statistics are now computer-dependent, but it will take our small plane a while to reach the new millennium.
Dictionary definitions of statistical inference tend to equate it with the entire discipline. This has become less satisfactory in the “big data” era of immense computer-based processing algorithms. Here we will attempt, not always consistently, to separate the two aspects of the statistical enterprise: algorithmic developments aimed at specific problem areas, for instance random forests for prediction, as distinct from the inferential arguments offered in their support.