To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
Chapters 1 to 4 covered the foundations of speech signal processing including the characteristics of audio signals, methods of handling and processing them, the human speech production mechanism and the human auditory system. Chapter 5 then looked in more detail at psychoacoustics – the difference between what a human perceives and what is actually physically present. This chapter will now build upon these foundations as we embark on an exploration of the handling of speech in more depth, in particular in the coding of speech for communications purposes.
The chapter will consider typical speech processing in terms of speech coding and compression (rather than in terms of speech classification and recognition, which we will describe separately in later chapters). We will first consider the important topic of quantisation, which assumes speech to be a general audio waveform (i.e. the technique does not incorporate any specialist knowledge of the characteristics of speech).
Knowledge of speech features and characteristics allows parameterisation of the speech signal, in particular the important source filter model. Perhaps the pinnacle of achievement in these approaches is the CELP (codebook excited linear prediction) speech compression technique, which will be discussed in the final section.
Quantisation
As mentioned at the beginning of Chapter 1, audio samples need to be quantised in some way during the conversion from analogue quantities to their representations on computer. In effect, the quantisation process acts to reduce the amount of information stored: the fewer bits used to quantise the signal, the less audio information is preserved.
Most real-world systems are bandwidth (rate) or size constrained, such as an MP3 player only being able to store 4 or 8 Gbyte of audio, or a Bluetooth connected speaker only being able to replay sound at 44.1 kHz in 16 bits because this results in the maximum bandwidth audio signal that Bluetooth wireless can convey.
Manufacturers ofMP3 devices may quote how many songs their devices can store, or how many hours of audio they can contain – these are both considered more customerfriendly than specifying memory capacity in Gbytes – however, it is the memory capacity in Gbytes that tends to influence the cost of the device. It is therefore also evident that a method of reducing the size of audio recordings is important, since it allows more songs to be stored on a device with smaller memory capacity.
The emphasis of this book up to now has been on understanding speech, audio and hearing, and using this knowledge to discern rules for handling and processing this type of content. There are many good reasons to take such an approach, not least being that better understanding can lead to better rules and thus better processing. If an engineer is building a speech-based system, it is highly likely that the effectiveness of that system relates to the knowledge of the engineer. Conversely, a lack of understanding on the part of that engineer might lead to eventual problems with the speech system. However, this type of argument holds true only up to a point: it is no longer true if the subtle details of the content (data) become too complex for a human to understand, or when the amount of data that needs to be examined is more extensive than a human can comprehend. To put it another way, given more and more data, of greater and greater complexity, eventually the characteristics of the data exceed the capabilities of human understanding.
It is often said that we live in a data-rich world. This has been driven in part by the enormous decrease in data storage costs over the past few decades (from something like e100,000 per gigabyte in 1980, e10,000 in 1990, e10 in 2000 to e0.1 in 2010), and in part by the rapid proliferation of sensors, sensing devices and networks. Today, every smartphone, almost every computer, most new cars, televisions, medical devices, alarm systems and countless other devices include multiple sensors of different types backed up by the communications technology necessary to disseminate the sensed information.
Sensing data over a wide area can reveal much about the world in general, such as climate change, pollution, human social behaviour and so on. Over a smaller scale it can reveal much about the individual – witness targeted website advertisements, sales notifications that are driven from analysis of shopping patterns, credit ratings driven by past financial behaviour or job opportunities lost through inadvertent online presence. Data relating to the world as a whole, as well as to individuals, is increasingly available, and increasingly being ‘mined’ for hidden value.
In Chapter 2 we looked at the general handling, processing and visualisation of audio: vectors or sequences of samples captured at some particular sample rate, and which together represent sound.
In this chapter, we will build upon that foundation, and use it to begin to look at (or analyse) speech. There is nothing special about speech from an audio perspective – it is simply a continuous sequence of time varying amplitudes and tones just like any other sound – it's only when a human hears it and the brain becomes involved that the sound is interpreted as being speech.
There is a famous experiment which demonstrates a sentence of something called sinewave speech. This presents a particular sound recording made from sinewaves. Initially, the brain of a listener does not consider this to be speech, and so the signal is unintelligible. However, after the corresponding sentence is heard spoken aloud in a normal way, the listener's brain suddenly ‘realises’ that the signal is in fact speech, and from then on it becomes intelligible. After that the listener does not seem to ‘unlearn’ this ability to understand sinewave speech: subsequent sentences which may be completely unintelligible to others will have become intelligible to this listener [8]. To listen to some sinewave speech, please go to the book website at http://mcloughlin.eu/sws.
There is a point to sinewave speech. It demonstrates that, while speech is just a structured set of modulated frequencies, the combination of these in a certain way has a special meaning to the brain. Music and some naturally occurring sounds also have some inherently speech-like characteristics, but we do not often mistake music for speech. It is likely that there is some kind of decision process in the human hearing system that sends speech-like sounds to one part of the brain for processing (the part that handles speech), and sends other sounds to different parts of the brain. However, there is a lot hidden inside the human brain that we do not understand, and how it handles speech is just one of those grey areas.
Fortunately speech itself is much easier to analyse and understand computationally: the speech signal is easy to capture with a microphone and record on computer. Over the years, speech characteristics have been very well researched, with many specialised analysis, handling and processing methods having been developed for this particular type of audio.
Statistical inference is an unusually wide-ranging discipline, located as it is at the triple-point of mathematics, empirical science, and philosophy. The discipline can be said to date from 1763, with the publication of Bayes’ rule (representing the philosophical side of the subject; the rule's early advocates considered it an argument for the existence of God). The most recent quarter of this 250-year history—from the 1950s to the present—is the “computer age” of our book's title, the time when computation, the traditional bottleneck of statistical applications, became faster and easier by a factor of a million.
The book is an examination of how statistics has evolved over the past sixty years—an aerial view of a vast subject, but seen from the height of a small plane, not a jetliner or satellite. The individual chapters take up a series of influential topics—generalized linear models, survival analysis, the jackknife and bootstrap, false-discovery rates, empirical Bayes, MCMC, neural nets, and a dozen more—describing for each the key methodological developments and their inferential justification.
Needless to say, the role of electronic computation is central to our story. This doesn't mean that every advance was computer-related. A land bridge had opened to a new continent but not all were eager to cross. Topics such as empirical Bayes and James–Stein estimation could have emerged just as well under the constraints of mechanical computation. Others, like the bootstrap and proportional hazards, were pureborn children of the computer age. Almost all topics in twenty-first-century statistics are now computer-dependent, but it will take our small plane a while to reach the new millennium.
Dictionary definitions of statistical inference tend to equate it with the entire discipline. This has become less satisfactory in the “big data” era of immense computer-based processing algorithms. Here we will attempt, not always consistently, to separate the two aspects of the statistical enterprise: algorithmic developments aimed at specific problem areas, for instance random forests for prediction, as distinct from the inferential arguments offered in their support.
Something important changed in the world of statistics in the new millennium. Twentieth-century statistics, even after the heated expansion of its late period, could still be contained within the classic Bayesian–frequentist– Fisherian inferential triangle (Figure 14.1). This is not so in the twenty-first century. Some of the topics discussed in Part III—false-discovery rates, post-selection inference, empirical Bayes modeling, the lasso—fit within the triangle but others seem to have escaped, heading south from the frequentist corner, perhaps in the direction of computer science.
The escapees were the large-scale prediction algorithms of Chapters 17– 19: neural nets, deep learning, boosting, random forests, and support-vector machines. Notably missing from their development were parametric probability models, the building blocks of classical inference. Prediction algorithms are the media stars of the big-data era. It is worth asking why they have taken center stage and what it means for the future of the statistics discipline.
The why is easy enough: prediction is commercially valuable. Modern equipment has enabled the collection of mountainous data troves, which the “data miners” can then burrow into, extracting valuable information. Moreover, prediction is the simplest use of regression theory (Section 8.4). It can be carried out successfully without probability models, perhaps with the assistance of nonparametric analysis tools such as cross-validation, permutations, and the bootstrap.
A great amount of ingenuity and experimentation has gone into the development of modern prediction algorithms, with statisticians playing an important but not dominant role.1 There is no shortage of impressive success stories. In the absence of optimality criteria, either frequentist or Bayesian, the prediction community grades algorithmic excellence on per-formance within a catalog of often-visited examples such as the spam and digits data sets of Chapters 17 and 18.2 Meanwhile, “traditional statistics” —probability models, optimality criteria, Bayes priors, asymptotics—has continued successfully along on a parallel track. Pessimistically or optimistically, one can consider this as a bipolar disorder of the field or as a healthy duality that is bound to improve both branches. There are historical and intellectual arguments favoring the optimists’ side of the story.
Audio processing systems have been a part of many people's lives since the invention of the phonograph in the 1870s. The resulting string of innovations sparked by that disruptive technology have culminated eventually in today's portable audio devices such as Apple's iPod, and the ubiquitous MP3 (or similarly compressed) audio files that populate them. These may be listened to on portable devices, computers, as soundtracks accompanying Blu-ray films and DVDs, and in innumerable other places.
Coincidentally, the 1870s saw a related invention – that of the telephone – which has also grown to play a major role in daily life between then and now, and likewise has sparked a string of innovations down the years. Scottish born and educated Alexander Graham Bell was there at their birth to contribute to the success of both inventions. He probably would be proud to know, were he still alive today, that two entire industry sectors, named telecommunications and infotainment, were spawned by the two inventions of phonograph and telephone.
However, after 130 years, something even more unexpected has occurred: the descendents of the phonograph and the descendents of the telephone have converged into a single product called a ‘smartphone’. Dr Bell probably would not recognise the third convergence that made all of this possible, that of the digital computer – which is precisely what today's smartphone really is. At heart it is simply a very small, portable and capable computer with microphone, loudspeaker, display and wireless connectivity.
Computers and audio
The flexibility of computers means that once sound has been sampled into a digital form, it can be used, processed and reproduced in an infinite variety of ways without further degradation. It is not only computers (big or small) that rely on digital audio, so do CD players, MP3 players (including iPods), digital audio broadcast (DAB) radios, most wireless portable speakers, television and film cameras, and even modern mixing desks for ‘live’ events (and co-incidentally all of these devices contain tiny embedded computers too). Digital music and sound effects are all around us and impact our leisure activities (e.g. games, television, videos), our education (e.g. recorded lectures, broadcasts, podcasts) and our work in innumerable ways to influence, motivate and educate us.