To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
In this chapter we very briefly overview and exemplify some of the other speech technologies out there. These include speaker identification, voice activity detection, diarization, language identification, and voice cloning.
This chapter gives an overview of the different aspects of speech in terms of articulation and acoustics. We focus in particular on how the speech signal can be decomposed in terms of frequency and time and how speech is represented on a computer.
This chapter focuses on neural techniques for speech synthesis, reviewing a number of recent systems, for example Deep Voice, Tacotron, VoiceLoop, FastSpeech, Glow-TTS, and VITS. For most of these, we give implementations that students can use to try out the systems. Along the way, we discuss connectionist temporal classification (CTC), an important technique for both synthesis and recognition. The chapter concludes with a discussion of the Coqui toolkit, which includes implementations for a number of these systems
In this chapter we discuss neural approaches to speech recognition. We begin with hybrid DNN-HMM approaches and then turn to RNN-based systems. Most of the chapter is focused on attention-based systems. We review a number of recent systems including RNN transducers, Listen, attend, and spell, Wav2letter, Wav2vec, HuBERT, Conformer, and QuartzNet.
This chapter treats non-neural approaches to automatic speech recognition. We cover classical techniques like dynamic time warping and the structure of HMM- and GMM-HMM-based recognition systems. We spend a fair amount of time on the Kaldi system, as this is still very widely used.
This chapter gives an overview of the two technologies the book focuses on: automatic speech recognition and speech synthesis. It also treats some general background topics, including the relevance and irrelevance of linguistic theory; engineering vs. science; and the basic building blocks of language that are relevant for our purposes: sounds, morphemes, sentences, and spelling.
This chapter treats statistical finite-state models: weighted transducers and hidden Markov models. It also treats N-gram models and shows how they can be implemented with statistical finite-state tools. The chapter also introduces the openfst toolkit.
This chapter introduces formal language theory generally and the regular languages and relations more specifically. These are the main building blocks of classical non-neural speech recognition systems. The chapter also covers how these systems can be used for language modeling more generally. The chapter also includes coverage of the PyFoma finite-state toolkit.
This chapter covers non-neural approaches to speech synthesis. On the theoretical side, we begin with rule-based systems and then go on to treat concatenative and HMM-based systems. Along the way, we give working examples for different kinds of synthesis systems, examples we code ourselves along with publicly available systems like espeak and festival.
This chapter covers the basic logic and architecture of neural nets. We start with a review of the basic math behind logic gates and simple perceptrons. We then turn to modern neural nets treating the basic structure and training via backpropagation. We show how these can all be implemented using pytorch specifically. We then go on to explain some of the most common neural architectures for speech: convolution, recurrent nodes of various sorts, encoder-decoder models, and attention systems.
The chapter begins with discussion of intelligence in simple unicellular organisms followed by that of animals with complex nervous systems. Surprisingly, even organisms that do not have a central brain can navigate their complex environments, forage, and learn. In organisms with central nervous system, neurons and synapses in the brain provide elementary basis of intelligence and memory. Neurons generate action potentials that represent information. Synapses hold memory and control the signal transmission between neurons. A key feature of biological neural circuits is plasticity, that is, their ability to modify the circuit properties based both on stimuli and time intervals between them. This represents one form of learning. The biological brain is not static but continuously evolves based on the experience. The field of AI seeks to learn from biological neural circuitry, emulate aspects of intelligence and learning and attempts to build physical devices and algorithms that can demonstrate features of animal intelligence. Neuromorphic computing therefore requires a paradigm shift in design of semiconductors as well as algorithm foundations that are not necessarily built for perfection, rather for learning.