To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
This chapter gives an outline of the related fields of phonetics and phonology. A good knowledge of these subjects is essential in speech synthesis because they help bridge the gap between the discrete, linguistic, word-based message and the continuous speech signal. More-traditional synthesis techniques relied heavily on phonetic and phonological knowledge, and often implemented theories and modules directly from these fields. Even in the more-modern heavily data-driven synthesis systems, we still find that phonetics and phonology have a vital role to play in determining how best to implement representations and algorithms.
Articulatory phonetics and speech production
The topic of speech production examines the processes by which humans convert linguistic messages into speech. The converse process, whereby humans determine the message from the speech, is called speech perception. Together these form the backbone of the field know as phonetics.
Regarding speech production, we have what we can describe as a complete but approximate model of this process. That is, in general we know how people use their articulators to produce the various sounds of speech. We emphasise, however, that our knowledge is very approximate; no model as yet can predict with any degree of accuracy how a speech waveform from a particular speaker would look like given some pronunciation input.
This chapter is concerned with the issue of synthesising acoustic representations of prosody. The input to the algorithms described here varies but in general takes the form of the phrasing, stress, prominence and discourse patterns which we introduced in Chapter 6. Hence the complete process of synthesis of prosody can be seen as one whereby we first extract a prosodic form representation from the text, as described in Chapter 6, and then synthesize an acoustic representation of this form, as described here.
The majority of this chapter focuses on the synthesis of intonation. The main acoustic representation of intonation is the fundamental frequency (F0), such that intonation is often defined as the manipulation of F0 for communicative or linguistic purposes. As we shall see, techniques for synthesizing F0 contours are inherently linked to the model of intonation used, so the whole topic of intonation, including theories, models and F0 synthesis, is dealt with here. In addition, we cover the topic of predicting intonation form from text, which was deferred from Chapter 6 since we first require an understanding of theories and models of intonational phenomena before explaining this.
Timing is considered the second important acoustic representation of prosody. Timing is used to indicate stress (phones are longer than normal), phrasing (phones become noticeably longer immediately prior to a phrase break) and rhythm.
Intonation overview
As a working definition, we will take intonation synthesis to be the generation of an F0 contour from higher-level linguistic information.
Speech and hearing are closely linked human abilities. It could be said that human speech is optimised toward the frequency ranges that we hear best, or perhaps our hearing is optimised around the frequencies used for speaking. However whichever way we present the argument, it should be clear to an engineer working with speech transmission and processing systems that aspects of both speech and hearing must often be considered together in the field of vocal communications. However, both hearing and speech remain complex subjects in their own right. Hearing particularly so.
In recent years it has become popular to discuss psychoacoustics in textbooks on both hearing and speech. Psychoacoustics is a term that links the words psycho and acoustics together, and although it sounds like a description of an auditory-challenged serial killer, actually describes the way the mind processes sound. In particular, it is used to highlight the fact that humans do not always perceive sound in the straightforward ways that knowledge of the physical characteristics of the sound would suggest.
There was a time when use of this word at a conference would boast of advanced knowledge, and familiarity with cutting-edge terminology, especially when it could roll off the tongue naturally. I would imagine speakers, on the night before their keynote address, standing before the mirror in their hotel rooms practising saying the word fluently. However these days it is used far too commonly, to describe any aspect of hearing that is processed nonlinearly by the brain. It was a great temptation to use the word in the title of this book.
This chapter contains a number of final topics, which have been left until last because they span many of the topics raised in the previous chapters.
Databases
Data-driven techniques have come to dominate nearly every aspect of text-to-speech in recent years. In addition to being affected by the algorithms themselves, the overall performance of a system is increasingly dominated by the quality of the databases that are used for training. In this section, we therefore examine the issues in database design, collection, labelling and use.
All algorithms are to some extent data-driven; even hand-written rules use some “data”, either explicitly or in a mental representation wherein the developer can imagine examples and how they should be dealt with. The difference between hand-written rules and data-driven techniques lies not in whether one uses data or not, but concerns how the data are used. Most data-driven techniques have an automatic training algorithm such that they can be trained on the data without the need for human intervention.
Unit-selection databases
Unit selection is arguably the most data-driven technique because little or no processing is performed on the data, rather it is simply analysed, cut up and recombined in different sequences. As with other database techniques, the issue of coverage is vital, but in addition we have further issues concerning the actual recordings.
This chapter describes the principle of compression in quantum communication channels. The underlying concept is that it is possible to convey “faithfully” a quantum message with a large number of qubits, while transmitting a compressed version of this message with a reduced number of qubits through the channel. Beyond the mere notion of fidelity, which characterizes the quality of quantum message transmission, the description brings the new concept of typicality in the space defined by all possible “quantum codewords.” The theorem of Schumacher's quantum compression states that for a qubit source with von Neumann entropy S, the message compression factor R has S − ε for the lower bound, where ε is any nonnegative parameter that can be made arbitrarily small for sufficiently long messages (hence, R ≈ S is the best possible compression factor). An original graphical and numerical illustration of the effect of Schumacher's quantum compression and the evolution of the typical quantum-codeword subspace with increasing message length is provided.
Quantum data compression and fidelity
In this chapter, we have reached the stage where it is possible to start addressing the issues that are central to information theory, namely, “How efficiently can we code information in a quantum communication channel?” both in terms of economy of means – the concept of data compression – and accuracy of transmission – the concept of message integrity or minimal data error, referred to here as fidelity.
In effect, the concept of information is obvious to anyone in life. Yet the word captures so much that we may doubt that any real definition satisfactory to a large majority of either educated or lay people may ever exist. Etymology may then help to give the word some skeleton. Information comes from the Latin informatio and the related verb informare meaning: to conceive, to explain, to sketch, to make something understood or known, to get someone knowledgeable about something. Thus, informatio is the action and art of shaping or packaging this piece of knowledge into some sensible form, hopefully complete, intelligible, and unambiguous to the recipient.
With this background in mind, we can conceive of information as taking different forms: a sensory input, an identification pattern, a game or process rule, a set of facts or instructions meant to guide choices or actions, a record for future reference, a message for immediate feedback. So information is diversified and conceptually intractable. Let us clarify here from the inception and quite frankly: a theory of information is unable to tell what information actually is or may represent in terms of objective value to any of its recipients! As we shall learn through this series of chapters, however, it is possible to measure information scientifically. The information measure does not concern value or usefulness of information, which remains the ultimate recipient's paradigm.
Speech-processing technology has been a mainstream area of research for more than 50 years. The ultimate goal of speech research is to build systems that mimic (or potentially surpass) human capabilities in understanding, generating and coding speech for a range of human-to-human and human-to-machine interactions.
In the area of speech coding a great deal of success has been achieved in creating systems that significantly reduce the overall bit rate of the speech signal (from of the order of 100 kilobits per second to rates of the order of 8 kilobits per second or less), while maintaining speech intelligibility and quality at levels appropriate for the intended applications. The heart of the modern cellular industry is the 8 kilobit per second speech coder, embedded in VLSI logic on the more than two billion cellphones in use worldwide at the end of 2007.
In the area of speech recognition and understanding by machines, steady progress has enabled systems to become part of everyday life in the form of call centres for the airlines, financial, medical and banking industries, help desks for large businesses, form and report generation for the legal and medical communities, and dictation machines that enable individuals to enter text into machines without having to type the text explicitly.
This chapter is concerned with the measure of information contained in qubits. This can be done only through quantum measurement, an operation that has no counterpart in the classical domain. I shall first describe in detail the case of single qubit measurements, which shows under which measurement conditions “classical” bits can be retrieved. Next, we consider the measurements of higher-order or n-qubits. Particular attention is given to the Einstein–Podolsky–Rosen (EPR) or Bell states, which, unlike other joint tensor states, are shown being entangled. The various single-qubit measurement outcomes from the EPR–Bell states illustrate an effect of causality in the information concerning the other qubit. We then focus on the technique of Bell measurement, which makes it possible to know which Bell state is being measured, yielding two classical bits as the outcome. The property of EPR–Bell state entanglement is exploited in the principle of quantum superdense coding, which makes it possible to transmit classical bits at twice the classical rate, namely through the generation and measurement of a single qubit. Another key application concerns quantum teleportation. It consists of the transmission of quantum states over arbitrary distances, by means of a common EPR–Bell state resource shared by the two channel ends. While quantum teleportation of a qubit is instantaneous, owing to the effect of quantum-state collapse, it is shown that its completion does require the communication of two classical bits, which is itself limited by the speed of light.
This chapter will take us into a world very different from all that we have seen so far concerning Shannon's information theory. As we shall see, it is a strange world made of virtual computers (universal Turing machines) and abstract axioms that can be demonstrated without mathematics merely by the force of logic, as well as relatively involved formalism. If the mere evocation of Shannon, of information theory, or of entropy may raise eyebrows in one's professional circle, how much more so that of Kolmogorov complexity! This chapter will remove some of the mystery surrounding “complexity,” also called “algorithmic entropy,” without pretending to uncover it all. Why address such a subject right here, in the middle of our description of Shannon's information theory? Because, as we shall see, algorithmic entropy and Shannon entropy meet conceptually at some point, to the extent of being asymptotically bounded, even if they come from totally uncorrelated basic assumptions! This remarkable convergence between fields must make integral part of our IT culture, even if this chapter will only provide a flavor. It may be perceived as being somewhat more difficult or demanding than the preceding chapters, but the extra investment, as we believe, is well worth it. In any case, this chapter can be revisited later on, should the reader prefer to keep focused on Shannon's theory and move directly to the next stage, without venturing into the intriguing sidetracks of algorithmic information theory.
The task of text decoding is to take a tokenised sentence and determine the best sequence of words. In many situations this is a classical disambiguation problem: there is one, and only one, correct sequence of words that gave rise to the text, and it is our job to determine this. In other situations, especially where we are dealing with non-natural-language text such as numbers and dates and so on, there may be a few different acceptable word sequences.
So, in general, text decoding in TTS is a process of resolving ambiguity. The ambiguity arises because two or more underlying forms share the same surface form, and, given the surface form (i.e. the writing), we need to find which of the underlying forms is the correct one. There are many types of linguistic ambiguity, including word identity, grammatical and semantic, but in TTS we need only concentrate on the type of ambiguity which affects the actual sound produced. So, while there are two words that share the orthographic form bank, they both sound the same, so we can ignore this type of ambiguity for TTS purposes. Tokens such as record can be pronounced in two different ways, so this is the type of ambiguity we need to resolve.
In this chapter, we concentrate on resolving ambiguity relating to the verbal component of language.