To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
This appendix provides a brief overview of common data compression standards used for sounds, texts, files, images, and videos. The description is just meant to be introductory and makes no pretense of comprehensively defining the actual standards and their current updated versions. The list of selected standards is also indicative, and does not reflect the full diversity of those available in the market, as freeware, shareware, or under license. It is a tricky endeavor to attempt a description here in a few pages of a subject that would fill entire bookshelves. The hope is that the reader will get a flavor and will be enticed to learn more about this seemingly endless, yet fascinating subject. Why put this whole matter into an appendix, and not a fully fledged chapter? This is because this set of chapters is primarily focused on information theory, not on information standards. While the first provides a universal and slowly evolving background reference, like science, the second represents practically all the reverse. As we shall see through this appendix, however, information standards are extremely sophisticated and “intellectually smart,” despite being just an application field for the former. And there are no telecom engineers or scientists who may ignore or will not benefit from this essential fact and truth!
This chapter marks a key turning point in our journey in information-theory land. Heretofore, we have just covered some very basic notions of IT, which have led us, nonetheless, to grasp the subtle concepts of information and entropy. Here, we are going to make significant steps into the depths of Shannon's theory, and hopefully begin to appreciate its power and elegance. This chapter is going to be somewhat more mathematically demanding, but it is guaranteed to be not significantly more complex than the preceding materials. Let's say that there is more ink involved in the equations and the derivation of the key results. But this light investment will turn out well worth it to appreciate the forthcoming chapters!
I will first introduce two more entropy definitions: joint and conditional entropies, just as there are joint and conditional probabilities. This leads to a new fundamental notion, that of mutual information, which is central to IT and the various Shannon's laws. Then I introduce relative entropy, based on the concept of “distance” between two PDFs. Relative entropy broadens the perspective beyond this chapter, in particular with an (optional) appendix exploration of the second law of thermodynamics, as analyzed in the light of information theory.
Joint and conditional entropies
So far, in this book, the notions of probability distribution and entropy have been associated with single, independent events x, as selected from a discrete source X = {x}.
Chapter 1 enabled us to familiarize ourselves (say to revisit, or to brush up?) the concept of probability. As we have seen, any probability is associated with a given event xi from a given event space S = {x1, x2,…, xN}. The discrete set {p(x1), p(x2),…, p(xN)} represents the probability distribution function or PDF, which will be the focus of this second chapter.
So far, we have considered single events that can be numbered. These are called discrete events, which correspond to event spaces having a finite size N (no matter how big N may be!). At this stage, we are ready to expand our perspective in order to consider event spaces having unbounded or infinite sizes (N → ∞). In this case, we can still allocate an integer number to each discrete event, while the PDF, p(xi), remains a function of the discrete variable xi. But we can conceive as well that the event corresponds to a real number, for instance, in the physical measurement of a quantity, such as length, angle, speed, or mass. This is another infinity of events that can be tagged by a real number x. In this case, the PDF, p(x), is a function of the continuous variable x.
This chapter is an opportunity to look at the properties of both discrete and continuous PDFs, as well as to acquire a wealth of new conceptual tools!
In this chapter we introduce the three main synthesis techniques which dominated the field up until the late 1980s, collectively known as first-generation techniques. Even though these techniques are used less today, it is still useful to discuss them because, apart from simple historical interest, they give us an understanding of why today's systems are configured the way they are. As an example, we need to know why today's dominant technique of unit selection is used rather than the more-basic approach which would be to generate speech waveforms “from scratch”. Furthermore, modern techniques have been made possible only by vast increases in processing power and memory, so in fact, for applications that require small footprints and low processing cost, the techniques explained here remain quite competitive.
Synthesis specification: the input to the synthesiser
First-generation techniques usually require a quite-detailed, low-level description of what is to be spoken. For purposes of explanation, we will take this to be a phonetic representation for the verbal component, together with a time for each phone and an F0 contour for the whole sentence. The phones will have been generated by a combination of lexical lookup, G2T rules and post-lexical processing (see Chapter 8), while the timing and F0 contour will have been generated by a classical prosody algorithm of the type described in Chapter 9. It is often convenient to place this information in a new structure called a synthesis specification.
This chapter describes what is generally considered to be one of the most important and historical contributions to the field of quantum computing, namely Shor's factorization algorithm. As its name indicates, this algorithm makes it possible to factorize numbers, which consists in their decomposition into a unique product of prime numbers. Other classical factorization algorithms previously developed have a complexity or computing time that increases exponentially with the number size, making the task intractable if not hopeless for large numbers. In contrast, Shor's algorithm is able to factor a number of any size in polynomial time, making the factorization problem tractable should a quantum computer ever be realized in the future. Since Shor's algorithm is based on several nonintuitive properties and other mathematical subtleties, this chapter presents a certain level of difficulty. With the previous chapters and tools readily assimilated, and some patience in going through the different preliminary steps required, such a difficulty is, however, quite surmountable. I have sought to make this description of Shor's algorithm as mathematically complete as possible and crack-free, while avoiding some academic considerations that may not be deemed necessary from any engineering perspective. Eventually, Shor's algorithm is described in only a few basic instructions. What is conceptually challenging is to grasp why it works so well, and also to feel comfortable with the fact that its implementation actually takes a fair amount of trial and error. The two preliminaries of Shor's algorithm are the phase estimation and the related order-finding algorithms.
This chapter is concerned with the measure of quantum states. This requires one to introduce the subtle notion of quantum measurement, an operation that has no counterpart in the classical domain. To this effect, we first need to develop some new tools, starting with Dirac notation, a formalism that is not only very elegant but is relatively simple to handle. The introduction of Dirac notation makes it possible to become familiar with the inner product for quantum states, as well as different properties for operators and states concerning projection, change of basis, unitary transformations, matrix elements, similarity transformations, eigenvalues and eigenstates, spectral decomposition and diagonal representation, matrix trace and density operator or matrix. The concept of density matrix makes it possible to provide a very first and brief hint of the analog of Shannon's entropy in the quantum world, referred to as von Neumann's entropy, to be further developed in Chapter 21. Once we have all the required tools, we can focus on quantum measurement and analyze three different types referred to as basis-state measurements, projection or von Neumann measurements, and POVM measurements. In particular, POVM measurements are shown to possess a remarkable property of unambiguous quantum state discrimination (UQSD), after which it is possible to derive “absolutely certain” information from unknown system states. The more complex case of quantum measurements in composite systems described by joint or tensor states is then considered.
In the world of telecoms, the term information conveys several levels of meaning. It may concern individual bits, bit sequences, blocks, frames, or packets. It may represent a message payload, or its overhead; the necessary extra information for the network nodes to transmit the message payload practically and safely from one end to another. In many successive stages, this information is encapsulated altogether to form larger blocks corresponding to higher-level network protocols, and the reverse all the way down to destination. From any telecom-scientist viewpoint, information represents this uninterrupted flow of bits, with network intelligence to process it. Once converted into characters or pixels, the remaining message bits become meaningful or valuable in terms of acquisition, learning, decision, motion, or entertainment. In such a larger network perspective, where information is well under control and delivered with the quality of service, what could be today's need for any information theory (IT)?
In the telecom research community indeed, there seems to be little interest for information theory, as based on the valid perception that there is nothing new to worry about. While the occasional evocation of Shannon invariably raises passionate group discussions, the professional focus is about the exploitation of bandwidth and network deployment issues.
Chapter 2 described the general handling, processing and visualisation of audio vectors: sequences of samples captured at some particular sample rate, and which together represent sound. This chapter will build upon that foundation, and use it to begin to look at speech. There is nothing special about speech from an audio perspective – it is simply like any other sound – it's only when we hear it that our brains begin to interpret a particular signal as being speech. There is a famous experiment which demonstrates a sentence of sinewave speech. This presents a particular sound recording made from sinewaves. Initially, the brain of a listener does not consider this to be speech, and so the signal is unintelligible. However after the corresponding sentence is heard spoken aloud in a normal way, the listener's brain suddenly ‘realises’ that the signal is in fact speech, and from then on it becomes intelligible. After that the listener cannot ‘unlearn’ this fact: similar sentences which are generally completely unintelligible to others will be perfectly intelligible to this listener.
Apart from this interpretative behaviour of the human brain, there are audio characteristics within music and other sounds that are inherently speech-like in their spectral and temporal characteristics. However speech itself is a structured set of continuous sounds, by virtue of its production mechanism. Its characteristics are very well researched, and many specialised analysis, handling and processing methods have been developed over the years especially for this narrow class of audio signals.
Initially turning our back on the computer and speech processing, this chapter will consider the human speech production apparatus, mechanisms, and characteristics.
This is a book about getting computers to read out loud. It is therefore about three things: the process of reading, the process of speaking, and the issues involved in getting computers (as opposed to humans) to do this. This field of study is known both as speech synthesis, that is the “synthetic” (computer) generation of speech, and as text-to-speech or TTS; the process of converting written text into speech. It complements other language technologies such as speech recognition, which aims to convert speech into text, and machine translation, which converts writing or speech in one language into writing or speech in another.
I am assuming that most readers have heard some synthetic speech in their life. We experience this in a number of situations; some telephone information systems have automated speech response, speech synthesis is often used as an aid to the disabled, and Professor Stephen Hawking has probably contributed more than anyone else to the direct exposure of (one particular type of) synthetic speech. The idea of artificially generated speech has of course been around for a long time – hardly any science-fiction film is complete without a talking computer of some sort. In fact science fiction has had an interesting effect on the field and our impressions of it. Sometimes (less technically aware) people believe that perfect speech synthesis exists because they “heard it on Star Trek”. Often makers of science-fiction films fake the synthesis by using an actor, although usually some processing is added to the voice to make it sound “computerised”.
So far this book has dealt with individual subjects ranging from background material on audio, its handling with Matlab, speech, hearing, and on to the commercially important topics of communications and analysis. Each of these has been relatively self-contained in scope, although the more advanced speech compression methods of the previous chapter did introduce some limited psychoacoustic features.
In this chapter we will progress onward: we will discuss and describe advanced topics that combine processing elements relating to both speech and hearing – computer models of the human hearing system that can be used to influence processing of speech (and audio), and computer models of speech that can be used for analysis and modification of the speech signal.
By far the most important of these topics is introduced first: psychoacoustic modelling, without which MP3 and similar audio formats, and the resultant miniature music players from Creative, Apple, iRiver, and others, would not exist.
Psychoacoustic modelling
Remember back in Section 4.2 we claimed that this marriage of the art of psychology and the science of acoustics was important in forming a link between the purely physical domain of sound and the experience of a listener? In this section we will examine further to see why and how that happens.
It follows that a recording of a physical sound wave – which is a physical representation of the audio – contains elements which are very relevant to a listener, and elements which are not. At one extreme, some of the recorded sound may be inaudible to a listener.
We finish with some general thoughts about the field of speech technology and linguistics and a discussion of future directions.
Speech technology and linguistics
A newcomer to TTS might expect that the relationship between speech technology and linguistics would parallel that between more-traditional types of engineering and physics. For example, in mechanical engineering, machines and engines are built on the basis of principles of dynamics, forces, energy and so on developed in classical physics. It should be clear to a reader with more experience in speech technology that this state of affairs does not hold between the engineering issues that we address in this book and the theoretical field of linguistics. How has this state of affairs come about and what is the relationship between the two fields?
It is widely acknowledged that researchers in the fields of speech technology and linguistics do not in general work together. This topic is often raised at conferences and is the subject of many a discussion panel or special session. Arguments are put forward to explain the lack of unity, all politely agree that we can learn more from each other, and then both communities go away and do their own thing just as before, such that the gap is even wider by the time the next conference comes around.
The first stated reason for this gap is the “aeroplanes don't flap their wings” argument.
This chapter sets the basis of quantum information theory (QIT). The central purpose of QIT is to qualify the transmission of either classical or quantum information over quantum channels. The starting point of the QIT description is von Neumann entropy, S(ρ), which represents the quantum counterpart of Shannon's classical entropy, H(X). Such a definition rests on that of the density operator (or density matrix) of a quantum system, ρ, which plays a role similar to that of the random-events source X in Shannon's theory. As we shall see, there also exists an elegant and one-to-one correspondence between the quantum and classical definitions of the entropy variants relative entropy, joint entropy, conditional entropy, and mutual information. But such a similarity is only apparent. Indeed, one becomes rapidly convinced from a systematic analysis of the entropy's additivity rules that fundamental differences separate the two worlds. The classical notion of information correlation between two event sources for quantum states shall be referred to as quantum entanglement. We then define a quantum communication channel, which encodes and decodes classical information into or from quantum states. The analysis shows that the mutual information H(X;Y) between originator and recipient in this communication channel cannot exceed a quantity χ, called the Holevo bound, which itself satisfies χ ≤ H(X), where H(X) is the entropy of the originator's classical information source.
We saw in Chapter 13 that, despite the approximations in all the vocal-tract models concerned, the limiting factor in generating high-quality speech is not so much in converting the parameters into speech, but in knowing which parameters to use for a given synthesis specification. Determining these by hand-written rules can produce fairly intelligible speech, but the inherent complexities of speech seem to place an upper limit on the quality that can be achieved in this way. The various second-generation synthesis techniques explained in Chapter 14 solve the problem by simply measuring the values from real speech waveforms. Although this is successful to a certain extent, it is not a perfect solution. As we will see in Chapter 16, we can never collect enough data to cover all the effects we wish to synthesize, and often the coverage we have in the database is very uneven. Furthermore, the concatenative approach always limits us to recreating what we have recorded; in a sense all we are doing is reordering the original data.
An alternative is to use statistical, machine-learning techniques to infer the specification-to-parameter mapping from data. While this and the concatenative approach can both be described as data-driven, in the concatenative approach we are effectively memorising the data, whereas in the statistical approach we are attempting to learn the general properties of the data.