To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
This chapter will take us into a world very different from all that we have seen so far concerning Shannon's information theory. As we shall see, it is a strange world made of virtual computers (universal Turing machines) and abstract axioms that can be demonstrated without mathematics merely by the force of logic, as well as relatively involved formalism. If the mere evocation of Shannon, of information theory, or of entropy may raise eyebrows in one's professional circle, how much more so that of Kolmogorov complexity! This chapter will remove some of the mystery surrounding “complexity,” also called “algorithmic entropy,” without pretending to uncover it all. Why address such a subject right here, in the middle of our description of Shannon's information theory? Because, as we shall see, algorithmic entropy and Shannon entropy meet conceptually at some point, to the extent of being asymptotically bounded, even if they come from totally uncorrelated basic assumptions! This remarkable convergence between fields must make integral part of our IT culture, even if this chapter will only provide a flavor. It may be perceived as being somewhat more difficult or demanding than the preceding chapters, but the extra investment, as we believe, is well worth it. In any case, this chapter can be revisited later on, should the reader prefer to keep focused on Shannon's theory and move directly to the next stage, without venturing into the intriguing sidetracks of algorithmic information theory.
The task of text decoding is to take a tokenised sentence and determine the best sequence of words. In many situations this is a classical disambiguation problem: there is one, and only one, correct sequence of words that gave rise to the text, and it is our job to determine this. In other situations, especially where we are dealing with non-natural-language text such as numbers and dates and so on, there may be a few different acceptable word sequences.
So, in general, text decoding in TTS is a process of resolving ambiguity. The ambiguity arises because two or more underlying forms share the same surface form, and, given the surface form (i.e. the writing), we need to find which of the underlying forms is the correct one. There are many types of linguistic ambiguity, including word identity, grammatical and semantic, but in TTS we need only concentrate on the type of ambiguity which affects the actual sound produced. So, while there are two words that share the orthographic form bank, they both sound the same, so we can ignore this type of ambiguity for TTS purposes. Tokens such as record can be pronounced in two different ways, so this is the type of ambiguity we need to resolve.
In this chapter, we concentrate on resolving ambiguity relating to the verbal component of language.
A study of human hearing and the biomechanical processes involved in hearing, reveals several nonlinear steps, or stages, in the perception of sound. Each of these stages contributes to the eventual unequal distribution of subjective features against purely physical ones in human hearing.
Put simply, what we think we hear is quite significantly different from the physical sounds that may be present (which in turn differs from what would be captured electronically by, for example, a computer). By taking into account the various nonlinearities in the hearing process, and some of the basic physical characteristics of the ear, nervous system, and brain, it is possible to account for the discrepancy.
Over the years, science and technology has incrementally improved the ability to model the hearing process from purely physical data. One simple example is that of A-law compression (or the similar μ-law used in some regions of the world), where approximately logarithmic amplitude quantisation replaces the linear quantisation of PCM: humans tend to perceive amplitude logarithmically rather than linearly, and thus A-law quantisation using 8 bits sounds better than linear PCM quantisation using 8 bits. It thus achieves a higher degree of subjective speech quality than PCM for a given bitrate.
Physical processes
The ear, as shown diagrammatically in Figure 4.1, includes the pinna which filters sound and focuses it into the external auditory canal. Sound then acts upon the eardrum where it is transmitted and amplified by the three bones, the malleus, incus and stapes, to the oval window, opening on to the cochlea.
This chapter introduces the notion of noisy quantum channels, and the different types of “quantum noise” that affect qubit messages passed through such channels. The main types of noisy channel reviewed here are the depolarizing, bit-flip, phase-flip, and bit-phase-flip channels. Then the quantum channel capacity χ is defined through the Holevo–Schumacher–Westmoreland (HSW) theorem. Such a theorem can conceptually be viewed as the elegant quantum counterpart of Shannon's (noisy) channel coding theorem, which was described in Chapter 13. Here, I shall not venture into the complex proof of the HSW theorem but only provide a background illustrating the similarity with its classical counterpart. The resemblance with the channel capacity χ and the Holevo bound, as described in Chapter 21, and with the classical mutual information H(X; Y), as described in Chapter 5, are both discussed. For advanced reference, a hint is provided as to the meaning of the still not fully explored concept of quantum coherent information. Several examples of quantum channel capacity, derived from direct applications of the HSW theorem, along with the solution of the maximization problem, are provided.
Noisy quantum channels
The notion of “noisiness” in a classical communication channel was first introduced in Chapter 12, when describing channel entropy. Such a channel can be viewed schematically as a probabilistic relation between two random sources, X for the originator, and Y for the recipient.
This chapter makes us walk a few preliminary, but decisive, steps towards quantum information theory (QIT), which will be the focus of the rest of this book. Here, we shall remain in the classical world, yet getting a hint that it is possible to think of a different world where computations may be reversible, namely, without any loss of information. One key realization through this paradigm shift is that “information is physical.” As we shall see, such a nonintuitive and striking conclusion actually results from the age-long paradox of Maxwell's demon in thermodynamics, which eventually found an elegant conclusion in Landauer's principle. This principle states that the erasure of a single bit of information requires one to provide an energy that is proportional to log 2, which, as we know from Shannon's theory, is the measure of information and also the entropy of a two-level system with a uniformly distributed source. This consideration brings up the issue of irreversible computation. Logic gates, used at the heart of the CPU in modern computers, are based on such computation irreversibility. I shall then describe the computers' von Newman's architecture, the intimate workings of the ALU processing network, and the elementary logic gates on which the ALU is based. This will also provide some basics of Boolean logic, expanding on Chapter 1, which is the key to the following logic-gate concepts.
We saw in Chapter 13 that, while vocal-tract methods can often generate intelligible speech, they seem fundamentally limited in terms of generating natural-sounding speech. We saw that, in the case of formant synthesis, the main limitation is not so much in generating the speech from the parametric representation, but rather in generating these parameters from the input specification which was created by the text-analysis process. The mapping between the specification and the parameters is highly complex, and seems beyond what we can express in explicit human-derived rules, no matter how “expert” the rule designer. We face the same problems with articulatory synthesis and in addition have to deal with the facts that acquiring data is fundamentally difficult and improving naturalness often necessitates a considerable increase in complexity in the synthesiser.
A partial solution to the complexities of specifiction-to-parameter mapping is found in the classical LP technique whereby we bypassed the issue of generating of the vocal-tract parameters explicitly and instead measured them from data. The source parameters, however, were still specified by an explicit model, which was identified as the main source of the unnaturalness.
In this chapter we introduce a set of techniques that attempt to get around these limitations. In a way, these can be viewed as extensions of the classical LP technique in that they use a data-driven approach: the increase in quality, however, largely arises from the abandonment of the over-simplistic impulse/noise source model.
Analysis techniques are those used to examine, understand and interpret the content of recorded sound signals. Sometimes these lead to visualisation methods, whilst at other times they may be used in specifying some form of further processing or measurement of the audio.
There is a general set of analysis techniques which are common to all audio signals, and indeed to many forms of data, particularly the traditional methods used for signal processing. We have already met and used the basic technique of decomposing sound into multiple sinusoidal components with the Fast Fourier Transform (FFT), and have considered forming a polynomial equation to replicate audio waveform characteristics through linear prediction (LPC), but there are many other useful techniques we have not yet considered.
Most analysis techniques operate on analysis windows, or frames, of input audio. Most also require that the analysis window is a representative stationary selection of the signal (stationary in that the signal statistics and frequency distribution do not change appreciably during the time duration of the window – otherwise results may be inaccurate). We had discussed the stationarity issue in Section 2.5.1, and should note that the choice of analysis window size, as well as the choice of analysis methods used, depends strongly upon the identity of the signal being analysed. Speech, noise and music all have different characteristics, and while many of the same methods can be used in their analysis, knowledge of their characteristics leads to different analysis periods, and different parameter ranges of the analysis result.