To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
This is a book about getting computers to read out loud. It is therefore about three things: the process of reading, the process of speaking, and the issues involved in getting computers (as opposed to humans) to do this. This field of study is known both as speech synthesis, that is the “synthetic” (computer) generation of speech, and as text-to-speech or TTS; the process of converting written text into speech. It complements other language technologies such as speech recognition, which aims to convert speech into text, and machine translation, which converts writing or speech in one language into writing or speech in another.
I am assuming that most readers have heard some synthetic speech in their life. We experience this in a number of situations; some telephone information systems have automated speech response, speech synthesis is often used as an aid to the disabled, and Professor Stephen Hawking has probably contributed more than anyone else to the direct exposure of (one particular type of) synthetic speech. The idea of artificially generated speech has of course been around for a long time – hardly any science-fiction film is complete without a talking computer of some sort. In fact science fiction has had an interesting effect on the field and our impressions of it. Sometimes (less technically aware) people believe that perfect speech synthesis exists because they “heard it on Star Trek”. Often makers of science-fiction films fake the synthesis by using an actor, although usually some processing is added to the voice to make it sound “computerised”.
So far this book has dealt with individual subjects ranging from background material on audio, its handling with Matlab, speech, hearing, and on to the commercially important topics of communications and analysis. Each of these has been relatively self-contained in scope, although the more advanced speech compression methods of the previous chapter did introduce some limited psychoacoustic features.
In this chapter we will progress onward: we will discuss and describe advanced topics that combine processing elements relating to both speech and hearing – computer models of the human hearing system that can be used to influence processing of speech (and audio), and computer models of speech that can be used for analysis and modification of the speech signal.
By far the most important of these topics is introduced first: psychoacoustic modelling, without which MP3 and similar audio formats, and the resultant miniature music players from Creative, Apple, iRiver, and others, would not exist.
Psychoacoustic modelling
Remember back in Section 4.2 we claimed that this marriage of the art of psychology and the science of acoustics was important in forming a link between the purely physical domain of sound and the experience of a listener? In this section we will examine further to see why and how that happens.
It follows that a recording of a physical sound wave – which is a physical representation of the audio – contains elements which are very relevant to a listener, and elements which are not. At one extreme, some of the recorded sound may be inaudible to a listener.
We finish with some general thoughts about the field of speech technology and linguistics and a discussion of future directions.
Speech technology and linguistics
A newcomer to TTS might expect that the relationship between speech technology and linguistics would parallel that between more-traditional types of engineering and physics. For example, in mechanical engineering, machines and engines are built on the basis of principles of dynamics, forces, energy and so on developed in classical physics. It should be clear to a reader with more experience in speech technology that this state of affairs does not hold between the engineering issues that we address in this book and the theoretical field of linguistics. How has this state of affairs come about and what is the relationship between the two fields?
It is widely acknowledged that researchers in the fields of speech technology and linguistics do not in general work together. This topic is often raised at conferences and is the subject of many a discussion panel or special session. Arguments are put forward to explain the lack of unity, all politely agree that we can learn more from each other, and then both communities go away and do their own thing just as before, such that the gap is even wider by the time the next conference comes around.
The first stated reason for this gap is the “aeroplanes don't flap their wings” argument.
This chapter sets the basis of quantum information theory (QIT). The central purpose of QIT is to qualify the transmission of either classical or quantum information over quantum channels. The starting point of the QIT description is von Neumann entropy, S(ρ), which represents the quantum counterpart of Shannon's classical entropy, H(X). Such a definition rests on that of the density operator (or density matrix) of a quantum system, ρ, which plays a role similar to that of the random-events source X in Shannon's theory. As we shall see, there also exists an elegant and one-to-one correspondence between the quantum and classical definitions of the entropy variants relative entropy, joint entropy, conditional entropy, and mutual information. But such a similarity is only apparent. Indeed, one becomes rapidly convinced from a systematic analysis of the entropy's additivity rules that fundamental differences separate the two worlds. The classical notion of information correlation between two event sources for quantum states shall be referred to as quantum entanglement. We then define a quantum communication channel, which encodes and decodes classical information into or from quantum states. The analysis shows that the mutual information H(X;Y) between originator and recipient in this communication channel cannot exceed a quantity χ, called the Holevo bound, which itself satisfies χ ≤ H(X), where H(X) is the entropy of the originator's classical information source.
We saw in Chapter 13 that, despite the approximations in all the vocal-tract models concerned, the limiting factor in generating high-quality speech is not so much in converting the parameters into speech, but in knowing which parameters to use for a given synthesis specification. Determining these by hand-written rules can produce fairly intelligible speech, but the inherent complexities of speech seem to place an upper limit on the quality that can be achieved in this way. The various second-generation synthesis techniques explained in Chapter 14 solve the problem by simply measuring the values from real speech waveforms. Although this is successful to a certain extent, it is not a perfect solution. As we will see in Chapter 16, we can never collect enough data to cover all the effects we wish to synthesize, and often the coverage we have in the database is very uneven. Furthermore, the concatenative approach always limits us to recreating what we have recorded; in a sense all we are doing is reordering the original data.
An alternative is to use statistical, machine-learning techniques to infer the specification-to-parameter mapping from data. While this and the concatenative approach can both be described as data-driven, in the concatenative approach we are effectively memorising the data, whereas in the statistical approach we are attempting to learn the general properties of the data.
This chapter introduces the fundamentals of the field of signal processing, which studies how signals can be synthesised, analysed and modified. Here, and for the remainder of the book, we use the term signal in a more specific sense than before, in that we take it to mean a waveform that represents a pattern of variation against time. This material describes signals in general, but serves as a precursor to the following chapters which describe the nature of speech signals and how they can be generated, manipulated and modified. This chapter uses the framework of digital signal processing, a widely adopted set of techniques used by engineers to analyse many types of signals.
Analogue signals
A signal is a pattern of variation that encodes information. Signals that encode the variation of information over time can be represented by a time waveform, which is often just called a waveform. Figure 10.1 shows an example speech waveform. The horizontal axis represents time and the vertical axis represents amplitude, hence the figure shows how the amplitude of the signal varies with time. The amplitude in a speech signal can represent diverse physical quantities: for example, the variation in air pressure in front of the mouth, the displacement of the diaphragm of a microphone used to record the speech or the voltage in the wire used to transmit the speech.
Audio and speech processing systems have steadily risen in importance in the everyday lives of most people in developed countries. From ‘Hi-Fi’ music systems, through radio to portable music players, audio processing is firmly entrenched in providing entertainment to consumers. Digital audio techniques in particular have now achieved a domination in audio delivery, with CD players, Internet radio, MP3 players and Pods being the systems of choice in many cases. Even within television and film studios, and in mixing desks for ‘live’ events, digital processing now predominates. Music and sound effects are even becoming more prominent within computer games.
Speech processing has equally seen an upward worldwide trend, with the rise of cellular communications, particularly the European GSM (Global System for Mobile communications) standard. GSM is now virtually ubiquitous worldwide, and has seen tremendous adoption even in the world's poorest regions.
Of course, speech has been conveyed digitally over long distance, especially satellite communications links, for many years, but even the legacy telephone network (named POTS for ‘Plain Old Telephone Services’) is now succumbing to digitisation in many countries. The last mile, the several hundred metres of twisted pair copper wire running to a customer's home, was never designed or deployed with digital technology in mind, and has resisted many attempts over the years to be replaced with optical fiber, Ethernet or wireless links. However with DSL (digital subscriber line – normally asymmetric so it is faster in one direction than the other, hence ADSL), even this analogue twisted pair will convey reasonably high-speed digital signals.
The design of wireless devices and networks present unique design challenges due to the combined effects of physical constraints, such as bandwidth and power, and communication errors introduced through channel fading and noise. The limited bandwidth has to be addressed through a compression operation that removes redundant information from the source. Unfortunately, the removal of redundancy makes the transmitted data not only more important but also more sensitive to channel errors. Therefore, it is necessary to complement the source compression operation with the application of an error control code to do channel coding. The goal of the channel coding operation is to add redundancy back to the source coded data that has been designed to efficiently detect and correct errors introduced in the channel. In this chapter we will study the use of cooperation to transmit multimedia traffic (such as voice or video conferencing). This involves studying the performance of schemes with strict delay constraint where user cooperation is combined with source and channel coding.
Although the source and channel codecs complement each other, historically their design had been approached independently of each other. The basic assumptions in this design method are that the source encoder will present to the channel encoder a bit stream where all source redundancy have been removed and that the channel decoder will be able to present to the source decoder a bit stream where all channel errors have been corrected.
Despite the promised gains of cooperative communication demonstrated in many previous works, the impact of cooperation at higher network levels is not yet completely understood. In the previous chapters, it was assumed that the user always has a packet to transmit, which is not generally true in a wireless network. For example, in a network, most of the sources are bursty in nature, which leads to periods of silence in which the users may have no data to transmit. Such a phenomenon may affect important system parameters that are relevant to higher network layers, for example, buffer stability and packet delivery delay. We focus on the multiple access layer in this chapter. One can ask many important questions now. Can we design cooperation protocols that take these higher layer network features into account? Can the gains promised by cooperation at the physical layer be applied to the multiple access layer? More specifically, what is the impact of cooperation on important multiple access performance metrics such as stable throughput region and packet delivery delay?
In this chapter, we try to address all of these important questions to demonstrate the possible gains of cooperation at the multiple access layer. A slotted time division multiple access (TDMA) framework in which each time slot is assigned only to one terminal, i.e., orthogonal multiple access is considered. If a user does not have a packet to transmit in his time slot, then this time slot is not utilized.
In the previous chapter, the symbol error rate performance of single-relay cooperative communications was analyzed for both the decode-and-forward and amplify-and-forward relaying strategies. This chapter builds upon the results in the previous chapter and generalizes the symbol error rate performance analysis to the multi-relay scenario.
Decode-and-forward relaying will be considered first, followed by the amplify-and-forward case. In both scenarios, exact and approximate expressions for the symbol error rate will be derived. The symbol error rate expressions are then used to characterize an optimal power allocation strategy among the relays and the source node.
Multi-node decode-and-forward protocol
We begin by presenting a class of cooperative decode-and-forward protocols for arbitrary N-relay wireless networks, in which each relay can combine the signal received from the source along with one or more of the signals transmitted by previous relays. Then, we focus on the performance of a general cooperation scenario and present an exact symbol error rate (SER) expressions for both M-ary phase shift keying (PSK) and quadrature amplitude modulation (QAM) signalling. We also consider an approximate expression for the SER of a general cooperation scenario that is shown to be tight at high enough SNR. Finally, we study optimal power allocation for the class of cooperative diversity schemes, where the optimality is determined in terms of minimizing the SER of the system.
In wireless communication systems with a single antenna, the channel capacity can be very low and the bit error rate high when fading occurs. Various techniques can be utilized to mitigate fading, e.g., robust modulation, coding and interleaving, error-correcting coding, equalization, and diversity. Different kinds of diversities such as space, time, frequency, or any combination of them are possible. Among these diversity techniques, space diversity is of special interest because of its ability to improve performance without sacrificing delay and bandwidth efficiency. Recently, space diversity has been intensively investigated in point-to-point wireless communication systems by the deployment of a MIMO concept together with efficient coding and modulation schemes. In recent years, cooperative communication [109] has been proposed as an alternative communication system that explores MIMO-like diversity to improve link performance without the requirement of additional antennas. However, most of existing works on MIMO systems and cooperative communications are designed based on an assumption that the receivers have full knowledge of the channel state information (CSI). In this case, the schemes must incorporate reliable multi-channel estimation, which inevitably increases the cost of frequent retraining and the number of estimated parameters to the receivers. Although the channel estimates may be available when the channel changes slowly comparing with the symbol rate, they may not be possibly acquired in a fast-fading environment. To develop practical schemes that omit such CSI requirements, we consider in this chapter differential modulations for cooperative communications.
Extending the lifetime of battery-operated devices is a key design issue that allows uninterrupted information exchange among distributed nodes in wireless networks. Cooperative communications enables and leverages effective resource sharing among cooperative nodes. This chapter provides a general framework for lifetime extension of battery-operated devices by exploiting cooperative diversity. The framework efficiently takes advantage of different locations and energy levels among distributed nodes. First, a lifetime maximization problem via cooperative nodes is considered and performance analysis for M-ary PSK modulation is provided. With an objective to maximize the minimum device lifetime under a constraint on bit error rate performance, the optimization problem determines which nodes should cooperate and how much power should be allocated for cooperation. Moreover, the device lifetime is further improved by a deployment of cooperative relays in order to help forward information of the distributed nodes in the network. Optimum location and power allocation for each cooperative relay are determined with an aim to maximize the minimum device lifetime. A suboptimal algorithm is presented to solve the problem with multiple cooperative relays and cooperative nodes.
Introduction
In many applications of wireless networks, extending the lifetime of battery-operated devices is a key design issue that ensures uninterrupted information exchange and alleviates burden of replenishing batteries. Lifetime extension of battery-limited devices has become an important issue due to the need in sensor and ad-hoc networks.
In this chapter, we will discuss shortcomings of conventional point-to-point communications that led to the introduction of the new paradigm shift for wireless communications, i.e., cooperative communications. We will define what the relay channel is, and in what aspects it is different from the direct point-to-point channel. We will also describe several protocols that can be implemented at the relay channel, and discuss the performance of these protocols which will be assessed based on their outage probability and diversity gains.
Cooperative communications
In cooperative communications, independent paths between the user and the base station are generated via the introduction of a relay channel as illustrated in Figure 4.1. The relay channel can be thought of as an auxiliary channel to the direct channel between the source and destination. A key aspect of the cooperative communication process is the processing of the signal received from the source node done by the relay. These different processing schemes result in different cooperative communications protocol. Cooperative communications protocols can be generally categorized into fixed relaying schemes and adaptive relaying schemes. In fixed relaying, the channel resources are divided between the source and the relay in a fixed (deterministic) manner. The processing at the relay differs according to the employed protocol. In a fixed amplify-and-forward (AF) relaying protocol, the relay simply scales the received version and transmits an amplified version of it to the destination.
Wireless communications technologies have seen a remarkably fast evolution in the past two decades. Each new generation of wireless devices has brought notable improvements in terms of communication reliability, data rates, device sizes, battery life, and network connectivity. In addition, the increase homogenization of traffic transports using Internet Protocols is translating into network topologies that are less and less centralized. In recent years, ad-hoc and sensor networks have emerged with many new applications, where a source has to rely on the assistance from other nodes to forward or relay information to a desired destination.
Such a need of cooperation among nodes or users has inspired new thinking and ideas for the design of communications and networking systems by asking whether cooperation can be used to improve system performance. Certainly it means we have to answer what and how performance can be improved by cooperative communications and networking. As a result, a new communication paradigm arose, which had an impact far beyond its original applications to ad-hoc and sensor networks.
First of all, why are cooperative communications in wireless networks possible? Note that the wireless channel is broadcast by nature. Even directional transmission is in fact a kind of broadcast with fewer recipients limited to a certain region. This implies that many nodes or users can “hear” and receive transmissions from a source and can help relay information if needed.