Hostname: page-component-77f85d65b8-8wtlm Total loading time: 0 Render date: 2026-03-30T07:09:28.938Z Has data issue: false hasContentIssue false

Neutral-to-emotional voice conversion with cross-wavelet transform F0 using generative adversarial networks

Published online by Cambridge University Press:  04 March 2019

Zhaojie Luo*
Affiliation:
Graduate School of System Informatics, Kobe University, 1-1 Rokkodai, Nada, Kobe 657-8501, Japan
Jinhui Chen
Affiliation:
RIEB, Kobe University, 2-1 Rokkodai, Nada, Kobe 657-8501, Japan
Tetsuya Takiguchi
Affiliation:
Graduate School of System Informatics, Kobe University, 1-1 Rokkodai, Nada, Kobe 657-8501, Japan
Yasuo Ariki
Affiliation:
Graduate School of System Informatics, Kobe University, 1-1 Rokkodai, Nada, Kobe 657-8501, Japan
*
Corresponding author: Zhaojie Luo E-mail: luozhaojie@me.cs.scitec.kobe-u.ac.jp

Abstract

In this paper, we propose a novel neutral-to-emotional voice conversion (VC) model that can effectively learn a mapping from neutral to emotional speech with limited emotional voice data. Although conventional VC techniques have achieved tremendous success in spectral conversion, the lack of representations in fundamental frequency (F0), which explicitly represents prosody information, is still a major limiting factor for emotional VC. To overcome this limitation, in our proposed model, we outline the practical elements of the cross-wavelet transform (XWT) method, highlighting how such a method is applied in synthesizing diverse representations of F0 features in emotional VC. The idea is (1) to decompose F0 into different temporal level representations using continuous wavelet transform (CWT); (2) to use XWT to combine different CWT-F0 features to synthesize interaction XWT-F0 features; (3) and then use both the CWT-F0 and corresponding XWT-F0 features to train the emotional VC model. Moreover, to better measure similarities between the converted and real F0 features, we applied a VA-GAN training model, which combines a variational autoencoder (VAE) with a generative adversarial network (GAN). In the VA-GAN model, VAE learns the latent representations of high-dimensional features (CWT-F0, XWT-F0), while the discriminator of the GAN can use the learned feature representations as a basis for a VAE reconstruction objective.

Information

Type
Original Paper
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution, and reproduction in any medium, provided the original work is properly cited.
Copyright
Copyright © The Authors, 2019
Figure 0

Fig. 1. Illustration of the structure of VAE [17], GAN [18] and the proposed VA-GAN. Here x and x' are input and generated features. z is the latent vector and t are target features. E, G, D are the encoder, generative, and discriminative networks, respectively. h is the latent representation processed by encoder network. y is a binary output which represents real/synthesized features.

Figure 1

Fig. 2. (1) Linear interpolation and log-normalized processing for source and target emotional voice. (2) The CWT for normalized LogF0 contours, the top pan and bottom pan show five examples of the different level's decomposed CWT-F0 features of the source and target F0, respectively. (3) First and second pans show the continuous wavelet spectrums of the source and target CWT-F0 features and the bottom pan represents the cross-wavelet spectrums of the two CWT-F0 features. The relative phase relationship is shown as arrows in the cross-wavelet spectrum (with in-phase pointing right (→), anti-phase pointing left (←), and source emotional features leading target emotional features by 90° pointing straight down.

Figure 2

Fig. 3. (left) Examples of processing the latent representations for emotion A and emotion B using Variational Autoencoder (VAE). (right) Examples of modifying the emotional voice from emotion A to emotion B. ${\bi \mu}_{e_{A}}$ and ${\bi \mu}_{e_{B}}$ represent the means of latent attribute representations for emotional eA and eB, respectively. qbmϕ and pbmθ mean the encode and decode functions, respectively. ${\bf x_{A}}^{i}$ and ${\bf x_{B}}^{i}$ represent the speech segment of emotion A and emotion B.

Figure 3

Fig. 4. Illustration of calculating the loss of GAN in voice conversion.

Figure 4

Fig. 5. Illustration of calculating the loss of VA-GAN.

Figure 5

Algorithm 1 Training procedure of VA-GAN

Figure 6

Fig. 6. The listening experiment setup for evaluating emotional voice generated by different XWT-F0 features. The small letters in the lower right of CWT-F0 and XWT-F0 represent the emotion. For instance, (n) and (a) represent the neutral and angry. (n,a) represents the XWT-F0 features generated by CWT-F0 features of angry and neutral.

Figure 7

Fig. 7. Similarity to different emotional voices (neutral, angry, happy, and sad) of synthesized emotional voices with different XWT-F0 features.

Figure 8

Table 1. F0-RMSE results for different emotions. N2A, N2S and N2H represent the datasets from neutral to angry, sad and happy voice, respectively. (40) and (80) denote the number of training examples. “+” means using both the CWT-F0 features and selected XWT-F0 features for training model. U/V-ER represents the unvoiced/voiced error rate of the proposed method

Figure 9

Fig. 8. Spectrograms of source voice, target voice and converted voices reconstructed by converted spectral features and F0 features using different methods.

Figure 10

Fig. 9. MOS evaluation of emotional voice conversion.

Figure 11

Table 2. Table 2. Details of network architectures of Enc, C and D

Figure 12

Table 3. Results of emotion classification for recorded (original) voices and converted voice by different methods [%]