Hostname: page-component-89b8bd64d-rbxfs Total loading time: 0 Render date: 2026-05-06T14:24:31.816Z Has data issue: false hasContentIssue false

Development of a computationally efficient voice conversion system on mobile phones

Published online by Cambridge University Press:  04 January 2019

Shuhua Gao
Affiliation:
Department of Electrical & Computer Engineering, National University of Singapore, Singapore Human Language Technology Department, Institute for Infocomm Research, A*STAR, Singapore
Xiaoling Wu
Affiliation:
Department of Electrical & Computer Engineering, National University of Singapore, Singapore Human Language Technology Department, Institute for Infocomm Research, A*STAR, Singapore
Cheng Xiang
Affiliation:
Department of Electrical & Computer Engineering, National University of Singapore, Singapore
Dongyan Huang*
Affiliation:
Human Language Technology Department, Institute for Infocomm Research, A*STAR, Singapore
*
Corresponding author: Dongyan Huang Email: huang@i2r.a-star.edu.sg

Abstract

Voice conversion aims to change a source speaker's voice to make it sound like the one of a target speaker while preserving linguistic information. Despite the rapid advance of voice conversion algorithms in the last decade, most of them are still too complicated to be accessible to the public. With the popularity of mobile devices especially smart phones, mobile voice conversion applications are highly desirable such that everyone can enjoy the pleasure of high-quality voice mimicry and people with speech disorders can also potentially benefit from it. Due to the limited computing resources on mobile phones, the major concern is the time efficiency of such a mobile application to guarantee positive user experience. In this paper, we detail the development of a mobile voice conversion system based on the Gaussian mixture model (GMM) and the weighted frequency warping methods. We attempt to boost the computational efficiency by making the best of hardware characteristics of today's mobile phones, such as parallel computing on multiple cores and the advanced vectorization support. Experimental evaluation results indicate that our system can achieve acceptable voice conversion performance while the conversion time for a five-second sentence only takes slightly more than one second on iPhone 7.

Information

Type
Original Paper
Creative Commons
Creative Common License - CCCreative Common License - BYCreative Common License - NCCreative Common License - SA
This is an Open Access article, distributed under the terms of the Creative Commons Attribution-NonCommercial-ShareAlike licence (http://creativecommons.org/licenses/by-nc-sa/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the same Creative Commons licence is included and the original work is properly cited. The written permission of Cambridge University Press must be obtained for commercial re-use.
Copyright
Copyright © The Authors, 2019
Figure 0

Fig. 1. General architecture of a typical voice conversion system.

Figure 1

Fig. 2. Parallel computation during training and conversion phases.

Figure 2

Fig. 3. Multi-core parallelism in the conversion phase. First, an input sentence is partitioned into multiple segments and those segments are converted simultaneously; then, the converted segments are merged smoothly using a logistic function for weighted sum. (a) Split a sentence for parallel conversion (b) Logistic function for segment merging.

Figure 3

Fig. 4. Scalar vs. SIMD operation for multiple additions.

Figure 4

Fig. 5. Overview of software architecture. The core algorithms (compute engine) are implemented in C++ for portability. The graphical user interface (GUI) sits on top of the engine and may be built with different languages/libraries on various platforms. The engine exposes C interfaces to be used by GUI on four common operation systems: iOS, Android, Windows, and Mac.

Figure 5

Fig. 6. Workflow of the voice conversion application on mobile phones.

Figure 6

Fig. 7. Outline of the modules in the iOS application.

Figure 7

Fig. 8. Model-View-Control(MVC) design pattern.

Figure 8

Fig. 9. Running time of the training phase with different configurations (95% confidence interval). From left to right, S+NV: single-threaded C++ with no vectorization, S+V: single-threaded C++ with vectorization, M+NV: multi-threaded C++ without vectorization, M+V: multi-threaded C++ with vectorization, MATLAB: 64-bit MATLAB 2016 with default settings.

Figure 9

Fig. 10. Running time of each stage during the training phase with different configurations (95% confidence interval). S1: speech analysis and feature construction; S2: frame alignment of parallel corpus; S3: conversion function training. Vectorization is enabled for all conditions.

Figure 10

Fig. 11. Running time of the training phase (top) and the conversion phase (bottom) when a different number of CPU cores are used for multi-core parallelism (95% confidence interval). There are four cores in total in the CPU under investigation. Vectorization is enabled.

Figure 11

Fig. 12. Training time of various training set sizes and numbers of cores (95% confidence interval). On average, each core corresponds to a training set of 10 utterances. A label n-k means n utterances in the training set and k cores. Vectorization is enabled.

Figure 12

Fig. 13. User interface of the Voichap application on iPhone 7. (a) Login and sign up (b) Recorded audios list (c) Select the target speaker (d) Speaker info & training (e) Source and targets management (f) Speak here.

Figure 13

Fig. 14. A typical usage scenario: how to make yourself sound like President Trump?.

Figure 14

Fig. 15. Measured running time of the training phase with respect to different number of Gaussian components (95% confidence interval). The training set contains 20 sentences of average length around 4.5 s and the running time is measured by repeating 10 times. Vectorization and 4-core parallelism are enabled.

Figure 15

Fig. 16. Subjective evaluation of the voice conversion system with respect to the number of Gaussian components m in GMM. 10 listeners were asked to evaluate the speech similarity and quality. Here it shows the average score of the four possible source-target conversion directions.

Figure 16

Table 1. The MCD of the unconverted source, the traditional GMM and the weighted frequency warping (WFW) method

Figure 17

Fig. 17. Global variances for the natural speech and the converted speeches via GMM and WFW.

Figure 18

Fig. 18. Subjective evaluation of the voice conversion system Voichap on iPhone 7 in terms of converted speech quality and similarity (95% confidence interval). There are 10 volunteer listeners participating in these tests and the 5-point scale is used for scoring. (a) Similarity (b) Quality.