Skip to main content Accessibility help
×
×
Home

Development of a computationally efficient voice conversion system on mobile phones

  • Shuhua Gao (a1) (a2), Xiaoling Wu (a1) (a2), Cheng Xiang (a1) and Dongyan Huang (a2)

Abstract

Voice conversion aims to change a source speaker's voice to make it sound like the one of a target speaker while preserving linguistic information. Despite the rapid advance of voice conversion algorithms in the last decade, most of them are still too complicated to be accessible to the public. With the popularity of mobile devices especially smart phones, mobile voice conversion applications are highly desirable such that everyone can enjoy the pleasure of high-quality voice mimicry and people with speech disorders can also potentially benefit from it. Due to the limited computing resources on mobile phones, the major concern is the time efficiency of such a mobile application to guarantee positive user experience. In this paper, we detail the development of a mobile voice conversion system based on the Gaussian mixture model (GMM) and the weighted frequency warping methods. We attempt to boost the computational efficiency by making the best of hardware characteristics of today's mobile phones, such as parallel computing on multiple cores and the advanced vectorization support. Experimental evaluation results indicate that our system can achieve acceptable voice conversion performance while the conversion time for a five-second sentence only takes slightly more than one second on iPhone 7.

  • View HTML
    • Send article to Kindle

      To send this article to your Kindle, first ensure no-reply@cambridge.org is added to your Approved Personal Document E-mail List under your Personal Document Settings on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part of your Kindle email address below. Find out more about sending to your Kindle. Find out more about sending to your Kindle.

      Note you can select to send to either the @free.kindle.com or @kindle.com variations. ‘@free.kindle.com’ emails are free but can only be sent to your device when it is connected to wi-fi. ‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.

      Find out more about the Kindle Personal Document Service.

      Development of a computationally efficient voice conversion system on mobile phones
      Available formats
      ×

      Send article to Dropbox

      To send this article to your Dropbox account, please select one or more formats and confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your <service> account. Find out more about sending content to Dropbox.

      Development of a computationally efficient voice conversion system on mobile phones
      Available formats
      ×

      Send article to Google Drive

      To send this article to your Google Drive account, please select one or more formats and confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your <service> account. Find out more about sending content to Google Drive.

      Development of a computationally efficient voice conversion system on mobile phones
      Available formats
      ×

Copyright

This is an Open Access article, distributed under the terms of the Creative Commons Attribution-NonCommercial-ShareAlike licence (http://creativecommons.org/licenses/by-nc-sa/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the same Creative Commons licence is included and the original work is properly cited. The written permission of Cambridge University Press must be obtained for commercial re-use.

Corresponding author

Corresponding author: Dongyan Huang Email: huang@i2r.a-star.edu.sg

References

Hide All
1Daniels, J.; Ha, L.K.; Ochotta, T.; Silva, C.T.: Robust smooth feature extraction from point clouds. Proceedings - IEEE International Conference on Shape Modeling and Applications 2007, SMI'07, 2007, 123133.
2Kain, A.B.; Hosom, J.-P.; Niu, X.; van Santen, J.P.; Fried-Oken, M.; Staehely, J.: Improving the intelligibility of dysarthric speech. Speech Commun., 49 (9) (2007), 743759.
3Kobayashi, K.; Toda, T.; Neubig, G.; Sakti, S.; Nakamura, S.: Statistical singing voice conversion with direct waveform modification based on the spectrum differential, in Fifteenth Annual Conf. of the Int. Speech Communication Association, 2014.
4Kawanami, H.; Iwami, Y.; Toda, T.; Saruwatari, H.; Shikano, K.: Gmm-based voice conversion applied to emotional speech synthesis, in Eighth European Conf. on Speech Communication and Technology, 2003.
5Akanksh, B.; Vekkot, S.; Tripathi, S.: Interconversion of emotions in speech using td-psola, in Thampi, S.M.; Bandyopadhyay, S.; Krishnan, S.; Li, K.-C.; Mosin, S.; Ma, M.: Eds., Advances in Signal Processing and Intelligent Recognition Systems, Springer, Trivandrum, India, 2016, 367378.
6Felps, D.; Bortfeld, H.; Gutierrez-Osuna, R.: Foreign accent conversion in computer assisted pronunciation training. Speech Commun., 51 (10) (2009), 920932.
7Aryal, S.; Gutierrez-Osuna, R.: Can voice conversion be used to reduce non-native accents? in Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE Int. Conf. on. IEEE, 2014, 78797883.
8Wu, Z.; Li, H.: Voice conversion versus speaker verification: an overview. APSIPA Trans. Signal. Inf. Process., 3 (2014), e17.
9Mohammadi, S.H.; Kain, A.: An overview of voice conversion systems. Speech Commun., 88 (2017), 6582.
10Kawahara, H.; Masuda-Katsuse, I.; De Cheveigne, A.: Restructuring speech representations using a pitch-adaptive time–frequency smoothing and an instantaneous-frequency-based f0 extraction: possible role of a repetitive structure in sounds1. Speech Commun., 27 (3–4) (1999), 187207.
11Moulines, E.; Charpentier, F.: Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones. Speech Commun., 9 (5–6) (1990), 453467.
12Stylianou, Y.; Laroche, J.; Moulines, E.: High-quality speech modification based on a harmonic+ noise model, in Fourth European Conf. on Speech Communication and Technology, 1995.
13Stylianou, Y.; Cappé, O.; Moulines, E.: Continuous probabilistic transform for voice conversion. IEEE Trans. Speech Audio Process., 6 (2) (1998), 131142.
14Toda, T. et al. : The voice conversion challenge 2016. Proc. of the Annual Conf. of the Int. Speech Communication Association, INTERSPEECH, vol. 08-12-Sept, 2016, 16321636.
15Helander, E.; Virtanen, T.: Voice conversion using partial least squares regression. Audio, Speech, Language Process., IEEE Trans., 18 (5) (2010), 912921.
16Abe, M.; Nakamura, S.; Shikano, K.; Kuwabara, H.: Voice conversion through vector quantization. ICASSP-88., Int. Conf. Acoust., Speech, Signal Process., 2 (1988), 655658.
17Kain, A.; Macon, M.W.: Spectral voice conversion for text-to-speech synthesis, in Acoustics, Speech and Signal Processing, 1998. Proc. of the 1998 IEEE Int. Conf. on, vol. 1. IEEE, 1998, 285288.
18Toda, T.; Black, A.W.; Tokuda, K.: Voice conversion based on maximum-likelihood estimation of spectral parameter trajectory. IEEE Trans. Audio, Speech, Language Process., 15 (8) (2007), 22222235.
19Kim, E.-K.; Lee, S.; Oh, Y.-H.: Hidden markov model based voice conversion using dynamic characteristics of speaker, in Fifth European Conf. on Speech Communication and Technology, 1997.
20Zhang, M.; Tao, J.; Nurminen, J.; Tian, J.; Wang, X.: Phoneme cluster based state mapping for text-independent voice conversion, in Acoustics, Speech and Signal Processing, 2009. ICASSP 2009. IEEE Int. Conf. on. IEEE, 2009, 42814284.
21Song, P.; Bao, Y.; Zhao, L.; Zou, C.: Voice conversion using support vector regression. Electron. Lett., 47 (18) (2011), 10451046.
22Narendranath, M.; Murthy, H.A.; Rajendran, S.; Yegnanarayana, B.: Transformation of formants for voice conversion using artificial neural networks. Speech Commun., 16 (2) (1995), 207216.
23Desai, S.; Black, A.W.; Yegnanarayana, B.; Prahallad, K.: Spectral mapping using artificial neural networks for voice conversion. IEEE Trans. Audio, Speech, Language Process., 18 (5) (2010), 954964.
24Chen, L.-H.; Ling, Z.-H.; Liu, L.-J.; Dai, L.-R.: Voice conversion using deep neural networks with layer-wise generative training. IEEE/ACM Trans. Audio, Speech, Language Process. (TASLP), 22 (12) (2014), 18591872.
25Van Den Oord, A. et al. : Wavenet: a generative model for raw audio. arXiv preprint arXiv:1609.03499, 2016.
26Erro, D.; Moreno, A.; Bonafonte, A.: Voice conversion based on weighted frequency warping. IEEE Trans. Audio, Speech, Language Process., 18 (5) (2010), 922931.
27Godoy, E.; Rosec, O.; Chonavel, T.: Voice conversion using dynamic frequency warping with amplitude scaling, for parallel or nonparallel corpora. IEEE Trans. Audio, Speech, Language Process., 20 (4) (2012), 13131323.
28Erro, D.; Navas, E.; Hernaez, I.: Parametric voice conversion based on bilinear frequency warping plus amplitude scaling. IEEE Trans. Audio, Speech, Language Process., 21 (3) (2013), 556566.
29Wu, Z.; Virtanen, T.; Kinnunen, T.; Chng, E.S.; Li, H.: Exemplar-based voice conversion using non-negative spectrogram deconvolution, in Eighth ISCA Workshop on Speech Synthesis, 2013.
30Wu, Z.; Virtanen, T.; Chng, E.S.; Li, H.: Exemplar-based sparse representation with residual compensation for voice conversion. IEEE/ACM Trans. Audio, Speech, Language Process., 22 (10) (2014), 15061521.
31Xu, N.; Tang, Y.; Bao, J.; Jiang, A.; Liu, X.; Yang, Z.: Voice conversion based on Gaussian processes by coherent and asymmetric training with limited training data. Speech. Commun., 58 (2014), 124138.
32Uriz, A.J.; Agüero, P.D.; Bonafonte, A.; Tulli, J.C.: Voice conversion using k-histograms and frame selection, in Tenth Annual Conf. of the Int. Speech Communication Association, 2009.
33Erro, D.; Polyakova, T.; Moreno, A.: On combining statistical methods and frequency warping for high-quality voice conversion. ICASSP, IEEE Int. Conf. on Acoustics, Speech and Signal Processing - Proceedings, 2008, 46654668.
34Nakashika, T.; Takashima, R.; Takiguchi, T.; Ariki, Y.: Voice conversion in high-order eigen space using deep belief nets, in Interspeech, 2013, 369372.
35Nguyen, H.Q.; Lee, S.W.; Tian, X.; Dong, M.; Chng, E.S.: High quality voice conversion using prosodic and high-resolution spectral features. Multimed. Tools. Appl., 75 (9) (2016), 52655285.
36Erro, D.; Moreno, A.: Weighted frequency warping for voice conversion. Proc. of the Annual Conf. of the International Speech Communication Association, INTERSPEECH, 2 (2007), 14651468.
37Toda, T.; Muramatsu, T.; Banno, H.: Implementation of computationally efficient real-time voice conversion, in Thirteenth Annual Conf. of the Int. Speech Communication Association, 2012.
38Wang, H.: A mobile world made of functions. APSIPA Trans. Signal Inf. Process., 6 (2017), e2.
39Erro, D.; Moreno, A.; Bonafonte, A.: Flexible harmonic/stochastic speech synthesis, in SSW, 2007, 194199.
40Depalle, P.; Helie, T.: Extraction of spectral peak parameters using a short-time fourier transform modeling and no sidelobe windows, in Applications of Signal Processing to Audio and Acoustics, 1997. 1997 IEEE ASSP Workshop on. IEEE, 1997, 4–pp.
41Erro, D.; Moreno, A.: A pitch-asynchronous simple method for speech synthesis by diphone concatenation using the deterministic plus stochastic model, in Proc. SPECOM, 2005.
42Paliwal, K.K.: Interpolation properties of linear prediction parametric representations, in Fourth European Conf. on Speech Communication and Technology, 1995.
43El-Jaroudi, A.; Makhoul, J.: Discrete all-pole modeling. IEEE Trans. Signal Process., 39 (2) (1991), 411423.
44Kalva, H.; Colic, A.; Garcia, A.; Furht, B.: Parallel programming for multimedia applications. Multimedia Tools Appl., 51 (2) (2011), 801818.
45Asanovic, K. et al. : The landscape of parallel computing research: A view from Berkeley. Technical Report UCB/EECS-2006-183, EECS Department, University of California, Berkeley, Tech. Rep., 2006.
46Corrales-Garcia, A.; Martinez, J.L.; Fernandez-Escribano, G.; Quiles, F.J.: Energy efficient low-cost video communications. IEEE Trans. Consum. Electron., 58 (2) (2012), 513521.
47Kwedlo, W.: A parallel em algorithm for Gaussian mixture models implemented on a numa system using openmp, in Parallel, Distributed and Network-Based Processing (PDP), 2014 22nd Euromicro Int. Conf. on IEEE, 2014, 292298.
48Mitra, G.; Johnston, B.; Rendell, A.P.; McCreath, E.; Zhou, J.: Use of simd vector operations to accelerate application code performance on low-powered arm and intel platforms, in Parallel and Distributed Processing Symposium Workshops & PhD Forum (IPDPSW), 2013 IEEE 27th Int. IEEE, 2013, 11071116.
49Ha, S.-W.; Park, H.-C.; Han, T.-D.: Mobile digital image stabilisation using simd data path. Electron. Lett., 48 (15) (2012), 922924.
50Dagum, L.; Menon, R.: Openmp: an industry standard api for shared-memory programming. IEEE Comput. Sci. Eng., 5 (1) (1998), 4655.
51Williams, A.: C++ Concurrency in Action. Manning Publications, Shelter Island, New York, USA, 2017.
52Guennebaud, G. et al. : Eigen v3. http://eigen.tuxfamily.org, 2010.
53Berndt, D.J.; Clifford, J.: Using dynamic time warping to find patterns in time series, in KDD workshop, vol. 10, no. 16. Seattle, WA, 1994, 359370.
54Hill, M.D.; Marty, M.R.: Amdahl's Law in the Multicore Era. Computer, 41 (7) (2008), 3338.
55Yang, S.; Xie, L.; Chen, X.; Lou, X.; Huang, D.; Li, H.: Statistical parametric speech synthesis using generative adversarial networks under a multi-task learning framework. arXiv preprint arXiv:1707.01670, 2017.
56Toda, T.; Ohtani, Y.; Shikano, K.: One-to-many and many-to-one voice conversion based on eigenvoices. 2007.
57Saito, D.; Yamamoto, K.; Minematsu, N.; Hirose, K.: One-to-many voice conversion based on tensor representation of speaker space, in Twelfth Annual Conf. of the Int. Speech Communication Association, 2011.
58Muramatsu, T.; Ohtani, Y.; Toda, T.; Saruwatari, H.; Shikano, K.: Lowdelay voice conversion based on maximum likelihood estimation of spectral parameter trajectory, in Proc. 2008 Autumn Meeting of Acoustic Society of Japan, 2008, 34.
Recommend this journal

Email your librarian or administrator to recommend adding this journal to your organisation's collection.

APSIPA Transactions on Signal and Information Processing
  • ISSN: 2048-7703
  • EISSN: 2048-7703
  • URL: /core/journals/apsipa-transactions-on-signal-and-information-processing
Please enter your name
Please enter a valid email address
Who would you like to send this to? *
×

Keywords

Metrics

Altmetric attention score

Full text views

Total number of HTML views: 0
Total number of PDF views: 0 *
Loading metrics...

Abstract views

Total abstract views: 0 *
Loading metrics...

* Views captured on Cambridge Core between <date>. This data will be updated every 24 hours.

Usage data cannot currently be displayed