Skip to main content
×
×
Home
  • Print publication year: 2017
  • Online publication date: July 2017

19 - Speech Synthesis: State of the Art and Challenges for the Future

from Part III - Machine Synthesis of Social Signals
Summary

Introduction

Speech synthesis (or alternatively text-to-speech synthesis) means automatically converting natural language text into speech. Speech synthesis has many potential applications. For example, it can be used as an aid to people with disabilities (see Challenges for the Future), for generating the output of spoken dialogue systems (Lemon et al., 2006; Georgila et al., 2010), for speech-to-speech translation (Schultz et al., 2006), for computer games, etc.

Current state-of-the-art speech synthesizers can simulate neutral read aloud speech (i.e., speech that sounds like reading from some text) quite well, both in terms of naturalness and intelligibility (Karaiskos et al., 2008). However, today, many commercial applications that require speech output still rely on prerecorded system prompts rather than use synthetic speech. The reason is that, despite much progress in speech synthesis over the last twenty years, current state-of-the-art synthetic voices still lack the expressiveness of human voices. On the other hand, using prerecorded speech has several drawbacks. It is a very expensive process that often has to start from scratch for each new application. Moreover, if an application needs to be enhanced with new prompts, it is quite likely that the person (usually an actor) that recorded the initial prompts will not be available. Furthermore, human recordings cannot be used for content generation on the fly, i.e., all the utterances that will be used in an application need to be predetermined and recorded in advance. Predetermining all utterances to be recorded is not always possible. For example, the number of names in the database of an automatic directory assistance service can be huge. Not to mention the fact that most databases are continuously being updated. In such cases, speech output is generated by using a mixture of prerecorded speech (for prompts) and synthetic speech (for names) (Georgila et al., 2003). The results of such a mixture can be quite awkward.

The discussion above shows that there is great motivation for further advances in the field of speech synthesis. Below we provide an overview of the current state of the art in speech synthesis, and present challenges for future work.

Recommend this book

Email your librarian or administrator to recommend adding this book to your organisation's collection.

Social Signal Processing
  • Online ISBN: 9781316676202
  • Book DOI: https://doi.org/10.1017/9781316676202
Please enter your name
Please enter a valid email address
Who would you like to send this to *
×
Adell, J., Bonafonte, A., & Escudero, D. (2006). Disfluent speech analysis and synthesis: A preliminary approach. In Proceedings of the International Conference on Speech Prosody.
Andersson, S., Georgila, K., Traum, D., Aylett, M., & Clark, R. A. J. (2010). Prediction and realisation of conversational characteristics by utilising spontaneous speech for unit selection. In Proceedings of the International Conference on Speech Prosody.
Andersson, S., Yamagishi, J., & Clark, R. A. J. (2012). Synthesis and evaluation of conversational characteristics in HMM-based speech synthesis.Speech Communication, 54(2), 175–188.
Artstein, R., Traum, D., Alexander, O., et al. (2014). Time-offset interaction with a holocaust survivor. In Proceedings of the International Conference on Intelligent User Interfaces (pp. 163–168).
Barra-Chicote, R., Yamagishi, J., King, S., Montero, J. M., & Macias-Guarasa, J. (2010). Analysis of statistical parametric and unit selection speech synthesis systems applied to emotional speech.Speech Communication, 52(5), 394–404.
Black, A. W. & Lenzo, K. A. (2000). Limited domain synthesis. In Proceedings of the International Conference on Spoken Language Processing(vol. 2, pp. 411–414).
Black, A. W. & Taylor, P. (1997). Automatically clustering similar units for unit selection in speech synthesis. In Proceedings of the European Conference on Speech Communication and Technology (pp. 601–604).
Black, A. W. & Tokuda, K. (2005). The Blizzard challenge – 2005: Evaluating corpus-based speech synthesis on common datasets. In Proceedings of the European Conference on Speech Communication and Technology (pp. 77–80).
Bozkurt, B., Ozturk, O., & Dutoit, T. (2003). Text design for TTS speech corpus building using a modified greedy selection. In Proceedings of the European Conference on Speech Communication and Technology (pp. 277–280).
Campbell, N. (2006). Conversational speech synthesis and the need for some laughter.IEEE Transactions on Audio, Speech, and Language Processing, 14(4), 1171–1178.
Campbell, N. (2007). Towards conversational speech synthesis: Lessons learned from the expressive speech processing project. In Proceedings of the ISCA Workshop on Speech Synthesis (pp. 22–27).
DeVault, D., Artstein, R., Benn, G., et al. (2014). SimSensei kiosk: A virtual human interviewer for healthcare decision support. In Proceedings of the International Conference on Autonomous Agents and Multiagent Systems (pp. 1061–1068).
Douglas-Cowie, E., Cowie, R., & Schröder, M. (2000). A new emotion database: Considerations, sources and scope. In Proceedings of the ISCA Tutorial and Research Workshop (ITRW) on Speech and Emotion (pp. 39–44).
Georgila, K. (2009). Using integer linear programming for detecting speech disfluencies. In Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics (HLT-NAACL),. Companion volume: short papers (pp. 109–112).
Georgila, K., Black, A. W., Sagae, K., & Traum, D. (2012). Practical evaluation of human and synthesized speech for virtual human dialogue systems. In Proceedings of the International Conference on Language Resources and Evaluation (pp. 3519–3526).
Georgila, K., Sgarbas, K., Tsopanoglou, A., Fakotakis, N., & Kokkinakis, G. (2003). A speechbased human–computer interaction system for automating directory assistance services.International Journal of Speech Technology(special issue on Speech and Human Computer Interaction), 6(2), 145–159.
Georgila, K.,Wolters, M., Moore, J.d., & Logie, R. H. (2010). The MATCH corpus: A corpus of older and younger users' interactions with spoken dialogue systems.Language Resources and Evaluation, 44(3), 221–261.
Hunt, A. J. & Black, A.W. (1996). Unit selection in a concatenative speech synthesis system using a large speech database. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (pp. 373–376).
Imai, S., Sumita, K., & Furuichi, C. (1983). Mel log spectrum approximation (MLSA) filter for speech synthesis.Electronics and Communications in Japan, 66(2), 10–18.
Iskarous, K., Goldstein, L. M., Whalen, D. H., Tiede, M. K., & Rubin, P. E. (2003). CASY: The Haskins configurable articulatory synthesizer. In Proceedings of the International Congress of Phonetic Sciences (pp. 185–188).
Karaiskos, V., King, S., Clark, R. A. J., & Mayo, C. (2008). The Blizzard challenge 2008. In Proceedings of the Blizzard Challenge Workshop.
Kawahara, H., Masuda-Katsuse, I., & de Cheveigné, A. (1999). Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds.Speech Communication, 27(3– 4), 187–207.
King, S. (2010). A tutorial on HMM speech synthesis. In Sadhana – Academy Proceedings in Engineering Sciences, Indian Institute of Sciences.
Kishore, S. P. & Black, A. W. (2003). Unit size in unit selection speech synthesis. In Proceedings of the European Conference on Speech Communication and Technology (pp. 1317–1320).
Klatt, D. H. (1980). Software for a cascade/parallel formant synthesizer.Journal of the Acoustical Society of America, 67(3), 971–995.
Kominek, J. & Black, A. W. (2004). The CMU ARCTIC speech databases. In Proceedings of the ISCA Workshop on Speech Synthesis (pp. 223–224).
Lemon, O., Georgila, K., Henderson, J., & Stuttle, M. (2006). An ISU dialogue system exhibiting reinforcement learning of dialogue policies: Generic slot-filling in the TALK in-car system. In Proceedings of the Conference of the European Chapter of the Association for Computational Linguistics (EACL) – Demonstrations (pp. 119–122).
Ling, Z.-H., Richmond, K., Yamagishi, J., & Wang, R.-H. (2008). Articulatory control of HMMbased parametric speech synthesis driven by phonetic knowledge. In Proceedings of the Annual Conference of the International Speech Communication Association (pp. 573–576).
Ling, Z.-H. & Wang, R.-H. (2006). HMM-based unit-selection using frame sized speech segments. In Proceedings of the International Conference on Spoken Language Processing (pp. 2034–2037).
Narayanan, S., Alwan, A., & Haker, K. (1997). Toward articulatory-acoustic models for liquid approximants based on MRI and EPG data: Part I, The laterals.Journal of the Acoustical Society of America, 101(2), 1064–1077.
Navigli, R. (2009). Word sense disambiguation: A survey.ACM Computing Surveys, 41(2), art. 10.
Pitrelli, J. F., Bakis, R., Eide, E. M., et al. (2006). The IBM expressive text-to-speech synthesis system for American English.IEEE Transactions on Audio, Speech, and Language Processing, 14(4), 1099–1108.
Qin, L., Ling, Z.-H., Wu, Y.-J., Zhang, B.-F., & Wang, R.-H. (2006). HMM-based emotional speech synthesis using average emotion model.Lecture Notes in Computer Science, 4274, 233–240.
Sagisaka, Y., Kaiki, N., Iwahashi, N., & Mimura, K. (1992). ATR v-TALK speech synthesis system. In Proceedings of the International Conference on Spoken Language Processing (pp. 483–486).
Schultz, T., Black, A.W., Vogel, S., & Woszczyna, M. (2006). Flexible speech translation systems.IEEE Transactions on Audio, Speech, and Language Processing, 14(2), 403–411.
Socher, R., Bauer, J., Manning, C. D., & Ng, A. Y. (2013). Parsing with compositional vector grammars. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (pp. 455–465).
Stylianou, Y. (1999). Assessment and correction of voice quality variabilities in large speech databases for concatenative speech synthesis. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (pp. 377–380).
Sundaram, S. & Narayanan, S. (2002). Spoken language synthesis: Experiments in synthesis of spontaneous monologues. In Proceedings of the IEEE Speech Synthesis Workshop (pp. 203– 206).
Székely, É., Cabral, J. P., Abou-Zleikha, M., Cahill, P., & Carson-Berndsen, J. (2012). Evaluating expressive speech synthesis from audiobooks in conversational phrases. In Proceedings of the International Conference on Language Resources and Evaluation (pp. 3335–3339).
Taylor, P. (2009). Text-to-speech Synthesis. New York: Cambridge University Press.
Tokuda, K., Yoshimura, T., Masuko, T., Kobayashi, T., & Kitamura, T. (2000). Speech parameter generation algorithms for HMM-based speech synthesis. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (pp. 1315–1318).
Toutanova, K., Klein, D., Manning, C. D., & Singer, Y. (2003). Feature-rich part-of-speech tagging with a cyclic dependency network. In Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics (HLT-NAACL) (pp. 173–180).
Von Kempelen, W. (1791). Mechanismus der menschlichen Sprache nebst Beschreibung einer sprechenden Maschine. Vienna: J. V. Degen.
Werner, S., & Hoffmann, R. (2007). Spontaneous speech synthesis by pronunciation variant selection: A comparison to natural speech. In Proceedings of the Annual Conference of the International Speech Communication Association (pp. 1781–1784).
Yamagishi, J., Nose, T., Zen, H., et al. (2009). Robust speaker-adaptive HMM-based text-tospeech synthesis.IEEE Transactions on Audio, Speech, and Language Processing, 17(6), 1208–1230.
Yamagishi, J., Usabaev, B., King, S., et al. (2010). Thousands of voices for HMM-based speech synthesis-analysis and application of TTS systems built on various ASR corpora.IEEE Transactions on Audio, Speech, and Language Processing, 18(5), 984–1004.
Yoshimura, T., Masuko, T., Tokuda, K., Kobayashi, T., & Kitamura, T. (1997). Speaker interpolation in HMM-based speech synthesis system. In Proceedings of the European Conference on Speech Communication and Technology (pp. 2523–2526).
Yoshimura, T., Tokuda, K., Masuko, T., Kobayashi, T., & Kitamura, T. (1998). Duration modeling for HMM-based speech synthesis. In Proceedings of the International Conference on Spoken Language Processing (pp. 29–32).
Young, S., Evermann, G., Gales, M., et al. (2009). The HTK Book (for HTK version 3.4). Cambridge: Cambridge University Press.
Zen, H., Nose, T., Yamagishi, J., et al. (2007). The HMM-based speech synthesis system (HTS) version 2.0. In Proceedings of the ISCA Workshop on Speech Synthesis (pp. 294–299).
Zen, H., Tokuda, K., & Black, A.W. (2009). Statistical parametric speech synthesis.Speech Communication, 51(11), 1039–1064.
Zen, H., Tokuda, K., & Kitamura, T. (2007). Reformulating the HMM as a trajectory model by imposing explicit relationships between static and dynamic feature vector sequences.Speech Communication, 21(1), 153–173.
Zhang, L., & Renals, S. (2008). Acoustic-articulatory modeling with the trajectory HMM.IEEE Signal Processing Letters, 15, 245–248.