Hostname: page-component-76fb5796d-dfsvx Total loading time: 0 Render date: 2024-04-25T21:15:34.005Z Has data issue: false hasContentIssue false

Low-level articulatory synthesis: A working text-to-speech solution and a linguistic tool1

Published online by Cambridge University Press:  21 June 2017

David R. Hill*
Affiliation:
University of Calgary, Dept. of Computer Science
Craig R. Taube-Schock
Affiliation:
Waikato University, Dept. of Computer Science
Leonard Manzara
Affiliation:
University of Calgary, Dept. of Computer Science

Abstract

A complete text-to-speech system has been created by the authors, based on a tube resonance model of the vocal tract and a development of Carré’s “Distinctive Region Model”, which is in turn based on the formant-sensitivity findings of Fant and Pauli (1974), to control the tube. In order to achieve this goal, significant long-term linguistic research has been involved, including rhythm and intonation studies, as well as the development of low-level articulatory data and rules to drive the model, together with the necessary tools, parsers, dictionaries and so on. The tools and the current system are available under a General Public License, and are described here, with further references in the paper, including samples of the speech produced, and figures illustrating the system description.

Résumé

Un système de synthèse vocale complet a été créé par les auteurs, basé sur un modèle de résonance tubulaire du système vocal, et, pour contrôler le tube, sur un développement du modèle aux régions distinctes de René Carré, qui est à son tour basé sur les résultats de Fant and Pauli (1974) au sujet de la sensibilité des formants. Pour atteindre cet objectif, des recherches linguistiques à long terme ont été menées, y compris des études de rythme et d'intonation, ainsi que le développement de données articulatoires de bas niveau et de règles pour faire fonctionner le modèle, ainsi que les outils, les analyseurs syntaxiques, les dictionnaires, etc. Les outils et le système actuel sont disponibles sous une Licence Publique Générale; ils sont décrits ici. D'autres références figurent dans l'article, y compris des exemples de la parole synthétisée et des figures illustrant la description du système.

Type
Articles
Copyright
© Canadian Linguistic Association/Association canadienne de linguistique 2017 

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Footnotes

1

Numerous people have contributed support, research, and technical assistance. Individuals directly involved in the synthesizer work are listed at <http://www.gnu.org/software/gnuspeech>. Walter Lawrence, Betsy Uldall and David Abercrombie were early mentors for the first author. René Carré originated the basic DRM idea, based on Fant and Pauli's (1974) research. Dalmazio Brisinda and Steve Nygard ported the synthesis system to the Macintosh. Marcelo Matuda ported it to GNU/Linux GNUStep. The Canadian Natural Sciences and Engineering Research Council supported early work under grant A5261. Suggestions by three anonymous reviewers significantly improved the article.

References

Abercrombie, David. 1964. English phonetic texts. London: Faber and Faber.Google Scholar
Abercrombie, David. 1967. Elements of general phonetics. Edinburgh: Edinburgh University Press.Google Scholar
Allen, George D. 1972a. The location of rhythmic stress beats in English: An experimental study I. Language and Speech 15(1): 72100.Google Scholar
Allen, George D. 1972b. The location of rhythmic stress beats in English: An experimental study II. Language and Speech 15(2): 179–95.Google Scholar
Allen, Jonathan, Hunnicutt, M. Sharon, and Klatt, Dennis. 1987. From text to speech: The MITalk system. Cambridge: Cambridge University Press.Google Scholar
Alleydog. 2016. Psychology class notes: Sensation and perception. <http://ww.alleydog.com/101notes/s&p.html>. Accessed 2016-09-18..+Accessed+2016-09-18.>Google Scholar
Birkholz, Peter. 2013. Modeling consonant–vowel coarticulation for articulatory speech synthesis. PLOS ONE 8(4): e60603. <http://www.ncb1.nlm.nih.gov/pmc/articles/PMC3628899/>. April 16, accessed 2015-01-24.Google Scholar
Boersma, Paul. 2001. PRAAT: Doing phonetics by computer. GLOT International 5(9/10): 341347. <http://www.fon.hum.uva.nl/praat/>. Accessed 2015-01-24.Google Scholar
Boersma, Paul and van Heuven, Vincent. n.d. Speak and unSpeak with PRAAT. <http://www.fon.hum.uva.nl/paul/papers/speakUnspeakPraatglot2001.pdf>. Accessed 2015-01-24..+Accessed+2015-01-24.>Google Scholar
Carré, René and Mrayati, M.. 1992. Distinctive regions in acoustic tubes: Speech production modelling. Journal d'Acoustique 5: 141159.Google Scholar
Cohen, Antonie and ‘t Hart, Johan. 1968. On the anatomy of intonation. Lingua 19(1/2): 177192.Google Scholar
Cook, Perry Raymond. 1990. Identification of control parameters in an articulatory vocal tract model, with applications to the synthesis of singing. Doctoral dissertation, Center for Computer Research on Music and Acoustics, Stanford University. <https://ccrma.stanford.edu/files/papers/stanm68.pdf>. Accessed 2015-01-25..+Accessed+2015-01-25.>Google Scholar
Cooper, Frank S., Liberman, Alvin M., Borst, J. M., and Gerstman, Lou J.. 1952. Some experiments on the perception of synthetic speech sounds. Journal of the Acoustical Society of America 24(6): 597606.Google Scholar
Crystal, David. 1972. The intonation system of English. In Intonation, ed. Bolinger, Dwight D., 110135. London: Penguin Books.Google Scholar
Delattre, Pierre. 1969. Coarticulation and the locus theory. Studia Linguistica 23(1): 126.CrossRefGoogle Scholar
Delattre, Pierre, Liberman, Alvin M., and Cooper, Frank S.. 1955. Acoustic loci and transitional cues for consonants. Journal of the Acoustical Society of America 27(4): 769773.Google Scholar
van den Doel, Kees, Vogt, Florian, English, R. Elliot, and Fels, Sidney. 2006. Towards articulatory speech synthesis with a dynamic 3D finite element tongue model. In 7th international seminar on speech production, 59–66. Ubatuba, Brazil. <http://www.cs.ubc.ca/_kvdoel/publications/ta.pdf>. Accessed 2015-01-21.Google Scholar
Dudley, Homer. 1939. The vocoder. Bell Laboratories Record 17: 122126.Google Scholar
Dudley, Homer, Riesz, R. R., and Watkins, S. A.. 1939. A synthetic speaker. Journal of the Franklin Institute 227: 739764.Google Scholar
Dusterhoff, Kurt E. 2000. Synthesizing fundamental frequency using models automatically trained from data. Doctoral dissertation, University of Edinburgh, Edinburgh.Google Scholar
Fant, C. Gunnar M. 1960. Acoustic theory of speech production: With calculations based on x-ray studies of Russian articulations. The Hague: Mouton.Google Scholar
Fant, C. Gunnar M. 1962. OVE II synthesis strategy. 1962 Stockholm Speech Communications Seminar, paper F5.Google Scholar
Fant, C. Gunnar M. and Pauli, S.. 1974. Spatial characteristics of vocal tract resonance models. Tech. Rep., KTH, Stockholm. Proceedings of the Stockholm Communication Seminar.Google Scholar
Fels, Sidney, Vogt, Florian, van den Doel, Kees, Lloyd, John E., Stavness, Ian, and Vatikiotis-Bateson, Eric. 2006. Artisynth: A biomechanical simulation platform for the vocal tract and upper airway. Technical Report TR-2006-10, Computer Science Department, University of British Columbia, Vancouver.Google Scholar
Flanagan, James L. 1972. Speech analysis, synthesis, and perception. Berlin: Springer Verlag.Google Scholar
Green, Peter S. 1958. Consonant–vowel transitions: A spectrographic study. Studia Linguistica 12(2): 57105.Google Scholar
Halliday, Michael A. K. 1970. A course in spoken English: Intonation. Oxford: Oxford University Press.Google Scholar
‘t Hart, Johan, Collier, Ren, and Cohen, Antonie. 1990. A perceptual study of intonation. Cambridge University Press.Google Scholar
Haskins. n.d. Haskins laboratory publications. <http://www.haskins.yale.edu/pubs.html>..>Google Scholar
Hill, David. 1972. A basis for model building and learning in automatic speech pattern discrimination. Presented at the Machine Perception of Patterns and Pictures Conference No. 13, Institute of Physics, London.Google Scholar
Hill, David. 1978. A program structure for event-based speech synthesis by rules within a flexible segmental framework. International Journal of Man-Machine Studies 10(3): 285294.Google Scholar
Hill, David, Jassem, Wiktor, and Witten, Ian H.. 1979. A statistical approach to the problem of isochrony in spoken British English. In Current issues in linguistic theory, ed. Hollien, Harry and Hollien, Patricia, vol. 9, 285294. Amsterdam: John Benjamins.Google Scholar
Hill, David R. and Reid, Neal. 1977. An experiment on the perception of intonational features. International Journal of Man-Machine Studies 9(2): 337347.CrossRefGoogle Scholar
Hill, David R., Witten, Ian H., and Jassem, Wiktor. 1977. Some results from a preliminary study of British English speech rhythm. Presented at the 94th meeting of the Acoustical Society of America. <http://pages.cpsc.ucalgary.ca/_hill/papers/>. Accessed 2016-09-26..+Accessed+2016-09-26.>Google Scholar
Hoffman, Howard S. 1958. Study of some cues in the perception of the voiced stop consonants. Journal of the Acoustical Society of America 30(11): 10351041.Google Scholar
Holmes, Jon N., Mattingly, Ignatius G., and Shearme, John N.. 1965. Speech synthesis by rules. Language and Speech 7(3): 127143.Google Scholar
Jassem, Wiktor. 1962. Noise spectra of Swedish, English, and Polish fricatives. Fourth International Congress on Acoustics, Copenhagen, paper G17.Google Scholar
Jassem, Wiktor. 1965. The formants of fricative consonants. Language and Speech 8(1): 116.Google Scholar
Jassem, Wiktor, Hill, David R., and Witten, Ian H.. 1984. Isochrony in English speech: Its statistical validity and linguistic relevance. In Pattern, process and function in discourse phonology, ed. Gibbon, Davydd, 203225. Berlin: de Gruyter.Google Scholar
von Kempelen, W. 1791. Le mécanisme de la parole, suivi de la déscription d'une machine parlante. Vienna: J. V. Degen.Google Scholar
Koenig, W., Dunn, H. K., and Lacy, L. Y.. 1946. the sound spectrograph. Journal of the Acoustical Society of America 18(1): 19.Google Scholar
Kratzenstein, C. G. 1782. Sur la naissance de la formation des voyelles. Journal of Physics 21: 358380.Google Scholar
Kuhl, Patricia K. 2000. A new view of language acquisition. Proceedings of the National Academy of Sciences 97(22): 1185011857.Google Scholar
Kuhl, Patricia K., Conboy, Barbara T., Padden, Denise, Nelson, Tobey, and Pruitt, Jessica. 2005. Early speech perception and later language development: Implications for the “critical period”. Language Learning and Development 1(3/4): 237264.Google Scholar
Ladefoged, Peter and Broadbent, Donald E.. 1957. Information conveyed by vowels. Journal of the Acoustical Society of America 29(1): 98104.CrossRefGoogle Scholar
Lawrence, Walter. 1953. The synthesis of speech from signals which have a low information rate. In Communication theory, ed. Jackson, Willis, chap 34. London: Butterworths.Google Scholar
Liberman, Alvin M., Ingemann, Frances, Lisker, Leigh, Delattre, Pierre, and Cooper, Frank S.. 1959. Minimal rules for synthesizing speech. Journal of the Acoustical Society of America 31(11): 14901499.Google Scholar
van Lieshout, Pascal. 2003. PRAAT short tutorial: A basic introduction. University of Toronto, Graduate Department of Speech-Language Pathology, Faculty of Medicine, Oral Dynamics Lab. <http://web.stanford.edu/dept/linguistics/corpora/material/PRAATworkshopmanualv421.pdf>Accessed 2015-01-24.Accessed+2015-01-24.>Google Scholar
Lisker, Leigh. 1957. Minimal cues for separating /w, r, l, y/ in intervocalic position. Word 13(2): 256267.Google Scholar
Manzara, Leonard. 2005. The tube resonance model speech synthesizer. Presented at the 149th Meeting of the Acoustical Society of America/Canadian Acoustical Association, Vancouver. <https://www.researchgate.net/publication/228877073TheTubeResonanceModelSpeechSynthesizer>. Accessed 2016-09-19..+Accessed+2016-09-19.>Google Scholar
McCullough, Gretchen. 2014. When your eyes hear better than your ears: The McGurk effect. <http://tinyurl.com/lqbwzjb>..>Google Scholar
McGurk, Harry and MacDonald, John. 1976. Hearing lips and seeing voices. Nature 264(5588): 746748.Google Scholar
O'Connor, Joseph D., Gerstman, I. J., Liberman, Alvin M., Delattre, Pierre C., and Cooper, Frank S.. 1957. Acoustic cues for the perception of initial /w, j, r, l/ in English. Word 13(1): 2443.Google Scholar
O'Shaughnessey, D. 1977. Fundamental frequency by rule for a text-to-speech system. In Proceedings of the international conference on acoustics, speech, and signal processing, 571574. New York: IEEE.Google Scholar
Palmer, Harold E. and Palmer, Dorothée. 1959. English through actions. London: Longmans Green. [1925; reprint ed. Ralph Cook].Google Scholar
de Pijper, Jan R. 1983. Modelling British English intonation. Dordrecht: Foris Publications.Google Scholar
Pike, Kenneth L. 1945. The intonation of American English. Ann Arbor: University of Michigan Press.Google Scholar
Potter, Ralph, Kopp, George A., and Kopp, Harriet Green. 1966. Visible speech. New York: Dover Publications. [1947. Murray Hill, NJ: Bell Telephone Laboratories].Google Scholar
Shearme, John N. and Holmes, John N.. 1962. An experimental study of the classification of sounds in continuous speech according to their distribution in the formant 1–formant 2 plane. In Proceedings of the 4th international congress of phonetic sciences, Helsinki 1961. The Hague: Mouton.Google Scholar
Stevens, Ken N. 1968. On the relations between speech movements and speech perception. Zeitschrift für Phonetik, Sprachwissenschaft und Kommunikationsforschung 21: 102106.Google Scholar
Story, Brad H. 2005. Physiologically-based speech simulation using an enhanced wavereflection model of the vocal tract. Doctoral dissertation, University of Iowa.Google Scholar
Story, Brad H. 2013. Phrase-level speech simulation with an airway modulation model of speech production. Computer Speech and Language 27(4): 9891010. Accompanying speech samples at <http://sal-slhs.webhost.uits.arizona.edu/node/30>. Accessed 2016-09-23.Google Scholar
Story, Brad H. and Bunton, K.. 2011. Decomposition of vowel and consonant contributions to the time-varying vocal tract shape. Journal of the Acoustical Society of America 129(4): 2456.Google Scholar
Strevens, Peter. 1960. Spectra of fricative noise in human speech. Language and Speech 3(3): 3249.Google Scholar
Strevens, Peter. 1961. Sibilant sounds of speech. The Dental Practitioner 11(11): 368378.Google Scholar
Taube-Schock, Craig. 1993. Synthesizing intonation for computer speech output. Master's thesis, Department of Computer Science, University of Calgary, Calgary.Google Scholar
Taylor, Paul. 2009. Text-to-speech synthesis. Cambridge: Cambridge University Press.Google Scholar
Uldall, Elizabeth. 1964. Transitions in fricative noise. Language and Speech 7(1): 1315.Google Scholar
Wells, John C. 1963. A study of the formants of the pure vowels of British English. Department of phonetics progress report, University College, London, London.Google Scholar
Willems, Nico, Collier, Ren, and ‘t Hart, Johan. 1988. A synthesis scheme for British English Intonation. Journal of the Acoustical Society of America 84(4): 12501261.Google Scholar
Witten, Ian H. 1977. A flexible scheme for assigning timing and pitch to synthetic speech. Language and Speech 20(3): 240260.Google Scholar
Yamagishi, Junichi, Richmond, Korin, King, Simon, and many others [sic]. 2007. Hidden Markov model-based speech synthesis. Ms., Centre for Speech Technology Research, University of Edinburgh. Available at <http://homepages.inf.ed.ac.uk/ckiw/rpml/HMMspeechsynthesis.pdf>. Accessed 2015-02-18..+Accessed+2015-02-18.>Google Scholar