Skip to main content

Low-level articulatory synthesis: A working text-to-speech solution and a linguistic tool 1

  • David R. Hill (a1), Craig R. Taube-Schock (a2) and Leonard Manzara (a1)

A complete text-to-speech system has been created by the authors, based on a tube resonance model of the vocal tract and a development of Carré’s “Distinctive Region Model”, which is in turn based on the formant-sensitivity findings of Fant and Pauli (1974), to control the tube. In order to achieve this goal, significant long-term linguistic research has been involved, including rhythm and intonation studies, as well as the development of low-level articulatory data and rules to drive the model, together with the necessary tools, parsers, dictionaries and so on. The tools and the current system are available under a General Public License, and are described here, with further references in the paper, including samples of the speech produced, and figures illustrating the system description.

Un système de synthèse vocale complet a été créé par les auteurs, basé sur un modèle de résonance tubulaire du système vocal, et, pour contrôler le tube, sur un développement du modèle aux régions distinctes de René Carré, qui est à son tour basé sur les résultats de Fant and Pauli (1974) au sujet de la sensibilité des formants. Pour atteindre cet objectif, des recherches linguistiques à long terme ont été menées, y compris des études de rythme et d'intonation, ainsi que le développement de données articulatoires de bas niveau et de règles pour faire fonctionner le modèle, ainsi que les outils, les analyseurs syntaxiques, les dictionnaires, etc. Les outils et le système actuel sont disponibles sous une Licence Publique Générale; ils sont décrits ici. D'autres références figurent dans l'article, y compris des exemples de la parole synthétisée et des figures illustrant la description du système.

Corresponding author
Hide All

Numerous people have contributed support, research, and technical assistance. Individuals directly involved in the synthesizer work are listed at <>. Walter Lawrence, Betsy Uldall and David Abercrombie were early mentors for the first author. René Carré originated the basic DRM idea, based on Fant and Pauli's (1974) research. Dalmazio Brisinda and Steve Nygard ported the synthesis system to the Macintosh. Marcelo Matuda ported it to GNU/Linux GNUStep. The Canadian Natural Sciences and Engineering Research Council supported early work under grant A5261. Suggestions by three anonymous reviewers significantly improved the article.

Hide All
Abercrombie, David. 1964. English phonetic texts. London: Faber and Faber.
Abercrombie, David. 1967. Elements of general phonetics. Edinburgh: Edinburgh University Press.
Allen, George D. 1972a. The location of rhythmic stress beats in English: An experimental study I. Language and Speech 15(1): 72100.
Allen, George D. 1972b. The location of rhythmic stress beats in English: An experimental study II. Language and Speech 15(2): 179–95.
Allen, Jonathan, Hunnicutt, M. Sharon, and Klatt, Dennis. 1987. From text to speech: The MITalk system. Cambridge: Cambridge University Press.
Alleydog. 2016. Psychology class notes: Sensation and perception. <>. Accessed 2016-09-18.
Birkholz, Peter. 2013. Modeling consonant–vowel coarticulation for articulatory speech synthesis. PLOS ONE 8(4): e60603. <>. April 16, accessed 2015-01-24.
Boersma, Paul. 2001. PRAAT: Doing phonetics by computer. GLOT International 5(9/10): 341347. <>. Accessed 2015-01-24.
Boersma, Paul and van Heuven, Vincent. n.d. Speak and unSpeak with PRAAT. <>. Accessed 2015-01-24.
Carré, René and Mrayati, M.. 1992. Distinctive regions in acoustic tubes: Speech production modelling. Journal d'Acoustique 5: 141159.
Cohen, Antonie and ‘t Hart, Johan. 1968. On the anatomy of intonation. Lingua 19(1/2): 177192.
Cook, Perry Raymond. 1990. Identification of control parameters in an articulatory vocal tract model, with applications to the synthesis of singing. Doctoral dissertation, Center for Computer Research on Music and Acoustics, Stanford University. <>. Accessed 2015-01-25.
Cooper, Frank S., Liberman, Alvin M., Borst, J. M., and Gerstman, Lou J.. 1952. Some experiments on the perception of synthetic speech sounds. Journal of the Acoustical Society of America 24(6): 597606.
Crystal, David. 1972. The intonation system of English. In Intonation, ed. Bolinger, Dwight D., 110135. London: Penguin Books.
Delattre, Pierre. 1969. Coarticulation and the locus theory. Studia Linguistica 23(1): 126.
Delattre, Pierre, Liberman, Alvin M., and Cooper, Frank S.. 1955. Acoustic loci and transitional cues for consonants. Journal of the Acoustical Society of America 27(4): 769773.
van den Doel, Kees, Vogt, Florian, English, R. Elliot, and Fels, Sidney. 2006. Towards articulatory speech synthesis with a dynamic 3D finite element tongue model. In 7th international seminar on speech production, 59–66. Ubatuba, Brazil. <>. Accessed 2015-01-21.
Dudley, Homer. 1939. The vocoder. Bell Laboratories Record 17: 122126.
Dudley, Homer, Riesz, R. R., and Watkins, S. A.. 1939. A synthetic speaker. Journal of the Franklin Institute 227: 739764.
Dusterhoff, Kurt E. 2000. Synthesizing fundamental frequency using models automatically trained from data. Doctoral dissertation, University of Edinburgh, Edinburgh.
Fant, C. Gunnar M. 1960. Acoustic theory of speech production: With calculations based on x-ray studies of Russian articulations. The Hague: Mouton.
Fant, C. Gunnar M. 1962. OVE II synthesis strategy. 1962 Stockholm Speech Communications Seminar, paper F5.
Fant, C. Gunnar M. and Pauli, S.. 1974. Spatial characteristics of vocal tract resonance models. Tech. Rep., KTH, Stockholm. Proceedings of the Stockholm Communication Seminar.
Fels, Sidney, Vogt, Florian, van den Doel, Kees, Lloyd, John E., Stavness, Ian, and Vatikiotis-Bateson, Eric. 2006. Artisynth: A biomechanical simulation platform for the vocal tract and upper airway. Technical Report TR-2006-10, Computer Science Department, University of British Columbia, Vancouver.
Flanagan, James L. 1972. Speech analysis, synthesis, and perception. Berlin: Springer Verlag.
Green, Peter S. 1958. Consonant–vowel transitions: A spectrographic study. Studia Linguistica 12(2): 57105.
Halliday, Michael A. K. 1970. A course in spoken English: Intonation. Oxford: Oxford University Press.
‘t Hart, Johan, Collier, Ren, and Cohen, Antonie. 1990. A perceptual study of intonation. Cambridge University Press.
Haskins. n.d. Haskins laboratory publications. <>.
Hill, David. 1972. A basis for model building and learning in automatic speech pattern discrimination. Presented at the Machine Perception of Patterns and Pictures Conference No. 13, Institute of Physics, London.
Hill, David. 1978. A program structure for event-based speech synthesis by rules within a flexible segmental framework. International Journal of Man-Machine Studies 10(3): 285294.
Hill, David, Jassem, Wiktor, and Witten, Ian H.. 1979. A statistical approach to the problem of isochrony in spoken British English. In Current issues in linguistic theory, ed. Hollien, Harry and Hollien, Patricia, vol. 9, 285294. Amsterdam: John Benjamins.
Hill, David R. and Reid, Neal. 1977. An experiment on the perception of intonational features. International Journal of Man-Machine Studies 9(2): 337347.
Hill, David R., Witten, Ian H., and Jassem, Wiktor. 1977. Some results from a preliminary study of British English speech rhythm. Presented at the 94th meeting of the Acoustical Society of America. <>. Accessed 2016-09-26.
Hoffman, Howard S. 1958. Study of some cues in the perception of the voiced stop consonants. Journal of the Acoustical Society of America 30(11): 10351041.
Holmes, Jon N., Mattingly, Ignatius G., and Shearme, John N.. 1965. Speech synthesis by rules. Language and Speech 7(3): 127143.
Jassem, Wiktor. 1962. Noise spectra of Swedish, English, and Polish fricatives. Fourth International Congress on Acoustics, Copenhagen, paper G17.
Jassem, Wiktor. 1965. The formants of fricative consonants. Language and Speech 8(1): 116.
Jassem, Wiktor, Hill, David R., and Witten, Ian H.. 1984. Isochrony in English speech: Its statistical validity and linguistic relevance. In Pattern, process and function in discourse phonology, ed. Gibbon, Davydd, 203225. Berlin: de Gruyter.
von Kempelen, W. 1791. Le mécanisme de la parole, suivi de la déscription d'une machine parlante. Vienna: J. V. Degen.
Koenig, W., Dunn, H. K., and Lacy, L. Y.. 1946. the sound spectrograph. Journal of the Acoustical Society of America 18(1): 19.
Kratzenstein, C. G. 1782. Sur la naissance de la formation des voyelles. Journal of Physics 21: 358380.
Kuhl, Patricia K. 2000. A new view of language acquisition. Proceedings of the National Academy of Sciences 97(22): 1185011857.
Kuhl, Patricia K., Conboy, Barbara T., Padden, Denise, Nelson, Tobey, and Pruitt, Jessica. 2005. Early speech perception and later language development: Implications for the “critical period”. Language Learning and Development 1(3/4): 237264.
Ladefoged, Peter and Broadbent, Donald E.. 1957. Information conveyed by vowels. Journal of the Acoustical Society of America 29(1): 98104.
Lawrence, Walter. 1953. The synthesis of speech from signals which have a low information rate. In Communication theory, ed. Jackson, Willis, chap 34. London: Butterworths.
Liberman, Alvin M., Ingemann, Frances, Lisker, Leigh, Delattre, Pierre, and Cooper, Frank S.. 1959. Minimal rules for synthesizing speech. Journal of the Acoustical Society of America 31(11): 14901499.
van Lieshout, Pascal. 2003. PRAAT short tutorial: A basic introduction. University of Toronto, Graduate Department of Speech-Language Pathology, Faculty of Medicine, Oral Dynamics Lab. <>Accessed 2015-01-24.
Lisker, Leigh. 1957. Minimal cues for separating /w, r, l, y/ in intervocalic position. Word 13(2): 256267.
Manzara, Leonard. 2005. The tube resonance model speech synthesizer. Presented at the 149th Meeting of the Acoustical Society of America/Canadian Acoustical Association, Vancouver. <>. Accessed 2016-09-19.
McCullough, Gretchen. 2014. When your eyes hear better than your ears: The McGurk effect. <>.
McGurk, Harry and MacDonald, John. 1976. Hearing lips and seeing voices. Nature 264(5588): 746748.
O'Connor, Joseph D., Gerstman, I. J., Liberman, Alvin M., Delattre, Pierre C., and Cooper, Frank S.. 1957. Acoustic cues for the perception of initial /w, j, r, l/ in English. Word 13(1): 2443.
O'Shaughnessey, D. 1977. Fundamental frequency by rule for a text-to-speech system. In Proceedings of the international conference on acoustics, speech, and signal processing, 571574. New York: IEEE.
Palmer, Harold E. and Palmer, Dorothée. 1959. English through actions. London: Longmans Green. [1925; reprint ed. Ralph Cook].
de Pijper, Jan R. 1983. Modelling British English intonation. Dordrecht: Foris Publications.
Pike, Kenneth L. 1945. The intonation of American English. Ann Arbor: University of Michigan Press.
Potter, Ralph, Kopp, George A., and Kopp, Harriet Green. 1966. Visible speech. New York: Dover Publications. [1947. Murray Hill, NJ: Bell Telephone Laboratories].
Shearme, John N. and Holmes, John N.. 1962. An experimental study of the classification of sounds in continuous speech according to their distribution in the formant 1–formant 2 plane. In Proceedings of the 4th international congress of phonetic sciences, Helsinki 1961. The Hague: Mouton.
Stevens, Ken N. 1968. On the relations between speech movements and speech perception. Zeitschrift für Phonetik, Sprachwissenschaft und Kommunikationsforschung 21: 102106.
Story, Brad H. 2005. Physiologically-based speech simulation using an enhanced wavereflection model of the vocal tract. Doctoral dissertation, University of Iowa.
Story, Brad H. 2013. Phrase-level speech simulation with an airway modulation model of speech production. Computer Speech and Language 27(4): 9891010. Accompanying speech samples at <>. Accessed 2016-09-23.
Story, Brad H. and Bunton, K.. 2011. Decomposition of vowel and consonant contributions to the time-varying vocal tract shape. Journal of the Acoustical Society of America 129(4): 2456.
Strevens, Peter. 1960. Spectra of fricative noise in human speech. Language and Speech 3(3): 3249.
Strevens, Peter. 1961. Sibilant sounds of speech. The Dental Practitioner 11(11): 368378.
Taube-Schock, Craig. 1993. Synthesizing intonation for computer speech output. Master's thesis, Department of Computer Science, University of Calgary, Calgary.
Taylor, Paul. 2009. Text-to-speech synthesis. Cambridge: Cambridge University Press.
Uldall, Elizabeth. 1964. Transitions in fricative noise. Language and Speech 7(1): 1315.
Wells, John C. 1963. A study of the formants of the pure vowels of British English. Department of phonetics progress report, University College, London, London.
Willems, Nico, Collier, Ren, and ‘t Hart, Johan. 1988. A synthesis scheme for British English Intonation. Journal of the Acoustical Society of America 84(4): 12501261.
Witten, Ian H. 1977. A flexible scheme for assigning timing and pitch to synthetic speech. Language and Speech 20(3): 240260.
Yamagishi, Junichi, Richmond, Korin, King, Simon, and many others [sic]. 2007. Hidden Markov model-based speech synthesis. Ms., Centre for Speech Technology Research, University of Edinburgh. Available at <>. Accessed 2015-02-18.
Recommend this journal

Email your librarian or administrator to recommend adding this journal to your organisation's collection.

Canadian Journal of Linguistics/Revue canadienne de linguistique
  • ISSN: 0008-4131
  • EISSN: 1710-1115
  • URL: /core/journals/canadian-journal-of-linguistics-revue-canadienne-de-linguistique
Please enter your name
Please enter a valid email address
Who would you like to send this to? *