Skip to main content Accessibility help

The Kestrel TTS text normalization system



This paper describes the Kestrel text normalization system, a component of the Google text-to-speech synthesis (TTS) system. At the core of Kestrel are text-normalization grammars that are compiled into libraries of weighted finite-state transducers (WFSTs). While the use of WFSTs for text normalization is itself not new, Kestrel differs from previous systems in its separation of the initial tokenization and classification phase of analysis from verbalization. Input text is first tokenized and different tokens classified using WFSTs. As part of the classification, detected semiotic classes – expressions such as currency amounts, dates, times, measure phases, are parsed into protocol buffers ( The protocol buffers are then verbalized, with possible reordering of the elements, again using WFSTs. This paper describes the architecture of Kestrel, the protocol buffer representations of semiotic classes, and presents some examples of grammars for various languages. We also discuss applications and deployments of Kestrel as part of the Google TTS system, which runs on both server and client side on multiple devices, and is used daily by millions of people in nineteen languages and counting.



Hide All
Abney, S., 1996. Partial parsing via finite-state cascades. Natural Language Engineering 2 (4): 337344.
Aho, A., 1969. Nested stack automata. Journal of the Association for Computing Machinery 16 (3): 383406.
Allauzen, C., Mohri, M., and Riley, M. 2004. Statistical modeling for unit selection in speech synthesis. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL’2004), pp. 55–62.
Allauzen, C., and Riley, M., 2012. A pushdown transducer extension for the OpenFst library. In Conference on Implementation and Application of Automata, Lecture Notes in Computer Science vol. 7381, Heidelberg: Springer, pp. 6677.
Allauzen, C., Riley, M., and Schalkwyk, J., 2011. Filters for efficient composition of weighted finite-state transducers. In Conference on Implementation and Application of Automata, Lecture Notes in Computer Science vol. 6482, Heidelberg: Springer, pp. 2838.
Allen, J., Hunnicutt, M. S., Klatt, D., Armstrong, R., and Pisoni, D. 1987. From Text to Speech: The MITalk System, Cambridge, England, UK: Cambridge University Press.
Bangalore, S., and Riccardi, G., 2001. A finite-state approach to machine translation. In 2nd Meeting of the North American Chapter of the Association for Computational Linguistics, Pittsburgh, PA, pp. 18.
Bird, S., and Ellison, T. M., 1994. One-level phonology: autosegmental representations and rules as finite automata. Computational Linguistics 20 (1): 5590.
de Gispert, A., Iglesias, G., Blackwood, G., Banga, E., and Byrne, W., 2010. Hierarchical phrase-based translation with weighted finite-state transducers and shallow-n grammars. Computational Linguistics 36 (3): 505533.
Duchi, J., and Singer, Y. 2009. Boosting with structural sparsity. In Proceedings of the 26th International Conference on Machine Learning, Montreal, p. 297304.
Johnson, C. D. 1972. Formal Aspects of Phonological Description. Walter de Gruyter.
Joshi, A., 1996. A parser from antiquity. Natural Language Engineering 2 (4): 291294.
Jurafsky, D., and Martin, J., 2009. Speech and Language Processing: an Introduction to Natural Language Processing, Computational Linguistics, and speech recognition. 2nd edn.Pearson: Prentice Hall.
Kaplan, R. M., and Kay, M., 1994. Regular models of phonological rule systems. Computational Linguistics 20: 331378.
Koskenniemi, K. 1983. Two-level morphology: a general computational model of word-form recognition and production. PhD thesis, University of Helsinki.
Möbius, B., 2001. German and Multilingual Speech Synthesis. Phonetik AIMS: Arbeitspapiere des Instituts für Maschinelle Sprachverarbeitung vol. 7, Lehrstuhl für experimentelle Phonetik, Stuttgart.
Möbius, B., Sproat, R., van Santen, J., and Olive, J. 1997. The Bell Labs German text-to-speech system: an overview. In Eurospeech. Rhodes.
Mohri, M. 2009. Weighted automata algorithms. In Droste, M., Kuich, W., and Vogler, H. (eds.) Handbook of Weighted Automata, Monographs in Theoretical Computer Science, Springer, pp. 213254.
Mohri, M., Pereira, F. C. N., and Riley, M., 2002. Weighted finite-state transducers in speech recognition. Computer Speech and Language 16 (1): 6988.
Mohri, M., and Sproat, R. 1996. An efficient compiler for weighted rewrite rules. In Proceedings of the 34th Annual Meeting of the Association for Computational Linguistics, pp. 231–238.
Navigli, R., 2009. Word sense disambiguation: a survey. ACM Computing Surveys 41 (2): 169.
Neubig, G., Nakata, Y., and Mori, S., 2011. Pointwise prediction for robust, adaptable Japanese morphological analysis. In Association for Computational Linguistics, Portland, OR, pp. 529533.
Pereira, F., Riley, M., and Sproat, R., 1994. Weighted rational transductions and their application to human language processing. In ARPA Workshop on Human Language Technology, Plainsboro, NJ, pp. 249254.
Roark, B., Riley, M., Allauzen, C., Tai, T., and Sproat, R., 2012. The OpenGrm open-source finite-state grammar software libraries. In ACL, Jeju Island, Korea, pp. 6166.
Roark, B., and Sproat, R., 2007. Computational Approaches to Morphology and Syntax. Oxford: Oxford University Press.
Roark, B., and Sproat, R., 2014. Hippocratic abbreviation expansion. In Association for Computational Linguistics, Baltimore, MD, pp. 364369.
Skut, W., Ulrich, S., and Hammervold, K., 2003. A generic finite state compiler for tagging rules. Machine Translation 18 (3): 239250.
Skut, W., Ulrich, S., and Hammervold, K., 2004. A bimachine compiler for ranked tagging rules. In Proceedings of the 20th International Conference on Computational Linguistics, COLING ’04, Association for Computational Linguistics, Geneva, Switzerland, pp. 198204.
Sproat, R., 1996. Multilingual text analysis for text-to-speech synthesis. Natural Language Engineering 2 (4): 369380.
Sproat, R. (ed.):, 1997. Multilingual Text-to-Speech Synthesis: The Bell Labs Approach. Boston, MA: Springer.
Sproat, R., 2000. A Computational Theory of Writing Systems. Cambridge, England, UK: Cambridge University Press.
Sproat, R., 2010. Lightly supervised learning of text normalization: Russian number names. In IEEE Workshop on Spoken Language Technology, IEEE, Berkeley, CA, pp. 436441.
Sproat, R., Black, A., Chen, S., Kumar, S., Ostendorf, M., and Richards, C., 2001. Normalization of non-standard words. Computer Speech and Language 15 (3): 287333.
Tai, T., Skut, W., and Sproat, R. 2011. Thrax: an open source grammar compiler built on OpenFst. In Automatic Speech Recognition and Understanding Workshop, Waikoloa Resort, Hawaii.
Taylor, P., 2009. Text to Speech Synthesis. Cambridge, England, UK: Cambridge University Press.
Yarowsky, D. 1996. Homograph disambiguation in text-to-speech synthesis. In van Santen, J., Sproat, R., Olive, J., and Hirschberg, J. (eds.), Progress in Speech Synthesis, New York: Springer, pp. 157172.


Altmetric attention score

Full text views

Total number of HTML views: 0
Total number of PDF views: 0 *
Loading metrics...

Abstract views

Total abstract views: 0 *
Loading metrics...

* Views captured on Cambridge Core between <date>. This data will be updated every 24 hours.

Usage data cannot currently be displayed