Skip to main content Accessibility help

The cross-linguistic performance of word segmentation models over time

  • Andrew CAINES (a1), Emma ALTMANN-RICHER (a2) and Paula BUTTERY (a1)


We select three word segmentation models with psycholinguistic foundations – transitional probabilities, the diphone-based segmenter, and PUDDLE – which track phoneme co-occurrence and positional frequencies in input strings, and in the case of PUDDLE build lexical and diphone inventories. The models are evaluated on caregiver utterances in 132 CHILDES corpora representing 28 languages and 11.9 m words. PUDDLE shows the best performance overall, albeit with wide cross-linguistic variation. We explore the reasons for this variation, fitting regression models to performance scores with linguistic properties which capture lexico-phonological characteristics of the input: word length, utterance length, diversity in the lexicon, the frequency of one-word utterances, the regularity of phoneme patterns at word boundaries, and the distribution of diphones in each language. These properties together explain four-tenths of the observed variation in segmentation performance, a strong outcome and a solid foundation for studying further variables which make the segmentation task difficult.


Corresponding author

*Corresponding author: Department of Computer Science & Technology, William Gates Building, 15 JJ Thomson Avenue, Cambridge CB3 0FD, UK. E-mail:


Hide All
Akaike, H. (1974). A new look at the statistical model identification. IEEE Transactions on Automatic Control, 19, 716–23.
Amatuni, A., & Bergelson, E. (2017). Semantic networks generated from early linguistic input. In Proceedings of the 39th Annual Conference of the Cognitive Science Society. Online <>.
Aslin, R. N., Saffran, J. R. & Newport, E. L. (1998). Computation of probability statistics by 8-month-old infants. Psychological Science, 9, 321–4.
Baayen, R. H. (2001). Word frequency distributions. Dordrecht: Kluwer Academic Publishers.
Baayen, R. H., Davidson, D., & Bates, D. (2008). Mixed-effects modeling with crossed random effects for subjects and items. Journal of Memory and Language, 59, 390412.
Bartoń, K. (2018). MuMIn: Multi-Model Inference. R package version 1.42.1. Online <>.
Basbøll, H. (2005). The phonology of Danish. Oxford University Press.
Basbøll, H. (2012). Monosyllables and prosody: the sonority syllable model meets the word. In Stolz, T., Nau, N., & Stroh, C. (Eds.), Studia typologica: Monosyllables: from phonology to typology (pp. 1341). Berlin: De Gruyter.
Bates, D., Maechler, M., Bolker, B., & Walker, S. (2015). Fitting linear mixed-effects models using lme4. Journal of Statistical Software, 67, 148.
Bentz, C., Alikaniotis, D., Cysouw, M., & i Cancho, R. F. (2017). The entropy of words – learnability and expressivity across more than 1000 languages. Entropy, 19(6), 275.
Bergelson, E., Amatuni, A., Dailey, S., Koorathota, S., & Tor, S. (2019). Day by day, hour by hour: naturalistic language input to infants. Developmental Science, 22(1), e12715.
Bernard, M. (2018). phonemizer-1.0. Online <>.
Bernard, M., Thiolliere, R., Saksida, A., Loukatou, G., Larsen, E., Johnson, M., Fibla, L., Dupoux, E., Daland, R., Cao, X., & Cristia, A. (in press). WordSeg: standardizing unsupervised word form segmentation from text. Behavior Research Methods. Online <>.
Bird, S., Klein, E., & Loper, E. (2009). Natural language processing with Python: analyzing text with the Natural Language Toolkit. Beijing: O'Reilly Media.
Bleses, D., Basbøll, H., & Vach, W. (2011). Is Danish difficult to acquire? Evidence from Nordic past-tense studies. Language and Cognitive Processes, 26, 1193–231.
Bleses, D., Vach, W., Slott, M., Wehberg, S., Thomsen, P., Madsen, T., & Basbøll, H. (2008). Early vocabulary development in Danish and other languages: a CDI-based comparison. Journal of Child Language, 35, 619–50.
Bortfield, H., Morgan, J., Golinkoff, R., & Rathbun, K. (2005). Mommy and me: familiar names help launch babies into speech-stream segmentation. Psychological Science, 16, 298304.
Boruta, L., Peperkamp, S., Crabbé, B., & Dupoux, E. (2011). Testing the robustness of online word segmentation: effects of linguistic diversity and phonetic variation. In Proceedings of the 2nd Workshop on Cognitive Modeling and Computational Linguistics. Online <>.
Braginsky, M., Yurovsky, D., Marchman, V., & Frank, M. (2018). Consistency and variability in children's word learning across languages. PsyArXiv. doi:10.31234/
Brent, M., & Cartwright, T. (1996). Distributional regularity and phonotactic constraints are useful for segmentation. Cognition, 61, 93125.
Butler, J., & Frota, S. (2018). Emerging word segmentation abilities in European Portuguese-learning infants: new evidence for the rhythmic unit and the edge factor. Journal of Child Language, 45, 1294–308.
Cairns, P., Shillcock, R., Chater, N., & Levy, J. (1997). Bootstrapping word boundaries: a bottom-up corpus-based approach to speech segmentation. Cognitive Psychology, 33, 111–53.
Casas, B., Català, N., Ferrer-i-Cancho, R., Hernández-Fernández, A., & Baixeries, J. (2018). The polysemy of the words that children learn over time. Interaction Studies, 19, 389426.
Chin, I., Goodwin, M., Vosoughi, S., Roy, D., & Naigles, L. (2018). Dense home-based recordings reveal typical and atypical development of tense/aspect in a child with delayed language development. Journal of Child Language, 45, 134.
Çöltekin, Ç. (2017). Using predictability for lexical segmentation. Cognitive Science, 41, 19882021.
Curtin, S. (2009). Twelve-month-olds learn novel word–object pairs differing only in stress pattern. Journal of Child Language, 36, 1157–65.
Curtin, S., Mintz, T. H., & Christiansen, M. H. (2005). Stress changes the representational landscape: evidence from word segmentation. Cognition, 96, 233–62.
Cutler, A., & Carter, D. (1987). The predominance of strong initial syllables in the English vocabulary. Computer Speech and Language, 2, 133–42.
Dahan, D., & Brent, M. (1999). An artificial-language study with implications for native-language acquisition. Journal of Experimental Psychology: General, 128, 165–85.
Daland, R., & Pierrehumbert, J. (2011). Learning diphone-based segmentation. Cognitive Science, 35, 119–55.
Dautriche, I., Mahowald, K., Gibson, E., Christophe, A., & Piantadosi, S. (2017). Words cluster phonetically beyond phonotactic regularities. Cognition, 163, 128–45.
Dupoux, E., Parlato, E., Frota, S., Hirose, Y., & Peperkamp, S. (2011). Where do illusory vowels come from? Journal of Memory and Language, 64, 199210.
Ettlinger, M., Finn, A., & Kam, C. H. (2012). The effect of sonority on word segmentation: evidence for the use of a phonological universal. Cognitive Science, 36, 655–73.
Evert, S. (2004). A simple LNRE model for random character sequences. In Proceedings of JADT. Online <>.
Evert, S., & Baroni, M. (2007). zipfR: word frequency distributions in R. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics, Posters and Demonstrations Sessions. Online <>.
Fourtassi, A., Börschinger, B., Johnson, M., & Dupoux, E. (2013). Whyisenglishsoeasytosegment? In Proceedings of the Fourth Annual Workshop on Cognitive Modeling and Computational Linguistics. Online <>.
Frank, M., Goldwater, S., Griffiths, T., & Tenenbaum, J. (2010). Modeling human performance in statistical word segmentation. Cognition, 117, 107125.
Frank, S., Keller, F., & Goldwater, S. (2013). Exploring the utility of joint morphological and syntactic learning from child-directed speech. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. Online <>.
Friederici, A., & Wessels, J. (1993). Phonotactic knowledge of word boundaries and its use in infant speech-perception. Perception & Psychophysics, 54, 287–95.
Gambell, T., & Yang, C. (2005). Word segmentation: quick but not dirty. Unpublished ms, Yale University. Online <>.
Gervain, J., & Guevara Erra, R. G. (2012). The statistical signature of morphosyntax: a study of Hungarian and Italian infant-directed speech. Cognition, 125, 263–87.
Goldwater, S., Griffiths, T. L., & Johnson, M. (2009). A Bayesian framework for word segmentation: exploring the effects of context. Cognition, 112, 2154.
Goodsitt, J. V., Morgan, J. L., & Kuhl, P. K. (1993). Perceptual strategies in prelingual speech segmentation. Journal of Child Language, 20, 229–52.
Graf Estes, K., & Hurley, K. (2013). Infant-directed prosody helps infants map sounds to meanings. Infancy, 18, 797824.
Grønnum, N. (2003). Why are the Danes so hard to understand? In Galberg Jacobsen, H., Bleses, D., Madsen, T. O. & Thomsen, P. (Eds.), Take Danish – for instance: linguistic studies in honour of Hans Basbøll presented on the occasion of his 60th birthday 12 July 2003. Odense: University Press of Southern Denmark.
Hallé, P. A., & de Boysson-Bardies, B. (1994). Emergence of an early receptive lexicon: infants’ recognition of words. Infant Behavior and Development, 17, 119–29.
Hammarström, H., Forkel, R., & Haspelmath, M. (2018). Glottolog 3.3. Online <>.
Hartman, K., Bernstein Ratner, N., & Newman, R. (2017). Infant-direct speech (IDS) vowel clarity and child language outcomes. Journal of Child Language, 44, 1140–62.
Hay, J., Pelucchi, B., Estes, K., & Saffran, J. (2011). Linking sounds to meanings: infant statistical learning in a natural language. Cognitive Psychology, 63, 93106.
Hendrickson, A., & Perfors, A. (2019). Cross-situational learning in a Zipfian environment. Cognition, 189, 1122.
Hockema, S. (2006). Finding words in speech: an investigation of American English. Language Learning and Development, 2, 119–46.
James, W. (1890). The principles of psychology, Volume 1. New York: Henry Holt and Company.
Johnson, E., & Jusczyk, P. (2001). Word segmentation by 8-month-olds: when speech cues count more than statistics. Journal of Memory and Language, 44, 548–67.
Johnson, E., & Tyler, M. (2010). Testing the limits of statistical learning for word segmentation. Developmental Science, 13, 339–45.
Johnson, M. (2008). Unsupervised word segmentation for Sesotho using adaptor grammars. In Proceedings of the Tenth Meeting of the ACL Special Interest Group on Computational Morphology and Phonology. Online <>.
Jusczyk, P. W., Cutler, A., & Redanz, N. (1993). Preference for the predominant stress patterns of English words. Child Development, 64, 675–87.
Jusczyk, P. W., Luce, P., & Charles-Luce, J. (1994). Infants’ sensitivity to phonotactic patterns in the native language. Journal of Memory and Language, 33, 630–45.
Kidd, E., Junge, C., Spokes, T., Morrison, L., & Cutler, A. (2018). Individual differences in infant speech segmentation: achieving the lexical shift. Infancy, 23, 770–94.
Krogh, L., Vlach, H. A., & Johnson, S. P. (2012). Statistical learning across development: flexible yet constrained. Frontiers in Psychology, 3. doi:10.3389/fpsyg.2012.00598
Kurumada, C., Meylan, S., & Frank, M. (2013). Zipfian frequency distributions facilitate word segmentation in context. Cognition, 127, 439–53.
Ladefoged, P. (2003). Commentary: some thoughts on syllables–an old-fashioned interlude. In Local, J., Ogden, R., & Temple, R. (Eds.), Phonetic interpretation: Papers in Laboratory Phonology VI. (pp. 269–78). Cambridge University Press.
Larsen, E., Cristia, A., & Dupoux, E. (2017). Relating unsupervised word segmentation to reported vocabulary acquisition. In Proceedings of INTERSPEECH. Online <>.
Lignos, C. (2012). Infant word segmentation: an incremental, integrated model. In Proceedings of the West Coast Conference on Formal Linguistics. Online <>.
Linzen, T., & Gallagher, G. (2017). Rapid generalization in phonotactic learning. Laboratory Phonology, 8, 132.
Long, J. (2018). jtools: analysis and presentation of social scientific data. R package version 1.1.1. Online <>.
MacWhinney, B. (1982). Basic syntactic processes. In Kuczaj, S. (Ed.), Language acquisition. volume 1: syntax and semantics (pp. 73136). Hillsdale, NJ: Lawrence Erlbaum.
MacWhinney, B. (2000). The CHILDES project: tools for analyzing talk (3rd ed.). Mahwah, NJ: Lawrence Erlbaum Associates.
Mampe, B., Friederici, A. D., Christophe, A., & Wermke, K. (2009). Newborns’ cry melody is shaped by their native language. Current Biology, 15, 14.
Mani, N., & Pätzold, W. (2016). Sixteen-month-old infants’ segment words from infant- and adult-directed speech. Language Learning and Development, 12, 499508.
Mattys, S., & Jusczyk, P. (2001). Phonotactic cues for segmentation of fluent speech by infants. Cognition, 78, 91121.
Mattys, S., White, L., & Melhorn, J. (2005). Integration of multiple segmentation cues: a hierarchical framework. Journal of Experimental Psychology: General, 134, 477500.
May, L., Byers-Heinlein, K., Gervain, J., & Werker, J. F. (2011). Language and the newborn brain: Does prenatal language experience shape the neonate neural response to speech? Frontiers in Psychology, 2. doi:10.3389/fpsyg.2011.00222
McCauley, S., Monaghan, P., & Christiansen, M. (2015). Language emergence in development. In MacWhinney, B. & O'Grady, W. (Eds.), The handbook of language emergence (pp. 415–36). Oxford: Blackwell.
Mehler, J., Dommergues, J. Y., Frauenfelder, U., & Segui, J. (1981). The syllable's role in speech segmentation. Journal of Verbal Learning and Verbal Behavior, 20, 298305.
Mintz, T., Walker, R., Welday, A., & Kidd, C. (2018). Infants’ sensitivity to vowel harmony and its role in segmenting speech. Cognition, 171, 95107.
Monaghan, P., & Christiansen, M. (2010). Words in puddles of sound: modelling psycholinguistic effects in speech segmentation. Journal of Child Language, 37, 545–64.
Moon, C., Panneton Cooper, R., & Fifer, W. P. (1993). Two-day-olds prefer their native language. Infant Behavioral Development, 16, 495500.
Nespor, M., Peña, M., & Mehler, J. (2003). On the different roles of vowels and consonants in speech processing and language acquisition. Lingue e Linguaggio, 2, 221–47.
Ngon, C., Martin, A., Dupoux, E., Cabrol, D., Dutat, M., & Peperkamp, S. (2013). (Non)words, (non)words, (non)words: evidence for a protolexicon during the first year of life. Developmental Science, 16, 2434.
Ota, M., & Skarabela, B. (2018). Reduplication facilitates early word segmentation. Journal of Child Language, 45, 204–18.
Pelucchi, B., Hay, J., & Saffran, J. (2009a). Learning in reverse: eight-month-old infants track backward transitional probabilities. Cognition, 113, 244–7.
Pelucchi, B., Hay, J., & Saffran, J. (2009b). Statistical learning in a natural language by 8-month-old infants. Child Development, 80, 674–85.
Phillips, L. (2015). The role of empirical evidence in modeling speech segmentation (Unpublished dissertation, University of California, Irvine). Retrieved from <>.
Phillips, L., & Pearl, L. (2015). Utility-based evaluation metrics for models of language acquisition: a look at speech segmentation. In Proceedings of the Sixth Workshop on Cognitive Modeling and Computational Linguistics. Online <>.
R Core Team (2018). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. Online <>.
Räsänen, O., Doyle, G., & Frank, M. (2018). Pre-linguistic segmentation of speech into syllable-like units. Cognition, 171, 130–50.
Rowland, C. F., & Fletcher, S. L. (2006). The effect of sampling on estimates of lexical specificity and error rates. Journal of Child Language, 33, 859–77.
Saffran, J., Aslin, R., & Newport, E. (1996). Statistical learning by 8-month-old infants. Science, 274, 1926–8.
Saksida, A., Langus, A., & Nespor, M. (2017). Co-occurrence statistics as a language-dependent cue for speech segmentation. Developmental Science, 20(3).
Schüppert, A., Hilton, N. H., & Gooskens, C. (2016). Why is Danish so difficult to understand for fellow Scandinavians? Speech Communication, 79, 4760.
Shoemaker, E., & Wauquier, S. (2019). The emergence of speech segmentation in adult L2 learners of French. Language, Interaction and Acquisition, 10, 2244.
Siyanova-Chanturia, A., Conklin, K., Caffarra, S., Kaan, E., & Van Heuven, W. (2017). Representation and processing of multi-word expressions in the brain. Brain and Language, 175, 111–22.
Swingley, D. (2005). Statistical clustering and the contents of the infant vocabulary. Cognitive Psychology, 50, 86132.
Tamis-LeMonda, C., Kuchirko, Y., Luo, R., Escobar, K., & Bornstein, M. (2017). Power in methods: language to infants in structured and naturalistic contexts. Developmental Science, 20.
Thiessen, E. D., & Saffran, J. R. (2003). When cues collide: use of stress and statistical cues to word boundaries by 7- to 9-month-old infants. Developmental Psychology, 39, 706–16.
Tomasello, M. (2000). The item-based nature of children's early syntactic development. Trends in Cognitive Sciences, 4, 156–63.
Trecca, F., Bleses, D., Madsen, T. O., & Christiansen, M. H. (2018). Does sound structure affect word learning? An eye-tracking study of Danish learning toddlers. Journal of Experimental Child Psychology, 167, 180203.
Trecca, F., McCauley, S. M., Andersen, S. R., Bleses, D., Basbøll, H., Højen, A., Madsen, T. O., Ribu, I. S. B., & Christiansen, M. H. (2019). Segmentation of highly vocalic speech via statistical learning: initial results from Danish, Norwegian, and English. Language Learning, 69(1), 143–76.
Vihman, M., dePaolis, R., Nakai, S., & Hallé, P. A. (2004). The role of accentual pattern in early lexical representation. Journal of Memory and Language, 50, 336–53.
Winter, B., & Wieling, M. (2016). How to analyze language change using mixed models, growth curve analysis and generalized additive modeling. Journal of Language Evolution, 1, 718.
Ziegler, J. C., & Goswami, U. (2005). Reading acquisition, developmental dyslexia, and skilled reading across languages: a psycholinguistic grain size theory. Psychological Bulletin, 131, 329.
Zipf, G. (1949). Human behavior and the principle of least effort. Cambridge, MA: Addison-Wesley.


The cross-linguistic performance of word segmentation models over time

  • Andrew CAINES (a1), Emma ALTMANN-RICHER (a2) and Paula BUTTERY (a1)


Altmetric attention score

Full text views

Total number of HTML views: 0
Total number of PDF views: 0 *
Loading metrics...

Abstract views

Total abstract views: 0 *
Loading metrics...

* Views captured on Cambridge Core between <date>. This data will be updated every 24 hours.

Usage data cannot currently be displayed