Hostname: page-component-89b8bd64d-z2ts4 Total loading time: 0 Render date: 2026-05-08T20:28:20.079Z Has data issue: false hasContentIssue false

Word segmentation from transcriptions of child-directed speech using lexical and sub-lexical cues

Published online by Cambridge University Press:  12 September 2023

Zébulon GORIELY*
Affiliation:
Department of Computer Science and Technology, University of Cambridge, Cambridge, UK.
Andrew CAINES
Affiliation:
Department of Computer Science and Technology, University of Cambridge, Cambridge, UK.
Paula BUTTERY
Affiliation:
Department of Computer Science and Technology, University of Cambridge, Cambridge, UK.
*
Corresponding author: Zébulon Goriely; Email: zg258@cam.ac.uk
Rights & Permissions [Opens in a new window]

Abstract

We compare two frameworks for the segmentation of words in child-directed speech, PHOCUS and MULTICUE. PHOCUS is driven by lexical recognition, whereas MULTICUE combines sub-lexical properties to make boundary decisions, representing differing views of speech processing. We replicate these frameworks, perform novel benchmarking and confirm that both achieve competitive results. We develop a new framework for segmentation, the DYnamic Programming MULTIple-cue framework (DYMULTI), which combines the strengths of PHOCUS and MULTICUE by considering both sub-lexical and lexical cues when making boundary decisions. DYMULTI achieves state-of-the-art results and outperforms PHOCUS and MULTICUE on 15 of 26 languages in a cross-lingual experiment. As a model built on psycholinguistic principles, this validates DYMULTI as a robust model for speech segmentation and a contribution to the understanding of language acquisition.

Information

Type
Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2023. Published by Cambridge University Press
Figure 0

Figure 1. Two indicators following the partial-peak strategy, segmenting the phrase is that a kitty. One segments at an increase in transitional probability and the other segments at a decrease. Letters are used instead of phonemes for clarity.

Figure 1

Table 1. Summary of Models Implemented in this Study

Figure 2

Figure 2. An example of PHOCUS-1S segmenting the utterance andadoggy, using letters as our discrete symbolic unit instead of phonemes for clarity.

Figure 3

Figure 3. An example of DYMULTI segmenting the utterance andadoggy, using letters as our discrete symbolic unit instead of phonemes for clarity.

Figure 4

Figure 4. The expanded word score function for DYMULTI with the lexical recognition process.

Figure 5

Table 2. First Five Utterances in the BR Corpus

Figure 6

Table 3. Comparison of Reimplemented MULTICUE Models on the BR Corpus

Figure 7

Table 4. Comparison of Reimplemented PHOCUS Models on the BR Corpus

Figure 8

Table 5. Result of a Pairwise Student’s t-test Comparing F1-scores of Segmentation Models

Figure 9

Table 6. Comparison of MULTICUE and DYMULTI Models on the BR Corpus

Figure 10

Figure 5. Word and Lexicon F1-scores (WF, LF) for a selection of the models implemented in this study, calculated over blocks of 200 utterances. Scores are calculated by running each model separately on ten shuffles of the BR corpus and averaging results.

Figure 11

Figure 6. Word and Lexicon F1-scores (WF, LF) for four models using three sets of indicators. MULTICUE-14, MULTICUE-17 and MULTICUE-23 are compared to three DYMULTI models using the same sets of indicators, setting $ \alpha =\mathrm{0,0.5,1} $. Scores are calculated by running each model separately on ten shuffles of the BR corpus and averaging results.

Figure 12

Figure 7. Word and Lexicon F1-scores (WF, LF) for MULTICUE-23 and DYMULTI-23, setting $ \alpha =\mathrm{0,0.5,1} $. Scores are calculated by running each model separately on ten different shuffles of the BR corpus and averaging results.

Figure 13

Table 7. Comparison of Computational Models for Word Segmentation

Figure 14

Figure 8. Lexicon F1-scores (LF) for PHOCUS-1S, MULTICUE-17 and DYMULTI-23 with$ \alpha =0 $compared across 26 languages, sorted by LF scores for DYMULTI-23. Scores are calculated by running each model separately on ten shuffles of each transcript and averaging results over the last 5000 utterances of each run, accounting for differing initial learning rates.

Figure 15

Figure 9. The Lexicon F1-scores (LF) that DYMULTI-23 achieves for each language. Languages are grouped by family and LF scores are calculated using 10 runs over different shuffles of each transcript.

Figure 16

Figure 10. The 10 structural language features that best predict DYMULTI-23 LF score. Each point in a row is a language, with the x-value giving the average DYMULTI-23 LF score achieved for that language across 10 runs. Each point is marked with a cross if the language contains the corresponding feature. Languages are grouped by family.

Figure 17

Figure 11. The 10 structural language features that best predict DYMULTI-23 LF score. Each point in a row is a language, with the x-value giving the average DYMULTI-23 LF score achieved for that language across 10 runs. Each point is marked with a cross if the language contains the corresponding feature. Languages are grouped by family, considering only Germanic, Balto-Slavic and Italic languages.

Figure 18

Figure 12. 3-gram conditional entropies of each language transcript used in the study. Languages are grouped by family and are plotted against the Lexicon F1-scores (LF) that DYMULTI-23 achieves in 10 runs over different shuffles of the transcript for that language.

Figure 19

Figure 13. Correlation scores between information-theoretic measures and the average F1-scores that DYMULTI-23 achieves for each language across 10 runs.