Hostname: page-component-77f85d65b8-6bnxx Total loading time: 0 Render date: 2026-03-28T11:12:47.245Z Has data issue: false hasContentIssue false

Is there a bilingual disadvantage for word segmentation? A computational modeling approach

Published online by Cambridge University Press:  03 November 2021

Laia FIBLA*
Affiliation:
School of Psychology, The University of East Anglia, UK Laboratoire de Sciences Cognitives et de Psycholinguistique, Département d’études cognitives, ENS, EHESS, CNRS, PSL University, France
Nuria SEBASTIAN-GALLES
Affiliation:
Center for Brain and Cognition, Universitat Pompeu Fabra, Spain
Alejandrina CRISTIA
Affiliation:
Laboratoire de Sciences Cognitives et de Psycholinguistique, Département d’études cognitives, ENS, EHESS, CNRS, PSL University, France
*
Address for correspondence: Laia Fibla, School of Psychology, University of East Anglia, Norwich Research Park, Norwich, NR4 7TJ, UK. E-mail: laia.fibla.reixachs@gmail.com
Rights & Permissions [Opens in a new window]

Abstract

Since there are no systematic pauses delimiting words in speech, the problem of word segmentation is formidable even for monolingual infants. We use computational modeling to assess whether word segmentation is substantially harder in a bilingual than a monolingual setting. Seven algorithms representing different cognitive approaches to segmentation are applied to transcriptions of naturalistic input to young children, carefully processed to generate perfectly matched monolingual and bilingual corpora. We vary the overlap in phonology and lexicon experienced by modeling exposure to languages that are more similar (Catalan and Spanish) or more different (English and Spanish). We find that the greatest variation in performance is due to different segmentation algorithms and the second greatest to language, with bilingualism having effects that are smaller than both algorithm and language effects. Implications of these computational results for experimental and modeling approaches to language acquisition are discussed.

Information

Type
Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution, and reproduction in any medium, provided the original work is properly cited.
Copyright
Copyright © The Author(s), 2021. Published by Cambridge University Press
Figure 0

Figure 1. Example of language switching every 100 utterances versus every other utterance for Spanish and Catalan. LA represents language A: in our study it could have been either Catalan or English (this example uses Catalan). LB represents language B: Spanish, in this case.

Figure 1

Figure 2. Phonologization and concatenation steps. LA represents language A: in our study, it could have been either Catalan or English. LB represents language B: Spanish, in this case.

Figure 2

Table 1. Properties of the Catalan “cat”, Spanish “spa”, English “eng”, and bilingual corpora. Utts indicates the number of utterances, PSWU the percentage of utterances that were single words, WPU indicates the mean number of words per utterance, Tokens and Types refer to words, and MATTR is Moving Average Type to Token Ratio (window size of 10 tokens).

Figure 3

Table 2. Summary of the segmentation algorithms included in this work by Name: “utt” utterance baseline, “syll” syllable baseline, “ag” Adaptor Grammar, “dibs” Diphone Based Segmentation, “tprel” transitional probabilities with relative threshold, “tpabs” transitional probabilities with absolute threshold. Type indicates the class of algorithm. Unit indicates how the corpus was unitized: “n/a” not applicable,“syll” boundaries can only be posited between syllables, “phon” boundaries can be posited between phones.

Figure 4

Table 3. Token Precision and Recall for the 7 algorithms, in the 5 language conditions. The acronyms stand for “eng” English, “spa” Spanish, “cat” Catalan, “utt” utterance baseline, “syll” syllable baseline, “ag” Adaptor Grammar, “dibs” Diphone Based Segmentation, “tpabs” transitional probabilities with absolute threshold, “tprel” transitional probabilities with relative threshold. Only the performance switching every utterance is shown, with overall length matched across the monolingual and the bilingual conditions.

Figure 5

Figure 3. Token F-scores per algorithm (see Table 1 for acronym explanation) and corpus: pink “e” for English, brown “es” for English–Spanish, blue “s” for Spanish, green “sc” for Spanish–Catalan, gold “c” for Catalan. Error bars indicate 2 standard deviations over 10 subparts of the relevant corpus.