1. Introduction
Child-directed speech (CDS), characterised by simplified sentences (Newport et al., Reference Newport, Gleitman and Gleitman2020), exaggerated prosody (Fernald & Simon, Reference Fernald and Simon1984), enhanced acoustic features, and greater variability (Huttenlocher et al., Reference Huttenlocher, Waterfall, Vasilyeva, Vevea and Hedges2010), has been suggested to facilitate early language development. This claim is supported by a body of experimental research demonstrating that infants engaged with CDS exhibit improved performance in various linguistic tasks such as speech sound discrimination (Karzon, Reference Karzon1985; Liu et al., Reference Liu, Kuhl and Tsao2003), word recognition (Singh et al., Reference Singh, Nestor, Parikh and Yull2009), and word learning (Ma et al., Reference Ma, Golinkoff, Houston and Hirsh-Pasek2011) compared to those exposed to adult-directed speech (ADS). The ability to segment continuous speech into meaningful units, such as words, is essential for language acquisition and processing. While behavioural studies suggest that CDS facilitates this process (Menn et al., Reference Menn, Michel, Meyer, Hoehl and Männel2022; Thiessen et al., Reference Thiessen, Hill and Saffran2005), the specific advantages of CDS in segmentation with naturally occurring spontaneous speech are not yet well-established. This paper investigates whether and how CDS facilitates early word segmentation by combining corpus analyses of linguistic properties with computational modelling across multiple segmentation algorithms.
1.1. Linguistic properties of Korean relevant to segmentation
Most research on word segmentation has focused on stress-based, right-branching languages like English (Jusczyk et al., Reference Jusczyk, Houston and Newsome1999). However, Korean presents a valuable contrast due to its lack of lexical stress (Jun, Reference Jun1998; Ko, Reference Ko2024), head-final syntax, and agglutinative morphology, making it an ideal case to explore how universal and language-specific cues interact during segmentation. Unlike English, Korean does not rely on lexical stress as a segmentation cue. Instead, Korean prosody is organised primarily around the Accentual Phrase (AP; Jun, Reference Jun1998), a phrasal-level unit without consistent word-level prominence (Ko, Reference Ko2013, Reference Ko2024). Infants exposed to languages without strong stress cues, such as French, rely heavily on alternative segmentation cues, including syllable-based regularities (Nazzi et al., Reference Nazzi, Iakimova, Bertoncini, Frédonie and Alcantara2006). Korean infants might similarly rely on these syllable-level cues for segmentation.
At the syntactic level, Korean is a head-final language with canonical Subject–Object–Verb (SOV) order. This syntactic structure makes forward transitional probabilities particularly informative, as arguments and modifiers usually precede heads. Korean learners thus demonstrate sensitivity primarily to forward rather than backward transitional probabilities (Onnis & Thiessen, Reference Onnis and Thiessen2013). At the same time, Korean’s flexible word order and frequent omission of arguments, particularly in conversational speech, reduce transitional probability stability, potentially complicating segmentation. CDS, however, with its structurally simpler, shorter utterances and frequent repetition, might partially offset these challenges by increasing regularity and predictability.
Morphologically, Korean’s agglutinative structure poses additional segmentation challenges. Morphemes in Korean are linearly concatenated but can be blurred by phonological processes such as coalescence. Recent computational modelling work, however, suggests morphological complexity plays a modest role compared to algorithmic factors (Loukatou et al., Reference Loukatou, Stoll, Blasi and Cristia2022b). In Korean CDS, the shorter and morphologically simpler words relative to ADS likely mitigate some of these segmentation difficulties. Still, the question remains whether infants initially segment at the word, morpheme, or prosodic unit (e.g., AP) level. In this study, we adopt orthographic words as segmentation targets, following prior research, though we acknowledge this approach may oversimplify segmentation tasks in languages with complex morphology like Korean.
Lexically, Korean CDS frequently employs onomatopoeia, expressive lengthening, and playful linguistic forms, particularly in speech directed towards younger infants (Jo & Ko, Reference Jo and Ko2018). Onomatopoeic words, acoustically salient and frequently repeated, may act as anchor points to aid segmentation, analogous to isolated or familiar words that facilitate segmentation (Bortfeld et al., Reference Bortfeld, Morgan, Golinkoff and Rathbun2005). Thus, Korean provides a distinct linguistic environment for investigating how language-specific structures interact with general CDS modifications to shape segmentation strategies, making it a valuable test case for examining the cross-linguistic applicability of segmentation models.
1.2. Segmentation of speech into words
Word segmentation plays a crucial role in language acquisition and processing, as it enables infants to identify and extract meaningful linguistic units from continuous speech (Goyet et al., Reference Goyet, Nishibayashi and Nazzi2013; Saffran et al., Reference Saffran, Aslin and Newport1996). However, segmenting speech into words presents challenges since natural speech lacks explicit word boundary markers. To overcome these challenges, infants rely on various cues present in the speech input, including stress (Jusczyk et al., Reference Jusczyk, Cutler and Redanz1993; Jusczyk & Aslin, Reference Jusczyk and Aslin1995; Menn et al., Reference Menn, Michel, Meyer, Hoehl and Männel2022), phrase-level prosody (Estes & Hurley, Reference Estes and Hurley2013; Jusczyk et al., Reference Jusczyk, Friederici, Wessels, Svenkerud and n Jusczyk1993; Shukla et al., Reference Shukla, White and Aslin2011), statistical regularities at the word (Bortfeld et al., Reference Bortfeld, Morgan, Golinkoff and Rathbun2005), syllable (Black & Bergmann, Reference Black and Bergmann2017), and phoneme levels (Mattys & Jusczyk, Reference Mattys and Jusczyk2001; Saffran et al., Reference Saffran, Aslin and Newport1996), and allophonic patterns (Estes, Reference Estes2014; Jusczyk et al., Reference Jusczyk, Houston and Newsome1999).
CDS has been shown to facilitate word segmentation in laboratory-based behavioural research (Hoareau et al., Reference Hoareau, Yeung and Nazzi2019; Thiessen et al., Reference Thiessen, Hill and Saffran2005). However, the specific mechanisms that facilitate segmentation in CDS remain unclear. Its efficacy is often attributed to its unique prosody, such as exaggerated pitch contours and slower speech rates (Cooper & Aslin, Reference Cooper and Aslin1990; Estes & Hurley, Reference Estes and Hurley2013; Fernald & Kuhl, Reference Fernald and Kuhl1987), and musical attributes that function similarly to songs (Ma et al., Reference Ma, Fiveash, Margulis, Behrend and Thompson2020). These features enhance the appeal and effectiveness of CDS, while its higher information value at the prosodic level aids attentional processing (Räsänen et al., Reference Räsänen, Kakouros and Soderstrom2018). These characteristics create an environment conducive to infants processing the statistical regularities in speech, essential for successful word segmentation (Saffran et al., Reference Saffran, Aslin and Newport1996).
While CDS is recognised for its salient acoustic characteristics that benefit children’s language learning (Fernald & Simon, Reference Fernald and Simon1984; Garnica, Reference Garnica1977; Menn et al., Reference Menn, Michel, Meyer, Hoehl and Männel2022; Stern et al., Reference Stern, Spieker, Barnett and MacKain1983), it is important to undertake a deeper analysis of input properties beyond phonological and phonetic features. Previous research has primarily focused on CDS features that attract and maintain infants’ attention and enhance speech signals (Fernald & Simon, Reference Fernald and Simon1984; Grieser & Kuhl, Reference Grieser and Kuhl1988). However, relatively few studies have thoroughly examined key linguistic features through corpus analysis, such as the proportion of one-word utterances, hapax legomena, onomatopoeia, and utterance length.
Recent naturalistic studies have begun bridging laboratory findings with real-world language acquisition. For example, Loukatou et al. (Reference Loukatou, Scaff, Demuth, Cristia and Havron2022a) showed that CDS consistently exhibits shorter utterances and lexical simplification across diverse speakers in French and Sesotho, indicating these properties facilitate segmentation regardless of cultural context. However, inconsistencies in the magnitude of CDS advantages across languages remain, potentially due to factors such as morphological complexity and prosodic structure. In agglutinative languages such as Korean, features like syllable-timed rhythm and case-marking particles might uniquely influence CDS segmentation outcomes. By systematically comparing Korean CDS and ADS, this study explores how language-specific structures interact with universal CDS characteristics to shape segmentability outcomes.
1.3. Simulating word segmentation
In the line of research utilising computational simulation to model word segmentation, findings on the advantage of CDS have been mixed. For instance, Batchelder (Reference Batchelder2002) reported a ~ 10% CDS advantage in English, Spanish, and Japanese using a single algorithm, and Fourtassi et al. (Reference Fourtassi, Borschinger, Johnson and Dupoux2013) similarly found a CDS advantage in English and Japanese, also using a single algorithm. Ludusan et al. (Reference Ludusan, Mazuka, Bernard, Cristia and Dupoux2017) also identified a CDS advantage in Japanese using four algorithms. However, more comprehensive studies have shown smaller advantages: Loukatou et al. (Reference Loukatou, Le Normand and Cristià2019) found only a ~3% CDS advantage in French using 18 algorithms, and Cristia et al. (Reference Cristia, Dupoux, Ratner and Soderstrom2019) reported a similar result in English using nine algorithms.
These discrepancies likely arise from methodological, linguistic, and corpus-related factors. Methodologically, simpler algorithms relying solely on transitional probabilities may be especially sensitive to repetitive features of CDS, potentially inflating its apparent advantage. More complex algorithms that incorporate multiple cues (e.g., diphone information, phonotactic patterns, or rule-based inference) distribute their reliance across features, resulting in more nuanced or modest CDS benefits.
The linguistic structure of the target language also matters. Agglutinative languages with regular morphological patterns (e.g., Korean, Japanese, Turkish) might facilitate segmentation, especially when morpheme boundaries consistently align with clear phonological or syllabic boundaries. However, this advantage may depend on how transparently these boundaries surface in actual speech, particularly in CDS, which tends to have shorter words and simpler morphological forms. In contrast, stress-timed languages like English often rely on stress patterns or function words as segmentation cues, which may not consistently align with word boundaries, making statistical patterns potentially less predictable. Thus, language-specific phonological and morphological structures interact differently with segmentation algorithms, potentially influencing the observed effectiveness of CDS.
Additionally, corpus sampling conditions, including differences in naturalness (e.g., spontaneous versus laboratory-recorded speech) and interactional context, can substantially influence segmentation outcomes. Spontaneously collected CDS in naturalistic contexts may provide more representative and consistent segmentation cues than data elicited in laboratory settings, contributing further variability to cross-study comparisons.
Finally, the age of the children in the corpus plays a role. CDS directed at younger infants often includes shorter utterances, and frequent repetition, which support segmentation, whereas speech to older children becomes syntactically complex and less optimised for segmentation (Chai & Ko, Reference Chai and Ko2025).
Together, these methodological and linguistic factors help explain the divergent findings across studies using different corpora, age groups, and segmentation approaches.
1.4. Linguistic properties of CDS associated with segmentability
Several linguistic properties of CDS consistently support efficient word segmentation, including shorter utterances, shorter words, frequent word repetition, one-word utterances, and enhanced prosodic features (Batchelder, Reference Batchelder2002; Brent & Siskind, Reference Brent and Siskind2001; Menn et al., Reference Menn, Michel, Meyer, Hoehl and Männel2022; Newport et al., Reference Newport, Gleitman and Gleitman2020; Thiessen & Saffran, Reference Thiessen and Saffran2003).
Short utterances are thought to facilitate segmentation by providing a simplified and predictable linguistic environment (Bernstein Ratner & Rooney, Reference Bernstein Ratner and Rooney2001; Brent & Siskind, Reference Brent and Siskind2001; Swingley & Humphrey, Reference Swingley and Humphrey2018). Computational modelling has demonstrated that shorter utterances improve segmentation performance (Frank et al., Reference Frank, Goldwater, Griffiths and Tenenbaum2010), indicating that the brevity of CDS may be beneficial for early word learning. Additionally, isolated words, which occur frequently in CDS, offer particularly clear perceptual anchors that help infants segment speech more efficiently (Brent & Siskind, Reference Brent and Siskind2001; Lew-Williams et al., Reference Lew-Williams, Pelucchi and Saffran2011).
Interjections, another frequent feature in spontaneous CDS, might also support segmentation despite lacking grammatical connections to adjacent words. Familiar anchor words like “mommy” or the child’s name are known to facilitate segmentation (Bortfeld et al., Reference Bortfeld, Morgan, Golinkoff and Rathbun2005), but whether interjections provide similar benefits remains unexplored. Differences in how interjections are classified across studies may therefore affect conclusions about segmentation cues in CDS. Variation in how interjections are treated when defining one-word utterances across studies could therefore influence conclusions about the structure and segmentation benefits of CDS.
CDS also consistently exhibits lower lexical diversity compared to ADS (Soderstrom, Reference Soderstrom2007), characterised by frequent repetition and fewer rare words (Goodman et al., Reference Goodman, Dale and Li2008). Higher lexical diversity and more unique words (hapax legomena) typically found in ADS introduce more segmentation complexity, as algorithms or learners encounter less predictable linguistic patterns. In contrast, the repetition and lexical simplicity in CDS strengthen statistical regularities, increasing predictability and ease of segmentation (Thiessen & Saffran, Reference Thiessen and Saffran2003). Variation sets, or repeated words presented in varied contexts, additionally reinforce segmentation cues (Lester et al., Reference Lester, Moran, Küntay, Allen, Pfeiler and Stoll2022). Computational studies have similarly shown that shortening utterances and increasing repetition significantly enhances segmentation performance (Batchelder, Reference Batchelder2002), further emphasising how CDS structurally supports infants’ segmentation abilities.
1.5. Research objectives
This study aims to advance our understanding of word segmentation in Korean, a language that is typologically different from languages like English at multiple linguistic levels. Our primary objectives are (1) to examine how linguistic features and statistical regularities differ between CDS and ADS in a naturalistic Korean speech corpus; (2) to evaluate the performance of word segmentation algorithms in segmenting words from Korean CDS compared to ADS; and (3) to assess the effectiveness of various algorithms in segmenting words from the Korean corpus. We employ computational modelling to investigate how Korean CDS and ADS, each with distinctive properties, interact with different segmentation strategies to influence word segmentation performance. In light of the results, we will discuss the underlying assumptions of each algorithm, and their ability to capture the unique statistical properties of Korean.
To the best of our knowledge, this is the first study to explore word segmentation in Korean CDS and ADS using computational models applied to a corpus of spontaneous speech and to compare the segmentation strategy of algorithms in relation to linguistic properties of Korean.
2. Methods
2.1. Design
This study utilizes the Korean Ko corpus (Ko et al., Reference Ko, Jo, On and Zhang2020), which includes transcriptions of natural interactions between Korean mothers and their children (CDS) and adults (ADS). To control for potential corpus-size effects (Montag et al., Reference Montag, Jones and Smith2018), we supplemented the ADS portion of the Ko corpus with additional data from the Call Friend Korean corpus (ADS-CF; Ko, Reference Ko2013). This augmentation ensured that our ADS dataset matched the CDS portion of the Ko corpus more closely in size, helping to rule out the possibility that observed segmentation differences were artifacts of corpus size disparities.
Previous studies typically used dictionary lookups to convert text into phonemic transcriptions. However, Beech and Swingley (Reference Beech and Swingley2023) show that using actual pronunciations leads to worse segmentation performance compared to phonemic forms, as many phonologically distinct word forms on the surface were actually the same underlying words, and vice versa. To address this mismatch, we transformed the orthographic form into a phonemic representation and applied phonological rules to derive actual pronunciations in spontaneous speech. While this approach has limitations, such as the inability to replicate the application of optional rules that are often dependent on speaking rate or prosodic phrasing, it represents an advancement over previous methods based on dictionary forms by aligning the input more closely with the phonological reality encountered by children during word segmentation.
2.2. Data and preparation
To simulate word segmentation of the Korean language, we used spontaneous speech from the Ko corpus (Ko et al., Reference Ko, Jo, On and Zhang2020), which includes transcripts of 35 mother–child dyads (14 girls; mean age = 16 months; age range = 6–30 months) during free-play sessions, totalling 23.3 hours of CDS. The same mothers also produced ADS while talking with adult family members or experimenters, totalling 5.8 hours. As the amount of ADS in the Ko corpus was relatively limited, we supplemented our analyses with the ADS-CF corpus (Ko, Reference Ko2013), a collection of 15-minute transcriptions from 100 phone calls between Korean-speaking adults residing in the U.S. These informal peer conversations involve 200+ speakers (55 female–female, 28 male–male, and 17 mixed-gender calls), primarily among young adults (mean age = 21.8). Although the speech is not face-to-face and varies in demographic background, it offers additional adult-directed input that matches the CDS in size. The Ko corpus, with rich annotations and matched CDS/ADS speakers, served as the primary source for comparing registers.
In the Korean writing system, each alphabetical letter corresponds to an underlying phoneme, and each block of Hangul characters represents a syllable. Hangul’s phonemic system allows direct derivation of phonetic output without dictionary lookups, as each block corresponds to a syllable, and each character within represents a specific phoneme. For example, the word “한글” (/hankɯl/), meaning “the Korean alphabet,” consists of two syllables: “한” (/han/) and “글” (/kɯl/). Each character block can contain up to three phonemes, typically in a consonant-vowel (CV) or consonant-vowel-consonant (CVC) structure.
We transformed the orthographic transcriptions into phonetic symbols using phonological rules from the KoG2P package (Hong et al., Reference Hong, Ki and Gweon2018; Mun et al., Reference Mun, Kim and Ko2022). The phonetic input resulting from applying phonological rules to the phonemic representation includes adjusted syllable boundaries and other phonological changes to accurately reflect actual pronunciations. The phonologisation process involved applying context-sensitive phonological rules to account for phonological variations and connected speech processes. For example, an application of the h-deletion, optional w-deletion, place assimilation, j-deletion, and resyllabification rules would transform the orthographic form “전화주시는거죠” (/d͡ʑʌn.hwa.cu.si.nɯn.gʌ.d͡ʑjo/, “You will call me, right?”) to the phonetic form [d͡ʑʌ.na.cu.si.nɯŋ.gʌ.d͡ʑo].
2.3. Metrics for assessing linguistic properties of corpus data
Metrics for primary linguistic properties. We systematically quantified several linguistic properties, including utterance and word length, the proportion of hapax legomena, the percentage of monosyllabic words, one-word utterances, onomatopoeic words and word play, as well as interjections. These metrics were specifically chosen due to their theoretical significance in understanding word segmentation and language complexity. The metrics are detailed below.
Utterance length and word length: We measured word and utterance length in both phones (i.e., total number of individual phone segments within a word or utterance) and syllables (i.e., total number of Hangul blocks within a word or utterance). Average lengths were then calculated for each participant at both phonetic and syllabic levels.
Hapax legomena: Hapax legomenon values were calculated for each subject respectively by dividing the number of word types that occur only once by the total number of word types in the sample.
Monosyllabic words: We identified monosyllabic words and measured their percentage relative to the total number of words in the Ko Corpus.
One-word utterances: We adopted a conservative approach, excluding words that never occur in combination with other words and interjection words such as discourse fillers (e.g., 어 “uh,” 아 “ah”), affirmative responses (e.g., 네 “yes,” 응 “mm-hmm,” 예 “yes,” 그래 “okay”), negative responses (e.g., 아니 “no”), and non-lexical vocalisations (e.g., 우와 “wow,” 아고 “oh dear,” 어머 “oh my”). These excluded words are collectively categorised as interjections due to their linguistic function.
Interjections: We defined interjections as exclamatory words or phrases that can stand alone or be inserted within a sentence to vividly convey sudden feelings such as surprise, excitement, anger, or disappointment (Ameka, Reference Ameka1992). This definition focuses on their linguistic properties rather than their acoustic features.Footnote 1
Onomatopoeic words and word play: The Ko corpus (Ko et al., Reference Ko, Jo, On and Zhang2020) annotates onomatopoeic words with the symbol “@o” and word play with the symbol “@wp.” To assess the ratio of these words, we tallied the total frequency of words identified with the tag, allowing for redundant counts between the two categories, and divided by the total number of word tokens in each corpus and across speech registers. This measure represents the proportion of onomatopoeic and word play tokens relative to all word tokens.
Metrics for corpus complexity. Expanding on the basic linguistic unit percentages outlined above, we further explored corpus complexity by assessing composite measures of lexical diversity and lexical ambiguity/entropy.
MATTR: To assess lexical diversity, we employed the Moving-Average Type-Token Ratio (MATTR) across a 20-word window (Covington & McFall, Reference Covington and McFall2010). This metric iteratively calculates the average type-token ratio as the window progresses through the corpus. A higher MATTR indicates greater lexical diversity in the speech, whereas a lower MATTR points to a greater degree of repetition.
Entropy: We quantified lexical ambiguity using the normalised Shannon entropy (NSE) formula. NSE evaluates the ambiguity or unpredictability of text segmentation by employing a unigram model that assigns probabilities to each possible segmentation based on word frequency. High entropy values indicate greater lexical ambiguity and unpredictability in segmentation, while low entropy values denote more constrained and predictable segmentation outcomes.
2.4. Word segmentation simulations
We opted to use the WordSeg package (Bernard et al., Reference Bernard, Thiolliere, Saksida, Loukatou, Larsen, Johnson and Cristia2020) to enable direct comparisons of the outcome in our study with previous studies that have utilised this package (e.g., Cristia et al., Reference Cristia, Dupoux, Ratner and Soderstrom2019; Loukatou et al., Reference Loukatou, Le Normand and Cristià2019). These studies have reported a minimal advantage of CDS over ADS, which may be attributed to variations in methods and language contexts. To model sub-lexical level segmentation, which focuses on sound sequences between syllables and phones, we employed the transitional probability (TP; Saffran et al., Reference Saffran, Aslin and Newport1996) and diphone-based segmentation (DiBS; Daland & Pierrehumbert, Reference Daland and Pierrehumbert2011) algorithms. For lexical-level segmentation, we utilised PUDDLE (Phonotactics from Utterances Determine Distributional Lexical Elements; Monaghan & Christiansen, Reference Monaghan and Christiansen2010) and the Adaptor Grammar approach (e.g., Cristia et al., Reference Cristia, Dupoux, Ratner and Soderstrom2019; Loukatou et al., Reference Loukatou, Le Normand and Cristià2019; Ludusan et al., Reference Ludusan, Mazuka, Bernard, Cristia and Dupoux2017).
The algorithms selected for this study represent a range of segmentation strategies, from local statistical cues (e.g., transitional probabilities, diphone statistics) to more global, lexicon-driven approaches (e.g., PUDDLE, Adaptor Grammar). Rather than modelling child cognition directly, our aim is to test which types of cues and mechanisms are most sensitive to differences between CDS and ADS in Korean, and which align best with the language’s structural properties. While these algorithms are not intended as fully cognitively plausible models, they help identify which input features (from CDS and ADS) and statistical regularities are most informative for segmentation, offering insight into the types of cues that may be available to learners.
In the following section, we provide a concise summary of each algorithm and outline the specific settings employed in our study, referencing Bernard et al., Reference Bernard, Thiolliere, Saksida, Loukatou, Larsen, Johnson and Cristia2020 for further descriptions of the algorithms.Footnote 2
Algorithms. Baseline: We conducted a series of baseline simulations at both phones and syllables levels. These simulations were configured to assign word boundaries with a 50% probability, i.e., at-chance level of inserting a word boundary between unit tokens (phones or syllables) and thus treating each as a word.
Transitional probabilities: We implemented four TP algorithms varying in transition thresholds (absolute and relative) and direction (forward and backward). Relative TP identifies boundaries based on a relative decrease in TP, while absolute TP sets the boundary at the average TP of the corpus. Forward TP analyses transitions left-to-right, and backward TP right-to-left, without requiring known word boundaries. Each algorithm processed syllable and phone units, yielding eight evaluation sets.
Diphone-based segmenter: The DiBS algorithm utilizes Bayes’ rule to estimate the probability of a word boundary between two phones/syllables. We implemented two versions of DiBS: (1) DiBS-phrasal uses utterance as the chunks for identifying word boundaries and (2) DiBS-lexical utilizes a seed lexicon. DiBS-lexical is still considered a sub-lexical algorithm as it focuses on the statistical properties of sub-word units, such as diphones, rather than statistical information at the word level. Both DiBS-phrasal and DiBS-lexical processed at both syllable and phone units, resulting in four sets of processes.
Phonotactics from utterances determine distributional lexical elements (PUDDLE): PUDDLE is an utterance-based word segmentation algorithm developed to model the incremental nature of language learning in a psychologically plausible way (Monaghan & Christiansen, Reference Monaghan and Christiansen2010). Rooted in a chunk-based approach, PUDDLE is conceptually related to Christiansen’s broader work on usage-based learning and chunking; it assumes that learners store and process input as chunks, gradually refining their representations as they encounter more data. Notably, this chunk-based perspective is also reflected in approaches like Lieven’s traceback method, which traces children’s utterances back to previously encountered chunks and frame-and-slot patterns in their input (Hartmann et al., Reference Hartmann, Koch and Quick2021; Lieven et al., Reference Lieven, Salomo and Tomasello2009).
Unlike local event-based approaches like DiBS and TP, PUDDLE breaks down utterances into candidate words using three long-term storage buffers: a “lexicon,” onset bigrams, and offset bigrams. The algorithm operates incrementally by (1) scanning utterances and matching sequences to the “lexicon,” (2) storing unmatched utterances as words with frequency and phonetic information, (3) using a phone window to capture word boundaries, (4) comparing new utterances to stored phonetic information for segmentation, and (5) favouring high-frequency words while adding unknown words to the “lexicon.”
Adaptor grammar (AG): AG (Goldwater et al., Reference Goldwater, Griffiths and Johnson2009; Johnson et al., Reference Johnson, Griffiths and Goldwater2006; Phillips, Reference Phillips2015) is a computational framework designed for parsing a corpus and inferring the probabilities of rewrite rules that may have generated it. AG assumes sentences consist of words, and words consist of basic units. The framework generates, stores, and rewrites rules through an iterative process: (1) parsing the corpus multiple times to refine sub-rules and prune uneconomical parses; (2) reapplying these parses to generate possible segmentations; and (3) using Minimum Bayes Risk to select the most probable segmentations. This flexible approach allows AG to robustly segment words while adapting to language-specific characteristics.
Evaluations. We initially trained each of the 9 unique algorithms (baseline, absolute forward TP, relative forward TP, absolute backward TP, relative backward TP, DiBS-phrasal, DiBS-lexical, PUDDLE, AG) on 140 unique corpus sets, derived from 35 dyads across two registers (CDS and ADS) and two levels of processing units (phones and syllables).
In this study, we analyse both phones and syllables as processing units, given their central role in Korean structure and bottom-up statistical learning. Phones represent the most basic segmental units, whereas syllables are highly salient in perception, production, and orthography, and often align closely with morpheme boundaries. Sensitivity to statistical regularities at these levels is therefore a critical precursor to effective word and morpheme segmentation.
Each algorithm processed the utterances produced by the mothers in the Ko corpus, which were categorised as either CDS or ADS based on the register tag. For evaluation purposes, we prepared a gold corpus that defines the gold standard for target word boundaries in the input utterances, and a unitised corpus used for algorithm training. The orthographic form of the data includes information on syllable and word boundaries, as well as the phones.
Performance evaluation for both phone-level and syllable-level algorithms involved comparing their segmented outputs against the gold standard corpus. Phone-level algorithms are denoted by a “_p” suffix and syllable-level algorithms by a “_s” suffix, with each being supplied with the corresponding unitised corpus at the phone or syllable level. We assessed performance using precision, recall, and F-scores, derived from the comparison of the segmented outputs to the gold corpus.
3. Results
3.1. Linguistic properties of CDS and ADS
We report relevant linguistic properties extracted from the Korean corpus in Table 1, providing means and standard deviations (SDs) accompanied by histograms generated using the skimr package in R (McNamara et al., Reference McNamara, Arino de la Rubia, Zhu, Ellis and Quinn2018). Building upon previous research by Cristia et al. (Reference Cristia, Dupoux, Ratner and Soderstrom2019) and Loukatou et al. (Reference Loukatou, Le Normand and Cristià2019), our study focuses on utterance and word length (measured in phones and syllables), the proportion of hapax legomena (words), one-word utterances, MATTR, and entropy, with an addition of the percentages of monosyllabic words, onomatopoeic, word-play and interjection words.
Linguistic properties of corpus, their mean and standard deviation, M (SD) across registers (CDS, ADS and ADS call friend)

1 Not an exhaustive list.
2 Information not available due to the lack of annotation for this property in the Call Friend corpus.
To compare property-related differences between the registers, we fitted a series of linear models for each linguistic property, using the metrics shown in Table 1. The dependent variable in each model was the linguistic property, with corpus register type (CDS, ADS, and ADS-CF) as the fixed factor. We also included the total token number as a covariate to account for differences in corpus size. Table 2 shows regression estimates for linguistic properties, comparing ADS and ADS-CF with CDS as the reference level, as well as the effect of corpus size (represented per 1,000 tokens). A positive estimate in the ADS or ADS-CF columns indicates that the respective register exhibits a higher value for that linguistic property compared to CDS. Negative estimates for corpus size indicate that as corpus size increases, the corresponding linguistic property value decreases.
Summary table of a series of linear regressions comparing differences in corpus properties categorised by registers (CDS, ADS, ADS-call-friend), with corpus size (total tokens) as the covariate. Values represent Estimate with standard errors (SE) in parentheses

***p < 0.001; **p < 0.01; *p < 0.05.
The analysis of different linguistic properties across CDS and ADS reveals distinct patterns. ADS (and ADS-CF) is characterised by longer utterance and word length; greater lexical diversity, as evidenced by higher MATTR; and a larger proportion of hapax legomena. The higher entropy in ADSFootnote 3 suggests more complex and less predictable language use compared to CDS, which may in part reflect the broader range of topics and socio-environmental contexts present in ADS samples. However, ADS-CF does not show significant differences from CDS in terms of entropy, likely due to some similar linguistic features that derive from the social distance between speakers in these two contexts. ADS shows a lower proportion of one-word utterances compared to CDS, indicating that speakers are less likely to produce a single word when addressing an adult.
In summary, our findings reveal clear structural differences between CDS and ADS, underscoring how adults adapt their language to support children’s linguistic development. CDS is characterised by simpler, more repetitive patterns that likely cater to the developmental needs of young learners, while ADS is generally more complex and varied, reflecting typical adult communication. It is important to note that the ADS in the Ko corpus includes interactions with both family members and experimenters; these subsets were not separated in our segmentation analysis due to the limited corpus size, which may contribute to the observed complexity in ADS.
3.2. Segmentability of speech
To systematically investigate register-related variations (Figure 1) between CDS and ADS, and assess segmentation algorithms differences, we employed a mixed linear regression model. This model analyses the effects of register type (CDS, ADS-CF & ADS), algorithm type (nine types), and the unitised corpus (at phone-level and syllable-level) on word segmentation performance, quantified by F-scores.Footnote 4 We account for individual differences among dyads by incorporating a by-dyad random factor into the model. The specific formula used for the model is as follows:
Scatter plot showing the distribution of raw F-scores measuring model performance across speech registers (CDS, ADS), speech processing algorithms, and phone/syllable units on word segmentation simulations.

f_score ~ algorithm + register + phonological unit + (1 | dyads).
Post-hoc analyses are conducted to examine the statistical contrasts between the estimated marginal means of each variable level (Figure 2).
Comparison of model estimated marginal mean F-scores on word segmentation simulations, across speech registers (CDS, ADS, ADS-CF), algorithms, and unitised type, with 95% confidence intervals. The purple bars represent 95% confidence intervals around each estimate. Red arrows highlight pairwise contrasts between conditions where a statistically significant difference was found.

Register effect. The model output reports a significant main effect of register type (χ 2 = 711.38, df = 1, p < .001). The register effect showed overall better word segmentation performance in CDS (emmean = .70) than in ADS (emmean = .63, p < .001) and in ADS-CF (emmean = .62, p < .001), with no significant differences between ADS and ADS-CF (p = .347).
Algorithm differences. The model output reports a significant main effect of algorithm type (χ2 = 2210.52, df = 8, p < .001). Post-hoc comparisons between algorithms revealed that the baseline control measure was significantly lower than the other algorithms (p’s < .001), thus suggesting that the other algorithms were more effective in word segmentation. It is noteworthy that in the analysis of transitional probabilities (TPs) algorithms, forward probabilities demonstrated greater segmentation performance (p’s < .001) than backward probabilities. Comparisons between the diphone-based (DiBS-phrasal and DiBS-lexical) algorithms showed that the phrasal variant of the algorithm outperforms the lexical variant (p’s < .001), suggesting that phrase-based DiBS are more effective at identifying word boundaries. Adaptor grammar consistently outperformed other algorithms (p’s < .001), with lexical-DiBS and PUDDLE showing poorest performances.
Phonological unit. We also found a significant main effect of the phonological unit (χ2 = 2981.28, df = 1, p < .001). Post-hoc comparisons of units showed overall better word segmentation performance in syllables (emmean = .70) than phones (emmean = .60, p < .001). This syllable advantage aligns with the linguistic properties of the Korean language, which we discuss in 4.4.1 in detail.
Mediation of corpus properties. This analysis examines whether corpus-level linguistic properties account for the observed differences in word segmentation performance between registers, essentially testing if these properties mediate the relationship between register and algorithm performance. We investigated the mediation of corpus properties on the register effect in word segmentation performance, with each phone-level and syllable-level units respectively. For example, we compared the linear mixed model for the phone-based algorithm, regressed on the F-score, with algorithm and register as predictors, to the model that has corpus properties added as covariates. These covariates included phone-based word and utterance length, and other common properties such as percentage of hapax legomena, MATTR, entropy, percentage of monosyllabic words, isolated words, and interjections. The same modelling approach was applied to the syllable-based algorithm (see Figure 3). The inclusion of corpus properties significantly improved the models (phone: χ2 = 532.32, p < .001; syllable: χ2 = 465.80, p < .001). At the phone level, the inclusion of corpus properties reduced the CDS advantage over ADS from 0.12 to 0.05 (i.e., differences in F-scores emmeans), and over ADS-CF from 0.09 to 0.02, while it remained statistically significant (ps < .05). At the syllable level, the inclusion of corpus properties reduced the CDS advantage over ADS to non-significant, decreased from 0.09 to 0.02 (p = .062), and the advantage over ADS-CF decreased from 0.06 to 0.02 (p = .012). These results indicate that much of the segmentation advantage observed in CDS can be attributed to underlying corpus properties, highlighting their mediating role in shaping algorithmic performance across registers.
Predicted F-scores (emmeans) for word segmentation models comparing child-directed speech (CDS), adult-directed speech (ADS), and Call Friend ADS (ADS-CF) across base and mediation models using phone-level (left) and syllable-level (right) units. Inclusion of corpus-level linguistic properties as covariates in mediation models reduces CDS segmentation advantage compared to ADS and ADS-CF, indicating a mediating role of corpus properties on register effects. Error bars reflect estimated model uncertainty.

4. Discussion
We conducted a detailed examination of the linguistic properties of Korean CDS and ADS to explore their impact on word segmentation performance, providing new insights into how these characteristics influence segmentation processes. While our findings that CDS shows better segmentability than ADS are consistent with cross-linguistic expectations, since simpler, more repetitive speech is generally easier to segment, Korean presents a particularly interesting case due to its flexible word order and complex, agglutinative morphology. These structural features, along with other Korean-specific cues discussed below, likely interact with corpus properties to shape segmentation outcomes in ways that differ from more commonly studied Indo-European languages. Crucially, we utilised statistical modelling techniques to explore the relationship between linguistic properties and segmentation algorithms in assessing the potential advantages of CDS. This approach allowed us to examine the complex interplay of factors involved in early language processing.
4.1. Differences in linguistic properties of CDS and ADS
The corpus analysis reveals systematic differences between CDS and ADS that are demonstrated by complementary measures of lexical predictability and diversity. ADS is characterised by longer utterances and words, greater lexical diversity (reflected in higher MATTR), and a larger proportion of hapax legomena compared to CDS. These characteristics, along with higher entropy in ADS, suggested a more complex and less predictable linguistic environment, consistent with the broader vocabulary and intricate sentence structures characteristic of adult communication. In contrast, CDS demonstrates lower MATTR values, reduced entropy, and fewer hapax legomena, underscoring a fundamentally more repetitive and predictable linguistic strategy. These patterns indicate that CDS employs a restricted set of vocabulary with frequent repetition, creating a more learnable environment for children by enhancing predictability and supporting statistical learning. The reduced lexical diversity and increased repetitiveness in CDS, as captured by both MATTR and entropy, directly align with findings that children benefit from repeated exposure to vocabulary items, facilitating word acquisition through consistent linguistic input (Ko et al., Reference Ko, Jo and Chai2022). Thus, the interplay of these metrics not only highlights the distinctiveness of CDS but also underscores its adaptive role in language development by providing children with a structured and accessible linguistic environment.
Beyond these core measures of lexical diversity, the register differences extend to specific linguistic features that serve different communicative functions. Interjections are often neglected in linguistic research, typically considered marginal (Ameka, Reference Ameka1992; Dingemanse, Reference Dingemanse2024). In the current study, we found that ADS exhibited a lower percentage of one-word utterances but a higher percentage of isolated interjection words compared to CDS. While ADS may involve more complex language and a higher reliance on interjections as discourse markers, CDS tends to prioritize simplicity and clarity to support children’s language acquisition. This is evident in its fewer interjections, more one-word utterances, and the frequent use of onomatopoeic words and word play, which likely serve as attention-getters, simple directives, or labels for objects or actions. The simplicity of CDS, in contrast to the increased complexity of ADS, where a broader vocabulary and more complex sentence structures are employed. The divergence in the linguistic properties of CDS and ADS highlights the inherent adaptability of language to the linguistic needs of interlocutors.
4.2. CDS advantage and mediation effect of corpus properties
Our investigation into word segmentation in Korean, utilising a spontaneous speech corpus, revealed superior results for CDS compared to ADS. These findings align with several prior studies (Batchelder, Reference Batchelder2002; Fourtassi et al., Reference Fourtassi, Borschinger, Johnson and Dupoux2013; Ludusan et al., Reference Ludusan, Mazuka, Bernard, Cristia and Dupoux2017; Ma et al., Reference Ma, Golinkoff, Houston and Hirsh-Pasek2011; Stärk et al., Reference Stärk, Kidd and Frost2022) and represent one of the first attempts to explore this phenomenon in the Korean language. However, it is important to note that there are conflicting reports, for instance, Cristia et al. (Reference Cristia, Dupoux, Ratner and Soderstrom2019) and Loukatou et al. (Reference Loukatou, Le Normand and Cristià2019) reported mixed or minimal advantages of CDS using similar algorithms. Despite replicating Cristia et al.’s (Reference Cristia, Dupoux, Ratner and Soderstrom2019) methods, our analysis of the Korean corpus demonstrated clear segmentation advantages for CDS. These cross-linguistic differences in CDS effectiveness are expected, as different language structures impact segmentability in varying ways. Korean exemplifies this principle particularly well, given its flexible syntax and complex agglutinative morphology, which create unique challenges and opportunities for word segmentation. The language’s SOV word order with flexible constituent placement, combined with rich inflectional morphology where multiple morphemes attach to stems, creates segmentation conditions that differ substantially from previously studied languages.
There are some notable differences between this study and Cristia et al. (Reference Cristia, Dupoux, Ratner and Soderstrom2019) such as language (current work: Korean; Cristia et al., Reference Cristia, Dupoux, Ratner and Soderstrom2019: English) and data size (i.e., based on the number of utterances; current work: ADS = 2,544; CDS = 22,203; Cristia et al., Reference Cristia, Dupoux, Ratner and Soderstrom2019: ADS = 1,772; CDS = 5,320). These differences, despite using the same algorithms and both being based on matched spontaneous speech corpora, seem to result in the register-based effects on word segmentation observed in our findings. The intrinsic attributes of CDS appear to facilitate word segmentation more effectively than ADS in Korean.
When we controlled for differences in linguistic properties between CDS and ADS, such as utterance length, word length, repetition, and lexical diversity, the segmentation advantage of CDS was reduced or, in the case of syllable-based models, became statistically non-significant. This indicates that the superior segmentability of CDS is not solely due to its register (i.e., being child-directed), but is largely mediated by specific corpus properties that make the input simpler, more repetitive, and more predictable for learners. In other words, we identified specific features of CDS, that is shorter utterances, higher repetition, and lower lexical diversity, that directly facilitate word segmentation. This finding highlights the importance of considering how measurable linguistic properties, rather than just social context, shape the learnability of speech input.
4.3. Comparative analysis of word segmentation algorithms
The interplay between word segmentation mechanisms and the linguistic properties of a language plays a crucial role in shaping how words are segmented from continuous speech. In this study, we selected a range of segmentation algorithms to probe which statistical cues are most informative for Korean, a head-final and morphologically rich language. TP and DiBS algorithms focus on local phonological and statistical regularities, such as syllable and diphone patterns, which are highly relevant given Korean’s predictable syllable structure. PUDDLE and AG algorithms, in contrast, operate at higher levels, incorporating longer-range phonotactic or hierarchical patterns and lexical storage, which may be less cognitively plausible for infants but allow us to test the informativeness of more global or memory-based cues. This comparative approach not only tests which statistical cues align best with Korean’s linguistic structure, but also allows us to consider how closely each algorithm approximates mechanisms that might be accessible to infants or language learners. Algorithms relying on local phonological patterns or simple transitional probabilities may better reflect cognitively plausible strategies available to early learners, whereas those requiring extensive lexical storage or hierarchical rule induction may be less so. This distinction is crucial for evaluating the relevance of different segmentation strategies to real-world language acquisition.
Forward versus backward probabilities: The impact of Korean’s head-final structure. In our study focusing on Korean language processing, we observed intricate interactions between forward and backward TPs and various corpus properties. These findings reflect Korean’s head-final structure, where forward contextual information is generally more predictive and informative than backward information. Our results align with previous evidence that forward TPs hold an advantage over backward TPs for languages like Korean (Onnis & Thiessen, Reference Onnis and Thiessen2013).
In the SOV word order of Korean, the verb occurs sentence-finally, often carrying critical semantic information. Forward TP algorithms, which calculate the probability of a word given the preceding word(s), can effectively capture these predictive cues. For example, in the sentence 나는 사과를 먹었다 na-nun sakwa-lul mek-ess-ta “I-TOPIC apple-ACC eat-PAST -DECL; I ate an apple,” the forward TP from the object 사과를 (sakwa-lul, “apple”) to the verb 먹었다 (mek-ess-ta, “ate”) is likely to be high because objects typically precede verbs, strongly predicting their occurrence. In contrast, backward TP algorithms, which calculate the probability of a word given the following word(s), may be less effective in Korean. Because Korean verbs appear at the end, they offer fewer precise cues to identify preceding words. Although the verb 먹었다 (mek-ess-ta, “ate”) suggests an eating-related context, it provides limited clues to predict specifically which words preceded it, such as 나 (na, “I”) or 사과 (sakwa, “apple”). This predictive relationship also applies at the word level. Consider the noun phrase 학교에 (hakkyo-ey, “at school”), composed of 학교 (hakkyo, “school”) followed by the postposition 에 (ey, “at”). Here, the preceding noun strongly predicts the following postposition, resulting in clear forward transitional probabilities at the word boundary.
The SOV word order and the use of grammatical markers and case particles in Korean create a linguistic environment where the preceding context is more informative for predicting upcoming words than the following context, aligning well with the strengths of forward TP algorithms. This alignment potentially explains their superior performance compared to backward TP algorithms in Korean word segmentation tasks.
While TPs can, in theory, be computed at multiple linguistic levels: phonetic, syllabic, morphological, and lexical, our study focuses on the phonetic and syllabic levels, where computational methods and theoretical foundations are most robust. Extending TP-based segmentation to higher levels, such as morphology and words, presents additional complexities, including challenges in consistently defining and classifying boundaries, thus remains an active area of research.
Phrasal-based versus lexical-based DiBS: The role of phonological cues. The superior performance of the phrasal-based DiBS over the lexical-based approach in Korean highlights the reliance on phonetic and phonological cues such as syllable structure and phonotactics at phrasal edges, rather than on lexical information, for word segmentation. This finding aligns with the phonological properties and the agglutinative nature of the Korean language, which has a relatively simple and consistent syllable structure, but a rich set of phonotactic constraints that govern the permissible combinations of phonemes within and across phrasal boundaries. For example, diphone sequences such as [sa] or [cha] never occur across word or phrasal boundaries due to the coda neutralisation rule applying to laryngeal consonants. On the other hand, sequences such as [ln] or [ng] are allowed only across phrasal boundaries, which are also word boundaries. These phonotactic regularities provide infants with a preliminary guide to word demarcations (Daland & Pierrehumbert, Reference Daland and Pierrehumbert2011). This is likely to be particularly beneficial in Korean, where the AP, the domain of many phonological rules, is also demarcated by intonational cues in speech (Jun, Reference Jun1998) and alignment with strong beats in children’s songs (Ko, Reference Ko2024).
Our results demonstrate that phonological cues, such as syllable structure and phonotactics, are a highly effective source of information for word segmentation in Korean, especially for infants who, early in development, may not yet rely on lexical-based strategies. These bottom-up cues from the speech input help children discern word boundaries and support early word learning. While our findings are specific to Korean, it is important to note that transitional probabilities and phonological cues at this level are also recognised as valuable segmentation sources in other languages, including English, though the degree of reliance may vary with linguistic structure.
Adaptor grammar advantage. The AG algorithm’s superior performance in Korean word segmentation stems from its use of Pitman-Yor process-based adaptors that hierarchically cache and reuse frequent linguistic structures. This creates a “rich get richer” dynamic where common Korean structures like case markers (이/가, 은/는) become increasingly probable through repeated caching. This hierarchical caching mechanism is particularly effective for Korean’s agglutinative morphology, where morpheme boundaries often align with syllable boundaries, allowing AG to learn predictable attachment patterns without requiring preset linguistic unit definitions. While such alignment could theoretically increase the risk of oversegmentation in morphologically rich languages (Loukatou et al., Reference Loukatou, Le Normand and Cristià2019, Reference Loukatou, Stoll, Blasi and Cristia2022b), AG’s multi-level statistical learning and ability to identify co-occurring morpheme sequences help mitigate this risk by favouring linguistically meaningful units over arbitrary boundaries. The Bayesian nonparametric framework enables flexible adaptation to Korean’s complex morphological structure by automatically discovering optimal segmentation patterns from the data.
4.4. Cross-linguistic implications
Understanding word segmentation requires recognising that languages differ not only in their structure but also in the cues they make available to learners. Specifically, while Korean’s clear syllable structure and agglutinative morphology make syllables especially salient for segmentation, other languages highlight different cues: French and Spanish listeners benefit from syllable boundaries (Cutler & Carter, Reference Cutler and Carter1987), English relies more on stress patterns (Cutler, Reference Cutler, Reed and Levis2015; Cutler & Norris, Reference Cutler and Norris1988), and languages like Turkish utilize vowel harmony (Hohenberger et al., Reference Hohenberger, Altan, Kaya, Tuncer and Avcu2016) or French employs final lengthening (Welby, Reference Welby2007) as segmentation aids. These differences are not merely theoretical; experimental evidence from artificial-language learning studies shows that when presented with identical speech streams, listeners from different language backgrounds preferentially use the cues most prominent in their native language (Spanish in Toro-Soto et al., Reference Toro-Soto, Rodríguez-Fornells and Sebastián-Gallés2007; English, French, and Dutch in Tyler & Cutler, Reference Tyler and Cutler2009; German and Italian in Ordin & Nespor, Reference Ordin and Nespor2016). Thus, the optimal segmentation strategy is shaped by the typological features of each language. In the following subsections, we detail how Korean’s linguistic properties shape segmentation strategies, focusing on the role of syllables, the influence of prosody and statistical cues, and the importance of one-word utterances in CDS.
Advantages of syllable-level word segmentation in Korean. Korean primarily employs simple syllable structures, such as CV and CVC, which stand in contrast to the more complex syllable forms typical in English. As a syllable-timed language, each syllable in Korean tends to occupy a relatively uniform amount of time, which makes syllable-based processing both natural and effective, closely aligning with the language’s morpho-phonemic attributes. These syllables serve not merely as rhythmic units but often function as morphemes or distinct word parts, thereby enhancing the utility of syllable-based analytical approaches. Each syllable carries significant morphemic information that is crucial for comprehending the language’s morphology and syntax (Kim, Reference Kim2004).
The psychological significance of syllable as a unit of psychological processing is further exemplified in Korean traditional poetry, such as Sijo, where syllable count plays a critical role analogous to the use of meter in English formal poetry. In linguistic games and wordplay, syllables similarly hold distinct importance. For instance, in 끝말잇기 kkeutmal-ittgi ‘end-word connecting,’ or word chain games, play is strictly based on syllables, in contrast to the segmental-level manipulation in Pig Latin in English.Footnote 5 This emphasis on syllables underpins a richer linguistic interaction and highlights their essential role in shaping Korean linguistic structure and psychological processing. More broadly, this illustrates how children can flexibly learn from the structures that provide the most salient evidence for segmentation in their language, with syllables serving as especially informative units in Korean.
Besides, the poor performance of lexical algorithms, such as lexical-DiBS and PUDDLE algorithms in Korean word segmentation may be attributed to their inability to fully capture the statistical cues and linguistic patterns specific to the language. Lexical-DiBS relies on lexical-related information and may struggle with the agglutinative nature of Korean, where words are formed by combining multiple morphemes. PUDDLE’s incremental learning approach may not align well with the statistical properties of Korean, where the agglutinative nature of grammatical markers and particles could potentially bias the algorithm’s lexicon construction. These findings suggest that infants may not rely solely on lexical information or incremental learning strategies for word segmentation in languages with complex morphological structures.
Notably, these structural and psychological properties that make syllables advantageous for word segmentation in Korean are not unique to this language. Similar syllable-morpheme alignment and syllable-based processing advantages are also found in other languages, such as Chinese, suggesting broader cross-linguistic relevance for future research.
Word segmentation in Korean: statistical cues and beyond. Another aspect to consider is the influence of prosody. Korean deviates significantly from languages like English, where lexical stress (Cutler, Reference Cutler, Reed and Levis2015) serves as one important cue among multiple segmentation strategies, including transitional probabilities and phonotactic patterns (Cutler & Norris, Reference Cutler and Norris1988; Saffran et al., Reference Saffran, Aslin and Newport1996; Thiessen & Saffran, Reference Thiessen and Saffran2003). Importantly, both Korean and English provide rich distributional structure, that is, in principle, usable for segmentation; however, the relative weighting and availability of specific cues differ across languages. In Korean, the absence of lexical stress precludes stress-based segmentation strategies, potentially increasing learners’ reliance on alternative sources of information, such as lexical repetition and prosodic grouping. From this perspective, utterance-level prosodic organisation itself becomes a primary locus of segmentation-relevant information.
From an interactional perspective, such learner-side pressures plausibly invite systematic adjustments in caregiver speech, a pattern consistent with longstanding accounts of CDS as an accommodated register shaped in response to children’s linguistic capacities (Giles et al., Reference Giles, Taylor and Bourhis1973; Snow, Reference Snow1972, Reference Snow, Snow and Ferguson1977). In Korean, where clause structure is canonically verb-prominent with an SOV order, nouns in CDS are often rendered salient through repetition and flexible ordering, and may surface in utterance-final positions associated with heightened perceptual salience (Ko et al., Reference Ko, Jo and Chai2022). This pattern illustrates how prosodically prominent edges can be populated with salient lexical material, which may in turn strengthen the association between structural position and perceptual prominence. Taken together, these cues point to a segmentation strategy in Korean that is grounded in prosodic phrasing (Kim & Cho, Reference Kim and Cho2009) and edge-prominence (Jun, Reference Jun and Jun2014) rather than in lexical stress, and thus reflect a typologically distinct pathway to early segmentation relative to stress-timed languages such as English.
While the role of onomatopoeic words and word play in word segmentation was not directly examined in the current study, a related study by Chai and Ko (Reference Chai and Ko2025) did not find a positive association between these features and word segmentation performance. This pattern raises the possibility that such features exert their influence primarily through acoustic salience or attentional engagement, rather than by systematically shaping the distributional cues that computational and behavioural models of segmentation typically exploit. Although the current study does not evaluate the specific impact of onomatopoeia, word play, and expressive lengthening on word segmentation directly, their presence in CDS highlights the unique characteristics of speech directed to children and the potential for these features to influence language acquisition through acoustic and attentional mechanisms.
One-word utterances and interjections in CDS across studies. The prevalence of one-word utterances in CDS varies across languages and studies due to differing criteria for defining these utterances (Stärk et al., Reference Stärk, Kidd and Frost2022). Some studies use inclusive definitions, while others employ stricter criteria (Brent & Siskind, Reference Brent and Siskind2001). The current study on the Korean corpus adopted a conservative approach by excluding interjections from the count of one-word utterances. This exclusion resulted in a lower percentage of one-word utterances compared to French and German studies that used more inclusive criteria. However, when isolated interjections are included, the percentage aligns with that of the German study (Stärk et al., Reference Stärk, Kidd and Frost2022). These variations highlight the importance of clear definitions for cross-linguistic comparisons and understanding language-specific factors in CDS structure. A refined study distinguishing between strict single-word utterances and isolated interjections is crucial for understanding their potential differential effects on word segmentation.
Beyond methodological considerations, isolated words function as perceptual anchors that facilitate speech segmentation. Research has shown that these “anchor words,” previously heard in isolation, help infants segment novel words from continuous speech (Cunillera et al., Reference Cunillera, Laine and Rodriguez-Fornells2016; Lew-Williams et al., Reference Lew-Williams, Pelucchi and Saffran2011). This anchoring mechanism operates through a bootstrapping process: words encountered in isolation become perceptually familiar and subsequently “pop out” when heard embedded within continuous utterances, providing reference points that help infants identify adjacent word boundaries. This process may be particularly helpful in Korean, given its complex interplay between morphological structure and prosodic units, such as AP’s. Due to Korean’s agglutinative morphology and the frequent, but not consistent, alignment of content words with AP boundaries, isolated words in Korean CDS likely offer crucial perceptual anchors, aiding infants in parsing the language’s morphological and prosodic complexity.
4.5. Limitations and future avenues
The current study highlights the significance of language-specific structural organisation in shaping segmentation cues, while also exposing limitations that necessitate additional efforts to differentiate typological effects from methodological variation. Infants universally exploit statistical regularities in broadly similar ways; however, the particular cues that instantiate these regularities differ across languages. Korean infants, for instance, may rely on syllable structure and prosodic boundaries (Jun, Reference Jun1998; Kim & Cho, Reference Kim and Cho2009), whereas English infants utilize lexical stress and strong–weak syllable patterns (Cutler & Norris, Reference Cutler and Norris1988; Jusczyk et al., Reference Jusczyk, Houston and Newsome1999). Other languages highlight different cues, such as vowel harmony in Turkish (Hohenberger et al., Reference Hohenberger, Altan, Kaya, Tuncer and Avcu2016), final lengthening in French (Welby, Reference Welby2007), or pitch movements in Dutch (Johnson & Seidl, Reference Johnson and Seidl2009). Although this study highlights segmentation cues specific to Korean, we did not directly compare these processes across languages. Future research should systematically examine how language-specific structures shape statistical segmentation cues and how infants adapt accordingly.
Additionally, the study highlights the importance of considering the social context and the nature of the linguistic input when investigating word segmentation. While the data were collected in a semi-naturalistic setting, the ADS corpus size was smaller than CDS. The ADS in the Ko corpus included speech involving both family members and experimenters, but these subsets were not separated in the segmentation analysis due to the small corpus size. This limitation affected the control for social distance when merging the two types of ADS for segmentation data. Future research should collect larger and more balanced corpora to enable a more fine-grained analysis of the linguistic properties and segmentability of speech across different social contexts and speaker–listener relationships.
While age effects were omitted from the current study due to space constraints and are detailed in a separate study by Chai & Ko (Reference Chai and Ko2025), investigating age-related changes in the statistical properties of CDS and their impact on word segmentation performance could provide valuable insights into the developmental trajectory of language acquisition.
This study’s findings underscore the intricate interplay between the language input and the assumptions and mechanisms of different word segmentation algorithms. Future research should systematically vary linguistic properties like the percentage of monosyllabic words, syllable complexity, and word length distribution to better understand how these factors interact with various algorithms. Additionally, investigating these effects in naturalistic language input across different languages could provide valuable insights into the generalisability of these findings and the potential role of language-specific adaptations in word segmentation strategies.
The definition of words in this study was based on their orthographic representation in the audio signal, specifically the spacing between words. In agglutinative languages like Korean, a single orthographic word can encompass multiple morphemes, conveying what might be multiple words’ worth of meaning in non-agglutinative languages such as English. Conversely, Korean has a category of nouns called dependent nouns, which are always cliticised but are orthographically represented as an independent word. For example, in phrases like 할 수 있다 hal swu iss-ta “do way exist; can do,” a dependent noun like 수 swu must be integrated with a verb to form a meaningful unit. The resulting phrase, though syntactically complex, forms a single AP, a prosodic unit demarcated by intonation. Given the importance of the role AP’s play in Korean phonology, infants might segment the speech stream based on the AP rather than on individual words or morphemes.
Future research should identify and validate the optimal segmentation units for Korean, considering how prosody, morphology, and perceptual cues interact. Although we assessed segmentation using adult-defined word boundaries, infants may segment speech using units that do not correspond neatly to traditional linguistic categories. Thus, methodologies such as experimental paradigms that examine infants’ sensitivity to various boundary types or computational models allowing flexible, multi-level segmentation may be needed. Leveraging raw audio data could also offer a more ecologically valid approach, revealing whether infants segment speech according to conventional linguistic units or whether segmentation emerges primarily from statistical and acoustic regularities.
5. Conclusion
This study provides a detailed analysis of word segmentation in Korean, showing that features of CDS, such as shorter, simpler utterances, higher repetition, and more frequent single-word utterances, enhance segmentability compared to ADS. These findings highlight clear structural differences between CDS and ADS, reflecting how adults adapt their language to support children’s linguistic needs. We demonstrate that the segmentation advantage of CDS is mediated by specific linguistic properties, and that these interact with segmentation algorithms in language-specific ways. While our results clarify the importance of Korean-specific cues like syllable structure, morphological complexity, and repetition, we acknowledge that the study does not include direct cross-linguistic comparisons or address all possible segmentation cues, such as morpheme-level boundaries or prosodic units. This advantage is not merely due to corpus properties but also results from how these properties interact with computational models designed to simulate human linguistic processing. Future research should further investigate which segmentation cues are most reliable for Korean-learning children and how these differ from those in other languages, as well as expand the analysis to additional linguistic units and distinctions within ADS, to deepen our understanding of word segmentation mechanisms in Korean.
Data availability statement
The data and analysis files associated with this study are available through the Open Science Framework at https://osf.io/uadw9/.
Funding statement
This work was supported by the National Research Foundation of Korea (NRF-2025S1A6B5A02004207) and a research fund from Chosun University, 2024.
Competing interests
The authors declare none.