Hostname: page-component-89b8bd64d-nlwjb Total loading time: 0 Render date: 2026-05-07T19:26:20.287Z Has data issue: false hasContentIssue false

The Alignment between Language Properties and Computational Algorithms Enhances Statistical Word Segmentation: Evidence from Korean Child-Directed Speech

Published online by Cambridge University Press:  29 April 2026

Jun Ho Chai
Affiliation:
Chosun University , Gwangju, Republic of Korea
Seongmin Mun
Affiliation:
Chosun University , Gwangju, Republic of Korea
Eon-Suk Ko*
Affiliation:
Chosun University , Gwangju, Republic of Korea
*
Corresponding author: Eon-Suk Ko; Email: eonsukko@chosun.ac.kr
Rights & Permissions [Opens in a new window]

Abstract

This study investigates whether child-directed speech (CDS) exhibits enhanced segmentability compared to adult-directed speech (ADS) and explores how specific linguistic properties of each register influence computational word segmentation performance in Korean. Employing a speaker-matched corpus of naturalistic Korean CDS and ADS, we observed that Korean CDS features shorter utterances and words, lower lexical diversity, fewer hapax legomena and interjections, a greater proportion of onomatopoeia and word play, a higher frequency of one-word utterances, and lower lexical ambiguity than ADS. Computational algorithms revealed significantly higher word segmentation F-scores for CDS than ADS, suggesting that child-oriented linguistic adaptations in CDS facilitate segmentation. This observation is further supported by statistical modelling, which indicates that the enhanced segmentability in CDS is modulated by the linguistic properties of the register. We discuss the nuanced roles of these properties in shaping the performance of segmentation algorithms.

초록

초록

본 연구는 한국어 아동 대상 발화 (Child-Directed Speech, CDS) 가 성인 대상 발화(Adult-Directed Speech, ADS) 에 비해 향상된 단어 분절 가능성 (segmentability) 을 보이는지 검토하고, 각 발화 유형의 특정 언어적 속성이 단어 분절 성능에 미치는 영향을 계산적으로 탐구하였다. 동일 화자가 발화한 자연스러운 한국어 CDS 및 ADS 코퍼스를 분석한 결과, 한국어 CDS 는 ADS 에 비해 발화 및 단어 길이가 짧고 어휘 다양성이 낮으며, 단일 출현 어휘 (hapax legomena) 와 감탄사의 빈도가 적은 것으로 나타났다. 반면 의성어와 말놀이의 비중, 한 단어 발화의 비율이 크며, 어휘 모호성은 낮게 관찰되었다. 계산 알고리즘 분석 결과, CDS 의 단어 분절 F-점수 (F-score) 는 ADS 보다 유의미하게 높았으며, 이는 CDS 의 아동 지향적 언어적 조정이 단어 분절 과정을 촉진함을 시사한다. 이러한 관찰은 통계적 모형 분석을 통해서도 지지되었는데, CDS 에서 나타나는 향상된 분절 가능성이 해당 발화 유형의 언어적 특성에 의해 조절됨을 확인하였다. 본 연구는 이러한 언어 특성들이 분절 알고리즘의 성능 형성에 미치는 복합적이고 미묘한 역할에 대해 논의한다.

Information

Type
Research Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2026. Published by Cambridge University Press
Figure 0

Table 1. Linguistic properties of corpus, their mean and standard deviation, M (SD) across registers (CDS, ADS and ADS call friend)

Figure 1

Table 2. Summary table of a series of linear regressions comparing differences in corpus properties categorised by registers (CDS, ADS, ADS-call-friend), with corpus size (total tokens) as the covariate. Values represent Estimate with standard errors (SE) in parentheses

Figure 2

Figure 1. Scatter plot showing the distribution of raw F-scores measuring model performance across speech registers (CDS, ADS), speech processing algorithms, and phone/syllable units on word segmentation simulations.

Figure 3

Figure 2. Comparison of model estimated marginal mean F-scores on word segmentation simulations, across speech registers (CDS, ADS, ADS-CF), algorithms, and unitised type, with 95% confidence intervals. The purple bars represent 95% confidence intervals around each estimate. Red arrows highlight pairwise contrasts between conditions where a statistically significant difference was found.

Figure 4

Figure 3. Predicted F-scores (emmeans) for word segmentation models comparing child-directed speech (CDS), adult-directed speech (ADS), and Call Friend ADS (ADS-CF) across base and mediation models using phone-level (left) and syllable-level (right) units. Inclusion of corpus-level linguistic properties as covariates in mediation models reduces CDS segmentation advantage compared to ADS and ADS-CF, indicating a mediating role of corpus properties on register effects. Error bars reflect estimated model uncertainty.