Hostname: page-component-89b8bd64d-dvtzq Total loading time: 0 Render date: 2026-05-08T12:49:50.342Z Has data issue: false hasContentIssue false

Largest-chunk strategy for syllable-based segmentation

Published online by Cambridge University Press:  17 May 2018

Rights & Permissions [Opens in a new window]

Abstract

We apply the largest-chunk segmentation algorithm to texts consisting of syllables as smallest units. The algorithm was proposed in Drienkó (2016, 2017a), where it was used for texts considered to have letters/characters as smallest units. The present study investigates whether the largest chunk segmentation strategy can result in higher precision of boundary inference when syllables are processed rather than characters. The algorithm looks for subsequent largest chunks that occur at least twice in the text, where text means a single sequence of characters, without punctuation or spaces. The results are quantified in terms of four precision metrics: Inference Precision, Alignment Precision, Redundancy, and Boundary Variability. We segment CHILDES texts in four languages: English, Hungarian, Mandarin, and Spanish. The data suggest that syllable-based segmentation enhances inference precision. Thus, our experiments (i) provide further support for the possible role of a cognitive largest-chunk segmentation strategy, and (ii) point to the syllable as a more optimal unit for segmentation than the letter/phoneme/character, (iii) in a cross-linguistic context.

Information

Type
Article
Copyright
Copyright © UK Cognitive Linguistics Association 2018 
Figure 0

table 1. The Largest-Chunk Segmentation Algorithm

Figure 1

table 2. Calculating precision values (characters)

Figure 2

table 3. Precision values for Experiment 1(Anne)

Figure 3

table 4. Precision values for Experiment 2(Miki)

Figure 4

table 5. Precision values for Experiment 3(Beijing)

Figure 5

table 6. Precision values for Experiment 4(Koki)

Figure 6

table 7. Precision values for Experiment 5(Gulliver)

Figure 7

Fig. 1. Precision values for all the texts used in the segmentation experiments. (IP: Inference Precision; R: Redundancy; AP: Alignment Precision; BV: Boundary Variability)

Figure 8

table 8. IP values across texts

Figure 9

table 9. R values across texts

Figure 10

table 10. AP values across texts

Figure 11

table 11. BV values across texts

Figure 12

table 12. Details for measuring BV in syllables

Figure 13

table 13. Calculating precision values (characters)

Figure 14

table 14. Calculating precision values (syllables)

Figure 15

table 15. Precision values for the ‘baby is baby it’ example

Figure 16

table 16. Precision values for the ‘what about what a boot’ example