Rhythms of Phones, Syllables, and Words in Connected Speech

doi:10.1017/9781009295888.012

9 - Rhythms of Phones, Syllables, and Words in Connected Speech

from Section 2 - Acoustic and Sublexical Rhythms

Published online by Cambridge University Press: 23 April 2026

Yuran Zhang ,

Jiajie Zou and

Nai Ding

Edited by

Lars Meyer and

Antje Strauss

Show author details

Lars Meyer: Affiliation:
Max Planck Institute for Human Cognitive and Brain Sciences
Antje Strauss: Affiliation:
University of Konstanz

Book contents

Summary

In speech, linguistic information is encoded in hierarchically organized units such as phones, syllables, and words. In auditory neuroscience, it is widely accepted that syllables in connected speech are quasi-rhythmic, and the rhythmicity makes them suitable to be encoded by theta-band neural oscillations. The rhythmicity of phones or words, however, is more controversial. Here, we analyze the statistical regularity in the duration of phones, syllables, and words, based on large corpora in English and Mandarin Chinese. The coefficient of variation (CV) of unit duration is slightly lower for syllables than phones and words, consistent with the idea that syllables are more rhythmic than phones and words, but the difference is weak. The mean duration of phones, syllables, and words matches the timescales of alpha-, theta-, and delta-band neural oscillations, respectively.

Keywords

duration stimulus onset asynchrony (SOA)coefficient of variation (CV)syllable rate

Information

Type: Chapter
Information: Rhythms of Speech and Language
Physiology, Cognition, Culture
, pp. 147 - 157

DOI: https://doi.org/10.1017/9781009295888.012 [Opens in a new window]

Publisher: Cambridge University Press

Print publication year: 2026
Creative Commons: This content is Open Access and distributed under the terms of the Creative Commons Attribution licence CC-BY-NC 4.0 https://creativecommons.org/cclicenses/

9 Rhythms of Phones, Syllables, and Words in Connected Speech

9.1 Introduction

Whether speech is rhythmic is a controversial topic (Arvaniti, Reference Arvaniti2009; Goswami and Leong, Reference Goswami and Leong2013; Turk and Shattuck-Hufnagel, Reference Turk and Shattuck-Hufnagel2013; see Chapter 11). On the one hand, listeners tend to feel that speech is rhythmic and linguists have divided languages into syllable-timed and stress-timed languages, in which the syllable or stress is perceived to have regular rhythm (Dauer, Reference Dauer1983). On the other hand, no study has revealed strict periodicity in any speech unit or feature (Turk and Shattuck-Hufnagel, Reference Turk and Shattuck-Hufnagel2013; see Chapter 14). Therefore, the rhythm of speech cannot arise from strict periodicity in sound features or linguistic units. Since speech is a highly complex signal, the rhythm of speech can be discussed at multiple levels (see Chapter 20 for a discussion of the prosodic hierarchy of speech). Here, we consider the rhythm of three linguistic levels, that is, phones, syllables, and words. Furthermore, we only analyze the duration of a single unit or the interval between the onsets of neighboring units, instead of higher-order rhythms defined by the patterning of the onset of multiple units. The phone is the basic phonetic unit of speech, which can be further divided into vowels and consonants. A syllable typically contains one vowel that may be preceded and/or followed by a few consonants. Phones and syllables are units of speech sound, while morphemes and words are the basic units of meaning. A morpheme is the linguistically defined smallest unit for meaning (Boey, Reference Boey1975). A word is the basic unit for writing in some languages, such as English, but it is less well defined in other languages, such as Chinese (Tan and Perfetti, Reference Tan, Perfetti, Leong and Tamaoka1998).

In the last couple of decades, the rhythm of syllables has received a lot of attention. It is shown that syllables have a relatively regular duration, and the mean rate of syllables is typically between 4 and 8 Hz (Coupé et al., Reference Coupé, Oh, Dediu and Pellegrino2019; Greenberg et al., Reference Greenberg, Carvey, Hitchcock and Chang2003; Pellegrino et al., Reference Pellegrino, Coupé and Marsico2011; see Chapter 8). Furthermore, it is suggested that syllables have a relatively reliable acoustic correlate, that is, the speech envelope (Assaneo and Poeppel, Reference Assaneo and Poeppel2018; Cummins, Reference Cummins2012; Giraud and Poeppel, Reference Giraud and Poeppel2012; see also Zhang et al., Reference Zhang, Zou and Ding2023a, for different opinions). The speech envelope refers to the low-frequency (mainly below 30 Hz) changes in acoustic power, and its power spectrum, referred to as the modulation spectrum, peaks between 4 and 8 Hz, corresponding to the mean rate of syllables (Ding et al., Reference Ding, Patel and Chen2017; Greenberg et al., Reference Greenberg, Carvey, Hitchcock and Chang2003). Therefore, the rhythm of syllables, which is roughly equivalent to the rhythm of the speech envelope, can be perceived even without extensive linguistic knowledge, compared with the rhythm of phones (Liberman et al., Reference Liberman, Shankweiler, Fischer and Carter1974; see Chapter 11) or words, which requires learning. Although a number of studies have demonstrated that syllables have relatively regular duration, the statistical regularity of the duration of other linguistic units such as phones and words has seldom been investigated. Here, we analyzed the duration of phones, syllables, and words, and tested whether syllables have more regular duration than phones and words. We analyzed two languages that have the most users, Chinese and English.

9.2 Methods

9.2.1 Corpus

Eight speech corpora included in this analysis (Table 9.1) were extracted from six speech datasets, that is, DARPA-TIMIT (Garofolo et al., Reference Garofolo, Lamel and Fisher1993), GigaSpeech (Chen et al., Reference Chen, Chai and Wang2021), TED-LIUM (Rousseau et al., Reference Rousseau, Deléglise and Estève2012), Chinese-TIMIT (Yuan et al., Reference Yuan, Ding, Liao, Zhan and Liberman2017), Aishell-1 (Bu et al., Reference Bu, Du, Na, Wu and Zheng2017), and WenetSpeech (Zhang et al., Reference Zhang, Lv and Guo2022). The selection and processing of the corpora followed a recent study on the syllabic rhythm of speech (Zhang et al., Reference Zhang, Zou and Ding2023a).

Table 9.1Speech corpora

9.2.2 Phone Duration

The boundaries of each phone (Figure 9.1) are automatically extracted based on audio and transcription using the Montreal Forced Aligner (MFA) (McAuliffe et al., Reference McAuliffe, Socolof, Mihuc, Wagner and Sonderegger2017). The MFA locates the boundaries between phones with a resolution of 10 ms; that is, the phone duration can only be a multiple of 10 ms. The MFA does not allow phones to overlap in time. This MFA method is validated based on the corpora for which manual labels of phone boundaries are available (Zhang et al., Reference Zhang, Zou and Ding2023a). The duration of a phone is defined as the time difference between phone onset and phone offset. The stimulus onset asynchrony (SOA) of phones is defined as the time difference between the onsets of two adjacent phones. The SOA is affected by the silence period after a phone, but the phone duration is not.

Figure 9.1

Schematized steady states and fast transit intervals.

The speech waveform and the boundaries of words, syllables, and phones.

An illustration of a speech waveform, showing the acoustic representation of a sentence. Vertical lines delineate the boundaries of different linguistic units: words, syllables, and phones along with labels below the waveform.

9.2.3 Syllable Duration

The MFA also provides the boundaries between words for English and the boundaries between characters for Mandarin Chinese. In Chinese, since each syllable corresponds to a character, the syllable boundaries are obtained directly from the character boundaries. In English, the syllable boundaries (Figure 9.1) are determined by grouping the phones of each word into syllables based on a dictionary, that is, the Unisyn Lexicon (Fitt, Reference Fitt2001). The duration and SOA of syllables are defined in the same way they are for phones.

9.2.4 Duration of Theta Syllables

The structure of a syllable has three parts, that is, the onset, the nucleus/peak, and the coda/offset (Greenberg, Reference Greenberg1999). In general, the nucleus corresponds to the local maximum in speech intensity within the duration of a syllable. In the neuroscience literature, the concept of a theta syllable has also been proposed, which is the unit between two successive vocalic nuclei (Ghitza, Reference Ghitza2013). The concept is proposed to describe the units that are tracked by theta-band neural activity during speech listening. In connected speech, consonants may be produced in between two vowels, and parsing the boundaries of syllables requires determining which consonants are the offset of the first syllable and which consonants are the onsets of the second syllable. This syllable parsing problem is especially challenging for languages that allow flexible syllable structures, for example, English, compared with languages that have a highly regular syllable structure, for example, Turkish (Durgunoğlu and Öney, Reference Durgunoğlu and Öney1999). For theta syllables, however, the consonants between vowels do not need to be parsed into an onset and an offset. Here, we also analyze the statistical regularity in the duration of theta syllables.

9.2.5 Word Duration

For English, word boundaries (Figure 9.1) are reported by the MFA. For Chinese, words are not separated in the writing system, and there is no univocal definition of words. Here, word segmentation is achieved using a popular word segmentation tool, that is, stanza (Qi et al., Reference Qi, Zhang, Zhang, Bolton and Manning2020). The duration and SOA of words are defined in the same way they are for phones.

9.3 Statistical Distribution of Phones, Syllables, and Words

9.3.1 Mean Duration and SOA

In the following, for each language, Chinese and English, we report the results that are pooled across corpora, while the results of individual corpora are summarized in Table 9.2. For Chinese, the mean durations for phones, syllables, and words are 98 ms, 213 ms, and 327 ms, respectively. For English, the mean durations for phones, syllables, and words are 83 ms, 211 ms, and 294 ms, respectively. For phones, when vowels and consonants are separately analyzed, the duration of vowels (106 ms for Chinese and 89 ms for English) is slightly longer than the duration of consonants (91 ms for Chinese and 80 ms for English – Table 9.3). When comparing syllables and theta syllables, the mean duration is longer for theta syllables (231 ms for Chinese and 232 ms for English) than for syllables (213 ms for Chinese and 211 ms for English – Table 9.4). As expected, the SOA of a unit is longer than its duration (Table 9.2). For Chinese, the mean SOAs for phones, syllables, and words are 107 ms, 231 ms, and 353 ms, respectively. For English, the mean SOAs for phones, syllables, and words are 90 ms, 226 ms, and 317 ms, respectively (Table 9.2). The SOA for vowels is not shown, but it is the same as the duration of theta syllables reported in the subsequent analysis (Table 9.4).

Table 9.2Unit duration and the coefficient of variation (CV)

Note: M = mean duration in milliseconds; SD = standard deviation; CV = coefficient of variation

Table 9.3Duration and CV of vowels and consonants

Table 9.4Duration and CV of theta syllable

9.3.2 Coefficient of Variation (CV)

The CV is a measure of relative variability, calculated as the ratio of the standard deviation to the mean. A lower CV indicates stronger statistical regularity. For the duration/SOA of a unit, a lower CV indicates stronger rhythmicity. For duration, the CV of syllables (0.39 for Chinese and 0.57 for English) is consistently lower than the CV of phones (0.55 for Chinese and 0.64 for English; binomial test based on the results from the eight corpora, p < 0.01) and the CV of words (0.54 for Chinese and 0.64 for English; see Table 9.2 and Figure 9.2A; binomial test, p < 0.01). For SOA, the CV of syllables (0.59 for Chinese and 0.67 for English) is significantly lower than the CV of phones (0.86 for Chinese and 0.82 for English; binomial test, p < 0.01) and words (0.62 for Chinese and 0.71 for English; binomial test, p < 0.05), while the CV of words is significantly lower than the CV of phones (Table 9.2; Figure 9.2B; binomial test, p < 0.05).

Figure 9.2

CV.

(A)

CV for unit duration. Each line shows the CV of a corpus. Chinese and English corpora are marked out by dots and triangles, respectively. Pairwise comparisons between phones, syllables, and words are carried out using the binomial test (* p < 0.05, ** p < 0.01).

(B)

CV for SOA.

9.4 The Rate of Phones, Syllables, and Words

In the neurolinguistic literature, the rate of a linguistic unit is more frequently discussed than the duration/SOA of a unit. The rate of a unit, however, can be defined in several ways. First, it can be defined as the total number of units divided by the total duration of speech recordings. This definition is the same as the reciprocal of the mean SOA of the unit, denoted as 1/E(SOA) in this chapter. The rate defined this way is sensitive to the duration of silence periods in the recording, which can be problematic if the recording has long silence periods. To solve this problem, the second definition excludes the silence periods in speech and defines the rate of a unit as the total number of units in a corpus divided by the total duration of units. For this definition, the rate of a unit is simply the reciprocal of the mean unit duration, denoted as 1/E(duration) in this chapter. A third definition is, for the duration of each unit, take the reciprocal and calculate the mean of this reciprocal. This measure is denoted as E(1/duration) in this chapter.

The rates of phones, syllables, and words calculated using these three methods are summarized in Table 9.5. When the units were pooled over all corpora, the rates for phones, syllables, and words calculated using method 1, that is, 1/E(SOA), are 10.4 Hz, 4.4 Hz, and 3.0 Hz, respectively. For method 2, that is, 1/E(duration), the rates for phones, syllables, and words are 11.3 Hz, 4.7 Hz, and 3.3 Hz, respectively. The rate calculated using method 1 is lower than the rate calculated using method 2 since it considers the silence period after a unit. For the third measure, that is, E(1/duration), the rates for phones, syllables, and words are 15.1 Hz, 6.3 Hz, and 5.0 Hz, respectively, higher than the rates calculated using the other two methods. The result is expected, since 1/E(1/duration) is the harmonic mean of duration, which is more strongly influenced by small values than the arithmetic mean, that is, E(duration), and the harmonic mean is always smaller than the arithmetic mean.

Table 9.5Unit rate (Hz)

9.5 Discussion

Rhythmicity in speech has motivated theoretical and experimental studies on how neural oscillations encode speech. Although such neuroscientific research is flourishing, relatively few studies have empirically characterized the rhythmicity in speech (Ding et al., Reference Ding, Patel and Chen2017; Inbar et al., Reference Inbar, Grossman and Landau2020; Meyer, Reference Meyer2018; Meyer et al., Reference Meyer, Henry, Gaston, Schmuck and Friederici2016; Stehwien and Meyer, Reference Stehwien and Meyer2022; Tilsen and Arvaniti, Reference Tilsen and Arvaniti2013; Zhang et al., Reference Zhang, Zou and Ding2023a; see more in Chapter 16). The rhythmicity of speech, however, is the basis for these neuroscientific studies and therefore deserves more attention. Here, we investigated the mean and variation of the duration of phones, syllables, and words, since the duration information of these units has been used to motivate hypotheses and analysis approaches in neuroscientific studies on speech processing (Coopmans et al., Reference Coopmans, de Hoop, Hagoort and Martin2022; Giraud and Poeppel, Reference Giraud and Poeppel2012; Kaufeld et al., Reference Kaufeld, Bosker and ten Oever2020; Kazanina and Tavano, Reference Kazanina and Tavano2022; Keitel et al., Reference Keitel, Gross and Kayser2018; ten Oever et al., Reference ten Oever, Carta, Kaufeld and Martin2022; Zhang et al., Reference Zhang, Zou and Ding2023b).

Since a word contains one or more syllables and a syllable contains one or more phones, for the three levels of units, the mean duration is longest for words and shortest for phones. The nesting relationship between the three units, however, does not decide which unit has stronger rhythmicity, and the empirical analyses here show that syllables have a statistically significantly lower CV of duration than phones and words, indicating that syllables have more regular duration. Although the difference is statistically significant, the CV of phone/word duration is only about 1.3 times bigger than the CV of syllable duration (Table 9.2). The difference is small compared to the variation of CV across speech corpora that differ in their language or speaking style. For example, the syllable CV tends to be lower for Chinese than for English, indicating higher syllabic rhythmicity for Chinese than for English. Even for the Chinese corpora, the syllable CV is lower for read sentences, that is, Chinese-TIMIT and AISHELL, than for spontaneous speech, that is, WenetSpeech (talk).

Roughly speaking, the mean rates for phones, syllables, and words fall into the range of alpha, theta, and delta oscillations, respectively. The mean rate of phones, however, does not contradict the hypothesis that gamma-band oscillations are critical for the neural encoding of phones (Giraud and Poeppel, Reference Giraud and Poeppel2012; Goswami, Reference Goswami2016; Hovsepyan et al., Reference Hovsepyan, Olasagasti and Giraud2020; Hyafil et al., Reference Hyafil, Fontolan, Kabdebon, Gutkin and Giraud2015; Meyer, Reference Meyer2018), since the hypothesis focuses on the critical timescales within a phone, for example, the timescales for formant transitions and voicing, instead of the mean rate of phones.

Box 9.1Chapter Overview

Summary

We quantified the statistical regularity in the duration of phones, syllables, and words. The CV is below 0.7 for the three units for most corpora, suggesting that the unit duration is relatively regular. Furthermore, the CV is slightly lower for syllables than for phones and words and is highly variable across corpora.

Implications

According to the mean rate of phones, syllables, and words, if some neural activity can track the onset of each phone, syllable, or word, its frequency will fall into the range of alpha, theta, and delta oscillations, respectively.

Gains

Phones, syllables, and words all have their typical duration. Future studies may also probe how phone, syllable, and word rates vary across speakers, and whether these rhythms are steady or show systematic variations over time.

Footnotes

Note: M = mean duration in milliseconds; SD = standard deviation; CV = coefficient of variation

References

Arvaniti, A. (2009). Rhythm, timing and the timing of rhythm. Phonetica, 66(1–2), 46–63. https://doi.org/10.1159/000208930 CrossRef Google Scholar PubMed

Assaneo, M. F., and Poeppel, D. (2018). The coupling between auditory and motor cortices is rate-restricted: Evidence for an intrinsic speech-motor rhythm. Science Advances, 4(2), eaao3842. https://doi.org/10.1126/sciadv.aao3842 CrossRef Google Scholar PubMed

Boey, L. K. (1975). An Introduction to Linguistics. Singapore University Press.Google Scholar

Bu, H., Du, J., Na, X., Wu, B., and Zheng, H. (2017). AISHELL-1: An open-source Mandarin speech corpus and a speech recognition baseline. 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA), Seoul, Korea (South), 2017, pp. 1–5. https://doi.org/10.1109/ICSDA.2017.8384449 CrossRef Google Scholar

Chen, G., Chai, S., Wang, G.-B., et al. (2021). Gigaspeech: An evolving, multi-domain ASR corpus with 10,000 hours of transcribed audio. Proceedings of Interspeech 2021, pp. 3670–3674. https://doi.org/10.21437/Interspeech.2021-1965 CrossRef Google Scholar

Coopmans, C. W., de Hoop, H., Hagoort, P., and Martin, A. E. (2022). Effects of structure and meaning on cortical tracking of linguistic units in naturalistic speech. Neurobiology of Language, 3(3), 386–412. https://doi.org/10.1162/nol_a_00070 CrossRef Google Scholar PubMed

Coupé, C., Oh, Y. M., Dediu, D., and Pellegrino, F. (2019). Different languages, similar encoding efficiency: Comparable information rates across the human communicative niche. Science Advances, 5(9), eaaw2594. https://doi.org/10.1126/sciadv.aaw2594 CrossRef Google Scholar PubMed

Cummins, F. (2012). Oscillators and syllables: A cautionary note. Frontiers in Psychology, 3. https://doi.org/10.3389/fpsyg.2012.00364 CrossRef Google Scholar PubMed

Dauer, R. M. (1983). Stress-timing and syllable-timing reanalyzed. Journal of Phonetics, 11(1), 51–62. https://doi.org/10.1016/S0095-4470(19)30776-4 CrossRef Google Scholar

Ding, N., Patel, A. D., Chen, L., et al. (2017). Temporal modulations in speech and music. Neuroscience & Biobehavioral Reviews, 81, 181–187. https://doi.org/10.1016/j.neubiorev.2017.02.011 CrossRef Google Scholar PubMed

Durgunoğlu, A. Y., and Öney, B. (1999). A cross-linguistic comparison of phonological awareness and word recognition. Reading and Writing, 11(4), 281–299. https://doi.org/10.1023/A:1008093232622 CrossRef Google Scholar

Fitt, S. (2001). Unisyn lexicon release (Version 1.3). Centre for Speech Technology Research, University of Edinburgh. www.cstr.ed.ac.uk/projects/unisyn/Google Scholar

Garofolo, J. S., Lamel, L. F., Fisher, W. M., et al. (1993). DARPA TIMIT: Acoustic-phonetic continuous speech corpus CD-ROM, NIST speech disc 1-1.1 (NIST IR 4930; p. NIST IR 4930). National Institute of Standards and Technology. https://doi.org/10.6028/NIST.IR.4930 CrossRef Google Scholar

Ghitza, O. (2013). The theta-syllable: A unit of speech information defined by cortical function. Frontiers in Psychology, 4, 138. https://doi.org/10.3389/fpsyg.2013.00138 CrossRef Google Scholar PubMed

Giraud, A.-L., and Poeppel, D. (2012). Cortical oscillations and speech processing: Emerging computational principles and operations. Nature Neuroscience, 15(4), 511–517. https://doi.org/10.1038/nn.3063 CrossRef Google Scholar PubMed

Goswami, U. (2016). Educational neuroscience: Neural structure-mapping and the promise of oscillations. Current Opinion in Behavioral Sciences, 10, 89–96. https://doi.org/10.1016/j.cobeha.2016.05.011 CrossRef Google Scholar

Goswami, U., and Leong, V. (2013). Speech rhythm and temporal structure: Converging perspectives? Laboratory Phonology, 4(1), 67–92. https://doi.org/10.1515/lp-2013-0004 CrossRef Google Scholar

Greenberg, S. (1999). Speaking in shorthand: A syllable-centric perspective for understanding pronunciation variation. Speech Communication, 29(2–4), 159–176. https://doi.org/10.1016/S0167-6393(99)00050-3 CrossRef Google Scholar

Greenberg, S., Carvey, H., Hitchcock, L., and Chang, S. (2003). Temporal properties of spontaneous speech: A syllable-centric perspective. Journal of Phonetics, 31(3–4), 465–485. https://doi.org/10.1016/j.wocn.2003.09.005 CrossRef Google Scholar

Hovsepyan, S., Olasagasti, I., and Giraud, A.-L. (2020). Combining predictive coding and neural oscillations enables online syllable recognition in natural speech. Nature Communications, 11(1), 3117. https://doi.org/10.1038/s41467-020-16956-5 CrossRef Google Scholar PubMed

Hyafil, A., Fontolan, L., Kabdebon, C., Gutkin, B., and Giraud, A.-L. (2015). Speech encoding by coupled cortical theta and gamma oscillations. ELife, 4, e06213. https://doi.org/10.7554/eLife.06213 CrossRef Google Scholar PubMed

Inbar, M., Grossman, E., and Landau, A. N. (2020). Sequences of intonation units form a ~1 Hz rhythm. Scientific Reports, 10(1), 15846. https://doi.org/10.1038/s41598-020-72739-4 CrossRef Google Scholar

Kaufeld, G., Bosker, H. R., ten Oever, S., et al. (2020). Linguistic structure and meaning organize neural oscillations into a content-specific hierarchy. Journal of Neuroscience, 40(49), 9467–9475. https://doi.org/10.1523/JNEUROSCI.0302-20.2020 CrossRef Google Scholar PubMed

Kazanina, N., and Tavano, A. (2022). What neural oscillations can and cannot do for syntactic structure building. Nature Reviews Neuroscience, 24, 113–128. https://doi.org/10.1038/s41583-022-00659-5 CrossRef Google Scholar PubMed

Keitel, A., Gross, J., and Kayser, C. (2018). Perceptually relevant speech tracking in auditory and motor cortex reflects distinct linguistic features. PLOS Biology, 16(3), e2004473. https://doi.org/10.1371/journal.pbio.2004473 CrossRef Google Scholar PubMed

Liberman, I. Y., Shankweiler, D., Fischer, F. W., and Carter, B. (1974). Explicit syllable and phoneme segmentation in the young child. Journal of Experimental Child Psychology, 18(2), 201–212. https://doi.org/10.1016/0022-0965(74)90101-5 CrossRef Google Scholar

McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., and Sonderegger, M. (2017). Montreal Forced Aligner: Trainable text-speech alignment using kaldi. Proceeding of Interspeech 2017, pp. 498–502. https://doi.org/10.21437/Interspeech.2017-1386 CrossRef Google Scholar

Meyer, L. (2018). The neural oscillations of speech processing and language comprehension: State of the art and emerging mechanisms. European Journal of Neuroscience, 48(7), 2609–2621. https://doi.org/10.1111/ejn.13748 CrossRef Google Scholar PubMed

Meyer, L., Henry, M. J., Gaston, P., Schmuck, N., and Friederici, A. D. (2016). Linguistic bias modulates interpretation of speech via neural delta-band oscillations. Cerebral Cortex, 27(9), 4293–4302. https://doi.org/10.1093/cercor/bhw228 Google Scholar

Pellegrino, F., Coupé, C., and Marsico, E. (2011). Across-language perspective on speech information rate. Language, 87(3), 539–558. https://doi.org/10.1353/lan.2011.0057 CrossRef Google Scholar

Qi, P., Zhang, Y., Zhang, Y., Bolton, J., and Manning, C. D. (2020). Stanza: A python natural language processing toolkit for many human languages. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations.Google Scholar

Rousseau, A., Deléglise, P., and Estève, Y. (2012). TED-LIUM: An automatic speech recognition dedicated corpus. Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC ’12), Istanbul, Turkey, pp. 125–129. European Language Resources Association (ELRA). https://aclanthology.org/L12-1405/Google Scholar

Stehwien, S., and Meyer, L. (2022). Short-term periodicity of prosodic phrasing: Corpus-based evidence. Proceedings of Speech Prosody 2022, pp. 693–698. https://doi.org/10.21437/SpeechProsody.2022-141 CrossRef Google Scholar

Tan, L.-H., and Perfetti, C. A. (1998). Phonological codes as early sources of constraint in Chinese word identification: A review of current discoveries and theoretical accounts. In Leong, C. K. and Tamaoka, K. (Eds.), Cognitive Processing of the Chinese and the Japanese Languages (Vol. 14, pp. 11–46). Amsterdam: Springer. https://doi.org/10.1007/978-94-015-9161-4_2 CrossRef Google Scholar

ten Oever, S., Carta, S., Kaufeld, G., and Martin, A. E. (2022). Neural tracking of phrases in spoken language comprehension is automatic and task-dependent. ELife, 11, e77468. https://doi.org/10.7554/eLife.77468 CrossRef Google Scholar PubMed

Tilsen, S., and Arvaniti, A. (2013). Speech rhythm analysis with decomposition of the amplitude envelope: Characterizing rhythmic patterns within and across languages. Journal of the Acoustical Society of America, 134(1), 628–639. https://doi.org/10.1121/1.4807565 CrossRef Google Scholar PubMed

Turk, A., and Shattuck-Hufnagel, S. (2013). What is speech rhythm? A commentary on Arvaniti and Rodriquez, Krivokapić, and Goswami and Leong. Laboratory Phonology, 4(1), 93–118. https://doi.org/10.1515/lp-2013-0005 CrossRef Google Scholar

Yuan, J., Ding, H., Liao, S., Zhan, Y., and Liberman, M. (2017). Chinese TIMIT: A TIMIT-like corpus of standard Chinese. 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA), Seoul, Korea (South), pp. 20171–20175. https://doi.org/10.1109/ICSDA.2017.8384463 CrossRef Google Scholar

Zhang, B., Lv, H., Guo, P., et al. (2022). WENETSPEECH: A 10000+ hours multi-domain mandarin corpus for speech recognition. 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, pp. 6182–6186. https://doi.org/10.1109/ICASSP43922.2022.9746682 CrossRef Google Scholar

Zhang, Y., Zou, J., and Ding, N. (2023a). Acoustic correlates of the syllabic rhythm of speech: Modulation spectrum or local features of the temporal envelope. Neuroscience & Biobehavioral Reviews, 147, 105111. https://doi.org/10.1016/j.neubiorev.2023.105111 CrossRef Google Scholar PubMed

Zhang, Y., Zou, J., and Ding, N. (2023b). Complex mapping between neural response frequency and linguistic units in natural speech. Journal of Cognitive Neuroscience, 35(8), 1361–1368. https://doi.org/10.1162/jocn_a_02013 CrossRef Google Scholar PubMed

Table 9.1 Speech corpora

Figure 9.1 Schematized steady states and fast transit intervals.The speech waveform and the boundaries of words, syllables, and phones.

Table 9.2 Unit duration and the coefficient of variation (CV)

Table 9.3 Duration and CV of vowels and consonants

Table 9.4 Duration and CV of theta syllable

Figure 9.2(A) CV for unit duration. Each line shows the CV of a corpus. Chinese and English corpora are marked out by dots and triangles, respectively. Pairwise comparisons between phones, syllables, and words are carried out using the binomial test (* p < 0.05, ** p < 0.01).

Figure 9.2(B) CV for SOA.

Table 9.5 Unit rate (Hz)

Accessibility standard: WCAG 2.0 A

Why this information is here

This section outlines the accessibility features of this content - including support for screen readers, full keyboard navigation and high-contrast display options. This may not be relevant for you.

Accessibility Information

The HTML of this chapter conforms to version 2.0 of the Web Content Accessibility Guidelines (WCAG), ensuring core accessibility principles are addressed and meets the basic (A) level of WCAG compliance, addressing essential accessibility barriers.

Content Navigation

Table of contents navigation
Allows you to navigate directly to chapters, sections, or non‐text items through a linked table of contents, reducing the need for extensive scrolling.

Index navigation
Provides an interactive index, letting you go straight to where a term or subject appears in the text without manual searching.

Reading Order & Textual Equivalents

Single logical reading order
You will encounter all content (including footnotes, captions, etc.) in a clear, sequential flow, making it easier to follow with assistive tools like screen readers.

Full alternative textual descriptions
You get more than just short alt text: you have comprehensive text equivalents, transcripts, captions, or audio descriptions for substantial non‐text content, which is especially helpful for complex visuals or multimedia.

Visualised data also available as non-graphical data
You can access graphs or charts in a text or tabular format, so you are not excluded if you cannot process visual displays.

Visual Accessibility

Use of colour is not sole means of conveying information
You will still understand key ideas or prompts without relying solely on colour, which is especially helpful if you have colour vision deficiencies.

Book contents

9 - Rhythms of Phones, Syllables, and Words in Connected Speech

Summary

Keywords

Information

9.1 Introduction

9.2 Methods

9.2.1 Corpus

9.2.2 Phone Duration

9.2.3 Syllable Duration

9.2.4 Duration of Theta Syllables

9.2.5 Word Duration

9.3 Statistical Distribution of Phones, Syllables, and Words

9.3.1 Mean Duration and SOA

9.3.2 Coefficient of Variation (CV)

9.4 The Rate of Phones, Syllables, and Words

9.5 Discussion

Summary

Implications

Gains

Footnotes

References

Accessibility standard: WCAG 2.0 A

Why this information is here

Accessibility Information

Content Navigation

Reading Order & Textual Equivalents

Visual Accessibility

Save book to Kindle

Save book to Dropbox

Save book to Google Drive