9.1 Introduction
Whether speech is rhythmic is a controversial topic (Arvaniti, Reference Arvaniti2009; Goswami and Leong, Reference Goswami and Leong2013; Turk and Shattuck-Hufnagel, Reference Turk and Shattuck-Hufnagel2013; see Chapter 11). On the one hand, listeners tend to feel that speech is rhythmic and linguists have divided languages into syllable-timed and stress-timed languages, in which the syllable or stress is perceived to have regular rhythm (Dauer, Reference Dauer1983). On the other hand, no study has revealed strict periodicity in any speech unit or feature (Turk and Shattuck-Hufnagel, Reference Turk and Shattuck-Hufnagel2013; see Chapter 14). Therefore, the rhythm of speech cannot arise from strict periodicity in sound features or linguistic units. Since speech is a highly complex signal, the rhythm of speech can be discussed at multiple levels (see Chapter 20 for a discussion of the prosodic hierarchy of speech). Here, we consider the rhythm of three linguistic levels, that is, phones, syllables, and words. Furthermore, we only analyze the duration of a single unit or the interval between the onsets of neighboring units, instead of higher-order rhythms defined by the patterning of the onset of multiple units. The phone is the basic phonetic unit of speech, which can be further divided into vowels and consonants. A syllable typically contains one vowel that may be preceded and/or followed by a few consonants. Phones and syllables are units of speech sound, while morphemes and words are the basic units of meaning. A morpheme is the linguistically defined smallest unit for meaning (Boey, Reference Boey1975). A word is the basic unit for writing in some languages, such as English, but it is less well defined in other languages, such as Chinese (Tan and Perfetti, Reference Tan, Perfetti, Leong and Tamaoka1998).
In the last couple of decades, the rhythm of syllables has received a lot of attention. It is shown that syllables have a relatively regular duration, and the mean rate of syllables is typically between 4 and 8 Hz (Coupé et al., Reference Coupé, Oh, Dediu and Pellegrino2019; Greenberg et al., Reference Greenberg, Carvey, Hitchcock and Chang2003; Pellegrino et al., Reference Pellegrino, Coupé and Marsico2011; see Chapter 8). Furthermore, it is suggested that syllables have a relatively reliable acoustic correlate, that is, the speech envelope (Assaneo and Poeppel, Reference Assaneo and Poeppel2018; Cummins, Reference Cummins2012; Giraud and Poeppel, Reference Giraud and Poeppel2012; see also Zhang et al., Reference Zhang, Zou and Ding2023a, for different opinions). The speech envelope refers to the low-frequency (mainly below 30 Hz) changes in acoustic power, and its power spectrum, referred to as the modulation spectrum, peaks between 4 and 8 Hz, corresponding to the mean rate of syllables (Ding et al., Reference Ding, Patel and Chen2017; Greenberg et al., Reference Greenberg, Carvey, Hitchcock and Chang2003). Therefore, the rhythm of syllables, which is roughly equivalent to the rhythm of the speech envelope, can be perceived even without extensive linguistic knowledge, compared with the rhythm of phones (Liberman et al., Reference Liberman, Shankweiler, Fischer and Carter1974; see Chapter 11) or words, which requires learning. Although a number of studies have demonstrated that syllables have relatively regular duration, the statistical regularity of the duration of other linguistic units such as phones and words has seldom been investigated. Here, we analyzed the duration of phones, syllables, and words, and tested whether syllables have more regular duration than phones and words. We analyzed two languages that have the most users, Chinese and English.
9.2 Methods
9.2.1 Corpus
Eight speech corpora included in this analysis (Table 9.1) were extracted from six speech datasets, that is, DARPA-TIMIT (Garofolo et al., Reference Garofolo, Lamel and Fisher1993), GigaSpeech (Chen et al., Reference Chen, Chai and Wang2021), TED-LIUM (Rousseau et al., Reference Rousseau, Deléglise and Estève2012), Chinese-TIMIT (Yuan et al., Reference Yuan, Ding, Liao, Zhan and Liberman2017), Aishell-1 (Bu et al., Reference Bu, Du, Na, Wu and Zheng2017), and WenetSpeech (Zhang et al., Reference Zhang, Lv and Guo2022). The selection and processing of the corpora followed a recent study on the syllabic rhythm of speech (Zhang et al., Reference Zhang, Zou and Ding2023a).

9.2.2 Phone Duration
The boundaries of each phone (Figure 9.1) are automatically extracted based on audio and transcription using the Montreal Forced Aligner (MFA) (McAuliffe et al., Reference McAuliffe, Socolof, Mihuc, Wagner and Sonderegger2017). The MFA locates the boundaries between phones with a resolution of 10 ms; that is, the phone duration can only be a multiple of 10 ms. The MFA does not allow phones to overlap in time. This MFA method is validated based on the corpora for which manual labels of phone boundaries are available (Zhang et al., Reference Zhang, Zou and Ding2023a). The duration of a phone is defined as the time difference between phone onset and phone offset. The stimulus onset asynchrony (SOA) of phones is defined as the time difference between the onsets of two adjacent phones. The SOA is affected by the silence period after a phone, but the phone duration is not.
Schematized steady states and fast transit intervals.
The speech waveform and the boundaries of words, syllables, and phones.

9.2.3 Syllable Duration
The MFA also provides the boundaries between words for English and the boundaries between characters for Mandarin Chinese. In Chinese, since each syllable corresponds to a character, the syllable boundaries are obtained directly from the character boundaries. In English, the syllable boundaries (Figure 9.1) are determined by grouping the phones of each word into syllables based on a dictionary, that is, the Unisyn Lexicon (Fitt, Reference Fitt2001). The duration and SOA of syllables are defined in the same way they are for phones.
9.2.4 Duration of Theta Syllables
The structure of a syllable has three parts, that is, the onset, the nucleus/peak, and the coda/offset (Greenberg, Reference Greenberg1999). In general, the nucleus corresponds to the local maximum in speech intensity within the duration of a syllable. In the neuroscience literature, the concept of a theta syllable has also been proposed, which is the unit between two successive vocalic nuclei (Ghitza, Reference Ghitza2013). The concept is proposed to describe the units that are tracked by theta-band neural activity during speech listening. In connected speech, consonants may be produced in between two vowels, and parsing the boundaries of syllables requires determining which consonants are the offset of the first syllable and which consonants are the onsets of the second syllable. This syllable parsing problem is especially challenging for languages that allow flexible syllable structures, for example, English, compared with languages that have a highly regular syllable structure, for example, Turkish (Durgunoğlu and Öney, Reference Durgunoğlu and Öney1999). For theta syllables, however, the consonants between vowels do not need to be parsed into an onset and an offset. Here, we also analyze the statistical regularity in the duration of theta syllables.
9.2.5 Word Duration
For English, word boundaries (Figure 9.1) are reported by the MFA. For Chinese, words are not separated in the writing system, and there is no univocal definition of words. Here, word segmentation is achieved using a popular word segmentation tool, that is, stanza (Qi et al., Reference Qi, Zhang, Zhang, Bolton and Manning2020). The duration and SOA of words are defined in the same way they are for phones.
9.3 Statistical Distribution of Phones, Syllables, and Words
9.3.1 Mean Duration and SOA
In the following, for each language, Chinese and English, we report the results that are pooled across corpora, while the results of individual corpora are summarized in Table 9.2. For Chinese, the mean durations for phones, syllables, and words are 98 ms, 213 ms, and 327 ms, respectively. For English, the mean durations for phones, syllables, and words are 83 ms, 211 ms, and 294 ms, respectively. For phones, when vowels and consonants are separately analyzed, the duration of vowels (106 ms for Chinese and 89 ms for English) is slightly longer than the duration of consonants (91 ms for Chinese and 80 ms for English – Table 9.3). When comparing syllables and theta syllables, the mean duration is longer for theta syllables (231 ms for Chinese and 232 ms for English) than for syllables (213 ms for Chinese and 211 ms for English – Table 9.4). As expected, the SOA of a unit is longer than its duration (Table 9.2). For Chinese, the mean SOAs for phones, syllables, and words are 107 ms, 231 ms, and 353 ms, respectively. For English, the mean SOAs for phones, syllables, and words are 90 ms, 226 ms, and 317 ms, respectively (Table 9.2). The SOA for vowels is not shown, but it is the same as the duration of theta syllables reported in the subsequent analysis (Table 9.4).

Note: M = mean duration in milliseconds; SD = standard deviation; CV = coefficient of variation


9.3.2 Coefficient of Variation (CV)
The CV is a measure of relative variability, calculated as the ratio of the standard deviation to the mean. A lower CV indicates stronger statistical regularity. For the duration/SOA of a unit, a lower CV indicates stronger rhythmicity. For duration, the CV of syllables (0.39 for Chinese and 0.57 for English) is consistently lower than the CV of phones (0.55 for Chinese and 0.64 for English; binomial test based on the results from the eight corpora, p < 0.01) and the CV of words (0.54 for Chinese and 0.64 for English; see Table 9.2 and Figure 9.2A; binomial test, p < 0.01). For SOA, the CV of syllables (0.59 for Chinese and 0.67 for English) is significantly lower than the CV of phones (0.86 for Chinese and 0.82 for English; binomial test, p < 0.01) and words (0.62 for Chinese and 0.71 for English; binomial test, p < 0.05), while the CV of words is significantly lower than the CV of phones (Table 9.2; Figure 9.2B; binomial test, p < 0.05).
CV.
CV for unit duration. Each line shows the CV of a corpus. Chinese and English corpora are marked out by dots and triangles, respectively. Pairwise comparisons between phones, syllables, and words are carried out using the binomial test (* p < 0.05, ** p < 0.01).

CV for SOA.

9.4 The Rate of Phones, Syllables, and Words
In the neurolinguistic literature, the rate of a linguistic unit is more frequently discussed than the duration/SOA of a unit. The rate of a unit, however, can be defined in several ways. First, it can be defined as the total number of units divided by the total duration of speech recordings. This definition is the same as the reciprocal of the mean SOA of the unit, denoted as 1/E(SOA) in this chapter. The rate defined this way is sensitive to the duration of silence periods in the recording, which can be problematic if the recording has long silence periods. To solve this problem, the second definition excludes the silence periods in speech and defines the rate of a unit as the total number of units in a corpus divided by the total duration of units. For this definition, the rate of a unit is simply the reciprocal of the mean unit duration, denoted as 1/E(duration) in this chapter. A third definition is, for the duration of each unit, take the reciprocal and calculate the mean of this reciprocal. This measure is denoted as E(1/duration) in this chapter.
The rates of phones, syllables, and words calculated using these three methods are summarized in Table 9.5. When the units were pooled over all corpora, the rates for phones, syllables, and words calculated using method 1, that is, 1/E(SOA), are 10.4 Hz, 4.4 Hz, and 3.0 Hz, respectively. For method 2, that is, 1/E(duration), the rates for phones, syllables, and words are 11.3 Hz, 4.7 Hz, and 3.3 Hz, respectively. The rate calculated using method 1 is lower than the rate calculated using method 2 since it considers the silence period after a unit. For the third measure, that is, E(1/duration), the rates for phones, syllables, and words are 15.1 Hz, 6.3 Hz, and 5.0 Hz, respectively, higher than the rates calculated using the other two methods. The result is expected, since 1/E(1/duration) is the harmonic mean of duration, which is more strongly influenced by small values than the arithmetic mean, that is, E(duration), and the harmonic mean is always smaller than the arithmetic mean.

9.5 Discussion
Rhythmicity in speech has motivated theoretical and experimental studies on how neural oscillations encode speech. Although such neuroscientific research is flourishing, relatively few studies have empirically characterized the rhythmicity in speech (Ding et al., Reference Ding, Patel and Chen2017; Inbar et al., Reference Inbar, Grossman and Landau2020; Meyer, Reference Meyer2018; Meyer et al., Reference Meyer, Henry, Gaston, Schmuck and Friederici2016; Stehwien and Meyer, Reference Stehwien and Meyer2022; Tilsen and Arvaniti, Reference Tilsen and Arvaniti2013; Zhang et al., Reference Zhang, Zou and Ding2023a; see more in Chapter 16). The rhythmicity of speech, however, is the basis for these neuroscientific studies and therefore deserves more attention. Here, we investigated the mean and variation of the duration of phones, syllables, and words, since the duration information of these units has been used to motivate hypotheses and analysis approaches in neuroscientific studies on speech processing (Coopmans et al., Reference Coopmans, de Hoop, Hagoort and Martin2022; Giraud and Poeppel, Reference Giraud and Poeppel2012; Kaufeld et al., Reference Kaufeld, Bosker and ten Oever2020; Kazanina and Tavano, Reference Kazanina and Tavano2022; Keitel et al., Reference Keitel, Gross and Kayser2018; ten Oever et al., Reference ten Oever, Carta, Kaufeld and Martin2022; Zhang et al., Reference Zhang, Zou and Ding2023b).
Since a word contains one or more syllables and a syllable contains one or more phones, for the three levels of units, the mean duration is longest for words and shortest for phones. The nesting relationship between the three units, however, does not decide which unit has stronger rhythmicity, and the empirical analyses here show that syllables have a statistically significantly lower CV of duration than phones and words, indicating that syllables have more regular duration. Although the difference is statistically significant, the CV of phone/word duration is only about 1.3 times bigger than the CV of syllable duration (Table 9.2). The difference is small compared to the variation of CV across speech corpora that differ in their language or speaking style. For example, the syllable CV tends to be lower for Chinese than for English, indicating higher syllabic rhythmicity for Chinese than for English. Even for the Chinese corpora, the syllable CV is lower for read sentences, that is, Chinese-TIMIT and AISHELL, than for spontaneous speech, that is, WenetSpeech (talk).
Roughly speaking, the mean rates for phones, syllables, and words fall into the range of alpha, theta, and delta oscillations, respectively. The mean rate of phones, however, does not contradict the hypothesis that gamma-band oscillations are critical for the neural encoding of phones (Giraud and Poeppel, Reference Giraud and Poeppel2012; Goswami, Reference Goswami2016; Hovsepyan et al., Reference Hovsepyan, Olasagasti and Giraud2020; Hyafil et al., Reference Hyafil, Fontolan, Kabdebon, Gutkin and Giraud2015; Meyer, Reference Meyer2018), since the hypothesis focuses on the critical timescales within a phone, for example, the timescales for formant transitions and voicing, instead of the mean rate of phones.
Summary
We quantified the statistical regularity in the duration of phones, syllables, and words. The CV is below 0.7 for the three units for most corpora, suggesting that the unit duration is relatively regular. Furthermore, the CV is slightly lower for syllables than for phones and words and is highly variable across corpora.
Implications
According to the mean rate of phones, syllables, and words, if some neural activity can track the onset of each phone, syllable, or word, its frequency will fall into the range of alpha, theta, and delta oscillations, respectively.
Gains
Phones, syllables, and words all have their typical duration. Future studies may also probe how phone, syllable, and word rates vary across speakers, and whether these rhythms are steady or show systematic variations over time.







