Skip to main content Accessibility help
×
Hostname: page-component-6766d58669-7fx5l Total loading time: 0 Render date: 2026-05-24T06:05:02.224Z Has data issue: false hasContentIssue false

Section 2 - Acoustic and Sublexical Rhythms

Published online by Cambridge University Press:  23 April 2026

Lars Meyer
Affiliation:
Max Planck Institute for Human Cognitive and Brain Sciences
Antje Strauss
Affiliation:
University of Konstanz

Information

Figure 0

Figure 8.1 The speech envelope illustrated.The “speech envelope” for a single sentence (shown in black). The temporal fine structure is portrayed in very light gray. Syllable boundaries are indicated by dotted vertical lines. The contour shows the “rate of change” in envelope energy, analogous to “delta features” used in certain ASR applications. Note how coarse the speech envelope is relative to the speech signal’s finer details. The envelope contour pertains to the broadband, unfiltered signal.Figure 8.1 long description.

Adapted from Oganian and Chang (2019).
Figure 1

Figure 8.2 Speech waveforms, spectrograms, and modulation spectra across acoustic frequencies.Spectrographic and time domain representations of the single sentence “The most recent geological survey found seismic activity” (Greenberg et al., 1998). The waveforms are plotted on the same amplitude scale, while the scale of the original, unfiltered signal is compressed by a factor of five for illustrative clarity. The frequency axis of the spectrographic display of the channels has been nonlinearly compressed for illustrative purposes. Note the quasi-orthogonal temporal registration of the waveform modulation pattern across frequency channels. On the right are modulation spectra (magnitude component) associated with each of four, 1/3-octave channels. The peak of the spectrum (in all but the highest channel) lies between 4 Hz and 6 Hz. Note the large amount of energy in the higher-modulation frequencies associated with the highest-frequency channel. The modulation spectra of the four-channel compound and the original, unfiltered signal are illustrated for comparison (top panel).

From Greenberg et al. (1998).
Figure 2

Figure 8.3 Speech intelligibility and the CMS.The CMS integrates the magnitude and phase components into a single value. The sentence material’s intelligibility for a listening experiment was manipulated by locally time-reversing the speech signal over different segment lengths. As the reversed-segment duration increases beyond 40 ms, intelligibility declines precipitously, as does the magnitude of the CMS. The spectro-temporal properties (and articulatory-acoustic features) also deteriorate appreciably under such conditions.Figure 8.3 long description.

Reprinted from Greenberg (2022). The figure is an adaptation of one originally published in Greenberg and Arai (2001, 2004).
Figure 3

Figure 9.1 Schematized steady states and fast transit intervals.The speech waveform and the boundaries of words, syllables, and phones.

Figure 4

Figure 9.2(A) CV for unit duration. Each line shows the CV of a corpus. Chinese and English corpora are marked out by dots and triangles, respectively. Pairwise comparisons between phones, syllables, and words are carried out using the binomial test (* p < 0.05, ** p < 0.01).

Figure 5

Figure 9.2(B) CV for SOA.

Figure 6

Figure 10.1 Schematic display of amplitude envelope.Waveform (black) with a stylized amplitude envelope (dashed line) shifted upwards to increase visibility.

Figure 7

Figure 10.2 Differences in amplitude envelope spectra for two linguistic contrasts.Estimated effects of consonantal length in Italian (top) and pitch accent type in German (bottom), as predicted by the GAMM model (left panel) and estimated differences (right panel). The gray band shows the 95% CI of the difference.Figure 10.2 long description.

Figure 8

Figure 11.1 Illustration of the P-center effect.The word sequence consisting of six English words (bad, mad, sad, had, ad, and pad) was produced in sync with a metronome at 2.5 Hz (or 400 ms between beat onsets) and sounds highly regular. However, the resulting inter-onset intervals between successive word onsets vary between minimally 312 ms (mad-sad) and maximally 534 ms (had-ad), demonstrating a discrepancy between (irregular) acoustics and (regular) perception typical of the P-center effect.

Figure 9

Figure 12.1A. Distal rate effects. (i) A sentence with a target segment (“summer or”) containing a reduced, coarticulated function word (“or”) (speech waveform in gray, target segment in black). (ii) A version of the same sentence manipulated to slow context speech rate. (iii) Subjects asked to repeat the sentence report fewer function words for the slowed context.Figure 12.1A. long description.

Figure 10

Figure 12.1B. The SI hypothesis. (i) An illustration of SI for the normal context speech stimulus from (A). A mean speech rate (μ_1) is computed from the interpretation(s) of context speech. For each candidate interpretation of the target segment (“summer or” and “summer”), knowledge of each syllable’s relative duration is combined with μ_1 to obtain an estimated candidate duration. These estimated candidate durations are compared to the observed (e.g., acoustic) duration (ν). The candidate that best explains what is heard (in this case, the one most probable given the observed duration) is perceived. (ii) An illustration of SI for the slowed context speech stimulus from (A). Here, a slower mean speech rate (μ_2) leads to a different judgment of which candidate interpretation is most probable.Figure 12.1B. long description.

Acoustic time series and envelope images adapted from Peelle and Davis (2012) (licensed under CC BY 3.0).
Figure 11

Figure 12.2A. A hierarchical Bayesian network for word segmentation. Hierarchically organized latent variables representing sequences of words (Wordi), syllables (Syli), and phonemes (Φi) constitute a generative model of morphosyntactic structure. Hierarchically organized latent variables representing syllabic rate (Rate) and sequences of syllable durations (DSyl,i) and phoneme durations (DΦ,i) constitute a generative model of speech timing. Together, they specify a speech signal as a trajectory in feature space.Figure 12.2A. long description.

Figure 12

Figure 12.2B. Interdependence of contextual and incremental processing. The size of each increment depends on the mean rate, and vice versa. Left and right show two different interdependent sets of increments and mean rate.Figure 12.2B. long description.

Figure 13

Figure 12.2C. The VPSI mechanism. A rate prior enables rate-dependent inference of content. When reliable timing information arrives, it triggers post hoc content re-estimation, which leads to rate re-estimation and a rate posterior. This serves as the rate prior for rate-dependent content inference of the next chunk of speech.

Acoustic time series and envelope images adapted from Peelle and Davis (2012) (licensed under CC BY 3.0).
Figure 14

Figure 13.1 Speech characterization at multiple levels of analysis.Rate (in Hz) or information rate (in bits/s) of seven linguistic features of an example sentence. Features are described from low to high linguistic levels: acoustic temporal modulation rate (in Hz), syllabic rate (in Hz), phonemic rate (in Hz), syllabic information rate (in bit/s), phonemic information rate (in bit/s), static lexical surprise (i.e., word frequency) (in bit/s), and contextual lexical surprise (in bit/s).

Licensed under CC BY.
Figure 15

Figure 14.1 A BLIT demonstration.Illustration of perceptual regimes with visual analyses of acoustic impulse trains (BLITs) at different rates and different domains (see text for details).Figure 14.1 long description.

Image taken from Albert (2023).
Figure 16

Figure 14.2 Perceptual regimes and syllables.Schematic illustration of the relationship between perceptual regimes and syllabic units. Segmental makeup in terms of sonority is related to the spectral regime with high-frequency oscillations within syllables, while syllabic size is related to the temporal regime with low-frequency oscillations between syllables. The ratio between the low- and high-frequency oscillations in this illustration is arbitrarily set to be 1:20. This is a realistic ratio such that if syllables are taken to have a typical average duration of 200 ms (5 Hz), the high-frequency oscillation within it would reflect a typical F0 for adult males at 100 Hz. For simplicity, this generalized illustration shows a single rate at each timescale using a steady phase (isochronous repetitions).Figure 14.2 long description.

Image taken from Albert (2023).

Save book to Kindle

To save this book to your Kindle, first ensure no-reply@cambridge.org is added to your Approved Personal Document E-mail List under your Personal Document Settings on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part of your Kindle email address below. Find out more about saving to your Kindle.

Note you can select to save to either the @free.kindle.com or @kindle.com variations. ‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi. ‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.

Find out more about the Kindle Personal Document Service.

  • Acoustic and Sublexical Rhythms
  • Edited by Lars Meyer, Max Planck Institute for Human Cognitive and Brain Sciences, Antje Strauss, University of Konstanz
  • Book: Rhythms of Speech and Language
  • Online publication: 23 April 2026
  • Chapter DOI: https://doi.org/10.1017/9781009295888.010
Available formats
×

Save book to Dropbox

To save content items to your account, please confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account. Find out more about saving content to Dropbox.

  • Acoustic and Sublexical Rhythms
  • Edited by Lars Meyer, Max Planck Institute for Human Cognitive and Brain Sciences, Antje Strauss, University of Konstanz
  • Book: Rhythms of Speech and Language
  • Online publication: 23 April 2026
  • Chapter DOI: https://doi.org/10.1017/9781009295888.010
Available formats
×

Save book to Google Drive

To save content items to your account, please confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account. Find out more about saving content to Google Drive.

  • Acoustic and Sublexical Rhythms
  • Edited by Lars Meyer, Max Planck Institute for Human Cognitive and Brain Sciences, Antje Strauss, University of Konstanz
  • Book: Rhythms of Speech and Language
  • Online publication: 23 April 2026
  • Chapter DOI: https://doi.org/10.1017/9781009295888.010
Available formats
×