Beyond Acoustics: Capacity Limitations of Linguistic Levels

doi:10.1017/9781009295888.016

13 - Beyond Acoustics: Capacity Limitations of Linguistic Levels

from Section 2 - Acoustic and Sublexical Rhythms

Published online by Cambridge University Press: 23 April 2026

Jérémy Giroud and

Benjamin Morillon

Edited by

Lars Meyer and

Antje Strauss

Show author details

Lars Meyer: Affiliation:
Max Planck Institute for Human Cognitive and Brain Sciences
Antje Strauss: Affiliation:
University of Konstanz

Book contents

Summary

Speech is a multiplexed signal displaying levels of complexity, organizational principles, and perceptual units of analysis at distinct timescales. This critical acoustic signal for human communication is thus characterized at distinct representational and temporal scales, related to distinct linguistic features, from acoustic to supra-lexical. This chapter presents an overview of experimental work devoted to the characterization of the speech signal at different timescales, beyond its acoustic properties. The functional relevance of these different levels of analysis for speech processing is discussed. We advocate that studying speech perception through the prism of multi-timescale representations effectively integrates work from various research areas into a coherent picture and contributes significantly to increasing our knowledge on the topic. Finally, we discuss how these experimental results fit with neural data and current dynamical models of speech perception.

Keywords

temporal constraints channel capacity neural dynamics speech processing linguistic features speech rate

Information

Type: Chapter
Information: Rhythms of Speech and Language
Physiology, Cognition, Culture
, pp. 214 - 229

DOI: https://doi.org/10.1017/9781009295888.016 [Opens in a new window]

Publisher: Cambridge University Press

Print publication year: 2026
Creative Commons: This content is Open Access and distributed under the terms of the Creative Commons Attribution licence CC-BY-NC 4.0 https://creativecommons.org/cclicenses/

13 Beyond Acoustics: Capacity Limitations of Linguistic Levels

13.1 Speech Processing as Information Transmission through Constrained Channels

Speech is crucial in our daily lives, enabling direct interaction and communication. Understanding speech is a complex process due to its transient and intricately structured nature. Indeed, speech sounds can be abstracted at multiple levels of analysis, speech being a multiplexed signal displaying levels of complexity, organizational principles, and perceptual units of analysis at distinct timescales.

How does the brain build the diverse representational linguistic units at different timescales from the speech signal? This process is even harder given the fleeting nature of speech and human memory limitation. Indeed, speech comprehension faces the “now or never bottleneck” (Christiansen and Chater, Reference Christiansen and Chater2016). This means that if listeners do not process relevant information in a fast and incremental fashion, they may lose the opportunity to understand it altogether. Therefore, the speed at which information (being acoustic or linguistic in nature) is conveyed in speech – the information rate – is a more relevant dimensional space than the absolute amount of information conveyed (information value) to the brain (Coupé et al., Reference Coupé, Oh, Dediu and Pellegrino2019). This is because our neurocognitive resources are limited and can only process a certain amount of information in a given amount of time, leading to temporal bottlenecks (Hasson et al., Reference Hasson, Yang, Vallines, Heeger and Rubin2008; Honey et al., Reference Honey, Thesen and Donner2012; Lerner et al., Reference Lerner, Honey, Silbert and Hasson2011; Vagharchakian et al., Reference Vagharchakian, Dehaene-Lambertz, Pallier and Dehaene2012). By understanding these bottlenecks and their implications, we can better understand how we process speech.

A potential way of uncovering general principles of speech perception is to describe and determine the temporal constraints that shape its processing at each level of speech analysis. As such, in this chapter, we will show that a meticulous characterization of the various levels of organization found in speech and language, their temporal constraints and their relation to comprehension, can provide valuable and novel insights into an individual’s speech processing ability.

Speech processing in the human brain can be conceptualized as a process of information integration through channels with limited capacities. The auditory system continuously receives a complex stream of sound waves that needs to be processed and decoded in real time into meaningful information to understand the messages conveyed by the speaker. This process involves several stages of analysis, from low-level acoustic processing to high-level semantic interpretation (Christiansen and Chater, Reference Christiansen and Chater2016; Hickok and Poeppel, Reference Hickok and Poeppel2007; Rosen, Reference Rosen1992). One of the key challenges of speech processing is dealing with the limited capacity of our neural resources, which results from intrinsic biological constraints. This implies that speech signals or speaking situations that do not conform with these constraints result in poor comprehension.

Building on previous work (Coupé et al., Reference Coupé, Oh, Dediu and Pellegrino2019; Ghitza, Reference Ghitza2013; Gibson et al., Reference Gibson, Futrell and Piantadosi2019; Pellegrino et al., Reference Pellegrino, Coupé and Marsico2011; Reed and Durlach, Reference Reed and Durlach1998), we recently proposed to determine how the limited capacity of our neural resources and the complexity of linguistic features in speech constrain our ability to comprehend spoken language (Giroud et al., Reference Giroud, Lerousseau, Pellegrino and Morillon2023). We proposed to rely on a concept inherited from information theory (Shannon, Reference Shannon1948), channel capacity. Within this framework, each level of the speech processing hierarchy can be modeled as a transfer of information through a dedicated channel. Channel capacity is defined as the maximum amount of information that can be transmitted through this communication channel without errors or loss, in bits per second (bits/s). It can be also referred to as a temporal (processing) bottleneck, in which information is processed at a fixed speed (Vagharchakian et al., Reference Vagharchakian, Dehaene-Lambertz, Pallier and Dehaene2012). Hence, if too much information arrives per unit of time, information transfer is suboptimal or fails. Using such a normative measurement framework allowed for the determination of multilevel linguistic processing constraints limiting speech comprehension. This suggests that speech perception is hierarchical (Millert et al., Reference Millert, Caucheteux and Orhan2022), with sequential bottlenecks, each with its own channel capacity. Hereafter, we will provide an account of diverse experimental works that brought insights about the relevant processing bottlenecks involved in speech comprehension. We will focus on work spanning multiple levels of analysis from acoustic to higher-level linguistic features to determine their respective channel capacity. More precisely, we hereafter characterize the following linguistic features: the speech acoustic timescales; the syllabic timescale; the phonemic timescale; higher-level linguistic timescales such as words, phrases, and sentences; and lexical information rates and contextual information, as derived from deep neural networks (Figure 13.1; see also Giroud et al., Reference Giroud, Lerousseau, Pellegrino and Morillon2023).

Figure 13.1

Speech characterization at multiple levels of analysis.

Rate (in Hz) or information rate (in bits/s) of seven linguistic features of an example sentence. Features are described from low to high linguistic levels: acoustic temporal modulation rate (in Hz), syllabic rate (in Hz), phonemic rate (in Hz), syllabic information rate (in bit/s), phonemic information rate (in bit/s), static lexical surprise (i.e., word frequency) (in bit/s), and contextual lexical surprise (in bit/s).

A graphic representation of the breakdown of the acoustic features of the spoken sentence over time along with the calculated values of the key acoustic features at normal speed. The sentence reads, She was a taxi driver.

Licensed under CC BY.

13.2 The Speech Acoustic Timescales

First and foremost, speech is a complex acoustic signal that involves variations in frequency and intensity over time. These low-level acoustic features can be described in terms of spectro-temporal modulations (Elliott and Theunissen, Reference Elliott and Theunissen2009), and are critical for the intelligibility of speech. On one hand, the spectral (or frequency) dimension is a crucial aspect of the speech signal. It corresponds to the distribution of the energy of the sound signal in the frequency scale (sound spectrum), and makes it possible to define the different formants of the speech units (in particular, the vowels) and their transitions (Stevens and Klatt, Reference Stevens and Klatt1974). On the other hand, the temporal dimension of the speech sounds is highly relevant for comprehension (Albouy et al., Reference Albouy, Benjamin, Morillon and Zatorre2020; Shannon et al., Reference Shannon, Zeng, Kamath, Wygonski and Ekelid1995; Smith et al., Reference Smith, Delgutte and Oxenham2002). This second dimension indexes the precise organization of the different elements of speech over time.

When producing speech, the dynamics of the vocal tract articulators are translated in a waveform that displays fluctuations in signal amplitude over time. This pattern is referred to as the speech signal’s envelope, and its main temporal modulation is typically situated between 2 and 8 Hz, with an average maximum around 4–5 Hz (Ding et al., Reference Ding, Patel and Chen2017; Varnet et al., Reference Varnet, Ortiz-Barajas, Erra, Gervain and Lorenzi2017). Critically, this characteristic range is preserved across speakers, languages, and speaking conditions (Ding et al., Reference Ding, Patel and Chen2017; Poeppel and Assaneo, Reference Poeppel and Assaneo2020). Speech thus appears to be temporally structured, a feature that the brain might capitalize on to further process relevant information. Multiple temporal modulations are crucial for comprehension, including those within the 1–7 Hz range related to phrases, words, and syllables (Elliott and Theunissen, Reference Elliott and Theunissen2009; Meyer, Reference Meyer2018). Temporal modulations above 12 Hz are linked to specific phonetic features and segmental information (Chapter 9; Christiansen et al., Reference Christiansen, Greenberg and Henrichsen2009; Drullman et al., Reference Drullman, Festen and Plomp1994; Rosen, Reference Rosen1992; Shannon et al., Reference Shannon, Zeng, Kamath, Wygonski and Ekelid1995). Speech signals lacking the naturally occurring envelope temporal modulations are less intelligible (Chi et al., Reference Chi, Gao, Guyton, Ru and Shamma1999, Reference Chi, Ru and Shamma2005; Elhilali et al., Reference Elhilali, Chi and Shamma2003; Elliott and Theunissen, Reference Elliott and Theunissen2009). Moreover, removing the main temporal fluctuations (2–9 Hz) within spoken stimuli by artificially filtering the signal results in degraded intelligibility for listeners. And artificially restoring these temporal modulations – by the addition of brief noise bursts that act as temporal cues at exactly where the “acoustic edges” of the original stimuli were – leads to a drastic increase in intelligibility (Doelling et al., Reference Doelling, Arnal, Ghitza and Poeppel2014; Ghitza, Reference Ghitza2012).

13.3 Neural Tracking of the Speech Acoustic Dynamics

At the neural level, the auditory cortex effectively represents the speech envelope (Nourski et al., Reference Nourski, Reale and Oya2009; Shamma, Reference Shamma2001). Theta band (4–8 Hz) neural activity consistently aligns with the speech envelope, which closely approximates the syllabic timescale (Chapter 3; Giraud and Poeppel, Reference Giraud and Poeppel2012; Luo and Poeppel, Reference Hickok and Poeppel2007). However, theta activity primarily encodes acoustic rather than linguistic features (Etard and Reichenbach, Reference Etard and Reichenbach2019). Although crucial for intelligibility, the speech envelope indeed only indirectly reflects syllabic rate, which is rather landmarked by acoustic onset edges (Oganian and Chang, Reference Oganian and Chang2019; Schmidt et al., Reference Schmidt, Chen and Keitel2023; Zhang et al., Reference Zhang, Zou and Ding2023). Neural tracking of the speech envelope is hence a necessary, but not sufficient, condition for comprehension (Ahissar et al., Reference Ahissar, Nagarajan and Ahissar2001; Brodbeck and Simon, Reference Brodbeck and Simon2020; Kösem et al., Reference Kösem, Dai, McQueen and Hagoort2023). Additionally, while natural speech’s temporal modulation rate is around 5 Hz, neural processes can adapt to acoustic rates up to 15 Hz, beyond which comprehension is hindered (Giroud et al., Reference Giroud, Lerousseau, Pellegrino and Morillon2023). This channel capacity associated with the acoustic modulation rate has a strong impact on speech comprehension, and is independent from syllabic or any other linguistic features. Taken together, the above results converge to reveal the central role of the temporal envelope in speech processing.

13.4 The Syllabic Timescale

The hierarchical structure of language implies the existence of different linguistic units that are combined in different ways to create an infinite number of meanings. While currently there is no consensus on the nature of the fundamental unit of speech recognition, it is generally accepted that features described in phonetics are at work during language perception. Two pre-lexical levels of description have been subject to intense neurophysiological investigation due to their relevance for speech perception: phoneme-sized units (either of a phonetic or a phonemic nature) and syllable-sized units (Giraud and Poeppel, Reference Giraud and Poeppel2012; Mesgarani et al., Reference Mesgarani, Cheung, Johnson and Chang2014; Poeppel and Assaneo, Reference Poeppel and Assaneo2020).

Syllables last between 150 and 300 ms, with an average around 200 ms (Ghitza and Greenberg, Reference Ghitza and Greenberg2009; Greenberg, Reference Greenberg2001; Rosen, Reference Rosen1992). This corresponds to a rate of 2.5–8 syllables per second in natural settings (Coupé et al., Reference Coupé, Oh, Dediu and Pellegrino2019; Kendall, Reference Kendall2013; Pellegrino et al., Reference Pellegrino, Coupé and Marsico2011; Zhang et al., Reference Zhang, Zou and Ding2023). The syllable is an essential unit of all languages, with regard to acquisition, pathologies, language errors, and psycholinguistic processing (Dolata et al., Reference Dolata, Davis and Macneilage2008).

Accordingly, the syllabic timescale is the strongest linguistic determinant of speech comprehension (Giroud et al., Reference Giroud, Lerousseau, Pellegrino and Morillon2023). Previous research using speeded speech has provided evidence that beyond 15 syllables per second, speech becomes unintelligible (Dupoux and Green, Reference Dupoux and Green1997; Foulke and Sticht, Reference Foulke and Sticht1969; Ghitza, Reference Ghitza2014; Giroud et al., Reference Giroud, Lerousseau, Pellegrino and Morillon2023; Nourski et al., Reference Nourski, Reale and Oya2009); 15 Hz would hence be the channel capacity associated with syllabic processing (Giroud et al., Reference Giroud, Lerousseau, Pellegrino and Morillon2023). Further use of time-compressed spoken materials showed that compressing natural speech three times or more impairs comprehension but that this effect is strongly alleviated by the insertion of periods of silence between time-compressed speech segments (Ghitza and Greenberg, Reference Ghitza and Greenberg2009). In particular, restoring the “syllabicity” of the spoken stimuli (its original temporal structure in terms of the syllable rate) seems to be the optimal way to partially restore comprehension of highly compressed speech. Overall, this suggests that the online tracking of individual syllables is a strong prerequisite for speech comprehension.

13.5 The Phonemic Timescale

While the syllabic timescale and its neural underpinning has been investigated in depth, the contribution and neural substrate of the phonemic timescale to speech comprehension is less clear. Phonemes are the smallest linguistic units of speech sounds and represent a generalization or abstraction over different phonetic realizations. Phonemes are the smallest perceptual unit capable of determining the meaning of a word (e.g., beer and peer differ only with respect to their initial phonemes). They last typically between 60 and 150 ms in natural speech, with the majority being around 50–80 ms (Ghitza and Greenberg, Reference Ghitza and Greenberg2009; Rosen, Reference Rosen1992). This corresponds to a rate of approximately 10–15 phonemes per second in natural speech (Studdert-Kennedy, Reference Studdert-Kennedy, Smelser and Gerstein1986). Phonemes are associated with a processing bottleneck whose channel capacity is of ~35 Hz (Giroud et al., Reference Giroud, Lerousseau, Pellegrino and Morillon2023). However, the phonemic rate has only a residual impact on speech comprehension (Giroud et al., Reference Giroud, Lerousseau, Pellegrino and Morillon2023), which suggests that the online tracking of individual phonemes is not a prerequisite for speech comprehension. Instead, acoustic phonetic representations of speech are encoded during natural speech perception (Mesgarani et al., Reference Mesgarani, Cheung, Johnson and Chang2014; Nourski et al., Reference Nourski, Steinschneider and Rhone2015), with multiple speech sounds being encoded in parallel at any given time, together with their relative order within the speech sequence (Gwilliams et al., Reference Gwilliams, King, Marantz and Poeppel2022).

Phonemes are, however, a relevant unit of representation for speech processing. During their first months, infants have the ability to discriminate most phonemic contrasts present in multiple languages (Gervain, Reference Gervain2015; Mahmoudzadeh et al., Reference Mahmoudzadeh, Dehaene-Lambertz and Fournier2013; Moon et al., Reference Moon, Lagercrantz and Kuhl2013), and about six months after birth, this ability becomes more focused on native phonemes (Kuhl, Reference Kuhl2000; Kuhl et al., Reference Kuhl, Ramírez, Bosseler, Lin and Imada2014). Interestingly, this specialization in processing native phonemes is linked to an increase in synchronization of low-gamma band (~25–50 Hz) neural activity (Ortiz-Mantilla et al., Reference Ortiz-Mantilla, Hämäläinen, Realpe-Bonilla and Benasich2016; see Menn et al., Reference Menn, Männel and Meyer2023, for a perspective). In adults, neural dynamics in the low-gamma band are also observed during auditory processing (Lakatos et al., Reference Lakatos, Shah and Knuth2005; Lehongre et al., Reference Lehongre, Ramus, Villiermet, Schwartz and Giraud2011; Morillon et al., Reference Morillon, Lehongre and Frackowiak2010, Reference Morillon, Liégeois-Chauvel, Arnal, Bénar and Giraud2012) and notably track the amplitude envelope of speech (Di Liberto et al., Reference Di Liberto, O’Sullivan and Lalor2015; Fontolan et al., Reference Fontolan, Morillon, Liegeois-Chauvel and Giraud2014; Gross et al., Reference Gross, Hoogenboom and Thut2013; Lehongre et al., Reference Lehongre, Ramus, Villiermet, Schwartz and Giraud2011; Lizarazu et al., Reference Lizarazu, Lallier and Molinaro2019). Whether this phenomenon reflects phonemic-categorical processing or lower-level acoustic or phonetic processing remains unclear. Work by Marchesotti et al. (Reference Marchesotti, Nicolle and Merlet2020) provides evidence of the crucial role played by low-gamma band neural dynamics in processing phonemic information during speech perception. In their study, they recorded electroencephalography (EEG) data from dyslexic participants and found that activity at 30 Hz was lower than that of neurotypical adults. They then used transcranial alternating current stimulation (tACS) to temporarily restore low-gamma neural dynamics in dyslexic adults. Interestingly, this intervention led to improved phonological processing and reading performance, but only when the stimulation was targeted at 30 Hz (versus 60 Hz) and in the group of participants with dyslexia. These findings support a connection between low-gamma neural oscillations and phonological processing.

13.6 Higher-Level Linguistic Timescales

Phonemes and syllables are combined to form larger units such as words, phrases, and sentences. The length, variability, and rhythmicity of these higher-level linguistic structures have been investigated (Breen, Reference Breen2018; Clifton et al., Reference Clifton, Carlson and Frazier2006). These (post-)lexical timescales, however, are of the same order of magnitude as prosodic dynamics, making their specific investigation difficult (but see Section 3). In particular, in spoken languages, prosodic information (intonation, pauses) naturally fluctuates around 0.5–3 Hz, which encompasses phrasal and word-level timescales (Auer et al., Reference Auer, Couper-Kuhlen and Muller1999; Ghitza, Reference Ghitza2017; Inbar et al., Reference Inbar, Grossman and Landau2020; Stehwien and Meyer, Reference Stehwien and Meyer2022). Such speech dynamics are tracked by neural dynamics in the same range, which corresponds to the delta frequency band (Bonhage et al., Reference Bonhage, Meyer, Gruber, Friederici and Mueller2017; Boucher et al., Reference Boucher, Gilbert and Jemel2019; Bourguignon et al., Reference Bourguignon, De Tiège and de Beeck2013; Buiatti et al., Reference Buiatti, Peña and Dehaene-Lambertz2009; Gross et al., Reference Gross, Hoogenboom and Thut2013; Meyer et al., Reference Meyer, Henry, Gaston, Schmuck and Friederici2017; Molinaro et al., Reference Molinaro, Lizarazu, Lallier, Bourguignon and Carreiras2016; Park et al., Reference Park, Ince, Schyns, Thut and Gross2015). The distinctive role of these delta rate dynamics in the temporal cortex for prosodic tracking and high-level linguistic processes has been documented (Bourguignon et al., Reference Bourguignon, De Tiège and de Beeck2013; Ding et al., Reference Ding, Melloni, Zhang, Tian and Poeppel2016; Keitel et al., Reference Keitel, Gross and Kayser2018; Kösem and van Wassenhove, Reference Kösem and van Wassenhove2017; Lamekina and Meyer, Reference Lamekina and Meyer2023; Lu et al., Reference Lu, Jin, Ding and Tian2023; Molinaro and Lizarazu, Reference Molinaro and Lizarazu2018; Rimmele et al., Reference Rimmele, Sun, Michalareas, Ghitza and Poeppel2023; Vander Ghinst et al., Reference Vander Ghinst, Bourguignon and Op de Beeck2016), but their respective channel capacity remains to be explored. Of note, this phrasal tracking occurs even in the absence of distinct acoustic modulations at the phrasal rate (Ding et al., Reference Ding, Melloni, Zhang, Tian and Poeppel2016; Kaufeld et al., Reference Kaufeld, Bosker and Ten Oever2020; Keitel et al., Reference Keitel, Gross and Kayser2018). However, diverging evidence led to the proposal to dissociate delta neural activity driven by acoustically driven segmentation following prosodic phrases from activity that indexes knowledge-based segmentation of semantic/syntactic phrases (Lu et al., Reference Lu, Jin, Ding and Tian2023; Meyer, Reference Meyer2018). Currently, the role of delta neural dynamics in speech processing is still vigorously debated (Boucher et al., Reference Boucher, Gilbert and Jemel2019; Giraud, Reference Giraud2020; Inbar et al., Reference Inbar, Grossman and Landau2020; Kazanina and Tavano, Reference Kazanina and Tavano2023; Lo et al., Reference Lo, Henke, Martorell and Meyer2023).

Strikingly, during speech perception, spontaneous finger tapping at the perceived (prosodic) rhythm of speech occurs within the delta range (i.e., at ~2.5 Hz; see Lidji et al., Reference Lidji, Palmer, Peretz and Morningstar2011). A similar effect is visible during music perception, with spontaneous movements occurring at the perceived beat, around 0.5–4 Hz (Merchant et al., Reference Merchant, Grahn, Trainor, Rohrmeier and Fitch2015; Morillon et al., Reference Morillon, Arnal, Schroeder and Keitel2019; Rajendran et al., Reference Rajendran, Teki and Schnupp2018). These findings point toward a preference of attentional and motor systems for the slow (~0.5–3 Hz) temporal dynamics of auditory streams. Accordingly, during speech processing, delta oscillations are not only visible in temporal areas but also in the motor cortex (Giordano et al., Reference Giordano, Ince and Gross2017). And delta motor cortical dynamics uniquely contribute to both the modulation of auditory processing and comprehension: On the one hand, the tracking of acoustic dynamics by the (left) auditory cortex is principally modulated by motor areas, through delta (and to a lesser extent theta) oscillatory activity (Keitel et al., Reference Keitel, Ince, Gross and Kayser2017; Park et al., Reference Park, Ince, Schyns, Thut and Gross2015). On the other hand, in motor areas, both delta tracking of the phrasal acoustic rate and delta-beta coupling predict speech comprehension (Keitel et al., Reference Keitel, Gross and Kayser2018).

13.7 From Speech Rate to Information Rate

Past works have characterized the properties of language in terms of informational content exchange and transmission using large cross-linguistic corpora and the information theory framework (Shannon, Reference Shannon1948). In this context, “information” does not refer to message meaning but to its unpredictability or unexpectedness (Coupé et al., Reference Coupé, Oh, Dediu and Pellegrino2019; Oh et al., Reference Oh, Coupé, Marsico and Pellegrino2015; Pellegrino et al., Reference Pellegrino, Coupé and Marsico2011). Pellegrino et al. (Coupé et al., Reference Coupé, Oh, Dediu and Pellegrino2019; Pellegrino et al., Reference Pellegrino, Coupé and Marsico2011) investigated how effectively different languages convey information, positing that languages globally share similarities due to human cognitive architecture. They calculated the syllabic information rate, quantifying the average information per syllable transmitted per second. Their studies revealed that many languages exhibit comparable channel capacity associated with syllabic processing, as evidenced by similar syllabic information rates. However, different strategies were visible across languages, captured by the visible trade-off between information density and speech rate (Coupé et al., Reference Coupé, Oh, Dediu and Pellegrino2019; Pellegrino et al., Reference Pellegrino, Coupé and Marsico2011). In other words, in some languages, such as Japanese, speakers tend to pronounce a lot of syllables per second (~8), with each syllable being mildly informative, while in other languages, such as Thai, speakers pronounce fewer syllables per second (~5), but each syllable is more informative. Overall, the amount of syllabic information transmitted per second is comparable across languages. This suggests that languages have adapted to fit in with the temporal constraints imposed by the processing bottleneck of syllabic information. Extending this research to online speech comprehension, Giroud et al. (Reference Giroud, Lerousseau, Pellegrino and Morillon2023) showed that both phonemic and syllabic information rates impose a processing bottleneck that significantly limits speech comprehension. However, these informational features were found to have a smaller impact on comprehension than higher-level lexical and supra-lexical information.

At the lexical level, listeners take advantage of one of the most striking properties of language: the fact that all words do not have the same probability to be uttered. Indeed, words obey a Zipfian distribution (Zipf, Reference Zipf1935), which characterizes the frequency at which they occur in natural language (as computed from a corpora of millions of words). The is the most common English word, while persiflage occurs rarely. The word frequency highly correlates with the mean duration needed to recognize a word (Howes and Solomon, Reference Howes and Solomon1951). Accordingly, compressed sentences with a higher density of unexpected words are more difficult to understand (Giroud et al., Reference Giroud, Lerousseau, Pellegrino and Morillon2023), and the rate at which this lexical information (or “static lexical surprise,” derived from word frequency) occurs during speeded speech perception is a major determinant of speech comprehension, independent of the previously described lower-level linguistic features.

13.8 Contribution of Contextual Information

The channel capacity of contextual (acoustic or lexical) information – the maximum amount of contextual information that listeners can process per unit of time – can also be determined. Contextual information in speech refers to the additional information from surrounding sounds, words, or sentences that can be used to guide perception and enhance comprehension. For instance, listeners can make use of the acoustic context (e.g., specific acoustic cues such as fundamental frequency or voice-onset time) to adaptively and predictively process speech in specific situations (Idemaru and Holt, Reference Idemaru and Holt2011; Lamekina and Meyer, Reference Lamekina and Meyer2023; Zhang et al., Reference Zhang, Wu and Holt2021). Furthermore, not only the nature but also the timing of events is highly relevant for comprehension. For instance, contextual speech rate has been shown to affect the detection of subsequent words (Dilley and Pitt, Reference Dilley and Pitt2010; Kösem et al., Reference Kösem, Bosker and Takashima2018), word segmentation boundaries (Reinisch et al., Reference Reinisch, Jesse and McQueen2011), and perceived constituent durations (Bosker, Reference Bosker2017).

Listeners capitalize also on contextual lexical information to process speech. Sentences with less expected endings (containing a surprising last word, as in “the little red riding camembert”) result in a larger negative deflection of the EEG signal 400 ms after the onset on the closing word: the classical N400 component (Kutas and Hillyard, Reference Kutas and Hillyard1984). Thanks to the ever-growing availability of large language models – these are deep neural networks trained on language material in an unsupervised way – researchers now have access to models that capture the statistical properties of the language data they are trained on, at different levels of the linguistic hierarchy. This enables a finer characterization of the contextual information contained in large corpora that reflect what listeners should expect during everyday communication situations.

By comparing the neural activity patterns evoked by different linguistic units to the probabilities assigned to those units by large language models, researchers can gain insights into the nature of the mental representations of linguistic features in the brain (Brodbeck et al., Reference Brodbeck, Hong and Simon2018; Caucheteux and Gramfort, Reference Caucheteux, Gramfort and King2021; Donhauser and Baillet, Reference Donhauser and Baillet2020; Frank et al., Reference Frank, Otten, Galli and Vigliocco2015; Goldstein et al., Reference Goldstein, Zada and Buchnik2022; Heilbron et al., Reference Heilbron, Armeni, Schoffelen, Hagoort and de Lange2022; Schrimpf et al., Reference Schrimpf, Blank and Tuckute2021). Combining a deep neural network (GPT-2) to estimate contextual predictions with neural recordings from participants that were listening to audiobooks, Heilbron et al. (Reference Heilbron, Armeni, Schoffelen, Hagoort and de Lange2022) found that brain responses are continuously modulated by linguistic predictions. They observed an impact of contextual predictions at the level of meaning, grammar, words, and speech sounds, and found that high-level predictions can inform low-level ones. Contextual predictions at the word level (i.e., contextual lexical information) extracted from GPT-2 also linearly map onto the brain responses to speech (Caucheteux et al., Reference Caucheteux, Gramfort and King2023). Overall, these results link predictive coding and language processing frameworks into a coherent picture (but see Antonello and Huth, Reference Antonello and Huth2022, for a conflicting view).

At the behavioral level, contextual lexical surprise (i.e., the unexpectedness of a word given the sentence context that was extracted from a large language model) strongly impacts comprehension of compressed sentences (Giroud et al., Reference Giroud, Lerousseau, Pellegrino and Morillon2023). Critically, in natural speech and at normal speed, the intrinsic statistics associated with contextual lexical information are already close to its channel capacity. This suggests that contextual lexical surprise is an important constraint regarding the rate at which natural speech unfolds.

13.9 Exploring Linguistic Levels and Their Associated Neural Mechanisms within the Channel Capacity Framework

The work presented here highlights multiple linguistic units and their respective timescales, which are relevant for speech processing. They encompass different levels of organizational description and complexity, from acoustic to supra-lexical. More precisely, we characterized the speech acoustic timescales; the syllabic and phonemic timescales; higher-level linguistic timescales such as words, phrases, and sentences; and lexical information rate and contextual information (Figure 13.1). While it appears that they each contribute to speech comprehension individually, it is likely that they also interact in a complex way in natural speech conditions to define a global channel capacity associated with our comprehension system (Giroud et al., Reference Giroud, Lerousseau, Pellegrino and Morillon2023). For instance, high-level contextual lexical information drives lexical access during continuous speech perception (Gwilliams et al., Reference Gwilliams, King, Marantz and Poeppel2022); lexical information modulates, in turn, phonological processing via the maintenance of sub-phonemic details in the auditory cortex over hundreds of milliseconds (Gwilliams et al., Reference Gwilliams, Linzen, Poeppel and Marantz2018); and when prior context constrains lexical processing, sub-lexical representations are inhibited as they are no longer as important for further processing (Martin, Reference Martin2016, Reference Martin2020).

A way forward in deepening our understanding of the neural processes at play during speech comprehension is to develop formal descriptions and measurements of computational units and test their relevance experimentally. In this view, a path worth exploring is the pursuit of the development of more and more precise models of speech and language processing using artificial neural networks (Arana et al., Reference Arana, Pesnot Lerousseau and Hagoort2023; Millert et al., Reference Millert, Caucheteux and Orhan2022). Studying the learned representations of these models can provide insights into meaningful representations for speech comprehension without relying on linguistic concepts (Millert et al., Reference Millert, Caucheteux and Orhan2022). Previous research has demonstrated that the retrieved model representations have similar spectro-temporal parameters as those measured directly in the human auditory cortex (Riad et al., Reference Riad, Karadayi, Bachoud-Lévi and Dupoux2021). In silico models offer several other advantages, including the ability to train them under specific conditions and stimuli and observe their resulting behaviors (Kanwisher et al., Reference Kanwisher, Khosla and Dobs2023). These models can also be used to make testable predictions and hypotheses about speech processing in the brain, thus guiding the development of new theories of language processing and acquisition. For instance, Caucheteux et al. (Reference Caucheteux, Gramfort and King2023) determined that large language models’ representations about upcoming words can be used to predict brain activity more accurately than representations from preceding words. Moreover, enhancing this algorithm with predictions that span multiple words improves this brain mapping, and these predictions are organized hierarchically, with frontoparietal cortices predicting higher-level, longer-range, and more contextual representations than temporal cortices. Such a result, specifically the exact depth of representations, would have been difficult to predict with such precision solely through theoretical models or experimental paradigms. Another interesting property of these models is that they can be used to select highly specific stimuli (words, sentences) that result in specific model behavior (e.g., a strong response of the layers or a suppressed response). These stimuli can then be presented to participants while their brain activity is recorded to observe the neural response of the network supporting language processing (Tuckute et al., Reference Tuckute, Sathe and Srikant2024).

We believe that having a framework that combines the modeling approach with standard experimental methodologies can lead to new insights into the mechanisms underlying comprehension. To that end, we propose that a sensible extension to previous natural language processing (NLP) studies – which have primarily focused on examining comprehension during listening to spoken utterances that fall within the range of typical everyday communication scenarios – would be to explore speech comprehension through the lens of the channel capacity framework. This approach involves pushing the comprehension system to its limits by presenting listeners with speech signals that are difficult to understand in order to identify the specific acoustic and linguistic features that are crucial for comprehension, their associated channel capacity, and how such processing bottlenecks are implemented in neural dynamics. Large language models offer an unprecedented level of resolution in describing the features of language and speech signals at multiple scales. This opens up new opportunities for researchers to gain a more detailed understanding of which specific features and timescales are crucial for comprehension.

13.10 Conclusion

In conclusion, speech is a highly complex signal structured at various levels of analysis. Because of its multiplexed nature, the necessary computations and neural circuits involved in speech processing are likely to be spatially and temporally highly organized. Throughout this chapter, we have examined the temporal constraints limiting speech comprehension beyond the acoustic level. How these hierarchical bottlenecks are implemented, what determines their channel capacity, and how they interact to efficiently process speech is currently unknown. We have also demonstrated the potential of the channel capacity framework in enhancing our comprehension of speech processing in humans. As such, developing a research program aimed at determining the capacity limitations of our cognitive resources for comprehension could be instrumental in developing predictive and remediative strategies for improving comprehension skills. One potential avenue is to tailor speech materials or specific interventions to individual cognitive resources to increase the efficiency of information transmission and encoding, thus reducing miscomprehension.

Box 13.1Chapter Overview

Summary

Speech is a complex signal that contains different levels of information at distinct timescales, from acoustic to supra-lexical. This chapter highlighted the importance of multiple linguistic features to understand human comprehension ability. The temporal dynamics of these levels of analysis is discussed, along with how they fit with neural data.

Implications

Each linguistic feature can be expressed in a number of units per second and their associated channel capacity can be derived. These channel capacities are temporal constraints for speech comprehension and can shape the multiplexed rhythms that are observed in speech and language.

Gains

The approach put forward in this chapter lays the foundation for deeper investigations into how the temporal unfolding of multilevel linguistic features impacts speech comprehension. We encourage the use of a normative framework (the concept of channel capacity) to explore the neural mechanisms of speech and language processing.

References

Ahissar, E., Nagarajan, S., Ahissar, M., et al. (2001). Speech comprehension is correlated with temporal response patterns recorded from auditory cortex. Proceedings of the National Academy of Sciences of the United States of America, 98(23), 13367–13372.10.1073/pnas.201400998CrossRef Google Scholar PubMed

Albouy, P., Benjamin, L., Morillon, B., and Zatorre, R. (2020). Distinct sensitivity to spectrotemporal modulation supports brain asymmetry for speech and melody. Science, 367(6481), 1043–1047.10.1126/science.aaz3468CrossRef Google Scholar PubMed

Antonello, R. J., and Huth, A. (2022). Predictive coding or just feature discovery? An alternative account of why language models fit brain data. Neurobiology of Language, 5(1), 64–79. https://doi.org/10.1162/nol_a_00087 Google Scholar

Arana, S., Pesnot Lerousseau, J., and Hagoort, P. (2023). Deep learning models to study sentence comprehension in the human brain. Language, Cognition and Neuroscience, 39(8), 972–990. https://doi.org/10.1080/23273798.2023.2198245 CrossRef Google Scholar

Auer, P., Couper-Kuhlen, E., and Muller, F. (1999). Language in time: The rhythm and tempo of spoken interaction. Oxford University Press.CrossRef Google Scholar

Bonhage, C. E., Meyer, L., Gruber, T., Friederici, A. D., and Mueller, J. L. (2017). Oscillatory EEG dynamics underlying automatic chunking during sentence processing. NeuroImage, 152, 647–657.CrossRef Google Scholar PubMed

Bosker, H. R. (2017). Accounting for rate-dependent category boundary shifts in speech perception. Attention, Perception & Psychophysics, 79(1), 333–343.CrossRef Google Scholar PubMed

Boucher, V. J., Gilbert, A. C., and Jemel, B. (2019). The role of low-frequency neural oscillations in speech processing: Revisiting delta entrainment. Journal of Cognitive Neuroscience, 31(8), 1205–1215.CrossRef Google Scholar PubMed

Bourguignon, M., De Tiège, X., de Beeck, M. O., et al. (2013). The pace of prosodic phrasing couples the listener’s cortex to the reader’s voice. Human Brain Mapping, 34(2), 314–326.CrossRef Google Scholar

Breen, M. (2018). Effects of metric hierarchy and rhyme predictability on word duration in The Cat in the Hat. Cognition, 174, 71–81.CrossRef Google Scholar PubMed

Brodbeck, C., and Simon, J. Z. (2020). Continuous speech processing. Current Opinion in Physiology, 18, 25–31.CrossRef Google Scholar PubMed

Brodbeck, C., Hong, L. E., and Simon, J. Z. (2018). Rapid transformation from auditory to linguistic representations of continuous speech. Current Biology, 28(24), 3976–3983.CrossRef Google Scholar PubMed

Buiatti, M., Peña, M., and Dehaene-Lambertz, G. (2009). Investigating the neural correlates of continuous speech computation with frequency-tagged neuroelectric responses. NeuroImage, 44(2), 509–519.CrossRef Google Scholar PubMed

Caucheteux, C., Gramfort, A., and King, J. (2021). Disentangling syntax and semantics in the brain with deep networks. Proceedings of the 38th International Conference on Machine Learning, 139, pp. 1336–1348.Google Scholar

Caucheteux, C., Gramfort, A., and King, J. (2023). Evidence of a predictive coding hierarchy in the human brain listening to speech. Nature Human Behaviour, 7(3), 430–441.10.1038/s41562-022-01516-2CrossRef Google Scholar PubMed

Chi, T., Ru, P., and Shamma, S. A. (2005). Multiresolution spectrotemporal analysis of complex sounds. Journal of the Acoustical Society of America, 118(2), 887–906.CrossRef Google Scholar PubMed

Chi, T., Gao, Y., Guyton, M. C., Ru, P., and Shamma, S. (1999). Spectro-temporal modulation transfer functions and speech intelligibility. Journal of the Acoustical Society of America, 106(5), 2719–2732.CrossRef Google Scholar PubMed

Christiansen, M. H., and Chater, N. (2016). The now-or-never bottleneck: A fundamental constraint on language. Behavioral and Brain Sciences, 39, e62.CrossRef Google Scholar PubMed

Christiansen, T. U., and Greenberg, S. (2009). Frequency selective filtering of the modulation spectrum and its impact on consonant identification. In Henrichsen, P. J. (ed.), Linguistic theory raw sound (pp. 119–133). Samfundslitteratur.Google Scholar

Clifton, C., Carlson, K., and Frazier, L. (2006). Tracking the what and why of speakers’ choices: Prosodic boundaries and the length of constituents. Psychonomic Bulletin & Review, 13(5), 854–861.CrossRef Google Scholar PubMed

Coupé, C., Oh, Y., Dediu, D., and Pellegrino, F. (2019). Different languages, similar encoding efficiency: Comparable information rates across the human communicative niche. Science Advances, 5(9), eaaw2594.CrossRef Google Scholar PubMed

Di Liberto, G. M., O’Sullivan, J. A., and Lalor, E. C. (2015). Low-frequency cortical entrainment to speech reflects phoneme-level processing. Current Biology, 25(19), 2457–2465.CrossRef Google Scholar PubMed

Dilley, L. C., and Pitt, M. A. (2010). Altering context speech rate can cause words to appear or disappear. Psychological Science, 21(11), 1664–1670.CrossRef Google Scholar PubMed

Ding, N., Melloni, L., Zhang, H., Tian, X., and Poeppel, D. (2016). Cortical tracking of hierarchical linguistic structures in connected speech. Nature Neuroscience, 19(1), 158–164.CrossRef Google Scholar PubMed

Ding, N., Patel, A. D., Chen, L., et al. (2017). Temporal modulations in speech and music. Neuroscience and Biobehavioral Reviews, 81(Pt B), 181–187.10.1016/j.neubiorev.2017.02.011CrossRef Google Scholar PubMed

Doelling, K. B., Arnal, L. H., Ghitza, O., and Poeppel, D. (2014). Acoustic landmarks drive delta-theta oscillations to enable speech comprehension by facilitating perceptual parsing. NeuroImage, 85(2), 761–768.CrossRef Google Scholar PubMed

Dolata, J. K., Davis, B. L., and Macneilage, P. F. (2008). Characteristics of the rhythmic organization of vocal babbling: Implications for an amodal linguistic rhythm. Infant Behavior & Development, 31(3), 422–431.CrossRef Google Scholar PubMed

Donhauser, P. W., and Baillet, S. (2020). Two distinct neural timescales for predictive speech processing. Neuron, 105(2), 385–393.CrossRef Google Scholar PubMed

Drullman, R., Festen, J. M., and Plomp, R. (1994). Effect of reducing slow temporal modulations on speech reception. Journal of the Acoustical Society of America, 95(5), 2670–2680.CrossRef Google Scholar PubMed

Dupoux, E., and Green, K. (1997). Perceptual adjustment to highly compressed speech: Effects of talker and rate changes. Journal of Experimental Psychology: Human Perception and Performance, 23(3), 914–927.Google Scholar PubMed

Elhilali, M., Chi, T., and Shamma, S. A. (2003). A spectro-temporal modulation index (STMI) for assessment of speech intelligibility. Speech Communication, 41(2–3), 331–348.CrossRef Google Scholar

Elliott, T. M., and Theunissen, F. E. (2009). The modulation transfer function for speech intelligibility. PLoS Computational Biology, 5(3), e1000302.CrossRef Google Scholar PubMed

Etard, O., and Reichenbach, T. (2019). Neural speech tracking in the theta and in the delta frequency band differentially encode clarity and comprehension of speech in noise. Journal of Neuroscience, 39(29), 5750–5759.CrossRef Google Scholar PubMed

Fontolan, L., Morillon, B., Liegeois-Chauvel, C., and Giraud, A.-L. (2014). The contribution of frequency-specific activity to hierarchical information processing in the human auditory cortex. Nature Communications, 5, 4694.10.1038/ncomms5694CrossRef Google Scholar PubMed

Foulke, E., and Sticht, T. G. (1969). Review of research on the intelligibility and comprehension of accelerated speech. Psychological Bulletin, 72(1), 50–62.CrossRef Google Scholar PubMed

Frank, S. L., Otten, L. J., Galli, G., and Vigliocco, G. (2015). The ERP response to the amount of information conveyed by words in sentences. Brain and Language, 140, 1–11.CrossRef Google Scholar

Gervain, J. (2015). Plasticity in early language acquisition: The effects of prenatal and early childhood experience. Current Opinion in Neurobiology, 35, 13–20.CrossRef Google Scholar PubMed

Ghitza, O. (2012). On the role of theta-driven syllabic parsing in decoding speech: Intelligibility of speech with a manipulated modulation spectrum. Frontiers in Psychology, 3, 238.10.3389/fpsyg.2012.00238CrossRef Google Scholar PubMed

Ghitza, O. (2013). The theta-syllable: A unit of speech information defined by cortical function. Frontiers in Psychology, 4, 138.CrossRef Google Scholar PubMed

Ghitza, O. (2014). Behavioral evidence for the role of cortical θ oscillations in determining auditory channel capacity for speech. Frontiers in Psychology, 5, 652.CrossRef Google Scholar PubMed

Ghitza, O. (2017). Acoustic-driven delta rhythms as prosodic markers. Language, Cognition and Neuroscience, 32(5), 545–561.CrossRef Google Scholar

Ghitza, O., and Greenberg, S. (2009). On the possible role of brain rhythms in speech perception: Intelligibility of time-compressed speech with periodic and aperiodic insertions of silence. Phonetica, 66(1–2), 113–126.CrossRef Google Scholar PubMed

Gibson, E., Futrell, R., Piantadosi, S. P., et al. (2019). How efficiency shapes human language. Trends in Cognitive Sciences, 23(5), 389–407.10.1016/j.tics.2019.02.003CrossRef Google Scholar PubMed

Giordano, B. L., Ince, R. A. A., Gross, J., et al. (2017). Contributions of local speech encoding and functional connectivity to audio-visual speech perception. eLife, 6, e24763. https://doi.org/10.7554/eLife.24763 CrossRef Google Scholar PubMed

Giraud, A.-L. (2020). Oscillations for all? A commentary on Meyer, Sun and Martin (2020). Language, Cognition and Neuroscience, 35(9), 1106–1113. https://doi.org/10.1080/23273798.2020.1764990 CrossRef Google Scholar

Giraud, A.-L., and Poeppel, D. (2012). Cortical oscillations and speech processing: Emerging computational principles and operations. Nature Neuroscience, 15(4), 511–517.CrossRef Google Scholar PubMed

Giroud, J., Lerousseau, J. P., Pellegrino, F., and Morillon, B. (2023). The channel capacity of multilevel linguistic features constrains speech comprehension. Cognition, 232, 105345.10.1016/j.cognition.2022.105345CrossRef Google Scholar PubMed

Goldstein, A., Zada, Z., Buchnik, E., et al. (2022). Shared computational principles for language processing in humans and deep language models. Nature Neuroscience, 25(3), 369–380. https://doi.org/10.1038/s41593-022-01026-4 CrossRef Google Scholar PubMed

Greenberg, S. (2001). What are the essential cues for understanding spoken language? Journal of the Acoustical Society of America, 109(5), 2382–2382.CrossRef Google Scholar

Gross, J., Hoogenboom, N., Thut, G., et al. (2013). Speech rhythms and multiplexed oscillatory sensory coding in the human brain. PLoS Biology, 11(12), e1001752.CrossRef Google Scholar PubMed

Gwilliams, L., King, J.-R., Marantz, A., and Poeppel, D. (2022). Neural dynamics of phoneme sequences reveal position-invariant code for content and order. Nature Communications, 13(1), 6606.CrossRef Google Scholar PubMed

Gwilliams, L., Linzen, T., Poeppel, D., and Marantz, A. (2018). In spoken word recognition, the future predicts the past. Journal of Neuroscience, 38(35), 7585–7599.10.1523/JNEUROSCI.0065-18.2018CrossRef Google Scholar PubMed

Hasson, U., Yang, E., Vallines, I., Heeger, D. J., and Rubin, N. (2008). A hierarchy of temporal receptive windows in human cortex. Journal of Neuroscience, 28(10), 2539–2550.10.1523/JNEUROSCI.5487-07.2008CrossRef Google Scholar PubMed

Heilbron, M., Armeni, K., Schoffelen, J.-M., Hagoort, P., and de Lange, F. P. (2022). A hierarchy of linguistic predictions during natural language comprehension. Proceedings of the National Academy of Sciences of the United States of America, 119(32), e2201968119.CrossRef Google Scholar PubMed

Hickok, G., and Poeppel, D. (2007). The cortical organization of speech processing. Nature Reviews Neuroscience, 8(5), 393–402.CrossRef Google Scholar PubMed

Honey, C. J., Thesen, T., Donner, T. H., et al. (2012). Slow cortical dynamics and the accumulation of information over long timescales. Neuron, 76(2), 423–434.CrossRef Google Scholar PubMed

Howes, D., and Solomon, R. L. (1951). Visual duration threshold as a function of word-probability. Journal of Experimental Psychology, 41(6), 401–410.CrossRef Google Scholar PubMed

Idemaru, K., and Holt, L. L. (2011). Word recognition reflects dimension-based statistical learning. Journal of Experimental Psychology: Human Perception and Performance, 37(6), 1939–1956.Google Scholar PubMed

Inbar, M., Grossman, E., and Landau, A. N. (2020). Sequences of intonation units form a ~ 1 Hz rhythm. Scientific Reports, 10(1), 15846.10.1038/s41598-020-72739-4CrossRef Google Scholar

Kanwisher, N., Khosla, M., and Dobs, K. (2023). Using artificial neural networks to ask “why” questions of minds and brains. Trends in Neurosciences, 46(3), 240–254.CrossRef Google Scholar PubMed

Kaufeld, G., Bosker, H. R., Ten Oever, S., et al. (2020). Linguistic structure and meaning organize neural oscillations into a content-specific hierarchy. Journal of Neuroscience, 40(49), 9467–9475.CrossRef Google Scholar PubMed

Kazanina, N., and Tavano, A. (2023). What neural oscillations can and cannot do for syntactic structure building. Nature Reviews Neuroscience, 24(2), 113–128.CrossRef Google Scholar PubMed

Keitel, A., Gross, J., and Kayser, C. (2018). Perceptually relevant speech tracking in auditory and motor cortex reflects distinct linguistic features. PLoS Biology, 16(3), e2004473.CrossRef Google Scholar PubMed

Keitel, A., Ince, R. A. A., Gross, J., and Kayser, C. (2017). Auditory cortical delta-entrainment interacts with oscillatory power in multiple fronto-parietal networks. NeuroImage, 147, 32–42.CrossRef Google Scholar PubMed

Kendall, T. (2013). Speech rate, pause and sociolinguistic variation: Studies in corpus sociophonetics. Palgrave Macmillan.CrossRef Google Scholar

Kösem, A., and van Wassenhove, V. (2017). Distinct contributions of low- and high-frequency neural oscillations to speech comprehension. Language, Cognition and Neuroscience, 32(5), 536–544.CrossRef Google Scholar

Kösem, A., Dai, B., McQueen, J. M., and Hagoort, P. (2023). Neural tracking of speech envelope does not unequivocally reflect intelligibility. NeuroImage, 272, 120040.CrossRef Google Scholar

Kösem, A, Bosker, H. R., Takashima, A., et al. (2018). Neural entrainment determines the words we hear. Current Biology, 28(18), 2867–2875.CrossRef Google Scholar PubMed

Kuhl, P. K. (2000). A new view of language acquisition. Proceedings of the National Academy of Sciences of the United States of America, 97(22), 11850–11857.CrossRef Google Scholar PubMed

Kuhl, P. K, Ramírez, R. R., Bosseler, A., Lin, J.-F. L., and Imada, T. (2014). Infants’ brain responses to speech suggest analysis by synthesis. Proceedings of the National Academy of Sciences of the United States of America, 111(31), 11238–11245.CrossRef Google Scholar PubMed

Kutas, M., and Hillyard, S. A. (1984). Brain potentials during reading reflect word expectancy and semantic association. Nature, 307(5947), 161–163.CrossRef Google Scholar PubMed

Lakatos, P., Shah, A. S., Knuth, K. H., et al. (2005). An oscillatory hierarchy controlling neuronal excitability and stimulus processing in the auditory cortex. Journal of Neurophysiology, 94(3), 1904–1911.CrossRef Google Scholar PubMed

Lamekina, Y., and Meyer, L. (2023). Entrainment to speech prosody influences subsequent sentence comprehension. Language, Cognition and Neuroscience, 38(3), 263–276.CrossRef Google Scholar

Lehongre, K., Ramus, F., Villiermet, N., Schwartz, D., and Giraud, A.-L. (2011). Altered low-γ sampling in auditory cortex accounts for the three main facets of dyslexia. Neuron, 72(6), 1080–1090.CrossRef Google Scholar PubMed

Lerner, Y., Honey, C. J., Silbert, L. J., and Hasson, U. (2011). Topographic mapping of a hierarchy of temporal receptive windows using a narrated story. Journal of Neuroscience, 31(8), 2906–2915.CrossRef Google Scholar PubMed

Lidji, P., Palmer, C., Peretz, I., and Morningstar, M. (2011). Listeners feel the beat: Entrainment to English and French speech rhythms. Psychonomic Bulletin & Review, 18(6), 1035–1041.CrossRef Google Scholar PubMed

Lizarazu, M., Lallier, M., and Molinaro, N. (2019). Phase-amplitude coupling between theta and gamma oscillations adapts to speech rate. Annals of the New York Academy of Sciences, 1453(1), 140–152.CrossRef Google Scholar PubMed

Lo, C. W., Henke, L., Martorell, J., and Meyer, L. (2023). When linguistic dogma rejects a neuroscientific hypothesis. Nature Reviews Neuroscience, 24(11), 725–725.CrossRef Google Scholar PubMed

Lu, Y., Jin, P., Ding, N., and Tian, X. (2023). Delta-band neural tracking primarily reflects rule-based chunking instead of semantic relatedness between words. Cerebral Cortex, 33(8), 4448–4458.CrossRef Google Scholar PubMed

Luo, H., and Poeppel, D. (2007). Phase patterns of neuronal responses reliably discriminate speech in human auditory cortex. Neuron, 54(6), 1001–1010.CrossRef Google Scholar PubMed

Mahmoudzadeh, M., Dehaene-Lambertz, G., Fournier, M., et al. (2013). Syllabic discrimination in premature human infants prior to complete formation of cortical layers. Proceedings of the National Academy of Sciences of the United States of America, 110(12), 4846–4851.CrossRef Google Scholar PubMed

Marchesotti, S., Nicolle, J., Merlet, I., et al. (2020). Selective enhancement of low-gamma activity by tACS improves phonemic processing and reading accuracy in dyslexia. PLoS Biology, 18(9), e3000833.10.1371/journal.pbio.3000833CrossRef Google Scholar PubMed

Martin, A. E. (2016). Language processing as cue integration: Grounding the psychology of language in perception and neurophysiology. Frontiers in Psychology, 7, 120.CrossRef Google Scholar PubMed

Martin, A. E. (2020). A compositional neural architecture for language. Journal of Cognitive Neuroscience, 32(8), 1407–1427.CrossRef Google Scholar PubMed

Menn, K. H., Männel, C., and Meyer, L. (2023). Does electrophysiological maturation shape language acquisition? Perspectives on Psychological Science, 18(6), 1271–1281. https://doi.org/10.1177/17456916231151584 CrossRef Google Scholar PubMed

Merchant, H., Grahn, J., Trainor, L., Rohrmeier, M., and Fitch, W. T. (2015). Finding the beat: A neural perspective across humans and non-human primates. Philosophical Transactions of the Royal Society of London: Biological Sciences, 370(1664), 20140093.CrossRef Google Scholar

Mesgarani, N., Cheung, C., Johnson, K., and Chang, E. F. (2014). Phonetic feature encoding in human superior temporal gyrus. Science, 343(6174), 1006–1010.CrossRef Google Scholar PubMed

Meyer, L. (2018). The neural oscillations of speech processing and language comprehension: State of the art and emerging mechanisms. European Journal of Neuroscience, 48(7), 2609–2621.CrossRef Google Scholar PubMed

Meyer, L., Henry, M. J., Gaston, P., Schmuck, N., and Friederici, A. D. (2017). Linguistic bias modulates interpretation of speech via neural delta-band oscillations. Cerebral Cortex, 27(9), 4293–4302.Google Scholar PubMed

Millert, J., Caucheteux, C., Orhan, P., et al. (2022). Toward a realistic model of speech processing in the brain with self-supervised learning. Advances in Neural Information Processing Systems 35 (NeurIPS 2022).Google Scholar

Molinaro, N., and Lizarazu, M. (2018). Delta(but not theta)-band cortical entrainment involves speech-specific processing. European Journal of Neuroscience, 48(7), 2642–2650.CrossRef Google Scholar

Molinaro, N., Lizarazu, M., Lallier, M., Bourguignon, M., and Carreiras, M. (2016). Out-of-synchrony speech entrainment in developmental dyslexia. Human Brain Mapping, 37(8), 2767–2783.CrossRef Google Scholar PubMed

Moon, C., Lagercrantz, H., and Kuhl, P. K. (2013). Language experienced in utero affects vowel perception after birth: A two-country study. Acta Paediatrica, 102(2), 156–160.10.1111/apa.12098CrossRef Google Scholar PubMed

Morillon, B., Arnal, L. H., Schroeder, C. E., and Keitel, A. (2019). Prominence of delta oscillatory rhythms in the motor cortex and their relevance for auditory and speech perception. Neuroscience and Biobehavioral Reviews, 107, 136–142.10.1016/j.neubiorev.2019.09.012CrossRef Google Scholar PubMed

Morillon, B., Liégeois-Chauvel, C., Arnal, L. H., Bénar, C.-G., and Giraud, A.-L. (2012). Asymmetric function of theta and gamma activity in syllable processing: An intra-cortical study. Frontiers in Psychology, 3, 248.10.3389/fpsyg.2012.00248CrossRef Google Scholar PubMed

Morillon, B., Lehongre, K., Frackowiak, R. S. J., et al. (2010). Neurophysiological origin of human brain asymmetry for speech and language. Proceedings of the National Academy of Sciences of the United States of America, 107(43), 18688–18693.CrossRef Google Scholar PubMed

Nourski, K. V., Reale, R. A., Oya, H., et al. (2009). Temporal envelope of time-compressed speech represented in the human auditory cortex. Journal of Neuroscience, 29(49), 15564–15574.CrossRef Google Scholar PubMed

Nourski, K. V., Steinschneider, M., Rhone, A. E., et al. (2015). Sound identification in human auditory cortex: Differential contribution of local field potentials and high gamma power as revealed by direct intracranial recordings. Brain and Language, 148, 37–50.CrossRef Google Scholar

Oganian, Y., and Chang, E. F. (2019). A speech envelope landmark for syllable encoding in human superior temporal gyrus. Science Advances, 5(11), eaay6279.CrossRef Google Scholar PubMed

Oh, Y. M., Coupé, C., Marsico, E., and Pellegrino, F. (2015). Bridging phonological system and lexicon: Insights from a corpus study of functional load. Journal of Phonetics, 53, 153–176.CrossRef Google Scholar

Ortiz-Mantilla, S., Hämäläinen, J. A., Realpe-Bonilla, T., and Benasich, A. A. (2016). Oscillatory dynamics underlying perceptual narrowing of native phoneme mapping from 6 to 12 months of age. Journal of Neuroscience, 36(48), 12095–12105.CrossRef Google Scholar PubMed

Park, H., Ince, R. A. A., Schyns, P. G., Thut, G., and Gross, J. (2015). Frontal top-down signals increase coupling of auditory low-frequency oscillations to continuous speech in human listeners. Current Biology, 25(12), 1649–1653.CrossRef Google Scholar PubMed

Pellegrino, F., Coupé, C., and Marsico, E. (2011). A cross-language perspective on speech information rate. Language, 87(3), 539–558.CrossRef Google Scholar

Poeppel, D., and Assaneo, M. F. (2020). Speech rhythms and their neural foundations. Nature Reviews Neuroscience, 21(6), 322–334.CrossRef Google Scholar PubMed

Rajendran, V. G., Teki, S., and Schnupp, J. W. H. (2018). Temporal processing in audition: Insights from music. Neuroscience, 389, 4–18.CrossRef Google Scholar PubMed

Reed, C. M., and Durlach, N. I. (1998). Note on information transfer rates in human communication. Presence: Teleoperators and Virtual Environments, 7(5), 509–518.10.1162/105474698565893CrossRef Google Scholar

Reinisch, E., Jesse, A., and McQueen, J. M. (2011). Speaking rate from proximal and distal contexts is used during word segmentation. Journal of Experimental Psychology: Human Perception and Performance, 37(3), 978–996.Google Scholar PubMed

Riad, R., Karadayi, J., Bachoud-Lévi, A.-C., and Dupoux, E. (2021). Learning spectro-temporal representations of complex sounds with parameterized neural networks. Journal of the Acoustical Society of America, 150(1), 353.CrossRef Google Scholar PubMed

Rimmele, J. M., Sun, Y., Michalareas, G., Ghitza, O., and Poeppel, D. (2023). Dynamics of functional networks for syllable and word-level processing. Neurobiology of Language, 4(1), 120–144.10.1162/nol_a_00089CrossRef Google Scholar PubMed

Rosen, S. (1992). Temporal information in speech: Acoustic, auditory and linguistic aspects. Philosophical Transactions of the Royal Society of London: Biological Sciences, 336(1278), 367–373.CrossRef Google Scholar PubMed

Schmidt, F., Chen, Y.-P., Keitel, A., et al. (2023). Neural speech tracking shifts from the syllabic to the modulation rate of speech as intelligibility decreases. Psychophysiology, 60, e14362. https://doi.org/10.1111/psyp.14362 CrossRef Google Scholar

Schrimpf, M., Blank, I. A., Tuckute, G., et al. (2021). The neural architecture of language: Integrative modeling converges on predictive processing. Proceedings of the National Academy of Sciences of the United States of America, 118(45), e2105646118. https://doi.org/10.1073/pnas.2105646118 CrossRef Google Scholar PubMed

Shamma, S. (2001). On the role of space and time in auditory processing. Trends in Cognitive Sciences, 5(8), 340–348.CrossRef Google Scholar PubMed

Shannon, C. E. (1948). A mathematical theory of communication. Bell System Technical Journal, 27(3), 379–423.CrossRef Google Scholar

Shannon, R. V., Zeng, F. G., Kamath, V., Wygonski, J., and Ekelid, M. (1995). Speech recognition with primarily temporal cues. Science, 270(5234), 303–304.CrossRef Google Scholar PubMed

Smith, Z. M., Delgutte, B., and Oxenham, A. J. (2002). Chimaeric sounds reveal dichotomies in auditory perception. Nature, 416(6876), 87–90.CrossRef Google Scholar PubMed

Stehwien, S., and Meyer, L. (2022). Short-term periodicity of prosodic phrasing. Speech Prosody 2022. https://doi.org/10.31234/osf.io/c9sgb CrossRef Google Scholar

Stevens, K. N., and Klatt, D. H. (1974). Role of formant transitions in the voiced–voiceless distinction for stops. Journal of the Acoustical Society of America, 55(3), 653–659.CrossRef Google Scholar PubMed

Studdert-Kennedy, M. (1986). Some developments in research on language behavior. In Smelser, N. J. and Gerstein, D. R. (eds.), Behavioral and social science: Fifty years of discovery: In commemoration of the fiftieth anniversary of the “Ogburn Report: Recent social trends in the United States” (pp. 208–248). National Academy Press.Google Scholar

Tuckute, G., Sathe, A., Srikant, S., et al. (2024). Driving and suppressing the human language network using large language models. Nature Human Behaviour, 8, 544–561. https://doi.org/10.1038/s41562-023-01783-7 CrossRef Google Scholar PubMed

Vagharchakian, L., Dehaene-Lambertz, G., Pallier, C., and Dehaene, S. (2012). A temporal bottleneck in the language comprehension network. Journal of Neuroscience, 32(26), 9089–9102.CrossRef Google Scholar PubMed

Vander Ghinst, M., Bourguignon, M., Op de Beeck, M., et al. (2016). Left superior temporal gyrus is coupled to attended speech in a cocktail-party auditory scene. Journal of Neuroscience, 36(5), 1596–1606.10.1523/JNEUROSCI.1730-15.2016CrossRef Google Scholar

Varnet, L., Ortiz-Barajas, M. C., Erra, R. G., Gervain, J., and Lorenzi, C. (2017). A cross-linguistic study of speech modulation spectra. Journal of the Acoustical Society of America, 142(4), 1976.CrossRef Google Scholar PubMed

Zhang, X., Wu, Y. C., and Holt, L. L. (2021). The learning signal in perceptual tuning of speech: Bottom up versus top-down information. Cognitive Science, 45(3), e12947.CrossRef Google Scholar PubMed

Zhang, Y., Zou, J., and Ding, N. (2023). Acoustic correlates of the syllabic rhythm of speech: Modulation spectrum or local features of the temporal envelope. Neuroscience and Biobehavioral Reviews, 147, 105111.CrossRef Google Scholar PubMed

Zipf, G. K. (1935). The psycho-biology of language: An introduction to dynamic philology. MIT Press.Google Scholar

Figure 13.1 Speech characterization at multiple levels of analysis.Rate (in Hz) or information rate (in bits/s) of seven linguistic features of an example sentence. Features are described from low to high linguistic levels: acoustic temporal modulation rate (in Hz), syllabic rate (in Hz), phonemic rate (in Hz), syllabic information rate (in bit/s), phonemic information rate (in bit/s), static lexical surprise (i.e., word frequency) (in bit/s), and contextual lexical surprise (in bit/s).

Licensed under CC BY.

Accessibility standard: WCAG 2.0 A

Why this information is here

This section outlines the accessibility features of this content - including support for screen readers, full keyboard navigation and high-contrast display options. This may not be relevant for you.

Accessibility Information

The HTML of this chapter conforms to version 2.0 of the Web Content Accessibility Guidelines (WCAG), ensuring core accessibility principles are addressed and meets the basic (A) level of WCAG compliance, addressing essential accessibility barriers.

Content Navigation

Table of contents navigation
Allows you to navigate directly to chapters, sections, or non‐text items through a linked table of contents, reducing the need for extensive scrolling.

Index navigation
Provides an interactive index, letting you go straight to where a term or subject appears in the text without manual searching.

Reading Order & Textual Equivalents

Single logical reading order
You will encounter all content (including footnotes, captions, etc.) in a clear, sequential flow, making it easier to follow with assistive tools like screen readers.

Full alternative textual descriptions
You get more than just short alt text: you have comprehensive text equivalents, transcripts, captions, or audio descriptions for substantial non‐text content, which is especially helpful for complex visuals or multimedia.

Visualised data also available as non-graphical data
You can access graphs or charts in a text or tabular format, so you are not excluded if you cannot process visual displays.

Visual Accessibility

Use of colour is not sole means of conveying information
You will still understand key ideas or prompts without relying solely on colour, which is especially helpful if you have colour vision deficiencies.

Book contents

13 - Beyond Acoustics: Capacity Limitations of Linguistic Levels

Summary

Keywords

Information

13.1 Speech Processing as Information Transmission through Constrained Channels

13.2 The Speech Acoustic Timescales

13.3 Neural Tracking of the Speech Acoustic Dynamics

13.4 The Syllabic Timescale

13.5 The Phonemic Timescale

13.6 Higher-Level Linguistic Timescales

13.7 From Speech Rate to Information Rate

13.8 Contribution of Contextual Information

13.9 Exploring Linguistic Levels and Their Associated Neural Mechanisms within the Channel Capacity Framework

13.10 Conclusion

Summary

Implications

Gains

References

Accessibility standard: WCAG 2.0 A

Why this information is here

Accessibility Information

Content Navigation

Reading Order & Textual Equivalents

Visual Accessibility

Save book to Kindle

Save book to Dropbox

Save book to Google Drive