5.1 Introduction
As expert listeners, we do not always notice how complex speech decoding is. Understanding speech means handling a continuous stream with unclear acoustic boundaries between elements and in which approximately 200 words occur per minute. Yet we have to decode this information in a handful of seconds. In the highly sophisticated cascade of processes that enables the achievement of this, one of the first steps is to segment the incoming signal into discrete elements onto which linguistic information can then be mapped. To this aim, listeners exploit a fundamental property of speech, its rhythmicity. Rhythm also characterizes intrinsic brain activity: without the need for any external stimulus, neuronal populations naturally show cyclic variations of excitability at different frequencies (see also Chapter 3). As nature knows best, brain oscillations are able to synchronize to the different timescales in speech, a process thought to be central to discretize the signal into phoneme-, syllable-, and word-sized chunks. After sketching the importance of rhythm in speech and introducing current oscillatory models of speech perception, this chapter reviews neurophysiological evidence for the coupling of neural oscillations to speech modulations, syllabic rhythm in particular. Our aim is not to provide an exhaustive review of the literature but to draw attention to topics that may help the reader to better grasp the neural oscillatory dynamic underlying speech perception. We specifically consider speech rate variations as good candidates to examine brain tracking of speech rhythm and also discuss the functional relationship between neural oscillations and speech intelligibility. We finally review some evidence supporting a role of motor regions as temporal predictors for speech parsing.
5.2 The Processing Benefits of (Quasi-)Rhythmic Information in Speech
The (quasi-)rhythmicity of speech is central for listeners to parse the continuous acoustic stream into distinct linguistic elements. Despite not being strictly periodic, the temporal structure of connected speech shows acoustic cues that occur at fairly regular intervals (see also Chapters 9 and 14). Syllables are produced at a frequency of ~5 Hz across languages (Poeppel and Assaneo, Reference Poeppel and Assaneo2020), and stressed syllables marking phrasal units occur approximately every 500 ms, at 2 Hz. These recurrent speech patterns over time serve as landmarks to predict the upcoming critical information, therefore facilitating speech segmentation and ultimately comprehension. Early on during language development, listeners exploit and take advantage of this rhythmical organization of prominences in speech to extract phonological, lexical, and morphosyntactic information (Christophe et al., Reference Christophe, Gout, Peperkamp and Morgan2003). English infants are, for instance, able to segment disyllabic words from unfolding speech based on stress cues, namely the strong–weak syllabic structure typical of words from their native language (Jusczyk et al., Reference Jusczyk, Friederici, Wessels, Svenkerud and Jusczyk1993). Sidiras et al. (Reference Sidiras, Iliadou, Nimatoudis, Reichenbach and Bamiou2017) additionally revealed that 10-year-old Greek-native children better identified two-syllable words in a babble when these were presented in synchrony with a previous isochronous rhythmic sequence, that is, when a prominent element was expected to occur. The benefit of rhythm for speech processing is also manifest in adults. Detecting a target phoneme in disyllabic English words is easier if stressed syllables carrying this phoneme are presented at regular temporal intervals (Quené and Port, Reference Quené and Port2005; see also Roncaglia-Denissen et al., Reference Roncaglia-Denissen, Schmidt-Kassow and Kotz2013, for evidence of syntactic processing). In the same vein, listening to nonverbal rhythmic sequences improves subsequent phonological processing in nonwords, provided that target phonemes appear on beat with the rhythmical prime (Cason and Schön, Reference Cason and Schön2012). The study by Ghitza and Greenberg (Reference Ghitza and Greenberg2009) perfectly illustrates the use of temporal expectations in speech processing. They assessed the intelligibility of sentences time-compressed to one third of their original duration, with silent intervals of varying durations inserted periodically or aperiodically between consecutive speech fragments. Sentence identification was maximal when silences were inserted periodically so as to restore the initial temporal structure of speech. In other words, despite the fact speech fragments were compressed and had limited intelligibility, hearing them at the time they were expected was sufficient to increase intelligibility (see also Ghitza, Reference Ghitza2014; Penn et al., Reference Penn, Ayasse, Wingfield and Ghitza2018). What counts for successful speech decoding is therefore not only the content of the information but when this information is likely to happen. Along this line, the surrounding speech rate can shape phonemic perception, a process known as rate normalization. Hearing a sentence at a fast rate biases subsequent perception towards longer speech sounds compared to when the sentence is produced at a slower rate (Bosker, Reference Bosker2017; Bosker and Ghitza, Reference Bosker and Ghitza2018; Dilley and Pitt, Reference Dilley and Pitt2010; Maslowski et al., Reference Maslowski, Meyer and Bosker2019; Reinisch, Reference Reinisch2016; Reinisch and Sjerps, Reference Reinisch and Sjerps2013; see also Lamekina and Meyer, Reference Lamekina and Meyer2022). As an example, Dutch words embedding an ambiguous vowel between short /ɑ/ and long /a:/ are identified as /ta:k/ (“task”) following a fast-rate carrier sentence, but as /tɑk/ (“branch”) after a sentence at a slower rate (Kösem et al., Reference Kösem, Bosker and Takashima2018). The rationale here is that listeners make some expectations about the occurrence of the future syllables based on the syllabic rate in the previous carrier sentence. For a fast rate, short syllables are expected, hence leading to an overestimation of the target vowel duration: vowels are perceived as longer than they actually are.
Altogether, these lines of research emphasize the predictive function of (quasi-)rhythmic information in speech. One computational mechanism that has been suggested to underpin the establishment of these predictions is the synchronization of brain rhythmic activity to speech temporal structure.
5.3 Neuronal Oscillations as a Key Mechanism for Speech Parsing
Endogenous oscillatory brain activity (Buzsáki and Draguhn, Reference Buzsáki and Draguhn2004; Fries, Reference Fries2005, Reference Fries2015; Varela et al., Reference Varela, Lachaux, Rodriguez and Martinerie2001) plays a crucial role in speech processing (see Chapter 3). By synchronizing, or entraining, to the quasi-rhythmic modulations in the acoustic signal, neuronal oscillations discretize the incoming stream into “packets” of different sizes that can then be mapped onto linguistic units. Contemporary neurocognitive models of speech processing advocate a close correspondence between the timescales in speech (Rosen, Reference Rosen1992) and auditory cortical activity in the delta (~1–4 Hz), theta (~4–8 Hz), and low-gamma (~25–45 Hz) frequency bands (Ghitza, Reference Ghitza2011; Giraud and Poeppel, Reference Giraud and Poeppel2012; Peelle and Davis, Reference Peelle and Davis2012; Poeppel, Reference Poeppel2003; see also Chapter 9). The phase of delta band oscillations parses the signal into large units such as words and phrases, thus encoding prosodic information, whereas theta activity phase-locks to the low-frequency modulations in the amplitude envelope, peaking at 4–7 Hz and conveying syllabic information. On the other hand, the low-gamma amplitude encodes faster modulations pertaining to phonetic features such as voicing and formant transitions. The parsing into speech units of varying sizes furthermore operates simultaneously but hierarchically through theta/gamma nesting, where the phase of theta oscillations drives gamma amplitude. Substantial evidence from electro- and magnetoencephalography (E/MEG) supports this brain dynamic during speech processing, highlighting how neuronal oscillations at different frequencies orchestrate to deal with the multi-time resolution of speech. In the following sections, we review the main existing findings for brain-to-speech coupling, focusing on theta band tracking of syllabic information in the case of speech rate variations. We then discuss whether and how oscillatory coupling is modulated by and contributes to speech intelligibility, a topic that is still debated in the literature. Lastly, we review evidence that beyond the auditory cortex, brain motor regions functionally contribute to speech decoding via top-down temporal predictions.
5.3.1 Cortical Tracking of Syllabic Structure: The Case of Speech Rate Variations
Since the coupling of theta band cortical activity to the speech envelope is thought to track syllabic modulations, one way to put it to the test has been to vary the syllable rate of incoming speech. Speech rate variations are ubiquitous in everyday life, and increasing speech rate, either naturally or artificially, is known to hinder speech intelligibility (Gordon-Salant et al., Reference Gordon-Salant, Zion and Espy-Wilson2014; Guiraud et al., Reference Guiraud, Bedoin and Krifi-Papoz2018; Janse, Reference Janse2004; Janse et al., Reference Janse, Nooteboom and Quené2003; Park and Jang, Reference Park and Jang2012). Studies in young normal listeners reported that 50% intelligibility (i.e., 50% of correct identification) was reached when compressed speech was three to four times faster than the normal rate, namely at ~12 syllables/s in German and Dutch and ~16 syllables/s in Mandarin (Meng et al., Reference Meng, Wang and Cai2019; Schlueter et al., Reference Schlueter, Lemke, Kollmeier and Holube2014; Versfeld and Dreschler, Reference Versfeld and Dreschler2002). Fast speech therefore increases processing demands, as quantified by higher-rated listening effort and larger pupil dilation with respect to slower speech (Koch and Janse, Reference Koch and Janse2016; Müller et al., Reference Müller, Wendt, Kollmeier, Debener and Brand2019). Despite comprehension of accelerated speech being challenging at first, other work showed that listeners are able to adapt relatively quickly (Adank and Devlin, Reference Adank and Devlin2010; Adank and Janse, Reference Adank and Janse2009; Golomb et al., Reference Golomb, Peelle and Wingfield2007; Peelle and Wingfield, Reference Peelle and Wingfield2005). Speech perception indeed improves after short exposure to artificially speeded sentences, and this adaptation is robust to changes in speakers’ characteristics (Dupoux and Green, Reference Dupoux and Green1997) and in languages provided they share some phonological and rhythmic properties (Pallier et al., Reference Pallier, Sebastian-Gallés, Dupoux, Christophe and Mehler1998; Sebastián-Gallés et al., Reference Sebastián-Gallés, Dupoux, Costa and Mehler2000).
Neural tracking of the slow envelope fluctuations may be one of the primary mechanisms used by the brain to adapt to speech rate variations, both within and between speakers. Most E/MEG studies that measured brain coupling to syllable rate changes have accelerated – and sometimes decelerated – speech artificially. In their seminal work, Ahissar et al. (Reference Ahissar, Nagarajan and Ahissar2001) reported that the auditory cortex phase-locked to amplitude modulations in English sentences compressed to 75% and 50% of their original duration, that is, at a syllable rate ranging between ~4 and ~6 syllables/s. However, for higher compression ratios (35% and 20%) with syllable rates between ~9 and ~14 syllables/s, neural oscillations no longer coupled to the envelope. These conditions were also the ones showing a drop of sentence intelligibility (see Section 5.3.2). These data seem to fit Ghitza’s assumption that speech comprehension is constrained by our capacity to decode syllables within a theta cycle, that is, at a maximal frequency of ~9 Hz (Ghitza, Reference Ghitza2011, Reference Ghitza2014). The results by Pefkou et al. (Reference Pefkou, Arnal, Fontolan and Giraud2017), however, challenge this view as oscillations in auditory regions tracked the envelope of French sentences time-compressed by a factor of 3 (up to 10–14 Hz), beyond the upper limit of the canonical theta band (see also Nourski et al., Reference Nourski, Reale and Oya2009, and Section 5.3.2).
Without pushing the limits this far, other studies converge to show that oscillatory activity dynamically adapts to changes in speech rate. In the work by Lizarazu et al. (Reference Lizarazu, Lallier and Molinaro2019), maximal brain-to-speech coupling in bilateral auditory regions was found at 4.5–7.5 Hz (peaking at 5.6 Hz) for normal-rate sentences in Spanish (~173 words/min). These values slightly decreased to 4–6.5 Hz (peaking at 4.7 Hz) for sentences that were artificially slowed down (~139 words/min), but increased to 5.5–8.5 Hz (peaking at 6.6 Hz) for accelerated sentences (~208 words/min). Note that the syllable rate was unfortunately not provided by the authors, making the comparison with other work rather difficult. Given the coupling frequencies reported, the syllable rate, however, seemed to remain within the theta band boundaries. Theta/gamma coupling also varied with speech rate in this study. The phase of theta oscillations in the three previously mentioned frequency ranges modulated the amplitude of gamma oscillations at 20–37 Hz for normal-rate speech, 18–34 Hz for decelerated speech, and 23–43 Hz for accelerated speech. Such rate-dependent gamma activity was assumed to reflect the encoding of fine acoustic information in speech.
A further step was taken by Kösem et al. (Reference Kösem, Bosker and Takashima2018) who nicely demonstrated that not only do neural oscillations adjust to speech rate but that coupling persists after the changes, thus biasing subsequent speech comprehension. Echoing the aforementioned behavioral studies on rate normalization (see Section 5.2), their MEG results revealed that listening to a carrier sentence (in Dutch) artificially accelerated at 5.5 Hz entrained neural oscillations at this frequency in a sustained manner, so that an ambiguous vowel in a following normal-rate word was perceived as long (e.g., /a:/). On the contrary, when the syllable rate of the carrier sentence was decelerated to 3 Hz, the subsequent vowel was perceived as short (e.g., /ɑ/). This study exemplifies the capacity of our brain to generate predictions based on the temporal pattern of incoming speech, thus functionally contributing to its decoding.
To the best of our knowledge, studies investigating the neural tracking of natural speech rate changes have been scarce, possibly because of methodological challenges. Yet examining the case of naturally accelerated (or decelerated) speech may prove highly informative to better grasp the oscillatory dynamic underlying naturalistic speech perception (see Alexandrou et al., Reference Alexandrou, Saarinen, Kujala and Salmelin2020, for similar arguments). Naturally speaking faster induces additional spectro-temporal changes in the speech signal compared to time compression, in particular by increasing coarticulation and assimilation between speech segments (Berry, Reference Berry2011; Janse et al., Reference Janse, Nooteboom and Quené2003). This complexifies speech decoding for the listeners (Janse, Reference Janse2004), and accordingly, processing of natural and artificial speech rate variations may involve different oscillatory mechanisms, at least partly. Alexandrou et al. (Reference Alexandrou, Saarinen, Kujala and Salmelin2018) compared brain oscillatory coupling to naturally produced speech extracts in Finnish, with syllable rates ranging from ~2.6 (slow) to ~4.7 (normal) to ~6.8 (fast) syllables/s. In the delta (2–4 Hz) and theta (4–7 Hz) bands used for analysis, normal-rate speech elicited stronger synchronization than slower and faster speech in the bilateral superior temporal cortex (for delta) and the right paracentral lobule (for theta). Slow speech increased theta coupling in the right parietal cortex with respect to normal-rate speech. Based on the study by Lizarazu et al. (Reference Lizarazu, Lallier and Molinaro2019), it is quite surprising that delta tracking was not stronger for slow-rate speech compared to the two other conditions. Similarly, fast-rate speech did not enhance brain coupling in the high theta band as one could have expected. One strength of this MEG study is undeniably the use of connected speech that was spontaneously produced at different rates, whereas most other work that examined neural tracking of speech rate variations included isolated sentences or read-aloud texts. One caveat, however, may be that coupling was measured in fixed, generic frequency bands, potentially preventing the researchers from observing changes in frequency coupling as a function of the syllable rate. The average normal and fast rates in this study were both circumscribed to the theta range, towards the lower (~4.7 Hz) or the upper (~6.8 Hz) band limits, respectively. The use of more flexible frequency windows for analysis might have revealed more specific coupling patterns depending on the two speaking rates (see Keitel et al., Reference Keitel, Gross and Kayser2018). In an MEG study (Hincapié Casas et al., Reference Hincapié Casas, Lajnef and Pascarella2021), we used this latter approach to compare neural tracking of French sentences naturally produced at a normal rate (~6.7 syllables/s) or at a fast rate (~9.1 syllables/s), or of sentences time-compressed to the same fast rate. We examined brain-to-speech coupling in two frequency bands defined from the spectral power of the sentence envelopes in each condition, namely 6.25 Hz ± 1 Hz for normal-rate speech and 8.75 ± 1 Hz for fast-rate/compressed speech. Our results showed that neural oscillations in right auditory and motor regions were tuned to natural speech rate variations at frequencies specifically matching the syllabic rate. No significant coupling was observed for time-compressed speech in the corresponding frequency band (8.75 ± 1 Hz), despite the fact sentences had a similar rate as natural fast speech. We suggested a possible preference of cortical oscillations to align to naturally produced speech rather than to artificially accelerated speech, which is not physiologically plausible. To our knowledge, this study is the first to provide evidence for a specific oscillatory signature of natural fast speech compared to artificially accelerated speech (see also Section 5.3.3 for a discussion of the motor cortex involvement).
Altogether, these findings underscore the need for future studies to consider natural speech rate variations in connected speech to refine neurocognitive models of speech perception. A further issue to dig into, and which is actively debated, is whether the rhythmic neural response to the speech envelope truly involves endogenous oscillatory generators or whether it reflects steady-state evoked responses to acoustic landmarks in the speech stream. Evidence for sustained oscillatory activity after speech offset (Kösem et al., Reference Kösem, Bosker and Takashima2018; van Bree et al., Reference van Bree, Sohoglu, Davis and Zoefel2021) and for neural responses at frequencies that are not physically present in the acoustic signal but that match mentally constructed linguistic units (Ding et al., Reference Ding, Melloni, Zhang, Tian and Poeppel2016; Zoefel and VanRullen, Reference Zoefel and VanRullen2016) clearly speaks in favor of the oscillatory model (Zoefel et al., Reference Zoefel, Archer-Boyd and Davis2018, for a review; see also Hincapié Casas et al., Reference Hincapié Casas, Lajnef and Pascarella2021, for a lack of power increase accompanying enhanced phase-locking to sentences). Other studies nevertheless reported that cortical coupling to speech is mainly driven by a linear convolution of evoked responses at the frequency of acoustic edges in the input (Oganian and Chang, Reference Oganian and Chang2019; Oganian et al., Reference Oganian, Kojima and Breska2023; Novembre and Iannetti, Reference Novembre and Iannetti2018; Zou et al., Reference Zou, Xu and Luo2021). Future work should systematically attempt to delineate the contribution of oscillatory versus evoked activity to speech neural tracking, bearing in mind that the two may not be mutually exclusive.
5.3.2 Brain-to-Speech Coupling and Intelligibility: A Bidirectional Relationship?
Increasing speaking rate can obviously reduce speech intelligibility up to a point where the listener is no longer able to comprehend. The question therefore arises as to whether neural tracking of speech varies with intelligibility and contributes to it. A number of E/MEG studies tackled this issue to better characterize the functional role of neural oscillations in speech decoding. The already mentioned work by Ahissar et al. (Reference Ahissar, Nagarajan and Ahissar2001) showed that cortical coupling to speech significantly declined for time-compressed sentences that were not intelligible. The authors hence proposed that speech comprehension depends on neural tracking, which would only be possible if the speech rate is close to the intrinsic oscillatory frequency of the auditory cortex (see Ghitza, Reference Ghitza2011, Reference Ghitza2014).
According to Ahissar’s interpretation, speech becomes unintelligible because neural tracking fails, meaning that effective brain-to-speech coupling is mandatory for successful speech decoding. The reverse question could, however, also be raised: Does oscillatory coupling fail because speech is not intelligible? To put it differently, does this coupling need intelligible speech to occur? One way to address the latter question is to test if neural tracking is also found for unintelligible stimuli, which would be evidence that it does not hinge on speech intelligibility. This would in turn provide insights, at least to some extent, to the first point, namely whether brain coupling is necessary or not for speech decoding. Neural tracking of unintelligible speech would indeed suggest that this process alone cannot fully account for intelligibility. Undertaking this issue also requires examining whether speech can be understood in the absence of neural tracking.
5.3.2.1 Does Neural Tracking Need Intelligible Speech to Occur?
Nourski et al. (Reference Nourski, Reale and Oya2009) reported in an electrocorticography (ECoG) study that Heschl’s gyrus only phase-locked to intelligible time-compressed sentences in English, in agreement with Ahissar et al. (Reference Ahissar, Nagarajan and Ahissar2001). However, further analyses revealed that high-frequency (70–250 Hz) power in the same region was modulated by the speech envelope irrespective of compression ratios, that is, even when speech was not understood. This casts doubt on the fact that neural tracking of speech modulations relies on intelligibility. Reconciling these contradictory findings, another ECoG study (Davidesco et al., Reference Davidesco, Thesen and Honey2018) showed that the ability of neural oscillations to synchronize to intelligible and/or unintelligible speech may depend on the cortical site along the processing hierarchy. Whereas low-level early-auditory areas, at the vicinity of the primary auditory cortex, tracked time-compressed speech outside the intelligibility range, higher-order areas such as secondary auditory areas in the superior temporal gyrus and the inferior frontal gyrus only synchronized to intelligible speech. E/MEG studies using time compression, time reversion, or vocoding provided further evidence that the brain tracks envelope modulations in both intelligible and unintelligible speech (Luo and Poeppel, Reference Luo and Poeppel2007; Pefkou et al., Reference Pefkou, Arnal, Fontolan and Giraud2017; Zoefel and VanRullen, Reference Zoefel and VanRullen2016; see Kösem and van Wassenhove, Reference Kösem and van Wassenhove2016, for a review). Howard and Poeppel (Reference Howard and Poeppel2010) showed that the theta band phase in the auditory cortex could equally discriminate between spoken English sentences and their time-reversed, non-intelligible counterparts. Converging results were found by Millman et al. (Reference Millman, Johnson and Prendergast2015), with similar coupling to physically identical sentences that solely differed in intelligibility (see also Baltzell et al., Reference Baltzell, Srinivasan and Richards2017). Kösem et al. (Reference Kösem, Dai, McQueen and Hagoort2023) also recently failed to find differential delta/theta neural tracking of noise-vocoded Dutch sentences that were intelligible or not. Altogether, these findings suggest that brain-to-speech coupling does not need intelligible speech to occur, and that it may be driven by acoustic cues rather that by linguistic information per se.
The conclusion is, however, not so straightforward given evidence that neural tracking of speech varies with intelligibility, as reflected by stronger phase-locking to the envelope for intelligible than for unintelligible speech (Dimitrijevic et al., Reference Dimitrijevic, Smith, Kadis and Moore2019; Ding et al., Reference Ding, Chatterjee and Simon2014; Doelling et al., Reference Doelling, Arnal, Ghitza and Poeppel2014; Pérez et al., Reference Pérez, Carreiras, Gillon Dowens and Duñabeitia2015). In the MEG study by Gross et al. (Reference Gross, Hoogenboom and Thut2013), delta/theta phase-locking in bilateral auditory regions was enhanced for intact compared to reversed stories in English. Stronger theta phase/gamma amplitude coupling was additionally observed when speech was intelligible. The same year, Peelle et al. (Reference Peelle, Gross and Davis2013) also reported increased theta coupling in the left temporal cortex to intelligible vocoded English sentences more than to unintelligible ones, despite having the same spectral complexity. They suggested that although neural oscillations can be entrained by speech lacking intelligibility, such a process would not only depend on bottom-up synchronization to the acoustic rhythmic fluctuations but also to higher-level linguistic content. Rimmele et al. (Reference Rimmele, Golumbic, Schröger and Poeppel2015) reached a somewhat similar conclusion. In a multi-talker paradigm, they showed similar neural tracking (2–8 Hz) of intact and vocoded sentences as long as participants ignored them. When speech was attended, stronger coupling was found only in the intact condition. According to the authors, cortico-acoustic coupling mainly reflects brain responses to slow acoustic modulations in the envelope, but for attended speech, processing of the fine structure can enhance the temporal precision of tracking (see also Zion Golumbic et al., Reference Zion Golumbic, Ding and Bickel2013). Linguistic information therefore seems to affect neural oscillatory coupling, a process that may operate nonlinearly, as revealed by Hauswald et al. (Reference Hauswald, Keitel, Chen, Rösch and Weisz2022). In this study, low-frequency (1–7 Hz) coupling in bilateral middle temporal and left frontal regions followed an inverted U-shaped curve. It was maximal for challenging yet intelligible (moderately vocoded) speech in German, reflecting the enhanced use of speech temporal structure, but it decreased to its lowest value for very easily (intact) or barely understandable (strongly vocoded) speech. Note that by using different parameters on the same dataset and by separating periodic from aperiodic components in the coupling spectrum, the research team did not replicate these results exactly (Schmidt et al., Reference Schmidt, Chen and Keitel2023). They instead showed a gradual decrease of neural tracking with the decline in speech intelligibility (in line with Gross et al., Reference Gross, Hoogenboom and Thut2013; Peelle et al., Reference Peelle, Gross and Davis2013), an effect yielded by the aperiodic components. The results on the periodic components exhibited a different pattern. As vocoded speech became less intelligible, neural tracking shifted from the syllabic rate (linguistic level) to the modulation rate of the envelope (acoustic level). This was shown by an increase of the coupling frequency and a narrowing of the frequency bandwidth (i.e., frequency tuning) around the higher frequencies of acoustic modulations for more vocoded, less intelligible speech. Accordingly, whereas neural oscillations would track both syllabic and acoustic information in the envelope of intact speech (leading to an increased frequency bandwidth), listeners would preferably rely on acoustic fluctuations when linguistic information such as syllable boundaries is more difficult to extract (see also Verschueren et al., Reference Verschueren, Gillis, Decruy, Vanthornhout and Francart2022, for acoustic versus linguistic tracking with speech rate increase). Disentangling the syllable rate from the envelope modulation rate in future studies (see Zhang et al., Reference Zhang, Zou and Ding2023), including those on speech rate variations, may prove efficient to better characterize the complex relationship between brain oscillations and low-level acoustic information on the one hand and higher-level linguistic content on the other.
Coming back to our initial question as to whether intelligibility is a prerequisite for neural tracking, the reviewed evidence is rather negative. Brain-to-speech coupling is also found for unintelligible speech, and it may thus be largely driven by acoustic modulations in the speech signal. Still, the coupling strength can be enhanced by intelligible speech. This suggests that linguistic content plays an important role in the oscillatory dynamic, which cannot be seen as a purely rhythmic acoustic cortical response.
5.3.2.2 Is Neural Tracking Necessary for Successful Speech Decoding?
Determining whether neural coupling is a limiting factor for comprehension may appear a bit more complex. The fact that brain oscillatory activity synchronizes to speech that is not intelligible, as reviewed above, provides some arguments that speech does not become intelligible only because neural oscillations phase-lock to its (quasi-)rhythmic fluctuations. Still, this does not allow to firmly conclude that neural tracking is not a necessary ingredient. We would in fact like to underline that as far as we are aware, no evidence for intelligible speech without concurrent auditory cortical coupling has been reported. Such coupling could thus be essential, albeit not sufficient, for speech decoding.
Convincing evidence for a causal role of neural tracking in speech processing and intelligibility comes from studies using transcranial alternating current stimulation, or tACS (Kadir et al., Reference Kadir, Kaza, Weissbart and Reichenbach2020; Keshavarzi et al., Reference Keshavarzi, Kegler, Kadir and Reichenbach2020; Riecke et al., Reference Riecke, Formisano, Sorger, Başkent and Gaudrain2018; Wilsch et al., Reference Wilsch, Neuling, Obleser and Herrmann2018; Zoefel et al., Reference Zoefel, Archer-Boyd and Davis2018, Reference Zoefel, Allard, Anil and Davis2020). Zoefel et al. (Reference Zoefel, Archer-Boyd and Davis2018) showed that tACS (~3 Hz) over left auditory regions decreased the functional magnetic resonance imaging (fMRI) response in the bilateral superior and middle temporal gyri for vocoded yet intelligible speech (i.e., sequences of monosyllabic words at ~3 Hz). The behavioral performance in an irregularity detection task was, however, unaffected. A follow-up study (Zoefel et al., Reference Zoefel, Allard, Anil and Davis2020) nonetheless revealed that bilateral, but not left-lateralized, tACS disrupted word report accuracy, suggesting that changes in neural oscillatory activity can impair speech perception. Conversely, enhanced speech comprehension has been reported after stimulation of neural oscillatory activity (Riecke et al., Reference Riecke, Formisano, Sorger, Başkent and Gaudrain2018). By applying a current carrying slow envelope modulations (4 Hz) over the bilateral auditory cortex, Riecke et al. could improve Dutch sentence recognition in a two-talker situation. A second experiment confirmed that biasing neural oscillations towards speech temporal structure benefitted speech processing. In a single-talker situation, participants were instructed to identify barely understandable sentences in which low-frequency envelope information (<16 Hz) had been removed. When auditory regions were stimulated with a current shaped with exactly this rhythmic information, sentence intelligibility increased, especially when the current preceded speech stimuli by 375 ms. Using electroencephalography (EEG), the same group then showed that delivering envelope information through vibrotactile stimulation applied to the participants’ fingers increased neural speech tracking but failed to improve behavioral outcome during sentence repetition (Riecke et al., Reference Riecke, Snipes, van Bree, Kaas and Hausfeld2019; see their discussion for methodological considerations). Positive evidence was yet provided by Guilleminot and Reichenbach (Reference Guilleminot and Reichenbach2022): Tactile stimulation delivered at the syllable rate (theta frequency 4.5 Hz) enhanced both brain-to-speech coupling in bilateral auditory regions and intelligibility of speech in noise.
Overall, these findings suggest that neural coupling to speech functionally contributes to and can aid speech perception. In this view, they open new avenues to improve speech processing in hearing-impaired or dyslexic people who show differential oscillatory mechanisms (for hearing impairment, for example: Decruy et al., Reference Decruy, Vanthornhout and Francart2020; Gillis et al., Reference Gillis, Kries, Vandermosten and Francart2023; Petersen et al., Reference Petersen, Wöstmann, Obleser and Lunner2017; for dyslexia: Goswami, Reference Goswami2011; Lehongre et al., Reference Lehongre, Morillon, Giraud and Ramus2013; Lizarazu et al., Reference Lizarazu, Scotto di Covella and van Wassenhove2021b). First studies in this direction revealed that tACS at appropriate frequencies can enhance speech-in-noise perception in hearing-impaired, elderly participants (Erkens et al., Reference Erkens, Schulte, Vormann, Wilsch and Herrmann2021) as well as phonological processing in dyslexic adults (Marchesotti et al., Reference Marchesotti, Nicolle and Merlet2020).
To conclude, the empirical evidence available so far seems to provide arguments that speech comprehension depends on neural tracking, at least to some extent. Brain-to-speech coupling may indeed be necessary, although not sufficient, for intelligibility. On the other hand, as we have seen, neural tracking does not need intelligible speech to occur. Nevertheless, linguistic information can affect the strength of the coupling, suggesting a bidirectional functional relationship between the two.
5.3.3 Motor Regions as Temporal Predictors for Speech Decoding
Low-frequency coupling to speech in auditory regions is also top-down modulated by alpha and beta activity in motor regions. Crucially, the stronger this cortico-cortical coupling, the stronger the auditory neural tracking of the speech input (Assaneo and Poeppel, Reference Assaneo and Poeppel2018, for syllable trains at 4.5 Hz; Park et al., Reference Park, Ince, Schyns, Thut and Gross2015, for intact versus reversed speech). Alpha and beta power modulations in fronto-central regions were moreover shown to precede coupling to speech in the auditory cortex, supporting their anticipatory role in lower-level speech processing (Keitel et al., Reference Keitel, Ince, Gross and Kayser2017; see also Di Liberto et al., Reference Di Liberto, Lalor and Millman2018; Hovsepyan et al., Reference Hovsepyan, Olasagasti and Giraud2023). In line with their involvement in predictive coding (Arnal, Reference Arnal2012; Morillon et al., Reference Morillon, Schroeder and Wyart2014; Schroeder et al., Reference Schroeder, Wilson, Radman, Scharfman and Lakatos2010) and temporal processing (Coull, Reference Coull and Toga2015; Schubotz, Reference Schubotz2007), motor regions are therefore able to extract the (quasi-)rhythmicity of speech and generate temporal predictions so as to optimize speech sampling.
Motor regions do not only modulate auditory cortex activity; they also synchronize their own oscillations to speech fluctuations at the phrasal and syllabic scales. Keitel et al. (Reference Keitel, Gross and Kayser2018) reported stronger delta band (0.6–1.3 Hz) left premotor tracking of speech for correctly identified sentences in English. Delta/beta (13–30 Hz) phase/amplitude coupling within this premotor cluster predicted comprehension as well, emphasizing the role of motor regions in predicting the likely occurrence of upcoming words. At the syllabic level, the right motor cortex was shown to track the isochronous rhythm of Chinese monosyllabic words at 4 Hz (Sheng et al., Reference Sheng, Zheng and Lyu2019) as well as the fast syllable rate (~9 syll/s) in French sentences (Hincapié Casas et al., Reference Hincapié Casas, Lajnef and Pascarella2021). For the latter, the motor cortex furthermore increased its connectivity with a left fronto-parieto-temporal network, probably belonging to the speech sensorimotor dorsal stream (Hickok and Poeppel, Reference Hickok and Poeppel2007). Such sensorimotor interactions may reflect the encoding of articulatory information as well as the use of internal models for phonological-articulatory mapping in challenging listening situations (see Chapter 6 for sensorimotor interactions during speech processing). Interestingly, listeners’ sensorimotor abilities predict the ability of motor oscillations to align to speech syllabic structure (Assaneo et al., Reference Assaneo, Ripolles and Orpella2019). Coupling of bilateral frontal regions to trains of random syllables at 4.5 Hz was indeed stronger in English participants who better synchronized their own syllabic production to the perceived rate. These “high synchronizers” additionally showed microstructural white matter differences, namely stronger left lateralization in a territory of the arcuate fasciculus compared to “low synchronizers.” Such anatomical changes could increase information transfer between auditory and motor regions underpinning speech perception and production, and consequently improve neural tracking of speech.
A growing body of evidence therefore converges in favor of a contribution of motor regions to the parsing of speech into syllable- and word/phrase-size packets, either through top-down predictions or direct synchronization to the signal (quasi-)rhythmic fluctuations. More work is nevertheless needed to understand the role of the different motor rhythms, in the delta, theta, and beta bands in particular, in deciphering the temporal structure of speech.
5.4 Concluding Remarks and Future Directions
The reviewed E/MEG findings generally corroborate oscillatory models of speech perception by revealing that low-frequency (theta band) oscillations in the auditory cortex synchronize to syllabic rhythmic information and modulate gamma amplitude, although only a few studies eventually examined this cross-frequency coupling. Theta activity can also adapt to speech rate variations (mostly investigated through artificial manipulation), at least to a certain extent. As we underlined, future work should, however, consider more naturalistic settings, using connected speech naturally produced at different rates and analyzing coupling to global but also local rate changes, to better grasp the underlying oscillatory dynamic and thus advance current neurolinguistic models. It is also striking that, despite some studies assessing neural tracking in bilinguals (Di Liberto et al., Reference Di Liberto, Nie and Yeaton2021; Lizarazu et al., Reference Lizarazu, Carreiras, Bourguignon, Zarraga and Molinaro2021a; Lu et al., Reference Lu, Deng, Xiao, Jiang and Gao2023; Peña and Melloni, Reference Peña and Melloni2012; Pérez et al., Reference Pérez, Carreiras, Gillon Dowens and Duñabeitia2015; Reetzke et al., Reference Reetzke, Gnanateja and Chandrasekaran2021), the effects of cross-linguistic differences, regardless of comprehension, have so far been largely ignored in the field. The question as to whether neural tracking is sensitive to the languages’ rhythmic properties per se, and not only to the listeners’ native language, indeed remains open. Yet investigating this issue could be highly informative in delineating the correspondence between neural oscillations and the timescales in speech, for instance by comparing brain-to-speech coupling for languages belonging to different rhythmic classes. We are only aware of one recent study that tackled this issue. Using EEG, Özer et al. (Reference Özer, Pereira and Sebastian-Galles2023) examined theta band phase-locking to the speech envelope in English and Spanish adults while they listened to resynthesized incomprehensible sentences from English (a stress-timed language), Spanish (syllable-timed), and Japanese (mora-timed). Neural tracking was weakest for English, intermediate for Spanish, and strongest for Japanese, irrespective of the participants’ mother tongue. Hence, brain-to-speech coupling reflected the rhythmic regularities of the heard languages, entailing syllabic complexity (from the least in English to the highest in Japanese), rather than processing of the native language.
When it comes to the functional relationship between brain coupling and speech intelligibility, the picture gets a bit complex with (at least partly) contradictory findings. We again emphasize that no study to the best of our knowledge has reported that speech could be intelligible without auditory oscillatory coupling to its envelope. Neural tracking of the signal could therefore be a necessary though insufficient step to support speech decoding. Bottom-up acoustic coupling in the auditory cortex is indeed modulated by speech intelligibility and occurs in parallel to, or under the influence of, other top-down processes, some of them arising from the motor cortex. By temporally structuring oscillatory activity in auditory regions and by directly synchronizing their own rhythms to speech modulations, motor regions optimize speech decoding. Although we did not address this issue in the present chapter, other high-level predictions generated from lexico-semantic and syntactic knowledge obviously also enable inferences on the incoming speech signal (e.g., Ding et al., Reference Ding, Melloni, Zhang, Tian and Poeppel2016; Kaufeld et al., Reference Kaufeld, Bosker and ten Oever2020; Lo et al., Reference Lo, Tung, Ke and Brennan2022; Meyer et al., Reference Meyer, Sun and Martin2019). Examining the dynamic interplay between predictions of different natures, for neural tracking of speech at different frequencies in distributed brain regions, appears crucial to capture how speech units are hierarchically organized and combined to ultimately comprehend spoken language.
Summary
We reviewed evidence that theta range neural oscillations in auditory regions track syllabic rhythm and adapt to speech rate variations. This coupling contributes to speech intelligibility, while at the same time being modulated by linguistic information. By extracting temporal regularities in the signal, motor regions also contribute to this temporal dynamic.
Implications
We draw attention to topics for future research to improve oscillatory models of speech perception. This includes investigating neural tracking of naturally produced connected speech at different rates, also in light of cross-linguistic differences, and examining the interplay between motor and auditory rhythms as well as their interactions with higher-level linguistic predictive processes.
Gains
This non-exhaustive review provides a better understanding of the functional relationship between neural oscillations and speech (quasi-)rhythmic information. Considering important issues such as dynamic coupling to speech rate changes, the influence of linguistic information, and the role of motor regions may prove valuable to refine oscillatory models of speech perception.