18.1 Introduction
Language comprehension is guided by external acoustic cues but may also have an internal, implicit realization, influencing processing from the inside out – that is, endogenously. In particular, there are a number of endogenous constraints that influence the timing and pace of language processing. Some of these constraints may reflect cognitive and neural processing limitations, such as properties of working memory or even the electrophysiology of the brain as such. These context-invariant endogenous constraints interact with the current context of linguistic representations that are built during incremental language processing.
In this chapter, we first review behavioural evidence from psycholinguistics, describing the complex interaction of syntactic, prosodic, and information structures during the integration of words into larger multi-word units across substantial temporal intervals. In the second part, we summarize evidence from event-related brain potentials (ERPs) supporting the idea that temporal processing windows in the order of multiple seconds shape language production and comprehension endogenously. Lastly, we suggest that the repetitive pace of these integration windows may trigger a certain extent of rhythmicity in language behaviour. We discuss recent evidence that temporal electrophysiological constraints are driven by underlying endogenous oscillatory activity that acts as a pacemaker on language processing.
18.2 Timing Constraints on Linguistic Processing in Behaviour
The timing of linguistic processing is driven by the incremental computation of linguistic representations, including prosodic, syntactic, and information structures. As language comprehension unfolds over time, the online computation of these structures (e.g., “where” to place a boundary within a syntactic tree) may be influenced by contextual temporal constraints (e.g., “when” to expect a syntactic boundary given the preceding acoustic and abstract context). Here, we first focus on the temporal relationship between (implicit) prosody and syntax, and how they relate to the context-invariant constraints of working memory capacity. We then describe how syntax and prosody further interact with information structure to place contextual constraints on the timing of language processing.
There is a close relationship between syntactic and prosodic structure (for reviews, see Dahan, Reference Dahan2015; Dahan and Ferreira, Reference Dahan, Ferreira and Hagoort2019; for further discussion of the prosody–syntax interface and its interaction with informativity, see Chapter 19). While not strictly one to one, syntactic phrase boundaries often co-occur with prosodic boundaries in spoken language (Clifton et al., Reference Clifton, Carlson and Frazier2002; Ferreira, Reference Ferreira1993; Selkirk, Reference Selkirk, Goldsmith, Riggle and Yu2011). Thus, prosody is an important cue to syntax: Listeners rely on the perception of prosodic boundaries to infer syntactic structure, particularly in cases of syntactic ambiguity. For example, in the sentence The friend of the moviestar who was sitting on the balcony was under investigation, the relative clause (who was sitting on the balcony) can either be attached “high” in the syntactic tree (to the first noun, the friend) or “low” (to the second noun, the moviestar; see Figure 18.1). During incremental processing of the sentence, comprehenders must choose what the relative clause attaches to, thus arriving at different semantic interpretations of the sentence (whether it was the friend or the moviestar that was on the balcony). Studies suggest that this syntactic choice depends on cross-linguistic attachment preferences (e.g., a preference for low attachment in English, as opposed to high attachment in Spanish; Carreiras and Clifton, Reference Carreiras and Clifton1999) as well as constraints in working memory capacity, further discussed below (Swets et al., Reference Swets, Desmet, Hambrick and Ferreira2007; Traxler, Reference Traxler2009). Importantly, prosodic boundaries following either the first or second noun have been shown to influence syntactic attachment choices, generally leading to more high attachment when the boundary follows the second noun (Carlson et al., Reference Carlson, Clifton and Frazier2001; Clifton et al., Reference Clifton, Carlson and Frazier2002; Frazier et al., Reference Frazier, Carlson and Clifton2006).Footnote 1 However, syntactic choices are not simply driven by the presence or absence of a prosodic boundary or its acoustic realization (e.g., the absolute duration of a pause or pitch height of a boundary tone) but rather by the global prosodic structure over the entire sentence (e.g., the duration of a prosodic pause or height of a boundary tone relative to previous pauses and intonation patterns; Clifton et al., Reference Clifton, Carlson and Frazier2002). This suggests that during spoken language comprehension, listeners incrementally build a representation of a prosodic structureFootnote 2, which then influences the computation of a syntactic structure (parsing) by constraining the temporal location of syntactic boundaries. Thus, it is the relationship between the acoustic characteristics of prosodic boundaries over a long time window (the entire sentence), rather than the individual duration and pitch of each boundary, that is thought to play a critical role in syntactic processing.
Relative clause attachment ambiguities.
Illustration of the attachment ambiguity highlighting the two possible attachment sites for the relative clause. The relative clause may be attached high in the syntactic tree (bold) combining it with the friend

when there is a (implicit) prosodic boundary after the moviestar. Alternatively, it may be attached low in the syntactic tree (bold) such that it is the moviestar who was sitting on the balcony

when the boundary only occurs later in the sentence.

The influence of prosodic constraints on syntactic parsing is not limited to spoken language. Even in silent reading, comprehenders appear to make syntactic decisions based on an internal, implicit realization of prosodic structure (Fodor, Reference Fodor2002). For example, Swets et al. (Reference Swets, Desmet, Hambrick and Ferreira2007) appeal to the concept of implicit prosody in order to explain their findings on the relationship between syntactic attachment preferences and working memory capacity. It was previously suggested that in cases of syntactic attachment ambiguities (as in the example above), comprehenders with low working memory span prefer low attachment, since the second noun is more recent and accessible than the first noun, which would need to be retrieved from working memory (the recency principle or late closure; Frazier, Reference Frazier1979; Traxler, Reference Traxler2007). However, Swets et al. found the opposite pattern: Readers with low working memory span were more likely to attach high compared to readers with high spans. In a following study, they tested whether implicit prosody mediates this chunking strategy. Readers with high spans may tend to chunk a larger amount of information together, combining the second noun with the relative clause in a single prosodic phrase; on the other hand, readers with low spans may chunk less information together, adding an implicit prosodic boundary between the second noun and the start of the relative clause. These two prosodic patterns would be consistent with a preference to attach low or high, respectively (see Figure 18.1B, C). Swets et al. (Reference Swets, Desmet, Hambrick and Ferreira2007) provide evidence for this interpretation by presenting their sentences in chunks, forcing all readers to insert a boundary following the second noun, resulting in high attachment regardless of readers’ working memory span. Later similar findings support this idea (Hemforth et al., Reference Hemforth, Fernandez and Clifton2015; Jun and Bishop, Reference Jun and Bishop2015; Traxler, Reference Traxler2009), and a large literature suggests that silent reading involves the computation of an implicit prosodic structure (for a review, see Breen, Reference Breen2014). Thus, even in the absence of overt articulation, comprehenders are thought to compute an internal prosodic structure constrained by context-invariant individual differences in working memory capacity. In particular, working memory capacity may determine the amount of chunked information and therefore the duration of prosodic phrases.
While Swets et al.’s findings contradict initial assumptions about the role of working memory in syntactic ambiguity resolution, they support an indirect effect of individual differences in working memory capacity mediated by prosodic phrasing. This effect is consistent with the early “sausage-machine” model of parsing (Frazier and Fodor, Reference Frazier and Fodor1978), which argues that syntactic parsing occurs over two stages: an initial stage that groups words into chunks of limited size (the Preliminary Phrase Packager), and a secondary stage that builds a full syntactic analysis based on these chunks (the Sentence Structure Supervisor). In this model, the size of the initial chunks was posited to be limited to approximately six words by working memory limitations; it follows that individual differences in working memory capacity would lead to differences in the size of these initial chunks.Footnote 3 In their discussion, Swets et al. (Reference Swets, Desmet, Hambrick and Ferreira2007) speculate that the size of the chunks produced through this first stage of parsing (constrained by working memory) may be responsible for the introduction of implicit prosodic boundaries and, subsequently, syntactic attachment choices. We further elaborate on the context-invariant constraint of working memory capacity on the size and duration of syntactic phrases in the next sections.
One additional contextual constraint placed by prosody and syntax relates to their interaction with another important aspect of language: information structure. Information structure is a term from linguistic pragmatics that describes how sentences organize given information (= what the listener already knows) and new information (= what the listener does not know yet), within the context of a broader discourse or conversation. In many languages, sentences often present given information first and new information last. This given-before-new strategy reflects greater accessibility of given information during production, and is thought to facilitate comprehension by helping the listener retrieve concepts from long-term memory before appending new information (Bock and Irwin, Reference Bock and Irwin1980; Christianson and Ferreira, Reference Christianson and Ferreira2005; Ferreira and Yoshita, Reference Ferreira and Yoshita2003; Haviland and Clark, Reference Haviland and Clark1974). However, speakers can deviate from this default information structure, cueing listeners towards focused, important information through prosodic prominence (e.g., JOHN ate the sandwich) or the use of specific syntactic structures (e.g., It was John who ate the sandwich; Carlson et al., Reference Carlson, Dickey, Frazier and Clifton2009; but see Vallduví, Reference Vallduví, Aloni and Dekker2016, for debates on terminology). There are cross-linguistic differences in whether speakers focus important information primarily prosodically (e.g., English and Dutch; Breen et al., Reference Breen, Fedorenko, Wagner and Gibson2010; Nooteboom and Kruyt, Reference Nooteboom and Kruyt1987) or syntactically (e.g., Spanish and Italian; Samek-Lodovici, Reference Samek-Lodovici2005; Zubizarreta, Reference Zubizarreta1998), also leading listeners to differently weigh each of these focus cues to infer information structure across languages (Calhoun et al., Reference Calhoun, Wollum and Kruse Va’ai2021; Kember et al., Reference Kember, Choi, Yu and Cutler2019; Yan and Calhoun, Reference Yan and Calhoun2020).
Inferring the upcoming information structure of a sentence as it unfolds places temporal constraints on all levels of language processing. Listeners are thought to predict the timing of important information so that attention and processing resources can be pre-allocated to it (Beier and Ferreira, Reference Beier, Ferreira, Cutler and Fodor2022; Cutler and Fodor, Reference Cutler and Fodor1979; Kristensen et al., Reference Kristensen, Wang, Petersson and Hagoort2013). In addition, information structure has been shown to interact with other key aspects of language comprehension, including semantic (Wang et al., Reference Wang, Hagoort and Yang2009, Reference Wang, Bastiaansen, Yang and Hagoort2011) and syntactic processing (Wang et al., Reference Wang, Bastiaansen, Yang and Hagoort2012). Focused words are read for longer (Lowder and Gordon, Reference Lowder and Gordon2015) and are remembered better (Birch and Garnsey, Reference Birch and Garnsey1995), while comprehenders may be more likely to engage in “good-enough” processing for already given information (Ferreira and Lowder, Reference Ferreira and Lowder2016). However, more research is needed to establish how listeners use syntactic and prosodic focus cues to incrementally infer upcoming information structure and how this interacts with language comprehension over time.
In summary, as language unfolds over time, processing is constrained by the incremental computation of syntactic, prosodic, and information structures and the temporal relationship between them. During spoken language comprehension, acoustic features are weighed to infer a sentence’s prosodic structure, which then guides syntactic choices. This occurs even in silent reading, where context-invariant individual differences in working memory capacity constrain the duration of (implicit) prosodic and syntactic phrases. Prosody and syntax further serve as cues to upcoming information structure, guiding the comprehender’s attention towards the timing of important information and interacting with all levels of linguistic processing.
18.3 Time-Limited Integration Windows: Evidence from ERPs
In the previous section, we discussed early psycholinguistic work that suggested that phrases or constituents are constrained to a length of around six words (Frazier and Fodor, Reference Frazier and Fodor1978). However, this constraint may not reflect a limitation in the number of words but rather a temporal constraint that limits the duration of linguistic structures within our processing capacity. For instance, studies of prosodic phrasing provide evidence that the duration of phrases plays an important role in the determination of boundaries (for a detailed discussion on prosodic phrasing as a cue for cognitive dynamics, see Chapter 15). Accordingly, prolonged phrase durations lead to a different overt prosodic phrasing and increase the likelihood for implicit prosodic boundaries (Hirose, Reference Hirose2003; Hwang and Schafer, Reference Hwang and Schafer2009; Webman-Shafran and Fodor, Reference Webman-Shafran and Fodor2016). Hence, temporal windows may shape the formal side of language production and perception.
The proposed six-word limit – assuming an average speech rate of approximately 150 words per minute (Tauroza and Allison, Reference Tauroza and Allison1990) – results in multi-word units of about 2.4 seconds. Likewise, speech production data suggests that spoken language tends to be delivered in chunks of around 2.6 seconds (Vollrath et al., Reference Vollrath, Kazenwadel and Krüger1992). Notably, this slow time constant corresponds to the maximum duration of word forms held in working memory. Baddeley et al. (Reference Baddeley, Thomson and Buchanan1975) argued that phonological working memory traces decay after 2–3 seconds. Sachs (Reference Sachs1967, Reference Sachs1974) found that after more than 3 seconds of intervening speech, listeners had great difficulties identifying form-related (grammatical) but not semantic changes. She proposed that the form and syntactic details of sentences are clear and accessible in working memory for only a few seconds. During that short time window, however, the conveyed message is recoded into a more durable semantic representation.
Roll et al. (Reference Roll, Lindgren, Alter and Horne2012) hypothesized that intonational phrases tend to be uttered within time windows of 2–3 seconds to be parseable while the word forms are still active in working memory. A more specific time constant was obtained by letting speakers read sentences that could be phrased in different ways. Speakers increased their speech rate for longer phrases, seemingly trying to fit them within a time window of 2.7 seconds. This duration was taken as the ideal duration of intonational phrases. To determine whether prosodic phrasing automatically proceeds in windows of 2.7 seconds during perception while controlling for acoustic factors, silent reading has been investigated using ERPs. In auditory processing, intonational phrase boundaries yield a positive-going shift in listeners’ ERPs called the closure positive shift (CPS; Steinhauer et al., Reference Steinhauer, Alter and Friederici1999). A CPS has even been found in silent reading at commas, where readers close phrase boundaries (Steinhauer and Friederici, Reference Steinhauer and Friederici2001). Roll et al. (Reference Roll, Lindgren, Alter and Horne2012) presented participants with sentences word by word to see if they would spontaneously integrate or chunk them into 2.7-second phrases. Sentences could be chunked in three different possible phrasings. Phrases could correspond to one, two, or three clauses. The sentence material was presented at different rates so that either one, two, or all three clauses of each sentence would fall within a time window of 2.7 seconds. In this way, a CPS was observed whenever a possible phrase boundary corresponded to a 2.7-second duration from a clause onset (see Figure 18.2). Due to the varying presentation rate, the phrase endings that elicited a CPS thus contained one, two, or three clauses. An additional control condition with an extra word in each clause allowed the researchers to investigate what happened when clauses exceeded the 2.7-second time window. For these 3.4-second-long clauses, there was a difference for participants with high and low working memory spans. Whereas the high-span group showed no difference between longer and shorter clauses, the low-span group exhibited a left parietal negativity for the longer compared to the shorter clauses, suggesting that stimuli exceeding the common time window were more taxing on working memory.
Presentation rate manipulations and their neural responses.
(A) Example sentence consisting of three clauses. (B) Sentences were presented word by word in three different presentation rates; following the experimental manipulation, one, two, or all three clauses of the sentence could fall into a temporal window of 2.7 seconds (adjusted from Roll et al., Reference Roll, Lindgren, Alter and Horne2012). (C) ERP (i.e., CPS) at those clause boundaries that coincide with a duration of 2.7 seconds; regularity of the evoked response may reflect the period of low-frequency neural oscillatory activity.

Implicit prosodic phrasing based on temporal processing windows of 2.7 seconds has also been found to influence syntactic parsing, in line with the behavioural evidence of working memory constraints on syntactic processing described above (e.g., Swets et al., Reference Swets, Desmet, Hambrick and Ferreira2007). When the time-induced prosodic phrasing pattern mismatches the expected syntactic parsing of a sentence, an ERP related to syntactic reanalysis (the so-called P600; Hagoort et al., Reference Hagoort, Brown and Groothusen1993) has been found, both in silent reading (Schremm et al., Reference Schremm, Horne and Roll2015) and speech processing (Henke and Meyer, Reference Henke and Meyer2021). Further evidence for a limited time window for the recoding of formal language features into semantics comes from an experiment on grammatical agreement (Roll et al., Reference Roll, Gosselke, Lindgren and Horne2013). Two (dis)agreeing words, such as We … run(s), were separated by four different time intervals based on increases in presentation rate. The temporal distances were below, at, or above 2.7 seconds. At the two faster rates, resulting in time intervals below 2.7 seconds, disagreeing elements produced the typical left anterior negativity (LAN) found for grammatical disagreement. This was interpreted as the processing of grammatical matching based on the form of the words. At a distance over 2.7 seconds, the negativity had a right-lateralized distribution, hinting at some other cognitive process being involved in matching the two elements, possibly semantically based. At 2.7 seconds, there was no effect of agreement, but participants’ working memory span correlated with the negativity in amplitude at the electrodes displaying the LAN. In favour of the idea that agreement processing at distances over 2.7 seconds rather involves semantic representations, a sharp divide in response times has been found for acceptability judgments of grammatical agreement with distances above or below 2.7 seconds. Yet, for semantic matching, response times instead slowly increased with presentation rate (Schremm et al., Reference Schremm, Horne and Roll2016).
Overall, time-driven processing constraints of approximately 2–3 seconds may shape the integration of words into larger multi-word groups. In particular, they may guide the insertion of implicit prosodic boundaries during reading, which allows the parsing of words into a syntactic structure before word traces fade from working memory. Semantic processing of language seems to be differently affected by this temporal constraint.
18.4 Neural Oscillations as Time Constraints?
Behavioural and ERP research both indicate a temporal constraint on the integration of words during language processing. Recently, it was hypothesized that this temporal limitation of processing windows discovered by the above behavioural and ERP work may stem from an underlying periodic electrophysiological substrate (Henke and Meyer, Reference Henke and Meyer2021; Meyer et al., Reference Meyer, Henry, Gaston, Schmuck and Friederici2016), so-called neural oscillations (cf. Chapter 3). Indeed, this work suggests that endogenous oscillatory activity may time-limit the duration of multi-word chunks, and it was also found to constrain acoustically driven chunking (Rimmele et al., Reference Rimmele, Poeppel and Ghitza2021). Cycles of low-frequency delta-band oscillations (<4 Hz) have a period of up to several seconds and are thus long enough to accommodate the integration of words into larger multi-word units. In that way, delta-band oscillations may function as an endogenous pacemaker for language processing (Meyer et al., Reference Meyer, Sun and Martin2020a, Reference Meyer, Sun and Martin2020b).
Delta-band oscillations have been associated with the tracking of prosody (Bourguignon et al., Reference Bourguignon, De Tiège and De Beeck2013; Gross et al., Reference Gross, Hoogenboom and Thut2013) but were also found to align with linguistic structure that is not apparent in the acoustic signal (Ding et al., Reference Ding, Melloni, Zhang, Tian and Poeppel2016; Meyer et al., Reference Meyer, Henry, Gaston, Schmuck and Friederici2016). Using isochronous stimuli, neural activity within the delta frequency band was found to track the rates of phrases and sentences, although those could not be inferred from the acoustic input (Ding et al., Reference Ding, Melloni, Zhang, Tian and Poeppel2016). Notably, tracking vanished when the phrasal rate fell out of the delta-band range (i.e., was faster; Lo, Reference Lo2021). Thus, delta-band oscillations may have an endogenous tempo playing an active role in language processing.
In natural language, it has proven difficult to investigate an endogenous function of delta-band oscillations because speech contains physical rhythms; even subtle acoustic cues may mask endogenous activity. Therefore, it is necessary to consider a different sensory domain that avoids this confound. In particular, text input does not provide any temporal structure on the chunk level; although punctuation may guide the structuring of the text input, punctuation marks occur in space and not at specific time intervals. Thus, rhythmic sampling during reading would be endogenously imposed. Accordingly, Henke et al. (Reference Henke, Lewis and Meyer2023) examined simultaneous recordings of eye movements and electroencephalography (EEG) during naturalistic reading and observed that fixation durations fluctuate rhythmically at ~1 Hz in coherence with delta-band activity over the occipital cortex.Footnote 4 This suggests that neural activity may impose its preferred processing pace onto reading, independent of physical rhythms in the stimulus. Complementary work on self-paced reading has shown that the behavioural periodicity of reading times (here, ~2 Hz) is temporally aligned to chunk boundaries – as defined by a computational formalization based on dependency annotations and part-of-speech tags – as well as sentence boundaries (Lo et al., Reference Lo, Anderson, Henke and Meyer2023). This alignment suggests that the observed periodicity is indeed related to the formation of multi-word chunks during language processing and may be behaviourally relevant to minimize memory demands. These findings on reading may also relate to psycholinguistic work on implicit prosodic phrasing (discussed in detail above). Implicit prosody as a chunking strategy may in fact be a reflection of underlying neural processing windows. Yet, as opposed to previously proposed integration windows of 2.7 seconds, the observed frequencies of 1–2 Hz in reading suggest rather short time windows of 0.5–1 second. These shorter and longer time windows could relate to the multi-scale nature of prosody that comprises both intermediate as well as full prosodic phrases (Stehwien and Meyer, Reference Stehwien and Meyer2022). In particular, multiple short intermediate phrases may be integrated into longer prosodic phrases. Alternatively, the size of the integration windows may differ across modalities; for instance, the here reported results suggest shorter windows for visual as compared to auditory language processing. More work is needed to unravel the different time windows and their underlying neural implementation.
Returning to the longer integration windows and supporting the idea of an endogenous constraint of electrophysiological processing, the phase of delta-band oscillations was found to predict the termination of multi-word chunks (Meyer et al., Reference Meyer, Henry, Gaston, Schmuck and Friederici2016) and, more specifically, to enforce chunking after 2.7 seconds (Henke and Meyer, Reference Henke and Meyer2021). These studies employed ambiguous sentences similar to those used in the behavioural psycholinguistic studies described above. The results suggest that chunking, and as a consequence attachment, is affected by the phase of delta-band oscillations. Reconsidering the behavioural results from this perspective, it may be that the period of these oscillations subserves (implicit) prosodic phrasing (Glushko et al., Reference Glushko, Poeppel and Steinhauer2022) and underlies the different chunking strategies of the high and low working memory span groups described above (i.e., the high-span group integrates more information together, while the low-span group terminates a chunk earlier; Swets et al., Reference Swets, Desmet, Hambrick and Ferreira2007). Note here that the duration of the chunking window for delta-band oscillations also aligns with findings on the CPS (e.g., Roll et al., Reference Roll, Lindgren, Alter and Horne2012) described above. Based on this, it was suggested that the CPS may consist of underlying neural oscillations within the delta band (Meyer et al., Reference Meyer, Henry, Gaston, Schmuck and Friederici2016); the increased amplitude that is observed as ERP may arise from phase-locking of oscillatory activity across trials (Sauseng et al., Reference Sauseng, Klimesch and Gruber2007). Yet, to date, this remains to be shown (see Inbar et al., Reference Inbar, Genzer, Perry, Grossman and Landau2023, for the first evidence consistent with a relationship between the CPS, prosodic boundaries, and delta-band tracking during naturalistic speech listening).
As mentioned previously, delta-band oscillations may also be exogenously guided by prosody (see Chapter 15 for further discussion on exogenous versus endogenous influences within the delta band), aligning their phase to slow modulations in the acoustic speech signal (e.g., Bourguignon et al., Reference Bourguignon, De Tiège and De Beeck2013; Ding and Simon, Reference Ding and Simon2014; Gross et al., Reference Gross, Hoogenboom and Thut2013). Strikingly, oscillations were also found to inherit the exogenous stimulus frequency, so that the aligned oscillatory rhythm persists even after stimulus offset – a phenomenon called entrainment (Kösem et al., Reference Kösem, Bosker and Takashima2018; Luo and Poeppel, Reference Luo and Poeppel2007; Obleser and Kayser, Reference Obleser and Kayser2019). Entrainment thus results from an exogenous (acoustic) cue, which then affects subsequent processing from the inside out. For instance, presenting participants acoustically with repetitive prosodic contours – whose duration fell within the delta-band range – influenced the perception of subsequently presented sentences that either matched or did not match the duration of the prosodic contours (Lamekina and Meyer, Reference Lamekina and Meyer2023). When their duration matched, processing of the visually presented sentences was facilitated as indicated by decreased response times in behaviour. This indicates that the prosodic contours lead to a temporal prediction of the upcoming sentence duration, possibly driven by entrained delta-band oscillations. Notably, the facilitatory effect was numerically strongest for prosodic contours of 2.5 seconds as compared to shorter (1.5 seconds) or longer contours (3.5 seconds). Supporting the role of the delta band as a pacemaker for processing, entrainment may be strongest when the exogenous stimulus matches the preferred endogenous pace.
While we propose that the period of oscillatory cycles at delta-band frequency imposes a temporal constraint on chunk-level integration, the course of each individual cycle may yet pose another constraint on processing: Excitability of the underlying cells changes along the course of the cycle and hence processing abilities are not uniform. On the neuronal level, there are excitable phases, where processing is facilitated, and unexcitable phases that hinder processing (Lakatos et al., Reference Lakatos, Shah and Knuth2005). A promising avenue for future research is to assess whether these cycles of excitability are aligned to the expected informativeness of the incoming language input, following the sentence’s information structure. For example, prosodic and syntactic cues to upcoming information structure may phase-reset ongoing delta-band oscillations so that the more excitable phase of the oscillations will coincide with the expected timing of more important, new information, while the less excitable phase will occur during less important, given information. Further research is needed to address this and similar questions regarding the relationship between oscillatory cycles of excitability and higher-order language processing.
In conclusion, delta-band oscillations may underlie the previously proposed time constraint and serve as an endogenous pacemaker for language processing. In particular, their preferred wavelength may constrain the formation of multi-word chunks, and even alignment to an external stimulus may be best when the stimulation duration matches their preferred endogenous pace. Additionally, excitability gradients along the cycle may yet pose another timing constraint on the processing of the information structure of the input.
18.5 Discussion
Converging findings from behaviour, ERPs, and oscillatory research point towards temporal constraints on linguistic processing. These constraints can either be context-dependent, resulting from the available linguistic and acoustic information, or context-invariant, representing endogenous processing limitations.
The psycholinguistic literature indicates that (implicit) prosody, syntax, and information structure interact in guiding the formation of multi-word chunks. During speech comprehension, acoustic cues of the prosodic structure guide syntactic parsing through the alignment of prosodic and syntactic boundaries. Even in silent reading, comprehenders use implicit prosodic structure as a chunking strategy to support syntax; this is in line with ERP evidence on the CPS during silent reading. Prosody and syntax further constrain linguistic processing by acting as cues to the information structure of a sentence, guiding attention and processing resources towards the timing of important information.
Additionally, studies on (implicit) prosodic phrasing provide evidence for a specific temporal limit on language production and comprehension. In particular, a context-independent temporal constraint of approximately 2–3 seconds shapes the integration of words into larger multi-word chunks. These temporal integration windows may reflect cognitive and neural processing limitations. At this point, there is no evidence that can disentangle the nature of these limitations; that is, whether they may result from the decay function of working memory or low-frequency delta-band oscillations. However, these accounts may not be mutually exclusive: Oscillatory cycles could potentially be the electrophysiological unit that corresponds to the working memory buffer that binds multiple to-be-integrated words during language comprehension. Indeed, temporal processing constraints have also been related to inter-individual differences in working memory capacity (Roll et al., Reference Roll, Lindgren, Alter and Horne2012, Reference Roll, Gosselke, Lindgren and Horne2013). Likewise, from an oscillatory perspective, a static cycle duration will find little support in terms of ecological validity for language processing. Hence, oscillatory cycles will likely be flexible within a certain frequency range and may vary across individuals. Furthermore, it remains open if this temporal constraint applies in the same way to different levels of processing. While temporal constraints directly seem to affect prosodic and syntactic processing, effects at higher levels of processing such as semantics or information structure may only be indirectly induced by lower-level constraints on chunking.
Contrary to the idea of a temporal constraint discussed here, it may also reflect an information constraint. Cross-linguistic research has found similar information rates for syllables across languages (Coupé et al., Reference Coupé, Oh, Dediu and Pellegrino2019; Pellegrino et al., Reference Pellegrino, Coupé and Marsico2011), suggesting a close relationship between the speech production rate and the informativeness. This information rate may be optimally tuned to our processing capacity. In this way, there may be a limit to the amount of information that can be processed within an integration window, which in turn may determine chunking behaviour. Limitations on information processing would also be consistent with the idea that speakers prefer utterances that convey information uniformly across speech signals (Entropy Rate Constancy/Uniform Information Density Hypothesis; Genzel and Charniak, Reference Genzel and Charniak2002; Jaeger, Reference Jaeger2010; Xu and Reitter, Reference Xu and Reitter2018).
Future studies may address these questions by combining measures from cognitive neuroscience with psycholinguistic paradigms to tease apart the specific neural processes underlying language comprehension (Beier et al., Reference Beier, Chantavarin, Rehrig, Ferreira and Miller2021; Meyer et al., Reference Meyer, Sun and Martin2020b).
Summary
We present evidence for temporal constraints on (implicit) prosody, syntax, and information structure. These constraints may link to cognitive and neural (e.g., oscillatory) processing limitations and interact with the buildup of structure during comprehension. We suggest that these constraints shape linguistic structure at an optimal pace for brain processing.
Implications
Cognitive, neural, and linguistic timing constraints influence language processing, either contextually or context-invariantly. These constraints appear to interact with the physical timing of the speech signal in shaping the timing of processing. We thus recommend to take them into account when designing experiments and interpreting findings on speech/language processing.
Gains
We review evidence for context-dependent and context-invariant timing influences on language processing. This may help to acknowledge that behavioural and neural rhythms during language processing – and possible speech and language as such – are shaped both exogenously by speech timing and endogenously by cognitive, neural, and linguistic constraints.



