Highlights
-
• L1 and L2 speakers showed evidence of early- and late-stage predictions
-
• No timing differences between predictions in L1 and L2 speakers
-
• L1 and L2 speakers are differentially impacted by speech rate
-
• Increased speech rate specifically impacted L2 late-stage predictions
-
• Supports research that late-stage prediction is more demanding
1. Introduction
Processing a spoken utterance requires multiple complex mechanisms working together. Sounds must be perceived, meaning must be assigned to these sounds and these meanings must be combined to create something comprehensible. Statistical regularities in each of these steps lead to the ability to predict likely continuations of the input. An ever-increasing amount of research shows that first language speakers (L1) can make use of multiple sources of information to predict upcoming linguistic information before encountering it (for reviews, see Ferreira & Chantavarin, Reference Ferreira and Chantavarin2018; Huettig, Reference Huettig2015; Huettig & Mani, Reference Huettig and Mani2016; Kamide, Reference Kamide2008; Staub, Reference Staub2015).
When it comes to second language speakers (L2), early research did not find evidence for prediction (e.g., Grüter et al., Reference Grüter, Lew-Williams and Fernald2012; Lew-Williams & Fernald, Reference Lew-Williams and Fernald2010; Martin et al., Reference Martin, Thierry, Kuipers, Boutonnet, Foucart and Costa2013). However, more recent research has found that L2 speakers do indeed predict (for reviews see Hopp, Reference Hopp2022; Kaan & Grüter, Reference Kaan, Grüter, Kaan and Grüter2021). This has led to the general consensus that L2 speakers can make predictions, and that the mechanisms involved in prediction are the same for L1 and L2 speakers (e.g., Kaan, Reference Kaan2014). When differences arise between L1 and L2 speakers, they may be explained by other factors such as individual differences (e.g., proficiency, quality of lexical representation, cognitive resources, processing strategies, etc.; Kaan, Reference Kaan2014), methodological factors (e.g., speech rate; Fernandez et al., Reference Fernandez, Engelhardt, Patarroyo and Allen2020), or a combination of both. In the current study, we test whether L1 and L2 speakers of English show differences in the timing and pattern of eye-movement behavior during semantic prediction, particularly at different stages of prediction, while exploring a methodological factor of speech rate.
1.1. Prediction stages
While it is possible that predictions could occur at one stage, Pickering and Gambi (Reference Pickering and Gambi2018) have argued that prediction involves two stages. They claim that the first stage is an early, rapid, relatively resource-free automatic stage driven by spreading activation that characterizes semantic priming, while the second is a later, slower, more integrative and resource-demanding non-automatic stage that makes use of more real-world information. Additionally, this later, more resource-demanding stage has been argued to be the likely source of differences in L1 and L2 predictive capabilities (Ito & Pickering, Reference Ito, Pickering, Kaan and Grüter2021). In the next sections, we provide evidence for two-stage prediction in L1 speakers, followed by evidence in L2 speakers. This is followed by a section outlining prediction in relation to our exploratory factor, speech rate.
1.1.1. Evidence for two-stage prediction in L1 speakers
As just mentioned, the first stage of prediction is theorized to be a fast, automatized and thus relatively cost-free form of prediction largely rooted in spreading activation. While this type of prediction is rapid, it is not necessarily accurate, given that lexical activation will include associatively related concepts which may not fit contextually within a sentence. This type of first-stage prediction can be seen using the Visual World Paradigm (VWP). For example, Kukona et al. (Reference Kukona, Fang, Aicher, Chen and Magnuson2011) found that L1 English participants look predictively toward pictures of a crook and a policeman after hearing Toby will arrest, suggesting that these predictive eye movements stem from the verb arrest rather than from sentence context. However, unlike Kukona et al. (Reference Kukona, Fang, Aicher, Chen and Magnuson2011), Gambi et al. (Reference Gambi, Pickering and Rabagliati2016) found that after hearing Pingu will ride, L1 children and adults looked only to the appropriate target (horse), and not at the agent-related (cowboy), suggesting that participants did not make associatively related (and ultimately incorrect) predictions. The authors argued that this different pattern of results may be due to the auditory recordings being quite slow (given that they were used with children), which may have led to unrepresentative predictive eye movements. Therefore, we will explore the contribution of speech rate to both stages of prediction.
The second stage of prediction, on the other hand, is theorized to be a slower, non-automatized and thus a resource-demanding form of prediction that involves integrating real-world knowledge or contextual information together with the linguistic input. While this type of prediction is slower, it allows predictions to be tailored to a given context and, therefore, potentially more accurate than those based on the first stage alone. For example, Corps et al. (Reference Corps, Brooke and Pickering2022) conducted a study that found evidence for both stages of prediction using the VWP. In their study, young adult participants heard sentences such as I would like to wear the nice spoken by either a male or female while viewing two verb-related objects (wearable, e.g., tie or dress) and two non-verb related objects (non-wearable, e.g., drill or hairdryer). These objects were selected such that one of each type was stereotypically masculine (e.g., drill and tie) and one was stereotypically feminine (e.g., hairdryer and dress). They found that listeners quickly predicted based on semantic association, evidenced by an equal increase in proportion of looks toward the two wearable objects (over the non-wearable objects), approximately 500 ms after the onset of wear. However, they also found that listeners made a later prediction based on real-world (gender stereotyping) knowledge evidenced by looks approximately 640 ms after the onset of wear toward the wearable object that stereotypically matched the gender of the speaker (over the wearable objects not stereotypically worn by the gender of the speaker; i.e., the male saying wear led to increased looks to tie over dress and the female saying wear led to increased looks to dress over tie). This suggests that L1 speakers make first-stage automatic predictions based on spreading activation (wear leading to an equal increase in looks to tie and dress) and second-stage non-automatic predictions based on real-world knowledge (male speakers leading to an increase of looks to tie and female speakers leading to an increase of looks to dress).
Further evidence for two stages of prediction comes from a VWP study comparing prediction in younger adults (mean age = 20.35 years) and older adults (mean age = 68.87 years) (Fernandez et al., Reference Fernandez, Shehzad and Hadley2025). In the stimuli, two objects were initially relevant, but could be narrowed down in a second step. Specifically, participants listened to sentences like The singer played the guitar while viewing a target (guitar), an agent-related object (microphone), a verb-related object (cards) and a distractor object (strawberry). While singer is related to both guitar and microphone, upon hearing the verb played, the prediction could be narrowed to guitar. The authors found that both groups showed increased looks to the target guitar and agent-related microphone after hearing the agent singer; though younger adults were significantly earlier (during the presentation of the agent singer) than older adults (soon after the onset of the verb plays). Both groups then showed later prioritization of guitar over the agent-related microphone, though interestingly, the older adults looked at guitar significantly earlier (within the latter part of the presentation of the verb plays) than the younger adults (within the latter part of the presentation of the article prior to the target the). The authors argued that the increased looks to guitar after hearing singer constituted first-stage prediction based on spreading activation from the agent, and the prioritization of guitar over microphone constituted second-stage prediction based on tailoring predictions following the verb (i.e., actively ruling out the agent-related object microphone based on combining both the agent and verb; singer + plays). The authors further argued that the delayed first-stage prediction by older adults relative to younger adults stems from age-related decreases in semantic network efficiency and lexical access speed (e.g., Cosgrove et al., Reference Cosgrove, Kenett, Beaty and Diaz2021, Reference Cosgrove, Beaty, Diaz and Kenett2023) and that the quicker second-stage prediction was related to age-related increases in real-world knowledge, efficiency of inhibiting irrelevant information and efficiency in shifting resources between targets (Veríssimo et al., Reference Veríssimo, Verhaeghen, Goldman, Weinstein and Ullman2022; but see, for example, Buckner, Reference Buckner2004; Fjell et al., Reference Fjell, Sneve, Grydeland, Storsve and Walhovd2017; Hasher & Zacks, Reference Hasher, Zacks and Bower1988, for contrary evidence).
Thus, there is increasing recognition that there may be different stages and mechanisms involved in prediction, and that these stages may be differently affected by individual difference measures.
1.1.2. Evidence for two-stage prediction in L2 speakers
As previously mentioned, when it comes to prediction in L2 speakers, the general consensus is that L1 and L2 prediction mechanisms are the same, and that differences that arise between these groups are most likely due to individual differences and/or methodological factors. Given that processing an L2 is more cognitively demanding (e.g., Segalowitz & Hulstijn, Reference Segalowitz, Hulstijn, Kroll and De Groot2009), and that only second-stage prediction is hypothesized to involve substantial cognitive resources, it stands to reason that it may be second-stage predictions that are particularly impacted in L2 speakers (e.g., Ito & Pickering, Reference Ito, Pickering, Kaan and Grüter2021).
Recently, Corps et al. (Reference Corps, Liao and Pickering2023) tested L2 speakers of English using the same items from their L1 study (Corps et al., Reference Corps, Brooke and Pickering2022) and directly compared the timing of the two stages of prediction across groups. Similar to L1 speakers, Corps et al. (Reference Corps, Liao and Pickering2023) found evidence for two-stage prediction in L2 speakers. Furthermore, when the authors compared the two groups, they found no differences in the timing of first-stage prediction, but found that L2 speakers were delayed in second-stage prediction relative to L1 speakers.
Additional potential evidence for the stages of prediction in L2 speakers comes from Peters et al. (Reference Peters, Grüter and Borovsky2018) who investigated prediction with lower skilled participants (i.e., participants who scored low on a vocabulary test or self-identified as an L2 speaker) and higher skilled participants (i.e., participants who scored high on a vocabulary test or self-identified as an L1 speaker) using the VWP. In their first experiment, Peters et al. used stimuli similar to Fernandez et al. (Reference Fernandez, Shehzad and Hadley2025) in that the visual array for the predictable items contained two objects related to the agent, which were narrowed down by the verb. For example, participants heard a sentence like The pirate hides the treasure while viewing four objects: a target (treasure), an agent-related object (ship), a verb-related object (bone) and an unrelated distractor (cat). Peters et al. found that both groups made predictions after hearing The pirate in terms of looks to the target (treasure) and the agent-related object (ship), followed by increased looks toward treasure (and decreased looks toward ship) after hearing hides. While Peters et al. do not operationalize their findings in terms of stages of prediction, we argue that this finding is evidence of second-stage prediction. Interestingly, and unlike Corps et al., Peters et al. did not find differences at this potential second stage of prediction across groups, but it is important to note that they could not accurately estimate (or compare) the time at which the groups looked toward the target relative to the agent-related object. Visual inspection of their fixation graphs seems to anecdotally suggest that participants who identified as L1 speakers looked earlier to the target than participants who did not identify as L1 speakers. In addition, Peters et al. do not directly test first-stage prediction (i.e., they do not compare the target object to the verb-related or distractor item), though visual inspection again provides anecdotal evidence of an early divergence between the target and the verb-related and distractor item across groups.
While we hesitate to draw too many parallels between Corps et al. and Peters et al., we believe that both studies provide evidence that prediction involves two stages in L2 speakers: a resource-free first stage based on spreading activation and a more costly second stage involving tailoring based on real-world knowledge given the situation, which may be additionally impacted in a cognitively demanding L2. However, research investigating both stages is limited. In the next section, we discuss how the two stages may be influenced by differences based on the stimuli (i.e., speech rate).
1.2. Exploratory factor
As previously outlined, the prediction mechanisms employed by L1 and L2 speakers are believed to be qualitatively the same, and when quantitative differences arise across these groups, they are believed to stem from individual or methodological factors (e.g., Kaan, Reference Kaan2014). Thus, we expect both L1 and L2 speakers to generate first- and second-stage predictions, but expect that the timing of their predictions may differ due to the difference in cognitive resources required for L1 versus L2 language processing. Since speech rate also affects the resource requirements for language processing (Müller et al., Reference Müller, Wendt, Kollmeier, Debener and Brand2019), it is possible that speech rate drives some of the quantitative differences that have been reported between groups. Therefore, in the current study, we directly explore how speech rate impacts these first- and second-stage predictions in both L1 and L2 speakers.
Given that the VWP focuses on spoken stimuli, it comes as no surprise that the rate at which the spoken stimuli are presented can impact a listener’s ability to make predictions. Unfortunately, there seems to be no consensus on what constitutes a “normal” speech rate (being reported as anywhere between 2.5 and 8.0 syllables per second; see Fernandez et al., Reference Fernandez, Engelhardt, Patarroyo and Allen2020). Additionally, research using the VWP does not consistently report speech rate, and when speech rate is reported, it is often reported as “normal” (which is a potentially broad range) or it tends to be on the slower side (e.g., 3 syllables per second or less; Fernandez et al., Reference Fernandez, Engelhardt, Patarroyo and Allen2020). For example, Kukona (Reference Kukona2023) estimated that the speech rate used in Peters et al. (Reference Peters, Grüter and Borovsky2018) study investigating the narrowing of prediction based on the verb (e.g., The pirate hides the treasure) was approximately 3.0 syllables per second. Using the VWP, Huettig and Guerra (Reference Huettig and Guerra2019) found that prediction only occurred for L1 speakers when speech rate was “slow” relative to “normal” (though the authors do not report the speech rate, Kukona estimated the rate to be 2 and 4.5 syllables per second, respectively). In a study that more directly manipulated speech rate in the VWP, Fernandez et al. (Reference Fernandez, Engelhardt, Patarroyo and Allen2020) tested predictive eye movements with sentences presented at 3.5, 4.5, 5.5 and 6 syllables per second. Fernandez et al. found that both college-aged L1 and L2 speakers showed an inverse-U pattern with predictive eye movements increasing from slower speech rates up to an “optimal” speech rate (where prediction was highest) and decreasing at faster speech rates. The authors found that L1 speakers showed the highest level of prediction at 5.5 syllables per second, while L2 speakers showed the most efficient prediction at speech rates slightly faster than 3.5 syllables per second (but slower than 4.5 syllables per second). Together, these studies, along with the Gambi et al. (Reference Gambi, Pickering and Rabagliati2016) study previously mentioned (which found no evidence of agent-related activation at slow speech rates), suggest that speech rate plays an important role in the VWP and that different speech rates may lead to different patterns of predictive behavior. Therefore, in this study, we include speech rate as a continuous variable in our models (and its interaction with language (L1/L2), and with language across time).
2. Current study
Research suggests that both L1 and L2 speakers make first-stage predictions similarly, while L2 speakers may be delayed in terms of more costly second-stage predictions. Research also suggests that methodological factors, such as speech rate, may impact the first and/or second stage of prediction. However, research on the stages of prediction is rather limited and often contradictory. Therefore, in the current study, we test first- and second-stage prediction with L1 and L2 speakers of English while investigating the role of speech rate. In addition, we use statistical approaches uniquely suited to investigate these stages of prediction: we use regressions that allow us to model the potential non-linearity of eye-movement patterns and model speech rate as a continuous variable (General Additive Mixed Models (GAMM); e.g., Porretta et al., Reference Porretta, Kyröläinen, van Rij, Järvikivi, Czarnowski and Jain2018; Wieling, Reference Wieling2018; Wood, Reference Wood2017), as well as a non-parametric bootstrapping approach that allows us to compare divergence times across groups (Divergence Point Analysis, DPA; Stone et al., Reference Stone, Lago and Schad2021).
In terms of first-stage predictions, it could be the case that L2 speakers are delayed relative to L1 speakers due to weaker semantic network efficiency. This would be similar to previous findings showing older adults having a delayed onset of first-stage predictions (Fernandez et al., Reference Fernandez, Shehzad and Hadley2025). However, we believe that this effect is (at least to some degree) age-specific. When comparing groups of younger adults, semantic predictions generally have been shown to be quite similar between L1 and L2 speakers and Corps et al. (Reference Corps, Brooke and Pickering2022, Reference Corps, Liao and Pickering2023) specifically found similar onset of first-stage predictions in L1 and L2 speakers. Thus, we hypothesize that both L1 and L2 speakers will make first-stage predictions at similar times. In terms of second-stage prediction, we hypothesize, again in line with Corps et al., that L2 speakers will show delayed second-stage predictions relative to L1 speakers, given the higher resource intensiveness of L2 language processing. In relation to speech rate, we hypothesize an inverse-U pattern. Specifically, we hypothesize that prediction (in the form of looks to the target relative to a competitor) will increase up until an “optimal” rate and then decrease at speech rates higher than the “optimal” rate (Fernandez et al., Reference Fernandez, Engelhardt, Patarroyo and Allen2020). Additionally, we believe that this “optimal rate” (or the peak of the inverse-U) will be different for L1 and L2 speakers. For first-stage predictions, we hypothesize that looking at the target may increase until speech rates slightly faster than 3.5 syllables per second for L2 speakers, and up to 5.5 syllables per second (though in the current study, the fastest speech rate is 4.6 syllables per second) for L1 speakers (as found in Fernandez et al., Reference Fernandez, Engelhardt, Patarroyo and Allen2020). For second-stage predictions, we hypothesize that changes in speech rate may impact both groups and that higher speech rates may be particularly impactful in second-stage predictions for L2 speakers, given the increased cognitive demands of this stage. For example, the optimal rate may shift for L2 speakers, such that looks to the target start decreasing at rates slower than approximately 3.5 syllables per second.
3. Methods
Sample size was based on a recent webcam-based eye-tracking study investigating verb-based prediction, which found that 40 participants (and 16 items per condition) obtained approximately 90% power (Prystauka et al., Reference Prystauka, Altmann and Rothman2024).
3.1. Participants
3.1.1. L1 English
Fifty L1 English speakers were recruited from the University of Alberta (Canada) for course credit. Four participants were excluded since they were raised in a bilingual environment. Therefore, 46 monolingually raised L1 English speakers were included in the study, none of whom were exposed to an L2 before the age of 5. All participants reported normal hearing and normal or corrected-to-normal vision. See Table 1 for additional information.
Table 1. Participant information

3.1.2. L2 English
Forty-five L2 English speakers (L1 German) were recruited from the University of Kaiserslautern-Landau (Germany), for course credit or payment (8 Euro). All participants reported being monolingually raised L1 German speakers, who had not been exposed to an L2 before the age of 5. All participants reported normal hearing and normal or corrected-to-normal vision. See Table 1 for additional information.
3.2. Materials
The sentence stimuli were taken from Holt et al. (Reference Holt, Bruggeman and Demuth2021)Footnote 1 and were identical to those used in Fernandez et al. (Reference Fernandez, Shehzad and Hadley2025). All sentences followed the same structure (see Table 2): The [agent] [verb] the [critical word]. There were 32 pairs of sentences, each consisting of one predictable and one unpredictable sentence. The sentence pairs were manipulated such that the critical word was the same across both sentence types, but was either predictable or unpredictable based on the preceding information. Based on the SUBTLEX-UK database of Zipf frequencies (van Heuven et al., Reference van Heuven, Mandera, Keuleers and Brysbaert2014), all agents, verbs and critical words were medium to high frequency. See Table 2 for example items, the Zipf frequency of the agent, verb and critical word, and the mean syllable count per word.
Table 2. Example stimuli from Holt sentences and item information (range in parentheses)

Each sentence pair had a corresponding visual array consisting of four objectsFootnote 2. For the example in Table 2, this included: a target object (guitar), an agent-related object for the predictable sentence (related to the agent but not the verb; microphone), a verb-related object for the predictable sentence (related to the verb but not the agent; cards) and a distractor object (not related to the agent, verb or target; strawberry). For the unpredictable sentences, all objects were unrelated to the agent and verb. Similarly, to the sentence words, the object names’ frequencies ranged from medium to high based on the SUBTLEX-UK database (M = 4.14, SD = 0.61). Additionally, to avoid phonological overlap, none of the words for the objects shared initial phonemes. All objects were grayscale 300x300 pixel jpeg objects and the placement of the objects was rotated and counterbalanced across items (objects provided by Holt et al., Reference Holt, Bruggeman and Demuth2021).
Predictability was established with an online pre-test (with L1 speakers of English), in which participants were provided with truncated sentences (the experimental sentences with the target word absent) and the four images in the array. Participants were then asked to choose the image that they believed would likely come next. Participants chose the target image over the other images 98.08% (SD 4.39) of the time for the predictable items and 25.26% (SD 16.54) of the time for the unpredictable items (for additional pre-testing information, see Fernandez et al., Reference Fernandez, Shehzad and Hadley2025).
3.2.1. Auditory information
All items were recorded by a male L1 Scottish-English speaker with a Blue Yeti USB microphone at a sampling rate of 48,000 Hz using Audacity® recording software (Audacity Team, 2021). For ease of post hoc acoustic manipulation, a click track (90 beats per minute) was used to ensure all constituents were spoken within a similar time frame. All constituents were normalized to the mean of that constituent across all items using Praat (Boersma & Weenink, Reference Boersma and Weenink2021), making all corresponding constituents the same length across items (see Table 3 for mean normalized constituent duration). While all the items were same duration, they naturally differed in speech rate. Speech rate for the items was determined by dividing the number of syllables in each item by the duration of the item, and the items ranged from 2.5 to 4.6 syllables per second (see OSF for a visualization of the distribution; https://osf.io/4ke82/).
Table 3. Mean normalized constituent durations (ms)

3.3. Apparatus
For both participant groups, eye movements were recorded using an Eyelink 1000 eye-tracker sampling at 1000 Hz, with their head stabilized using a chin rest. Only the right eye was recorded (viewing was binocular). Monitors had a 60 Hz refresh rate and a 1024x768 resolution, and participants listened through Philips Bass+ on-ear headphones. The L1 English participants sat approximately 50 cm away from a 20’ Dell monitor (model 2009 W1), while the L2 English participants sat approximately 85 cm away from a 19’ Dell monitor flat screen cathode ray tube (model P1130).
3.4. Procedure
The same procedure was followed for both L1 and L2 speakers in the following order: participants provided their informed consent, took part in the eye tracking tasks, completed a language background questionnaire (LSBQ; Anderson et al., Reference Anderson, Mak, Keyvani Chahi and Bialystok2018) and finally completed a proficiency test (Oxford Placement Test – Part A; OPT). The study lasted for approximately 45 minutes. The eye-tracking task consisted of 107 trials, three of which were practice, 32 of which were critical items for this experiment (16 predictable/16 unpredictable), and 72 of which served as fillers (experimental items for a study not reported here). The 104 experimental items were placed into two blocks (52 items each) with a break in between.
The eye-tracking task started with a standard 9-point calibration. During the experiment, participants were told there was no time limit and that they were able to pace the study and take breaks as needed (if a participant took a break, they were recalibrated). There was an additional mandatory break between the two blocks, where all participants were recalibrated. Instructions were provided in written form on the screen and in verbal form by the experimenter. Participants were instructed that they would hear a short sentence while viewing an array of objects. Their task was to select the object that they believed best matched the sentence using the mouse after the auditory stimulus. Participants were further instructed that they could not click on the object until after the sentence played in its entirety and a green border appeared around the display.
All trials began with a drift correction in the center of the screen. In order to start the trial, participants were required to look at the drift correct and simultaneously press the space bar. Each trial started with the four objects being displayed on the screen for 2000 ms, after which the auditory stimulus would be played. After the auditory stimulus presentation ended, the objects remained on the screen for an additional 2000 ms and then a green border encompassed the display and the mouse icon appeared. Participants could then use the mouse to click on one of the four objects (participants had to click within the object for their choice to be recorded), which would end the trial.
3.5. Analysis
To investigate the pattern of looks to the objects and the contribution of speech rate, we used generalized additive mixed models (GAMMs; e.g., Porretta et al., Reference Porretta, Kyröläinen, van Rij, Järvikivi, Czarnowski and Jain2018; Wieling, Reference Wieling2018; Wood, Reference Wood2017), and to investigate the timing of looks to different objects we used diverge point analysis (DPA; Stone et al., Reference Stone, Lago and Schad2021), both are ideal for VWP analyses. GAMMs model both linear and non-linear relationships and can deal with the autocorrelation that is inherent to eye movements across short time spans (or consecutive time bins). However, GAMMs cannot compare effect onsets across groups. Therefore, we additionally investigate the timing of looks using DPA. DPA can also deal with autocorrelation (since the bootstrapping procedure does not assume a distribution) and Type 1 error that comes from running many tests across a time window.
Similar to Fernandez et al. (Reference Fernandez, Shehzad and Hadley2025), analyses included only the predictable items (with the unpredictable items serving as a baseline and to ensure that participants were not employing any particular strategy that might overestimate prediction). To investigate first-stage prediction, we ran a GAMM and a DPA comparing looks to the target (guitar) versus looks to the verb-related object (cards). Unlike Fernandez et al. (Reference Fernandez, Shehzad and Hadley2025), we chose to compare the target to the verb-related object and not to the distractor object, because we believe that this is a more conservative approach, given that cards should be activated and ruled out after hearing the verb, while the distractor should not be activated at all. Evidence for first-stage prediction should be seen in the form of rapid spreading activation from singer, which would activate guitar (and microphone), but lead cards to be disregarded. In terms of eye movements, this should be evidenced by an increase in looks to the guitar (and microphone) relative to cards soon after singer is processed. To investigate second-stage prediction, we ran a GAMM and a DPA comparing looks to the target (guitar) versus looks to the agent-related object (microphone) as done in Fernandez et al. (Reference Fernandez, Shehzad and Hadley2025) and Peters et al. (Reference Peters, Grüter and Borovsky2018). As second-stage prediction involves the narrowing of prediction based on additional information, evidence for second-stage prediction should be seen in the form of the selection of guitar and the inhibition of microphone when processing singer + plays in combination. In terms of eye movements, this should be evidenced by an increase in looks to the guitar relative to microphone soon after the singer + plays is processed.
3.5.1. General Additive Mixed Models (GAMM)
For the GAMM analyses, comparisons were made during the predictive time window (from the onset of the agent to the onset of the target). First-stage prediction compared looks to the target (guitar) versus looks to the verb-related object (cards) in the predictable items. Second-stage prediction compares looks to the target (guitar) versus looks to the agent-related object (microphone) in the predictable items. Our main dependent variable was the empirical logit of fixation counts, which is the log-odds ratio of looking toward the target relative to looking at another specific object (e.g., microphone) in the array (Barr, Reference Barr2008). Data were grouped into 20 ms bins and were weighted to control for eye-movement-based dependencies (Barr, Reference Barr2008). The empirical logit was submitted to a GAMM, which allows us to model non-linear time course data (e.g., Porretta et al., Reference Porretta, Kyröläinen, van Rij, Järvikivi, Czarnowski and Jain2018; Wieling, Reference Wieling2018; Wood, Reference Wood2017) using R (R core team, 2018). We included a parametric fixed effect of language (L1/L2) as an ordered factor. We included ordered factor difference smooth interactions of time x language, and speech rate x language. Additionally, we included a three-way tensor product ordered factor difference smooth interaction of time x speech rate x language. The models included random smooths over time by participant, and random reference/difference smooths of time by item grouped by language (this reduces type 1 error rate and increases power, Sóskuthy, Reference Sóskuthy2021). Model residuals were checked for autocorrelation with the itsadug package (van Rij et al., Reference van Rij, Wieling, Baayen and van Rijn2022), and autocorrelation was present in both models. Therefore, we fit the same parameters as an autoaggressive error model, in which the model estimates parameters with the assumption that neighboring observations are correlated (this is achieved by including an estimate of autocorrelation into the model; see Wieling, Reference Wieling2018). After adjusting this, autocorrelation at lag 1 was <0.1 (see OSF for visualizations). Additionally, the effective degrees of freedom were checked for oversmoothing, and if the value of the basis function (k value) was significant (p < 0.05; suggesting that the basis dimension was too restricted), it was doubled (Wieling, Reference Wieling2018; see OSF; https://osf.io/4ke82/). If a p < 0.05 could not be reached, the model with the highest p-value was chosen. Significance testing was done by checking the model output of both the parametric and smooth factors in the model output and applying a Bonferroni correction to deal with the increased likelihood of Type 1 error from multiple comparisons (4 comparisons were made for each model: .05/4 yields a p-value of .013; see Sóskuthy, Reference Sóskuthy2021). This approach to significance testing is comparable to model fitting using AIC comparisons and does not require likelihood ratio tests (see Sóskuthy, Reference Sóskuthy2021).
3.5.2. Divergence Point Analysis (DPA)
To investigate the timing of looks across the L1 and L2 speakers, we conducted a DPA (Stone et al., Reference Stone, Lago and Schad2021). DPA is a non-parametric bootstrapping method that allows comparisons between groups using confidence intervals (CI). The DPA uses t-tests to compare fixations on two objects until 10 consecutive 20 ms bins (or 200 ms) are significantly different. By resampling the original data (via boostrapping), 2000 new datasets and their respective divergence points are generated, and then the mean and 95% CI of these divergence points are calculated. Similar to the GAMMs, we made two comparisons using DPA, though the DPA was run over the whole time window. The first analysis investigated the timing of first-stage prediction and compared looks to the target (guitar) relative to looks to the verb-related object (cards) in the predictable items. The second analysis investigated the timing of second-stage prediction and compared looks to the target (guitar) versus looks to the agent-related object (microphone) in the predictable items.
4. Results
4.1. Accuracy
Incorrectly answered items were removed before analysis. The target was chosen in 94.2% of the items (thus, 5.8% of the items were removed). Of the 5.8% of items that were removed, 3.8% were due to selection of the agent-related object, 1.5% to selection of the verb-related object and 0.5% to selection of the distractor. See Figure 1 for a visualization of the fixation proportion for the correctly answered trials.

Figure 1. Mean fixation proportion to all objects across language and sentence type (correctly answered items only).
4.1.1. First-stage prediction
4.1.1.1. GAMM
To investigate first-stage prediction, we tested the empirical logit of looks to the target (guitar) versus looks to the verb-related object (cards) in the predicable items from the onset of the agent to onset of the target): see Figure S1 in the Supplementary material for a visualization.
The results from the GAMM can be seen in Table 4.
Table 4. First-stage prediction GAMM output for the empirical logit during the prediction time window. Part A reports the parametric coefficients and Part B reports the smooth terms.

The only significant parameter was the tensor product ordered factor difference smooth interaction between time and speech rate (F = 16.04, p < .0001). Contour plots were used to visualize this interaction; see Figure 2. The top left plot displays the speech rate x time interaction for the L2 speakers, and the top right plot displays the speech rate x time interaction for the L1 speakers. In these two plots, pink indicates greater empirical logits (greater looks to the target relative to the verb-related object) and green indicates less empirical logits (fewer looks to the target relative to the verb-related object). The bottom left plot displays the difference between the L2 and L1 speakers. The positive values indicate that the L2 speakers have greater empirical logits (greater looks to the target relative to the verb-related object) than the L1 speakers (with the difference increasing the more pink it becomes), while the smaller values (approaching 0) indicate that the L1 speakers have greater empirical logits than the L2 speakers (with the difference increasing the more green it becomes).

Figure 2. Contour plots for the empirical logit of the three-way interaction between time, speech rate and language group. Top left panel. Contour plot for L2 speakers. Top right panel. Contour plot for L1 speakers. In both of the top panels, negative values (i.e., green) indicate lower empirical logits (looks to the target) and positive values (i.e., pink) indicate greater empirical logits. Bottom left panel. The difference between the L2 and L1 speakers. In the bottom panel, the negative values indicate greater empirical logits (looks to the target) by L1 speakers relative to L2, and positive values indicate greater empirical logits by L2.
What is clear in these graphs is that both L1 and L2 speakers are showing a negative or 0 empirical logit (indicating participants are not looking toward the target) until around 700-800 ms (around the offset of the agent). At this point, the empirical logit becomes positive (indicating participants are looking at the target), suggesting they are making first-stage predictions after hearing the agent. To further aid in the interpretation of the impact of speech rate, we visualized the L2-L1 difference across time at three speeds, 2.6, 3.6 and 4.6 syllables per second (see Figure S2 in the Supplementary material). These values were arbitrarily selected to cover the range of speech rates used in the experiment.
These visualizations suggest that both L1 and L2 speakers are making similar looks to the target across time at the slower (2.6 syllables per second) and middle (3.6 syllables per second) speech rates. However, at the faster (4.6 syllables per second) speech rates L1 speakers are making more looks to the target as soon as they hear the agent (as evidence by the larger empirical logit between 430 and 730 ms) and L2 speakers show an increase of looks to the target later (starting at approximately 1300 ms–to the end of the window) in the difference graph. This suggests that at the faster speech rates, L1 speakers look to the target earlier (during the agent) until the end of the time window, while L2 speakers look to the target later (at the end of the verb), with their looks to the target continuing to increase as the time window unfolds. Overall, both groups show similar patterns of looks to the target at slower and middle speech rates, and L1 speakers make more and earlier looks to the target at faster speech rates, with L2 speakers seeming to “catch up” by showing later and greater looks to the target at the end of the time window.
4.1.1.2. DPA
To test the first-stage prediction, we compared looks to the target (guitar) versus looks to the verb-related object (cards) in the predictable items. The DPA revealed that looks to these objects diverged at 709.25 ms [CI: 620,920] ms for L1 speakers and 766.97 ms [CI: 720,920] for L2 speakers. The difference between the two groups is 57.72 [CI: −140, 240], and given that the CI crosses 0, we do not conclude that L1 and L2 speakers differ in the timing of looks to the target relative to the verb-related object, see Figure 3.

Figure 3. Divergence point and 95% confidence intervals superimposed on the fixation proportion of looks to the target and verb-related object.
4.1.2. Second-stage prediction
4.1.2.1. GAMM
To investigate second-stage prediction, we tested the empirical logit of looks to the target (guitar) relative to the agent-related object (microphone) in the predictable items from the onset of the verb to the onset of the target, see Figure S3 in the Supplementary material for visualization.
The results from the GAMM can be seen in Table 5.
Table 5. Second-stage prediction GAMM output for the empirical logit during the prediction time window. Part A reports the parametric coefficients and Part B reports the smooth terms.

The ordered factor difference smooth interaction of speech rate by language was significant (F = 48.75, p < .0001). For visualization of this interaction, see Figure S4 in the Supplementary material. The tensor product ordered factor difference smooth interaction between time, speech rate and language was significant (F = 12.04, p < .0001). Contour plots were used to visualize this interaction; see Figure 4. The top left plot displays the speech rate x time interaction for the L2 speakers, and the top right plot displays the speech rate x time interaction for the L1 speakers. In these two plots, pink indicates greater empirical logits (greater looks to the target) and green indicates less empirical logits (less looks to the target). The bottom left plot displays the difference between the L2 and L1 speakers. The positive values indicate that the L2 speakers have greater empirical logits than the L1 speakers, while the smaller values (approaching 0) indicate that the L1 speakers have greater empirical logits than the L2 speakers.

Figure 4. Contour plots for the empirical logit of the three-way interaction between time, speech rate and language group. Top left panel. Contour plot for L2 speakers. Top right panel. Contour plot for L1 speakers. In both of the top panels, negative values (i.e., green) indicate lower empirical logits (looks to the target) and positive values (i.e., pink) indicate greater empirical logits. Bottom left panel. The difference between the L2 and L1 speakers. In the bottom panel, the negative values indicate greater empirical logits (looks to the target) by L1 speakers relative to L2, and positive values indicate greater empirical logits by L2.
What is clear in these graphs is that L1 and L2 speakers show differences in second-stage prediction. L2 speakers show very little evidence of an empirical logit over 0, with looks to the target increasing over looks to the agent-related object only at the very start and very end of the time window, particularly at the fastest speech rates. Meanwhile, L1 speakers show a general increase in looks to the target as speech rate and time increase. To further aid in interpretation, we visualized the L2-L1 difference across time at three speeds, 2.6, 3.6 and 4.6 syllables per second (see Figure S5 in the Supplementary material).
These visualizations suggest that at the slower speech rates, and earlier in the time window, both groups are showing similar patterns of looks, with L2 speakers making slightly more looks to the target toward the end of the window, as evidenced by the larger values (indicating greater looks by the L2 group) in the bottom right corner of the difference graph. As speech rate increases, L1 speakers make more and earlier looks to the target (at both 3.6 syllables and 4.6 syllables per second) as evidenced by the smaller values (indicating greater looks by the L1 group) in the top right corner of the difference graph. Overall, this suggests that at faster speech rates, L1 speakers look to the target while hearing the verb more than L2 speakers, who show greater and earlier looks to the target at the slower speech rates.
4.1.2.2. DPA
To test the second-stage prediction, we compared looks to the target- versus agent-related object in the predictable items. The DPA revealed that looks to these objects diverge at 1299 [CI: 1240, 1420] ms for L1 speakers and 1396 [CI:1320, 1520] ms for L2 speakers. The difference between the two groups is 122.61 [CI: −40, 240]. Given that the CI crosses 0, we do not conclude that L1 and L2 speakers differ in the timing of looks to the verb-related relative to the distractor object; see Figure 5.

Figure 5. Divergence point and 95% confidence intervals superimposed on the fixation proportion of looks to the target and agent-related object.
5. Discussion
It is well established that L1 speakers make predictions while listening to speech (e.g., see Ferreira & Chantavarin, Reference Ferreira and Chantavarin2018; Huettig, Reference Huettig2015; Huettig & Mani, Reference Huettig and Mani2016; Kamide, Reference Kamide2008; Staub, Reference Staub2015). The general consensus is that L2 speakers are able to make predictions in the same way as L1 speakers, with differences between the groups stemming from individual differences and/or methodological factors (e.g., Hopp, Reference Hopp2022; Kaan & Grüter, Reference Kaan, Grüter, Kaan and Grüter2021). It has also been argued that prediction occurs in two stages: an automatic, relatively cost-free first stage, and a non-automatic, more costly second stage (e.g., Ito & Pickering, Reference Ito, Pickering, Kaan and Grüter2021; Pickering & Gambi, Reference Pickering and Gambi2018). Evidence suggests that it might be the late, more costly stage, where L2 speakers show differences relative to L1 speakers (e.g., Corps et al., Reference Corps, Liao and Pickering2023). However, this may be modulated by factors such as speech rate (Gambi et al., Reference Gambi, Pickering and Rabagliati2016). In the current study, we therefore investigated both first- and second-stage prediction in L1 and L2 speakers of English while taking into account the role of speech rate. We hypothesized that L1 and L2 speakers would make first-stage predictions at similar times, but that L2 speakers would be delayed in making second-stage predictions. In terms of speech rate, we hypothesized that both groups would show an inverse-U pattern of predictions across speech rates, with “optimal” speech rates being slower for L2 speakers than L1 speakers. Further, we hypothesized that the “optimal” rate for L2 speakers in the more costly second stage would be lower than that of the first stage and start decreasing at slower rates.
To test first-stage prediction, we compared looks to the target versus looks to the verb-related object while listening to predictable sentences. We found that both groups quickly looked toward the target around the start of the verb, with the DPA showing no difference in the timing of looks between the two groups. The GAMM revealed that both groups were making first-stage predictions around the verb onset between 700 and 800 ms (consistent with the DPA timings). Both groups showed similar patterns of looks to the target at speech rates between approximately 2.5–4.0 syllables per second. As the speech rate increased above approximately 4.0 syllables per second, both groups continued to show evidence of prediction. However, L1 speakers made first-stage predictions while processing the agent, while L2 speakers made first-stage predictions later, only at the verb offset. This suggests that, for both groups, as speech rate increases, so do looks to the target, which supports previous findings that predictive eye movements increase from slower speech rates (Fernandez et al., Reference Fernandez, Engelhardt, Patarroyo and Allen2020). However, in the current study, the “optimal” speech rate for L2 speakers was up to 4.5 syllables per second (unlike Fernandez et al., who found the optimal rate to be around 3.5 syllables per second). This may be due to the type of syntactic prediction tested in Fernandez et al. (Reference Fernandez, Engelhardt, Patarroyo and Allen2020) being more costly than the first-stage prediction in the current study, thus leading to a faster “optimal” speech rate.
To test second-stage prediction, we compared looks to the target versus looks to the agent-related object while listening to predictable sentences. Given that the second stage is more cognitively demanding and subject to more conscious control, it may be that when processing a more cognitively demanding L2, there are fewer resources to invest in making second-stage predictions. We found that both groups showed competition between the target and agent-related object and that, soon after hearing the verb (and before the target is spoken), both groups predictively look to the appropriate target. Surprisingly, while the divergence point of the two groups was numerically different (1300 ms for the L1 speakers, corresponding to approximately the onset of the article, versus 1400 ms for the L2 speakers, corresponding to approximately the end of the article prior to the target), the DPA did not reveal the groups to differ. In terms of speech rate, both groups interestingly show an early preference for the target at the fastest speech rates, which indicates looks to the target during the presentation of the agent, at least when it is presented quickly. The L1 speakers showed a very clear increase in prediction as speech rate increased, with looks to target incrementally increasing with speech rate. Starting at approximately 4 syllables per second, the L1 group consistently (and increasingly) made predictions, which may indicate that their optimal rate is 4.6 syllables per second or higher. This finding supports Fernandez et al. (Reference Fernandez, Engelhardt, Patarroyo and Allen2020), that when speech rates are too slow, prediction is less efficient. The L2 group showed very little early evidence of disambiguating the target from the agent-related image, and did not show an empirical logit above zero (indicating a preference for the target) until approximately 1300 ms across all speech rates. Overall, L2 speakers did not show a clear pattern with any optimal speech rate. While the DPA showed that L2 speakers looked to the target before it was spoken, it may be that L2 speakers adopted a “wait and see” strategy (e.g., McMurray et al., Reference McMurray, Farris-Trimble and Rigler2017; Van Petten & Luka, Reference Van Petten and Luka2012) in which they wait for more of the sentence to unfold, thus accruing more information, before making a prediction. This strategy may reduce processing costs by limiting the set of active alternative interpretations that then need to be inhibited. However, it is important to note that we did not actively manipulate speech rate, so future research designed to more directly test the impact of speech rate could help elucidate this potential further.
We believe that these data support previous research demonstrating that both L1 and L2 speakers make first- and second-stage predictions (e.g., Corps et al., Reference Corps, Brooke and Pickering2022, Reference Corps, Liao and Pickering2023; Ito & Pickering, Reference Ito, Pickering, Kaan and Grüter2021; Pickering & Gambi, Reference Pickering and Gambi2018). When it comes to first-stage predictions, both L1 and L2 speakers showed similar patterns of looks to the target until approximately 4 syllables per second. At faster speeds, both groups continued to show increased looks to the target; however, L1 made showed earlier looks to the target (during the agent) than L2 speakers (after the verb). When it comes to second-stage prediction, there were no timing differences between the groups, but the impact of speech rate on these stages shows the more nuanced difficulties that L2 speakers face in more cognitively demanding linguistic situations. Particularly, L2 speakers showed reduced and more variable predictions, unlike L1 speakers, who showed increased prediction as speech rate increased. That is, when prediction is more cognitively demanding, L1 speakers are able to predict more consistently, even as speech rate increases, while L2 speakers, who have less cognitive resources available, show less, later and more variable prediction. This may reflect a wait-and-see strategy in which L2 speakers wait for more information before committing to a prediction to potentially reduce the costs that accompany second-stage prediction.
As mentioned, and unlike Corps et al. (Reference Corps, Liao and Pickering2023), we did not see overall DPA differences in second-stage prediction between L1 and L2 speakers. We believe there are at least three reasons why the timing differences did not arise. First, the second-stage prediction in our items relied on combining verb and agent information, while Corps et al. (Reference Corps, Brooke and Pickering2022, Reference Corps, Liao and Pickering2023) relied on stereotypy. It is possible that the latter requires more cognitive resources, and thus, there is more opportunity to see these second-stage differences. Second, there is also a temporal gap between activating competitors and ruling them out in the current study, while in the Corps et al. study, there is no such temporal gap and both activation of relevant objects and narrowing down of those objects can occur simultaneously at the verb. It may be that the simultaneous activating and narrowing is more cognitively demanding, again making second-stage differences more apparent. Third, Corps et al. (Reference Corps, Liao and Pickering2023) recruited a heterogeneous group of L2 speakers, while the current study recruited only L1 speakers of German. It is possible that similarities between English and German made the second-stage prediction less costly.
This raises another aspect we believe is worth discussing: the potential influence of proficiency. The impact of proficiency on prediction is not entirely clear, with some researchers finding increased L2 prediction abilities with higher proficiency (e.g., Chambers & Cooke, Reference Chambers and Cooke2009; Dussias et al., Reference Dussias, Valdeś, Guzzardo and Gerfen2013; Hopp, Reference Hopp2013) while other research finds no relationship between proficiency and predictive abilities (Hopp, Reference Hopp2015; Ito et al., Reference Ito, Pickering and Corley2018; Kaan & Grüter, Reference Kaan, Grüter, Kaan and Grüter2021; Kim & Grüter, Reference Kim and Grüter2021; Perdomo & Kaan, Reference Perdomo and Kaan2021). While we collected a measure of proficiency in the current study, we did not include it in our models because (1) it was highly correlated with speaker group and (2) we compared a model with and without proficiency score and the inclusion of proficiency did not improve model fit (see OSF for correlation and model comparison). While the exact contributions of proficiency aren’t clear, what is clear is that speakers with the proficiency level in the current study (i.e., intermediate to high proficiency) are capable of making semantic predictions at multiple speech rates.
6. Conclusions
In this study, we investigated first- and second-stage prediction in L1 and L2 speakers of English while exploring the role of speech rate. We found evidence that both groups make first- and second-stage predictions at similar times. However, speech rate played an important role in both stages. During the first stage, both L1 and L2 speakers showed similar patterns of looks to the target at slower speech rates. At faster speech rates, however, L1 speakers showed earlier predictions than L2 speakers. This suggests that L2 speakers may make slightly later first-stage predictions (relative to L1 speakers) at faster speech rates. Additionally, L2 speakers seem to have an “optimal” speech rate of up to 4.5 syllables per second (faster than found by Fernandez et al., Reference Fernandez, Engelhardt, Patarroyo and Allen2020) in this stage. During the second stage, L1 speakers showed a clear increase in prediction as speech rate increased, while L2 speakers showed a later, reduced and more variable pattern relative to L1 speakers. This may reflect a wait-and-see approach. Particularly, when prediction is more costly, L2 speakers may wait for more information before committing to a prediction. This supports the literature that second-stage prediction is more costly and has a greater impact on speakers with less available cognitive resources, and highlights the importance of choosing an appropriate speech rate in VWP research.
Supplementary material
The supplementary material for this article can be found at http://doi.org/10.1017/S1366728925100515.
Data availability statement
The data and statistical code that support the findings of this study are openly available in OSF at https://osf.io/4ke82/.