Learning a second or additional language (L2/A) in adulthood is a difficult task that entails a great amount of individual variability (e.g., Li et al., Reference Li, Hiver and Papi2022). Among the different factors that contribute to the varying success in adult L2/A learning outcomes, individual differences in cognitive abilities have been shown to play a significant role. One cognitive ability that has received considerable attention in L2/A research is working memory (WM), which is the short-term memory system responsible for temporarily storing and processing information while executing cognitive tasks (Baddeley & Hitch, Reference Baddeley, Hitch and Bower1974; Cowan, Reference Cowan2005).
A number of cross-sectional studies have provided evidence of a robust, positive association between WM and L2/A learning and processing abilities for both receptive and productive tasks (e.g., Wen & Jackson, Reference Wen, Jackson, Li, Hiver and Papi2022). Longitudinal studies examining whether and how individual differences in WM account for variability with L2/A, however, are quite limited in number and reach inconsistent results. This is true for both receptive and productive L2/A abilities, although the role of WM seems more prominent for productive abilities over time than for receptive abilities (e.g., Linck et al., Reference Linck, Osthus, Koeth and Bunting2014). Research has also tended to focus on beginning learners with more form-focused tasks. Thus, the current study aims to contribute to this body of research by examining the role of WM over time for a receptive (acceptability judgment) task as well as a meaning-focused productive (information gap) task, both of which target the same morphosyntactic structure (grammatical gender agreement). In discussing our results, we also offer a post hoc, exploratory analysis of associations between WM and the neurocognitive processing of L2/A gender agreement, with considerations about how further interdisciplinary research across linguistics, psychology, and cognitive neuroscience can deepen our understanding of the variability in L2/A learning outcomes.
Literature review
WM in L2/A abilities
WM is a central component of human cognition, responsible for temporarily storing and processing a small amount of information while executing cognitive tasks, including reasoning, problem solving, and learning (Baddeley, Reference Baddeley, Dimitrova and Hyltenstam2000; Baddeley & Hitch, Reference Baddeley, Hitch and Bower1974; Cowan, Reference Cowan2005). Over the past decade, WM has received increasing attention among second language acquisition researchers, who are interested in determining whether and how WM impacts L2/A learning. A substantial number of cross-sectional studies report a robust positive association between WM and L2/A learning. For instance, a meta-analysis by Linck et al. (Reference Linck, Osthus, Koeth and Bunting2014) evidenced a small but significant positive association between WM and L2/A receptive (p = .242) and productive (p = .267) language skills. However, longitudinal studies examining whether and how individual differences in WM capacity predict L2/A learning are limited, and findings are inconsistent. The following subsections critically examine longitudinal research on the role of WM in receptive and productive L2/A abilities. We focus on studies that administered complex WM tasks that require both storage and processing of information, consistent with the definition of WM above (Daneman & Carpenter, Reference Daneman and Carpenter1980; Oswald et al., Reference Oswald, McAbee, Redick and Hambrick2015; Turner & Engle, Reference Turner and Engle1989).
Working memory and receptive L2/A abilities over time
Receptive L2/A ability refers to a learner’s skill in reading and listening to the target language. To date, a small number of longitudinal studies have investigated the role of WM in receptive L2/A abilities, yielding inconclusive results (Faretta-Stutenberg & Morgan-Short, Reference Faretta-Stutenberg and Morgan-Short2018b; Grey et al., Reference Grey, Cox, Serafini and Sanz2015; Lado, Reference Lado2017; Santamaria & Sunderman, Reference Santamaria, Sunderman, Wen, Mota and McNeill2015; Sanz et al., Reference Sanz, Lin, Lado, Stafford and Bowden2016; Serafini & Sanz, Reference Serafini and Sanz2016). Among these, two laboratory-based studies of Latin learning report positive associations between WM and receptive L2/A abilities over time. In Sanz et al. (Reference Sanz, Lin, Lado, Stafford and Bowden2016), beginner learners of Latin completed input-based practice and received metalinguistic feedback; half of the learners additionally received a metalinguistic grammar lesson prior to beginning practice. WM was assessed via a listening span task, and participants completed written and aural interpretation tasks and a grammaticality judgment task (GJT) to assess assignment of semantic functions to noun phrases in Latin. Results indicated that WM significantly predicted gains in interpreting both written and aural input for learners who did not receive metalinguistic instruction; no such association was found for learners in the metalinguistic instruction group. Lado (Reference Lado2017) utilized the same three receptive tasks at three timepoints (pretest, immediate posttest, and delayed/retention posttest 2 weeks after training). Here, in addition to input-based practice with corrective feedback, half of the learners of Latin additionally received metalinguistic feedback regarding case-marking clues to assign semantic roles to nouns. WM was assessed via a reading span task, and results revealed a positive association between WM and GJT performance on the immediate and delayed posttest exclusively for the group that received metalinguistic feedback. No relationship between WM and interpretation task performance was found for either group.
The remaining longitudinal studies report no consistent relationships between WM and receptive L2/A abilities across participants in different contexts (study abroad, classroom, laboratory – including the metalinguistic instruction condition in Sanz et al., Reference Sanz, Lin, Lado, Stafford and Bowden2016 and the no-metalinguistic feedback group in Lado, Reference Lado2017) and different proficiency levels. For example, Faretta-Stutenberg and Morgan-Short (Reference Faretta-Stutenberg and Morgan-Short2018b) examined intermediate learners of Spanish over the course of one semester, divided into two groups: classroom learners and study-abroad learners. Participants completed an AJT targeting Spanish word order (while their EEGs were recorded), and WM was measured using a battery of tasks, including the operation span, reading span, and symmetry span tasks. Results revealed no significant relationship between WM capacity and learning gains on the AJT over time in either instructional context. Similarly, Grey et al. (Reference Grey, Cox, Serafini and Sanz2015) examined advanced learners of Spanish in a study-abroad context, focusing on both grammatical and lexical development throughout a 5-week program. WM was measured via a listening span task, and learning outcomes were assessed using a GJT and a lexical decision task, targeting word order and noun–adjective gender and number agreement. As in the previous study, findings revealed no significant association between WM and gains in grammatical or lexical knowledge over time. Further null effects were reported by Santamaria and Sunderman (Reference Santamaria, Sunderman, Wen, Mota and McNeill2015), who investigated the role of WM in learning French clitic pronouns among beginner classroom learners over 2 weeks, using a picture-matching task and a reading span test. Once again, no relationship was found between WM capacity and receptive grammatical learning over time. Last, Serafini and Sanz (Reference Serafini and Sanz2016) examined beginner, intermediate, and advanced learners of Spanish over the course of one semester across various (morpho)syntactic structures using a GJT and an operation span task. In line with the studies above, WM was not significantly associated with learning gains on the GJT, regardless of learners’ proficiency levels.
In summary, findings from longitudinal studies examining the role of WM in receptive L2/A grammatical learning suggest that WM may play a facilitative role in short-term longitudinal development for absolute beginner learners trained using input-based practice and with corrective feedback (Sanz et al., Reference Sanz, Lin, Lado, Stafford and Bowden2016) and reactive metalinguistic feedback (Lado, Reference Lado2017). In contrast, WM appears to have limited influence when absolute beginner learners are provided with pre-practice metalinguistic instruction (Sanz et al., Reference Sanz, Lin, Lado, Stafford and Bowden2016), potentially because such explicit conditions reduce the cognitive demands that are associated with form interpretation. Regarding proficiency, results appear to be consistent in finding null effects of WM at intermediate and advanced levels of proficiency (Faretta-Stutenberg & Morgan-Short, Reference Faretta-Stutenberg and Morgan-Short2018b; Grey et al., Reference Grey, Cox, Serafini and Sanz2015; Serafini & Sanz, Reference Serafini and Sanz2016), whereas results for beginner learners remain mixed (positive effect for some conditions in Lado, Reference Lado2017; Sanz et al., Reference Sanz, Lin, Lado, Stafford and Bowden2016; no effect in Lado, Reference Lado2017 for the no metalinguistic feedback group, Santamaria & Sunderman, Reference Santamaria, Sunderman, Wen, Mota and McNeill2015; Sanz et al., Reference Sanz, Lin, Lado, Stafford and Bowden2016 for metalinguistic instruction group; Serafini & Sanz, Reference Serafini and Sanz2016). However, the limited number of studies, combined with heterogeneous methodologies, task types, and learner proficiency levels, precludes definitive conclusions.
Working memory and productive L2/A abilities over time
Productive L2/A ability refers to a learner’s skill in speaking and writing the target language. As with receptive L2/A abilities, longitudinal research on the role of WM in L2/A productive abilities is scarce. To date, four studies of classroom learners have investigated this relationship (Biedroń et al., Reference Biedroń, Véliz-Campos and Zychowicz2022; Sagarra, Reference Sagarra2017; Santamaria & Sunderman, Reference Santamaria, Sunderman, Wen, Mota and McNeill2015; Serafini & Sanz, Reference Serafini and Sanz2016). All reported evidence of a positive association between WM and L2/A grammatical development over time for beginner learners, but results were mixed for intermediate and advanced learners.
Sagarra (Reference Sagarra2017) investigated the acquisition of various Spanish (morpho)syntactic features (e.g., subject–verb agreement, adjective–noun number and gender agreement, and prepositions) over the course of one semester among beginner learners using a cloze task. WM (assessed via a reading span test) was found to positively modulate L2/A grammar development, as learners with higher WM capacity demonstrated significantly greater learning gains over one semester than those with lower WM. Similarly, Santamaria and Sunderman (Reference Santamaria, Sunderman, Wen, Mota and McNeill2015) found that while WM capacity did not significantly predict receptive L2/A abilities over time, it did significantly predict productive learning gains among beginner learners in an explicit productive task (fill-in-the-blanks activity) involving French clitics.
Serafini and Sanz (Reference Serafini and Sanz2016) explored the role of WM in L2/A grammatical development using an elicited oral imitation task containing half grammatical, half ungrammatical sentences to assess productive grammatical development across different levels of proficiency. Consistent with previous research, although no significant relationship was found between WM and receptive abilities over time in this study, WM was positively associated with productive grammatical development among beginner learners. In contrast, negative relationships between WM and productive ability gains emerged for intermediate and advanced learners.
Biedroń et al. (Reference Biedroń, Véliz-Campos and Zychowicz2022) also examined intermediate learners, investigating the relationship between WM (measured using both a reading span and a listening span task) and productive grammatical development (assessed through multiple-choice, fill-in-the-blank, word paraphrasing, and cloze tasks) over a 6-month period. Results revealed that the listening span, but not the reading span task, significantly predicted grammatical development, though with a small effect size. In contrast with the intermediate learners in Serafini and Sanz (Reference Serafini and Sanz2016), Biedroń et al.’s (Reference Biedroń, Véliz-Campos and Zychowicz2022) results indicated that WM was a weak but statistically reliable predictor of productive L2/A abilities.
In summary, longitudinal research on WM and productive L2/A grammatical abilities suggests that WM may facilitate grammatical development at earlier stages of proficiency (Biedroń et al., Reference Biedroń, Véliz-Campos and Zychowicz2022; Sagarra, Reference Sagarra2017; Santamaria & Sunderman, Reference Santamaria, Sunderman, Wen, Mota and McNeill2015; Serafini & Sanz, Reference Serafini and Sanz2016), but its influence may lessen as proficiency increases (Serafini & Sanz, Reference Serafini and Sanz2016). Importantly, the available evidence is limited to four classroom-based studies that employed non-communicative productive tasks (e.g., multiple choice, cloze tests, fill-in-the-blank activities, and elicited oral imitation task). These methodological constraints, combined with the limited evidence base, underscore the need for further research in this area.
Working memory and L2/A processing over time
L2/A processing refers to moment-by-moment computations of language during real-time comprehension (VanPatten & Jegerski, Reference VanPatten and Jegerski2010). Longitudinal research investigating the role of WM in L2/A processing is very limited, with only one study to date to our knowledge directly addressing this question. Faretta-Stutenberg and Morgan-Short (Reference Faretta-Stutenberg and Morgan-Short2018b) examined intermediate learners of Spanish in both classroom and study-abroad settings, focusing on their neural responses to word order violations. Although behavioral measures revealed no significant effects of WM, ERP data indicated that WM was positively associated with morpho-syntactic processing changes among study abroad, but not classroom, learners. These findings suggest that the relationship between WM and L2/A processing over time may be context-dependent, emerging in immersive environments but not in traditional classroom settings. However, given that this conclusion is based on a single study with a relatively small sample size (N = 30), it should be approached cautiously and should not be generalized without further research.
Synthesis of WM in longitudinal studies
The longitudinal evidence on WM and grammatical L2/A development over time is limited but informative. Existing studies suggest that the long-term contributions of WM to L2/A grammatical development may vary depending on learners’ proficiency level, the type of linguistic ability measured, and the instructional context. Overall, WM appears to facilitate grammatical learning most consistently at early states of proficiency (e.g., Serafini & Sanz, Reference Serafini and Sanz2016). Its effect seems to be stronger for productive abilities in form-focused tasks (e.g., Sagarra, Reference Sagarra2017; Santamaria & Sunderman, Reference Santamaria, Sunderman, Wen, Mota and McNeill2015) and, to a lesser extent, for receptive abilities input-based practice conditions (Lado, Reference Lado2017; Sanz et al., Reference Sanz, Lin, Lado, Stafford and Bowden2016). Additionally, WM may support morpho-syntactic processing in study abroad contexts (Faretta-Stutenberg & Morgan-Short, Reference Faretta-Stutenberg and Morgan-Short2018b), though evidence here is very limited.
Despite these emerging patterns, several gaps in the literature remain. Most existing studies have focused on beginner-level learners in classroom or laboratory environments, with relatively little evidence for intermediate or advanced learners, or learners in study abroad contexts. Furthermore, outcome measures have predominantly emphasized form-focused tasks, which may not adequately reflect learners’ ability to use L2/A grammar in communicative, real-world contexts. Robust longitudinal research that addresses these open areas is needed to elucidate whether, how, and for whom WM influences L2/A grammatical development and processing.
Working memory and grammatical gender agreement processing
In addition to the longitudinal research presented, we turn now to cross-sectional studies that have explored relationships between WM and the morphosyntactic linguistic target in this study: grammatical gender agreement. This structure represents a proxy for morpho-syntactic development and was selected in large part because it facilitated examination of productive abilities in a meaning-focused task, as well as online processing of a structure that is present in the L2/A but absent in the L1. The existing cross-sectional work examining WM and grammatical gender agreement abilities also incorporates processing measures (self-paced reading, eye-tracking), providing valuable interdisciplinary insights and points of comparison with the current study.
Sagarra (Reference Sagarra and Han2007) found evidence of sensitivity to gender violations in self-paced reading among novice learners of Spanish with high WM, but not low WM. The beginner learners in Sagarra and Herschensohn (Reference Sagarra and Herschensohn2010), conversely, did not evidence sensitivity to gender violations in GJT or self-paced reading measures (the authors note a lack of variability in WM scores within this group, making it difficult to determine whether relationships with WM and receptive abilities or processing may have emerged). For intermediate learners in the same study, WM was associated with greater sensitivity to gender violations. Similar results were obtained by Durand-López (Reference Durand-López2024), who investigated relationships between WM and processing of gender agreement violations within and across noun phrases among beginner and advanced L2/A learners of Spanish via GJT and self-paced reading task. Beginner learners did not show sensitivity to gender disagreement regardless of distance. Advanced learners, however, showed sensitivity to grammatical gender violations in local domains, and higher WM advanced learners were also sensitive to gender disagreement in the more taxing structural distance condition. Keating (Reference Keating, VanPatten and Jegerski2010) also explored gender agreement among advanced L2/A Spanish speakers, examining eye-tracking data for three linear distance conditions (one-, four-, and seven-words between noun and adjective). Individual differences in WM correlated positively with sensitivity to gender agreement errors, suggesting a role for WM among these advanced learners.
Cross-sectional work seems to suggest positive effects for WM among lower proficiency learners, who may need to devote WM resources to morphosyntactic features in the target language (Sagarra, Reference Sagarra and Han2007), but perhaps not among beginner learners who have not yet developed any sensitivity to such violations (Durand-López, Reference Durand-López2024; Sagarra & Herschensohn, Reference Sagarra and Herschensohn2010). As proficiency increases, and with it, sensitivity to morphosyntactic violations, the role of WM may change and be devoted to higher level tasks (Durand-López, Reference Durand-López2024; Keating, Reference Keating, VanPatten and Jegerski2010; Sagarra & Herschensohn, Reference Sagarra and Herschensohn2010). As with the longitudinal studies, research on this topic remains limited, with each of these investigations utilizing distinct methodologies to match specific research questions. Given the limited and heterogeneous evidence base, expanded investigations into the relationship between WM and grammatical gender agreement are needed to reach generalizable conclusions.
Research questions
Although prior research on this topic is relatively limited, existing studies provide an important foundation and valuable insights. Building on these findings, this study addresses the following research questions:
1. Does WM predict L2/A receptive abilities for Spanish grammatical gender agreement over time?
2. Does WM predict L2/A productive abilities for Spanish grammatical gender agreement over time?
Given the compelling results from cross-sectional work examining relationships between WM and grammatical gender agreement processing, we also present a post hoc exploratory analysis that examines the relationship between WM and neural responses for receptive abilities.
Methods
Participants
Participants were native speakers of English studying Spanish as an L2/A at the university level. Participants were screened for language history and experience and had only classroom experience with Spanish, with no significant immersion experience prior to beginning the study. Curricular level was used as the participant selection criterion. During the semester of study, participants were enrolled in at least one intermediate/advanced Spanish course (i.e., equivalent of fifth semester or above, including advanced grammar review, introductory linguistics, and literary analysis) at either their home university (n = 21) or in study-abroad programs in Spanish-speaking countries (n = 15). Although curricular level and background experience were the sole eligibility criteria, participants completed two objective measures of proficiency (details reported in Supplementary Materials Table S1); their proficiency levels fell within a low-intermediate range with significant variability, which is characteristic of intermediate-level learners. Participant background information is presented in Table 1 (note that no significant differences between the at-home and study-abroad groups were observed in any of the mixed effects models; see Supplementary Materials “Analysis” for details).
Participant background information

1 All participants were native speakers of English. Some participants reported more than one native language, none of which encoded grammatical gender agreement.
Procedure and tasks
Participants completed three experimental sessions in a research laboratory. The first session was a cognitive assessment that included three measures of WM (described in the following section) as well as a language background questionnaire. On a subsequent date, participants completed a baseline language assessment session, which included measures of their receptive and productive abilities in Spanish. At the end of the semester (approximately 15 weeks after the baseline session), participants completed an identical language assessment session using different, counterbalanced versions of the language tasks. All three sessions included additional measures that fall outside the scope of this research questions and analyses. Participants provided informed consent at each session and were compensated monetarily for their time.
Working memory tasks
Automated, computerized versions of three externally validated and reliable WM measures were employed during the cognitive assessment session: Operation Span, Reading Span, and Symmetry Span tasks (OSpan, RSpan, SymSpan; Unsworth et al., Reference Unsworth, Heitz, Schrock and Engle2005). Each task requires around 25 minutes to complete and includes a processing and storage component. To obtain a comprehensive measure of WM abilities, a composite WM score, based on the overall score (“absolute” scoring, sum of perfect trials, including judgment accuracy and recall), was calculated and used in analyses (Conway et al., Reference Conway, Kane, Bunting, Hambrick, Wilhelm and Engle2005).
In the OSpan, participants see and solve a math operation (processing component, e.g., (1*2) + 1 = ?), then indicate whether their answer matches the number provided on the next screen; following this response, a screen with a single letter appears, followed by another math operation, solution, and letter. After a set of math operations and letters has been presented (ranging from 3 to 7 trials), the participant is asked to recall each letter that appeared during that set, in the order of presentation (storage). The RSpan follows identical procedures, replacing the math operation with a sentence judgment of logical and illogical sentences (e.g., The prosecutor’s dish was lost because it was not based on fact.). Both tasks include a total of 75 trials. In the SymSpan, participants are presented with an image (88-square matrix with some cells filled in black to form an image) and asked to provide a symmetry judgment (This image is symmetrical: true/false) for the processing component. Following their judgment, participants are presented with a 4 × 4 matrix in which one cell is filled in red. After completing a set (ranging from 2 to 5 trials), participants are prompted to recall the location of each red square, in order of presentation. The task comprises a total of 42 trials. Across the three tasks, the reliability for WM responses based on Cronbach’s alpha was .82, showing high reliability. Performance data for each task and correlations between WM tasks are provided in Supplementary Materials Tables S2 and S3.
Information gap activity
During each language assessment session, participants completed a communicative, information gap (Info Gap) activity designed to elicit production of gender agreement on articles and adjectives. Participants were given a large picture of a kitchen containing 12 items of varying sizes and colors (four versions, each containing half feminine, half masculine nouns, were created with the order of administration counterbalanced across participants, and different versions completed by each participant at the baseline and follow-up language assessment sessions). The researcher was seated near the participant with an identical (empty) kitchen image, along with nine options for each item in the participant’s kitchen (each item in three colors and sizes), placed in loose piles on a shared table. Participants were instructed to describe their kitchen aloud to the researcher, using complete sentences in Spanish, so that the researcher could place items in her kitchen to make it match the participant’s kitchen. When necessary, the researcher prompted the participant to provide more information about a particular item without providing any clues regarding the grammatical gender of the item (e.g., ¿Cuál gato? “Which cat?”). The task structure prevented avoidance strategies on adjectives, as participants needed to produce (minimally) one size adjective and one color adjective for the researcher to select the appropriate item among the nine versions to create a matching kitchen (details regarding participant attempts and accuracy by session are provided in Supplementary Materials Table S4). No feedback on production accuracy was provided at any point during the task.
To control for vocabulary knowledge, participants were provided with a vocabulary sheet containing the names and pictures of all 16 possible target nouns and a list of colors, size words, copular verbs, prepositions (above, below, etc.), and the names of non-target items in the kitchen (counter, table, etc.). To avoid gender agreement priming, all nouns on the vocabulary sheet (target items and general kitchen items) were presented without articles. Target nouns were presented in Spanish only, under a picture of the item. Non-target kitchen items as well as adjectives were presented in English with the Spanish equivalent (e.g., counter – encimera). The activity was recorded using a digital voice recorder. Recordings were transcribed; use and accuracy of articles and gender-marked adjectives for each of the 12 target items were coded by two independent raters to obtain proportion correct for each participant (inter-rater reliability = 100%). Full details of the task, with protocol and sample scoring sheet, are available on the Open Science Framework (https://osf.io/867yu).
Acceptability judgment task
Participants also completed an AJT during each language assessment session. The AJT asked participants to read sentences on a computer screen and indicate via mouse click whether each was “good” or “bad” in Spanish while EEG data were recorded (more details on EEG processing data are provided in the Discussion). Stimuli were designed to include vocabulary typically covered during the first three semesters of university-level Spanish study. The AJT included five violation types: two experimental conditions (article-noun and noun-adjective grammatical gender agreement) and three distractor conditions (subject–verb agreement, semantic, and phrase structure). Two equivalent versions of the task were created with 300 sentences each (half correct, evenly distributed across the five conditions, such that neither version contained both the correct and violation version of the same sentence). The task was divided into five blocks so that participants could take short breaks during the task. All participants completed both versions of the task, with the order of task version presentation counterbalanced across participants. Across both task versions, the Spearman–Brown split-half reliability for AJT responses was .95 for articles and .93 for adjectives, showing high reliability.
The experimental conditions were designed to assess participant sensitivity to grammatical gender agreement violations on articles and adjectives. Target nouns were evenly distributed between masculine/feminine, were all inanimate, and all had canonical endings (-o/-a). In the article condition, agreement violations were created by replacing the singular, definite article (e.g., lafem calculadora fem versus *elmasc calculadorafem “the calculator”). In the adjective condition, agreement violations were created by changing the gender of an attributive adjective that appears directly following the target noun (e.g., calculadorafem rotafem versus calculadorafem rotomasc “broken calculator”). Sample sentences from the experimental conditions are provided in Table 2; a full list of stimuli for the AJT is available on the Open Science Framework (https://osf.io/867yu).
AJT experimental stimuli

Note: Bold typeface marks the critical word, that is, the word where violation becomes evident in each sentence. The word that constitutes the violation is indicated with an asterisk (*). For demonstration purposes, the target noun is underlined. Adjective condition sentences contained the invariable third-person possessive determiner “su” instead of a gender-marked article to avoid providing an additional gender cue.
Analysis
First, for WM measures, a composite score was calculated by averaging z-scores from the overall performance on all three tasks – OSpan, RSpan, and SymSpan. Participants performed reasonably well on the tasks and showed variability among participants (see Supplementary Materials Table S2), indicative of individual differences in WM abilities. Next, for receptive L2/A abilities, accuracy on the AJT was analyzed across both sessions. Mean accuracy was calculated to examine above-chance performance on each linguistic condition – article masculine, article feminine, adjective masculine, and adjective feminine. Last, productive L2/A abilities were assessed through participant accuracy in gender agreement for articles and adjectives during the information gap activity. Production accuracy was calculated by dividing the total number of correct articles or adjectives by the total number of attempted articles or adjectives for each session. As with the AJT, mean accuracy was computed to examine above-chance performance for each linguistic condition.
To address our research questions, two logistic mixed-effect models were conducted to examine the effects of WM on the development of L2/A Spanish grammatical gender agreement over time among intermediate learners. The dependent variable was accuracy on gender agreement for either receptive or productive abilities, and our key predictor variable was WM. Additional predictor variables included Sessions (1 and 2), Gender (masculine and feminine), and Linguistic Form (article and adjective). We also included Group (study abroad, at-home) as a control variable. To determine the final model structure to answer our research questions, we followed recommended procedures for selecting both random and fixed effects (Bates et al., Reference Bates, Mächler, Bolker and Walker2015; Matuschek et al., Reference Matuschek, Kliegl, Vasishth, Baayen and Bates2017). Detailed model specification and selection procedures are provided in the Supplementary Materials “Analysis,” and results from the best-fitting models are reported in the results section below.
Results
RQ1: WM and receptive L2/A abilities over time
Prior to conducting the analysis for our first research question related to WM and L2/A receptive abilities, we examined performance on the AJT (Table 3). Descriptively, participants performed above chance overall, although not consistently for each condition. Judgment accuracy was slightly better at Session 2 than at Session 1 for masculine articles (difference in accuracy: M = 0.07, SD = 0.28, range = −0.60–0.87) and feminine adjectives (M = 0.01, SD = 0.25, range = −0.60–0.77). There were no clear group-level improvements for masculine adjectives (M = 0.00, SD = 0.25, range = −0.57–0.73) or feminine articles (M = 0.00, SD = 0.27, range = −0.83–0.93). Importantly, though, an examination of minimum and maximum change scores suggests substantial variability among participants for all conditions. For performance on the distractor conditions, see Supplementary Materials Table S5.
Performance on AJT – receptive task

Results from the logistic mixed-effects model used to address our first research question revealed a significant main effect of Linguistic Form (b = .266, SE = .099, z = 2.68, p = .007, OR = 1.305), indicating that participants were more likely to respond accurately to articles than to adjectives. As illustrated in Figure 1a, the estimated probability of a correct response was approximately 57% for articles and 50% for adjectives. The full model is provided in Supplementary Materials Table S6.
Significant model results for receptive L2/A abilities.

Additionally, there was a significant interaction between WM and Linguistic Form (b = .196, SE = 0.095, z = 2.053, p = .04, OR = 1.216), revealing that the effect of Linguistic Form on accuracy varied as a function of WM (see Figure 1b). Post hoc comparisons using estimated marginal means showed that for participants with lower WM, accuracy did not differ significantly between the article and adjective conditions (OR = 1.12, SE = 0.138, z = .886, p = .376). However, for participants with higher WM, the odds of a correct response were significantly higher in the article condition than in the adjective condition (OR = 1.53, SE = 0.194, z = 3.340, p < .001).
RQ2: WM and productive L2/A abilities over time
Before answering our second research question regarding WM and L2/A productive abilities, we examined performance on the information gap activity (Table 4). Descriptive results indicated that across both sessions, participants demonstrated higher accuracy for masculine forms – both articles and adjectives – compared to feminine forms, where accuracy was descriptively lower for adjectives than for articles.
Performance on information gap activity – productive task

Overall, production accuracy was slightly better at Session 2 than at Session 1 for masculine articles (difference in accuracy: M = 0.02, SD = 0.17, range = −0.50–0.63) and for feminine articles (M = 0.03, SD = 0.25, range = −0.71–0.50) and adjectives (M = 0.11, SD = 0.26, range = −0.54–0.82), but not for masculine adjectives (M = −0.02, SD = 0.14, range = −0.42–0.36). However, an examination of minimum and maximum change scores suggests substantial variability among participants for all conditions.
Results from the logistic mixed-effects model used to address our second research question revealed significant main effects and a significant interaction (the full model is provided in Supplementary Materials Table S7). First, there was a significant main effect of Session (b = − .297, SE = 0.139, z = −2.135, p = .033, OR = 0.743), indicating that learners were more likely to be accurate in Session 2 than in Session 1. As illustrated in Figure 2a, the estimated probability of a correct response increased from 83% in Session 1 to 86% in Session 2.
Significant model results for productive L2/A abilities.

Second, findings showed evidence of a significant main effect of Gender (b = 2.153, SE = 0.121, z = 17.73, p < .001, OR = 8.613). Holding all other variables constant, participants were significantly more likely to produce accurate gender-marked forms when the target noun was masculine than when the target noun was feminine. Specifically, the estimated probability of a correct response was 94% for masculine nouns and 65% for feminine nouns, as illustrated in Figure 2b.
Third, there was a significant main effect of Linguistic Form (b = .603, SE = 0.119, z = 5.057, p = < .001, OR = 1.827), indicating that participants produced articles with higher accuracy than adjectives. As illustrated in Figure 2c, the predicted probability of a correct response was 88% for articles and 80% for adjectives.
Last, there was a significant interaction between Gender and Linguistic Form (b = −1.487, SE = 0.237, z = −6.27, p < .001, OR = 0.226), suggesting that the effect of gender differed across linguistic forms (see Figure 2d). Post hoc comparisons revealed that for article production, participants were significantly more likely to produce accurate, gender-marked forms when the target noun was masculine than when the target noun was feminine (OR = 4.1, SE = 0.71, z = 8.136, p < .001). This gender effect was substantially larger for adjectives, where participants were even more likely to use correctly gender-marked adjective forms for masculine nouns than for feminine nouns (OR = 18.1, SE = 3.01, z = 17.440, p < .001).
Discussion
The current study aimed to examine the longitudinal role of WM in both receptive and productive L2/A abilities, targeting the same morphosyntactic structure: Spanish grammatical gender agreement. First, with regard to receptive abilities, participants demonstrated some very small descriptive improvements in accuracy over time (that varied among participants), although these gains did not reach statistical significance at the group level. Importantly, WM was not found to play a significant role in learning of L2/A Spanish grammatical gender agreement over time. However, results suggest that WM played a role overall when it interacted with linguistic form (regardless of time): learners with higher WM capacity were more likely to be accurate in articles than in adjectives, whereas learners with lower WM capacity showed no difference in accuracy between these forms.
Although it is possible that the group-level change in accuracy on the receptive task was too small for any individual difference to explain (as pointed out by an anonymous reviewer), these findings are largely consistent with previous longitudinal studies. The limited extant literature reports null relationships between WM and receptive L2/A abilities across various (a) learning contexts (e.g., study-abroad: Faretta-Stutenberg & Morgan-Short, Reference Faretta-Stutenberg and Morgan-Short2018b; Grey et al., Reference Grey, Cox, Serafini and Sanz2015; classroom: Faretta-Stutenberg & Morgan-Short, Reference Faretta-Stutenberg and Morgan-Short2018b; Santamaria & Sunderman, Reference Santamaria, Sunderman, Wen, Mota and McNeill2015; Serafini & Sanz, Reference Serafini and Sanz2016), and (b) proficiency levels (beginner: Santamaria & Sunderman, Reference Santamaria, Sunderman, Wen, Mota and McNeill2015; Serafini & Sanz, Reference Serafini and Sanz2016; intermediate: Faretta-Stutenberg & Morgan-Short, Reference Faretta-Stutenberg and Morgan-Short2018b; Serafini & Sanz, Reference Serafini and Sanz2016; advanced: Grey et al., Reference Grey, Cox, Serafini and Sanz2015; Serafini & Sanz, Reference Serafini and Sanz2016). Indeed, only two previous studies report a role for WM over time for receptive L2/A abilities, both examining absolute beginners trained in a laboratory setting under specific training conditions (Lado, Reference Lado2017; Sanz et al., Reference Sanz, Lin, Lado, Stafford and Bowden2016).
More generally, when considering overall learning measured cross-sectionally, our results provide nuance to previous meta-analytic findings. Whereas Linck et al. (Reference Linck, Osthus, Koeth and Bunting2014) reported a small but significant positive association between WM and receptive L2/A learning, our results show that this role may be constrained to certain linguistic forms, given that higher WM predicted higher accuracy for agreement on articles than for adjectives. Future research that examines the role of WM for additional linguistic structures will continue to refine our understanding of these relationships.
Second, regarding productive abilities, participants showed descriptive gains over time, which were somewhat more pronounced for gender agreement with feminine nouns. Analyses confirmed that these gains were statistically significant, with a small effect size, indicating modest but meaningful improvement over time. Similar to our findings for receptive L2/A abilities, these gains did not interact with WM. Indeed, no role for WM in productive L2/A learning was observed.
Previous longitudinal studies on WM and L2/A productive abilities have reported a positive role for WM for beginning learners (Sagarra, Reference Sagarra2017; Santamaria & Sunderman, Reference Santamaria, Sunderman, Wen, Mota and McNeill2015; Serafini & Sanz, Reference Serafini and Sanz2016). For intermediate level-learners (the proficiency level of participants in this study), one study reported a limited, positive relationship between WM and productive performance on a more constrained productive task (Biedroń et al., Reference Biedroń, Véliz-Campos and Zychowicz2022) and another reported a negative relationship on a more demanding, time pressured, and constrained productive task (Serafini & Sanz, Reference Serafini and Sanz2016; although see author note about the “extremely low sample size” in this group, p. 627). Whereas previous studies employed non-meaning-focused tasks, our present task design emphasized communicative goals and spontaneous language production. Taken together, these findings suggest that WM effects may be proficiency- and task-dependent, with a facilitative role for WM emerging most clearly at lower stages of proficiency and under conditions where learners have time to process and manipulate linguistic forms more consciously.
Our results also contrast with findings from studies that have examined the role of WM in production more broadly. For example, Linck et al. (Reference Linck, Osthus, Koeth and Bunting2014) reported a small but significant positive association between WM and L2/A productive outcomes across time. Importantly, although in their meta-analysis they found a role of WM for both high- and low-proficiency learners, they noted that the role of WM across different proficiency levels remains under-researched.
Interestingly, only one previous study to our knowledge has considered both L2/A reception and production with the same linguistic targets among intermediate learners. Serafini and Sanz (Reference Serafini and Sanz2016) did not find a role for WM for receptive abilities and reported a negative relationship with change in productive abilities. In contrast, most previous work and meta-analyses have treated WM as a general predictor of L2/A success, without distinguishing between receptive and productive outcomes. This lack of specificity has made cross-study comparisons challenging and may obscure ability-specific effects of WM. Future research would benefit from more integrated designs that assess both linguistic skills.
The behavioral findings reported here underscore a recurring challenge in the study of individual differences in L2/A learning, as they reveal performance, but do not provide insights into processing in real time. An interdisciplinary approach that integrates applied linguistics, neurolinguistics, and cognitive psychology is a powerful way to disentangle the role that WM plays in the L2/A learning process. In addition to the cross-sectional studies described that provide evidence of relationships between WM and processing of grammatical gender agreement, our previous work with a subset of participants from this same longitudinal study has shown meaningful individual variability in neurocognitive processing. Specifically, we observed that WM influences syntactic processing – but not behavioral outcomes – of Spanish phrase structure, with study-abroad learners exhibiting different ERP patterns as a function of their WM capacity (see Faretta-Stutenberg & Morgan-Short, Reference Faretta-Stutenberg and Morgan-Short2018b).
More relevant to the current study, we have also previously reported ERP effects for L2/A Spanish gender agreement (Faretta-Stutenberg & Morgan-Short, Reference Faretta-Stutenberg, Morgan-Short, Sanz and Morales-Front2018a). In that study, at-home learners did not elicit significant ERP effects at either session articles or adjectives. In contrast, study-abroad learners elicited a significant fronto-central negativity followed by a posterior P600 for article violations in Session 2, although no significant effects were elicited for articles in Session 1 or adjectives at either time point. However, WM was not examined as an individual difference across sessions and groups in that study.
To address this gap, we now conducted a post hoc exploratory analysis examining the relationship between WM and ERP responses for receptive abilities at Sessions 1 and 2. Because the original ERP study was not designed to investigate articles and adjectives by noun gender, we were unable to conduct analyses parallel to those in our behavioral analysis reported above. We instead examined correlations between WM and the magnitudes of the N400 and P600 for articles and adjectives at both sessions. We also computed correlations with two additional measures, the Response Magnitude Index (RMI) – a measure of overall sensitivity to agreement violations within the N400 and P600 time windows – and the Response Dominance Index (RDI) – relative response dominance–negativity in the N400 time window versus positivity in the P600 time window – for articles and adjectives at Sessions 1 and 2. We want to stress that this is a purely exploratory analysis, where we did not apply any corrections for multiple comparisons. No correlations reached statistical significance. However, interpreting the r values that meet the criteria of at least a small effect sizes based on Botes et al. (Reference Botes, Resnik, Greiff and Stempfer2025), we observed one correlation at Session 1 – a negative correlation between WM and the N400 magnitude for articles – and three at Session 2 – a negative correlation between WM and the P600 magnitude for adjectives, a positive correlation with RMI for adjectives, and a negative correlation with RDI for adjectives (full results in Supplementary Materials Table S8). Interestingly, for adjectives, these trends seem to be somewhat more consistent for negativities/N400 amplitudes, very tentatively suggesting that WM may relate more strongly to early lexical-semantic aspects of agreement processing than to later syntactic processes indexed by the P600.
We emphasize, however, that these interpretations are speculative and intended only to motivate future confirmatory research. Indeed, other cross-sectional studies have examined WM and neurocognitive processing with mixed results that may be moderated by proficiency. For example, Gabriele et al. (Reference Gabriele, Bañón, Hoffman, Covey, Rossomondo and Fiorentino2021) found a role for WM in L2/A morphosyntactic processing among beginners, whereas Dong et al. (Reference Dong, Han, Hestvik and Hermon2023) did not find a role for intermediate to advanced learners. Overall, this line of investigation represents a promising direction for further interdisciplinary studies examining the interplay of WM and neurocognitive processing in L2/A development, both cross-sectionally and longitudinally. Taken together, these exploratory findings illustrate how integrating behavioral and neurocognitive approaches could provide a more nuanced, interdisciplinary understanding of WM in L2/A learning.
Limitations
It is important to consider the limitations of our study. First, our sample size was relatively small, resulting in reduced statistical power and potentially limiting the ability to detect possible relationships. Increasing the number of participants in future research will be essential to improve the reliability and robustness of the findings. Additionally, our sample consisted exclusively of college students, most of whom were female. Future studies should aim to include learners with more diverse educational backgrounds, age ranges, and gender distributions to enhance the generalizability of the findings. Second, the AJT was administered using a Rapid Serial Visual Presentation (RSVP) format due to EEG recording requirements. This format does not reflect a fully natural reading process, and although there is evidence that RSVP with normal reading rate does not impact comprehension (Ricciardi & Di Nocera, Reference Ricciardi and Di Nocera2017) or alter reading processes (Juola et al., Reference Juola, Ward and McNamara1982), future work may benefit from incorporating more naturalistic or multimodal comprehension measures. Lastly, the target stimuli in this study were intentionally limited to inanimate nouns with canonical -o/-a endings; future work that is interested in exploring additional aspects of grammatical gender agreement, including unmarked and deceptively marked nouns, would need to develop a less restrictive stimuli set.
Conclusion
In conclusion, by investigating the longitudinal role of WM in L2/A abilities among intermediate learners, this study addresses an important gap in a largely underexplored area of research. Our findings did not provide evidence that WM contributes to the development of receptive or productive Spanish grammatical gender agreement abilities over time. However, by examining both receptive and productive abilities using the same target structure, this study provides novel insights into the role of WM in L2/A development. The findings suggest that the influence of WM might not be uniform, but rather may depend on multiple factors, including the type of linguistic ability (receptive vs. productive), proficiency level (beginner, intermediate, advanced), and task type (focus on form vs. focus on meaning). Moreover, incorporating fine-grained linguistic variables, such as linguistic form and gender, as factors in the analysis revealed nuanced interaction patterns – specifically, an interaction between WM and linguistic form in L2/A reception, and between linguistic form and gender in L2/A production – that are often obscured in broader analyses.
Future research should continue to explore whether and how WM supports different dimensions of L2/A learning, examining its role as a function of linguistic ability, proficiency, and task type, among others, both cross-sectionally and longitudinally. An interdisciplinary approach integrating multiple methodological frameworks will be essential for advancing toward a more comprehensive model of individual differences in L2/A learning, one that captures both performance outcomes and the cognitive-neural mechanisms that support them. Such efforts will be crucial for refining theoretical accounts of WM in L2/A learning and for deepening our understanding of the cognitive processes that shape L2/A development over time.
Supplementary material
The supplementary material for this article can be found at https://doi.org/10.1017/S0267190526100270.
Acknowledgments
We are grateful to Dr. Alexander Demos for his formative and invaluable advice on data analysis. We would also like to acknowledge Karen Cralli, Erin Kohnke, Erin O’Connor, Luz Orozco, and Nabiha Quadri for their help with data collection. This work was supported by funds and support awarded to Mandy Faretta-Stutenberg, including the Language Learning Dissertation Grant, the Provost’s Award for Graduate Research (University of Illinois Chicago), and the Audrey Lumsden-Kouvel Dissertation Fellowship (Department of Hispanic & Italian Studies, University of Illinois Chicago). Finally, we wish to thank the editors and the anonymous reviewers for their insightful comments and constructive recommendations. Any remaining errors are our own. We have no conflicts of interest to disclose.
CRediT author statement
A.R.G.: Conceptualization, Formal analysis, Data curation, Writing (original draft, review and editing), Visualization (primary), and Validation; M.F.-S.: Conceptualization, Methodology, Investigation, Data curation, Writing (original draft, review and editing), Visualization (supporting), and Funding acquisition; K.M.-S.: Conceptualization, Methodology, Resources, Writing (original draft, review and editing), and Supervision.





