1. Introduction
Over the past 20 years, a growing number of studies have examined how different bilinguals represent and process their languages. One result that has emerged from this work is the importance of age and input – and how these two interact – in the development of a linguistic grammar. Heritage Speakers (HSs) provide critical perspectives on this debate, as their acquisition trajectory is a unique, complex interaction between age and input that typically separates them from other bilinguals (Valdés, Reference Valdés2000). Specifically, while HSs are exposed to their heritage language in naturalistic contexts, particularly in the home, they grow up in a societal environment where the heritage language is not dominant. Formal education is typically in the dominant societal language; input in the heritage language decreases as users spend more time in school and as interactions with speakers of the dominant societal language become more common as children grow older (He, Reference He, Duranti, Ochs and Schieffelin2011). Thus, at some point in childhood, language input shifts in favor of the dominant societal language. Nonetheless, HSs share with sequential bilinguals (i.e., first-generation immigrants who acquire the societal language as an additional language) the experience of naturalistic input to the target languageFootnote 1 at a young age. However, similar to late second-language learners (e.g., speakers who acquire the target language as an additional language, typically in a classroom context), HS adults interact predominantly in their non-heritage language, and, importantly, receive most of their formal education through that language. Thus, grammatical development in both languages of an HS takes a singular path and reflects a complex interplay between age and input in overall language development.
How exactly HSs differ from and/or overlap with other bilingual groups in their linguistic competence is critical to a richer understanding of the full spectrum of bilingualism. Much of the literature has focused on linguistic divergences between HSs and first-generation immigrants, whose linguistic competence is taken as a baseline (Scontras et al., Reference Scontras, Fuchs and Polinsky2015; Valdés, Reference Valdés2000). However, recent research has suggested that this observed divergence may be reduced or even absent, depending upon the structure(s) being analyzed, the experimental paradigm employed, how researchers measure dominance and whether individual differences are considered. One area of focus has been the domain of morphosyntax, in which consistent divergence in HSs’ production and comprehension has been documented, though studies employing various methodologies have shown varying results: HSs seem to perform close to or even equal to the first-generation immigrants on some tasks (e.g., self-paced reading [Di Pisa et al., Reference Di Pisa, Kubota, Rothman and Marinis2022; Parshina et al., Reference Parshina, Lopukhina and Sekerina2022] and eye-tracking [Fuchs, Reference Fuchs2021]), while on others, HSs’ performance is significantly different (e.g., grammaticality judgement tasks [Scontras et al., Reference Scontras, Polinsky and Fuchs2018]; see also Alarcón, Reference Alarcón2011; Hao et al., Reference Hao, Chondrogianni and Sturt2023; Montrul et al., Reference Montrul, Davidson, de la Fuente and Foote2014).
The source of these divergences between HSs and first-generation immigrants has been a major topic of inquiry in the literature on heritage languages. One common consideration is that performance on different types of tasks (e.g., offline vs online) may be differentially influenced by factors such as metalinguistic awareness and task demands. These may in turn affect HSs differently than other groups, yielding the range of outcomes noted above (for a more thorough overview, see Bayram et al., Reference Bayram, Rothman, Pisa and Slabakova2020). Here we focus on another recurring suggestion: that HSs have more difficulty processing speech in their heritage language. This is often assumed, and has been suggested to account for divergences in, among others, comprehension of agreement (e.g., Scontras et al., Reference Scontras, Polinsky and Fuchs2018), discourse-driven use of null versus overt subjects (e.g., Laleko & Polinsky, Reference Laleko and Polinsky2017), and more recently in processing work on code-switching (Salig et al., Reference Salig, Valdés Kroff, Slevc and Novick2025). The logic is as follows: because of reduced experience with and knowledge of the heritage language, HSs require more cognitive resources for language processing in the heritage language overall, which leaves fewer resources for attending to other aspects of the clause, such as finer-grained morphology (e.g., gender agreement) or elements that require incorporating information at the interfaces (e.g., null subjects).
While this comes up regularly as a possible explanation of certain observed linguistic divergences, to our knowledge, this assumption has not been formally tested. Work on the relative amount of cognitive resources required for language processing in a non-dominant language has largely focused on L2 learners, finding that L2 learners do indeed exert more cognitive effort for language processing, relative to (non-heritage) native speakers (see Section 1.1 below for further discussion). Still, while HSs are like L2 learners in being non-dominant speakers of the target language, a wealth of recent work has demonstrated that HSs’ language processing is often qualitatively similar to that of first-generation immigrants (Bayram et al., Reference Bayram, Rothman, Pisa and Slabakova2020). This literature makes clear that results from studies on language processing in an L2 relative to an L1 need not generalize to language processing in a heritage language. It is thus critical to address the gap in the literature on cognitive effort in bilingual language processing by investigating cognitive effort during heritage-language processing, enabling us to determine whether assumptions regarding cognitive effort based on work on L2s can indeed be reliably extended to HSs.
1.1. Listening effort and increased cognitive resources
The key to better understanding this lies in gaining a clearer understanding of how effort, or the cognitive demands required of participants to carry out a given task, changes and adapts under different linguistic and/or complexity conditions. Effortful listening tasks engage attention and working memory, as outlined by the Framework for Understanding Effortful Listening (FUEL) (Pichora-Fuller et al., Reference Pichora-Fuller, Kramer, Eckert, Edwards, Hornsby, Humes, Lemke, Lunner, Matthen, Mackersie, Naylor, Phillips, Richter, Rudner, Sommers, Tremblay and Wingfield2016), which proposes that effortful listening is subject to constraints on the individual’s cognitive ability (e.g., resources) and motivation (see also Colby & McMurray, Reference Colby and McMurray2023).
A commonly used behavioral methodology for testing the effects of listening effort is the dual-task study, which assesses the cognitive load caused by a primary task on performance in a secondary task, measured by a decrease in performance on the latter when both are carried out in parallel (Pashler, Reference Pashler1994). Performance on the secondary task – whether in terms of accuracy or reaction time – is inversely proportional to the cognitive resources required by the primary task. Dual-task studies are grounded in the assumption that as task demands increase, cognitive effort will also increase: participants need to direct their cognitive resources to multiple input streams, and because cognitive resources are assumed to be finite, increasing the complexity of one task will necessarily deplete the resource pool more rapidly and reduce the resources available for performance of the other task.
In the field of bilingualism, both dual-task studies and physiological measures have been employed to test to what extent production and/or comprehension in an L2 is more effortful than in an L1 and/or dominant language. In one study, Declerck and Kormos (Reference Declerck and Kormos2012) conducted a language production task simultaneously with a non-linguistic finger-tapping task and found that the simultaneous execution of the non-linguistic task significantly reduced the accuracy of L2s of Hungarian in lexical selection, relative to native speakers. The authors found that this effect was modulated by the L2s’ proficiency. Not all dual-task studies have found such effects: Révész et al. (Reference Révész, Michel and Gilabert2016) compared dual-task methodology with self-ratings and expert judgments of the cognitive effort required during performance of simple and complex versions of three oral tasks. Participants were English monolingual speakers and English L2s, half of whom carried out the tasks under dual-task conditions, which involved reacting to changes from red to green in the computer background screen. Notably, though results revealed that participants showed higher mean accuracy on the visual secondary task when doing the simple versions of the primary tasks, no group differences were found. Using physiological measures, Francis et al. (Reference Francis, Tigchelaar, Zhang and Zekveld2018) found that L1-Dutch L2-English speakers showed higher listening effort in their L2, as indicated by increased pupil dilation in preparation for speech involving English target words. Consistent findings regarding increased pupil size when processing speech in an L2 were found for L1-Italian L2-English speakers relative to native English speakers (Borghini & Hazan, Reference Borghini and Hazan2018).
1.2. Present study
In the present study, we use dual-task methodology to measure cognitive demands on grammatical processing in heritage Spanish, comparing HSs whose dominant language is English to two other groups drawn from the Spanish-English bilingualism spectrum (Birdsong, Reference Birdsong2018; Luk & Bialystok, Reference Luk and Bialystok2013). The groups crucially differ in their language background. Spanish-raised bilinguals (SRBs) were raised in a Spanish-speaking environment with rich input in Spanish through adolescence. This group is most likely to be Spanish-dominant. By contrast, HSs are early bilinguals raised in Spanish-speaking families but with restricted input, typically limited to the home environment. English-raised bilinguals (ERBs) are late bilinguals whose restricted input to Spanish is typically limited to the classroom environment; they are typically referred to in the literature as L2 learners, but we use the label ERBs to reflect their acquisition trajectory and their place on the bilingualism spectrum. Both HSs and ERBs are more likely to be English-dominant.
In making this comparison, we have three goals. First, we want to test the assumption that divergent linguistic performance between HSs and SRBs on some tasks presented in the literature may be linked to the increased cognitive effort required to process linguistic stimuli in a heritage language. As far as we are aware, this is the first study to directly test this question. Second – in the reverse of our first goal – we want to test whether observed similarities between HSs’ and SRBs’ linguistic performance in prior work may in fact require different amounts of cognitive effort. It is possible that HSs may require more resources to process the same amount of input, even though they may, under some task conditions, ultimately demonstrate the same outcome on independent measures. Finally, we include an ERB group. Although increased cognitive effort in ERBs’ language processing has previously been studied, our third goal is to understand HSs’ linguistic processing in the context of the broader bilingualism spectrum.
The linguistic structures we employ to manipulate linguistic complexity in the present study are relative clauses. This choice is motivated by abundant research showing a clear hierarchy of processing difficulty (for a review, see Lau & Tanaka, Reference Lau and Tanaka2021), wherein processing cost is greater for object-extracted relative clauses (ORs) than for subject-extracted relative clauses (SRs), across a number of different languages and under many different experimental conditions, such as self-paced reading (Gibson et al., Reference Gibson, Desmet, Grodner, Watson and Ko2005), eye-tracking while reading (Biondo et al., Reference Biondo, Pagliarini, Moscati, Rizzi and Belletti2023; Cunnings & Fujita, Reference Cunnings and Fujita2023; Staub, Reference Staub2010), ERPs (Kwon et al., Reference Kwon, Kluender, Kutas and Polinsky2013) and pupillometry, with normal adult listeners as well as listeners with language deficits (Chapman & Hallowell, Reference Chapman and Hallowell2021) and typically developing monolingual children (for Spanish see Presotto & Torregrossa, Reference Presotto and Torregrossa2024; Torrens, Reference Torrens2024), to name a few. While the reasons for this asymmetry between SRs and ORs are beyond the scope of the present paper, we point the reader to the studies mentioned here for further discussion.
Crucially for the purposes of the present study, research has also identified asymmetries in relative-clause processing and comprehension in adult second-language learners (Chun & Kaan, Reference Chun and Kaan2019; O’Grady et al., Reference O’Grady, Lee and Choo2001; Sánchez-Walker, Reference Sánchez-Walker2012) as well as in HSs across a variety of languages and age groups, including adult HSs of Spanish (Sánchez-Walker, Reference Sánchez-Walker2019), Russian (Polinsky, Reference Polinsky2011) and Korean (O’Grady et al., Reference O’Grady, Lee and Choo2001) and child HSs of Russian (Polinsky, Reference Polinsky2011), Turkish (Coşkun Kunduz & Montrul, Reference Coşkun Kunduz and Montrul2024), Czech (Cilibrasi et al., Reference Cilibrasi, Adani, Pérez, Schmidt, Wigdorowitz and Tsimpli2022) and Cantonese (Kidd et al., Reference Kidd, Chan and Chiu2015). We do note that, as in monolingual Mandarin, the asymmetry between SRs and ORs may not be so clear in heritage Mandarin, as suggested by work on child HSs (Jia & Paradis, Reference Jia and Paradis2020).
1.3. Research questions & predictions
The present study addresses the following questions. (1) Do HSs or second-language learners (ERBs) require more domain-general cognitive resources for language processing than do speakers who had rich input to the target language through adolescence (SRBs)? (2) Does processing increasingly syntactically complex structures require more domain-general cognitive resources? (3) If increasingly complex syntactic structures do require more domain-general cognitive resources, is there evidence that they require more resources in heritage or L2 language processing (i.e., for HSs or ERBs relative to SRBs); in other words, is there an additive effect of reduced language experience and increased syntactic complexity?
To address these questions, we conduct a dual-task study involving a non-linguistic Motion-Object Tracking (MOT) task (Pylyshyn & Storm, Reference Pylyshyn and Storm1988), adapted from Heyselaar and Segaert (Reference Heyselaar and Segaert2019), and a linguistic picture-matching task. The MOT and the linguistic task both have three levels of difficulty to increasingly tax cognitive resources. We assume, following previous literature, that if increasing the difficulty of the linguistic task depletes more cognitive resources, performance on the MOT should be reduced (e.g., Magnuson & Nusbaum, Reference Magnuson and Nusbaum2007). Importantly, interpreting the results of the MOT, which is a non-linguistic task, to answer our research questions has the added benefit of avoiding potential complications due to varying experience with literacy in the target language across the different groups, which is a common consideration for studies on HSs in particular. It might also be possible that performance on the dual task is susceptible to naturally occurring individual variation in cognitive resources. We account for this in Section 2.2, using an independent measure of cognitive resources (Temporal Flanker Task) to exclude participants who might constitute outliers in terms of individual cognitive resources.
Question (1) has previously been asked, comparing ERBs to (non-heritage) native speakers, with existing evidence suggesting that language processing in an L2 does indeed require more resources than native-language processing (Borghini & Hazan, Reference Borghini and Hazan2018; Declerck & Kormos, Reference Declerck and Kormos2012; Francis et al., Reference Francis, Tigchelaar, Zhang and Zekveld2018; cf. Section 1.1). In the present study, we make a three-way comparison between SRBs, HSs and ERBs of Spanish. If language processing in Spanish requires increased cognitive resources for the HSs and ERBs relative to the SRBs, we expect that the SRBs will be more accurate than the HSs and ERBs on the MOT in the dual-task portion of the study, when the linguistic stimulus is presented concurrently with the MOT and thus requires simultaneous processing and encoding of linguistic information in memory.
Building on this, if increased syntactic complexity requires more cognitive resources (Question (2)), we expect participants’ performance on the MOT to decrease as the level of complexity of the concurrently presented linguistic stimulus increases. Finally, if the effect of increased syntactic complexity on cognitive load is magnified for speakers with reduced input to the language (Question (3)), we expect the effect of syntactic complexity just described to interact with group, specifically leading to a cumulative effect for ERBs and/or HSs.
2. Methods
2.1. Participants
One hundred and sixteen Spanish-English bilinguals participated in the study across two universities in the US. Data from 13 participants were removed: eight participants did not meet the inclusion criteria for any group; four participants had incomplete data due to software issues; one participant was excluded due to their performance on the Flanker task (see below). A total of 103 participants were included in the final analysis. Participants were grouped post hoc into three clusters: SRBs (n = 51; mean age = 22.7, SD = 6.2) who immigrated to the US at or after age 15; heritage Spanish speakers (HSs; n = 31, mean age = 20.2, SD = 3.2) who reported speaking Spanish with one or both parents, and were born and raised in the US or immigrated to the US prior to or at age 6; and English-raised second-language learners of Spanish (ERBs; n = 21; mean age = 20.4, SD = 0.7), who at the time of testing were enrolled in university-level courses with Spanish as the language of instruction. To be able to enroll in these courses, the ERB participants all had to have completed at least 4 years of high school Spanish or four semesters of Spanish instruction at the college level and had to have taken at least one semester of university-level basic Spanish skills classes (speaking or writing). All participants self-assessed their Spanish proficiency (see Section 2.4), but no formal proficiency assessment was included. All participants followed procedures outlined by the IRB of their respective participating sites. SRB and HS participants were recruited and tested at one of the universities and were paid for participation, while ERBs were recruited and tested at the other institution and given course credit.
2.2. Flanker design
The decision to use a dual task to address our research question necessarily implies that performance will be affected by individual differences in cognitive resources. In particular, listeners with a low amount of cognitive resources may show a stronger effect of the dual-task manipulation, regardless of the type of sentence being heard. To address this, we included a Temporal Flanker task (Hazeltine et al., Reference Hazeltine, Lightman, Schwarb and Schumacher2011; Kunde, Reference Kunde2003; Schumacher et al., Reference Schumacher, Schwarb, Lightman and Hazeltine2011), a variant of an Eriksen flanker task in which distracting and target items are displayed sequentially, measuring conflict resolution ability. This provides a general estimate of the amount of cognitive resources an individual is able to (or willing to) apply to a given problem. We then used Temporal Flanker performance as an exclusion criterion, dropping any participant whose overall accuracy on the task was below 0.5; this excluded one participant.
The Flanker task consisted of a fixation cross displayed for 500 msec, followed by five stimuli played in sequence. Each stimulus was displayed for 250 msec, with 50 msec between displays. The middle stimulus (position 3) was always the target. Participants were instructed to respond to this middle stimulus on all trials, indicating with a button press whether they saw a blue, green, red or purple circle. Flanker colors either matched or did not match the color of the target (middle) circle to create contrasting pairs – trials consisted of either green and blue circles or red and purple circles (but never green and purple circles). This created four different trial types: blue-green trials where the flanker and target were the same color (congruent) and blue-green trials where the flanker and target were different (incongruent), plus analogous congruent and incongruent red-purple trials.
The Flanker task began with 10 practice trials, followed by 192 experimental trials split evenly into congruent and incongruent conditions, and into blue-green and red-purple colors. Practice trials were entirely random, but experimental trials were controlled such that color type (but not congruence condition) alternated. Participants were instructed to respond as quickly and accurately as possible, but trials had no time limit. Written instructions were provided in Spanish, with optional oral instructions in English.
Both reaction times to the middle stimulus and button-press accuracy were measured. Typically, listeners who are slower or less accurate on incongruent trials in these tasks are thought to have lower cognitive control, as they are unable to resolve the disparity presented by the difference in flankers and target (Eriksen & Eriksen, Reference Eriksen and Eriksen1974; Servant & Logan, Reference Servant and Logan2019). However, individual reaction times are dependent on how quickly participants can respond in general. Thus, instead of using raw reaction times to incongruent trials, we computed a difference score between the incongruent and congruent trials for both reaction time and accuracy.
2.3. Dual task design
The main task consisted of three parts: a (single-task) linguistic picture-matching task in which participants matched candidate images with auditorily presented linguistic stimuli, a (single-task) non-linguistic MOT task in which listeners had to remember the identity of some number of multiple similar moving objects on screen, and a dual task where participants completed both previously encountered tasks simultaneously. The single-task picture-matching and MOT tasks were always completed first (counterbalanced), followed by the dual-task portion.
The linguistic picture-matching task contained 120 trials. These were all Spanish phrases, heard in the presence of two images. The phrase was always of the form ARTICLE + NOUN + RELATIVE CLAUSE. Phrase difficulty was manipulated by increasing the complexity of relative clauses, from Reduced Relative (RR) to Subject Relative (SR) to Object Relative (OR), as illustrated in (1).Footnote 2

Images consisted of paired depictions of transitive actions, performed by two animals (e.g., one might be a bear painting a picture of a tiger, while the other would be a tiger painting a bear; see Figure 1 for illustration of images in a dual-task trial, and see Supplementary Materials for illustration of images in a single-task picture-matching trial). Listeners matched the phrase they heard to the correct image by selecting the left arrow to indicate the left-hand image and the right arrow to indicate the right-hand image. There were 20 total animals and 20 action verbs combined to create 80 different animal-action pairs. Visual stimuli for the linguistic task were custom drawings done by commission. Audio stimuli for the task were recorded in a sound-isolated booth by a first-language speaker of Spanish with a Mexican dialect. Sentences were recorded multiple times, with the best being chosen for experimental stimuli and cleaned using Praat (Boersma & Weenink, Reference Boersma and Weenink2026). All audio was recorded at 44.1 kHz and presented monaurally using speakers; the average acoustic intensity of the final 240 audio stimuli was 0.004 mean squared amplitude (SD = 0.00001) (equivalent to −24.1 decibels relative to full scale). Though OR sentences were more grammatically complex, this did not result in much of a difference in length relative to the other two conditions – on average, RRs were 2.25 s (SD = 0.26), SRs were 2.32 s (SD = 0.28) and ORs were 2.46 s (SD = 0.22).
Illustration of the progression of a dual-task trial.

The MOT task, adapted from Heyselaar and Segaert (Reference Heyselaar and Segaert2019), comprised 120 trials of video stimuli (see Figure 1 for illustration in a dual-task trial; see Supplementary Materials for illustration of a single-task MOT trial). On every trial, six blue dots were displayed vertically on screen. Then, some number (either one, two or three) changed to red for one second, then back to blue. Following this, the dots moved around the screen for three seconds. Finally, a single random dot flashed red again. Participants were tasked with determining whether this final dot was part of the set of dots that turned red at the start by selecting the up arrow if they thought it was and the down arrow if they thought it was not. We expect this to be easiest when participants had to track one dot, and most difficult when participants had to track three dots. Both accuracy and reaction time were measured on every trial. Videos were created using custom MatLab software. This controlled everything from the dimensions of the video file to the size and velocity of the dots, to whether the final dot was a part of the initial set. Code is available in the OSF repository.
The dual task contained 120 trials, in which participants both heard a phrase (with referent images) and saw an MOT video on the screen simultaneously. After both were finished, one of the two (either images relevant to the sentence or the final frame of the video) remained on the screen; participants were instructed to provide a response based on what remained, analogously to the type of response they provided to corresponding single-task trials (Figure 1). Whether the picture-matching or the MOT remained on the screen was randomized on every trial, though across trials, participants responded to each task an equal number of times. Sentences and MOT videos were not reused from their respective single-task variants.
Both the single tasks (MOT and picture-matching) and the dual task were blocked into sections of 30 trials, with an optional break between each block. Instructions for the two single tasks were provided at the onset of the experiment. Instructions for the dual task were provided prior to the onset of dual-task trials.
2.4. Language-dominance questionnaire
To assess language background and language dominance, participants completed a pen-and-paper version of the Bilingual Language Profile (BLP) (Birdsong et al., Reference Birdsong, Gertken and Amengual2012) and an additional brief questionnaire consisting of selected questions from the Language Experience and Proficiency Questionnaire (LEAP-Q) (Kaushanskaya et al., Reference Kaushanskaya, Blumenfeld and Marian2020; Marian et al., Reference Marian, Blumenfeld and Kaushanskaya2007). Recent research on bilinguals has shown that language dominance may be a broad construct that includes proficiency, language use, language context and language attitudes, among other factors. Moreover, dominance is often used as a comparison measure between a bilingual’s two languages and can be context-dependent. The BLP offers one way of operationalizing and measuring language dominance (Olson, Reference Olson2024), and it includes four subcomponents: language history, current language use, language proficiency and language attitudes. Responses to each section are summed and then weighted to arrive at a Spanish-language score and an English-language score for each participant. The BLP calculates dominance by subtracting the Spanish score from the English score. Thus, a more positive number indicates higher English dominance.
On average, SRBs had a BLP dominance score of 20.1 (SD = 46.8), HSs had a score of 46.9 (SD = 28.1) and ERBs had a score of 129.0 (SD = 33.0). The full distribution of scores for each group is presented in the Supplementary Materials. These results are used only to verify assumptions regarding overall expected language dominance in the three experimental groups. We note that it is not surprising that the BLP assessed some of the SRBs to be English-dominant, given that the BLP weights current language use heavily, and all our participants were current or recent students at US universities, which likely skewed their self-reported use of English in a “typical week.”
Participants additionally completed a receptive vocabulary task that consisted of multiple-choice picture matching to ensure they had sufficient knowledge of the nouns and verbs used in the main task. Participants were presented with each of the target words in written form and with four candidate images per word. They selected which image corresponded to the given word. For nouns, images depicted animals used in the main task; for verbs, images depicted scenes used in the main task, with only the agent of the verb included. Accuracy on this task was very high; the ERBs had the lowest average accuracy on both verbs (93%) and animals (95%), while the SRBs and HSs performed at ceiling on both categories (average accuracy above 98%).
2.5. Procedure
In both study locations, participants were tested in a private space and followed the same procedure. Participants first gave informed consent for participation in the study. They then completed the Flanker task and the receptive vocabulary task. Following this, they completed the main experimental task, which consisted of blocked single-task trials and then blocked dual-task trials. Within each block, participants initiated each trial with a key press, meaning inter-stimulus intervals were self-timed; breaks between experimental blocks were explicitly signaled and self-timed. The Flanker task, the receptive vocabulary task, and the main experimental task were all completed on a laptop in the laboratory, and participants were allowed to adjust the angle of the screen to their comfort. Finally, participants completed the language background questionnaire on paper. The total time of participation in the study lasted approximately 75–90 minutes.
2.6. Analysis approach
We took a general mixed effects modeling approach for analysis. All data were processed and analyzed in R Version 4.2.1 (R Core Team, 2021) using the lme4 package (Bates et al., Reference Bates, Mächler, Bolker and Walker2015). For each analysis, the maximal random effect structure supported by the data was included; if a model including random slopes and intercepts did not converge, the random effects structure was reduced until convergence was reached. Random intercepts grouped by participant were prioritized over random intercepts grouped by item when a model with both of these did not converge, to account for typically larger by-participant than by-item variation (Sonderegger, Reference Sonderegger2023).
When examining Flanker task performance, incorrect responses and responses faster than 150 ms and slower than 2000 ms were removed from analysis, resulting in similar rates of exclusion across the three groups. The remaining responses were separated by congruency of the corresponding trial and averaged per participant.
We first analyze results from the single-task portion of the study. Although these do not directly bear on our research questions, this part of the analysis allows us to verify that our difficulty manipulations yielded the desired effect, enabling us to subsequently analyze and interpret the dual task data with respect to our research questions. When examining MOT performance (alone or in a dual-task context), responses faster than 150 ms and responses outside of two standard deviations of the mean (grouped by participant and by difficulty level) were excluded from analysis, removing 3.6% of the data, which constituted approximately equal rates of exclusion across the three groups and across conditions within each group. Similarly, for the picture-matching task, responses faster than 150 ms and outside of two standard deviations of the mean (grouped by participant and by relative clause type) were excluded from analysis, removing 2.8% of the data, at approximately equal rates across the three groups and across conditions within each group. For the dual-task portion of the study, due to a software malfunction, there was data loss for an additional five SRB participants and one HS participant.
3. Results
3.1. Single-task results
3.1.1. MOT task
Accuracy on the single-task MOT is visualized in Figure 2. A logistic mixed effects regression model was fitted to the data predicting the log-odds of a correct response by group, difficulty and their interaction, with random intercepts grouped by participant. A similar model with an additional random intercept grouped by item failed to converge. Group was a deviation-coded three-level categorical variable such that the first contrast tested for the divergence between SRBs and the grand mean and the second contrast tested for the divergence of HSs from the grand mean; the predictor was re-leveled here and in all models presented below to check for the significance of the contrast between ERBs and the grand mean.Footnote 3 Difficulty was a three-level categorical variable forward-difference coded such that the first contrast tested for a difference between accuracy on the MOT when tracking one dot versus two dots, and the second contrast tested for a difference between accuracy on the MOT when tracking two dots versus three dots. The model (Table 1) found main effects of MOT difficulty: as expected, participants were overall more accurate when tracking one dot than two dots (β = 0.93, SE = 0.09; z = 10.54, p < 0.001) and more accurate when tracking two dots than three (β = 0.56, SE = 0.07, z = 7.87, p < 0.001). There were no main effects of group, but the model did identify an interaction effect between group and difficulty, specifically between group and the contrast between tracking one dot versus two dots. Follow-up models suggest group differences in the magnitude of this contrast, though it is significant for all groups (SRBs: β = 2.39, SE = 0.31; z = 7.58, p < 0.001; HSs: β = 2.28, SE = 0.39; z = 5.85, p < 0.001; ERBs: β = 1.02, SE = 0.35; z = 2.89, p = 0.004).
Accuracy on the single-task MOT, grouped by experimental group and by MOT difficulty level. Error bars indicate standard error.

Regression output for model fitted to responses to single-task MOT

Note: Model structure: response ~ group × MOT difficulty + (1 | participant). No. observations: 9,079 No. groups: 103. *p ≤ .05, **p ≤ .01, ***p ≤ .001.
3.1.2. Picture-matching task
Accuracy on the picture-matching task for each group is presented in Figure 3. A logistic mixed effects regression model was fitted to the data, predicting log-odds of the probability of a correct response by group, RC type and their interaction, as well as random intercepts grouped by item (models with random effects grouped by participant did not converge). Group was deviation-coded as in the MOT analysis. RC type was forward-difference coded, with the first contrast testing for the difference between RRs and SRs, and the second contrast testing for the difference between SRs and ORs. The model (see Table 2) found a significant effect of RC type, wherein participants were overall more accurate on RRs than SRs (β = 0.45, SE = 0.11, z = 4.19, p < 0.001) and on SRs than ORs (β = 3.27, SE = 0.07, z = 44.14, p < 0.001). The model also found a main effect of group, wherein the SRBs had an overall higher accuracy than the grand mean (β = 0.77, SE = 0.05, z = 15.02, p < 0.001), as did the HSs (β = 0.20, SE = 0.06, z = 3.39, p < 0.001) and the re-leveled model showed that the ERBs were overall significantly less accurate than the grand mean (β = −0.96, SE = 0.05, z = −20.39, p < 0.001). The model also found a significant interaction effect of group and RC type, suggesting that the effect of RC type was different across groups. Indeed, follow-up models show that the contrast between RRs and SRs was significant for all groups, though at different magnitudes (SRBs: β = 0.80, SE = 0.18, z = 4.45, p < 0.001; HSs: β = 0.87, SE = 0.22, z = 3.86, p < 0.001; ERBs: β = −0.30, SE = 0.14, z = −2.14, p = 0.03), and similarly for the contrast between SRs and ORs (SRBs: β = 2.57, SE = 0.11, z = 23.35, p < 0.001; HSs: β = 5.07, SE = 0.17, z = 29.52, p < 0.001; ERBs: β = 3.29, SE = 0.14, z = 24.12, p < 0.001).
Accuracy on single-task picture-matching, grouped by experimental group and by RC type. Error bars indicate standard error.

Regression output for model fitted to responses to single-task picture-matching

Note: Model structure: response ~ group × RC level + (1 | item). No. observations: 15,332 No. groups: 240. *p ≤ .05, **p ≤ .01, ***p ≤ .001.
3.2. Dual-task results
Data from the dual-task portion of the study were subdivided according to whether, on a given trial, participants responded to the MOT or to the picture-matching task. We start by analyzing the results for response to the MOT, as experimental predictions pertain to these results. For completeness, we also present the analysis of picture-matching responses, though we discuss them only briefly in Section 4.3.
3.2.1. Dual-task MOT
Accuracy on this portion of the task is presented in Figure 4. A logistic mixed effects regression model was fitted, predicting log-odds of a correct response to the MOT in the dual-task portion of the study. Predictors in the model were group, MOT difficulty and RC type, as well as their three-way and pairwise interactions. All categorical variables were coded as before in the single-task analyses. The model included random intercepts grouped by participant.
Accuracy of responses to dual-task MOT trials, grouped by experimental group, MOT difficulty and type of concurrently presented relative clause (RC type).

The model (Table 3) found a main effect of MOT difficulty – consistent with results from the single-task MOT – wherein participants were more accurate when tracking one dot versus two (β = 0.59, SE = 0.09, z = 6.52, p < 0.001) and more accurate when tracking two dots versus three (β = 0.23, SE = 0.08, z = 2.82, p = 0.005). The model also found a main effect of group, whereby SRBs were overall more accurate than the grand mean (β = 0.20, SE = 0.10, z = 1.99, p = 0.047) and the ERBs were overall less accurate than the grand mean (β = −0.26, SE = 0.12, z = −2.16, p = 0.03). Crucially, the model also found a main effect of RC type, in that participants were more accurate on the MOT when the simultaneously presented RC was an SR as compared to when it was an OR (β = 0.26, SE = 0.09, z = 3.01, p = 0.003). The model also identified a significant interaction between group and MOT difficulty; follow-up models indicate that group modulated the magnitude and significance of the MOT difficulty contrasts in a manner consistent with results from the single-task MOT.
Regression output for model fitted to responses to dual-task MOT

Note: Model structure: response ~ group × RC level × MOT diff + (1 | participant). No. observations: 5,888 No. groups: 97. *p ≤ .05, **p ≤ .01, ***p ≤ .001.
3.2.2. Dual-task picture-matching
A logistic regression model was fitted to the data predicting log-odds of a correct response by group, MOT difficulty and RC type (Figure 5). Mixed effects models with random intercepts grouped by participant or item failed to converge or resulted in a singular fit. The model (Table 4) found a significant main effect of group, wherein the SRBs were more accurate than the grand mean (β = 0.72, SE = 0.08, z = 9.28, p < 0.001), the HSs were more accurate than the grand mean (β = 0.21, SE = 0.08, z = 2.54, p = 0.01) and the ERBs were less accurate than the grand mean (β = −0.93, SE = 0.07, z = −13.52, p < 0.001). The model also found a main effect of RC type consistent with the patterns in the single-task picture-matching task: participants were overall more accurate on RRs than SRs (β = 0.65, SE = 0.15, z = 4.19, p < 0.001) and more accurate on SRs than on ORs (β = 2.64, SE = 0.10, z = 25.57, p < 0.001). The model did not find any main or interaction effects involving MOT difficulty. Follow-up models exploring significant interaction effects between group and RC type indicate that the contrast between RRs and SRs was significant for the SRBs (β = 1.27, SE = 0.29, z = 4.43, p < 0.001) and the HSs (β = 0.90, SE = 0.32, z = 2.83, p = 0.005) but not for the ERBs; the contrast between SRs and ORs was significant for all three groups (SRBs: β = 2.61, SE = 0.18, z = 14.46, p < 0.001; HSs: β = 4.10, SE = 0.24, z = 17.00, p < 0.001; ERBs: β = 2.67, SE = 0.18, z = 14.4, p < 0.001).
Accuracy of responses to dual-task picture-matching trials, grouped by experimental group, RC type and difficulty of concurrently presented MOT stimulus (MOT difficulty).

Regression output for model fitted to responses to dual-task picture-matching

Note: Model structure: response ~ group × RC level × MOT diff. No. observations: 5,911. *p ≤ .05, **p ≤ .01, ***p ≤ .001.
4. Discussion
4.1. Single-task MOT and picture-matching
Although the single-task results do not directly inform the present research questions, results consistent with our assumptions regarding increasing difficulty in the linguistic task and the MOT presented individually allow us to interpret the results of the dual-task portion of the study more readily. In the single-task MOT, we find an expected decrease in accuracy as the number of tracked dots increases. We find no group differences in this effect, as expected. We do, however, find that the effect of MOT difficulty is modulated by group, that is, an unexpected interaction effect, specifically in the effect of one versus two dots. Follow-up analyses found that the effect was consistently non-zero across groups, but possibly smaller in magnitude for the ERBs. We speculate that this may be due to a difference in motivation, as the ERBs were participating for extra credit (though participants in all groups were university students and affiliates).
The results of the single-task picture-matching also largely align with our expectations – participants’ accuracy was overall highest on the RRs, followed by SRs, with the lowest accuracy on ORs. A group effect suggesting that the SRBs were more accurate than the HSs and ERBs is also consistent with expectations based on prior work on relative clauses (see Section 1.2). The contrast between SRs and ORs was consistent across groups; however, ERBs differed from the other two groups in that the contrast between RRs and SRs was in the opposite direction. In fact, this group was overall less accurate on RRs and SRs relative to HSs or SRBs (as confirmed by post hoc pairwise comparisons between groups within each RC clause type; see Supplementary Materials), suggesting that they may not have quite as strongly established the grammatical rules necessary to process these phrases, or that RRs presented a more difficult challenge than we initially intended for this group.
A notable aspect of these results is the low accuracy rates on ORs for HSs and ERBs: while the SRBs also had a decreased accuracy on the ORs, theirs remained above chance-level (0.5), whereas the HSs and ERBs fell below chance-level, which may be suggestive of consistent error or of preference for an SR interpretation. These results echo those of O’Grady et al. (Reference O’Grady, Lee and Choo2001), who found in a picture selection task that HSs and ERBs of Korean often interpreted ORs as SRs. In another picture-matching study, Polinsky (Reference Polinsky2011) found adult HSs were highly accurate on SRs but performed at chance on the ORs.
4.2. Dual-task
Having established that the difficulty manipulation in the linguistic task and the MOT yields expected effects when stimuli are presented individually, we now turn to discussing the dual-task results with respect to the research questions. The first research question we asked was Question (1) Do HSs or second-language learners (ERBs) require more domain-general cognitive resources for language processing than do speakers who had rich input to the target language through adolescence? Accuracy of responses to the dual-task MOT most directly answers this research question, given the following logic. Participants were presented concurrently with an MOT and a Spanish-language linguistic stimulus. Because during the unfolding of the stimuli, participants did not know which component of the study they would have to respond to, we assume that they processed and encoded in memory the information from both stimuli, and only subsequently accessed one of them to provide a response. Reduced accuracy on the MOT is therefore thought to be the outcome of the increased cognitive effort required to process the Spanish-language stimulus and to encode it in working memory, which reduces the cognitive resources available for processing and responding to the MOT.
In Section 3.2, we find that the SRBs show the highest overall accuracy on the dual-task MOT, and the ERBs show the lowest overall accuracy. This suggests that the SRBs required the least amount of domain-general cognitive resources for processing the linguistic stimuli. Conversely, the effect for the ERBs indicates that processing in a language learned later in life requires more domain-general cognitive resources than it does for speakers who had rich input to the target language through adolescence. This result is consistent with previous work testing increased listening effort during language comprehension in an L2 (Borghini & Hazan, Reference Borghini and Hazan2018; Declerck & Kormos, Reference Declerck and Kormos2012; Francis et al., Reference Francis, Tigchelaar, Zhang and Zekveld2018).
These results also suggest that HSs were less accurate on the MOT in the dual-task context than the SRBs, providing novel evidence that HSs do indeed require more domain-general cognitive resources for language processing than do speakers who had rich input to the target language through adolescence. As discussed in Section 1, HSs have often been assumed to require more effort in processing a target language than do late-immigrant baseline populations. However, this assumption was largely founded on work on cognitive effort in L2 processing, which need not generalize to heritage-language processing. We provide novel evidence to directly test and validate this assumption. In addition, HSs also performed differently (more accurately) than ERBs, highlighting that, despite their shared limited input to the target language, HSs and ERBs remain distinct bilingual populations, including in terms of the amount of cognitive resources required for language processing. We note that, in comparing bilingual groups to each other, we deliberately remain agnostic as to what might drive the fact that listeners find more complex phrases to be more challenging, or what precise mechanisms require listeners in some groups to exert more effort for parsing than others. An anonymous reviewer suggests it may be relative language co-activation or differences in proficiency that yield these effects. We find both to be plausible candidates, but our study is not designed to investigate the question to this depth – we invite future work to pursue this matter.
One possible complication in the interpretation of the main effect of group in the dual-task MOT results could be the significant interaction between group and MOT difficulty identified in the single-task MOT (Section 3.1) – no group differences were expected in the single-task MOT, which involved no linguistic stimuli. However, follow-up models unpacking the interaction effect in the single-task MOT suggest that the contrasts of one versus two dots and two vs three dots were consistently non-zero for each group, and that what was driving the interaction effect was a smaller magnitude of the effect for the ERB group compared to the other two – indeed, ERBs’ accuracy on two-dot trials was numerically higher than that of the other groups. Thus, if this did have any implications for the dual-task, we would expect the ERBs to have an advantage going into the dual-task MOT, which is contrary to what we find.
Our second research question was Question (2) Does processing increasingly syntactically complex structures require more domain-general resources? Indeed, in addition to the expected main effect of MOT difficulty on accuracy in the dual-task MOT and the group effect discussed above, we also find a main effect of RC type, wherein participants were more accurate on the MOT when the concurrent auditory prompt contained an SR than an OR. In prior work, Heyselaar and Segaert (Reference Heyselaar and Segaert2019) showed that accuracy on an MOT was reduced when conducted while listening to a linguistic stimulus for which a response may later be required; we contribute novel evidence that an increase in the syntactic complexity of this linguistic stimulus requires additional domain-general cognitive resources, at least for the bilingual populations tested here.
Finally, we address our third research question: Question (3) If increasingly complex syntactic structures do require more domain-general cognitive resources, is there evidence that they require even more resources in language processing by HSs or ERBs? In other words, we ask whether there is a cumulative effect of syntactic complexity and reduced language experience. Evidence for this would have been a significant interaction effect between RC type and group on accuracy on the dual-task MOT. However, we did not observe such an effect. This may be due to limited power to detect even a two-way interaction in a 3 × 3 × 3 design. Interaction effects tend to be more variable and thus more difficult to detect, and even in a 2 × 2 design, the sample size needed to achieve 80% power to detect an interaction effect can be as high as 16 times the sample size needed to achieve 80% power for a main effect (Brysbaert, Reference Brysbaert2019; Gelman, Reference Gelman2018). One limitation of our study is that we did not conduct an a priori power analysis, which is all the more reason to exercise caution in interpreting a null effect. Our hope is that the present project will motivate future work and higher-powered studies that may more reliably test whether an increase in syntactic complexity differentially affects bilinguals of different language acquisition trajectories.
4.3. Limitations
One possible challenge in interpreting the results of the study is that no effects of MOT difficulty were observed for accuracy on the picture-matching task in the dual-task portion of the study (cf. Section 3.2). This may seem unexpected – if performance on the MOT and the picture-matching indeed draws from the same finite pool of cognitive resources, one might expect an equal effect in both directions, that is, an increase in MOT difficulty might lead to reduced accuracy on the dual-task picture-matching. We do not have a clear picture as to why we do not observe this, but we offer a few considerations. One often-noted limitation of dual-task studies is that they rely on the assumption that participants are equally attentive to both types of stimuli in the dual-task portion. However, it is certainly possible that some participants may espouse a strategy in which they attend more to one component of the task, to the detriment of the other (Colby & McMurray, Reference Colby and McMurray2021; Wu et al., Reference Wu, Stangl, Zhang, Perkins and Eilers2016). Our results are consistent with a scenario in which participants attended more to the picture-matching task than the MOT: participants may have perceived success to be more likely on the linguistic task and therefore attended to it more, leading to fewer observable differences in performance on the picture-matching. In a different vein, given varying proficiency levels, it is possible that accuracy on the linguistic task was already rather noisy, and this noisiness may have been exacerbated in the dual-task context, making it more difficult to detect the signal, though we stress that this consideration remains speculative.
A final possible limitation in the present study is that the effect of RC difficulty on performance on the dual-task MOT is evident only in the contrast between SRs and ORs; the contrast between RRs and SRs was not significant. We would like to suggest that this is not entirely surprising, for two reasons that are likely related to each other. The first is that the difference in difficulty between RRs and SRs in the single-task picture-matching is numerically small (cf. Section 3.1), whereas the contrast between SRs and ORs was numerically larger for all groups. It is possible that the contrast in difficulty of processing RRs vs SRs while concurrently processing the MOT stimulus was not large enough to have an observable effect on performance on the MOT. The second reason, which likely informs the first to some degree, is the following. In designing an experimental task with three RC types of increasing difficulty relative to each other, the choice to include SRs and ORs was well motivated (see Section 1.2). However, the choice for the third level – one easier than SRs or one more difficult than ORs – was less clear. On the accessibility hierarchy of relative clauses (Keenan & Comrie, Reference Keenan and Comrie1977), the hypothesized next most difficult RC type to process would be an indirect object relative clause (SR > OR > IOR) (ex. el oso al que la rana dio el regalo ‘the bear that the frog gave a gift to’), but this would entail much longer auditory stimuli and a third entity in the depicted/described events. Additionally, this kind of construction could be too difficult: if ORs are difficult enough that HSs in some prior studies performed at chance (cf. Section 4.1), it is not clear that performance on indirect object relative clauses would be meaningfully different; in other words, we might observe floor effects. Indeed, this is echoed post hoc in our single-task picture-matching results, where HSs’ (and ERBs’) performance on ORs fell numerically below an average accuracy of 0.5, indicating that some participants were consistently misinterpreting these constructions.
The remaining option for relative clause type then was to include a construction similar in nature to SRs and ORs but in some way easier than an SR (? > SR > OR). The literature does not offer a clear solution here, and the choice of RRs was ultimately motivated by feedback we received during the development of the experimental design. The results of the single-task picture-matching suggest that these constructions were indeed overall easier than the SRs, if not by much, but the lack of studies directly comparing RRs and SRs limited our ability to make a priori predictions regarding the relative difficulty of these two types of RCs.
4.4. Implications
The current study had three goals. First, we wanted to test the assumption that HSs require more cognitive effort to process input in the target language, relative to speakers with rich input to the target language through adolescence. Our results indeed validate this assumption, suggesting it is a plausible explanation for divergences in linguistic performance between HSs and SRBs. Second – and related to the first – we wanted to determine whether the absence of divergence in performance between these groups reported on certain tasks in the literature may in some cases be attributed to additional cognitive effort exerted by HSs to achieve the same outcome as SRBs. Our results again suggest that this is plausible, though it may not be observable in tasks that measure performance on a linguistic task alone. Finally, we set out to understand how HSs fit into the broader bilingualism spectrum by including in our study not only SRBs but also ERBs, for whom listening effort has been previously studied (cf. Section 1.2), providing an additional baseline for comparison. Consistent with prior work, we find that SRBs required the least amount of cognitive resources for processing in Spanish, while ERBs required the most, but crucially all three groups performed differently from each other, reaffirming that HSs’ input being early and naturalistic (like that of SRBs) yet variable and restricted in domain (like ERBs) distinguishes them as having unique outcomes.
The results from this study also have relevance for classroom teaching, for HSs and ERBs. There is a large body of literature showing the benefits of task-based language teaching, a methodology that focuses on having learners carry out real-life tasks that push them to use certain structures and vocabulary. Crucial to understanding how to develop tasks is understanding the cognitive demands placed upon learners as they carry them out. Dual-task paradigms such as the one used here can provide direct insight into how different groups of learners are affected by task demands of growing complexity (Révész, Reference Révész2021; Révész et al., Reference Révész, Michel and Gilabert2016).
5. Conclusion
The present study aimed to determine the relative amount of cognitive resources used to process linguistic input for SRBs, HSs and ERBs of Spanish. We used a dual-task methodology, concurrently manipulating the complexity of a non-linguistic task (MOT with one, two or three dots) and a linguistic task (picture-matching involving phrases with reduced relative clauses, subject relative clauses and object relative clauses). We followed prior work in assuming that accuracy on the non-linguistic task may be reduced if attention to the linguistic task requires more cognitive resources due to linguistic complexity and/or experience with the language.
We found that SRBs were overall most accurate on the dual-task MOT and that ERBs were overall least accurate, which we interpret as a three-way difference between SRBs, HSs and ERBs in the amount of cognitive resources required to process the concurrent linguistic stimulus. An additional overall effect of relative clause type across groups indicated that higher complexity of the linguistic stimulus also requires more cognitive resources during language processing. We did not observe an interaction between these effects.
This work replicates prior findings that ERBs require more cognitive resources than speakers with rich input to the target language through adolescence, and they expand our understanding of this phenomenon, demonstrating that HSs also require more cognitive resources for language processing than do SRBs. This validates a common claim in the literature regarding the relative amount of cognitive resources required during heritage-language processing. The implications of this go two ways: first, it means that some divergences in linguistic performances between HSs and SRBs may indeed be ascribed to reduced resources being available to attending to specific aspects of a linguistic stimulus, and second it is possible that in some cases similar performance on a linguistic task between SRBs and HSs may reflect more effort exerted by HSs to achieve the same outcome.
Supplementary material
The supplementary material for this article can be found at http://doi.org/10.1017/S1366728926101308.
Data availability statement
The data that support the findings of this study are openly available on OSF at https://osf.io/etprk/overview?view_only=19f67921b4eb4ab58299685068ac0a16.
Acknowledgments
For valuable support and discussion, we thank the audiences at the University of Cologne Department of Romance Studies; the UC San Diego Center for Research in Language; the Bilingualism, Mind & Brain Lab; the 14th Heritage Language Research Institute; and the 38th Annual Conference on Human Sentence Processing. For support in preparation of experimental materials and data collection, we thank Luke Whitaker, as well as Jiamin Cheng, Leo Mermelstein, Metehan Oğuz, Damaris Ortega, Anna Runova, Patrick Sherlock and Leslie Yang. All errors are our own.
Competing interests
The authors declare no competing interests.