Highlights
-
• The impact of L2 exposure on ToM is visible after one academic year
-
• The impact of L2 exposure on ToM is independent of L2-vocabulary
-
• Selective attention, switching, and inhibition contribute variably to ToM
-
• L1-vocabulary and non-verbal reasoning are integral to ToM development
-
• Individual differences impact cognitive, affective, and conative ToM differentially
1. Introduction
The nature of the relationship between bilingualism and children’s social, cognitive, and linguistic development continues to be debated. Whereas early literature (Ausubel et al., Reference Ausubel, Sullivan and Ives1980; Hakuta, Reference Hakuta1986) warned that exposure to bilingualism might impede these skills, there is now literature supporting bilingualism’s positive effects on cognition (Bialystok & Martin, Reference Bialystok and Martin2004; Cape et al., Reference Cape, Vega-Mendoza, Bak and Sorace2018; Costa et al., Reference Costa, Hernández and Sebastián-Gallés2008; Genesee, Reference Genesee, Bhatia and Ritchie2004; Hernández et al., Reference Hernández, Martin, Barceló and Costa2013) and literature that does not (Antón et al., Reference Antón, Carreiras and Duñabeitia2019; Branzi et al., Reference Branzi, Calabria, Gade, Fuentes and Costa2016; Hernández et al., Reference Hernández, Martin, Barceló and Costa2013; Paap & Greenberg, Reference Paap and Greenberg2013; Paap et al., Reference Paap, Johnson and Sawi2015). Studies have become more precise about the cognitive skills that bilingualism might enhance (Costa et al., Reference Costa, Hernández, Costa-Faidella and Sebastián-Gallés2009; Bialystok & Craik, Reference Bialystok and Craik2010) and have discussed the intensity and type of bilingual environment in which positive effects are more likely to materialize (Bialystok & Magumuder, Reference Bialystok and Majumder1998; Carlson & Meltzoff, Reference Carlson and Meltzoff2008; Hermanto et al., Reference Hermanto, Moreno and Bialystok2012; Kalashnikova & Mattock, Reference Kalashnikova and Mattock2014). Focus has also turned to the potential contributions of myriad variables present in this group, such as socioeconomic status (SES), caregivers’ educational background, immigrant status, first-language (L1) proficiency, and working memory (WM) (Chamorro & Janke, Reference Chamorro and Janke2020; Engel de Abreu et al., Reference Engel de Abreu, Cruz-Santos, Tourinho, Martin and Bialystok2012; Hughes et al., Reference Hughes, Jaffee, Happé, Taylor, Caspi and Moffitt2005; Nguyen & Astington, Reference Nguyen and Astington2014; Paap et al., Reference Paap, Johnson and Sawi2015). Still more recently, researchers have examined whether early-observed advantages continue longitudinally (Chamorro & Janke, Reference Chamorro and Janke2023; Chamorro et al., Reference Chamorro, de la Viña and Janke2025; Dick et al., Reference Dick, Garcia, Pruden, Thompson, Hawes, Sutherland, Riedel, Laird and Gonzalez2019; Nichols et al., Reference Nichols, Wild, Stojanoski, Battista and Owen2020; Rubio-Fernández & Glucksberg, Reference Rubio-Fernández and Glucksberg2012).
These questions extend to research on bilingual children’s social cognition – the main focus of our study – in particular, their developing Theory-of-Mind (ToM) (Premack & Woodruff, Reference Premack and Woodruff1978). ToM refers to the ability to understand others’ beliefs, desires, and thoughts and to recognize that these will influence their behavior (Wellman, Reference Wellman2018). As ToM tasks exploit conceptual (constructing several representations) and executive-function (EF) skills (maintaining and toggling between multiple representations and discarding one in favor of another), it is unsurprising that bilingualism might boost aspects of ToM too (Bialystok & Senman, Reference Bialystok and Senman2004; Farhadian et al., Reference Farhadian, Abdullah, Mansor, Redzuan, Kumar and Gazanizad2010; Goetz, Reference Goetz2003; Kovacs, Reference Kovács2009; Schroeder, Reference Schroeder2018). Some studies have reported overlaps in neurological activation in participants undertaking ToM and EF tasks (van der Meer et al., Reference van der Meer, Groenewold, Nolen, Pijnenborg and Aleman2011), warranting further comparisons between patterns of performance on these different batteries. There is, for example, research that points to EF having a positive effect on ToM in monolinguals (Devine & Hughes, Reference Devine and Hughes2014; Doenyas et al., Reference Doenyas, Yavuz and Selcuk2018; Hughes & Devine, Reference Hughes, Devine, Whitebread, Grau, Kumpulainen, McClelland, Perry and Pino-Pasternak2019), and work that has shown EF development to predict ToM development but not vice versa (Carlson et al., Reference Carlson, Moses and Breton2002), which might suggest that EF skills underpin those utilized in ToM (Markovitch et al., Reference Marcovitch, O’Brien, Calkins, Leerkes, Weaver and Levine2015). However, the overlap is not necessarily neat and may depend on the ToM task administered (Schlaffke et al., Reference Schlaffke, Lissek, Lenz, Juckel, Schultz, Schultz, Tegenthoff, Schmidt-Wilcke and Brüne2015; Sebastian et al., Reference Sebastian, Fontaine, Bird, Blakemore, Brito, McCrory and Viding2012).
With respect to bilinguals, Huang et al. (Reference Huang, Baker and Wang2023) linked non-verbal WM and cognitive flexibility – but not inhibitory control – to ToM scores in Spanish-English children from low-income families. This and other work have urged researchers to extend the breadth of ToM tasks administered beyond the prototypical cognitive false-belief (FB) tasks (Wellman, Reference Wellman2018) and to monitor the influence of other variables, such as L1-proficiency, on ToM (Białecka et al., Reference Białecka, Wodniecka, Muszyńska, Szpak and Haman2024), a factor strongly associated with ToM in monolinguals (de Villiers & de Villiers, Reference de Villiers and de Villiers2014; Milligan et al., Reference Milligan, Astington and Dack2007).
In this longitudinal study, our primary question is whether children with higher L2-exposure achieve higher ToM scores. We respond to concerns over the heterogeneity in bilingual groups and the call to utilize different ToM tasks by focusing on bilingually educated children, whose access to bilingualism is restricted to school, and by administering an extensive battery, which assesses understanding of desires, emotions, beliefs, reference, moral-reasoning, lies, and sarcasm (Sotomayor-Enríquez et al., Reference Sotomayor-Enríquez, Gweon, Saxe and Richardson2023). These concepts comprise either chiefly cognitive, affective, or conative components (see below), enabling us to check if performance on these is differentially affected. However, limiting the present participants to those from monolingual homes does more than introduce predictability in terms of L2 exposure. It means that their access to the broader social interactions familiar to multilingually and multiculturally raised children is reduced. Since the cultural diversity that bilingual children experience can affect their sensitivity to others’ emotions (Bukhlaenkova et al., Reference Bukhalenkova, Veraksa, Gavrilova and Kartushina2022; Cheung et al., Reference Cheung, Mak, Luo and Xiao2010), minimizing this diversity diminishes another potentially influential factor. In addition, the intensity and duration of second-language (L2) exposure is important, so by including L2-exposure as a continuous variable, we monitor these influences on ToM too. As a secondary question, we explore how three attention measures (selective, switching, inhibition) relate to ToM, and finally, we track the influence of a suite of individual differences (non-verbal reasoning [NVR], WM, L1-vocabulary, L2-vocabulary, age, gender, family education, other language[s] at home, onset of L2-exposure, L2-exposure outside school, and exposure to further languages beyond school). Our introduction starts with literature that has focused on bilingualism in a broad range of bilingual contexts and its relation to ToM, before turning to investigations that narrow down to educational bilingualism and ToM. Where space permits, we include EF literature that relates directly to this paper’s chief ToM focus. The current paper complements Chamorro et al. (Reference Chamorro, de la Viña and Janke2025), who examined educational bilingualism and attention, and found that higher L2-exposure and higher L2-vocabulary consistently predicted higher selective attention, switching, and inhibition scores. It will be particularly interesting to see, therefore, whether the variables that predicted attention scores on this population also predict ToM scores.
1.1. Bilingualism and ToM
This section presents studies of participants with different language histories, highlighting how their degree of dominance is reported. Where information is available, we include which aspects of ToM were measured to see if distinctions emerge with respect to the tasks’ cognitive, affective, or conative components (Shamay-Tsoory & Aharon-Peretz, Reference Shamay-Tsoory and Aharon-Peretz2007). When reported, EF measures that were compared with ToM will also be signaled.
The majority of studies on bilinguals’ ToM development has focused on FB (Yu et al., Reference Yu, Kovelman and Wellman2021), which prototypically involves a child being presented with a vignette where they see that, unbeknown to a protagonist, the location/content of an item changes. The task tests whether children grasp that the protagonist will continue to falsely believe that the original location/content is correct (Astington & Gopnik, Reference Astington and Gopnik1991; Wimmer & Perner, Reference Wimmer and Perner1983). By 4 years, most children pass simple versions (Wellman et al., Reference Wellman, Cross and Watson2001). FB tasks are categorized as cognitive as they require children to hold two mental representations (one corresponding with reality and the other not) and to choose the one that conflicts with reality over the one that does not (Carlson et al., Reference Carlson, Moses and Breton2002). In contrast, affective-ToM tasks test whether children identify emotions, distinguish overt emotional expressions from actual emotional states, and feel compassion for others (Dennis et al., Reference Dennis, Simic, Bigler, Abildskov, Agostino, Taylor, Rubin, Vannatta, Gerhardt, Stancin and Yeates2013; Feng et al., Reference Feng, Cho and Luk2023). Like their strictly cognitive counterparts, affective-ToM tasks incorporate mentalizing, but the additional inferences drawn concern emotions, not beliefs or knowledge. In light of Shamay-Tsoory et al. (Reference Shamay-Tsoory, Shur, Barcai-Goodman, Medlovich, Harari and Levkovitz2007), who reported a dissociation between cognitive and affective-ToM in adults with contrasting brain traumas, it is possible that these skills are differentially influenced.
FB tasks become more complex by introducing a second protagonist into the vignette and asking ‘what does X believe that Y believes about a circumstance’. Performance on these second-order tasks develops later, from around six (Liddle & Nettle, Reference Liddle and Nettle2006); possible reasons, beyond manipulation of more representations, are that they are more complex linguistically since their explanation involves several layers of sentence embeddings. This increased linguistic complexity makes their results harder to pin to purely cognitive factors. Other cognitive-ToM tasks include physical appearance-reality tasks (Carlson et al., Reference Carlson, Moses and Breton2002; Flavell et al., Reference Flavell, Flavell and Green1983), where the challenge is to distinguish between what something looks like and what it actually is, and visual perspective-taking tasks, which test the ability to recognize that an item will look different from alternative viewpoints.
Bialystok and Senman (Reference Bialystok and Senman2004) examined the relation between bilingualism and performance on an appearance-reality task, expecting 4- and 5-year-old bilinguals to outperform monolinguals on the basis of their higher inhibitory-control scores. The bilinguals had families with recent immigrant status, a range of language backgrounds, spoke L2-English in their schools/neighborhoods, and L1 at home. The degree of dominance or age at which they started to learn English is not stated. Shown ‘real’ objects (a sponge looking like a rock) and ‘representational’ ones (a whale functioning as a pen), children were asked what they were. On answering, the objects’ true characteristics were revealed. The questions included (a) what the children first thought the item was, (b) what someone who had not been shown its true characteristics would think it was, and (c) what it really was. With language controlled, bilinguals outperformed monolinguals on (c). In addition, for all children, composite inhibitory-control scores correlated with performance on (c), which made sense conceptually since to succeed on both tasks they had to suppress a salient yet irrelevant representation in favor of a correct one. This finding maps with literature on monolinguals’ ToM, which has linked inhibitory control (alongside other EF skills) to successful execution of ToM tasks (Apperly, Reference Apperly2011).
Other work has pointed to the role of WM for ToM in bilinguals. Nguyen and Astington (Reference Nguyen and Astington2014) examined the relation between WM (Backward Word Span), conflict inhibition (Day/Night, Stroop), and FB (unexpected contents/location). It compared 3- to 5-year-old English and French monolinguals with English-French bilinguals. The latter had been exposed to English and French before 8 months and were exposed to both languages minimally 30% of the time. With bilinguals’ lower vocabulary scores controlled, they outperformed monolinguals on FB and WM but not conflict inhibition. There was also a correlation between FB and WM, which was attributed to WM exploiting similar skills as FB, such as maintenance and manipulation of items.
Bilingualism also predicted performance on a spatial-perspective task in Greenberg et al. (Reference Greenberg, Bellana and Bialystok2013), where the verbal element to the task had been greatly reduced in an effort to pin the bilingual advantage more confidently to EF skills. Eight-year-old bilinguals (representing 15 languages), 89% of whom spoke their non-English language at home and 62% of whom spoke English and their other language daily, were compared with monolinguals. The groups did not differ on English vocabulary, NVR, or SES, and it was bilingualism that predicted performance on this purely cognitive task, arguably due to their better EF skills.
Other studies have found ToM performance in bilinguals to be related to language abilities rather than cognitive ones. Diaz and Farrar (Reference Diaz and Farrar2018) compared Spanish-English bilingual and monolingual 3- to 5-year-olds on FB and appearance-reality. The language histories of the bilingual children were diverse: 61% were exposed to both languages from birth, spoke regularly in both, and those whose access to their least dominant language had come later, had been exposed to it for at least a year. With language controlled, a bilingual advantage was found again for (composite) ToM. However, whereas general language ability predicted composite ToM scores for both groups, interference suppression and WM mediated monolinguals’ scores. As the FB and appearance-reality tasks were not analyzed separately, one cannot tell if different factors predicted performance on them. In a longitudinal version of this study, Diaz and Farrar (Reference Diaz and Farrar2017) compared 3- to 5-year-old Spanish-English bilinguals with monolinguals. Again, the bilinguals had varied language histories, with 78% experiencing both languages from birth and the remaining being exposed to their non-dominant language for at least a year. With vocabulary controlled, bilinguals outperformed monolinguals on composite ToM (and EF) scores at first testing. Expressive vocabulary and metalinguistic awareness predicted bilinguals’ ToM scores, whereas interference suppression and metalinguistic awareness predicted monolinguals’. One year later, the group differences on FB or EF had gone, and performance predictors had changed: metalinguistic scores (where the key trial was a synonym task) predicted bilinguals’ ToM but language (receptive language and complementation) and cognitive-flexibility (Dimensional Change Card Sort, DCCS) scores from the first testing phase predicted monolinguals’ ToM. Again, use of composite ToM scores prevents any disentangling between FB and appearance-reality with respect to these predictors.
Białecka et al. (Reference Białecka, Wodniecka, Muszyńska, Szpak and Haman2024) looked further at the role of language proficiency on ToM but included an affective component in their tasks. They compared Polish-English 4- to 6-year-old immigrant bilinguals (living in England but also spending time in Poland), with Polish monolinguals. The bilinguals’ age at onset of L2-exposure varied substantially and the Polish-English balance at home is not stated. Of the ToM tasks tested, most were cognitive, but some considered feelings, thereby incorporating affect into the composite score. Children gained points for accuracy and justification of answers, and whereas no differences emerged for accuracy, bilinguals did better with justifications. For all groups, L1-proficiency predicted accuracy but for monolinguals, NVR did, too. However, the strongest predictors for bilinguals’ justification scores – the harder task cognitively and linguistically – were L1- and L2-vocabulary.
Han and Lee (Reference Han and Lee2013) compared cognitive and affective perspective-taking abilities in 4- to 5-year-old bilinguals (Korean-English) and monolinguals (Korean). For the cognitive part, children retold a story from another’s perspective, and for the affective, they decided on a character’s emotions. The bilinguals had comparable Korean and English vocabulary scores. All children lived in South Korea with Korean parents, but the bilinguals were largely born outside of South Korea, attended international kindergartens, and had varied experiences living abroad. Their finding, that bilinguals outperformed monolinguals on the affective but not the cognitive task, is difficult to interpret because scores on the affective task were very high overall, whereas 4-year-olds scored less than 50% correct on the cognitive one, indicating substantial difficulty with it. NVR, WM, or other EF skills were not measured, so the study cannot speak to whether these factors contributed to overall performance or whether cognitive and affective components were differentially affected. Note, for example, that in Cassetta et al. (Reference Cassetta, Pexman and Goghari2018), monolinguals’ performance on inhibitory control and switching predicted (second-order) cognitive-ToM but not affective-ToM. However, Han and Lee’s results might suggest that the generalization that cognitive-ToM precedes affective-ToM, as per Shamay-Tsoory et al. (2007), is too strong and dependent on task difficulty. Success with cognitive second-order FB tasks does seem to precede success with social-faux-pas tasks (Baron-Cohen et al., Reference Baron-Cohen, O’Riordan, Stone, Jones and Plaisted1999), where the latter includes complex cognitive and affective components. Keeping track of the type of ToM task administered is important as the ability to infer thoughts differs from the ability to infer feelings (Healey & Grossman, Reference Healey and Grossman2018), and in other monolingual adult groups, these abilities are associated with different EF skills. Healey and Grossman, for example, found switching was more relevant to cognitive than affective perspective-taking. Viewed in relation to the literature demonstrating how development of inhibitory control (Grote et al., Reference Grote, Scott and Gilger2021; Verhagen et al., Reference Verhagen, Mulder and Leseman2017) and switching (Planckaert et al., Reference Planckaert, Duyck and Woumans2023), in particular, favor the bilingual child, this is another reason that monolinguals’ and bilinguals’ performance on cognitive and affective-ToM might diverge.
To summarize, we have reviewed the performance of bilingual children with quite disparate language histories on a range of ToM tasks and found that different linguistic and EF predictors emerge not just between monolinguals and bilinguals but between different bilingual types. Given the heterogeneity in bilinguals across studies, a question that arises is whether results could become clearer by forging greater homogeneity in this group. Focusing on educational bilingualism – where L2-access starts at pre-school/primary-education (PE) and is largely restricted to this environment – is one way of extracting a more homogeneous subset, and there are only a few studies that have examined ToM in relation to this group.
1.2. Educational bilingualism and ToM
Goetz (Reference Goetz2003) provided an example of educational bilingualism benefitting ToM development. It compared two monolingual preschooler groups (Mandarin, English) with Mandarin-English bilinguals whose access to English began in daycare. Children participated twice (T1, T2) on three cognitive tasks: appearance-reality, perspective-taking, and FB (unexpected contents/transfer). There was a bilingual advantage for overall ToM at T1 but not T2, and for all children, L1-vocabulary correlated positively with ToM at T1 but not T2. A key difference between groups with respect to T1 and 2 was that monolinguals repeated the tasks in the same language, whereas bilinguals completed them in English and then in Mandarin. As Goetz suggests, this might account for why monolinguals’ performance improved, whereas bilinguals’ remained constant. However, it is also possible that an initial boost provided by bilingualism faded quickly. Goetz’s bilingual group does not map sufficiently tightly to ours because although Mandarin was the primary home language, the children had some access to English via family friends and activities such as TV. Buac and Kaushanskaya (Reference Buac and Kaushanskaya2020) accords more closely in this respect. Focusing on 7-year-old children, it compared monolinguals with two bilingual groups with different language histories on an English second-order FB task. One bilingual group had access to two languages before age three, but the other group had only L1-English at home and were introduced to Spanish at school from six, where the language balance was typically 70% English to 30% Spanish. The simultaneous bilinguals struggled most with FB and the factors predicting the groups’ performance differed. For simultaneous bilinguals, whose English was significantly lowest, expressive English language skills predicted FB. For monolinguals, it was WM, yet for the immersion group, interference inhibition and switching scores were the relevant predictors. One possible interpretation is that the importance of language for FB is central when that language is (relatively) impoverished, but as it improves, aspects of EF take over as key. Comparison between the simultaneous bilingual and immersion groups on FB in Spanish, where language abilities would not have been at stake, might have clarified this.
Cheung et al. (Reference Cheung, Mak, Luo and Xiao2010) compared FB in 3- to 4-year-old children with the same home language (Cantonese) but with different access to L2-English. Both groups had monolingual families, but one attended an immersion kindergarten, with all activities in English. The other attended a Cantonese-speaking kindergarten with 5 hours of English per week. After 1 year, the immersion children fared better on FB, suggesting that there might be a minimum threshold for L2-exposure to impact ToM, 5 hours per week being insufficient. However, the authors linked the difference to sociolinguistic awareness: the immersion children switched languages completely between kindergarten and home, and it was this difference that propelled sociolinguistic awareness, and in turn, ToM. Indeed, it was sociolinguistic awareness scores that predicted FB performance. Agostini et al.’s (Reference Agostini, Apperly and Krott2025) results were more promising with respect to the linguistic threshold for positive ToM effects. They compared 4- to 5-year-old monolingual, bilingual, and L2-learner groups on perspective-taking in a communicative setting on starting school and 6 months later. At first, children’s performance did not differ, but 6 months on, the bilingual and L2-groups outperformed monolinguals. However, neither interference inhibition, cognitive-flexibility, nor L1-English vocabulary predicted performance on this cognitive task.
These few studies examining ToM in relation to educational bilingualism demonstrate that further inroads could be made by studying a greater number of children in this less-studied and more homogeneous group and by including school L2-exposure as a continuous variable. A longitudinal study on this group would reveal how quickly any advantages that do materialize occur and how they pattern over time. Finally, documenting children’s success on individual ToM concepts rather than on composite scores would enable comparisons between performance on ToM components whose driving force are cognitive, affective, or conative (see Section 2.2). Our study is situated within this context.
1.3. The present study
Our main aim is to compare the ToM scores of bilingually educated Spanish children with monolingual Spanish children longitudinally. We employ Sotomayor-Enríquez et al.’s (Reference Sotomayor-Enríquez, Gweon, Saxe and Richardson2023) task, which builds on the developmental progression of sociocognitive abilities proposed in Wellman and Liu (Reference Wellman and Liu2004) and includes cognitive, affective, and conative components. A subsidiary aim is to monitor whether attention skills further modulate ToM success. Lastly, our inclusion of a range of individual-difference factors will track their contribution to ToM scores. Our questions and hypotheses are as follows:
RQ1. How does proportion of English at school relate to performance on individual ToM components (desires, emotion, belief, moral-reasoning, reference, lies, sarcasm) longitudinally?
H1. Greater L2-exposure at school will pattern with higher ToM scores at the year’s start, and this association will have strengthened at the year’s end; however, cognitive, affective, and conative ToM components may be differentially affected.
RQ2. To what extent does performance on attention (selective, switching, response inhibition), beyond that of L2-exposure, contribute to ToM scores longitudinally?
H2. Higher attention scores will pattern with higher cognitive ToM scores, but affective and conative ToM components may be differentially affected.
RQ3. Do individual differences (NVR, WM, L1-vocabulary, L2-vocabulary, age, gender, family educational level, other language(s) spoken at home, onset of L2-exposure, L2-exposure beyond school, exposure to further languages beyond school) influence children’s ToM scores?
H3. Higher NVR, L1-vocabulary, and L2-vocabulary scores, in particular, will map to higher ToM scores.
2. Method
2.1. Participants
A total of 231 Spanish children from 10 Madrid schools were recruited.Footnote 1 Teachers and parents reported they had no social, cognitive, or linguistic conditions. There were two testing phases: beginning of PE (T0), age 5–6; and end of Year 1 (T1),Footnote 2 age 6–7. Children were grouped as monolingual, bilingual, or immersion according to English exposure at school. The ‘monolinguals’, from four schools (three state, one semi-private), attended non-bilingual schools whose curricula were delivered in Spanish aside from 3 hours of English per week, which amounted to 13.3% of their curricula. The bilinguals, from five schools (three state, two semi-private), attended bilingual schools where 32–41.1% of the curricula were in English. The immersion group, from one British (private) school, had an 82.86% English curriculum, with 4.5 hours of Spanish per week. Table 1 shows each group’s number, gender, and age at T1.
English exposure, number, gender distribution, and mean age (SD) by group at T1

2.2. Materials
2.2.1. ToM measures
For ToM, we used the Theory-of-Mind-Booklet Task (Sotomayor-Enríquez et al., Reference Sotomayor-Enríquez, Gweon, Saxe and Richardson2023), developed for longitudinal studies with children aged 3–12, which includes two booklets.Footnote 3 The task exists in English and German, so we translated it into (Peninsular) Spanish, piloting it first. It comprises stories and pictures (see Figure 1) that describe a protagonist’s mental state and appearance, as well as physical events, objects, and states, and that capture stable individual differences and developmental changes. It includes binary forced-choice and free-response questions covering a range of concepts. At T0, we administered Booklet 1, with two stories: one about children finding schoolbooks and another about playing in a park. It contains 42 items assessing five concepts: Desires, Emotion, Belief, Reference, and Moral-Reasoning. At T1, children completed Booklet 2, featuring a story about looking for snacks. By using two booklets designed for longitudinal studies, we tracked changes while avoiding practice effects from using the same stories. Booklet 2 includes the ToM concepts assessed in Booklet 1 but adds Lies and Sarcasm, raising the item total to 50.Footnote 4 For each item, children scored 1 if correct and 0 if incorrect. Table 2 illustrates the concepts included and their cognitive/affective/conative categorizations. Cognitive mentalizing refers to ToM tasks requiring cognitive perspective-taking, distinguishing physical appearance from reality and assessment of others’ cognitive beliefs. Affective mentalizing includes recognizing emotions behind facial expressions and, at a more advanced stage, understanding deceptive emotional expressions of states. Lastly, conative mentalizing requires an understanding of how and why someone can influence the thoughts/feelings of another, such as by empathizing or using sarcasm/irony (Dennis et al., Reference Dennis, Simic, Bigler, Abildskov, Agostino, Taylor, Rubin, Vannatta, Gerhardt, Stancin and Yeates2013).
False-belief example from Booklet 2.

ToM concepts included in each booklet, their score range, and categorization

2.2.2. Attention measures
Children completed the Test-of-Everyday-Attention-for-Children (Manly et al., Reference Manly, Robertson, Anderson and Nimmo-Smith1999), which is a standardized and normed clinical battery of attention tests. They undertook four tasks (six measures) assessing three attention types: selective, switching, and response inhibition.
-
• SkySearch (selective attention: timing): Children locate as many pairs of identical spaceships (targets) as quickly as possible among pairs of different spaceships (distractors). The score reflects timing but takes accuracy into consideration too.
-
• CreatureCounting (switching: accuracy and timing): Children see rows of ‘creatures’ with arrows pointing up or down inserted between some of them. They count the creatures aloud, switching the way they count based on the arrows’ directions. There are seven trials with two measures: accuracy (number of trials in which creatures are counted correctly) and timing (mean time children take in accurate trials; timing is taken if they get at least three trials correct).
-
• Walk/Don’t Walk (response inhibition: accuracy): A series of tones play, and for each, children mark a step along a path except for the last tone, which ends differently, signaling them to stop unpredictably. The task has 20 trials; the score represents the total number of correct trials.
-
• OppositeWorlds (switching: timing – congruent and incongruent): Children see paths of numbers, ‘1’and ‘2’. In the congruent condition, they read the numbers as they are; in the incongruent condition, they read ‘1’ as ‘2’ and ‘2’ as ‘1’. There are four trials with two in each condition (i.e., two timing measures in each condition).
2.2.3. Individual-difference measures
Standardized measures included NVR, assessed by Raven’s Coloured Progressive Matrices (Raven et al., Reference Raven, Court and Raven1998); WM, assessed by the Forward Digit Span task from the Wechsler Intelligence Scales for Children-Revised (WISC-R; Wechsler, Reference Wechsler1974); and L1- and L2-receptive vocabulary, assessed by the Test de Vocabulario en Imágenes Peabody (PPVT-III; Dunn et al., Reference Dunn, Dunn and Arribas2006) and the British Picture Vocabulary Scales (BPVS3; Dunn & Dunn, Reference Dunn, Dunn and Styles2009), respectively.Footnote 5 Children’s age and gender were recorded, and parents completed a questionnaire at T0 on their educational background, immigrant status, home language(s), children’s onset of L2-exposure, and amount/type of exposure to foreign languages outside school, which were collected at school.
2.3. Procedure
With school and parental consent in place, children undertook all tasks at T0 and T1. Sessions occurred individually in a quiet room, during school, over three 30-minute sessions on different days. During the first, children completed Raven’s then Digit Span and then BPVS. In the second, they undertook the ToM task and then PPVT, and in the third, the attention test. Tasks were explained and conducted in Spanish, aside from BPVS which was explained in Spanish and conducted in English.
2.4. Analyses
Analyses were conducted in R (version 4.2.1; R Core Team, 2023) with Generalized Linear Mixed Models using the glmmTMB package. Given that our study’s main focus was to identify how development progressed across ToM concepts, the analyses were conducted at the concept level. The dependent variable was Concept-LevelFootnote 6 score (coded 0/1), which was calculated for the concepts not at ceiling, with ‘concept’ included as a fixed effect. Concepts exceeding the conservative ceiling criterion (≥ 0.95 accuracy) were excluded from inferential modeling to ensure variability.
Models were fitted at T0 and longitudinally across T0 and T1 for all ToM concepts included at T0 and T1. T1-only models were run for the two concepts introduced at T1 (Sarcasm, Lies). Additional models were fitted to examine the contributions of attention measures (selective, switching, response inhibition) to ToM.
The main predictor of interest was percentage of English-at-school,Footnote 7 with time (T0, T1) and their interaction included to explore longitudinal changes. Attention measures were added as predictors in a subset of extended models to evaluate their role beyond that of L2-exposure in explaining ToM outcomes. For categorical predictors (i.e., ToM concept), deviation coding was used. Models also included covariates (L1-vocabulary, L2-vocabulary, NVR, WM, age, gender, family education, further language(s) spoken at home, age of first exposure to English, weekly exposure to English, and other languages outside school).Footnote 8 A systematic model selection process was employed, beginning with a full model containing all fixed effects and applying backward elimination based on likelihood ratio tests to simplify models (Plonsky & Ghanbar, Reference Plonsky and Ghanbar2018). The main predictors of interest (English-at-school, time, their interaction) were retained in final models. For each model, the maximal random effects structure that converged was employed, with random intercepts for subjects to account for repeated measures (Barr et al., Reference Barr, Levy, Scheepers and Tily2013). Model assumptions, including uniformity, dispersion, and outliers, were checked through residuals using the DHARMA package (version 0.4.7; Hartig, Reference Hartig2022). Model outputs are in the Supplementary Materials (Tables S5–S11).
3. Results
Preliminary analyses of individual- and demographic-difference variables using ‘Type of School’ as a grouping variable showed groups were matched on NVR, WM, age, and L1-vocabulary. However, there were differences in L2-vocabulary (β = 0.025, SE = 0.002, z = 9.18, p < .001). Children in immersion and bilingual schools had significantly higher vocabulary scores than those in monolingual schools (see Table 3).
Age, NVR, WM, and L1- and L2-vocabulary means (SDs) by group at T0

Initial ToM analyses showed children from all three groups scored at or above the 95% ceiling threshold on Desires. Consequently, Desires was excluded from the analyses.Footnote 9 Figure 2 shows the percentage accuracy for each concept by group. Descriptively, it suggests a consistent ordering of difficulty across groups, with children performing best on Desires, followed by Emotion, Belief, and finally Reference and Moral-Reasoning.
Accuracy percentages for each ToM concept by group at T0 (these values are intended for illustrative purposes only and should not be used to draw inferential conclusions).

In the following subsections, we begin with results at T0 to gauge preliminary group differences. This is followed by an analysis of the changes in performance between T0 and T1, once children had completed their first year of PE. We end with the results on attention.
3.1. T0
In our analyses, we modeled Concept-Level-ToM for concepts not at ceiling, using deviation coding for the ‘concept’ factor. This enabled comparisons of each concept’s accuracy relative to the overall mean across all the concepts included. The model resulted in a significant main effect of English-at-school (β = 0.014, SE = 0.004, z = 3.55, p < .001), with greater L2-exposure associated with increased odds of correct responses. Among ToM concepts, Emotion (β = 1.569, SE = 0.170, z = 9.22, p < .001) showed significantly higher accuracy than the grand mean, while Moral-Reasoning (β = −0.563, SE = 0.188, z = −2.99, p = .003) was significantly below it. Belief did not differ significantly from the grand mean (β = −0.111, SE = 0.168, z = −0.66, p = .508). Because deviation coding was used, Reference cannot be directly estimated in the model. However, its value can be derived as the negative sum of the other concept estimates (Reference = −[Belief+Emotion+Moral-Reasoning]). Based on these estimates (β = −0.111, 1.569, and − 0.563, respectively), this gives a value of β = −0.895 for Reference, indicating it was the lowest-performing concept overall. There was a significant negative interaction between Emotion and English-at-school (β = −0.018, SE = 0.004, z = −4.33, p < .001), showing that the positive effect of L2-exposure was attenuated for Emotion relative to the overall pattern. No other concept-by-exposure interactions were significant. Additionally, L1-vocabulary (β = 0.240, SE = 0.067, z = 3.56, p < .001) and NVR (β = 0.474, SE = 0.069, z = 6.91, p < .001) were significant predictors of ToM accuracy. English-exposure outside school also contributed positively (β = 0.591, SE = 0.295, z = 2.01, p = .045). Conversely, L2-vocabulary was negatively associated with ToM (β = −0.196, SE = 0.079, z = −2.48, p = .013).
3.2. T0 versus T1
At T1, children across all three groups scored at/above the 95% ceiling threshold on Desires and Emotion, so at T1, these concepts were thus excluded from the main analyses. Figure 3 shows the accuracy percentages for each concept by group.
Accuracy percentages for each ToM concept by group at T1 (these values are intended for illustrative purposes only and should not be used to draw inferential conclusions).

The model fitted to assess progress over time (T0 versus T1) on ToM (Belief, Reference, Moral-Reasoning) showed a significant effect of time (β = 1.001, SE = 0.096, z = 11.37, p < .001), indicating that all children’s performance improved between T0 and T1. The model also revealed a significant main effect of English-at-school across Belief, Reference, and Moral-Reasoning (β = 0.012, SE = 0.003, z = 4.43, p < .001), indicating that greater English-at-school continued to enhance ToM. Using deviation coding, Concept-Level coefficients reflect how each concept’s accuracy deviates from the overall mean. Belief contributed significantly more to overall ToM than the grand mean (β = 0.363, SE = 0.103, z = 3.51, p < .001), while the contribution of Moral-Reasoning did not differ significantly from it (β = −0.032, SE = 0.129, z = −0.24, p = .807). Despite Reference being the implicit baseline and therefore not directly estimated, its effect can be inferred as the inverse of the sum of the other two concepts (β ≈ − 0.331), illustrating that performance on Reference was lower than the grand mean. This suggests children performed best on Belief and less well on Reference and Moral-Reasoning. The interaction between concept and time is reflected in the individual concept-time coefficients: Moral-Reasoning showed significantly greater improvement longitudinally than the overall mean (β = 0.575, SE = 0.124, z = 4.65, p < .001), while Belief did not (β = 0.025, SE = 0.113, z = 0.22, p = .826) and Reference still less so. A significant three-way interaction between Moral-Reasoning, English-at-school, and time (β = 0.011, SE = 0.003, z = 4.14, p < .001) suggests that the positive effect of school L2-exposure on Moral-Reasoning became stronger over time. Conversely, a negative interaction between English-at-school and time (β = −0.008, SE = 0.002, z = −3.62, p < .001) indicated that while L2-exposure was beneficial overall, this effect slightly decreased between T0 and T1 (see Figure 4). Turning to individual-difference measures, L1-vocabulary (β = 0.207, SE = 0.045, z = 4.58, p < .001) and NVR (β = 0.223, SE = 0.040, z = 5.64, p < .001) were significant predictors of performance on these three cognitive concepts. In contrast, gender was not (p > .05). With respect to L2-vocabulary, although it was no longer negatively associated with ToM, as it had been at T0, it was also not a significant predictor of ToM at the year’s end.
Interaction effect showing predicted probabilities of correct responses across Belief, Moral-Reasoning, and Reference as a function of English-at-school (%) by time point (T0 versus T1).

The final ToM analyses examined concepts introduced at T1 (Lies, Sarcasm) by fitting these concepts to the T1 data (see Supplementary Materials, Table S11). English-at-school did not significantly predict performance (β = −0.001, SE = 0.012, z = −0.02, p = .977), and children performed significantly worse on Sarcasm than Lies (β = −4.00, SE = 0.896, z = −4.45, p < .001). Children with higher NVR scores achieved better scores on these items (β = 0.354, SE = 0.121, z = 2.94, p = .003) as did older children (β = 0.679, SE = 0.266, z = 2.55, p = .011). While gender did not yield a significant main effect (p > .05), a significant interaction between concept and gender showed that girls outperformed boys on Sarcasm (β = −0.603, SE = 0.205, z = −3.08, p = 0.004).
3.3. Attention results
Our secondary question asked whether attention skills (selective, switching, response inhibition) contributed to ToM performance. Thus, we extended our models by including attention measures as predictors at T0 and T1.Footnote 10
First, we examined correlations among attention measures and found no evidence of problematic multicollinearity. The one strong correlation was between the two OppositeWorlds conditions (r = 0.72). This was expected given their shared task structure; however, both were retained because they represent distinct processing demands: the congruent condition requires task execution, while the incongruent one includes switching and inhibition. The full correlation matrix is in the Supplementary Materials (Table S13).
We extended our baseline model to explore the impact of attention measures on ToM at T0 (Desires excluded) and found that only selective attention and switching were significant contributors to ToM scores: children with higher SkySearch scores (β = −0.028, SE = 0.010, z = −2.83, p = .004) and greater CreatureCounting accuracy (β = 0.073, SE = 0.031, z = 2.37, p = .017) demonstrated superior performance. Consistent with our baseline model, L1-vocabulary (β = 0.173, SE = 0.067, z = 2.56, p = .010), NVR (β = 0.393, SE = 0.069, z = 5.66, p < .001), and English-at-school (β = 0.010, SE = 0.003, z = 2.87, p = .004) remained significant predictors of ToM, and BPVS was again negatively associated with ToM accuracy (β = −0.205, SE = 0.079, z = −2.60, p = .009).
We then examined progress between T0 and T1. SkySearch remained a significant predictor across the Belief, Moral-Reasoning, and Reference concepts (β = −0.038, SE = 0.007, z = −5.19, p < .001), an effect which strengthened over time (β = 0.033, SE = 0.013, z = 2.44, p = .015) (see Figure 5). Walk/Don’t Walk was a significant positive predictor (β = 0.025, SE = 0.012, z = 2.04, p = .041), suggesting a small but reliable role of response inhibition in ToM performance, and CreatureCounting accuracy showed a positive trend toward significance (β = 0.044, SE = 0.023, z = 1.90, p = .058). However, unlike SkySearch, the interaction of time with CreatureCounting and Walk/Don’t Walk did not reach significance (see Supplementary Materials, Table S9). These findings indicate that at T1, where not just Desires but also Emotion had been excluded from the model due to ceiling effects, selective attention, response inhibition, and switching all contributed toward ToM performance but that this was most marked for selective attention.
Interaction effect showing predicted probabilities of correct responses on ToM as a function of SkySearch performance by time point (T0 versus T1).

Finally, we ran models for the two solely T1 categories (Lies, Sarcasm). No attention measures reached significance for either of them (p > .05).
4. Discussion
Our investigation asked whether L2-exposure impacted ToM in Spanish children educated but not raised bilingually. Restricting our population in this way, we reduced the effects of multilingual/multicultural home on ToM (Tiv et al., Reference Tiv, O’Regan and Titone2021), and by including English-at-school as a continuous variable, we established firmer control over the timing and degree of L2-exposure. Children participated at the start and end of their first PE year and included monolinguals, bilinguals, and immersion children. Our foremost aim was to assess the effect of English-at-school on Concept-Level-ToM longitudinally so that we could ascertain whether the effect of English-at-school differed across specific ToM concepts. The second was to examine whether EF predicted ToM beyond that of L2-exposure, and the third was to monitor individual-difference contributions to ToM (Navarro et al., Reference Navarro, DeLuca and Rossi2022).
To summarize, over time, children improved on all components and the order of difficulty largely reflected that reported in previous literature. English-at-school predicted Concept-Level-ToM at both time points, meaning that children with greater L2-exposure started and finished ahead. However, their initially steep improvement flattened, while children with lower exposure made more gains. With respect to attention, at T0, only selective attention and switching scores predicted Concept-Level-ToM, whereas at T1, all three attention measures did. Yet, the effect was most consistent for selective attention, whose effect also strengthened over time. Neither English-at-school nor any of the three attention measures was linked to Lies and Sarcasm. The key individual-difference predictors for the conflated Reference, Belief, and Moral-Reasoning ToM scores were NVR and L1-vocabulary. Gender was not a predictor for these cognitive ToM tasks. However, girls outperformed boys on Sarcasm, the one conative task. Contrary to expectations, higher L2-vocabulary scores did not predict ToM outcomes at either time points.
4.1. Background variables and ToM-component difficulty
At T0, there were no group differences with respect to age, NVR, WM, or L1-vocabulary, but children with more English-at-school had higher L2-vocabulary. Eighty-five percent of children had attended preschool at their respective school so had already experienced different amounts of English prior to PE, demonstrating that even at preschool, where English activities are less structured, degree of L2-exposure yields visible linguistic consequences.
Our analyses indicated the following difficulty ranking of concepts at T0: Desires>Emotion>Belief>Moral-Reasoning>Reference (see Supplementary Materials, Figures S2 and S3). Emotion, classed as affective, proved easier than all but one of the concepts classed as cognitive, in line with Shamay-Tsoory et al. (Reference Shamay-Tsoory, Shur, Barcai-Goodman, Medlovich, Harari and Levkovitz2007) which reported better performance on affective than cognitive tasks albeit on tasks designed for older children and adults. Our finding that Desires was easier than Belief, which was easier than Reference, accords with the proposed order of difficulty described in Wellman and Liu (Reference Wellman and Liu2004) and Peterson and Wellman (Reference Peterson and Wellman2019), although Emotion, which our populations found easier than Belief, is at odds with the latter study’s findings for typical monolingual children and deaf children born to hearing parents. This contrast should be interpreted cautiously, however, and in relation to the different levels of difficulty between these test batteries.
The ranking found at T0 remained at T1, although the gap between Moral-Reasoning and Reference increased. The two new concepts were interspersed on this scale. Lies achieved near-ceiling scores, and Sarcasm, although more difficult than Belief, was understood more easily than Reference, which remained the most challenging. Difficulties with Sarcasm were expected given its conative classification (Dennis et al., Reference Dennis, Simic, Bigler, Abildskov, Agostino, Taylor, Rubin, Vannatta, Gerhardt, Stancin and Yeates2013), and Lies can be explained by these particular task requirements. Children only had to distinguish a fact from a false state-of-affairs, not infer an intention behind the deception, likening the task to an easier subcomponent of FB (Astington & Gopnik, Reference Astington and Gopnik1991). Their excellent performance makes sense given their age at T1. Sarcasm, however, where children needed to understand that a protagonist was attempting to exert influence over another and that an intended meaning of a statement could conflict with the literal meanings of the words within it, proved one of the most difficult, as per Peterson et al. (Reference Peterson, Wellman and Slaughter2012). With respect to Reference, we think a particular conceptualization of Reference in one vignette answers for this category’s overall scores. Whereas for most items there was a straightforward spatial relationship to be navigated, one item incorporated an implicit size comparison, leading children to focus on an incorrect reference point from the outset. Their mistakes, therefore, were less about Reference than on how to draw comparisons between sizes when key contextual clues are missed (see ‘Task difficulty’ in Supplementary Materials). With these items – 2 of 6 – omitted, the children’s score improved substantially, suggesting a better understanding of Reference than initially illustrated, placing it in line with Sarcasm.
4.2. English at school and ToM
The fact that English-at-school exhibited an effect at both time points suggests that the spur to ToM provided by educational bilingualism takes effect quickly. Our results accord with Agostini et al. (Reference Agostini, Apperly and Krott2025), whose L2-learners, after 1 year of immersion, also outperformed monolinguals on perspective taking. Relatedly, Listanti et al. (Reference Listanti, Torregrossa, Eisenbeiß and Bongartz2023) found that even by increasing reading in children’s heritage language, first- and second-order FB improved. Taken together, these results suggest that the L2-threshold for influence on one aspect of ToM can be quite low if children are not below 4 years (Cheung et al., Reference Cheung, Mak, Luo and Xiao2010). Practically, our results suggest that at this early stage, it is the bilingual experience itself that has positive consequences for ToM rather than a prescribed level of L2-attainment, meaning all children could profit from early access to an L2. Further support for this proposition is that children’s L2-vocabulary scores did not predict ToM at either time points. The lack of an L2-vocabulary effect is consistent with the cognitive exercise and/or socio-cultural experience of bilingualism being the relevant ToM-boosting conduits as opposed to L2-fluency. Also of note is the negative interaction between English-at-school and time at T1, which indicates that although children with greater L2-exposure continued to score higher than those with less exposure, the magnitude of the advantage diminished. Such a pattern suggests that it was the lower-exposure children who made more gains during this period. Continued tracking will indicate if this reflects an attenuation of the bilingual children’s progress or whether further boosts in the bilinguals’ performance materialize over time. The results are all the more interesting given current debates with respect to what causes the bilingualism effect, in particular, Paap (Reference Paap and Schwieter2019), who suggests that the initial spur to EF provided by bilingualism is short-lived and due to the sudden increase in intentional cognitive effort at the incipient stage of (language) learning. Given the purported partial overlap in requirements underlying cognitive-ToM and attention tasks (Wellman et al., Reference Wellman, Cross and Watson2001), one might expect L2-exposure to impact cognitive-ToM and EF progression similarly. However, Chamorro et al. (Reference Chamorro, de la Viña and Janke2025), who examined attention in the same children reported here, found that the positive effects of L2-exposure on selective attention, switching, and response inhibition increased longitudinally. Continued monitoring of these children’s trajectory with respect to both EF and all three ToM components will clarify which, if any, are similarly affected by L2-exposure. Returning to the contribution of English-at-school on ToM in the current study (which excluded Desires at T0 and Desires and Emotion at T1), an L2-exposure effect was present at both time points on the remaining cognitive tasks. This is consistent with Yu et al. (Reference Yu, Kovelman and Wellman2021), who reported a bilingual advantage for ToM in 16 of the 21 studies reviewed, where the majority used FB, a quintessentially cognitive task.
An interesting question is why English-at-school only interacted negatively with Emotion. However, this task, classed as affective, would draw on further competencies to those required for purely cognitive tasks, so skills honed by bilingualism that influence these might not generalize to affective ones (Kalbe et al., Reference Kalbe, Schlegel, Sack, Nowak, Dafotakis, Bangard, Shah, Fink and Kessler2010; Yott & Poulin-Dubois, Reference Yott and Poulin-Dubois2016). Indeed, studies on adults with clinical profiles have found behavioral and neurological dissociations between affective- and cognitive-ToM tasks (Healey & Grossman, Reference Healey and Grossman2018; Shamay-Tsoory et al., Reference Shamay-Tsoory, Shur, Barcai-Goodman, Medlovich, Harari and Levkovitz2007). Recall also Cassetta et al. (Reference Cassetta, Pexman and Goghari2018), who reported that inhibitory control and switching predicted performance on cognitive-ToM tasks but not affective ones. It might be that alternative consequences of bilingualism, such as increased sociocultural awareness – a measure we did not include – are more pertinent for affective aspects of ToM development (Cheung et al., Reference Cheung, Mak, Luo and Xiao2010). If true, we would not necessarily expect our monolingually raised bilinguals to exhibit an early advantage for Emotion since their access to the broader cultural experience that might accelerate sociocultural awareness was reduced (Han & Lee, Reference Han and Lee2013; Tiv et al., Reference Tiv, O’Regan and Titone2021). Further monitoring over time with more challenging Emotion tasks would tell.
English-at-school did not predict Lies/Sarcasm scores either. For Lies, all children achieved over 90% accuracy, so the lack of effect is likely due to this near-ceiling performance. However, Sarcasm performance might relate to the further social and empathetic components inherent to this conative task, depending as it does on engagement with social norms and feelings. Children had to interpret superficially contradictory statements, calling upon sophisticated social reasoning skills to forge the link between seemingly antithetical displays of behavior and protagonists’ intended meanings. Enhancement of such social reasoning skills might again not materialize in children not brought up multiculturally: again, further monitoring over time will tell.
4.3. Attention and ToM
The influence of attention grew with time. Whereas at T0, only selective attention and switching predicted ToM, at T1, selective attention, switching, and inhibition did. However, it was only selective attention whose influence strengthened over time.
Selective attention requires participants to attend to relevant stimuli and to ignore distractors. This process is analogous to that needed to choose between correct and incorrect beliefs/visual perspectives and between physical appearances and reality. In all these cognitive tasks, interfering or distracting information must be suppressed in favor of less salient yet correct information. On this basis, one might expect that if L2-exposure boosts selective attention (Adesope et al., Reference Adesope, Lavin, Thompson and Ungerleider2010), the influence of selective attention would be similarly visible on the prototypically cognitive tasks as observed here.
Aspects of switching were also significant for ToM, and the links between switching and ToM per se ties in with previous literature. The cognitive action of language switching is arguably akin to the practice of adapting quickly to changing task requirements (as in DCCS and CreatureCounting and OppositeWorlds), a similarity which would answer for the frequently found advantage for bilinguals on cognitive flexibility (Bialystok, Reference Bialystok2010). What all ToM tasks share is that participants must alternate between different viewpoints, be it a belief, a feeling, or a visual perspective, which points to an overlap in this cognitive action and that underpinning switching tasks. The correspondence we found, therefore, between switching and ToM in bilinguals might be anticipated: if early bilinguals surpass monolinguals on switching, this should translate to some of the mental operations engaged during ToM tasks (Bialystok & Martin, Reference Bialystok and Martin2004; Prior & McWhinney, Reference Prior and Macwhinney2010). Indeed, Austin et al. (Reference Austin, Groppe and Elsner2014) found that switching predicted ToM in monolingual children, and Buac and Kaushanskaya (Reference Buac and Kaushanskaya2020) reported it as integral to bilingual/immersion children’s FB scores (see also Bialystok & Viswanathan, Reference Bialystok and Viswanathan2009; Carlson & Meltzoff, Reference Carlson and Meltzoff2008). Where our results offer pause for thought, however, is that educational bilingualism does not provide the same opportunities for language switching that sequential or simultaneous bilinguals enjoy, and it is such an environment that has been argued to hone this skill (Bialystok, Reference Bialystok2010). The current bilingual children exhibited heightened switching skills, as reported in Chamorro et al. (Reference Chamorro, de la Viña and Janke2025), in conjunction with a positive association between switching and ToM, despite not having the environment most conducive to its enhancement.
Our results for response inhibition are not entirely in sync with what one might expect if inhibition abilities underpin ToM because this variable only exerted an influence at T1. The positive impact of bilingualism on inhibition is widely supported (Barac et al., Reference Barac, Bialystok, Castro and Sanchez2014; Bialystok & Martin, Reference Bialystok and Martin2004; but see also Bialystok & Viswanathan, Reference Bialystok and Viswanathan2009; Carlson & Meltzoff, Reference Carlson and Meltzoff2008; Duñabeitia et al., Reference Duñabeitia, Hernández, Antón, Macizo, Estévez, Fuentes and Carreiras2014), and indeed Chamorro et al. (Reference Chamorro, de la Viña and Janke2025) found enhanced inhibition in these very bilinguals. Despite this, inhibition’s contribution to the children’s ToM scores at this incipient stage was less central. Recall, too, that Huang et al. (Reference Huang, Baker and Wang2023) reported no relation between inhibition and ToM scores, unlike switching. Still more recently, Agostini et al. (Reference Agostini, Apperly and Krott2024) reported that inhibition had not been integral to a bilingual advantage they found for referential perspective taking. These discrepant results motivate further exploration into why response inhibition showed a weaker association with ToM at this stage than did switching and selective attention.
4.4. Individual differences and ToM
Of all the covariates, the three that made notable contributions were NVR, L1-vocabulary, and gender. L2-vocabulary, however, was conspicuous for its absence. With respect to gender, this variable was of relevance to the Sarcasm task, where girls outperformed boys. Performance on this conative task, which calls on social reasoning and empathic skills for its successful execution, was also unaffected by any of the attention measures or English-at-school. In this respect, it behaved similarly to Emotions, which interacted negatively with English-at-school. Thus, results for the two tasks that are not purely cognitive diverged from those that are: proportion of English-at-school was not relevant to performance at this point of exposure. A future study might explore if this dichotomy between cognitive tasks on the one hand and conative on the other remains over time and extends to a broader range of bilingual contexts.
With respect to L1-vocabulary, our results support the monolingual literature, which has shown strong vocabulary skills in monolinguals predict ToM (de Villiers & de Villiers, Reference de Villiers and de Villiers2014). Our tasks included extensive explanatory aspects, where children justified decisions. Better justification of decisions is linked to strong language skills (Białecka et al., Reference Białecka, Wodniecka, Muszyńska, Szpak and Haman2024), so the relevance of L1-vocabulary we found throughout makes sense conceptually. The lack of a relation between L2-vocabulary and ToM, however, is unexpected, and contrasts with Chamorro et al. (Reference Chamorro, de la Viña and Janke2025), where this variable predicted performance on selective attention, switching, and inhibition. The apparent independence of ToM from L2-vocabulary at this stage is something to be monitored over time as this also relates to the question over whether the same factors that promote EF also impact on ToM. Lastly, the bearing of NVR on every ToM component indicates the role played by fluid intelligence in these tasks all of which encompass cognition.
5. Conclusions
Our study has shown that in educational bilingualism, the impact of L2-exposure on ToM is visible immediately and not dependent on L2-vocabulary attainment. This means educators need not worry that these mentalizing benefits are restricted to children who can reach a certain L2-threshold. Aspects of attention also predicted ToM but not uniformly so. The influence of selective attention was the most consistent and its effects strengthened over time. We have argued that the characteristics underlying cognitive aspects of ToM tasks map closely to those underlying attention tasks, already shown by many to be enhanced in bilinguals. However, Emotion, an affective task, and Sarcasm, a conative one, stood out in that they were not affected by L2-exposure and girls did better than boys on Sarcasm. Further explorations of these components, with a harder Emotions task than the current one, could explore in more depth whether and how L2-exposure and gender interact with cognitive-, affective-, and conative-ToM distinctions. We have also suggested that the qualitatively different experience of educational bilingualism, where children’s exposure to multilingual/cultural experiences is reduced, might affect when and if bilingual gains on affective and conative skills surface. The fact that girls surpassed boys on conative tasks but not cognitive ones, together with the lack of any large systematic gender divide on attention tasks (Grissom & Reyes, Reference Grissom and Reyes2019), is a further reason to conclude that although ToM and EF share some underpinnings, they should not be conflated. In particular, the effects of the characteristics of ToM components that incorporate sentiments and heightened social awareness are unearthed by using concept-level rather than composite scoring. Finally, the consistent influence of NVR and L1-vocabulary on performance across the board reminds us of how integral these variables are to ToM development, irrespective of bilingual exposure.
Supplementary Material
The supplementary material for this article can be found at http://doi.org/10.1017/S1366728926101229.
Data availability statement
The R script used for the analyses is available via the embedded link. The data set forms part of an ongoing longitudinal project. As data collection is still underway, we are unable to share the data set at this time to protect participant confidentiality and preserve the integrity of future analyses. The data set will be made available upon completion of the study.
Acknowledgments
This project was funded by the Comunidad de Madrid (Atracción de Talento Investigador, T1/HUM-19952). We would also like to thank the participating schools, parents, and children.
Competing interests
The authors declare no competing interests.


