The foreign language effect in the illusion of causality: Evidence of absence

Stefano Dalla Bona; Eduardo Navarrete; Michele Vicovaro

doi:10.1017/S1366728926101412

The foreign language effect in the illusion of causality: Evidence of absence

Published online by Cambridge University Press: 13 May 2026

Stefano Dalla Bona ,

Eduardo Navarrete and

Michele Vicovaro

Show author details

Stefano Dalla Bona: Affiliation:
Department of General Psychology, University of Padua , Padua, Italy
Eduardo Navarrete: Affiliation:
Department of Developmental Psychology and Socialisation, University of Padua , Padua, Italy
Michele Vicovaro*: Affiliation:
Department of General Psychology, University of Padua , Padua, Italy
*: Corresponding author: Michele Vicovaro; Email: michele.vicovaro@unipd.it

Article contents

Abstract
Highlights
Introduction
Methods
Results
Discussion
Data availability statement
Competing interests
Ethics statement
Disclosure of use of AI tools
Footnotes
References

Rights & Permissions

Abstract

Recent research suggests that a cognitive bias, the illusion of causality, can be attenuated when the task is presented in a foreign language (Díaz-Lago & Matute, 2019a, Quarterly Journal of Experimental Psychology, 72(1), 41–51), supporting the well-known foreign language effect on decision making and reasoning. We conducted a replication study with a large sample (N = 220), determined through a Bayes factor design analysis, but our results did not support the original findings. This finding challenges the generalizability of the foreign language effect on reducing cognitive biases. Additionally, we found that the magnitude of the illusion decreased with increasing years of formal education and was generally weaker among male participants compared to females. These findings emphasize the importance of using samples with balanced demographic characteristics to avoid potential confounds in between-group comparisons. Overall, our study highlights the need for further research to clarify the conditions under which the foreign language effect can influence cognitive biases.

Keywords

causal learning illusion of causality foreign language effect Bayes factor design analysis reproducibility crisis

Information

Type: Research Article
Information: Bilingualism: Language and Cognition , First View , pp. 1 - 10

DOI: https://doi.org/10.1017/S1366728926101412 [Opens in a new window]
Creative Commons: This is an Open Access article, distributed under the terms of the Creative Commons Attribution-NonCommercial-ShareAlike licence (http://creativecommons.org/licenses/by-nc-sa/4.0), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the same Creative Commons licence is used to distribute the re-used or adapted article and the original article is properly cited. The written permission of Cambridge University Press or the rights holder(s) must be obtained prior to any commercial use.
Open Practices: Open data Open materials Preregistered
Copyright: © The Author(s), 2026. Published by Cambridge University Press

Highlights

• Replication of the foreign language effect on the illusion of causality;
• Bayes factor design analysis to guide the replication effort;
• Illusion of causality not reduced by the unbalanced second language;
• Demographic features mediate the illusion of causality;
• More research needed on foreign language’s impact on cognitive biases.

1. Introduction

1.1. The illusion of causality

From a psychological perspective, causality is naively understood as a relationship between two subsequent events, wherein the latter is regarded as the direct consequence of the former. The ability to comprehend causal relationships is crucial for human survival and represents a capacity in which humans typically outperform other species (Bender, Reference Bender2020). The mechanisms underlying our perception and detection of causal connections have been extensively studied across various fields of psychology, including visual perception (e.g., Michotte, Reference Michotte2017), reasoning (e.g., Waldmann et al., Reference Waldmann, Hagmayer and Blaisdell2006) and learning (e.g., Dickinson et al., Reference Dickinson, Shanks and Evenden1984) – highlighting the centrality of causal interpretation as a recurring theme in psychological research. Within the psychology of learning, causality has often been examined from an associative perspective, wherein individuals infer causal relationships based on the repeated co-occurrence of cause and effect events (Wasserman, Reference Wasserman1990).

Within this learning framework of how individuals intuitively form causal representations, the illusion of causality emerges as a well-documented cognitive bias (see Moreno Fernández et al., Reference Moreno Fernández, Blanco Bregón and Matute2023, for a comprehensive and updated summary). This illusion refers to the belief that two events, A and B, are causally related (i.e., A causes B) when no actual causal connection exists. In situations where individuals act as passive observers of multiple co-occurrences between two events, they may overestimate the extent to which these events are causally related, leading to erroneous causal inferences (see also Matute et al., Reference Matute, Blanco, Moreno-Fernández and Pohl2022).

The illusion of causality is commonly investigated using the Contingency Learning Task (CLT) paradigm (Moreno Fernández et al., Reference Moreno Fernández, Blanco Bregón and Matute2023), in which participants observe a sequence of trials, each characterized by the presence or absence of a potential cause (Event A) and an effect (Event B). The CLT structure yields four possible scenarios: (a) both the potential cause and the effect are present, (b) only the potential cause is present, (c) only the effect is present and (d) neither the potential cause nor the effect is present. Each trial corresponds to one of these scenarios, and participants passively observe a randomized sequence of trials, with the experimenter controlling the frequency of each type. Upon completion of all trials, participants are typically asked to evaluate the perceived strength of the causal relationship between the potential cause and the effect on a scale from 0 to 100 (however, recent literature has suggested that the use of a unidirectional 0–100 scale may be theoretically problematic; see Ng et al., Reference Ng, Lee and Lovibond2024).

From a statistical viewpoint, one of the most widely used indices to measure contingency in this context is the ΔP (Perales & Shanks, Reference Perales and Shanks2007), an index related to the chi-square (χ²) statistic (Allan, Reference Allan1980). The ΔP is calculated by subtracting the probability of observing the outcome in the absence of the potential cause, denoted as $ P({B}_1\mid {A}_0) $ , from the probability of observing the outcome in the presence of the cause, denoted as $ P({B}_1\mid {A}_1) $ (Jenkins & Ward, Reference Jenkins and Ward1965):

(1)

$$ \Delta P=P({B}_1\mid {A}_1)-P({B}_1\mid {A}_0) $$

A large and positive $ \Delta P $ value indicates a strong causal effect of A on B, whereas a value of zero signifies no statistical support for the contingency between the potential cause and the effect. This index can serve as a normative benchmark for evaluating human causal learning (Matute et al., Reference Matute, Blanco, Moreno-Fernández and Pohl2022; Perales & Shanks, Reference Perales and Shanks2007). Theoretically, the overestimation of causal strength, which is the illusion of causality, can occur independently of the actual $ \Delta P $ value. However, empirical studies have primarily documented this phenomenon in null contingency conditions ( $ \Delta P=0 $ ), where no causal relationship is present (Allan & Jenkins, Reference Allan and Jenkins1980).

This study specifically focuses on the illusion of causality as it arises from a particular distribution of trial frequencies, known as the outcome-density bias (Matute et al., Reference Matute, Blanco, Yarritu, Díaz-Lago, Vadillo and Barberia2015). The bias emerges when the frequency of trials in which the outcome occurs, namely, scenarios where the cause and the outcome are present and scenarios where the outcome is present without the cause, exceeds the frequency of trials in which the outcome does not occur. Despite the $ \Delta P $ value remaining at zero, people overestimate the extent of the causal relationship (Alloy & Abramson, Reference Alloy and Abramson1979), with average judgements often near 50 on a 0–100 Likert scale, indicating a tendency to perceive a causal link when none exists (e.g., Dalla Bona & Vicovaro, Reference Dalla Bona and Vicovaro2024; Díaz-Lago & Matute, Reference Díaz-Lago and Matute2019a, Reference Díaz-Lago and Matute2019b).

1.2. The foreign language effect

Recent research has increasingly explored whether the illusion of causality can be reduced (Matute et al., Reference Matute, Blanco, Yarritu, Díaz-Lago, Vadillo and Barberia2015). Investigating the conditions under which this illusion diminishes not only serves the practical purpose of informing public education strategies to counteract such biases but also contributes to a deeper understanding of the underlying cognitive processes. In this context, Díaz-Lago and Matute (Reference Díaz-Lago and Matute2019a) found that presenting information in a foreign language can reduce the magnitude of the illusion.

The foreign language effect (FLE) was first described by Keysar et al. (Reference Keysar, Hayakawa and An2012), who showed that participants performing a decision-making task in a foreign language (FL) exhibited less biased responses than those completing the same task in their native language (NL). Over the years, the FLE has been replicated across a range of paradigms, including loss-aversion tasks, decision making and moral dilemmas (Circi et al., Reference Circi, Gatti, Russo and Vecchi2021; Del Maschio et al., Reference Del Maschio, Crespi, Peressotti, Abutalebi and Sulpizio2022). For instance, Costa et al. (Reference Costa, Foucart, Arnon, Aparici and Apesteguia2014a) replicated and extended Keysar et al.’s (Reference Keysar, Hayakawa and An2012) findings, providing further evidence that reasoning in a FL reduces susceptibility to cognitive biases. In the domain of moral judgement, an area where the FLE has been extensively studied, the use of a FL appears to promote more utilitarian responses (e.g., Costa et al., Reference Costa, Foucart, Hayakawa, Aparici, Apesteguia, Heafner and Keysar2014b). Nonetheless, recent literature on the FLE has shown that this phenomenon does not consistently arise (Hu et al., Reference Hu, Martín-Luengo and Navarrete2026), and several explanations for this lack of replicability have been proposed, such as differences in task characteristics and the level of emotional engagement elicited by the task (Hayakawa et al., Reference Hayakawa, Lau, Holtzmann, Costa and Keysar2019), as well as cultural influences and linguistic similarity between the languages involved (Dylman & Champoux-Larsson, Reference Dylman and Champoux-Larsson2020).

The reduction of the illusion of causality through the use of a FL, as shown by Díaz-Lago and Matute (Reference Díaz-Lago and Matute2019a), represents an interesting preliminary finding that may offer valuable insights into the mechanisms underlying this cognitive bias. In a series of two consecutive experiments, the authors observed that participants performing a CLT in a FL exhibited a diminished illusion of causality compared to those using their NL. Although several tentative and post hoc explanations were proposed and discussed, the results were primarily interpreted through the lens of cognitive disfluency. Specifically, the increased difficulty associated with processing information in a non-native language was posited to induce more deliberate and analytical thinking, thereby promoting more normative and less biased judgements.

The mediating role of cognitive disfluency in the illusion of causality was also supported by the results of another study by Díaz-Lago and Matute (Reference Díaz-Lago and Matute2019b), in which it was found that presenting the CLT in a hard-to-read font reduced the magnitude of the illusion. However, Dalla Bona and Vicovaro (Reference Dalla Bona and Vicovaro2024) recently failed to replicate and extend this finding, challenging the notion that increased task difficulty alone reduces the illusion of causality. Moreover, in some tasks, operating in a FL may consume mental resources, thereby impairing performance due to increased cognitive load (Muda et al., Reference Muda, Pennycook, Hamerski and Białek2023).

These conflicting findings underscore the need for further empirical investigation into the conditions under which the use of a FL may influence causal reasoning. In particular, it remains unclear whether the observed reduction in the causal illusion is primarily driven by processing fluency, as proposed by Díaz-Lago and Matute (Reference Díaz-Lago and Matute2019b), or by alternative mechanisms such as emotional distancing or the context in which the FL was acquired – explanations commonly discussed in the FLE literature (Circi et al., Reference Circi, Gatti, Russo and Vecchi2021).

The processing fluency hypothesis contends that tasks conducted in a FL are often perceived as more cognitively demanding, which in turn fosters more deliberate and analytical processing, ultimately leading to more normative responses (Oppenheimer, Reference Oppenheimer2008). This mechanistic explanation posits that completing the CLT in a less proficient FL increases perceived task difficulty, thereby eliciting greater cognitive engagement and reducing susceptibility to incorrect causal inferences.

The emotional distance hypothesis posits that FL induces emotional detachment, potentially making decision making more adherent to normative rules and less biased (Del Maschio et al., Reference Del Maschio, Crespi, Peressotti, Abutalebi and Sulpizio2022). In studies on the illusion of causality, a commonly used CLT is the so-called Allergy Task, in which participants assess the causal relationship between taking a medicine and recovering from a disease. This task was used by Díaz-Lago and Matute (Reference Díaz-Lago and Matute2019a) and is also employed in this study. According to the emotional distance hypothesis, the Allergy Task could be less emotionally salient when presented in a FL than in the NL. Although Díaz-Lago and Matute (Reference Díaz-Lago and Matute2019a) described the task as relatively emotionally neutral, other findings suggest that modifying its cover story – thus altering its motivational framing – can significantly influence the magnitude of the illusion (Matute et al., Reference Matute, Blanco, Moreno-Fernández and Pohl2022). This indicates that the task is not entirely devoid of emotional content. Given its resemblance to real-life health scenarios, the cover story may elicit affective responses that are attenuated in a FL context. For completeness, it should also be noted that emotional responses may, in some cases, be elicited by task feedback, as suggested by Gao et al. (Reference Gao, Zika, Rogers and Thierry2015). However, in this study, feedback during the trial phase merely indicated whether the effect was present or absent and did not include emotionally valenced cues such as those used in Gao et al. (Reference Gao, Zika, Rogers and Thierry2015); therefore, we believe the trial phase is unlikely to trigger substantial emotional engagement.

The context of acquisition hypothesis posits that the conditions under which the NL and FL were acquired influence information processing and judgement (Geipel et al., Reference Geipel, Hadjichristidis and Surian2015). In the case of the Allergy Task, its health-related context may activate socio-cultural knowledge and normative expectations about the efficacy of medicines. As these beliefs are typically acquired in the NL, presenting the task in the NL may facilitate access to such norms, potentially biasing participants towards perceiving a causal link between taking a medicine and recovery, even before the associative trials begin. In contrast, presenting the task in a FL may reduce the salience of these expectations, promoting a more objective evaluation of causal contingency.

Another possible explanation of the FLE in CLTs refers to language-dependent memory encoding, by which semantic associations would be generally weaker in a FL than in the NL (Kroll & Stewart, Reference Kroll and Stewart1994). Participants performing the task in a FL may form weaker associative links during the learning phase, resulting in slower and less robust acquisition of contingencies (Ezrina, Reference Ezrina2023). These weaker associations may also decay more rapidly, potentially reducing the perceived frequency of co-occurrence between the potential cause and the outcome, and thus lowering the estimated strength of the causal relationship.

Given these multiple and unaddressed explanatory accounts, we sought to replicate the original study by Díaz-Lago and Matute (Reference Díaz-Lago and Matute2019a) while also adopting an exploratory design aimed at disentangling the underlying mechanisms of the FLE. We believe this hybrid structure strikes an optimal balance between replication and exploration. It enhances the informativeness of the experimental effort regardless of the outcome: if the original finding fails to replicate, the study still provides evidence against its generalizability; if it does replicate (fully or partially), the additional measures offer deeper insights into the psychological processes involved. This approach aligns with current best practices in psychological science where replication is increasingly valued not only for confirming previous findings but also for refining theoretical understanding (McShane et al., Reference McShane, Tackett, Böckenholt and Gelman2019).

2. Methods

2.1. Tools for transparency

The study was pre-registered on the Open Science Framework (OSF) at: https://doi.org/10.17605/OSF.IO/PZX9N. The pre-registration includes detailed documentation of the experimental plan, including the study design, hypotheses, exclusion criteria, results of the design analyses, list of planned variables to be measured, analysis plan and the structure of the experimental procedure, along with supporting files in various formats. All study materials, including the experiment script, questionnaires, raw data, R pipeline for data extraction, statistical analysis scripts, design analysis script, flow diagram of the procedure, back-translation results and other supplementary materials, are openly available in the associated OSF project at: https://doi.org/10.17605/OSF.IO/HVGKX.

2.2. Design analysis for the confirmatory hypothesis

Díaz-Lago and Matute (Reference Díaz-Lago and Matute2019a) conducted two experiments to test whether performing a CLT (the Allergy Task) in a FL reduces the illusion of causality. In Experiment 1, 36 participants were assigned to a null contingency outcome-density condition, with 20 participants completing the task in a FL and 16 participants in the NL. In Experiment 2 (N = 80), 40 participants were assigned to the same null contingency condition (20 FL and 20 NL) and 40 to a positive $ \Delta P $ condition (20 FL and 20 NL). Across both experiments, the illusion of causality was reduced when the CLT was completed in a FL compared to the NL.

To inform the design of our replication, we re-analysed the publicly available dataset from Díaz-Lago and Matute (Reference Díaz-Lago and Matute2019a), hosted on OSF (https://osf.io/fc9nx/). Specifically, we re-calculated the effect sizes between the NL and FL groups under the null contingency outcome-density conditions. Experiment 1 showed a very large effect size (Cohen’s d = 1.4), while Experiment 2 showed a large effect size (d = 0.8).

We chose to replicate Experiment 2, as its sample characteristics most closely aligned with those of our intended participant pool. Notably, Experiment 1 involved English-speaking Erasmus students who acquired Spanish relatively late in life (Age of Acquisition; AoA: M = 12.61 years, SD = 4.08), whereas Experiment 2 featured Spanish speakers who had learned English at an earlier age (AoA: M = 6.58 years, SD = 4.32). Our sample consisted of native Italian speakers who learned English as a second language in formal educational settings, making Experiment 2 a more appropriate comparison. Unlike the original study, we focused on a two-group comparison (NL vs. FL) within a null contingency condition, omitting the positive contingency condition. This trade-off allowed us to concentrate on the main hypothesis concerning the reduction of the illusion of causality, which occurs in the null contingency condition. The positive contingency condition typically serves only as a control to ensure that any NL–FL differences are specific to the null contingency condition and, thus, to the illusion of causality.

Although the effect sizes observed in Díaz-Lago and Matute (Reference Díaz-Lago and Matute2019a) are large, they contrast with more modest average effect sizes reported in recent meta-analyses of the FLE, which tend to settle around d ≈ 0.2 (e.g., Circi et al., Reference Circi, Gatti, Russo and Vecchi2021; Del Maschio et al., Reference Del Maschio, Crespi, Peressotti, Abutalebi and Sulpizio2022). However, these meta-analyses predominantly focus on moral dilemma paradigms, which involve distinct decision-making processes from those involved in associative learning tasks. Additionally, response formats differ across paradigms: moral dilemma studies often involve categorical or binary responses, whereas our CLT uses continuous judgement measures (i.e., causal ratings on a 0–100 Likert scale).

Our estimate of the expected effect size was derived using a heuristic approach, detailed further in our pre-registration (https://doi.org/10.17605/OSF.IO/PZX9N). To determine the necessary sample size for our confirmatory analysis, we employed a simulation-based method grounded in distributional analysis of causal judgement data from Dalla Bona and Vicovaro (Reference Dalla Bona and Vicovaro2024). Like the current study, their experiment was conducted (i) online and (ii) using the PsychoPy software (Peirce et al., Reference Peirce, Gray, Simpson, MacAskill, Höchenberger, Sogo, Kastman and Lindeløv2019). We focused on the group exposed to a null contingency condition with outcome-density bias, without perceptual manipulations. All the participants in that study performed the CLT in Italian, their NL.

Our first step was to identify the most plausible generative distribution underlying the causal judgement data from Dalla Bona and Vicovaro (Reference Dalla Bona and Vicovaro2024). Using a Cullen and Frey graph (Frey & Cullen, Reference Frey and Cullen1995), we determined that judgements followed a truncated normal distribution bounded between 0 and 100. Building on this consideration, we simulated a series of paired distributions for the NL and FL groups, calculating Cohen’s d for each simulated pair. The simulations were parameterized along four key dimensions: (i) simulated values ranged between 0 and 100, using only integers to reflect the discrete nature of Likert-scale responses; (ii) based on the data distribution observed in Dalla Bona and Vicovaro (Reference Dalla Bona and Vicovaro2024), for the NL group, the mean of the truncated normal distribution was set to vary from 55 to 65, with a fixed SD of 20; (iii) for the FL group, the SD was assumed to be slightly higher (20–25), in line with findings by Díaz-Lago and Matute (Reference Díaz-Lago and Matute2019a); (iv) finally, we defined meaningful differences in means between the NL and the FL groups based on observed response anchoring: in Dalla Bona and Vicovaro (Reference Dalla Bona and Vicovaro2024), 50% of participants anchored responses on values divisible by five. Thus, we hypothesized that a minimum meaningful shift in means would be of five points (i.e., roughly conceptualizing a shift from one ‘anchor’ to the next one), which corresponds to a small effect size (d ≈ 0.2). Conversely, the maximum plausible shift was set at 20 points (four ‘anchor points’, d ≈ 0.9), which would closely align with the effect reported by Díaz-Lago and Matute (Reference Díaz-Lago and Matute2019a). The simulation yielded a distribution of Cohen’s d values which defined our prior distribution for the expected effect size, ranging from a minimal effect (d ≈ 0.2) to a large effect (d ≈ 0.9). This prior approximates a uniform distribution bounded between 0.2 and 0.9.

We used this simulated prior in a Bayes factor design analysis (BFDA, see Figure 1), a method used to estimate sample size requirements for Bayes factor (BF) hypothesis testing (Schönbrodt & Wagenmakers, Reference Schönbrodt and Wagenmakers2018; Stefan et al., Reference Stefan, Gronau, Schönbrodt and Wagenmakers2019). We used the same distribution to define both the design prior and analysis prior for the standardized effect size d under the alternative hypothesis, in accordance with suggestions that both these priors should be aligned with the researchers’ theoretical expectations (Stefan et al., Reference Stefan, Gronau, Schönbrodt and Wagenmakers2019).Footnote ¹ The null hypothesis was operationalized as a statistical model assuming no differences between groups. Based on 5000 simulations for each sample size, both under H0 and H1, the BFDA indicated that a sample size of 110 participants per group yields a statistical power of .80 (Figure 1, left panel), while keeping the false positive rate below .05 (Figure 1, right panel).

Figure 1.

Bayes factor design analysis (BFDA) results. The left panel shows the results of BFDA simulations under the alternative hypothesis (H1), with 5,000 simulations per sample size, defined as the number of participants per group. The proportions of BF10 values within specific ranges (indicated by different colours; see the legend) are shown as a function of sample size. Power (PWR) is defined as the proportion of simulations yielding compelling evidence in the expected direction (i.e., BF10 ≥ 3, dark blue bars). A PWR of .80 is achieved with a sample size of 110 participants per group (horizontal dashed white line). The right panel presents the BFDA simulations under the null hypothesis (H0), also with 5,000 simulations per sample size, and reports the proportions of BF01 values. With N = 110 per group, H0 is convincingly supported in more than 95% of simulations (blue bars), and the false positive rate (FPR) ─ defined as the proportion of simulations yielding evidence for H1 when H0 is true ─ remains below .05 (horizontal dashed white line). Graphics were created in R (R Core Team, 2024), using the ggplot2 package (Wickham, Reference Wickham2016).

2.3. Data collection and exclusion criteria

In line with the results of the BFDA, we collected data from exactly 220 participants (110 participants per group). This target numerosity was reached after applying exclusion criteria to an initial sample of 241 participants, as specified in our pre-registration and partially described in the recruitment form presented to participants. Specifically, participants were excluded if they took part in the experiment more than once (zero participants), read the instructions in less than 10 seconds (four participants), completed the trial section in less than 160 seconds (zero participants) or responded in less than 2 seconds to any of the key questions concerning emotional engagement with the instructions, the causality rating, the previous experience with medicine, task disfluency or the estimated number of trials (zero participants). We also excluded participants who declared English as their native language (six participants), those who indicated that Italian was not their native language (two participants) and those who reported residing in an English-speaking country at the time of the experiment (one participant, who also declared to be a native English speaker).

Regarding language proficiency, the pre-registration specified exclusion criteria based on the Common European Framework of Reference for Languages (CEFR) scale.Footnote ² Specifically, participants who reported an A1 level of English (one participant) had to be excluded, as this would suggest they might not be able to understand the instructions, as well as participants who reported a C2 level (one participant), indicating too much fluency in the foreign language. The C2 participant had already been excluded for completing the instructions in less than 10 seconds. The A1 participant was retained, as their performance on an objective English test administered to all participants at the end of the experimental procedure was 23/25, indicating high proficiency (see the Procedure).

Furthermore, one participant was excluded due to incomplete data, and one participant was excluded for completing the experiment in more than 3 hours (median time to complete the experiment was approximately 15 minutes; Mdn = 14 minutes and 47 seconds). In the FL condition, we applied additional exclusion criteria based on performance on the objective English test: we excluded participants who scored the maximum 25/25 (six participants), as this suggests native-like proficiency, and those who scored below 7/25 (one participant), a performance level consistent with random guessing.

2.4. Participants

Participants were recruited primarily via the Prolific platform (https://www.prolific.com/), where those who agreed to participate were paid £3 to complete the online experiment (N = 198, N = 182 after exclusions). The remaining participants (N = 43, N = 38 after exclusions) were recruited from the student population at the University of Padua, using consistently the same recruitment form and eligibility criteria as on Prolific.

Our final sample included 220 native Italian speakers who were not native English speakers and who did not live in countries where English is the predominant language at the time of testing. Participants’ ages ranged from 20 to 65 years (M = 32.65 years, SD = 10.61), representing a general adult population with a concentration in early adulthood (1st quartile = 25 years, Mdn = 29 years, 3rd quartile = 39.25 years). This sample was older and more diverse compared to that of the second experiment in Díaz-Lago and Matute’s (Reference Díaz-Lago and Matute2019a) study, in which participants were primarily university students (M = 23.81 years, SD = 9.24). In terms of biological sex distribution, our sample included 102 female and 118 male participants, indicating a roughly balanced composition, in contrast to Díaz-Lago and Matute’s (Reference Díaz-Lago and Matute2019a) second experiment (F = 49, M = 31). Participants’ years of formal education ranged from 8 to 24 years (M = 16.6 years, SD = 2.62). Specifically, two participants had completed only primary school, 27 had completed lower secondary school, 74 held a high school diploma, 80 held a bachelor’s degree and 37 had obtained a master’s degree or higher. Regarding enrolment in a formal education programme, 101 participants were enrolled and 119 were not.

Regarding English certifications, 139 participants (63.2%) participants reported not holding any English language diploma, while 14 (6.4%) participants had obtained one within the past 2 years and 67 (30.5%) participants had received their diploma more than 2 years prior. Among those reporting diploma levels aligned with the CEFR scale (see Footnote 2) and the International English Language Testing System (IELTS) scale, the most frequently reported level was B2 (equivalent to IELTS 5.5–6), reported by 35 participants. This was followed by 27 participants reporting C1 proficiency (IELTS 7–8). Fewer participants reported levels of B1 (N = 11), A2 (N = 3) or A1 (N = 1). One participant did not recall his or her level. Additionally, three participants gave open-ended responses indicating alternative forms of certification: one reported a TOEFL score of 98 (approximately C1 level), another referenced a Cambridge Proficiency certificate from the 1990s with an unclear score (93/100) and one declared that he or she could not recall any relevant details.

The AoA for the FL in our sample ranged from 3 to 25 years, with a mean of 8.12 years (SD = 3.22). The median AoA was 7 years, with the first quartile at 6 years and the third quartile at 10 years, indicating that participants generally began learning English during childhood. When compared to the AoA reported in Díaz-Lago and Matute’s (Reference Díaz-Lago and Matute2019a) second experiment (M = 6.58 years, SD = 4.32), our sample showed a slightly higher AoA, suggesting marginally later exposure to the FL.

2.5. Procedure

To ensure consistency in the online data collection, participants were instructed to complete the experiment exclusively on a personal computer and to position themselves in a well-lit environment. Prior to beginning the experiment, participants were required to read an informed consent form (see Figure 2, Box 1), which had received ethical approval from the Ethics Committee of the University of Padua (Protocol No. 5010, 3 November 2022). Participation was contingent upon providing informed consent by clicking a response button, thereby affirming their agreement to the terms described.

Figure 2.

Representation of the experimental flow. The diagram shows, through arrows, the sequence of tasks completed by participants. Each white box corresponds to a component of the experimental procedure. The labels ‘ITA’ and ‘ENG’ indicate the language (Italian or English) in which each component was presented.

Participants were automatically and randomly assigned to one of two experimental conditions (NL or FL) via a pseudo-random algorithm embedded in the experimental code. The experiment was hosted on Pavlovia (https://pavlovia.org/) and programmed using PsychoPy (Peirce et al., Reference Peirce, Gray, Simpson, MacAskill, Höchenberger, Sogo, Kastman and Lindeløv2019), with the code compiled into PsychoJS (JavaScript; script available in the supplementary materials). Questionnaires were administered via Pavlovia Surveys (https://pavlovia.org/docs/surveys/overview) and were integrated into the experimental flow (questionnaires are available in the supplementary materials; https://doi.org/10.17605/OSF.IO/PZX9N). Once launched, the experiment occupied the entirety of the computer screen, minimizing the possibility of accessing automatic browser translation features.

The first part of the experiment (Figure 2, Boxes 2–8) was administered in Italian to the participants assigned to the NL condition and in English to the participants assigned to the FL condition. It consisted of seven sequential segments. Together, the first, third and fourth segments (Boxes 2, 4 and 5 in Figure 2) correspond to the classic version of the Allergy Task. In the first segment, participants were introduced to a fictional medical scenario in which they assumed the role of emergency room doctor (Figure 2, Box 2), tasked with determining whether a causal relationship existed between the administration of a fictitious medication (‘Batatrim’, the potential cause) and recovery from a fictional illness (‘Lindsay Syndrome’, the supposed effect).

In the second segment, participants’ emotional reactions to the presented story were assessed using the Affective Slider (Betella & Verschure, Reference Betella and Verschure2016), a digital, continuous-response tool designed to measure valence and arousal (Figure 2, Box 3). This instrument is a modern adaptation of the widely used Self-Assessment Manikin (SAM; Bynion & Feldner, Reference Bynion and Feldner2020). Participants completed this measure immediately after reading the CLT instruction, responding to the question: ‘How did you feel while reading the story that was presented to you?’ The first slider assessed emotional valence (from sad to happy) and the second slider assessed arousal (from calm to activated), both rated on an analogue scale ranging from 0 to 1. The rationale, based on the emotional distance hypothesis, was that using a FL could attenuate emotional engagement with the task, especially the emotional intensity, thereby modulating the strength of the causal illusion. This emotional assessment phase was specific to our procedure and is not typically part of the CLT.

The third segment presented participants with 40 patient records shown in randomized order, each separated by a 1-second inter-trial interval (Figure 2, Box 4). Each record described one of four combinations of the potential cause and effect: (a) cause present/effect present, (b) cause present/effect absent, (c) cause absent/effect present and (d) cause absent/effect absent. The frequencies of these combinations were set to establish a null contingency condition (ΔP = 0; a = 15, b = 5, c = 15, d = 5) with a high outcome probability, namely a .75 probability that the patient recovered from the syndrome. This combination of frequencies of the scenarios typically elicits an outcome-density bias (Matute et al., Reference Matute, Blanco, Yarritu, Díaz-Lago, Vadillo and Barberia2015). Each record was presented through three horizontally arranged panels, following the typical procedure also employed by Díaz-Lago and Matute (Reference Díaz-Lago and Matute2019a). The top panel, which remained visible throughout the trial, indicated whether the cause was present or absent showing if patient had taken Batatrim or not (e.g., ‘The patient has taken the Batatrim’). The middle panel, also visible throughout, posed a predictive question (‘Do you think the patient will overcome the crisis?’), and participants responded by selecting one of two buttons. There was no time limit for this response. Upon submitting a response, a third panel appeared displaying the presence or absence of the effect (e.g., ‘The patient has overcome the crisis’). Participant responses did not affect the effect shown. After a brief interval, the next trial began, upon reaching 40 trials.

In the fourth segment (Figure 2, Box 5), participants rated the perceived causal strength between the medication and patient recovery (‘To what extent do you think that Batatrim has been effective in healing the crises of the patients you have seen?’) using a discrete visual analogue scale ranging from 0 (‘definitely not’) to 100 (‘definitely yes’). This was the main dependent variable. Upon clicking the scale, a cursor appeared allowing participants to select a value, which was then displayed numerically below the scale.

The fifth segment (Figure 2, Box 6) assessed general beliefs about the effectiveness of medications with the following single question: ‘Beyond the specific case of Batatrim, in your PERSONAL experience, how effective are medications in treating diseases IN GENERAL?’ (101-point Likert scale ranging from 0 = ‘definitely not’ to 100 = ‘definitely yes’). This item was intended to probe individual background beliefs that might provoke a stronger causal evaluation in NL. Indeed, according to the context of acquisition hypothesis, these background beliefs could be more easily accessed in NL than in a FL.

In the sixth segment (Figure 2, Box 7), participants estimated how many trials they observed for each of the four cue–outcome pairings (i.e., a, b, c, d scenarios), in randomized order. This measure was designed to assess whether participants’ memory for different event contingencies was modulated by the language of task presentation, consistent with the hypothesis that language of encoding influences memory retrieval processes.

In the seventh segment (Figure 2, Box 8), a single-item measure of perceived task difficulty was presented, adapted from Graf et al. (Reference Graf, Mayer and Landwehr2018). Participants were asked: ‘How difficult did you find the reading and comprehension activities during the task?’ (7-point Likert scale ranging from 1 = ‘very easy’ to 7 = ‘very difficult’). This measure was intended to capture participants’ subjective experience of cognitive effort during the CLT, under the assumption that lower fluency in a FL increases perceived difficulty and may trigger more deliberative, analytical reasoning.

The second part of the experiment (Figure 2, Box 9), presented in Italian to all participants, consisted of a series of questions regarding demographic and linguistic characteristics. In the demographic section, participants were asked to report their age, biological sex, whether they were currently studying and their total years of formal education. In the linguistic section, participants were asked whether they held an English language certification (and, if so, which one), whether they were native speakers of Italian or English and whether they were living abroad in an English-speaking country at the time of testing. In addition, to assess self-perceived language proficiency in English, participants rated their skills in writing, comprehension, speaking and reading on Likert scales ranging from 1 to 10. These items were adapted from the Language Experience and Proficiency Questionnaire (LEAP-Q; Marian et al., Reference Marian, Blumenfeld and Kaushanskaya2007) and were selected to allow direct comparison with the original study of Díaz-Lago and Matute (Reference Díaz-Lago and Matute2019a), where a composite measure summing these four dimensions was reported. Only selected items relevant to the current study were used; the full questionnaire was not administered.

In the third part of the experiment (Figure 2, Box 10), participants completed an objective English proficiency test adapted from a publicly available Cambridge English assessment (https://www.cambridgeenglish.org/test-your-english/general-english/). The test, administered in English to all the participants, included 25 multiple-choice questions, each scored as correct or incorrect, designed to measure grammatical and vocabulary competence. This allowed for an objective evaluation of participants’ proficiency in the FL.

3. Results

Before presenting the main confirmatory analyses, it is important to assess the comparability of the NL and FL groups in key demographic variables and linguistic competence. To assess potential group differences (i.e., NL vs. FL) in terms of demographic variables, such sex, age and years of education, we conducted a model comparison for every possible combination of predictors group and recruitment source (i.e., Prolific vs. University), with a series of generalized linear models (GLMs) for the continuous variables. The appropriate distributions of reference were selected based on distributional assessments via Cullen and Frey graphs. For the categorical variable of sex, we applied BF contingency table analyses.Footnote ³ Results of the model comparisons, evaluations of the best-fitting model, predictive checks and point estimates with their corresponding confidence intervals are provided in the supplementary materials (https://doi.org/10.17605/OSF.IO/PZX9N). This also applies to the other between-group comparisons described in the current section. In brief, the results indicated that participants recruited from the university were younger, had fewer years of formal education and were more frequently female than those recruited via Prolific. However, the distribution of these demographic variables was consistent across the NL and FL groups.

With respect to the linguistic competence, a self-reported FL proficiency sum score was computed for each participant,Footnote ⁴ and subsequent analyses supported the comparability of our sample with that of Díaz-Lago and Matute (Reference Díaz-Lago and Matute2019a). Specifically, the average self-reported FL proficiency sum score in our sample was M = 29.63, SD = 4.86, slightly higher than the mean reported in Díaz-Lago and Matute’s (Reference Díaz-Lago and Matute2019a) second experiment (M = 28.96, SD = 3.37). However, the observed difference between groups (d = 0.15) was supported to be negligible, as a Bayesian t-test for testing the difference between the two studies yielded a BF01 = 3.81, providing moderate evidence in favour of the null hypothesis (i.e., no meaningful difference between groups).Footnote ⁵ As for the objective English test, participants achieved a mean total score of 18.87 out of 25 (Mdn = 20; SD = 4.08), which aligns reasonably well with the self-reported FL proficiency scores (M = 29.63 out of 40), corresponding to roughly three quarters of the total available points on both measures. The correlation between subjective and objective proficiency sum scores was moderate (r = .49), suggesting a fair degree of agreement between perceived and actual language ability. No difference emerged between the NL and FL groups in the objective English test performance.Footnote ⁶

The main confirmatory hypothesis, pre-registered on OSF, aimed to replicate the effect of language condition (NL vs. FL) on the illusion of causality. Specifically, we hypothesized that participants in the FL condition would exhibit a reduced illusion of causality compared to those in the NL condition. This directional hypothesis informed both our a priori design analysis and sample size determination, conducted using a BFDA approach. Contrary to our expectations, participants in the FL group exhibited a negligibly greater illusion of causality (M = 59.61, SD = 22.92) than those in the NL group (M = 55.76, SD = 22.38; d = −0.17; Figure 3, left panel).Footnote ⁷ To test the main hypothesis, we computed a BF comparing two models: one corresponding to the alternative hypothesis, which included a group difference in the hypothesized direction with a custom prior with .95 of the mass between 0.2 and 0.9 placed on the standardized effect size (see Footnote 1), and one corresponding to the null hypothesis, which assumed no difference between groups. This analysis yielded strong evidence in favour of the null hypothesis (BF01 = 366.2; Figure 2, top-right panel). To address potential concerns regarding the use of an informed prior, we also conducted a second BF t-test, using the default half-Cauchy prior (truncated to positive values) placed on the standardized effect size under H1, and a point null prior under H0. The results again supported the null hypothesis, yielding a BF01 of 14.47 (Figure 3, top-right panel).

Figure 3.

Summary of main results. The left panel illustrates the causality ratings observed in the two experimental groups. The top-right panel presents BF evidence from the confirmatory analyses, including results for the main hypothesis tested with both a customized prior and the default Cauchy prior. Additionally, it displays the BF for the best-fitting model that incorporates relevant demographic covariates. The bottom-right panel shows BF evidence supporting the absence of group differences across the exploratory measures. In the two right panels, only the larger of the two possible Bayes factors (BF01 or BF10, see legend) is shown for each model.

We also conducted a Bayesian general linear model comparison to assess the effects of sex, age, education and group (NL vs. FL) on the illusion of causality, evaluating all possible combinations of main effects and interactions. The model with the strongest support included sex and education as additive predictors, yielding a BF of 8.06 relative to the null model (Figure 3, top-right panel). A follow-up Bayesian linear analysis revealed a high probability of a negative relationship between years of formal education and the illusion of causality (median slope = −1.46, 95% CrI [−2.54, −0.35], 99.62% < 0), as well as a high probability of a reduced illusion of causality in males (median difference = −6.92, 95% CrI [−12.89, −0.87], 98.62% < 0). Models including age or group – either alone or in combination – received weaker support (BFs < 1.5), suggesting that these variables contribute minimally to explaining individual differences in the illusion of causality. Detailed results of the model comparisons and the follow-up Bayesian linear analysis are provided in the supplementary materials.

Finally, we conducted Bayesian t-tests comparing group differences on each exploratory measure, using the default Cauchy prior for the alternative hypothesis and a point null prior for the null hypothesis. The resulting BF01 values ranged from 4.43 to 16.43 across the various measures (Figure 3, bottom-right panel). This suggests that performing the CLT in a FL rather than in the NL did not directly affect the perceived task difficulty, the level of emotional involvement, the perceived effectiveness of medications in treating diseases in general and the estimated frequency of trials type. Detailed results are provided in the supplementary materials.

4. Discussion

Using a sample more than five times larger than that in Díaz-Lago and Matute’s (Reference Díaz-Lago and Matute2019a) second experiment (null contingency condition), we were unable to replicate the main confirmatory findings, which formed the core of our study. This failure to replicate cannot be attributed to flaws in the study design or insufficient statistical power. Our BF comparison was both pre-registered and pre-planned, with the study powered (≥0.80) using a BFDA approach to detect the expected effect sizes. Interestingly, the observed group difference was small and in the opposite direction to the original hypothesis, further contradicting the prediction of a reduction in the illusion of causality when reasoning in a foreign language. Thus, our results provide clear and robust evidence for the absence of a reduction in the illusion of causality in the FL condition. One possible counter-argument is that our sample may not have adequately represented individuals with an intermediate level of English proficiency. Our recruitment strategy targeted participants whose English proficiency was sufficient to understand the task but still low enough to potentially trigger a disfluency effect – presumed to be the mechanism underlying the original study’s findings. If this interpretation is accurate, it is notable that the only directly comparable measure between the two studies – self-reported proficiency – showed no meaningful difference.

One possible explanation for our failure to replicate previous findings is that earlier studies may have been underpowered, an issue that remains widespread in psychological research and contributes significantly to the broader replication crisis. Underpowered studies are more likely to produce inflated effect sizes, false positives and results that fail to replicate (McShane et al., Reference McShane, Tackett, Böckenholt and Gelman2019). Nonetheless, we recognize that psychological research is often conducted under constraints of time, funding and access to participants. For this reason, replication of preliminary findings should be considered a critical step before investing resources into more fine-grained investigations of the underlying mechanisms. Our study was designed with this consideration in mind. We adopted a hybrid structure that combined confirmatory replication with exploratory analysis. This dual approach provides a constructive model for advancing psychological science: it allows researchers to contribute robust evidence in support of (or against) a given phenomenon, while also generating informative hypotheses that can guide subsequent confirmatory work. We believe this kind of design, though perhaps not always suitable, can help address the hesitation to replicate, as it allows both for the validation of existing findings and the discovery of new, potentially explanatory variables. In this sense, our study complements the work of Díaz-Lago and Matute (Reference Díaz-Lago and Matute2019a).

Nonetheless, with respect to their original findings, it cannot be entirely ruled out that differences in sample characteristics may have contributed to the failure to replicate the effect. It is possible that undergraduate student subsamples, such as those tested in the original findings, are more susceptible to the FLE than the more heterogeneous population examined in this study. A further point worth acknowledging is that the original experiment was conducted across two sessions, with the second session devoted exclusively to the CLT and administered entirely in the FL. This design feature may have amplified language-switching related processes, raising the possibility that the observed effect was driven, at least in part, by the switching language process (e.g., Oganian et al., Reference Oganian, Korn and Heekeren2016).

Although the primary confirmatory hypothesis was not replicated, we observed a reduced illusion of causality in males and more educated participants. These results should be interpreted with caution, as they may reflect a systematic preference for the mid–low range of the response scale, indicative of a generally sceptical response style, rather than a genuine reduced sensitivity to the illusion of causality. However, they point to a valuable direction for future research: ensuring large samples with balanced demographics, to enhance results generalizability and avoid potential confounds in between-group comparisons.

To conclude, it is important to note that recent meta-analyses have reported a small FLE primarily in the domain of moral dilemmas (d ≈ 0.2; Circi et al., Reference Circi, Gatti, Russo and Vecchi2021; Del Maschio et al., Reference Del Maschio, Crespi, Peressotti, Abutalebi and Sulpizio2022). In contrast, despite the balancing of our FL and NL groups for key demographic variables, our results showed no evidence of an FLE in the context of the illusion of causality. This evidence is consistent with the results of studies showing no evidence of a FLE for biases such as outcome bias and the representativeness heuristic (Circi et al., Reference Circi, Gatti, Russo and Vecchi2021; Del Maschio et al., Reference Del Maschio, Crespi, Peressotti, Abutalebi and Sulpizio2022; Vives et al., Reference Vives, Aparici and Costa2018). In light of these findings, a tentative explanation could be that the FLE may be limited to moral dilemmas or similar emotionally driven tasks (but see Hu et al., Reference Hu, Martín-Luengo and Navarrete2026). In moral tasks, the link between bias, emotional resonance and social norms is more direct and influential (Del Maschio et al., Reference Del Maschio, Crespi, Peressotti, Abutalebi and Sulpizio2022). Unlike the task used to assess the illusion of causality, moral dilemmas may more strongly engage the affective system (Greene et al., Reference Greene, Sommerville, Nystrom, Darley and Cohen2001) and/or make learned norms more salient (Geipel et al., Reference Geipel, Hadjichristidis and Surian2015). Using a FL may alter emotional processing or access to social norms, potentially eliciting more normative responses in moral dilemmas, whereas more neutral biases, such as the illusion of causality, remain unaffected.

Data availability statement

The study was pre-registered on the Open Science Framework (OSF) at https://doi.org/10.17605/OSF.IO/PZX9N. All study materials, including raw data, are openly available in the associated OSF project at https://doi.org/10.17605/OSF.IO/HVGKX.

Competing interests

The authors declare no competing interests and thank all participants for their contribution to this study.

Ethics statement

The authors assert that all procedures contributing to this work comply with the ethical standards of the relevant national and institutional committees on human experimentation and with the Helsinki Declaration of 1975, as revised in 2008. The study received approval from the Ethics Committee for Psychological Research at the University of Padova (approval number: 5010) and informed consent was obtained from all individual participants included in the study.

Disclosure of use of AI tools

The authors declare the use of AI tools to check for errors in the main manuscript and to improve the grammar and clarity of the written text. AI tools were also used to verify the correctness and enhance the clarity of the text, code and tables included in the supplementary materials.

Footnotes

This article has earned badges for transparent research practices: Open Data, Open Materials and Preregistration. For details, see the Data Availability Statement.

¹ The analytical version of the prior under the alternative hypothesis was defined as

$$ f(x)=\left(\frac{1}{1+{e}^{-40\left(x-0.2\right)}}+\frac{1}{1+{e}^{35\left(x-0.9\right)}}-1\right)\cdot \frac{10}{7}, $$

which approximates a uniform distribution with .95 of the mass between 0.2 and 0.9.

² CEFR proficiency diploma levels range from A1, indicating minimal proficiency, to C2, indicating native-like proficiency. The complete set of levels, in order of increasing proficiency, is A1, A2, B1, B2, C1 and C2.

³ These BF were computed in R (R Core Team, 2024), using the BayesFactor package (Morey & Rouder, Reference Morey and Rouder2024) and they test the assumption of independence in contingency tables, following the approach proposed by Günel and Dickey (Reference Günel and Dickey1974).

⁴ To validate the use of a composite score in our dataset, we conducted a confirmatory factor analysis (CFA) to examine whether the four self-assessment items loaded onto a single latent factor of FL proficiency. The model allowed for residual covariance between the reading and comprehension indicators due to their conceptual similarity. The CFA showed excellent model fit (RMSEA = 0.05, TLI = 0.99, CFI > 0.99), with standardized factor loadings ranging from 0.65 to 0.99, indicating strong unidimensionality. Reliability analyses further confirmed the internal consistency of the composite scale, yielding a Cronbach’s alpha of 0.87 and a McDonald’s omega of 0.93.

⁵ The Bayesian t-test was conducted in R (R Core Team, 2024), using the BayesFactor package (Morey & Rouder, Reference Morey and Rouder2024). Specifically, we applied the ttestbf() function, which implements the Bayesian independent samples t-test. Under the alternative hypothesis, the standardized effect size (Cohen’s d) was modelled using a Cauchy prior distribution with a scale parameter of √2/2. This prior implies a 50% probability that the effect size lies between ±0.707. The null hypothesis was defined as a point null, corresponding to a spike prior centred at zero, representing the assumption of no effect.

⁶ To investigate whether the two groups (FL vs. NL) differed in objective language ability, for each of the 25 test items, we constructed 2 × 2 contingency tables comparing group condition and response accuracy (correct vs. incorrect). The analysis revealed that for 24 of the 25 questions, the BF indicated either weak (1 < BF01 < 3; 6 items) or moderate to strong (BF01 ≥ 3; 18 items) evidence supporting independence between condition and item response. Only one item yielded weak evidence against independence (1/3 < BF01 < 1).

⁷ We additionally conducted a Bayesian t-test comparing University and Prolific participants, yielding BF01 = 5.10, indicating that the two groups did not differ meaningfully on the main dependent variable.

References

Allan, L. G. (1980). A note on measurement of contingency between two binary variables in judgment tasks. Bulletin of the Psychonomic Society, 15(3), 147–149. https://doi.org/10.3758/BF03334492CrossRef Google Scholar

Allan, L. G., & Jenkins, H. M. (1980). The judgment of contingency and the nature of the response alternatives. Canadian Journal of Psychology/Revue Canadienne de Psychologie, 34(1), 1–11. https://doi.org/10.1037/h0081013CrossRef Google Scholar

Alloy, L. B., & Abramson, L. Y. (1979). Judgment of contingency in depressed and nondepressed students: Sadder but wiser? Journal of Experimental Psychology: General, 108(4), 441–485. https://doi.org/10.1037/0096-3445.108.4.441CrossRef Google Scholar PubMed

Bender, A. (2020). What is causal cognition? Frontiers in Psychology, 11, 3. https://doi.org/10.3389/fpsyg.2020.00003CrossRef Google Scholar PubMed

Betella, A., & Verschure, P. F. M. J. (2016). The affective slider: A digital self-assessment scale for the measurement of human emotions. PLoS One, 11(2), e0148037. https://doi.org/10.1371/journal.pone.0148037CrossRef Google Scholar

Bynion, T.-M., & Feldner, M. T. (2020). Self-assessment manikin. In Encyclopedia of personality and individual differences (pp. 4654–4656). Springer. https://doi.org/10.1007/978-3-319-24612-3_77CrossRef Google Scholar

Circi, R., Gatti, D., Russo, V., & Vecchi, T. (2021). The foreign language effect on decision-making: A meta-analysis. Psychonomic Bulletin & Review, 28, 1131–1141. https://doi.org/10.3758/s13423-020-01871-zCrossRef Google Scholar PubMed

Costa, A., Foucart, A., Arnon, I., Aparici, M., & Apesteguia, J. (2014a). Piensa twice: On the foreign language effect in decision making. Cognition, 130(2), 236–254. https://doi.org/10.1016/j.cognition.2013.11.010CrossRef Google Scholar

Costa, A., Foucart, A., Hayakawa, S., Aparici, M., Apesteguia, J., Heafner, J., & Keysar, B. (2014b). Your morals depend on language. PLoS One, 9(4), e94842. https://doi.org/10.1371/journal.pone.0094842CrossRef Google Scholar

Dalla Bona, S., & Vicovaro, M. (2024). Does perceptual disfluency affect the illusion of causality? Quarterly Journal of Experimental Psychology, 77(8), 1727–1744. https://doi.org/10.1177/17470218231220928CrossRef Google Scholar

Del Maschio, N., Crespi, F., Peressotti, F., Abutalebi, J., & Sulpizio, S. (2022). Decision-making depends on language: A meta-analysis of the foreign language effect. Bilingualism: Language and Cognition, 25(4), 617–630. https://doi.org/10.1017/S1366728921001012CrossRef Google Scholar

Díaz-Lago, M., & Matute, H. (2019a). Thinking in a foreign language reduces the causality bias. Quarterly Journal of Experimental Psychology, 72(1), 41–51. https://doi.org/10.1177/1747021818755326CrossRef Google Scholar

Díaz-Lago, M., & Matute, H. (2019b). A hard to read font reduces the causality bias. Judgment and Decision Making, 14(5), 547–554. https://doi.org/10.1017/s1930297500004848CrossRef Google Scholar

Dickinson, A., Shanks, D., & Evenden, J. (1984). Judgement of act-outcome contingency: The role of selective attribution. The Quarterly Journal of Experimental Psychology, 36(1), 29–50. https://doi.org/10.1080/14640748408401502CrossRef Google Scholar

Dylman, A. S., & Champoux-Larsson, M.-F. (2020). It’s (not) all Greek to me: Boundaries of the foreign language effect. Cognition, 196, 104148. https://doi.org/10.1016/j.cognition.2019.104148CrossRef Google Scholar PubMed

Ezrina, E. (2023). Association strength between concepts as the origin of the “foreign language effect”. CUNY Academic Works. https://academicworks.cuny.edu/gc_etds/5378 Google Scholar

Frey, H. C., & Cullen, A. C. (1995). Distribution development for probabilistic exposure assessment. In A&WMA Annual Meeting (Vol. 11, pp. 95–TA42).Google Scholar

Gao, S., Zika, O., Rogers, R. D., & Thierry, G. (2015). Second language feedback abolishes the “hot hand” effect during even-probability gambling. The Journal of Neuroscience: The Official Journal of the Society for Neuroscience, 35(15), 5983–5989. https://doi.org/10.1523/JNEUROSCI.3622-14.2015CrossRef Google Scholar PubMed

Geipel, J., Hadjichristidis, C., & Surian, L. (2015). How foreign language shapes moral judgment. Journal of Experimental Social Psychology, 59, 8–17. https://doi.org/10.1016/j.jesp.2015.02.001CrossRef Google Scholar

Graf, L. K., Mayer, S., & Landwehr, J. R. (2018). Measuring processing fluency: One versus five items. Journal of Consumer Psychology, 28(3), 393–411. https://doi.org/10.1002/jcpy.1021CrossRef Google Scholar

Greene, J. D., Sommerville, R. B., Nystrom, L. E., Darley, J. M., & Cohen, J. D. (2001). An fMRI investigation of emotional engagement in moral judgment. Science (New York, N.Y.), 293(5537), 2105–2108. https://doi.org/10.1126/science.1062872CrossRef Google Scholar PubMed

Günel, E., & Dickey, J. (1974). Bayes factors for Independence in contingency tables. Biometrika, 61, 545–557.10.1093/biomet/61.3.545CrossRef Google Scholar

Hayakawa, S., Lau, B. K. Y., Holtzmann, S., Costa, A., & Keysar, B. (2019). On the reliability of the foreign language effect on risk-taking. Quarterly Journal of Experimental Psychology (2006), 72(1), 29–40. https://doi.org/10.1177/1747021817742242CrossRef Google Scholar PubMed

Hu, Z., Martín-Luengo, B., & Navarrete, E. (2026). The impact of foreign language on meta-reasoning in moral decisions. Journal of Cognition, 9(1), 2. https://doi.org/10.5334/joc.472CrossRef Google Scholar PubMed

Jenkins, H. M., & Ward, W. C. (1965). Judgment of contingency between responses and outcomes. Psychological Monographs: General and Applied, 79(1), 1–17. https://doi.org/10.1037/h0093874CrossRef Google Scholar PubMed

Keysar, B., Hayakawa, S. L., & An, S. G. (2012). The foreign-language effect: Thinking in a foreign tongue reduces decision biases. Psychological Science, 23(6), 661–668. https://doi.org/10.1177/0956797611432178CrossRef Google Scholar

Kroll, J. F., & Stewart, E. (1994). Category interference in translation and picture naming: Evidence for asymmetric connections between bilingual memory representations. Journal of Memory and Language, 33(2), 149–174.10.1006/jmla.1994.1008CrossRef Google Scholar

Marian, V., Blumenfeld, H. K., & Kaushanskaya, M. (2007). The language experience and proficiency questionnaire (LEAP-Q): Assessing language profiles in bilinguals and multilinguals. Journal of Speech, Language, and Hearing Research, 50, 940–967. https://doi.org/10.1044/1092-4388(2007/067CrossRef Google Scholar PubMed

Matute, H., Blanco, F., Yarritu, I., Díaz-Lago, M., Vadillo, M. A., & Barberia, I. (2015). Illusions of causality: How they bias our everyday thinking and how they could be reduced. Frontiers in Psychology, 6, 888. https://doi.org/10.3389/fpsyg.2015.00888CrossRef Google Scholar PubMed

Matute, H., Blanco, F., & Moreno-Fernández, M. M. (2022). Causality bias. In Pohl, R. (Ed.), Cognitive illusions: Intriguing phenomena in thinking, judgment, and memory (3rd ed., pp. 108–123). Routledge. https://doi.org/10.4324/9781003154730-9CrossRef Google Scholar

McShane, B. B., Tackett, J. L., Böckenholt, U., & Gelman, A. (2019). Large-scale replication projects in contemporary psychological research. The American Statistician, 73(1), 99–105. https://doi.org/10.1080/00031305.2018.1505655CrossRef Google Scholar

Michotte, A. (2017). The perception of causality. Taylor & Francis. https://doi.org/10.4324/9781315519050 (Original work published 1963).CrossRef Google Scholar

Moreno Fernández, M. M., Blanco Bregón, F., & Matute, H. (2023). Recent advances in the study of the illusion of causality: Theory, methods, and practical implications. Psicológica: Revista de Metodología y Psicología Experimental, 44(2). https://doi.org/10.20350/DIGITALCSIC/15705Google Scholar

Morey, R. D., & Rouder, J. N. (2024). BayesFactor: Computation of bayes factors for common designs (Version 0.9.12–4.7) [R package]. https://CRAN.R-project.org/package=BayesFactor.Google Scholar

Muda, R., Pennycook, G., Hamerski, D., & Białek, M. (2023). People are worse at detecting fake news in their foreign language. Journal of Experimental Psychology: Applied, 29(4), 712–724. https://doi.org/10.1037/xap0000475Google Scholar PubMed

Ng, D. W., Lee, J. C., & Lovibond, P. F. (2024). Unidirectional rating scales overestimate the illusory causation phenomenon. Quarterly Journal of Experimental Psychology, 77(3), 551–562. https://doi.org/10.1177/17470218231175003CrossRef Google Scholar PubMed

Oganian, Y., Korn, C. W., & Heekeren, H. R. (2016). Language switching-but not foreign language use per se-reduces the framing effect. Journal of Experimental Psychology. Learning, Memory, and Cognition, 42(1), 140–148. https://doi.org/10.1037/xlm0000161CrossRef Google Scholar

Oppenheimer, D. M. (2008). The secret life of fluency. Trends in Cognitive Sciences, 12(6), 237–241. https://doi.org/10.1016/j.tics.2008.02.014CrossRef Google Scholar PubMed

Peirce, J., Gray, J. R., Simpson, S., MacAskill, M., Höchenberger, R., Sogo, H., Kastman, E., & Lindeløv, J. K. (2019). PsychoPy2: Experiments in behavior made easy. Behavior Research Methods, 51(1), 195–203. https://doi.org/10.3758/s13428-018-01193-yCrossRef Google Scholar PubMed

Perales, J. C., & Shanks, D. (2007). Models of covariation-based causal judgment: A review and synthesis. Psychonomic Bulletin & Review, 14, 577–596. https://doi.org/10.3758/BF03196807CrossRef Google Scholar PubMed

R Core Team. (2024). R: A language and environment for statistical computing. R Foundation for Statistical Computing. https://www.R-project.org/.Google Scholar

Schönbrodt, F. D., & Wagenmakers, E. J. (2018). Bayes factor design analysis: Planning for compelling evidence. Psychonomic Bulletin & Review, 25(1), 128–142. https://doi.org/10.3758/s13423-017-1230-yCrossRef Google Scholar PubMed

Stefan, A. M., Gronau, Q. F., Schönbrodt, F. D., & Wagenmakers, E. J. (2019). A tutorial on Bayes factor design analysis using an informed prior. Behavior Research Methods, 51(3), 1042–1058. https://doi.org/10.3758/s13428-018-01189-8CrossRef Google Scholar PubMed

Vives, M. L., Aparici, M., & Costa, A. (2018). The limits of the foreign language effect on decision-making: The case of the outcome bias and the representativeness heuristic. PLoS One, 13(9), e0203528. https://doi.org/10.1371/journal.pone.0203528CrossRef Google Scholar PubMed

Waldmann, M. R., Hagmayer, Y., & Blaisdell, A. P. (2006). Beyond the information given: Causal models in learning and reasoning. Current Directions in Psychological Science, 15(6), 307–311. https://doi.org/10.1111/j.1467-8721.2006.00458.xCrossRef Google Scholar

Wasserman, E. A. (1990). Detecting response-outcome relations: Toward an understanding of the causal texture of the environment. Psychology of Learning and Motivation, 26, 27–82. https://doi.org/10.1016/S0079-7421(08)60051-7CrossRef Google Scholar

Wickham, H. (2016). ggplot2: Elegant graphics for data analysis. Springer-Verlag. https://ggplot2.tidyverse.org.10.1007/978-3-319-24277-4CrossRef Google Scholar

Figure 1. Bayes factor design analysis (BFDA) results. The left panel shows the results of BFDA simulations under the alternative hypothesis (H1), with 5,000 simulations per sample size, defined as the number of participants per group. The proportions of BF10 values within specific ranges (indicated by different colours; see the legend) are shown as a function of sample size. Power (PWR) is defined as the proportion of simulations yielding compelling evidence in the expected direction (i.e., BF10 ≥ 3, dark blue bars). A PWR of .80 is achieved with a sample size of 110 participants per group (horizontal dashed white line). The right panel presents the BFDA simulations under the null hypothesis (H0), also with 5,000 simulations per sample size, and reports the proportions of BF01 values. With N = 110 per group, H0 is convincingly supported in more than 95% of simulations (blue bars), and the false positive rate (FPR) ─ defined as the proportion of simulations yielding evidence for H1 when H0 is true ─ remains below .05 (horizontal dashed white line). Graphics were created in R (R Core Team, 2024), using the ggplot2 package (Wickham, 2016).

Figure 2. Representation of the experimental flow. The diagram shows, through arrows, the sequence of tasks completed by participants. Each white box corresponds to a component of the experimental procedure. The labels ‘ITA’ and ‘ENG’ indicate the language (Italian or English) in which each component was presented.

Figure 3. Summary of main results. The left panel illustrates the causality ratings observed in the two experimental groups. The top-right panel presents BF evidence from the confirmatory analyses, including results for the main hypothesis tested with both a customized prior and the default Cauchy prior. Additionally, it displays the BF for the best-fitting model that incorporates relevant demographic covariates. The bottom-right panel shows BF evidence supporting the absence of group differences across the exploratory measures. In the two right panels, only the larger of the two possible Bayes factors (BF01 or BF10, see legend) is shown for each model.

Article contents

The foreign language effect in the illusion of causality: Evidence of absence

Abstract

Keywords

Information

Highlights

1. Introduction

1.1. The illusion of causality

1.2. The foreign language effect

2. Methods

2.1. Tools for transparency

2.2. Design analysis for the confirmatory hypothesis

2.3. Data collection and exclusion criteria

2.4. Participants

2.5. Procedure

3. Results

4. Discussion

Data availability statement

Competing interests

Ethics statement

Disclosure of use of AI tools

Footnotes

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests