Highlights
-
• Replication of the foreign language effect on the illusion of causality;
-
• Bayes factor design analysis to guide the replication effort;
-
• Illusion of causality not reduced by the unbalanced second language;
-
• Demographic features mediate the illusion of causality;
-
• More research needed on foreign language’s impact on cognitive biases.
1. Introduction
1.1. The illusion of causality
From a psychological perspective, causality is naively understood as a relationship between two subsequent events, wherein the latter is regarded as the direct consequence of the former. The ability to comprehend causal relationships is crucial for human survival and represents a capacity in which humans typically outperform other species (Bender, Reference Bender2020). The mechanisms underlying our perception and detection of causal connections have been extensively studied across various fields of psychology, including visual perception (e.g., Michotte, Reference Michotte2017), reasoning (e.g., Waldmann et al., Reference Waldmann, Hagmayer and Blaisdell2006) and learning (e.g., Dickinson et al., Reference Dickinson, Shanks and Evenden1984) – highlighting the centrality of causal interpretation as a recurring theme in psychological research. Within the psychology of learning, causality has often been examined from an associative perspective, wherein individuals infer causal relationships based on the repeated co-occurrence of cause and effect events (Wasserman, Reference Wasserman1990).
Within this learning framework of how individuals intuitively form causal representations, the illusion of causality emerges as a well-documented cognitive bias (see Moreno Fernández et al., Reference Moreno Fernández, Blanco Bregón and Matute2023, for a comprehensive and updated summary). This illusion refers to the belief that two events, A and B, are causally related (i.e., A causes B) when no actual causal connection exists. In situations where individuals act as passive observers of multiple co-occurrences between two events, they may overestimate the extent to which these events are causally related, leading to erroneous causal inferences (see also Matute et al., Reference Matute, Blanco, Moreno-Fernández and Pohl2022).
The illusion of causality is commonly investigated using the Contingency Learning Task (CLT) paradigm (Moreno Fernández et al., Reference Moreno Fernández, Blanco Bregón and Matute2023), in which participants observe a sequence of trials, each characterized by the presence or absence of a potential cause (Event A) and an effect (Event B). The CLT structure yields four possible scenarios: (a) both the potential cause and the effect are present, (b) only the potential cause is present, (c) only the effect is present and (d) neither the potential cause nor the effect is present. Each trial corresponds to one of these scenarios, and participants passively observe a randomized sequence of trials, with the experimenter controlling the frequency of each type. Upon completion of all trials, participants are typically asked to evaluate the perceived strength of the causal relationship between the potential cause and the effect on a scale from 0 to 100 (however, recent literature has suggested that the use of a unidirectional 0–100 scale may be theoretically problematic; see Ng et al., Reference Ng, Lee and Lovibond2024).
From a statistical viewpoint, one of the most widely used indices to measure contingency in this context is the ΔP (Perales & Shanks, Reference Perales and Shanks2007), an index related to the chi-square (χ2) statistic (Allan, Reference Allan1980). The ΔP is calculated by subtracting the probability of observing the outcome in the absence of the potential cause, denoted as
$ P({B}_1\mid {A}_0) $
, from the probability of observing the outcome in the presence of the cause, denoted as
$ P({B}_1\mid {A}_1) $
(Jenkins & Ward, Reference Jenkins and Ward1965):
A large and positive
$ \Delta P $
value indicates a strong causal effect of A on B, whereas a value of zero signifies no statistical support for the contingency between the potential cause and the effect. This index can serve as a normative benchmark for evaluating human causal learning (Matute et al., Reference Matute, Blanco, Moreno-Fernández and Pohl2022; Perales & Shanks, Reference Perales and Shanks2007). Theoretically, the overestimation of causal strength, which is the illusion of causality, can occur independently of the actual
$ \Delta P $
value. However, empirical studies have primarily documented this phenomenon in null contingency conditions (
$ \Delta P=0 $
), where no causal relationship is present (Allan & Jenkins, Reference Allan and Jenkins1980).
This study specifically focuses on the illusion of causality as it arises from a particular distribution of trial frequencies, known as the outcome-density bias (Matute et al., Reference Matute, Blanco, Yarritu, Díaz-Lago, Vadillo and Barberia2015). The bias emerges when the frequency of trials in which the outcome occurs, namely, scenarios where the cause and the outcome are present and scenarios where the outcome is present without the cause, exceeds the frequency of trials in which the outcome does not occur. Despite the
$ \Delta P $
value remaining at zero, people overestimate the extent of the causal relationship (Alloy & Abramson, Reference Alloy and Abramson1979), with average judgements often near 50 on a 0–100 Likert scale, indicating a tendency to perceive a causal link when none exists (e.g., Dalla Bona & Vicovaro, Reference Dalla Bona and Vicovaro2024; Díaz-Lago & Matute, Reference Díaz-Lago and Matute2019a, Reference Díaz-Lago and Matute2019b).
1.2. The foreign language effect
Recent research has increasingly explored whether the illusion of causality can be reduced (Matute et al., Reference Matute, Blanco, Yarritu, Díaz-Lago, Vadillo and Barberia2015). Investigating the conditions under which this illusion diminishes not only serves the practical purpose of informing public education strategies to counteract such biases but also contributes to a deeper understanding of the underlying cognitive processes. In this context, Díaz-Lago and Matute (Reference Díaz-Lago and Matute2019a) found that presenting information in a foreign language can reduce the magnitude of the illusion.
The foreign language effect (FLE) was first described by Keysar et al. (Reference Keysar, Hayakawa and An2012), who showed that participants performing a decision-making task in a foreign language (FL) exhibited less biased responses than those completing the same task in their native language (NL). Over the years, the FLE has been replicated across a range of paradigms, including loss-aversion tasks, decision making and moral dilemmas (Circi et al., Reference Circi, Gatti, Russo and Vecchi2021; Del Maschio et al., Reference Del Maschio, Crespi, Peressotti, Abutalebi and Sulpizio2022). For instance, Costa et al. (Reference Costa, Foucart, Arnon, Aparici and Apesteguia2014a) replicated and extended Keysar et al.’s (Reference Keysar, Hayakawa and An2012) findings, providing further evidence that reasoning in a FL reduces susceptibility to cognitive biases. In the domain of moral judgement, an area where the FLE has been extensively studied, the use of a FL appears to promote more utilitarian responses (e.g., Costa et al., Reference Costa, Foucart, Hayakawa, Aparici, Apesteguia, Heafner and Keysar2014b). Nonetheless, recent literature on the FLE has shown that this phenomenon does not consistently arise (Hu et al., Reference Hu, Martín-Luengo and Navarrete2026), and several explanations for this lack of replicability have been proposed, such as differences in task characteristics and the level of emotional engagement elicited by the task (Hayakawa et al., Reference Hayakawa, Lau, Holtzmann, Costa and Keysar2019), as well as cultural influences and linguistic similarity between the languages involved (Dylman & Champoux-Larsson, Reference Dylman and Champoux-Larsson2020).
The reduction of the illusion of causality through the use of a FL, as shown by Díaz-Lago and Matute (Reference Díaz-Lago and Matute2019a), represents an interesting preliminary finding that may offer valuable insights into the mechanisms underlying this cognitive bias. In a series of two consecutive experiments, the authors observed that participants performing a CLT in a FL exhibited a diminished illusion of causality compared to those using their NL. Although several tentative and post hoc explanations were proposed and discussed, the results were primarily interpreted through the lens of cognitive disfluency. Specifically, the increased difficulty associated with processing information in a non-native language was posited to induce more deliberate and analytical thinking, thereby promoting more normative and less biased judgements.
The mediating role of cognitive disfluency in the illusion of causality was also supported by the results of another study by Díaz-Lago and Matute (Reference Díaz-Lago and Matute2019b), in which it was found that presenting the CLT in a hard-to-read font reduced the magnitude of the illusion. However, Dalla Bona and Vicovaro (Reference Dalla Bona and Vicovaro2024) recently failed to replicate and extend this finding, challenging the notion that increased task difficulty alone reduces the illusion of causality. Moreover, in some tasks, operating in a FL may consume mental resources, thereby impairing performance due to increased cognitive load (Muda et al., Reference Muda, Pennycook, Hamerski and Białek2023).
These conflicting findings underscore the need for further empirical investigation into the conditions under which the use of a FL may influence causal reasoning. In particular, it remains unclear whether the observed reduction in the causal illusion is primarily driven by processing fluency, as proposed by Díaz-Lago and Matute (Reference Díaz-Lago and Matute2019b), or by alternative mechanisms such as emotional distancing or the context in which the FL was acquired – explanations commonly discussed in the FLE literature (Circi et al., Reference Circi, Gatti, Russo and Vecchi2021).
The processing fluency hypothesis contends that tasks conducted in a FL are often perceived as more cognitively demanding, which in turn fosters more deliberate and analytical processing, ultimately leading to more normative responses (Oppenheimer, Reference Oppenheimer2008). This mechanistic explanation posits that completing the CLT in a less proficient FL increases perceived task difficulty, thereby eliciting greater cognitive engagement and reducing susceptibility to incorrect causal inferences.
The emotional distance hypothesis posits that FL induces emotional detachment, potentially making decision making more adherent to normative rules and less biased (Del Maschio et al., Reference Del Maschio, Crespi, Peressotti, Abutalebi and Sulpizio2022). In studies on the illusion of causality, a commonly used CLT is the so-called Allergy Task, in which participants assess the causal relationship between taking a medicine and recovering from a disease. This task was used by Díaz-Lago and Matute (Reference Díaz-Lago and Matute2019a) and is also employed in this study. According to the emotional distance hypothesis, the Allergy Task could be less emotionally salient when presented in a FL than in the NL. Although Díaz-Lago and Matute (Reference Díaz-Lago and Matute2019a) described the task as relatively emotionally neutral, other findings suggest that modifying its cover story – thus altering its motivational framing – can significantly influence the magnitude of the illusion (Matute et al., Reference Matute, Blanco, Moreno-Fernández and Pohl2022). This indicates that the task is not entirely devoid of emotional content. Given its resemblance to real-life health scenarios, the cover story may elicit affective responses that are attenuated in a FL context. For completeness, it should also be noted that emotional responses may, in some cases, be elicited by task feedback, as suggested by Gao et al. (Reference Gao, Zika, Rogers and Thierry2015). However, in this study, feedback during the trial phase merely indicated whether the effect was present or absent and did not include emotionally valenced cues such as those used in Gao et al. (Reference Gao, Zika, Rogers and Thierry2015); therefore, we believe the trial phase is unlikely to trigger substantial emotional engagement.
The context of acquisition hypothesis posits that the conditions under which the NL and FL were acquired influence information processing and judgement (Geipel et al., Reference Geipel, Hadjichristidis and Surian2015). In the case of the Allergy Task, its health-related context may activate socio-cultural knowledge and normative expectations about the efficacy of medicines. As these beliefs are typically acquired in the NL, presenting the task in the NL may facilitate access to such norms, potentially biasing participants towards perceiving a causal link between taking a medicine and recovery, even before the associative trials begin. In contrast, presenting the task in a FL may reduce the salience of these expectations, promoting a more objective evaluation of causal contingency.
Another possible explanation of the FLE in CLTs refers to language-dependent memory encoding, by which semantic associations would be generally weaker in a FL than in the NL (Kroll & Stewart, Reference Kroll and Stewart1994). Participants performing the task in a FL may form weaker associative links during the learning phase, resulting in slower and less robust acquisition of contingencies (Ezrina, Reference Ezrina2023). These weaker associations may also decay more rapidly, potentially reducing the perceived frequency of co-occurrence between the potential cause and the outcome, and thus lowering the estimated strength of the causal relationship.
Given these multiple and unaddressed explanatory accounts, we sought to replicate the original study by Díaz-Lago and Matute (Reference Díaz-Lago and Matute2019a) while also adopting an exploratory design aimed at disentangling the underlying mechanisms of the FLE. We believe this hybrid structure strikes an optimal balance between replication and exploration. It enhances the informativeness of the experimental effort regardless of the outcome: if the original finding fails to replicate, the study still provides evidence against its generalizability; if it does replicate (fully or partially), the additional measures offer deeper insights into the psychological processes involved. This approach aligns with current best practices in psychological science where replication is increasingly valued not only for confirming previous findings but also for refining theoretical understanding (McShane et al., Reference McShane, Tackett, Böckenholt and Gelman2019).
2. Methods
2.1. Tools for transparency
The study was pre-registered on the Open Science Framework (OSF) at: https://doi.org/10.17605/OSF.IO/PZX9N. The pre-registration includes detailed documentation of the experimental plan, including the study design, hypotheses, exclusion criteria, results of the design analyses, list of planned variables to be measured, analysis plan and the structure of the experimental procedure, along with supporting files in various formats. All study materials, including the experiment script, questionnaires, raw data, R pipeline for data extraction, statistical analysis scripts, design analysis script, flow diagram of the procedure, back-translation results and other supplementary materials, are openly available in the associated OSF project at: https://doi.org/10.17605/OSF.IO/HVGKX.
2.2. Design analysis for the confirmatory hypothesis
Díaz-Lago and Matute (Reference Díaz-Lago and Matute2019a) conducted two experiments to test whether performing a CLT (the Allergy Task) in a FL reduces the illusion of causality. In Experiment 1, 36 participants were assigned to a null contingency outcome-density condition, with 20 participants completing the task in a FL and 16 participants in the NL. In Experiment 2 (N = 80), 40 participants were assigned to the same null contingency condition (20 FL and 20 NL) and 40 to a positive
$ \Delta P $
condition (20 FL and 20 NL). Across both experiments, the illusion of causality was reduced when the CLT was completed in a FL compared to the NL.
To inform the design of our replication, we re-analysed the publicly available dataset from Díaz-Lago and Matute (Reference Díaz-Lago and Matute2019a), hosted on OSF (https://osf.io/fc9nx/). Specifically, we re-calculated the effect sizes between the NL and FL groups under the null contingency outcome-density conditions. Experiment 1 showed a very large effect size (Cohen’s d = 1.4), while Experiment 2 showed a large effect size (d = 0.8).
We chose to replicate Experiment 2, as its sample characteristics most closely aligned with those of our intended participant pool. Notably, Experiment 1 involved English-speaking Erasmus students who acquired Spanish relatively late in life (Age of Acquisition; AoA: M = 12.61 years, SD = 4.08), whereas Experiment 2 featured Spanish speakers who had learned English at an earlier age (AoA: M = 6.58 years, SD = 4.32). Our sample consisted of native Italian speakers who learned English as a second language in formal educational settings, making Experiment 2 a more appropriate comparison. Unlike the original study, we focused on a two-group comparison (NL vs. FL) within a null contingency condition, omitting the positive contingency condition. This trade-off allowed us to concentrate on the main hypothesis concerning the reduction of the illusion of causality, which occurs in the null contingency condition. The positive contingency condition typically serves only as a control to ensure that any NL–FL differences are specific to the null contingency condition and, thus, to the illusion of causality.
Although the effect sizes observed in Díaz-Lago and Matute (Reference Díaz-Lago and Matute2019a) are large, they contrast with more modest average effect sizes reported in recent meta-analyses of the FLE, which tend to settle around d ≈ 0.2 (e.g., Circi et al., Reference Circi, Gatti, Russo and Vecchi2021; Del Maschio et al., Reference Del Maschio, Crespi, Peressotti, Abutalebi and Sulpizio2022). However, these meta-analyses predominantly focus on moral dilemma paradigms, which involve distinct decision-making processes from those involved in associative learning tasks. Additionally, response formats differ across paradigms: moral dilemma studies often involve categorical or binary responses, whereas our CLT uses continuous judgement measures (i.e., causal ratings on a 0–100 Likert scale).
Our estimate of the expected effect size was derived using a heuristic approach, detailed further in our pre-registration (https://doi.org/10.17605/OSF.IO/PZX9N). To determine the necessary sample size for our confirmatory analysis, we employed a simulation-based method grounded in distributional analysis of causal judgement data from Dalla Bona and Vicovaro (Reference Dalla Bona and Vicovaro2024). Like the current study, their experiment was conducted (i) online and (ii) using the PsychoPy software (Peirce et al., Reference Peirce, Gray, Simpson, MacAskill, Höchenberger, Sogo, Kastman and Lindeløv2019). We focused on the group exposed to a null contingency condition with outcome-density bias, without perceptual manipulations. All the participants in that study performed the CLT in Italian, their NL.
Our first step was to identify the most plausible generative distribution underlying the causal judgement data from Dalla Bona and Vicovaro (Reference Dalla Bona and Vicovaro2024). Using a Cullen and Frey graph (Frey & Cullen, Reference Frey and Cullen1995), we determined that judgements followed a truncated normal distribution bounded between 0 and 100. Building on this consideration, we simulated a series of paired distributions for the NL and FL groups, calculating Cohen’s d for each simulated pair. The simulations were parameterized along four key dimensions: (i) simulated values ranged between 0 and 100, using only integers to reflect the discrete nature of Likert-scale responses; (ii) based on the data distribution observed in Dalla Bona and Vicovaro (Reference Dalla Bona and Vicovaro2024), for the NL group, the mean of the truncated normal distribution was set to vary from 55 to 65, with a fixed SD of 20; (iii) for the FL group, the SD was assumed to be slightly higher (20–25), in line with findings by Díaz-Lago and Matute (Reference Díaz-Lago and Matute2019a); (iv) finally, we defined meaningful differences in means between the NL and the FL groups based on observed response anchoring: in Dalla Bona and Vicovaro (Reference Dalla Bona and Vicovaro2024), 50% of participants anchored responses on values divisible by five. Thus, we hypothesized that a minimum meaningful shift in means would be of five points (i.e., roughly conceptualizing a shift from one ‘anchor’ to the next one), which corresponds to a small effect size (d ≈ 0.2). Conversely, the maximum plausible shift was set at 20 points (four ‘anchor points’, d ≈ 0.9), which would closely align with the effect reported by Díaz-Lago and Matute (Reference Díaz-Lago and Matute2019a). The simulation yielded a distribution of Cohen’s d values which defined our prior distribution for the expected effect size, ranging from a minimal effect (d ≈ 0.2) to a large effect (d ≈ 0.9). This prior approximates a uniform distribution bounded between 0.2 and 0.9.
We used this simulated prior in a Bayes factor design analysis (BFDA, see Figure 1), a method used to estimate sample size requirements for Bayes factor (BF) hypothesis testing (Schönbrodt & Wagenmakers, Reference Schönbrodt and Wagenmakers2018; Stefan et al., Reference Stefan, Gronau, Schönbrodt and Wagenmakers2019). We used the same distribution to define both the design prior and analysis prior for the standardized effect size d under the alternative hypothesis, in accordance with suggestions that both these priors should be aligned with the researchers’ theoretical expectations (Stefan et al., Reference Stefan, Gronau, Schönbrodt and Wagenmakers2019).Footnote 1 The null hypothesis was operationalized as a statistical model assuming no differences between groups. Based on 5000 simulations for each sample size, both under H0 and H1, the BFDA indicated that a sample size of 110 participants per group yields a statistical power of .80 (Figure 1, left panel), while keeping the false positive rate below .05 (Figure 1, right panel).
Bayes factor design analysis (BFDA) results. The left panel shows the results of BFDA simulations under the alternative hypothesis (H1), with 5,000 simulations per sample size, defined as the number of participants per group. The proportions of BF10 values within specific ranges (indicated by different colours; see the legend) are shown as a function of sample size. Power (PWR) is defined as the proportion of simulations yielding compelling evidence in the expected direction (i.e., BF10 ≥ 3, dark blue bars). A PWR of .80 is achieved with a sample size of 110 participants per group (horizontal dashed white line). The right panel presents the BFDA simulations under the null hypothesis (H0), also with 5,000 simulations per sample size, and reports the proportions of BF01 values. With N = 110 per group, H0 is convincingly supported in more than 95% of simulations (blue bars), and the false positive rate (FPR) ─ defined as the proportion of simulations yielding evidence for H1 when H0 is true ─ remains below .05 (horizontal dashed white line). Graphics were created in R (R Core Team, 2024), using the ggplot2 package (Wickham, Reference Wickham2016).

2.3. Data collection and exclusion criteria
In line with the results of the BFDA, we collected data from exactly 220 participants (110 participants per group). This target numerosity was reached after applying exclusion criteria to an initial sample of 241 participants, as specified in our pre-registration and partially described in the recruitment form presented to participants. Specifically, participants were excluded if they took part in the experiment more than once (zero participants), read the instructions in less than 10 seconds (four participants), completed the trial section in less than 160 seconds (zero participants) or responded in less than 2 seconds to any of the key questions concerning emotional engagement with the instructions, the causality rating, the previous experience with medicine, task disfluency or the estimated number of trials (zero participants). We also excluded participants who declared English as their native language (six participants), those who indicated that Italian was not their native language (two participants) and those who reported residing in an English-speaking country at the time of the experiment (one participant, who also declared to be a native English speaker).
Regarding language proficiency, the pre-registration specified exclusion criteria based on the Common European Framework of Reference for Languages (CEFR) scale.Footnote 2 Specifically, participants who reported an A1 level of English (one participant) had to be excluded, as this would suggest they might not be able to understand the instructions, as well as participants who reported a C2 level (one participant), indicating too much fluency in the foreign language. The C2 participant had already been excluded for completing the instructions in less than 10 seconds. The A1 participant was retained, as their performance on an objective English test administered to all participants at the end of the experimental procedure was 23/25, indicating high proficiency (see the Procedure).
Furthermore, one participant was excluded due to incomplete data, and one participant was excluded for completing the experiment in more than 3 hours (median time to complete the experiment was approximately 15 minutes; Mdn = 14 minutes and 47 seconds). In the FL condition, we applied additional exclusion criteria based on performance on the objective English test: we excluded participants who scored the maximum 25/25 (six participants), as this suggests native-like proficiency, and those who scored below 7/25 (one participant), a performance level consistent with random guessing.
2.4. Participants
Participants were recruited primarily via the Prolific platform (https://www.prolific.com/), where those who agreed to participate were paid £3 to complete the online experiment (N = 198, N = 182 after exclusions). The remaining participants (N = 43, N = 38 after exclusions) were recruited from the student population at the University of Padua, using consistently the same recruitment form and eligibility criteria as on Prolific.
Our final sample included 220 native Italian speakers who were not native English speakers and who did not live in countries where English is the predominant language at the time of testing. Participants’ ages ranged from 20 to 65 years (M = 32.65 years, SD = 10.61), representing a general adult population with a concentration in early adulthood (1st quartile = 25 years, Mdn = 29 years, 3rd quartile = 39.25 years). This sample was older and more diverse compared to that of the second experiment in Díaz-Lago and Matute’s (Reference Díaz-Lago and Matute2019a) study, in which participants were primarily university students (M = 23.81 years, SD = 9.24). In terms of biological sex distribution, our sample included 102 female and 118 male participants, indicating a roughly balanced composition, in contrast to Díaz-Lago and Matute’s (Reference Díaz-Lago and Matute2019a) second experiment (F = 49, M = 31). Participants’ years of formal education ranged from 8 to 24 years (M = 16.6 years, SD = 2.62). Specifically, two participants had completed only primary school, 27 had completed lower secondary school, 74 held a high school diploma, 80 held a bachelor’s degree and 37 had obtained a master’s degree or higher. Regarding enrolment in a formal education programme, 101 participants were enrolled and 119 were not.
Regarding English certifications, 139 participants (63.2%) participants reported not holding any English language diploma, while 14 (6.4%) participants had obtained one within the past 2 years and 67 (30.5%) participants had received their diploma more than 2 years prior. Among those reporting diploma levels aligned with the CEFR scale (see Footnote 2) and the International English Language Testing System (IELTS) scale, the most frequently reported level was B2 (equivalent to IELTS 5.5–6), reported by 35 participants. This was followed by 27 participants reporting C1 proficiency (IELTS 7–8). Fewer participants reported levels of B1 (N = 11), A2 (N = 3) or A1 (N = 1). One participant did not recall his or her level. Additionally, three participants gave open-ended responses indicating alternative forms of certification: one reported a TOEFL score of 98 (approximately C1 level), another referenced a Cambridge Proficiency certificate from the 1990s with an unclear score (93/100) and one declared that he or she could not recall any relevant details.
The AoA for the FL in our sample ranged from 3 to 25 years, with a mean of 8.12 years (SD = 3.22). The median AoA was 7 years, with the first quartile at 6 years and the third quartile at 10 years, indicating that participants generally began learning English during childhood. When compared to the AoA reported in Díaz-Lago and Matute’s (Reference Díaz-Lago and Matute2019a) second experiment (M = 6.58 years, SD = 4.32), our sample showed a slightly higher AoA, suggesting marginally later exposure to the FL.
2.5. Procedure
To ensure consistency in the online data collection, participants were instructed to complete the experiment exclusively on a personal computer and to position themselves in a well-lit environment. Prior to beginning the experiment, participants were required to read an informed consent form (see Figure 2, Box 1), which had received ethical approval from the Ethics Committee of the University of Padua (Protocol No. 5010, 3 November 2022). Participation was contingent upon providing informed consent by clicking a response button, thereby affirming their agreement to the terms described.
Representation of the experimental flow. The diagram shows, through arrows, the sequence of tasks completed by participants. Each white box corresponds to a component of the experimental procedure. The labels ‘ITA’ and ‘ENG’ indicate the language (Italian or English) in which each component was presented.

Participants were automatically and randomly assigned to one of two experimental conditions (NL or FL) via a pseudo-random algorithm embedded in the experimental code. The experiment was hosted on Pavlovia (https://pavlovia.org/) and programmed using PsychoPy (Peirce et al., Reference Peirce, Gray, Simpson, MacAskill, Höchenberger, Sogo, Kastman and Lindeløv2019), with the code compiled into PsychoJS (JavaScript; script available in the supplementary materials). Questionnaires were administered via Pavlovia Surveys (https://pavlovia.org/docs/surveys/overview) and were integrated into the experimental flow (questionnaires are available in the supplementary materials; https://doi.org/10.17605/OSF.IO/PZX9N). Once launched, the experiment occupied the entirety of the computer screen, minimizing the possibility of accessing automatic browser translation features.
The first part of the experiment (Figure 2, Boxes 2–8) was administered in Italian to the participants assigned to the NL condition and in English to the participants assigned to the FL condition. It consisted of seven sequential segments. Together, the first, third and fourth segments (Boxes 2, 4 and 5 in Figure 2) correspond to the classic version of the Allergy Task. In the first segment, participants were introduced to a fictional medical scenario in which they assumed the role of emergency room doctor (Figure 2, Box 2), tasked with determining whether a causal relationship existed between the administration of a fictitious medication (‘Batatrim’, the potential cause) and recovery from a fictional illness (‘Lindsay Syndrome’, the supposed effect).
In the second segment, participants’ emotional reactions to the presented story were assessed using the Affective Slider (Betella & Verschure, Reference Betella and Verschure2016), a digital, continuous-response tool designed to measure valence and arousal (Figure 2, Box 3). This instrument is a modern adaptation of the widely used Self-Assessment Manikin (SAM; Bynion & Feldner, Reference Bynion and Feldner2020). Participants completed this measure immediately after reading the CLT instruction, responding to the question: ‘How did you feel while reading the story that was presented to you?’ The first slider assessed emotional valence (from sad to happy) and the second slider assessed arousal (from calm to activated), both rated on an analogue scale ranging from 0 to 1. The rationale, based on the emotional distance hypothesis, was that using a FL could attenuate emotional engagement with the task, especially the emotional intensity, thereby modulating the strength of the causal illusion. This emotional assessment phase was specific to our procedure and is not typically part of the CLT.
The third segment presented participants with 40 patient records shown in randomized order, each separated by a 1-second inter-trial interval (Figure 2, Box 4). Each record described one of four combinations of the potential cause and effect: (a) cause present/effect present, (b) cause present/effect absent, (c) cause absent/effect present and (d) cause absent/effect absent. The frequencies of these combinations were set to establish a null contingency condition (ΔP = 0; a = 15, b = 5, c = 15, d = 5) with a high outcome probability, namely a .75 probability that the patient recovered from the syndrome. This combination of frequencies of the scenarios typically elicits an outcome-density bias (Matute et al., Reference Matute, Blanco, Yarritu, Díaz-Lago, Vadillo and Barberia2015). Each record was presented through three horizontally arranged panels, following the typical procedure also employed by Díaz-Lago and Matute (Reference Díaz-Lago and Matute2019a). The top panel, which remained visible throughout the trial, indicated whether the cause was present or absent showing if patient had taken Batatrim or not (e.g., ‘The patient has taken the Batatrim’). The middle panel, also visible throughout, posed a predictive question (‘Do you think the patient will overcome the crisis?’), and participants responded by selecting one of two buttons. There was no time limit for this response. Upon submitting a response, a third panel appeared displaying the presence or absence of the effect (e.g., ‘The patient has overcome the crisis’). Participant responses did not affect the effect shown. After a brief interval, the next trial began, upon reaching 40 trials.
In the fourth segment (Figure 2, Box 5), participants rated the perceived causal strength between the medication and patient recovery (‘To what extent do you think that Batatrim has been effective in healing the crises of the patients you have seen?’) using a discrete visual analogue scale ranging from 0 (‘definitely not’) to 100 (‘definitely yes’). This was the main dependent variable. Upon clicking the scale, a cursor appeared allowing participants to select a value, which was then displayed numerically below the scale.
The fifth segment (Figure 2, Box 6) assessed general beliefs about the effectiveness of medications with the following single question: ‘Beyond the specific case of Batatrim, in your PERSONAL experience, how effective are medications in treating diseases IN GENERAL?’ (101-point Likert scale ranging from 0 = ‘definitely not’ to 100 = ‘definitely yes’). This item was intended to probe individual background beliefs that might provoke a stronger causal evaluation in NL. Indeed, according to the context of acquisition hypothesis, these background beliefs could be more easily accessed in NL than in a FL.
In the sixth segment (Figure 2, Box 7), participants estimated how many trials they observed for each of the four cue–outcome pairings (i.e., a, b, c, d scenarios), in randomized order. This measure was designed to assess whether participants’ memory for different event contingencies was modulated by the language of task presentation, consistent with the hypothesis that language of encoding influences memory retrieval processes.
In the seventh segment (Figure 2, Box 8), a single-item measure of perceived task difficulty was presented, adapted from Graf et al. (Reference Graf, Mayer and Landwehr2018). Participants were asked: ‘How difficult did you find the reading and comprehension activities during the task?’ (7-point Likert scale ranging from 1 = ‘very easy’ to 7 = ‘very difficult’). This measure was intended to capture participants’ subjective experience of cognitive effort during the CLT, under the assumption that lower fluency in a FL increases perceived difficulty and may trigger more deliberative, analytical reasoning.
The second part of the experiment (Figure 2, Box 9), presented in Italian to all participants, consisted of a series of questions regarding demographic and linguistic characteristics. In the demographic section, participants were asked to report their age, biological sex, whether they were currently studying and their total years of formal education. In the linguistic section, participants were asked whether they held an English language certification (and, if so, which one), whether they were native speakers of Italian or English and whether they were living abroad in an English-speaking country at the time of testing. In addition, to assess self-perceived language proficiency in English, participants rated their skills in writing, comprehension, speaking and reading on Likert scales ranging from 1 to 10. These items were adapted from the Language Experience and Proficiency Questionnaire (LEAP-Q; Marian et al., Reference Marian, Blumenfeld and Kaushanskaya2007) and were selected to allow direct comparison with the original study of Díaz-Lago and Matute (Reference Díaz-Lago and Matute2019a), where a composite measure summing these four dimensions was reported. Only selected items relevant to the current study were used; the full questionnaire was not administered.
In the third part of the experiment (Figure 2, Box 10), participants completed an objective English proficiency test adapted from a publicly available Cambridge English assessment (https://www.cambridgeenglish.org/test-your-english/general-english/). The test, administered in English to all the participants, included 25 multiple-choice questions, each scored as correct or incorrect, designed to measure grammatical and vocabulary competence. This allowed for an objective evaluation of participants’ proficiency in the FL.
3. Results
Before presenting the main confirmatory analyses, it is important to assess the comparability of the NL and FL groups in key demographic variables and linguistic competence. To assess potential group differences (i.e., NL vs. FL) in terms of demographic variables, such sex, age and years of education, we conducted a model comparison for every possible combination of predictors group and recruitment source (i.e., Prolific vs. University), with a series of generalized linear models (GLMs) for the continuous variables. The appropriate distributions of reference were selected based on distributional assessments via Cullen and Frey graphs. For the categorical variable of sex, we applied BF contingency table analyses.Footnote 3 Results of the model comparisons, evaluations of the best-fitting model, predictive checks and point estimates with their corresponding confidence intervals are provided in the supplementary materials (https://doi.org/10.17605/OSF.IO/PZX9N). This also applies to the other between-group comparisons described in the current section. In brief, the results indicated that participants recruited from the university were younger, had fewer years of formal education and were more frequently female than those recruited via Prolific. However, the distribution of these demographic variables was consistent across the NL and FL groups.
With respect to the linguistic competence, a self-reported FL proficiency sum score was computed for each participant,Footnote 4 and subsequent analyses supported the comparability of our sample with that of Díaz-Lago and Matute (Reference Díaz-Lago and Matute2019a). Specifically, the average self-reported FL proficiency sum score in our sample was M = 29.63, SD = 4.86, slightly higher than the mean reported in Díaz-Lago and Matute’s (Reference Díaz-Lago and Matute2019a) second experiment (M = 28.96, SD = 3.37). However, the observed difference between groups (d = 0.15) was supported to be negligible, as a Bayesian t-test for testing the difference between the two studies yielded a BF01 = 3.81, providing moderate evidence in favour of the null hypothesis (i.e., no meaningful difference between groups).Footnote 5 As for the objective English test, participants achieved a mean total score of 18.87 out of 25 (Mdn = 20; SD = 4.08), which aligns reasonably well with the self-reported FL proficiency scores (M = 29.63 out of 40), corresponding to roughly three quarters of the total available points on both measures. The correlation between subjective and objective proficiency sum scores was moderate (r = .49), suggesting a fair degree of agreement between perceived and actual language ability. No difference emerged between the NL and FL groups in the objective English test performance.Footnote 6
The main confirmatory hypothesis, pre-registered on OSF, aimed to replicate the effect of language condition (NL vs. FL) on the illusion of causality. Specifically, we hypothesized that participants in the FL condition would exhibit a reduced illusion of causality compared to those in the NL condition. This directional hypothesis informed both our a priori design analysis and sample size determination, conducted using a BFDA approach. Contrary to our expectations, participants in the FL group exhibited a negligibly greater illusion of causality (M = 59.61, SD = 22.92) than those in the NL group (M = 55.76, SD = 22.38; d = −0.17; Figure 3, left panel).Footnote 7 To test the main hypothesis, we computed a BF comparing two models: one corresponding to the alternative hypothesis, which included a group difference in the hypothesized direction with a custom prior with .95 of the mass between 0.2 and 0.9 placed on the standardized effect size (see Footnote 1), and one corresponding to the null hypothesis, which assumed no difference between groups. This analysis yielded strong evidence in favour of the null hypothesis (BF01 = 366.2; Figure 2, top-right panel). To address potential concerns regarding the use of an informed prior, we also conducted a second BF t-test, using the default half-Cauchy prior (truncated to positive values) placed on the standardized effect size under H1, and a point null prior under H0. The results again supported the null hypothesis, yielding a BF01 of 14.47 (Figure 3, top-right panel).
Summary of main results. The left panel illustrates the causality ratings observed in the two experimental groups. The top-right panel presents BF evidence from the confirmatory analyses, including results for the main hypothesis tested with both a customized prior and the default Cauchy prior. Additionally, it displays the BF for the best-fitting model that incorporates relevant demographic covariates. The bottom-right panel shows BF evidence supporting the absence of group differences across the exploratory measures. In the two right panels, only the larger of the two possible Bayes factors (BF01 or BF10, see legend) is shown for each model.

We also conducted a Bayesian general linear model comparison to assess the effects of sex, age, education and group (NL vs. FL) on the illusion of causality, evaluating all possible combinations of main effects and interactions. The model with the strongest support included sex and education as additive predictors, yielding a BF of 8.06 relative to the null model (Figure 3, top-right panel). A follow-up Bayesian linear analysis revealed a high probability of a negative relationship between years of formal education and the illusion of causality (median slope = −1.46, 95% CrI [−2.54, −0.35], 99.62% < 0), as well as a high probability of a reduced illusion of causality in males (median difference = −6.92, 95% CrI [−12.89, −0.87], 98.62% < 0). Models including age or group – either alone or in combination – received weaker support (BFs < 1.5), suggesting that these variables contribute minimally to explaining individual differences in the illusion of causality. Detailed results of the model comparisons and the follow-up Bayesian linear analysis are provided in the supplementary materials.
Finally, we conducted Bayesian t-tests comparing group differences on each exploratory measure, using the default Cauchy prior for the alternative hypothesis and a point null prior for the null hypothesis. The resulting BF01 values ranged from 4.43 to 16.43 across the various measures (Figure 3, bottom-right panel). This suggests that performing the CLT in a FL rather than in the NL did not directly affect the perceived task difficulty, the level of emotional involvement, the perceived effectiveness of medications in treating diseases in general and the estimated frequency of trials type. Detailed results are provided in the supplementary materials.
4. Discussion
Using a sample more than five times larger than that in Díaz-Lago and Matute’s (Reference Díaz-Lago and Matute2019a) second experiment (null contingency condition), we were unable to replicate the main confirmatory findings, which formed the core of our study. This failure to replicate cannot be attributed to flaws in the study design or insufficient statistical power. Our BF comparison was both pre-registered and pre-planned, with the study powered (≥0.80) using a BFDA approach to detect the expected effect sizes. Interestingly, the observed group difference was small and in the opposite direction to the original hypothesis, further contradicting the prediction of a reduction in the illusion of causality when reasoning in a foreign language. Thus, our results provide clear and robust evidence for the absence of a reduction in the illusion of causality in the FL condition. One possible counter-argument is that our sample may not have adequately represented individuals with an intermediate level of English proficiency. Our recruitment strategy targeted participants whose English proficiency was sufficient to understand the task but still low enough to potentially trigger a disfluency effect – presumed to be the mechanism underlying the original study’s findings. If this interpretation is accurate, it is notable that the only directly comparable measure between the two studies – self-reported proficiency – showed no meaningful difference.
One possible explanation for our failure to replicate previous findings is that earlier studies may have been underpowered, an issue that remains widespread in psychological research and contributes significantly to the broader replication crisis. Underpowered studies are more likely to produce inflated effect sizes, false positives and results that fail to replicate (McShane et al., Reference McShane, Tackett, Böckenholt and Gelman2019). Nonetheless, we recognize that psychological research is often conducted under constraints of time, funding and access to participants. For this reason, replication of preliminary findings should be considered a critical step before investing resources into more fine-grained investigations of the underlying mechanisms. Our study was designed with this consideration in mind. We adopted a hybrid structure that combined confirmatory replication with exploratory analysis. This dual approach provides a constructive model for advancing psychological science: it allows researchers to contribute robust evidence in support of (or against) a given phenomenon, while also generating informative hypotheses that can guide subsequent confirmatory work. We believe this kind of design, though perhaps not always suitable, can help address the hesitation to replicate, as it allows both for the validation of existing findings and the discovery of new, potentially explanatory variables. In this sense, our study complements the work of Díaz-Lago and Matute (Reference Díaz-Lago and Matute2019a).
Nonetheless, with respect to their original findings, it cannot be entirely ruled out that differences in sample characteristics may have contributed to the failure to replicate the effect. It is possible that undergraduate student subsamples, such as those tested in the original findings, are more susceptible to the FLE than the more heterogeneous population examined in this study. A further point worth acknowledging is that the original experiment was conducted across two sessions, with the second session devoted exclusively to the CLT and administered entirely in the FL. This design feature may have amplified language-switching related processes, raising the possibility that the observed effect was driven, at least in part, by the switching language process (e.g., Oganian et al., Reference Oganian, Korn and Heekeren2016).
Although the primary confirmatory hypothesis was not replicated, we observed a reduced illusion of causality in males and more educated participants. These results should be interpreted with caution, as they may reflect a systematic preference for the mid–low range of the response scale, indicative of a generally sceptical response style, rather than a genuine reduced sensitivity to the illusion of causality. However, they point to a valuable direction for future research: ensuring large samples with balanced demographics, to enhance results generalizability and avoid potential confounds in between-group comparisons.
To conclude, it is important to note that recent meta-analyses have reported a small FLE primarily in the domain of moral dilemmas (d ≈ 0.2; Circi et al., Reference Circi, Gatti, Russo and Vecchi2021; Del Maschio et al., Reference Del Maschio, Crespi, Peressotti, Abutalebi and Sulpizio2022). In contrast, despite the balancing of our FL and NL groups for key demographic variables, our results showed no evidence of an FLE in the context of the illusion of causality. This evidence is consistent with the results of studies showing no evidence of a FLE for biases such as outcome bias and the representativeness heuristic (Circi et al., Reference Circi, Gatti, Russo and Vecchi2021; Del Maschio et al., Reference Del Maschio, Crespi, Peressotti, Abutalebi and Sulpizio2022; Vives et al., Reference Vives, Aparici and Costa2018). In light of these findings, a tentative explanation could be that the FLE may be limited to moral dilemmas or similar emotionally driven tasks (but see Hu et al., Reference Hu, Martín-Luengo and Navarrete2026). In moral tasks, the link between bias, emotional resonance and social norms is more direct and influential (Del Maschio et al., Reference Del Maschio, Crespi, Peressotti, Abutalebi and Sulpizio2022). Unlike the task used to assess the illusion of causality, moral dilemmas may more strongly engage the affective system (Greene et al., Reference Greene, Sommerville, Nystrom, Darley and Cohen2001) and/or make learned norms more salient (Geipel et al., Reference Geipel, Hadjichristidis and Surian2015). Using a FL may alter emotional processing or access to social norms, potentially eliciting more normative responses in moral dilemmas, whereas more neutral biases, such as the illusion of causality, remain unaffected.
Data availability statement
The study was pre-registered on the Open Science Framework (OSF) at https://doi.org/10.17605/OSF.IO/PZX9N. All study materials, including raw data, are openly available in the associated OSF project at https://doi.org/10.17605/OSF.IO/HVGKX.
Competing interests
The authors declare no competing interests and thank all participants for their contribution to this study.
Ethics statement
The authors assert that all procedures contributing to this work comply with the ethical standards of the relevant national and institutional committees on human experimentation and with the Helsinki Declaration of 1975, as revised in 2008. The study received approval from the Ethics Committee for Psychological Research at the University of Padova (approval number: 5010) and informed consent was obtained from all individual participants included in the study.
Disclosure of use of AI tools
The authors declare the use of AI tools to check for errors in the main manuscript and to improve the grammar and clarity of the written text. AI tools were also used to verify the correctness and enhance the clarity of the text, code and tables included in the supplementary materials.