Different measurements of bilingualism and their effect on performance on a Simon task

Abstract We investigated how operationalizing bilingualism affects the results on a Simon task in a population of monolingual and bilingual native English speakers (N = 166). Bilingualism was measured in different ways within participants, and the measurements were used both as dichotomous and continuous variables. Our results show that the statistical significance and effect size varied across operationalizations. Specifically, the Composite Factor Score (the Language and Social Background Questionnaire’s general score), showed a bilingual disadvantage on reaction times regardless of how it was used (dichotomously or continuously). When dividing participants into monolinguals and bilinguals based on the Nonnative Language Social Use score (a Language and Social Background Questionnaire subscore), differences in accuracy and reaction times were found between the groups, but the Nonnative Language Social Use score did not predict accuracy when used as a continuous variable (only reaction times). Finally, earlier age of acquisition predicted faster reaction times, but only when used on a continuum. Effect sizes were between the small and medium range. No differences on the Simon effect were found. Our results call for cautiousness when comparing studies using different types of measurements, highlight the need for clarity and transparency when describing samples, and stresses the need for more research on the operationalization of bilingualism.

In recent years, there has been a lively debate regarding the so-called bilingual advantage in executive functions. It has been suggested that bilingualism causes structural and functional changes in the brain due to the constant control that bilinguals must have on their languages, which, because they are activated simultaneously (e.g., Marian & Spivey, 2003;Wu & Thierry, 2010), constantly compete for selection (e.g., Bialystok, 2015Bialystok, , 2017Bialystok, Craik, & Luk, 2012;Li, Legault, & Litcofsky, 2014). The bilingual's task to continuously monitor the environment, choose the appropriate language, and inhibit the other, is suggested to lead to domain-general changes in the brain that extend beyond linguistic control (Bialystok, 2009, 2015, Bialystok et al., 2012Li et al., 2014). Therefore, the rationale is that bilinguals should show an advantage in nonlinguistic tasks recruiting the same executive functions that are used when they control their languages (Bialystok, 2009(Bialystok, , 2015(Bialystok, , 2017Bialystok et al., 2012). While there is a large body of research that has found such a bilingual advantage in children (e.g., Blom et al., 2017;Martin-Rhee & Bialystok, 2008, see Barac, Bialstok, Castro, & Sachez, 2014, for a review) and adults (e.g., Bialystok, Craik, Klein, & Viswanathan, 2004;Costa, Hernández, & Sebastián-Gallés, 2008;Damian, Ye, Oh, & Yang, 2018, see Bialystok et al., 2012, several studies have failed to replicate said results (e.g., Costa, Hernández, Costa-Faidella, & Sebastián-Gallés 2009;Namazi & Thordardottir, 2010;Paap, Johnson, & Sawi, 2014;Paap, Anders-Jefferson, Mason, Alvarado, & Zimiga, 2018) or have even found a bilingual disadvantage in certain executive functions (e.g., Folke et al., 2016;Paap & Greenberg, 2013;Paap et al., 2017;Papageorgiou, Bright, Tomas, & Filippi 2018). The quest to understanding which mechanisms could lay behind a potential bilingual advantage, and why it cannot always be found, has led to numerous investigations tackling the issue from different angles (see, for instance, Lehtonen et al., 2018, for a meta-analysis). Despite the extensive number of studies investigating various aspects of executive functioning and potential differences between monolinguals and bilinguals, for instance at the neurological level (see Grundy, Anderson, & Bialystok, 2017, for a review), the debate concerning the existence (and true nature) of the bilingual advantage in executive functions remains, leaving us to wonder why no clear conclusion can be found.
However, when looking closely at studies where the executive functions of bilinguals are investigated, it becomes clear that the way bilingualism itself is defined and measured varies greatly between different studies (de Bruin, 2019). In a study by Surrain and Luk, (2019), where the different labels used to describe bilinguals in the scientific literature in recent years are reviewed, it is clear not only that different studies use different labels or characteristics to describe their bilingual sample but also that those labels are operationalized and measured differently as well. Not only does the specific facet of bilingualism that is used vary greatly across studies, but the clarity with which the operationalization of bilingualism is described and the extent to which it is measured vary as well (Surrain & Luk, 2019). Clearly, then, there is an enormous lack of consistency in the definition, operationalization, and measurement of bilingualism across studies, which makes the comparison of different results from independent studies problematic, and even impossible in some cases.
This variation may not be surprising considering the complexity of the concept of bilingualism. Important aspects such as proficiency, home versus society language use, language history, and the specific sociolinguistic context (Surrain & Luk, 2019) all are relevant facets of bilingualism that create variability even within bilinguals. For instance, there is a large body of research showing that the type and frequency of code switching in particular (see, for instance, the dynamic restructuring model, Pliatsikas, 2020; and the adaptive control hypothesis, Abutalebi & Green, 2016) lead to functional and structural neurocognitive variations within a given group of bilinguals. Because there are neurological differences between bilingual individuals depending on how they use their languages, it cannot be assumed that all types of bilingualism will potentially affect executive functions in the same way. De Bruin (2019) showed in a review that performance on tasks tapping into executive functions can vary greatly between different types of bilinguals depending on how they are defined, and how their bilingualism is measured. Bilingualism is not a homogenous concept and differences within bilinguals is an important issue that impedes the direct comparison of results across studies. In addition, if the central characteristic of the participants is operationalized in disparate ways without knowing whether these are equivalent and can be used interchangeably, it could be methodologically problematic to compare different "types" of bilinguals with each other. This has led to an appeal to both define and operationalize bilingualism more rigorously in bilingualism research (e.g., de Bruin, 2019;Poarch & Krott, 2019;Surrain & Luk, 2019).
This methodological issue (of whether different operationalizations of bilingualism are comparable with each other), may be a contributing factor to the contradictory results found in bilingualism research. For instance, in a study by Gathercole et al. (2010), a bilingual advantage was found in elementary school children on a Stroop task. There, participants were classified as bilingual or monolingual based on an extensive background questionnaire where home and school language use was measured. However, in a different study, the participants (also elementary school children) were categorized as bilinguals based on exposure to and proficiency in Spanish and Basque (Duñabeitia et al., 2014). Duñabeitia et al. did not find that bilingual participants performed better on the Stroop task than the monolingual children. Nevertheless, whether these disparate findings are the result of the different ways bilingualism was defined and measured is unclear.
To make matters even more complex, bilingualism is often measured as a categorical variable. However, few people are either monolingual or bilingual, but rather, most fall somewhere on a range in between the two extremes. Thus, it is increasingly argued that bilingualism should be operationalized as a continuous variable in order to better reflect its true nature and thus increase the precision and sensitivity of the operationalization (e.g., Champoux-Larsson & Dylman, 2019;DeLuca, Rothman, Bialystok, & Pliatsikas, 2019;Edwards, 2012;Gullifer et al., 2018;Gullifer & Titone, 2020;Incera & McLennan, 2018;Jylkkä et al., 2017;Kaushanskaya & Prior, 2015;Luk & Bialystok, 2013;Sulpizio, Del Maschio, Del Mauro, Fedeli, & Abutalebi, 2020;Surrain & Luk, 2019, but see Kremin & Byers-Heinlein, 2020, for a suggestion on how to use both categorical and continuous approaches simultaneously based on a factor mixture model and on a gradeof-membership model). Yet, there are no systematic investigations to determine whether this can directly affect the outcomes of studies.
Therefore, in order to take the first step in systematically investigating the effect of defining and measuring bilingualism based on different factors, we conducted a study where we operationalized bilingualism in two different ways (i.e., based on different characteristics of bilingualism, both dichotomously and continuously) and investigated whether these different operationalizations would affect the results on a Simon task within the same participants. Because our aim was to illustrate that different types of operationalization can lead to different results, but not to systematically investigate all possible operationalizations of bilingualism, we chose to use the Language and Social Background Questionnaire (LSBQ; Anderson, Mak, Chahi, & Bialystok, 2018) to measure bilingualism. The LSBQ is a comprehensive and validated tool that suited the purpose of this study well as it measures several aspects of bilingualism (including proficiency, language use in different contexts, and code switching) and through which bilingualism can be operationalized categorically or on a continuum. The LSBQ has also the advantage of providing different composite scores based on various aspects of bilingualism that cluster together, and where each item is weighted according to its relative contribution to the concept. Of particular interest here is the Composite Factor Score, an extensive and comprehensive estimate of bilingualism that comprises all the aspects of language use, including proficiency, that are covered in the LSBQ. Two other composite scores of interest can be found in the LSBQ, namely, the Nonnative Home Use and Proficiency score and the Nonnative Language Social Use score. As both language use at home and in society are two important aspects of bilingualism (Surrain & Luk, 2019), these two composite scores are highly relevant when operationalizing bilingualism. Although both proficiency and code switching are included in at least one of the composite scores from the LSBQ, code switching is particularly interesting on its own due to the evidence that it is associated with structural and functional neurological differences not only between monolinguals and bilinguals but also across different bilinguals (e.g., Abutalebi & Green, 2016;Pliatsikas, 2020), and we therefore chose to investigate it separately. However, an aspect of bilingualism that is relatively often used in research but that is not included in the composite scores of the LSBQ is age of acquisition. Because of its frequency in the bilingual literature, this is also an aspect that we chose to investigate on its own. As for the Simon task, it is a paradigm that has been used in several studies where the executive functions of bilinguals were investigated (e.g., Bialystok et al., 2004Bialystok et al., , 2005. The Simon task consists of stimuli of two different types (e.g., a square or a circle, or two different letters) shown on the left or the right side of a screen, where the participant's task is to determine which type of stimulus is shown by pressing a key on the right or the left of the keyboard. Of importance, some trials are spatially congruent (e.g., the answer key for a circle is placed on the left side of the keyboard and the circle is shown on the left side of the screen), while other trials are spatially incongruent (e.g., the circle is shown on the right side of the screen). Incongruent trials usually lead to slower response times and more mistakes than congruent trials. The Simon effect is thus the difference in accuracy or reaction time between congruent and incongruent trials. While a bilingual advantage has been found in some studies, meaning that bilinguals showed a reduced Simon effect (e.g., Bialystok et al., 2004Bialystok et al., , 2005, this advantage is not always replicated (e.g., . However, note that the terms advantage and disadvantage, even though they are used extensively in the bilingual-advantage debate may be misleading terms. For instance, while slower reaction times are often seen as a disadvantage, they may simply reflect adaptive cognitive mechanisms that can lead to equally good performance overall compared to a group that is faster (e.g., Gullifer & Titone, in press).
Thus, with this study, we aimed to investigate two main questions. First, we investigated whether the definition of bilingualism based on different factors led to different results on the Simon task in terms of accuracy, reaction times, and the Simon effect (measured as the difference between congruent and incongruent trials). Second, we investigated whether the operationalization of bilingualism as a dichotomous versus as a continuous variable affected the interpretation of the results on a Simon task also in terms of accuracy, reaction times, and the Simon effect.

Participants
A total of 166 participants (M age = 37.5, SD age = 10.4; 50% males) were recruited online via Prolific and MechanicalTurk. However, 8 participants failed to fill out the survey properly, making it impossible to evaluate their language profile accurately, and were therefore excluded. The final sample consisted of 158 participants (M age = 37.4, SD age = 10.5; 49.4% males). The participants' highest completed educational level (1 = elementary school or lower, 2 = high school, 3 = professional education, 4 = bachelor's degree, 5 = master's degree, 6 = PhD) was used as an indicator of their socioeconomic status (Mode = 4, n = 79). Forty-two participants reported speaking only English (M age = 38.3, SD age = 9.1; 57.1% males; Mode education = 4, n = 17). The remaining 116 participants (M age = 37.1, SD age = 11; 46.6% males; Mode education = 4, n = 62) reported having English as their first language, and a variety of second languages. The mean reported frequency of use of the most proficient second language on average for speaking, writing, listening, and reading (each on a scale from 0 to 4) was 1.29 (SD = 0.89). For a list of the reported second, third, fourth, and fifth languages, please see Table 1. See the sections below for a detailed description of the sample's language profile based on the different measurements.

Procedure
Members of Prolific or MechanicalTurk self-enrolled to the study in exchange for monetary compensation (1.70 GBP and 1.25 USD respectively). The Simon task was programmed and presented online via PsyToolkit (Stoet, 2010(Stoet, , 2017. In our version of the task, the letters Q and P were used as stimuli and presented on the left or the right side of the screen for 2000 ms or until an answer was provided. After 12 practice trials, a total of 48 trials were presented, of which half were spatially congruent, and half were spatially incongruent (equally distributed across the two stimuli). Participants were instructed to answer as quickly and accurately as possible by pressing the letter P or Q on the keyboard. A fixation cross appeared for 500 ms between each trial as soon as an answer was provided, or after 2000 ms if no answer had been provided. The task was programmed to detect the type of device used by the participants and only participants using a physical keyboard could complete the experiment.
After the Simon task was completed, participants were sent to an online survey where information about their gender, age, and highest completed level of education was collected. Furthermore, in order to measure the participants language profile, the LSBQ (Anderson et al., 2018) was used. The questionnaire was digitalized and presented in Qualtrics. The LSBQ is a comprehensive and validated instrument that allows measuring different facets of bilingualism and provides different types of operationalizations. It consists of three parts, the first one covering demographic and other background information (Part A), the second one covering language background (Part B), and the third part concerning community language use behavior (Part C). In this study, we replaced Part A with our own demographic questions (i.e., gender, age, and education) and used Part B and Part C in their entirety. More specifically, Part B consisted of questions on which language(s) the participant speaks and understands (including English), in which context(s) each language had been acquired, and at what age. Proficiency questions for English followed for speaking, understanding, reading, and writing on a scale from 0 to 100, as well as frequency of use of English for speaking, understanding, reading, and writing on a scale from 0 to 4 (0 = none of the time, 1 = a little of the time, 2 = some of the time, 3 = most of the time, and 4 = all of the time). The proficiency and frequency of use questions were presented again, but for the participant's second language (or, in cases where several second languages were reported, the one that was the most fluent). Participants who only spoke English could skip those questions. As for Part C, questions about language use were first answered on a scale from 0 to 4 (0 = only English, 1 = mostly English, 2 = both languages equally, 3 = mostly the second language, and 4 = only the second language). Those questions covered which language(s) were heard during different life periods (infancy, preschool age, primary school age, and high school age), which language(s) were used to communicate with different people (parents, siblings, grandparents, other relatives, partner, roommates/other people the person lives with, neighbors, and friends), which language(s) were used in different settings (at home, school, work, social activities, religious activities, hobbies, shopping and other commercial activities, healthcare, or contact with various authorities), and which language(s) were used for various activities (reading, e-mailing, texting, on social media, for writing lists and notes, for watching TV and listening to the radio, watching movies, surfing on the internet, or praying). Finally, a last block of question asked the participant how often code switching occurred on a scale from 0 to 4 (0 = never, 1 = rarely, 2 = sometimes, 3 = often, and 4 = always) with family members, friends, and on social media respectively. Out of Part B and Part C, different scores can be calculated by using the provided LSBQ Factor Score Calculator (Anderson et al., 2018). We chose the LSBQ in particular as one of its factors is designed specifically to be used either as a continuous variable or as a dichotomous variable, thus suiting the purpose of this study perfectly. Namely, the Composite Factor Score (CFS) is a measurement that includes all questions that are weighted according to the validation in Anderson et al. (2018). Another factor that can be computed is the Nonnative Home Use and Proficiency score (HUP), which includes a subset of questions related to second language use and proficiency only (for instance language used with grandparents, during infancy, proficiency in second language, etc.). Furthermore, another factor, the Nonnative Language Social Use score (LSU) includes a subset of questions related to second language use in the participant's social life (e.g., at work, when writing e-mails, frequency of code switching with friends, etc.). A last factor, namely, English proficiency, can be computed, but was not used in this study.

Data preparation
For this study, the three scores from the LSBQ described above were used. More specifically, the CFS (possible range: -6.582 to 32.32, the higher the score, the more bilingual), the HUP (possible range: -13.9 to 24.163, the higher the score, the more proficient the second language, and the more it is used in home settings), and the LSU (possible range: -7.5 to 80.304, the higher the score the more frequently the second language is used in social settings). In addition, age of acquisition of the second language and code-switching frequency (CS, possible range: 0 to 4, based on the mean of three questions where answers ranged from never to always) were also used to operationalize the participants' language profile since they are frequently used to operationalize bilingualism in bilingual research. Of importance, all the variables were used to divide the participants into two separate groups, but they were also all used as continuous variables. For the CFS, the LSBQ's guidelines were used to create a monolingual and bilingual group (monolinguals <-3.13: n = 60, bilinguals >1.23: n = 40, those with a score falling in between these thresholds were excluded: n = 58). As for the HUP and LSU score, a median split (Mdn HUP : -4.95; Mdn LSU : -3.91) was used to create a group of monolinguals (HUP: n = 79, LSU: n = 79) and of bilinguals (HUP: n = 79, LSU: n = 79) as it is a practice that is frequently used in research despite its limitations (MacCallum, Zhang, Preacher, & Rucker, 2002). Participants who reported knowing more than one language (n = 116) were divided into two groups based on the age of acquisition of their L2. Participants who began to acquire their L2 before the age of 5 years were categorized as early bilinguals (n = 40) and those who began acquiring their L2 from the age of 5 years (or older) were categorized as late bilinguals (n = 76). Age 5 was chosen as the cutoff age based on findings suggesting neurological differences between bilinguals acquiring their L2 before age 5 and those acquiring it afterwards (e.g., Berken, Chai, Chen, Gracco, & Klein 2016;Bloch et al., 2009). Finally, for the participants who reported knowing more than one language, participants were divided into a group of nonswitchers for those who on average based on the three code-switching questions (with family, friends, on social media) reported code switching less frequently than "sometimes" (i.e., mean values below 2: n = 86) and between switchers (i.e., mean values of 2 and above: n = 30). For descriptive statistics for each independent variable when treated categorically and continuously, please see Table 2 and Table 3, respectively.
Both accuracy (number of correct answers) and reaction times (in milliseconds) were analyzed for congruent and incongruent trials. Furthermore, a Simon effect score was calculated for accuracy (difference between congruent and incongruent trials) and reaction times (difference between incongruent and congruent trials), where larger scores represent a larger Simon effect. Given that event-related potentials show that focusing spatial attention on a stimulus and preparing motor action occurs around the time period of the N2-wave (Luck, 2012), answers that occurred within 200 ms were too quick for the participant to have had time to process the stimulus and were considered as mistakes. Trials where the participants did not answer within the time limit of 2000 ms were also considered as mistakes. Only correct answers were included in reaction time analysis.
The different dependent variables described above were tested in individual analyses with each predictor (CFS, HUP, LSU, age of acquisition, and frequency of code switching). Thus, for each independent variable, accuracy for congruent trials,  accuracy for incongruent trials, reaction times for congruent trials, reaction times for incongruent trials, the Simon effect based on accuracy, and the Simon effect based on reaction times were analyzed. For the dichotomous independent variables (where two groups were created), mixed-model analyses of variance were performed for the accuracy measurements (congruent and incongruent trials), as well as for the reaction times (congruent and incongruent trials). For increased readability, the main effects of condition (congruent or incongruent), which were not the principal interest of this study, are reported in supplementary materials only. Furthermore, t tests for independent samples were performed for each of the Simon effect measurements (based on accuracy and based on reaction times). In addition, for the categorical variables, t tests were performed on age, Mann-Whitney U tests were performed on education level, and chi-square test for independence were performed on gender to control whether the groups differed on these variables. As for the continuous independent variables, simple linear regression analyses were conducted. The analyses were performed using JASP version 0.10.2. A summary of the means, standard deviations, p values and effect sizes for the different variables, groups, and conditions for all analyses is presented in Table 4 at the end of the Results section.

Dichotomous independent variables CFS
The main effect of group for accuracy was not significant (monolinguals: M = 44.5, SD = 3.1; bilinguals: M = 44.9, SD = 3.3; F < 1), and neither was the interaction between group and condition (congruent, incongruent: F < 1). As for the reaction times, there was a significant difference between the groups where monolinguals (M = 449 ms, SD = 77) were faster than the bilinguals (M = 484 ms, SD = 90), F (1, 98) = 4.33, p = .04, η 2 = .04. The interaction between group and condition was not significant for reaction times either (F < 1). None of the analyses for the Simon effect (accuracy or reaction times) were significant (both ts < 1). As for background variables, the t test for age showed that the monolingual group (M = 40.1, SD = 11.4) was significantly older than the bilingual group (M = 33.8, SD = 8.8), t (98) = 2.98, p = .004, d = 0.61. In addition, although both groups had a median of 4 for education, the bilingual group (M = 4.2, SD = 0.99) had a higher level of education than the monolingual group (M = 3.5, SD = 0.89), U = 728, p < .001. The chi-square for gender was not significant, χ 2 (1, n = 100) = 0.43, p = .51.

HUP
The main effect of group for accuracy was not significant (monolinguals: M = 44.8, SD = 2.8; bilinguals: M = 44.8, SD = 3.5; F < 1), and neither was the interaction between group and condition (congruent or incongruent: F < 1). Neither main effect of group for reaction times (monolinguals: M = 457 ms, SD = 75; bilinguals: M = 475, SD = 81; F < 1), nor the interaction between group and condition was significant (congruent or incongruent: F < 1). Further, none of the t tests for the Simon effect (accuracy or reaction times) were significant (both ts < 1).   Note: Acc., Accuracy. RT, reaction times (in ms). CFS, Composite Factor Score. HUP, Nonnative Home Use score. LSU, Nonnative Language Social Use score. AoA, age of acquisition of the second language. CS, frequency of code switching. Values in bold indicate a significant difference between the groups (dichotomous) or a significant (or approaching significance) predictor (continuous). *p < .05. **p < .01. †approaching significance.
Here again, we took a closer look at the background variables. Age was significantly lower in the early bilinguals group (M = 32.3, SD = 8.1) than in the late bilinguals group (M = 39.6, SD = 11.6), t (114) = 3.56, p < .001, d = 0.7. Education level (both groups with a median of 4) and gender did not differ significantly.

Continuous independent variables CFS
The model predicting reaction times on the congruent trials was significant, F (1, 156) = 4.62, p = .03, R 2 = .03, where the higher the CFS was, the longer the reaction times were. As for the model predicting reaction times on the incongruent trials, it approached significance, F (1, 156) = 3.43, p = .066, R 2 = .02, where the higher the CFS was, the longer the reaction times tended to be. The models for accuracy on the congruent and incongruent trials, as well as the models for the Simon effects (accuracy or reaction times) were not significant (all Fs < 1).

LSU
The model for the reaction times for congruent trials was significant, F (1, 156) = 4.69, p = .032, R 2 = .03, but only approached significance for reaction times for incongruent trials, F (1, 156) = 3.47, p = .064, R 2 = .02. In both cases, the higher the LSU score, the longer the reaction times were. The models for accuracy for congruent and incongruent trials, as well as the models for the Simon effects (accuracy or reaction times) were all nonsignificant (all Fs < 1).

Age of acquisition
The model for reaction times for congruent trials was significant, F (1, 114) = 4.41, p = .04, R 2 = .04, and the model for reaction times for incongruent trials, F (1, 114) = 3.27, p = .073, R 2 = .03, approached significance, where a higher age of acquisition of a second language lead (or tended to lead) to slower reaction times on the Simon task. The models for both types of accuracy and for both types of Simon effects (accuracy and reaction times) were not significant (all Fs < 1).

Code switching
None of the models with frequency of code switching as the predictor were significant (all Fs < 1).

Discussion
In this study, we operationalized bilingualism in different ways and analyzed performance on a Simon task (where a bilingual advantage has at times been found in previous studies) based on various possible independent variables in order to investigate whether operationalizing bilingualism in different ways can affect the results. All predictors were used both as dichotomous variables and on a continuous scale in order to investigate whether dividing participants into distinct groups or treating bilingualism as a continuous variable would affect the results.
Our results showed that one of the predictors (the CFS), was almost constantly associated with slower reaction times for bilinguals. Namely, the more bilingual participants were according to the CFS, the slower they were on congruent trials on the Simon task, and there was a tendency toward a significant effect for the incongruent trials as well. When participants were divided into groups of monolinguals and bilinguals based on the CFS, the bilingual group was significantly slower.
Of interest, while a bilingual advantage on accuracy was found when the groups were based on the LSU score, this effect was not found when accuracy was predicted by the LSU scores on a continuous scale. Furthermore, when the LSU was used as a dichotomous variable, monolinguals were significantly faster than bilinguals for both congruent and incongruent trials. However, when LSU was treated as a continuous variable, slower reaction times were predicted by bilingualism for congruent trials only (the model only approached significance for the incongruent trials). This suggests that, in our sample, the more bilingual participants were based on the LSU score, the slower they responded to congruent trials. However, this effect was not found for incongruent trials.
Finally, while age of acquisition of the second language did not show effects on accuracy or reaction times when used dichotomously (early vs. late bilinguals), age of acquisition did significantly predict reaction times for congruent trials when bilingualism was measured on a continuum. Furthermore, the prediction of age of acquisition on incongruent trials approached significance. As for the HUP score and code switching, they did not predict any differences regardless of how they were used.
Markedly, a variation on the Simon effect was not predicted by any of the variables, whether it was measured based on accuracy or on reaction times. None of the effect sizes found in the significant results were impressively large, however, most of them being small or halfway toward medium in size. This implies that the results in this study should be interpreted with caution. It is also worth mentioning that, although we label slower reaction times as being a disadvantage, as we pointed out earlier, these slower reaction times may actually reflect a speed-accuracy trade-off as accuracy was higher for the bilinguals. It could be that the bilinguals use different cognitive strategies when performing the task. As Gullifer and Titone (in press) suggest, it could be that the bilinguals use active goal maintenance to a higher degree than monolinguals do when managing conflict, which could manifest itself behaviorally in terms of slower reaction times. This is not necessarily a disadvantage per say, however, especially when accuracy is improved.
Nonetheless, the results of this study demonstrate clearly that operationalization can have an effect on the results. For instance, while we found a bilingual advantage in accuracy and a disadvantage in speed when bilingualism was operationalized dichotomously for the LSU, the disadvantage only appeared for congruent (and not incongruent) trials when the LSU was used as a continuous variable. Even more interesting, while no effects whatsoever were detected when participants were divided into early and late bilinguals, an effect on reaction times for congruent trials was found when age of acquisition was used on a continuum. A possible explanation is that the effect of LSU was found for accuracy when groups were used only due to a Type I error, since dividing a continuous variable into categories increases the risk of a Type I error (Cohen, 1983). As for the effect of LSU on reaction times, it could also be that the effect was driven by the congruent trials only, although no significant interaction effects were found. In contrast, the advantage for bilinguals having acquired their second language earlier (when using age of acquisition as a continuum on congruent trials) disappeared completely when participants were divided into early and late bilinguals. This difference may reflect the more subtle and fine-grained differences that can be found when using a variable on a continuum. Dividing participants into groups based on a continuous variable may lead to as much, if not more, variability within the group as between the groups.
The measurement that was arguably the more consistent was the CFS, where no effects were found on accuracy regardless of whether the variable was used dichotomously or continuously, but where significant effects were found for reaction times regardless of how the variable was used. This consistency may be due to several reasons. A possibility is that the CFS covers several facets of bilingualism, thus making it a robust measurement including many of the aspects of bilingualism that are of importance for performance on a Simon task. In contrast, measurements such as the LSU cover social aspects of language use only, regardless of the participant's reported language skills. This could be a possible explanation as to why a more complete measurement such as the CFS yielded more stable results than, for instance, the LSU. Another possibility is that the groups that are created based on the recommendations of the LSBQ exclude participants in a "gray zone" between monolingualism and bilingualism, thus creating groups with clear boundaries and decreasing the variability within those groups. At the same time, when used continuously, the CFS includes the "gray zone" participants and the advantage of using the variable as a continuum to detect fine-grained effects is preserved.
Another interesting aspect of this study is that, when using a median split on the LSU and HUP (which is a practice that is not necessarily optimal but that is nonetheless used), arbitrary cutoff limits were created. Thus, a bilingual in our sample may have been a monolingual in a different sample since the median will inevitably vary across samples. Here, this became even clearer as some participants ended up being bilingual according to either the LSU or the HUP, but monolingual according to the other factor. Twenty-four participants were categorized as monolingual on the LSU and bilingual on the HUP, and an additional 24 participants were categorized as bilingual on the LSU and monolingual on the HUP. None of the LSU-monolinguals were categorized as bilinguals on the CFS (21 were excluded, 3 were monolinguals). However, 2 of the HUP-monolinguals were categorized as bilinguals according to the CFS (20 were excluded, 2 were monolinguals), which is clearly problematic. These discrepancies not only demonstrate the operationalization issue well and why creating categories based on arbitrary cutoffs such as a median split makes comparison across studies problematic but also illustrates the complexity of the bilingual experience when it is measured based on different facets, as well as how important each and every of these facets are.
An important note to make is that, according to the CFS values (computed based on the LSBQ), our sample had very few "true" bilinguals, namely, participants that had a score above the suggested cutoff value for bilingualism (i.e., several participants fell in between the monolingual and bilingual categories or in the "gray zone"). Even if the LSBQ is a validated tool and although the CFS was computed using the calculator provided by Anderson et al. (2018), the values in Anderson et al. are from a North American population and the questionnaire has been validated in this population only. It may be that the different items in the questionnaire would load differently on the factors identified by Anderson et al. if they were validated in another population. Although it is beyond the scope of this paper to validate the LSBQ in a different population, we cannot reject the possibility that the LSBQ did not measure the language profile of our population as accurately as for the population used in Anderson et al. However, a large portion of our sample was recruited via MechanicalTurk, where a majority of users are also in North America and shared several similarities with Anderson et al.'s sample. The variety of second languages spoken by our participants was similar, as was the level of education and the gender distribution, even though our sample was slightly older. Thus, it is probable that our sample is similar to the sample used in the LSBQ validation study. However, since the average CFS, LSU, and HUP are not provided in Anderson et al. (2018), it is difficult to compare the samples directly, and the possibility remains that our sample was significantly different in terms of bilingual experience.
In addition, our participants were recruited exclusively online via crowd sourcing platforms (MechanicalTurk and Prolific). Although samples recruited via such services appear to be as reliable as other more traditional samples and have the advantage of including participants outside of the college and university students population, it may be that a sample recruited in a different context would behave differently. Hauser and Norbert (2016) showed that MTurkers (participants recruited via MechanicalTurk) pay better attention than non-MTurkers when performing cognitive tests. This could explain at least in part why the accuracy rates in our study were so high. Furthermore, in particular because accuracy rates in our study were so high, the results should be interpreted with caution. The use of statistical analysis such as analysis of variance can be problematic when a ceiling effect is present and can lead to Type I errors (Šimkovic & Träuble, 2019).
Furthermore, there were some demographic differences between the groups that were created, and we cannot rule out that they may have affected the significance, or nonsignificance, of the results. Namely, CFS-and HUP-based bilinguals were younger and reported higher levels of education, but responded slower (but more accurately) when grouped based on the CFS. Based on the LSU, monolinguals reported lower levels of education and were predominantly males. Thus, the better accuracy of the bilingual group could be an effect of level of education. As for early bilinguals, they were also significantly younger, but then again, no effect was found between the groups. Only the nonswitchers versus switchers did not differ when it came to age, education, and gender, and no differences were found in terms of performance between those groups.
However, we would like to stress that the primary purpose of the current study was not to add yet another data set to the already ongoing debate of the existence of a bilingual advantage in executive functions. Rather, the main purpose was to explore whether operationalizing bilingualism in different ways could lead to different patterns of results, which is what we found. However, although this study highlights the fact that the specific way bilingualism is operationalized can affect the results and thus the conclusions that are drawn, it is not comprehensive enough to solve this methodological concern. First, the participants of the current study consisted of adults, while the bilingual advantage in executive functions is more consistently found in children (e.g., Bialystok & Viswanathan, 2009;Janus & Bialystok, 2018;Kovács, 2009;Yow et al., 2017) or elderly populations (e.g., Bialystok, Poarch, Luo, & Craik, 2014;Borsa et al., 2018;Cox et al., 2016;Gold, Kim, Johnson, Kryscio, & Smith, 2013). Replicating the current results with additional age groups such as children or elderly populations might therefore lead to further insight in this matter, as will replicating the current study but incorporating additional tasks, particularly tasks that recruit other types of executive functions such as switch cost.
Second, there are other facets of bilingualism that we did not measure in this study. For instance, there are other questionnaires available for measuring bilingualism both for adults (e.g., the Language Experience and Proficiency Questionnaire; Marian, Blumfeld, & Kaushanskaya, 2007) and children (e.g., the Bilingual Language Experience Calculator; Unsworth, 2013), and other facets such as L2 proficiency or frequency of use of the L2 that were not explored in the current study (although they are included as part of the different LSBQ scores). In addition, there are tests beyond self-reports that may more objectively measure knowledge of a language, or at least some aspects of it such as receptive vocabulary (e.g., LexTALE: Lemhöfer & Broersma, 2012;Peabody Picture Vocabulary Test;Dunn, 2018). As we did not look at the complexity of the social aspect of bilingualism in this study, it should be more thoroughly investigated in future studies as there is growing evidence that the context of use of a bilingual's languages may be a main factor affecting executive functions (de Bruin, 2019; Tiv, Gulifer, Feng, & Titone, in press). It has even been suggested that the potential advantage may emerge from the specific social linguistic context that a bilingual interacts in (e.g., Fan, Liberman, Keysar, & Kinzler, 2015;Hartanto & Yang, 2016;Wu & Thierry, 2013) and investigating the social aspect of bilingualism in addition to linguistic aspects of bilingualism could shed light on the debate, and how to best operationalize bilingualism.
Third, although there is a movement toward treating bilingualism as a continuous variable that is endorsed by many (e.g., Champoux-Larsson & Dylman, 2019;DeLuca et al., 2019;Edwards, 2012;Gullifer et al., 2018;Gullifer & Titone, 2020;Incera & McLennan, 2018;Jylkkä et al., 2017;Kaushanskaya & Prior, 2015;Luk & Bialystok, 2013;Sulpizio et al., 2020;Surrain & Luk, 2019), as Kremin and Byers-Heinlein (2020) point out, there are situations where this may not be possible or preferable (e.g., with small samples). Therefore, giving up a dichotomous classification of bilingualism altogether may not be the answer when it comes to operationalizing bilingualism (Kremin & Byers-Heinlein, 2020). Using a factor mixture model or a grade-of-membership model, which allow for categorization but considers the variation within the categories, would allow analysing results based on categories, on a continuous scale, or even on both (Kremin & Byers-Heinlein, 2020). The flexibility that such models provide would allow choosing an approach that would be driven by the research question, by the latent concept that is analyzed, and by the sample. Given the complexity of bilingualism, we believe that such a flexible approach would be appropriate to capture the different facets and nuances of the concept. Yet, using such models may not always be possible. Based on the recommendations of MacCallum et al. (2002) concerning the dichotomization of continuous variables, bilingualism should, when possible, be operationalized as a continuous variable given that its nature is continuous. Dichotomizing a continuous variable will almost inevitably lead to misleading results (MacCallum et al., 2002), and serious considerations should be taken before dichotomizing bilingualism. This choice should be supported for instance by inspecting the data in order to determine whether participants cluster into defined clusters, or by providing a clear theoretical and conceptual argument for doing so.
However, our goal here was to investigate whether operationalization can affect results, making them significant in some cases but not in others when operationalization is slightly modified. In other words, our aim was not to establish a universal or optimal way to operationalize and define bilingualism, but rather to highlight the fact that defining and measuring bilingualism in different ways can lead to different conclusions. This supports the argument that a thorough and clear description of the sample of bilinguals that is investigated is necessary if comparisons can be made across studies (Surrain & Luk, 2019). We suggest that our results illustrate the complexity and difficulty of measuring a latent construct such as bilingualism, and that future research should take the different nuances of the concept into account. Therefore, researchers should carefully support their operationalization choices based on clearly defined theoretical models and empirical results rather than on arbitrary cutoffs or measurements. We suggest that, as long as it is not established which aspects of bilingualism or of the bilingual experience affect specific cognitive mechanisms, caution should be used when choosing not to measure a specific characteristic. For the time being, it may still be more methodologically sound to use questionnaires that are detailed and extensive (such as the LSBQ), and refrain from methods that are known to be flawed (such as using a median split).
It is important to point out however that several of the experimental paradigms used to reveal an advantage in executive functions in bilinguals (such as the Simon task) are not necessarily the most reliable tools to draw conclusions on betweenindividual differences. While low variability between participants is necessary in order to reliably find group-level effects, which experimental paradigms are designed to do, high between-subjects variability is necessary in order to reliably detect individual differences (Hedge, Powell, & Summer 2018), which is often what we aim to do in bilingual research. This suggests that caution should be taken when interpreting results to highlight differences between monolinguals and bilinguals (or between different types of bilinguals) and that methodological changes are necessary when designing tasks and analyzing results if such paradigms are used (see Hedge et al., 2018, for recommendations).
In sum, although the current study cannot settle how bilingualism is best measured, nor which facets of it are the most relevant, it achieves its aim of showing that operationalizing bilingualism in different ways (specifically, using different facet of bilingualism, and whether bilingualism is measured as a continuous or dichotomous variable) does affect the results. Here, we showed that dividing participants into groups instead of using their scores continuously tipped the value from approaching significance or not being significant to showing significant differences between monolinguals and bilinguals. Furthermore, our results also show that differences can (or cannot) be found depending on which aspect of the behavioral data is analyzed (i.e., accuracy vs. reaction times on congruent vs. incongruent trials, or on a score such as the Simon effect). This illustrates quite well how these results, had they come from different samples and been published as independent studies, would have led to inconsistent results where a bilingual advantage in accuracy would have been established in one study, but not in another, and where a bilingual disadvantage in speed would even have been supported by a third one. Simultaneously, there are several consistencies across the measurements and the dependant variables as well, suggesting that consistency across studies, despite using different types of operationalization, is not impossible. While this specific methodological issue is unlikely to be the sole source of all disparate results on the bilingual advantage in executive functions, it is possibly a contributing factor. Our results therefore stress the need for more research investigating operationalization and measurement of bilingualism. Future research should investigate which facets of bilingualism correlate together, which lead to similar neurological changes and effects on tasks tapping into executive functions, and which facets can and cannot be compared directly with each other. This study also highlights the need to clearly and transparently report how bilingualism is operationalized in studies investigating bilinguals in order for informed comparisons and meta-analyses to be possible. We believe that the current study contributes to the efforts toward more rigorous and standardised methodologies, and encourages increased openness, awareness, and transparency in the field of bilingual research.