A path to the bilingual advantage: Pairwise matching of individuals

Abstract Matching participants (as suggested by Hope, 2015) may be one promising option for research on a potential bilingual advantage in executive functions (EF). In this study we first compared performances in three EF-tasks of a naturally heterogeneous sample of monolingual (n = 69, age = 9.0 y) and multilingual children (n = 57, age = 9.3 y). Secondly, we meticulously matched participants pairwise to obtain two highly homogeneous groups to rerun our analysis and investigate a potential bilingual advantage. The initally disadvantaged multilinguals (regarding socioeconomic status and German lexicon size) performed worse in updating and response inhibition, but similarly in interference inhibition. This indicates that superior EF compensate for the detrimental effects of the background variables. After matching children pairwise on age, gender, intelligence, socioeconomic status and German lexicon size, performances became similar except for interference inhibition. Here, an advantage for multilinguals in the form of globally reduced reaction times emerged, indicating a bilingual executive processing advantage.


Introduction
The probably most highly debated issue in bilingualism research to date is the possible enhancement of executive functions (EF) through bilingualism (Bialystok, 2015;Paap, Johnson & Sawi, 2015). From the late 1990s onwards, many studies reported an advantage of bilinguals over monolinguals in EF tasks (for meta-analyses, see Adesope, Lavin, Thompson & Ungerleider, 2010;de Bruin, Treccani & Della Sala, 2014), but recent findings invoke doubt about the existence of the BILINGUAL ADVANTAGE (for more detail, see Paap et al., 2015). With this study we pursue two goals. First, we compare the EF performance of mono-and multilinguals taken from a large, naturally heterogeneous sample of primary school children. By accepting the multilinguals' disadvantageous position in terms of socioeconomic status (SES) and lexicon size, we can explore the children's abilities as they can be observed in the German school context. Secondly, we try to illustrate the importance of meticulous pairwise matching of participants when comparing multilinguals with monolinguals. Only by pairwise matching can we control for the influence of many potentially influential or confounding variables by creating experimental groups that are comparable on a group and on an individual level.

The bilingual advantage in EF
One explanation for the EF advantage of bilinguals is the continuous, parallel activation of their languages: The inactive language needs to be inhibited to avoid intrusion errors, while the language currently in use needs to be activated by directing attention towards it (Green, 1998;Kroll, Dussias, Bogulski & Kroff, 2012). This constant language control puts high demands on the EF system and therefore constitutes a training, which is suggested to yield the bilingual advantage in EF (Bialystok, 2015). Since the development of EF starts in early childhood, investigating the bilingual advantage in primary school children is particularly interesting, as their EF still undergo refinement in terms of processing speed and accuracy (Best & Miller, 2010). Some reviews and meta-analyses support the existence of the bilingual advantage in EF, despite significant variability in the data (Adesope et al., 2010;Barac & Bialystok, 2011;de Bruin et al., 2014;Hilchey & Klein, 2011). EF comprise a family of cognitive control mechanisms with three core functions: inhibition, working memory or updating, and shifting (Diamond, 2013;Miyake & Friedman, 2012). The strongest evidence for the bilingual advantage exists for one subcomponent of inhibition, namely interference inhibition, but not for response inhibition. Tasks assessing response inhibition require control to react to a target stimulus but to withhold a reaction upon presentation of a non-target stimulus. There is hardly any evidence for a bilingual advantage for this EF component (Esposito, Baker-Ward & Mueller, 2013;Martin-Rhee & Bialystok, 2008;but Bialystok, Barac, Blaye & Poulin-Dubois, 2010).
Bilinguals outperform monolinguals in interference inhibition tasks, since these probably involve the same control mechanism as in bilingual language control (as described above). Besides simple, congruent conditions, they comprise incongruent conditions with additional inhibitory demands. In these conditions, conflicting stimulus information requires focusing on the relevant dimension while inhibiting interference from the irrelevant dimension. Experimental paradigms supporting the bilingual advantage are, for example, the Simon task (Martin-Rhee & Bialystok, 2008;Poarch & van Hell, 2012), Stroop task (Costa, Fuentes & Vivas, 2010) and Flanker task (Costa, Hernandez, Costa-Faidella & Sebastián-Gallés, 2009).
Depending on the result pattern in interference inhibition tasks, Hilchey, Saint-Aubin and Klein (2015) either assume a BILINGUAL INHIBITORY CONTROL ADVANTAGE (BICA) or a BILINGUAL EXECUTIVE PROCESSING ADVANTAGE (BEPA). A superior performance in conditions with additional inhibitory demands (measured by the interference effect, the difference between incongruent and congruent conditions) indicates a BICA. However, since many studies report globally faster reaction times (RT), Hilchey et al. (2015) consider BEPA the predominant finding in research on the bilingual advantage. On this account, also Bialystok (2017) supports the change of an advantage solely in inhibition to a more general advantage in EXECUTIVE ATTENTION. Here, monitoring of attention is additionally involved to detect potential conflicts, which plays an important role in the afore-mentioned interference inhibition tasks.

Doubt about the bilingual advantage
Some researchers argue that the bilingual advantage is "restricted to very specific and undetermined circumstances" (Paap et al., 2015, p. 1). Doubt stems from an initial publication bias for positive results (de Bruin et al., 2014) and the fact that after 2011, more than 80% of the published studies reported null results (Paap et al., 2015). Moreover, studies with large-scale designs and carefully matched groups could not replicate seminal findings (Duñabeitia, Hernández, Antón, Macizo, Estévez, Fuentes & Carreiras, 2014;Paap, Johnson & Sawi, 2014) and the bilingual advantage was detected especially in studies with small sample sizes (n < 30) what increases chances of false positive results (Paap et al., 2015). However, Hope (2015) suggested, "by reducing samples to ensure that groups are more comparable, (…) smaller studies can be more informative than larger studies, if they are better controlled" (p. 59).
The above-mentioned null-results can either indicate the nonexistence of a bilingual advantage or result from experimental manipulations. For example, participants' age seems to affect the outcome, since behavioral effects are predominantly found in children and older adults but not in young adults (Antón, Duñabeitia, Estévez, Hernández, Castillo, Fuentes, Davidson & Carreiras, 2014). The reason is probably that young adults are at the height of their cognitive performance and bilingualism does not constitute an additional boost to EF (Bialystok, Martin & Viswanathan, 2005). Moreover, task demands affect behavioral results, because only conditions with high cognitive demands reveal a bilingual advantage (Costa et al., 2009;Martin-Rhee & Bialystok, 2008). This fact applies also to EF training studies in general, since intervention and control groups usually differ in the cognitively most demanding conditions (Diamond, 2014).
Another major concern is the uncontrolled influence of confounding factors. SES is in the center of this discussion as it correlates with EF (Carlson, Zelazo & Faja, 2013). Consequently, SES needs to be controlled for when comparing multilinguals with monolinguals to avoid skewed results (Calvo & Bialystok, 2014;Morton & Harper, 2007). In certain contexts, multilingualism is even accompanied with lower SES. This is the case in Germany, where multilingual children stem more often from families with a lower SES, referring to both income and educational level (Bildungsberichterstattung, 2014; for other settings, see Engel de Abreu, Cruz-Santos, Tourinho, Martin & Bialystok, 2012;Paap & Greenberg, 2013). In Germany, multilingualism also frequently coincides with migration background (Leerhoff, Rehkämper, Rockmann, Brunner, Gärtner & Wendt, 2013), but if and how migration status interacts with EF remains unclear, as empirical evidence is scarce. Another factor often differing between mono-and multilinguals is the size of the mental lexicon, which is smaller for bilinguals when comparing each language separately (e.g., Bialystok & Luk, 2012). Moreover, multilingual children residing in Germany (especially those with a migration status) show on average lower language proficiency in German (Niklas, Schmiedeler, Pröstler & Schneider, 2011). Therefore, test conditions including long and complex verbal instructions and tasks based on verbal stimuli or culture-specific knowledge may put multilingual children at a disadvantage (Hagmann-Von Arx, Petermann & Grob, 2013).

Pairwise matching
Friedman (2016, p. 2) stated as follows: "This bilingual advantage hypothesis seems straightforward to test: Obtain a sample of bilinguals and appropriately matched controls and test whether they differ on a measure of EF". However, finding the´right`bilingual sample for a study poses a problem, since multilingualism is a complex and continuous variable itself comprising aspects like language dominance and proficiency, age of acquisition (AoA) etc. (Luk & Bialystok, 2013). Matching participants poses a great challenge, too, because numerous characteristics of an individual potentially influence EF (Diamond, 2013), and multilingualism is just one of them. In many studies participants were MATCHED ON A GROUP-LEVEL. This was done either by controlling that mono-and multilingual groups do not differ significantly on a certain number of background variables (e.g., Bialystok & Viswanathan, 2009;Kousaie, Laliberté, López Zunini, Taler, Zunini & Taler, 2015;Scaltritti, Peressotti & Miozzo, 2017) or by excluding individuals of a larger sample to reach comparable groups (Bak, Long, Vega-Mendoza & Sorace, 2016;Stocco & Prat, 2014). This matching method at the group-level is used in studies examining small samples sizes with n < 30 (e.g., Kousaie et al., 2015;Morton & Harper, 2007), as well as large samples with n > 100 Duñabeitia et al., 2014;. This approach is always applied at a group or grade level, but it ignores the complex set of characteristics on an individual level. In this paper, we demonstrate how to overcome this problem by MATCHING INDIVIDUAL PAIRS of participants on a number of background variables.

Research aim
The discussion on the existence of the bilingual advantage in EF is still vivid and we approached this problem by studying a natural, unselected group of mono-and multilingual children from Germany. Our goal was twofold. First, we compared the children's performance in EF to investigate their abilities in a natural setting. Secondly, we demonstrated how to use pairwise matching on an individual level to investigate the existence of the bilingual advantage in EF. For that purpose, we applied a two-step analysis, i.e., without and with pairwise matching of participants.
We collected data from a larger cohort of third graders in Germany, who were categorized post-hoc as mono-or multilingual (according to reported home language use; for details, see Methods section). In our first analysis without pairwise matching, we compared performance in EF tasks between these large, unmatched groups (further referred to as HETEROGENEOUS GROUPS) accepting all natural group differences in background variables. We hypothesized that if multilingualism does not influence EF, multilingual children should underperform throughout all tasks due to their anticipated disadvantage in the background variables. In a second step, we matched participants pairwise on important influential variables known from the literature (i.e., age, gender, intelligence, lexicon size and SES), creating two HOMOGENEOUS GROUPS and compared their EF-performance. We hypothesized that if a bilingual advantage exists, it should arise in this step, because detrimental effects from confounding variables are eliminated.
Since we have specific hypotheses about the nature of the bilingual advantage, our study comprises three tasks testing the following EF components: interference inhibition, response inhibition and updating. We predict that multilinguals outperform monolinguals in interference but not in response inhibition (Esposito et al., 2013;Martin-Rhee & Bialystok, 2008). This advantage would manifest itself either in lower global RTindicating a BEPAor in a lower interference effectindicating a BICA. Furthermore, we expect an advantage in the more demanding conditions (e.g., of the N-back task measuring updating, or of the interference inhibition task when having to switch between conditions in the mixed block), since only high cognitive demands allow benefits of bilinguals to emerge.

Participants
In total, 168 children who attended third grade in different schools in Germany took part in the study. They were categorized as monolinguals or multilinguals based on their home language use in the following way: the monolingual group (n = 69; 36 female) included children who spoke only German at home and had no further contact with another language. The multilingual group consisted of bilingual (n = 50; 24 female) and trilingual children (n = 7; 3 female). To be included in the multilingual group, the child's family needed to use at least one other language at home besides German, and the child's verbal proficiency in this language had to be at least good (for more detail see section Questionnaire below and Table 1). We will refer to the home language as L1, because for successive multilinguals it is their L1 (their acquisition of German starts only in kindergarten and German is therefore their L2/L3); simultaneous multilinguals have two/three first languages, hence their home language is one of these L1s. We suggest that these selection criteria ensure an active and continuous use of L1, which is probably necessary for the bilingual advantage to develop in this age group. Consequently, we excluded 42 children from further analyses. These comprise three children with an IQ < 70 indicating an intellectual disability (DIMDI, n.d.) and 39 children who could not be assigned to either language group. Exclusion criteria were (a) contact with another language besides German did not occur at home (n = 33), and (b) questionnaires were incomplete (n = 3) or contained conflicting information on the language background (n = 3).
We recruited participants through class teachers who distributed written information about the study among the children's parents. The parents gave written informed consent before the start of the study. All participants had normal or corrected-to-normal vision, were naive with regard to the hypotheses of the study and received small gifts for their participation.

Questionnaires
The parents completed a paper-pencil questionnaire at home (a German and a Turkish version were available) and reported the country of birth of father, mother and child (to determine the migration background). The family's SES was specified by the ISCED (International Standard Classification of Education, UNESCO Institute for Statistics, 2012) for both parents, which reflects the highest school and professional qualification. Parents reported also their children's language background including the languages spoken at home and in kindergarten, the AoA of German, language proficiency and frequency of use at home.
The participating children filled in a child-friendly version of largely the same questionnaire in the first experimental session (for more detail, see below). The children's proficiency in German and in L1 was rated both by themselves and their parents for comprehension and production on a scale of 1 (none) to 4 (very good). The overlap between these ratings was high for German (comprehension: 74% / production: 67%), but lower for L1 (39% / 30%). Faced with this discrepancy between a rather high convergence for ratings of German proficiency but lower convergence for rating L1 verbal abilities, we decided to rely only on the parents' ratings. Regarding the ratings in German, both parents and children might include grades and feedback from teachers influencing their ratings; the ratings in the L1, however, are based on parents' judgements, most likely in comparison to their own language proficiency as speakers of that language and compared to the children's abilities in German. The children, though, may lack this point of reference for comparing their own language abilities in their L1, as they are growing up in a multilingual community rather than with peers of the same L1 only.

EF tasks
Participants sat at a desk in front of a tablet computer in a quiet classroom. All three EF tasks were presented on a Microsoft Surface Pro 2 Tablet with a display size of 25.5 cm x 17 cm and a resolution of 2160 px x 1440 px.

Response inhibition
We administered the Go/Nogo task to measure response inhibition. In this task, we presented two black-and-white line drawings of animals (a goat and a deer from Snodgrass & Vanderwart, 1980), which were visually highly similar, to increase task difficulty. Children were instructed to touch a response bar on the screen with the index finger of their dominant hand when they saw the target picture (Go-condition, e.g., the goat) but to withhold the response in the Nogo-condition (e.g., the deer). Targets were randomly assigned to group sessions and were counterbalanced over participants. We explained the task orally to the group and pictograms on the tablet illustrated the procedure to ensure all children understood independent of their German skills or processing speed. Children completed 10 practice trials with feedback (happy or sad smiley); if less than 60% of Go-trials were answered correctly, practice was repeated up to two times. Afterwards participants completed two experimental blocks consisting of 20 trials each (10 Go-and 10 Nogo-trials) without feedback. Items were pseudo-randomized so that no more than two equal items followed each other. Each item (7 cm x 7 cm) appeared in the center of the screen for 2 s, with an inter-stimulus interval of 2 s, allowing a maximum RT of 4 s to meet the requirements of the heterogeneous sample.

Interference inhibition
To test interference inhibition, we administered an adapted version of the Bivalent Shape Task (BST, Mueller & Esposito, 2014). Children were instructed to sort circles and squares (presented in the center of the screen, Ø 4.2 cm) according to their shape by pressing corresponding buttons on the tablet screen (lower left and right side, Ø 2.1 cm) with their index fingers. The difficulty lies in ignoring the more salient color of the shapes and concentrating on their shape. Depending on the color, the stimuli belonged to three congruency conditions: (a) black outlines in the neutral condition, (b) a red and a blue shape matching the response buttons in color and shape in the congruent condition, and (c) interchanged colors in the incongruent condition, so that the shape but not the color matched the response buttons. Position (right, left) and color (blue, red) of the response buttons were counterbalanced over participants. The experiment consisted of three uniform blocks in a fixed order (neutral, followed by congruent and incongruent) with 20 randomized trials (each item appeared 10 times). Afterwards children completed one mixed block including all congruency conditions that comprised 30 randomized trials with 10 items of each condition. Items appeared immediately after button press or after 3 s, if no answer was given. Between each block, there was a break, but children were encouraged to complete the experiment as fast as they could.
Instructions and examples were given orally to the whole group. Additionally, children completed a practice block on the tablet including 12 randomized trials (four of each condition) with visual feedback in German ("Richtig!", engl. correct, in green or "Falsch", engl. wrong, in red letters).

Updating
As a measure of updating, the children completed two versions of the N-back task: a 1-back followed by a 2-back task. The stimuli, five letters (A, B, O, R and S; 3.7 cm high), were presented one-by-one in the center of the screen. We asked children to press a response bar with their index finger of the dominant hand when the displayed letter was the same as one trial (1-back) or two trials ago (2-back). Letters were presented for 2 s with an inter-stimulus interval of 2 s without feedback. Both tasks comprised two pseudorandomized blocks including 20 items of which 40% were critical items and no more than two succeeded one another. The procedure was the same as for the Go/ Nogo task regarding instructions and practice (10 trials with visual feedback).

Intelligence
We used the second part of the Culture Fair Intelligence Test Scale 1 (CFT 1-R, Weiß & Osterland, 2013), which assesses figural reasoning skills. The sum of correct answers from the short version of the subtests Matrix, Series and Classification served as measure of intelligence. The CFT, a culture-independent and language-free test, should not disadvantage children from other cultural backgrounds or with lower German skills.
Phonological short-term memory The number of correctly repeated non-words from the ZLT-II (Petermann & Daseking, 2012) served as a measure of shortterm memory. The test comprises 30 items, which consist of 2 to 6 syllables presented with increasing length (6 items per length). Items were pre-recorded to standardize testing conditions and consisted of meaningless CV-syllables presented with a neutral prosody (equal stress on each syllable) to avoid German-specific structures (e.g., consonant clusters or prosody).

German lexicon
We used the WWT (short version test 2, Glück, 2011) to assess expressive and receptive lexicon size in German. First, a picturenaming task with 40 items (including nouns, verbs and naming opposites of adjectives) served as measure of expressive lexicon size. Second, the test for receptive lexicon comprised all incorrect items from the preceding picture-naming task. Items were presented via loudspeaker and the child was instructed to touch the corresponding out of four pictures on the screen. The number of correct answers in each task served as measure of expressive or receptive lexicon.

Procedure
This study is part of a larger project conducted in three inclusive primary schools (see for example Czapka, Klassert & Festman, 2019); we report here only the tasks relevant for this study. After receiving the completed parents' questionnaire, three experimental sessions proceeded in the following way: in two group sessions, children filled in the questionnaire and completed the EF tasks and the intelligence test; in a third, individual session, we assessed German lexicon size and short-term memory.
In group sessions (with max. 20 children per session), tasks were explained by one trained experimenter. Depending on group size, each additional experimenter (up to nine per session) supervised between one child to maximally four children to ensure a quiet atmosphere, correct administration of tests, immediate help in case of technical problems (with the tablet computers) and understanding of instructions despite language, comprehension or processing speed difficulties.

Data analysis: statistical analysis and matching
In the EF tasks, we analyzed only RTs, since performance in terms of accuracy was at ceiling and error rates (calculated with Mann-Whitney-U test due to non-normal distributions) did not differ significantly between groups in any tasks or conditions. For the analysis, RTs of correct responses were log-transformed to normalize distributions. We removed outliers in two steps: first, exclusion of participants when their performance clearly deviated from the group; and secondly, removal of single data points by visual inspection of the RT-distribution for each task (see below).
We calculated linear mixed effect models for each task separately with the lme4 package (Douglas et al., 2015) in R version 3.2.2. (R Core Team, 2015). The models included RTs for each trial (level-one units) nested in the different subjects who were level-two units. To calculate group differences, the models included fixed effects for GROUP (monolinguals, multilinguals), the experimental variables and interactions between them. We added random intercepts for participant and random by-subject slopes for the experimental manipulations (block for the Go/ Nogo task, congruency and type for the BST and difficulty for the N-back). This procedure was applied to the heterogeneous and the homogeneous sample.
To obtain a homogenous sample for the second analysis, we matched participants of both groups pairwise on variables known from the literature to possibly influence executive functions. These important covariates were age, gender, intelligence, lexicon size and SES (ISCED mother; cf. Table 2). To avoid multicollinearity due to high correlations between some background variables (cf. Table 3), we chose the mother's ISCED as measure of SES and expressive lexicon size as measure of German lexicon size. To match participants, we selected pairs of mono-and multilinguals from the data, 1) who did not differ in gender, 2) whose ISCED differed not more than one point (on a 4-point scale), 3) whose age, lexicon and intelligence scores were as close as possible (group differences between matched pairs amount for age in months: M = 5.1, SD = 5.2, for lexicon on a scale between 0 to 40: M = 3.0, SD = 2.4, and for intelligence on a scale between 0 and 45: M = 2.4, SD = 1.5). A summary of the demographic variables (group means with standard deviations) with group comparisons calculated with unpaired t-tests can be found in Table 1 for the heterogeneous and in Table 2 for the homogenous groups. Pairwise matching (instead of adding all these variables as covariates to the models) was chosen to make the process of group matching transparent (what is rarely done in the literature), to ensure that children are comparable on an individual level (in contrast to matching on a group-level), to simultaneously control for a set of background variables that impact on EF development and to avoid overfitting the regression models with too many variables.

Background variables
In Table 1, we show means and standard deviations for the heterogeneous sample (monolinguals vs. multilinguals) on all collected background variables. Multilinguals were significantly older, showed poorer German skills and their parents' ISCED was significantly lower. The heterogeneity of our sample was also reflected in the multilinguals' proficiency in L1, which was lower and more heterogeneous than their proficiency in German, as well as in the wide range of AoA (see Table 1). The frequency with which the multilingual children used their languages at home was also very diverse: 30% reported that they used predominantly one or more languages other than German, 17% spoke mainly German and 32% used both German and another language similarly often (for 21% of the cases no information was given). Most bilinguals spoke Turkish (37%) or Arabic (17%), the others Albanian, Bosnian, Chinese, Edo, English, Greek, Hungarian, Kurdish, Persian, Polish, Punjabi, Russian, Serbian or Spanish. In the trilingual group, children spoke Albanian, Edo, English, Hebrew, Mandinka, Polish, Rumanian, Serbian, Turkish or Zaza. Overall, 83% of the multilingual group and 7% of the monolingual group indicated a migration background, meaning at least one parent or the child was born outside of Germany.

EF-Tasks
An overview of the raw values for the three EF tasks can be found in Figure 1.

Response inhibition (Go/Nogo)
The regression model for the response inhibition task consisted of 2370 observations from 125 participants and is presented in Table 4. Outliers comprised the complete data set of one bilingual who performed below chance and RTs with log(RT) > 7.8 or log(RT) < 6.4 (1.2% of data points, corresponds to about M ± 2 SD). There was significant variance in intercepts across participants. RTs significantly decreased from the first to the second block and multilinguals answered significantly slower than monolinguals. Table 4 shows the results for the interference inhibition task. 10,435 observations of 124 participants were included in this model. We excluded two multilingual children from the analysis due to below chance performance in one of the uniform blocks. Additionally, single blocks with less than four correct responses and RTs with log(RT) < 5.8 were removed (0.1% of all data points, corresponds to approximately M ± 3 SD). There was significant variance in intercepts across participants. Significant experimental manipulations were block type (RTs increased in the mixed block) and congruency (in comparison to neutral trials, performance on congruent trials was faster and on incongruent trials slower); additionally, incongruency had a smaller effect in mixed compared to uniform blocks. No significant interaction of multilingualism with any variable was found, but there was a marginal effect that multilinguals respond overall faster than monolinguals.

Updating (N-back)
The model for the updating task is presented in Table 4. One bilingual child did not take part in this task and we needed to exclude four multilingual children from the analysis in the 2-back task due to below chance performance. Outliers with log (RT) > 7.9 and log(RT) < 6.3 were removed (0.9% of data points, corresponds to approximately M ± 2 SD). The final data set consisted of 2585 observations from 125 participants. There was significant variance in intercepts across participants and RTs were significantly longer in the 2-back task compared to the 1-back. Overall, multilinguals performed significantly slower than monolinguals and a trend indicates that the increase in difficulty from 1-to 2-back is smaller for multilinguals than monolinguals.

Background variables
Pairwise matching resulted in a subsample of n = 21 children per group, who did not differ in any background variables (cf. Table 2). The multilingual group included n = 3 trilinguals and n = 18 bilinguals. The reported frequency of language use at home was the following: 24% spoke predominantly one or more languages other than German, 33% spoke mainly German and 24% spoke German and another language similarly often (19% are missing data). Four bilinguals spoke Turkish, and the remaining spoke Albanian, Arabic, Bosnian, Chinese, English, Hungarian, Polish, Punjabi, Russian or Spanish. Trilinguals spoke English and Mandinka, Albanian and Polish, and Turkish and Zaza.   Figure 2 displays the raw values for the response inhibition, interference inhibition and updating task of the homogeneous sample.

Response inhibition
For the Go/Nogo task, 805 observations from 42 participants were included in the model (see Table 4). There was significant variance in intercepts across participants and RTs decreased from the first to the second block. Multilingualism did not influence RTs.

Interference inhibition
The final model (see Table 4) for the BST comprises 3594 observations from 42 participants. There was significant variance in the intercepts across participants. Congruency influenced RTs significantly: In comparison to neutral trials, performance on congruent trials was faster and on incongruent trials slower. The influence of block type was marginally significant. Multilinguals responded overall significantly faster than monolinguals, but no interactions reached statistical significance.

Updating
The regression model for the N-back (see Table 4) consists of 899 observations of 42 participants. There was significant variance in the intercepts across participants, but no significant difference between tasks or groups.

Discussion
In this study, we first compared heterogeneous groups of monoand multilingual children on their performance in three EF tasks, tapping into response inhibition, interference inhibition and updating. When group membership was confounded with differences in SES and German skills, multilinguals showed a disadvantage in the Go/Nogo and N-back task, but similar performance in the BST. After matching participants of both groups pairwise on possibly influential variables (i.e., age, gender, intelligence, SES and German lexicon size), multilinguals outperformed their monolingual peers in the BST; more specifically they answered overall faster. The differences in the Go/Nogo and the N-back task disappeared: multilinguals were no longer slower in response inhibition and in updating.

Conclusions from the heterogeneous groups
When we 'ignored' between-group differences in age, SES and lexicon size in our first analysis based on the heterogeneous groups, we found no bilingual advantage, but rather a disadvantage in updating and response inhibition. This was expected due to the known detrimental effects of these variables for the multilingual group. This concerns especially their lower SES, which negatively influences EF (Carlson et al. 2013;Morton & Harper, 2007). In general, the parents' SES, as measured by either financial or educational aspects, is related to the amount of stimulating educational input children receive and is therefore a major factor for cognitive development including EF and, for example, lexicon (Calvo & Bialystok, 2014;Carlson et al. 2013). Nevertheless, it is remarkable that multilinguals were apparently able to compensate for these detrimental effects and performed equally on the BST in the first comparison between heterogeneous groups. It seems that they were able to counter the negative influences of the background variables with better cognitive control. This pattern was also reported in Carlson and Meltzoff (2008) for Spanish-English preschool children who did not differ in their EF performance from monolingual peers despite disadvantages in SES, age and verbal ability. In our study, multilinguals have a more difficult starting position due to their parents' lower SES but also because of their poorer German skills, their later start of German acquisition and a disadvantage in the German education system (cf. Fereidooni, 2012). Therefore, one cannot expect a bilingual advantage in executive functions, because the combination of disadvantaging factors (linked to their multilingual migration background) likely overshadows a cognitive advantage. However, their comparable performance indicates that multilinguals can at least compensate for these disadvantages through superior EF.

Conclusions from pairwise matching
The matching procedures in the literature are in general applied on a group level (e.g., Duñabeitia et al., 2014;Kousaie et al., 2015). Our method of matching participants on an individual level, however, considers the distribution of multiple variables in each matched couple of mono-and multilinguals. A more rigorous approach like this was already demanded by other researchers, like Hilchey and Klein (2011): "the extent to which bilingualism is the complete, partial, or apparent cause of these data is an area that warrants further investigative work, and we urge future investigators of the BICA/ BEPA hypotheses to be assiduous in their efforts to match monolinguals and bilinguals on plausibly pertinent demographic factors" (p. 643). Our approach for matching participants aimed at controlling for as many important background variables as possible, as it seemed likely that it was a certain combination of background variables that made the difference. By reporting in detail the process of reducing our initial sample of 168 children to two homogeneous groups of 21 mono-and 21 multilinguals, we want to encourage other researchers to reveal their methods of participant selection more clearly. In contrast to other matching strategies that focused, for example, only on reaching insignificance between group means, we carefully chose pairs of children who overlapped as well as possible in all our background variables. Thus, this strategy ensured that the language groups consisted of children who were similar on both the group and the individual level. This reduced the sample size tremendously, but "by reducing samples to ensure that groups are more comparable, (…) smaller studies can be more informative than larger studies, if they are better controlled." (Hope, 2015, p. 59). After closely following these recommendations, our results indicate indeed that this is an appropriate method to approach the bilingual advantage debate, since after minimizing the influence of all potentially influential background variables by pairwise matching, a multilingual advantage in the BST emerged and the differences in the other tasks were balanced out.
Matching on lexicon size had an additional effect: namely that we excluded children not only with particularly poor German skills but also with late AoA of German from the sample. Therefore, the remaining multilinguals supposedly underwent a more extensive training in their languages, which is important because the proficiency in the single languages (Luk & Bialystok, 2013), their usage (Luk & Bialystok, 2013;Yow & Li, 2015), and the AoA (Kapa & Colombo, 2013) influence EF. Their exhaustive bilingual experience may have resulted in superior performance in the BST. Crucially, future research should consider bilingual experience more thoroughly: in particular, because of the tremendously heterogeneous experiences in countries, speech communities, school settings, and so forth.

Superior interference inhibition
Since multilinguals outperformed their monolingual peers only in the BST but not in the Go/Nogo task, our results support the dissociation between interference and response inhibition (Luk, Anderson, Craik, Grady & Bialystok, 2010;Martin-Rhee & Bialystok, 2008). The crucial difference between these concepts is that the information in Go/Nogo tasks is not by itself 'conflicting' but requires only execution or suppression of a motor response. In contrast, in the BST, every stimulus contains potentially conflicting information on shape and color that needs to be processed to allow for correct responses. The increased attentional demands due to the potentially conflicting information probably involves the neural networks also required in bilingual language processing: constant attention towards the relevant stimulus feature (or the target language) and inhibition of the irrelevant information (or the non-target language). Thus, speaking multiple languages affects only interference but not response inhibition; a finding consistent with previous literature (Esposito et al., 2013;Martin-Rhee & Bialystok, 2008).
Multilinguals showed globally reduced RT in the BST. This result indicates that not solely inhibition, but rather executive processing is enhanced in multilinguals. Hilchey et al. (2015) term this pattern BILINGUAL EXECUTIVE PROCESSING ADVANTAGE (BEPA) and Bialystok (2017) SUPERIOR EXECUTIVE ATTENTION. Both comprise not only advanced inhibition of irrelevant information but also advanced attention towards relevant information. This interpretation is in accordance with the theory on bilingual language processing: these two aspects of EFinhibition and attentionare essential for bilinguals to comprehend and produce language (see introduction for details) and their constant usage trains the associated EF components. Consequently, the presence of potentially conflicting information in interference inhibition tasks seems to be the key feature that allows multilinguals to reveal their EF advantage.
If only inhibition was modulated by bilingualism, the advantage would manifest itself in a smaller interference effect and particularly lower RT in incongruent conditions (Hilchey et al. refer to this pattern as BICA). Esposito et al. (2013) found this pattern in the BST when they compared bilingual 4-year-olds with monolingual peers. Differences in age and therefore cognitive development between our sample and Esposito and colleagues' might cause the discrepancy in outcome patterns.

Updating
Multilinguals answered overall slower in the N-back task, but after matching participants no difference was found. This finding is in line with other studies showing no influence of bilingualism on working memory: for example, in the Listening Span and Pattern Recall Task (Namazi & Thordardottir, 2010), or in Backward Digit Span and Counting Recall (Engel de Abreu, 2011). However, Blom et al. (2014), who used a similar methodology, found opposite results: they compared visuospatial and verbal working memory between monolingual Dutch and Turkish-Dutch children, who had a migration background, lower SES and smaller Dutch vocabulary size (in that way, a sample similar to ours). Initially, both groups performed similarly, but, when SES and vocabulary were controlled, bilinguals outperformed monolinguals. These opposite results may stem from differences in samples or task impurity, which impedes precise measurement of EF due to increasing chances of type I and type II errors (Friedman, 2016). Task impurity results from variance in stimuli characteristics (word, digit, or picture), response modality (motor or verbal) or their dynamics (they assess either the storage capacity, as in span tasks, or the ability to manipulate working memory content, like the N-back task that requires concurrent updating of memory content and dropping of irrelevant information; Conway, Kane, Bunting, Hambrick, Wilhelm & Engle, 2005). These factors might conceal common features of N-back and other working memory tasks and cause the heterogeneity of research findings. Our results do not support a bilingual advantage in updating, but its existence cannot be excluded (Calvo, Ibáñez & García, 2016).

The role of high cognitive demands
Since training studies with EF tasks (Diamond, 2014) and other studies on the bilingual advantage (Costa et al., 2009;Martin-Rhee & Bialystok, 2008) usually found a bilingual advantage in the more demanding conditions, we expected this also in our study. However, neither the increased EF demands from 1-to 2-back nor from uniform to mixed block in the BST yielded a bilingual advantage. Without pairwise matching, no advantage could arise because of the strong influence of the other background variables, which had a negative influence on the multilinguals' performance. With pairwise matching, however, the more difficult experimental conditions did not yield additional challenges for the homogeneous groups anymore. We conclude that matching led to a sample with on average higher mental faculties.
Comparing response latencies on the uniform versus the mixed block in the BST, we observed that the additional difficulty of having to switch between different types of stimuli (neutral, congruent, incongruent) seems to have been compensated by overall longer RTs, particularly on the easier trial types (neutral, congruent), rather than on the more difficult, incongruent condition. This strategy has been observed as well in studies on language switching, in particular in trilingual contexts. When participants had to switch between three languages for naming digits, such mixed conditions were at the expense of the more dominant languages (Festman & Mosca, 2016).

Limitations
In this study, we assessed three different EF components with three tasks. Due to task impurity, however, further studies need to investigate which tasks measure exactly which EF components. Friedman (2016) advises researchers for example to include multiple tasks and latent variables to ensure precise measurement of EF. Equally important is controlling for influential factors in research on the bilingual advantage. As suggested by Hope (2015), our approach aimed at controlling for as many important background variables as possible. Future research should provide even more insights into their role on EF and their possibly differential impact on mono-versus multilinguals' processing (for one recent example of the differential role of EF on mono-vs. multilinguals' spelling, see Czapka et al., 2019). An alternative approach would be regression analyses based on even larger samples of mono-and multilinguals and to statistically control for the background variables in one model. The variable we could not balance between language groups was migration status. How it affects EF is still unknown, but further research investigating the interaction between EF and migration status is necessary to exclude uncontrolled influence.

Conclusion
The analysis of our heterogeneous sample of primary school children showed that multilinguals with disadvantages in SES and German skills underperformed in the updating and response inhibition tasks. Since their performance was equal in interference inhibition, we conclude that they compensated for the detrimental effects of the background variables by superior EF. Our comparison of the homogeneous groups supports this conclusion. Following this initial analysis, we meticulously matched participants on age, gender, intelligence, SES and German lexicon size to achieve groups that were comparable on a group and individual level. The performance in the updating and response inhibition

352
Sophia Czapka et al. tasks was now equal between those matched groups, and in the interference inhibition task, multilinguals showed globally reduced RT indicating a bilingual executive processing advantage.