Highlights
-
• Measuring language status dichotomously masks variability between children.
-
• Multilingual children show greater conceptual vocabulary than monolinguals.
-
• Linguistic distance, language and context entropy partially explain vocabulary size.
-
• Medium linguistic distance is related to increased vocabulary.
-
• Dominant, non-dominant and total vocabulary in multilinguals show a quadratic relation to entropy measures.
1. Introduction
Being able to communicate is fundamental for children’s later life outcomes (Duncan et al., Reference Duncan, Anderson, King, Finders, Schmitt and Purpura2022; Marchman & Fernald, Reference Marchman and Fernald2008). Children learn to communicate within and from their natural everyday environment. However, these everyday environments differ significantly between children, for instance, the objects children encounter or the characteristics of the people they meet (Bergelson & Aslin, Reference Bergelson and Aslin2017). As a result, children vary in cognition, behaviour, and the ways they interact, cooperate, and work with others (Anderson et al., Reference Anderson, Mak, Keyvani Chahi and Bialystok2018; Brink et al., Reference Brink, Lane and Wellman2015; Neuhauser et al., Reference Neuhauser, Ramseier, Schaub, Burkhardt and Lanfranchi2018). Likewise, the diversity in children’s language and vocabulary development has been linked to their environments. Specifically, the quantity and quality of language input they receive from their caregivers (Huttenlocher, Reference Huttenlocher1998; Rowe, Reference Rowe2012; Rowe & Goldin-Meadow, Reference Rowe and Goldin-Meadow2009). The more words children hear, the more words they know (Hart & Risley, Reference Hart and Risley2003) and the more frequently a word is heard, the earlier it is learned (Goodman et al., Reference Goodman, Dale and Li2008). Thus, children’s early vocabulary and language development have a significant impact on their later outcomes, and this development is influenced by their environment.
Children exposed to more than one language experience a communicative environment that differs significantly from that of their monolingual peers (Paradis, Reference Paradis2023; Singh, Reference Singh2021). To communicate successfully, multilingual children must navigate a highly variable environment and adapt their behaviour accordingly (Howard et al., Reference Howard, Carrazza and Woodward2014; Wermelinger et al., Reference Wermelinger, Daum and Gampe2024). Multilinguals’ language exposure is split between the different languages they are exposed to, and, as a result, they receive less input per language. For this reason, multilingual children and adults tend to know fewer words than monolinguals do in at least one of their languages (Bedore et al., Reference Bedore, Peña, García and Cortez2005; Bialystok & Barac, Reference Bialystok and Barac2012), while their combined vocabularies are of comparable size to the vocabularies of monolinguals (Gampe et al., Reference Gampe, Kurthen and Daum2018; Hoff et al., Reference Hoff, Core, Place, Rumiche, Señor and Parra2012; Pearson et al., Reference Pearson, Fernández and Oller1993). To overcome the limitations of single-language vocabulary comparisons and to include possible interaction effects of the multiple languages, researchers advocate measuring total vocabulary (i.e., the vocabularies in each of the languages combined) and conceptual vocabulary (i.e., the number of concepts the child knows in either language; Kohnert, Reference Kohnert2010; Pearson et al., Reference Pearson, Fernández and Oller1993). Nevertheless, a simple accumulator model (Kachergis et al., Reference Kachergis, Marchman and Frank2022), where more input equals greater vocabulary, does not seem to hold up for multilingual children (Sander-Montant et al., Reference Sander-Montant, López Pérez and Byers-Heinlein2023). Other factors likewise predict multilingual vocabularies, such as the types of words children learn (Muszyńska et al., Reference Muszyńska, Kołak, Haman, Białecka-Pikul and Otwinowska2024a), the people with whom children interact (e.g., adults or siblings; Hoff et al., Reference Hoff, Rumiche, Burridge, Ribot and Welsh2014) and the number of translation equivalents children use (i.e., words with similar meanings in different languages; Tan et al., Reference Tan, Marchman and Frank2024). Furthermore, multilingual children’s environments are largely heterogeneous (Gullifer & Titone, Reference Gullifer and Titone2020; Titone & Tiv, Reference Titone and Tiv2023). Multilingual children differ depending on the languages they learn, the prestige associated with their languages, or when and how they are exposed to each language (Wermelinger et al., Reference Wermelinger, Daum and Gampe2024). Previous research has not captured this diversity and has often treated multilingual children as one homogeneous group. However, to validly capture multilinguals’ communicative environment, we need to adopt measures that assess different aspects of a multilingual environment and understand multilingualism as a continuous variable rather than the dichotomous distinction between monolingualism and multilingualism (Gullifer & Titone, Reference Gullifer and Titone2020).
In the current project, we characterised the communicative environment of multilingual children using three measures: linguistic distance, language entropy and context entropy to increase our understanding of the diversity within multilingual environments. We used these measures to explain differences in children’s receptive vocabulary as an early marker of language and communicative development (Huttenlocher, Reference Huttenlocher1998).
2. Linguistic distance
Monolingual children differ in their language development depending on the language they acquire (vocabulary growth, Bleses et al., Reference Bleses, Vach, Slott, Wehberg, Thomsen, Madsen and Basbøll2008; Thordardottir, Reference Thordardottir2005; phonological awareness, Kang, Reference Kang2012; word segmentation, Antovich & Graf Estes, Reference Antovich and Graf Estes2020; Mateu & Sundara, Reference Mateu and Sundara2022; Orena & Polka, Reference Orena and Polka2019). Multilinguals need to develop multiple interdependent language systems (French & Jacquet, Reference French and Jacquet2004) such that speech in one language activates word recognition in both of their languages (Blumenfeld & Marian, Reference Blumenfeld and Marian2013; Lin et al., Reference Lin, Lin and Yeh2023; Von Holzen & Mani, Reference Von Holzen and Mani2012). Furthermore, each language has specific qualities, which may lead to interference effects (Marian et al., Reference Marian, Blumenfeld and Boukrina2008; Tan et al., Reference Tan, Marchman and Frank2024). For example, words may have different meanings in different languages (e.g., gift means present in English but poison in German), or correct sentence construction in one language can lead to errors in other languages. These interactions between languages influence how multilingual children develop languages. For instance, multilingual infants rely more or less on vowels than consonants for lexical processing, depending on the language (Delle Luche et al., Reference Delle Luche, Poltrock, Goslin, New, Floccia and Nazzi2014; Mani & Plunkett, Reference Mani and Plunkett2007; Nishibayashi & Nazzi, Reference Nishibayashi and Nazzi2016).
The linguistic distance between the languages, that is, the level of similarity between languages on the lexical, phonological, and/or grammatical level (Jaekel et al., Reference Jaekel, Ritter and Jaekel2023), is associated with children’s language development. Smaller linguistic distances are associated with larger vocabulary sizes in bilingual toddlers in both of their languages (Gampe et al., Reference Gampe, Endesfelder Quick and Daum2021) and faster development and greater proficiency in adult second language (L2) learners (Jaekel et al., Reference Jaekel, Ritter and Jaekel2023; Van der Slik, Reference Van der Slik2010). Using commonalities between languages, such as cognates, may allow multilinguals to bootstrap their word learning via cross-linguistic transfer (Tan et al., Reference Tan, Marchman and Frank2024; Van der Slik, Reference Van der Slik2010). Cognates are words that are phonologically similar between two spoken languages (e.g., English house and German Haus), as compared to translational equivalents that are non-cognates (e.g., English dog and German Hund). However, phonological similarities (such as those found in cognates) may also reduce the separation between the two languages, leading to interference and delays in language acquisition, as observed in adults (Marian et al., Reference Marian, Blumenfeld and Boukrina2008). In line with this, Squires et al. (Reference Squires, Ohlfest, Santoro and Roberts2020) report that 25% of the investigated studies on children in their systematic review did not find a positive cognate facilitation effect on vocabulary and report that the cognate facilitation effect is influenced by many factors, such as language dominance (Bosma et al., Reference Bosma, Blom, Hoekstra and Versloot2019; Chai & Bao, Reference Chai and Bao2023; Garrido-Pozú, Reference Garrido-Pozú2024; Koutamanis et al., Reference Koutamanis, Kootstra, Dijkstra and Unsworth2025; Poarch & van Hell, Reference Poarch and van Hell2012; Quirk & Cohen, Reference Quirk and Cohen2022; Robinson Anthony et al., Reference Robinson Anthony, Blumenfeld, Potapova and Pruitt-Lord2022), language proficiency (Chai & Bao, Reference Chai and Bao2023) and language exposure (Robinson Anthony et al., Reference Robinson Anthony, Blumenfeld, Potapova and Pruitt-Lord2022). In sum, previous work is mixed and suggests that linguistic distance is associated with both larger and smaller vocabularies.
Linguistic distance can be estimated based on language families (Spolaore & Wacziarg, Reference Spolaore, Wacziarg, Ginsburgh and Weber2016), expert judgments of language characteristics (Dryer & Haspelmath, Reference Dryer and Haspelmath2013), or automated calculations of phonetic similarities (Automatic Similarity Judgement Program, ASJP; Wichmann, Reference Wichmann2020). The current study estimates linguistic distance based on the lexico-phonological similarity indicated in the ASJP database (Søren et al., Reference Søren, Holman and Brown2022; Wichmann, Reference Wichmann2020).
3. Entropy measures
Next to the linguistic distance between their languages, multilingual children differ in how they are exposed to those languages across social contexts (e.g., caregivers, institutional childcare). For example, children can learn their languages in a compartmentalised context (e.g., a child hears one language at home and another one at institutional childcare) or an integrated context (Gullifer & Titone, Reference Gullifer and Titone2020; for example, a child hears their languages regardless of the communicative context). Individual differences in how multilinguals use their languages across social contexts may define how they represent, access and control those languages (Abutalebi & Green, Reference Abutalebi and Green2016; Green & Abutalebi, Reference Green and Abutalebi2013). In the current study, we measured the differences in children’s language exposure across social contexts using holistic entropy measures. Entropy is a concept rooted in physics and has been used to quantify uncertainty or diversity (Shannon, Reference Shannon1948). Previous work in psycholinguistics used entropy measures in sentence comprehension research to model how readers or listeners adapt to noisy linguistic input, by quantifying how new information shifts expectations about earlier parts of a sentence (Levy, Reference Levy2008).
Entropy is computed using the following function:
$ H(X)=-{\sum}_{\mathrm{i}=1}^{\mathrm{n}}p\left({x}_{\mathrm{i}}\right){\log}_2p\left({x}_{\mathrm{i}}\right) $
(Shannon, Reference Shannon1948). With p being the proportion of awake time the child is exposed to each language/context (x), and n representing the total number of languages/contexts the child is exposed to. In the current study, we calculated the entropy of children’s language exposure (i.e., language entropy) and contexts (i.e., context entropy). Unlike simple frequency counts, entropy captures both the number of distinct elements (i.e., different languages and contexts a child is exposed to) and their relative probabilities (Cover & Thomas, Reference Cover and Thomas2006), offering a nuanced measure of language and contextual diversity. Entropy increases with the number of distinct elements (i.e., different languages or social contexts), meaning that exposure to more varied input naturally leads to greater measured diversity. Higher entropy values, thus, reflect a more diverse and less predictable linguistic environment, which has been linked to improved vocabulary development and language outcomes in children (Rowe, Reference Rowe2012). Further, entropy is a symmetric measure, treating all elements without bias. Lastly, it is sensitive to the distribution of elements: entropy is maximised when input is balanced and evenly distributed, reflecting greater unpredictability and complexity (Gullifer & Titone, Reference Gullifer and Titone2020). In case of exposure to two languages or two contexts, the entropy is highest (H (2) = 1.00) if exposure is exactly balanced between the two elements. In case of exposure to three languages/contexts, entropy is higher than it would be with exposure to two languages and peaks (H(3) = 1.58) at exposure being equally distributed across the three languages (see Figure 1).

Figure 1. Distribution of entropy for (a) two elements (languages/social contexts) and (b) for three elements. Note that the distributions peak at equal distributions across the given number of elements.
3.1. Language entropy
Language entropy (Gullifer & Titone, Reference Gullifer and Titone2020; Titone & Tiv, Reference Titone and Tiv2023) quantifies the diversity of cumulative everyday language exposure to different languages (Gullifer & Titone, Reference Gullifer and Titone2020). It is estimated as a function of the probability of certain events occurring, that is, when the child is exposed to a specific language. As described in the previous paragraph, higher language entropy values relate to more balanced language use and greater language diversity across contexts. In adults, increased language entropy is associated with increased language proficiency in the non-dominant language (Gullifer et al., Reference Gullifer, Kousaie, Gilbert, Grant, Giroud, Coulter, Klein, Baum, Phillips and Titone2021; Gullifer & Titone, Reference Gullifer and Titone2020). However, while language exposure has been linked to language development in childhood (Hart & Risley, Reference Hart and Risley2003; Huttenlocher, Reference Huttenlocher1998), this relationship is not necessarily linear (Hoff & Ribot, Reference Hoff and Ribot2017; Sander-Montant et al., Reference Sander-Montant, López Pérez and Byers-Heinlein2023). For instance, the number of translation equivalents in a child’s languages can support cross-linguistic word learning (Tan et al., Reference Tan, Marchman and Frank2024). Hence, an increase in language entropy (i.e., more balanced language input) is not automatically associated with an increased vocabulary in a child’s language. Furthermore, the diversity of social contexts in which languages are used may further contribute to children’s language development (Marvin et al., Reference Marvin, Beukelman and Bilyeu2009).
3.2. Context entropy
Language use differs between social contexts (Abdalla, Reference Abdalla2022; Straker, Reference Straker1980). For instance, preschool children use only about one-third of their vocabulary across their home and preschool contexts, while other words are specific to either of the contexts (Marvin et al., Reference Marvin, Beukelman and Bilyeu2009; Muszyńska et al., Reference Muszyńska, Łuniewska, Dynak, Kolak, Lohrum, Otwinowska, Wodniecka and Haman2024b). In contrast, monolinguals hear context-specific labelled words across home and school contexts (Muszyńska et al., Reference Muszyńska, Łuniewska, Dynak, Kolak, Lohrum, Otwinowska, Wodniecka and Haman2024b). The differences in language use between social contexts influence children’s vocabulary, depending on the time spent in these contexts and the subsequent variety of language input children experience (Goodman et al., Reference Goodman, Dale and Li2008; Hart & Risley, Reference Hart and Risley2003). In the current study, we measured the diversity of children’s social contexts with context entropy. In parallel to language entropy, context entropy is estimated by calculating the probability of certain events occurring. In the case of context entropy, these events refer to the time children spend in different social contexts (e.g., primary caregivers, institutional childcare). Thus, a balanced measure of contextual entropy indicates children spending similar amounts of time in their different caregiving contexts.
4. The current study
The current study explored how diversity in multilingual children’s communicative environments relates to their language development. This diversity is indicated by children’s linguistic distance and their language and context entropy. We took an exploratory approach to examining how various aspects of children’s language environment associate with their receptive vocabulary. Specifically, we addressed two objectives: (1) to characterise diversity in children’s communicative environment using three measures: linguistic distance, language entropy and context entropy, and (2) relate these measures to their language development. We re-analysed existing data from parental questionnaires on children’s language exposure and receptive vocabulary tests from studies conducted in our research group between 2019 and 2024. We first followed the traditional approach and compared the receptive vocabulary of monolingual and multilingual children on the group level. We then associated children’s linguistic distance, language entropy and context entropy with their receptive vocabulary. Multilingual children’s receptive vocabulary, in the two languages they are most exposed to, was measured and reported as vocabulary per language, total vocabulary (i.e., summed across both languages) and conceptual vocabulary (i.e., the sum of how many concepts a child knows).
5. Method
We preregistered the study prior to data analyses (https://osf.io/fhu5e/?view_only=285a0d10bd814d60b88ad0f11c138cc3) and made the data collected and R codes available on the Open Science Framework (see Data Availability Statement).
5.1. Participants
We included data of N = 257 children between the ages of 3.8–5.6 years (Mage = 4.54 years, SDage = 0.36), 49% were girls. Children were monolingual Swiss German (n = 112, i.e., not more than 10% input in another language) and multilingual (n = 134). All multilingual children spoke Swiss German (i.e., the societal language) and were exposed to another language for at least 20% of the time (in accordance with Wermelinger et al., Reference Wermelinger, Gampe and Daum2017). Of the multilingual children, 19 were exposed to a third language. The multilingual children were exposed to the following languages: English (n = 36), Italian (n = 35), French (n = 21), Spanish (n = 18), Portuguese (n = 8), Swedish, Greek, Hungarian (n = 4 each), Romansh, Mandarin, Czech (n = 3 each), Croatian, Polish, Russian, Serbian (n = 2 each) and Albanian, Arabic, Danish, Dutch, Slovakian, Tagalog (the Philippines), Tigrinya (Eritrea), Turkish and Vietnamese (n = 1 each). Twelve children were excluded from the analyses because they did not meet the criteria for language exposure.
The children in our sample came from a background of high socio-economic status. The household income of the participants in our sample was above average; 79.5% of caregivers reported above-city average household income, 12.4% reported a household income within the average range and 8.0% reported below-average household income (Bundesamt fuer Statistik, Reference Bundesamt fuer2024). Parental education was high; 73.0% of children came from households in which both caregivers completed higher education at a university or a university of applied sciences and 16.0% of children came from households in which one of the caregivers completed higher education at a university or a university of applied sciences. The other 1% of children came from households in which caregivers completed other types of education, such as apprenticeships or vocational training.
The participants were recruited via the research unit’s database. The database consists of caregivers interested in participating in studies with their children. The children were healthy, born full term (week of gestation >37), had a birth weight > 2500 g and had no diagnosis of a developmental disorder, according to parental reports. Each child received a certificate and a small present (approximately USD 5) as compensation. Children’s caregivers gave informed consent. The Ethics Committee of the [Ethics Committee of the UZH Faculty of Arts and Social Sciences] agreed with the general procedure, which adheres to the ethical standards of the 1964 Declaration of Helsinki and its subsequent amendments.
5.2. Design
This study re-analysed data from three cross-sectional observational studies on monolingual and multilingual children’s communicative development. We used two data sources: a parental questionnaire on children’s cumulative language exposure and a laptop-based test of children’s receptive vocabulary (Gampe et al., Reference Gampe, Kurthen and Daum2018).
In the parental questionnaire, caregivers were asked for each caregiving situation since the child’s third birthday, for each day of the week, each interaction partner, and each language and the amount of time the child spent with this interaction partner and was exposed to each language. Based on this information, children’s language status was determined (monolingual in case of exposure to a second language <10% of awake time and multilingual in case of exposure to a second language for at least 20% of awake time).
Receptive vocabulary in the children’s languages was assessed by the BILEX (Gampe et al., Reference Gampe, Kurthen and Daum2018). The BILEX is a reliable and valid method for measuring receptive vocabulary in children aged 3–5 years. In this test, children hear a noun and are asked to correctly match it to one of six simultaneously presented objects across 48 trials. Children’s vocabulary in Swiss German was always assessed as the first task in the studies, and the vocabulary in their second language as the final task. Hence, no randomisation was performed for the measurements of the current study. However, we controlled for possible effects of fatigue in the second administration of the vocabulary test by comparing the children’s performance and reaction times between the two measurements (Heitz, Reference Heitz2014). The results of two dependent t-tests show significant differences in children’s performance, t(158) = 5.99, p < .001, but not in children’s reaction times, t(159) = 1.26, p = .209. Therefore, the difference in performance is unlikely to be explained by differences in reaction times, for instance, due to increased fatigue in completing the second BILEX. Children’s data were included in the current study if data from both the parental questionnaire and the BILEX were available. Children’s BILEX data were discarded in case of values below the second percentile of the norm scores developed within the research unit.
5.3. Measures
5.3.1. Linguistic distance
The children’s linguistic distance measure was based on the lexico-phonological similarity indicated in the ASJP database (Holman et al., Reference Holman, Brown, Wichmann, Müller, Velupillai, Hammarström, Sauppe, Jung, Bakker, Brown, Belyaev, Urban, Mailhammer, List and Egorov2011; Wichmann, Reference Wichmann2020). The ASJP consists of word lists of the 40 most stable lexical items from Swadesh’s 100-item list (Swadesh, Reference Swadesh1955), transcribed into a simplified standard orthography which is phonologically informed and consistent across the different languages (e.g., the English words dog and sun are transcribed as dag and s3n and would be compared to the Swiss-German huN and sunne). We calculated the normalised Levenshtein distance (Bakker et al., Reference Bakker, Müller, Velupillai, Wichmann, Brown, Brown, Egorov, Mailhammer, Grant and Holman2009) from each child’s language to the societal language to estimate the linguistic distance between the languages they are exposed to (Konstantinidis, Reference Konstantinidis2007). The score is calculated as the number of characters that need to change from one word to its equivalent in the other language. The sum of these transformations for all 40 items is the linguistic distance: the higher the linguistic distance, the fewer similarities between languages. Because the societal language consists of a wide variety of dialects and the ASJP offers data on one of those dialects, the linguistic distance obtained approximates the actual distance, as the children in our sample may be exposed to other dialects. If children were exposed to more than two languages, we calculated the linguistic distance for the two languages the child is exposed to most.
5.3.2. Entropy measures
Language and context entropy were computed based on the proportion of awake time children spent in certain languages or social contexts, based on the data from the parental questionnaire, using the languageEntropy R package (Gullifer & Titone, Reference Gullifer and Titone2018). The entropy values range from 0 to log n, with n being the total number of languages or social contexts the function is computed over (Gullifer & Titone, Reference Gullifer and Titone2020). Higher levels of language and context entropy indicate more balanced language exposure or time spent in different social contexts. When exposure is evenly distributed between two languages or contexts (i.e., 50% each), entropy reaches its maximum value of 1. In the case of three equally balanced languages or contexts (approximately 33% each), the maximum entropy is about 1.58 (see Figure 1). Context entropy was computed across the contexts of primary caregivers, secondary caregivers, institutional childcare and kindergarten.
5.3.3. Vocabulary
Children’s receptive vocabulary in both of their most dominant languages was measured as the number of correctly identified nouns in the BILEX per language and summed up across both languages to obtain a measure of total vocabulary. Furthermore, the conceptual vocabulary was calculated via the number of correctly identified nouns in either language. For the analysis, we took a data-driven approach and considered the language the child is exposed to more than 50% of the time to be their first language (dominant language was not the societal language for n = 42 children). In case a child’s exposure was exactly 50%, we considered the societal language as the child’s first language (n = 1).
In sum, children’s language status, age and communication environment (language entropy, context entropy and linguistic distance) were derived from parental questionnaires. Linguistic distance is operationalised as a continuous variable based on lexico-phonological similarity and as a categorical variable following an exploratory cluster analysis (see Results). Continuous vocabulary outcomes (total, conceptual, dominant and non-dominant language) were assessed using BILEX (Gampe et al., Reference Gampe, Kurthen and Daum2018). A detailed overview of all variables and their operationalisation is provided in Tables A1 and A2 in the Appendix.
6. Results
We deviated from the preregistered analyses by running group-level analyses (i.e., monolingual versus multilingual) on children’s vocabulary, and by exploring the effects of context entropy on multilingual children and monolingual children separately.
6.1. Group-level analyses
We used independent t-tests to assess the difference in vocabulary scores for the dominant language and conceptual vocabulary between monolingual and multilingual children on the group level to assess whether multilingual children show greater variability in their vocabulary, as often suggested in the literature (Hoff et al., Reference Hoff, Rumiche, Burridge, Ribot and Welsh2014; Hoff & Core, Reference Hoff and Core2013; Lauro et al., Reference Lauro, Core and Hoff2020). The difference in variance between the two groups was assessed with an F-test for Equality of Variance.
Monolingual children scored significantly higher (M = 39.3, SD = 4.2) than multilingual children (M = 37.8, SD = 4.0) in their dominant language, t(236) = 2.49, p = .013, Cohen’s f 2 = .03. We found significant difference in variance between monolinguals and multilinguals, F(111,126) = 0.69, p = .049, with higher varience in dominant language vocabulary among multilinguals. Multilingual children scored significantly higher (M = 41.2, SD = 4.0) on conceptual vocabulary than their monolingual peers (M = 39.4, SD = 4.1), t(233) = −3.38, p < .001, Cohen’s f 2 = .05. However, the variability between the groups showed no significant difference, F(111,130) = 1.07, p = .718 (see Figure 2A,B).

Figure 2. Graphs a and b show group-level analyses comparing the vocabularies of monolingual and multilingual children. Graphs c to f show vocabulary differences in multilingual children by the two identified linguistic distance groups (Low and High).
6.2. Indicators of diversity
To examine diversity in children’s communicative environments and relate this to their vocabulary development, we computed three indicators of diversity (i.e., linguistic distance, language entropy and context entropy) and tested the associations between these indicators and vocabulary outcomes. We used linear models to assess whether these indicators of diversity are related to their vocabulary. We standardised all predictors before running the analyses. For visualisation purposes, however, we used the raw scores. Since we did not expect these associations to be linear, we compared the linear models with the respective quadratic models using the R anova() function and model comparison criteria (i.e., Akaike Information Criterion, Bayesian Information Criterion, adjusted R 2 and χ2) and report the findings of the better-fitting models, see Table 1 (for model comparisons see Tables S6–S13 in Supplementary Materials).
Table 1. Best fitting models

Note: With ‘LD’ for Linguistic Distance and ‘LD group’ referring to the low and high linguistic distance group categories. For model comparisons, see Tables S6 to S13 in Supplementary Materials.
6.2.1. Linguistic distance
Linguistic distance was analysed only among multilingual children. It was not normally distributed in our sample. Therefore, we performed a clustering analysis using Mclust in R (Scrucca et al., Reference Scrucca, Fraley, Murphy and Adrian2023) to explore different categories in the data, see Table S3 in Supplementary Materials. We identified two groups, which we will use for further analyses: low (M = 72.12, SD = 1.07) and high (M = 93.41, SD = 6.27) linguistic distance. We refer to the linguistic distance as ‘low’ and ‘high’ as it refers to the ASJP computation of level of similarity. The low linguistic distance group (n = 33) includes the following languages: Danish, Dutch, English and Swedish. The high linguistic distance group (n = 100) includes the languages Italian, French, Spanish, Portuguese, Hungarian, Romansch, Czech, Greek, Mandarin, Croatian, Tagalog, Vietnamese, Arabic, Tigrinya, Albanian, Russian and Slovakian. We ran three linear models predicting multilinguals’ dominant language, non-dominant language, conceptual and total vocabulary using linguistic distance group, language entropy and context entropy as predictor variables and children’s age in months as a covariate (see Table 1).
We found significant effects of linguistic distance on vocabulary size, associating higher linguistic distances with larger receptive vocabulary in the children’s dominant language, F(2,124) = 8.20, p = .026. and Cohen’s f 2 = .06. We did not find a significant relation between linguistic distance and conceptual and total vocabulary scores or the children’s vocabulary scores in their non-dominant language (all ps > .060). For the results of each model, see Table A3 in the Appendix and Figure 2C–F.
To better understand the relationship between linguistic distance and vocabulary size, we ran non-preregistered exploratory analyses focusing on the multilingual children in the category with higher linguistic distance. We ran linear models with the linguistic distance as a continuous variable on all measures of vocabulary size. For multilingual children in the high linguistic distance group, greater linguistic distance was negatively associated with vocabulary size in the dominant language, F(2, 93) = 5.32, p = .032, adjusted R 2 = .08, Cohen’s f 2 = .11. This effect was not found in non-dominant language vocabulary, conceptual and total vocabularies (all ps > .100), see Figure 2C–F and Table A4 in Appendix.
To explore the possible quadratic relationship between linguistic distance and dominant language vocabulary, we ran an additional unregistered quadratic model including all multilingual children, predicting their dominant vocabulary based on the continuous measure of the linguistic distance. This model confirmed a quadratic relationship between linguistic distance and children’s receptive vocabulary in their dominant language (F(3, 123) = 7.83, p = .005, adjusted R 2 = .14, Cohen’s f 2 = .19.
6.2.2. Language entropy
Language entropy was calculated based on the proportion of children’s awake time spent exposed to different languages. The resulting measure captures the degree of balance in the language exposure, with higher values reflecting a more even distribution of time across languages. For example, when children are exposed to two languages, language entropy ranges from 0 to 1.00, with a value of 1.00 indicating perfectly equal exposure. For children exposed to three languages, entropy can range from 0 to approximately 1.58, where 1.58 represents equal exposure to all three languages.
Language entropy was analysed only among multilingual children. We compared several linear regression models, including linear and higher-order terms of language entropy (for best-fitting model, see Table 1). For full details on the fit statistics and comparison of the tested models, see Table S4 in Supplementary Materials. Furthermore, because the results showed different effects on vocabulary for groups with lower and higher linguistic distances, we included these categories as control variables, deviating from the pre-registration.
Language entropy was quadratically associated with dominant vocabulary, F(4,122) = 4.73, p = .018, R 2 = .14, Cohen’s f 2 = .20. Higher language entropy is associated with higher vocabulary scores in the dominant language, up to a threshold of 0.94. Beyond this point, higher language entropy is associated with smaller vocabulary scores in the dominant language. Similarly, we found a curvilinear relationship between language entropy and children’s vocabulary in their non-dominant language (F(3, 120) = 6.54, p = .036, R 2 = 0.09 and Cohen’s f 2 = .10). Higher language entropy is associated with higher vocabulary scores in the non-dominant language up to an entropy of 1.00, after which higher entropy is associated with smaller vocabulary scores. We found the same effect for multilingual children’s total vocabulary (F(3, 119) = 9.46, p = .003, R 2 = .17 and Cohen’s f 2 = .24). Higher language entropy is related to higher total vocabulary scores up to an entropy of 0.98, after which higher entropy scores are associated with lower total vocabulary scores.
In contrast, we did not find a significant association between language entropy and children’s vocabulary score in their conceptual vocabulary, p > .050, see Table A5 in the Appendix and Figure 3.

Figure 3. Associations of language entropy and children’s vocabulary scores. The solid line shows the predicted vocabulary scores from the statistical model. The shaded area represents the 95% confidence interval around these predictions.
6.2.3. Context entropy
Context entropy was calculated based on the proportion of awake time children spent across different caregiving environments. Entropy values were computed over four contexts: primary caregivers, secondary caregivers, institutional childcare, and kindergarten. The resulting entropy reflects how evenly children’s time was distributed across these contexts, with higher values indicating a more balanced exposure. For example, for children exposed to two different contexts, their context entropy falls anywhere between 0 and 1.00, with 1.00 being exactly balanced exposure to two contexts. For children exposed to three contexts, their entropy lies anywhere between 0 and 1.58, with 1.58 representing their time evenly distributed across three contexts.
The models on context entropy include monolingual and multilingual children. We compared linear regression models with both linear and quadratic terms for context entropy, as preregistered. The model with a linear term for context entropy was the best fit (see Table 1 and Tables S8 and S9 in Supplementary Materials for fit statistics).
The results showed no significant associations between context entropy and total and conceptual vocabulary for both monolingual and multilingual children, F(3,241) = 8.70, p = .288 and F(3,241) = 113.70, p = .154, see Table A6. We ran additional, unregistered linear regressions for multilingual children only, to explore whether context entropy is differently related to children’s dominant vocabulary than to their non-dominant vocabulary. For full details on model fit and comparison, see Tables S10–S13 in Supplementary Materials. Context entropy showed a curvilinear relationship to dominant vocabulary for multilingual children, F(3,122) = 5.09, p = .037, R 2 = .09, Cohen’s f 2 = .13, meaning that higher context entropy scores are associated with higher vocabulary scores in the children’s dominant language, up to a threshold of 1.03, after which higher context entropy scores are associated with lower vocabulary scores in the children’s dominant language (see Figure 4 and Table A7 in Appendix). We found no significant associations between context entropy and non-dominant vocabulary, as well as conceptual and total vocabulary (all p > .05), see Table A7.

Figure 4. Associations of context entropy and children’s vocabulary scores. The solid line shows the predicted vocabulary scores from the statistical model. The shaded area represents the 95% confidence interval around these predictions. Note that the vocabulary scores on the dominant language (c) and the non-dominant language (d) are based on data of multilingual children only.
7. Discussion
Children’s communicative environments shape their development, with multilingual children experiencing distinct environments compared to their monolingual peers. However, research often simplifies language status as a binary variable, overlooking the variability within multilingual environments and its nuanced influence on developmental outcomes (Kremin & Byers-Heinlein, Reference Kremin and Byers-Heinlein2021). Here, we quantified the communicative environment of monolingual and multilingual preschool children using three indicators of communicative complexity: linguistic distance, language entropy, and context entropy and investigated the associations between these measures and vocabulary outcomes. In line with previous studies, monolinguals scored higher than multilinguals on a receptive vocabulary task in their dominant language (Bialystok & Barac, Reference Bialystok and Barac2012). However, when considering multilinguals’ vocabulary in their two most dominant languages, no significant differences were found between multilingual and monolingual children’s conceptual vocabulary scores (Hoff et al., Reference Hoff, Core, Place, Rumiche, Señor and Parra2012; Pearson et al., Reference Pearson, Fernández and Oller1993). Additionally, we observed greater variability in multilinguals’ language outcomes. This variability appears to be partly due to differences in multilinguals’ linguistic distance, language entropy and context entropy. Our findings indicate that linguistic distance is associated with vocabulary size in children’s dominant but not in their non-dominant language. Regarding language entropy, we found that higher language entropy up to a threshold of around 1.00 is associated with larger vocabulary sizes in total, dominant and non-dominant languages, after which the relationship reverses. As entropy peaks at 1.00 for children exposed to two languages, an entropy value higher than 1.00 is associated with exposure to a third language. Finally, higher context entropy was associated with larger dominant and non-dominant language vocabulary among multilingual children.
7.1. Linguistic distance
Linguistic distance was associated with an increased receptive vocabulary in the dominant language and conceptual vocabulary scores, suggesting that greater separation between languages may enhance vocabulary size or language processing speed (Marian et al., Reference Marian, Blumenfeld and Boukrina2008). However, we found a negative correlation between linguistic distance and vocabulary scores for children within the high linguistic distance group. This group consisted of children speaking a non-Germanic language and Swiss German. In other words, while children with high linguistic distances tend to score better on vocabulary than children whose second language is more closely related to the societal language, further increases in linguistic distance are linked to lower vocabulary scores. To explore this further, we examined correlations between the three indicators of diversity. We found only small associations (all r < .20), which makes it unlikely that multicollinearity is driving this effect (for models including all three indicators, see Tables S14–S16 in the Supplementary Material).
Our finding aligns with previous research (Gampe et al., Reference Gampe, Endesfelder Quick and Daum2021), suggesting that the benefits of higher linguistic distance may diminish beyond a certain point. One possible explanation for this pattern could be the nature of the languages in the low linguistic distance group in our sample, which consists exclusively of Germanic languages. These languages share substantial lexical (53%; Batubara & Widayati, Reference Batubara and Widayati2022), grammatical and phonological similarities with the societal language (Hansen & Kroonen, Reference Hansen, Kroonen and Olander2022). We speculate that this may lead to interference effects that hinder vocabulary acquisition. In contrast, languages with moderate linguistic distance may introduce more distinct structures and facilitate linguistic differentiation, resulting in better vocabulary outcomes. However, as linguistic distance continues to increase, the challenges associated with processing and learning such distinct languages may outweigh the benefits of diversity and could lead to lower vocabulary scores (Squires et al., Reference Squires, Ohlfest, Santoro and Roberts2020).
While a positive association between linguistic distance and vocabulary was observed in children’s dominant language, we do not find significant associations between linguistic distance and vocabulary in their non-dominant language. This suggests that linguistic distance may support vocabulary growth in the dominant language to some extent, but it does not necessarily impact vocabulary in a non-dominant language. The finding differs from previous studies that found a negative effect of linguistic distance on learning a second language (Jaekel et al., Reference Jaekel, Ritter and Jaekel2023; Van der Slik, Reference Van der Slik2010). These studies, however, focused on L2 acquisition later in life. Therefore, rather than one language impacting the acquisition of the other as seen in adult second-language learners, our results suggest that the simultaneous acquisition of both languages in early childhood may lead to a more interconnected process, where the linguistic distance can affect the acquisition of both languages.
In sum, our results suggest that, in early childhood, the acquisition of L1 in multilinguals is affected by the simultaneous acquisition of L2, similar to how L2 acquisition is affected by L1 in adult second-language learners. Nevertheless, the discrepancy found between the dominant and non-dominant language invites further investigation into the factors influencing non-dominant language acquisition, potentially including the role of exposure and language use frequency (e.g., daily exposure to and use of multiple languages or intensive exposure to individual languages during specific periods such as vacations).
7.2. Language entropy
Our analysis revealed that language entropy is not linearly related to vocabulary but follows a curvilinear pattern. Higher language entropy is related to larger vocabulary size, up to a threshold of 0.94 for the dominant language vocabulary, 1.00 for the non-dominant language, and 0.98 for the total vocabulary. Beyond these thresholds, language entropy is associated with lower vocabulary outcomes. This finding aligns with earlier research emphasising the importance of language exposure for vocabulary development (Huttenlocher, Reference Huttenlocher1998; Rowe, Reference Rowe2012), particularly in multilingual contexts (language entropy >1.00) where reduced exposure to each language can limit vocabulary growth (Hoff & Core, Reference Hoff and Core2013). Our results further suggest that vocabulary acquisition is not solely driven by exposure; other factors, such as the balance between languages, play a significant role (Hoff & Ribot, Reference Hoff and Ribot2017; Sander-Montant et al., Reference Sander-Montant, López Pérez and Byers-Heinlein2023). A language entropy of 1.00 corresponds to the transition in exposure from two to three languages. In the case of bilinguals, that is, children speaking two languages, language entropy ranges from 0.00 to 1.00 and is highest when time is equally split between the two languages. An increase in language entropy means relatively less exposure to the language that is considered dominant and more exposure to the non-dominant language. Our findings, therefore, suggest that increased exposure to the non-dominant language in bilingual children is related to a greater vocabulary in their dominant and non-dominant languages. For the non-dominant language, this relationship is intuitive: more exposure naturally supports greater vocabulary development (Huttenlocher, Reference Huttenlocher1998; Rowe, Reference Rowe2012; Rowe & Goldin-Meadow, Reference Rowe and Goldin-Meadow2009). However, the association is less straightforward for the dominant language, because higher language entropy reflects reduced exposure to that language. One possible explanation for this asymmetry could lie in the interplay between the differences in the strength and stability of linguistic representations between the two languages (Kastenbaum et al., Reference Kastenbaum, Bedore, Peña, Sheng, Mavis, Sebastian, Rangmani, Vallila-Rohter and Kiran2018) and the development of meta-linguistic awareness (Cummins, Reference Cummins1978; Huang, Reference Huang2018). In the dominant language, the representation of words is typically stronger and more stable due to more frequent exposure and use. While meta-linguistic awareness develops across both languages, the stronger representations in the dominant language may allow bilingual children to better leverage this awareness in vocabulary acquisition in their dominant language (Varga, Reference Varga2021; Zhang et al., Reference Zhang, Chin and Li2017) than in their non-dominant language.
The curvilinear relationship between language entropy and multilingual children’s total vocabulary suggests that balanced language exposure supports vocabulary development, up to a threshold of 0.98, after which it declines. In contrast, we did not find a significant effect of the association between language entropy and children’s conceptual vocabulary. One possible explanation is that conceptual vocabulary, which reflects children’s knowledge of concepts regardless of the language used to label them, may be less sensitive to the overall balance of language exposure and more influenced by compartmentalisation of language use (Muszyńska et al., Reference Muszyńska, Łuniewska, Dynak, Kolak, Lohrum, Otwinowska, Wodniecka and Haman2024b). That is, if certain languages are predominantly used in a specific context or for a specific activity, children may acquire concept-word mappings in a context-dependent manner. This kind of distributional pattern would not be captured by our current entropy measure, which does not account for the functional or contextual separation of languages.
For all multilingual children in our sample, those learning more than two languages, language entropy is greater than 1.00 and is negatively associated with vocabulary size in the dominant language. This may be explained by a reduction in exposure to each language (Pearson et al., Reference Pearson, Fernandez, Lewedeg and Oller1997; Unsworth, Reference Unsworth2013), as a greater language entropy in multilinguals reflects a more balanced exposure across all languages (i.e., 33% exposure to each language). Additionally, multilingualism places demands on cognitive processing (Bialystok, Reference Bialystok2009), which may become more pronounced with the number of languages children are exposed to. In sum, our findings support the idea that there is more to vocabulary acquisition than exposure alone (Byers-Heinlein, Reference Byers-Heinlein2013), and other factors, such as the number of languages spoken, need to be considered.
7.3. Context entropy
Our investigation into context entropy revealed a positive association with dominant language vocabulary in multilingual children up to a context entropy of 1.03, after which the association becomes negative. This suggests that, initially, introducing more different social contexts may positively impact the dominant language vocabulary. For most children in our sample, their dominant language is the societal language. When the number of social contexts increases (e.g., through institutionalised daycare), children’s exposure to the societal language increases and diversifies, which may increase their vocabulary (see Table S14 in the Supplementary Materials; Soderstrom et al., Reference Soderstrom, Grauer, Dufault and McDivitt2018; Zaretsky & Lange, Reference Zaretsky and Lange2017). However, it has been suggested that children in institutionalised daycare get less direct attention and child-directed speech from group leaders than at home from caregivers (Peterson & Peterson, Reference Peterson and Peterson1986). This could affect their vocabulary acquisition (Weisleder & Fernald, Reference Weisleder and Fernald2013). Hence, adding more different contexts, with reduced child-directed speech at the cost of contexts with more child-directed speech, may affect language development negatively, explaining the downward part of the curvilinear association. In sum, while increasing social context diversity may initially support dominant language vocabulary growth in multilingual children, excessive diversity could hinder development due to possible reduced exposure to child-directed speech in certain contexts.
7.4. Contextual considerations and implications
The relationships observed in the current study highlight the complex interplay between language environments and vocabulary development, suggesting that the early language environment affects vocabulary differently depending on the language. The linguistic landscape in which this study was conducted, characterised by its linguistic and cultural diversity, offers a unique backdrop that contrasts with more homogeneous multilingual studies on language development (mostly English-Spanish; Francisco et al., Reference Francisco, Carlo, August and Snow2006; Shiro et al., Reference Shiro, Hoff and Ribot2020; followed by Canadian English-French; Comeau et al., Reference Comeau, Genesee and Mendelson2007; Nicoladis et al., Reference Nicoladis, Pika and Marentette2009). The diversity of languages and cultures in the country in which this study was conducted likely increases individual differences in vocabulary development, highlighting the need to consider varied linguistic environments in multilingual research. Although this diversity can create variability, the specific traits of our study population may help reduce these differences. The homogeneity of our study population, regarding caregiver education and household income, enabled us to examine other influences on vocabulary. This could explain the smaller differences in vocabulary sizes between multilingual and monolingual children. Higher socio-economic status often correlates with larger vocabularies (Dicataldo & Roch, Reference Dicataldo and Roch2020), which may compensate for the smaller vocabulary sizes typically seen in multilingual preschoolers (Bialystok & Barac, Reference Bialystok and Barac2012).
8. Limitations
This study offers several strengths, including a multidimensional approach to characterising multilingual environments, comprehensive vocabulary assessment across multiple languages and a linguistically and culturally diverse sample. However, some methodological limitations should be noted. First, the data were drawn from lab-based studies conducted between 2019 and 2024, which applied strict inclusion criteria: children were classified as multilingual only if they received more than 20% exposure to a second language, and as monolingual if exposure was below 10%. As a result, children with low levels of second language exposure were excluded, limiting the generalisability of the findings to those with moderate or high bilingual exposure. Future research should consider treating language exposure as a fully continuous variable and include a broader range of language experiences.
Second, the study assessed receptive vocabulary using nouns only. Prior work suggests that linguistic distance may affect productive and receptive vocabularies differently (Floccia et al., Reference Floccia, Sambrook, Delle Luche, Kwok, Goslin, White, Cattani, Sullivan, Abbot-Smith, Krott, Mills, Rowland, Gervain and Plunkett2018; Kelley & Kohnert, Reference Kelley and Kohnert2012; Potapova et al., Reference Potapova, Blumenfeld and Pruitt-Lord2016), an effect we were unable to detect in this study. Further, receptive vocabulary tests do not provide an accurate representation of children’s linguistic competences (Bogue et al., Reference Bogue, DeThorne and Schaefer2014; Ukrainetz & Blomquist, Reference Ukrainetz and Blomquist2002) and tests that sample nouns only may miss other word types a child may or may not know, which could reduce the precision and generalisability of the test results to overall vocabulary size (Stoeckel et al., Reference Stoeckel, McLean and Nation2021). Despite these limitations, this measure was chosen for its strong psychometric properties and its suitability for assessing vocabulary across multiple languages in multilingual children (Gampe et al., Reference Gampe, Kurthen and Daum2018).
9. Conclusion
In conclusion, we examined how different indicators of diversity (linguistic distance, language entropy and context entropy) relate to vocabulary outcomes in preschool children. Our findings show that these indicators are associated with vocabulary outcomes in multilingual children, particularly in their dominant language. Linguistic distance was positively associated with dominant language vocabulary scores, up to a threshold, after which linguistic distance shows a negative association. A similar pattern was observed for language entropy with vocabulary in children’s dominant and non-dominant language and for their total vocabulary. Moderate entropy, reflecting balanced exposure to two languages, was associated with higher vocabulary sizes. Whereas high entropy levels, indicating exposure to more than two languages, were linked to smaller vocabulary sizes. Similarly, context entropy positively predicted vocabulary outcomes in only the dominant language up to a point, beyond which higher context entropy was related to smaller vocabulary sizes. These results confirm the need to consider multilingualism as a multidimensional, continuous variable and highlight that balance and structure of exposure matter, not just the amount.
Future research should continue to explore these dynamics, particularly in diverse linguistic environments constantly changing due to factors like globalisation and migration, to better understand multilingual development and address the unique needs of children growing up in diverse language environments.
Supplementary material
The supplementary material for this article can be found at http://doi.org/10.1017/S1366728925100758.
Data availability statement
The data supporting the findings of this study are openly available in Open Science Framework (OSF, https://osf.io/fhu5e/?view_only=285a0d10bd814d60b88ad0f11c138cc3).
Acknowledgements
We thank the Kleine Weltentdecker*innen Lab and the Jacobs Center for Productive Youth Development at Universität Zürich for providing the resources necessary for this research and all children and their caregivers for their participation in our studies.
Competing interests
The authors declare none.
Appendix
Table A1. Operationalisation of independent variables

Table A2. Operationalisation of dependent variables

Table A3. Linguistic distance and vocabulary sizes

Note: LD-group = Linguistic distance group. The reported category is the high linguistic distance group. The reference category is the low linguistic distance group. Due to missing data, sample sizes varied slightly in the different models: n = 127 for the dominant language vocabulary, n = 124 for the non-dominant language vocabulary, n = 131 for the conceptual vocabulary and n = 133 for the total vocabulary.
Table A4. Linguistic distance and vocabulary sizes in high linguistic distance group

Note: Linguistic distance (LD) as a continuous variable. Due to missing data, sample sizes varied slightly in the different models: n = 96 for the dominant language vocabulary, n = 95 for the non-dominant language vocabulary, n = 118 for the conceptual vocabulary and n = 100 for the total vocabulary.
Table A5. Language entropy and vocabulary sizes

Note: LE = Language entropy, LD-group = Linguistic distance group. Reported LD-group is the high linguistic distance group, with the low linguistic distance group as the category of reference. Due to missing data, sample sizes varied slightly in the different models: n = 127 for the dominant language vocabulary, n = 124 for the non-dominant language vocabulary, n = 131 for the conceptual vocabulary and n = 133 for the total vocabulary.
Table A6. Context entropy and vocabulary sizes in monolingual and multilingual children

Note: The reference category for language status was monolingual children.
Table A7. Context entropy and vocabulary sizes in multilingual children only

Note: Due to missing data, sample sizes varied slightly in the different models: n = 127 for the dominant language vocabulary, n = 123 for the non-dominant language vocabulary, n = 129 for the conceptual vocabulary and n = 132 for the total vocabulary. Only the best-fitting models are reported.