Roles of bilingualism and musicianship in resisting semantic or prosodic interference while recognizing emotion in sentences

Listeners can use the way people speak (prosody) or what people say (semantics) to infer vocal emotions. It can be speculated that bilinguals and musicians can better use the former rather than the latter compared to monolinguals and non-musicians. However, the literature to date has offered mixed evidence for this prosodic bias. Bilinguals and musicians are also arguably known for their ability to ignore distractors and can outperform monolinguals and non-musicians when prosodic and semantic cues conflict. In two online experiments, 1041 young adults listened to sentences with either matching or mismatching semantic and prosodic cues to emotions. 526 participants were asked to identify the emotion using the prosody and 515 using the semantics. In both experiments, performance suffered when cues conflicted, and in such conflicts, musicians outperformed non-musicians among bilinguals, but not among monolinguals. This finding supports an increased ability of bilingual musicians to inhibit irrelevant information in speech.


Introduction
Prosody can communicate a speaker's intent, attitude, or emotion with the use of acoustic variables such as pitch, intensity, and duration of speech segments (Botinis et al., 2001;Cutler et al., 1997;Lehiste, 1970).For example, anger is typically characterized by high pitch (often with a descending contour), high intensity levels, and a rapid and variable speech rate (Preti et al., 2016).Children begin to recognize prosody early in infancy (Friend, 2001;Mastropieri & Turkewitz, 1999), which has been linked to better social development, communication skills, and empathy (Baron-Cohen et al., 1985).However, it is not until age 5 that children begin to consistently label a speaker's emotional state using their tone of voice (Aguert et al., 2010(Aguert et al., , 2013;;Sauter et al., 2013).
In daily conversation, the emotional prosody of speech can sometimes conflict with the semantic context (or the choice of words used to portray an emotion).In such cases, the understanding of emotional prosody is vital for understanding the true message of speech.For example, the utterance "What a great day" has positive or happy semantic content.However, if said in a sarcastic tone of voice, it would indicate the speaker's discontent.Thus far, the literature shows that when presented with incongruent semantic and prosodic cues to emotions in spoken sentences and specifically asked to use prosody, 4-year-old children will make judgements about a speaker's emotions based on semantic cues (Friend, 2000;Friend & Bryant, 2017;Morton & Trehub, 2001).However, at 10 years of age, children begin to grant more weight to prosodic (over semantic) cues in such situations (Morton & Trehub, 2001).Thus, at 10 years old there is a shift in the salience of semantic and paralinguistic cues to vocal emotions, moving toward a more adult-like ability.This is surprising given that prosody can be recognized very early in life, before children learn to speak and understand the semantic context of speech (Friend, 2001).One possible interpretation is that young children have a rudimentary understanding of the communicative role of vocal emotions and therefore grant more weight to semantic cues which may be utilized more easily for communication (Morton & Trehub, 2001).This raises the question of how the progressive mastery in a language along with general maturation effects eventually offset the balance between the use of semantic and prosodic cues to emotions.
While children are developing the ability to recognize speech prosody, many are also being exposed to a second language (Grosjean, 2010).Being bilingual or multilingual comes with several advantages.As bilinguals must learn when to use each language depending on the context, they are constantly making linguistic decisions, and may be better at handling conflicting demands.In addition to being able to communicate with more people, research has shown that being bilingual may contribute to better metalinguistic awareness, as well as advantages in several executive functions including inhibition, monitoring, and working memory (Adesope et al., 2010;Bialystok, 2015;Bialystok & Craik, 2010;Christoffels et al., 2013;Kroll & Bialystok, 2013;Yow & Markman, 2015, 2016).Executive functions are defined as high level cognitive control abilities that are involved in mental activity (Lehtonen et al., 2018) that arguably involve the prefrontal and parietal regions of the brain (Chung et al., 2014;Kang et al., 2022).Inhibitory control, or the ability to selectively attend to relevant information while ignoring irrelevant information, is one executive function of particular interest in the current study, as participants are required to attend to one cue while inhibiting another (Bialystok et al., 2005).However, the literature on bilingualism has not always shown an advantage in executive functions, such as working memory, conflict monitoring, and inhibitory control, among bilinguals (Lehtonen et al., 2018;Paap, 2019;Paap & Greenberg, 2013).There are also deficits seen in bilinguals' linguistic abilities across the lifespan (Bailey et al., 2020;Bialystok & Craik, 2010; see meta-analysis by Donnelly et al., 2019).Furthermore, previous research has failed to find a bilingual advantage or found a bilingual disadvantage in some cognitive tasks including metacognitive processing, attentional control, inhibitory control, etc. (e.g., see Folke et al., 2016;Paap & Greenberg, 2013;Paap et al., 2015Paap et al., , 2017Paap et al., , 2018)).Thus, this body of knowledge does not point to a clear cognitive or executive functioning advantage for bilinguals, but perhaps specific advantages in prosody perception or reliance on prosody.
It is well established that both languages known by a bilingual are activated in all contexts, requiring selection mechanisms to attend to the appropriate language in a given listening situation (Bialystok, 2017;Lehtonen et al., 2018).Thus, bilinguals are constantly inhibiting one of their competing languages, while actively using the other.To reduce mental load, bilingual children may progressively learn to use a cue that is more consistent across individuals and languages: prosody.In a study where 4 year old children were presented with sentences with conflicting prosodic and semantic cues, bilinguals showed an earlier ability to use prosodic cues than monolinguals (Yow & Markman, 2011).These results may be interpreted as a bilingual advantage in executive functioning (Bialystok, 1999;Costa et al., 2008;Kovács & Mehler, 2009) but could also be interpreted as a prosodic bias.For example, in Champoux-Larsson and Dylman's (2019) study, when asked to identify the emotion in the content (i.e., semantics) of words while ignoring prosody, bilingual children made more mistakes than monolingual children.When asked to identify the emotion in the prosody while ignoring the content, bilingual children made fewer mistakes than monolingual children and this difference increased both with age and with increased bilingual experience.Thus, bilingual 6-9-year-olds demonstrated a prosodic bias whereby they used prosodic cues to detect vocal emotions, even when prosody was the distractor.To delve further into the role of language exposure and the nature of the exposure received, some researchers have used questionnaires, such as the LEAP-Q (Marian et al., 2007), but in Champoux-Larsson and Dylman's (2019) study they used the Language and Social Background Questionnaire (LSQ; Luk & Bialystok, 2013) to rate participants (from 1: exposed to and used only one language to 4: equal exposure to and use of two languages) and classify them as bilingual or monolingual.Thus, these differing methods of measuring bilingualism in the field may explain the lack of consistency in finding a bilingual advantage.While these studies focus on child populations, more recently, Champoux-Larsson and Dylman (2021) found that this prosodic bias may continue into adulthood, but only under some experimental conditions.However, one factor not considered in these studies was the effect of musical training.
Musical training has been shown to have many cognitive benefits.For example, musicians have better working memory (George & Coch, 2011) and executive functions (D'Souza et al., 2018;Moradzadeh et al., 2015) than non-musicians.Musical training might also help the encoding of speech and more globally for processing language (Coffey et al., 2017;Patel, 2011;Shook et al., 2013;Tierney & Kraus, 2013;Tierney et al., 2013).This is not surprising given that music and language share many common features (Besson et al., 2011;Hausen et al., 2013;Peretz et al., 2015), including the communication of emotions (Paquette et al., 2018).Emotions can be recognized in music as their acoustic properties are similar to emotions depicted in speech (Juslin & Laukka, 2003).Musicians are also better than non-musicians at detecting pitch fluctuations in both music and language (Sares et al., 2018;Schön et al., 2004).The fact that these findings hold for both speech and music is promising as it suggests that this musician advantage (in perceiving the "musicality" of speech) may be robust to linguistic (and therefore semantic) influences in the speech materials.In real life, people convey emotions partly through prosody and partly through their choice of words, the latter being arguably more straightforward (Pell & Skorup, 2008;Shakuf et al., 2022).There is also some mixed evidence in the literature that suggests that musicians may be better at recognizing the emotional prosody in speech (Lima & Castro, 2011;Trimmer & Cuddy, 2008), depending on whether emotional intelligence is accounted for or not.To date, it is unknown whether adult musicians and non-musicians differ in their use of prosody versus semantics for emotion processing, but there could well be a musician advantage in parsing these cues when they conflict.Finally, it is important to note that in this field ( just like in the bilingualism field), the definition of musicianship is one of the possible reasons why mixed results have been found.For example, George and Coch (2011) defined a musician as someone who has studied music for 9 or more years, began playing prior to age 10, continuously studies the same instrument, and actively studies music.D'Souza et al. ( 2018) defined a musician as someone who has at least 8 years of experience playing and performing music, began training around 7 years old, and practices regularly.Moradzadeh et al.'s (2015) musicians had an average of 12 years of formal musical training, 90% had music theory training, 83% had ear training, and on average rated themselves 3.25 or having "good" sight-reading ability on a 5-point scale (1 = "beginner" and 5 = "expert").The advantages that musicians exhibit in different tasks could rely more on some of the aforementioned variables than others, which could explain the differing results.
To the best of our knowledge, very few research groups have looked at the individual and combined effects of bilingualism and musicianship on vocal emotion recognition abilities in a single study and in an adult sample.Bialystok and DePape (2009) used an auditory Stroop task, where listeners were instructed to attend to prosody or to semantics of single words (and not sentences) with an emotional meaning.They found that adult musicians (monolingual) responded more quickly than bilinguals and monolinguals (both non-musicians) in the prosody task, but there was no group difference in the semantics task.Similarly, Graham and Lakshmanan (2018) largely replicated the same design but only included a prosody task.They found that adult musicians (monolingual) had reduced reaction times on incongruent trials and smaller cognitive costs compared to the bilinguals (non-musicians and non-tone second language) but did not differ from monolinguals (non-musicians) or tone language bilinguals.However, neither study looked at the combined effects of bilingualism and musicianship.This is a striking gap given that both factors could facilitate the recognition of emotional prosody: being both a bilingual and musician may have additive effects.
The current study addresses this gap using an orthogonal 2×2 design to examine the contribution of each factor (bilingualism and musicianship) and their possible interaction, in relying on prosody versus semantics (or vice-versa) when recognizing emotions in sentences.More specifically, in situations of conflict, we hypothesized that bilingual adults would either demonstrate a prosodic bias as seen in children (whereby they would outperform monolinguals when asked to use prosody but perform worse than monolinguals when asked to use semantics) or they would demonstrate an inhibitory control advantage making them more resistant than monolinguals to distractors in both tasks.For musicians, the idea of a prosodic bias received mixed evidence so we favored the inhibitory control idea: we hypothesized that musicians would outperform non-musicians both when asked to use prosodic cues and when asked to use semantic cues to emotions, as long as the cues conflicted with one another.To test this, we designed two separate studies to mirror each other, with participants either attending to prosody (Experiment 1) or semantics (Experiment 2) to report the emotion contained in sentencesa sort of emotional Stroop task.Note that these experiments were run for two independent sets of subjects to avoid: 1) the same participant switching between the two tasks and changing listening/communicative strategies; and 2) to not expose a participant to the same sentence twice (a within-subject design would have halved the number of trials per task with such a constraint).

Participants
A total of 1086 participants across two experiments were recruited through Prolific (https://prolific.co/), an online recruitment platform.Recruitment was open only to specific English-dominant countries (Australia, Ireland, New Zealand, United Kingdom, and United States).Four separate batches were collected for each experiment (for a total of 8 batches): bilingual musicians, bilingual non-musicians, monolingual musicians, and monolingual non-musicians.The batches were based on filters for bilingualism and musicianship available in Prolific.For bilinguals, this meant answering "English" to the question "What is your first language" ( just like for monolinguals) and answering, "native +1 or native +2 other languages" to the question "Apart from your native language, do you speak any other languages fluently?".For musicianship, this meant answering "Yes.For 5+ years." to the question "Do you play a musical instrument, if so for how many years?"Forty-five participants either had technical difficulties (e.g., downloading the materials or browser issues) or did not complete their respective experiment and were thus excluded from the analyses.None of the participants had concerns about their hearing, but two participants (0.19%) reported having mental health issues (still included).The final sample included 526 participants (N Females = 271, N Males = 253, and N Prefer not to say = 2) in Experiment 1 and 515 participants (N Females = 298 and N Males = 217) in Experiment 2. All participants were between the ages of 18-41 years old (Experiment 1: M = 25.35,SD = 5.94; Experiment 2: M = 24.23,SD = 4.95).
Within our experimental interface, participants were asked about their language and musical background, and based on these answers (not their answers on the Prolific filters), they were divided into four groups: bilingual musicians (N Experiment 1 = 177, N Experiment 2 = 171) bilingual non-musicians (N Experiment 1 = 114, N Experiment 2 = 101), monolingual musicians (N Experiment 1 = 138, N Experiment 2 = 144), and monolingual non-musicians (N Experiment 1 = 97, N Experiment 2 = 99).Participants were asked "How many languages do you know in total?" and then required to give the name of each language.For each language entered, participants were then asked "At what age did you begin learning this language?","How proficient are you in this language?",and "In the past year, how much have you used this language in daily life?0 = Never, 10 = Exclusively."The same questions were then asked, replacing the word language by instrument.The group classification was intentionally simple: monolinguals were participants who reported knowing only one language, English, while bilinguals reported knowing two or more languages (including English, their first language).Similarly, non-musicians were participants who did not play any musical instrument, while musicians reported playing one or more instruments.Note that this is not to deny the considerable variability within these groups.There is a notorious heterogeneity among bilinguals (e.g., de Bruin, 2019;Luk, 2015) and among musicians (Daly & Hall, 2018), so the information we recorded about their age of acquisition, proficiency, and use of each of their languages or musical instruments could allow us to probe further into the roles of bilingualism and musicianship.For example, it is known that early-trained musicians (before age 7) have behavioral benefits in auditory tasks (Bailey et al., 2020) and changes in cortical and sub-cortical networks compared to late-trained musicians (Penhune, 2019;Shenker et al., 2022;Vaquero et al., 2020).In our sample, roughly 30% reported learning their first instrument before age 7. On this basis, one might be tempted to narrow down our musician group definition by one demographic variable (and the same holds for bilingualism).However, given the dangers of dichotomizing such continuous variables (MacCallum et al., 2002) and the many possibilities for this dichotomization, we did not reclassify participants using arbitrary cut-offs (or manually dichotomize participant) from these metrics (i.e., age of acquisition, proficiency, or use).Instead, our variables were based on whether the participants reported a second language or not, or reported playing an instrument or not.In addition, we explored bilingualism and musicianship as continuous variables in regression approaches (see Figures S5 and S6 in Appendix S6).

Protocol
The participants recruited were bilinguals, monolinguals, musicians, and non-musicians interested in participating in an online study for compensation.Individuals interested in participating were redirected from Prolific to the experimental interface hosted on Pavlovia (an online platform for behavioural experiments), that was designed using the PsychoPy software (Peirce et al., 2019).All participants provided informed consent online in Bilingualism: Language and Cognition accordance with the Institutional Review Board at Concordia University (ref: 30013650) and were compensated £3.90 for their participation.
Written instructions were given to explain the task.Participants were asked to adjust the volume of their device to a comfortable level before beginning a practice block.In each experiment, the practice block consisted of 16 trials of auditory stimuli, half of which were congruent (matching semantics and prosody), and the other half were incongruent (differing semantics and prosody).Participants were asked to attend to the prosody of each sentence for those recruited in Experiment 1, or the semantics for those recruited in Experiment 2, such that there was no confusion (or switch) in the goal of the task.After the presentation of each sentence, the participants were asked to click on the word of the emotion that was expressed out of four possible options displayed in four quadrants of the screen: angry (top-left), calm (bottom-right), happy (top-right), or sad (bottom-left).To pass the practice block, the participants had to obtain a minimum of 75% correct (12 out 16 trials correct).If this was not obtained, participants continued repeating the practice block until 75% was attained.Feedback on performance was provided for practice trials but not for test trials.After completing the practice, participants moved on to the test phase.
In each experiment, the test phase consisted of 144 trials of auditory stimuli split across three blocks (48 trials per block).In each block, half of the trials (24) were congruent, and the other half were incongruent (24).Trials were equally divided into the four emotions: angry, calm, happy, or sad.Participants were presented with audio recordings of the sentences and asked to choose which emotion was expressed out of the four possible buttons (same quadrants as in the practice).Each of the three blocks differed in the way in which the semantic and prosodic cues to emotions were swapped in the incongruent trials (see Figure 1).In the swap valence block, the valence, or positive-negative dimension, of the emotions was swapped (e.g., a semantically angry sentence enacted with a happy prosody).In the swap intensity block, the intensity, or high-low energy dimension, of the emotions was swapped (e.g., a semantically happy sentence enacted with a calm prosody).Finally, in the swap both block, both the intensity and valence of the emotions were swapped (e.g., a semantically angry sentence enacted with a calm prosody).The order in which these three blocks were presented was counterbalanced across participants.

Stimuli
All stimuli were created by the experimenters.They were produced and recorded by four speakers (2 males and 2 females) to generate variability and prevent listeners from learning speaker-specific manners of conveying emotions (either through their voice characteristics or speaking style).The list of 144 sentences can be found in Appendix S1 and contained 36 semantically angry sentences (e.g., "My sister gets on my nerves"), 36 semantically calm sentences (e.g., "Baths are relaxing"), 36 semantically happy sentences (e.g., "Let's go to Disneyland"), and 36 semantically sad sentences (e.g., "His grandmother died").These four emotions were selected as we wanted one emotion from each quadrant (see Figure 1) to have emotions of both positive and negative valence, and high and low intensity.This also allowed us to do the block type analyses (see Appendix S6).The speakers read each sentence with the prosody of all four emotions to create congruent and incongruent stimuli, resulting in 576 recordings from each speaker.Thus, the full set consisted of 2304 stimuli in total.Of these, 144 were randomly selected for each participant, with no repetition of sentences.Each sentence was between 1.2 and 3.0 seconds long (M = 2.0, SD = 0.3).
We conducted an analysis on the semantics of each sentence using the word2vec algorithm to ensure that each sentence depicted its intended emotion (see Appendix S2 for more details on this analysis).This analysis confirmed that, overall, each set of sentences contained a semantic content that reflected the intended emotion.However, this was somewhat difficult to demonstrate and can perhaps be improved with more advanced packages (Raji & de Melo, 2020).Similarly, we conducted an analysis on the prosody of each sentence, demonstrating that emotions were enacted by the four speakers as expected: angry productions were particularly fast and dynamic in their intensity contours, while sad productions were slow and more stationary; happy productions were particularly high in pitch and well intonated, while sad and calm productions were low and more monotonous.In each metric, however, it is clear that speakers had their own style (see Appendix S3 for more detail) and were only partially consistent with one another in how they conveyed emotions.

Equipment
Given that the present experiments took place online, we did not have rigorous control over the equipment and quality of sound.To address this limitation, we asked participants to indicate the audio device they were using (headphones, earbuds, external speakers, or default output from their PC/laptop), and asked them to rate the quality of their audio from 0-10 (where 0 is poor and 10 is excellent).There were no differences between groups in audio quality, F (3, 1037) = 0.22, p = .881,η 2 <.001; M = 6.3,SD = 0.86.There were also no differences between groups in the type of audio device used, χ 2 (9, N = 1041) = 16.52,p = .057,with about 26% of participants listening through headphones, 19% through earbuds, 17% through speakers, and 38% through their default computer output.

Demographic analyses
Separate 2-by-2 ANOVAs (musicianship by bilingualism) were run on the combined data from both studies to analyze whether the groups differed in the language metrics collected (age of acquisition, proficiency, and use) for first language, second language (if applicable), and first instrument (if applicable).Additionally, Chi-Squared tests were used to compare other demographic variables, such as age, sex, employment status, and student status, between groups.Finally, all three metrics related to second language (L2; proficiency, use, and age of acquisition) were correlated with each other and all three metrics related to first instrument (I1; proficiency, use, and age of acquisition) were also correlated with each other.

Performance analyses
The measures of performance focused on sensitivity (d' values) and reaction times.Participants' responses (accuracy) were collapsed into confusion matrices, which were translated into hits and false alarm rates for each emotion.From these rates, we calculated d prime (d') values for each participant, which were then used as the dependent variable in linear mixed effects models to examine the recognition of emotional prosody in Experiment 1 and the recognition of emotional semantics in Experiment 2. There were two between-subject fixed factors musicianship and bilingualism, where participants were either classified as a musician or as a non-musician and classified as a bilingual or as a monolingual.Finally, there was a within-subject fixed factor of trial type (incongruent or congruent condition).These models always contained random intercepts by subject, and random intercepts by emotion.Chi-square tests were conducted, after each fixed term was progressively added to the model to evaluate main effects and interactions.The analyses were run separately for Experiment 1 and Experiment 2. Scores were also analyzed on a trial-by-trial basis (using logistic regressions; see Appendix S4) and the findings were consistent with the main analysis.
In the aforementioned analysis, the type of incongruency was ignored (i.e., block type was not considered).However, we designed this experiment such that the emotions portrayed by the semantic and prosodic cues were swapped in a particular fashion in each block: valence-based, intensity-based, or both valence and intensity (see section 2.2 and Figure 1).To examine this factor, d' values by block type were also used as the dependent variable in linear mixed effects models to examine the differences in performance by block type and group allocation averaged across the four emotions.For simplicity (i.e., to avoid complex 4-way interactions), we used the interference effect in d' units (congruent-incongruent) as the dependent variable, with musicianship, bilingualism, and block type (swap valence, swap intensity, or swap both) as fixed factors.This model contained random intercepts by subject.See Appendix S6 for the results and discussion of the block type results.
Finally, the logarithm of the reaction time was used as the dependent variable in linear mixed effects models to examine how quickly participants responded as a function of trial type, musicianship, and bilingualism as fixed factors.This model again contained random intercepts by subject, and random intercepts by emotion.Each model was run using the lme4 package in r (Bates et al., 2015) and was run separately for Experiment 1 and Experiment 2. The emmeans package in r (Lenth, 2023) was used for all post-hoc comparisons with Tukey's HSD adjustment to control for the inflation of Type I error in multiple comparisons.

Demographics
First, we present the demographic data for the total sample (combining Experiments 1 and 2).These are not the main results of the present studyhowever, given our large sample size, they are valuable in that they may be generalizable to bilinguals and musicians overall (or least those that can be found online).The means and standard deviations for each language variable (proficiency, use, and age of acquisition of first and second languages) and instrument variables (proficiency, use, and age of acquisition of first instrument) by experiment are presented in Table 1.
As demonstrated in Figure 2 (top two panels) there is diversity in participants' L2s and musical instruments.For their L2, bilingual musicians and bilingual non-musicians did not differ in age of acquisition, F (1, 561) = 0.38, p = .54,η 2 = 0.00067, or proficiency, F (1, 561) = 0.46, p = .50,η 2 = 0.00082, but they did differ in their use, F (3, 561) = 11.13,p < .001;η 2 = 0.0019, as bilingual non-musicians used their L2 more often than bilingual musicians, M Difference = 0.77, SE = 0.23, p < .001.Here again, this finding is not intuitive, and it is not clear whether it is a peculiarity of  our online samples or whether this could reflect a generalizable tendency.Previous research has shown that musical training positively impacts second language proficiency (see review by Zeromskaite, 2014;Slevc & Miyake, 2006), so it is rather puzzling as to why it would have an opposite effect for use (especially knowing how correlated proficiency and use are).Perhaps, this relates to the amount of free time: by engaging in extracurricular activities, musicians may have simply less time to practice their L2 compared to non-musicians.Next, we analyzed the two musician groups.Monolingual musicians acquired their first instrument 1.04 years (SE = 0.36) later than bilingual musicians, F (1, 628) = 8.44, p = .004,η 2 = 0.0143.
bilingual musicians were more proficient in their first instrument than monolingual musicians, M Difference = 0.36, SE = 0.15; F (1, 628) = 6.13, p = .014,η 2 = 0.010.However, they did not differ in use, F (1, 628) = 0.097, p = .76,η 2 = 0.00016.Thus, bilingual musicians learned their instrument earlier and were more proficient in it than monolingual musicians.Once again, we could not identify any similar finding in the literature, so it is unclear whether they are a peculiarity of our samples or truly generalizable, but they further support the need to crossinvestigate both factors in demographic analyses.We surmise that they might reflect environmental factors (home support, culture, and diligence regarding musicianship) which were not captured here by any other variable.

Language and instrument variable correlations
All three metrics related to L2 (proficiency, use, and age of acquisition) were correlated with each other with an R 2 above .092,p < .001(see Figure 2 top left).These relationships held within bilingual musicians and within bilingual non-musicians, R 2 above .075,p <.001.In contrast, only some of the first instrument (I1) metrics were correlated with each other (I1; proficiency, use, and age of acquisition).Proficiency was correlated with use and age of acquisition, R 2 above .061,p <.001 (see Figure 2 top right).These relationships held within bilingual musicians and within monolingual musicians, R 2 above .041,p <.001.However, use and age of acquisition of first instrument were not correlated, R 2 = .0044,p = .100,and even though this link existed within bilingual musicians, it was weak, R 2 = .022,p = .005.

Experiment 1 -Performance in emotional prosody
Figure 3 depicts the d' results of Experiments 1 and 2. As a reminder, for Experiment 1, participants were instructed to respond to the prosody (and ignore semantic cues).There was a main effect of trial type, confirming that d' decreased for incongruent stimuli compared to congruent stimuli, thus demonstrating that the task worked as participants found it challenging to completely ignore semantics (see Table 2 for all model results).There was no main effect of bilingualism, no main effect of musicianship, and no interaction between the two.There was a two-way interaction between trial type and musicianship, no interaction between trial type and bilingualism, and a three-way interaction.
Dissecting the three-way interaction between trial type, bilingualism, and musicianship, there were no differences in performance between any of the groups on the congruent trials, p always above .963;differences were only seen on the incongruent trials.This confirms the idea that the factors of interest (bilingualism and musicianship) acted upon the resistance to semantic interference (i.e., correctly attending to prosody), but not on basic emotion recognition.More precisely, there was a differential effect of musicianship among monolinguals compared to bilinguals in this resistance: bilingual musicians were better able to resist the semantic interference than bilingual non-musicians, p < .001,whereas musicianship had no effect among monolinguals, p = .948.On the other hand, there was no effect of bilingualism among non-musicians, p = .338,or among musicians, p = .993,suggesting that, controlling for musicianship, bilingualism had no role.To summarize, musicians were good at attending to prosody and could thus resist semantic interference compared to non-musicians, but this effect appeared to be driven by bilinguals.However, this interaction may be driven by the bilingual non-musicians exhibiting the poorest performance compared to all other groups.

Experiment 2 -Performance in emotional semantics
For Experiment 2, participants responded to semantics (and ignored prosodic cues).There was a main effect of trial type, confirming that d' decreased for incongruent stimuli compared to congruent stimuli, thus demonstrating that the task worked as participants found it challenging to completely ignore prosody (see Table 2 for all model results).There was no main effect of bilingualism, but the main effect of musicianship was statistically significant.Additionally, there was no interaction between bilingualism and musicianship.There was no interaction between trial type and musicianship, nor between trial type and bilingualism, but a significant 3-way interaction.
Mirroring Experiment 1, there were no differences between groups on the congruent trials, p is always above .915,but group differences on the incongruent trials, confirming the idea that the factors of interest (bilingualism and musicianship) acted upon the resistance to prosodic interference (i.e., correctly attending to semantics).More specifically, there was a differential effect of musicianship among bilinguals and not among monolinguals in this resistance.That is, bilingual musicians were better able to resist the prosodic interference than bilingual non-musicians, p = .0194,whereas musicianship had no role among monolinguals, p > 0.999.On the other hand, there was no effect of bilingualism among non-musicians, p = .179,or among musicians, p = .983.To summarize, musicians were also good at attending to semantics and could thus resist prosodic interference compared to non-musicians, but this effect appeared rather exclusive to Bilingualism: Language and Cognition bilinguals.However, contrary to the first experiment, this interaction is less driven by the bilingual non-musician group.

Reaction Time
In both experiments, reaction time was delayed in incongruent compared to congruent trials (see Table A4.2 in Appendix S4.2).Specifically, it was about 3.00 versus 2.87 seconds in Experiment 1 and 3.24 versus 3.00 seconds in Experiment 2 (see Figure 4), but this 130-240ms difference was not sensitive to group allocation.

Discussion
The goal of the present study was to examine how bilinguals and musicians would recognize vocal emotions based on prosodic or semantic cues, compared to monolinguals and non-musicians.As intended, all groups showed a performance reduction accompanied by a delayed reaction time in incongruent compared to congruent trials.Consistent with the literature, we found a musician advantage in both experiments, whereby musicians were less prone to interference of the distracting cue (be it prosodic or semantic).However, this advantage was only found when also bilingual (i.e., in bilingual musicians).As for bilingualism on its own, we failed to observe a prosodic bias like the one seen in children (i.e., advantage in using prosodic cues and disadvantage in ignoring them) and we failed to see a bilingual advantage across the two tasks independent of musicianship.Furthermore, in Experiment 1, the interaction seems to be driven by the poorest performance exhibited by bilingual non-musicians.Taken together, these results do not point to differences in cue weighting across these four groups rather differences in executive functioning among musicians and non-musicians, that somehow are exacerbated when also bilingual.
Regarding the protocol as an emotional Stroop task, it worked as expected and successfully created interference in processing in the incongruent trials.This was demonstrated by a reduction in accuracy of 10-20% and a delayed reaction time (by about 200 ms) in incongruent compared to congruent trials.Incongruent trials are of interest as they require listeners to pit two cues against each other, similar to situations of sarcastic speech encountered in everyday life.Previous studies on vocal emotion recognition have shown similar interference effects, where performance suffers and reaction times are delayed (Dupuis & Pichora-Fuller, 2010;Nygaard & Queen, 2008;Wurm et al., 2001).Our findings support the idea that experience with language and music can modulate the degree of confusion or challenges posed by this sort of ambiguous communicative modes.

Previous research in children
Based on findings in bilingual children, we had hypothesized that, even in adults, experience with multiple languages would influence the domain (prosody over semantics) primarily recruited in conflicting situations.Indeed, bilingual children begin using prosodic cues earlier than monolingual children (Yow & Markman, 2011) and show a prosodic bias in situations where prosodic and semantic cues to emotions conflict (Champoux-Larsson & Dylman, 2019).The present findings did not replicate the same pattern, suggesting that in young adulthood, bilingualism alone does not lead to greater reliance on prosodic cues.We speculate that with greater cognitive maturation and language development, bilinguals can offset their early bias towards prosody and change their listening strategies to make an appropriate use of emotional cues in speech.However, the current results clearly highlight the importance of controlling for both language and musical experience in these types of designs.

Previous research on the effects of bilingualism and musicianship individually
The present study accounts for both bilingualism and musicianship individually, as well as their combined effects.We added a group of bilingual musicians for a fully orthogonal sampling structure, which is rarely done in studies on vocal emotion recognition.This turned out to be critical as our findings generally support a musician advantage effect that is largely exaggerated among (if not exclusive to) bilinguals.Previous studies looking at each factor separately had revealed a musician advantage in a prosody task (Bialystok & DePape, 2009;Graham & Lakshmanan, 2018) but not in a semantics task (Bialystok & DePape, 2009).We think that this latter discrepancy may be due partly to the rudimentary nature of the semantic material used in these previous studies (i.e., using the words "high" vs "low" and not emotionally loaded sentences).Thus, group differences could have been missed in the role of semantics for reasons related to task complexity.If this interpretation is correct, the musician advantage may be found in either domain (prosody or semantics) but would be easier to observe when placing participants in richer linguistic environments, which would surely have ecological relevance.Notably, however, the musician advantage effect that we observed among bilinguals was slightly smaller in the semantics task than in the prosody task.This difference is therefore going in a direction consistent with the contrast highlighted by Bialystok and DePape (2009).So, the nature of the task is important but perhaps more important is the need to control for language exposure among musicians and non-musicians.
One interesting avenue to try and make sense of the difference between bilingual musicians and non-musicians comes from insight gleaned from Schwartz and Kroll (2006).In cognitive tasks involving language, both languages are activated and influence performance even if the intention is to process information in one language only, a process referred to as non-selective lexical activation.Schwartz and Kroll (2006) found that non-selectivity is reduced when sentences provide richer semantic context, as if the brain was primed to navigate within a targeted language.In our study, Experiment 2 did guide participants towards semantics in general and could have limited this non-selectivity, but our first experiment would have done precisely the opposite.The fact that bilingual non-musicians (and not bilingual musicians) showed a particularly poor performance in Experiment 1 but less so in Experiment 2 suggest that they may be especially prone to non-selectivity.It follows that music training could mitigate the impact of non-selective lexical activation among bilinguals.Exactly how is unclear but perhaps by organizing parallel (rather than common) networks in each language separately.This further emphasizes the importance of controlling for both language exposure and musical experience in cognitive tasks involving language.

Previous research on the combined effects of bilingualism and musicianship
In the few studies that did investigate bilingualism and musicianship simultaneously, findings are rather consistent with the present ones.Namely, it is musical training and not bilingualism that is more likely associated with benefits, specifically in task switching and dual-task performance tasks (Moradzadeh et al., 2015).Furthermore, Schroeder et al. (2016) disambiguated a "true" interference (by looking at a neutral condition minus incongruent trials) from a facilitation effect (congruent minus neutral trials), and Simon effects (congruent minus incongruent trials, as in the present study) but on a non-linguistic visualspatial Simon Task in the same four groups.They found that bilingual musicians had a smaller Simon effect compared to all other groups, consistent with the present findings.However, bilingual musicians, bilingual non-musicians, and monolingual musicians had all smaller interference effects compared to monolingual non-musicians.There were no differences in facilitation effects once confounding variables such as IQ and age were accounted for.Their results suggest an enhanced ability to suppress interfering cues shared among bilinguals, musicians, and bilingual musicians, but they propose that the Simon effect (congruent minus incongruent) is a more convoluted metric encompassing both facilitation and interference effects making it Bilingualism: Language and Cognition harder to interpret.In the present study, we did not include semantically neutral sentences or sentences spoken with a neutral prosody, so we are unable to disentangle these different effects.It would be interesting to see whether the unique advantage of the combined musician and bilingual profile taps more into the facilitation than the interference effect.It is important to note though that these studies did not focus on vocal emotion recognition, but rather executive functioning among these groups.However, based on the results of these studies, we could speculate that the present results may be due to better executive functioning among bilingual musicians.

The role of executive functions
While we see differences in performance between groups, they do not reflect differences in cue weighting, but differences in executive functioning.A difference in cue weighting would have resulted in bilingual musicians outperforming the other groups on one task and performing worse on the other task.For example, if they weighted prosody more heavily, then bilingual musicians' performance would have been best when asked to use prosody to detect vocal emotions as they would easily ignore anything unrelated to prosody (i.e., the semantic meaning of the sentence).Additionally, their performance would have been worse when prosody served as a distractor because they would still rely on these salient prosodic cues that do not necessarily help in the task of deciphering the semantic content of the sentences.Since we never observed this sort of advantage/disadvantage reversal between the two tasks, we must interpret the results based on elements that were common across the two tasks, hence a general advantage in executive functioning when making judgements about vocal emotions.Bilingual musicians were able to use the correct cue regardless of the task and did not favour one listening strategy over another.This may reflect better response inhibition, cognitive control, or cognitive flexibility that have been previously shown to be advantages associated with being bilingual (Bialystok & Craik, 2010;Costa et al., 2008;Krizman et al., 2012;Wiseheart et al., 2016) or being a musician (Bialystok & DePape, 2009;Strong & Mast, 2019;Zuk et al., 2014).However, previous research has been somewhat inconclusive on whether bilingualism and musicianship have benefits that extend beyond the realm of language and music, respectively, into other executive functions.Neither bilingualism nor musical experience (D'Souza et al., 2018;Lehtonen et al., 2018) has been unequivocally shown to facilitate executive functioning in adults.
Based on the current results, we speculate that this might be partly because the other factor (bilingualism or musicianship) was not controlled for.Given their individual roles, it makes sense that the interaction between these two skills provides additional benefits in executive functioning in certain situations.In simpler terms, the effects may be additive.However, executive functioning was not specifically measured in the present study, so this idea is only one possible interpretation.An alternative interpretation is that the musician advantage in executive functioning transfers to the language domain more easily in bilinguals.

Transfer effects
Overlap between music and language has been noted in acoustic properties (Besson et al., 2011;Hausen et al., 2013;Peretz et al., 2015) and the communication of emotions (Paquette et al., 2018).Also, there is substantial overlap in brain regions that process language and music (Fedorenko et al., 2009;Levitin, 2003;Maess et al., 2001;Patel & Iversen, 2007).So, one could speculate that the benefits of experience in one would transfer to the other.Cross-domain transfer effects have been reported from music to language (Besson et al., 2011;Bidelman et al., 2011;Moreno, 2009;Patel, 2011) and from language to music (Deroche et al., 2019a;Krishnan & Gandour, 2009), but the causality of music trainingas opposed to inherent perceptual or cognitive aptitudesis highly debated (Mankel & Bidelman, 2018;Penhune, 2019, andalso McKay, 2021 for a review of this question within the hearing-impaired world).Patel (2011) argues that musical training leads to neuroplasticity in brain networks responsible for speech processing resulting in better encoding of several features of speech, but this would occur only under certain conditions.Specifically, music training must allow for precise processing and discrimination of auditory information in these networks, connect to emotional rewards, be associated with focused attention, and be repeated frequently, for such transfers to occur.These criteria are all common to language learning and good reasons why musical training may benefit the acquisition of a second language (Chobert & Besson, 2013).In sum, individuals who receive musical training and learn multiple languages might have a unique opportunity to develop neural networks that are critical to the encoding of certain aspects of speech (perhaps particularly affective cues) necessary to decode emotions in sentences.However, once again, this is only speculative and further research needs to be done to better understand why such a transfer from music training to the language domain would not occur (or not as easily) in monolinguals.

Emotional intelligence
There are other variables that may account for, or mediate, some of the current results: emotional intelligence.Not surprisingly, higher emotional intelligence has been linked to better recognition of emotions.Alqarni and Dewaele (2020) found that participants who have higher trait emotional intelligence (i.e., the construct that relies more on perception of one's own emotions) were better at perceiving and interpreting emotions from audiovisual recordings.Crucially, they found that bilinguals had higher trait emotional intelligence than monolinguals.However, the effect sizes for each of these results were small (Cohen's d of about 0.30).Furthermore, Trimmer and Cuddy (2008) found that emotional prosody discrimination was related to emotional intelligence scores but not musical training (contradicting other reportssee introduction).Also, musical training has not been linked to higher emotional intelligence (Glenn Schellenberg, 2011;Trimmer & Cuddy, 2008) and, to our knowledge, there are no studies on emotional intelligence in individuals who are both a musician and bilingual.Thus, if differences in emotional intelligence were a concern for this study, one might have expected it to enhance performance among bilinguals but not musicians, which was not what we observed.Also, we would expect this variable to affect performance on congruent trials as well, whereas group differences were exclusive to incongruent trials here.For these reasons, we suspect that it is unlikely to explain the current results.

Socioeconomic status
We might equally wonder whether socioeconomic status (SES) could partially explain the results as SES is known to affect research on bilingualism particularly.Some studies have found SES to be a potential confound when assessing a bilingual advantage in the Simon Task (Morton & Harper, 2007), while others have controlled for SES and continued to find a bilingual advantage in inhibitory control (Emmorey et al., 2008;Filippi et al., 2022;Nair et al., 2017).So, this is an on-going debate.But perhaps most relevant here, Naeem et al. (2018) found that being a bilingual compared to a monolingual had no effect on performance (in the Simon task) among individuals with high SES, but bilinguals outperformed monolinguals (on both congruent and incongruent trials) among individuals with low SES.As musicians are likely to have higher SES than non-musicians (Swaminathan & Schellenberg, 2018), one could have expected (from a SES-based interpretation) bilingualism to have little role among musicians, but a beneficial role among non-musicians.Again, this is not what we observed, and as for emotional intelligence, such an interpretation would affect both trial types whereas our findings pointed specifically to the incongruent trials.Thus, the present findings do not align easily with an interpretation based on SES differences, though more research should be done to account for this variable.

Limitations
Some limitations to the current study should be acknowledged.Given the nature of online studies: 1) there was generally a lack of control over the stimulus delivery as it was not administered in a controlled environment, 2) the quality of bilinguals and musicians and the reliability of their self-reports could be questioned, and 3) the generalizability of online findings should be verified.
Here we respond to each of these concerns.
In response to the first concern, we asked participants to rate the quality of their audio and did not find any group difference in this regard.Also, performance on congruent trials (including reaction times) was overall decent and similar to previous studies (e.g., see Bialystok & DePape, 2009;Champoux-Larsson & Dylman, 2019;Moradzadeh et al., 2015).Thus, poorer audio quality than in a lab or a general lack of interest and attention towards an online task would be unlikely to explain the group difference found in this study.
Second, we relied on participants' self-reports to allocate them as either a bilingual or monolingual, and as musician or non-musician.Tomoschuk et al. (2019) found that objective measures of language proficiency (e.g., picture naming or proficiency interviews) are better than self-rating of language proficiency, while other studies found self-report measures to be just as reliable as objective measures (Lim et al., 2008;Shameem, 1998).Thus, this concern is debatable however, our analytical approach did not rely on precise estimates of age of acquisition, proficiency, and use, since we followed a categorical approach for the groups' definition.In other words, inaccuracies in self-reports would have had no consequence on our conclusion (but would have affected slightly the findings of Appendix S5 where continuous variables were used).
Third, the validity of online studies has been investigated in recent years.As outlined in the review by Chandler and Shapiro (2016), there are notable differences between the general population and online convenience samples.Several issues are relevant here, such as the realization that online samples tend to be younger than the general population and some samples may be either over-or under-represented (i.e., more participants tend to be Caucasian and Asian, and participants tend to be more educated).In this study specifically, we found that bilinguals were younger and more of them were students and unemployed compared to monolinguals.However, the extent to which these characteristics might be biased by being online samples is very difficult to assess.Of note, Eyal et al. (2021) found that the online platform Prolific (the one used here) provided higher quality data in terms of comprehension, attention, and dishonesty, than MTurk (the online platform used in Chandler & Shapiro, 2016).We also see certain advantages to conducting the present study online: having a very large sample size that could reflect the heterogeneity of musicians and bilinguals; and being able to easily recruit English-speaking monolinguals (a fairly difficult thing to do in-person in Québec).Thus, we believe that the benefits outweigh the disadvantages of using online platforms in some experimental designs, such as in the present study.

Conclusions and future directions
In conclusion, musical training appears to benefit the recognition of vocal emotions, either when semantic cues or when prosodic cues are providing conflicting information, but only among bilinguals.We did not see a difference in cue weighting (e.g. a prosodic bias) when identifying vocal emotions among the groups, as previously seen in bilingual and monolingual children.Instead, differences may be due to enhanced executive functioning in bilingual musicians that results in better performance in incongruent trials.We speculate that this is because the enhanced executive functions of musicians are somehow strengthened, or transfer more easily to Bilingualism: Language and Cognition the language domain, in bilinguals than they do in monolinguals.This may be due to bilinguals being more flexible in their listening strategies or still figuring out the different ways to resolve conflictual situations of communicative intent.
This research has implications for educational and linguistic fields, but also for clinical areas such as in individuals growing up with degraded hearing.For example, school-aged children with cochlear implants or with hearing aids typically perform worse than their normal hearing counterparts on tasks of emotional prosody (Barrett et al., 2020;Chatterjee et al., 2015;Deroche et al., 2019b;Lin et al., 2022;Most & Peled, 2007).Deficits in these tasks are often linked to poor pitch perception, but it may well be that these children also develop alternative strategies to recognize emotions in sentences.Some of these strategies could involve a stronger reliance on semantics and weaker reliance on prosody, or a different weighting among prosodic cues (e.g., using temporal and intensity cues more than pitch cues).Thus, understanding the particular circumstances or participant profiles that result in enhanced vocal emotion recognition may be beneficial to understanding how to improve these abilities in hearing-impaired and cochlear implanted children and adults.Experiments are under way to run this exact paradigm in cochlear implant users.

Figure 1 .
Figure 1.Three different block types in the test phase The blue arrows show a swap in valence, the orange arrows show a swap in intensity, and the green arrows show a swap in both intensity and valence.

Figure 2 .
Figure 2. Demographic Data.Top left: Correlations between proficiency and use, or proficiency and age of acquisition for their second language.Top right: Correlations between proficiency and use, or proficiency and age of acquisition of their first instrument.Bottom left: Pie chart of types of second languages.Bottom right: Pie chart of classes of first instruments.

Figure 3 .
Figure 3. d' results.d' data by group and trial type for Experiment 1 (top left panel) and Experiment 2 (top right panel).Interaction between musicianship and bilingualism on the interference effect (congruent minus incongruent trials) expressed in d' units in Experiment 1 (bottom left panel) and Experiment 2 (bottom right panel), where lower d' units indicate better performance.

Figure 4 .
Figure 4. Reaction time results by trial type.Reaction time by trial type shown both with log reaction time and reaction time in seconds in Experiment 1 (top left) and Experiment 2 (top right) and by group in Experiment 1 (bottom left) and Experiment 2 (bottom right).

Table 1 .
The means and standard deviations for each language and instrument variable by experiment Note: L1 = first language; L2 = second language; I1 = first instrument

Table 2 .
Model Results of the linear mixed effects models with d' as the dependent variable.