Introduction
Prosody – systematic variations in pitch, intensity, voice quality, and duration – shapes how listeners interpret a speaker’s message. Beyond its linguistic functions, prosody conveys affective and social information. Attitudinal prosody, in particular, signals interpersonal intentions such as doubt, confidence, or friendliness (Wichmann, Reference Wichmann2000) and is essential for understanding meaning beyond the literal words. However, prosodic expressions of social intentions vary across languages, and listeners often misinterpret attitudinal cues in cross-linguistic communication (Chen et al., Reference Chen, Gussenhoven and Rietveld2004; Shochi et al., Reference Shochi, Aubergé and Rilliard2006). As multilingual interaction becomes increasingly common, accurately decoding such cues in a second language (L2) has become an important yet challenging skill. The present study, therefore, examines how L2 learners perceive attitudinal prosody and how this process differs from first language (L1) comprehension, adopting an interdisciplinary approach that integrates behavioral, linguistic, and neural evidence. We also consider the potential role of learners’ real-world exposure in shaping their ability to interpret attitudinal prosody in the L2.
Attitudinal prosody in the first and second languages
In the L1, attitudinal prosody efficiently conveys speakers’ social intentions. Both children and adults reliably categorize attitudes such as doubt, confidence, or enthusiasm based solely on prosodic cues, with clear acoustic correlates in pitch, duration, and intensity (Esteve-Gibert & Prieto, Reference Esteve-Gibert and Prieto2013; Hellbernd & Sammler, Reference Hellbernd and Sammler2016; Hupp et al., Reference Hupp, Jungers, Hinerman and Porter2021). These findings highlight the intuitive, experience-based nature of attitudinal prosody processing in the L1. In contrast, interpreting attitudinal prosody in an L2 is substantially more difficult. Cross-linguistic studies show that languages differ in how they encode prosodic intentions, leading L2 learners to misinterpret attitudes even when acoustic cues are perceived correctly (Chen et al., Reference Chen, Gussenhoven and Rietveld2004; Kang & Kermad, Reference Kang, Kermad and Taguchi2019; Rilliard et al., Reference Rilliard, Erickson, Shochi and de Moraes2014; Shochi et al., Reference Shochi, Aubergé and Rilliard2006). For example, some attitudes, such as authority or interrogation, generalize across languages, whereas others, such as expressions of doubt, are difficult for Japanese learners of English to interpret (Shochi et al., Reference Shochi, Rilliard, Aubergé, Erickson and Hancil2009). This difficulty may partly reflect differences between English and Japanese in how prosodic prominence and pitch variation signal both linguistic structure and interpersonal stance. These findings suggest that L2 learners must acquire not only prosodic cues but also the language-specific pragmatic mappings linking prosody to communicative intention.
Attitudinal prosody relies on language-specific pragmatic mappings, making its acquisition in an L2 highly dependent on learners’ experience with meaningful exposure. Although higher proficiency reduces reliance on L1-based interpretive strategies (Chen, Reference Chen2009; Shochi et al., Reference Shochi, Brousse, Guerry, Erickson and Rilliard2016), L2 exposure may be especially critical. Naturalistic exposure allows learners to encounter authentic prosodic realizations of social intentions – patterns rarely taught explicitly in L2 classrooms (Kang et al., Reference Kang, Kermad and Taguchi2021) – and provides opportunities to infer how attitudes are encoded in real communicative settings. Behavioral and neurocognitive research shows that immersion supports prosody comprehension (Trofimovich & Baker, Reference Trofimovich and Baker2006) and induces structural and functional adaptations in language-control and socio-cognitive networks (DeLuca et al., Reference DeLuca, Rothman, Bialystok and Pliatsikas2019; Pliatsikas et al., Reference Pliatsikas, DeLuca, Moschopoulou and Saddy2017). Because attitudinal prosody is largely acquired implicitly, the quantity and quality of L2 exposure may be a particularly strong driver of successful interpretation in the L2, relative to learners’ proficiency.
Neural mechanisms of prosody processing
Neuroimaging research has shown that prosody processing relies on a distributed network that integrates auditory, linguistic, and socio-cognitive processes. Linguistic prosody, which supports sentence segmentation, focus marking, and syntactic interpretation, typically engages left-lateralized speech-language areas. Core regions include the left inferior frontal gyrus (IFG) and superior and middle temporal gyri, particularly when prosody is essential for sentence comprehension (Friederici, Reference Friederici2002, Reference Friederici2006; van der Burght et al., Reference van der Burght, Goucha, Friederici, Kreitewolf and Hartwigsen2019). When prosodic cues are redundant or nonessential, processing shifts toward right-hemisphere auditory regions. In contrast, affective and attitudinal prosody additionally involve socio-cognitive systems implicated in inferring speakers’ emotional states, interpersonal stance, or epistemic commitments. Meta-analytic evidence shows that affective prosody activates audio-vocal and limbic regions (e.g. amygdala, anterior insula), whereas linguistic prosody primarily recruits the speech-language network (Belyk & Brown, Reference Belyk and Brown2014; Mitchell & Ross, Reference Mitchell and Ross2013). Critically, attitudinal prosody, which conveys epistemic stance or interpersonal attitudes, engages theory-of-mind (ToM)–related regions such as the medial prefrontal cortex (mPFC), posterior cingulate cortex, and temporo-parietal junction, supporting the inference of communicative intent (Enrici et al., Reference Enrici, Bara and Adenzato2019; Hellbernd & Sammler, Reference Hellbernd and Sammler2018; Tettamanti et al., Reference Tettamanti, Vaghi, Bara, Cappa, Enrici and Adenzato2017). These findings highlight the dual involvement of auditory–prosodic decoding and higher-order socio-cognitive evaluation in processing attitudinal meaning.
Processing prosody in an L2 imposes additional cognitive and linguistic demands. L2 sentence comprehension generally recruits more extensive frontal and temporal regions than L1, including greater engagement of the IFG and other language-control areas (Jeong et al., Reference Jeong, Sugiura, Sassa, Haji, Usui, Taira, Horie, Sato and Kawashima2007; Liu & Cao, Reference Liu and Cao2016; Perani & Abutalebi, Reference Perani and Abutalebi2005). Research on L2 linguistic prosody further shows increased activation in the anterior insula and superior frontal sulcus, likely reflecting heightened attentional and working-memory requirements (Gandour et al., Reference Gandour, Tong, Wong, Talavage, Dzemidzic, Xu, Li and Lowe2007). These findings suggest that the difficulty of L2 prosody processing may arise from both perceptual challenges in decoding unfamiliar acoustic patterns and integrative demands in mapping these cues onto language-specific pragmatic meanings.
Despite the relevance of these systems, the neural mechanisms supporting attitudinal prosody in an L2 remain virtually unexplored. Clarifying these mechanisms is essential for determining whether the additional cognitive load in L2 attitudinal prosody is primarily perceptual (e.g. unfamiliar acoustic contours) or integrative (e.g. inferencing epistemic stance from less familiar pragmatic conventions). This question is central to understanding whether L2 learners recruit the same socio-cognitive and frontal networks as in their L1, or whether they rely on additional executive processes – an issue directly addressed in the present study.
Individual differences in L2 prosody perception
A range of individual learner characteristics may shape how attitudinal prosody is processed in a second language. Among these factors, L2 exposure has emerged as a particularly influential contributor, because it provides learners with opportunities to encounter authentic prosodic realizations of social intentions – patterns rarely taught explicitly in L2 classrooms. Behavioral studies show that greater immersion facilitates more accurate interpretation of prosodic and pragmatic cues (Kang et al., Reference Kang, Kermad and Taguchi2021; Trofimovich & Baker, Reference Trofimovich and Baker2006). Neurocognitive research also demonstrates that extensive L2 exposure leads to structural and functional adaptations in multiple learning systems. One line of research emphasizes procedural learning mechanisms, reporting experience-dependent expansions in the caudate nucleus, thalamus, and nucleus accumbens during sustained L2 immersion (DeLuca et al., Reference DeLuca, Rothman, Bialystok and Pliatsikas2019; Pliatsikas et al., Reference Pliatsikas, DeLuca, Moschopoulou and Saddy2017). Another line highlights the role of the hippocampal–parahippocampal system, traditionally linked to episodic and contextual learning, in supporting the integration of rich, naturalistic L2 input. Although these frameworks are often presented as distinct, they converge on the idea that real-world experience reshapes the neural systems that support L2 acquisition. The present study builds on this convergence by testing whether attitudinal prosody, a socially grounded, largely implicitly learned aspect of communication, engages both episodic (hippocampal) and procedural (thalamic–striatal) mechanisms in an L2.
Proficiency has also been shown to influence L2 attitudinal prosody processing. Learners with higher proficiency tend to rely less on L1-based interpretive strategies and demonstrate more accurate mappings between prosodic cues and communicative intentions (Chen, Reference Chen2009; Shochi et al., Reference Shochi, Brousse, Guerry, Erickson and Rilliard2016). Willingness to Communicate (WTC) may play a role as well, as it shapes how often learners engage in interactional situations in which attitudinal meanings can be perceived (Elahi Shirvan et al., Reference Elahi Shirvan, Khajavy, MacIntyre and Taherian2019; Yashima, Reference Yashima2002). At the same time, prior work has highlighted naturalistic, informal exposure as a particularly important source of socially meaningful prosodic input, suggesting that exposure may function as a key experiential factor in the development of L2 attitudinal prosody sensitivity.
The present study
Despite growing evidence on the neural bases of prosody and the role of experiential factors in L2 learning, no research has examined the neural mechanisms supporting attitudinal prosody in a second language, nor how learner-related variables, especially naturalistic exposure, modulate this process. Previous work has focused primarily on L1 attitudinal prosody or on L2 linguistic prosody, leaving open the question of how L2 listeners integrate prosodic cues with language-specific pragmatic mappings to infer speakers’ social or epistemic intentions. The current fMRI study addresses this gap by comparing attitudinal and linguistic prosody processing in L1 Japanese and L2 English, and by examining how individual differences in L2 experience shape behavioral and neural outcomes. We pursued three research questions:
1. To what extent does L2 attitudinal prosody recruit neural systems that overlap with, or differ from, those engaged in the L1?
2. How does the interaction between prosody type (attitudinal vs. linguistic) and language (L1 vs. L2) modulate neural activation?
3. How do individual differences, particularly naturalistic, informal L2 exposure, shape behavioral and neural responses to attitudinal prosody?
Although the mapping between specific prosodic cues and attitudes may differ between Japanese and English, interpreting attitudinal prosody in either language requires listeners to infer the speaker’s communicative intention. We therefore predicted that L2 attitudinal prosody would recruit intention‐inference and socio‐cognitive systems that overlap with those engaged in the L1. At the same time, we predicted that L2 processing would place greater demands on executive and controlled interpretive processing regions, particularly the left IFG, because learners must map less familiar prosodic cues onto language-specific pragmatic and linguistic conventions. Finally, we predicted that learners with greater naturalistic L2 exposure would show stronger engagement of neural systems involved in experience-based interpretation of attitudinal prosody.
Methods
Participants
Thirty-six Japanese native speakers participated in this fMRI experiment. Six participants were excluded: four due to excessive head motion (more than 5 mm), and two for accuracy below chance level (50%). The remaining 30 participants (aged 18–30 years, M = 20.57, standard deviation [SD] = 2.22; 10 females) were included in the analyses. All participants were right-handed (Edinburgh Handedness Inventory), reported normal or corrected-to-normal vision and normal hearing, and had no history of neurological disorder. Participants had intermediate L2 proficiency based on TOEFL ITP scores (M = 530.27, SD = 27.21), corresponding to the Common European Framework of Reference for Languages (CEFR) B1 level. They began learning English at an average age of 10.5 years (SD = 2.9), primarily through classroom instruction. Listening proficiency was assessed via the Oxford Quick Placement test (British English materials; M = 79.36, SD = 6.35) (Allen, Reference Allen1992); these scores were used as the proficiency index in the analyses. All participants provided informed consent. The study was approved by the Ethical Committee of Tohoku University.
Individual-difference measures
To examine individual differences in L2 prosody comprehension, participants completed three questionnaires after the experiment.
(1) L2 exposure and use
To capture participants’ habitual exposure to English, we developed a language background questionnaire assessing both the quantity and context of L2 exposure and use. Participants reported how frequently they encountered English across a range of academic and nonacademic contexts using a 6-point scale (0 = “never,” 5 = “every day”). The questionnaire contained 16 items in total, covering listening and speaking across academic (7 items) and nonacademic (9 items) contexts.
• Academic exposure included activities such as attending lectures in English, reading academic articles, preparing presentations, and participating in class discussions.
• Non-academic exposure encompassed watching English-language films or YouTube videos, listening to news or podcasts, playing games, and interacting with foreigners in informal settings.
For analysis, listening and speaking exposure items were combined to derive three indices for each participant: overall exposure (mean across all 16 items), academic exposure (mean across the 7 academic items), and nonacademic exposure (mean across the 9 nonacademic items). These indices, therefore, represent mean scores on a 0–5 scale. The overall exposure score was intended to capture learners’ cumulative experience with English across contexts and was therefore used as the primary predictor in the main individual-difference analyses. Academic and nonacademic exposure were examined in follow-up exploratory analyses to further characterize the nature of the exposure effect by assessing whether it was more strongly associated with one context type than the other. The full questionnaire is provided in the Supplementary Materials.
Overall exposure was low (M = 0.70, SD = 0.40), consistent with the limited opportunities for English use in Japanese contexts. Academic exposure was the lowest (M = 0.48, SD = 0.34), whereas nonacademic exposure was higher and more variable (M = 0.86, SD = 0.64), suggesting that informal media input accounted for most of the participants’ English contact.
(2) Willingness to Communicate
The Japanese version of MacIntyre and Clément’s WTC scale (Yashima, Reference Yashima2002) was used to assess participants’ self-perceived likelihood of initiating communication in English. Using a 0–100 scale, participants rated how likely they would initiate a conversation with friends, acquaintances, and strangers. Given that scores below 52 indicate generally low WTC, the mean WTC score in our sample (M = 42.04, SD = 18.9) suggests that many participants were reluctant to initiate communication in English, potentially limiting their opportunities for L2 practice.
(3) Autistic Spectrum Quotient (AQ)
Because attitudinal prosody is often linked to social–pragmatic processing, we administered the AQ questionnaire (Wakabayashi et al., Reference Wakabayashi, Baron-Cohen, Wheelwright and Tojo2006) to account for individual differences in social-cognitive sensitivity. Higher autism spectrum disorder traits have been shown to negatively impact prosody perception (Sinagra & Wiener, Reference Sinagra and Wiener2022), as well as affect social cognition (Rice & Redcay, Reference Rice and Redcay2016). The AQ questionnaire measures the extent of autistic traits, ranging from 0 to 50, with higher scores indicating more autistic traits. In previous research, neurotypical adults typically score around 15–20 (Baron-Cohen et al., Reference Baron-Cohen, Wheelwright, Skinner, Martin and Clubley2001), whereas individuals with autism spectrum conditions often score above 33 in the Japanese context, validated by Wakabayashi et al. (Reference Wakabayashi, Baron-Cohen, Wheelwright and Tojo2006). AQ scores in the present study ranged from 12 to 35 (M = 23.27, SD = 7.3). AQ did not correlate significantly with either English (r = –0.25) or Japanese prosody comprehension accuracy (r = –0.23), and was therefore treated as a control measure rather than a primary predictor.
Stimuli
We created a total of 112 basic English sentences and their Japanese counterparts using high school–level English vocabulary (Yokogawa, Reference Yokogawa2009), corresponding approximately to the CEFR A2–B1 level. All sentences followed the same syntactic frame – an animate subject, a transitive verb, and an inanimate object – to minimize lexical, syntactic, and semantic variability across conditions. No lexical cues indicating interrogativity or attitude (e.g. question markers, modal verbs, or affective adjectives) were included. Japanese translations were generated via machine translation and subsequently checked and corrected by two Japanese native speakers.
Sentences were recorded by four native speakers of each language (English and Japanese) using a Tascam DR-07C portable audio recorder. Because linguistic prosody (e.g. focus marking and interrogativity) and attitudinal prosody involve different communicative goals, the sentences were recorded separately for the two prosody conditions to ensure that the intended prosodic patterns were produced clearly and naturally. English speakers were native speakers of the North American variety of English, since English education in Japan typically emphasizes the North American variety. Each sentence was produced twice: once with linguistic prosody and once with attitudinal prosody, resulting in a pool of 224 auditory stimuli per language. Using a counterbalanced design, the stimuli were divided into two lists so that each participant heard 112 sentences per language, with each sentence presented in only one prosodic version and no lexical content repeated within a language. For linguistic prosody recordings, speakers produced the sentences with either statement or question intonation and with a prosodic focus on either the subject or object, following the procedure of Gandour et al. (Reference Gandour, Tong, Wong, Talavage, Dzemidzic, Xu, Li and Lowe2007). This resulted in four linguistic prosody combinations (statement–subject focus, statement–object focus, question–subject focus, question–object focus), with 14 sentences per combination. For the attitudinal prosody recordings, speakers expressed one of four attitudes while keeping the lexical content unchanged: friendliness, unfriendliness, confidence, or doubt (14 sentences per attitude). Speakers were instructed to convey the intended effect solely through prosody, without altering the segmental content. To ensure the quality of the stimuli, recordings were cross-checked during production: for each language, four native speakers evaluated each other’s utterances, and items judged as unclear or ambiguous with respect to the intended attitude were re-recorded. In addition, a perception-based validation was conducted with an independent group of 33 Japanese students, who listened to the English stimuli and identified the conveyed attitudes. The overall mean accuracy was 0.84 (SD = 0.12), indicating that the intended attitudes were generally recognizable.
All audio files were edited and amplitude-normalized using Audacity. To preserve the natural prosodic realization of the utterances, no additional temporal normalization was applied; accordingly, sentence duration varied somewhat across conditions (Appendix A).
The final stimuli were divided into four experimental conditions based on the crossing of prosody type (attitudinal vs. linguistic) and language (English vs. Japanese): (1) Attitudinal English (AE), (2) Linguistic English (LE), (3) Attitudinal Japanese (AJ), and (4) Linguistic Japanese (LJ). To characterize the acoustic properties of the stimuli, we analyzed duration, mean intensity, and fundamental frequency for each condition. Mean intensity and mean F0 did not differ systematically as a function of prosody type or language, whereas F0 range showed a clear effect of prosody type (attitudinal vs. linguistic; p < .001), consistent with the expected pitch variation associated with attitudinal prosody. Summary acoustic values for each condition are provided in Appendix A.
Experimental procedures
Inside the scanner, participants performed a two-alternative forced-choice prosody judgment task. On each trial, two response options appeared on the screen, and participants selected the option that best matched the prosodic information in the spoken sentence by pressing a button as accurately and quickly as possible. A one-second fixation cross preceded the next trial. The task was organized into linguistic and attitudinal prosody blocks. In linguistic prosody blocks, participants categorized either sentence type or focus location, depending on the response labels shown on the screen: (i) Sentence type trials: [“Statement”] vs. [“Question”]. (ii) Focus-location trials: [“Subject”] vs. [“Object”] (i.e. which constituent carried prosodic prominence/narrow focus). Participants were instructed to indicate which constituent (subject or object) was prosodically emphasized, rather than selecting a single “most stressed word.” In attitudinal prosody blocks, participants categorized the speaker’s intended attitude using fixed label pairs: [“Friendly”] vs. [“Unfriendly”] or [“Confident”] vs. [“Doubtful”].
The experiment consisted of four sessions: two in English, followed by two in Japanese. The English and Japanese sessions were split into two because of the session’s length. This fixed order was chosen to minimize cross-language interference and to preserve the interpretability of within-language individual differences analyses. Each session contained alternating blocks of linguistic and attitudinal prosody. To counterbalance block order, half of the participants completed linguistic blocks first, and the others completed attitudinal blocks first. As illustrated in Figure 1, each block consisted of four trials, followed by a 15-second rest period, yielding 14 blocks per session (56 trials per session). Trial order within each block was randomized. To counterbalance the stimuli, one group of participants heard half of the sentences in the attitudinal condition and the remaining half in the linguistic condition, whereas the second group heard the opposite assignment. Short rest breaks were provided between sessions. The total task-related scanning time was approximately 44 minutes, after which a T1-weighted structural scan was acquired.
Schematic illustration of the fMRI prosody task. Each block consisted of four trials of either linguistic or attitudinal prosody judgments, followed by a 15-s rest period. Each trial began with a fixation cross (1 s), followed by presentation of an auditory sentence with two response options, to which participants responded via button press. In linguistic blocks, participants categorized sentence type (statement vs. question) or focus location (subject vs. object). In attitudinal blocks, participants categorized speaker attitude (e.g. friendly vs. unfriendly).

The task was presented using E-prime 3.0 software (Psychology Software Tools, Inc., Pittsburgh, PA, USA). Auditory stimuli were delivered through MRI-compatible active noise–canceling optical headphones (OptoActive; Optoacoustics Ltd., Or Yehuda, Israel), which substantially reduced scanner noise and ensured adequate audibility of the speech stimuli. Prior to entering the scanner, participants completed a practice session of 28 English trials, which did not appear in the main experiment. After completing all fMRI sessions, participants performed the English listening test and completed the three individual-difference questionnaires described in the individual-difference measures section.
Data acquisition and preprocessing
Functional MRI data were collected on a Philips Achieva dStream 3.0T scanner using a gradient-echo EPI sequence (Repetition time (TR) = 2,000 ms, echo time (TE) = 30 ms, flip angle = 80°, 32 axial slices, 3-mm thickness, field of view (FOV) = 192 mm, matrix = 64 × 64). Each session consisted of 333 volumes. A high-resolution T1-weighted structural image (1-mm isotropic) was acquired for anatomical reference. Preprocessing was performed in SPM12 (Wellcome Centre for Human Neuroimaging). Functional images underwent slice-timing correction, realignment, co-registration of the T1 image to the mean functional image, normalization to MNI space, and spatial smoothing with an 8-mm FWHM Gaussian kernel. Motion parameters were inspected for quality assurance, and participants with excessive motion (>5 mm or substantial framewise displacement) were excluded.
Data analysis
Behavior data
Behavioral performance was analyzed to examine how prosody type and language influenced participants’ accuracy and reaction time (RT). For both dependent variables, a 2 × 2 repeated-measures analysis of variance (ANOVA) was conducted with Prosody (attitudinal vs. linguistic) and Language (L1-Japanese vs. L2-English) as within-subject factors, resulting in four conditions (AE, LE, AJ, LJ). Accuracy served as the primary dependent variable, as it provides the most reliable measure of prosody comprehension in sentence-level auditory tasks. RT analyses were conducted on correct trials only, with RT defined as the interval between auditory sentence onset and the button press. Significant main effects and interactions were followed up with Bonferroni-corrected paired-samples t tests. To clarify the Prosody × Language interaction, we compared English and Japanese within each prosody type (AE vs. AJ; LE vs. LJ). The same follow-up procedure was applied to both accuracy and RT analyses. Because sentence duration differed across conditions (Appendix A), RTs are reported as a complementary measure and should not be interpreted as a pure index of task difficulty.
To address the individual differences component of the study, we examined whether participants’ behavioral performance – particularly in the attitudinal prosody conditions – was associated with variation in L2 proficiency, WTC, and overall L2 exposure. To further characterize the nature of overall L2 exposure, we then conducted follow-up analyses examining the academic and nonacademic subcomponents of exposure. Pearson correlation analyses were conducted between participants’ accuracy scores and each individual-difference measure after inspecting the correlations among the questionnaire variables.
Brain data
Brain data were analyzed using a conventional two-level random-effect approach in SPM12. At the first level, each participant’s preprocessed functional images were modeled using a voxel-wise general linear model, with neural activity modeled at the block level. For each condition (AE, LE, AJ, LJ), boxcar regressors were created time-locked to the onset of each block with a duration of 28 s (corresponding to four trials per block) and convolved with the canonical hemodynamic response function. Four regressors corresponding to the experimental conditions (AE, LE, AJ, LJ) were included in the model. Rest periods were not explicitly modeled and served as an implicit baseline. Six movement parameters (three translations, three rotations) were included as nuisance regressors, and a 128-s high-pass filter was applied to remove low-frequency drift. For each participant, contrast images for each condition and for the relevant pairwise comparisons were generated.
At the second level, group-level analyses were performed on these contrast images. To examine the main effect of prosody type, attitudinal conditions (AE + AJ) were contrasted with linguistic conditions (LE + LJ). To examine the effects of language, L2 English (AE + LE) was contrasted with the L1 Japanese (AJ + LJ). The interaction contrast [(AE > LE) > (AJ > LJ)] was used to identify regions showing differential attitudinal–linguistic processing in English relative to Japanese.
The threshold for whole-brain analyses was at p < .05, family-wise error (FWE) corrected at the voxel level. For regions with a priori hypotheses, specifically the left IFG and the hippocampus, small-volume correction (SVC) was performed using anatomically defined masks, with significance set at p < .05 (voxel-wise, FWE-corrected) following an initial uncorrected threshold of p < .001. To visualize condition-specific activation, parameter estimates were extracted from peak voxels within significant clusters.
Finally, a whole-brain voxel-wise multiple regression analysis was performed to examine how individual differences in L2 experience and performance relate to neural activation in the [AE > LE] contrast. The model included four covariates: L2 listening proficiency (Oxford Placement test scores), overall L2 exposure, L2 WTC, and behavioral accuracy for attitudinal prosody. All predictors were entered simultaneously into the SPM12 design matrix, allowing us to assess the unique contribution of each variable while controlling for the others. This analysis used an initial threshold of p < .001 uncorrected, with cluster-level correction at p < .05.
Results
Behavioral results
A 2 × 2 repeated-measures ANOVA on accuracy revealed a significant Prosody × Language interaction, F(1, 29) = 27.35, p < .001, η 2 = 0.178. Bonferroni-corrected pairwise comparisons showed that attitudinal prosody was identified less accurately in L2 English, M = 75.54, SD = 6.53, than in L1 Japanese, M = 84.88, SD = 7.54, t(29) = − 5.91, p < .001, d = − 1.21, whereas linguistic prosody did not differ between languages, t(29) = 0.99, p = 1.00, d = 0.23. These findings indicate a selective L2 disadvantage for attitudinal, but not linguistic prosody, consistent with previous studies showing greater difficulty in interpreting sentence-level attitudes in a second language (Chen et al., Reference Chen, Gussenhoven and Rietveld2004; Rilliard et al., Reference Rilliard, Erickson, Shochi and de Moraes2014; Shochi et al., Reference Shochi, Aubergé and Rilliard2006).
A corresponding 2 × 2 repeated-measures ANOVA on RT revealed a significant main effect of Prosody, F(1, 29) = 13.53, p < .001, η 2 = 0.318, and a significant Prosody × Language interaction, F(1, 29) = 25.98, p < .001, η2 = 0.473, but no main effect of Language, F(1, 29) = 0.44, p = .512, η 2 = 0.015. Bonferroni-corrected pairwise comparisons showed that participants responded faster to attitudinal prosody in Japanese (AJ: M = 2.41 s, SD = 0.75) than in English (AE: M = 2.88 s, SD = 0.52), t(29) = 3.77, p = .001, d = 0.69. Conversely, participants responded more slowly to linguistic prosody in Japanese (LJ: M = 3.14 s, SD = 0.25) than in English (LE: M = 2.78 s, SD = 0.60), t(29) = − 3.43, p = .004, d = − 0.63. Because sentence duration differed across conditions, particularly in the Japanese linguistic condition (Appendix A), RTs are reported as a complementary measure and should not be interpreted as a pure index of task difficulty. Accuracy and RT descriptives are presented in Table 1.
Descriptive statistics for the behavioral data

Note: Values are means, with standard deviations in parentheses. Accuracy is reported as percentages. Reaction times (RTs) are reported in seconds and were calculated from correct trials only.
Individual-difference analyses are summarized in Table 2. Pearson correlations revealed that overall L2 exposure was positively associated with accuracy in attitudinal prosody (r = 0.40, p = .031). To further characterize the nature of this relationship, we conducted follow-up analyses on the academic and nonacademic subcomponents of exposure. This analysis showed that only nonacademic exposure was significantly correlated with accuracy (r = 0.46, p = .011). No significant correlations were observed for WTC (r = –0.13) or L2 listening proficiency (r = 0.25).
Correlation results for behavioral results

Note:
* indicates p < .05, ** indicates p < .01. Pearson’s r values are shown. p-values are shown in parentheses.
WTC = Willingness to Communicate in the L2 scores.
As a robustness check, we additionally fitted a logistic mixed-effects model for accuracy. This analysis yielded the same significant Language × Prosody interaction pattern as the repeated-measures ANOVA, confirming the main behavioral findings (see Supplementary Materials).
Brain results
Comparison of prosody types (attitudinal vs. linguistic)
Across languages, attitudinal prosody [AE + AJ] elicited significantly greater activation than linguistic prosody [LE + LJ] in the left anterior temporal lobe (ATL), left lateral orbital gyrus (LOG), mPFC, bilateral precuneus, and right Rolandic operculum (Table 3, Figure 2). Conversely, linguistic prosody [LE + LJ] produced significantly greater activation than attitudinal prosody [AE + AJ] in the bilateral inferior parietal lobules and bilateral precentral gyri (Table 3, Figure 3).
Main effect of attitudinal prosody (attitudinal > linguistic).

Main effect of linguistic prosody (linguistic > attitudinal).

Comparison of prosody types

Note: Coordinates (x, y, z) are in MNI space. t-values reflect the peak voxel within each cluster. Cluster sizes indicate the number of voxels (2 × 2 × 2 mm3). p < .05, voxel-wise FWE-corrected.
Interaction effect between prosody type and language
To identify regions showing a greater attitudinal–linguistic prosody difference in L2 English than in L1 Japanese, we examined the interaction contrast [(AE > LE) > (AJ > LJ)]. This analysis revealed significant activation in the left pars triangularis of IFG and the left middle frontal gyrus (MFG) (Table 4, Figure 4). These effects indicate that the distinction between attitudinal and linguistic prosody engaged left frontal regions more strongly in English than in Japanese. The reverse interaction [(LE > AE) > (LJ > AJ)] yielded no suprathreshold clusters, even at p < .001, uncorrected (Table 4).
Interaction effect between prosody type and language.

Interaction effect between prosody type and language

Note: Coordinates are in MNI space. Peak t-values and cluster sizes (voxels) are reported. Statistical threshold: SVC at p < .05, voxel-wise FWE-corrected (left IFG and left MFG).
Individual differences in L2 processing
To examine how learner-related factors influenced neural processing of L2 attitudinal prosody, we conducted a voxel-wise multiple regression analysis using four predictors: overall L2 exposure, L2 proficiency, WTC in L2, and task accuracy. Among these variables, only overall L2 exposure significantly predicted neural activation, revealing two clusters: a thalamic cluster significant at the cluster level (FWE-corrected) and a left hippocampal–parahippocampal cluster significant under SVC (Table 5, Figure 5a and b). This effect remained significant after controlling for proficiency, WTC, and accuracy.
Associations between L2 exposure components and neural activation during L2 attitudinal prosody processing. The brain maps show clusters identified using the overall L2 exposure score. Scatterplots display correlations between neural beta values and nonacademic (teal circles) or academic (gray triangles) exposure. Non-academic exposure showed significant positive correlations with both regions (thalamus: r = 0.45, p = .013; hippocampus: r = 0.51, p = .004), whereas academic exposure showed no reliable relationships.

Language exposure effect on attitudinal prosody processing in L2

Note: Coordinates are in MNI space. Peak t-values and cluster sizes (voxels) are reported. The thalamus cluster survived cluster-level FWE correction (p < .05; cluster-forming threshold p < .001), and the left hippocampal/parahippocampal cluster survived SVC at p < .05, voxel-wise FWE-corrected.
To further characterize the nature of this exposure effect, we conducted follow-up analyses examining academic and nonacademic exposure separately. Scatterplots of beta values extracted from the peak coordinates in each region showed that nonacademic exposure – reflecting naturalistic, outside-of-class contact with English – was significantly correlated with activation in both the thalamus (r = 0.45, p = .013) and the left hippocampus (r = 0.51, p = .004), whereas academic exposure showed no reliable associations (thalamus: r = 0.31, p = .092; hippocampus: r = 0.15, p = .432).
Discussion
This study investigated the neural mechanisms underlying attitudinal prosody processing in a second language (L2), revealing three major findings. First, attitudinal and linguistic prosody engaged dissociable neural networks across both L1 and L2. Second, interpreting attitudinal prosody in the L2 elicited greater activation for attitudinal compared to linguistic prosody in the left IFG and MFG, indicating increased cognitive-linguistic demands when processing social intentions in a non-native language. Third, individual differences analyses demonstrated that L2 exposure was the only factor that significantly predicted both behavioral accuracy and neural outcomes in L2 attitudinal prosody processing. Together, these findings illustrate both shared and L2-specific neural mechanisms of attitudinal prosody and highlight the role of language exposure in L2 pragmatic development.
Attitudinal prosody processing in the L2
A central aim of this study was to determine why attitudinal prosody is more difficult to interpret in a second language and how this difficulty is reflected in behavioral and neural patterns. The accuracy data provided clear evidence of an L2 disadvantage: participants were significantly less accurate at identifying attitudinal prosody in English than in Japanese, whereas accuracy for linguistic prosody did not differ across languages. These behavioral findings suggest that learners experience greater ambiguity and reduced reliability when interpreting attitudinal prosody in their L2, consistent with previous research demonstrating that sentence-level attitudes are more challenging to identify in a second language (Chen et al., Reference Chen, Gussenhoven and Rietveld2004; Rilliard et al., Reference Rilliard, Erickson, Shochi and de Moraes2014; Shochi et al., Reference Shochi, Aubergé and Rilliard2006).
The neuroimaging results provide converging evidence for this behavioral difficulty. The interaction contrast [(AE > LE) > (AJ > LJ)] revealed increased activation in the left IFG (pars triangularis) during L2 attitudinal prosody processing. This region is a core hub for higher-level sentence processing, supporting controlled evaluation, interpretation, and selection of contextually relevant information (Friederici, Reference Friederici2012). Causal neurostimulation work further demonstrates that intonation shapes semantic and syntactic processing within the IFG (van der Burght et al., Reference van der Burght, Numssen, Schlaak, Goucha and Hartwigsen2023), suggesting that this region provides a functional interface through which prosodic cues influence sentence-level interpretation. Importantly, the IFG was engaged across both languages and prosody types in the present study, reflecting its general role in supporting the use of prosodic information during sentence processing. What distinguishes the L2, however, is the degree of modulation: activation in the pars triangularis differentiated more sharply between attitudinal and linguistic prosody in L2 English, whereas the difference was minimal in L1 Japanese. This pattern is consistent with Oh et al. (Reference Oh, Graham, Ng, Yeh, Chan and Edwards2019), who reported stronger modulation of the pars triangularis in L2 contexts where listeners engage more controlled interpretive processing. Together, these results suggest that L2 listeners rely more heavily on left IFG-mediated controlled processing when interpreting attitudinal prosody, not because the region is uniquely required in the L2, but because its contribution is selectively enhanced for attitudinal, relative to linguistic, prosody in the non-native language.
We also observed increased activity in the left MFG for L2 attitudinal prosody. The MFG is frequently implicated in demanding cognitive tasks involving working memory and executive control (Duncan, Reference Duncan2010; Fedorenko et al., Reference Fedorenko, Duncan and Kanwisher2013). Its involvement here may reflect the need to maintain task-relevant linguistic and prosodic information while evaluating the speaker’s intended social meaning. Notably, the direction of activity in the MFG differed by language: Japanese attitudinal prosody elicited less MFG activation than linguistic prosody, a pattern that may be consistent with relatively reduced executive demands in the L1. In contrast, English attitudinal prosody elicited greater MFG activation than linguistic prosody, which may reflect greater engagement of domain-general executive resources in the L2.
Taken together, these findings show that attitudinal prosody in the L2 imposes greater cognitive demands than in the L1, requiring greater engagement of controlled interpretive processing (IFG) and executive-control (MFG) mechanisms. Because learners are less familiar with the pragmatic and prosodic conventions of the L2, attitudinal cues become more ambiguous and less reliably mapped onto their intended interpersonal meanings.
Main effects of prosody type: Neural dissociation of attitudinal and linguistic prosody
Across both languages, attitudinal prosody elicited robust activation in ToM–related regions associated with mentalizing and speaker-intention inference (Hellbernd & Sammler, Reference Hellbernd and Sammler2018), whereas linguistic prosody recruited dorsal fronto-parietal regions associated with phonological working memory and articulatory planning (Baldo & Dronkers, Reference Baldo and Dronkers2006; Meister et al., Reference Meister, Wilson, Deblieck, Wu and Iacoboni2007). Despite the additional processing demands observed for L2 attitudinal prosody, main effects revealed a stable neural dissociation between attitudinal and linguistic prosody.
Specifically, the main effect of attitudinal prosody, shown in the contrast [AE + AJ] > [LE + LJ], recruited the mPFC, precuneus, left ATL, and left LOG, regions implicated in mentalizing and speaker-intention inference. This pattern aligns with extensive L1 research showing that inferring a speaker’s interpersonal stance engages socio-cognitive mechanisms rather than core language regions (Enrici et al., Reference Enrici, Bara and Adenzato2019; Hellbernd & Sammler, Reference Hellbernd and Sammler2018; Tettamanti et al., Reference Tettamanti, Vaghi, Bara, Cappa, Enrici and Adenzato2017). Our findings extend this evidence to the L2, showing that learners rely on the same ToM-related network when interpreting attitudinal meanings conveyed through prosody. These results suggest that listeners engage inferential mechanisms to process attitudinal prosody across both L1 and L2 contexts, highlighting the inherently social nature of attitudinal prosody.
In contrast, the main effect of linguistic prosody, shown in the contrast [LE + LJ] > [AE + AJ], elicited greater activation in the bilateral inferior parietal lobules (IPL) and precentral gyri, regions associated with phonological working memory and articulatory planning (Baldo & Dronkers, Reference Baldo and Dronkers2006; Meister et al., Reference Meister, Wilson, Deblieck, Wu and Iacoboni2007). These findings suggest that sentence-level linguistic prosody depends more heavily on phonological and articulatory planning resources, consistent with its role in conveying structural information such as sentence modality and focus across both L1 and L2.
Taken together, these findings demonstrate that attitudinal and linguistic prosody recruit distinct neural systems that remain consistent across L1 and L2. While L2 learners experience additional cognitive load for attitudinal prosody, as discussed in attitudinal prosody processing in the L2 section, the basic organization of prosody processing appears broadly preserved across languages, with attitudinal prosody relying more on socio-cognitive mechanisms and linguistic prosody relying more on phonological and articulatory planning mechanisms.
Effect of L2 exposure on attitudinal prosody processing
The individual differences analyses revealed a clear and consistent pattern: L2 exposure was the only factor that reliably predicted both behavioral and neural outcomes during L2 attitudinal prosody processing. Learners who reported higher levels of informal English use, particularly listening and speaking outside the classroom, showed higher accuracy in identifying attitudinal prosody and greater activation in the left hippocampus and thalamus during attitudinal prosody processing. These results indicate that naturalistic L2 contact plays a key role in supporting the interpretation of prosodic cues and their mapping onto intended interpersonal meanings.
The regions that correlated with L2 exposure centered on the hippocampus, with the cluster extending into the parahippocampal gyrus. Rather than interpreting these findings in terms of a traditional declarative–procedural distinction, a more contemporary interpretation highlights the role of the hippocampal–parahippocampal system in binding episodic, auditory, and contextual information during naturalistic learning. This system enables learners to integrate past experiences, including prosodic patterns, speaker-specific cues, affective tone, and social–contextual regularities, into predictive representations that facilitate the interpretation of unfolding speech (Davis & Gaskell, Reference Davis and Gaskell2009; Ranganath & Ritchey, Reference Ranganath and Ritchey2012). The extension of the activation into the parahippocampal gyrus further suggests engagement of scene- and context-construction processes that allow learners to situate incoming auditory input within familiar experiential frameworks (Aminoff et al., Reference Aminoff, Kveraga and Bar2013; Epstein & Kanwisher, Reference Epstein and Kanwisher1998; Ranganath & Ritchey, Reference Ranganath and Ritchey2012). From this perspective, exposure provides the episodic and contextual grounding necessary for real-time interpretation of pragmatic prosody.
The involvement of the thalamus provides further insight into how exposure is associated with L2 speech processing. The thalamus plays a key role in sensory gating, attentional selection, and the routing of auditory information across cortical networks (Bulut & Hagoort, Reference Bulut and Hagoort2024; Halassa & Kastner, Reference Halassa and Kastner2017). In the present study, thalamic activation may reflect experience-dependent modulation of auditory attention to socially relevant prosodic cues. This interpretation is consistent with previous findings showing structural changes in the thalamus in individuals with extensive L2 immersion (DeLuca et al., Reference DeLuca, Rothman, Bialystok and Pliatsikas2019; Pliatsikas et al., Reference Pliatsikas, DeLuca, Moschopoulou and Saddy2017), suggesting that repeated L2 use strengthens subcortical mechanisms that support speech processing.
Behaviorally, distinguishing between types of exposure revealed that nonacademic exposure – informal, socially embedded encounters with English – was more strongly associated with attitudinal prosody accuracy than academic exposure. While the lower variance of academic exposure in the present sample may have reduced the likelihood of detecting a significant correlation, this possibility should be considered when interpreting the observed difference between academic and nonacademic exposure, as it may partly reflect differences in variability. At the same time, the observed association between nonacademic exposure and attitudinal prosody performance is consistent with claims that socially grounded L2 use provides richer pragmatic and affective cues than formal instruction alone (Jeong et al., Reference Jeong, Li, Suzuki, Sugiura and Kawashima2021; Li & Jeong, Reference Li and Jeong2020; Liu et al., Reference Liu, Jeong, Cui, Dewaele, Okamoto, Suzuki and Sugiura2024). Social interactions expose learners to a broader range of communicative intentions, interpersonal stances, and prosodic contours, all of which are encoded as episodic, multimodal experiences that can later be recruited during speech processing. The hippocampal–parahippocampal–thalamic involvement identified here provides a mechanistic account of why such informal experiences matter: they offer episodic and contextual traces that support the extraction and prediction of pragmatic prosodic meaning. Proficiency, WTC, and behavioral accuracy did not show significant associations with neural activation, highlighting that exposure was the primary factor linked to learners’ attitudinal prosody development.
Finally, the presence of these exposure effects in the Japanese EFL context is noteworthy. Opportunities for immersive, socially sustained L2 use are limited, and learners typically encounter English primarily in instructed settings. Yet the brain still tracked individual differences in exposure, suggesting that even modest, episodic encounters with English – through media, brief social interactions, or incidental listening – are sufficient to be reflected in the neural systems supporting pragmatic prosody perception. These findings highlight both the sensitivity and adaptability of the L2 learning system and underscore the importance of providing learners with richer opportunities for naturalistic, socially grounded L2 experience.
Limitations, pedagogical implications, and future research
This study has several limitations that should be considered when interpreting the results. First, the explicit judgment task may have encouraged top-down strategies, potentially altering how learners relate prosodic to semantic and pragmatic interpretations. In addition, prosody type was not independently manipulated from task demands, as linguistic and attitudinal prosody were associated with different categorization requirements and were recorded separately to ensure natural prosodic realization. The present design was intended to compare different task-dependent modes of prosody processing, and the observed differences therefore reflect the combined influence of prosodic features and task-related demands. Although lexical content and syntactic structure were held constant across conditions, the results cannot be attributed to prosodic features alone. More implicit paradigms (e.g. passive listening, mismatch responses, or incidental attitude recognition) would help clarify how attitudinal prosody is processed when attentional demands are minimized. Second, the block design limited item-level analyses, preventing direct comparison of attitudinal subtypes (e.g. confidence, friendliness, doubt) or fine-grained variability across items. Event-related designs would enable more detailed mapping of specific attitudinal categories and their neural correlates. Third, this study focused on a single L1–L2 pairing. Although this provides a clear test case, attitudinal prosody varies cross-linguistically, and future research should examine additional language combinations to assess generalizability.
Despite these considerations, the findings yield several insights into the neural and experiential foundations of L2 attitudinal prosody. First, attitudinal prosody in both L1 and L2 engaged socio-cognitive networks associated with mentalizing, highlighting shared mechanisms for inferring speakers’ intentions. Second, L2 attitudinal prosody elicited additional activation in the left IFG and MFG, reflecting increased executive and control-related demands required to integrate unfamiliar pragmatic cues. Third, informal L2 exposure emerged as the strongest predictor of both behavioral accuracy and neural activation, consistent with the idea that socially grounded, real-world encounters provide learners with richer pragmatic, affective, and prosodic information than formal instruction alone. This aligns with emerging perspectives on social learning, which emphasize that episodic, multimodal encounters – in conversation, media, or everyday listening – serve as critical sources of pragmatic input that strengthen experience-based processing mechanisms.
Because attitudinal prosody is rarely addressed explicitly in L2 curricula, incorporating authentic spoken materials – such as films, interviews, podcasts, or conversational clips – may provide learners with more meaningful access to interpersonal intentions. Activities that guide learners to notice and interpret prosodic cues expressing attitudes (e.g. confidence, friendliness, doubt) through perception-based or awareness-raising tasks may support pragmatic development. Interactive communicative activities that require interpreting or expressing attitudes in context may further reinforce sensitivity to prosodic cues. These practices are directly consistent with the present finding that socially embedded, naturalistic experience plays a central role in L2 attitudinal prosody acquisition.
Future research should examine how different forms of L2 exposure – such as conversational practice, multimodal input, or targeted instruction – shape learners’ pragmatic prosody skills over time, and how instructional interventions can support the acquisition of socially meaningful prosodic cues. Longitudinal and training studies may clarify how exposure-driven changes in both behavior and neural activation unfold during L2 development, offering new insights into how learners come to interpret interpersonal intentions in multilingual communication.
Conclusion
This study provides the first evidence clarifying the neural mechanisms that support attitudinal prosody in an L2. The findings demonstrate that interpreting attitudinal prosody engages universal socio-cognitive mechanisms as well as additional linguistic and executive processes required when listeners interpret pragmatic cues in a less familiar language. While Theory of Mind–related networks facilitated intention inference across languages, L2 attitudinal prosody was strongly associated with learners’ real-world experience, particularly informal, socially grounded exposure. The observed involvement of the hippocampal–parahippocampal–thalamic system further suggests that such exposure supplies the episodic, contextual, and auditory grounding through which learners construct predictions about speakers’ attitudes.
By integrating behavioral and neural evidence, this study advances theoretical models of pragmatic prosody acquisition and highlights the central role of naturalistic, socially meaningful input in L2 pragmatic development. These findings underscore the importance of providing learners with rich opportunities to encounter attitudinal cues in authentic communicative contexts and contribute to broader discussions on how experiential factors shape the neurocognitive foundations of pragmatic development in a second language.
Supplementary material
The supplementary material for this article can be found at https://doi.org/10.1017/S0267190526100257.
Acoustic properties of the stimuli

Note: Duration and pitch measures (mean F0 and F0 range) are reported with standard deviations in parentheses. Intensity SD values are not reported because occasional silent intervals produced outlier values, but mean intensity was comparable across conditions.










