Highlights
-
• L1-relevant cues that are only partly useful in L2 reinforce sub-optimal strategies
-
• Perceptual strategies can be adjusted with as little as 3 hours of training
-
• Our findings offer a new direction for designing targeted training paradigms
1. Introduction
Acquiring a new language involves learning new linguistic categories and how they map onto continuous variations across multiple acoustic parameters. Mapping sounds to symbols is a challenging task while learning a first language (L1), but it becomes even more difficult when learning a new language because acoustic dimensions can play different roles across languages. For example, native English speakers have not learned to distinguish word meanings based on different pitch contours, which, on the other hand, is crucial for conveying meaning in tonal languages (Francis et al., Reference Francis, Ciocca, Ma and Fenn2008; Hao, Reference Hao2018; Yip, Reference Yip2002). If second language (L2) learners have difficulty detecting differences along the crucial dimension, they might instead rely on information that is of secondary importance to native speakers. For example, non-native speakers of English (e.g., first language [L1] Polish and Russian speakers; Bogacka, Reference Bogacka, Daskalaki, Katsos, Mavrogiorgos and Reeve2004; Kondaurova & Francis, Reference Kondaurova and Francis2008; Mandarin, Zhang et al., Reference Zhang, Feng, Zheng, Xu and Dang2015) often use duration to distinguish between English vowels /i:/ vs /i/ instead of spectral cues. Similarly, Spanish speakers rely more on onset fundamental frequency (F0) in English voicing (Llanos et al., Reference Llanos, Dmitrieva, Shultz and Francis2013). As a result, an optimal L1 listening strategy might not be as effective for a subsequent L2. Therefore, learning an L2 might require people to redirect attention to dimensions that can help distinguish the new speech contrasts and to form new mappings between L2-relevant acoustic dimensions and L2 categories. This could potentially lead to higher perceived salience of those dimensions (Francis et al., Reference Francis, Baldwin and Nusbaum2000).
Consistent with the predictions of attention-to-dimension models of perceptual learning, Francis and Nusbaum (Reference Francis and Nusbaum2002) showed that improvements in distinguishing between consonant-vowel syllables arise from acquiring within-category similarity and between-category distinctiveness. Both processes are said to be driven by attentional shifts – increased attention to dimensions that best define within-category similarity or differentiate between the categories. Attention might also serve as a perceptual lens, where attending to relevant acoustic dimensions (cue enhancement) and disengaging from task-irrelevant dimensions (cue inhibition) brings that information into focus (Kondaurova & Francis, Reference Kondaurova and Francis2010). Most studies exploring the roles of attention to dimensions and dimensional salience in L2 speech acquisition have focused on the segmental level in the context of specific phonetic contrasts (Baese-Berk, Reference Baese-Berk2019) and minimal pairs (Kondaurova & Francis, Reference Kondaurova and Francis2010; Lehet & Holt, Reference Lehet and Holt2020; Liu & Holt, Reference Liu and Holt2015; Wu & Holt, Reference Wu and Holt2022; Zhang et al., Reference Zhang, Wu and Holt2021). However, these stimuli do not reflect the full range of acoustic mappings learners need to master, as acoustic differences between languages manifest not only at the segmental level but also at the suprasegmental level (i.e., during prosodic categorization, e.g., phrase boundaries, Kuang et al., Reference Kuang, Chan and Rhee2022; word stress, Gordon & Roettger, Reference Gordon and Roettger2017).
Several aspects of prosody are particularly important in learning an L2, including syllable prominence within words (lexical stress), emphasis of words containing new or important information (contrastive focus) and placement of phrase boundaries. For example, in English, the position of word stress is not fully predictable as in fixed-stress languages (e.g., final syllable stress in standard French, Peperkamp & Dupoux, Reference Peperkamp, Dupoux, Gussenhoven and Warner2002) and must be remembered as a part of the pronunciation of all words (Cutler, Reference Cutler, Reed and Lewis2015). Stressed syllables in English tend to feature greater pitch movement (strong vs weak syllable distinction, Fear et al., Reference Fear, Cutler and Butterfield1995), longer duration (Davis & Van Summers, Reference Davis and Van Summers1989; Lunden, Reference Lunden2017) and increased intensity (Plag et al., Reference Plag, Kunter and Schramm2011). Contrastive focus is characterized by greater pitch change and longer duration of the emphasized word (e.g., Breen et al., Reference Breen, Fedorenko, Wagner and Gibson2010), and phrase boundaries in English are associated with lengthening of the pre-boundary segment (word or syllable) and a rapid change in pitch just before the boundary (Beach, Reference Beach1991; de Pijper & Sanderman, Reference de Pijper and Sanderman1994; Streeter, Reference Streeter1978). However, due to differences in cue weighting patterns across languages, English prosody is not necessarily realized in the same manner by L2 English learners. For instance, Chinese learners of English rely on pitch information more than native speakers in categorizing stress patterns in nonsense words, and among pitch, duration and amplitude, only pitch is a decisive cue for them (Wang, Reference Wang2008). In another study, English and Mandarin Chinese native speakers used vowel quality as a primary and pitch as a secondary cue when detecting stress patterns in non-words, while pitch was completely disregarded by Russian speakers, who relied more on duration and intensity instead (Chrabaszcz et al., Reference Chrabaszcz, Winn, Lin and Idsardi2014). A similar pattern was found for perception of L2 English words: Mandarin speakers used vowel quality to a similar degree to native English speakers but weighted pitch more highly (Zhang & Francis, Reference Zhang and Francis2010). Productions of stressed syllables by native Mandarin speakers were also marked with higher pitch than those of English speakers (Zhang et al., Reference Zhang, Nissen and Francis2008).
Suprasegmental features are important for speech comprehension (Cutler et al., Reference Cutler, Dahan and Van Donselaar1997) and are strong predictors of not only degree of perceived foreign accent but also speech intelligibility and comprehensibility (Kang, Reference Kang2010; Munro & Derwing, Reference Munro and Derwing2001; Trofimovich & Baker, Reference Trofimovich and Baker2006). However, L2 prosody perception strategies have received relatively little attention in the literature. Several studies looked at how suprasegmental categories map onto acoustic cues in various languages (Zhang & Francis, Reference Zhang and Francis2010; Chrabaszcz et al., Reference Chrabaszcz, Winn, Lin and Idsardi2014; Tremblay et al., Reference Tremblay, Broersma, Coughlin and Choi2016; Baek, Reference Baek2022), but there is almost no research showing how cue weighting in L2 speech prosody changes as a function of L2 learning. One study found short-term changes in word emphasis cue weighting strategies due to short-term exposure to artificial accents (Jasmin et al., Reference Jasmin, Tierney, Obasih and Holt2023), suggesting that prosody perception strategies are not set in stone but can rapidly change when they are sub-optimal. Another study found that L1 Mandarin speakers with more than 3 years of residence in an English-speaking country made less use of pitch and greater use of duration when categorizing the location of a phrase boundary in English, compared to L1 Mandarin speakers with less than 1 year of residence abroad (Petrova et al., Reference Petrova, Jasmin, Saito and Tierney2023). This suggests that L2 cue weighting strategies can change due to extensive language exposure in adulthood. Yet, it remains unclear whether short-term targeted training (e.g., 2–3 weeks) can modify L2 cue weighting strategies, and if so, what the consequences might be for prosody perception and production.
If L2 category learning relies on the re-distribution of attention to L2-relevant cues and away from L1-optimal cues, then training the ability to flexibly attend to acoustic dimensions could be a viable strategy for L2 speech learning. How can L2 learners’ attention be redirected from one set of cues to another? One possible approach would be to explicitly instruct participants to pay attention to the specific acoustic dimension relevant to a given language. While such perceptually focused instruction may aid L2 speech learning (e.g., Lee et al., Reference Lee, Plonsky and Saito2020; Yang & Sundara, Reference Yang and Sundara2019), this kind of instruction is usually not available in naturalistic learning conditions, neither while learning a first language nor in most cases of second language acquisition (Francis et al., Reference Francis, Baldwin and Nusbaum2000; Francis & Nusbaum, Reference Francis and Nusbaum2002). Due to differences in cue relevance across languages, learners may not be able to follow such an instruction. This is because they might not show sensitivity to L2-relevant dimensions, or the relative salience of other dimensions is much stronger and attracts their attention instead. For example, Mandarin Chinese speakers struggle to ignore pitch and attend to amplitude in English speech even when explicitly asked to do so (Jasmin et al., Reference Jasmin, Sun and Tierney2021). This suggests that since Mandarin speakers are adept in using pitch (their default L1 strategy), they might not be inclined to change the strategy that works well for them. Removing that perceptually overpowering dimension or reducing its reliability may be necessary to trigger changes in cue weighting strategies employed by L2 learners. It remains to be seen whether such training-induced changes to perceptual strategies can lead to observable improvements in L2 speech perception and production.
1.1. Present study
We designed a novel prosody training paradigm to help L2 learners adjust their perceptual strategies to resemble more closely those of native speakers. Since Mandarin Chinese learners of English tend to overweight pitch information in L2 speech perception and production (Zhang et al., Reference Zhang, Nissen and Francis2008) and have trouble disengaging attention from pitch even when explicitly asked to do so (Jasmin et al., Reference Jasmin, Sun and Tierney2021), the prosody training featured duration as the target cue and pitch as the distractor uninformative cue. We expected that several days of exposure to acoustically manipulated speech in which pitch was either removed or unreliable would challenge listeners to use other dimensions than pitch, leading to more flexible perceptual strategies that are less reliant on pitch. We tested whether any cue weighting shifts generalized across domains (i.e., to music perception). Finally, we asked whether training perceptual strategies can lead to improvements in L2 prosody perception (the accuracy of their categorization of prosodic features) and production.
2. Methods
2.1. Participants
Participants were 60 native Mandarin Chinese speakers (the pre-test data from these participants was previously reported in Kachlicka et al., Reference Kachlicka, Symons, Saito, Dick and Tierney2024). Participants were students and working professionals recruited from the SONA platform for participant recruitment (Sona Systems, https://www.sona-systems.com/), social media and community groups. The main goal of this study was to offer targeted training to improve participants’ L2 listening strategies. Therefore, we recruited only people who learned English as a second language and had limited immersion experience (length of residence [LOR] < 18 months). All participants had learned English at school and had to pass a recognized English language test (e.g., IELTS, TOEFL) to enter education in the United Kingdom, but they had little to no experience in daily communication with native speakers of English prior to their arrival in the United Kingdom. Participants were randomly assigned to one of the training groups (Prosody – experimental, Vocabulary – control), and their demographics, language and musical background were balanced across groups (see Table 1).
Table 1. Summary of basic demographics, musical training and language background information

Note: Both individuals who reported their AOA as 11 and 14 only travelled to English-speaking countries at the time (<1 month) and were not fully immersed in an L2 until they moved to the United Kingdom to study.
2.2. General procedure
Interested participants who responded to study adverts were invited to a short telephone or video call to check their eligibility and explain study procedures. After giving their informed consent, participants began the study. For overview of the procedure, see Figure 1. Part 1 (Pre-Test) consisted of a series of online tasks (dimension-selective attention task, categorization tasks and speech prosody perception task) and an EEG testing session that took place at Birkbeck University. Part 2 (Training) involved 6 days of online language training (experimental prosody or control vocabulary). After the training, Part 3 (Post-Test) included the same online tasks and a second EEG testing session in the lab. Analyses of EEG data and the dimension-selective attention task, along with detailed task descriptions, are presented in Supplementary Material (Supplementary Sections S1–S5). Additionally, 6 months after the end of Part 3, participants were invited to complete Part 4 with the same online tasks. Only 42 participants out of 60 completed this stage, and analysis of this data is presented in Supplementary Material (Supplementary Tables S4–S6). All other analyses presented in the following sections include data from all 60 participants. Experimental procedures were approved by the departmental Ethics Committee at Birkbeck. All participants were reimbursed for their time in cash (at £10 per hour) or its equivalent in course credits.

Figure 1. Schematic of experimental design. All participants completed Part 1, Part 2 and Part 3 of the study. They completed tasks from Parts 1 and 3 on separate days to avoid fatigue. During the training, they were allowed to take up to 2 days break in total.
2.3. Training
2.3.1. Prosody training (experimental)
The experimental group of Mandarin speakers practiced their ability to perceive prosodic information in English speech. The training stimuli were audio recordings of naturalistic speech capturing various prosodic features – phrase boundary, contrastive focus and word stress. Participants performed categorization exercises where they heard one speech sample at a time and had to categorize each speech sample as belonging to one of the two categories displayed on the screen.
2.3.1.1. Stimuli
To create the training materials, we expanded the existing Multidimensional Battery of Prosody Perception (MBOPP, Jasmin et al., Reference Jasmin, Dick and Tierney2021) by adding lexical stress stimuli and additional recordings from multiple speakers for contrastive focus and phrase boundary stimuli. The final corpus consists of recordings made by six professional voice actors (3 males, 3 females) to guarantee a high variability of speakers’ speech rate (female voices were 116.24, 108.4 and 127.19 words per minute, Mfemale = 117.39; male voices were 175.62, 136.02 and 140.06 words per minute, Mmale = 150.57), pitch range (females voices were 229.51, 248.77 and 206.65 Hz, Mfemale = 228.31 Hz; male voices were 145.18, 111.27 and 123.83 Hz, Mmale = 126.76 Hz) and age (females were 24, 33 and 43 years old, Mfemale = 33.33; males were 25, 36 and 46 years old, Mmale = 35.67) during training. More details about the stimuli recording session can be found in Supplementary Material (Supplementary Section S7).
The list of stimuli included up to 100 sentences for each of the three prosody tests – 84 sentences for “Phrase Boundary,” 94 sentences for “Contrastive Focus” and 100 sentences for “Lexical Stress” (a complete list of stimuli can be found on: https://osf.io/vbzt9/). The sentences were arranged into pairs forming target prosody contrasts. For example, for the lexical stress stimuli, one sentence had a word with stress placed on the first syllable (e.g., “Most buildings in the COM-pound are connected by tunnels”) and the other sentence included a word with stress placed on the second syllable (e.g., “Changes to current policies will only com-POUND the problem”). All voice actors were asked to read the sentences aloud using their usual strategies to convey the above-mentioned prosodic features. The speakers were asked to make sure that their speech sounded as natural as possible while emphasizing the target contrast.
2.3.1.2. Stimuli processing
All audio files were first trimmed to remove any silences at the beginning and end of each recording and downmixed to mono by averaging the existing channels. Next, target words were extracted from the carrier sentences for stress stimuli, and identical portions of recordings were extracted for contrastive focus and phrase boundary stimuli. All samples were then normalized to an equal loudness level by adjusting them to −20 dB RMS relative to the loudest point across recordings (represented as 0 dB) with the “Match loudness” function in Adobe Audition software (Adobe Inc.). This normalization procedure ensured loudness consistency across samples recorded on different devices.
Voice stimuli were then processed using STRAIGHT voice morphing software (Kawahara & Irino, Reference Kawahara, Irino and Divenyi2005) by morphing two target contrast recordings along two acoustic dimensions – F0 and duration. For all samples, F0 was set at 50% (i.e., representing values of F0 halfway between the two contrast recordings), and duration was increased from 0% to 100% in 5% increments. The other dimensions available in STRAIGHT (i.e., aperiodicity, spectrum and frequency) were also set at 50% such that morphs contained equal amounts of information from the two recordings. The resulting morphs with 0% duration represent one end of the continuum (i.e., the word “com-POUND,” with second stressed syllable longer than the first) and those with 100% duration represent the other extreme of the continuum (the word “COM-pound,” with first stressed syllable longer than the second).
Whispered speech was created by transforming original recordings with the “Whisper” command from the Praat Vocal Toolkit (Corretge, Reference Corretge2023). After transformation, there were no sufficient traces of voicing in the recordings to detect F0.Footnote 1
As a last step, we conducted acoustic similarity analysis across all stimuli (for details, see Supplementary Section S7 in Supplementary Material), to remove target contrasts that could be difficult for listeners to distinguish. The final stimuli set consisted of 5416 focus stimuli, 5572 phrase stimuli and 5076 stress stimuli that were allocated to appropriate difficulty levels depending on their durational cue contents.
2.3.1.3. Adaptive difficulty levels
The main goal of the training was to help participants achieve more English native-like prosody perception strategies. Since Mandarin speakers tend to overweight pitch, the training required participants to use duration for prosodic categorization by either eliminating F0 entirely or removing its efficacy as a cue. We gradually introduced task-irrelevant pitch variation and decreased the size of the duration cue to increase the difficulty of the tasks throughout the training.
Participants started the training by listening to whispered speech, which did not contain any voiced elements. Removing the otherwise salient F0 from the initial levels of training helped listeners to focus on duration when making categorical judgements, as no competing information was available. Then, they moved on to listening to transformed voiced speech. In these samples, pitch contour was audible and changed naturally within speech, but it was informatively ambiguous (i.e., it did not help the listeners resolve the target contrast as its values were set at 50% between the two contrasting recordings; see Supplementary Figures S3 and S4 for spectrograms of example stimuli). Thus, listeners were forced to rely on duration when making categorical decisions. An additional layer of difficulty was introduced by decreasing the step size of the duration cue (see Supplementary Table S8). The easiest levels contained the target contrasts with the largest duration difference (0% vs 100%, 5% vs 95), while more challenging levels that included samples closer to each other in the stimulus space (20% vs 80%, 30% vs 70%) were introduced as participants proceeded through the training.
2.3.2. Vocabulary training (control)
The control group practised their English skills with vocabulary boosting exercises matched in length and intensity with the experimental training. Vocabulary training was chosen to satisfy the requirements of an effective control condition. It was essential to select an active control condition that would not overtly appear as such (e.g., no practice or completing tasks not related to L2 learning would immediately be recognized as a control condition). Moreover, both training paradigms needed to be equally engaging and motivating for participants. In the vocabulary training, participants could practice their knowledge of English words and learn increasingly more challenging vocabulary. Participants were asked to match pictures, words and short phrases with their translations. They saw an image – a word in Chinese or English – and they had to click on the appropriate English word to name the object, translate the Chinese word or match the English word with its short definition.
2.3.2.1 Stimuli and difficulty levels
The main vocabulary selection used in the control training was taken from the Vocabulary Test developed by Schmitt and Schmitt (Reference Schmitt and Schmitt2014). These materials include lists of words arranged based on their frequency of occurrence and reflect the order in which the L2 learners would most likely acquire the vocabulary (i.e., vocabulary profile; Nation, Reference Nation2006; Cobb, Reference Cobb2012). For example, the words from 2K level would be learned first by English learners since these words are most frequently used. Then, they would learn words from the subsequent groups. The training stimuli included words from the following vocabulary levels: 2K level words (e.g., birth, dust, operation), 3K level words (e.g., assist, bother, condemn), academic vocabulary (3K and 4K level words, e.g., anticipate, principle, empirical), 5K level words (e.g., casual, desolate, fragrant) and 10K level words (e.g., smoulder, luscious, primeval).
Within each word level, participants completed six different types of tasks with different levels of complexity (Table 2). Introducing various tasks served two functions: one to make the training more engaging and another to make the tasks more challenging without adding new vocabulary. This aligns with scholars’ claims that instruction should focus on helping learners develop not only breadth but also depth of word knowledge, with the goal of achieving more automatic use in real-life settings (e.g., Nation, Reference Nation2013).
Table 2. Description and examples of vocabulary training task

Using different tasks with the same vocabulary facilitated the repetition of new material and allowed participants to reinforce their engagement with particular words in different task contexts (i.e., naming, translation and definition). First, participants completed picture-matching tasks. They saw one picture and needed to select which word best described that picture, or they saw three images and they had to decide which picture corresponded to the word displayed on the screen. These tasks included only concrete words (i.e., primarily nouns or adjectives) that could be easily depicted with an image. Apart from target words, each trial included two filler words: semantic filler (meaning-related, but not synonyms) and phonological filler (homophones, i.e., the words that sound similar or closely matched rhymes). For example, for a target word “shoe,” good filler words could be “sue” (similarly sounding) and “sock” (related meaning – you put both on your feet). However, “boot” or “sneaker” would not be good examples as these are synonyms of “shoe.” Next, participants moved on to word-to-word translations, where they had to either match one of the English words with one of the three Mandarin Chinese translations or select an appropriate English translation from the three options given for a single Chinese word. In these tasks, we introduced more abstract words and verbs. We used the same filler words as in the picture task. The last tasks included trials in which participants needed to match English words with their short definitions written in English. All the words included in the training were translated to Mandarin Chinese by a linguistically trained native speaker of Mandarin Chinese with several years of professional experience in conducting English to Chinese and Chinese to English translations.
2.3.3. Training procedure
Both training paradigms were designed and hosted on the Gorilla platform (Anwyl-Irvine et al., Reference Anwyl-Irvine, Massonnié, Flitton, Kirkham and Evershed2020). The design included features drawn from video games (e.g., levels display, colour themes, audio feedback) to maximize participants’ engagement. Participants completed the same tasks every day for 6 days (approximately 30 minutes a day). Although we recommended practicing every day, we allowed for up to 2 days of breaks during the training (i.e., participants completed six training sessions across eight total days). Compliance was monitored daily by the researcher. Each session included 15 blocks of 20 trials of exercises (300 trials per day, 1800 trials in total). In prosody training, there were five blocks for each linguistic feature. Scores and level advancements were computed separately for each linguistic feature (contrastive focus, phrase boundary and lexical stress) so that between-feature differences did not interfere with participants’ overall progress. This is because duration might vary in informativeness across features, which could prevent participants from progressing to the next levels if they had failed one linguistic feature. Participants could always see which feature they were practising in the top left corner of the screen. The order in which the feature blocks were presented was randomized across training sessions and between participants. In vocabulary training, there were 15 blocks with 20 trials of vocabulary exercises per day. For both types of training, each set of 20 trials presented in each block was randomly selected for each participant from that stimulus set. If a participant did not reach the pass threshold and had to repeat a given level, the algorithm selected another random selection of 20 trials from that level (i.e., participants did not see the same trials in the same order again).
Both forms of training were adaptive, becoming gradually more difficult when participants performed well to ensure that the tasks remained challenging. An adaptive training regimen was meant to provide a balanced difficulty. Adaptiveness allows participants to quickly move through the easier levels and spend time on challenging tasks where more practice is needed to elicit change. The levels participants achieved at the end of the training might vary, but across levels all participants completed the same amount of training and were consistently challenged. Especially in the Prosody Group, this balance was crucial, as the perceived salience of acoustic dimensions might vary across participants. With an adaptive procedure, the stimuli they heard and the levels they reached represent the limits of how difficult the task could be for them.
After each block, participants received feedback on how well they performed during that block. If they scored 75% correct or more, they moved up a level. They had to repeat the current level if they scored below 75% (i.e., made more than five mistakes). That means that although participants completed the same amount of training (i.e., the same number of trials), the number of levels they cleared depended on their performance since the training was adaptive. Their progress carried over across training days. Additionally, immediate feedback was provided on the screen for each trial, and a summary of scores was given at the end of each block. Participants could track their progress within each block by looking at the progress bar displayed on the top of the screen. They could see their current level throughout the training displayed in the top right corner of the screen. Online materials used to run both types of training are available for preview at: https://app.gorilla.sc/openmaterials/580460 and on https://osf.io/vbzt9/.
2.4. Outcome measures
2.4.1. Cue weighting tasks
Cue weighting tasks measured participants’ reliance on acoustic cues (pitch vs duration) during categorization of L2 speech prosody (contrastive focus, lexical stress and phrase boundary) and musical beats. In all four categorization tasks, participants were presented with stimuli that varied orthogonally in the extent to which pitch and duration indicated one of two possible interpretations. After listening to each stimulus, participants were asked to categorize the stimuli as belonging to one of two categories: phrase with early or late closure (“If Barbara gives up, the ship” vs “If Barbara gives up the ship”), emphasis on the first or second word (“STUDY music” vs “study MUSIC”), lexical stress on the first vs second syllable (“COMpound” vs “comPOUND”) and musical beats occurring either every two or three notes (“strong–weak” vs “strong–weak–weak” patterns). Speech stimuli were designed to be lexically ambiguous to ensure that participants based their judgements solely on prosodic cues. There were 10 blocks of each categorization task, which were interleaved in the following order: musical beats, contrastive focus, lexical stress and phrase boundary.
Speech stimuli were taken from the MBOPP battery (contrastive focus and phrase boundary; Jasmin et al., Reference Jasmin, Dick and Tierney2021) or recorded (lexical stress) by the same native Southern British English speaker. The two versions of the same phrase that differed in the location of the prosodic contrast were morphed together using STRAIGHT (Kawahara & Irino, Reference Kawahara, Irino and Divenyi2005) by setting the values of fundamental frequency (F0; correlate of pitch) and durational morphing rates at 0%, 33%, 67% and 100% to create the stimuli (where recording with 0% pitch and duration represents one category and with 100% the alternate category, with the interim values linearly interpolated between the two original recordings).
A similar approach was used to create the musical beats stimuli. Sequences of six four-harmonic complex tones repeated three times formed either three-note groupings (“strong–weak–weak” pattern, waltz time) or two-note groupings (“strong–weak” pattern, march time). These groupings were created by increasing pitch or duration of the first tone compared to the consecutive one or two tones. There were four levels for both acoustic cues. The pitch levels were [C#-A-A-C#-A-A] (indicating groups of three), [B-A-A-B-A-A], [B-A-B-A-B-A] and [C#-A-C#-A-C#-A] that indicated groups of two, where A was 440 Hz, B 493.9 Hz and C# 554.4 Hz. Duration levels were [200 50 50 200 50 50 ms] (indicating groups of three), [100 50 50 100 50 50 ms], [100 50 100 50 100 50 ms] and [200 50 200 50 200 50 ms] (indicating groups of two).
2.4.2. Speech prosody perception
To measure L2 prosody perception, we assessed categorization of contrastive focus, phrase boundary and lexical stress. On each trial, participants heard a spoken phrase or a word and saw two written versions on the screen. Their task was to decide whether the spoken phrase or word sounded most like the phrase or word on the left or the right side of the screen and indicate their responses via button press. We asked them to respond as quickly as possible without making any mistakes to measure their relatively spontaneous and automatic L2 processing (Saito & Plonsky, Reference Saito and Plonsky2019). The final score was calculated as the proportion of correct responses.
The stimuli were 120 phrases selected to capture the contrasts between emphasized vs non-emphasized words, stressed vs unstressed syllables and phrase boundary placement (40 phrases each; full list of stimuli can be found on: https://osf.io/vbzt9/). All phrases were arranged in pairs representing contrastive instances of a given feature. For example, lexical stress items included pairs of two-syllable words with stress placed on the first syllable (e.g., PROtest) or the second syllable (e.g., proTEST). To avoid familiarity with the voices, they would be listening to during the training, the stimuli for this test were recorded by two different voice actors (male and female, both native British English speakers). All the stimuli were recorded in a sound-proof booth at Birkbeck University with a RØDE NT1A large-diaphragm cardioid condenser microphone with shock mount and pop filter and Audacity software (version 3.0.5). The researcher instructed the actors to read the sentences naturally and convey prosodic contrasts as they usually do while speaking.
To make the perception of the prosodic contrast challenging enough to detect individual differences in performance (i.e., to avoid ceiling effects), we used modified versions of these recordings in which the magnitude of the acoustic cues was reduced. We morphed two versions of each contrast with STRAIGHT morphing software (Kawahara & Irino, Reference Kawahara, Irino and Divenyi2005). We time-aligned these phrases, manually marked anchor points (i.e., corresponding portions of both recordings) and generated stimulus continua varying along fundamental frequency (F0) and duration. For the prosody test, we chose the samples with pitch and duration both set at 70% or both set at 30% (i.e., both features suggesting the same interpretation of the stimulus, but with reduced overall amount of informativeness to avoid ceiling effects). Selected stimuli were divided into two versions of the task in which the presence of male and female voices across features was counterbalanced. Trials were organized in three blocks by linguistic features and presented in random order.
2.4.3. Speech prosody production
While many existing studies use highly controlled production tasks (e.g., delayed repetition tasks), where L2 learners can fully focus on accurately producing target phonetic features, these tasks have been criticized. The concern is that such performance may not represent how L2 learners actually use language in more communicatively authentic contexts, where they need to produce intelligible exemplars of the target features while also attending to lexical, grammatical and semantic aspects of language (Saito & Plonsky, Reference Saito and Plonsky2019). To this end, we adapted a timed picture description task widely used in speech production research (e.g., Saito, Reference Saito2013) to measure participants’ spontaneous production of English prosody. We designed pictures including prompts comprising phrases reflective of the prosodic contrasts of interest: phrase boundary, contrastive focus and lexical stress. Participants were then tested on their ability to produce target words with accurate use of these prosodic features while primarily focusing on conveying meaning (by describing each picture).
The stimuli were word, phrase or sentence pairs capturing prosodic contrasts embedded into colourful cartoon images. To avoid familiarity effects, the sentences included in the production task differed from the prosodic contrasts included in the cue weighting tasks and prosody test. The images were designed to correspond thematically with the embedded text and were created with the Canva platform (www.canva.com) and its open graphics library. We created 24 unique images in total, 4 contrasts per feature. The images were split into two sets, Version A and B, randomized between testing sessions to avoid training effects and familiarity with picture content.
Participants were asked to describe each picture by using the target phrase or word embedded within a picture. They saw one picture at a time, and they had 10 seconds to prepare (i.e., look at the picture and think about what to say) and then 30 seconds to speak. They saw a timer with a countdown indicating the remaining time on the screen. Before moving on to the task, the researcher presented participants with three examples of picture descriptions to indicate what kind of descriptions they should provide (see Figure 2).

Figure 2. Examples from prosody picture description task instruction. During the testing session, the researcher demonstrated these images and performed sample descriptions as examples. (A) Phrase boundary example: “In this picture, I can see an old lady and two kids. I think they are baking something. At some point, the girl, Jenny, says, ‘I’m going to eat, grandma!’ Most probably, she’s hungry. They all look very happy.” (B) Contrastive focus example: “Two girls in this picture are eating something from their bowls. I think it’s a strawberry yoghurt, not a natural yoghurt, because what they have in their bowls is pink. Pink yoghurt is more likely strawberry. I don’t like strawberry yoghurt.” (C) Lexical stress example: “There is a lady in a pink suit who looks troubled. She doesn’t know how to compress wav files into mp3. Her boss will be angry if she doesn’t do it.” If participants did not know the meaning of the word, they were advised to attempt to describe the picture by including that word in their description, for example, by saying: “There is a lady that is thinking about something. She says ‘compress,’ so she most probably needs to compress something, but she doesn’t know what to do” or even “The picture shows a working woman. I don’t know what the word ‘compress’ means, but the lady is saying it. She’s wearing a pink suit and a nice haircut.” This was to encourage the articulation of all the prosodic contrasts even if participants were not familiar with the target words and their meaning.
2.4.3.1. Speech recordings
Participants’ responses were recorded with a RØDE NT1A large-diaphragm cardioid condenser microphone with shock mount and pop filter and Audacity software (version 3.0.5). For safety reasons during the COVID-19 pandemic,Footnote 2 participants were wearing face masks during the testing sessions. Whenever possible, participants were allowed to remove their masks for the duration of the speaking task (less than 10 minutes) and put their masks on for the remainder of the session. Participants who were uncomfortable removing their masks in the lab were provided with standard two-layered surgical masks to keep the speech distortions constant across participants and testing sessions. Surgical masks of this type were shown to cause minor speech distortions in controlled speech recordings (Corey et al., Reference Corey, Jones and Singer2020; Magee et al., Reference Magee, Lewis, Noffs, Reece, Chan, Zaga, Paynter, Birchall, Rojas Azocar, Ediriweera, Kenyon, Caverlé, Schultz and Vogel2020; Nguyen et al., Reference Nguyen, McCabe, Thomas, Purcell, Doble, Novakovic, Chacon and Madill2021).
To evaluate how well participants conveyed English prosody, we extracted from the 30-second recordings only the portions that represented contrasts of interest (e.g., the word “present,” the contrastive phrase “French food” or a fragment of a sentence including a phrase boundary “When the boy leaves[,] the house”). A detailed description of the pre-processing procedure is provided in Supplementary Material (Supplementary Section S8).
2.4.3.2. Speech production ratings
All speech samples were rated by five native Southern British English speakers (all females, Mage = 26, SDage = 2.74, rangeage = 23–28). All raters had previous experience conducting L2 speech ratings and were highly familiar with variability of native English prosody articulation and variability of non-native accents. Three participants were current doctoral students (including one who obtained some training in L2 linguistics), one a BA graduate in English literature and one a working professional. Although they were all naïve listeners, the task involved rating aspects of speech every native speaker should be familiar with and their familiarity with these constructs was confirmed by their self-rated understanding of the task (ratings on a 9-point scale where 9 represents very good understanding of the task; Mfocus = 8.6, Mstress = 8.4, Mphrase = 8.2). Raters completed their task in three sessions, approximately 1–1.5 hours long each. The order of blocks across participants and trials within each block were randomized. More details about the rating procedure are included in Supplementary Material (Supplementary Section S9).
2.4.3.3. Rating scales
We adopted the 9-point rating procedure developed by Flege and colleagues (Flege et al., Reference Flege, Takagi and Mann1995; the procedure was originally created to capture the different developmental stages of English /r/ vs /l/ acquisition by Japanese raters) to measure how well participants can articulate L2 English prosody. All listeners judged the quality of prosodic contrasts by choosing one of the response alternatives ranging from 1 for native-like productions of early boundary/focus/stress to 9 for native-like production of late boundary/focus/stress (see Supplementary Table S8). The ratings were meant to represent the accuracy of participants’ prosody production. The mid-point of the scale was used for neutral samples that could reflect the interlanguage stage of adult L2 learning (for similar approach to acquisition of segmentals, see Saito et al., Reference Saito, Kachlicka, Suzukida, Petrova, Lee and Tierney2022).
2.4.3.4. Final scores
To compute the final production accuracy scores, the lower end of the scale (i.e., 1–5 ratings) was inverted using the 10-N formula so that the native-like productions of both categories were equal to 9 (i.e., the higher the rating, the more native-like their production). Then, we dichotomized the ratings into correct vs incorrect realizations of the prosodic contrasts. To do so, we counted all the trials with ratings above 5 as correct (representing a range of correct pronunciations) and all the trials with ratings equal to or lower than 5 as incorrect (i.e., receiving a neutral rating or representing the opposite end of the contrast). We obtained the final score for each participant by averaging all correct and incorrect responses across all trials and the three most reliable raters (more details about raters’ reliability are provided in Supplementary Section S10 and Supplementary Table S9 of Supplementary Material). That procedure resulted in a single number per participant per feature (the higher the value, the higher the overall accuracy of prosody production).
2.4.3.5. Speech acoustics analysis
Since our analyses showed that the prosody training only led to demonstrable changes in cue weighting strategies during phrase boundary categorization, our acoustic analysis focused solely on the production of phrase boundaries. To assess the use of pitch and duration in conveying speech prosody, we quantified differences in pitch and duration use between tokens with early vs late phrase boundary, that is, the word just before the phrase boundary in the early phrase examples and the corresponding word in the late phrase sentences. The word just before the boundary is where the acoustic difference between the early and late phrase boundaries is the most pronounced (i.e., words just before phrase boundary tend to be longer and have more varied pitch). Durations were computed with a custom MATLAB script based on annotated timestamps following a simple formula where word duration = word offset – word onset. Pitch contours were extracted using Praat (Boersma & Weenink, Reference Boersma and Weenink2023) and then manually corrected to reflect pitch estimates for voiced parts of the word only. Then, we computed standard deviation across every 10-ms voiced segment of each recording and used these values as a measure of pitch variability. A detailed description of the procedure is presented in Supplementary Material (Supplementary Section S11).
2.5. Statistical analyses
All statistical analyses were conducted in R (R version 4.4.2). To quantify any changes in the use of acoustic cues across categorization tasks after the training, we used mixed-effects logistic regression models from the lmer4 package (Bates et al., Reference Bates, Mächler, Bolker and Walker2015). The trial-by-trial categorical responses (represented as 0 or 1) were used as the dependent variable. The categorical variables representing the testing session (pre-test, post-test) and training group (vocabulary, prosody) were treatment coded with the first variable level serving as a baseline and the second as a group comparison (0 and 1, respectively). The continuous predictors pitch level (1–4) and duration level (1–4) were standardized using the rescale function from the arm package (Gelman et al., Reference Gelman, Su, Yajima, Hill, Pittau, Kerman, Zheng and Dorie2022). The resulting beta coefficients from the models (categorical response ~ testing session × training group × pitch level × duration level + (1 | Participant ID); Models 1–4: four separate categorization models for phrase boundary, contrastive focus, lexical stress and musical beats) represent the change in log odds given an increase of one standard deviation of that variable. Participants’ unique IDs were included as a random intercept. Simpler models without random slopes were interpreted since including random slopes for pitch and duration levels resulted in overfitting. We based our model evaluation on automated warnings from lme4 package that flag instances of “singular fit” in overparametrized models (Bates et al., Reference Bates, Kliegl, Vasishth and Baayen2018). Across all models, we only removed terms required to allow for a non-singular fit (following the recommendation by Barr et al., Reference Barr, Levy, Scheepers and Tily2013).
Mixed-effects models from the lmer4 package were also used to test the effects of training on perception of speech prosody (correct response ~ testing session × training group + (1 | Participant ID); Models 5–7: three separate perception models for phrase boundary, contrastive focus and lexical stress). The dependent variables were responses on each trial (0 – incorrect, 1 – correct), and categorical predictors representing the testing session (pre-test, post-test) and training group (vocabulary, training) were treatment coded (0 and 1, respectively).
Using linear models for continuous outcomes bound by 0–1 intervals might result in spurious effects, so we used regression models with a beta distribution for modelling correctness of production (averaged correctness of responses computed from rating data; Models 8–10: three separate production rating models for phrase boundary, contrastive focus and lexical stress). We used an implementation of mixed-effects regression model with beta distribution from the glmmTMB package (Brooks et al., Reference Brooks, Kristensen, van Benthem, Magnusson, Berg, Nielsen, Skaug, Mächler and Bolker2017). The categorical variables representing testing time (pre-test, post-test) and training group (vocabulary, prosody) were treatment coded with the first variable level serving as a baseline and the second as a group comparison (0 and 1, respectively), while participants’ unique IDs were included as a random intercept (model formulae were response ~ testing session × training group + (1 | Participant ID) for all three above models).
Finally, a linear mixed-effects model was used to test the effects of training on the use of pitch and duration in conveying phrase boundaries (SD ~ feature/testing session × training group + (1 | Participant ID); Model 11: nested model for pitch and duration use in production). The dependent variable was the standard deviation (SD) of pitch or duration (normalized with inverse hyperbolic sine transformation). Categorical predictors representing the testing session (pre-test, post-test) and training group (vocabulary, training) were treatment coded (0 and 1, respectively), whereas acoustic feature (pitch, duration) was coded with a centred sum contrast (−.5, .5).
Processed data and analysis scripts (including alternative models with testing time and training group coded using a centred sum contrast) can be found at: https://osf.io/vbzt9/.
3. Results
3.1. Training shifts cue weighting strategies towards stronger reliance on duration in categorizing phrase boundaries
Participants relied on pitch and duration across all categorization tasks (Models 1–4; Table 3, Figure 3), confirming the usefulness of both acoustic cues in conveying information about these categories (pitch, contrastive focus β = 5.350, SE = .140, p < .001; phrase boundary β = 1.654, SE = .080, p < .001; lexical stress β = 5.692, SE = .151, p > .001; and musical beats β = 8.697, SE = .257, p < .001; duration, contrastive focus β = .339, SE = .090, p < .001; phrase boundary β = 2.897, SE = .088, p < .001; lexical stress β = .493, SE = .095, p < .001; and musical beats β = 1.921, SE = .130, p > .001). When comparing the reliance on these cues before and after the training, we found significant differences in duration use during phrase boundary categorization (Model 1; interaction between time, group and duration; β = .424, SE = .169, p = .012) and pitch use during lexical stress categorization (Model 3; interaction between time, group and pitch; β = −1.289, SE = .310, p < .001). To follow up on these interactions, we ran separate regression analyses for each training group for both tasks (see Supplementary Table S11). These post hoc analyses revealed that during phrase boundary categorization, participants from the prosody group relied more on duration after the training compared to before the training (Supplementary Model S17; β = .304, SE = .100, p = .002), while no changes in duration use were observed for vocabulary group (p > .05). Conversely, when categorizing lexical stress participants from vocabulary group increased their reliance on pitch after the training (Supplementary Model S18; β = .962, SE = .228, p < .001), but participants in prosody group did not change (p > .05). No effects of training were observed for contrastive focus (Model 2) and musical beats categorization (Model 4).
Table 3. Summary of effects across mixed-effects regression models for prosody and musical beats categorization tasks

Note: Categorization models’ formula was as follows: categorical response ~ testing time × training group × pitch level × duration level + (1 | Participant ID) for all four models.

Figure 3. Cue weighting patterns in speech prosody and musical beats categorization tasks. The lines represent the proportion of categorization responses across training groups and sessions, with error bars depicting 95% CI. Participants’ performance is plotted as a function of pitch level (left) and duration level (right) for Vocabulary and Prosody groups to visualize the differences in pitch and duration use during categorization before and after the training. For discussion of observed response patterns, see Supplementary Section S12.
Additional analyses including musical training as a covariate (Supplementary Table S12) are available in the Supplementary Material. These analyses are not discussed in the main text, as they do not alter the interpretation of the core findings.
3.2. Training does not lead to improvements in prosody perception
A significant two-way interaction between time and training group suggested that the effect of time on perception of phrase boundaries differed between the prosody and control training groups (Model 5; β = .309, SE = .154, p = .044; Table 4). Post -hoc regression analysis for each training group revealed that this effect was driven by decreased performance accuracy in Vocabulary group over time (Supplementary Model S23; β = .239, SE = .110, p = .030; see Figure 4 and Supplementary Table S13), with no significant change in performance for the prosody training group (p > .05). There were no effects of training on perception of contrastive focus or lexical stress (Models 6 and 7; p > .05).
Table 4. Summary of effects in mixed-effects regression models for accuracy in prosody perception (Models 5–7) and prosody production (Models 8–9); separate models for phrase boundary, contrastive focus and lexical stress

Note: Models’ formula was as follows: accuracy ~ testing time × training group + (1 | participant ID) for all six models.

Figure 4. Performance accuracy for phrase boundary perception before and after the training. Error bars depict 95% CI.
3.3. Training does not lead to improvements in prosody production
Analysis of prosody production accuracy revealed no improvements following training (Models 8–10; time × group interactions, p > .05; Table 4).
3.4. Participants in control group increase their use of pitch in conveying speech prosody
Furthermore, results from the mixed-effects regression model for pitch and duration use in production of phrase boundaries (Table 5) indicated that the groups differed in the effect of time on use of acoustic cues (Model 11; feature × time × group interaction, β = −.504, SE = .172, p = .004; Figure 5). Post hoc pairwise comparisons revealed that the three-way interaction is driven by the Vocabulary group using pitch more after the training (Supplementary Model S27, β = .492, SE = .157, pFDR-adjusted = .016). There was also a trend towards more pronounced use of pitch at pre-test by the Prosody group, although this effect did not reach statistical significance (Supplementary Model S31; β = −.341, SE = .169, pFDR-adjusted = .180).
Table 5. Summary of effects in mixed-effects regression model for pitch and duration use in phrase boundary production

Note: Model formula was as follows: SD ~ feature/testing time × training group + (1 | participant ID).

Figure 5. Pitch and duration use during phrase boundary production. Error bars represent standard error of the mean.
4. Discussion
4.1. Perceptual training can change cue weighting strategies
We proposed that a major source of difficulty in learning to perceive and produce L2 speech prosody is that individuals have trouble resisting default strategies inherited from their first language. We predicted that targeted training focused on enhancing the importance of durational information would result in more native-like cue weighting strategies in categorizing stimuli – less biased towards pitch and more towards duration. Consistent with these predictions, we observed increased reliance on durational cues in the experimental group (Prosody training) compared to the control group (Vocabulary training). These effects, however, were limited to perception of phrase boundaries (Model 1), for which duration is the primary cue. In contrast, participants in the Vocabulary group showed stronger reliance on pitch over time in categorizing lexical stress (Model 3) and used pitch more in their speech when conveying phrase boundaries (Model 11), suggesting that in the absence of targeted training, over time listeners tend to solidify the use of the strategy that works well for them. Participants in the Prosody group were able to resist this trend and keep using duration as a secondary cue when categorizing phrase boundaries (Model 1) and did not increase their use of pitch in production thereof (Model 11). These results show that listeners can track the variability across multiple acoustic cues and dynamically adjust speech categorization to reflect detected cue redistribution across tasks. This finding aligns with our predictions derived from speech perception and cue weighting theories (e.g., Francis et al., Reference Francis, Baldwin and Nusbaum2000), suggesting that listeners can shift their relative cue weights, but perhaps more training is needed to overwrite the lifetime influences of L1 on listening strategies across tasks.
It is possible that we did not observe any shifts in duration use in categorizing lexical stress and contrastive focus because duration is not a dominant cue for these contrasts. Training that includes solely examples spanning contrasts in which duration bears more relevance than other cues could be more effective. Furthermore, such short exposure to artificial distributions of pitch and duration could have been insufficient to guarantee more stable refinements, as reliance on pitch, the default strategy for Mandarin speakers, was already quite effective for categorizing these stimuli. Additionally, while participants were doing the training, they were also being exposed to natural speech throughout the day. For phrase boundaries that natural speech would align with how they were being trained, but for stress and focus, the natural speech they heard would reinforce their default strategies for focusing primarily on pitch. It is also possible that over time listeners’ strategies naturally become more biased towards pitch. Perhaps, it is just easier to overcompensate in performance with a familiar strategy than learn to use a completely new one.
Our results have significant implications for understanding the difficulties faced by second language learners, as they explain why L1-related perceptual strategies are so hard to unlearn. We show that individuals can learn new strategies quickly, provided the relative usefulness of acoustic cues is carefully manipulated to gradually introduce these cues into learners’ perceptual repertoire. This suggests that learning limitations are not tied to, for example, an overall decreased ability to learn in adulthood. Instead, perseveration of sub-optimal perceptual strategies may occur when listeners are exposed only to inputs in which their preferred cue is somewhat useful, even if relying on it is not the ideal solution. Such exposure could provide positive feedback supporting the use of sub-optimal strategy, and so, without explicit training targeting the underlying issue, learners might continue reinforcing the use of their native strategies, thereby hindering their L2 learning progress. The adaptive training paradigm we introduced here could be potentially adapted to target a range of specific auditory challenges experienced by learners. For example, setting only the F3 to be informative could facilitate learning proper strategies for discrimination of /r/ vs /l/ contrasts among Japanese speakers (Iverson et al., Reference Iverson, Kuhl, Akahane-Yamada, Diesch, Tohkura, Kettermann and Siebert2003).
One limitation of our design is that we kept the natural pauses between phrase boundaries in the phrase boundary training stimuli. While the intention was to maintain naturalistic speech patterns, these pauses may have inadvertently overshadowed the targeted durational cues. Pause length could be considered a durational cue, but it might also be processed separately (Scott, Reference Scott1982). The pause occurring before a phrase boundary is indeed perceived as a distinct cue and may be assigned different weights in perception. For instance, while pauses have limited relevance in German (Wellmann et al., Reference Wellmann, Holzgrefe, Truckenbrodt, Wartenburger and Höhle2012), Mandarin speakers tend to attribute greater weights to pauses than to durational cues, at least when processing their native language (Yang et al., Reference Yang, Shen, Li and Yang2014). Therefore, the presence of pauses could have somewhat attenuated the effects of prosody training. We recommend that future studies explore the effects of training without pauses, to isolate and emphasize the durational aspects of prosody more effectively – in fact, such a follow-up study is already underway.
Furthermore, individual differences in language proficiency, cognitive abilities and learning styles may influence the efficacy of perceptual training. Participants might have progressed through the training with different speeds and learning trajectories. Changes in strategies also do not occur suddenly after the training but happen gradually, while listeners are exposed to speech sampled from acoustically manipulated stimuli space. Perceptual choices made during the training might depend not only on the current stimulus but also on the context in which the stimulus is presented or the history of participants’ own choices (Bosch et al., Reference Bosch, Fritsche, Ehinger and de Lange2020; Urai et al., Reference Urai, Braun and Donner2017, Reference Urai, De Gee, Tsetsos and Donner2019). While our current dataset does not have sufficient data to conduct a detailed analysis of trial-by-trial responses for each linguistic feature, such an analysis could help us understand how participants’ behaviour changed throughout the training. We could then answer whether learners continue using their default strategies and relying on pitch even though this dimension was unreliable or show signs of changes throughout the training. This would be an invaluable insight into the development of second language perceptual strategies. Further tailoring training protocols to accommodate these individual factors could enhance effectiveness and learner engagement.
4.2. The relationship between changes in cue weighting strategies and prosody perception and production
Despite the observed shifts in cue weighting strategies during phrase boundary categorization, these changes did not generalize to overall improvements in L2 prosody perception (Models 5–7) or production (Models 8–10). We did not find any training improvements in L2 prosody perception in the Prosody training group. Establishing representations of L2 categories takes time, so it is possible that more exposure is needed to reliably re-adjust the boundaries between these categories and support more accurate perception.
We also found that duration shifts in categorizing phrase boundaries did not translate to greater use of duration in production (Model 11). This lack of effect corroborates with previous research on segmental speech sound contrasts. While previous research points to higher proficiency leading to overall more native-like patterns of cue use in speech perception (Sakai & Moorman, Reference Sakai and Moorman2018), results for production do not follow the same pattern. For example, in a study comparing perception and production of voicing contrast in English by Korean learners (Kim & Han, Reference Kim and Han2014; Kong & Yoon, Reference Kong and Yoon2013), L2 learners showed clear bias towards using F0, a highly relevant cue in their L1, but of secondary importance in English speakers, but no such differences were present in production (although Kong & Yoon, Reference Kong and Yoon2013, reported differences in the use of VOT). On the other hand, we found an interaction between pitch use, time and group for phrase boundary production. This indicated that the vocabulary training group, but not the prosody training group, increased their use of pitch to convey phrase boundaries over time. Thus, over time listeners may tend to solidify their speech production strategies as they are exposed to language input, but the training which the prosody group received – in which pitch was designed to not be a useful cue – may have led them to resist this trend.
Several factors may contribute to the lack of an increase in the use of duration during phrase boundary production within the prosody training group. First, while the perceptual training appeared to be sufficient for participants to resist solidifying their default speech perception strategies, further training that incorporates production could potentially lead to increased use of duration as well. Integrating concurrent production training that reinforces the use of appropriate prosodic cues could facilitate transfer from perception to production domains. However, previous research also suggested that while training perception might have beneficial effects on production, production during training can disrupt perceptual learning (e.g., Baese-Berk, Reference Baese-Berk2019; Baese-Berk & Samuel, Reference Baese-Berk and Samuel2016; Nagle, Reference Nagle2018).
Second, the stages of L2 acquisition at which training is administered may play a critical role in its effectiveness. For example, auditory processing training might be more useful at the initial stages of learning to facilitate the development of new phonetic categories (e.g., Saito et al., Reference Saito, Kachlicka, Suzukida, Petrova, Lee and Tierney2022). However, prosodic contrasts might be conceptually more difficult and require more nuanced knowledge of L2 than learners have at the initial stages of L2 immersion. As a consequence, if we introduced the perceptual training with prosody features too early, then participants may not have been able to benefit from it. For instance, participants who had just arrived in the United Kingdom might not have had enough exposure to native L2 speech to detect differences in prosodic contrasts or become familiar with English intonation. Perhaps, such training would be better for learners with enough exposure to native L2 (e.g., longer LOR). On the other hand, if we introduced the prosody training too late, we might have run into a risk of trying to retrain fossilized strategies or re-learning incorrect or incomplete representations of L2 prosody acquired through years of learning in an L1 country. An earlier study showed that changes in cue weighting in naïve native English learners of Chinese showed rapid shifts in perceptual space during the first months of learning but plateaued as soon as after 3 months (Wiener, Reference Wiener2017). If that is the case, then any attempts to refine listening strategies should occur alongside the regular L2 English teaching curriculum. Unfortunately, without a longitudinal investigation that would track participants’ learning trajectories from the day they started, it is impossible to discern which skills are most beneficial at which stages of L2 learning.
Third, the implicit nature of the training in our study may have limited learners’ ability to consciously apply newly acquired perceptual strategies across contexts. Since the aim of the training was somewhat concealed (i.e., we did not explicitly inform participants that they should be paying attention to duration), the exercises might have seemed arbitrary. It would be straightforward to explicitly instruct participants to pay attention to the specific acoustic dimension relevant to a given language. However, as explained earlier, such instructions might have been difficult to follow due to the differences in cue relevance across languages. To compensate for any disadvantages arising from these differences, the training would require lengthy explanations of speech acoustics and examples of how these acoustic features are represented in natural speech. Another downside of explicit instruction is that it might only work in certain individuals who have the meta-cognitive skills to intentionally take on new explicit strategies. Instead, we could include a familiarization session that presents several trials with untransformed stimuli to introduce listeners to the target range of acoustic features they will be dealing with. Previous studies showed that as little as a single trial could have facilitatory effects on subsequent training (e.g., a single informative presentation was sufficient to enable learning to detect oddly oriented visual elements under difficult conditions; Ahissar, Reference Ahissar1999). Future research should explore the comparative efficacy of implicit versus explicit training modalities and consider hybrid approaches that leverage the advantages of both.
Finally, what appears to be crucial for stable learning effects is the appropriate length of training. There is limited research on L2 prosody acquisition, let alone on prosodic cue weighting. It should be a priority to establish how much training is needed to permanently shift perceptual strategies during prosody perception via targeted training and design long-term learning support to retain these strategies over time. Long-term learning support is important because some individuals learn L2 in the L1 context and do not have enough exposure to L2 speech or return to their home countries after a period of immersion, significantly limiting L2 input that could reinforce the strategies they have learned while living abroad. It is possible that after participants finish their studies and return to China, they might revert to their L1 strategies (attrition of L2 after coming back to L1 country, Mickan et al., Reference Mickan, McQueen, Brehm and Lemhofer2022). It is, therefore, of paramount importance to design paradigms not only for training but also for long-term retention of newly acquired skills.
4.3. Conclusions
Our results provide evidence of perceptual training at the suprasegmental level, extending the existing literature showing similar effects on segmental speech contrasts (Kondaurova & Francis, Reference Kondaurova and Francis2010; Zhang et al., Reference Zhang, Wu and Holt2021). Although the prosody training effects were limited only to phrase boundary categorization, it is quite remarkable to observe that with as little as 3 hours of training, we could change perceptual strategies that took a lifetime to develop. It is also important to emphasize that these strategies were employed during naturalistic and complex speech perception tasks, not during the categorization of isolated syllables. These findings offer a new direction for designing more targeted language training paradigms. More work is needed to establish how to individualize these paradigms and efficiently select stimulus spaces appropriate for refining L1-specific language difficulties experienced by learners.
Supplementary material
The supplementary material for this article can be found at http://doi.org/10.1017/S1366728925100850.
Data availability statement
All materials, data and analyses discussed in the manuscript are available via a link to OSF: https://osf.io/vbzt9
Funding statement
This work was supported by a Bloomsbury Studentship awarded to Magdalena Kachlicka and an Economic and Social Research Council [ES/V007955/1] Grant awarded to Adam Tierney.
Competing interests
The authors declare none.