Previous research has demonstrated that adult L2 learners can achieve advanced levels of L2 speech proficiency, particularly when they have extensive opportunities for authentic interaction in L2-speaking environments. Studies have consistently found the positive impact of naturalistic exposure on various aspects of L2 speech, including global oral proficiency, segmental features, and suprasegmental features (Derwing & Munro, Reference Derwing and Munro2013; Saito & Brajot, Reference Saito and Brajot2013; Trofimovich & Baker, Reference Trofimovich and Baker2006). In contrast, learners in foreign language (FL) classroom settings typically receive substantially less authentic L2 input, and their attainments tend to depend on the quantity and quality of instructional exposure (Baumert et al., Reference Baumert, Fleckenstein, Leucht, Köller and Möller2020; Larson-Hall, Reference Larson-Hall2008; Muñoz, Reference Muñoz2006). Yet, even among learners with similar L2 exposure and instruction, their learning outcomes often vary widely. This variability is not merely of theoretical interest—it has direct implications for communication. Learners who fail to acquire accurate L2 prosodic patterns, such as English word stress, often produce speech that is less intelligible and comprehensible for listeners, which hinders successful communication (Kang et al., Reference Kang, Rubin and Pickering2010; Issacs & Trofimovich, Reference Isaacs and Trofimovich2012). In FL classroom contexts, pursuing intelligible and comprehensible speech is a realistic and pedagogically meaningful goal, given that achieving native-like pronunciation among those who acquire L2 after early childhood is exceedingly rare (Mora, Reference Mora2025).
From a theoretical standpoint, unexplained variability raises critical questions about the mechanisms of L2 speech learning. According to the revised Speech Learning Model (SLM-r; Flege & Bohn, Reference Flege, Bohn and Wayland2021), L2 phonological attainment is shaped jointly by experience (e.g., quantity and quality of input) and perceptual-cognitive aptitude (e.g., auditory processing). Yet, the model has not specified how much experience or aptitude contributes to individual differences in speech learning outcomes, particularly beyond segmental learning. Addressing this issue is essential for explaining why learners exposed to comparable input diverge markedly in phonological attainment. Empirical answers to this question are also valuable for pedagogy, as they help teachers identify which learners are likely to struggle with L2 prosody and how instruction might effectively support them.
To explain this variability, scholars have increasingly suggested that L2 learning experience alone cannot fully account for differences in L2 speech development. Perceptual-cognitive aptitudes, especially auditory processing (Saito et al., Reference Saito, Kachlicka, Suzukida, Petrova, Lee and Tierney2022) and working memory (Darcy et al., Reference Darcy, Park and Yang2015), have been identified as key predictors of learning success. These aptitude factors may determine how efficiently learners encode, memorize, and retrieve novel phonetic and prosodic information. At the same time, FL classroom studies indicate that the amount of L2 exposure and targeted instruction, both in and outside the classroom, can facilitate measurable gains in pronunciation and prosody, even under input-limited environments (Saito & Hanzawa, Reference Saito and Hanzawa2016).
Taken together, previous research points to a dynamic interplay between experience-related and aptitude-related factors in L2 speech development (Flege & Bohn, Reference Flege, Bohn and Wayland2021), which has been rarely investigated for suprasegmental features, despite their central role in comprehensibility and intelligibility of L2 speech (Field, Reference Field2005; Saito, Reference Saito2014). The imbalance in the literature—where segmental acquisition has received far greater attention—limits theoretical progress and leaves educators with little guidance on how to address prosodic difficulties that persist even among advanced learners. To address this gap, the present study investigates how individual differences in L2 learning experience and aptitude jointly predict the acquisition of English word stress among Japanese EFL learners in classroom settings. For the theoretical relevance (SLM-r; Flege & Bohn, Reference Flege, Bohn and Wayland2021), this study seeks to compare the relative importance of experience and aptitude factors in L2 speech learning and determine whether such mechanisms proposed by SLM-r, originally formulated for segmental acquisition, also apply to suprasegmental learning. The findings are expected to clarify why some classroom learners develop greater perceptual sensitivity and more fluent processing of English word stress than others, thereby providing important pedagogical implications for helping FL learners achieve more intelligible and comprehensible speech and for guiding teachers in designing more individualized approaches that support diverse learners in FL classrooms.
Background
Roles of experience-related factors in L2 speech learning
L2 speech learning is a multifaceted process shaped by both experiential and learner-internal factors. One of the latest and most well-known L2 speech acquisition models, the Revised Speech Learning Model (SLM-r; Flege & Bohn, Reference Flege, Bohn and Wayland2021), explains how learners acquire L2 sounds in naturalistic settings. According to the model, L2 learners initially substitute L1 phonetic categories for similar L2 sounds. As learners gain more experience with the L2, they gradually differentiate L1 and L2 sounds using the same cognitive mechanisms employed during L1 acquisition. The model highlights learners’ continuous development of new sound categories, even after puberty, provided they receive sufficient input. This framework emphasizes that input quantity and quality are primary engines of L2 speech development.
Empirical studies broadly support this view but also reveal important nuances. Cross-sectional studies indicate that learners’ proficiency is strongly predicted by length of residence in an L2-speaking environment when the L2 serves as their primary language of communication (e.g., Flege & Liu, Reference Flege and Liu2001), while longitudinal studies (e.g., Mora & Valls-Ferrer, Reference Mora and Valls-Ferrer2012; Segalowitz & Freed, Reference Segalowitz and Freed2004; Sun et al., Reference Sun, Saito and Dewaele2024) suggest uneven developmental trajectories across linguistic domains. Oral fluency and lexicogrammar accuracy and complexity tend to improve rapidly during the first few months of immersion, whereas pronunciation—particularly suprasegmental control—develops more gradually and requires years of L2 exposure. This contrast suggests that input quantity alone cannot fully account for phonological attainment, especially in FL contexts (Suzukida & Saito, Reference Suzukida and Saito2023).
Emerging L2 pronunciation research, particularly studies examining the acquisition of segmental features such as vowel and consonant contrasts, has begun to address this issue by exploring factors beyond input quantity that influence acquisition. However, suprasegmental learning remains comparatively underexplored, and it is unclear whether findings from segmental studies generalize to features such as stress, rhythm, and intonation. This imbalance is critical, as suprasegmental cues play a central role in speech intelligibility and comprehensibility (Anderson-Hsieh et al., Reference Anderson-Hsieh, Johnson and Koehler1992; Field, Reference Field2005; Issacs & Trofimovich, Reference Isaacs and Trofimovich2012). Trofimovich and Baker’s (Reference Trofimovich and Baker2006) well-cited study examined the suprasegmental acquisition among Korean learners of English in the U.S. and exemplifies this complexity: while some suprasegmental features (e.g., stress timing) related to experience, others (e.g., peak alignment) did not, suggesting that additional learner-internal factors may constrain improvement even under rich input conditions. These findings collectively point to the need for models that integrate both experience and aptitude-related factors to explain how learners develop L2 prosody—a domain central to L2 speech comprehensibility but still poorly understood in terms of trainability and aptitude effects.
Roles of aptitude in L2 speech learning
Crucially, SLM-r (Flege & Bohn, Reference Flege, Bohn and Wayland2021) acknowledge that experience alone cannot account for all phonological learning outcomes: individual differences in cognitive-perceptual aptitude mediate how effectively learners make use of input. Building on this theoretical insight, a growing body of research has identified several perceptual-cognitive variables that contribute to L2 speech proficiency and processing efficiency, including working memory (WM) (Darcy et al., Reference Darcy, Park and Yang2015; Kormos, Reference Kormos and Schwieter2015; Miyake & Friedman, Reference Miyake, Friedman, Healy and Bourne1998), attention (Darcy & Mora, Reference Darcy, Mora, Granena, Jackson and Yilmaz2016; Darcy et al., Reference Darcy, Mora and Daidone2014; Segalowitz, Reference Segalowitz2010), inhibitory control (Darcy et al., Reference Darcy, Mora and Daidone2016; Linck et al., Reference Linck, Kroll and Sunderman2009; Miyake & Friedman, Reference Miyake and Friedman2012), and auditory processing (Saito et al., Reference Saito, Sun and Tierney2020, Reference Saito, Suzukida, Tran and Tierney2021, Reference Saito, Kachlicka, Suzukida, Petrova, Lee and Tierney2022, Reference Saito, Kachlicka, Suzukida, Mora-Plaza, Ruan and Tierney2024). These studies collectively suggest that higher cognitive abilities generally facilitate deeper encoding and more efficient utilization of L2 input, thereby accelerating learning.
However, not all aptitude components appear equally influential across linguistic domains or learning stages. WM is closely associated with early advantages in the development of new and robust phonological representations, as greater WM capacity enables learners to retain and utilize larger amounts of information from the input (Darcy et al., Reference Darcy, Park and Yang2015). Specifically, phonological short-term memory, a sub-component of WM, has been found to benefit speaking fluency (Kormos, Reference Kormos and Schwieter2015) as well as phonological processing of segmental, suprasegmental (word stress), and phonotactic (Darcy et al., Reference Darcy, Park and Yang2015). This ability is critical in the early phase of phonological learning when learners must hold new sound sequences in mind and compare them to existing categories.
Meanwhile, domain-general auditory processing ability—the ability to detect fine-grained acoustic differences—has been shown to predict later success in both segmental and suprasegmental learning (Kachlicka et al., Reference Kachlicka, Saito and Tierney2019). In a longitudinal study, Saito et al. (Reference Saito, Sun and Tierney2020) found that although L1 Chinese learners’ initial suprasegmental performance was tied to prior L2 experience, advanced auditory processing emerged as the key determinant for achieving higher word stress and segmental accuracy at the end of the eight months of immersion. Similar impacts of auditory aptitude have been consistently reported across diverse learner populations, including Spanish, Polish, and Chinese learners of English (Saito et al., Reference Saito, Kachlicka, Suzukida, Petrova, Lee and Tierney2022).
Taken together, these findings indicate that refining complex pronunciation features such as segmental contrasts and word stress requires not only sufficient experience but also strong perceptual-cognitive aptitude (see also Doughty, Reference Doughty2019; Li, Reference Li2016; Skehan, Reference Skehan, Weng, Skehan, Biedron, Li and Sparks2019). In input-limited FL contexts, these aptitude factors may exert even greater impacts on individual trajectories of L2 suprasegmental development.
L2 speech learning in foreign language classroom settings
It is important to note that most of the existing literature on L2 speech learning has concerned naturalistic settings. In such contexts, there is ample evidence showing that earlier initial exposure to an L2 typically facilitates better development, with frequent L2 use positively influencing L2 speech acquisition. However, very few studies have focused on foreign language (FL) classroom settings. According to Muñoz (Reference Muñoz2008), FL settings typically feature (1) limited formal instruction, usually no more than four 50-minute lessons per week; (2) exposure to the L2 primarily through the teacher, with constraints on source, quantity, and quality; (3) minimal peer interaction in the L2; and (4) scarce opportunities for L2 use outside the classroom. It is in such contexts where many adult learners learn an L2; however, much less is known about whether, to what degree, and how L2 speech learning takes place in classroom settings.
Within this “minimal input” environment (Larson-Hall, Reference Larson-Hall2008, p. 36), some researchers have attempted to examine the impacts of L2 input quantity and quality on learner development by comparing different types and lengths of L2 instruction (e.g., Baker-Smemoe et al., Reference Baker-Smemoe, Cundick, Evans, Henrichsen and Dewey2012; Dewey, Reference Dewey2008; Llanes & Muñoz, Reference Llanes and Muñoz2009; Taguchi, Reference Taguchi2008). Existing research has consistently demonstrated that extensive language exposure (Jaekel et al., Reference Jaekel, Schurig, Florian and Ritter2017; Larson-Hall, Reference Larson-Hall2008) and greater out-of-class use (Muñoz, Reference Muñoz2014; Saito & Hanzawa, Reference Saito and Hanzawa2016) facilitate development, but the effect sizes are modest and often moderated by learner characteristics (e.g., Saito & Hanzawa, Reference Saito and Hanzawa2018; Zheng et al., Reference Zheng, Saito and Tierney2022). Saito and Hanzawa (Reference Saito and Hanzawa2016), for example, found that Japanese learners made steady progress in segmental accuracy, speech fluency, and lexicogrammar with more exposure during the initial phase of FL classroom. Notably, segmental developments plateaued after the initial phase, while suprasegmentals (e.g., word stress and intonation) showed slower improvements over an academic year. This pattern suggests that while early gains in L2 phonological learning may be largely input-driven, sustained progress depends on more than increased exposure. Given the perceptually complex and less salient nature of suprasegmental cues, quantity of input is necessary but insufficient for mastering L2 prosody; learners’ underlying auditory and cognitive aptitudes are crucial for extracting and internalizing prosodic regularities from limited input (Skehan, Reference Skehan2015).
Subsequent research has highlighted the crucial role of learner aptitudes in FL contexts. Saito et al. (Reference Saito, Suzukida and Sun2019) found in-class L2 experience and explicit aptitude (associative memory, phonemic coding) predicted initial gains in global comprehensibility, prosody, and fluency, while incidental aptitude (sound sequence recognition) predicted sustained improvement in pronunciation and comprehensibility. This shift mirrors skill acquisition theory in FL contexts (DeKeyser, Reference DeKeyser, Loewen and Sato2017; Suzuki, Reference Suzuki2023), where early explicit learning transitions to more automatic, perception-driven processing through consistent practice.
These findings collectively suggest a complex developmental trajectory of L2 speech acquisition in FL settings. While increased input and practice inside and outside classrooms can enhance L2 development, particularly in the early stages (Muñoz, Reference Muñoz2014; Saito & Hanzawa, Reference Saito and Hanzawa2018), learner-internal factors such as perceptual-cognitive aptitude not only constrain ultimate attainment but also shape the trajectory of learning over time (Mora, Reference Mora, Derwing, Munro and Thomson2022, Reference Mora2025).
L2 English word stress
Among suprasegmental features, word stress is particularly important for comprehensibility and intelligibility of L2 speech (Cooper et al., Reference Cooper, Cutler and Wales2002; Kang et al., Reference Kang, Rubin and Pickering2010). Empirical studies consistently demonstrate that accurate stress placement enhances listener understanding, whereas misplaced or absent stress can disrupt communication more than segmental errors (Bond, Reference Bond1999; Hahn, Reference Hahn2004; Saito & Saito, Reference Saito and Saito2017). Nevertheless, acquiring English word stress is notoriously challenging because its realization depends on the interplay of pitch, duration, intensity, and vowel quality (Fry, Reference Fry1958), whereas many other languages encode prosodic prominence differently.
Cross-linguistic studies have demonstrated that learners tend to transfer their L1 prosodic weighting patterns when processing L2 word stress (Guion et al., Reference Guion, Harada and Clark2004; Jasmin et al., 2021; Wang, Reference Wang2008). For instance, Mandarin learners, whose L1 encodes pitch lexically to distinguish tones, tend to rely heavily on pitch cues while underusing duration and intensity when perceiving English word stress. This tendency exemplifies how L1 experience biases auditory attention toward familiar acoustic dimensions and hinders the reallocation of perceptual weightings required for successful L2 suprasegmental acquisition (Petrova et al., Reference Petrova, Jasmin, Saito and Tierney2023).
Japanese learners of English—the main focus of the study—exhibit similar difficulties due to their pitch-accent L1, where variations in pitch (F0) distinguish words, while syllables are otherwise relatively equal in duration and intensity. Therefore, Japanese learners often over-rely on pitch cues, even when other acoustic cues (duration, intensity, vowel quality) are equally important in L2 prosody (Konishi & Kondo, Reference Konishi and Kondo2015). Furthermore, whereas English exhibits vowel reduction and rhythmic alternation between stressed and unstressed syllables, Japanese follows a mora-timed rhythm characterized by uniform mora duration and full vowel articulation (Beckman, Reference Beckman1982). Consequently, Japanese learners often fail to capture syllable contrasts and produce all syllables with relatively equal emphasis and duration, resulting in rhythmically flat speech (Tsujimura, Reference Tsujimura, Librik and Beeler1996; Saito & Saito, Reference Saito and Saito2017).
These strong L1 influences highlight not only language-specific transfer but also the potential contribution of domain-general auditory aptitude in mitigating such effects. Sensitivity to multidimensional acoustic dimensions may enable learners to redirect attention from familiar to novel prosodic cues. Such aptitude could thus moderate the extent to which L1 experience constrains L2 word stress acquisition. However, the role of domain-general auditory aptitude in L2 suprasegmental learning remains limited, particularly in input-constrained FL contexts.
Current study
Building on the preceding discussion, the present study examined how individual differences in experience and aptitude jointly shape Japanese EFL learners’ perception of English word stress. The present study asked the following research question: To what extent do individual differences in aptitude and L2 learning experience predict the ability to perceive English word stress among Japanese EFL learners?
Research to date has demonstrated that both cumulative L2 experience and perceptual-cognitive aptitude interact to contribute to L2 speech learning (Flege & Bohn, Reference Flege, Bohn and Wayland2021). However, their relative roles in suprasegmental acquisition remain uncertain, particularly in input-limited FL contexts. The field still lacks integrative evidence showing whether increasing L2 experience can compensate for low aptitude or whether aptitude overrides experiences. Addressing this issue is essential for both theoretical progress and pedagogical practice. According to SLM-r (Flege & Bohn, Reference Flege, Bohn and Wayland2021), perceptual-cognitive aptitude along with input plays a critical role in determining the rate of success in L2 speech learning. Yet, it remains uncertain how much experience or aptitude predicts suprasegmental development. Clarifying this would deepen our understanding of the mechanisms underlying adult L2 speech learning.
Pedagogically, determining what facilitates or constrains suprasegmental acquisition would be crucial for educators identifying learners who may struggle with developing word stress knowledge in input-limited environments. Such understanding enables educators to provide necessary learners with explicit instruction or focused training for achieving intelligible and comprehensible L2 speech.
This study operationalizes experience through the timing and length of FL education as well as the presence of extracurricular (pronunciation and music training). Aptitude is indexed through measures of auditory processing (perceptual level) and working memory (cognitive level). By considering both domains simultaneously, the study aims to determine whether they make unique or overlapping contributions to word stress perception.
One unique approach of the current investigation lies in how L2 speech perception performance was conceptualized and measured. While participants’ perceptual ability was measured via an often-used task format (i.e., forced-choice identification), their performance was evaluated in terms of both accuracy and fluency. To date, most studies have focused on L2 learners’ accuracy ratio during the task. However, as previous work has noted (Segalowitz, Reference Segalowitz2010), it is important to consider not only how fast but also how consistently learners can process L2 knowledge at fast speeds.
To capture this fluency dimension, the present study analyzed RT and its intraindividual variability, indexed by the coefficient of variation (CV), which has been widely employed in L2 studies (e.g., Akamatsu, Reference Akamatsu2008; Hui, Reference Hui2020; Hulstijn et al., Reference Hulstijn, Van Gelderen and Schoonen2009; Lim & Godfroid, Reference Lim and Godfroid2015; Rodgers, Reference Rodgers2011; Saito et al., Reference Saito, Uchihara, Takizawa and Suzukida2025; Suzuki, Reference Suzuki2018). CV is calculated by dividing the standard deviation of an individual’s RTs by their mean RT, providing insight into processing stability. Smaller CV values indicate lower variability in RTs, which reflects greater processing stability (Hui & Godfroid, Reference Hui and Godfroid2021). Thus, RT and CV together offer a comprehensive view of both the speed and stability of L2 processing (Segalowitz, Reference Segalowitz2010; Segalowitz et al., Reference Segalowitz, Segalowitz and Wood1998; Segalowitz & Segalowitz, Reference Segalowitz and Segalowitz1993).
It was hypothesized that L2 learning experience and perceptual-cognitive abilities (auditory processing and working memory) would equally contribute to the accuracy of Japanese learners’ English word stress perception (correct ratio). Regarding fluency (RT, CV), working memory was expected to demonstrate significant predictive power. Given that automatization requires not only accurate but also prompt and consistent processing of stress patterns, working memory allows learners to temporarily store and manipulate phonological rules in real-time.
Method
Participants
A total of 119 Japanese learners of English from Japanese universities participated in the current study (66 females, 53 males; M age = 22 years; age range = 18–38 years). Their English proficiency levels ranged from CEFR B1 to B2 (Lemhöfer & Broersma, Reference Lemhöfer and Broersma2012), as indicated by their LexTALE scores (M = 58.8%, SD = 8.6; Range = 57.2–60.3%). All participants reported normal hearing and no history of auditory, neurological, or language disorders. To determine sufficient sample size, a power analysis was performed using G * Power and revealed that a total n = 118 participants would yield a power higher than 0.8 to detect a medium effect size (f 2 = 0.15, ɑ err prob = 0.05) for a model including 10 predictors.
To recruit a large number of participants, an electronic flyer was distributed to Japanese universities and posted on various community websites that provide information on participation in psychological experiments in Japan. The flyer explicitly stated the participation requirements: (1) native Japanese speakers studying English as a second language, (2) university students in Japan, and (3) no prior experience of studying abroad or traveling in English-speaking countries for more than one month. These criteria were set to ensure that participants shared a similar linguistic and educational background, and they learned L2 only in FL contexts.
Data collection
Due to the convenience of both the researcher living in the UK and students who live in different cities in Japan, all the data collection (word stress perception and production tests, aptitude tests, a series of perceptual training and questionnaires) was administered via the online data collection platform GORILLA (Anwyl-Irvine et al., Reference Anwyl-Irvine, Massonnié, Flitton, Kirkham and Evershed2020). Participants were asked to use a computer or laptop with stable internet access in a private room and wear earphones or headphones during the experiment.
Following Saito et al. (Reference Saito, Kachlicka, Suzukida, Mora-Plaza, Ruan and Tierney2024)’s procedure, which implemented similar online L2 and auditory processing measures, all participants completed the screening tasks first to confirm that the tasks functioned properly on GORILLA on the participants’ computers and to ensure that there were no technical problems with their setup before proceeding to the main tasks. Of the 194 individuals who completed the screening, 119 (approximately 61%) met the eligibility criteria and completed all main tasks. Each participant received a personalized ID and GORILLA link and was instructed to complete the experiment in a single sitting. Online progress was monitored remotely using GORILLA’s supervisor tracking function to ensure participant engagement and data integrity throughout the experiment.
To reinforce full understanding of procedures, participants in the main data collection were asked to read an information sheet that explained the detailed task instructions in Japanese. They were given opportunities to ask any questions regarding the experiment procedure. After they signed the consent form to participate in the main experiment, they took an auditory processing test with six subtests for each dimension, a perception test of word stress, and a background questionnaire.
Screening
Screening tasks were implemented to ensure that participants who applied to take part in the study had access to the required tools (i.e., a PC, headphones, and Google Chrome) and could conduct the experiment in an appropriate environment. Applicants who could not successfully complete the screening were excluded from the study. Given that this study was conducted entirely online and self-paced, the screening process was essential for ensuring participant motivation and an appropriate test-taking environment. Although the original tasks were designed in English, additional instructions were provided in Japanese to minimize construct-irrelevant variance.
The screening involved two tasks: LexTALE (Lemhöfer & Broersma, Reference Lemhöfer and Broersma2012) and the forward and backward digit span working memory tests. LexTALE assessed participants’ vocabulary knowledge, and participants scoring below 70% (low and intermediate proficiency learners) were invited to join the main experiment. This threshold was set to prevent potential ceiling effects, as the target words were likely too familiar to advanced learners, who may have already acquired the relevant aspects of word knowledge.
The forward and backward digit span tests served two purposes. First, they acted as a screening tool to ensure that participants had sufficient cognitive ability (working memory) necessary to complete this type of online experiment reliably. Following Saito et al. (Reference Saito, Kachlicka, Suzukida, Mora-Plaza, Ruan and Tierney2024), participants who failed to recall three- or four-digit sequences would have been excluded to minimize the risk of poor data quality due to cognitive overload or lack of engagement. However, in the current study, no participants fell below this threshold, indicating that all participants possessed adequate concentration and memory resources to complete the online tasks. Second, these tests functioned as an outcome measure in later analyses to examine how individual differences in working memory contribute to successful word stress acquisition. By ensuring participants met the necessary linguistic, cognitive, and motivational criteria, this procedure aimed to maximize the validity and reliability of the data collected.
English word stress measures
Target word items
The perception and production test included 62 target words chosen from the first and second 1000-word families on the BNC/COCA lists (Nation, Reference Nation2017), ensuring that participants were likely familiar with the meanings of the target words. High-frequency words were prioritized over low-frequency words or non-words for two main reasons: (1) their practical relevance to real-life and FL classroom settings and (2) their potential impact on L2 pronunciation accuracy. Flege et al. (Reference Flege, Frieda, Walley and Randazza1998) have shown that more familiar words are often produced more accurately than unfamiliar ones. By focusing on high-frequency words, this study can minimize the influence of lexical factors and allow learners to attend to pronunciation details without being distracted by unfamiliar vocabulary meanings, which could otherwise significantly hinder their performance. Given that Japanese learners often omit or misplace word stress due to the absence of stress patterns in their native language, assessing their knowledge of high-frequency words provides a clear picture of their current proficiency and helps practitioners plan targeted instruction.
To minimize potential confusion, noun-verb pairs that are spelled identically but differ in stress (e.g., REcord as a noun versus reCORD as a verb; permit, contest, object, address, produce, contract) were excluded from the target items. Including noun-verb pairs might shift learners’ focus to lexical or grammatical properties rather than stress patterns, as these pairs are often taught in the context of grammatical function. This shift would detract from the primary goal of the task, which is to assess word stress knowledge. Additionally, compound nouns (e.g., toothbrush, airport) were excluded to avoid assessing familiarity with compounding rules instead of general word stress proficiency. By excluding these word types, the task maintains a clearer focus on learners’ ability to recognize word stress intuitively, free from interference caused by lexical or grammatical knowledge.
Perception test
The perception test was designed to assess participants’ ability to identify stressed syllables in target words. For materials, see Supplementary Information S1. While there was no strict time limit, participants were encouraged to answer as quickly as possible. The test began with two practice questions to familiarize participants with the format. Feedback was given only in practice trials. In the test, participants first heard a native speaker pronounce a word. The word was then displayed on the screen with numbered syllables (1 and 2 for two-syllable words, or 1, 2, and 3 for three-syllable words). Number 1 corresponds to the first syllable, number 2 to the second syllable, and so on. Participants listened carefully, identified the stressed syllable, and clicked the corresponding number on the screen. The task design was adopted from Wang et al.’s (Reference Wang, Spence, Jongman and Sereno1999) study, where participants heard isolated Mandarin words and identified the tone by clicking on numbered buttons. By following this approach, the test ensured a straightforward and intuitive way to evaluate participants’ ability to perceive word stress patterns. Each trial began with a 500 ms fixation cross, followed by auditory representation of the stimulus (mean duration = 850 ms). Participants then respond without a time limit. The inter-trial interval was 500 ms. The 62-item task lasted approximately 5-6 minutes in total. The results of the perception test were assessed in terms of placement accuracy of word stress. Thus, a binary coding scheme was used; correct answers were awarded one point, and incorrect answers were awarded zero points.
Auditory processing measures
To measure the participants’ thresholds for discriminating sounds varying in pitch, formant frequency, amplitude rise time, and duration, as well as their ability to detect and reproduce temporal patterns (sound sequences) in non-verbal auditory stimuli, the procedure and materials were adopted from Kachlicka et al. (Reference Kachlicka, Saito and Tierney2019) and Saito et al. (Reference Saito, Sun and Tierney2020). These auditory processing batteries are freely accessible to teachers and researchers on the SLA speech tools website: http://sla-speech-tools.com/research (Mora-Plaza et al., Reference Mora-Plaza, Saito, Suzukida, Dewaele and Tierney2022). In the current study, participants’ abilities across six different dimensions of auditory processing were assessed in the following order: pitch, formant, amplitude rise time, duration, rhythm, and melody. Assessing these auditory dimensions was crucial for understanding participants’ ability to perceive and process the acoustic features relevant to English word stress.
Perceptual acuity
Participants’ spectral (pitch, formant) and temporal (rise time, duration) acuity were evaluated via psychoacoustic AXB discrimination tests. These tests explicitly required participants to distinguish sounds differing in spectral and temporal cues. Participants were presented with three tones, with an interstimulus interval of 0.5 seconds, and were asked to identify whether the first or third tone differed from the second by selecting either “1” or “3” on the screen. Following the method established by precursor research (Kachlicka et al., Reference Kachlicka, Saito and Tierney2019; Saito et al., Reference Saito, Sun and Tierney2020), thresholds for each discrimination task were calculated using MATLAB scripts. The tasks employed an adaptive threshold procedure (Levitt, Reference Levitt1971), wherein the difficulty of the trials dynamically adjusted based on participants’ responses. Tests began at Level 50. Incorrect responses led to a reduction in difficulty by 10 steps, increasing the difference between stimuli. Conversely, three consecutive correct responses increased difficulty by 10 steps, reducing the difference. Step sizes were adjusted after reversals (points where difficulty increased following a decrease or vice versa), starting with a 10-step change and reducing to five steps on the second reversal, then two steps, and finally one step. The test concluded after 70 trials or eight reversals. The auditory processing score was calculated by averaging the threshold values from the second reversal onward, with lower scores indicating higher sensitivity.
Audiomotor integration
The rhythm reproduction task was designed to assess the participant’s temporal integration ability, while the melody reproduction task was designed to assess their spectral integration ability. The rhythm reproduction task tapped into participants’ ability to accurately recall and reproduce rhythmic sequences, which involve broader levels of temporal information. Participants listened to each rhythmic pattern three times and were then instructed to replicate it three times by pressing the space bar, simulating drumming. The participants’ sequences of hits and rests were then compared to those in the stimulus. The ratio of correct hits and rests determined the rhythm integration scores, reflecting the participants’ accuracy in reproducing the given rhythmic patterns.
The melody reproduction task was designed to assess participants’ ability to recall and reproduce sequences of complex tones with varying pitches (Saito et al., Reference Saito, Suzukida, Tran and Tierney2021). A total of ten melodies were used in the task, with each melody composed of seven notes. The fundamental frequencies of these notes corresponded to the first five notes of the major scale: 220 Hz, 246.9 Hz, 277.2 Hz, 311.1 Hz, and 329.6 Hz. The amplitude of each harmonic was equalized. Each melody began on the third note of the scale (277.2 Hz). Participants first listened to a melody played three times. They were then asked to reproduce it using piano boxes displayed on their screen. These piano boxes were vertically aligned and labeled 1 to 5. Each box corresponded to a specific note frequency: box 1 to 220 Hz, box 2 to 246.9 Hz, box 3 to 277.2 Hz, box 4 to 311.1 Hz, and box 5 to 329.6 Hz. By clicking on a piano box, the associated tone was played, allowing participants to reproduce the melody they had heard.
Working memory measures
As working memory has been demonstrated to play a critical role in L2 acquisition, this study included it as one of the independent variables to investigate its potential impact on the acquisition of word stress.
Based on Baddeley’s (Reference Baddeley, Baddeley, Eysenck and Anderson2020) framework, two key components of working memory were assessed: the phonological loop (a phonological short-term storage) and the central executive (a supervisory system that manages attention, controls information processing, and coordinates cognitive activities). To evaluate these components, forward and backward digit span tasks were employed, following the procedure outlined by Olsthoorn et al. (Reference Olsthoorn, Andringa and Hulstijn2014) and Saito et al. (Reference Saito, Kachlicka, Suzukida, Mora-Plaza, Ruan and Tierney2024). The forward digit span task assessed the phonological loop by requiring participants to recall digits in the order they were presented. The backward digit span task tested the central executive by asking participants to recall digits in reverse order. Digit sequences increased in length from 3 to 8 digits, with four trials at each level. Each digit was displayed for 500 milliseconds, and participants entered their responses using a keyboard. Consistent with Saito et al. (Reference Saito, Kachlicka, Suzukida, Mora-Plaza, Ruan and Tierney2024), working memory for each task was scored based on the longest digit sequence correctly recalled. This method was chosen because it provides a stable index of individual span capacity and minimizes the influence of partial guessing, which can inflate total-correct or partial-credit scores in online experimental settings. The task materials were downloaded from Gorilla Open Materials (Anwyl-Irvine et al., 2020; https://app.gorilla.sc/openmaterials/497080).
Background questionnaire
Individual differences related to participants’ learning experience, current use of English, musical training, and L2 listening and speaking self-confidence were collected using a background questionnaire on which participants self-reported their information. For materials, see Supplementary Information S2.
Individual differences related to participants’ learning experience, current use of English, musical training, and L2 listening and speaking self-confidence were collected using a background questionnaire on which participants self-reported their information. The questions included (a) the onset of L2 learning; (b) the total years of L2 instruction; (c) the total hours of English learning and use in the classroom per week; (d) the total hours of English learning and use outside the classroom per week; (e) the total minutes of self-study in English per week; (f) the total minutes of English use for entertainment (watching TV or videos in English); (g) the total minutes of language learning app use; (h) past experience of listening training; (i) past experience of speaking training; (j) past experience of any instructions on word stress; and (k) experience of musical training (if yes, how many years).
For analysis, onset and length of L2 instruction were treated as chronological indicators of cumulative experience, while in-class L2 use, outside-class L2 use, self-study, watching media in L2, and learning app use represented weekly exposure measures. Listening training, speaking training, word stress instruction experience, and musical training experience were coded as binary variables (0 = no, 1 = yes). In the case of musical training experience, participants with more than 6 years of formal instruction were coded as 1, and those with less were coded as 0, following prior definitions of a “musician” (e.g., Wayland et al., Reference Wayland, Herrera and Kaan2010).
To enable comparisons across different scales in subsequent analyses, all continuous variables were standardized to z-scores. Additionally, five experience variables (in-class L2 use, outside-class L2 use, self-study, watching media in L2, and learning app use) were transformed using log(x + 1) to account for skewed distributions. These operationalizations ensured that all experience-related factors were placed on comparable scales and could be modeled effectively as predictors of Japanese EFL learners’ word stress perception.
Results
Word stress proficiency profiles
The first objective of the statistical analyses was to identify broad patterns of participants’ word stress proficiency through descriptive analysis. Participants’ performance on the perception test was captured using three complementary measures that reflect different aspects of word stress processing: accuracy, reaction time (RT), and the coefficient of variation (CV).
Accuracy was measured at the trial level as a binary variable (0 = incorrect, 1 = correct), and these trial-level data were subsequently used for analyses. RT served as an index of processing speed and was analyzed at the trial level for all responses, following the methodological recommendations of Hui and Jia (Reference Hui and Jia2024). To ensure reliable assessment of response latency, outliers were removed prior to analysis; defined as RTs shorter than 200 ms or greater than 3 standard deviations from a participant’s mean, they were excluded. RTs were then log-transformed to address positive skewness. CV was used as an index of response stability, capturing response consistency across trials. Following Hui and Godfroid (Reference Hui and Godfroid2021), CV was calculated at the participant level as the ratio of the standard deviation to the mean of log-transformed RTs from correct responses. This metric provides a stable index of processing stability.
According to the results of the normality test (Kolmogorov–Smirnov test), log-transformed RTs and CVs were comparable to a normal distribution (D = .09, .11, p = .357, .104), whereas perception test scores (accuracy) significantly deviated from a normal distribution (D = .17, p = .002).
As shown in Table 1, the descriptive analysis revealed relatively high mean accuracy in the perception test (M = 83.4%, SD = 12.3), indicating that participants generally had a good understanding of word stress patterns for words from the 1st and 2nd 1000-word families.
Descriptive statistics of perception score (accuracy), reaction time (speed), and coefficient of variation (stability)

Note: *smaller CV indicates more stable response.
Given that accuracy scores were not normally distributed, in order to examine correlations among accuracy, speed (logRT), and stability (CV), participant-level data of the three measures was submitted to Spearman correlation analyses with 95% bootstrap confidence intervals.
Accuracy was weakly negatively correlated with both logRT (ρ = −0.21, p = .019, 95% CI [−0.39, −0.03]) and CV (ρ = −0.22, p = .017, 95% CI [−0.39, −0.04]), indicating that participants with higher accuracy tended to respond slightly faster and with more consistent speeds. RT and CV were moderately positively correlated (ρ = 0.55, p < .001, 95% CI [0.40, 0.68]), suggesting that slower responses were associated with greater variability in RTs. These patterns suggest that fluency and stability in word stress perception varied across participants in this input-limited FL context.
Predictor variables
To explore the independent variables, including auditory processing, working memory, and experience-related factors, as potential predictors of English word stress acquisition, this study collected relevant data employing auditory processing measures, forward and backward digit span tests, and a background questionnaire.
Auditory processing profiles
The descriptive results of raw scores of auditory processing measures and working memory scores are summarized in Table 2. Participants showed varying levels of auditory processing abilities. To ensure comparability across different auditory processing measures, all scores were standardized and converted to z-scores. Standardized pitch, formant, rise time, and duration discrimination scores were then averaged to create an index of auditory acuity. Similarly, standardized rhythm and melody reproduction scores were averaged to index motor integration. To align the directionality of auditory measures, acuity scores were inverted for subsequent analyses.
Descriptive statistics of auditory processing and working memory profiles

Notes: asmaller acuity scores indicate more sensitivities to a specific acoustic dimension; blarger integration scores reflect superior processing abilities of specific acoustic dimensions.
Experience-related variables
The descriptive results of 11 variables related to past and current language experience, as well as musical training experience, are summarized in Table 3. To enable comparisons across different scales in subsequent analyses, all data within these variables were standardized to z-scores. Additionally, inspection of variable distribution revealed that several individual difference measures (in-class L2 use, outside-class L2 use, self-study, watching media in L2, and learning app use) exhibited substantial positive skewness. Consequently, these five variables were transformed using log(x + 1) prior to further analyses. Other continuous variables (AOA, length of L2 instruction) displayed only mild deviations from normality (|skew| < 2) and were thus retained in their standardized form (z-scores) for subsequent analyses.
Descriptive statistics of questionnaire items

Aptitude, L2 learning experience, and word stress proficiency link
The next objective of the statistical analysis was to investigate the relative weights of individual differences in aptitude (auditory processing and working memory) and L2 learning experience in L2 word stress proficiency via factor and multiple regression analysis.
Factor and regression analyses
As our preliminary analyses revealed moderate correlations among several experience-related variables (e.g., age of onset and length of L2 learning; see Supplementary Information S3), an exploratory factor analysis (EFA) was conducted to combine related predictors and address potential multicollinearity. 10 questionnaire items related to L2 learning experience were initially considered for EFA to uncover common factors. Prior to the analysis, Bartlett’s test of sphericity and the Kaiser–Meyer–Olkin (KMO) measure of sampling adequacy were examined to assess suitability for factor analysis. The initial overall KMO was 0.50, with particularly low individual KMO values for time using L2 in the classroom (0.33) and outside the classroom (0.42), suggesting that these variables shared little variance with the others. After removing these two variables, the overall KMO improved to 0.57. Individual MSAs ranged from 0.48 (AOA, length of L2 learning) to 0.70 (word stress instruction). Bartlett’s test of sphericity was significant (χ2(28) = 5976, p < .001), suggesting correlations between variables were sufficient for factor analysis.
Subsequently, questionnaire items measured using different scales were standardized. Finally, 8 experience-related variables were submitted to factor analysis using unweighted least squares (ULS) extraction with Varimax rotation. Based on parallel analysis, four factors were retained for interpretation. The four-factor solution explained 52% of the total variance. Factor loadings were generally stronger for length of L2 learning (0.98), listening instruction (0.89), and word stress instruction (0.97), indicating that these variables were well represented by the factors (see Table 4). The identified four factors are summarized in Table 5.
Summary of a four-factor solution based on a factor analysis of experience-related variables

Note: All loadings >.4 were highlighted in bold.
Summary of four-factor solution based on FA of experience-related variables

Note: All eigenvalues >1. L2, second language (i.e., English).
The four-factor solution reflects meaningful clusters of variables: Factor 1 primarily represents the length of English instruction and the onset age of L2 learning. Factor 2 captured past experience of listening and speaking instruction. Factor 3 was dominated by the experience of receiving targeted instruction on word stress. Factor 4 represents autonomous L2 exposure through L2 learning apps or watching media in L2. These four factors—in-class L2 use, outside-class L2 use, auditory processing, working memory, and musical training experience—were subsequently included as predictor variables to evaluate their contribution to participants’ word stress perception.
Prior to running the analyses, several assumptions were assessed: (1) To address multicollinearity, the initial set of 14 variables was reduced to 10 factors by combining highly correlated variables through EFA. (2) The normality of residuals was evaluated using both the Shapiro-Wilk test and a Q-Q plot. Although the Shapiro-Wilk test indicated significant deviation from normality (W = 0.84, p < .001), the Q-Q plot revealed that the residuals largely aligned with the reference line, suggesting approximate normality. Given the large sample size and the minor nature of the deviations, the residuals were deemed sufficiently normal for the analysis. (3) The Durbin-Watson statistic (2.0) indicated no significant autocorrelation in the residuals. (4) Homoscedasticity was confirmed by the Breusch-Pagan test (χ² = 11.9, p = .16).
For accuracy scores, a generalized linear mixed-effects model (GLMM) was conducted for each trial (0 = incorrect, 1 = correct answer), with participant- and item-level random intercepts. The model revealed significant positive effects of acuity (b = 0.61, p = .001), integration (b = 0.60, p < .001), working memory (b = 1.62, p = .018), and outside-class L2 use (b = 0.28, p = .022). Other experience-related factors were not significant (see Table 6). The model explained 8.1% of variance in fixed effects and 64.5% of variance including random effects.
Summary of GLMM for word stress perception accuracy scores

Note: *indicates p < .05.
Building on the GLMM results, a post-hoc examination of item-level variability was conducted using the model’s random intercepts. Item-level accuracy varied considerably across the stimulus set (Var = 4.00, SD = 2.00), indicating that some words were systematically more difficult than others. The result found that trisyllabic words such as entertain, introduce, and disappear were among the most challenging, with large negative intercepts reflecting lower accuracy. In contrast, disyllabic or highly familiar words such as muscle and castle showed high positive intercepts, indicating near-ceiling accuracy rates. Importantly, these item differences occurred even though all words were drawn from high-frequency 1st–2nd 1000-word families (Nation, Reference Nation2017), suggesting that while word familiarity was controlled, factors such as word length and phonological complexity contributed to participants’ perception of English word stress.
The final objective of the study was to determine which learners’ perceptual-cognitive abilities and L2 learning experiences are associated with their fluent and stable processing of word stress. To delve into the speed and stability of word stress identification, participants’ trial-level log-transformed RTs and participant-level CVs on correct responses were analyzed.
Linear mixed-effects modelling of log-transformed RT, with random effects of participants and items, indicated a significant positive effect of in-class L2 use (b = 0.09, p = .004), while other predictors did not reach significance. Conditional R² for this model was 0.63, suggesting that the majority of variance was captured by participant- and item-level random effects (see Table 7).
Summary of LMM for word stress identification speed (log RT)

Note: *indicates p < .05.
Linear regression analyses for CV revealed several significant predictors (e.g., acuity, integration, WM, some experience-related factors, musical training experience, and in-class and outside-class practice), though the model explained a relatively small proportion of variance (R² = 0.06, Adj. R² = 0.06) (see Table 8).
Summary of LM for word stress identification stability (CV)

Note: *indicates p < .05.
These results collectively suggest that perceptual-cognitive aptitude—particularly auditory processing and WM—is the primary driver of accurate word stress perception, while experience-related factors play a more subtle role in processing speed and stability in the current FL context.
Discussion
Focusing on a total of 119 Japanese learners of English in FL settings, the current study investigated the role of aptitude (auditory processing and working memory) and L2 learning and musical training experience in the development of English word stress among Japanese EFL learners. The present analyses provide converging evidence on the factors contributing to learners’ ability to perceive an unfamiliar L2 prosody in an input-limited FL context.
Regarding word stress perception accuracy, the GLMM demonstrated that auditory processing (acuity and integration), working memory, and outside-class L2 use positively predicted accurate word stress identification. This finding aligns with previous research (Sun et al., Reference Sun, Saito and Tierney2021), highlighting that post-pubertal L2 speech outcomes are shaped by not only experience-related factors but also perceptual-cognitive aptitudes. In particular, Flege and Bohn (Reference Flege, Bohn and Wayland2021) emphasize that individual differences in domain-general auditory processing ability represent an important source of variability in L2 phonological learning, as learners must detect and encode fine-grained acoustic cues to establish new phonetic categories. The present findings provide novel empirical support for this perspective: learners with stronger auditory acuity and integration were better able to perceive word stress patterns, even under the constrained input conditions of FL settings. This underscores the critical role of perceptual-cognitive aptitude in suprasegmental acquisition and suggests that the SLM-r may be relevant not only to L2 segmental but also to suprasegmental learning.
For adult learners in FL contexts where authentic L2 input is limited, acquiring L2 prosody is challenging because learners must attend to L2-specific acoustic cues despite strong influence from L1 cue-weighting patterns (McAllister et al., Reference McAllister, Flege and Piske2002). This challenge is significant because accurate suprasegmental perception and production directly impact speech comprehensibility and intelligibility (Cooper et al., Reference Cooper, Cutler and Wales2002; Field, Reference Field2005; Issacs & Trofimovich, Reference Isaacs and Trofimovich2012; Kang et al., Reference Kang, Rubin and Pickering2010). The current findings explain the mechanisms mitigating this challenge for some learners: L2 learners’ stronger perceptual-cognitive aptitude and more frequent L2 use outside the classroom help them better notice, analyze, and internalize the acoustic cues necessary for English word stress. This finding supports the auditory precision hypothesis-L2 (Mueller et al., Reference Mueller, Friederici and Männel2012), which posits that precise auditory processing acts as a bottleneck for language learning.
In contrast, the study found that experience within the FL classroom had limited impact on word stress acquisition. While participants demonstrated relatively high accuracy on familiar vocabularies (i.e., the 1st and 2nd 1000-word families; M = 83.4%, SD = 12.3), neither longer L2 learning experience nor targeted instruction experience in listening, speaking, or word stress significantly predicted accuracy. These results indicate that L2 exposure through a typical FL classroom alone is insufficient for acquiring English word stress. Although increased input through FL classrooms has been found to support L2 segmental acquisition (e.g., Muñoz, Reference Muñoz2014; Saito & Hanzawa, Reference Saito and Hanzawa2018), more extensive input through outside class practice and higher aptitude may be necessary for successful suprasegmental learning.
Another possible explanation of this discrepancy may stem from a lack of attention to word stress during FL classroom instruction in Japan. According to the participants’ self-reports, only 53% of individuals have learned about English word stress (even briefly) during school English lessons. Such minimal instruction and practice compared to segmental features may necessitate greater reliance on learners’ aptitude and outside-class practice from the early stages of suprasegmental learning (i.e., stress perception of high-frequency words).
One thing to note is that the model explained substantially more variance when including random effects for participants (Var = 1.23, SD = 1.11) and items (Var = 4.00, SD = 2.00). Post-hoc examination of the item random intercepts revealed that some trisyllabic words (e.g., entertain, introduce, disappear) had lower accuracy rates than shorter, high-frequency words (e.g., muscle, castle, famous). This suggests that, even within very high-frequency words (1k–2k families), lexical length and phonological familiarity contributed to variability in participants’ perception of word stress.
With respect to processing speed and stability, the predictors of RT and CV were less robust, indicating that response speed and stability may be less directly influenced by individual differences in L2 learning experience or aptitude. Although both faster processing and more consistent responses were significantly correlated with more accurate responses, those correlations were weak (r = −0.15 and r = −0.19, respectively), and substantial variability was observed at both participant and item levels. This suggests that even learners who accurately identified stressed syllables of high-frequency English words often did so without fluent processing. In other words, only a subset of learners with superior aptitude and frequent outside-class L2 use were able to perceive English word stress accurately and efficiently, highlighting the challenges of mastering L2 prosodic features in typical FL contexts.
Overall, word stress acquisition was largely mediated by a combination of factors beyond L2 instruction experiences, including auditory processing, working memory, and outside-class L2 use. These findings suggest that in the absence of sufficient targeted instruction and practice of word stress within the FL classroom, learners’ development heavily depends on perceptual-cognitive aptitude and L2 practice in real-life settings, whereas processing speed and consistency are more closely related to the overall consolidation of the skill. An important question that remains unanswered in the present study concerns the extent to which the observed aptitude-related advantages are malleable through focused instruction. Since this study was cross-sectional, it cannot determine whether learners with heightened sensitivity to specific auditory dimensions (e.g., pitch, formant, duration, or rise time) would differentially benefit from instructional interventions that explicitly direct attention to L2-relevant stress cues. To address this, future longitudinal research involving training or interventions is encouraged to examine whether learners’ pre-instructional auditory profiles predict both baseline performance and training gains. Such work will help clarify whether aptitude functions as a stable predictor of L2 prosodic learning or whether its effects can be amplified or attenuated by instruction (cf. Hosaka & Saito, Reference Hosaka and Saitoforthcoming; Kachlicka et al., Reference Kachlicka, Symons, Ruan, Saito, Dick and Tierney2026).
Despite the insights gained, this study has several limitations that should be acknowledged. First, the analyses focused on word stress perception and did not comprehensively examine other important aspects of stress production, such as vowel reduction in unstressed syllables. Japanese speakers often demonstrate great difficulties with vowel reduction and produce longer durations for unstressed vowels. This difficulty likely stems from the interference of L1 perceptual strategies. In Japanese, every syllable is typically stressed equally, which may hinder learners’ ability to attenuate unstressed vowels. Therefore, it would be valuable to assess diverse acoustic properties of L2 learners’ production using acoustic analysis software like Praat and provide deeper insights into how their productions of unstressed syllables differ from those of native speakers.
Second, this study’s findings are based on a specific group of learners—119 Japanese beginner to intermediate (B1–B2) English learners in an FL setting—and assessed word stress knowledge of high-frequency words. To generalize these findings, future research should replicate the study with more advanced or unfamiliar vocabularies, larger samples, and participants of varying proficiency levels. Considering the number of predictors included in the models, a larger sample size would be preferable to ensure more robust and stable estimates. The influence of aptitude may be even more pronounced when using lower-frequency words, mirroring findings from previous L2 segmental studies where sustained development and greater attainment required more than extended FL learning experience. Therefore, a longitudinal study incorporating more diverse English words and learner profiles would be valuable for a clearer developmental trajectory of L2 suprasegmental acquisition.
Third, a methodological limitation involves the treatment of WM. Although the WM task was implemented during participant screening, its measure was subsequently treated as a fixed effect in the analysis. This approach may therefore overestimate its contribution or fail to capture domain-specific aspects of phonological working memory relevant to suprasegmental learning. Future studies could include more targeted measures of WM to better isolate its role in L2 prosody acquisition.
Fourth, the design of the perception task may have introduced potential confounds. Written words were presented with numbered syllables, which could provide orthographic or syllabification cues that facilitated stress identification independently of participants’ auditory processing abilities. Future studies should consider using purely auditory stimuli without visual cues to more accurately assess the role of perceptual-cognitive aptitude in L2 suprasegmental acquisition.
Lastly, the 10 independent variables included in the multiple regression analysis explained only part of the total variance in the original dataset, suggesting that other factors likely contribute to word stress acquisition. Although exploratory factor analysis was used to reduce multicollinearity among experience-related variables, the overall Kaiser–Meyer–Olkin measure was marginal (0.57), indicating that some factor composites may be unstable. While this study has given extensive focus on auditory processing, working memory, and FL experience, other individual differences, such as motivation (Moyer, Reference Moyer1999; Schmidt & Watanabe, Reference Schmidt, Watanabe, Dörnyei and Schmidt2001), anxiety (Baran-Łucarz, Reference Baran-Łucarz2016), and attention control (Mora & Darcy, Reference Mora, Darcy, Isaacs and Trofimovich2016) have also been shown to influence L2 pronunciation learning. Future research should adopt a broader approach by exploring the contributions and interactions of various individual difference variables to gain a more comprehensive understanding of the complex and dynamic nature of L2 pronunciation development.
Conclusion
This study examined how perceptual-cognitive aptitude (auditory processing and working memory) and L2 experience influence Japanese EFL learners’ perception of English word stress. Learners with stronger auditory acuity, higher working memory, and more frequent outside-class L2 use demonstrated better stress perception, supporting the central role of aptitude in L2 suprasegmental learning. The results also suggest that typical FL classroom instruction alone may be insufficient and that lexical characteristics, such as word length and phonological complexity, interact with learners’ performance. Despite limitations in sample size and the predictors included, these findings provide novel evidence on how aptitude and experience jointly contribute to L2 suprasegmental learning and underscore the importance of considering learners’ perceptual-cognitive abilities when designing interventions in FL classrooms.
Replication package
All the tasks, including screening, auditory processing, and English word stress perception tasks, were delivered in Japanese via the online psychology experiment builder, Gorilla (Anwyl-Irvine et al., 2020). These tasks are currently shared as Open Materials (https://app.gorilla.sc/openmaterials/1030671). The anonymized experimental data and analysis code of this study are publicly available at (https://osf.io/6ku3y/).
Acknowledgements
We gratefully acknowledge the participants, as well as the reviewers and Associate Editor of Applied Psycholinguistics, Melissa Baese-Berk, for their insightful comments on an earlier version of the manuscript.
Funding statement
The study was derived from part of the first author’s (IH) PhD dissertation submitted to University College London. The project was funded by a Language Learning Dissertation Grant awarded to IH and by a Leverhulme Trust Grant (RPG-2024-391) and a UK-ISPF Grant (1185702223) awarded to KS.
Competing interests
The author(s) declare none.
Consent to participate
All participants provided written informed consent prior to data collection.
Ethical considerations
This study was approved by the IOE Research Ethics Committee of University College London (approval no. Z6364106/2024/03/28 social research) on April 8th, 2024.







