A LONGITUDINAL INVESTIGATION OF EXPLICIT AND IMPLICIT AUDITORY PROCESSING IN L2 SEGMENTAL AND SUPRASEGMENTAL ACQUISITION

Abstract Precise auditory perception at a subcortical level (neural representation and encoding of sound) has been suggested as a form of implicit L2 aptitude in naturalistic settings. Emerging evidence suggests that such implicit aptitude explains some variance in L2 speech perception and production among adult learners with different first language backgrounds and immersion experience. By examining 46 Chinese learners of English, the current study longitudinally investigated the extent to which explicit and implicit auditory processing ability could predict L2 segmental and prosody acquisition over a 5-month early immersion. According to the results, participants’ L2 gains were associated with more explicit and integrative auditory processing ability (remembering and reproducing music sequences), while the role of implicit, preconscious perception appeared to be negligible at the initial stage of postpubertal L2 speech learning.


INTRODUCTION
It is widely acknowledged that the great individual variability in postpuberty second language (L2) learning success cannot be solely explained by experience factors (i.e., the extent to which learners practice a target language). This is arguably because even when exposed to a similar amount and quality of L2 input, learners differ in the receptive and productive L2 skills they can achieve. One important factor contributing to these individual differences could be perceptual (e.g., phonemic discrimination) and broader cognitive abilities (e.g., working memory) that are involved in the underlying mechanisms of language processing. Variability in these perceptual and broader cognitive foundations, therefore, could help determine an individual's readiness to learn a language (i.e., language learning aptitude)-and therefore, measuring these skills could enable a further examination of the association between aptitude and L2 learning gains which is the goal of the current study.

SECOND LANGUAGE LEARNING APTITUDE
Over the past five decades, the role of aptitude in second language acquisition (SLA) has been extensively researched (see Li, 2016 for a review). Originally, aptitude was conceptualized as perceptual and broader cognitive abilities that help determine the success of intentional and explicit L2 learning in classroom settings (i.e., foreign language aptitude). According to the seminal model proposed by Carroll and Sapon (1959), the components of aptitude include phonetic coding, grammatical sensitivity, inductive learning, and rote memory. These abilities are believed to be essential to the explicit processing stages of noticing, patterning and associating (Li, 2015;Skehan, 2002). Indeed, measured by the Modern Language Aptitude Test (MLAT; Carroll & Sapon, 1959), aptitude scores were found to demonstrate moderate correlations with foreign language achievement, especially at the initial learning stage (e.g., Bialystok & Fröhlich, 1978;Sparks et al., 1998). Similar findings have been reported using other widely used aptitude tests based on Carroll's model such as the Pimsleur Language Aptitude Battery (PLAB; Pimsleur, 1966) and the LLAMA Test (Meara, 2005). For example, by relating PLAB aptitude scores to the final foreign language grades of high school students, Curtin et al. (1983) found that more variance was explained by aptitude among beginner-level students than among advanced-level students.
More recently, researchers have begun to examine the relationship between aptitude and L2 learning in naturalistic settings. In the existing literature, the explanatory power of traditional foreign language aptitude varies according to a range of factors, such as age of onset and the type of linguistic structures being learned or assessed (e.g., Abrahamsson & Hyltenstam, 2008;DeKeyser, 2000;Granena & Long, 2013). In light of the more complex nature of naturalistic L2 learning and processing, a growing number of scholars have emphasized the importance of capturing one's ability to learn a language not only explicitly (i.e., with conscious awareness) but also implicitly (i.e., without conscious awareness) (Doughty, 2019). Thus, in the current study, the former type of ability is labeled as explicit language aptitude and the latter type is labeled as implicit language aptitude.
To capture the language aptitude of both types, a range of instruments have recently been introduced to the SLA field. For example, the Hi-LAB test battery (Linck et al., 2013) was designed to identify the perceptual and broader cognitive abilities that could predict adult learners' L2 listening and reading attainment at a highly advanced level. Out of the 11 domain-general cognitive abilities (underlying general-purpose learning) and domain-specific perceptual abilities (specific to language learning) covered in Hi-LAB, implicit sequence learning ability (measured using serial reaction time), associative memory, and phonological short-term memory were found to be predictors of highlevel L2 achievement, whereas the two measures of domain-specific perceptual abilities (phonemic discrimination and categorization) were not.
One domain-general perceptual ability, however, has been largely overlooked by previous investigations of language learning aptitude but nonetheless could play an important role in language acquisition-auditory processing. This refers to the ability to precisely and accurately perceive sound characteristics, which is commonly assessed behaviorally by asking participants to discriminate or reproduce individual acoustic dimensions of sound, or by examining the size, stability, or accuracy of neural responses to sound using neurophysiological measures.

AUDITORY PROCESSING ABILITY AND L1 ACQUISITION IN CHILDREN AND ADULTS
Speech contains information about language structure on many different levels, from acoustic patterns on a time scale of tens of milliseconds that distinguish phonemes, to patterns unfolding over seconds that convey information about conversational turntaking, with many levels in between. Variation along several acoustic dimensions, including pitch, duration, amplitude, and spectral shape (formant frequency), conveys information about many different aspects of language, including word boundaries (Cutler & Butterfield, 1992), lexical stress (Fear et al., 1995), phrase boundaries (Marslen-Wilson et al., 1992), and information structure (Breen et al., 2010). Robust auditory processing may be facilitated by decreased variability of perceptual input and lead to rapid acquisition of knowledge about phonemic and prosodic categories (Toscano & McMurray, 2010), with potentially beneficial consequences for the learning of language structures on multiple levels.
There is a long history of research in the L1 acquisition literature on the relationship between precise auditory processing and various language skills, including reading and grammatical knowledge. In this literature, auditory processing has been measured in many different ways; here we will focus on three particularly popular techniques. First, the degree of noise in a particular auditory channel can be measured using adaptive discrimination tests (psychoacoustic thresholds). For example, children with specific language impairment (SLI) or dyslexia have been shown to be more likely to have higher thresholds along a number of different auditory dimensions, including frequency (McArthur & Bishop, 2005), duration (Casini et al., 2018), and amplitude rise time (Goswami et al., 2002). In the same vein, adults with psychoacoustic difficulties are more likely to have reading difficulties (Ahissar et al., 2000;Walker et al., 2002). Second, the ability to discriminate or reproduce melodic and rhythmic patterns can be assessed, and these abilities tend to be somewhat poorer in children with poor phonological and reading skills (Flaugnacco et al., 2014;Grube et al., 2012).
The advantage of these behavioral techniques is that they are relatively cost-effective, quick to implement, and simple enough to be performed by children. However, a major disadvantage of these techniques is that they touch on only conscious and attentional processing of sounds (rather than bottom-up, implicit auditory processing) and could also reflect modality-general (i.e., not specific to sound) cognitive skills such as attention (Snowling et al., 2018). One way to attempt to isolate implicit auditory processing is to make use of neural measures of sound processing that are comparatively unaffected by cognitive state. One such measure is an electroencephalographic (EEG) response known as the frequency following response (FFR), which is the third way to assess auditory processing.
The FFR was first recorded in 1973 by Moushegian et al. (1973). They reported that periodic auditory stimuli give rise to an EEG response that mirrors the frequency content of the stimulus. That is, a pure tone repeating at 400 cycles per second would give rise to a roughly sinusoidal response repeating 400 times a second. The fidelity of the response to the stimulus is such that the difference between vowel sounds presented to participants is identifiable based on the difference in the frequency content of the responses (Galbraith et al., 1998). The FFR is a rapid response, beginning only 10 ms after a sound is presented to a participant. This suggests that its primary generator is not the cerebral cortex, given that 10 ms is not sufficient for sound information to reach the cortex. Indeed, early work suggested that the FFR is produced exclusively by subcortical generators. For example, Smith et al. (1975) showed that cooling of the inferior colliculus (a region in the auditory midbrain) greatly diminished the scalprecorded FFR, while Sohmer et al. (1977) showed that patients with upper brain-stem lesions show no scalp-recorded FFR. Moreover, Kiren et al. (1994) showed that lesioning the inferior colliculus in cats greatly diminishes the FFR, while cortical lesions have no effect. However, more recent evidence has suggested a more complex set of generators of the FFR. Kuwada et al. (2002), for example, found that deactivating various stations along the auditory neuraxis in rabbits decreased FFR to a certain extent, and that the source of the FFR seemed to vary with frequency, with a more cortical origin for lower frequencies and a more subcortical origin for higher frequencies. This view has been borne out by more recent work; while fMRI, for example, has produced evidence for a modest cortical contribution to the FFR , research using high-density EEG suggests a rather low upper-frequency limit for cortical contributions to the FFR, with no contribution whatsoever above 150 Hz. Overall, however, the FFR to a complex sound (such as a speech sound) likely mostly reflects processing within subcortical generators. Supporting this view, White-Schwoch et al. (2017) found that trial-by-trial variability measures in the FFR closely tracked timing jitter in the inferior colliculus in guinea pigs, as measured using extracellular recordings.
That the FFR is driven by generators in areas close to the auditory periphery suggests that the influence of attention and cortical state on the response might be attenuated relative to more cortical responses, and this supposition is borne out by experimental data. For example, the FFR is commonly collected when participants are instructed not to attend to the stimulus but instead to watch an unrelated movie . Indeed, the FFR can even be recorded while participants sleep (Yamada et al., 1977). Moreover, a number of studies have found no effect of attention on the FFR (see Varghese et al., 2015 for experimental data and a clear summary of prior research on this topic). Similarly, there does not seem to be any effect of phonemic categorization on the FFR (Bidelman et al., 2013). However, there is evidence that the FFR can be affected by statistical regularities present in the stimuli. The FFR can, for example, be affected by the novelty of a stimulus relative to its surrounding stimuli (Gao et al., 2014;Slabu et al., 2012) as well as its predictability (Lau et al., 2017). Overall, then, the FFR is likely to largely be a measure of implicit auditory processing, although there may be some small effects of explicit (i.e., attention-driven) processing on the lower frequencies of the response (Holmes et al., 2018).
The FFR is not only of interest to neurophysiologists but also has recently been adopted by cognitive neuroscientists and psychologists interested in the neural underpinnings of individual differences in speech perception and language learning. One reason for this is that there are large individual differences in a number of characteristics of the FFR across participants-including its timing, consistency, and strength of spectral encodingwhich are highly replicable within participants (Easwar et al., 2020;Hornickel et al., 2012), with between-session correlations for some metrics reaching r = 0.88. Moreover, these metrics have been shown to relate to individual differences in behavioral measures of auditory processing, suggesting that FFR variability can be a window into variability in auditory skills in the general population. For example, the trial-by-trial consistency of the response is linked to the precision with which individuals are able to synchronize movements to a metronome in both adults (Tierney & Kraus, 2013 and children (Woodruff Carr et al., 2016); the amplitude of the response has been linked to the ability to discriminate stimuli based on amplitude modulations ; the strength of the representation of the fundamental frequency has been linked to the precision of pitch perception (Carcagno & Plack, 2011;Krishnan et al., 2010;Marmel et al., 2013); and the accuracy of vowel formant encoding in the FFR has been linked to vowel recognition performance in participants' L1 (Won et al., 2016).
Given that individual differences in the FFR have been linked to variability in auditory abilities, researchers have also used the FFR to investigate the relationship between language skills and auditory processing, both by comparing language impaired and typically developing populations and by investigating individual differences in L1 skills across the adult population. The proposed mechanism by which auditory processing might impact language acquisition is that variable or imprecise neural representation of acoustic attributes, as reflected by decreased FFR phase-locking, could delay the acquisition of knowledge about phonemic and prosodic categories, which could in turn possibly delay the acquisition of other language skills, including phonological awareness and reading. One consistent finding is that the strength of encoding of the fundamental frequency is linked to the ability to perceive speech in background noise, which has been shown in typically developing children (Anderson et al., 2010), young adults (Song et al., 2011), and older adults (Anderson et al., 2011). The FFR has been linked to other language skills as well, particularly reading. For example, Banai et al. (2009) found that phonological awareness and word reading correlated with the strength of encoding of middle harmonics; the participants were a mix of language-impaired and typically developing children, but the analyses related predictors derived from the FFR to continuous variation in outcome measures. Hornickel and Kraus (2013) found that trial-by-trial FFR consistency was linked to reading ability in a large group of children, most of whom were typically developing but a few of whom had been diagnosed as language impaired.
Given that the FFR largely reflects processing in peripheral auditory areas, is mostly unaffected by attention, is linked to individual variation in auditory skills, and has been tied to variation in L1 skills such as speech-in-noise perception and reading, we suggest in the next section that it could be a promising implicit aptitude measure relevant to second language acquisition as well.

AUDITORY PROCESSING ABILITY AND L2 LEARNING IN ADULTS
Recently researchers have begun to ask whether the link between auditory processing and successful language acquisition could extend to L2 learning. At first, these studies focused on predicting the impact of short-term in-lab training on adults' L2 speech learning. For example, in Lengeris and Hazan (2010), the formant discrimination thresholds of 18 Greek adult learners were related to their training success in English vowel perception and production. After receiving five phonetic training sessions on 14 English vowels over 2 weeks, learners with lower formant discrimination thresholds before the training tended to perform better at English vowel identification and production tasks after the training. However, the correlation analyses did not take pre-training language skills into account. Thus, it is not clear whether formant discrimination acuity could predict the individual differences in learning during the training sessions or not. Focusing on the suprasegmental aspect of speech perception, Wong and Perrachione (2007) examined auditory processing ability as a predictor of the attainment of nonnative Mandarin tone perception during the training. A group of 17 adult American English speakers who reported zero exposure to tone language received training sessions on identifying three mandarin tones (level, rising, and falling). According to the regression analysis results, English speakers who could better identify pitch patterns before the training showed a higher level of attainment across the training (the initial stage of learning). The authors also examined the effects of musical experience and found that musicians were more likely to be successful learners of Mandarin tones than nonmusicians. In this study, the causal effect of auditory processing ability is somewhat clearer, as the link between auditory processing and L2 speech perception learning cannot reflect an influence of language training on auditory skills before the training.
Individual differences in the robustness of the encoding of acoustic information in the FFR have also been shown to predict in-lab nonnative speech perception learning. Chandrasekaran et al. (2012) divided a sample of adult English speakers into two groups, one of which had more consistent neural encoding of pitch changes than the other. It was found that the former group demonstrated a faster learning rate across the nine training sessions and almost doubled the latter group's identification ratio of Mandarin tones at the end of the training. Findings in these studies suggest that both explicit and implicit auditory processing ability could predict the rate of L2 speech learning in laboratory settings.
Based on these studies, it seems that (a) domain-general auditory processing abilities not only act as an essential foundation for L1 acquisition but could also boost the rate of initial L2 learning; and (b) the initial stage of L2 learning in laboratory settings draws upon both explicit and implicit, preconscious auditory processing. However, little research has yet been conducted to investigate the role of auditory processing ability in naturalistic L2 learning in immersion contexts, wherein learners acquire a target language through intensive exposure to meaningful, interactive, and authentic input in a similar fashion to L1 acquisition.
In our previous research, we have proposed auditory processing as one component of explicit and implicit aptitude relevant to every stage of naturalistic L2 speech learning (see Saito et al., 2019;. Under this framework, explicit auditory processing is defined as one's ability to process sound with some degree of awareness and attention, which we assess using behavioral tasks where participants are aware of the purpose and make careful judgements (e.g., discrimination and reproduction). Implicit auditory processing is defined as the early encoding of sound features in subcortical regions of the auditory system, which are relatively unaffected by attention; thus, we assess implicit auditory processing using the FFR.
Thus far, we have cross-sectionally examined the extent to which explicit and implicit auditory processing correlated with L2 English speech perception and production in naturalistic settings among adult learners with different L1 backgrounds (Chinese, Japanese, Spanish, Polish) and varied immersion experience (1-20 years) (e.g., Kachlicka et al., 2019;Omote et al., 2017;Saito et al., 2019;. Omote et al. (2017) studied the link between implicit auditory processing (i.e., FFR) and English speech perception of 25 adult Japanese speakers with varied immersion experience in the United Kingdom (M = 2.6 years, SD = 3.1). Participants were tested for the perception of English vowel and consonant contrasts with which this population tends to have difficulty (e.g., /r/-/l/ contrast). According to the results, the more consistent participants' neural responses were from trial to trial, and the more robust the representation of the lower frequencies of sound was, the better they performed in the consonant (but not vowel) perception test. Implicit auditory processing was also found to be linked to L2 speech perception (Kachlicka et al., 2019) and to L2 speech production  of L1 Polish and L1 Chinese speakers. As for explicit auditory processing ability, both sound discrimination threshold and rhythmic memory were associated with L2 segmental (Kachlicka et al., 2019) and prosodic perception . Rhythmic and melodic memory have also been found to relate to the fluency of L2 speech production .

MOTIVATION FOR THE CURRENT STUDY
One limitation of these previous studies mentioned in the preceding text is that they were cross-sectional in design-auditory processing and language learning success were both measured after a period of immersion. This means that the causality of the link between auditory processing and language learning cannot be established. In particular, it remains possible that enhanced auditory processing is a consequence of successful L2 learning, rather than a predictor (Krizman et al., 2014;Skoe et al., 2017). In the current study, we examined the link between explicit and implicit auditory processing and L2 speech learning using a longitudinal design. To answer the research question whether and to what degree explicit and implicit auditory processing ability could predict L2 speech perception gains during the immersion, we assessed, in 46 Chinese learners of English, phonemic and prosodic English speech perception before (Time 1) and after (Time 2) a 5-month period of immersion within their first year in the United Kingdom, and auditory discrimination, melodic/rhythmic memory, and neural encoding of sound before the immersion. Based on the results from previous studies (e.g., Kachlicka et al., 2019), both explicit and implicit aptitude were hypothesized to show predictive power for the L2 speech perception gains.

PARTICIPANTS
A total of 50 Chinese international students were recruited from a few universities in London (majoring in a wide range of subjects including education, engineering, science, linguistics, and management). Forty-six returned for testing at Time 2 (3 males, 43 females, M age = 23.6, Range = 21-29). They were all considered to be late L2 learners of English in the early phase of immersion, as they arrived in the United Kingdom after the age of 21 years (M = 23, Range = 21-28). At Time 1 they had been in the United Kingdom for around 5 months (M = 4.5 months, Range = 3.9-5.3). Prior to immersion, they had received an extensive amount of foreign language education in China (M = 13.5 years, Range = 10-19). According to their IELTS scores (above 6.5), their English proficiency levels were intermediate to advanced. Additionally, 11 students reported various lengths of musical training experience (M = 6.95 years, Range = 0.5-20).

AUDITORY PROCESSING MEASURES
Three auditory processing abilities were tested in the current study, including two behavioral measures assessing explicit auditory processing (with conscious awareness) and one neurophysiological measure tapping into implicit processing (without conscious awareness)-(a) sound discrimination threshold, (b) music memory, and (c) neural encoding of sound.

Sound Discrimination Threshold
Following the design in the previous study (Kachlicka et al., 2019), participants' auditory processing threshold was assessed in terms of four different acoustic features of soundspitch, formant, duration, and amplitude rise time. During each subtest, participants were asked to listen to a set of three tones (with a half-second interval in between) and identify whether the first or the third one was different from the middle one by pressing the key "1" or "3." For each subtest, a total of 100 target stimuli and one baseline stimulus were prepared consisting of artificial sounds varying along a continuum of the target acoustic feature (pitch, formant, duration, or amplitude rise time), forming 100 stimulus levels. Higher levels along the continuum were linked to a bigger difference between the baseline and target stimulus and, therefore, easier discrimination. Following the adaptive threshold procedure in Levitt (1971), the tests started with level 50 (i.e., with the target stimulus 50 steps away from the baseline stimulus) and the level changed depending on participants' responses. When the response was incorrect, the difference between stimuli in the next trial became wider (initially by 10 steps), which made the discrimination task easier. When two correct responses were made in a row, the difference between stimuli in the next trial was narrowed (initially by 10 steps), making the task more difficult. Prior to the first incorrect response, however, only a single correct response was necessary for the task to become more difficult. Once the direction of step changes reversed, the step size of the change became smaller, first to five, then to two, and finally to one step, which was then retained till the end of the test (e.g., 50 ! 40 !30 ! 35 ! 35 ! 33 ! 33 ! 34 ! 34 ! 33 ! 33 ! 32 ! 32 …). The tests stopped after either 70 trials or eight reversals, and the sound discrimination threshold scores were calculated by averaging the difference levels where the reversals occurred since the third reversal (which is level 34 in the preceding example). Lower threshold scores reflect better sensitivity to differences between sounds.
All stimuli were created using custom MATLAB scripts. Unless described differently, all sounds consisted of 500-ms four-harmonic complex tones with fundamental frequency (F0) at 330 Hz and a 15-ms linear ramp at the beginning and end. For the pitch discrimination test, while the baseline stimulus remained at a F0 of 330 Hz, that of the target stimuli ranged from 330.3 to 360 Hz with a step of 0.3 Hz. For the duration discrimination test, the baseline stimulus was 250 ms long whereas the target stimuli ranged from 252.5 to 500 ms with a step of 2.5 ms. For the rise time discrimination test, the baseline stimulus had a rise time of 15 ms and that of the target stimuli ranged from 17.85 to 300 ms with a step of 2.85 ms. For the formant discrimination test, stimuli were complex tones with F0 at 100 Hz, the first formant (F1) at 500 Hz, the third formant (F3) at 2,500 Hz, and harmonics up to 3,000 Hz. The second formant (F2) was 1,500 Hz for the baseline stimulus and 1,502-1,700 Hz with a step of 2 Hz for the target stimuli. To form a composite measure of sound discrimination threshold, scores of all four subtests were averaged.

Music Memory
Participants' ability to remember melodic and rhythmic patterns was measured by assessing how accurately they reproduced a melody or rhythm which they listened to three times.
For the melodic memory test, 10 melodies were prepared as stimuli. Melodies were constructed from a set of five notes, consisting of the first five notes of the major scale, corresponding to frequencies of 220, 246.9, 277.2, 311.1, and 329.6 Hz. Each note was 300 ms in duration with a 50-ms cosine ramp at the beginning and end of the note. The first note of the melody was always the third pitch. Subsequent notes were then randomly chosen to be either one note higher on the scale or one note lower on the scale. This process repeated until all seven notes were chosen. The melody could not descend below 220 Hz or ascend above 329.6 Hz; once the melody reached these limits, the next note was chosen to either be closer to the center of the range or identical to the previous note.
Melodies were repeated three times, with a 1-s interval between each repetition. After each of the melodies was played, five boxes numbered 5-1 were shown on the screen and participants were asked to reproduce the seven notes by clicking one box at a time (starting with Box 3); when each of these boxes was clicked the corresponding note was played. Before the test, participants had a chance to listen to an example and practice with the boxes as much as they needed to get familiar with the five pitches. To calculate response accuracy, a 1-to-1 comparison was made between the notes chosen by the participant and the notes in the target melody, and a percentage score was calculated.
For the rhythmic memory test, 10 rhythmic patterns (from Povel & Essens, 1985) were prepared as stimuli. The rhythmic patterns consisted of 16 segments, each 200 ms, containing either a rest or a drum hit. Nine of the segments contained a drum hit, while the remainder contained a rest. Rhythms were each repeated three times with a 600-ms interval in between. Drum hits consisted of a 150-ms conga drum hit sound acquired from freesound.org. After listening to the stimuli, participants were asked to reproduce the rhythm by pressing the space key. The response time of each pressing was recorded and compared with the drum hit segments in the target stimuli. First, the interresponse intervals were quantized by converting them to the nearest multiple of 200 ms. The accuracy of responses was then calculated on a segment-by-segment basis by comparing the content of each segment in the participant's rhythm (i.e., whether it contained a rest or a drum hit) to the corresponding segment in the target rhythm, and a percentage score was calculated. The scores of melodic and rhythmic memory tests were averaged to form a composite measure of music memory.

Neural Encoding of Sound
As discrimination tests require conscious assessment of auditory information, the influence of explicit measures of auditory processing on L2 learning could partially reflect individual differences in attention and memory (Snowling et al., 2018). In contrast, the frequency following response to sound (FFR; , an electrophysiological response that mirrors the spectro-temporal content of the evoking sound, could be a purer assessment of auditory processing, as it is relatively unaffected by cognitive and attentional state (Varghese et al., 2015). Thus, the neural encoding of spectral and temporal information of a synthesized speech syllable was examined using the FFR.
A 170-ms consonant-vowel syllable /da/ was synthesized as the stimulus using a Klattbased synthesizer. It began with a short onset burst of 5 ms. Between 5 and 50 ms was the transitional stage where the first, second, and third formants (F1, F2, F3) changed respectively from 400 to 720 Hz, 1,700 to 1,240 Hz, and 2,580 to 2,500 Hz. Then, from 50 to 170 ms, these formants remained stable. However, throughout the stimulus, the fundamental frequency (F0) was constant at 100 Hz, while F4, F5, and F6 were constant at 3,300 Hz, 3,750 Hz, and 4,900 Hz, respectively.
During the test, the /da/ sound was presented repeatedly (6,300 times over the course of 20 minutes) at a rate of 4.35 Hz, through insert earphones (ER-3; Etymotic Research) at 80 dB. To enable separate examination of the amplitude envelope and temporal fine structure of speech (Aiken & Picton, 2008), the stimulus was presented at alternating polarities (i.e., every other stimulus was inverted). To collect electrophysiological responses to the stimulus, a montage of five electrodes was placed on the head of participants-one active electrode on the center of the top of the head (i.e., at Cz), two reference electrodes on the left and right earlobes, and two ground electrodes on the forehead. Contact impedance was maintained beneath 20 kΩ. Continuous electrophysiological data were recorded using a BioSemi ActiveTwo EEG system with a sample rate of 16,384 Hz and open filters in ActiView (BioSemi) acquisition software. During the testing session, participants were encouraged to read a book or a magazine of their choice instead of paying attention to the sound. They were also asked to relax their muscles and avoid extraneous body movements.
The neurophysiological analyses were conducted using custom MATLAB scripts. To begin with, recordings were bandpass filtered between 70 to 2,000 Hz using a first-order Butterworth filter to isolate the FFR from the cortical evoked response to sound. Then, the recording was segmented from À30 to 210 ms with respect to stimulus presentation. All trials containing amplitude spikes of above 35 μV were rejected as artifacts, and then the first 2,500 artifact-free responses to each stimulus polarity (5,000 total sweeps) were selected for further analysis.
The accuracy of neural sound encoding was measured using intertrial phase-locking analysis. This analysis reveals the degree of temporal consistency in the response across trials at each frequency. Our use of intertrial phase locking analysis rather than spectral analysis of the average response was motivated by prior work showing that intertrial phase-locking demonstrates a comparatively greater signal-to-noise ratio (Zhu et al., 2013). For each trial, a Hanning-windowed fast Fourier transform was conducted over a response time window between 10 and 180 ms (10-180 ms for F0 encoding, 60-180 ms for F1 and F2 encoding). The outcome of this procedure consists of a complex vector for each trial with information about the amplitude and phase of the neural response. Next, these vectors were converted to unit vectors that retained only the phase information and were averaged. Greater length of the averaged vector indicates similar phases across the unit vectors. Thus, the length was taken as the measure of intertrial phase consistency, which varies from 0 (no consistency/phases uniformly distributed) to 1 (perfect consistency/phases identical across trials). It is worth noting that there was an extra step for the analysis of F1 and F2 encoding before the intertrial phase-locking procedure-the phases of trials corresponding to one polarity were shifted 180 degrees to emphasize the temporal fine structure of the stimulus, which enables analysis of neural encoding of the higherfrequency formants (Aiken & Picton, 2008).
In line with our previous study (Kachlicka et al. 2019), we focused on three frequencies that were particularly important in the evoking sound: 100 Hz (F0), 720 Hz (F1), and 1,240 Hz (F2). Neural encoding of F0 was calculated as the maximum intertrial phase coherence between 80 and 120 Hz, whereas neural encoding of F1 was calculated as the maximum intertrial phase coherence between 680 and 720 Hz, and neural encoding of F2 was calculated as the average of the average of the maximum intertrial phase coherence between 1,180 and 1,220 Hz and the maximum between 1,280 and 1,320 Hz. To obtain a composite score of neural encoding of sound, an average score of the phase-locking consistency at F0, F1, and F2 was calculated for each participant.

L2 AND MUSICAL EXPERIENCE MEASURE
Participants reported their experience of L2 interaction and musical training using an online questionnaire. Although participants' length of residence in the United Kingdom was similar, daily use of the target language varied widely across participants. In the current study, L2 experience was measured by recording participants' interactive L2 use, which could be crucial to L2 speech learning (e.g., Moyer, 2011). A survey was conducted at Time 2 where students reported retrospectively the weekly hours spent on L2 speaking in professional, home, and social settings during the 5-month immersion. The hours were added up to reflect the amount of L2 experience. As for musical experience, 11 participants had received regular formal training by Time 1. In the questionnaire, they provided information about the length of training in years and the focus of training. Due to the small number of participants with musical training experience, the data was encoded categorically (0 = no experience at all, 1 = any experience).

L2 PROFICIENCY MEASURES
To examine the degree of improvement in phonological knowledge of the L2, participants' ability to differentiate English speech contrasts at the segmental and suprasegmental level (i.e., speech perception) was assessed before and after the immersion period.
Following the previous study (Kachlicka et al., 2019), participants were asked to listen to a word or sequence of words and choose the word or phrase that best matched what they heard from two options shown on the screen by pressing the keys "1" (left) or "2" (right). The stimuli were minimal pairs comprising vowel contrasts (e.g., /ae/ vs. /e/), consonant voicing contrasts (e.g., /d/ vs. /t/) and phrases that differed in contrastive focus (i.e., READ books vs. read BOOKS). There were 20 pairs for each contrast. All stimuli were produced by a native speaker of Southern British English. The test was run in MATLAB. The speech perception scores were calculated as the percentage of correct answers out of the 20 trials. Participants' performance on consonant perception at Time 1 was largely at ceiling; only two participants did not achieve a perfect score, and even these participants answered only a single item incorrectly. As a consequence, only data from the vowel and prosody items was analyzed further.
Contrastive focus stimuli were taken from the Multidimensional Battery of Prosody Perception (MBOPP; Jasmin et al., 2020). This test battery consists of minimal pairs of recorded phrases that are identical lexically but differ on a single prosodic feature. The speech morphing software STRAIGHT (Kawahara & Irino, 2005) was used to morph these two phrases onto one another, so that they could be set to differ only in their durational and pitch properties. The duration and pitch cues to the location of contrastive focus were then set to 60% of their original size, in an attempt to avoid ceiling effects.

PROCEDURE
Data was collected in a lab at the Department of Psychological Sciences at Birkbeck, University of London. All auditory processing and speech perception tasks were conducted at Time 1; both speech perception tasks were also conducted at Time 2 using the same materials, together with the EEG test and the survey for L2 and musical experience, but data from the EEG test at Time 2 is not analyzed here. Tasks were administered in the following order: Sound Discrimination Test, Speech Perception Test, Music Memory Test, and Experience Survey. Finally, the FFR was recorded. All instructions were delivered in both English and Chinese to avoid any misunderstandings of the procedure. The testing sessions lasted for approximately 2 hours at Time 1 and for around 1.5 hour at Time 2.

RELIABILITY ANALYSES
The test-retest reliability of all measures was examined by correlating performance at Time 1 and Time 2 for each measure. FFR phase-locking was calculated based on the 45 participants who completed the EEG test at both Time 1 and Time 2 (r = 0.83, p < .001). The sound discrimination task and the music memory task were not conducted at Time 2 in this study. Thus, we conducted a separate experiment to assess test-retest reliability, where we recruited and asked a total of 30 L1 and L2 English users to take the same sound discrimination task and music memory task twice in two consecutive days. According to the correlation analyses, their test-retest performance demonstrated relatively strong associations, i.e., r = .70, p < .001 for sound discrimination threshold, and r = .86, p < .001 for music memory, which can be taken as acceptable (Lance et al., 2006) (for details, see Supplementary Material; see also our full report in Saito, Sun, & Tierney, 2020). As for the speech perception measures, although the reliability of the prosody perception test was acceptable (r = 0.68, p < .001), the reliability of the vowel perception test was low (r = 0.47, p < .001).
A total of three independent variables (sound discrimination threshold, music memory, neural encoding of sound) and two dependent variables (L2 vowel and prosody perception scores at Time 2) were entered into the analysis. First, a set of paired-samples t-tests was run on the linguistic measures to show if participants made any significant improvement in L2 knowledge from Time 1 to Time 2. For those measures that demonstrated significant gains, the Time 2 scores were related to independent variables using partial correlation analyses (with Time 1 scores controlled for) to reveal any predictors of L2 gains. Subsequently, multiple regression analyses were conducted, with the auditory processing measures as predictors and Time 2 score as the outcome variable, and with Time 1 scores, L2 and musical experience controlled for.
According to Shapiro-Wilk's test, scores of music memory, neural encoding of sound, L2 experience, and prosody perception at Time 1 were not normally distributed (p < .05). Thus, nonparametric Spearman correlations were conducted. As for the multiple regression analyses, the residuals were normally distributed.

GAINS IN L2 SPEECH PERCEPTION
To investigate whether auditory processing ability can predict the longitudinal development of L2 speech perception, we first examined whether and to what degree participants improved in vowel and prosody perception tasks from Time 1 to Time 2. Given that some participants' performance at Time 1 already reached the ceiling (i.e., 100% correct performance) and had no room for improvement, their data was excluded from the statistical analyses, which left N = 44 for the vowel perception test and N = 31 for the prosody perception test.
As gain scores (Time 2-Time 1) of both vowel and prosody perception (based on the downsized datasets) were normally distributed according to Shapiro-Wilk's test (p > 0.1), the Time 1 and Time 2 scores of both speech perception dimensions were submitted to paired-samples t-tests. The results (summarized in Table 1) showed significant gains in prosody perception scores over time, t(30) = 3.22, p = .003, but not in vowel perception scores, t(43) = À0.27, p = .788. Therefore, the subsequent analyses only focused on predictors of L2 prosody perception gain scores based on the N = 31 dataset. According to Plonsky and Oswald's (2014) field-specific benchmarks (d = .60 as small, 1.00 as medium, 1.40 as large), the improvement participants made in prosody perception showed a small effect size (d = .43).

AUDITORY PROCESSING ABILITY PROFILES
The descriptive results of the three auditory processing ability measures are summarized in Table 2. Participants showed individual variability in auditory processing abilities at Time 1. To investigate the interrelationships between the three independent variables, a set of Spearman's nonparametric correlation analyses was conducted. To adjust for multiple comparisons, the alpha level was set at .017 using the Bonferroni correction. As shown in Table 3, no significant correlation was found between the three auditory processing measures. More specifically, there was no evidence that (a) the neural encoding of sound, which was assumed to tap into the implicit dimension of auditory processing, was related to the explicit auditory processing measures; and that (b) the two explicit auditory processing measures, sound discrimination threshold and music memory, were associated with each other.

AUDITORY PROCESSING AND GAINS IN L2 PROSODY PERCEPTION
To examine which of the three independent variables (i.e., sound discrimination threshold, music memory, and neural encoding of sound) could predict the gains in L2 prosody perception from Time 1 to Time 2, we conducted a set of nonparametric partial Spearman's correlation analyses. Auditory processing abilities were submitted to the analyses as predictors, with the prosody perception scores at Time 2 as the dependent variable and Note: The scores of L2 speech perception were calculated as the percentages of correct answers.  Time 1 scores controlled for as a covariate. To adjust for multiple comparisons, the alpha level was set at .017 using the Bonferroni correction. As summarized in Table 4, L2 prosody perception scores at Time 2 showed a significant and positive correlation with music memory (r = .456, p = .011), which suggests that music memory was a predictor of L2 prosody perception gains. See Figure 1 for scatterplots displaying the relationship between music memory and L2 prosody perception scores at Time 2. According to Plonsky and Oswald's (2014) field-specific benchmarks (r = .25 as small, .40 as medium, .60 as large), the strength of the correlations indicated that the role of music memory in L2 speech perception development could be considered as "medium." By contrast, implicit auditory processing ability (i.e., neural encoding of sound) and explicit auditory discrimination thresholds did not show correlations with gains in L2 prosody perception.

PREDICTORS OF GAINS IN L2 PROSODY PERCEPTION
To determine the predictors of the gains in L2 prosody perception and the amount of variance they could explain, music memory, Time 1 prosody perception scores, and degree of L2 experience (i.e., weekly hours of L2 speaking) were submitted to stepwise multiple regression analyses as independent variables. Sound discrimination threshold and neural encoding of sound did not enter into the model because they were not correlated with Time 2 prosody perception scores after covarying for Time 1 scores. As participants with musical experience showed better music memory than those without, t(29) = 4.66, p < .001, to test if the correlation between music memory and L2 prosody perception gains was mediated by musical experience, the musical experience measure (0 = no experience at all, 1 = any experience) was also submitted to stepwise multiple regression analyses as an independent variable. Prosody perception score at Time 2 was submitted as a dependent variable. As shown in Table 5, only music memory remained in the model as a predictor, accounting for 10.5% of the variance in L2 prosody perception at Time 2, apart from the L2 prosody perception Time 1 scores (explaining 52.6% of the variance), whereas the amount of L2 interaction and musical training experience were excluded from the model. According to the interpretations of effect sizes by Plonsky and Oswald (2014), the variance explained by music memory corresponded to a small-to-medium effect size (6.25% < R 2 < 16%).

DISCUSSION
The current study longitudinally examined the link between domain-general auditory processing (explicit, implicit) and L2 speech learning during the early stage of immersion with a pretest-posttest design. A total of 46 Chinese learners of English were tested for their L2 segmental (vowel and consonant) and prosodic (contrastive focus) speech perception at the beginning and the end of a 5-month immersion in an English-speaking environment (i.e., the United Kingdom). While the 31 participants who performed below ceiling in prosody perception at the beginning of immersion demonstrated significant improvement in prosody perception after immersion, no significant improvement in L2 segmental perception was found at the group level. The learning difficulty of vowels is in line with Munro and Derwing (2008), who found that Mandarin speakers' performance on English vowel production generally stabilized after a half year of immersion. However, our finding of significant gains in L2 prosody but not vowel perception could also just reflect the relative reliability of these two measures, given that the vowel perception test showed particularly low reliability (r = 0.47). Based on our previous research (Kachlicka et al., 2019;Saito et al., 2019), three types of auditory processing abilities were assessed at Time 1. Behavioral tests assessing relatively explicit auditory processing included measurements of sensitivity to sounds differing in spectral and temporal features (i.e., sound discrimination threshold) and tests of the ability to remember and reproduce melodic and rhythmic patterns of nonverbal audio input (i.e., music memory). Implicit preconscious auditory sensitivity was measured using the frequency following response to sound, an electrophysiological response that reproduces the spectro-temporal characteristics of the evoking stimulus.
According to the correlation analyses, these measures evaluated three independent domains of auditory processing abilities. Among them, the results of multiple regression analysis indicated that music memory was the sole predictor of L2 prosody perception gains, accounting for 10.5% of the variance, even when music training experience was entered as a potential predictor. Implicit auditory processing (neural encoding of sound) and explicit auditory discrimination thresholds, however, did not explain L2 prosody perception gains from a short immersion. It is noteworthy that according to the multiple regression analyses, the amount of authentic and interactive L2 input during immersion did not relate to the extent of L2 speech perception gains. This may reflect the relatively short duration of the L2 exposure here, which may have been too brief for L2 input characteristics to have an effect on speech perception gains. Our findings provide more support to the view that the outcomes of postpubertal L2 speech learning can be influenced not only by experiencerelated factors but also by individual differences in learners' perceptual-cognitive abilities (Saito, 2019;Saito et al., 2019;, by tapping into domaingeneral perceptual abilities. In what follows, we discuss how individual differences in L2 prosody acquisition could reflect different types of explicit and implicit auditory processing-memory, discrimination, and neural encoding.

EXPLICIT AUDITORY PROCESSING AND L2 SPEECH ACQUISITION
The longitudinal relationship between music memory and L2 prosody perception gains found in this study extends the findings of previous cross-sectional studies that reported that learners who performed better on L2 speech perception and production tend to have stronger music skills (Kachlicka et al., 2019;Saito et al., 2019;. Here, we show that music skills (rhythmic and melodic memory) assessed at the very beginning of a period of immersion can predict subsequent L2 speech learning gains, which suggests that the relationship between auditory processing and L2 language learning does not merely reflect the effects of successful language learning, but instead that auditory processing abilities may play a causal role in helping determine the speed of L2 acquisition (cf. Snowling et al., 2018). In other words, learners who can better detect and reproduce the timing and pitch contour of sounds may find it easier to extract pitch and duration information to identify contrastive focus (and, potentially, other prosodic features such as phrase boundaries).
However, we agree that even when auditory processing at an earlier time point predicts language learning success at a later time point, it remains possible that this association is driven by a third factor related to auditory processing, such as socioeconomic status or modality-general cognitive skills (e.g., working memory or attention) (Doughty, 2019;Linck et al., 2013). As a result, the causality of the link between auditory processing and language learning remains an important topic for future research (Mueller et al., 2012), which could be addressed using intervention studies (e.g., Li & DeKeyser, 2017).
We found no correlation between L2 prosody perception gains and sound discrimination thresholds, conflicting with our earlier findings of a link between auditory discrimination and L2 speech perception (Kachlicka et al., 2019). This could reflect the different stages of immersion in these two studies (experienced vs. inexperienced), or the different L1s spoken by the participants (Polish vs. Mandarin). In our recent investigations with adult L2 learners with various lengths of immersion and L1 backgrounds , there is some preliminary evidence that learners' ability to detect acoustic details of sound (measured using sound discrimination tasks) can predict the extent to which L2 learners can continue to improve and attain advanced L2 proficiency, provided an ample amount of L2 immersion experience though a longer period of immersion. It is probable that such perceptual acuity plays a crucial role especially in the mid-to-ultimate phases of L2 learning (see Linck et al., 2013), while learners' ability to remember melodic and rhythmic patterns (measured using reproduction tasks) appears to be equally linked to various stages and contexts of L2 learning and attainment (Saito et al., 2021).

IMPLICIT AUDITORY PROCESSING AND L2 SPEECH ACQUISITION
Importantly, a relationship was not identified between implicit auditory processing ability (as assessed using neural encoding of speech) and L2 prosody perception learning, again conflicting with Kachlicka et al. (2019) and Omote et al. (2017), who found a robust relationship between FFR encoding and speech perception after several years of L2 English immersion. There are several possible explanations for this null finding. First, the amount of immersion (5 months) participants had in this study might not be enough for implicit auditory processing ability to show any effects. After receiving more than 10 years of formal L2 instruction in China prior to their arrival in the United Kingdom, learners seemed to rely on explicit processing of the target language within the first year of immersion. Thus, the effects of implicit auditory processing on L2 speech learning may become more evident when participants have accumulated enough immersion at a later phase of naturalistic learning (Granena, 2013;Suzuki & DeKeyser, 2015). Indeed, our work has shown that implicit auditory processing could explain variance among L2 learners with sufficiently long length of residence in L2-speaking countries (e.g., 5-10 years; Kachlicka et al., 2019;. An alternate perspective is that the focus of Kachlicka et al. (2019) and Omote et al. (2017) was on L2 segmental perception while the focus of this study was on suprasegmental perception; thus, the inconsistent findings may reflect the different roles of implicit auditory processing in segmental versus suprasegmental speech learning. According to the results of Saito et al. (2019), while segmental L2 speech production was related to both explicit and implicit auditory processing, suprasegmental production was related only to rhythmic memory. There seems to be a possibility that while segmental learning may draw upon the precision of encoding of auditory dimensions, as reflected in the robustness of the FFR, suprasegmental learning may instead draw upon the ability to remember changes in rhythmic and melodic (i.e., durational and pitch) patterns. However, more investigation on both segmental and suprasegmental learning are needed to show a clearer picture.
A third explanation concerns the extent to which the outcome measures in this study (vowel and prosody perception tasks) required L2 speakers to rely on their implicit language aptitude. Compared to producing spontaneous speech in the target language, differentiating contrasts of words or short phrases is much easier to monitor explicitly and thus likely to relate to explicit aptitude rather than implicit aptitude (cf. Skehan, 2016). Finally, in this small dataset (N = 31), participants' variability in L2 speech learning was limited (e.g., the ceiling effects found in L2 speech perception measures), which may confound the predictive power of certain auditory processing measures.
Overall, the results from the current study and previous studies suggest different roles for explicit and implicit auditory processing abilities in L2 speech learning. On the one hand, implicit auditory processing seems to have a more salient effect on the ultimate attainment of certain aspects (arguably those more difficult to be mastered) of L2 speech learning than on the initial learning rate. On the other hand, explicit auditory processing could contribute to various stages and aspects of L2 speech learning.

CONCLUSIONS AND FUTURE DIRECTIONS
The current study is a preliminary longitudinal investigation of the effects of auditory processing ability on L2 speech learning with a pretest-posttest design. Focusing on the L2 speech perception gains from a short phase of early immersion, the results support a predictive role for explicit but not implicit auditory processing in driving gains in prosody perception. Here we acknowledge several methodological limitations and call for more future studies to investigate the impact of auditory processing among a larger number of participants with more balanced gender distributions, as well as more varied L1 backgrounds (tonal vs. nontonal), language learning experience (classroom vs. immersion), and proficiency levels (cf. . Depending on their L1s (e.g., tonal vs. nontonal), L2 learners may recruit different spectro-temporal cues to extract information from the auditory input (Jasmin et al., 2021). Therefore, it would be interesting to compare the impact of spectral and temporal perception abilities on L2 acquisition among learners with various L1-L2 pairings. Moving forward, future studies could also examine the longitudinal development of various L2 skills and the dynamic interactions between different types of auditory processing, experience, and L2 performance over a longer period of immersion. More measures of auditory processing ability tapping into explicit and implicit dimensions should also be encouraged and their relationships should be explored. Although we argue that sound discrimination and music memory tasks draw heavily on explicit processing and FFR on implicit processing, we cannot conclusively rule out the possibility that both explicit and implicit processes contribute to participants' performance in all these tasks. Developing a wider battery of measures of auditory processing would also enable researchers to begin to gain a clearer picture of the ways in which auditory processing can be fractionated into different skills, and the relative importance of these skills for L2 learning. More reliable measures of skills for L2 learning are also worth exploring, to avoid the lack of variability or gains over time caused by low reliability (such as the vowel perception task in the current study). Finally, more research is needed to test our tentative hypothesis that more precise auditory processing leads to more successful L2 speech learning. One intriguing direction is to further investigate the causal relationship between audition and acquisition by conducting random-assignment intervention studies with control groups to examine whether and to what degree auditory training at the outset of L2 immersion can enhance L2 speech acquisition.

SUPPLEMENTARY MATERIALS
To view supplementary material for this article, please visit http://dx.doi.org/10.1017/ S0272263120000649.