AN EAR FOR LANGUAGE

ABSTRACT This study tested whether individual sensitivity to an auditory perceptual cue called amplitude rise time (ART) facilitates novel word learning. Forty adult native speakers of Polish performed a perceptual task testing their sensitivity to ART, learned associations between nonwords and pictures of common objects, and were subsequently tested on their knowledge with a picture recognition (PR) task. In the PR task participants heard each nonword, followed either by a congruent or incongruent picture, and had to assess if the picture matched the nonword. Word learning efficiency was measured by accuracy and reaction time on the PR task and modulation of the N300 ERP. As predicted, participants with greater sensitivity to ART showed better performance in PR suggesting that auditory sensitivity indeed facilitates learning of novel words. Contrary to expectations, the N300 was not modulated by sensitivity to ART suggesting that the behavioral and ERP measures reflect different underlying processes.

of common objects, and were subsequently tested on their knowledge with a picture recognition (PR) task. In the PR task participants heard each nonword, followed either by a congruent or incongruent picture, and had to assess if the picture matched the nonword. Word learning efficiency was measured by accuracy and reaction time on the PR task and modulation of the N300 ERP. As predicted, participants with greater sensitivity to ART showed better performance in PR suggesting that auditory sensitivity indeed facilitates learning of novel words. Contrary to expectations, the N300 was not modulated by sensitivity to ART suggesting that the behavioral and ERP measures reflect different underlying processes.

INTRODUCTION
What makes some people excel at learning words of a foreign language and others struggle with the acquisition of even basic vocabulary? Up to now, researchers identified several factors that characterize good learners of foreign languages. These include the so-called language aptitude (Carroll & Sapon, 1959), phonological short-term memory capacity (e.g., Baddeley et al., 1998), inhibitory control (Bartolotti et al., 2011), and musical abilities (Dittinger et al., 2016(Dittinger et al., , 2017. However, to the best of our knowledge, no research to date has shown a relationship between any specific perceptual skill and learning novel words in healthy adults. In this article, we argue that sensitivity to an acoustic cue called amplitude rise time (ART) is relevant for vocabulary acquisition. This acoustic cue is considered important for speech segmentation in an individual's first language (Richardson et al., 2004;Thomson & Goswami, 2009). Given that better segmentation can lead to more efficient encoding of a novel word form (cf. Marecka et al., 2018 for an overview), high sensitivity to ART should make word learning much easier.

AUDITORY SKILLS AND NOVEL WORD LEARNING
Linguistically talented people are thought to have "an ear for language." However, there have been few studies on auditory skills and novel word learning to support this claim. Perhaps the most prominent line of research on this topic concerns the relationship between word learning and the ability to segment speech into linguistic units such as phonemes, rhymes, and syllables. To measure the ability to segment speech, researchers typically use tasks involving the extraction of a phonological unit from a word (i.e., saying the word cat without /k/), pointing to the unit that is shared by two words (e.g., the same phoneme or rhyme), or dividing words into phonemes or syllables (e.g., Farnia & Geva, 2011;Hu & Schuele, 2005;Marecka et al., 2018). Second language (L2) learners and children learning their first language (L1) who score higher in such tasks, tend to have larger vocabulary sizes and learn novel words more efficiently in auditory learning tasks (L1; Bowey, 1996Bowey, , 2001Farnia & Geva, 2011;Hu, 2003Hu, , 2008Hu & Schuele, 2005;Marecka et al., 2018;Metsala, 1999).
Recently, Marecka et al. (2018) proposed that two different segmentation mechanisms might be at play, depending on whether learners process an already familiar language or a completely new language. It is implicit in the theory that these segmentation mechanisms facilitate auditory word learning-they help with processing the speech signal in the novel word and subsequently acquiring its form. One of the mechanisms-phonological mapping-operates when processing a known language, whereas the other-universal segmentation-operates when processing a novel language .
Phonological mapping relies on the existing sublexical phonological representations, that is, representations of syllables, segmental categories (segments or phonemes), and larger sequences of segments in the listener's memory (a.k.a. speech chunks or ngrams; see Jones, 2016;Szewczyk et al., 2018). Within this mechanism learners segment words by pattern matching: they compare the acoustic form they hear against known phonological patterns. As a result, they decompose the stream of speech into a sequence of known phonological units that can then be encoded in the memory (see the EPAM-VOC and CLASSIC models: Jones, 2016). The better the quality and the larger the inventory of sublexical phonological representations in the learner's memory, the more efficient the mapping and the better the encoding of the word forms (Jones, 2016). This hypothesis is confirmed by experiments showing that children remember novel words better, if those words contain speech chunks they know (Storkel, 2001(Storkel, , 2003. Moreover, computational models show that well-established representations of speech chunks lead to more efficient encoding of word forms in memory (Jones & Witherstone, 2011).
In contrast to phonological mapping mechanism, the universal segmentation mechanism does not require the knowledge of any phonological representations-in this process listeners use language-universal acoustic cues to divide speech into smaller parts such as words, syllables, and subsyllabic elements like phonemes (see Endress & Hauser, 2010). Therefore, the universal segmentation mechanism is especially useful for processing speech in a language in which the learners have few phonological representations. It helps less proficient learners acquire novel word forms by allowing them to divide the speech stream into manageable units that can then be further processed and encoded .
Summing up, research indicates that the ability to segment speech helps learning (auditorily) novel word forms. Two segmentation mechanisms might be involved in the process: phonological mapping, which depends on the individual's knowledge of phonological structures and thus is related to the experience with the phonology of the language, and universal segmentation, which depends on the detection of acoustic cues and thus is likely to be related to individual differences in perceptual skills. In this article we focus on the second mechanism, and in particular on one acoustic cue that might be related to itsensitivity to temporal amplitude changes, and more specifically ART. We test if greater sensitivity to ART (understood as the ability to discriminate between different ARTs) helps individuals learn new words. In the following sections, we define ART and focus on the studies supporting the notion that temporal amplitude changes, and ARTs specifically, provide cues to universal segmentation and thus facilitate word learning.

Sensitivity to ART as a Predictor of Universal Segmentation and of Word Learning
The speech stream presumably contains a number of acoustic cues that could be relevant to universal speech segmentation, but most of the current research focuses on the cues related to how the amplitude (intensity, perceived as loudness of the sound) changes over time. One cue that receives particular attention is ART. ART is the time a sound takes to reach its maximum amplitude. ART is short when a sound's amplitude rises very rapidly (i.e., the sound gets loud very fast) and long when its amplitude rises more gradually. Tasks used to assess individual sensitivity typically involve deciding if two sounds have the same or different ART (e.g., Hämäläinen et al., 2005), or picking a sound with a different ART (the odd-one-out) from a series of three (e.g., Surányi et al., 2008). The differences in ART between the sounds presented to the participants vary from very large to very small. If participants give correct answers for trials where the differences in ART between the sounds are very small, they have greater sensitivity to ART than participants who can distinguish only between sounds with larger ART differences.
In this study we test the hypothesis that greater sensitivity to ART (i.e., the ability to perceive smaller differences in ART between two sounds) facilitates word learning in a completely foreign language. This hypothesis rests on two premises: (a) greater sensitivity to ART leads to more efficient universal speech segmentation; and (b) more efficient universal speech segmentation leads to faster and more accurate novel word learning. In the following text, we discuss the data supporting each premise and the nascent evidence for the hypothesis derived from them.

Greater Sensitivity to ART Discrimination Leads to More Efficient Universal Speech Segmentation
Accurate perception of ARTs is thought to help speech segmentation and processing. Discriminating between sounds with relatively slow ART (slow ART discrimination) is thought to facilitate segmentation of speech into syllables (Goswami, 2011;Goswami et al., 2002;Richardson et al., 2004). Discriminating between sounds with relatively fast ART (fast ART discrimination) facilitates segmentation of speech into onset, rhymes, and phonemes (e.g., Goswami, 2011;Goswami et al., 2002;Hämäläinen et al., 2005;McAnally & Stein, 1996;Richardson et al., 2004). Support for the relationship between sensitivity to ART and speech segmentation comes from studies on developmental dyslexia. Dyslexic individuals typically have poor segmentation skills and they also have problems discriminating sounds with different ARTs (e.g., Goswami et al., 2002;Hämäläinen et al., 2005;McAnally & Stein, 1996;Muneaux et al., 2004;Richardson et al., 2004;Surányi et al., 2008, although see Georgiou et al., 2010 for different results). Furthermore, these two abilities (speech segmentation and ART) are directly correlated in this population (Hämäläinen et al., 2005;Richardson et al., 2004;Surányi et al., 2008;Thomson & Goswami, 2009). Some studies show that individuals with dyslexia have problems with slow ART discrimination (Goswami et al., 2002;Hämäläinen et al., 2005), while other studies suggest that it is fast changes, that is, fast ART detection that should be problematic (McAnally & Stein, 1996). Importantly, the relationship between ART discrimination and phonological segmentation in dyslexic individuals has been found in many language communities. ART discrimination is a consistent predictor of segmentation skills among English, Spanish, Chinese , Hungarian (Surányi et al., 2008), Finnish (Hämäläinen et al., 2005), and French individuals with dyslexia (Muneaux et al., 2004). 1 This indicates that ART is a languageuniversal cue to speech segmentation-a very likely cue for universal segmentation.
More Efficient Universal Speech Segmentation Leads to Faster and More Accurate Novel Word Learning As already mentioned, individuals who score lower on tasks involving segmenting speech into phonemes and syllables, learn words slower and less accurately (Bowey, 1996(Bowey, , 2001Farnia & Geva, 2011;Hu, 2003Hu, , 2008Hu & Schuele, 2005;Marecka et al., 2018;Metsala, 1999). In the studies cited previously the segmentation involved in this relationship could be either universal segmentation or phonological mapping. However, studies looking at the relationship between musical training, speech segmentation abilities, and word learning suggest an involvement of specifically universal segmentation in word learning (Dittinger et al., 2016(Dittinger et al., , 2017François & Schön, 2011;François et al., 2013). These studies show that musically trained children (Dittinger et al., 2017) and adults (Dittinger et al., 2016) learn words of a completely foreign language better than those without musical training. Following a word learning task, participants with musical training were more accurate in assessing the meaning of the newly learned words. While there is no reason why musicians should have a richer inventory of phonological sequences, it is likely that they have superior sensitivity to acoustic cues, including those involved in segmenting speech.
In sum, the existing evidence shows a link between ART sensitivity and better segmentation of speech, as well as between better segmentation of speech and better word learning. However, it is not clear if sensitivity to ART contributes to novel word learning. Some indirect evidence has been provided by studies with dyslexic individuals, who typically have poor ART discrimination skills and experience difficulties learning novel words (Alt et al., 2017;Kalashnikova & Burnham, 2016;Kwok & Ellis, 2014). Children with specific language impairment (SLI)-a disorder that is often characterized by lower vocabulary scores-have been found to be less sensitive to ART (Cumming et al., 2015). Moreover, correlations were reported between sensitivity to ART and vocabulary size (Corriveau et al., 2007) as well as between sensitivity to ART and vocabulary learning in a paired-associates laboratory task (Thomson & Goswami, 2009). However, these studies included analyses on heterogonous groups, composed of both typically developing individuals and those with dyslexia or SLI. Therefore, it is unclear whether the relationship between ART and vocabulary size holds for a typically developed adult population.

CURRENT STUDY
In the current study, we directly addressed the hypothesis that a greater sensitivity to ART is related to more accurate and faster auditory word learning, especially for words that are phonologically novel, that is, have foreign phonological structure and accent. This has never been studied in typical adults.
To assess sensitivity to ART we used odd-one-out discrimination tasks, which estimated participants' thresholds for discriminating fast and slow ART separately. To assess the ability to learn new words, we used a paired associate learning paradigm (see de Groot, 2011 for an overview), in which participants were asked to remember novel words (nonwords) presented auditorily and paired with pictures of known objects. We additionally statistically controlled for other individual differences that, as suggested by earlier research, can independently explain word learning, namely phonological short-term memory as measured with digit span (e.g., Baddeley et al., 1998), and the resistance to interference (the ability to ignore interfering stimuli; see Friedman & Miyake, 2004) as measured by the Simon task (Bartolotti et al., 2011;Rey-Mermet & Gade, 2018). This second skill might be important specifically when participants learn new names of known objects that already have their names in L1, as was the case in the current study. Finally, we explored whether the universal segmentation mechanism is especially relevant for words that are phonologically novel.
Thus, we varied the nonwords' phonological similarity to words in the participants' L1. We manipulated both the nonwords' phonological structure (by using less or more typical phonological sequences to construct them) and the accent with which they were pronounced.
Manipulating both accent and structure of the word separately was done in an attempt to systematize the term "phonologically foreign." Several studies explore how different factors influence the learning or processing of foreign words, but none define what "foreign" means from a phonological point of view. We argue that the words can be foreign in terms of phonological structure (the phonotactic probability of these words) and/or the acoustic features that can be ascribed to accent (the quality of vowels and consonants, as well as prosodic patterns). These two aspects of "foreignness" are strikingly different. While accent is connected more with surface processing of the word (the processing of acoustic features), processing structure might be more linguisticpossibly depending on the knowledge of the sublexical representations (see e.g., Jones, 2016). Based on the current literature, we did not have a specific hypothesis as to whether learning words with foreign accent or structure would be more associated with ART discrimination. Foreign structure can result in foreign patterns of amplitude rises within the utterance, so the learning of such words might be facilitated by better ART discrimination. However, foreign accent could result in greater reliance on universal segmentation because participants might have problems with identifying known phonemes and syllables within foreign-accented words. In either case, making the distinction between accent and structure is important for understanding the potential relationship between ART and word learning.
To measure the accuracy and quality of representations in learning the paired associates, we used a picture recognition (PR) task in which participants were auditorily presented with a nonword followed by a congruent or incongruent picture, and were asked to indicate whether the picture matched the nonword. Apart from collecting behavioral data (accuracy and response latencies in the PR task), we also measured event-related potentials (ERPs) time-locked to the presentation of the pictures. ERPs may be particularly sensitive to the early stages of word acquisition, showing effects of learning even after a single exposure to a new word (Borovsky et al., 2010(Borovsky et al., , 2012McLaughlin et al., 2004;Osterhout et al., 2006Osterhout et al., , 2008 and can reveal differences in word learning by less and more skilled learners (Balass et al., 2010;Perfetti et al., 2005).
If greater sensitivity to ART leads to more better word learning, we should find that participants who are more sensitive to ART will be more accurate and faster on the PR task. This effect should be particularly pronounced in novel words that have a less familiar phonological structure and accent and thus require participants to use their universal segmentation mechanism.
With respect to the ERPs, we focused on the N300 ERP component. The N300 is a fronto-central component occurring 250-400 ms after the stimulus of interest. It is a member of a larger family of ERP components (the N3 complex). These components reflect the activity in occipitotemporal cortex related to the cognition and decision making connected with visual objects (Schendan, 2019). For example, higher (more negative) N300s are typically observed in response to pictures that are semantically incongruent (e.g., a picture of a women putting a checkerboard into the oven) or semantically unrelated to the preceding stimuli (e.g., Ganis & Kutas, 2003;Mazerolle et al., 2007;Mudrik et al., 2010;West & Holcomb, 2002), compared to congruent or related pictures. The N300 is also more negative to visually presented objects that are presented from an atypical perspective or are impoverished, so they match reference objects in memory to a lesser extent (Schendan & Ganis, 2015;Schendan & Stern, 2007).
In the present study, if a participant learns the nonword-picture association, the presentation of the nonword should activate memory representations of the object linked with this nonword. When the participant is then shown a picture consistent with the nonword, this picture should reduce the negativity in the N300 amplitude (make it more positive) compared to when the participant did not learn the association or when the participant is shown an incongruent picture. Building upon this logic, we propose that the better the participants learn the association, the more preactivated the concept should be and hence the more reduced the N300 amplitude in response to the congruent pictures. N300 amplitude on the congruent pictures should show how strongly participants activate the meaning upon hearing the word and thus could be considered an index of how well they learned the association between the form and meaning. For pictures incongruent with the newly acquired nonwords, the N300 amplitude should not be reduced no matter how well participants learned the nonword-picture associations because the concept expressed by the incongruent picture will not be activated by the preceding nonword. Thus, the difference between the N300 amplitude for congruent and incongruent pictures (the magnitude of the N300 effect) will provide an index of how strongly participants activated the association to the new word. As such, we consider the modulation of the N300 an additional index of paired associates learning, complementary to the behavioral data.

PARTICIPANTS
Forty-four native speakers of Polish took part in the experiment. We rejected data for four participants who knew Russian because half of the stimuli in the experiment were produced with a Russian accent, which was supposed to be unfamiliar for the learners. We included 40 participants (26 female) in the final analyses. The participants were on average 23.3 years old (SD = 2.9, range: 19.0-30.10) and had 15 years of formal education (SD = 2.42, range: 12-25). All participants completed a background questionnaire, where they needed to assess their proficiency on all the known languages on a 9-point Likert scale for four components: reading comprehension, listening comprehension, speaking, and writing. We averaged the scores on these four components for each language. If the average score was above 6, we assumed that the participant knew the language on upperintermediate or a higher level. Twenty-two participants (55% of the sample) reached this level of proficiency for at least one additional language, but none for Russian. A shortened version of Raven's Advanced Progressive Matrices test (sRAPM) was used to measure participants' fluid intelligence (see Marzecová et al., 2013 for a detailed description and rationale). On average, participants scored 13.18 points out of 18 (SD = 2.65, range: 7-17). All participants passed a hearing screen, in which they were played six pure tones at 30 dB HL. Each tone (250 Hz, 500 Hz, 1,000 Hz, 2,000 Hz, 4,000 Hz, and 8,000 Hz) was tested twice in the left and the right ear separately. No participant reported any language impairments or neurological disorders and all had normal or corrected-to-normal vision. The participants were recruited using a job-hunting Internet portal or through an experimental recruitment system at Jagiellonian University, Kraków. They were paid 40 zł (about 12 dollars) for their participation and signed an informed consent form prior to the experiment.

Sensitivity to ART
Sensitivity to ART was tested using two 3-interval 2-alternative forced choice tasks (3I-2AFC) separately for fast ART and slow ART. Both tasks were administered on a laptop using a custom program coded in Real Basic. In each task participants saw three cartoon owls, which produced tones. On each trial, either the first or the third owl produced a tone that was different from the others in ART. Participants had to choose which of these two tones (the first or the last one) was the odd-one-out. Feedback on the participant's accuracy was provided automatically after each trial.
This task used an adaptive (staircase) procedure to establish an individual's discrimination threshold, that is, the smallest possible change in ART a participant can detect. We used a combined 1-up 2-down and 1-up 3-down staircase. Under the 1-up 2-down staircase, every time a participant made two consecutive correct responses (i.e., correctly identified the odd-one-out), the task became harder (i.e., the difference between the tones was reduced) and every time they got one trial incorrect the discrimination became easier. After two reversals (a change in direction from getting easier to getting harder or vice versa) had been reached, the staircase changed. Under the 1-up 3-down staircase, every time a participant made three consecutive correct responses, the task became harder (i.e., the difference between the tones was reduced) and every time they got one trial incorrect the discrimination became easier. The initial step size of 58 ms was halved after the fourth and sixth reversals to enable the staircase to converge more closely on the participant's threshold. The tasks ended after 40 trials. The discrimination threshold for each condition/task was computed as the average of the last three reversals of the staircase procedure.
Each task contained sinusoidal tones of 450 ms duration with a fixed frequency of 500 Hz. All tones were played out at a peak intensity of 75 dB SPL. The tones in the two tasks were manipulated in the following way: (a) Task testing sensitivity to fast ART The two standard tones always had a 15 ms linear ART, 385 ms steady rate, and a 50 ms linear fall time. The target tone (odd-one-out) varied in linear ART from the standard tones. At the start, the ART of the target tone was 300 ms (i.e., the procedure started with the maximal difference of 285 ms ART) and it got shorter in subsequent trials. (b) Task testing sensitivity to slow ART The two standard tones always had a 300 ms linear ART, 385 ms steady rate, and a 50 ms linear fall time. The target tone (odd-one-out) varied in linear ART from the standard tones. In the initial step, the ART of the tone was 15 ms and it got longer in subsequent trials.

Simon Task
The task was based on paradigms employed by Paap and Sawi (2014) and Bialystok et al. (2004) who used it to measure resistance to interference in bilinguals. The task was presented in the DMDX software (Forster & Forster, 2003).
A trial consisted of a 500 ms blank interval, followed by a fixation cross for 300 ms and then a letter "P" or "Q." Participants had to press the letter they saw as quickly as possible. The timeout was 1,000 ms. The letters remained on the screen until the timeout. Throughout the task, participant's left hand rested on the "Q" key and the right hand rested on the "P" key on the keyboard. Participants first did a noncritical block (practice session), where the letters appeared 2.3°above or below from the central fixation. This block contained 80 trials. Then they did one critical block, where the letters were presented either 3.9°to the left or to the right of the central fixation. In the congruent condition of the critical block, localization of the letter on the screen (left or right) matched localization of response button on the response keyboard (left or right) ("P" appeared on the right side of the screen or "Q" appeared on the left side of the screen). In the incongruent condition, the letter on the screen did not match the location of the response button on the keyboard ("Q" appeared on the right side of the screen or "P" appeared on the left side of the screen). There were 160 trials in the critical block, 80 congruent and 80 incongruent, presented in a random order. The Simon effect was calculated as the difference in mean reaction times for correct responses between incongruent and congruent condition in the critical block.

Forward Digit Span
To measure individual differences in short-term memory, we used a forward digit span task. This was a pen and article task, taken from the Polish version of the Wechsler Intelligence Scale for adults (Brzeziń ski et al., 2004) and administered as per the standard procedure. The participants were instructed that they would hear strings of digits and would be asked to repeat those digits in the order of presentation. The experimenter read each string aloud once at a pace of approximately one digit per second and asked the participant to repeat the string immediately after. The participant could get either one point for each string-if all digits in a string were recalled correctly-or zero points. The task always began with two strings of three digits, then two strings of four digits and so on up until nine digits (overall 14 trials, two per each span level). The participants got feedback only for the first two strings (these trials are also included in the final score). The task was terminated if the participant repeated two strings of the same length incorrectly.

Novel Words
Twelve novel words (nonwords) were created for the purpose of the task. First, on the basis of frequency data from the Polish corpus SUBTLEX-PL (Mandera et al., 2015), we algorithmically created a set of 80 bisyllabic nonwords that were phonotactically possible, but maximally differed in terms of phonotactic probability, as measured with ngram frequency. Ngram frequency is a mean (log) frequency of all ngrams, that is, phoneme sequences (that is bigrams-two-phoneme sequences, trigrams-three-phoneme sequences, and so forth, up to the length of the stimulus minus 1) in a given nonword, weighted for the length of the ngram. 2 A previous study has shown that this metric was the superior over other indices of phonotactic probability (see Szewczyk et al., 2018).
From the nonwords created by the algorithm, we chose 12 that had a high ngram frequency (4.66-6.45) and 12 that had a low ngram frequency (2.66-3.88). All the chosen nonwords had a CCVCV (consonant-consonant-vowel-consonant-vowel) or CVCCV structure and were stressed on the penultimate syllable. The nonwords were then recorded by a native speaker of Polish and tested for wordlikeness by 20 native speakers of Polish, who had to indicate on a 5-point Likert scale to what extent a particular nonword sounded like a typical Polish word, with 1 meaning "this could never be a Polish word" and 5 meaning "this could very well be a Polish word." On the basis of the test, from the high ngram frequency nonwords, we chose six that were assessed as most wordlike (M = 4.03, SD = 0.25). These were the native structure items. We also chose six nonwords from the low ngram frequency set that had lowest wordlikeness (M = 2.82, SD = 0.59), which became nonnative structure items. Each of the nonwords selected started with a different phoneme. All items are presented in Appendix A.
All items were then recorded in two versions by a bilingual speaker of Polish and Russian. The speaker recorded each of the 12 items with a Russian accent and, during a separate session, with a Polish accent. Thus, we created two accent versions of the stimuli: native accent (Polish) and nonnative accent (Russian). Each nonword was recorded three times with each accent. All three sets were interchangeably used in the experiment to provide some variance in the acoustic realization of the nonwords.

Pictures
From the SUBTLEX-PL corpus, we chose 12 concrete high-frequency nouns. Next, we searched in Google Image search engine for black-and-white line drawing depicting the nouns. For each concept, we chose three pictures depicting it. To check whether the pictures tend to be named unambiguously, we conducted a pilot study on 17 participants. The participants received a questionnaire, in which they were asked to name all 36 pictures. The accuracy rate on the test was 99.18%, which shows that the names of the pictures were unambiguous. All pictures were adjusted to be perceptually similar in size. The images were displayed in the center of the screen. Using three versions of the pictures portraying each object was aimed at forcing participants to associate the nonwords with concepts, rather than specific depictions of these concepts.

Stimuli Lists
We created four stimuli lists. In each of them we had three nonwords with native accent and native structure, three nonwords with native accent and nonnative structure, three words with nonnative accent and native structure, and three words with nonnative accent and nonnative structure. Because we recorded each of the nonwords in a native accent version and in a nonnative accent, we counterbalanced the accents in the nonwords across the lists. That means that the nonwords produced with native accent in two of the lists, was produced with nonnative accent in the other two and vice versa. For each list we randomly assigned nonwords to concepts depicted in the pictures, so that for each of the four lists the associations between nonwords and pictures were different.

Procedure
The word learning task was presented using the DMDX software. It involved an exposure phase and a test phase. The procedure is graphically represented in Figure 1 and described in more detail in the following text.

Exposure Phase
In the exposure phase, the participants were familiarized with all the stimuli. They were told that they would have to learn 12 new words and their association to objects, and that later they would be tested on how well they learned the nonwords.
The exposure phase consisted of four blocks of 108 trials each. Each trial took 2,500 ms. The trial structure is presented in Figure 1 on the left. The participant saw a fixation cross for 500 ms, followed by the presentation of a picture for 1,000 ms. After those 1,000 ms the participant heard the nonword associated with the picture through the headphones -the picture stayed on the screen for the remaining 1,000 ms. We used three pictures of the object and three different recordings of each nonword. This was done to ensure that the FIGURE 1. The word learning task. The left panel represents an exemplar trial of the exposure phase. The panel on the right represents the trial structure of the test phase (Picture Recognition Task), which followed the exposure phase.
participants did not learn the association between only one particular version of the recording and a particular picture, but that they generalized over three acoustically different versions of the word and built the link between this word and a conceptual representation of the object portrayed using the three different pictures. We paired all the versions of the nonword with all the pictures of the object, which gave nine different combinations for each picture-nonword association. We presented all these nine combinations once per block for all 12 nonword-object associations, yielding 108 trials. The order of trials was fully randomized within each block. There were breaks between each of the four blocks. The whole exposure phase took around 20 minutes.

Testing Phase: Picture Recognition Task
The task consisted of three blocks of 264 trials. Each trial began with 1,000 ms of blank screen, immediately followed by a fixation cross and the auditory presentation of a nonword, which occurred at the same time (see Figure 1 on the right). After the offset of the word, the fixation cross stayed on the screen for 500 more ms and then a picture was presented on the screen. The picture displayed was either the one that was associated with the presented novel word (congruent) or another picture that was associated with one of the other 11 novel words (incongruent). The participant had to decide if the picture matched or did not match the word heard. The timeout for the reaction was 2,000 ms. There was no feedback during the PR task. In the task both EEG data and behavioral data were collected.
Within each block, each novel word occurred in 11 congruent trials (with different versions of recordings and pictures) and in 11 incongruent trials. Each incongruent trial used a picture associated with a different novel word. The order of trials was randomized within the block. The blocks were divided by short breaks during which the participants were encouraged to rest.

PROCEDURE
The testing session took about 3 hours with a short break in the middle. First, the participants were informed about the aim of the experiment and they underwent a hearing screen and a shortened version of sRAPM, which served as screening measures (see details described in the "Participants" section). Then they performed the digit span, sensitivity to ART tasks, and the Simon task. Following the behavioral tests, they were prepared for the EEG testing and underwent the word learning procedure. First, they were exposed to the novel word names and pictures for 20 minutes and then were tested with a PR task. At end of the session, participants completed the background questionnaire and were paid for participation.

EEG ACQUISITION
The EEG was recorded at 256 Hz from 32 Ag/AgCl scalp electrodes, positioned at the standard 10-20 locations, mounted in an elastic cap, using the BioSemi ActiveTwo system. Electrodes were initially referenced online to the Common Mode Sense electrode located at C1 and re-referenced off-line to linked mastoids. Horizontal and vertical electrooculogram was recorded bipolarly using electrodes placed at the outer canthus of each eye (horizontal), as well as below and above a participant's right eye (vertical). The EEG signal was filtered off-line with an IIR band-pass filter (0.05-25 Hz frequency range; low cutoff slope: 24 dB/oct; high cutoff slope: 12 dB/oct).
We used BrainVision Analyzer software to analyze the ERPs in the PR task. From the EEG recordings we extracted epochs of 150 ms before and 800 ms after the onset of the picture. Both correct and incorrect trials were analyzed. The epochs were baseline corrected using the 150-ms prestimulus time window. We removed ocular and other stationary artifacts using independent components analysis (ICA; Delorme et al., 2007;Jung et al., 2000). We then inspected all trials manually to remove any remaining artifacts (9.3% of congruent and 8.9% of incongruent trials in total).
On the basis of a previous study, which used a similar paradigm (presentation of word and then a picture- Mazerolle et al., 2007), we investigated mean voltage amplitudes of the N300 on the frontocentral electrodes (Fz, FC1, FC2, Cz) within the 250-350 ms time window.

DATA ANALYSIS
We created three linear mixed-effects models, to establish the effects of ART on the outcome variables taken from the PR Task: RTs, accuracy, and N300 amplitude. 3 The models for RT and N300 were mixed effects linear regressions, calculated using lmer function in lme4 package (Bates et al., 2015b). The model for accuracy was a mixedeffects binary logistic regression, calculated using the lme4 package. For all three models, we used the Satterthwaite approximation for p values, as implemented in the lmerTest package (Kuznetsova et al., 2015). In all models, we started with the maximal structure of random effects. If the model did not converge, we first removed correlations between random effects, and-in the next step-the random effects with the smallest unique variance, following the recommendation by Bates et al. (2015a). All predictor variables in all the models were standardized. For the RT model, only the correct responses were taken. The RTs in the model were log-transformed.
For both the RTs and accuracy models, the participant-related fixed effects entered into the model were forward digit span, the mean Simon effect, and the estimates of fast ART and slow ART thresholds. The item-related fixed effects were accent (native vs. nonnative), structure (native, nonnative), and congruence (congruent, incongruent). We also entered interactions between accent and ART thresholds and interactions between structure and ART thresholds.
In the N300 model, we used the same fixed effects as in the preceding analyses, with the addition of the interactions of congruence and ART thresholds, accent and ART thresholds, structure and ART thresholds, and the three-way interaction of accent, congruence, and ART thresholds, as well as the three-way interactions of structure, congruence, and ART thresholds. In this model, the primary effects of interest were interactions with congruence because our index of learning was the reduction of N300 in the congruent trials, as compared to the incongruent trials (see also the last paragraph of the "Introduction" section).
The R scripts containing the specification of all models are available at https://osf.io/ 3fxgc/. All graphs were created in R using the ggplot2 package (Wickham et al., 2018).

PICTURE RECOGNITION TASK: ACCURACY DATA
Mean accuracy for the PR was 94.9% (range: 75.25%-99.24%; SD = 0.05%). The binary logistic regression model for the accuracy in PR is presented in Tables 1 (random effects) and 2 (fixed effects). Lower fast ART thresholds, smaller Simon effect and larger digit span were related to greater accuracy in the PR task. The participants were more accurate for items pronounced with native rather than nonnative accent and for incongruent rather than congruent trials. Figures 2 and 3 show the accuracy in behavioral data as a function of sensitivity to fast ART. Figure 2 shows raw data with the LOESS line fitted, while Figure 3 shows the predictions of the model. It is worth mentioning that while fast ART and slow ART were moderately correlated (r = 0.43), the variance inflation factor (VIF) diagnostics for this and all the other models reported here had appropriately low collinearity (VIF < 1.5)

PICTURE RECOGNITION TASK: RT BEHAVIORAL DATA
Mean RT for the PR was 679.42 ms (range: 445.64-1,109.41; SD = 151.12). The estimates of the random and fixed effects for the analysis of picture-naming latencies Significant effects are bolded (* p < .05, ** p < .01, *** p < .001).
FIGURE 2. Accuracy on the picture recognition task as a function of sensitivity to fast ART. The panels show raw data along with the LOESS line fitted. The x-axis shows fast ART discrimination thresholds, i.e., the smallest perceived differences in the Amplitude Rise Times measured in ms (a lower value is better). The y-axis shows the log-odds ratio of giving a correct response as predicted by the model. The first panel shows the interaction with accent and the second one shows the interaction with structure.
FIGURE 3. Accuracy on the picture recognition task as a function of sensitivity to fast ART. The panels show the regression lines taken from the model along with 95% confidence intervals, marked with ribbons. The x-axis shows fast ART discrimination thresholds, i.e., the smallest perceived differences in the Amplitude Rise Times measured in ms (a lower value is better). The y-axis shows the log-odds ratio of giving a correct response as predicted by the model. The first panel shows the interaction with accent and the second one shows the interaction with structure.
are presented in Tables 3 and 4. The results show that participants arrived at the correct decision faster if they had (a) a larger forward digit span, (b) smaller Simon effect, and (c) greater sensitivity (i.e., lower threshold) to fast ART. Incongruent trials elicited longer RTs than congruent trials and items pronounced with a nonnative accent elicited longer RTs than those pronounced with native accent. The collinearity in the model was appropriately low (VIF below 1.5). Figures 4 and 5 show the RT values as a function of sensitivity to fast ART. Figure 4 shows raw data with the LOESS line fitted, while Figure 5 shows the predictions of the model.

PICTURE RECOGNITION TASK: EEG DATA
We observed a negativity between 250 and 350 ms and peaking at around 300 ms, for the incongruent, compared to the congruent trials, which we identified as the N300 component (see Figure 6). Figure 7 presents the difference waves showing the effect of Significant effects are bolded (* p < .05, ** p < .01, *** p < .001).
FIGURE 4. The RTs on the picture recognition task as a function of sensitivity to fast ART. The panels show raw data along with the LOESS line fitted. The x-axis shows fast ART discrimination thresholds, i.e., the smallest perceived differences in the Amplitude Rise Times measured in ms (a lower value is better). The y-axis shows the RT in ms. The first panel shows the interaction with accent and the second one shows the interaction with structure.
FIGURE 5. The RTs on the picture recognition task as a function of sensitivity to fast ART. The panels show the regression lines taken from the model along with 95% confidence intervals, marked with ribbons. The x-axis shows fast ART discrimination thresholds, i.e., the smallest perceived differences in the Amplitude Rise Times measured in ms (a lower value is better). The y-axis shows the RT in ms. The first panel shows the interaction with accent and the second one shows the interaction with structure.
FIGURE 6. Stimulus-locked grand-averaged waveforms for congruent and incongruent trials at representative midline electrodes Fz, Cz, and Pz (top) with scalp potential difference maps for the N300 component (bottom). Confidence intervals are marked in gray. The shaded vertical stripe corresponds to the time-window of the N300 component.  congruence (incongruent-congruent) for the different word types-native in accent and structure; native in accent, but not structure; native in structure, but not accent; and nonnative in accent and structure. We analyzed the data for the fronto-central electrodes (Fz,FC1,FC2,Cz), in accordance with previous study using a similar paradigm (Mazerolle et al., 2007) so as to avoid double-dipping. However, it needs to be observed that the topography of our effect is more central than in the quoted study. Tables 5 and 6 provide the random and fixed effects for model predicting the N300 amplitude in response to the pictures. The model met the no-collinearity assumption (VIF below 1.5).
Pictures congruent with the preceding nonword elicited the N300 component with a reduced amplitude, compared to pictures incongruent with the preceding nonword. Furthermore, participants with a longer digit span had a smaller (less negative) N300 amplitude across the board. Prior to the experiment we hypothesized that participants with better ART sensitivity (i.e., lower threshold) should have a smaller (i.e., more positive) N300 amplitude for the congruent pictures, compared to participants with lower ART sensitivity. Thus, we expected a positive and significant interaction of congruence and ART. This was, however, not the case: the effect of congruence was not modified by ART (see Figure 8).

DISCUSSION
In this article, we asked whether increased sensitivity to basic auditory information is related to greater accuracy and better-quality word representations in language learning. We focused on sensitivity to a single auditory parameter, namely ART, and investigated whether it predicts word learning, especially for words that have a foreign phonological structure and accent. We exposed participants to nonword-picture pairs and then tested them on their knowledge of the pairs with a PR task. We hypothesized that greater sensitivity to ART would predict better accuracy in word learning and a better quality of the representations of the learned words: participants with smaller thresholds for detecting differences in ART would show higher accuracy and shorter reaction times in the PR, as well as more reduced N300 in response to congruent pictures. We predicted that sensitivity to ART would particularly improve learning of novel words with unfamiliar phonological structure and accent, that is, those that require participants to use their universal segmentation mechanism. Additionally, we explored what rate of ART-slow (connected with segmentation into syllables) or fast (connected with segmentation into phonemes)-would be more strongly associated with word learning. There was no prior evidence allowing us to make strong predictions with regard to this question.
Our analyses showed that sensitivity to fast rather than slow ART correlated with improved word learning accuracy. In line with the predictions, sensitivity to fast ART predicted accuracy and reaction times on the PR task. However, contrary to our expectations, the sensitivity to ART was equally predictive for learning nonnative and native FIGURE 8. The lack of interaction between the type of trial (congruent vs. incongruent) and sensitivity to fast ART in the picture recognition task. The lines indicate the values of the N300 amplitude (in microvolts) for congruent (gray) and incognruent (black) trials as a function of sensitivity to fast ART. The ribbons represent the 95% confidence intervals.
nonwords. Moreover, the relationship between the sensitivity to ART and word learning was only visible in behavioral measures. The amplitude of the N300 component at the target picture was not affected by the individual sensitivity to ART.

SENSITIVITY TO ART IS CONNECTED WITH BETTER PERFORMANCE ON BEHAVIORAL INDICES OF WORD LEARNING
So far, the relationship between sensitivity to ART (or any other single auditory characteristic) and novel vocabulary learning hypothesized only in studies of dyslexic individuals and children with SLI (Alt et al., 2017;Cumming et al., 2015;Kalashnikova & Burnham, 2016;Kwok & Ellis, 2014). Furthermore, only two studies that tested this relationship directly, but in both, samples of typically developing participants were mixed in with dyslexic or SLI children (Corriveau et al., 2007;Thomson & Goswami, 2009). Our study is the first one to show that sensitivity to ART is important not only for reading and word processing in developmentally delayed children but also continues to be important for word learning in typically developed adults who are already proficient in at least one language.

WORD LEARNING TASK IS INFLUENCED BY SENSITIVITY TO FAST RATHER THAN SLOW ART
The performance on the PR task in our study was connected with sensitivity to fast, but not slow ARTs. Previous research suggested that sensitivity to slow amplitude cues such as slow ART might be connected with the ability to segment speech into syllables, while sensitivity to fast amplitude change, such as fast ART might be important for discriminating phonemes and phoneme boundaries (Goswami, 2011;Hämäläinen et al., 2005;McAnally & Stein, 1996). Our data suggest that in typically developing adult population, sensitivity to fast ART might be of greater importance for word learning than sensitivity to the slow ART.

THE SENSITIVITY TO FAST ART FACILITATES LEARNING BOTH NATIVE AND NONNATIVE WORDS
In this article we hypothesized that sensitivity to ART influences word learning by facilitating a universal segmentation mechanism. According to our theory (based on Marecka et al., 2018), this mechanism enables segmentation of the word form, particularly for words that have an unfamiliar phonological structure. We hypothesized that segmentation of words with familiar structure is done using a different segmentation mechanism-phonological mapping, which relies on sublexical phonological representations, that is, representations of syllables, phonemes, and ngrams (speech chunks). If this hypothesis were correct, we would see a greater effect of sensitivity to ART on learning words with nonnative rather than native accent or structure. However, our results do not provide support for the hypothesis-sensitivity to fast ART predicted the acquisition of both familiar and unfamiliar novel word types to the same extent. There are two possible explanations for this finding. The first one is that ART helps in word segmentation regardless of whether the word is native or nonnative. This would mean that there is a single segmentation mechanism that is sensitive to languageuniversal auditory cues such as ART.
The alternative explanation is that the theory of two segmentation mechanism is correct, but we do not see evidence for it in our data due to potential methodological problems. One of them is that our task did not force participants to focus on learning word forms in a greater detail. Because the nonwords acquired in the study were quite distinct, participants could perform the task successfully by making very coarse distinctions between these nonwords. Consequently, participants may have not learned the nonwords in very fine-grained detail. It is possible that if the participants were tested with a task that required a more detailed knowledge of the word form-for example, a production task or a task where participants needed to choose a correct form from a set of similar nonwords-the effect of nativeness would be present.
Whatever the explanation ultimately is, our study suggests that sensitivity to ART supports word learning at all stages of word acquisition, even when learning words with a well-known phonological structure.

ERP INDICES OF WORD LEARNING ARE NOT SENSITIVE TO THE EFFECT OF ART
While we found the effect of sensitivity to ART on the behavioral measures-that is, the overall word learning efficiency-we did not see it in our EEG data. The difference in the N300 amplitude evoked by the congruent and incongruent pictures was not modulated by participants' sensitivity to fast ART. There are three possible explanations of this result, however because there is no previous data on this topic, all of them are speculative at this point.
The first explanation is that in all learners, irrespective of their sensitivity to fast ART, presentation of the nonword fully activated the associated concept. In the study we assumed that size of N300 effects would depend on the strength of conceptual activation. However, present data suggest that presentation of a (non)word may activate the corresponding concept in a binary fashion (i.e., the concept is either fully activated or not activated) and there is no gradation in its strength that could correlate with fast ART. This would mean that, contrary to our initial assumptions, the N300 amplitude on the congruent pictures cannot be considered an index of conceptual activation.
The second explanation is that the N300 component can capture changes in word learning only at the very initial stages of acquisition and our measurement of the learning processes occurred rather late (after most of the participants had learned the words very well). With this account, we would expect to see more variability in the N300 for the congruent condition, if the participants were tested at the beginning of learning, when the link between the word form and meaning was still volatile. In light of this conclusion, it would be beneficial to conduct a further study that would use a much larger number of items (as to make the task more difficult) and to investigate the ERPs at the very beginning of the learning process.
The third explanation is that the N300 and the behavioral indices are to some extent dissociable and rely on different cognitive processes and sources of information. Hamm et al. (2002) suggested that N300 might reflect presemantic processing of pictures, related to the categorization of objects. The authors claim that this process is dissociable from semantic processing. Even though there is some evidence against that claim (Draschkow et al., 2018), it is possible that the N300 captures an aspect of conceptual knowledge that does not fully reflect word learning efficiency or that cannot be graded based on individual differences measured behaviorally.

ADDITIONAL FINDINGS
In all our models we controlled for phonological STM (measured by digit span) and resistance to interference (measured by the Simon task). We found that both those measures influenced the behavioral results. Participants who had a longer digit span were more accurate and faster on the PR task. This confirms previous findings that better phonological STM tasks predict better performance on novel word learning tasks and greater vocabulary size in L1 and L2 learners (e.g., Farnia & Geva, 2011;Gathercole & Baddeley, 1989, 1990Masoura & Gathercole, 1999;Papagno & Vallar, 1995;Service & Kohonen, 1995).
Furthermore, participants who exhibited a smaller Simon effect (i.e., were more resistant to interference) were also more accurate and faster on the PR task. This suggests that learning words of a different language requires efficient ability to resist interference-a domain general aspect of cognitive control (e.g., Friedman & Miyake, 2004;Rey-Mermet & Gade, 2018). A possible explanation for this finding is that when learners acquire novel word forms for concepts that already have names in their L1, they need to ignore the interference from the L1 form. Our results contribute to research on elementary cognitive process influencing second language learning and to the field that investigates relation between bilingualism and cognitive control (e.g., Bartolotti et al., 2011;Blumenfeld & Marian, 2011;Teubner-Rhodes et al., 2016).

CONCLUSIONS
The presented research is the first study to that detecting fast ART can support vocabulary acquisition. In particular, we propose that sensitivity to ART allows for better segmentation of the word form, which translates to better encoding. Our finding paves the way for future studies exploring the elusive "ear for language"-the relationship between auditory skills and word learning. It offers a glimpse into the issue of language talent, by showing the possible sources of the difference between good and bad language learners. Finally, it suggests a possibility that training basic auditory skills such as sensitivity to ART can improve vocabulary learning in people with second language learning problems. It should be noted, however, that this result was not found for the Greek children in the study by Georgiou et al. (2010). 2 For example, for the sequence "#bask#," where # signifies a beginning or end of the word, the sequences would be: "#b," "as," "ba," "sk," "k#," "#ba," "bas," "ask," "sk#," "#bas," "bask," "ask#," "#bask," "bask#." 3 It is theoretically possible that second language experience enhances ART discrimination, so before we ran the following analyses, we also calculated correlations between fast ART discrimination scores and L2 age of acquisition as well as between fast ART discrimination and L2 experience (calculated as age minus L2 age of acquisition). Neither factor explained significant variance in ART (fast ART and age of acquisition: r = À0.13, p = 0.427, À95% CI [À0.44, 0.20]; fast ART and L2 experience: r = À0.05, p = 0.744, À95% CI [À0.37, 0.27]). Thus, at least in this sample, there is no evidence for a relationship between second language experience and ART.