THE CONTRIBUTIONS OF IMPLICIT-STATISTICAL LEARNING APTITUDE TO IMPLICIT SECOND-LANGUAGE KNOWLEDGE

Abstract This study addresses the role of domain-general mechanisms in second-language learning and knowledge using an individual differences approach. We examine the predictive validity of implicit-statistical learning aptitude for implicit second-language knowledge. Participants (n = 131) completed a battery of four aptitude measures and nine grammar tests. Structural equation modeling revealed that only the alternating serial reaction time task (a measure of implicit-statistical learning aptitude) significantly predicted learners’ performance on timed, accuracy-based language tests, but not their performance on reaction-time measures. These results inform ongoing debates about the nature of implicit knowledge in SLA: they lend support to the validity of timed, accuracy-based language tests as measures of implicit knowledge. Auditory and visual statistical learning were correlated with medium strength, while the remaining implicit-statistical learning aptitude measures were not correlated, highlighting the multicomponential nature of implicit-statistical learning aptitude and the corresponding need for a multitest approach to assess its different facets.


INTRODUCTION
Understanding the relationship between implicit (unconscious) learning and knowledge is fundamental to second language acquisition (SLA) theory and pedagogy. In recent years, researchers have turned to measures of language aptitude (an individual's ability to learn language) to better understand the nature of the different types of linguistic knowledge. Results have shown that explicit aptitude predicts the knowledge that results from explicit instruction (Li, 2015(Li, , 2016Skehan, 2015); however, evidence for the effects of implicit-statistical learning aptitude on implicit knowledge has been limited in the field of SLA (compare Granena, 2013;Suzuki & DeKeyser, 2017). In this project, we address two questions related to implicit-statistical learning aptitude and second language (L2) knowledge: (1) whether implicit-statistical learning aptitude is a componential mechanism (convergent validity) and (2) the extent to which different types of implicit-statistical learning tasks predict implicit knowledge (predictive validity). We expand the number of implicit-statistical learning aptitude measures beyond serial reaction time to obtain a more comprehensive assessment of learners' implicitstatistical aptitude. Alongside, we will administer a battery of linguistic knowledge tests designed to measure explicit and implicit L2 knowledge. By doing so, we are able to examine how implicit-statistical learning aptitude predicts the development of implicit L2 knowledge.

IMPLICIT-STATISTICAL LEARNING APTITUDE
Implicit-statistical learning denotes one's ability to pick up regularities in the environment (Frost et al., 2019). 1 Learners with greater implicit-statistical learning aptitude, for instance, can segment word boundaries (statistical learning) and detect regularities in artificial languages (implicit language learning) better than those with lower implicitstatistical learning ability (for a comprehensive review of the unified framework of implicit-statistical learning, see Christiansen, 2019;Conway & Christiansen, 2006;Perruchet & Pacton, 2006;. This process of implicitstatistical learning is presumed to take place incidentally, without instructions to learn or the conscious intention on the part of the learner to do so. Traditionally, implicit-statistical learning ability has been conceptualized as a unified construct where learning from different modes, such as vision, audition, and sense of touch, is interrelated and there is a common implicit-statistical learning mechanism governing the extraction of patterns across different modes of input. Recently, however, a growing body of research has evidenced that implicit-statistical learning may operate differently in different modalities and stimuli, yet still be subserved by domain-general computational principles (for reviews, see Arciuli, 2017;Frost et al., 2015;Siegelman et al., 2017a). In this view, implicit-statistical learning is modality and stimulus constrained (as the encoding of the information in different modalities relies on different parts of our body and different cortices) but this modality specific information is subject to domain-general processing principles that invoke shared brain regions. Implicitstatistical learning is thus modality specific at the level of encoding while also obeying domain-general computational principles at a more abstract level. If implicit-statistical learning is a componential ability, it follows that a more comprehensive approach to measurement is needed that brings together different tasks tapping into different components of implicit-statistical learning. Our first aim, accordingly, is to test the convergent validity of implicit-statistical learning measures by assessing the interrelationships between different measures of implicit-statistical learning. Doing so will inform measurement and help illuminate the theoretical construct of implicit-statistical learning.
These different measures can provide insight into the nature of the learning processes that individuals draw on in different language learning tasks. Specifically, when performance on the linguistic task and the aptitude measure share variance, a common cognitive process (i.e., implicit-statistical learning or procedural memory) can be assumed to guide performance on both tasks. To illustrate, Yi (2018) found that native English speakers' performance on a serial reaction time task predicted (i.e., shared variance with) their phrasal acceptability judgment speed. A similar association for L2 speakers between their explicit aptitude and phrasal acceptability judgment accuracy led the author to conclude that L1 speakers process collocations implicitly and L2 speakers process them more explicitly.
Although the use of implicit-statistical learning aptitude measures in L2 research is rising, there is a need to justify the use of these measures from a theoretical and a psychometric perspective more strongly. The possibility that implicit-statistical learning may not be a unitary construct highlights the need to motivate the choice of specific aptitude measure(s) and examine their construct validity, with due consideration of the measures' input modality . The questions of convergent validity (correlation with related measures) and divergent validity (dissociation from unrelated measures) have implications for measurement as well as SLA theory. Indeed, if implicitstatistical learning aptitude is to fulfill its promise as a cognitive variable that can explain the learning mechanisms that operate in different L2/foreign language contexts, for different target structures, and for learners of different L2 proficiency levels, valid and reliable measurement will be paramount.
In recent years, some researchers have begun to examine the construct validity of implicit-statistical learning aptitude measures by exploring their relationship to implicit memory (Granena, 2019), procedural memory (Buffington et al., 2021;Buffington & Morgan-Short, 2018), and working memory and explicit learning aptitude (Yi, 2018). For measures of implicit learning aptitude, Granena (2019) found that the serial reaction time task loaded onto a different factor than the LLAMA D in an exploratory factor analysis (EFA), suggesting the two measures did not converge. Similarly, Yi (2018) reported that the serial reaction time task and LLAMA D were uncorrelated and the reliability of LLAMA D was low. In a study combining measures of implicit learning aptitude and procedural memory, Buffington et al. (2021) also observed a lack of convergent validity between the ASRT, the Weather Prediction Task, and the TOL. These results do not support a unitary view of implicit-statistical learning aptitude or procedural memory. Furthermore, this research is yet to include measures of statistical learning as another approach to the same phenomenon (Christiansen, 2019;Conway & Christiansen, 2006;Perruchet & Pacton, 2006;Reber, 2015;. More research is needed to advance our understanding of these important issues. With this study, we aim to advance this research agenda. We consider multiple dimensions of implicit-statistical learning aptitude, their reliabilities, and interrelationships (convergent validity). Of the various measures used as implicit-statistical learning aptitude in SLA and cognitive psychology, we included measures that represent different modes of input streams: visual statistical learning (VSL) for visual input, auditory statistical learning (ASL) for aural input, and ASRT for motor and visual input. In addition, we included the TOL task in recognition of its wide use in SLA research as a measure of procedural memory along with the ASRT task.

IMPLICIT, AUTOMATIZED EXPLICIT, AND EXPLICIT KNOWLEDGE
It is widely believed that language users possess at least two types of linguistic knowledge: explicit and implicit. Explicit knowledge is conscious and verbalizable knowledge of forms and regularities in the language that can be acquired through instruction. Implicit knowledge is tacit and unconscious linguistic knowledge that is gained mainly through exposure to rich input, and therefore cannot be easily taught. A third type of knowledge, automatized explicit knowledge, denotes explicit knowledge that language users are able to use rapidly, in time-pressured contexts, as a result of their extensive practice with the language. While the use of (nonautomatized) explicit knowledge tends to be slow and effortful, both implicit and automatized explicit knowledge can be deployed rapidly, with little or no conscious effort, during spontaneous communication (DeKeyser, 2003;Ellis, 2005). Consequently, it has been argued that implicit and automatized explicit knowledge are "functionally equivalent" (DeKeyser, 2003), in that it may be impossible to discern between the two in practice.
In a landmark study, Ellis (2005) proposed a set of criteria to guide the design of tests that could provide relatively separate measures of explicit and implicit knowledge. Using principal component analysis, Ellis showed that time-pressured grammar tests that invite a focus on meaning (content creation) or form (linguistic accuracy) loaded onto one component (i.e., an oral production [OP] task, elicited imitation [EI], and a timed grammaticality judgment test [GJT]), which Ellis termed implicit knowledge. Untimed grammar tests that focus learners' attention on form (i.e., ungrammatical items on an untimed GJT and a metalinguistic knowledge test [MKT]) loaded onto a different component, which Ellis labeled explicit knowledge (see Ellis & Loewen, 2007, for a replication of these findings with confirmatory factor analysis). Subsequent studies using factor analysis on similar batteries of language tests also uncovered at least two dimensions of linguistic knowledge, termed explicit and implicit, which was largely consistent with Ellis's initial results (e.g., Bowles, 2011;Kim & Nam, 2017;Spada et al., 2015;Zhang, 2015; but see Gutiérrez, 2013).
The advent of reaction-time measures, however, invited new scrutiny of the construct validity of traditional measures of implicit knowledge such as the EI task and the timed written GJT (compare Ellis, 2005;Suzuki & DeKeyser, 2015Vafaee et al., 2017). The theoretical debate surrounding this issue was the distinction between implicit and automatized explicit knowledge, described previously, and whether, aside from differences in neural representation, the two types of knowledge can be differentiated behaviorally, in L2 learners' language use. Departing from Ellis (2005), researchers have hypothesized that timed, accuracy-based tests (e.g., EI) may be more suited to tap into learners' automatized explicit knowledge because timed tests do not preclude learners from accessing their explicit knowledge, but merely make it more difficult for learners to do so (DeKeyser, 2003;Suzuki & DeKeyser, 2015). Reaction-time tests such as selfpaced reading (SPR), however, require participants to process language in real time, as it unfolds, and could therefore hypothetically be more appropriate to capture learners' implicit knowledge (Godfroid, 2020;Suzuki & DeKeyser, 2015;Vafaee et al., 2017). In the implicit-statistical learning literature, Christiansen (2019) similarly argued for the use of processing-based measures (e.g., reaction time tasks) over reflection-based tests (e.g., judgment tasks) to measure the effects of implicit-statistical learning. He did not, however, attribute differences in construct validity to them (i.e., both types of tests are assumed to measure largely implicit knowledge, but at different levels of sensitivity or completeness).
Using confirmatory factor analysis, Suzuki (2017) and Vafaee et al. (2017) confirmed that timed, accuracy-based tests and reaction-time tests represent different latent variables, which they interpreted as automatized explicit knowledge and implicit knowledge, respectively. The researchers did not include measures of (nonautomatized) explicit knowledge, however, which leaves the results open to alternative explanations. Specifically, for automatized explicit knowledge to be a practically meaningful construct, it needs to be distinguishable from implicit knowledge and (nonautomatized) explicit knowledge simultaneously, within the same statistical analysis. Doing so requires a more comprehensive approach to measurement, with tests of linguistic knowledge being sampled from across the whole explicit/automatized explicit/implicit knowledge spectrum. Hence, current evidence for the construct validity of reaction-time tasks as measures of implicit knowledge is still preliminary.
More generally, all the previous validation studies have included only a subset of commonly used explicit/implicit knowledge tests in SLA, which limits the generalizability of findings. Differences in test batteries may explain the conflicting findings for tests such as the timed written GJT (see Godfroid et al., 2015). This is because the results of confirmatory factor analysis are based on variance-covariance patterns for the tests included in the analysis and hence different test combinations may give rise to different statistical solutions. To obtain a more comprehensive picture, Godfroid et al. (2018) synthesized 12 years of test validation research since Ellis (2005) by including all previously used measures in one study-the word monitoring test (WMT), SPR, EI, OP, timed/untimed GJTs in the aural and written modes, and the MKT. The results suggested that both a three-factor model (EI and timed written GJT as "automatized explicit knowledge"; Suzuki & DeKeyser, 2015) and a two-factor model (EI and timed written GJT as "implicit knowledge"; Ellis, 2005) provided a good fit for the data and that the two models did not differ significantly. These results support the viability of a three-way distinction between explicit, automatized explicit, and implicit knowledge. As with all factor analytic research, however, the nature of the latent constructs was left to the researchers' interpretation. Other sources of validity evidence, such as different patterns of aptitude-knowledge associations, examined here, could support the proposed interpretation and bolster the case for the distinction between implicit and automatized explicit knowledge.

CONTRIBUTIONS OF IMPLICIT-STATISTICAL LEARNING APTITUDE TO IMPLICIT KNOWLEDGE
Three studies to date have examined aptitude-knowledge associations in advanced L2 speakers, with a focus on measurement validity. We will review each study in detail because of their relevance to the current research. Granena (2013) compared Spanish L1 and Chinese-Spanish bilinguals' performance on measures of explicit and implicit knowledge, using both agreement and nonagreement structures in Spanish. The participants had acquired Spanish either from birth, early in life, or postpuberty. Granena wanted to know whether the participants' starting age impacted the cognitive processes they drew on for language learning. She found that early and late bilinguals' performance on agreement structures correlated with their implicit-statistical learning aptitude, as measured by a serial reaction time task (early learners) or LLAMA D (late learners). These results suggested that bilinguals who do not acquire the language from birth may still draw on implicit-statistical learning mechanisms, albeit to a lesser extent than native speakers do; hence, the bilinguals' greater sensitivity to individual differences in implicitstatistical learning aptitude compared to native speakers. Suzuki and DeKeyser (2015) compared the construct validity of EI and the WMT as measures of implicit knowledge. L1 Chinese-L2 Japanese participants performed an EI test with a built-in monitoring task. They were asked to listen to and repeat sentences, as is commonly done in an EI test, but in addition, they were asked to monitor the spoken sentences for a given target word (i.e., built-in word monitoring). The researchers found that performance on the two test components correlated with different criterion variables; specifically, EI correlated with performance on a MKT (a measure of explicit knowledge), whereas the WMT correlated with performance on the serial reaction time task (a measure of implicit-statistical learning aptitude), albeit only in a subgroup of participants who had lived in Japan for at least 2.5 years. Based on these results, the authors concluded that the WMT is a measure of implicit linguistic knowledge, whereas the EI test (traditionally considered a measure of implicit knowledge as well) is best considered a measure of automatized explicit knowledge.
In a follow-up study, Suzuki and DeKeyser (2017) examined the relationships among implicit knowledge, automatized explicit knowledge, implicit-statistical learning aptitude, explicit learning aptitude, and short-term memory. Different from Granena (2013) and Suzuki and DeKeyser (2015), the researchers found no significant association between serial reaction time (a measure of implicit-statistical learning aptitude) and either implicit or automatized explicit knowledge. Rather, they found that advanced Japanese L2 students' performance on LLAMA F (a measure of explicit learning aptitude) predicted their automatized explicit knowledge. The authors also tested the explanatory value of adding a knowledge interface (i.e., a directional path) between automatized explicit and implicit knowledge in the structural equation model (SEM). This path was indeed significant, meaning automatized explicit knowledge predicted implicit knowledge, but the interface model as a whole was not significantly different from a noninterface model that did not include such a path. The researchers interpreted their results as evidence that automatized explicit knowledge directly impacts the acquisition of implicit knowledge (through the interface), and that explicit learning aptitude indirectly facilitated the development of implicit knowledge. Thus, in their study no direct predictors of implicit knowledge were found.
Taken together, Granena (2013) and Suzuki and DeKeyser (2015) found a positive correlation between implicit knowledge test scores (i.e., sensitivity on a WMT) and implicit-statistical learning aptitude, in line with the view that the WMT, a reaction-time measure, may index implicit knowledge. In Suzuki and DeKeyser's (2017) SEM, however, the same implicit-statistical learning aptitude test had no association with the implicit knowledge construct, which was composed of three reaction-time measures, including a WMT (incidentally, none of the three reaction-time measures loaded onto the implicit knowledge factor significantly, which may have signaled a problem with these measures or with the assumption that they were measuring implicit knowledge). Critically, the three studies have only used a very limited set of implicit-statistical learning aptitude measures (serial reaction time and, in Granena's study, LLAMA D) that examine implicit-statistical motor learning and phonetic coding ability, respectively. Given that the implicit-statistical learning construct is modality specific (i.e., implicit-statistical learning can occur in visual, aural, and motor modes), the limited range of implicit-statistical learning aptitude tests in these studies limits the generalizability of the results to the tests with which they were obtained. Another issue concerns the low reliability of aptitude and knowledge measures obtained from reaction time data (Draheim et al., 2019;Rouder & Haaf, 2019), which may obscure any aptitude-knowledge relationships. In recognition of these gaps, we included a battery of four implicit-statistical learning aptitude tests (VSL, ASL, ASRT, and TOL) in order to examine the predictive validity of implicit-statistical learning aptitude for implicit, automatized explicit, and explicit L2 knowledge.

RESEARCH QUESTIONS
In this study, we triangulate performance on a battery of nine linguistic knowledge tests with data from four measures of implicit-statistical learning aptitude with an aim to validate a new and extended set of measures of implicit, automatized explicit, and explicit knowledge. The following research questions guided the study:

Convergent validity of implicit-statistical learning aptitude:
To what extent do different measures of implicit-statistical learning aptitude interrelate? 2. Predictive validity of implicit-statistical learning aptitude: To what extent do measures of implicit-statistical learning aptitude predict three distinct dimensions of linguistic knowledge, referred to as explicit knowledge, automatized explicit knowledge, and implicit knowledge?

PARTICIPANTS
Participants were 131 nonnative English speakers (Female = 69, Male = 51, Not reported = 11) who were pursuing academic degrees at a large Midwestern university in the United States. The final sample was obtained after excluding 26 participants who completed only one out of the four aptitude tests. Nearly half of the participants were native speakers of Chinese (n = 66). The remaining participants' L1s included Korean, Spanish, Arabic, Russian, Urdu, Malay, Turkish, and French, among others. The participants' average length of residence in an English-speaking country was 41 months (SD = 27.21, range 2-200 months). The participants were highly proficient English speakers with an average TOEFL score of 96.00 (SD = 8.80). Their mean age was 24 years (SD = 4.64) and their average age of arrival in the United States was 20 years old (SD = 5.68). They received $50 as compensation for their time.

Morphological Syntactical
Third person -s Embedded questions * The old woman enjoy reading many different famous novels.
* He wanted to know why had he studied for the exam.

Mass/count nouns
Be passive * The boy had rices in his dinner bowl.
* The flowers were pick last winter for the festival.

Comparatives (double marking)
Verb Complement (ask, have, need, want) * It is more harder to learn Japanese than to learn Spanish.
* Jim is told his parents want buying a new house.
Note: Critical region in boldface and underlined. We also administered four implicit-statistical learning aptitude tests: VSL, ASL, ASRT, and TOL. Table 2 summarizes the characteristics of the nine linguistic and four aptitude tests.

Word Monitoring Task
The WMT is a dual processing task that combines listening comprehension and wordmonitoring task demands. Participants first saw a content word (e.g., reading), designated as the target for word monitoring. They were instructed to press a button immediately as they heard the word in a spoken sentence (e.g., The old woman enjoys reading many different famous novels). Importantly, the monitor word was always preceded by one of the six linguistic structures in either a grammatical (e.g., enjoys) or ungrammatical (e.g., enjoy) form. Exhibiting grammatical sensitivity-that is, slower reaction times on content words when the prior word is ungrammatical than when it is grammatical-indicated knowledge of the grammatical target structure.

Self-Paced Reading
In the SPR task, participants read a sentence word-by-word in a self-paced fashion. They progressed to the next word in a sentence by pressing a button. As with the WMT, participants read grammatical and ungrammatical sentences. Evidence for linguistic knowledge was based on grammatical sensitivity-that is, slower reaction times to the ungrammatical version than the matched, grammatical version of the same sentence. In particular, we analyzed reaction times for the spillover region (i.e., the word or words immediately following the critical region) for each sentence and created a difference score for the ungrammatical and grammatical sentences.

Oral Production
The OP task was a speaking test where participants had to retell a picture-cued short story that contained multiple tokens of the six target structures. After reading the story two times without a time limit, participants had to retell the story in as much detail as possible in two and a half minutes. The percentage of correct usage of each target structure in all obligatory occasions of use (i.e., obligatory contexts) was used as a dependent variable. Obligatory contexts were defined relative to the participants' own production. Two coders independently coded OP. The reliability of interrater coding (Pearson r) was .96.

Elicited Imitation
Similar to the OP task, the EI was a speaking test where participants were asked to listen to a sentence, judge the semantic plausibility of the sentence, and repeat the sentence in correct English. No explicit instructions directed participants to correct the erroneous part of the sentence. Following Erlam's (2006) scoring system, correct usage in obligatory context was used for analysis.

Grammaticality Judgment Tests
In the GJTs, participants either read or listened to a sentence in a timed or an untimed test condition. The participants were instructed to determine the grammaticality of the sentence. The time limit for each sentence in the timed written and the timed aural GJT was set based on the length of the audio stimuli in the aural GJT. We computed the median audio length of sentences with the same number of words and added 50%. This resulted in a time limit of 4.12 seconds for a seven-word sentence and up to 5.7 seconds for a 14-word sentence for the timed GJTs. Two sets of sentences were created and counterbalanced for grammaticality and each set was rotated between the four tests, resulting in eight sets of sentences in total. In each of the four GJTs (timed written, untimed written, timed aural, untimed aural), one point was given per accurate judgment.

Metalinguistic Knowledge Test
The MKT required participants to read 12 sentences that contained a grammatical error. Their task was to (1) identify the error, (2) correct the error, and (3) explain in as much detail as possible why it was ungrammatical. We only scored the error correction and explanation parts of the test; as such, a total of two points were given per question. The maximum score was 24 and the total score was converted to a percentage. See Appendix S2 in online Supplementary Materials for the scoring rubric.

ASRT
The ASRT (Howard & Howard, 1997) was used to measure implicit-statistical learning aptitude. In the ASRT, participants viewed four empty circles in the middle of a computer screen that would fill as a black circle one at a time. The sequence of the filled circles followed a pattern that alternated with random (nonpatterned) trials, creating a secondorder relationship (e.g., 2r4r3r1r, where r denotes a random position). Participants were instructed to press the corresponding key on a keyboard that mirrored the position of the filled circle as quickly and accurately as possible. To capture learning, we calculated the change in reaction time to pattern trials from block 1 to block 10 and subtracted the change in reaction time to random trials from block 1 to block 10. Positive values indicate a greater improvement in sequence learning over the course of the task.

Auditory Statistical Learning
The ASL (Siegelman et al., 2018, experiment 1b) served as another implicit-statistical learning task. In the ASL test, participants heard 16 nonverbal, familiar sounds that were randomly organized into eight triplets (sequences of three sounds). Four triplets had a transitional probability of one (i.e., they were fixed) and four triplets had a transitional probability of .33 (i.e., every sound was followed by one of three other sounds, with equal likelihood). Each triplet was repeated 24 times during a continuous familiarization stream.
Participants were asked to listen to the input very carefully as they would be tested on it after the training. The test consisted of 42 trials: 34 four-alternative forced-choice questions measuring recognition of triplets and eight pattern completion trials measuring recall. Performance on the test yielded an accuracy percentage score.

Visual Statistical Learning
The VSL (Siegelman et al., 2017b) was used to measure learners' ability to learn visual patterns implicitly. As the visual counterpart of the ASL, the VSL presented participants with 16 complex visual shapes that were difficult to describe verbally and were randomly organized into eight triplets (sequences of three shapes). The triplets had a transitional probability of one. Each triplet was repeated 24 times during the familiarization phase. In the testing phase, participants completed 42 trials: 34 four-alternative forced-choice items measuring recognition of triplets and eight pattern completion trials measuring recall. Performance on the test yielded an accuracy percentage score.

Tower of London
The TOL (Kaller et al., 2011) was administered to measure learners' implicit-statistical learning ability during nonroutine planning tasks. Participants were presented with two spatial configurations that consisted of three pegs with colored balls on them. These configurations were labeled as "Start" or "Goal." The participants' task was to move the colored balls on the pegs in the "Start" configuration to match the "Goal" configuration in the given number of moves. There was a block of four 3-move trials, eight 4-move trials, eight 5-move trials, and eight 6-move trials (Morgan-Short et al., 2014). We will present the results for overall solution time in what follows, which is the sum of initial thinking time and movement execution time. All three measures yielded similar results. To capture learning, we calculated a proportional change score for each block of trials (i.e., 3-move, 4-move, 5-move, and 6-move separately) for each participant using the following computation: (RT on the first trial À RT on the final trial)/RT on the first trial. Positive values indicate a greater improvement in planning ability from the beginning to the end of each block in the experiment.

PROCEDURE
Participants met with a trained research assistant for three separate sessions. As seen in Table 3, the first session included the WMT, SPR, timed aural GJT, and untimed aural GJT; the second session started with OP followed by EI, the timed written GJT, untimed written GJT, and MKT; in the last session, participants completed all aptitude tests starting with VSL, and ended with the MLAT 5 (which is not discussed in this article). Sessions 1 and 2 started with the more implicit knowledge measures to minimize the possibility of participants becoming aware of the target features in the implicit tasks.

Descriptive Statistics and Correlations
Overall, 6% of the data were missing and they were missing completely at random (Little's MCAR test: χ2 = 1642.159, df = 1744, p = .960). To explore the associations among measures of implicit-statistical learning aptitude and between implicit-statistical learning aptitude and linguistic knowledge, respectively, we calculated descriptive statistics and Spearman correlations (abbreviated as rs in what follows) for all measures of L2 morphosyntactic knowledge and cognitive aptitude. All such analyses were carried out in R version 1.2.1335 (R Core Team, 2018).

Factor Analysis
To address research question 1, "to what extent do different measures of implicitstatistical learning aptitude interrelate (convergent validity)?," we conducted an EFA to explore the association between the four implicit-statistical learning aptitude measures. The EFA was performed with an oblique rotation (oblimin) that permits factors to correlate with each other. The model was computed using weighted least squares to account for the violation of multivariate normality assumption for the four tests (Mardia's skewness coefficient was 36.93 with a p-value of 0.012; Mardia's kurtosis coefficient was 2.28 with a p-value of 0.023). Finally, we used a factor loading cutoff criterion of .40 to interpret the factor loadings.
To address research question 2, "to what extent do measures of implicit-statistical learning aptitude predict three distinct dimensions of linguistic knowledge (i.e., explicit knowledge, automatized explicit knowledge, and implicit knowledge (predictive validity)?," we built confirmatory factor analysis (CFA) and SEM models using the lavaan package in R. To examine the psychometric dimensions underlying the nine linguistic tests, we constructed two CFA models, a two-factor and a three-factor model. These models were specified based on theory and previous empirical findings from CFA studies by Ellis (2005) and Suzuki and DeKeyser (2017). To evaluate the CFA models, we used a model test statistic (chi-square test), standardized residuals (<|1.96|) and three model fit indices (Hu & Bentler, 1999): the comparative fit index (CFI => .96), the root mean square error of approximation (root mean square error of association [RMSEA] =< .06), and the standardized root mean square residual (standardized root mean square [SRMR] =< .09). We then built a SEM. In combination with a measurement model (CFA), SEM estimates the directional effects of independent variables (measures of implicit-statistical learning aptitude) on the latent dependent variables (the knowledge type constructs). Full-information maximum likelihood estimation was used to evaluate different models and Robust Maximum Likelihood was adopted as an estimation method for both the CFA and SEM analyses to account for the violation of multivariate normality assumption. Table 4 shows the descriptive statistics for all linguistic and aptitude measures. Participants showed a wide range of abilities in their performance on the linguistic knowledge measures. Reliabilities of the individual differences measures ranged from satisfactory to high and were generally on a par with those reported in previous studies: ASRT intraclass correlation coefficient (ICC) = .96 (this study) and ASRT ICC = .99 (Buffington & Morgan-Short, 2018), VSL α = .75 (this study) and VSL α = .88 (Siegelman et al., 2017b), ASL α = .68 (this study) and ASL α = .73 (Siegelman et al., 2018, Experiment 1b), and TOL ICC = .78 (this study) and TOL split-half reliability = .59 (Buffington & Morgan-Short, 2018).

Correlational Analysis Among Aptitude Measures
To examine the unidimensionality of implicit-statistical learning aptitude and the interrelationships between different aptitude measures, we ran a correlation matrix between the four implicit-statistical learning aptitude measures. Table 5 presents the Spearman correlation matrix of ASRT, VSL, ASL, and TOL. We note a medium correlation between the VSL and ASL tasks (rs = .492, p < .001). At the same time, correlations of the ASRT and TOL with other tasks are low (À.146 ≤ rs ≤ .054). These results suggest that ASL and VSL may tap into a common underlying ability, statistical learning, whereas performance on other measures of implicit-statistical learning aptitude was essentially unrelated. In sum, the correlation analysis provides initial evidence for the lack of convergent validity of measures of implicit-statistical learning aptitude.

Exploratory Factor Analysis
As the second and final step in answering research question 1, we conducted an EFA with the same four measures. The Kaiser-Meyer-Olkin (KMO) measure suggested that, at the  The reliability scores of the four GJTs are an average score of the structure-level reliability of eight versions. f Total correct on the EI was rescaled by a factor of .10, yielding a total score out of 2.4. group-level, the sampling for the analysis was close to the minimum KMO of .50 (KMO = .49). At an individual test level, most tests were near the .50 cutoff point (ASRT = .52; VSL = .49; ASL = .49) with TOL reaching a bit short (.43). Despite the low KMO, we decided to keep all measures in the analysis because they were theoretically motivated. Bartlett's test of sphericity, χ²(6) = 31.367, p < .001, indicated that the correlations between tests were sufficiently large for an EFA. Using an eigenvalue cutoff of 1.0, there were three factors that explained a cumulative variance of 72% (the third factor accounted for a substantial increase in the explained variance, that is, 22%, and was thus included even though the eigenvalue was slightly short of 1.0). Table 6 details the factor loadings post rotation using a factor criterion of .40. As can be seen in Table 6, factor 1 represents motor sequence learning (ASRT), factor 2 represents procedural memory (TOL), and the last factor represents statistical learning, with VSL and ASL loading together.

Confirmatory Factor Analysis
To address the second research question, we first constructed measurement models as a part of SEM to examine the number of dimensions in the nine linguistic tests. As seen in Table 2, we specified two CFA models based on SLA theory: a two-factor model distinguishing implicit versus explicit knowledge (Ellis, 2005) and a three-factor model distinguishing implicit versus automatized explicit versus explicit knowledge (an extension of Suzuki & DeKeyser, 2017). The models differed critically with regard to whether the reaction-time tasks (WMT, SPR) and the timed, accuracy-based measures (OP, EI, TAGJT, TWGJT) loaded onto the same factor, "implicit knowledge," in the twofactor solution, or different factors, "implicit knowledge" and "automatized explicit knowledge," in the three-factor solution (see Table 2). The summary of the fit indices for the measurement models in Table 7 suggests that both models fit the data well, meeting general guidelines by Hu and Bentler (1999). At the same time, the two-factor model demonstrates a better fit than the three-factor model with a Bayesian information criterion value smaller than the three-factor model (ΔBIC ranging between 2 and 6 denotes a positive difference in favor of the model with the lower BIC; see Kass & Raftery, 1995).

Correlational Analysis of Aptitude Measures and Knowledge Measures
Before running the SEM, we first explored the correlations between implicit-statistical learning aptitude and linguistic knowledge measures. Figure 1 contains Spearman correlation coefficients (above the diagonal) of the 13 variables, scatterplots for variable pairs (below the diagonal), and density plots for each variable (on the diagonal). The results suggest that ASRT correlated significantly and positively with the WMT (rs = .335, p = .002) and the TWGJT (rs = .229, p = .024). In contrast, VSL (rs = À.341, p = .001) and, to a lesser extent, ASL (rs = À.184, p = .095) correlated negatively with the WMT. TOL did not correlate significantly with any of the linguistic knowledge measures (À.128 ≤ rs ≤ .069).

Structural Equation Model
As the final step in answering research question 2, we fitted the structural model to the measurement model to examine aptitude-knowledge relationships. In light of the EFA findings where VSL and ASL clustered into a single factor, we built a latent predictor variable called statistical learning (SL), which combined the ASL and VSL. Consequently, we retained three measures of implicit-statistical learning aptitude (SL, TOL, and ASRT) and treated these as predictor variables of different knowledge constructs to examine the aptitude-knowledge relationships. In the measurement model, we allowed for the different knowledge constructs (i.e., explicit, automatized explicit, and implicit knowledge) to correlate because they represent different subcomponents of language proficiency and thus we assumed that they would be related. Figures 2 and 3 show the results of the analyses. 3 Table 8 details model fit indices for the two-factor and three-factor SEM models. Two out of the four global fit indices, namely the chi-square test and CFI, fell short of the cutoff points proposed by Hu and Bentler (1999); the SRMR was slightly above the .09 threshold. To diagnose any sources of model misspecification, we inspected the modification indices and standardized residuals. In the two-factor SEM model, two modification indices were larger than 3.84, signaling localized areas of potential ill fits. Both modification indices concerned the WMT, which had a low factor loading onto implicit knowledge. The modifications were not implemented, however, as they lacked theoretical justifications (i.e., one recommended WMT as an explicit measure [MI = 4.89] and another suggested WMT as a SL measure [MI = 4.65]). No standardized residual for any of the indicators was greater than |1.96| (largest = 1.73). In the three-factor model, 12 modification indices were larger than 3.84. Based on this information, we modified the model by adding a method effect (error covariance) between EI and OP to account for the fact that EI and OP are both production tasks. Other modification indices lacked a theoretical or methodological underpinning and, hence, were not pursued further. As detailed in Table 8, adding the error covariance changed the global fit of the modified three-factor model mostly positively (i.e., chi-square p value: 0.02 ! 0.03; CFI: .843 ! .863; lower bound RMSEA: 0.028 ! 0.019) but also negatively (i.e., SRMR: 0.094 ! 0.095). No standardized residual for any of the variables was greater than |1.96|; however, the standardized residual for the WMT-ASRT covariance was slightly above the threshold (Std. residual = 1.97), indicating an area of local strain. Taken together, the two-factor model exhibited a better local fit than the three-factor model, which suggested that it represented our data best. Global fit indices were somewhat low, possibly due to sample size limitations, but importantly, the underlying measurement models (CFA) demonstrated a good fit (see Table 7). As such, we proceeded to interpret the parameter estimates of the two-factor model.  Table 9 and Figure 2 detail parameter estimates for the two-factor SEM model. As seen in Table 9, the regression path from ASRT to implicit knowledge was significant, r = .258, p = .007. None of the other aptitude measures were significantly predicting ability in implicit or explicit knowledge in the model.

SUMMARY OF RESULTS
We aimed to contribute to the theorization of implicit-statistical learning aptitude as an individual differences variable that may be of special importance for attaining an advanced L2 proficiency (Linck et al., 2013). To measure implicit-statistical learning aptitude more comprehensively, we included two new measures-ASL and VSL (Siegelman et al., 2017b(Siegelman et al., , 2018-to the better-known measures of ASRT and TOL. Overall, only ASL and VSL showed a medium-strong correlation (r = .49) and loaded onto the same factor, whereas the remaining measures were not correlated (RQ1). This underlines that implicit-statistical learning aptitude is a multidimensional, multifaceted construct and that input modality is an important facet of the construct. A multitest approach, measuring aptitude in different input streams and task conditions, is best suited to ensure its predictive validity for language learning.
Given the theoretical importance of implicit-statistical learning aptitude, we also examined its predictive validity for implicit language knowledge, using a battery of nine Abbreviations: CFI, comparative fit index; RMSEA, root mean square error of approximation; SRMR, standardized root mean square. L2 grammar tests. The final SEM consisted of three aptitude measures regressed on a twofactor measurement model-explicit and implicit knowledge. We found that only ASRT predicted implicit knowledge, which was a latent variable composed of timed, accuracybased measures and reaction-time tasks. These results inform ongoing debates about the nature of implicit knowledge in SLA (Ellis, 2005;Suzuki & DeKeyser, 2017) and do not lend support to the view that reaction time measures are inherently superior for measuring L2 speakers' implicit knowledge (Suzuki & DeKeyser, 2015;Vafaee et al., 2017).

MULTIDIMENSIONAL NATURE OF IMPLICIT-STATISTICAL LEARNING APTITUDE (RQ1)
Research on implicit-statistical learning aptitude can be traced back to different research traditions within cognitive and developmental psychology (Christiansen, 2019). The domain-general mechanisms that enable implicit-statistical learning have been linked to a range of different linguistic behaviors-from speech segmentation and vocabulary acquisition, to syntactic processing and literacy development (see Armstrong et al., 2017;, for recent theoretical discussions). Given the explanatory power of implicit-statistical learning aptitude in language research, we first examined the convergent validity of different measures used to assess learners' aptitude.
The results of our EFA did not support the unidimensionality of the different implicitstatistical learning aptitude measures (see Table 6). At a descriptive level, bivariate correlations between the different aptitude measures were close to 0, with the exception of ASL and VSL, which showed a .49 correlation. Correspondingly, in the EFA, the threefactor solution indicated that the battery of aptitude tests does not represent a unitary construct of implicit-statistical learning aptitude. Three factors were extracted: Factor 1 [ASRT] = .25; Factor 2 [TOL] = .24; Factor 3 [ASL and VSL] = .22, which together accounted for 72% of the total variance.
The medium strength correlation between the measures of statistical learning replicated Siegelman et al. (2018, experiment 2), who reported a .55 correlation between the ASL and VSL. The ASL and VSL are similar in terms of the nature of the embedded statistical regularity, length of training, and the way statistical learning is assessed (Siegelman et al., 2017a(Siegelman et al., , 2018. Given that the tests are similar other than with regard to their input modality, these measures jointly offer a relatively pure test of the role of input modality in statistical learning. The results of the EFA showed that a common underlying ability, statistical learning, accounted for approximately 22% of the variance in participants' ASL and VSL performance, while differences in input modality accounted for some portion of the remaining 78% of variance. Input modality is therefore likely to be an important source of individual differences in statistical learning . These modalitybased differences in statistical learning aptitude are relevant to adult L2 learners insofar as learners experience a mix of written and spoken input that may shift according to their instructed or naturalistic learning environments. For instance, Kim and Godfroid (2019, experiment 2) reported an advantage for visual over auditory input in the L2 acquisition of implicit knowledge of syntax by college-educated adults. While results of correlation research are best interpreted cumulatively, across different research studies, the mediumstrong ASL-VSL correlation in the present study is consistent with the view (Arciuli, 2017;Frost et al., 2015;Siegelman et al., 2017b) that statistical learning is a domaingeneral process that is not uniform across modalities.
Seen in this light, it is interesting that the other assessment of statistical learning in the visual modality, the ASRT, showed no correlation with the VSL (see Table 5). Both tests use nonverbal material to assess an individual's ability to extract transitional probabilities from visual input. The ASRT has an added motor component, which may have contributed to the lack of convergence between the two measures. Additionally, VSL and ASRT may not have correlated because of when learning was assessed. Learning on the ASRT was tracked online, during the task, as a reaction time improvement (speed up) over training. In the ASL and VSL, however, assessment of learning took place offline, in a separate multiple-choice test that came after the training phase. It has been argued that the conscious reflection involved in offline tasks may confound the largely implicit learning that characterizes statistical learning (Christiansen, 2019). Online measures of implicitstatistical learning such as the ASRT, however, may be able to capture learning with a higher resolution (Siegelman et al., 2017b) and a better signal-to-noise ratio. Although more research is needed to evaluate these claims, our results support the superiority of online measurement. Using structural equation modeling, we confirmed the predictive validity of the ASRT for predicting implicit grammar knowledge in a sample of advanced L2 speakers (see the next section on RQ2 for further discussion). Conversely, the VSL or ASL did not have predictive validity for L2 implicit grammar knowledge in this study, potentially because the two measures of statistical learning allowed for participants' conscious involvement on the posttests. To investigate this result in more depth, researchers could reexamine the predictive validity of the ASL and VSL for specific grammar structures in our test battery such as embedded questions or third-person -s, which contain a clear, patterned regularity that lends itself well to statistical learning.
Lastly, the ASL, VSL, and ASRT were unrelated to the TOL. The TOL task finds its origin in research on planning and executive function (Shallice, 1982) and was used in a modified form, as a measure of cognitive skill learning, in Ouellet et al. (2004). Because TOL measures the effects of practice, it can be regarded as a measure of skill acquisition (Ouellet et al., 2004) and is assumed to reflect procedural learning (Ouellet et al., 2004) and provide a measure of individual differences in procedural memory ability (e.g., Antoniou et al., 2016;Buffington & Morgan-Short, 2018;Buffington et al., 2021;Ettlinger et al., 2014;Morgan-Short et al., 2014). The contributions of procedural memory to implicit-statistical learning are complex (Batterink et al., 2019;Williams, 2020). Batterink et al. (2019) reported that "a common theme that emerges across implicit learning and statistical learning paradigms is that there is frequently interaction or competition between the declarative and nondeclarative [e.g., procedural] memory systems of the brain…. Even in paradigms that have been specifically designed to isolate 'implicit learning' per se, healthy learners completing these tasks may show behavioral evidence of having acquired both declarative and nondeclarative memory" (p. 485, our addition in brackets). This interaction between declarative and nondeclarative memory in implicit learning tasks could explain the lack of convergent validity between TOL and the other measures of implicit-statistical learning aptitude; that is, measures of implicitstatistical learning may draw on multiple memory systems including, but not limited to, procedural memory. Our results are consistent with Buffington and Morgan-Short (2018) and Buffington et al. (2021), who also reported a lack of correlation between the ASRT and TOL in two samples of university-level research participants (r = À.03, n = 27 and r = .03, n = 99).
The TOL does not involve patterned stimuli like the other three measures in this study, but focuses instead on an individual's improvement (accuracy gains or speed up) in solving spatial problems as a result of practice. The lack of predictive validity for implicit knowledge in advanced L2 speakers creates a need for further research into the learning processes and memory systems engaged by the TOL. TOL is indeed measuring practice, but our results, in addition to those of Buffington and colleagues (2021), do not support the claim that such practice effects reflect an individual's procedural memory learning ability. Further research into the construct validity of the TOL will be necessary. To facilitate future validation efforts, it would be helpful to standardize the use of the TOL task in L2 research. Multiple task versions, with and without repeating trials, as well as with accuracy scores versus with reaction times, are currently used in parallel in SLA, which renders comparisons of results across studies difficult ( (2016) published the TOL-F, an accuracy-based version of the TOL with improved psychometric properties that is still new in L2 research but could be of great value to achieve greater standardization in the field.
On balance, our results suggest that the findings for implicit-statistical learning aptitude do not generalize beyond the measure with which they were obtained. Future researchers will therefore need to continue treating different tests of implicit-statistical learning aptitude as noninterchangeable. For maximum generalizability, it will be important to continue using a multitest approach as exemplified in the present study. Including multiple tests of implicit-statistical learning aptitude will ensure proper representation of the substantive domain and may help researchers steer clear of confirmation bias. Over time, it will also enable researchers to refine their understanding of the different dimensions of implicit-statistical learning aptitude (Siegelman et al., 2017a) and come to a more nuanced understanding of these dimensions' roles, or nonroles, in different L2 learning environments, for learners of different ages and education levels, and with different target structures. Our call for a multitest approach echoes common practice in explicit learning aptitude research, where researchers routinely administer a battery of different tests to language learners to measure their aptitudes (see Kalra et al., 2019;Li, 2015Li, , 2016.

ONLY TIMED, ACCURACY-BASED TESTS SUPPORTED AS MEASURES OF IMPLICIT KNOWLEDGE (RQ2)
This study was conducted against the background of an ongoing debate about how best to measure L2 learners' implicit knowledge. Measures of implicit-statistical learning aptitude can inform the construct validity of different tests-timed, accuracy-based tests and reaction time tasks-by revealing associations of aptitude with these hypothetical measures of implicit knowledge (DeKeyser, 2012;Granena, 2013). The results of this study support the predictive validity of implicit-statistical learning aptitude (ASRT) for performance on timed language tests, affirming the validity of timed, accuracy-based tests as measures of implicit knowledge (Ellis, 2005). Similar support for the validity of reaction-time-based tests was lacking (cf. Suzuki & DeKeyser, 2017), which emphasized that our understanding of reaction-time measures of linguistic knowledge is still at an early stage.
We find these results to be intriguing. The two reaction-time tasks in the study, WMT and SPR, rely on the same mechanism of grammatical sensitivity (i.e., slower responses to ungrammatical than grammatical sentences) to capture an individual's linguistic knowledge. It has been assumed, often without much challenge, that grammatical sensitivity on reaction-time tests operates outside the participants' awareness, and hence may represent the participants' linguistic competence or implicit knowledge (for a critical discussion of this assumption, see Godfroid, 2020;Marsden et al., 2018). But in spite of the underlying similarity between the two tasks, performance on the SPR and the WMT correlated weakly, rs = .178, p = .098 (see Figure 1), and the two tasks loaded poorly onto the implicit knowledge factor in the CFA/SEM analysis (SPR, Std. Est. = 0.225; WMT, Std. Est. = 0.054). This indicates that current models of L2 linguistic knowledge do not account well for participants' performance on reaction-time tasks.
The construct validity of reaction time measures of linguistic knowledge cannot be separated from the instrument reliability. Compared to the accuracy-based tasks in the study, learners' performance on the WMT and SPR (the two reaction time tasks) was somewhat less reliable (see Table 4 for a comprehensive review on the validity and reliability of the nine linguistic measures). This has been a fairly consistent observation for reaction time measures, and in particular reaction time difference measures used in individual differences research (e.g., Draheim et al., 2019;Hedge et al., 2018;Rouder & Haaf, 2019), such as the grammatical sensitivity scores calculated for SPR and WMT in this study. Draheim et al. (2019) pointed out that researchers who work with reaction time difference measures often see one task "dominate" a factor, with other measures loading poorly onto the same factor. This is exactly what happened in the three-factor SEM model, where the implicit knowledge factor accounted perfectly for participants' SPR performance, but did not explain much variance in WMT scores. The three-factor model was abandoned for a simpler, two-factor SEM model, but that model did not account well for either reaction-time measure (see Figure 2 and Appendix S3 in online Supplementary Materials). These results suggest that reaction-time tests of linguistic knowledge are not a homogeneous whole (either inherently or because of lack of internal consistency), in spite of their shared methodological features. Therefore, given the current state of affairs, claims about their construct validity ought to be refined to the level of individual tests, for instance WMT or SPR separately, rather than reaction time measures as a whole.
To illustrate, we performed a post-hoc correlation analysis of the ASRT with WMT and SPR separately. We found that the ASRT correlated significantly and positively with the WMT (Spearman rank, rs = .335, p = .002), mirroring the global result for implicit knowledge (i.e., the latent variable, which was also predicted by the ASRT). SPR did not correlate with the ASRT (rs = À.027, p = .804) or with other measures of implicitstatistical learning aptitude. These results suggest that at the individual-test level, the WMT has some characteristics of a measure of implicit knowledge, consistent with earlier findings from Granena (2013) and Suzuki and DeKeyser (2015). No such evidence for SPR was obtained in this study.
Last but not least, our results revealed a significant association between implicitstatistical learning aptitude (the ASRT) and a latent factor that included four timed, accuracy-based tests (TWGJT, TAGJT, EI, OP). This supported the validity of these measures as implicit knowledge tests (Ellis, 2005). Successful performance on the timed, accuracy-based measures requires fast and accurate processing of targeted grammatical knowledge. The ASRT, however, is an entirely nonlinguistic (nonverbal) task that requires fast and accurate motor responses from participants. To obtain a high aptitude score on the ASRT, participants need to speed up over time as they induce the repeating patterns in the motor sequence. One possible account for the ASRT-implicit knowledge relationship, therefore, is that both measures rely on participants' procedural memory (also see Buffington et al., 2021). On this account, the ASRT derives its validity as a predictor of implicit knowledge because it taps into the same neural substrate as implicit knowledge of language does, namely procedural memory. Similarly to procedural memory representations, implicit knowledge takes time to develop. This may explain why in previous studies, as in the present one, the SRT and ASRT predicted performance in proficient or near-native L2 learners (Granena, 2013;Linck et al., 2013;Suzuki & DeKeyser, 2015; but see Suzuki & DeKeyser, 2017;Tagarelli et al., 2016) or predicted collocational knowledge in L1 speakers and not L2 speakers (Yi, 2018). For researchers who may not have the resources to include multiple measures of implicit-statistical learning, the SRT or ASRT may thus be the best, single-test option to gain insight into the nature of learner processes or linguistic outcomes (also see Kaufman et al., 2010, who referred to the SRT as "the best measure of implicit learning currently available," p. 325). CONCLUSION We examined the contributions of implicit-statistical learning aptitude to implicit L2 grammar knowledge. Our results are a part of an ongoing, interdisciplinary research effort, designed to uncover the role of domain-general mechanisms in first and second language acquisition. Implicit-statistical learning aptitude was found to differ along multiple dimensions, suggesting a need for caution when generalizing results from a specific test (e.g., ASRT) to the larger theoretical constructs of implicit learning, statistical learning, and procedural memory because results may be specific to the test with which they were obtained, and the theoretical constructs may not be unitary in nature.
We also adduced support for the validity of timed, accuracy-based knowledge tests (i.e., OP, EI, timed auditory/written GJTs) as measures of implicit knowledge, supporting their use in the language classroom, language assessment, and lab-based language research to assess implicit grammar knowledge. Reaction time measures (i.e., SPR, word monitoring) currently do not enjoy the same level of validity evidence, in spite of their widespread use in lab-based research.
Despite its contributions, this study had some limitations that must be considered when interpreting the results. First, our participants were highly heterogeneous in their L1s, language learning contexts, and length of residence in an English-speaking country. Nearly half of our participants were Chinese, who may have had a jagged profile of explicit and implicit knowledge. Differences in L1 background could invite possible transfer effects (both positive and negative) across the tasks and structures. This study would also have benefited from a larger sample size, both for the EFA and the SEM. Lastly, it will be crucial to establish a good test-retest reliability for the different measures of implicit-statistical learning aptitude in future research (see Kalra et al., 2019; to show that these aptitude measures can serve as stable individual differences measures that preserve rank order between individuals over time.
Nonetheless, the results of this study help reconcile different theoretical positions regarding the measurement of L2 implicit knowledge by affirming the validity of timed, accuracy-based tests. They also point to the validity and reliability of reaction-time measures as an important area for future research. We would very much welcome other researchers to advance this research agenda and hope that the test battery developed for this project will help contribute to this goal.

SUPPLEMENTARY MATERIALS
To view supplementary material for this article, please visit http://dx.doi.org/10.1017/ S0272263121000085.

1
We have chosen to adopt the term "implicit-statistical learning" based on Conway and Christiansen (2006), Perruchet and Pacton (2006), Reber (2015), Christiansen (2019), and Rebuschat and Monaghan (2019), in which these authors make arguments for combining the two approaches of implicit learning and statistical learning into one phenomenon due to their similar focus, ancestry, and use of artificial languages. 2 Although knowledge represented in procedural memory is implicit (inaccessible to awareness), both declarative and procedural memory underlie implicit knowledge, suggesting procedural memory and implicit knowledge are related but not isomorphic (Batterink et al., 2019;Ullman, 2020;Williams, 2020). 3 The full covariance matrix with error covariances for each figure is available from the authors upon request.