Aptitude Testing of Diverse Groups

doi:10.1017/9781009076463.008

Part II Aptitude Testing of Diverse Groups

6 Testing Language Aptitude for Chinese Learners of Foreign Languages FLAT-C vs. LLAMA

Introduction

Language aptitude (LA) is one of the most important individual difference factors and plays an important role in language learning success. LA consists of a group of complex abilities that integrate both the cognitive ability and the perceptual ability of an individual and helps them to learn a foreign language faster and more easily (Carroll, Reference Carroll and Diller1981). Aptitude can be measured by LA tests, which are used for many purposes, such as screening for foreign language talents, diagnosing foreign language learning difficulties, and personalized foreign language teaching and learning. Quite a few LA tests have been developed since the 1950s, and among them are the Modern Language Aptitude Test (MLAT) and Pimsleur’s Language Aptitude Battery (PLAB). However, these tests were mainly developed for English native speakers (Wu, Liu, & Jeffrey, Reference Wu, Liu and Jeffrey1993) rather than for speakers of other languages, and any potential limitation because of their dependence on the English language for presentation did not matter. A range of aptitude tests developed since then have followed this L1-English delivery approach.

Given that LA has been little researched in China, despite its importance for Chinese L2 learners, Li (Reference Li2014) and Li and Luo (Reference Li, Luo, Wen, Skehan, Biedroń, Li and Sparks2019) developed the Foreign Language Aptitude Test for Chinese Learners of Foreign Languages (FLAT-C), based on Carroll’s LA theory and Skehan’s (1998) aptitude-linked language learning proposals. A preliminary validation of the FLAT-C has been conducted, and it showed that item reliability for the subtests ranged from 0.84 to 0.96, suggesting that the reliability of the test was satisfactory (Li & Luo, Reference Li, Luo, Wen, Skehan, Biedroń, Li and Sparks2019). However, the comprehensive validity of this new test is far from complete. Consequently, the current study will examine the criterion-related validity of FLAT-C, with LLAMA as an external criterion, as well as the predictive validity of the test, with the final-term English test as the criterion. There will also be some discussion of the construct validity of the test. These investigations could potentially provide empirical evidence for the validation of FLAT-C and promote the study of LA in the Chinese context. The study also explores the importance of the delivery language for an aptitude test.

Theories of Foreign Language Aptitude

Of the various factors influencing foreign language learning, foreign language aptitude has been considered as one of the most important (Skehan, Reference Skehan1998). It refers to “the individual’s initial state and readiness and capacity for learning a foreign language, and probable degree of facility in doing so” (Carroll, Reference Carroll and Diller1981, p. 86). LA is a trait that is widely considered to be partly innate, fairly stable, and relatively enduring (Carroll, Reference Carroll1993, and see Abrahamsson and Smeds, Chapter 8, this volume) and a strong predictor of general L2 proficiency, with different aptitude components demonstrating differential predictive validity for different aspects of learning (Li, Reference Li2016). It has been found that the correlation between LA and second language achievement is generally in the range of 0.4–0.6 (Carroll, Reference Carroll1966).

Although different aptitude models have been proposed, Carroll’s four-factor model is still the most influential and important. Based on the findings of extensive empirical research, Carroll (Reference Carroll and Glaser1965) proposed that LA consists of four sub-components: phonetic coding ability (the ability to discriminate and code unfamiliar sounds so that they can be recalled later); associative memory (the ability to make connections between native language words and their foreign language counterparts); grammatical sensitivity (the ability to identify the functions of words in sentences); and inductive language learning ability (the successful identification and extrapolation of patterns between form and meaning). Based on this model, the MLAT was developed to predict the rate of foreign language learning. Carroll’s foreign language aptitude theory laid the foundation for a large proportion of the subsequent research in this area.

Over the years, other models of foreign language aptitude have been proposed. For example, Skehan (Reference Skehan1998, Reference Skehan and Robinson2002, Reference Skehan, Gass and Mackey2012) proposed an acquisition-based account, in which LA consisted of auditory processing, language analysis, and memory components. In addition, Skehan (Reference Skehan and Robinson2002) divided language development into four macro stages: noticing, patterning, controlling, and lexicalizing, and combined them with different LA sub-components. Noticing is concerned with the initial inroad, the first insight that some aspect of form is worth attention, and phonetic coding and working memory play a role at the noticing stage. Patterning reflects the capacity to detect and manipulate patterns in the target language and requires input to be analyzed and processed, a generalization to be made, and then extension achieved, so language analytic ability could be applied at this stage. The third stage is achieving control, a process where a rule-based generalization, initially handled with difficulty, becomes proceduralist, but Skehan did not elaborate on what LA sub-component(s) might be relevant to the learning in this stage. Lexicalizing is concerned with how the learner is able to go beyond rule-based processing, however fast, and build a lexical system that can be used to underlie real-time performance. Skehan (Reference Skehan and Robinson2002) argued that the memory components of LA were responsible for the learning in this stage. Although Skehan’s claims are theoretically appealing, it is difficult to empirically verify them because the hypothesized stages may not be easily operationalized (Li, Reference Li, Luo, Wen, Skehan, Biedroń, Li and Sparks2019).

Robinson (Reference Robinson and DeKeyser2007) proposed a hypothesis, the Aptitude Complex Hypothesis, in which foreign language aptitude is seen as consisting of several aptitude complexes that are differentially related to foreign language learning under different circumstances. In line with this hypothesis, some foreign language learners might possess strengths in some abilities under specific learning conditions but have the lower ability in others. While this model is, in many respects, an extension of traditional research on aptitude in assuming a hierarchical structure, relating aptitude complexes to instructional options and settings, and assigning learners to memory-oriented and analytically oriented categories, its unique contribution lies in the complexity and meticulousness of analysis of aptitude–treatment interactions.

In addition to the models mentioned above, there have been other foreign language aptitude proposals in recent years, such as the Linguistic Coding Difference Hypothesis model, which supports the componential view of foreign language aptitude and confirms the relationship existing between L1 skills and L2 learning (Sparks et al., Reference Sparks, Patton, Ganschow and Humbach2011); and the Cognitive Ability for Novelty in Language Acquisition – Foreign (CANAL-F) model, which emphasizes the ability to handle novelty and ambiguity when learning a second language (Grigorenko, Sternberg, & Ehrman, Reference Grigorenko, Sternberg and Ehrman2000). In recent years, many researchers have argued that working memory should be regarded as a central component of LA (Harrington & Sawyer, Reference Harrington and Sawyer1992; McLaughlin, Reference McLaughlin1995; Wen, Biedroń, & Skehan, Reference Wen, Biedroń and Skehan2017; Wen & Skehan, Reference Wen and Skehan2011), but only a few empirical studies have been conducted to explore the relationship between working memory and LA (Hummel, Reference Hummel2009; Safar & Kormos, Reference Safar and Kormos2008; Winke, Reference Winke2013; Yalçın, Cecen, & Erçetin, Reference Yalçın, Cecen and Erçetin2016; Yoshimura, Reference Yoshimura, Bonch-Bruevich, Crawford, Hellermann, Higgins and Nguyen2001).

With the development of cognitive psychology, applied linguistics, and cognitive neuroscience, the construct and components of LA are being clarified. However, Carroll’s conception of LA is still the classical model and the basis on which many foreign language aptitude tests have been developed. In the present chapter, our focus will not be on the range of aptitude tests that now exist, but rather we will explore whether Carroll’s account still has relevance and, most centrally, what is the best way to implement it in non-L1-English contexts.

Development of Language Aptitude Tests

First of all, we need to take a more detailed approach to more recent aptitude test developments to gain insights into how we can address this issue. Various LA tests have been developed for different populations. These tests are of two broad categories: a) the aptitude structure proposed by Carroll and b) contemporary research in second language acquisition and cognitive psychology (Skehan, Reference Skehan2015). Most LA tests, such as the MLAT (Carroll & Sapon, Reference Carroll and Sapon1959), PLAB (Pimsleur, Reference Pimsleur1966), and LLAMA (Meara, Reference Meara2005) belong to the first category; others, such as the Cognitive Ability for Novelty in Language Acquisition – Foreign Test (CANAL-FT) (Grigorenko, Sternberg, & Ehrman, Reference Grigorenko, Sternberg and Ehrman2000) and the High-Level Language Aptitude Battery (Hi-LAB) (Doughty, Reference Doughty2019; Linck, Hughes, & Campbell, Reference Linck, Hughes and Campbell2013), belong to the second category.

The MLAT, devised by Carroll & Sapon (Reference Carroll and Sapon1959), consists of five parts (Number Learning, Phonetic Script, Spelling Clues, Words in Sentences, and Paired Associates), which measure three of the sub-components of LA in Carroll’s model. The MLAT has dominated the field of LA for almost 60 years, yet it does not fully measure inductive language learning ability, the fourth sub-component of the model. Inspired by Carroll, Pimsleur (Reference Carroll1966) developed the PLAB for teenagers. Although the PLAB was broadly recognized as a substitute for the MLAT, it actually strengthens the measurement of language analysis ability and also investigates learning interest (motivation) as well as the GPA of the learner. Another test similar to the MLAT is the Defense Language Aptitude Battery (DLAB), which was developed by Petersen and Al-Haik (Reference Petersen and Al-Haik1976) and has been used only by the U.S. Department of Defense. The DLAB primarily measures the ability to learn the grammatical rules within a sequence of systematically varying language materials. It was developed for selection purposes and emphasized predictive validity. Most of the tests from this era measured explicit cognitive abilities, and implicit language learning abilities attracted little attention at that time (Granena, Reference Granena2019).

In slight contrast to these developments, the CANAL-FT is based on a new cognitive theory that emphasizes the ability to handle novelty and ambiguity when learning a second language. It focuses on measuring subjects’ recall and inferencing ability to process and acquire novel linguistic materials in both immediate and delayed situations (Grigorenko, Sternberg, & Ehrman, Reference Grigorenko, Sternberg and Ehrman2000). However, the test has not enjoyed the popularity of the MLAT, perhaps due to its features such as duration (three hours) and format (i.e., paper and pencil but coupled with auditory material), which might make its administration more challenging (Granena, Reference Granena2019).

In recent years, two other LA tests have attracted attention. One is the LLAMA test, which consists of four subtests (Meara, Reference Meara2005). Although the LLAMA was largely based on the MLAT, there are some differences, in at least three ways. First, it is free, computer-based, and widely available. Second, unlike the MLAT, the LLAMA is language independent. It relies on picture stimuli and verbal materials adapted from a British Columbian indigenous language and a Central American language, sources that are intended to be fair and suitable for speakers of any language. The third difference is that one part of the LLAMA, LLAMA D, a sound recognition test measuring the ability to recognize sound sequences, has been proposed (Granena, Reference Granena, Granena and Long2013) to assess implicit learning ability, a construct that is not present in the MLAT. Another recent test, the Hi-LAB, is a computer-based LA test that was developed by Doughty and her colleagues and has been used to predict high-level L2 attainment (Doughty, Reference Doughty2019; Linck, Hughes, & Campbell, Reference Linck, Hughes and Campbell2013). The test consists of 11 cognitive and auditory perceptual abilities and is thought to strengthen the measurement of the implicit learning abilities and memory components (i.e., working memory, long-term memory, short-term memory, associative memory). However, this test has very limited availability to the research community at this time. Accordingly, it is LLAMA that will be the focus of investigation in the present study.

The previously mentioned LA tests (LLAMA apart) have been developed primarily for English native speakers. As a result, anyone taking such a test may have results that are influenced by their English proficiency if they are non-native speakers. For several years, Chinese researchers have been attempting to develop foreign language aptitude tests for Chinese native speakers. Wu, Liu, and Jeffery (Reference Wu, Liu and Jeffrey1993) were among the first Chinese researchers to investigate the foreign language aptitude of Chinese native speakers. They found that foreign language aptitude was the primary factor influencing an individual’s English learning ability. Liu, Liu, and Deng (Reference Liu, Liu and Deng2005) attempted to devise a test measuring grammatical sensitivity for Chinese foreign language learners. Although they devised eight tasks, test-takers’ performance in tasks in Chinese did not predict their English test scores. They argued that the Chinese language and English belonged to different phyla, so grammatical sensitivity toward Chinese would not be related to learners’ English proficiency. In 2006, Liu and Jiang (Reference Liu and Jiang2006) proposed a design for a foreign aptitude test based on the MLAT and PLAB. Although the scores of the total aptitude test that was developed, and the scores of its four subtests, were correlated with test-takers’ performance in the English test, the correlation coefficient $(r = 0.014 - 0.327)$ was not convincing compared with other studies utilizing the MLAT and PLAB, such as Carroll (Reference Carroll1966) $(r = 0.4 - 0.6)$ and Ehrman and Oxford (Reference Ehrman and Oxford1995) $(r = 0.5)$ . Xia (Reference Xia2011) tried to devise two sets of a foreign language aptitude test for Chinese learners of a foreign language. The stimuli in the two tests were based on French and Korean, respectively, and the end-of-term English grades were used as foreign language learning achievement. The results showed that the total score of the French-based aptitude test explained 31.6% of the variance of test-takers’ fourth-semester English final grades, while the Korean-based aptitude test scores did not enter the regression model. Xia concluded that the Korean-based test would not predict students’ foreign language aptitude. Unfortunately, the validity and reliability of the two tests and the correlation data between the two aptitude tests and English achievement test scores were not provided.

These tests, developed by domestic Chinese researchers, lacked a fundamental theoretical basis, and their reliability and validity needed improvement. To meet the need for a foreign language aptitude test for Chinese foreign language learners, Li (Reference Li2014) and Li and Luo (Reference Li, Luo, Wen, Skehan, Biedroń, Li and Sparks2019) developed a new instrument, FLAT-C, based on the MLAT and PLAB. The FLAT-C was developed to measure the four sub-components in Carroll’s LA model. A preliminary validation study for the test showed that the internal reliability of subtests ranged from 0.77 to 0.92, the infit value of items was between 0.7 and 1.3, and there was a strong correlation between 10th-grade students’ total aptitude score and their mid-term English examination score (correlation of 0.612) (Li & Luo, Reference Li, Luo, Wen, Skehan, Biedroń, Li and Sparks2019).

However, the concurrent and criterion-related validation of this test was rather limited. According to Weir (Reference Weir2005), criterion-related validity is an important part of validation. Therefore, the purpose of the present study is to conduct a criterion-related validation study for the FLAT-C, which was based on Carroll’s (Reference Carroll and Diller1981) LA theory and Skehan’s (Reference Skehan1998) foreign language learning theory, as well as to gather some predictive validity data. The two research questions are as follows:

1) What is the concurrent validity of FLAT-C?
2) To what extent does the FLAT-C predict English achievement?

Methodology

Participants

A total of 75 participants $(male = 17, female = 58)$ formed a convenience sample from a university in the southern part of China. The participants had learned English for more than eight years on average. The average age of onset of English learning was 6.3 years old $(Range = 3 - 11)$ and the average age of testing was 18.6 $(Range = 17 - 20)$ . The participants were sophomore undergraduates in the university majoring in Chinese, accounting, finance, or Japanese. All the participants were recruited by the university after the Chinese Gaokao,Footnote ¹ and their average English score in Gaokao was 116.43 ( $Range = 80 - 139$ ; the total score is 150). Based on Liu and Wu’s (Reference Liu and Wu2019) research, while the learners’ English proficiency after Gaokao is equivalent to the Common European Framework of Reference for Languages (CEFR) B1, the sophomores’ English proficiency is equivalent to CEFR B2.Footnote ² However, the participants’ proficiency in the current study was lower than the standard in CSE. This may be because all the participants were from an ordinary university (or second-tier university), whose admission score for recruiting students is lower than that of other key universities.

Instruments

The instruments in the current study include two LA tests, the LLAMA and FLAT-C, and a short questionnaire related to participants’ personal background information.

LLAMA

LLAMA was chosen as the external criterion for the concurrent validation study for the following reasons. First, the stimulus materials in LLAMA are artificial language or symbols, which can decrease the influence of the mother tongue and long-term memory strategies. Second, the LLAMA battery was developed largely based on MLAT. Third, it is a computer-based test that has been used in many research studies. There have been over 700 citations on Google Scholar since 2013 (Rogers et al., Reference Rogers, Meara, Barnett-Legh, Curry and Davie2017). Finally, the LLAMA test takes only 25 minutes and can be scored automatically by computer. The test consists of four subtests. In the current study, the internal consistency of the LLAMA was 0.58.

LLAMA B is a vocabulary learning task that measures learners’ ability to learn a number of vocabulary words in a new language in a short space of time. It assesses the users’ ability to attach unfamiliar names to unfamiliar objects. This test is loosely based on the vocabulary learning subtest (MLAT-5) in the MLAT. Before testing, test-takers are given two minutes to study 20 word–picture associations. During the testing phase, test-takers have to identify the picture for each word within a short time.

LLAMA D is a new test that assesses a skill not measured by the MLAT. The subtest is a timed sound recognition test measuring the ability to recognize a sound sequence. During testing, learners listen to a string of 10 computer-generated sound sequences. The stimuli are only played once and do not provide time for additional study. Then, test-takers are asked to complete a recognition test in which they need to discriminate between old and new sound sequences.

LLAMA E is a sound–symbol correspondence test that was adapted from MLAT. The subtest requires test-takers to learn the relationship between sounds presented auditorily (24 in total) and an unfamiliar writing system. Test-takers have two minutes to learn the associations between sounds and symbols. During testing, test-takers are required to choose one symbol based on two different sounds.

LLAMA F is a grammatical inferencing test that measures the ability to infer the rules of an unknown language. Test-takers are shown 20 pictures and a short sentence in an artificial language describing each picture. They have five minutes to learn the rules for each association. In the testing phase, test-takers are shown a picture and two sentences and are required to choose the correct sentence for each picture.

FLAT-C

The FLAT-C, which was based on Carroll’s (Reference Carroll and Diller1981) LA theory and Skehan’s (Reference Skehan1998) foreign language learning theory, was developed by Li and Luo (Li, Reference Li2014; Li & Luo, Reference Li, Luo, Wen, Skehan, Biedroń, Li and Sparks2019). This test consists of five parts, takes about 55 minutes to complete, and measures all four abilities in Carroll’s LA model. In the current study, the internal consistency of the FLAT-C was 0.65.

Part One: Number Learning corresponds to MLAT subtest 1 and measures test-takers’ auditory and associative memory ability. This test consists of two phases. In the first phase, test-takers study numbers (e.g., 1, 2, 3; 10, 20, 30; 100, 200, 300) represented by meaningless syllables by listening to the recording of these numbers in a novel language. The test-takers are required to study and memorize the pattern and pronunciation of numbers in the new language. In the second phase, test-takers are required to write the numeric forms of the numbers (43 in total) according to the recording, which is played only once.

Part Two: Phonetic Script corresponds to MLAT subtest 2 and measures test-takers phonetic coding ability. This subtest consists of 30 items, which are divided into six blocks, each containing an average of five items. In each item, four symbols correspond to four meaningless syllables. At first, all of the sounds in each block are played one by one, and test-takers are required to study and memorize the association between the sounds and the symbols. Then, for each item, test-takers hear one sound and are required to identify the corresponding symbol.

Part Three: Paired Associates corresponds to MLAT subtest 5 and measures associative memory ability. In this subtest, test-takers are shown 24 words written in an artificial language and their Chinese equivalents. Test-takers have two minutes to study and remember the association between each word and its Chinese equivalent. In the testing phase, test-takers are required to select the corresponding Chinese equivalent for each word from the four options.

Part Four: Words in Sentences corresponds to MLAT subtest 4. It measures grammatical sensitivity and consists of 30 items. For each item, there is one key sentence and a parallel but different sentence containing several marked options. One word/phrase is highlighted in the key sentence and four words are highlighted in the option sentence. Test-takers are required to select the correct word/phrase in the option sentence serving the same grammatical function as the highlighted word in the key sentence. The following is a sample item.

北京是中国的首都。

第一次大手术的成功极大地增强了我们的信心。

A B C D E

Translation:

Beijing is the capital of China.

The success of the first operation greatly enhanced our confidence.

C A B D E

Part Five: Language Analysis corresponds to PLAB subtest 4 and measures inductive language learning ability. This subtest presents words and sentences from the artificial language developed for the test along with their Chinese equivalents. Test-takers need to study the examples first. Next, they are required to infer the underlying grammatical rules governing the sentence for each item. Each item has a sentence in either Chinese or the novel language. Then, they are asked to choose the corresponding sentences expressed in the novel language or Chinese. An example (with options omitted) follows.

下面给出了两个外语单词和一个外语句子以及它们相对应的汉语意思 (Below are two foreign words, one foreign sentence, and their Chinese meanings):

aijo: 妈妈 (mother)

la ponra: 那只狗 (that dog)

aijo la ponram ne: 妈妈喜欢那只狗 (Mother likes that dog.)

现在请你根据上面给出的内容，想想下面的中文句子该如何用这种新的语言来表达 (Now please think about how the following sentence should be expressed in the new language):

那只狗喜欢妈妈. (That dog likes mother.)

Questionnaire

A short questionnaire was designed to collect participants’ background information, focusing on their major, gender, age of onset, age of testing, English score in Gaokao, etc.

Data Collection and Scoring Procedures

A pilot study was conducted in order to establish the viability and appropriateness of the testing procedures. Although the pilot study went well, post-test interviews revealed that subjects’ anxiety over the remaining time for the last two parts of FLAT-C could have interfered with their performance. In order to reduce the subjects’ anxiety, in the main study, subjects were informed 15 minutes before time was up. In addition, the researchers prepared detailed instructions for the LLAMA for subjects so that they would be familiarized with the interface as well as the procedures. Finally, the order of the two tests was adjusted: Subjects would take the FLAT-C first and then take LLAMA subtests in alphabetical order since the FLAT-C is partially in Chinese and could help subjects better understand the test.

In the main study (the present study), participants completed the tests at a convenient office when they were available on any one weekend at the end of 2018. The participants registered for a time slot in advance and then were divided into several groups according to their time. Each group completed the FLAT-C by listening through headphones, answering the items on paper within 55 minutes, and then completing the LLAMA on their computer within 25 minutes. To avoid mental fatigue, there was a ten-minute break between the two tests. After finishing the aptitude tests, participants completed the short questionnaire. It took six weekends for data collection to be completed. All participants were offered a gift as compensation for their time and were told their results after testing. About seven months later, we obtained participants’ final-term English test scores with the help of colleagues in Academic Affairs Offices. The final-term English test is a paper and pencil test, which consists of listening, reading, translation, and writing subtests. While the LLAMA was scored automatically by computer, the FLAT-C was scored manually by one researcher, and second researchers double-checked the results. The final-term English test was scored by participants’ teachers. After data collection and scoring, all the data were inputted into SPSS for further analysis.

Results

Concurrent Validation of FLAT-C

To examine the concurrent validity of the FLAT-C, first, the FLAT-C and LLAMA were compared, not statistically, but based on the proposed test content and test constructs (see Table 6.1).

Table 6.1 Test content and construct of FLAT-C and LLAMA

Construct		Content
	LLAMA	FLAT-C
Associative memory ability	LLAMA B	Part One: Number Learning Part Three: Paired Associates (weakly)
Phonetic coding ability	LLAMA E	Part One: Number Learning (weakly) Part Two: Phonetic Script
Language analysis ability	LLAMA F	Part One: Number Learning (weakly) Part Four: Words in Sentences Part Five: Language Analysis
Sound recognition ability	LLAMA D

Table 6.1 proposes that the four subtests in LLAMA measure four abilities respectively, while the five parts in FLAT-C measure three abilities. Although both FLAT-C and LLAMA were developed in accordance with the constructs of the MLAT, the two tests are a little different from the MLAT. In LLAMA, LLAMA D measures test-takers’ sound recognition ability, which is not present in the MLAT. The other three parts, LLAMA B, LLAMA E, and LLAMA F, target associative memory ability, phonetic coding ability, and language analysis ability, respectively. In contrast to LLAMA, FLAT-C only attempts to measure associative memory ability, phonetic coding ability, and language analysis ability. The MLAT was originally developed to measure the four sub-components in Carroll’s four-factor LA model. In practice, the inductive language learning ability construct was only weakly measured. In order to make up for this deficiency in the MLAT, both LLAMA and FLAT-C strengthened the measurement of inductive language learning ability with LLAMA F and Part Five: Language Analysis in FLAT-C, respectively. Skehan’s three-factor model is used instead of Carroll’s four-factor model in the current study because Skehan (Reference Skehan1998) suggested that language analysis refers to the ability to infer language rules and perform language induction or extrapolation tasks. This integrates grammatical sensitivity and inductive language learning ability, both of which are different aspects of language analysis ability. Grammar sensitivity focuses on specific words that are more implicit and negative, while inductive language learning ability focuses on sentence patterns that are more explicit and positive.

Descriptive analysis, correlation analysis, and factor analysis were used to examine the statistical relationship between LLAMA and FLAT-C. Since the total scores of the FLAT-C, LLAMA and their subtests use essentially different scales, percentages of subtest scores have been used to improve comparability. Table 6.2 shows that, in percentage terms, the total score of FLAT-C $(M = 54.39)$ is significantly higher than that of LLAMA $(M = 41.6)$ $(t = 7.321, p = 0.00, SD = 15.06)$ . Among the subtests of LLAMA, LLAMA E gets the lowest score, while LLAMA F gets the highest scores. However, none of the LLAMA subtests has scored higher than any FLAT-C subtests. In FLAT-C, Phonetic Script and Paired Associates have the highest scores, while Number Learning gets the lowest score.

Table 6.2 Descriptive statistics of FLAT-C and LLAMA scores (%)

	N	Min	Max	Mean	SD
L-B	75	10	85	41.8	19.04
L-D	75	0	86.67	40.21	22.22
L-E	75	0	90	38.93	25.12
L-F	75	0	90	45.08	26.05
L-Total	75	10.67	72	41.6	16.75
NL	75	0	97.67	51.1	22.89
PS	75	33.33	90	59.6	14.73
PA	75	8.3	100	59.67	25.58
WS	75	20	83.33	52.89	12.61
LA	75	13.33	93.33	51.2	18.57
FLAT-C Total	75	27.39	82.8	54.39	12.01

Note: L-B = LLAMA B; L-D = LLAMA D; L-E = LLAMA E; L-F = LLAMA F; L-Total = LLAMA Total; NL = Number Learning; PS = Phonetic Script; PA = Paired Associates; WS = Words in Sentences; LA = Language Analysis

Correlation analysis was conducted to further examine the concurrent validation of the FLAT-C, using Pearson’s r (see Table 6.3). The results show that the FLAT-C Total is significantly correlated with the scores on LLAMA B $(r = 0.32)$ , LLAMA E $(r = 0.48)$ , LLAMA F $(r = 0.34)$ , and LLAMA Total $(r = 0.50)$ . Similarly, the LLAMA Total is significantly correlated with most parts of the five subtests in FLAT-C (with correlations ranging from 0.31 to 0.46). Most of the FLAT-C subtests are significantly correlated with the LLAMA subtests (correlations ranged from 0.27 to 0.46), with the exception that PA is not significantly correlated with any subtest in LLAMA. We have just seen that the total scores of the two batteries correlate at 0.50, and this suggests that the FLAT-C has a degree of concurrent validity with the LLAMA, sharing 25% of the variance. It has to be recognized, though, that this degree of concurrent validity can, at best, be regarded as moderate only.

Table 6.3 Correlation between FLAT-C and LLAMA

	PS	PA	WS	LA	FLAT-C Total	L-B	L-D	L-E	L-F	L-Total
NL	0.39^**	0.33^**	0.40^**	0.02	0.79^**	0.41^**	0.18	0.40^**	0.31^*	0.46^**
PS		0.25^*	0.36^**	0.44^**	0.72^**	0.20	0.27^*	0.46^**	0.24	0.41^**
PA			0.08	−0.13	0.53	0.03	0.02	0.07	0.19	0.12
WS				0.32^**	0.61^**	0.22	0.19	0.28^*	0.19	0.31^*
LA					0.41^**	−0.02	0.14	0.28^*	0.08	0.18
FLAT-C Total						0.32^*	0.25	0.48^**	0.34^**	0.50^**
L-B							0.31^**	0.37^**	0.31^*	0.66^**
L-D								0.46^**	0.25	0.65^**
L-E									0.41^**	0.8^**
L-F										0.74^**

Note: *, p < 0.05; **, p < 0.01; L-B = LLAMA B; L-D = LLAMA D; L-E = LLAMA E; L-F = LLAMA F; L-Total = LLAMA Total; NL = Number Learning; PS = Phonetic Script; PA = Paired Associates; WS = Words in Sentences; LA = Language Analysis

Building on the correlation analysis, factor analysis was conducted on the nine subtests from the two LA tests. The result of this principal component analysis, using the varimax rotation method, shows that the nine subtests from both LLAMA and FLAT-C are best accounted for by three factors (see Table 6.4). Factor 1 explains 23.92% of the variance, Factor 2 explains 21.16% of the variance, while Factor 3 explains 17.03% of the variance. Taken together, the three factors explain 62.10% of the variance.

Table 6.4 Total variance explained

Component	Initial Eigenvalues			Extraction Sums of Squared Loadings			Rotation Sums of Squared Loadings
Component	Total	% of variance	Cumulative %	Total	% of variance	Cumulative %	Total	% of variance	Cumulative %
1	3.07	34.10	34.10	3.07	34.101	34.10	2.15	23.92	23.92
2	1.41	15.67	49.77	1.41	15.67	49.77	1.90	21.16	45.07
3	1.11	12.34	62.10	1.11	12.34	62.10	1.53	17.03	62.10
4	0.88	9.82	71.92
5	0.72	7.96	79.88
6	0.61	6.73	86.61
7	0.50	5.51	92.12
8	0.34	4.20	96.31
9	0.33	3.70	100.00

Note: Extraction Method: Principal Component Analysis

When factor matrices are interpreted, variables with a factor loading greater than 0.40 are usually grouped into one factor (Hatcher, Reference Hatcher1994). According to the rotated solution, five subtests load on the first component with factor loadings between 0.44 and 0.77: LLAMA B (0.77), LLAMA D (0.69), LLAMA E (0.67), LLAMA F (0.59), and Number Learning (0.44). Number Learning, from FLAT-C and the only non-LLAMA test that loads on this factor, is intended to measure associative memory ability, phonetic coding ability (weakly), and language analysis ability (weakly). To recall the detail here, test-takers study numbers, represented by meaningless syllables, by listening to the recording of these numbers in a novel language, and then need to remember the pattern and pronunciation of the numbers. LLAMA is a computer-based LA test to measure associative memory ability, phonetic coding ability, language analysis ability, and sound recognition ability. In this test, all the stimulus materials are meaningless pictures, symbols, or pronunciations in a novel language. Accordingly, it is concluded that both the Number Learning subtest and LLAMA use a novel language so that test-takers are not influenced by their native language. This seems to be the major influence that brings the different subtests together within this factor.

Another four subtests load on the second component, with factor loadings between 0.43 and 0.83: Language Analysis (0.83), Phonetic Script (0.78), Words in Sentences (0.58), and LLAMA E (0.43). Among these subtests, Language Analysis, Phonetic Script, and Words in Sentences are from FLAT-C and measure language analysis ability and phonetic coding ability, while LLAMA E is a sound–symbol correspondence test, adapted from the MLAT, and measures phonetic coding ability. Factor 2 seems to mainly measure language analysis ability and phonetic coding ability, which are two important components in Skehan’s (Reference Skehan1998, Reference Skehan and Robinson2002, Reference Skehan, Gass and Mackey2012) three-factor LA model. Another two subtests load on the third component, with factor loadings between 0.68 and 0.85: Paired Associates (0.85) and Number Learning (0.68) (see Table 6.5). Both subtests are from FLAT-C and mainly measure memory ability, an important component in Skehan’s LA model.

Table 6.5 Rotated component matrix of aptitude subtests

	Factors
	1	2	3
LLAMA B	0.77	−0.05	0.12
LLAMA D	0.69	0.20	−0.15
LLAMA E	0.67	0.43	0.07
LLAMA F	0.59	0.11	0.26
Number Learning	0.44	0.21	0.68
Phonetic Script	0.20	0.78	0.31
Paired Associates	−0.06	0.01	0.85
Words in Sentences	0.23	0.58	0.22
Language Analysis	−0.03	0.83	−0.32

Note: Extraction method: Principal Component Analysis

Rotation method: Varimax with Kaiser normalization

There are some other noteworthy features to these results. The first factor is LLAMA-dominated, with the additional but lower loading on FLAT-C, Number Learning. In other words, these results do not accord with Granena (Reference Granena2019), who reported LLAMA D to fit with a different factor to the other three LLAMA subtests, a factor she interpreted as being more of an implicit aptitude. In the present study, all LLAMA tests are largely accounted for by the first factor, and so it seems appropriate to simply label this factor a LLAMA factor. Then, FLAT-C is distributed over the second and third factors, with Factor 2 emphasizing language and discrimination and Factor 3 focused on associating memory, linking the two FLAT-C tests from this area, one paper and pencil, with a higher loading, and one involving auditory material, with a slightly lower loading. The most striking thing about these results is the degree of separation between the two batteries. One LLAMA subtest (LLAMA E) has loading on the first of the FLAT-C factors, and one FLAT-C subtest has loading on the LLAMA factor (Number Learning). Otherwise, the two batteries, despite their foundation in the MLAT, are surprisingly distinct from one another. We return to this below.

Predictive Validation of FLAT-C

To explore the predictive validity of FLAT-C, the 75 participants’ final-term English test scores were used as indices of their English achievement. A correlational analysis was conducted between the test-takers’ English achievement scores and their FLAT-C and LLAMA scores. The results show that there is a significant correlation between the FLAT-C Total score and final-term English test scores (r = 0.28). At the subtest level, only WS and PS are significantly correlated with English achievement (r = 0.34, r = 0.25). For the LLAMA language battery, the L-Total is not significantly correlated with final-term English test scores; however, one subtest, LLAMA E, is significantly correlated with final-term English test scores (see Table 6.6).

Table 6.6 Correlation between FLAT-C, LLAMA, and English achievement

	NL	PS	PA	WS	LA	FLAT-C Total	L-B	L-D	L-E	L-F	L-Total
English scores	0.17	0.25^*	0.06	0.34^**	0.15	0.28^*	0.05	0.10	0.34^**	0.10	0.22

Note: *, p < 0.05 (two-tailed); **, p < 0.01 (two-tailed); NL = Number Learning, PS = Phonetic Script, PA = Paired Associates, WS = Words in Sentences, LA = Language Analysis; L-B = LLAMA B; L-D = LLAMA D; L-E = LLAMA E; L-F = LLAMA F; L-Total = LLAMA Total; English scores = final-term English test scores

Finally, a stepwise regression analysis procedure was utilized to explore which foreign language aptitude test(s) or which subtest(s) can explain the variance of foreign language achievement. First of all, both FLAT-C Total and LLAMA Total were included as independent variables in the regression equation, the final-term English test achievement was included as the dependent variable, and the stepwise method was used. The results (see Table 6.7) show that only FLAT-C Total enters into the equation and accounts for 8.5% of the variance in English achievement. In order to explore the extent to which subtest(s) in both LLAMA and FLAT-C explain the English achievement, the five subtests in FLAT-C and the four subtests in LLAMA were included as independent variables in the regression equation, the final-term English test achievement was included as the dependent variable, and the stepwise method was used. The results show that both WS and LLAMA E enter into the equation and account for 22.5% of the variance in English achievement, while WS is the strongest predictor and accounts for 17% variance of English achievement.

Table 6.7 Regression model of English achievement

	Model	R	R²	Adjusted R²	Std Error of the Estimate	Unstandardized Coefficients		Standardized Coefficients	t	Sig.
						B	Std. Error	Beta
FLAT-C Total	1	0.29	0.085	0.069	34.79	0.55	0.24	0.29	2.32	0.02
WS	1	0.41	0.17	0.156	33.15	4.02	1.17	0.41	3.44	0.001
WS+	2	0.474	0.225	0.198	32.2	3.37	1.18	0.35	2.85	0.006
L-E						0.35	0.17	0.25	2.02	0.048

Note: WS = Words in Sentences, L-E = LLAMA E

Table 6.7 shows that two of the four subtests, which loaded on Factor 2 (ref. Table 6.5), generate significant predictive validity. Such a result suggests phonetic coding ability and language analysis ability play a more important role in English learning, at least in the current final-term English test.

Discussion

We will start this section by recapitulating the results in this comparison of LLAMA and FLAT-C. The analysis shows that although both FLAT-C and LLAMA were developed based on MLAT, the two LA batteries have several differences in both test content and test constructs. In terms of test content, while each of the four LLAMA subtests measures one sub-component of LA, each aptitude sub-component of FLAT-C seems to be measured by more than one subtest, or one subtest could measure different sub-components of LA. For example, language analysis ability is measured by Number Learning, Words in Sentences, and Language Analysis, while Number Learning measures phonetic coding ability, associative memory ability, and language analysis ability, at least to some extent. Regarding test constructs, LLAMA B, LLAMA E, and LLAMA F require test-takers to use explicit knowledge to learn the associations between symbols and words or phrases, and to identify the rules in order to generalize them in other contexts. In other words, and consistent with Granena (Reference Granena2019), these three subtests measure explicit learning ability. LLAMA D, with only auditory stimuli and no visual materials, requires test-takers to judge whether the sound sequence that is played has been presented before, and the stimuli are played only once and do not give test-takers time to study the rules. This subtest has been considered to measure implicit language learning ability (Granena, Reference Granena, Granena and Long2013, Reference Granena2019), yet the results in the present study seem more consistent with recent studies by Li and Qian (Reference Li and Qian2021) and Suzuki (Reference Suzuki2021), who could not find evidence that LLAMA D measures implicit learning or memory. This suggests that LLAMA D is part of the general LLAMA-defined factor and more concerned with explicit aptitude. All five FLAT-C subtests need explicit cognitive processes to study and memorize pronunciation and words or analyze the underlying rules of sentences. We conclude that they, too, measure explicit learning ability. At least in this respect, the two batteries seem to have a level of consistency and reflect their point of origin.

It seems that the LLAMA test is more difficult than FLAT-C for Chinese foreign language learners, or at least, with the proportion-based scales, lower mean scores are obtained. Assuming that this is a fair assessment, it is interesting to consider why this is the case. Of possible relevance here is that the LLAMA battery is a timed computer-based test in which an artificial language is used. Test-takers are required to match the sounds or symbols with their corresponding pictures or symbols, so it is difficult for test-takers to recognize, study, and remember the sounds, symbols, or pictures within a short period of time. The FLAT-C, Paired Associates, Words in Sentences, and Language Analysis subtests all use items that were designed and developed based on the characteristics of the Chinese language. Paired Associates require test-takers to study and memorize the correspondence between foreign words and their Chinese equivalents; Grammatical Sensitivity requires test-takers to study and identify the grammatical function of a certain Chinese word in the sentence, while Language Analysis requires test-takers to study and infer the expressions of Chinese sentences in an artificial language. Number Learning is the most difficult among the five subtests in FLAT-C. This subtest requires test-takers to study and memorize number spelling rules in an artificial language within a short period and then recombine the numbers, with the intention of measuring both sound discrimination ability and memory ability. The above analysis suggests that test-takers can perform better in a LA test that uses the test-takers’ mother tongue, possibly due to their familiarity with the words and sentences that are used. This result also implies that the decision to base aptitude tests on language learners’ mother tongues has some importance. Even though there is considerable convenience and transportability in using language-neutral aptitude tests, there are advantages in using tests administered in and even based on L1s in order to reduce the mixed effects of LA with other factors.

It is worth noting that several subtests from LLAMA and FLAT-C are correlated significantly and that the FLAT-C Total is significantly correlated with L-Total (0.50). However, few of these subtest correlations are higher than 0.40, and the median cross-battery subtest correlation is only 0.20. This does not provide strong evidence of convergent validity. A Principal Component Analysis run on the nine aptitude subtests (the five FLAT-C subtests and four LLAMA subtests) identified three underlying constructs accounting for 62.10% of the total variance in the data. The first factor, which accounts for 23.92% of the total variance, received significant loadings from five of nine subtests: the four LLAMA subtests (LLAMA B, LLAMA D, LLAMA E, and LLAMA F) and one FLAT-C subtest (Number Learning). These tests have in common the fact that the stimulus in the tests comprises artificial language or symbols, which could be influenced a little by the test-takers’ native language. During testing, the test-takers firstly need to study the artificial language or symbols and then decide which symbol or artificial language corresponds with the recording or picture. As we have seen, and based on Table 6.1, Factor 1 is labeled “LLAMA language aptitude.”

The second factor, which accounts for 21.16% of the additional variance, has significant loadings from four of the nine subtests: three FLAT-C subtests (Phonetic Script, Words in Sentences, and Language Analysis), and one LLAMA subtest (LLAMA E), although it should be noted that this last subtest has a higher loading on the first “LLAMA language aptitude” factor. All the subtests in this factor measure explicit, attention-driven cognitive abilities involving conscious and reflective learning processes. LLAMA E and Phonetic Script require test-takers to listen, memorize, and identify the pronunciations consciously, measuring phonetic coding ability. Words in Sentences requires test-takers to analyze and discriminate sentence components and functions; LLAMA F requires test-takers to infer the rules of an unknown language based on the learned rules. It seems that the two subtests measure language analysis ability. Based on the fact that the four tests loading on the second component are measuring cognitive abilities that involve explicit cognitive processes and phonetic coding ability (see Table 6.1), Factor 2 is labeled “phonetic coding–inductive language analysis ability.”

The third factor, which accounts for 17.03% of the additional variance, has significant loadings on two of the nine subtests: two FLAT-C subtests (Number Learning and Paired Associates). Number Learning requires test-takers to study and memorize the pattern and pronunciation of numbers in the new language. Paired Associates requires test-takers to study and memorize the association between each word and its Chinese equivalent. Based on Table 6.1, Factor 3 was labeled “rote memory.”

The above analysis suggests that while Factor 1 is mainly composed of LLAMA subtests, Factor 2 and Factor 3 mainly belong to FLAT-C, which reveals that although the underlying rationale of the two tests is MLAT and Carroll’s LA theory, FLAT-C and LLAMA are essentially two different LA tests, sharing only 25% of the variance. It seems that FLAT-C mainly measures rote memory, phonetic coding ability, and inductive language learning ability, which require explicit and cognitive consciousness, as well as measuring explicit learning ability. At one level, LLAMA, too, measures similar abilities. But apart from some minor linkages in the factor analysis of sound–symbol association and associative memory, the two batteries seem surprisingly distinct from one another, a separation that merits some discussion.

Two possible influences may contribute to this situation. The first is the obvious one that LLAMA is independent of any L1 and so has been used for any L1–L2 combination. Possibly the lack of L1 use means that interpretation of what has to be done draws in comprehension skills that have an impact on all the subtests and unifies them in this respect. Perhaps, also, as Sparks (Chapter 11, this volume) argues, L1 skills are a foundation for foreign language aptitude, and so lack of opportunity for the use of L1 skills changes the basis on which aptitude tests, in this case LLAMA, are done. A second factor might be that all LLAMA tests have a visual element, and so performance across subtests draws on visual abilities. Yalçın and Erçetin (Reference Yalçın and Erçetin2019) report, in this respect, an involvement of visual working memory in LLAMA subtest performance, at least where lower-proficiency students were involved. This too might contribute to the single-factor loading for LLAMA in this study. Based on this, the construct and validation of LLAMA may merit further exploration, and there is a potential for refining LLAMA further (Bokander & Bylund, Reference Bokander and Bylund2020; Suzuki, Reference Suzuki2021). There is also the implication, if this analysis is valid, that developing foreign language aptitude tests that are L1 neutral may be a more difficult undertaking than was previously thought.

The current study has also explored the predictive validity of the FLAT-C. The correlation analysis (ref. Table 6.6) shows that the FLAT-C is significantly correlated with English achievement (final-term English test; r = 0.28). However, LLAMA is not significantly correlated with English achievement, which suggests FLAT-C has a higher predictivity than LLAMA, at least in this context. Even so, the FLAT-C predictive validity coefficient is not particularly high. Several reasons may account for this. First, previous instruction in English, as well as a number of other variables, such as the opportunity for exposure and motivation, were not controlled in this study and may have reduced the scope for aptitude to have a strong influence. Second, the relatively lower correlation coefficient may also be due to the homogeneity of university learners’ aptitude scores, which is perhaps related to the fact that the participants are probably a more “select” group based on their performance in Gaokao. Heterogeneity of the learner population seems to be a prerequisite for the effect of aptitude to become manifest because one assumption behind the concept of aptitude is that, other things being equal, the wider the range of learners in a study, the more scope there is for learners with the higher aptitude to show that they learn more and faster than others (Li, Reference Li2016). However, the participants in the current study have been selected precisely because they have been successful in education. Third, these learners have had more than eight years of English learning experience, and they are at least at the intermediate level of English proficiency. Robinson (Reference Robinson2005) hypothesized that traditional aptitude measures were only predictive of beginning second language acquisition (SLA), not of high attainment levels (Robinson, Reference Robinson2005). A relevant finding for this is Li (Reference Li2016) who found that LA has less effect in university language classes than in high school language classes. So it may be that LA, at least explicitly measured LA (Li & Qian, Reference Li and Qian2021), is more sensitive to initial stages than later stages of second language learning. Taking these different factors together, it would not be surprising if the scope for predictive validity is attenuated, and that in other, more propitious circumstances, higher validity coefficients might be found. Obviously, this reasoning also applies to the LLAMA predictive validity, which may also have been reduced relative to other circumstances.

The study also found that overall LA, measured by FLAT-C and two particular subtests, Paired Script and Words in Sentence, were significantly correlated with English achievement. This may be because overall aptitude has greater predictive power than discrete components, and sub-components play different roles in various skills. For example, phonetic coding is the weakest predictor for listening comprehension, while language analytic ability appears to be a better predictor of reading comprehension than phonetic coding or rote memory (Li, Reference Li2016). In the current study, the final-term English test is a comprehensive test that includes listening, reading, translation, and writing and measures participants’ comprehensive language ability. This suggests a need for aptitude researchers to clarify, or at least be transparent about, the construct of LA components.

Finally, the study found that only the FLAT-C Total enters into the regression equation, accounting for 8.5% of the variance in English achievement, producing a small effect size (Cohen, Reference Cohen1988). The result is less than the sorts of results discussed in Dörnyei and Skehan (Reference Dörnyei, Skehan, Doughty and Long2003) and Li (Reference Li2016). It suggests that the FLAT-C has the potential to explain a greater proportion of the variance in English achievement with learners of lower proficiency levels and perhaps more controlled instructional conditions. Even so, one has to remember that LA is only one of the important factors influencing foreign language learning (Ehrman & Oxford, Reference Ehrman and Oxford1995; Sparks, Ganschow, & Patton, Reference Sparks, Ganschow and Patton1995). The lower explanatory value may be because of several factors. First, as indicated earlier, the participants in this study are university students who were recruited based on Gaokao and are less varied in LA. Second, although the FLAT-C has been validated to some extent, it has not been standardized. Last, the final-term English test is not a standardized English test. When the total nine subtests from both FLAT-C and LLAMA were included in the regression equation, only Words in Sentences and LLAMA E entered, explaining just 22.5% of the variance in English achievement. Words in Sentences requires test-takers to study the grammatical function of words in the sentences and measure grammatical sensitivity (one part of language analysis ability). LLAMA E, a sound–symbol correspondence test, requires test-takers to study the relationship between sounds presented auditorily and an unfamiliar writing system, which measures phonetic coding ability. The above analysis suggests that while rote memory plays a less important role, phonetic coding ability and language analysis ability are good predictors for English proficiency. This finding is consistent with Li’s (Reference Li2016) study, which found rote memory was the weakest predictor for second language proficiency.

As to why the other subtests do not enter the equation, there are three possible reasons. First, this may be due to the nature of the English test (which is an achievement test). The test was developed by the teachers themselves based on the English curriculum and textbook. And to achieve a good score, students usually prepare and practice for the test by reviewing the textbook. To further examine the relationship between LA and foreign language achievement, a standardized foreign language test should be used to better examine test-takers’ proficiency in a foreign language. Second, both FLAT-C and LLAMA have not been standardized or benefited from wide-ranging validation studies. Until this is done, any claims involving either of the tests will have to be limited in the claims that are made. Thirdly, and developing the theme of better validation, the test-takers in the current study were sampled in one university and have achieved a high level in English, which may influence the testing results and attenuate any correlations that are found. Test-takers from different regions with diverse foreign language proficiency could be sampled to further examine the effectiveness of the two instruments.

Conclusion

The current study has explored the concurrent and criterion-related validity of FLAT-C, with LLAMA as the aptitude test external criterion and the final-term English test as the criterion regarding achievement. First, the results have shown that although the FLAT-C is related to LLAMA, the two LA tests have clear differences. Second, FLAT-C explained 8.5% of the variance in English achievement. The current study provides empirical evidence for the validation of FLAT-C, which is helpful and meaningful to advance the theory, tests, research, and practice of foreign language aptitude in China.

Although the FLAT-C appears to have reasonable concurrent and predictive validity, there is still considerable room for improvement. First, a measure of working memory could be included in the test. Working memory has attracted increasing attention since the multi-component working memory model was proposed by Baddeley (Reference Baddeley1986). Since working memory is closely related to some advanced cognitive activities in humans, for example, reading comprehension, reasoning, etc., this skill has attracted the attention of SLA researchers in recent years. Researchers have confirmed that there is a significant correlation between working memory and second language achievement (Harrington & Sawyer, Reference Harrington and Sawyer1992; Miyake & Freidman, Reference Miyake, Freidman, Healy and Bourne1998; Sagarra, Reference Sagarra, Han and Park2007; Sagarra & Herschensohn, Reference Sagarra and Herschensohn2010; Walter, Reference Walter2004), and researchers have proposed that working memory should be incorporated into a foreign language aptitude model (Miyake & Freidman, Reference Miyake, Freidman, Healy and Bourne1998; Sawyer & Ranta, Reference Sawyer, Ranta and Robinson2001; Skehan, Reference Skehan and Robinson2002). Wen (Reference Wen, Wen, Skehan, Biedroń, Li and Sparks2019) has explored the relationship between working memory and LA and proposed the Phonological/Executive model, which could be useful for extending and revising LA theory and LA test development. At present, working memory has been incorporated and measured in the Hi-LAB (Doughty, Reference Doughty2019), which provides a reference for LA test development. Chinese researchers could further explore the relationship between working memory and LA, examine the nature, function, and measurement of working memory, and incorporate working memory in foreign language aptitude theory and tests.

Second, FLAT-C could be improved by including a measure of implicit LA. Although the study of implicit language learning has a long history in the SLA field, implicit foreign language aptitude has only been studied in recent years. Few research studies have examined and discussed the concept of implicit foreign language aptitude, and so progress in understanding has been slight. However, recent years have seen a change in popularity (DeKeyser, Reference DeKeyser, Wen, Skehan, Biedroń, Li and Sparks2019), and some researchers have tried to explore the measurement of both explicit and implicit foreign language aptitude, and their impact on foreign language learning (Granena, Reference Granena, Granena and Long2013, Reference Granena2016; Li & Qian, Reference Li and Qian2021; Suzuki, Reference Suzuki2021; Suzuki & DeKeyser, Reference Suzuki and DeKeyser2017). This recent work provides a reference for the development of foreign language aptitude tests in China. Future LA tests should develop test items to measure implicit foreign language aptitude, which will be helpful in balancing what aptitude tests cover, for Chinese native speakers and others. Even so, from the present results, especially from the factor analysis, it appears that developing aptitude tests targeting the construct of implicit learning is one of the greatest of challenges.

Third, it is necessary to conduct studies for further validation and standardization of the FLAT-C with larger samples of participants before the measure can be utilized on a large scale. Additional evidence and data need to be collected for validation, reliability analysis, and standardization, all of which will provide important and useful information for LA test development. It is expected that FLAT-C could play a more important role in foreign language teaching (e.g., aptitude–treatment interaction or diagnostic language learning), screening and cultivation of foreign language talents, and foreign language research.

7 Testing Language Aptitude for Recently Arrived Parent–Child Immigrant Dyads

On the Over-Representation of Western Undergraduates in Applied Psychology and SLA Studies

In 2010, psychologists Heinrich, Heine, and Norenzayan published a series of papers in which they criticized the over-representation of Western undergraduates as participants in applied psychology studies. They argued that “people from Western, educated, industrialized, rich and democratic (WEIRD) societies – and particularly American undergraduates – [were] some of the most psychologically unusual people on Earth” (Henrich et al., Reference Henrich, Heine and Norenzayan2010a, p. 29) but were nevertheless largely over-represented in scientific studies, “a randomly selected American undergraduate [being] more than 4,000 times more likely to be a research participant than is a randomly selected person from outside of the West” (Henrich et al., Reference Henrich, Heine and Norenzayan2010b, p. 63).

This conclusion came from their comprehensive review of the (non‑)universality of psychological traits and behaviors, where they discussed the differences between (1) the results of subjects from industrialized societies and subjects from small-scale societies, (2) Western and non-Western participants, (3) contemporary Americans and the rest of the West, and (4) typical contemporary American subjects (i.e., undergraduates or highly educated participants) and other Americans (Henrich et al., Reference Henrich, Heine and Norenzayan2010b). These comparisons showed that (American) undergraduates differed from other populations in traits as varied as visual perception, moral reasoning, or perception of choice. Maybe the most compelling evidence of this outlier status of American undergraduates appears in results from different populations on the Müller-Lyer illusion. In this experiment, participants see two lines that are perceived as being of different lengths even if they are of the same lengthFootnote ¹. In experiments manipulating the length of the two lines until they are perceived to be the same length, American undergraduates and American children scored at the extreme range of the spectrum (needing line A to be a fifth longer than line B before perceiving them as equal) and differed significantly from all other samples, while many other populations were not distinguishable from one another (Segall et al., Reference Segall, Campbell and Herskovits1966, cited by Henrich et al., Reference Henrich, Heine and Norenzayan2010b). In this example, as in many others cited by the authors, American undergraduates are the prototype of a category of “WEIRD” subjects that are not representative of the rest of the world’s population. Note, however, that rather than a simple dichotomy of WEIRD vs. non-WEIRD, we would argue that some participants might share some traits with this prototype but not others (for instance, individuals from Western, educated, industrialized, rich but dictatorial societies) and that we can all be placed on a continuum of “WEIRDness.” Also, as we will discuss at the end of this chapter, depending on their personal trajectories, even people coming from prototypically WEIRD societies might score differently to other “WEIRD participants” after experiencing trauma or other life events, such as migration.

The fact that American undergraduates differ from other populations on traits and behaviors that were considered universal should encourage us to reassess the recruitment of study subjects in applied linguistic studies as well. Indeed, as Bigelow and Tarone (Reference Bigelow and Tarone2004, p. 690) note, “none of the studies published in TESOL Quarterly during the past ten years documents the SLA processes of post-critical period L2 learners who have low L1 literacy” (i.e., individuals who we would probably not consider as typically “WEIRD”). As we will argue below, this over-representation of college students is also compelling as far as research on aptitude (using the LLAMA or other tests) is concerned.

Convenience Sample and Research on Foreign Language Learning Aptitude

Modern research on foreign language aptitude was born with convenience sampling methods. Commissioned and paid for by the U.S. Department of Defense, the first foreign language aptitude test batteries were developed for and tested on young Americans recruited for army service and needing to be trained quickly and efficiently on languages of military importance (Asher, Reference Asher1977; Carroll, Reference Carroll1964, Reference Carroll1973, Reference Carroll and Diller1981; Parry & Child, Reference Parry and Child1990; Petersen & Al-Haik, Reference Petersen and Al-Haik1976). Note that this hasn’t changed much lately: As a matter of fact, new test batteries, such as the Hi-LAB, are still being tested on employees of the U.S. military and from various U.S. government agencies that commissioned the research (Linck et al., Reference Linck, Hughes and Campbell2013, and see Skehan, Chapter 17, this volume).

Of course, research on aptitude does not only consist of the development of tests. During recent decades, aptitude research has shifted from its initial goal of selectionFootnote ² to research aiming to understand individual differences, with the objective of developing methods adapted to each learner’s strengths and weaknesses, or aptitude–treatment interaction (ATI) research (for a recent review, see Granena & Yilmaz, Reference Granena and Yilmaz2018). Note, however, that most, if not all, ATI studies were run either in lab settings or with college students taking foreign language classes as part of their academic course of study.

Another important field of study in aptitude research concerns its relation to the age of onset. Researchers in this paradigm work on foreign language aptitude with the aim of better understanding the advantage of young learners in terms of L2 ultimate attainment (Abrahamsson & Hyltenstam, Reference Abrahamsson and Hyltenstam2008, Reference Abrahamsson and Hyltenstam2009; DeKeyser, Reference DeKeyser, Gass and Mackey2012; DeKeyser et al., Reference DeKeyser, Alfi-Shabtay and Ravid2010; Granena, Reference Granena2012; Granena & Long, Reference Granena and Long2013). The authors working on this topic usually follow the assumption that adults who develop high proficiency in an L2 would have, for some reason, kept the ability to learn foreign languages, an ability that others would have lost during childhood or adolescence (Carroll, Reference Carroll1973; Selinker, Reference Selinker1972 cited by Abrahamsson & Hyltenstam, Reference Abrahamsson and Hyltenstam2008). Participants in studies on age and aptitude are usually recruited outside of academia: DeKeyser’s (Reference DeKeyser2000) participants were Hungarian immigrants in the United States recruited via ads, flyers, and word of mouth; Abrahamsson & Hyltenstam’s (Reference Abrahamsson and Hyltenstam2008, Reference Abrahamsson and Hyltenstam2009) participants were Spanish-speaking immigrants in Sweden recruited through advertisements in newspapers and posters on university campuses (Abrahamsson & Hyltenstam, Reference Abrahamsson and Hyltenstam2008, p. 491); DeKeyser et al.’s (Reference DeKeyser, Alfi-Shabtay and Ravid2010) participants were Russian immigrants in the United States (Study 1) or Israel (Study 2), also recruited via ads and flyers posted in public places; and Granena’s (Reference Granena2012, Reference Granena2014) participants were similarly recruited via published ads in the Chinese community in Spain. Note that the Spanish-speaking immigrants in Sweden, Russian immigrants in the United States and Israel, and Chinese-speaking immigrants in Spain were all recruited under the condition of having stayed for about a decade in their new country and having an educational level of no less than high school (Granena) or senior high school (Abrahamsson & Hyltenstam). In the study on Russian speakers in Israel and the United States, education was not a recruiting criterion, but the majority of the participants nevertheless had college degrees and were working in high or intermediate positions.

Testing Foreign Language Aptitude

The close-to-exclusive use of undergraduates, highly educated immigrants, or military employees as participants in research on aptitude can be explained in two ways. If convenience is certainly one reason, a potential incompatibility of the tests for other populations could also be an important factor. As DeKeyser (Reference DeKeyser, Wen, Skehan, Biedroń, Li and Sparks2019, p. 320) reminds us, foreign language learning tests were developed with people “with at least a high school education” in mind. The tests themselves might, therefore, not be valid for other, less educated populations. This might particularly be the case for paper-and-pencil tasks, such as the Modern Language Aptitude Test (MLAT) or the Pimsleur Language Aptitude Battery (PLAB), which were developed in the 1960s at the peak of research on aptitude.

In this perspective, the LLAMA tests developed by Meara and colleagues (Meara, Reference Meara2005) in the early 2000s seem to constitute a good alternative. The LLAMA tests are computer-run and picture-based exercises simulating the learning of an artificial language. They can be used by speakers from any L1 because they do not rely on any specific language system, and their user-friendly interface makes them a priori easily usable for any type of population, including children (but see Bokander, Chapter 5, this volume, for a critique of the tests).

The first subtest, LLAMA_B, simulates vocabulary learning. The participant has 120 seconds to learn the names of a set of invented objects (drawings) in an unknown language (training phase). They are then tested on their learning.

LLAMA_D is a sound discrimination and recognition task. It is intended to measure the participant’s ability to recognize oral patterns in an unknown language. Participants first hear a series of sounds (training phase) and then must discriminate between new and previously heard items (test phase).

LLAMA_E measures sound–symbol association ability. For 120 seconds, participants learn the relationship between twenty-two sounds and twenty-two written forms in an artificial scriptural system (training phase). They then hear a word and choose the correct written word between two variants (test phase).

LLAMA_F measures inductive learning ability. For 300 seconds, participants infer the grammatical system of an artificial language with a set of visual and written stimuli (training phase). They then choose the grammatically correct variants out of two new stimuli (test phase).

Testing Aptitude in Inconvenient Samples: The Language Aptitude Outside the Classroom (LAOC) Study

The aim of this chapter is to discuss the use of the LLAMA aptitude tests with non-“typically WEIRD” (i.e., not undergraduates or other highly educated) populations. We report on a recent longitudinal research project investigating age in relation to aptitude and exposure in recently arrived immigrant families.

Participants were 51 Spanish-speaking parents (46 mothers and 5 fathers) and 51 children (27 girls, 24 boys, mean age 10, $SD = 3$ ) who arrived in the United States after 2016 (see Table 7.1). Participants’ countries of origin reflect the current waves of immigration in the United States: Venezuela $(n = 32)$ , the Dominican Republic $(n = 28)$ , and Honduras $(n = 18)$ formed the larger groups, followed by Ecuador $(n = 10)$ , Mexico $(n = 4)$ , Bolivia $(n = 2)$ , El Salvador $(n = 2)$ , Peru $(n = 2)$ , Puerto Rico $(n = 2)$ , and Colombia $(n = 2)$ .

Table 7.1 Description of the sample: Age, gender, and length of residence at T1 (in months) of the adults (first row) and children (second row)

Group	Age	Gender		LoR T1
	Mean (SD)	Females	Males	Mean (SD)
Adults (n = 51)	38 (7)	46	5	20 (14)
Children (n = 51)	10 (3)	27	24	18 (12)
Total (n = 102)	N/A	73	29	19 (13)

At the time of the study, participants were living in households composed of 1–5 adults and 1–5 children, most of them in two Hispanic boroughs of NYC (Queens and South Bronx). With few exceptions, participants’ occupation after immigration could be considered to be lower status than what would be expected from the education they received in their home country. Many participants, who before emigrating were accountants $(n = 14)$ , in charge of marketing/HR $(n = 6)$ , home designers $(n = 4)$ , doctors $(n = 1)$ , psychologists $(n = 2)$ , or engineers $(n = 3)$ were, at time of the study, home attendants $(n = 10)$ , cleaners $(n = 7)$ , or working in restaurants/food-processing industries $(n = 6)$ and other service industries $(n = 10)$ .

The vast majority of the participants (45 of 51 families) had not planned to emigrate and had not prepared for their migration by, for instance, taking extra English classes. As a result, 67% of the adults and 53% of the children reported knowing “only basic words (for instance, numbers, colors, thank you, hello, etc.)” when they arrived in the United States, and 23% of the adults and 20% of the children reported knowing “just enough English for basic daily needs, but with difficulties to express [them]self.” Additionally, 16% of the children (and one adult) reported not knowing any English at the time of arrival.

In the LAOC project, participants’ English proficiency was assessed longitudinally at three equally spaced times (every six months) during a one-year period: (1) as soon as possible after their arrival in the United States, (2) six months (±2 weeks) after this first data collection session, and (3) twelve months (±2 weeks) after the first data collection session. During the first data collection session, participants’ foreign language aptitude (LLAMA test battery) and working memory (backward digit span and Corsi block performance) were assessed to serve as predictors of proficiency development.

For each pair of participants, the adult and child were assessed together by the same experimenter. At T1, participants’ aptitude and working memory were tested on two laptops (one for the adult and one for the child). The experimenter gave the instructions in Spanish and allowed time for participants to ask any clarification questions. She also asked either the child or the adult to reformulate the instructions in their own words when she felt it was needed. Participants were allowed to take breaks whenever necessary, and the experimenter always provided juice and cookies for both the adult and the child.

During the second half of the session, participants’ English proficiency was assessed. While participant 1 (the adult) answered the comprehension task, which focused on tense, on a laptop with headphones, participant 2 (the child) performed two verbal fluency tasks and an oral narrative (frog story) with the experimenter. After reversing the tasks (the child using the laptop for the verbal tense comprehension task, the adult with the experimenter for the oral tasks), parent and child answered a short questionnaire in Spanish about their exposure to English, their anxiety, and other sociodemographic questions. No questions were asked specifically about familiarity with computer use, but all the adults and 46 out of the 51 children reported using the Internet to communicate either in Spanish or English. The first session lasted for 90–120 minutes (about 45–60 minutes for the cognitive tests, and about 45–60 minutes for the English proficiency tests and sociobiographic questionnaire). At T2 and T3, participants completed only the English proficiency tasks and the questionnaire on exposure and anxiety. The T2 and T3 sessions lasted for 45–60 minutes.

Comparing Results from Different Samples on the LLAMA Tests

In the following, we compare the results of the participants in the LAOC study on the LLAMA tests to the results of participants from other studies published in the last decade. To do so, we ran a comprehensive search on Google Scholar for any studies that used the test battery (with the keywords “LLAMA” + “aptitude”) between 2010 and 2020. This search yielded thirty-four studies containing sufficient information on participants, means, and standard deviations to run an analysis of variance (ANOVA) using summary data. To compare the results of the LAOC participants to other studies, we categorized the participants as “adults” (more than 18 years old), “children” (less than 13 years old), and “teenagers” (between 13 and 18 years old).

The majority of studies had been conducted with adult participants: college students (Benson & DeKeyser, Reference Benson and DeKeyser2019; Bokander & Bylund, Reference Bokander and Bylund2020; Chaffee et al., Reference Chaffee, Lou and Noels2019, Reference Chaffee, Lou, Noels and Katz2020; Cox et al., Reference Cox, Lynch, Mendes and Zhai2019; Curcic et al., Reference Curcic, Andringa and Kuiken2019; del Mar Suárez & Gesa, Reference del Mar Suárez and Gesa2019; Drozdova et al., Reference Drozdova, Van Hout and Scharenborg2017; Granena, Reference Granena2019; Hamrick, Reference Hamrick2015; Huang et al., Reference Huang, Loerts and Steinkrauss2020; Ishikawa, Reference Ishikawa2019; Jackson, Reference Jackson, Miller, Martin and Eddington2014; Kachinske, Reference Kachinske2016; Ma et al., Reference Ma, Yao and Zhang2018; Michaud, Reference Michaud2020; Moon, Reference Moon2012; Moorman, Reference Moorman2017; Rodríguez Silva, Reference Rodríguez Silva2017; Rogers et al., Reference Rogers, Meara, Barnett-Legh, Curry and Davie2017; Saito, Reference Saito2017, Reference Saito2019; Saito et al., Reference Saito, Sun and Tierney2018; Tran, Reference Tran2019; Yalçın et al., Reference Yalçın, Çeçen and Erçetin2016; Yilmaz, Reference Yilmaz2013; Yilmaz & Granena, Reference Yilmaz and Granena2016), long-term immigrants (Granena, Reference Granena2012, Reference Granena2016), missionaries (Larson-Hall & Dewey, Reference Larson-Hall and Dewey2012), or adults taking foreign language classes outside the college (Artieda & Muñoz, Reference Artieda and Muñoz2016). The remaining studies tested aptitude in groups of teenagers (del Mar Suárez & Gesa, Reference del Mar Suárez and Gesa2019; Turker et al., Reference Turker, Seither-Preisler, Reiterer and Schneider2019; Yalçın & Spada, Reference Yalçın and Spada2016) and children (Christiner et al., Reference Christiner, Rüdegger and Reiterer2018; Kourtali, Reference Kourtali2018; Kourtali & Révész, Reference Kourtali and Révész2020; Lambelet & Berthele, Reference Lambelet, Berthele, Wen, Skehan, Biedroń, Li and Sparks2019; Rogers et al., Reference Rogers, Meara, Barnett-Legh, Curry and Davie2017; Rüdegger, Reference Rüdegger2017).

ANOVAs were run to determine between the results with the LAOC participants and the results of the participants from the former studies using the rpsychi package for R (Okumura & Okumura, Reference Okumura and Okumura2012). In the case of the studies where different groups were compared (e.g., intervention studies), we included the different groups in the analysis.

LLAMA_B

The LLAMA_B test, the measure of vocabulary learning (or rote memory), has a score range of 0–100. The LAOC study participants’ score was in the lower range of the spectrum, with a mean of 20.92 $(SD = 12.12)$ for the adults $(n = 50)$ and 29.33 $(SD = 13.96)$ for the children $(n = 51)$ . The score of the adults is significantly lower than their children’s $(t (49) = - 3.60, p < .001)$ .

As can be seen in Figure 7.1 and Table 7.2, the LAOC adult participants’ score was not only significantly lower than their own children, but they also scored 33.7 points lower than the average of the adults in the other studies (95% CI [–39.47, –27.93]) and 34.5 points lower than the average of all the participants from the other studies (95% CI [–40.08, –28.89]). The LAOC adult participants even scored significantly lower than the children (–8.37 points lower, 95% CI [–12.80, –3.94]) and the teenagers of the other studies (–23.4 points lower, 95% CI [–28.71, –18.14]).

Figure 7.1 Means and standard deviations of the adults (circles), children (squares), and teenagers (triangles) of the LAOC project (in gray) and 42 other samples (in black) on the LLAMA_B subtest.

Table 7.2 Comparisons between the adults (first row) and children (second row) of the LAOC study and the combination (average) of the other studies on the LLAMA_B subtest: The contrasts are significant if the 95% CIs have a range that does not include 0 (0 indicating no difference)

	Comparison to ALL other participants from former studies	Comparison to the children from former studies	Comparison to the teens from former studies	Comparison to the adults from former studies
LAOC adults	Mean diff = −34.48 95% CI [−40.08, −28.89]	Mean diff = −8.37 95% CI [−12.801, −3.943]	Mean diff = −23.4 95% CI [−28.71, −18.14]	Mean diff = −33.70 95% CI [−39.47, −27.93]
LAOC children	Mean diff = −25.88 95% CI [−31.43, −20.34]	Mean diff = 0.01 95% CI [−4.74, 4.76]	Mean diff = −15.04 95% CI [−20.35, −9.73]	Mean diff = −24.27 95% CI [−29.95, −18.59]

In contrast, the LAOC children scored similarly to the average of the children of the other studies (and, as expected, lower than the adults, teenagers, and all other participants; see Table 7.2 for details).

LLAMA_D

The LLAMA_D subtest measures phonemic discrimination; its possible scores range from 0 to 75. The LAOC study’s participants ranged between 0 and 60 for the adults $(mean = 21.9, SD = 14.5, n = 51)$ and between 0 and 50 for the children $(mean = 23.12, SD = 16.56, n = 51)$ . The difference between adults and children was not significant $(t (50) = - 0.69, p = .49)$ (See Figure 7.2).

Figure 7.2 Means and standard deviations of the adults (circles), children (squares), and teenagers (triangles) of the LAOC project (in gray) and 42 other samples (in black) on the LLAMA_D subtest.

The adult and child participants’ scores from the LAOC project showed similar trends when compared to the participants in the other studies: Both groups scored significantly lower than the average of the adults in the other studies (8.7 points lower for the adults, 7.2 points lower for the children) and lower than all the participants when taken together (7.4 points lower for the adults, 6.1 points lower for the children). Both groups scored similarly to the average of children and teenagers from the other studies (see Table 7.3)

Table 7.3 Comparisons between the adults (first row) and children (second row) of the LAOC study and the combination (average) of the other studies on the LLAMA_D subtest: The contrasts are significant if the 95% CIs have a range that does not include 0 (0 indicating no difference)

	Comparison to ALL other participants from former studies	Comparison to the CHILDREN from former studies	Comparison to the TEENS from former studies	Comparison to the ADULTS from former studies
LAOC Adults	mean diff = −7.35 95% CI [−11.72, −2.99)	mean diff = −1.54 95% CI [−5.63, 3.52]	mean diff = −4.26 95% CI [−9.51, 1]	mean diff = −8.73 95% CI [−13.06, −4.41]
LAOC Children	mean diff = −6.11 95% CI [−10.45, −1.76]	mean diff = 0.21 95% CI [−4.56, 4.98]	mean diff = −4.27 95% CI [−9.52, 1]	mean diff = −7.24 95% CI [−11.58, −2.91]

LLAMA_E

The LLAMA_E measures sound–symbol association ability with scores ranging from 0 to 100. The LAOC study’s adult participants scored slightly higher than their children (adults: $mean = 43.98, SD = 17.9, n = 51$ ; children: $mean = 35.42, SD = 28.1, n = 50$ ), but this difference was only marginally significant $(t (49) = 2.0, p = .05)$ (see Figure 7.3).

Figure 7.3 Means and standard deviations of the adults (circles), children (squares) and teenagers (triangles) of the LAOC project (in gray) and 39 other samples (in black) on the LLAMA_E subtest.

Both the adults and the children from the LAOC study scored significantly lower than the adults (30.3 and 37.8 points lower, respectively), teenagers (−12.3 and −20.8 points, respectively), and all participants taken together (−24 and −32.8 points, respectively) but did not differ significantly from the children in the other studies (see Table 7.4).

Table 7.4 Comparisons between the adults (first row) and children (second row) of the LAOC study and the combination (average) of the other studies on the LLAMA_E subtest: The contrasts are significant if the 95% CIs have a range that does not include 0 (0 indicating no difference)

	Comparison to ALL other participants from former studies	Comparison to the CHILDREN from former studies	Comparison to the TEENS from former studies	Comparison to the ADULTS from former studies
LAOC Adults	mean diff = −24.01 95% CI [−31.06, −16.97]	mean diff = 3.88 95% CI [−5.11, 12.87]	mean diff = −12.26 95% CI [−21.35, −3.17]	mean diff = −30.29 95% CI [−36.51, −24-07]
LAOC Children	mean diff = −32.79 95% CI [−39.91, −25.68]	mean diff = −6.25 95% CI [−15.63, 3.12]	mean diff = −20.82 95% CI [−29.9, −11.74]	mean diff = −37.84 95% CI [−44.69, −30.99]

LLAMA_F

The LLAMA_F measures inductive learning ability; its possible scores range from 0 to 100. The children from the LAOC study scored higher than their parents (adults: $mean = 15.78, SD = 16.6$ , children: $mean = 20.31, SD = 19.5$ ) but this difference was not significant $(t (47) = - 1.45, p = .15)$ (see Figure 7.4).

Figure 7.4 Means and standard deviations of the adults, children, and teenagers of the LAOC project and 46 other samples on the LLAMA_F subtest.

As shown in Table 7.5, the adults of the LAOC project scored similarly to the children from the other studies, and significantly lower than the teenagers (−20.7 points), adults (−40.7 points), and all participants combined (−31.3 points). On the other hand, LAOC children did not score significantly lower than the other children and teenagers but did score lower than the adults (−30.5 points) and all participants combined (−26.6 points).

Table 7.5 Comparisons between the adults (first row) and children (second row) of the LAOC study and the combination (average) of the other studies on the LLAMA_F subtest: The contrasts are significant if the 95% CIs have a range that does not include 0 (0 indicating no difference)

	Comparison to ALL other participants from former studies	Comparison to the CHILDREN from former studies	Comparison to the TEENS from former studies	Comparison to the ADULTS from former studies
LAOC Adults	mean diff = −31.26 95% CI [−37.63, −24.89]	mean diff = −4.34 95% CI [−10.47, 1.8]	mean diff = −20.71 95% CI [−28.12, −13.31]	mean diff = −40.72 95% CI [−47.1, −34.34]
LAOC Children	mean diff = −26.63 95% CI [−33.12, −20.15]	mean diff = 0.46 95% CI [−6.39, 7.31]	mean diff = −16.18 95% CI [−23.82, 8.54]	mean diff = −30.51 95% CI [−36.98, −24.02]

These comparisons between the results of the LAOC participants and participants from former studies on the four LLAMA subtests show that, while the children of the LAOC study score similarly to their peers from other studies (and even similar to the teenagers on the LLAMA_B and LLAMA_F subtests), their parents consistently score lower than other adults from former studies, and even lower than the average of children on the LLAMA_B. These differences are summarized in Table 7.6.

Table 7.6 Mean scores for each group of participants on each of the four LLAMA subtests

	LAOC adults	LAOC children	Adults from former studies	Children from former studies	Teens from former studies
LLAMA_B	20.92	29.33	54.62	29.29	42.94
LLAMA_D	21.9	23.12	30.63	22.91	27.39
LLAMA_E	43.98	35.42	74.27	41.67	56.24
LLAMA_F	15.78	20.31	51.79	20.05	36.49

In our view, these results raise questions about the use of LLAMA tests to measure aptitude in non-“typically WEIRD” populations and call for more research with hard-to-reach populations, such as recent immigrants.

As Bigelow and Tarone (Reference Bigelow and Tarone2004) remind us, “SLA theorists routinely generalize about second-language acquisition, but their conclusions may not apply to L2 learners with interrupted educational experiences or low levels of literacy” (p. 698). In light of the results presented above, we would also argue that some SLA results may not apply to L2 learners in the beginning phases of immigration – whatever their level of literacy – particularly when they suffer from immigration stress and/or trauma resulting from, or originating before, leaving their home country and establishing themselves in a new environment.

Although foreign language learning aptitude has consistently shown its predictive power in informal and experimental settings (see Lambelet et al., Reference Lambelet, Berthele, Wen, Skehan, Biedroń, Li and Sparks2019 for a review), the results of the LAOC study in terms of their predictive effect on second language acquisition by recent immigrants show a more nuanced picture. In the remainder of this chapter, we summarize some of the main results regarding the effect of aptitude and other factors in recently arrived Spanish-speaking adult and child immigrants’ development of English proficiency.

Aptitude and Language Learning in Recently Arrived Adult and Child Immigrants

In the LAOC project, English proficiency was assessed longitudinally with three tasks: a short verbal fluency task, an oral narrative (frog story), and a verbal tense listening comprehension task. In this chapter, we focus on the verbal fluency task, but the results are similar for the other two measures of proficiency (see Lambelet [Reference Lambelet2021] for the results regarding lexical diversity of the oral narrative, and Lambelet et al. [in preparation] for the results of the listening comprehension task). Participants’ English verbal fluency was assessed by asking them to generate as many animal names as possible within a one-minute interval. At T2 and T3, their Spanish verbal fluency was also assessed by having them produce as many fruit/vegetables names in Spanish as possible in one minute. This animal naming verbal fluency task is a measure of how accessible the vocabulary is that has been used with bilinguals/L2 learners as a measure of proficiency and/or to document the bilingual lexicon’s organization (see, for instance, Escobar et al., Reference Escobar, Kalashnikova and Escudero2018; Portocarrero et al., Reference Portocarrero, Burright and Donovick2007; Rosselli et al., Reference Rosselli, Ardila and Salvatierra2002).

Adults’ and children’s scores at T1, T2, and T3 are presented in Table 7.7. Both groups show a progression between T1 and T3, and the children scored higher than their parents each time. This pattern of results is similar to our participants’ scores on the other two measures of English proficiency (Lambelet, Reference Lambelet2021). Verbal fluency scores also correlated moderately to highly with the lexical diversity of their oral narratives $(T 1 : r = .64, p < .001, T 2 : r = .63, p < .001, T 3 : r = .66, p < .00)$ and verbal tense listening comprehension $(T 1 : r = .61,, p < .001,, T 2 : r = .41,, p < .001,, T 3 : r = .33,, p = .002)$

Table 7.7 Mean and SD scores of the adults (first row) and children (second row) on the verbal fluency task at the three data collection times

	T1	T2	T3
	Mean (SD)	Mean (SD)	Mean (SD)
Adults	7.4 (3.4)	9.7 (4.2)	10.9 (3.8)
Children	9.1 (4.2)	11.1 (4.4)	13.1 (4.4)

To investigate the predictive effect of aptitude and other cognitive and contextual–affective factors, we fitted backward elimination linear mixed-effects models to the data using the lmer() function of the lme4 package for R (Bates et al., Reference Bates, Maechler and Bolker2012). Data collection time (i.e., progression), length of residence, LLAMA_B, LLAMA_D, LLAMA_E, LLAMA_F, backward digit span, Corsi block span, Spanish verbal fluency, exposure to English, and anxiety when speaking in English were included in the analysis as fixed effects. To control for the influence of household, the dyad was included as a random effect with a random intercept and random slope for time.

Time was treated as factors (T1, T2, T3), while the other independent variables were mean-centered and standardized by group (z scores). At each step of the analysis, the predictor with the largest nonsignificant p-value was removed until the simplest model was found. At each step, the model with the predictor was compared to the model without the predictor to determine the predictor’s significance by using the likelihood ratio test in the ANOVA() function of the lme4 package. The explained variance (marginal and conditional R²) of the best-fitting model was then computed using the rsquaredGLMM() function of the MuMIn package for R.

The results of the best-fitting model (marginal $R^{2} = .57$ , conditional $R^{2} = .80$ ) for the child group are presented in Table 7.8. The findings indicate progression between T1 and T3, and a significant predictive effect of LLAMA_B, exposure to English, anxiety, and Spanish verbal fluency. More precisely, at T2 (intercept), for each increase of one standard deviation on LLAMA_B, the children scored 1.11 units higher on verbal fluency; for each increase of one standard deviation in exposure to English and Spanish verbal fluency, they increased their score by about 1.45 units; and for each increase of one standard deviation in anxiety, their score decreased by 1.35 units.

Table 7.8 Best-fitting model for the child group

	Best-fitting model Children
Fixed effects	Estimate	SE	P
Intercept (T2, all other predictors = 0)	11.14	.51	<.001
T1	−2.01	.49	<.001
T3	1.90	.49	<.001
LLAMA_B (z_score)	1.11	.45	.02
Exposure to English (z_score)	1.45	.55	.01
Anxiety (z_score)	−1.35	.59	.03
Spanish verbal fluency (z_score)	1.46	.50	.006

In the best-fitting model for the adult group $(marginal R^{2} = .52, conditional R^{2} = .78)$ , none of the four aptitude dimensions were significant predictors of verbal fluency development. As Table 7.9 shows, time, backward digit span, and exposure to English are the only significant predictors of verbal fluency. For each increase of one standard deviation in backward digit span, adult scores at T2 (intercept) increased by 1.47 units, and for each increase of one standard deviation in exposure, verbal fluency increased by 2.02 units.

Table 7.9 Best-fitting model for the adult group

	Best-fitting model Adults
Fixed effects	Estimate	SE	p
Intercept (T2, all other predictors = 0)	9.75	.47	<.001
T1	−2.39	.44	<.001
T3	1.15	.44	.01
Backward digit (z_score)	1.47	3.73	<.001
Exposure to English (z_score)	2.02	.39	<.001

These results, and in particular the effect of exposure and the lack of predictive effect of the LLAMA tests in the adult group, are consistent with what was found following the same modeling process on the other measures of English proficiency. For instance, the analysis of the development of the lexical diversity of the oral narratives (frog stories) shows that, in the entire sample, exposure to English, LLAMA_D, LLAMA_E, and length of residence are significant predictors, as well as the interaction of time by group (but not group and time per se), while in the best-fitting model for the adult group, only exposure is a significant predictor; in the child group, length of residence, LLAMA_D, LLAMA_E, and anxiety are significant predictors (Lambelet, Reference Lambelet2021).

These results contrast, nevertheless, with the findings of former studies that investigated aptitude in relation to the age of onset as a predictor of ultimate attainment. For instance, in a study with Hungarians who immigrated to the United States either as adults (late starters) or before the age of 16 (early starters), foreign language aptitude appeared as a predictor of English proficiency in late starters but not in early starters (DeKeyser, 2000). Similarly, in a study with immigrants in the United States and in Israel, correlations were found between foreign language aptitude and English proficiency scores for participants who began their L2 learning between the ages of 18 and 40, but not for early starters or those who began their learning after age 40 (DeKeyser et al., Reference DeKeyser, Alfi-Shabtay and Ravid2010). On the other hand, Abrahamsson & Hyltenstam (Reference Abrahamsson and Hyltenstam2008) found an effect of aptitude in both early and late starters in a study involving 42 Chilean immigrants in Sweden.

As a whole, the results from the three studies reviewed in the last paragraph seem to indicate that aptitude is an important predictor for proficiency development in adult learners, while its effect is less central in young starters. Note, however, that more nuanced results appear in the studies on Chinese immigrants in Spain reported in Granena and Long (Reference Granena and Long2013): While the researchers did find predicted correlations between aptitude and phonology and between aptitude and lexis in the late starter group but not in the early starters, they did not find correlations between aptitude and morphosyntax in any group.

In contrast, in the LAOC project, LLAMA_B was a predictor of proficiency development in the child group, but no effect of aptitude appeared in the adult group. For all the measures of English proficiency, exposure to the language is the main predictive factor of language development in both the child and adult groups.

Discussion and Conclusion

Three main discussion points arise from the analyses presented in this chapter. First, the comparisons between the score on the LLAMA subtests with different populations show that the LLAMA test battery may not be adapted for every population or may assess abilities that are beyond foreign language learning aptitude per se.Footnote ³ Note that the LLAMA test battery has been widely used in SLA research, despite the warning issued by its authors as long ago as 2005 that they were “exploratory versions of on-going research, and […] should NOT be used in high-stakes situations where accuracy and reliability are at a premium.” (Meara, Reference Meara2005, p. 21). Their internal validity has also been questioned recently (Bokander & Bylund, Reference Bokander and Bylund2020) and a new version has been developed and tested (Rogers et al., Chapter 3, this volume). We look forward to learning if this new version of the LLAMA test battery will prove reliable in underserved populations.

From a wider perspective, the results presented in this chapter call for more research with populations that may seem harder to reach and that are therefore poorly represented in SLA and other neighboring disciplines.

Note that in addition to the difficulty in terms of access and recruiting, research with less represented populations is also challenging in terms of analysis and conclusions. In the words of Medin et al. (Reference Medin, Ojalehto, Marin and Bang2017), “[w]hen data are reported from non-WEIRD samples, there is a burden to compare the results with WEIRD populations and explain any differences” (Medin et al., Reference Medin, Ojalehto, Marin and Bang2017). In the case of the analyses presented in this chapter, the differences between the LAOC participants and other participants on the LLAMA tests (as well as the lack of predictive effect of aptitude on language proficiency development in the adults) cannot be explained solely by differences in the educational background; many participants of the LAOC study are actually highly educated, as was the case in the other studies on immigrants’ ultimate attainment, and the children’s scores are, broadly, within the range of scores of the children of the other studies reviewed earlier. This would not be expected if their SES level were the main explanation for the results.

It could, then, be argued that the low results on the LLAMA tests obtained by the LAOC adult participants were caused by environmental disturbances while taking the tests: Participants were tested either in community centers, libraries, or at their own home, not in quiet labs as has been the case in other studies. In our opinion, this may nevertheless be a strength of the LAOC project rather than a weakness: The testing conditions reflect the language learning environments that the participants experience in their immigration journey. The fact that recently arrived immigrants often live in overcrowded homes, unsafe areas, and segregated neighborhoods may actually impact their ability to learn the majority language, and these factors deserve more investigation in SLA research.

In fact, exploring other factors, for example, (post-traumatic) stress, difficulties in concentrating, or cultural differences, could explain the non-WEIRD samples’ results on the various measures we use in SLA research. This includes, but is not limited to, measures of foreign language aptitude. In our opinion, taking such an approach is crucial for a better understanding of the issues at stake when learning a second language in less ideal conditions than the language classroom.

This brings us to the last point we would like to address in this chapter. If the work of Heinrich, Heine, and Norenzayan on the over-representation of WEIRD subjects in psychology (and, by extension, SLA) studies is indeed thought-provoking and raises important issues that need to be addressed in our field, we believe that the complete picture is more complex than a mere distinction between WEIRD and non-WEIRD samples. First, the definition of “Western, educated, industrialized, rich, and democratic” can, and should, be discussed (for instance, which societies are to be considered as Western, or as democratic, can be debated). Second, many of our participants grew up in industrialized/rich societies and were highly educated; they would therefore be expected to score highly on the tests but in fact still recorded surprisingly low scores. Maybe what they lack is not “coming from WEIRD societies” but simply being wired, that is, connected and integrated into the social system they are beginning to navigate.

The LAOC adult participants’ low scores remind us of the results of studies on language learning in individuals suffering from post-traumatic stress disorder (PTSD). Søndergaard (Reference Søndergaard2017), for instance, reports on a study where different memory and IQ dimensions were tested in refugees suffering from PTSD. These subjects did significantly worse on the Benton Visual Retention Test, a measure of the ability to convert visual impressions into episodic memory before consolidating them into long-term memory, compared to other non-PTSD subjectsFootnote ⁴. They also made less progress in the majority language than their peers during the period of the study. More recently, refugee children were found to score much lower than other immigrant children on several measures of working memory/executive functions (Delage & Franck, Reference Delage and Franck2021). The LAOC participants were not tested for PTSD, but several of them shared privately their traumatic experiences before and after immigration, and many of them were visibly worried about immigration status, housing, access to health insurance, and other pressing needs. In the words of Métraux (Reference Métraux2017), their senses might well have been saturated at the time of study because they were finding themselves in what he calls states of survival (états de survie in the original), which could explain their results on the aptitude tests as well as their difficulties with learning English.

We do not want to elaborate further on these hypotheses since they were not the primary aim of the LAOC study, but we would like to argue once again for the need for more research with recently arrived immigrants and other populations for whom learning a second language is both a necessity and a hardship. Working with less represented populations will also help us as researchers to “examine some of [our] own uncritically accepted cultural assumptions and presuppositions, founded in [our] own literacy, on [our] perceptions as researchers” (Bigelow & Tarone, Reference Bigelow and Tarone2004, p. 698), and therefore help us understand the challenges encountered by many immigrants when learning the majority language of their new country of residence. Besides addressing ethnocentrism, understanding the challenges faced by these underserved populations seems particularly important when we look at the numbers involved. Following the latest statistics published by the United Nations, in 2019, 3.5% of the global population (i.e., 272 million people) were international migrants (UN DESA, 2019), many of them to countries where the majority language is not their first language. In light of this, when will we move from convenience samples in SLA and aptitude research and conduct more studies in the real world?

8 Is Language Aptitude Immune to Experience? Divergent Evidence from Bilingualism vs. Blindness

Preliminaries

Research articles on language aptitude, both past and recent, nearly without exception start off with a summative definition of the construct itself, declaring that language aptitude is generally considered to be a largely innate and relatively fixed talent that is relatively independent of other internal and external factors. Strangely enough, and as recently pointed out by several researchers (e.g., Chalmers, Reference Chalmers2017; Li, Reference Li2016; Wen, Biedroń, & Skehan, Reference Wen, Biedroń and Skehan2017), this characterization of the nature and origin of language aptitude has rarely been challenged theoretically, let alone investigated empirically. In their overview, Wen, Biedroń, and Skehan (Reference Wen, Biedroń and Skehan2017) contended that even though research methods have changed significantly in recent years, our knowledge about language aptitude itself “has not developed much at all since it started some 50 years ago,” summarizing that “the concept has remained intact – a relatively fixed trait that is not subject to malleability by later learning experience” (p. 6.). In other words, while empirical research on language aptitude has shifted its focus tremendously during the past 20 years, from the four-componential (black box-like) Carrollian paradigm (e.g., Carroll, Reference Carroll1958, Reference Carroll and Glaser1962, Reference Carroll1973, Reference Carroll and Diller1981; Carroll & Sapon, Reference Carroll and Sapon1959) to the more open-ended (Pandora’s box-like) “aptitude complexes” framework (e.g., Doughty, Reference Doughty2019; Linck et al., Reference Linck, Hughes and Campbell2013; Robinson, Reference Robinson1997; Reference Robinson and Robinson2002; Snow, Reference Snow, Sternberg and Wagner1994; Sparks et al., Reference Sparks, Humbach, Patton and Ganschow2011), the traditional branding of language aptitude as a largely innate and relatively stable trait has stubbornly persisted. Unfortunately, this persistence not only runs the risk of fueling the already next-to-mystical reputation of language aptitude, but it also seems to have turned the concepts of innateness and stability into an ever-growing elephant in the room. Chalmers (Reference Chalmers2017) was right in stating that these issues have been grossly neglected, especially in the light of other developments in the field, and we agree with his conclusion that “with new ways of understanding L2 aptitude more holistically […] and some researchers questioning Carroll’s original thinking […], now seems an appropriate time to revisit the issues of stability and untrainability in L2 aptitude” (p. 93).

In this contribution, we explore the question of whether there is reason to maintain the traditional view of language aptitude as a relatively fixed trait that is resistant to experience, or if it should instead be seen as a rather flexible and acquirable skill. We compare the relative experiential effects of (1) having learned an L2 and having been a long-term functional and fluent bilingual in adulthood with (2) having lived with total visual deprivation for a significant period of life. Both bilingualism and visual loss have been reported to have enhancing effects on language-related as well as non-linguistic cognition, but few studies have focused on their effects on language aptitude specifically, especially in the case of blindness. The chapter closes with a discussion on what it would mean for current views on the role of age of L2 acquisition and critical period(s) if the above-average language aptitude hitherto robustly associated with adult near-native L2 learning should turn out to be nothing but an effect of L2 learning itself.

Background

The Construct of “Language Aptitude” and Its (Lack of) Theoretical Progress

In the early days of second language acquisition (SLA) theory construction, several central scholars engaged in some quite bold speculation on the nature and origin of the unusual talent of a few exceptionally successful post-critical-period L2 learners, carefully hinting at domain-specificity, innateness, plasticity, and other Chomskyan and Lennebergian conceptualizations. For example, Selinker (Reference Selinker1972), one of the pioneers of modern-day SLA, assumed that those rare adults who do become nativelike in their L2 “have somehow reactivated the latent language structure which Lenneberg describes” (p. 212; emphasis in the original). Lamendella (Reference Lamendella1977) proposed that a nativelike adult L2 learner is an individual who has “remained more ‘plastic’ than the average member of our species” (p. 170). Carroll himself, the founder of modern aptitude research, suggested that adults with high language aptitude “are those who have for some reason lost little of the language acquisition ability with which they are natively endowed” (Carroll, Reference Carroll1973, pp. 6–7). In fact, ever since the onset of the Carrollian paradigm in the late 1950s (Carroll, Reference Carroll1958, Reference Carroll and Glaser1962; Carroll & Sapon, Reference Carroll and Sapon1959) and for some 40 years after, language aptitude was repeatedly conceptualized as

an innate, fixed, domain-specific talent for learning a foreign language with relative ease

where “ease” was usually operationalized as “speed.” To the extent this conceptualization was at all modified during this period, any changes or additions were related exclusively to the last and relatively untheoretical part of the definition (underlined above). As language aptitude testing was originally developed for the foreign language learning context, it was long believed that aptitude was an important predictor of success only for explicit learning in formal settings, whereas attitudinal or affective factors were thought to better predict the outcome of implicit, informal language acquisition (see, e.g., Gardner, Reference Gardner1985; Krashen, Reference Krashen, Burt, Dulay and Finocchiaro1977, Reference Krashen and Diller1981; Spolsky, Reference Spolsky1989). However, as research successively revealed that language aptitude was highly predictive also of naturalistic, non-tutored L2 development (Abrahamsson & Hyltenstam, Reference Abrahamsson and Hyltenstam2008; DeKeyser, Reference DeKeyser2000; Forsberg-Lundell & Sandgren, Reference Forsberg-Lundell, Sandgren, Granena and Long2013; Granena & Long, Reference Granena and Long2013; Harley & Hart, Reference Harley, Hart and Robinson2002; Robinson, Reference Robinson1997; Skehan, Reference Skehan1989), the conceptualization eventually needed to make room for terms like “acquisition” and “second language” along with “learning” and “foreign language.” As for “relative ease,” with time, “speed” was accompanied by measures of relative long-term achievement, such as “ultimate attainment” and the potentiality of “nativelikeness” in a non-native language (e.g., Abrahamsson & Hyltenstam, Reference Abrahamsson and Hyltenstam2008; DeKeyser, Reference DeKeyser2000; Ioup et al., Reference Ioup, Boustagui, El Tigi and Moselle1994).

These modifications were motivated by emerging empirical evidence at the time and only resulted in extensions of the contextual relevance of language aptitude, while leaving the fundamental nature and origin of the construct intact. Not even Skehan’s (Reference Skehan1998, Reference Skehan and Robinson2002, Reference Skehan, Gass and Mackey2012) staged model – a quite radical rethinking and expansion of the aptitude construct, with the inclusion of various different cognitive components for different kinds of language processing at different acquisitional stages – seemed to necessitate dispensing with the Carrollian assumption that “aptitude does not change with the seasons” (Skehan, Reference Skehan and Robinson2002, p. 79).

It was not until quite recently that the more intrinsic aspects of language aptitude (as expressed in the italicized first half of the definition above) were discussed or challenged (see Chalmers, Reference Chalmers2017; Kormos, Reference Kormos, Granena and Long2013; Singleton, Reference Singleton2017). With the onset of the Robinsonian paradigm (after the work of Peter Robinson; cf. Robinson, Reference Robinson and Robinson2002, Reference Robinson2005, Reference Robinson and DeKeyser2007, Reference Robinson and Pawlak2012), the set of components that Carroll had associated with language aptitude (i.e., phonetic coding ability, grammatical sensitivity, rote learning ability, and inductive language learning ability) were increasingly being replaced by a wider complex of cognitive factors known to play important roles in language learning and in learning in general (e.g., Linck et al., Reference Linck, Hughes and Campbell2013; Robinson, Reference Robinson1997, Reference Robinson and Robinson2002; Snow, Reference Snow, Sternberg and Wagner1994; Sparks et al., Reference Sparks, Humbach, Patton and Ganschow2011). With this shift in focus, the aptitude concept began to transform from being a highly domain-specific, stable talent for learning languages – independent of other individual factors, such as general intelligence, motivation, anxiety, attitudes, and musicality – into “a more dynamic, multifaceted conglomerate of various cognitive skills” (Ameringer et al., Reference Ameringer, Green, Leisser, Turker and Reiterer2018, p. 7), or, as formulated by Wen, Biedroń, and Skehan (Reference Wen, Biedroń and Skehan2017), “something of a hybrid construct related to a number of cognitive factors creating a composite measure regarded as the general capacity to master an L2” (p. 2). In fact, from having been listed (e.g., in introductory SLA textbooks) as one of the individual factors with the strongest predictive power of language learning outcomes (the others being age of acquisition and motivation), language aptitude has, slowly and unobtrusively, become an umbrella term for a variety of individual difference constructs, while at the same time losing much of its status as a specific, well-defined construct of its own.

Another recent game changer has been the embrace of working memory (WM), specifically phonological short-term memory (pSTM), as a central ingredient of the aptitude construct. While some researchers are willing to incorporate WM as one of several cognitive components of language aptitude (e.g., DeKeyser & Koeth, Reference DeKeyser, Koeth and Hinkel2011; Kormos, Reference Kormos, Granena and Long2013; Singleton, Reference Singleton2014, Reference Singleton2017; Turker et al., Reference Turker, Reiterer, Seither-Preisler and Schneider2017), and while some actually argue for a full-on WM-based construct (e.g., Miyake & Friedman, Reference Miyake, Friedman, Healy and Bourne1998; Sawyer & Ranta, Reference Sawyer, Ranta and Robinson2001; Wen, Biedroń, & Skehan, Reference Wen, Biedroń and Skehan2017; Wen & Skehan, Reference Wen and Skehan2011; but see Wen & Skehan, Reference Wen and Skehan2021), others are more cautious, suggesting that although WM is indeed implicated in language learning, the available empirical evidence indicates that “it is better not to consider it an aptitude component because its role is not restricted to language learning” (Li, Reference Li, Wen, Skehan, Biedroń, Li and Sparks2019, p. 86). What is relevant here is that, in recent years, WM has also gone from having been defined as a stable and permanent trait to one that is more often viewed as a relatively flexible capacity susceptible to experience and training (e.g., Holmes, Gathercole, & Dunning, Reference Holmes, Gathercole and Dunning2009; Klingberg, Reference Klingberg2010; Singleton, Reference Singleton2014), although the issue is still under debate.

It should go without saying that if dynamic individual differences, such as WM, motivation, attitudes, identity, affection, personality, emotion, and anxiety, and even linguistic sub-skills like reading, writing, listening, and speaking (Sáfár & Kormos, Reference Sáfár and Kormos2008) are continuously being factored in, then the entire concept of language aptitude will forever be a moving target. Under such circumstances, the definition of language aptitude as a rigid, dispositional, domain-specific trait becomes difficult to uphold, while a more flexible, experiential, domain-general construct will gain validity.

The most common argument against the innateness/stability feature is that language aptitude is potentially sensitive to language learning and bilingualism itself. The argument thus implies that aptitude would not only affect language learning outcomes but would also be affected by activities and processes involved in second language learning and bilingual use. Rather than serving as an independent variable, language aptitude would, under such a paradigm, constitute both an independent and a dependent variable, which would be utterly problematic for theories that depend on aptitude as an explanatory factor, and not as a learning achievement, for example, theories of why some exceptional adults become (near-)nativelike in their L2 (e.g., Abrahamsson & Hyltenstam, Reference Abrahamsson and Hyltenstam2008; DeKeyser, Reference DeKeyser2000; Ioup et al., Reference Ioup, Boustagui, El Tigi and Moselle1994). We will return to this dilemma in the closing section of this chapter.

Next, we give a summary review of the existing (potential) empirical support for the claim that L2 learning/bilingualism has enhancing effects on language aptitude, possibly constituting evidence that language aptitude is, in principle, flexible and a result of experience.

The Empirical Evidence for L2 Learning/Bilingualism Effects on Language Aptitude

Empirical studies that claim to challenge the innateness and fixedness of aptitude are still scarce. However, the studies that do exist have focused on the potential influence from language learning and/or bi-/multilingualism per se on learners’ language-analytic abilities. The basic idea is that, since the act of language learning implies experiencing and practicing the art of analyzing, systematizing, and memorizing linguistic materials (if only unconsciously so), and since the state of bi-/multilingualism implies juggling two or more languages on a regular and long-term basis, a byproduct in the form of enhanced language aptitude is to be expected (cf., e.g., Cenoz, Reference Cenoz2013; Chalmers, Reference Chalmers2017; Herdina & Jessner, Reference Herdina and Jessner2002; Hirosh & Degani, Reference Hirosh and Degani2017; Kormos, Reference Kormos, Granena and Long2013; Singleton, Reference Singleton2017). This reasoning might seem plausible and well-motivated, especially as the last decade has produced a large number of studies that seemed to demonstrate a “bilingual cognitive advantage” in terms of, for example, divergent thinking, enhanced WM and executive control, and even delayed symptoms of dementia (see, e.g., Bialystok, Reference Bialystok2009, Reference Bialystok2016, Reference Bialystok2017). Note, however, that this position has been seriously challenged in recent years by large-scale empirical studies (e.g., Dick et al., Reference Dick, Garcia and Pruden2019; von Bastian, Souza, & Gade, Reference von Bastian, Souza and Gade2016) as well as large meta-analyses of published and unpublished empirical research (e.g., Donnelly, Brooks, & Homer, Reference Donnelly, Brooks and Homer2019; Lehtonen et al., Reference Lehtonen, Soveri and Laine2018; Lowe et al., Reference Lowe, Cho, Goldsmith and Morton2021; Paap, Reference Paap and Schwieter2019; cf. also the discussions in Paap et al., Reference Paap, Mason, Zimiga, Ayala-Silva and Frost2020; Leivada et al., Reference Leivada, Westergaard, Andoni Duñabeitia and Rothman2021). Together, these studies suggest that there is no compelling evidence of enhanced executive functioning in bilinguals. In fact, when Lehtonen et al. (Reference Lehtonen, Soveri and Laine2018) made corrections for publication bias, effect sizes dropped to nearly zero or even became negative (potentially indicating a bilingual disadvantage).

Nevertheless, there are some studies that have investigated whether and how previous language experience has any specific effects on linguistic (rather than general) cognition, in this case language aptitude.Footnote ¹ Three different operationalizations of language experience have been utilized. The first refers to recent experience with foreign language instruction, measured through aptitude pre-tests and post-tests immediately before and after an intervening language course ranging in length from a number of weeks up to a full school year. Typical results from these studies are that there is a significant post-test gain in aptitude (e.g., Chalmers, Reference Chalmers2017; Ganschow & Sparks, Reference Ganschow and Sparks1995; Sáfár & Kormos, Reference Sáfár and Kormos2008; Sparks et al., Reference Sparks, Ganschow, Pohlman, Skinner and Artzer1992, Reference Sparks, Ganschow, Fluharty and Little1995, 1997, Reference Sparks, Artzer and Patton1998; Sparks & Ganschow, Reference Sparks and Ganschow1993) and that initially low-aptitude or at-risk students gain more than others. Chalmers (Reference Chalmers2017) showed also that L2 aptitude can be trained explicitly through various kinds of form-focused linguistic exercises. The main problem with these quasi-experimental studies is that post-tests were distributed at the end of the intervention, which made it impossible to know whether the gains in aptitude would remain stable or if they were only a short-lived recency effect. Cox et al. (Reference Cox, Lynch, Mendes and Zhai2019) are concerned that findings such as these “may not have indicated much more than a practice effect, so it remains important to investigate how bilingualism coming from other language learning experiences relates to aptitude” (p. 480).

The second operationalization of language experience is represented by the long-term effect of being bilingual, generally because of immigration and immersion with no or little formal L2 instruction, or by the long-term effect of having taken foreign language courses earlier in life, or by a combination of the two. Although the results are mixed, some group comparisons or regression analyses showed certain aptitude advantages for bilinguals over monolinguals, for instructed over naturalistic (childhood) bilingualism, and/or for active, advanced, and balanced bilinguals over inactive, less advanced, and unbalanced bilinguals (e.g., Cox et al., Reference Cox, Lynch, Mendes and Zhai2019; Eisenstein, Reference Eisenstein1980; Planchon & Ellis, Reference Planchon and Ellis2014; Rogers et al., Reference Rogers, Meara, Barnett-Legh, Curry and Davie2017; Thompson, Reference Thompson2013). Meanwhile, other studies exhibited no differences at all, or differences in only some sub-components of aptitude (e.g., Cox et al., Reference Cox, Lynch, Mendes and Zhai2019; Harley & Hart, Reference Harley and Hart1997; Sawyer, Reference Sawyer1992). The main problem with these cross-sectional studies is that there is no way of determining the direction of the causality: Do people with already high aptitude make more out of long-term bilingual exposure than people with less aptitude? Do people with an already discovered knack for learning languages opt for language classes more often than others?

The third category of studies, finally, are those that have operationalized language experience in terms of the number of languages a person knows or uses, or how many language classes they are taking or have recently taken, usually through the comparison of language aptitude between L2 learners/bilinguals and L3 learners/trilinguals (or “L3+,” i.e., “three or more languages”). Again, the results are mixed, some showing an advantage in some aptitude components with an increased number of languages (e.g., Ma, Yao, & Zhang Reference Ma, Yao and Zhang2018; Turker et al., Reference Turker2019), and others showing no advantage (Cox et al., Reference Cox, Lynch, Mendes and Zhai2019; Turker et al., Reference Turker, Reiterer, Schneider, Seither-Preisler and Reiterer2018). Interestingly, in the studies by Turker et al. (Reference Turker, Reiterer, Schneider, Seither-Preisler and Reiterer2018, Reference Turker2019), the number of previously learned languages had no effect on their adult participants’ aptitudes, but the researchers did find such an effect in 10–16-year-old participants. Again, there is no way of establishing the direction of causality: The higher aptitude might be an effect of the number of languages being learned, but, in addition to the effects of recency of training, it might just as well be that children and teenagers who experience that they learn languages quite effortlessly tend to take more language classes than those who feel they are not very good at language learning.

Language Processing and Blindness

That blind individuals compensate for their lack of vision through the enhancement of other senses and abilities has been assumed throughout history. For example, having been considered to be the most trustworthy and skilled individuals for such assignments, blind people have served as living databases in both ancient and recent times, memorizing by heart and passing down the teachings of canonical religious scriptures (Amedi et al., Reference Amedi, Raz, Pianka, Malach and Zohary2003). Similarly, the position of court musician was a traditional profession for the blind in ancient China, as well as other ancient cultures, because these people were considered particularly musically talented. What is more, present-day language teachers sometimes testify to the talents of blind individuals in learning foreign languages, especially in terms of pronunciation (personal communication with teachers of Swedish as a Second Language). In fact, a common notion among laypeople seems to be that blind individuals are particularly attentive to and perceptive of sounds and voices, this being the result of their reliance on hearing as a way of compensating for their lack of vision. But is there any scientific truth to these impressions and beliefs?

Since the end of the nineteenth century, empirical studies on the phenomenon of “sensory compensation” (or “perceptual compensation”) in the blind have varied in outcomes, with some confirming the ability of blind individuals to use other senses to compensate for their lack of vision, and others disputing that idea (Miller, Reference Miller1992). Research in first language acquisition in blind children has also produced contradictory results or interpretations. Much of this inconsistency can be explained by methodological problems characteristic of the early blindness studies: Selection criteria were not always sufficiently formulated, which often led to people with varying degrees and ages of onset of blindness ending up in a single participant group, and the use of non-matching control participants or non-identical tasks for blind and sighted participants resulted in unwarranted comparisons (Röder, Rösler, & Neville, Reference Röder, Rösler and Neville2001). In addition, it was not uncommon that some (although not all) blind participants suffered from additional physical or mental challenges (see Rowland, Reference Rowland and Mills1983, Reference Rowland1984).

No consistent results or firm conclusions on the matter could thus be drawn from empirical research until the mid-1990s, when neuropsychology turned its interest to blindness with the aim of obtaining a deeper understanding of neural plasticity and the ability of the human organism to adapt to different sensory interactions with its environment. This field of research recognized the importance of rigorous sample selection criteria that took degree and age of onset of blindness into account, excluded participants with additional disabilities, and focused on congenitally blind individuals in order to minimize influence from any previous visual experience. And indeed, much clearer and more consistent results have been obtained ever since, showing that blind individuals – mainly congenitally blind but in certain cases also individuals who became blind later in life – do compensate for the loss of vision by developing many skills that are less developed in the sighted.

For example, in relation to language, the blind have been shown to develop a superior ability for recognizing words (phonological recognition memory; Raz, Amedi, & Zohary, Reference Raz, Amedi and Zohary2005; Röder, Rösler, & Neville, Reference Röder, Rösler and Neville2001) and for being able to immediately repeat auditorily presented lists of words (phonological short-term memory; Loiotile et al., Reference Loiotile, Lane, Omaki and Bedny2020; Röder & Neville, Reference Röder, Neville, Grafman and Robertson2003; Rokem & Ahissar, Reference Rokem and Ahissar2009), to develop a higher sensitivity to speech sounds (Hugdahl et al., Reference Hugdahl, Ek and Takio2004), to process speech at a faster speed (Dietrich, Hertrich, & Ackermann, Reference Dietrich, Hertrich and Ackermann2013; Hugdahl et al., Reference Hugdahl, Ek and Takio2004; Röder, Rösler, & Neville, Reference Röder, Rösler and Neville2001; Stevens & Weaver, Reference Stevens and Weaver2005), to have a greater ability for recognizing L1 speech through masking noise (Rokem & Ahissar, Reference Rokem and Ahissar2009), and to grammatically process sentences more efficiently than sighted individuals (Loiotile et al., Reference Loiotile, Lane, Omaki and Bedny2020; Röder, Rösler, & Neville, Reference Röder, Rösler and Neville2000).

In the discussion section of this contribution, we will return to the various neurocognitive explanations for such superiority in our effort to make sense of their possible implications for the language aptitude construct.

Aims of the Study

The aptitude data we report here is part of a much larger data set from blind and sighted L1 and L2 speakers.Footnote ² The larger project included tests on language and memory (phonological short-term memory, recognition memory, and episodic memory), word and sentence perception in white noise and cocktail noise, and aspects of speech production. The present report is limited to the two “pure” language aptitude tests that were employed (see below) with the aim of shedding light on the nature of language aptitude in terms of stability/flexibility.

Methods

Participants

As shown in Figure 8.1, a total of 80 adult participants were engaged, 40 of whom were “monolingual” L1 Swedish speakers and 40 of whom were “bilingual” L2 speakers of Swedish (hereafter L1/Monoling and L2/Biling, respectively).Footnote ³ Half of the participants were blind and half were sighted, resulting in 20 blind and 20 sighted each in the L1/Monoling group and the L2/Biling group. Among the blind, about half were either blind from birth (i.e., they were congenitally blind) or had turned blind during early childhood (hereafter Early Blind), whereas the other half had lost their vision as adults (hereafter Late Blind). This grouping of participants yields a 2 (L1/Monoling vs. L2/Biling) × 3(Early Blind vs. Late Blind vs. Sighted) design, through which the relative impact of blindness vs. L2 learning/bilingualism on language aptitude can be investigated.

Figure 8.1 Participants in the present study

Participant groups were comparable in terms of chronological age $(= age at testing)$ $(M_{Age} = 47 years old)$ and educational background and occupational status, and the bilingual groups did not differ in terms of age of L2 acquisition (AoA) $(= age at immigration) (M_{AoA} = 27 years old)$ or length of L2 exposure (LoE) $(= length of residence in Sweden) (M_{LoE} = 21 years)$ . The Early and Late Blind groups naturally differed in terms of age of onset of blindness (AoB) $(M_{AoB} = 2.5 vs . 30 years old)$ and length of blindness (LoB) $(M_{LoB} = 44 vs . 17 years)$ ; see Tables 8.1 and 8.2 for details. The L1 background of the bilinguals was not held constant, but it was restricted to a few language families equally distributed among the participant groups. The most common L1s were Persian, Kurdish (Kurmanji and Sorani), Turkish, Arabic, Amharic, Tigrinya, Polish, Bosnian, and Serbian. Normal hearing was confirmed with an OSCILLA SM910 screening audiometer. The bilingual participants were screened for their L2 proficiency through a 40-item auditory grammaticality judgment test, which was a shortened and simplified version of the test used in various studies from our lab (see, e.g., Abrahamsson, Reference Abrahamsson2012; Abrahamsson & Hyltenstam, Reference Abrahamsson and Hyltenstam2008, Reference Abrahamsson and Hyltenstam2009; Bylund, Hyltenstam, & Abrahamsson, Reference Bylund, Hyltenstam and Abrahamsson2021). Most of the data collection took place in a lab environment at the Centre for Research on Bilingualism at Stockholm University; however, some testing needed to be carried out at the home of the participant (in only a few cases and only with blind participants) or at their workplaces. The conditions were always quiet, and there was no time limit. Data collection took 2–2.5 hours for the L2/Biling participants, and ca. 1.5 hours for the L1/Monoling participants. Before the test session, the participants signed a consent form, and they were remunerated when the test session had ended: The L2/Biling participants received $SEK 300 (ca . € 30)$ , and the L1/Monoling participants $SEK 200 (ca . € 20)$ .Footnote ⁴

Table 8.1 Background data on Early Blind and Late Blind participants on the variable age of onset of blindness (AoB), in years

Language background	Early Blind			Late Blind
Language background	M	SD	Range	M	SD	Range
L1/Monolinguals	2.14	3.00	0–7	33.56	8.21	26–50
L2/Bilinguals	2.59	2.60	0–9	27.11	7.88	19–40

Table 8.2 Background data on Early Blind and Late Blind participants on the variable length of blindness (LoB), in years

Language background	Early Blind			Late Blind
Language background	M	SD	Range	M	SD	Range
L1/Monolinguals	44.41	10.70	29–60	17.67	10.49	6–36
L2/Bilinguals	43.41	11.13	20–58	16.89	9.62	8–31

Language Aptitude Tests

The two aptitude subtests that were used, LAT A and LLAMA D, are the only two purely auditory tests within the Swansea LAT/LLAMA test family. Both are language-independent, and neither is based on visual stimuli.

LAT A

This test used to constitute the first subtest of the Swansea Language Aptitude Test (Swansea LAT, v. 2.0; Meara, Milton, & Lorenzo-Dus, Reference Meara, Milton and Lorenzo-Dus2003), which is the precursor of the current LLAMA test battery (Meara, Reference Meara2005a) that is very much in use among linguists and SLA researchers today. Like the LLAMA tests, the Swansea LAT battery was computer-based, and it consisted of five subtests loosely based on the Carrollian criteria underlying the Modern Language Aptitude Tests (MLAT; Carroll & Sapon, Reference Carroll and Sapon1959). Subtest Lat A was a test of phonetic memory, in which the participants heard and repeated 25 unfamiliar sound stringsFootnote ⁵ ranging from 1 to 13 syllablesFootnote ⁶ in length, one at a time. The original version was a self-judgment test, in which the participants rated their own oral repetition of each item by pressing either a button marked with ☺ (meaning something like “Yes, I did pretty well”) or ☹ (meaning “No, I didn’t do very well”). LAT A was eventually excluded from the test battery by its creators because of low validity and therefore never made it to the next generation of LLAMA tests. Not surprisingly, with the dependent variable being measured through self-assessment, it most definitely failed to measure the actual ability to repeat the items but rather tapped into the test taker’s own perception of their ability to repeat them. Consequently, if people are poor at repeating unfamiliar sound strings, they might also be just as poor at perceiving and judging their own phonetic production and might therefore push the ☺ button too often; likewise, someone who is extremely good at repeating unfamiliar sound strings might have very high standards for what should be judged as an acceptable repetition and may therefore push the ☹ button even when their production is relatively good. Therefore, in the present study, in order to make this test a more valid measure of the actual ability to repeat unfamiliar sound strings, the responses were recorded and later transcribed, analyzed, and scored by one of the authors.

The scoring procedure was as follows. Every stimulus item was divided into syllables, and one point was given for every syllable repeated correctly in the right serial order. The scores were between 1 and 13 points for each item, depending on the number of syllables it contained. By correlating the scoring from the analysis reported here, namely one using a “strict” scoring procedure (i.e., item syllables had to be repeated correctly in the right serial order) with a more “lenient” scoring procedure (where 1 point was given for every correctly repeated syllable regardless of correct serial order), a high intra-rater reliability score was obtained $(r = .95, n = 77, p < .001)$ . The internal consistency of items (‘strict’ scoring) was good (Cronbach’s $α = .88$ ).

LLAMA D

The LLAMA D test is a subtest of the current LLAMA Language Aptitude Tests battery (Meara, Reference Meara2005a). The battery contains four subtests that measure vocabulary learning (LLAMA B), sound recognition (LLAMA D), sound–symbol association (LLAMA E), and grammatical inferencing (LLAMA F). LLAMA D is the only non-visual subtest in the LLAMA battery and is not based on any of the subtests of MLAT. Loosely based on the work of Service (Reference Service1992; Service & Kohonen, Reference Service and Kohonen1995), it is a test of recognition memory of unfamiliarFootnote ⁷ sound sequences with very little support from long-term phonological knowledge. The theoretical basis for this test is that it has been suggested that a key language learning skill is the ability to recognize patterns of spoken language, and that it is an advantage if a pattern of sounds is recognized as being familiar the second time it is heard, thus aiding vocabulary acquisition (Meara, Reference Meara2005b; Speciale, Ellis, & Bywater, Reference Speciale, Ellis and Bywater2004).

The sound sequences (words/phrases) presented to the test taker were between 2 and 5 CV(C) syllables long; 12 sound sequences were 2-syllabic, 13 sequences were 3-syllabic, three sequences were 4-syllabic, and 2 sequences were 5-syllabic. The test taker first listened to 10 target sequences that, immediately after presentation, were to be identified among 30 sequences. Among these 30 items, 10 sequences were the target items previously heard in the first presentation, and 5 of these sequences were also repeated a second time; 15 items were novel and unfamiliar (i.e., distractor) words. The items were presented in a fixed quasi-random order. The test took approximately 3–3.5 minutes to complete. Since half of the participants in this study were blind, all the participants gave their YES/NO answers orally to the test administrator, who pushed the corresponding buttons on the computer. Granena (Reference Granena, Granena and Long2013) and Bokander and Bylund (Reference Bokander and Bylund2020) established an acceptable internal consistency (Cronbach’s $α = .60$ and .64, respectively) for LLAMA D, and Granena (Reference Forsberg-Lundell, Sandgren, Granena and Long2013) also demonstrated a fairly strong test–retest correlation (2-year interval, $r = .60, p < .01$ ).

Results

LAT A

As can be seen in Figure 8.2, the mean scores for the L2/Biling participants were slightly higher than that of the L1/Monoling participants, but a two-tailed ANOVA revealed that a main effect of L2/Bilingualism did not reach statistical significance, $F (1, 71) = 2.82, p = .097$ . However, there was a statistically significant main effect of Blindness, $F (2, 71) = 13.82, p = .001, {η_{p}}^{2} = .28$ . No interaction effect was found, $F (2, 71) = 0.74, p = .48$ . A Bonferroni post-hoc test revealed that Early Blind scored significantly higher than both Late Blind and Sighted $(p \leq .001)$ , and that the mean scores of Late Blind and Sighted did not differ. The 11 highest-scoring participants were all Early Blind (6 L1/Monoling, 5 L2/Biling).

Figure 8.2 Mean overall scores on LAT A, correctly repeated syllables in the correct serial order

The second analysis was based on raw scores (transformed into percentages), where one point was given for every correctly repeated item, irrespective of the length of the items (which otherwise varied between 1 and 13 syllables). The 13 syllable levels were then collapsed into three levels, namely 1–4, 5–8 and 10–13 syllables (see Figure 8.3), and the statistics were run on these figures. Two-tailed ANOVAs showed that for the lower collapsed levels 1–4, there was a main effect of Blindness, $F (2, 71) = 7.41, p = .001, {η_{p}}^{2} = .17$ , but no main effect of L2/Bilingualism, $F (1, 71) = 1.7, p = .202$ , nor an interaction effect, $F (2, 71) = 1.20, p = .306$ . Bonferroni’s post-hoc tests showed that the performances of Early Blind and Late Blind did not differ from each other $(p = 1.0)$ , and that both groups scored significantly higher than the Sighted group ( $p = .005$ and $p = .008$ , respectively).

Figure 8.3 Mean scores (%) of correctly repeated items on LAT A, collapsed 1–4, 5–8, and 10–13-syllable levels

At the 5–8 syllable level, there was a main effect of Blindness, $F (2, 71) = 8.55, p = .001, {η_{p}}^{2} = .19$ , but no main effect of L2/Bilingualism, $F (1, 71) = 0.38, p = .542$ , nor an interaction effect, $F (2, 71) = 0.07, p = .936$ . A Bonferroni post-hoc analysis revealed that Early Blind scored significantly higher than both Late Blind $(p = .006)$ and Sighted $(p = .001)$ , while the scores of Late Blind and Sighted did not differ significantly. In other words, at this more difficult level, the Late Blind group failed to perform at the same level as the Early Blind group.

Finally, at the 10–13 syllable level, statistical methods could not be used as very few participants managed to repeat any items of this length. Only 11 participants managed to perform at this level: 9 participants with early blindness and 2 sighted participants, but no participants with late blindness. As indicated in Figure 8.3, the 1–4 and 10–13 syllable items had time spans of up to 2.42 and 3.01 seconds, respectively, which well exceeds the two-second pSTM capacity of spoken material as proposed by Baddeley (Reference Baddeley2012).Footnote ⁸

LLAMA D

The results from LLAMA D are shown in Figure 8.4. Here, a Kolmogorov–Smirnov test showed that normality was violated. For this reason, non-parametric and parametric techniques were compared, and since the results did not present the same significant group differences, the non-parametric technique was used. A Mann–Whitney U test revealed that there was no significant difference between L1/Monoling and L2/Biling speakers. A Kruskal–Wallis test showed, however, that there was an effect of Blindness, $χ^{2} (2, 80) = 19.57, p = .001$ . No interaction effects were found through two-way testing between-groups ANOVA, $F (2, 74) = 1.27, p = .29$ . A Mann–Whitney U test revealed that the difference between Early Blind and Late Blind was not significant, $U = 135, z = - 1.73, p = .08$ , while the difference between Early Blind and Sighted was significant, $U = 168.00, z = - 4.03, p = .001, r = .51$ , as was the difference between Late Blind and Sighted, $U = 196.50, z = - 2.78, p = .005, r = .36$ . It can be concluded that both the Early Blind and Late Blind groups performed significantly better than Sighted group, and that this test (as was the case with the collapsed 1- to 4-syllable levels in LAT A) seemed to capture a performance advantage for both early blindness and late blindness. Furthermore, a ranking of the performances on LLAMA D showed that of the 16 participants with a result higher than 1 standard deviation from the mean (from 1 to 2.36), 12 were from the Early Blind group (5 L1/Monoling, 7 L2/Biling), three were from the Late Blind group (all L1/Monoling) and one was from the Sighted group (L2/Biling).

Figure 8.4 Mean overall scores on LLAMA D

Discussion

Is Language Aptitude Affected by the Experience of L2 Learning or Bilingualism?

Our data revealed no significant main effects of L2 learning or bilingualism on the two measures of language aptitude. Instead, the L1 and L2 speakers exhibited statistically comparable results on the tests of phonological short-term memory (LAT A) and phonological recognition memory (LLAMA D) of linguistic materials from languages previously unknown to them. In other words, the L2 speakers/bilinguals had not developed a lasting superior phonological or auditory ability as a result of having engaged in successful long-term L2 acquisition and bilingual use. Thus, the present results do not support the suggestion that (phonological/auditory) language aptitude is affected by previous language learning experience or by the experience of being bilingual per se. In effect, these particular results do not support the claim that language aptitude is experiential and flexible in nature, at least not where these two phonological components are concerned.

This conclusion may seem to contradict the few studies referred to earlier that were carried out to investigate the stability vs. flexibility of language aptitude. The studies by Ganschow, Sparks and colleagues in the 1990s, and the more recent studies by Sáfár & Kormos (Reference Sáfár and Kormos2008) and Chalmers (Reference Chalmers2017) all demonstrated some gain in aptitude from language learning activities (shorter or longer foreign language courses) in terms of pre-test/post-test differences in aptitude scores. However, there is nothing in those studies that suggests that an experience-induced enhancement of aptitude will persist over time. In fact, the observed gains in language aptitude may well have been short-lived and due only to the recency of language learning in relation to the timing of aptitude testing. These studies used post-tests at or immediately after the end of a language course – that is, while any potential effects from intensive language-analytical exercise would have still been at their highest (Cox et al., Reference Cox, Lynch, Mendes and Zhai2019). In contrast, the bilinguals in the present study were all long-term residents in the L2 environment $(M_{LoR} = 21 years)$ , which means that any focused, formal (or informal) L2 learning activities (e.g., Swedish courses for immigrants) had been completed a long time ago, and that these participants were fully-fledged L2 speakers/bilinguals at the time of the study. In other words, even if real aptitude flexibility is reflected in differences between pre-test and post-test scores in the context of active language learning (e.g., a language course), these differences might only result from temporary rather than permanent improvements of language aptitude. The studies mentioned above would have benefited from yet another post-test, say, 6–12 months after the actual learning activity, as this later post-test would have revealed either (1) a retraction of aptitude scores to a default (potentially innate) point of departure, or (2) a permanently enhanced and stabilized aptitude level.

These speculations, of course, require extensive empirical testing. If such testing consistently reveals temporary increases in aptitude scores, then aptitude is – in some sense of the word – flexible. However, this result would also imply that flexibility is only temporary and closely tied to the ongoing or recently terminated language-analytic brain exercise and linguistic problem-solving that typifies language learning. Any enhanced aptitude will revert to “normal” when the L2 speaker’s active language learning phase eventually evolves into a less analytical but more functional and fluent bilingual existence.

Is Language Aptitude Affected by the Experience of Blindness?

In contrast to the non-effect of L2 learning experience, the present study demonstrated overwhelming effects of visual deprivation on both pSTM and phonological recognition memory. All measures exhibited strong main effects of blindness, with superior performances by the early-blind participants in particular, supporting the established view that congenital/early blindness has significant enhancing effects on linguistic (in addition to non-linguistic) auditory abilities and cognitive functions, such as pSTM and recognition memory (see, e.g., Raz, Amedi, & Zohary, Reference Raz, Amedi and Zohary2005; Loiotile et al., Reference Loiotile, Lane, Omaki and Bedny2020; Röder & Neville, Reference Röder, Neville, Grafman and Robertson2003; Röder, Rösler, & Neville, Reference Röder, Rösler and Neville2001; Rokem & Ahissar, Reference Rokem and Ahissar2009). However, participants with late blindness also performed significantly better than sighted participants, and late-blind individuals performed on par with early-blind individuals both on LLAMA D and at the lowest (1–4-syllable) level of LAT A. At the mid (5–8 syllable) level and onwards of LAT A, late-blind individuals performed like sighted individuals (i.e., less well than early-blind individuals). Only 11 of the 80 participants had a phonological short-term memory capable of handling unfamiliar phonetic strings longer than 3 seconds (10–13 syllables), which is appreciably longer than the two-second upper limit of pSTM in the population at large (Baddely, Reference Baddeley2012), and of these 11, nine had early-onset blindness.Footnote ⁹

These results accord well with the current state of knowledge about cognitive functions and language processing in blind people (for a comprehensive review, see Smeds, Reference Smeds2015, pp. 47–62). As mentioned earlier, blindness is associated with superior abilities for word recognition (i.e., phonological recognition memory; Raz, Amedi, & Zohary, Reference Raz, Amedi and Zohary2005; Röder, Rösler, & Neville, Reference Röder, Rösler and Neville2001) and word repetition (phonological short-term memory; Loiotile et al., Reference Loiotile, Lane, Omaki and Bedny2020; Röder & Neville, Reference Röder, Neville, Grafman and Robertson2003; Rokem & Ahissar, Reference Rokem and Ahissar2009), higher sensitivity to speech sounds (Hugdahl et al., Reference Hugdahl, Ek and Takio2004), faster speech processing (Dietrich, Hertrich, & Ackermann, Reference Dietrich, Hertrich and Ackermann2013; Hugdahl et al., Reference Hugdahl, Ek and Takio2004; Röder, Rösler, & Neville, Reference Röder, Rösler and Neville2001; Stevens & Weaver, Reference Stevens and Weaver2005), better perception of L1 speech presented in noise (Rokem & Ahissar, Reference Rokem and Ahissar2009), and more efficient grammatical processing of (e.g., garden-path) sentences than in sighted individuals (Loiotile et al., Reference Loiotile, Lane, Omaki and Bedny2020; Röder, Rösler, & Neville, Reference Röder, Rösler and Neville2000). Needless to say, none of these superior abilities are innate, as there is no reason to believe that visual deprivation of any kind (including congenital blindness) comes bundled with ready-made enhanced cognitive abilities or an above-average language aptitude. Instead, these enhanced abilities are experiential in origin and are developed as a result of living without vision. They are also flexible in nature, as evidenced by the variation between blind individuals as a function of AoB, among other factors. What is interesting here are the salient neurocognitive marks that the experience of blindness leaves in terms of structural and functional cortical reorganization, and the way the manifested cognitive and linguistic advantages seem to be immediately associated with cross-modal plasticity (Frasnelli et al., Reference Frasnelli, Collignon, Voss and Lepore2011). Bedny (Reference Bedny2017) proposes that human cortices are cognitively pluri-potent, “capable of assuming a wide range of cognitive functions” (p. 637), and it has become more and more evident that the perceptual compensation in blind individuals occurs through the colonization of free wetware in the visual cortex in the occipital lobe (see Amedi et al., Reference Amedi, Raz, Pianka, Malach and Zohary2003, Reference Amedi, Floel, Knecht, Zohary and Cohen2004; Bavelier & Neville, Reference Bavelier and Neville2002; Bedny, Reference Bedny2017; Bedny et al., Reference Bedny, Pascual-Leone, Dodell-Feder, Fedorenko and Saxe2011; Röder, Rösler, & Neville, Reference Röder, Rösler and Neville2000; Tomasello et al., Reference Tomasello, Wennekers, Garagnani and Pulvermüller2019). Thus, with auditory, language, and memory processing going on in the classical cortical areas (including Broca’s area), blind individuals also exhibit parallel activity in the primary as well as secondary visual cortex, making processing faster and more efficient. Thus, together with certain functional reorganizations and plastic alterations that take place in the young blind child’s auditory cortex (Stevens & Weaver, Reference Stevens and Weaver2009), the recruitment of extra cortical space in the occipital lobe – and the parallel processing thus enabled – constitute the neuronal substrate that explains blind individuals’ higher performances on aptitude tests.

The question then arises whether these facts constitute evidence that language aptitude – in principle – emanates from experience? The answer has to be a firm “No.” Blindness is associated with fundamental cortical changes, resulting in effects that cannot be compared with the effects from ordinary practice or learning (e.g., through a language course), which cause no fundamental reorganization of the cortex. In her effort to relate plasticity in blindness to cultural learning and training, Bedny (Reference Bedny2017) explains that even though “learning to recognize visual characters and words is itself a subtype of visual object recognition, […] blindness is a more dramatic experiential change,” adding that “when the change in experience is more dramatic, as in the case of blindness, so too is the change in cortical function” (p. 644). And, naturally, the experience of learning an additional language, or living a bilingual life, could never be remotely as dramatic for the human brain as the experience of navigating life without vision, and neither the cross-modal activation nor the colonization of free wetware is known to occur in the case of L2 learning or bilingualism.

Language Aptitude: A Shared Symptom of Many Underlying Conditions?

Let us assume for a moment that the lack of effects of L2 learning in the present study is, for some reason, unrepresentative and that other studies, such as Sáfár & Kormos (Reference Sáfár and Kormos2008) and Chalmers (Reference Chalmers2017), are more generally correct in concluding that aptitude increases as a function of L2 learning and bi-/multilingual experience. If so, one theoretical objection could be that it is not the language learner’s aptitude per se that is enhanced from such experiences, but rather their language awareness. The relationship between the separate constructs of “language aptitude” and “language awareness” has been the subject of recurring discussions (see, e.g., Hyltenstam, Reference Hyltenstam2021; Jessner, Reference Jessner2006; Ranta, Reference Ranta and Robinson2002; Roehr-Brackin & Tellier, Reference Roehr-Brackin and Tellier2019; Singleton, Reference Singleton2014). Even though language aptitude usually refers to an unconscious sensitivity to language structure (a sensitivity that the individual may be entirely unaware of), while language awareness refers more to explicit knowledge about language (including metalinguistic terminology, in some conceptualizations) and conscious perception and sensitivity in language learning, there is reason to believe that the two constructs overlap to a certain extent. According to Roehr-Brackin & Tellier (Reference Roehr-Brackin and Tellier2019), “researchers either see aptitude as innate and relatively stable and thus as impacting on the development of metalinguistic awareness …, or they assume a bidirectional influence between the two variables, with neither conceptualized as stable” (p. 1115). Language awareness is typically associated with very conscious and determined language learners or language professionals, such as hyper-polyglots (Hyltenstam, Reference Hyltenstam2021), language teachers and interpreters (Abrahamsson & Hyltenstam, Reference Abrahamsson and Hyltenstam2008), and, of course, linguists and language scientists. As for the latter category, even the manual of the LLAMA aptitude tests (Meara, Reference Meara2005b) points out that both LLAMA E (a sound–symbol correspondence task) and LLAMA F (a grammatical inferencing task) are “good at identifying analytical linguists” and, in the case of LLAMA E, “particularly those with formal training in phonetics” (p. 14). It thus appears that in the case of language aptitude vs. language awareness, many standardized aptitude subtests tap into both constructs simply because they fail to differentiate between the two, or, at least, parts of them. It seems quite reasonable that two very similar constructs have spill-over effects on each other, and that a certain degree of language awareness would contribute to higher aptitude scores.

However, one relevant question would then be why two such diametrically diverse experiences as L2 learning/bilingualism (or language awareness training) on the one hand, and blindness on the other hand, with fundamentally different neurocognitive underpinnings, result in the same performance on a test of language aptitude. Would it be at all reasonable to assume that L2 learners and blind people end up with the same underlying quality, as suggested by identical scores on an aptitude test? The question becomes even more intriguing when we consider another documented correlate to enhanced language aptitude, namely the occurrence of multiple (right-hemispheric) Heschl’s gyri (HG), a neuro-anatomical structure comprising the primary and parts of the secondary auditory cortex. Turker and colleagues have found robust associations between high speech-imitation skills, high language aptitude, and the occurrence of multiple HG in the right auditory cortices of children, teenagers, and adults (e.g., Turker, Reference Turker2019; Turker et al., Reference Turker, Reiterer, Seither-Preisler and Schneider2017, Reference Turker, Reiterer, Schneider, Seither-Preisler and Reiterer2018; but see Novén et al., Reference Novén, Olsson and Helms2021). In these studies, the occurrence of a single gyrus correlated with particularly low language aptitude, whereas high aptitude was associated with complete posterior duplication of HG. The important thing to point out here is that within an individual, the neuroanatomy of HG is innate and stable, and does not emerge or change as a function of training or experience. Thus, if we add the presence of multiple HG to the list, together with language learning/bilingualism experience, language awareness training, and the experience of blindness, we have – from this highly selective discussion alone – managed to identify no less than four extremely different underlying “conditions” that seem to have only one thing in common: They tend to generate high scores on language aptitude tests. The question then becomes whether it is at all reasonable to assume that such different experiences, with extremely diverse neurocognitive underpinnings, actually lead to one qualitatively common cognitive advantage – namely, an enhanced language aptitude.

Or should a heightened language-analytic ability, as measured by standardized aptitude tests, instead be regarded as a general symptom with a variety of underlying causes, in the same way that a bodily fever, as measured with a thermometer, is not an illness in itself but rather a common symptom of an indefinite number of underlying medical conditions? A thermometer will only reveal whether or not the patient has a heightened body temperature and how high it is, but will indicate nothing about what that temperature is a symptom of. Similarly, could we assume that a high score on an aptitude test is no more than a symptom of one (or several) of many possible underlying causes with different origins?

Another crucial follow-up query would be whether we know for a fact that blind people or people with right twin HG are actually better at learning languages than sighted people or people with a single HG. Put another way, is there any evidence that blind people or those with multiple HG have an actual heightened ability to learn languages? In our original study (Smeds, Reference Smeds2015), a panel of 25 native speakers of Swedish listened to speech samples from 20 blind and 20 sighted L2 speakers and rated them for the degree of foreign accent on a 10-gradient Likert scale (from “1: No foreign accent” to “10: Very heavy foreign accent, impeding comprehensibility”). The results showed no significant differences in accent between the speaker groups $(M \approx 5)$ , neither for elicited speech nor for free speech. In other words, in terms of ultimate attainment, the superior aptitude of the blind participants had not led to a superior long-term L2 speech competence as compared to the sighted L2 participants. So, even though anecdotal evidence from teachers of blind immigrant students’ superior pronunciation skills (personal communications) is probably accurate, there is reason to believe that the blindness advantage is initial and temporary, and that blindness conveys no long-term benefits, at least not to pronunciation.

Does the Stability/Flexibility of Aptitude Matter?

Why should we at all bother about whether language aptitude is dispositional or experiential by origin, or fixed or flexible in nature? Is the issue important in any significant way? If our only interest and purpose with aptitude research were to be able to better predict foreign language students’ prospects of succeeding in language courses by screening them with standardized aptitude tests, then the answer would probably be “No,” especially as more than 50 years of testing practice have demonstrated that the current tests actually predict student performance quite well. For the purpose of prediction, it would not matter much whether an aptitude score reflected a stable ability that was present from birth (either as a traditionally understood talent, or, say, as an effect of twin HG in the right auditory cortex), or if the ability had developed out of experience (from language learning, bilingualism, awareness training, or blindness). As long as the readiness of students for language studies can be correctly predicted, the origin or nature of that readiness should not matter much. However, accurate prediction of readiness would require that language students update their aptitude status every so often through aptitude retesting, because every language course they took could potentially enhance their language-analytic skills. As an aside, it should come as welcome news to foreign language practitioners and language students if aptitude turns out to be trainable.

However, our current knowledge gap is potentially problematic for hypotheses of language acquisition that depend on aptitude being an innate and fixed trait. The situation becomes especially problematic if the experience of language learning itself is assumed to be an important (perhaps even primarily important) source of a learner’s aptitude, notably in situations where language aptitude is meant to explain successful L2 acquisition that has already taken place, which is the case in research on L2 ultimate attainment. As we have noted above, without aptitude pre-testing, there is no way of actually knowing the direction of causality of the correlation between aptitude and language learning success; that is, whether success in language learning is caused by a pre-existing high aptitude or whether a high aptitude has emerged as a byproduct from the very experience of successful L2 learning.

Theories concerned with the impact of age of acquisition and critical period(s) predict that only childhood L2 learners eventually attain nativelike competence and behavior in their second language, whereas the adult learner’s ultimate attainment typically falls short of nativelikeness. However, in the unusual case of adult (near-)nativelike ultimate attainment, high language aptitude always seems to have served as a compensatory factor, allowing for exceptional, even nativelike behavior (Abrahamsson & Hyltenstam, Reference Abrahamsson and Hyltenstam2008; DeKeyser, Reference DeKeyser2000). In other words, an above-average, innate, and fixed talent for language learning plays a crucial role in explaining why some adult learners seem to be able to beat the predictions of a critical period. If, however, the above-average aptitude associated with these exceptional language learners should turn out to be nothing but the result of their successful L2 learning, and not its cause, the theory would fall. For the theory to hold, we would need to be able to trust that our post-hoc measures of aptitude do indeed mirror the level of aptitude that existed pre-L2 acquisition. In other words, and for this reason alone, the importance of investigating the question of stability vs. flexibility reaches well beyond the realm of aptitude theory development.

9 Testing Language Aptitude A Commentary on Batteries and Reanalysis of Constructs

Introduction

The present chapter has two general aims. The first is to survey the range of aptitude batteries and sub-tests that are discussed in the literature, and then to explore how they relate to one another and what emphases each of them contains. To achieve this, the various sub-tests will be located in terms of two dimensions: whether they are domain-specific or domain-general, and whether they require implicit or explicit processes and learning. In addition, how the different domains of sound, working memory, and processing, language and learning are handled in each of the sub-tests will be explored. The major outcome is to assess the implicit assumptions different batteries and tests make; to identify where such batteries duplicate one another; and, more usefully, where there are gaps and scope to develop new aptitude sub-tests.

The second aim is to explore what insights aptitude tests might contribute to theorising about the nature of second language learning. There are many different and contrasting accounts of what second language learning is, and aptitude tests are, potentially, operationalisations of these different accounts – if they are to account for learning, such sub-tests need to reflect the different views about learning processes, such as skill acquisition or statistical learning. The different theoretical accounts will be examined before existing aptitude tests are related to them, indicating clear coverage in some areas, and not very much in others. It is argued that aptitude work, viewed in this way, should be central to second language acquisition and reveal how we can understand and predict it.

Contrasting Perspectives on Developing Aptitude Tests: A Preliminary Survey

An unavoidable starting point for any survey of aptitude tests is to recognise that most aptitude test development and research has been motivated by practical, problem-solving reasons. Of greatest importance here has been the challenge to predict foreign language learning success, or more precisely, rate of second language learning, and this is the typical context for the development of most aptitude batteries. The MLAT (Carroll & Sapon, Reference Carroll and Sapon1959) and Pimsleur’s Language Aptitude Battery (PLAB) (Pimsleur, Reference Pimsleur1966), the two most significant early batteries, were largely used to make predictions of adult and high school students’ language achievement, respectively. Contrastingly, in a military, diplomatic, or other government context, a need is perceived to predict the speed with which personnel can learn foreign languages, particularly at high proficiency levels. The Defense Language Aptitude Battery (DLAB) (Petersen & Al-Haik, Reference Petersen and Al-Haik1976) was produced against such a background, as were the CANAL-F (Grigorenko et al., Reference Grigorenko, Sternberg and Ehrman2002) and the Hi-LAB (Linck et al., Reference Linck, Hughes and Campbell2013; Hughes et al., Chapter 4, this volume).

Given such practically motivated starting points, a first task is to examine these major batteries and aptitude tests (together with some other tests, aptitude or otherwise, that have been used in aptitude research). This review will set the scene for the next section, where the different batteries and tests are analysed in terms of the two relevant dimensions (implicit–explicit and language–cognition). We will look at batteries and tests not simply in terms of content and underlying theory but also the context in which they were developed. The intention is not to repeat descriptions widely available elsewhere, but rather to review the foundation of aptitude tests for the analysis to come.

Carroll (Reference Carroll and Glaser1962) placed considerable emphasis on a job-sample approach to developing aptitude tests. His starting point was to analyse the nature of language learning and the activities on which it is based and then to develop a large number of potential aptitude tests. At the first stage, the aim was to try to cover just about everything that might be considered important in the task of learning a language. Some of these candidate sub-tests differed radically from one another, but in other cases, relatively similar sub-tests were included in the hope that one would have an edge in prediction. The next stage was to try out a large battery of such tests and examine their inter-relationships. The tests were also used with actual language learners to generate validity coefficients. In other words, a large number of possibilities were involved, and then only the most distinctive and predictive sub-tests were retained. No other aptitude battery, before or since, has been so thoroughly or so extensively validated. The final stage in this programme was to build upon the statistical results to develop a theory of foreign language aptitude, which led Carroll to propose his four-factor account: phonetic coding ability, grammatical sensitivity, inductive language learning ability, and associative memory (with this last the only part of the theory which linked in any natural way with the contemporary psychology of the time).

Two other batteries have connections to the MLAT. The first of these, PLAB (Pimsleur, Reference Pimsleur1966) was developed shortly after the MLAT. This test targets high school–age foreign language learners (the MLAT is more focussed on ages from young adults upwards) and reflects Pimsleur’s views that auditory issues are at the root of under-achievement in this area (Pimsleur, Reference Pimsleur and Davies1968). Accordingly, out of three actual aptitude sub-tests (other information is collected on motivation and native language ability) two focus on sound (sound discrimination and sound–symbol association). The other, interestingly, tests inductive language learning, the aptitude factor proposed by Carroll, but not included by him in the MLAT. As with the MLAT, considerable validation work was conducted and usefully covered in the accompanying manual. The PLAB has much less emphasis on learning than the MLAT. Even so, there is little fundamental difference in theory between the two batteries. The main differences were the age-appropriateness of the sub-tests and the commitment to an auditory focus, with this skill (or aptitude) linked to the potential for diagnostic and remedial action. The other related battery is the LLAMA, a test developed at the University of Swansea by Paul Meara (Meara, Reference Meara2005; Rogers et al., Chapter 3, this volume). This battery was modelled on the MLAT, and so its underlying theory is similar to Carroll’s model. The differences are computer administration and lack of dependence on any particular L1 for delivery. Accessibility and lack of cost are also factors. This battery consists of four tests. Two, on paired associates learning and sound–symbol association, are similar to the MLAT. Interestingly LLAMA, like the PLAB, uses an inductive language learning test rather than a grammatical sensitivity test. The final sub-test, the learning of sound, introduces a slightly different dimension to the auditory area and has been argued by Granena (Reference Granena, Granena and Long2013, Reference Granena2019) to tap more implicit learning processes. This battery is probably the most widely used in current aptitude work. Issues remain, however, concerning validation (Bokander & Bylund, Reference Bokander and Bylund2020; Bokander, Chapter 5, this volume).

Only one test battery of note was developed during the 1970s, the DLAB (Petersen & Al-Haik, Reference Petersen and Al-Haik1976). The title of the battery is revealing – it was funded by the US Defence Department. The motivation for this work was the perception that the MLAT was not sufficiently effective at high levels of achievement. Again, there is no fundamental change in theory, and DLAB’s three sub-tests span the areas of sound, language, and learning. The focus in each area was slightly different, though. Sound required accent or stress identification; learning involved learning language rules, then applying them; and language required grammar rules to be inferred. It turned out that the DLAB did not produce a higher validity coefficient than the MLAT. The battery that resulted is restricted and so not widely available, then or now, nor is there much validation information, aside from Petersen and Al-Haik (Reference Petersen and Al-Haik1976). The next battery to consider, the CANAL-F (Grigorenko et al., Reference Grigorenko, Sternberg and Ehrman2002), was also government sponsored, this time with a diplomatic emphasis, and again with a focus on higher-level achievement. The developers were two cognitive psychologists (Grigorenko and Sternberg) and a language specialist (Ehrman). The battery consists of five sub-tests with a clear focus on language and learning (but not overtly sound). It is broader in its treatment of language than other batteries, with meaning and inference involved, as well as language-rule learning. It is also more intricately designed, with a careful manipulation of aural and paper-and-pencil presentation and some integration across the sub-tests. The test is hypothesised to draw on processes such as selective encoding, accidental encoding, selective comparison, selective transfer, and selective combination. There is little general validity information available about the test, beyond Grigorenko et al. (Reference Grigorenko, Sternberg and Ehrman2002), and it has not been widely used except by the originators of the battery.

Next, we come to the most significant development in aptitude testing in recent years – the Hi-LAB (Linck et al., Reference Linck, Hughes and Campbell2013; Hughes et al., Chapter 4, this volume). This, too, was government sponsored and was located within the Center for the Advanced Study of Language (CASL), University of Maryland. It focusses on understanding and predicting high-level foreign language achievement. Considerable resources have been put into in the production of this large test battery, which consists of 12 sub-tests, covering working memory and processing, learning, and sound (but not directly language). The emphasis on detailed sub-processes of working memory and processing speed are distinctive and new, as is the focus on implicit processes (e.g. for learning). Sound and associative learning are more conventional in approach. Some validation information is available (Linck et al., Reference Linck, Hughes and Campbell2013; Hughes et al., Chapter 4, this volume), but the test has not been extensively trialled with non-government populations and across proficiency levels, and it is only selectively available. The battery is strongly influenced by contemporary cognitive psychology and draws upon techniques of measurement developed within that field.

There are also three small-scale tests, essentially sub-tests, that are worth a brief mention. The York Language Analysis test (Green, Reference Green1975) was developed as part of a research study investigating the effectiveness of language laboratories. It is an inductive language learning test, along similar lines to the corresponding PLAB and LLAMA sub-tests. A Chinese University of Hong Kong team developed two tests. The first (Chan et al., Reference Chan and Skehan2011) is a non-word repetition test targeting phonological working memory and phonemic coding ability. It did this by using non-words that were distinctive because they conformed to the phonological structure of the language to be learned (Mandarin or Cantonese). This procedure contrasts with English-based non-words, which are more typical. The other test (Chan & Skehan, Reference Chan and Skehan2011) is an inductive language learning test that takes a different approach to sample the language to be learned and the progression within the language. It is based on Pienemann’s (Reference Pienemann1998) Processability Theory and follows the six stages that he outlined for language development, providing a more principled basis for progression within the test. None of these smaller-scale tests is associated with extensive validation information, and they are not widely used. They do, though, offer slightly different perspectives on foreign language aptitude and are interesting to include in the present inquiry for that reason. In addition, some cognitive psychology tests of implicit learning have been used in aptitude-linked research, such as the Weather Prediction test (Knowlton et al., Reference Knowlton, Mengels and Squire1996) and the Tower of London test (Shallice, Reference Shallice1982). These are not language-focussed but tap the ability to learn probabilistic patterns in data. Some researchers (e.g. Sasaki, Reference Sasaki1996; Kempe & Brooks, Reference Kempe, Brooks, Granena, Jackson and Yilmaz2016) have also explored the relevance of general IQ tests (Cattell & Cattell, Reference Carroll1973) as predictors of language learning success.

Reflecting on these different approaches, one issue to emerge is to consider who the major players are in aptitude research, not so much in terms of individuals, but more in relation to background influences. Carroll (and Sapon), in the development of the MLAT, were academic researchers who obtained funding and who meticulously produced a complete aptitude battery (Reed & Stansfield, Chapter 2, this volume). (One could say similar things of Pimsleur, the author of the LAB.) Otherwise, academic researchers have made rather piecemeal contributions, and very rarely with complete test batteries. Green (Reference Green1975), and Chan and Skehan (Reference Chan and Skehan2011) fall into this category. The exception is the LLAMA (Meara, Reference Meara2005), a complete battery produced by an academic researcher and, indeed, possibly the most widely used aptitude measure in recent years. Otherwise, the major groups who have been interested in aptitude test development have a distinct military or governmental feel, and this has been true for many years. Earlier efforts include the DLAB (Petersen & Al-Haik, Reference Petersen and Al-Haik1976) and CANAL-F (Grigorenko et al., Reference Grigorenko, Sternberg and Ehrman2002), which received development funding from US government agencies. And, of course, most recently, there has been the Hi-LAB, which emerged through work at CASL, itself a government response to perceived lack of foreign language capacities in the US post 9/11.

Important consequences follow from the backgrounds of aptitude test developers. First, there is validation. On reflection, the MLAT, and Pimsleur’s LAB to a considerable extent, were models of how aptitude test batteries should be validated. Considerable numbers of participants were involved with planned variation (within the target populations), and this work was published in detail in accompanying manuals, including extensive information on norms. All information was publicly available and could inform pedagogic decisions. Since then, such thoroughness and public availability have not been matched. It seems that academic aptitude researchers who have developed tests (and even batteries) have not been funded in the same way, and so we frequently have tests made available without adequate validation information, as was pointed out by Bokander and Bylund (Reference Bokander and Bylund2020). One can speculate that large international testing organisations are not interested in developing aptitude batteries because the number of administrations per year would not justify the initial outlay. A second issue concerns restriction and secrecy, and this, too, connects with the funding sources for larger aptitude ventures. The DLAB had strong military links, and it was produced only for use within that context. The CANAL-F battery was also produced in connection with a government agency and was not widely used after its development. More recently, the Hi-LAB, which was the result of a well-funded and extensive research project, has only been made available in a small number of contexts (e.g. Granena, Reference Granena2019). Its validation is impressive (Hughes et al., Chapter 4, this volume) (although perhaps not with the breadth of different groups of the MLAT), but its penetration into wider aptitude research is limited.

The consequence of all this is that the only publicly available validated aptitude battery currently is the MLAT (Carroll & Sapon, Reference Carroll and Sapon1959) which, at the time of writing, is approaching its (pensionable) sixty-fifth birthday. Otherwise, the only major aptitude battery is the LLAMA, but this measure is only partially validated, and, indeed, the most thorough examination demonstrated several shortcomings (Bokander & Bylund, Reference Bokander and Bylund2020; Bylund, Reference Bokander and Bylund2020), an important focus for modifications reported in Rogers et al. (Chapter 3, this volume). Nevertheless, it has been used widely in aptitude research. A reasonable amount of accumulated LLAMA wisdom is available, and this retrospectively provides some sort of foundation for research results to be interpreted. But the field is in urgent need of re-evaluation of existing aptitude instruments, a re-evaluation which can take into account more recent, acquisition-oriented micro research and research that has emerged from the use of Hi-LAB. It seems timely to engage in some degree of re-evaluation rather than continue to follow the same rather limited paths, relying on out-of-date, unvalidated, or restricted test batteries. That is the purpose of the next section of the chapter.

A Framework for Exploring Existing Aptitude Sub-tests

Selection of Aptitude Sub-tests and Methodological Decisions

The analysis so far indicates that there is something of a piecemeal nature to the different positions which have been covered. We have several aptitude batteries in existence now. The problem is that they contribute to fragmentation in our understanding of aptitude, precisely because of their heterogeneity. Inevitably, each has reflected the viewpoint of the developers of that battery or free-standing sub-test. But frequently, these different viewpoints are not easy to relate to one another. What we need is a general framework within which the different batteries and tests can be located and then related to one another. This framework could help us identify what the main focus of a particular battery might be, and equally, which areas potentially relevant to foreign language aptitude are de-emphasised or even omitted. A framework would give us a view of strengths and weaknesses, both of individual batteries and perhaps of the enterprise of aptitude testing as a whole. It might also make it easier to locate where there are gaps in provision.

There is a theoretical motivation for such a framework, also. As we will see in more detail in a later section of the chapter, there are different theoretical positions about the nature of second and foreign language learning, and at a practical level, awareness of these viewpoints could be useful in generating more aptitude tests. But more theoretically, developing aptitude tests consistent with the different theoretical positions and then comparing their effectiveness, possibly in different contexts, might be revealing about the nature of second and foreign language learning itself, so that aptitude research would feed back into theoretical development in second language learning more generally.

The next question is to consider what the nature of such a framework might be. Here, two underlying dimensions will be proposed, and then, separate from that, four domains within which aptitude tests operate. The two dimensions are, first, the contrast between a focus on language versus a focus on general cognition, and second, the contrast between implicit and explicit processes, learning, and memory (Skehan, Reference Skehan, Granena, Jackson and Yilmaz2016). These contrasts should, therefore, create a ‘two-dimensional space’ within which sub-tests can be located, for example, implicit–cognition, explicit–language, and so on. In addition, the four domains (sound, working memory and processing, language, learning) reflect the structures, data, and processing areas that the aptitude sub-tests work upon. The dimensions and domains are viewed separately to enable the possibility that the two dimensions might interact with the four domains of operation, such as whether particular language–cognition and implicit–explicit combinations might be particularly important in certain domains.

Assuming this framework for analysing aptitude tests is useful, the next question concerns the method of investigation. Obviously, the most effective way to proceed would be to have an empirical study that explores all tests, their inter-relationships and effectiveness, and underlying dimensions. Factor analysis would fit the bill quite well in this regard. Equally obvious is that no one has ever attempted this type of study, although there have been some interesting factor analytic studies with subsets of the sub-tests available, such as Li and Qian (Reference Li and Qian2021). For reasons of time and resourcing, perhaps no one ever will. In view of this, the approach taken here is to examine all the sub-tests and rate them on a Language–Cognition scale and an Implicit–Explicit scale. Rating scales were developed for each of the dimensions to achieve this goal. All the sub-tests were rated on these scales by two raters, the author and one other applied linguistics professional, generating inter-rater reliability coefficients of 0.87 for the Implicit–Explicit rating and 0.89 for the Language–Cognition rating. This may not be the ideal way to gather evidence to explore the focus of the different sub-tests, but it is the only one that is practical at this scale.

The database for this investigation consisted of the following aptitude batteries, plus some independently devised aptitude sub-tests and various measures from cognitive psychology generally. Specifically, the batteries are:

The Modern Languages Aptitude Test (five sub-tests)
PLAB (three sub-tests)
LLAMA (four sub-tests)
The DLAB (three sub-tests)
The CANAL-F Battery (five sub-tests)
The Hi-LAB (10 sub-tests)
The York Language Aptitude Test
Chan and Skehan’s Phonological Short-Term Memory (PSTM) test, and their Language Analysis test
Two implicit learning tests (Tower of London, Weather Prediction)

Brief descriptions of all of these are provided in Appendix 1.

The results of the two-dimension ratings are provided in Figure 9.1, which uses the Language vs. Cognition and Implicit vs. Explicit axes to locate the average ratings of the two raters for the range of aptitude and cognitive tests. The labels used to represent the points clarify which aptitude tests are involved in each case. We will consider the different batteries and sub-tests in turn.

Figure 9.1 Two-dimensional view of aptitude sub-tests

Aptitude Batteries: Coverage of Language vs. Cognition and Implicit vs. Explicit Processes

The MLAT sub-tests are, in four out of five cases, around the mid-points of the two dimensions, reflecting implicit and explicit components as well as a balance between language and cognition. There is some degree of spread, but only within fairly narrow limits for MLAT 1, 2, 3, and 5. If not a strong implicit component, this suggests at least some degree of non-explicit learning and processing. The exception is MLAT4, Words in Sentences, which is clearly explicit and language-focussed, indeed one of the most explicit and language-focussed sub-tests in the entire group. Looking at the five sub-tests overall, one could say that the MLAT, while it does have a slight linguistic orientation, is, according to these ratings, not as explicit as it is often portrayed, and perhaps not as tied to particular classroom methodologies as it is often assumed to be.

The PLAB was designed at roughly the same time as the MLAT but does have some differences. There are only three sub-tests to consider. The test is very slightly linguistic in orientation overall, but this is mainly accounted for by PLAB4, Inductive Language Learning. Despite only the three sub-tests, the PLAB covers a greater range on the implicit–explicit dimension, largely because two of the sub-tests, those focussed on sound, are less explicit in nature. The language test, Inductive Language Learning, provides a very slightly greater scope for more implicit processes than does the MLAT Words in Sentences.

The LLAMA battery was designed as an alternative to the MLAT, and so, not surprisingly, there are similarities. All sub-tests have a language orientation, but only one, LLAMA_F, Inferencing, is strongly so. The four sub-tests also provide an interesting range on the implicit-to-explicit dimension, covering quite a span on this dimension. Granena (Reference Granena2019) has argued that LLAMA_D, Phonology Learning, taps implicit processes, but interestingly, the ratings suggest that two other LLAMA sub-tests, LLAMA_E, Sound–Symbol Association, and LLAMA_B, Vocabulary Learning, are slightly below and above the mid-point, respectively. In any case sound, language, and learning are all covered, reflecting the influence of the MLAT.

In some ways, the DLAB is interestingly different from the other batteries. Again, there is a slight focus on language rather than cognition. The battery seems to cover a considerable range in terms of the implicit-to-explicit dimension, but with a considerable split between the two very language-focussed sub-tests and DLAB2, which concerns sound, so there is a large area in the middle range of implicit-to-explicit that is not covered. Perhaps the most distinctive feature is that DLAB3, the Foreign Language Grammar sub-test, seems to reflect a declarative-to-procedural view of learning. Language may figure, therefore, but learning may be more cognitively viewed.

In fact, a declarative-to-procedural perspective is also relevant for the CANAL-F battery, and, indeed, cumulative learning can occur during the course of the administration. The organisation of the test is complex, with paper-and-pencil and auditory material interleaved. The test is integrated also, so memory, and not simply working memory, is pervasively involved. While there is variation between sub-tests on both dimensions, it is striking that they are all placed in the lower right quadrant – a language and explicit processing combination. The clarity of focus here contrasts with the spread and diversity in all the other batteries considered so far.

Finally, we have the Hi-LAB, which presents a considerable contrast to all of the other batteries. Almost all sub-tests fall in the implicit half of Figure 9.1, with most Hi-LAB sub-tests being very clearly so, hovering around 2 in their rating. The exceptions are the Paired Associates sub-test and the Working Memory Task Switching sub-test. Mostly, the orientation is towards cognition rather than language, but there is often a language connection if the sound or verbal working memory or long-term memory priming are concerned. Working memory and processing are heavily emphasised. Conversely, there are only limited language-linked sub-tests, meaning that there is nothing in the quadrant reflecting language and explicit processing. In other words, the Hi-LAB seems a very clear contrast (and complement) to the CANAL-F battery.

The main focus so far has been on complete batteries, but some other sub-tests are shown in Figure 9.1. The Tower of London and the Weather Prediction tasks have, as their origin, general psychological work on implicit learning. There is little language focus and no explicit dimension. They are positioned very clearly in the top-left quadrant in Figure 9.1. In contrast, there are three tests that were developed for purely language aptitude reasons. The York (Green, Reference Green1975) and Pienemann-based tests (Chan & Skehan, Reference Chan and Skehan2011) are clearly located in the bottom-right explicit–language quadrant, and the non-word repetition test (Chan et al., Reference Chan and Skehan2011) is regarded as slightly implicit and slightly cognitive.

The previous paragraphs have described the aptitude tests that we have available. But it is abundantly clear from Figure 9.1 that the discussion has mainly concerned the top-left and bottom-right quadrants (implicit–cognitive and explicit–linguistic) respectively. The top-right (explicit–cognitive) and bottom-left (implicit–linguistic) are hardly represented, and it is interesting to consider what sort of sub-tests might go there and whether these would be of any relevance. Explicit–cognitive would suggest the declarative learning or processing of non-linguistic material, and perhaps this is the area which would be (partly) covered within the sub-sections of a conventional intelligence test, of the sort sometimes used in aptitude work. Implicit–linguistic would, perhaps, develop the approach taken by Reber (Reference Reber1967), who studied the learning of sequences based on linguistically-related material. Possibly, though, there might be scope for discussion regarding what material would be needed to qualify as actually linguistic. In any case, it is striking that these two wide-ranging areas are not represented very much in current aptitude tests.

The final aspect of Figure 9.1 to consider is the focus and amount of coverage of the existing batteries. We will only consider the MLAT, CANAL-F, and Hi-LAB in this respect because these batteries contain five or more sub-tests, giving a reasonable potential for coverage. As we have seen, Hi-LAB is mainly located in the top-left, implicit–cognitive quadrant, while CANAL-F is mainly bottom-right, explicit–language. Both are aptitude batteries and attempt the same task, yet there is little overlap! Doughty (Reference Doughty2019) proposes that the Hi-LAB–MLAT combination might be a good one because the two batteries complement one another. From the present analysis, it might be the case that Hi-LAB and CANAL-F might be an even better combination because of the joint coverage they would produce. Turning to the MLAT, though, we have the greatest coverage by an individual battery, ranging from several sub-tests around the mid-points of each dimension, and then a fairly extreme explicit–language test. This distribution is consistent with Doughty’s suggestion, but with a slightly less explicit–linguistic emphasis. Broadly, then, if one accepts the relevance of the ‘space’ defined by the two dimensions, it is clear that no one test provides adequate coverage and that most batteries are making assumptions about what language aptitude really is, but also what language aptitude is not. A later section will consider exactly this issue but will do so from a more theoretical perspective and be less concerned with the details of measurement.

Domains and Language Aptitude Sub-tests

As we have seen, aptitude tests can be analysed in terms of sound, working memory and processing, language, and learning. This is a meaningful division, but clearly not watertight – sub-tests can, and usually do, involve combinations of these. Any decision is therefore taken to reflect the major focus of a sub-test (e.g. language) even if that does not tell the whole story (with some language-focussed tests – York, for example – also containing learning elements).

The sub-tests focussing on sound are all in, or very close to, the mid-point of the Language–Cognition dimension of Figure 9.1, with little spread along this axis. This may seem slightly surprising, and perhaps there is a case to argue that the language involvement is greater with some sub-tests, particularly if discriminations, for example, are based on knowledge of phonology. In contrast, there is much more dispersion along the Implicit–Explicit dimension. No sub-tests are above the mid-point here, but within the range 1–4 there is considerable coverage. There seems to be a move from phonological learning (LLAMA_D), through tests of sound discrimination or identification (PLAB5, Hi-LAB Hindi and Russian, DLAB 2), to sound–symbol association (LLAMA_E, MLAT1, PLAB6). Finally, there is MLAT3, Spelling Clues, a clever test that requires the use of declarative knowledge and the processing of sound to make effective decisions. This is generally regarded as a relatively easy test, and it would be interesting to see the ideas that generated this test used at a greater level of difficulty, for example, bringing together language and explicit processing with sound. Most of the major batteries are represented in this domain, the one exception being CANAL-F. In fact, sound is used widely in this battery but is never the primary focus in any of the five sub-tests it contains.

Working memory and processing are covered by a large number of tests, eight in total. Strikingly, seven of these are from the Hi-LAB and cover detailed aspects of working memory, both central executive and buffer structures and processes, as well as long-term memory operation. Almost all of these sub-tests are concerned with implicit processing, and mostly across the cognition area. The interesting, and slight, exceptions are the Available Long-Term Memory (ALTM) Synonyms sub-test, which has language connections, and the Task Switching test, which is rated as enabling some explicit processes to be used. The one non-Hi-LAB test in this category is Chan et al.’s (Reference Chan, Skehan and Gong2011) non-word repetition test based on L2 phonology. The conclusion has to be that working memory and processing are amply represented in the Hi-LAB but not particularly in any of the sub-tests from the other batteries, at least as a major focus.

The remaining domains are learning and language, and these two theoretically distinct domains are sometimes difficult to separate in practice. Clear tests of (word) learning exist (MLAT Number Learning, as well as several paired associates sub-tests in other batteries). Implicit learning is also represented (Tower of London, Weather Prediction, Hi-LAB Serial Reaction) – all in cases of a non-linguistic nature. In addition, there are clear language processing tests, such as the MLAT Words in Sentences and CANAL-F, Understanding the Meaning of Passages. In between are several tests concerned with the structure of language where it is not simply the structure of language per se that is involved. In addition, there is scope for learning as the test progresses, and where such learning facilitates faster and more effective work since the stimulus material is cumulative in nature. All the inductive language learning tests are of this type, together with some CANAL-F sub-tests. Many of the language and learning tests are in the lower right quadrant, explicit and language-focussed.

Two additional points are interesting. First, there are two rather dense clusters of sub-tests, with most batteries represented and where the test label locations are separated, for legibility, using the ‘Jitter’ option in R (The R Project for Statistical Computing, 2020). One cluster consists of paired associates learning, with several versions of this same task. The other cluster is inductive language learning, with four different sub-tests (PLAB4, York, LLAMA_F, and the Pienemann-based test), all concerned with the same set of language and learning processes. One could also argue that there is a third cluster, this time of implicit–cognitive tests, represented by Weather Prediction, Tower of London, and one of the Serial Reaction Time sub-tests from Hi-LAB. Second, CANAL-F is particularly interesting in the nature of its different sub-tests. Not only are the sub-tests all located in the same quadrant, but they are also distributed reasonably evenly across this quadrant, even becoming more language oriented as they become more explicit. CANAL-F Learning Neologisms represents a different take on learning because inferencing is required as a precursor to learning itself. This battery has little to offer in any of the other quadrants, but it provides a well-distributed sample of the area where it does focus.

One final observation that can be made most easily here, though it applies elsewhere, is that in most aptitude testing situations, time is precious. Developers have to find ways of extracting as much information as possible in the minimum amount of time. In addition, clarity of separation between different sub-tests requires separation from one another to do justice to the multi-dimensional nature of aptitude. (CANAL-F, with its integrative nature, takes a refreshingly different approach here.) On the other hand, despite these time pressures on efficiency (and humanity for test-takers!), there is the issue that learning, especially, would benefit from longer time involvement if it is to be measured validly. Some delayed testing would also be valuable. But these approaches are not often feasible – the challenge for aptitude test designers is to get vital information quickly. A possible conclusion is that the brief time involvement makes the measurement of learning less effective than would otherwise be the case.

Aptitude and Theory

Implicit in all the discussion so far is the idea that there are important theoretical issues at play in aptitude testing. In this section, we will explore this issue and consider the different theoretical positions as they are captured (or not) by the range of available aptitude sub-tests. But more ambitiously, this discussion will explore what aptitude can illuminate in relation to the fundamental nature of a language learning ability. We will review five theoretical perspectives, some closely associated with existing batteries, others much less so.

The Pragmatic, Carrollian, Statistical Approach: One could be forgiven for thinking that there is little theory in Carroll’s position since it is based on a careful job-sample approach coupled with sophisticated statistical analysis. But the outcome that aptitude concerns processing sound, handling language, and memory/learning has been at the heart of almost all aptitude testing ever since. Theoretically, the assumption is that language is central. Carroll (Reference Carroll1973) speculated that aptitude might result from the differential fading of a first language learning ability (and see Skehan, Reference Skehan1988, who reports evidence from a longitudinal study on first to foreign language learning connections). Learning and handling sound is seen as complementing this central role for language. It has been argued (Krashen, Reference Krashen and Diller1981) that Carroll’s approach is excessively tied to conventional, classroom-based language instruction. In Figure 9.1, it is clear that MLAT sub-tests do have a degree of language focus, but on the implicit–explicit dimension they show quite a bit of spread, suggesting that they may not be locked into any particular instructional method or context, a point developed in Skehan (Reference Skehan1989).

So far, we have focussed on Carroll’s work specifically with foreign language aptitude, but it is important to place this work within a wider project of his. Carroll (Reference Carroll1993) regarded his work on foreign language aptitude as part of an investigation into the nature of general human cognition, including intelligence. Indeed, his major publication was Human Cognitive Abilities, the culmination of decades of work exploring the range of abilities from the perspective of differential psychology and based on his re-analyses of a very large number of datasets that were collected by many researchers. He proposed a three-stratum theory of cognition, with the third stratum a factor of general ability and the first stratum a very large number of specific abilities. Of interest here is the second stratum, which suggests several specialised abilities, including language, reasoning, memory, speed, and several others. The import of Carroll’s research, independent of any theorising about fundamental differences between first and second language learning, is that people vary significantly across a profile of potential abilities. Within this viewpoint, it is natural to consider that some people will have a constellation of abilities suited to more effective language learning and that this set of abilities will be measurable. His theory provides a much more general account of human cognition than his specific views on foreign language aptitude, but it is still relevant for characterising the nature of human abilities and is the wider context for the development of foreign language aptitude tests.

The approach advocated by Richard Sparks (Reference Sparks2012; Chapter 11, this volume) is entirely consistent with this approach. Sparks (Reference Sparks2012) also argues that language is central to language aptitude and that foreign language aptitude can only be understood by relating it to first language learning skills. He reports considerable empirical work in support of this contention. Essentially, this is consistent with Carroll’s view about language abilities since they are pervasive in their effects – as relevant, in his human cognition approach, for first as for second and foreign language learning.

Turning to its impact on practical testing, this pragmatic, language-oriented theoretical foundation is most evident in the MLAT itself. Its sub-tests are addressing a particular learning context – language – but they also fit into the wider structure of human cognition that Carroll (Reference Carroll1993) proposed. The MLAT sub-tests are based on the subset of that wider cognition most clearly implicated in foreign language learning. Just as some people have clusters of abilities that suit them for music, mathematics, or tennis, there are those whose strengths in cognition fit them more effectively for language learning. (This does not mean that others, less gifted in these ways, cannot learn languages, but, as Carroll would argue, they may need more time to reach the same level.) The PLAB and LLAMA have similar foundations.

The conclusion has to be that the Carrollian approach is well represented in aptitude batteries. Not only do the three batteries just highlighted provide sub-tests that cover all the areas in the underlying model, but there are also other sub-tests that attempt to measure the different areas, such as the York test (Green, Reference Green1975), Chan and Skehan’s (Reference Chan and Skehan2011) Pienemann-based test, the associative memory sub-test from the Hi-LAB (Linck et al., Reference Linck, Hughes and Campbell2013), and so on. This leads to a final point of some importance. The MLAT has been associated with outdated language teaching methodologies and is sometimes marginalised as a result. It is important to maintain that the underlying four-factor theory is not methodology-bound, nor is Carroll’s wider account of human cognition, with its three-stratum theory. The foundation is a set of proposals for the differentiated nature of human cognition, and these are not linked exclusively to schooling or any particular methodology. Instead, they are proposed as a basic architecture of cognition.

Selective Fading of Universal Grammar: Another approach to justifying language-as-special is to claim that a generative approach to language is still relevant in the second language acquisition case. In recent years, Meisel (Reference Meisel2011) and Rothman and Slabadova (Reference Rothman and Slabakova2018) have argued strongly for this position, with the discussion exploring how a Universal Grammar (UG) approach is wholly or partially still available, and then explaining the consequences from the more probable case that we are dealing with partial availability. Rothman and Slabadova (Reference Rothman and Slabakova2018) review various approaches that try to offer precision about which generative features are still operative (and presumably, therefore, do not implicate individual differences or foreign language aptitude), and which features are not, (and therefore do or, at least, might). They give the example of Uninterpretable Feature theories (Hawkins & Hattori, Reference Hawkins and Hattori2006), which propose that some features that may have been interpretable in the first language case are no longer so in second language learning. Examples of such features are case and grammatical gender. The relevance here is that there may be individual differences in how second language learners handle such features, and if there is variation, such variation may provide a perspective on aptitude, for example, how to deal with aspects of language acquisition that generative approaches no longer cover. Aptitude sub-tests that are focussed on language structure could focus on such areas for their content.

Arguably, Meisel’s (Reference Meisel2011) approach links more naturally with language aptitude (Skehan, Reference Skehan, Wen, Skehan, Biedroń, Li and Sparks2019) and makes a distinction between a Language Acquisition Device (LAD) and a Language Making Capacity (LMC). The former, which is close to Rothman and Slabadova’s hypothesis of unavailable areas, such as uninterpretable features, in second language development (Hawkins & Hattori, Reference Hawkins and Hattori2006), discusses areas like domain-specific discovery procedures and processing mechanisms, as well as learning mechanisms for non-UG constraints. Sound processing also figures in his proposal. All of these points suggest that, even in the areas where UG is not directly involved, language is still special. The LMC, in contrast, includes general implicit learning, working memory, and general pattern making (areas that figure very strongly in the more cognitive theories covered below). These proposals implicitly offer an agenda for foreign language aptitude test construction and provide structure for the various influences which might impact language learning success. If there were tests available in all these areas (e.g., in differential abilities in handling uninterpretable features, implicit learning, or working memory), one could then explore which of these potential influences have an impact on language learning success. The various possibilities constitute hypotheses, and aptitude testing has the potential to deliver relevant evidence.

It seems reasonable to claim that no aptitude battery straightforwardly addresses these UG-linked proposals in any systematic way. But there are a number of sub-tests from those covered in Figure 9.1 that do have relevance for UG interpretations of second language acquisition. Clearly, the various tests focussing on sound have relevance to Meisel’s LAD. Then, a gap is represented by domain-specific discovery procedures: implicit learning is mentioned, but not implicit learning for language, for which there are no clear aptitude tests at present. Perhaps the closest measures are the variety of inductive language learning tests, where speed of presentation may result in tests drawing on implicit linguistic processes. Chan and Skehan’s test based on Pienemann’s Processability Theory is the closest to exploring processing mechanisms and non-UG factors. Beyond these measures, with Meisel’s LMC, it is clear that a range of Hi-LAB tests is relevant, for example, general implicit learning and working memory. His other component here, general pattern making, might be captured by the Tower of London and Weather Prediction tests. All in all, this is a fragmented but surprisingly interesting collection, not completely thorough, but with a reasonable amount of coverage, through aptitude sub-tests, of the areas highlighted in Meisel’s model.

Second Language Acquisition Based Approaches: There are two related sets of proposals in this section – Skehan’s (Reference Skehan, Granena, Jackson and Yilmaz2016) proposal that putative second language acquisition stages could be the basis for aptitude test development and constructs, and Robinson’s (Reference Robinson and Robinson2002, Reference Robinson2005) suggestions regarding aptitude complexes. Following Klein (Reference Klein1986) and based on second language acquisition research, Skehan (Reference Skehan, Granena, Jackson and Yilmaz2016) proposes that one can identify stages in interlanguage development and, if there are individual differences at any of these stages, one has a candidate starting point for developing an aptitude sub-test. The stages he proposes are shown in Figure 9.2.

Figure 9.2 Stages in interlanguage development

The three macro stages on the right-hand side of Figure 9.2 are concerned with the processing of sound, the capacity to focus on pattern, and proceduralisation so that emerging language can be used fluently and (hopefully) effortlessly, without demanding excessive attention. Clearly, the first two macro stages in Figure 9.2, what can be termed the system development stages, are consistent with what we have learned about second language acquisition. The remaining stages, which involve the achievement of control, are most consistent with the sort of account proposed by Anderson (Reference Anderson2010), with a move from declarative to procedural processing (which connects with the following sections in this chapter). The first group, handling sound, is compatible with explicit and implicit processes. The second group, handling patterns, is also consistent with both types of processes. However, in the case of a declarative-to-procedural flow, central to the third macro stage, there is an assumption not so much of discontinuity between this and the previous macro stages but more that there is a different emphasis. The first two stages are necessary to unlock the potential of the third.

The motivation in examining aptitude in this way is to consider the possibility of individual differences at each stage, which could then suggest that an aptitude test focussing on that stage would have some construct validity. Skehan (Reference Skehan, Granena, Jackson and Yilmaz2016) discusses the way existing aptitude tests cover the stages in this sequence and argues that of the batteries that are available, there is a reasonable sampling in the first two macro stages but not the third, automatisation/proceduralisation. Regarding the first macro stage, sound, there is the range of sound discrimination tests, the various sound–symbol association tests, and, more theoretically, Carroll’s concept of phonemic coding ability. This first macro stage, in turn, brings in the relevance of the working memory tests, which incorporate sound and linguistic elements. Turning to the language-as-pattern set of stages, the closest thing we have are the various tests of inductive language learning ability, all of which probe this area. Given Kempe and Brooks’ (Reference Kempe, Brooks, Granena, Jackson and Yilmaz2016) research, it may be that some IQ tests are relevant – they show that generalising is linked to IQ. In contrast, a focus within a more clearly defined linguistic domain is linked to more typical pattern-oriented language aptitude sub-tests. Turning to the third, proceduralising stage, one reason for weakness in this area of aptitude testing generally is simply time: to measure automatisation would require more time than most aptitude tests are permitted to take, given the pressures on instructional contexts and learners. Perhaps it is the tests from the CANAL-F, with its integrative cumulative nature, that come closest at this stage.

The stages approach has not generated much by way of new aptitude tests. In fact, only two concrete proposals have been made. The first (Chan et al., Reference Chan and Skehan2011) describes a non-word repetition PSTM test in which the non-words are based on the phonological structure of the L2, intended to draw upon phonetic coding ability. The second measure (Chan & Skehan, Reference Chan and Skehan2011) is a test of inductive language learning and follows Pienemann’s (Reference Pienemann1998) account of second language development with its different stages. This second test does mesh a little more with the sorts of sub-processes involved at the macro stage of handling structure.

Robinson (Reference Robinson and Robinson2002) takes a different approach to building on insights from second language acquisition and focusses more on context. He proposes a three-level theory. At the highest level, we have the Aptitude Complexes Hypothesis, which suggests various contexts in which acquisition might be promoted. These contexts include a focus on form, incidental learning (oral), incidental learning (written), and explicit rule learning. The contexts are supported, at the next level down, by ability factors such as noticing, memory for contingent speech, deep semantic processing, memory for contingent text, and metalinguistic rule rehearsal. Pairs of ability factors contribute to features at the aptitude complexes level, for example, the first two (noticing and memory for contingent speech) to focus on form, and the last two to explicit rule learning. Then, at the most detailed level, there are ability-test task components, such as encoding, inferring, comparing, combining, and so on.

Clearly, some of these concepts overlap with Skehan’s stages proposals, such as noticing. In addition, some of them map onto existing aptitude tests, as with incidental learning and metalinguistic rule learning, encoding, and comparing (Grigorenko et al., Reference Grigorenko, Sternberg and Ehrman2002). In other cases, the mapping is not so clear. An important strength of Robinson’s account is that it connects more easily with aptitude–treatment interaction (ATI) approaches since the highest level, aptitude complexes, suggests that different constellations of aptitude components will have importance in different learning contexts. This approach has clear implications for research design with aptitude studies and is consistent with the recent moves to ‘micro’ research.

Declarative to Procedural Learning: All the theoretical approaches we have covered so far have assumed a special place for language in language aptitude. The remaining approaches are a clear contrast to the language-is-special assumption since they view language aptitude as essentially a cognitive ability, with no particular focus on language. One version of this, argued by DeKeyser (Reference DeKeyser, Wen, Skehan, Biedroń, Li and Sparks2019) is that first language acquisition is dependent on implicit processes, and these may involve a special place for language, but post-critical-period learning is qualitatively different. Implicit processes are assumed still to exist but to be much less effective (Ullmann, Reference Ullman, VanPatten and Williams2015; and see Jackson & Maie, Chapter 16, this volume), whereas declarative learning is more efficient. As a result, the process of second and foreign language learning is assumed to largely depend on a declarative-to-procedural sequence and general cognitive abilities.

Taking this approach would suggest that effective foreign language aptitude testing would implicate a range of tests of declarative learning and the declarative-to-procedural transition. Intriguingly, the existing batteries that come closest to satisfying these assumptions are the DLAB and CANAL-F. DLAB3, Foreign Language Grammar (learning rules, then applying them), and DLAB4, Foreign Language Concept Formation (inferring language rules through picture-based information), both seem consistent with this sequence. CANAL-F Part 4 (Sentential Inference) and Part 5 (Learning Language Rules) also bring together fairly explicit material and the opportunity for learning. It is impossible to avoid saying that these are two of the least-used aptitude batteries by researchers. There are also tests of implicit learning derived from cognitive psychology, such as the Weather Prediction and Tower of London tasks, and also sections of the Hi-LAB, such as Serial Reaction Time. But these tests do not really have a prior declarative phase and really claim to assess implicit learning, as opposed to proceduralisation (which would require such an earlier phase). The Weather Prediction and Tower of London tasks have been used as aptitude tests in research by Buffington and Morgan-Short (Reference Buffington, Morgan-Short, Wen, Skehan, Biedron, Li and Sparks2019), as were tests of declarative memory. Consistent with DeKeyser’s position, Buffington and Morgan-Short (Reference Buffington, Morgan-Short, Wen, Skehan, Biedron, Li and Sparks2019) argue that declarative tests are more effective predictors at lower levels and also in foreign language contexts. Procedural memory tests are reported as more effective at higher proficiency levels and in the second language and naturalistic contexts.

An issue worth discussing at this point concerns the relationship between implicit learning and memory, on the one hand, and proceduralised/automatised learning and memory, on the other. This discussion forms a bridge between the current section, on the declarative to procedural sequence, and the next, on implicit learning. Both sections, declarative-to-procedural and implicit, make the assumption that the respective processes they discuss are distinct from one another. A declarative to procedural or automatised sequence sees conscious and effortful learning slowly replaced by more proceduralised and even automatic performance below the level of consciousness and not requiring attention. In contrast, implicit learning is considered to take place directly and slowly and not to have a declarative, conscious, focussed phase; it simply develops below the level of consciousness.

Theoretically, this difference is clear, as accounts like DeKeyser (Reference DeKeyser, Wen, Skehan, Biedroń, Li and Sparks2019) and Paradis (Reference Paradis2009) make clear. But in practice, there are difficulties in separating the two sets of processes. For example, Suzuki and DeKeyser (Reference Suzuki and DeKeyser2017) probed this distinction and found problems. Central to their examination is the construct of implicit learning as this is currently measured. If implicit learning were a clear construct, it would be possible to operationalise the construct through a series of tests that follow from underlying theory and then inter-correlate with one another in predictable ways. If this provides convergent validity, one would also expect to see lower correlations between such tests and others targeting proceduralisation (varieties of which should themselves inter-correlate reasonably highly). Attempts to do this in the second language field have not been notably successful. Tests of implicit learning show relatively weak inter-correlations (Godfroid & Kim, Reference Godfroid and Kim2021; Li & Qian, Reference Li and Qian2021), while declarative–procedural tests show stronger inter-relationships (Suzuki & DeKeyser, Reference Suzuki and DeKeyser2017). As a result, we are left with difficult questions. We cannot be sure whether there is a unified construct of implicit learning, whether implicit learning exists in different forms, or whether implicit learning can be shown, empirically, to be different from proceduralised learning and memory. These qualifications mean that we have to treat theorising and measurement in this area with care, and so the separation between the present section and the next is a little suspect, even if it does reflect quite a lot of discussion within the field of foreign language aptitude.

Implicit Learning: The previous approach assumed a critical period, a reduction in the effectiveness of implicit processes, and a need to rely on more explicit learning. One could, alternatively, take a Unified Theory perspective (MacWhinney, Reference MacWhinney, Kroll and De Groot2005), and propose only implicit processes, operative in roughly the same way in the first and second language acquisition cases. Again, there would not be anything special about language, and basic learning processes would be essentially the same as in non-language domains.

The argument for the importance of implicit language aptitude has been made strongly in recent years by Granena (Reference Granena2019, Reference Granena2020). She compared the LLAMA tests (B, D, E, and F) with some of the Hi-LAB tests (ALTM, Letter Span, and Serial Reaction Time). She reports a factor analysis that suggests a clear separation between explicit (LLAMA B, E, F) and implicit tests (all the others, including LLAMA_D). She has also related implicit language aptitude to the capacity to respond profitably to feedback, suggesting that implicit aptitude is more related to the effectiveness of implicit feedback (Granena & Yilmaz, Reference Granena and Yilmaz2019). However, Li & Qian (Reference Li and Qian2021) report that tests of implicit language aptitude, including LLAMA_D, do not inter-correlate highly, and that LLAMA_D itself relates more to the other (explicit) LLAMA tests (and see Zhao et al., Chapter 6, this volume, for similar results). As noted elsewhere in this chapter, this means that conclusions about the construct and measurement of implicit aptitude are currently unclear.

The only one of the batteries we have considered that would have relevance to implicit aptitude is the Hi-LAB. This has several sub-tests focussing on working memory (central executive operations and buffer systems), basic cognitive speed, implicit learning, access to long-term memory (LTM), and sound processing. As Figure 9.1 showed, some of these were rated strongly based on implicit processes. In addition, as we have seen, LLAMA_D has been claimed to tap implicit processing and learning (Granena, Reference Granena2019). The implicit learning tests imported from cognitive psychology (Weather Prediction, Tower of London) are also relevant. Obviously missing here is any language involvement, which is, though, consistent with the underlying theoretical viewpoint.

The conclusion seems to be that implicit learning processes, and implicit aptitude, are an important possibility to consider. Even so, the importance of such learning needs to be established (see Jackson & Maie, Chapter 16, this volume, who suggest that it may not be strong in its effects). In addition, the viability of using aptitude tests to measure implicit learning potential (Li & Qian, Reference Li and Qian2021), and the range of contexts in which such a form of aptitude is most effective remain to be established (Buffington & Morgan-Short, Reference Buffington, Morgan-Short, Wen, Skehan, Biedron, Li and Sparks2019), even though there are some encouraging findings.

Declarative, Implicit, and Procedural Learning: We have seen that a range of measures of implicit learning and procedural learning are available. We have also seen that clear and distinct operationalisations of the constructs underlying these positions are not, as yet, a basis for obvious choices. This leads us to consider a hybrid position for the development of non-linguistically oriented aptitude batteries – it may be that it is better to think in terms of the co-existence of two approaches, the explicit and the implicit/procedural. Indeed, one could go further and propose that a hybrid approach could also involve declarative knowledge. This is close to the position argued by Ullmann (Reference Ullman, VanPatten and Williams2015), who suggests that two knowledge sources exist, the declarative and the procedural, and that each has different strengths, weaknesses, and characteristics. The research by Buffington and Morgan-Short (Reference Buffington, Morgan-Short, Wen, Skehan, Biedron, Li and Sparks2019) cited earlier, provides a possible example here – declarative knowledge was more relevant for lower levels and foreign language contexts, while procedural knowledge was more relevant for more advanced levels and more naturalistic learning contexts.

It is important to say that this hybrid approach is still consistent with the idea that language is not special – what is being learned is learned through general cognitive abilities. There are implications for aptitude, though, since a wide range of factors would need to be assessed. This perspective is best represented by the Hi-LAB, which effectively covers most of the possibilities here. Possibly the main addition could be CANAL-F, which is also based on cognitive psychology, but with a different emphasis on the nature of learning. This battery could, though, claim to provide the more extensive measurement of a declarative-to-procedural sequence.

The Conclusion

There are three parts to the concluding section. First, the focus is on what we can now say about foreign language aptitude testing, theorising, and research. Then, the issue will be what is needed by way of research and reconceptualisation. Finally, the concern is the relationship between aptitude and wider theory about language learning.

What Can We Now Say?

The broadest generalisation is that, despite the vitality and achievements of recent decades of aptitude research, what we now know is fragmented. The range of data we have available is greater, and the number of studies linked to aptitude battery construction and focussed on micro aspects of acquisition has grown impressively. However, the picture is still incomplete and lacking in overall structure. Several factors contribute to this conclusion.

A major issue concerns the aptitude tests that have been associated with the most recent research and the limitations that follow from their use. Two batteries have dominated in this regard. The Hi-LAB (Linck et al., Reference Linck, Hughes and Campbell2013; Hughes et al., Chapter 4, this volume) has introduced much greater variety into the aptitude sub-tests that have been used. But the test is not widely accessible and has been used with what might be termed restricted populations. Although the published material on validation is impressive (see Hughes et al., Chapter 4, this volume), one would like to see its use with a wider range of learners. The LLAMA, the alternative, has been used in a very large number of studies, and a good deal of accumulated wisdom is the result. But it has not been adequately validated, and indeed there are concerns about its validation (Bokander & Bylund, Reference Bokander and Bylund2020; Bokander, Chapter 5, this volume). Some attempts have been made to address these concerns (Granena, Reference Granena, Granena and Long2013; Rogers et al., Chapter 3, this volume), but it is clear that the sort of validation that was carried out with the MLAT has not been matched. In any case, there may also be the problem that the accumulated wisdom that we do have may not relate in any clear fashion to the revised LLAMA that is described in this volume.

A consequence of this two-battery domination is a lack of broader progress in understanding aptitude structure. For example, the range of micro studies in recent years has been very impressive, and we have improved our understanding of the general impact of aptitude on instruction and feedback (Li, Reference Li2015) and obtained hints about areas of greatest impact (Skehan, Reference Skehan2015). But while individual studies have made valuable contributions, a broader picture has not been possible, partly because of the unsystematic language areas that have been studied, and partly because different aptitude and working memory tests have been used. Above all, there has been a lack of what might be termed aptitude research designs in studies. With the exception of studies such as Granena (Reference Granena2019), which probed relationships between LLAMA and Hi-LAB, and Buffington and Morgan-Short (Reference Buffington, Morgan-Short, Wen, Skehan, Biedron, Li and Sparks2019), which explored aptitude-by-proficiency level interactions, studies have tended to have limited focus. Typically, limited populations are researched, or a restricted range of aptitude tests are used, often from one theoretical persuasion. These limitations restrict the power of the claims that can be made. The conclusion is that although much has been learned in recent years, a great deal more needs to be learned. We have been instrument-led rather than construct-led.

What Is Needed?

We can take Doughty’s (Reference Doughty2019) proposals as a starting point. She argues that it is advisable not to focus on just one aptitude test, but that a better option is the combination of the Hi-LAB and MLAT. Essentially, she is proposing that the strengths of the MLAT, namely, that it is appropriate to a range of proficiency levels and has a linguistic focus, are complemented by the Hi-LAB, with its focus on processing and cognition–implicit factors. Her argument is cogent, but it would be good to see an extension that is not bound by these two test batteries.

To a certain extent, the first section of this chapter clarified that one can locate aptitude sub-tests within the two dimensions of language vs. cognition and implicit vs. explicit processes. When existing aptitude sub-tests were analysed in this way, it was interesting that while some areas within the space so defined were well covered, others were not particularly well represented. Three areas seemed particularly lacking:

Cognitive explicit pattern learning
Implicit language pattern learning
Proceduralisation, both of language and cognitive patterns.

In addition, it can be argued that more sub-tests of general processing, particularly speed and LTM access, would be useful (although the relevant sub-tests from Hi-LAB may well be adequate here already). If one accepts the relevance of the two dimensions proposed, these areas are key omissions to the armoury of tests available. They would have particular relevance when one is considering the different accounts of the nature of second and foreign language learning. The opportunity to draw on tests targeting these areas and their potential incorporation into any aptitude battery would broaden the theoretical base in aptitude test design. Essentially, this would take us beyond a situation where one has to, rather ironically, accept a ‘one size fits all’ inflexible battery for measuring individual differences and move towards a situation where tests might be selectable from a validated pool of tests, simultaneously more appropriate for the particular context of use (Robinson, Reference Robinson and Robinson2002) and also more likely to contribute to research design and aptitude theory.

The issue of research design is fundamental because this is the only way to move beyond the fragmentation we currently face. In other words, designing studies appropriately might allow specific research questions to be addressed and simultaneously extend our knowledge of aptitude and the range of available aptitude tests. This applies to macro studies (with larger numbers of participants and with more extended periods and perhaps a wider range of tests), ATI studies (where aptitude can potentially interact with additional variables), and micro studies (concerned with focussed instruction or feedback conditions and possibly shorter time intervals). Regarding macro studies, the study of Granena (Reference Granena2019) is instructive and provides a glimpse into which sub-tests are inter-related and which are not. Interestingly, she drew sub-tests from two of the most significant and widely used batteries of recent years. The approach needs to be extended, perhaps drawing from, as a sampling frame, the two-dimensional arrangement covered in the first major section of this chapter. It might also be beneficial to incorporate some older aptitude sub-tests into such a mix, assuming that they would be available. As a result, we would not only learn about inter-relationships of sub-tests but also underlying aptitude constructs.

Another vital research area is that of potential ATI variables. Theoretically, Robinson (Reference Robinson and Robinson2002) has proposed a set of contexts where particular aptitude configurations are hypothesised to have special importance. Practically, Buffington and Morgan-Short (Reference Buffington, Morgan-Short, Wen, Skehan, Biedron, Li and Sparks2019) explored whether explicit aptitude tests, in this case, the MLAT5, Paired Associates, and the Continuous Visual Memory Task, and implicit/procedural tests, such as the Tower of London and Weather Prediction tests, would be particularly important with beginner, foreign language contexts and more advanced, second language contexts. They confirmed that they were, with the declarative tests being more predictive at the lower levels, and the procedural tests predicting more effectively in higher proficiency, second language contexts. The scope for such research is considerable, and what has been done so far is only a beginning. Wider conceptions of aptitude, principled selection of aptitude tests, and a range of variables that might interact with aptitude could be an exciting arena for study. It also holds the prospect of demonstrating that matching learners with contexts (Wesche, Reference Wesche and Diller1981) could lead to more efficient learning.

The final research design area concerns micro studies. A significant number of such studies have appeared in recent years. In almost all these cases, the motivation for the study has been an experimental comparison between either type of instruction or type of feedback, typically the contrast between explicit and implicit in either case. But three issues emerge. The first concerns the selection of aptitude sub-tests in these micro studies. The discussion earlier in this chapter made it clear that choosing appropriate tests is a difficult undertaking, linked to the availability (or lack thereof), and the validity status, of aptitude sub-tests (Bokander, Chapter 5, this volume). It is to be hoped that a more principled basis for selecting aptitude sub-tests will be feasible in the future. Second, with regard to micro studies, there is clearly scope to explore the variable of time. The studies covered in Li (Reference Li2015) and Skehan (Reference Skehan2015), for example, varied in length of experimental condition from 15 minutes to 15 hours, but with all except one study being less than four hours. Comparing effects across such diversity of intervention time is hazardous. We urgently need studies that manipulate time itself as an important variable. Third, the issue of sample size is an important one. Bokander (Chapter 5, this volume) shows that an important proportion of significant correlations between aptitude and performance measures comes from studies with small sample sizes, and that studies with larger sample sizes report significances far less often. This is a worry and suggests very strongly that sample sizes may need to be increased in such research.

How Can Aptitude Research Be Used to Illuminate Theory?

The final area to be discussed is the nature of aptitude theory and, more broadly, what aptitude research might be able to say about the nature of second language learning itself. In a sense, aptitude tests are embodiments of theories of second and foreign language learning ability, and so aptitude research has the potential to be revealing about this important ability. Following the earlier section based on ratings of aptitude sub-tests, there are two basic questions.

Does language aptitude implicate language, and if so, to what extent, and with what underlying theory?
What are the respective roles of explicit, declarative knowledge, explicit learning, and memory relative to implicit knowledge, learning, and memory?

Regarding the first question, a language interpretation would be consistent with batteries such as the MLAT, the PLAB, the DLAB, LLAMA, and CANAL-F, together with the miscellaneous sub-tests that have been developed, such as the York test. So, to the extent that these tests work, and in the main, they do, the case for a language involvement in language aptitude testing is strengthened. We have to accept, though, that the detailed nature of the language linkage is lacking – most aptitude tests have been developed with relatively vague theory. There has been no real basis in any particular linguistic interpretation, whether connected with any underlying post-critical-period capacity for language or, alternatively, the generalised view of human cognitive abilities (Carroll, Reference Carroll1973). Future research will be needed to explore contrasting bases for aptitude test construction to see if any particular viewpoints lead to superior performance.

The second major question concerns the declarative-to-procedural, or explicit-to-implicit contrast. Of course, one issue is the relationship between these two conceptually distinct labels. Part of the problem here is the difficulty in handling, at a measurement level, what is clearer at a conceptual level. Indeed, there are questions as to whether there is a clear, measurable construct of implicit learning or knowledge (Perruchet, Reference Perruchet2021). Still, capitalising on the actual range of measures, the question becomes whether we are dealing with any possibility of progression (declarative to procedural to automatic) or whether there is a unified process, with implicit learning providing a plausible theoretical account of this. If it were possible to develop a range of aptitude tests that overcome these measurement difficulties, we might be able to use foreign language aptitude research to clarify which of these positions is more credible, or whether each of them might be credible in different situations, as Ullmann (Reference Ullman, VanPatten and Williams2015) argues.

In view of these unresolved questions in aptitude theorising, it would probably be wise in aptitude research to take an essentially conservative approach and to avoid using restricted sets of aptitude tests (cf. Doughty, Reference Doughty2019). In other words, where possible, there would be considerable value in using language-based aptitude tests (incorporating language structure, both generatively based and more widely based, along with sound and verbal memory); working memory measures, both language- and non-language-based; general pattern learning, both explicit and implicit; and then wider implicit tests, of learning and brain functioning. We have not had many studies that take a broad perspective, yet if there is to be progress in understanding the nature of aptitude, this approach is unavoidable.

Book contents

Part II - Aptitude Testing of Diverse Groups

Summary

Information

Introduction

Theories of Foreign Language Aptitude

Development of Language Aptitude Tests

Methodology

Participants

Instruments

LLAMA

FLAT-C

Translation:

Questionnaire

Data Collection and Scoring Procedures

Results

Concurrent Validation of FLAT-C

Table 6.1 Test content and construct of FLAT-C and LLAMA

Table 6.2 Descriptive statistics of FLAT-C and LLAMA scores (%)

Table 6.3 Correlation between FLAT-C and LLAMA

Table 6.4 Total variance explained

Table 6.5 Rotated component matrix of aptitude subtests

Predictive Validation of FLAT-C

Table 6.6 Correlation between FLAT-C, LLAMA, and English achievement

Table 6.7 Regression model of English achievement

Discussion

Conclusion

On the Over-Representation of Western Undergraduates in Applied Psychology and SLA Studies

Convenience Sample and Research on Foreign Language Learning Aptitude

Testing Foreign Language Aptitude

Testing Aptitude in Inconvenient Samples: The Language Aptitude Outside the Classroom (LAOC) Study

Table 7.1 Description of the sample: Age, gender, and length of residence at T1 (in months) of the adults (first row) and children (second row)

Comparing Results from Different Samples on the LLAMA Tests

LLAMA_B

Table 7.2 Comparisons between the adults (first row) and children (second row) of the LAOC study and the combination (average) of the other studies on the LLAMA_B subtest: The contrasts are significant if the 95% CIs have a range that does not include 0 (0 indicating no difference)

LLAMA_D

Table 7.3 Comparisons between the adults (first row) and children (second row) of the LAOC study and the combination (average) of the other studies on the LLAMA_D subtest: The contrasts are significant if the 95% CIs have a range that does not include 0 (0 indicating no difference)

LLAMA_E

Table 7.4 Comparisons between the adults (first row) and children (second row) of the LAOC study and the combination (average) of the other studies on the LLAMA_E subtest: The contrasts are significant if the 95% CIs have a range that does not include 0 (0 indicating no difference)

LLAMA_F

Table 7.5 Comparisons between the adults (first row) and children (second row) of the LAOC study and the combination (average) of the other studies on the LLAMA_F subtest: The contrasts are significant if the 95% CIs have a range that does not include 0 (0 indicating no difference)

Table 7.6 Mean scores for each group of participants on each of the four LLAMA subtests

Aptitude and Language Learning in Recently Arrived Adult and Child Immigrants

Table 7.7 Mean and SD scores of the adults (first row) and children (second row) on the verbal fluency task at the three data collection times

Table 7.8 Best-fitting model for the child group

Table 7.9 Best-fitting model for the adult group

Discussion and Conclusion

Preliminaries

Background

The Construct of “Language Aptitude” and Its (Lack of) Theoretical Progress

The Empirical Evidence for L2 Learning/Bilingualism Effects on Language Aptitude

Language Processing and Blindness

Aims of the Study

Methods

Participants

Table 8.1 Background data on Early Blind and Late Blind participants on the variable age of onset of blindness (AoB), in years

Table 8.2 Background data on Early Blind and Late Blind participants on the variable length of blindness (LoB), in years

Language Aptitude Tests

LAT A

LLAMA D

Results

LAT A

LLAMA D

Discussion

Is Language Aptitude Affected by the Experience of L2 Learning or Bilingualism?

Is Language Aptitude Affected by the Experience of Blindness?

Language Aptitude: A Shared Symptom of Many Underlying Conditions?

Does the Stability/Flexibility of Aptitude Matter?

Introduction

Contrasting Perspectives on Developing Aptitude Tests: A Preliminary Survey

A Framework for Exploring Existing Aptitude Sub-tests

Selection of Aptitude Sub-tests and Methodological Decisions

Aptitude Batteries: Coverage of Language vs. Cognition and Implicit vs. Explicit Processes

Domains and Language Aptitude Sub-tests

Aptitude and Theory

The Conclusion

What Can We Now Say?

What Is Needed?

How Can Aptitude Research Be Used to Illuminate Theory?

Footnotes