Introduction
The present chapter has two general aims. The first is to survey the range of aptitude batteries and sub-tests that are discussed in the literature, and then to explore how they relate to one another and what emphases each of them contains. To achieve this, the various sub-tests will be located in terms of two dimensions: whether they are domain-specific or domain-general, and whether they require implicit or explicit processes and learning. In addition, how the different domains of sound, working memory, and processing, language and learning are handled in each of the sub-tests will be explored. The major outcome is to assess the implicit assumptions different batteries and tests make; to identify where such batteries duplicate one another; and, more usefully, where there are gaps and scope to develop new aptitude sub-tests.
The second aim is to explore what insights aptitude tests might contribute to theorising about the nature of second language learning. There are many different and contrasting accounts of what second language learning is, and aptitude tests are, potentially, operationalisations of these different accounts – if they are to account for learning, such sub-tests need to reflect the different views about learning processes, such as skill acquisition or statistical learning. The different theoretical accounts will be examined before existing aptitude tests are related to them, indicating clear coverage in some areas, and not very much in others. It is argued that aptitude work, viewed in this way, should be central to second language acquisition and reveal how we can understand and predict it.
Contrasting Perspectives on Developing Aptitude Tests: A Preliminary Survey
An unavoidable starting point for any survey of aptitude tests is to recognise that most aptitude test development and research has been motivated by practical, problem-solving reasons. Of greatest importance here has been the challenge to predict foreign language learning success, or more precisely, rate of second language learning, and this is the typical context for the development of most aptitude batteries. The MLAT (Carroll & Sapon, Reference Carroll and Sapon1959) and Pimsleur’s Language Aptitude Battery (PLAB) (Pimsleur, Reference Pimsleur1966), the two most significant early batteries, were largely used to make predictions of adult and high school students’ language achievement, respectively. Contrastingly, in a military, diplomatic, or other government context, a need is perceived to predict the speed with which personnel can learn foreign languages, particularly at high proficiency levels. The Defense Language Aptitude Battery (DLAB) (Petersen & Al-Haik, Reference Petersen and Al-Haik1976) was produced against such a background, as were the CANAL-F (Grigorenko et al., Reference Grigorenko, Sternberg and Ehrman2002) and the Hi-LAB (Linck et al., Reference Linck, Hughes and Campbell2013; Hughes et al., Chapter 4, this volume).
Given such practically motivated starting points, a first task is to examine these major batteries and aptitude tests (together with some other tests, aptitude or otherwise, that have been used in aptitude research). This review will set the scene for the next section, where the different batteries and tests are analysed in terms of the two relevant dimensions (implicit–explicit and language–cognition). We will look at batteries and tests not simply in terms of content and underlying theory but also the context in which they were developed. The intention is not to repeat descriptions widely available elsewhere, but rather to review the foundation of aptitude tests for the analysis to come.
Carroll (Reference Carroll and Glaser1962) placed considerable emphasis on a job-sample approach to developing aptitude tests. His starting point was to analyse the nature of language learning and the activities on which it is based and then to develop a large number of potential aptitude tests. At the first stage, the aim was to try to cover just about everything that might be considered important in the task of learning a language. Some of these candidate sub-tests differed radically from one another, but in other cases, relatively similar sub-tests were included in the hope that one would have an edge in prediction. The next stage was to try out a large battery of such tests and examine their inter-relationships. The tests were also used with actual language learners to generate validity coefficients. In other words, a large number of possibilities were involved, and then only the most distinctive and predictive sub-tests were retained. No other aptitude battery, before or since, has been so thoroughly or so extensively validated. The final stage in this programme was to build upon the statistical results to develop a theory of foreign language aptitude, which led Carroll to propose his four-factor account: phonetic coding ability, grammatical sensitivity, inductive language learning ability, and associative memory (with this last the only part of the theory which linked in any natural way with the contemporary psychology of the time).
Two other batteries have connections to the MLAT. The first of these, PLAB (Pimsleur, Reference Pimsleur1966) was developed shortly after the MLAT. This test targets high school–age foreign language learners (the MLAT is more focussed on ages from young adults upwards) and reflects Pimsleur’s views that auditory issues are at the root of under-achievement in this area (Pimsleur, Reference Pimsleur and Davies1968). Accordingly, out of three actual aptitude sub-tests (other information is collected on motivation and native language ability) two focus on sound (sound discrimination and sound–symbol association). The other, interestingly, tests inductive language learning, the aptitude factor proposed by Carroll, but not included by him in the MLAT. As with the MLAT, considerable validation work was conducted and usefully covered in the accompanying manual. The PLAB has much less emphasis on learning than the MLAT. Even so, there is little fundamental difference in theory between the two batteries. The main differences were the age-appropriateness of the sub-tests and the commitment to an auditory focus, with this skill (or aptitude) linked to the potential for diagnostic and remedial action. The other related battery is the LLAMA, a test developed at the University of Swansea by Paul Meara (Meara, Reference Meara2005; Rogers et al., Chapter 3, this volume). This battery was modelled on the MLAT, and so its underlying theory is similar to Carroll’s model. The differences are computer administration and lack of dependence on any particular L1 for delivery. Accessibility and lack of cost are also factors. This battery consists of four tests. Two, on paired associates learning and sound–symbol association, are similar to the MLAT. Interestingly LLAMA, like the PLAB, uses an inductive language learning test rather than a grammatical sensitivity test. The final sub-test, the learning of sound, introduces a slightly different dimension to the auditory area and has been argued by Granena (Reference Granena, Granena and Long2013, Reference Granena2019) to tap more implicit learning processes. This battery is probably the most widely used in current aptitude work. Issues remain, however, concerning validation (Bokander & Bylund, Reference Bokander and Bylund2020; Bokander, Chapter 5, this volume).
Only one test battery of note was developed during the 1970s, the DLAB (Petersen & Al-Haik, Reference Petersen and Al-Haik1976). The title of the battery is revealing – it was funded by the US Defence Department. The motivation for this work was the perception that the MLAT was not sufficiently effective at high levels of achievement. Again, there is no fundamental change in theory, and DLAB’s three sub-tests span the areas of sound, language, and learning. The focus in each area was slightly different, though. Sound required accent or stress identification; learning involved learning language rules, then applying them; and language required grammar rules to be inferred. It turned out that the DLAB did not produce a higher validity coefficient than the MLAT. The battery that resulted is restricted and so not widely available, then or now, nor is there much validation information, aside from Petersen and Al-Haik (Reference Petersen and Al-Haik1976). The next battery to consider, the CANAL-F (Grigorenko et al., Reference Grigorenko, Sternberg and Ehrman2002), was also government sponsored, this time with a diplomatic emphasis, and again with a focus on higher-level achievement. The developers were two cognitive psychologists (Grigorenko and Sternberg) and a language specialist (Ehrman). The battery consists of five sub-tests with a clear focus on language and learning (but not overtly sound). It is broader in its treatment of language than other batteries, with meaning and inference involved, as well as language-rule learning. It is also more intricately designed, with a careful manipulation of aural and paper-and-pencil presentation and some integration across the sub-tests. The test is hypothesised to draw on processes such as selective encoding, accidental encoding, selective comparison, selective transfer, and selective combination. There is little general validity information available about the test, beyond Grigorenko et al. (Reference Grigorenko, Sternberg and Ehrman2002), and it has not been widely used except by the originators of the battery.
Next, we come to the most significant development in aptitude testing in recent years – the Hi-LAB (Linck et al., Reference Linck, Hughes and Campbell2013; Hughes et al., Chapter 4, this volume). This, too, was government sponsored and was located within the Center for the Advanced Study of Language (CASL), University of Maryland. It focusses on understanding and predicting high-level foreign language achievement. Considerable resources have been put into in the production of this large test battery, which consists of 12 sub-tests, covering working memory and processing, learning, and sound (but not directly language). The emphasis on detailed sub-processes of working memory and processing speed are distinctive and new, as is the focus on implicit processes (e.g. for learning). Sound and associative learning are more conventional in approach. Some validation information is available (Linck et al., Reference Linck, Hughes and Campbell2013; Hughes et al., Chapter 4, this volume), but the test has not been extensively trialled with non-government populations and across proficiency levels, and it is only selectively available. The battery is strongly influenced by contemporary cognitive psychology and draws upon techniques of measurement developed within that field.
There are also three small-scale tests, essentially sub-tests, that are worth a brief mention. The York Language Analysis test (Green, Reference Green1975) was developed as part of a research study investigating the effectiveness of language laboratories. It is an inductive language learning test, along similar lines to the corresponding PLAB and LLAMA sub-tests. A Chinese University of Hong Kong team developed two tests. The first (Chan et al., Reference Chan and Skehan2011) is a non-word repetition test targeting phonological working memory and phonemic coding ability. It did this by using non-words that were distinctive because they conformed to the phonological structure of the language to be learned (Mandarin or Cantonese). This procedure contrasts with English-based non-words, which are more typical. The other test (Chan & Skehan, Reference Chan and Skehan2011) is an inductive language learning test that takes a different approach to sample the language to be learned and the progression within the language. It is based on Pienemann’s (Reference Pienemann1998) Processability Theory and follows the six stages that he outlined for language development, providing a more principled basis for progression within the test. None of these smaller-scale tests is associated with extensive validation information, and they are not widely used. They do, though, offer slightly different perspectives on foreign language aptitude and are interesting to include in the present inquiry for that reason. In addition, some cognitive psychology tests of implicit learning have been used in aptitude-linked research, such as the Weather Prediction test (Knowlton et al., Reference Knowlton, Mengels and Squire1996) and the Tower of London test (Shallice, Reference Shallice1982). These are not language-focussed but tap the ability to learn probabilistic patterns in data. Some researchers (e.g. Sasaki, Reference Sasaki1996; Kempe & Brooks, Reference Kempe, Brooks, Granena, Jackson and Yilmaz2016) have also explored the relevance of general IQ tests (Cattell & Cattell, Reference Carroll1973) as predictors of language learning success.
Reflecting on these different approaches, one issue to emerge is to consider who the major players are in aptitude research, not so much in terms of individuals, but more in relation to background influences. Carroll (and Sapon), in the development of the MLAT, were academic researchers who obtained funding and who meticulously produced a complete aptitude battery (Reed & Stansfield, Chapter 2, this volume). (One could say similar things of Pimsleur, the author of the LAB.) Otherwise, academic researchers have made rather piecemeal contributions, and very rarely with complete test batteries. Green (Reference Green1975), and Chan and Skehan (Reference Chan and Skehan2011) fall into this category. The exception is the LLAMA (Meara, Reference Meara2005), a complete battery produced by an academic researcher and, indeed, possibly the most widely used aptitude measure in recent years. Otherwise, the major groups who have been interested in aptitude test development have a distinct military or governmental feel, and this has been true for many years. Earlier efforts include the DLAB (Petersen & Al-Haik, Reference Petersen and Al-Haik1976) and CANAL-F (Grigorenko et al., Reference Grigorenko, Sternberg and Ehrman2002), which received development funding from US government agencies. And, of course, most recently, there has been the Hi-LAB, which emerged through work at CASL, itself a government response to perceived lack of foreign language capacities in the US post 9/11.
Important consequences follow from the backgrounds of aptitude test developers. First, there is validation. On reflection, the MLAT, and Pimsleur’s LAB to a considerable extent, were models of how aptitude test batteries should be validated. Considerable numbers of participants were involved with planned variation (within the target populations), and this work was published in detail in accompanying manuals, including extensive information on norms. All information was publicly available and could inform pedagogic decisions. Since then, such thoroughness and public availability have not been matched. It seems that academic aptitude researchers who have developed tests (and even batteries) have not been funded in the same way, and so we frequently have tests made available without adequate validation information, as was pointed out by Bokander and Bylund (Reference Bokander and Bylund2020). One can speculate that large international testing organisations are not interested in developing aptitude batteries because the number of administrations per year would not justify the initial outlay. A second issue concerns restriction and secrecy, and this, too, connects with the funding sources for larger aptitude ventures. The DLAB had strong military links, and it was produced only for use within that context. The CANAL-F battery was also produced in connection with a government agency and was not widely used after its development. More recently, the Hi-LAB, which was the result of a well-funded and extensive research project, has only been made available in a small number of contexts (e.g. Granena, Reference Granena2019). Its validation is impressive (Hughes et al., Chapter 4, this volume) (although perhaps not with the breadth of different groups of the MLAT), but its penetration into wider aptitude research is limited.
The consequence of all this is that the only publicly available validated aptitude battery currently is the MLAT (Carroll & Sapon, Reference Carroll and Sapon1959) which, at the time of writing, is approaching its (pensionable) sixty-fifth birthday. Otherwise, the only major aptitude battery is the LLAMA, but this measure is only partially validated, and, indeed, the most thorough examination demonstrated several shortcomings (Bokander & Bylund, Reference Bokander and Bylund2020; Bylund, Reference Bokander and Bylund2020), an important focus for modifications reported in Rogers et al. (Chapter 3, this volume). Nevertheless, it has been used widely in aptitude research. A reasonable amount of accumulated LLAMA wisdom is available, and this retrospectively provides some sort of foundation for research results to be interpreted. But the field is in urgent need of re-evaluation of existing aptitude instruments, a re-evaluation which can take into account more recent, acquisition-oriented micro research and research that has emerged from the use of Hi-LAB. It seems timely to engage in some degree of re-evaluation rather than continue to follow the same rather limited paths, relying on out-of-date, unvalidated, or restricted test batteries. That is the purpose of the next section of the chapter.
A Framework for Exploring Existing Aptitude Sub-tests
Selection of Aptitude Sub-tests and Methodological Decisions
The analysis so far indicates that there is something of a piecemeal nature to the different positions which have been covered. We have several aptitude batteries in existence now. The problem is that they contribute to fragmentation in our understanding of aptitude, precisely because of their heterogeneity. Inevitably, each has reflected the viewpoint of the developers of that battery or free-standing sub-test. But frequently, these different viewpoints are not easy to relate to one another. What we need is a general framework within which the different batteries and tests can be located and then related to one another. This framework could help us identify what the main focus of a particular battery might be, and equally, which areas potentially relevant to foreign language aptitude are de-emphasised or even omitted. A framework would give us a view of strengths and weaknesses, both of individual batteries and perhaps of the enterprise of aptitude testing as a whole. It might also make it easier to locate where there are gaps in provision.
There is a theoretical motivation for such a framework, also. As we will see in more detail in a later section of the chapter, there are different theoretical positions about the nature of second and foreign language learning, and at a practical level, awareness of these viewpoints could be useful in generating more aptitude tests. But more theoretically, developing aptitude tests consistent with the different theoretical positions and then comparing their effectiveness, possibly in different contexts, might be revealing about the nature of second and foreign language learning itself, so that aptitude research would feed back into theoretical development in second language learning more generally.
The next question is to consider what the nature of such a framework might be. Here, two underlying dimensions will be proposed, and then, separate from that, four domains within which aptitude tests operate. The two dimensions are, first, the contrast between a focus on language versus a focus on general cognition, and second, the contrast between implicit and explicit processes, learning, and memory (Skehan, Reference Skehan, Granena, Jackson and Yilmaz2016). These contrasts should, therefore, create a ‘two-dimensional space’ within which sub-tests can be located, for example, implicit–cognition, explicit–language, and so on. In addition, the four domains (sound, working memory and processing, language, learning) reflect the structures, data, and processing areas that the aptitude sub-tests work upon. The dimensions and domains are viewed separately to enable the possibility that the two dimensions might interact with the four domains of operation, such as whether particular language–cognition and implicit–explicit combinations might be particularly important in certain domains.
Assuming this framework for analysing aptitude tests is useful, the next question concerns the method of investigation. Obviously, the most effective way to proceed would be to have an empirical study that explores all tests, their inter-relationships and effectiveness, and underlying dimensions. Factor analysis would fit the bill quite well in this regard. Equally obvious is that no one has ever attempted this type of study, although there have been some interesting factor analytic studies with subsets of the sub-tests available, such as Li and Qian (Reference Li and Qian2021). For reasons of time and resourcing, perhaps no one ever will. In view of this, the approach taken here is to examine all the sub-tests and rate them on a Language–Cognition scale and an Implicit–Explicit scale. Rating scales were developed for each of the dimensions to achieve this goal. All the sub-tests were rated on these scales by two raters, the author and one other applied linguistics professional, generating inter-rater reliability coefficients of 0.87 for the Implicit–Explicit rating and 0.89 for the Language–Cognition rating. This may not be the ideal way to gather evidence to explore the focus of the different sub-tests, but it is the only one that is practical at this scale.
The database for this investigation consisted of the following aptitude batteries, plus some independently devised aptitude sub-tests and various measures from cognitive psychology generally. Specifically, the batteries are:
The Modern Languages Aptitude Test (five sub-tests)
PLAB (three sub-tests)
The DLAB (three sub-tests)
The York Language Aptitude Test
Chan and Skehan’s Phonological Short-Term Memory (PSTM) test, and their Language Analysis test
Two implicit learning tests (Tower of London, Weather Prediction)
Brief descriptions of all of these are provided in Appendix 1.
The results of the two-dimension ratings are provided in Figure 9.1, which uses the Language vs. Cognition and Implicit vs. Explicit axes to locate the average ratings of the two raters for the range of aptitude and cognitive tests. The labels used to represent the points clarify which aptitude tests are involved in each case. We will consider the different batteries and sub-tests in turn.

Figure 9.1 Two-dimensional view of aptitude sub-tests
Aptitude Batteries: Coverage of Language vs. Cognition and Implicit vs. Explicit Processes
The MLAT sub-tests are, in four out of five cases, around the mid-points of the two dimensions, reflecting implicit and explicit components as well as a balance between language and cognition. There is some degree of spread, but only within fairly narrow limits for MLAT 1, 2, 3, and 5. If not a strong implicit component, this suggests at least some degree of non-explicit learning and processing. The exception is MLAT4, Words in Sentences, which is clearly explicit and language-focussed, indeed one of the most explicit and language-focussed sub-tests in the entire group. Looking at the five sub-tests overall, one could say that the MLAT, while it does have a slight linguistic orientation, is, according to these ratings, not as explicit as it is often portrayed, and perhaps not as tied to particular classroom methodologies as it is often assumed to be.
The PLAB was designed at roughly the same time as the MLAT but does have some differences. There are only three sub-tests to consider. The test is very slightly linguistic in orientation overall, but this is mainly accounted for by PLAB4, Inductive Language Learning. Despite only the three sub-tests, the PLAB covers a greater range on the implicit–explicit dimension, largely because two of the sub-tests, those focussed on sound, are less explicit in nature. The language test, Inductive Language Learning, provides a very slightly greater scope for more implicit processes than does the MLAT Words in Sentences.
The LLAMA battery was designed as an alternative to the MLAT, and so, not surprisingly, there are similarities. All sub-tests have a language orientation, but only one, LLAMA_F, Inferencing, is strongly so. The four sub-tests also provide an interesting range on the implicit-to-explicit dimension, covering quite a span on this dimension. Granena (Reference Granena2019) has argued that LLAMA_D, Phonology Learning, taps implicit processes, but interestingly, the ratings suggest that two other LLAMA sub-tests, LLAMA_E, Sound–Symbol Association, and LLAMA_B, Vocabulary Learning, are slightly below and above the mid-point, respectively. In any case sound, language, and learning are all covered, reflecting the influence of the MLAT.
In some ways, the DLAB is interestingly different from the other batteries. Again, there is a slight focus on language rather than cognition. The battery seems to cover a considerable range in terms of the implicit-to-explicit dimension, but with a considerable split between the two very language-focussed sub-tests and DLAB2, which concerns sound, so there is a large area in the middle range of implicit-to-explicit that is not covered. Perhaps the most distinctive feature is that DLAB3, the Foreign Language Grammar sub-test, seems to reflect a declarative-to-procedural view of learning. Language may figure, therefore, but learning may be more cognitively viewed.
In fact, a declarative-to-procedural perspective is also relevant for the CANAL-F battery, and, indeed, cumulative learning can occur during the course of the administration. The organisation of the test is complex, with paper-and-pencil and auditory material interleaved. The test is integrated also, so memory, and not simply working memory, is pervasively involved. While there is variation between sub-tests on both dimensions, it is striking that they are all placed in the lower right quadrant – a language and explicit processing combination. The clarity of focus here contrasts with the spread and diversity in all the other batteries considered so far.
Finally, we have the Hi-LAB, which presents a considerable contrast to all of the other batteries. Almost all sub-tests fall in the implicit half of Figure 9.1, with most Hi-LAB sub-tests being very clearly so, hovering around 2 in their rating. The exceptions are the Paired Associates sub-test and the Working Memory Task Switching sub-test. Mostly, the orientation is towards cognition rather than language, but there is often a language connection if the sound or verbal working memory or long-term memory priming are concerned. Working memory and processing are heavily emphasised. Conversely, there are only limited language-linked sub-tests, meaning that there is nothing in the quadrant reflecting language and explicit processing. In other words, the Hi-LAB seems a very clear contrast (and complement) to the CANAL-F battery.
The main focus so far has been on complete batteries, but some other sub-tests are shown in Figure 9.1. The Tower of London and the Weather Prediction tasks have, as their origin, general psychological work on implicit learning. There is little language focus and no explicit dimension. They are positioned very clearly in the top-left quadrant in Figure 9.1. In contrast, there are three tests that were developed for purely language aptitude reasons. The York (Green, Reference Green1975) and Pienemann-based tests (Chan & Skehan, Reference Chan and Skehan2011) are clearly located in the bottom-right explicit–language quadrant, and the non-word repetition test (Chan et al., Reference Chan and Skehan2011) is regarded as slightly implicit and slightly cognitive.
The previous paragraphs have described the aptitude tests that we have available. But it is abundantly clear from Figure 9.1 that the discussion has mainly concerned the top-left and bottom-right quadrants (implicit–cognitive and explicit–linguistic) respectively. The top-right (explicit–cognitive) and bottom-left (implicit–linguistic) are hardly represented, and it is interesting to consider what sort of sub-tests might go there and whether these would be of any relevance. Explicit–cognitive would suggest the declarative learning or processing of non-linguistic material, and perhaps this is the area which would be (partly) covered within the sub-sections of a conventional intelligence test, of the sort sometimes used in aptitude work. Implicit–linguistic would, perhaps, develop the approach taken by Reber (Reference Reber1967), who studied the learning of sequences based on linguistically-related material. Possibly, though, there might be scope for discussion regarding what material would be needed to qualify as actually linguistic. In any case, it is striking that these two wide-ranging areas are not represented very much in current aptitude tests.
The final aspect of Figure 9.1 to consider is the focus and amount of coverage of the existing batteries. We will only consider the MLAT, CANAL-F, and Hi-LAB in this respect because these batteries contain five or more sub-tests, giving a reasonable potential for coverage. As we have seen, Hi-LAB is mainly located in the top-left, implicit–cognitive quadrant, while CANAL-F is mainly bottom-right, explicit–language. Both are aptitude batteries and attempt the same task, yet there is little overlap! Doughty (Reference Doughty2019) proposes that the Hi-LAB–MLAT combination might be a good one because the two batteries complement one another. From the present analysis, it might be the case that Hi-LAB and CANAL-F might be an even better combination because of the joint coverage they would produce. Turning to the MLAT, though, we have the greatest coverage by an individual battery, ranging from several sub-tests around the mid-points of each dimension, and then a fairly extreme explicit–language test. This distribution is consistent with Doughty’s suggestion, but with a slightly less explicit–linguistic emphasis. Broadly, then, if one accepts the relevance of the ‘space’ defined by the two dimensions, it is clear that no one test provides adequate coverage and that most batteries are making assumptions about what language aptitude really is, but also what language aptitude is not. A later section will consider exactly this issue but will do so from a more theoretical perspective and be less concerned with the details of measurement.
Domains and Language Aptitude Sub-tests
As we have seen, aptitude tests can be analysed in terms of sound, working memory and processing, language, and learning. This is a meaningful division, but clearly not watertight – sub-tests can, and usually do, involve combinations of these. Any decision is therefore taken to reflect the major focus of a sub-test (e.g. language) even if that does not tell the whole story (with some language-focussed tests – York, for example – also containing learning elements).
The sub-tests focussing on sound are all in, or very close to, the mid-point of the Language–Cognition dimension of Figure 9.1, with little spread along this axis. This may seem slightly surprising, and perhaps there is a case to argue that the language involvement is greater with some sub-tests, particularly if discriminations, for example, are based on knowledge of phonology. In contrast, there is much more dispersion along the Implicit–Explicit dimension. No sub-tests are above the mid-point here, but within the range 1–4 there is considerable coverage. There seems to be a move from phonological learning (LLAMA_D), through tests of sound discrimination or identification (PLAB5, Hi-LAB Hindi and Russian, DLAB 2), to sound–symbol association (LLAMA_E, MLAT1, PLAB6). Finally, there is MLAT3, Spelling Clues, a clever test that requires the use of declarative knowledge and the processing of sound to make effective decisions. This is generally regarded as a relatively easy test, and it would be interesting to see the ideas that generated this test used at a greater level of difficulty, for example, bringing together language and explicit processing with sound. Most of the major batteries are represented in this domain, the one exception being CANAL-F. In fact, sound is used widely in this battery but is never the primary focus in any of the five sub-tests it contains.
Working memory and processing are covered by a large number of tests, eight in total. Strikingly, seven of these are from the Hi-LAB and cover detailed aspects of working memory, both central executive and buffer structures and processes, as well as long-term memory operation. Almost all of these sub-tests are concerned with implicit processing, and mostly across the cognition area. The interesting, and slight, exceptions are the Available Long-Term Memory (ALTM) Synonyms sub-test, which has language connections, and the Task Switching test, which is rated as enabling some explicit processes to be used. The one non-Hi-LAB test in this category is Chan et al.’s (Reference Chan, Skehan and Gong2011) non-word repetition test based on L2 phonology. The conclusion has to be that working memory and processing are amply represented in the Hi-LAB but not particularly in any of the sub-tests from the other batteries, at least as a major focus.
The remaining domains are learning and language, and these two theoretically distinct domains are sometimes difficult to separate in practice. Clear tests of (word) learning exist (MLAT Number Learning, as well as several paired associates sub-tests in other batteries). Implicit learning is also represented (Tower of London, Weather Prediction, Hi-LAB Serial Reaction) – all in cases of a non-linguistic nature. In addition, there are clear language processing tests, such as the MLAT Words in Sentences and CANAL-F, Understanding the Meaning of Passages. In between are several tests concerned with the structure of language where it is not simply the structure of language per se that is involved. In addition, there is scope for learning as the test progresses, and where such learning facilitates faster and more effective work since the stimulus material is cumulative in nature. All the inductive language learning tests are of this type, together with some CANAL-F sub-tests. Many of the language and learning tests are in the lower right quadrant, explicit and language-focussed.
Two additional points are interesting. First, there are two rather dense clusters of sub-tests, with most batteries represented and where the test label locations are separated, for legibility, using the ‘Jitter’ option in R (The R Project for Statistical Computing, 2020). One cluster consists of paired associates learning, with several versions of this same task. The other cluster is inductive language learning, with four different sub-tests (PLAB4, York, LLAMA_F, and the Pienemann-based test), all concerned with the same set of language and learning processes. One could also argue that there is a third cluster, this time of implicit–cognitive tests, represented by Weather Prediction, Tower of London, and one of the Serial Reaction Time sub-tests from Hi-LAB. Second, CANAL-F is particularly interesting in the nature of its different sub-tests. Not only are the sub-tests all located in the same quadrant, but they are also distributed reasonably evenly across this quadrant, even becoming more language oriented as they become more explicit. CANAL-F Learning Neologisms represents a different take on learning because inferencing is required as a precursor to learning itself. This battery has little to offer in any of the other quadrants, but it provides a well-distributed sample of the area where it does focus.
One final observation that can be made most easily here, though it applies elsewhere, is that in most aptitude testing situations, time is precious. Developers have to find ways of extracting as much information as possible in the minimum amount of time. In addition, clarity of separation between different sub-tests requires separation from one another to do justice to the multi-dimensional nature of aptitude. (CANAL-F, with its integrative nature, takes a refreshingly different approach here.) On the other hand, despite these time pressures on efficiency (and humanity for test-takers!), there is the issue that learning, especially, would benefit from longer time involvement if it is to be measured validly. Some delayed testing would also be valuable. But these approaches are not often feasible – the challenge for aptitude test designers is to get vital information quickly. A possible conclusion is that the brief time involvement makes the measurement of learning less effective than would otherwise be the case.
Aptitude and Theory
Implicit in all the discussion so far is the idea that there are important theoretical issues at play in aptitude testing. In this section, we will explore this issue and consider the different theoretical positions as they are captured (or not) by the range of available aptitude sub-tests. But more ambitiously, this discussion will explore what aptitude can illuminate in relation to the fundamental nature of a language learning ability. We will review five theoretical perspectives, some closely associated with existing batteries, others much less so.
The Pragmatic, Carrollian, Statistical Approach: One could be forgiven for thinking that there is little theory in Carroll’s position since it is based on a careful job-sample approach coupled with sophisticated statistical analysis. But the outcome that aptitude concerns processing sound, handling language, and memory/learning has been at the heart of almost all aptitude testing ever since. Theoretically, the assumption is that language is central. Carroll (Reference Carroll1973) speculated that aptitude might result from the differential fading of a first language learning ability (and see Skehan, Reference Skehan1988, who reports evidence from a longitudinal study on first to foreign language learning connections). Learning and handling sound is seen as complementing this central role for language. It has been argued (Krashen, Reference Krashen and Diller1981) that Carroll’s approach is excessively tied to conventional, classroom-based language instruction. In Figure 9.1, it is clear that MLAT sub-tests do have a degree of language focus, but on the implicit–explicit dimension they show quite a bit of spread, suggesting that they may not be locked into any particular instructional method or context, a point developed in Skehan (Reference Skehan1989).
So far, we have focussed on Carroll’s work specifically with foreign language aptitude, but it is important to place this work within a wider project of his. Carroll (Reference Carroll1993) regarded his work on foreign language aptitude as part of an investigation into the nature of general human cognition, including intelligence. Indeed, his major publication was Human Cognitive Abilities, the culmination of decades of work exploring the range of abilities from the perspective of differential psychology and based on his re-analyses of a very large number of datasets that were collected by many researchers. He proposed a three-stratum theory of cognition, with the third stratum a factor of general ability and the first stratum a very large number of specific abilities. Of interest here is the second stratum, which suggests several specialised abilities, including language, reasoning, memory, speed, and several others. The import of Carroll’s research, independent of any theorising about fundamental differences between first and second language learning, is that people vary significantly across a profile of potential abilities. Within this viewpoint, it is natural to consider that some people will have a constellation of abilities suited to more effective language learning and that this set of abilities will be measurable. His theory provides a much more general account of human cognition than his specific views on foreign language aptitude, but it is still relevant for characterising the nature of human abilities and is the wider context for the development of foreign language aptitude tests.
The approach advocated by Richard Sparks (Reference Sparks2012; Chapter 11, this volume) is entirely consistent with this approach. Sparks (Reference Sparks2012) also argues that language is central to language aptitude and that foreign language aptitude can only be understood by relating it to first language learning skills. He reports considerable empirical work in support of this contention. Essentially, this is consistent with Carroll’s view about language abilities since they are pervasive in their effects – as relevant, in his human cognition approach, for first as for second and foreign language learning.
Turning to its impact on practical testing, this pragmatic, language-oriented theoretical foundation is most evident in the MLAT itself. Its sub-tests are addressing a particular learning context – language – but they also fit into the wider structure of human cognition that Carroll (Reference Carroll1993) proposed. The MLAT sub-tests are based on the subset of that wider cognition most clearly implicated in foreign language learning. Just as some people have clusters of abilities that suit them for music, mathematics, or tennis, there are those whose strengths in cognition fit them more effectively for language learning. (This does not mean that others, less gifted in these ways, cannot learn languages, but, as Carroll would argue, they may need more time to reach the same level.) The PLAB and LLAMA have similar foundations.
The conclusion has to be that the Carrollian approach is well represented in aptitude batteries. Not only do the three batteries just highlighted provide sub-tests that cover all the areas in the underlying model, but there are also other sub-tests that attempt to measure the different areas, such as the York test (Green, Reference Green1975), Chan and Skehan’s (Reference Chan and Skehan2011) Pienemann-based test, the associative memory sub-test from the Hi-LAB (Linck et al., Reference Linck, Hughes and Campbell2013), and so on. This leads to a final point of some importance. The MLAT has been associated with outdated language teaching methodologies and is sometimes marginalised as a result. It is important to maintain that the underlying four-factor theory is not methodology-bound, nor is Carroll’s wider account of human cognition, with its three-stratum theory. The foundation is a set of proposals for the differentiated nature of human cognition, and these are not linked exclusively to schooling or any particular methodology. Instead, they are proposed as a basic architecture of cognition.
Selective Fading of Universal Grammar: Another approach to justifying language-as-special is to claim that a generative approach to language is still relevant in the second language acquisition case. In recent years, Meisel (Reference Meisel2011) and Rothman and Slabadova (Reference Rothman and Slabakova2018) have argued strongly for this position, with the discussion exploring how a Universal Grammar (UG) approach is wholly or partially still available, and then explaining the consequences from the more probable case that we are dealing with partial availability. Rothman and Slabadova (Reference Rothman and Slabakova2018) review various approaches that try to offer precision about which generative features are still operative (and presumably, therefore, do not implicate individual differences or foreign language aptitude), and which features are not, (and therefore do or, at least, might). They give the example of Uninterpretable Feature theories (Hawkins & Hattori, Reference Hawkins and Hattori2006), which propose that some features that may have been interpretable in the first language case are no longer so in second language learning. Examples of such features are case and grammatical gender. The relevance here is that there may be individual differences in how second language learners handle such features, and if there is variation, such variation may provide a perspective on aptitude, for example, how to deal with aspects of language acquisition that generative approaches no longer cover. Aptitude sub-tests that are focussed on language structure could focus on such areas for their content.
Arguably, Meisel’s (Reference Meisel2011) approach links more naturally with language aptitude (Skehan, Reference Skehan, Wen, Skehan, Biedroń, Li and Sparks2019) and makes a distinction between a Language Acquisition Device (LAD) and a Language Making Capacity (LMC). The former, which is close to Rothman and Slabadova’s hypothesis of unavailable areas, such as uninterpretable features, in second language development (Hawkins & Hattori, Reference Hawkins and Hattori2006), discusses areas like domain-specific discovery procedures and processing mechanisms, as well as learning mechanisms for non-UG constraints. Sound processing also figures in his proposal. All of these points suggest that, even in the areas where UG is not directly involved, language is still special. The LMC, in contrast, includes general implicit learning, working memory, and general pattern making (areas that figure very strongly in the more cognitive theories covered below). These proposals implicitly offer an agenda for foreign language aptitude test construction and provide structure for the various influences which might impact language learning success. If there were tests available in all these areas (e.g., in differential abilities in handling uninterpretable features, implicit learning, or working memory), one could then explore which of these potential influences have an impact on language learning success. The various possibilities constitute hypotheses, and aptitude testing has the potential to deliver relevant evidence.
It seems reasonable to claim that no aptitude battery straightforwardly addresses these UG-linked proposals in any systematic way. But there are a number of sub-tests from those covered in Figure 9.1 that do have relevance for UG interpretations of second language acquisition. Clearly, the various tests focussing on sound have relevance to Meisel’s LAD. Then, a gap is represented by domain-specific discovery procedures: implicit learning is mentioned, but not implicit learning for language, for which there are no clear aptitude tests at present. Perhaps the closest measures are the variety of inductive language learning tests, where speed of presentation may result in tests drawing on implicit linguistic processes. Chan and Skehan’s test based on Pienemann’s Processability Theory is the closest to exploring processing mechanisms and non-UG factors. Beyond these measures, with Meisel’s LMC, it is clear that a range of Hi-LAB tests is relevant, for example, general implicit learning and working memory. His other component here, general pattern making, might be captured by the Tower of London and Weather Prediction tests. All in all, this is a fragmented but surprisingly interesting collection, not completely thorough, but with a reasonable amount of coverage, through aptitude sub-tests, of the areas highlighted in Meisel’s model.
Second Language Acquisition Based Approaches: There are two related sets of proposals in this section – Skehan’s (Reference Skehan, Granena, Jackson and Yilmaz2016) proposal that putative second language acquisition stages could be the basis for aptitude test development and constructs, and Robinson’s (Reference Robinson and Robinson2002, Reference Robinson2005) suggestions regarding aptitude complexes. Following Klein (Reference Klein1986) and based on second language acquisition research, Skehan (Reference Skehan, Granena, Jackson and Yilmaz2016) proposes that one can identify stages in interlanguage development and, if there are individual differences at any of these stages, one has a candidate starting point for developing an aptitude sub-test. The stages he proposes are shown in Figure 9.2.

Figure 9.2 Stages in interlanguage development
The three macro stages on the right-hand side of Figure 9.2 are concerned with the processing of sound, the capacity to focus on pattern, and proceduralisation so that emerging language can be used fluently and (hopefully) effortlessly, without demanding excessive attention. Clearly, the first two macro stages in Figure 9.2, what can be termed the system development stages, are consistent with what we have learned about second language acquisition. The remaining stages, which involve the achievement of control, are most consistent with the sort of account proposed by Anderson (Reference Anderson2010), with a move from declarative to procedural processing (which connects with the following sections in this chapter). The first group, handling sound, is compatible with explicit and implicit processes. The second group, handling patterns, is also consistent with both types of processes. However, in the case of a declarative-to-procedural flow, central to the third macro stage, there is an assumption not so much of discontinuity between this and the previous macro stages but more that there is a different emphasis. The first two stages are necessary to unlock the potential of the third.
The motivation in examining aptitude in this way is to consider the possibility of individual differences at each stage, which could then suggest that an aptitude test focussing on that stage would have some construct validity. Skehan (Reference Skehan, Granena, Jackson and Yilmaz2016) discusses the way existing aptitude tests cover the stages in this sequence and argues that of the batteries that are available, there is a reasonable sampling in the first two macro stages but not the third, automatisation/proceduralisation. Regarding the first macro stage, sound, there is the range of sound discrimination tests, the various sound–symbol association tests, and, more theoretically, Carroll’s concept of phonemic coding ability. This first macro stage, in turn, brings in the relevance of the working memory tests, which incorporate sound and linguistic elements. Turning to the language-as-pattern set of stages, the closest thing we have are the various tests of inductive language learning ability, all of which probe this area. Given Kempe and Brooks’ (Reference Kempe, Brooks, Granena, Jackson and Yilmaz2016) research, it may be that some IQ tests are relevant – they show that generalising is linked to IQ. In contrast, a focus within a more clearly defined linguistic domain is linked to more typical pattern-oriented language aptitude sub-tests. Turning to the third, proceduralising stage, one reason for weakness in this area of aptitude testing generally is simply time: to measure automatisation would require more time than most aptitude tests are permitted to take, given the pressures on instructional contexts and learners. Perhaps it is the tests from the CANAL-F, with its integrative cumulative nature, that come closest at this stage.
The stages approach has not generated much by way of new aptitude tests. In fact, only two concrete proposals have been made. The first (Chan et al., Reference Chan and Skehan2011) describes a non-word repetition PSTM test in which the non-words are based on the phonological structure of the L2, intended to draw upon phonetic coding ability. The second measure (Chan & Skehan, Reference Chan and Skehan2011) is a test of inductive language learning and follows Pienemann’s (Reference Pienemann1998) account of second language development with its different stages. This second test does mesh a little more with the sorts of sub-processes involved at the macro stage of handling structure.
Robinson (Reference Robinson and Robinson2002) takes a different approach to building on insights from second language acquisition and focusses more on context. He proposes a three-level theory. At the highest level, we have the Aptitude Complexes Hypothesis, which suggests various contexts in which acquisition might be promoted. These contexts include a focus on form, incidental learning (oral), incidental learning (written), and explicit rule learning. The contexts are supported, at the next level down, by ability factors such as noticing, memory for contingent speech, deep semantic processing, memory for contingent text, and metalinguistic rule rehearsal. Pairs of ability factors contribute to features at the aptitude complexes level, for example, the first two (noticing and memory for contingent speech) to focus on form, and the last two to explicit rule learning. Then, at the most detailed level, there are ability-test task components, such as encoding, inferring, comparing, combining, and so on.
Clearly, some of these concepts overlap with Skehan’s stages proposals, such as noticing. In addition, some of them map onto existing aptitude tests, as with incidental learning and metalinguistic rule learning, encoding, and comparing (Grigorenko et al., Reference Grigorenko, Sternberg and Ehrman2002). In other cases, the mapping is not so clear. An important strength of Robinson’s account is that it connects more easily with aptitude–treatment interaction (ATI) approaches since the highest level, aptitude complexes, suggests that different constellations of aptitude components will have importance in different learning contexts. This approach has clear implications for research design with aptitude studies and is consistent with the recent moves to ‘micro’ research.
Declarative to Procedural Learning: All the theoretical approaches we have covered so far have assumed a special place for language in language aptitude. The remaining approaches are a clear contrast to the language-is-special assumption since they view language aptitude as essentially a cognitive ability, with no particular focus on language. One version of this, argued by DeKeyser (Reference DeKeyser, Wen, Skehan, Biedroń, Li and Sparks2019) is that first language acquisition is dependent on implicit processes, and these may involve a special place for language, but post-critical-period learning is qualitatively different. Implicit processes are assumed still to exist but to be much less effective (Ullmann, Reference Ullman, VanPatten and Williams2015; and see Jackson & Maie, Chapter 16, this volume), whereas declarative learning is more efficient. As a result, the process of second and foreign language learning is assumed to largely depend on a declarative-to-procedural sequence and general cognitive abilities.
Taking this approach would suggest that effective foreign language aptitude testing would implicate a range of tests of declarative learning and the declarative-to-procedural transition. Intriguingly, the existing batteries that come closest to satisfying these assumptions are the DLAB and CANAL-F. DLAB3, Foreign Language Grammar (learning rules, then applying them), and DLAB4, Foreign Language Concept Formation (inferring language rules through picture-based information), both seem consistent with this sequence. CANAL-F Part 4 (Sentential Inference) and Part 5 (Learning Language Rules) also bring together fairly explicit material and the opportunity for learning. It is impossible to avoid saying that these are two of the least-used aptitude batteries by researchers. There are also tests of implicit learning derived from cognitive psychology, such as the Weather Prediction and Tower of London tasks, and also sections of the Hi-LAB, such as Serial Reaction Time. But these tests do not really have a prior declarative phase and really claim to assess implicit learning, as opposed to proceduralisation (which would require such an earlier phase). The Weather Prediction and Tower of London tasks have been used as aptitude tests in research by Buffington and Morgan-Short (Reference Buffington, Morgan-Short, Wen, Skehan, Biedron, Li and Sparks2019), as were tests of declarative memory. Consistent with DeKeyser’s position, Buffington and Morgan-Short (Reference Buffington, Morgan-Short, Wen, Skehan, Biedron, Li and Sparks2019) argue that declarative tests are more effective predictors at lower levels and also in foreign language contexts. Procedural memory tests are reported as more effective at higher proficiency levels and in the second language and naturalistic contexts.
An issue worth discussing at this point concerns the relationship between implicit learning and memory, on the one hand, and proceduralised/automatised learning and memory, on the other. This discussion forms a bridge between the current section, on the declarative to procedural sequence, and the next, on implicit learning. Both sections, declarative-to-procedural and implicit, make the assumption that the respective processes they discuss are distinct from one another. A declarative to procedural or automatised sequence sees conscious and effortful learning slowly replaced by more proceduralised and even automatic performance below the level of consciousness and not requiring attention. In contrast, implicit learning is considered to take place directly and slowly and not to have a declarative, conscious, focussed phase; it simply develops below the level of consciousness.
Theoretically, this difference is clear, as accounts like DeKeyser (Reference DeKeyser, Wen, Skehan, Biedroń, Li and Sparks2019) and Paradis (Reference Paradis2009) make clear. But in practice, there are difficulties in separating the two sets of processes. For example, Suzuki and DeKeyser (Reference Suzuki and DeKeyser2017) probed this distinction and found problems. Central to their examination is the construct of implicit learning as this is currently measured. If implicit learning were a clear construct, it would be possible to operationalise the construct through a series of tests that follow from underlying theory and then inter-correlate with one another in predictable ways. If this provides convergent validity, one would also expect to see lower correlations between such tests and others targeting proceduralisation (varieties of which should themselves inter-correlate reasonably highly). Attempts to do this in the second language field have not been notably successful. Tests of implicit learning show relatively weak inter-correlations (Godfroid & Kim, Reference Godfroid and Kim2021; Li & Qian, Reference Li and Qian2021), while declarative–procedural tests show stronger inter-relationships (Suzuki & DeKeyser, Reference Suzuki and DeKeyser2017). As a result, we are left with difficult questions. We cannot be sure whether there is a unified construct of implicit learning, whether implicit learning exists in different forms, or whether implicit learning can be shown, empirically, to be different from proceduralised learning and memory. These qualifications mean that we have to treat theorising and measurement in this area with care, and so the separation between the present section and the next is a little suspect, even if it does reflect quite a lot of discussion within the field of foreign language aptitude.
Implicit Learning: The previous approach assumed a critical period, a reduction in the effectiveness of implicit processes, and a need to rely on more explicit learning. One could, alternatively, take a Unified Theory perspective (MacWhinney, Reference MacWhinney, Kroll and De Groot2005), and propose only implicit processes, operative in roughly the same way in the first and second language acquisition cases. Again, there would not be anything special about language, and basic learning processes would be essentially the same as in non-language domains.
The argument for the importance of implicit language aptitude has been made strongly in recent years by Granena (Reference Granena2019, Reference Granena2020). She compared the LLAMA tests (B, D, E, and F) with some of the Hi-LAB tests (ALTM, Letter Span, and Serial Reaction Time). She reports a factor analysis that suggests a clear separation between explicit (LLAMA B, E, F) and implicit tests (all the others, including LLAMA_D). She has also related implicit language aptitude to the capacity to respond profitably to feedback, suggesting that implicit aptitude is more related to the effectiveness of implicit feedback (Granena & Yilmaz, Reference Granena and Yilmaz2019). However, Li & Qian (Reference Li and Qian2021) report that tests of implicit language aptitude, including LLAMA_D, do not inter-correlate highly, and that LLAMA_D itself relates more to the other (explicit) LLAMA tests (and see Zhao et al., Chapter 6, this volume, for similar results). As noted elsewhere in this chapter, this means that conclusions about the construct and measurement of implicit aptitude are currently unclear.
The only one of the batteries we have considered that would have relevance to implicit aptitude is the Hi-LAB. This has several sub-tests focussing on working memory (central executive operations and buffer systems), basic cognitive speed, implicit learning, access to long-term memory (LTM), and sound processing. As Figure 9.1 showed, some of these were rated strongly based on implicit processes. In addition, as we have seen, LLAMA_D has been claimed to tap implicit processing and learning (Granena, Reference Granena2019). The implicit learning tests imported from cognitive psychology (Weather Prediction, Tower of London) are also relevant. Obviously missing here is any language involvement, which is, though, consistent with the underlying theoretical viewpoint.
The conclusion seems to be that implicit learning processes, and implicit aptitude, are an important possibility to consider. Even so, the importance of such learning needs to be established (see Jackson & Maie, Chapter 16, this volume, who suggest that it may not be strong in its effects). In addition, the viability of using aptitude tests to measure implicit learning potential (Li & Qian, Reference Li and Qian2021), and the range of contexts in which such a form of aptitude is most effective remain to be established (Buffington & Morgan-Short, Reference Buffington, Morgan-Short, Wen, Skehan, Biedron, Li and Sparks2019), even though there are some encouraging findings.
Declarative, Implicit, and Procedural Learning: We have seen that a range of measures of implicit learning and procedural learning are available. We have also seen that clear and distinct operationalisations of the constructs underlying these positions are not, as yet, a basis for obvious choices. This leads us to consider a hybrid position for the development of non-linguistically oriented aptitude batteries – it may be that it is better to think in terms of the co-existence of two approaches, the explicit and the implicit/procedural. Indeed, one could go further and propose that a hybrid approach could also involve declarative knowledge. This is close to the position argued by Ullmann (Reference Ullman, VanPatten and Williams2015), who suggests that two knowledge sources exist, the declarative and the procedural, and that each has different strengths, weaknesses, and characteristics. The research by Buffington and Morgan-Short (Reference Buffington, Morgan-Short, Wen, Skehan, Biedron, Li and Sparks2019) cited earlier, provides a possible example here – declarative knowledge was more relevant for lower levels and foreign language contexts, while procedural knowledge was more relevant for more advanced levels and more naturalistic learning contexts.
It is important to say that this hybrid approach is still consistent with the idea that language is not special – what is being learned is learned through general cognitive abilities. There are implications for aptitude, though, since a wide range of factors would need to be assessed. This perspective is best represented by the Hi-LAB, which effectively covers most of the possibilities here. Possibly the main addition could be CANAL-F, which is also based on cognitive psychology, but with a different emphasis on the nature of learning. This battery could, though, claim to provide the more extensive measurement of a declarative-to-procedural sequence.
The Conclusion
There are three parts to the concluding section. First, the focus is on what we can now say about foreign language aptitude testing, theorising, and research. Then, the issue will be what is needed by way of research and reconceptualisation. Finally, the concern is the relationship between aptitude and wider theory about language learning.
What Can We Now Say?
The broadest generalisation is that, despite the vitality and achievements of recent decades of aptitude research, what we now know is fragmented. The range of data we have available is greater, and the number of studies linked to aptitude battery construction and focussed on micro aspects of acquisition has grown impressively. However, the picture is still incomplete and lacking in overall structure. Several factors contribute to this conclusion.
A major issue concerns the aptitude tests that have been associated with the most recent research and the limitations that follow from their use. Two batteries have dominated in this regard. The Hi-LAB (Linck et al., Reference Linck, Hughes and Campbell2013; Hughes et al., Chapter 4, this volume) has introduced much greater variety into the aptitude sub-tests that have been used. But the test is not widely accessible and has been used with what might be termed restricted populations. Although the published material on validation is impressive (see Hughes et al., Chapter 4, this volume), one would like to see its use with a wider range of learners. The LLAMA, the alternative, has been used in a very large number of studies, and a good deal of accumulated wisdom is the result. But it has not been adequately validated, and indeed there are concerns about its validation (Bokander & Bylund, Reference Bokander and Bylund2020; Bokander, Chapter 5, this volume). Some attempts have been made to address these concerns (Granena, Reference Granena, Granena and Long2013; Rogers et al., Chapter 3, this volume), but it is clear that the sort of validation that was carried out with the MLAT has not been matched. In any case, there may also be the problem that the accumulated wisdom that we do have may not relate in any clear fashion to the revised LLAMA that is described in this volume.
A consequence of this two-battery domination is a lack of broader progress in understanding aptitude structure. For example, the range of micro studies in recent years has been very impressive, and we have improved our understanding of the general impact of aptitude on instruction and feedback (Li, Reference Li2015) and obtained hints about areas of greatest impact (Skehan, Reference Skehan2015). But while individual studies have made valuable contributions, a broader picture has not been possible, partly because of the unsystematic language areas that have been studied, and partly because different aptitude and working memory tests have been used. Above all, there has been a lack of what might be termed aptitude research designs in studies. With the exception of studies such as Granena (Reference Granena2019), which probed relationships between LLAMA and Hi-LAB, and Buffington and Morgan-Short (Reference Buffington, Morgan-Short, Wen, Skehan, Biedron, Li and Sparks2019), which explored aptitude-by-proficiency level interactions, studies have tended to have limited focus. Typically, limited populations are researched, or a restricted range of aptitude tests are used, often from one theoretical persuasion. These limitations restrict the power of the claims that can be made. The conclusion is that although much has been learned in recent years, a great deal more needs to be learned. We have been instrument-led rather than construct-led.
What Is Needed?
We can take Doughty’s (Reference Doughty2019) proposals as a starting point. She argues that it is advisable not to focus on just one aptitude test, but that a better option is the combination of the Hi-LAB and MLAT. Essentially, she is proposing that the strengths of the MLAT, namely, that it is appropriate to a range of proficiency levels and has a linguistic focus, are complemented by the Hi-LAB, with its focus on processing and cognition–implicit factors. Her argument is cogent, but it would be good to see an extension that is not bound by these two test batteries.
To a certain extent, the first section of this chapter clarified that one can locate aptitude sub-tests within the two dimensions of language vs. cognition and implicit vs. explicit processes. When existing aptitude sub-tests were analysed in this way, it was interesting that while some areas within the space so defined were well covered, others were not particularly well represented. Three areas seemed particularly lacking:
Cognitive explicit pattern learning
Implicit language pattern learning
Proceduralisation, both of language and cognitive patterns.
In addition, it can be argued that more sub-tests of general processing, particularly speed and LTM access, would be useful (although the relevant sub-tests from Hi-LAB may well be adequate here already). If one accepts the relevance of the two dimensions proposed, these areas are key omissions to the armoury of tests available. They would have particular relevance when one is considering the different accounts of the nature of second and foreign language learning. The opportunity to draw on tests targeting these areas and their potential incorporation into any aptitude battery would broaden the theoretical base in aptitude test design. Essentially, this would take us beyond a situation where one has to, rather ironically, accept a ‘one size fits all’ inflexible battery for measuring individual differences and move towards a situation where tests might be selectable from a validated pool of tests, simultaneously more appropriate for the particular context of use (Robinson, Reference Robinson and Robinson2002) and also more likely to contribute to research design and aptitude theory.
The issue of research design is fundamental because this is the only way to move beyond the fragmentation we currently face. In other words, designing studies appropriately might allow specific research questions to be addressed and simultaneously extend our knowledge of aptitude and the range of available aptitude tests. This applies to macro studies (with larger numbers of participants and with more extended periods and perhaps a wider range of tests), ATI studies (where aptitude can potentially interact with additional variables), and micro studies (concerned with focussed instruction or feedback conditions and possibly shorter time intervals). Regarding macro studies, the study of Granena (Reference Granena2019) is instructive and provides a glimpse into which sub-tests are inter-related and which are not. Interestingly, she drew sub-tests from two of the most significant and widely used batteries of recent years. The approach needs to be extended, perhaps drawing from, as a sampling frame, the two-dimensional arrangement covered in the first major section of this chapter. It might also be beneficial to incorporate some older aptitude sub-tests into such a mix, assuming that they would be available. As a result, we would not only learn about inter-relationships of sub-tests but also underlying aptitude constructs.
Another vital research area is that of potential ATI variables. Theoretically, Robinson (Reference Robinson and Robinson2002) has proposed a set of contexts where particular aptitude configurations are hypothesised to have special importance. Practically, Buffington and Morgan-Short (Reference Buffington, Morgan-Short, Wen, Skehan, Biedron, Li and Sparks2019) explored whether explicit aptitude tests, in this case, the MLAT5, Paired Associates, and the Continuous Visual Memory Task, and implicit/procedural tests, such as the Tower of London and Weather Prediction tests, would be particularly important with beginner, foreign language contexts and more advanced, second language contexts. They confirmed that they were, with the declarative tests being more predictive at the lower levels, and the procedural tests predicting more effectively in higher proficiency, second language contexts. The scope for such research is considerable, and what has been done so far is only a beginning. Wider conceptions of aptitude, principled selection of aptitude tests, and a range of variables that might interact with aptitude could be an exciting arena for study. It also holds the prospect of demonstrating that matching learners with contexts (Wesche, Reference Wesche and Diller1981) could lead to more efficient learning.
The final research design area concerns micro studies. A significant number of such studies have appeared in recent years. In almost all these cases, the motivation for the study has been an experimental comparison between either type of instruction or type of feedback, typically the contrast between explicit and implicit in either case. But three issues emerge. The first concerns the selection of aptitude sub-tests in these micro studies. The discussion earlier in this chapter made it clear that choosing appropriate tests is a difficult undertaking, linked to the availability (or lack thereof), and the validity status, of aptitude sub-tests (Bokander, Chapter 5, this volume). It is to be hoped that a more principled basis for selecting aptitude sub-tests will be feasible in the future. Second, with regard to micro studies, there is clearly scope to explore the variable of time. The studies covered in Li (Reference Li2015) and Skehan (Reference Skehan2015), for example, varied in length of experimental condition from 15 minutes to 15 hours, but with all except one study being less than four hours. Comparing effects across such diversity of intervention time is hazardous. We urgently need studies that manipulate time itself as an important variable. Third, the issue of sample size is an important one. Bokander (Chapter 5, this volume) shows that an important proportion of significant correlations between aptitude and performance measures comes from studies with small sample sizes, and that studies with larger sample sizes report significances far less often. This is a worry and suggests very strongly that sample sizes may need to be increased in such research.
How Can Aptitude Research Be Used to Illuminate Theory?
The final area to be discussed is the nature of aptitude theory and, more broadly, what aptitude research might be able to say about the nature of second language learning itself. In a sense, aptitude tests are embodiments of theories of second and foreign language learning ability, and so aptitude research has the potential to be revealing about this important ability. Following the earlier section based on ratings of aptitude sub-tests, there are two basic questions.
Does language aptitude implicate language, and if so, to what extent, and with what underlying theory?
What are the respective roles of explicit, declarative knowledge, explicit learning, and memory relative to implicit knowledge, learning, and memory?
Regarding the first question, a language interpretation would be consistent with batteries such as the MLAT, the PLAB, the DLAB, LLAMA, and CANAL-F, together with the miscellaneous sub-tests that have been developed, such as the York test. So, to the extent that these tests work, and in the main, they do, the case for a language involvement in language aptitude testing is strengthened. We have to accept, though, that the detailed nature of the language linkage is lacking – most aptitude tests have been developed with relatively vague theory. There has been no real basis in any particular linguistic interpretation, whether connected with any underlying post-critical-period capacity for language or, alternatively, the generalised view of human cognitive abilities (Carroll, Reference Carroll1973). Future research will be needed to explore contrasting bases for aptitude test construction to see if any particular viewpoints lead to superior performance.
The second major question concerns the declarative-to-procedural, or explicit-to-implicit contrast. Of course, one issue is the relationship between these two conceptually distinct labels. Part of the problem here is the difficulty in handling, at a measurement level, what is clearer at a conceptual level. Indeed, there are questions as to whether there is a clear, measurable construct of implicit learning or knowledge (Perruchet, Reference Perruchet2021). Still, capitalising on the actual range of measures, the question becomes whether we are dealing with any possibility of progression (declarative to procedural to automatic) or whether there is a unified process, with implicit learning providing a plausible theoretical account of this. If it were possible to develop a range of aptitude tests that overcome these measurement difficulties, we might be able to use foreign language aptitude research to clarify which of these positions is more credible, or whether each of them might be credible in different situations, as Ullmann (Reference Ullman, VanPatten and Williams2015) argues.
In view of these unresolved questions in aptitude theorising, it would probably be wise in aptitude research to take an essentially conservative approach and to avoid using restricted sets of aptitude tests (cf. Doughty, Reference Doughty2019). In other words, where possible, there would be considerable value in using language-based aptitude tests (incorporating language structure, both generatively based and more widely based, along with sound and verbal memory); working memory measures, both language- and non-language-based; general pattern learning, both explicit and implicit; and then wider implicit tests, of learning and brain functioning. We have not had many studies that take a broad perspective, yet if there is to be progress in understanding the nature of aptitude, this approach is unavoidable.

