Introduction
Instrument reliability is often overlooked as manifested in nonreports in research papers (McKay & Plonsky, Reference McKay, Plonsky, Winke and Brunfaut2021). Recently, issues surrounding reliability have received increased attention in second language acquisition (SLA), especially in studies employing tasks that elicit response-time (RT) differences (e.g., Buffington et al., Reference Buffington, Demos and Morgan-Short2021). For some research involving cognitive aptitude, a great deal is at stake because of the heavy reliance on RT differences as an individual difference measure (e.g., larger or smaller RT differences indicating stronger ability for an individual). Indeed, investigations into inhibitory control (IC) and its relationships with external variables (e.g., proficiency or phonological processing) would fall into this category due to the type of measurement of IC (e.g., Darcy et al., Reference Darcy, Mora and Daidone2016; Huensch, Reference Huensch2024); therefore, it is important for researchers to understand the issues around instrument reliability more thoroughly and identify solutions to mitigate any limitations inherent to RT-difference measures. While efforts have been made to improve the reliability of RT differences (Hui & Wu, Reference Hui and Wu2024), the extent of the problem and the potential consequences of relying on such measures in research specifically involving IC in SLA have not yet been fully documented and illustrated. In addition, the degree to which model-based approaches could improve the reliability of IC measures for second language (L2) research is not entirely clear. The implications of the potentially improved reliability on the predictive validity of the measure also remain conceptual. Thus, in this paper, we perform a secondary analysis on an open dataset, initially collected for the examination of the relationship between IC and phonological processing for L2 learners (Huensch, Reference Huensch2024), to fill these gaps. Our additional, broader aim is also to raise L2 researchers’ awareness of the reliability issues surrounding measures of cognitive individual differences.
Reliability challenges in RT-based individual differences measures: The case of inhibitory control measures
Generally, the reliability of an instrument refers to the extent to which it consistently yields the same score when used under the same measuring conditions with the participants (Cohen et al., Reference Cohen, Manion, Morrison, Cohen, Manion and Morrison2017). This fundamental measurement principle applies across all individual differences research, with particular implications for RT measures widely used in SLA and cognitive psychology (e.g., Buffington et al., Reference Buffington, Demos and Morgan-Short2021; Maie, Reference Maie2022). Reliability in correlational/individual differences studies is typically based on classical test theory to indicate the extent to which an instrument can rank individuals consistently (Hedge et al., Reference Hedge, Powell and Sumner2018). In general, a number of factors can undermine reliability: A reduction in variance between individuals while error variance remains constant, or an increase in error variance while the variance between participants stays the same (Hedge et al., Reference Hedge, Powell and Sumner2018). In other words, two potential sources of low reliability are (a) high measurement errors and (b) low between-participant variation. Both of these issues affect numerous individual difference measures in SLA, with RT-difference measures being particularly vulnerable (Hui & Wu, Reference Hui and Wu2024; McKay & Plonsky, Reference McKay, Plonsky, Winke and Brunfaut2021).
A high level of measurement error can lead to “non-systematic change between individuals” (Hedge et al., Reference Hedge, Powell and Sumner2018, p. 1167) across testing sessions, thereby undermining reliability. Whether it is due to participant bias or item design within the instruments themselves, measurement errors can introduce variability in the data that is not related to the construct being measured, subsequently mitigating the validity of the results we observe. Additionally, reliability may suffer when there is little variation between participants or when the sample is particularly homogeneous (Hedge et al., Reference Hedge, Powell and Sumner2018). This limitation is especially critical for correlational or individual differences research that “examines factors that distinguish between individuals within a population (i.e., between-subject variance)” (Hedge et al., Reference Hedge, Powell and Sumner2018, p. 1166). To clarify this point further, when participants exhibit comparable performance on cognitive tasks, their resulting scores cluster together with insufficient differentiation between individuals. This restricted range of scores undermines the instrument’s capacity to produce consistent rankings of individuals across multiple measurements. Even small measurement errors can alter participants’ relative standings when the true between-participant differences are minimal. Consequently, correlational analyses using such measures may fail to detect genuine relationships not due to the absence of such relationships, but rather because the measurement instrument cannot reliably distinguish between participants’ abilities. This methodological limitation is particularly problematic in individual differences research, where the goal is to precisely identify how variations in, for example, cognitive ability relate to learning outcomes.
When it comes to RT differences measures (i.e., a type of measure that depends on subtracting the RTs in different conditions), reliability can be low due to what Hedge et al. (Reference Hedge, Powell and Sumner2018) termed the “reliability paradox” (e.g., Tan & Yap, Reference Tan and Yap2016; Buffington et al., Reference Buffington, Demos and Morgan-Short2021). These measures typically show robust effects at the group level but poor reliability for individual differences. This occurs potentially because subtracting condition means (e.g., incongruent minus congruent RTs) inherently reduces between-participant variance, a crucial component for reliability in individual differences research. Moreover, with many trials per participant, the magnitude of an individual’s RT difference tends to show substantial within-person variability. Consequently, these measures struggle to consistently rank individuals across different subsets of trials, undermining their reliability for capturing individual differences in linguistic knowledge (e.g., Hui & Jia, Reference Hui and Jia2024), processing (e.g., Frinsel & Christiansen, Reference Frinsel and Christiansen2024), as well as cognitive ability, such as procedural memory capacity (e.g., Buffington et al., Reference Buffington, Demos and Morgan-Short2021) and IC (e.g., Huensch, Reference Huensch2024).
For RT-difference tasks that tap into IC, Hedge and colleagues (Reference Hedge, Powell and Sumner2018, Studies 1 and 2) examined the test-retest reliability of four response inhibition tasks (i.e., the Eriksen flanker task, Stroop task, go/no-go task, and the stop-signal task) commonly used in cognitive psychology and neuroscience. Calculating the intraclass correlation coefficient (ICC), the authors found that none of the four tasks met the reliability of .80, which was considered excellent, and that only two measures marginally met the threshold of being substantial (.60). In particular, the Stroop task (i.e., the task that requires participants to name the color of a presented word while inhibiting naming the word itself which could be a color word different in its ink color) had ICC values of .60 (session 1) and .67 (session 2), while the go/no-go task had an ICC of .76 in both sessions. Similarly, in Hedge et al. (Reference Hedge, Powell, Bompas and Sumner2022), where researchers examined the reliability of multiple executive function tasks across several datasets, it was also reported that the reliability of behavioral measures, namely RT costs and error costs in conflict tasks, such as Flanker and Stroop tasks, is generally low. With the ICC of the RT costs ranging from .38 to .65 in the Flanker task and from .38 to .66 in the Stroop task (see the supplementary material A in Hedge et al., Reference Hedge, Vivian-Griffiths, Powell, Bompas and Sumner2019), the RT costs over a four-week period indicate only a low to moderate test-retest reliability. Moreover, for the Simon task/Spatial Stroop task (i.e., participants were asked to name the color/meaning of a stimulus and ignore its location), despite its relatively higher test-retest reliability reported by Hedge et al. (Reference Hedge, Powell, Bompas and Sumner2022), most of its ICCs still do not achieve the threshold of .80 across different datasets (i.e., dataset 1: .74; dataset 3: .60; dataset 5: .67; dataset 6: .72), displaying only a moderate level of reliability. Given the findings, researchers have argued that the unsatisfactory reliability could be attributed to the lack of between-participant variability in the data (Hedge et al., Reference Hedge, Powell and Sumner2018). This idea is supported by the observation that the ICC for RT differences is generally lower than that for RTs within each component (i.e., congruent and incongruent conditions) (Hedge et al., Reference Hedge, Powell, Bompas and Sumner2022). This is because the calculation of RT differences involves subtracting the RT in one condition from the RT in another, which inherently reduces between-participant variability and, as a result, contaminates reliability (Hedge et al., Reference Hedge, Powell and Sumner2018, Reference Hedge, Powell, Bompas and Sumner2022).
Given the reduced between-participant variance and potential measurement errors associated with RT measures—such as participants feeling fatigued when pressing buttons or being inattentive to stimuli—these findings (i.e., low reliability of many RT-based measures for IC and other cognitive constructs) should be alarming for any serious researchers using RT-difference measures as indicators of individual differences in SLA and beyond. This does not imply, however, that RT-difference measures should be abandoned for indexing individual differences (Hui & Wu, Reference Hui and Wu2024). Rather, it highlights the importance of carefully considering how IC and, more generally, cognitive differences between individuals can be measured with high reliability. With more reliable measures, researchers could reveal more precise and meaningful insights into individual differences in relation to L2 learning and processing. This is especially important for instructed SLA researchers who often seek to understand which learners benefit most from particular intervention by exploring aptitude
$ \times $
treatment interactions. Such exploration not only reveals how individual differences and external factors work together to influence language learning outcomes but also provides insight into the underlying processes at play (DeKeyser, Reference DeKeyser and DeKeyser2021), which are core foci of SLA research.
In addition to the theoretical rationale of examining RT differences, it should be noted that, from a statistical point of view, RT-difference measures indexing individual differences are often placed on the predictor side of a regression equation, where perfect reliability is assumed. When this assumption is violated, findings regarding the predictive validity of IC to external variables such as phonological processing could be undermined. To further illustrate exactly how unreliability can mask important relationships, we present a hypothetical example here, where the outcome (y) represents phonological processing and the predictor (x) is IC. For the sake of clarity in this example, we only vary the level of reliability on x and assume what should never be assumed: that phonological processing (y in Figure 1 below) is measured reliably (e.g., at .90). Several simulated scenarios are visualized in Figure 1, where the reliability of x (IC) decreases from 1.00 to .20 across the panels. The first panel to the left shows the true, positive, and strong relationship between the two variables. As the reliability of x decreases, the data points spread more widely on the x-axis due to the increased unreliability, which subsequently causes the slope of the regression line to flatten, and eventually, the slope might not be significantly different from zero (no relationship). In other words, the unreliability of IC measures, depending on the extent, can mask any meaningful relationships with phonological processing. It is also worth noting that the baseline here is a strong relationship (left panel). If the true relationship is moderate or weak, there is then an even slimmer chance that it can be observed with an unreliable x variable.

Figure 1. Visualization of the estimation change with various degrees of measurement errors involved.
While this is a serious issue, there is already some awareness in the field. For example, Huensch (Reference Huensch2024) rightly pointed out that correlational studies can lose power as a result of the lack of between-participant variability in the inhibition score, an underlying cause for the unreliability. To further refine our understanding of the reliability issues, it would be useful to lay bare the extent of the problem, for example, by examining how different IC tasks might correlate with each other (or not). Also, the field would benefit from knowing if there are any effective solutions to the problem, for example, by employing more contemporary, model-based approaches to estimate a more reliable IC score for an individual learner.
Although we focus on IC measures in this paper, the reliability challenges we have discussed extend to many RT-based individual differences measures in SLA and cognitive science more broadly. The “reliability paradox” affecting IC tasks similarly impacts many other RT-differences measures, such as those measuring lexical access (Hui et al., Reference Hui, Godfroid and Elgort2025; Zhang et al., Reference Zhang, Wu and Wang2025), syntactic processing (Fang & Wu, Reference Fang and Wu2022), and working memory capacity (Unsworth & Engle, Reference Unsworth and Engle2005). Understanding and addressing these measurement challenges is thus important not only for improving IC research specifically but also for enhancing the methodological rigor of individual differences research across multiple domains in SLA.
Estimating the reliability of RT differences
For RT-difference measures (e.g., the Stroop task), researchers often follow classical test theory, aggregating trials into a single score by calculating the mean RT in the congruent and incongruent conditions and subtracting them for RT differences for further analyses. However, this approach often contaminates trial-by-trial variation, resulting in the reduction of not only reliability but also effect sizes and correlation (Rouder & Haaf, Reference Rouder and Haaf2019). To address this issue, Rouder and Haaf (Reference Rouder and Haaf2019) advocated the application of mixed-effects models with trial-by-trial analysis. Taking into account both participant and item variability, the researchers reanalyzed the data shared by Hedge et al. (Reference Hedge, Powell and Sumner2018) and found a much-improved test-retest reliability for both the Stroop and the Flanker tasks. In SLA, Hui and Wu (Reference Hui and Wu2024) evaluated the effectiveness of model-based approaches by comparing three methods for estimating the reliability of RT differences: computations based on (a) raw RTs, (b) by-participant z-transformed RTs, and (c) the model-based estimation. The authors found that the model-based approach can outperform the other two methods in estimating reliability. However, this superiority is contingent upon certain conditions, namely moderate measurement error and a limited number of items. Despite the limitations, their findings underscore the potential of the model-based approach as a promising alternative to estimating the reliability of RT-difference tasks in SLA, potentially enhancing the validity of studies that rely on RT-difference measures.
Despite its advantages, not many studies to date have adopted the model-based approach for reliability estimation, and the degree to which the model-based approach could improve reliability in L2 research is still not clear. Moreover, whether the improved reliability derived from the model-based approach could actually influence the predictive power of a measure still lacks sufficient empirical evidence. Thus, the current study aims to not only further the investigation of how the model-based approach could improve reliability but also examine whether the improved reliability could lead to more accurate prediction of an external measure. To achieve this, we situated our study in the context of the relationship between inhibition control and L2 phonological processing and conducted a secondary analysis using an open-source database shared by Huensch (Reference Huensch2024).
Huensch (Reference Huensch2024) on inhibitory control and phonological processing
Before proceeding to describe the present study, we provide a brief review of Huensch (Reference Huensch2024) to contextualize our secondary analysis. We chose this study because the author has commendably made the data publicly available (Huensch, Reference Huensch2022), for which we are truly thankful. After all, this sharing has enabled the current methodological investigation. The author carried out a preregistered, close replication study of Darcy et al. (Reference Darcy, Mora and Daidone2016), examining the relationships between IC and phonological processing. Huensch (Reference Huensch2024) included two additional IC tasks (i.e., a Stroop task and a Simon task, both described above), aside from the retrieval-induced inhibition task originally included in Darcy et al. (Reference Darcy, Mora and Daidone2016). The inclusion of these two additional tasks is because these are “classic test[s] of prepotent response inhibition” (Huensch, Reference Huensch2024, p. 3), measuring intentional, behavioral resistance to an immediately distracting stimulus, in contrast to the retrieval-induced inhibition task which “represents unintentional, cognitive, resistance to proactive interference” (Huensch, Reference Huensch2024, p. 2). That is to say, these two tasks allow researchers to capture additional, and perhaps different, aspects of IC that are hypothesized to have a stronger relationship with production skills but that had not been examined in the initial study. The initial retrieval-induced inhibition task contained three phases: During the first phase, participants were instructed to memorize 18 words grouped into three different categories. In the second phase, they were prompted to recall three words from two of the three categories, and finally, in the third phase, participants were asked to identify whether the words presented had appeared in the first phase. Assessment of IC involved three trial conditions: (a) the practiced condition, (b) the inhibited condition (i.e., words not practiced but belonging to a practiced category), and (c) the controlled condition (i.e., words not practiced), with the latter two being the critical conditions. The retrieval-induced inhibitory score was calculated by dividing the median RT for the items in the inhibited condition by the median RT for the items in the control condition. A score greater than 1 indicates greater IC, with a higher value reflecting stronger inhibition (Huensch, Reference Huensch2024). As for the Simon task, each participant’s Simon score was computed by subtracting the mean RT for the congruent items (when the location of the text stimulus on the screen matches that of the key on the keyboard to be pressed) from the mean RT for the incongruent items. Similarly, the Stroop score was determined for each participant by subtracting their mean RT for the neutral items (when the stimulus was a string of symbols) from their mean RT for the incongruent items (when the stimulus was text and did not match the color of it). For both the Simon and the Stroop tasks, lower scores indicate better IC as they reflect faster responses to incongruent items (Huensch, Reference Huensch2024). All three scores were calculated based on correct responses only, with incorrect responses being excluded during the data preprocessing stage. For phonological processing, Huensch (Reference Huensch2024) followed the operationalization of Darcy et al. (Reference Darcy, Mora and Daidone2016). L2 learners’ phonological perception was assessed by the speeded ABX categorization task with Spanish vowel and consonant contrasts /e-ei̯/ and /d-ɾ/ as critical experimental items, while their production was measured by the delayed sentence repetition task involving the same contrasts.
Data collected from 58 participants did not replicate the findings of the initial study. Specifically, the Spearman partial correlation analyses demonstrated no statistically significant relations between retrieval-induced inhibition and vowel/consonant perception and production, and the hierarchical regression revealed that inhibition was not a significant predictor of vowel perception accuracy. In terms of the additional IC tasks, neither the Stroop task nor the Simon task displayed a clear relationship with vowel perception and production, nor with consonant perception and production, similar to the retrieval-induced inhibition task. Overall, the findings from Huensch (Reference Huensch2024) suggest that “no strong, clear, or consistent relationship emerges between inhibitory control and L2 perception/production skills” (Huensch, Reference Huensch2024, p. 17).
Huensch (Reference Huensch2024) argued that the discrepancies in the results may be related to the possibilities that (1) the relationship between IC and L2 phonological processing might be weak, if not null; (2) the inhibition tasks used may not effectively capture individual differences in inhibition; and (3) variations in study features could potentially influence the results. Regardless of the reasons, methodologically, the author acknowledged the challenges for reliably measuring IC due to limited between-participant variability, and this lack of variability could, in turn, reduce statistical power, resulting in the null results that were observed.
Huensch’s insights, combined with the model-based approaches tested by Hui and Wu (Reference Hui and Wu2024) and Rouder and Haaf (Reference Rouder and Haaf2019), motivated applying a model-based approach to the data in Huensch (Reference Huensch2024), which could potentially mitigate the limitations in the unreliability of IC tasks. This approach may offer the promise of more precise reliability estimates for these tasks and better accounts for the inherent variability in RT-based measures. If successful, it can strengthen the case for a weak, if not null, relationship between IC and phonological processing. Moreover, Huensch’s (Reference Huensch2022) dataset also offers a great opportunity to reexamine the correlation between different IC measures. This investigation can potentially tease out the confounding statistical consideration (i.e., lack of between-participant variability) regarding the low correlation observed in the literature (Hedge et al., Reference Hedge, Powell and Sumner2018; Rey-Mermet et al., Reference Rey-Mermet, Gade and Oberauer2018) and unveil important questions regarding whether inhibition in different tasks is a unified concept or not (Rouder & Haaf, Reference Rouder and Haaf2019). Lastly, as has been mentioned earlier, the potential for enhancing the predictive power of these inhibition tasks through improved reliability has not been well supported by empirical evidence. In other words, there is a lack of robust evidence demonstrating whether these tasks would more accurately predict L2 phonological processing if their reliability were enhanced. This gap leaves uncertainty regarding the extent to which boosting reliability could lead to better predictive outcomes for these tasks.
Thus, the current study aims to extend the work of both Huensch (Reference Huensch2024) and Hui and Wu (Reference Hui and Wu2024) by applying the model-based approach to the RT data shared in Huensch (Reference Huensch2022). By doing so, we hope to provide a more precise estimation of the reliability of these tasks, thereby contributing to a clearer understanding of the relationship between not only different IC measures but also the relationship between IC and L2 speech processing.
The present study
Building upon the work of Huensch (Reference Huensch2024) and Hui and Wu (Reference Hui and Wu2024), we conducted a secondary analysis of the data shared by Huensch (Reference Huensch2022) using a model-based approach to estimate inhibition scores. We formulated three specific research questions (RQs):
RQ1: To what extent does the reliability of the three RT-based IC measures (retrieval-induced inhibition, Simon, and Stroop tasks) improve when adopting a model-based approach?
RQ2: To what extent do correlations between the three IC tasks differ when using the more reliable indices compared to traditional scoring methods?
RQ3: What are the relationships between IC measures and L2 phonological processing, based on the more reliable IC scores?
In line with principles of open science and to facilitate replication and extension of this work, all R code used for data analysis is made publicly available in the Open Science Framework (OSF; https://osf.io/bng82/).
Data set
This study utilized the data set shared by Huensch (Reference Huensch2022), publicly available on OSF (https://osf.io/fxzvj/). The associated substantive publication is Huensch (Reference Huensch2024). We started with the raw data sets for each of the six tasks administered (e.g., Stroop.csv) within the zipped folder “Data and Analysis Code.zip.” In Huensch (Reference Huensch2024), the author employed three different IC tasks, each with a unique scoring method:
-
1. Retrieval-induced inhibition task: Scores were calculated by dividing the median RT for inhibited items by the median RT for control items.
-
2. Simon task: Scores were derived by log-transforming the difference between median RTs for the congruent condition (where stimulus location matched response side) and the incongruent condition (where stimulus location conflicted with response side).
-
3. Stroop task: Scores were computed by log-transforming the difference between median RTs for the neutral condition (color patches) and the incongruent condition (color words printed in mismatching ink colors).
RQ1: Reliability of IC measures
Methods
Data preparation
We first applied accuracy-based screening procedures following Huensch (Reference Huensch2024), removing all incorrectly responded trials. Each data set was then split into odd-numbered and even-numbered halves. For the Simon and Stroop Tasks, which featured randomized trial orders for each participant, we restructured the data to ensure comparability. Specifically, we reordered trials so that only those with identical content, i.e., same location and same text for the Simon task and same color and same text for the Stroop task, were considered duplet items. In the Simon task, this meant pairing trials featuring boxes of the same color in the same screen location. For the Stroop task, we paired trials presenting the same word in the same ink color.
Approach 1 (Huensch’s method)
For the Retrieval-Induced Inhibition task, we calculated the RT division for each item. In the case of the Simon and Stroop tasks, we computed log-transformed RT differences. These values were then aggregated across participants, and we obtained median RT differences, conducting only by-participant analyses as per Huensch (Reference Huensch2024). Split-half reliability was then estimated between the two halved data sets using two methods. First, we calculated Pearson correlation coefficients using the cor.test() function (stats package; R Core Team, 2021). Second, to account for potential outliers and abnormal distributions, we computed percentage bend correlation coefficients, which were more robust (Wilcox, Reference Wilcox1994), using the pbcor() function (WRS2 package; Mair & Wilcox, Reference Mair and Wilcox2020).
Approaches 2 and 3 (Model-based methods)
For both model-based approaches, we fit linear mixed effects models for each half of the datasets. These models included trial type as a fixed effect, with item and participant as random effects, allowing for random slopes for trial type (Baayen et al. Reference Baayen, Davidson and Bates2008). The key difference between the two approaches lies in the specification of the outcome variable. For Approach 2, we used log-transformed RT (log[RT]), while for Approach 3, we used inverse-transformed RT (–1/RT). In both cases, we maintained a maximal random-effects structure (Barr et al., 2013), using the nloptwrap optimizer (optimx package; Nash & Varadhan, Reference Nash and Varadhan2011) and the partial Bayesian method (blme package; Chung et al., Reference Chung, Rabe-Hesketh, Dorie, Gelman and Liu2013) to address convergence issues, following Hui and Wu (Reference Hui and Wu2024). After fitting the models, we extracted by-participant random slopes and computed Pearson and robust correlations between the two halved data sets to assess reliability in the same way as in Approach 1.
We chose to explore two different data transformation methods—logarithmic and inverse—for several reasons. First, there is no universally optimal approach to data transformation given nonnormal data (see Maie et al., Reference Maie, Eguchi and Uchihara2024). The choice of transformation can depend on the specific characteristics of the data and the nature of the research question. Second, logarithmic and inverse transformations are among the most commonly used methods in RT research, each with its strengths in addressing different types of distributional issues (Jiang, Reference Jiang2013). By including both, we can assess the robustness of our findings across different analytical approaches. Finally, comparing these two methods allows us to demonstrate how the choice of data transformation may or may not influence the results, providing insights into the methodological considerations researchers should keep in mind when analyzing RT data.
We selected the optimal transformation method based on which yielded the highest split-half reliability coefficient. While this approach helps identify the most reliable method for each task, we acknowledge the inherent subjectivity in this selection process. To address this limitation, we report all transformations tested and their resulting reliability coefficients, allowing readers to evaluate the magnitude of improvements across different approaches.
Results
Table 1 presents the correlation coefficients from the three IC tasks using the three computational approaches. A striking observation is the substantial variation in reliability estimates for the same task depending on the transformational and computational approach used. This variability is evident across all three tasks, with reliability coefficients ranging from near zero (indicating no reliability) to .73 (approaching acceptable reliability).
Table 1. Split-half correlations for the three RT-based inhibition tasks data sets

Note: Boldfaced values are the highest reliability coefficients observed for each task across all computational approaches.
a Coefficients were corrected from negative values using the method described by Krus and Helmstadter (Reference Krus and Helmstadter1993) and used in Buffington et al. (Reference Buffington, Demos and Morgan-Short2021).
Consistent with previous research (e.g., Hui & Wu, Reference Hui and Wu2024; Rouder & Haaf, Reference Rouder and Haaf2019), all three IC tasks demonstrated an improvement in reliability of .20 to .50 when model-based approaches were applied. This increase essentially saved the Simon and the Stroop tasks. In other words, it brought the reliability coefficients of the Simon and Stroop tasks from unacceptably low levels (.14 to .34) to values approaching or exceeding .70, reaching what Brown (2014) considers the minimum threshold for acceptable reliability. These results also align with findings from Rouder and Haaf (Reference Rouder and Haaf2019), who reported a similar increase of approximately .20 in test-retest reliability when using model estimates compared to non-model-based methods in their reanalysis of Hedge et al.’s (Reference Hedge, Powell and Sumner2018) Stroop and Flanker task data.
Moreover, it is important to note that the optimal data transformation differed between tasks. In the case of the Simon task, an inverse transformation performed best, while a log transformation was optimal for the Stroop task. The observed differences in optimal transformations across tasks show that researchers must be transparent about transformation selection criteria and ideally preregister their analytical decisions. Without such safeguards, there is indeed a risk of researchers trying multiple methods and selectively reporting only those yielding favorable results. In our study, we report all transformations tested to provide full transparency.
Also, despite consistent improvement across tasks, not all measures reached acceptable levels of reliability. This suggests that the model-based approach is useful to various degrees, highlighting the need for task-specific considerations in reliability analysis.
RQ2: Correlations between IC tasks
Although researchers commonly use the umbrella term “inhibitory control,” this term encompasses distinct cognitive processes that likely rely on different neural mechanisms and serve different functions. The three tasks examined in this study target distinctly different aspects of IC: retrieval-induced inhibition measures unintentional resistance to proactive interference with a strong memory component; the Stroop task assesses the ability to suppress automatic responses in a linguistic context; and the Simon task evaluates domain-general spatial response inhibition. Given these substantial differences in what each task measures, strong correlations between them would not be expected even with perfect measurement. However, traditional measurement approaches may underestimate any existing relationships due to reliability issues. Our model-based approach aims to reduce measurement error to reveal the true extent of relationships (or lack thereof) between these different aspects of IC. Rather than expecting strong correlations, our goal is to determine whether more reliable measurement might reveal modest relationships that were previously obscured by measurement error, or confirm that these distinct aspects of IC function independently.
Methods
To address RQ2, we investigated the correlations among the three IC tasks using two approaches. The first approach utilized the original scores calculated following Huensch (Reference Huensch2024), while the second employed the model-based individual random slopes identified as the most reliable from RQ1. We selected the approach that is considered methodologically preferable a priori based on established psychometric principles, i.e., that a more reliable measurement provides more accurate estimates of true relationships between constructs by minimizing the attenuating effects of measurement error. As demonstrated in our reliability analyses (see Table 1), the approach that had the highest reliability coefficients was used for each task to reveal the true between-task relationships potentially.
We used the cor.test() function for Pearson correlation coefficients and the pbcor() function from the WRS2 package for the more robust percentage bend correlation coefficients. This dual correlation analysis strategy allowed us to assess the relationships between tasks while accounting for potential outliers or nonnormality in the data.
Results
The correlation analyses revealed marked differences between the two approaches in assessing the relationships among the three IC tasks. Tables 2 and 3 present correlation matrices from Huensch’s original scoring methods and the most reliable model-based individual random slopes, respectively.
Table 2. The correlation between tasks based on Huensch’s scoring methods

Table 3. The correlation between tasks based on the most reliable model-based individual random slopes

Using Huensch’s original method, we observed consistently low correlations between tasks, with all coefficients falling below .10 (see Table 2). These results suggest little to no relationship between the three IC measures when using traditional scoring methods. Given that these tasks are designed to measure different aspects of IC, low correlations between them may align with theoretical expectations. Nevertheless, the lack of correlation raises questions about whether IC should be conceptualized as a unified IC construct with related subcomponents or as fundamentally distinct cognitive processes that share only nomenclature.
In contrast, the model-based approach yielded notably higher correlations, revealing previously undetected relationships between the tasks (see Table 3). Significant positive correlations emerged between the retrieval-induced inhibition and Simon tasks (r = .29 to .33, p = .01 to .03) and between the Simon and Stroop tasks (r = .26 to .27, p = .04 to .05). The correlation between the retrieval-induced inhibition and Stroop tasks, while higher than in the original method, did not reach statistical significance (r = .18 to .24, p = .18 to .24).
Figure 2 provides a visual comparison of the correlation coefficients and their 95% confidence intervals between three IC tasks, contrasting the results based on Huensch’s (2024) scoring methods and the most reliable model-based random slopes. This visualization clearly illustrates the enhanced inter-task relationships revealed by the model-based approach.

Figure 2. Comparison of correlation coefficients and 95% confidence intervals between original and model-based methods for IC tasks.
Note: This figure is based on the Pearson correlation, as the results from the robust correlation method are similar to those of the Pearson correlation method.
These findings highlight the potential impact of the analytical approach on the observed relationships between IC measures. The model-based approach uncovered moderate correlations between tasks that were not apparent using traditional scoring methods, suggesting that it may provide a more sensitive measure of the shared variance between different IC tasks. This improved sensitivity could have important implications for our understanding of IC as a construct and its measurement in L2 research.
RQ3: IC predicting phonological processing
Methods
To address RQ3, we investigated whether the lack of a significant relationship between IC and L2 phonological processing reported by Huensch (Reference Huensch2024) persisted when using the most reliable IC measures derived from model-based individual random slopes. We employed two analytical approaches.
Nonparametric Spearman partial correlation analysis
We conducted analyses comparing the three IC tasks’ results and the phonological scores described in Huensch (Reference Huensch2024). For each task, we ran two analyses: one using Huensch’s original RT-difference scores and another using the model-based estimates of individual random slopes that yielded the highest reliability in RQ1.
Hierarchical regression analysis
We ran two sets of analyses. First, we followed Huensch’s original analysis to establish a baseline for comparison and then ran the same models but using individual random slopes generated from the most reliable model-based approach for the retrieval-induced inhibition task. In the second set of analyses, we involved both the Stroop and Simon tasks in the model selection, given that they showed the greatest improvements in reliability. In other words, we conducted comprehensive hierarchical regression analyses that included individual random slopes from these tasks to predict the vowel perception error rates. All predictors were standardized using the scale() function to aid interpretation.
The analysis began with all IC measures entered as predictors in the first step. Before model selection, we conducted model diagnostics and removed outliers (three to four) to ensure compliance with model assumptionsFootnote 1. At each later step, we identified and removed the predictor with the highest p-value, followed by model diagnostics, to verify that there was no violation of assumptions. We also tested if removing the predictor in model selection resulted in a significantly worse model fit via a likelihood ratio test using the anova() function in R. The best model had the optimal fit and was parsimonious. This approach allowed us to evaluate whether different IC measures would predict vowel perception differently and whether the relationship between the IC measures and vowel perception holds across the various analytical methods.
To ensure the robustness of our final models, we conducted three sensitivity analyses: two robust regression models and a bootstrap analysis. The robust regression analyses employed the rlm() function from the MASS package (Venables & Ripley, Reference Venables and Ripley2002) to handle potential influential observations and the lm_robust() function from the estimatr package to address potential heteroscedasticity. Additionally, we implemented a bootstrap analysis with 1,000 resamples using lm.boot() from the simpleboot package to obtain estimates that do not rely on parametric assumptions about the error distribution.
Results
Partial correlations
Table 4 compares partial correlations between IC measures and L2 phonological processing using Huensch’s original scores and the most reliable model-based random slopes. Despite slight increases in some correlations using the model-based approach, all correlations remained weak (below .25), suggesting no substantial improvement in the relationship between IC and phonological processing measures.
Table 4. Partial correlations with Huensch’s (2024) score and the most reliable model-based individual random slopes.

Hierarchical regression
We first ran Huensch’s (2024) models with the original retrieval-induced inhibition measure and then with the slightly improved measure using the model-based approach. The analysis with the more reliable measure showed a marginally nonsignificant effect of the retrieval-induced inhibition task, and the variance in the outcome explained by the model remained tiny (adjusted R2 = .07, β = –.03, t = –2.00, p = .051).
Then, we expanded upon Huensch’s original analysis (2024) first by incorporating all three IC tasks from both the original and the model-based IC scores. In both cases, the Stroop task survived model selection but was not a significant predictor. Table 5 presents the primary results of the basic models and those of the expanded analyses (see the appendix for the comparison between the standard model results and the sensitivity analyses).
Table 5. Hierarchical regression results with retrieval-induced inhibition predicting vowel perception error rates while controlling for proficiency with different methods

Discussion
In this study, we conducted a secondary analysis of data from Huensch (Reference Huensch2022, Reference Huensch2024), applying the methods tested in Hui and Wu (Reference Hui and Wu2024) to examine their potential to improve task reliability estimates and their implications on subsequent analyses. Our investigation yielded three key findings. First, model-based approaches could enhance reliability compared to methods using raw RT differences, particularly for the Stroop and Simon tasks and when paired with specific data transformation methods. Second, model-based random slopes for the three tasks showed numerically higher correlations between the tasks than when traditional approaches were used. Lastly, the improved task reliability only partially extended to analyses of the relationships between IC and phonological processing. Compared to the original methods in Huensch (Reference Huensch2024), the application of the model-based approach on neither correlation analyses nor hierarchical analyses showed any meaningful differences, indicating that the null effects reported in Huensch (Reference Huensch2024) were not likely due to the unreliability of the IC tasks.
Addressing RQ1, we found that model-based approaches generally improved reliability, particularly for the Simon and Stroop tasks. While these tasks showed low reliability using the standard approaches (confirming Huensch, Reference Huensch2024 and Hedge et al., Reference Hedge, Powell and Sumner2018, Reference Hedge, Powell, Bompas and Sumner2022), the model-based approach increased reliability by .20 to .50 across tasks, with both the Stroop and Simon tasks reaching near-acceptable levels (.72–.73). These results align with the findings reported in Rouder and Haaf (Reference Rouder and Haaf2019) (i.e., from .55 to .72 for the full set of Stroop tasks), showing that model-based approaches could improve the reliability of IC tasks when they are used as predictors of L2 learning outcomes.
However, the retrieval-induced inhibition task showed limited improvement, remaining at a low level of reliability (.30–.36). This persistent issue suggests that some measures may be inherently more susceptible to measurement error and reduced between-participant reliability, regardless of the statistical approach used. As Hui and Wu (Reference Hui and Wu2024) noted, not all datasets benefit equally from model-based approaches, with item variability, for example, potentially moderating their effectiveness. These different results highlight the importance of careful selection of data analysis procedures in individual differences research in SLA and beyond. As demonstrated in our analysis and supported by Hui and Wu (Reference Hui and Wu2024), model-based approaches yield greater benefits when applied to tasks with substantial item variability. For tasks like the Stroop and Simon that involve stimuli with varying difficulty levels, model-based approaches may be more easily ready to partition this variance. To determine whether the model-based approach could be helpful, researchers can (1) consider task structure and item characteristics, (2) evaluate the sample size and trial numbers, and (3) conduct pilot reliability assessments before a full-scale implementation. Moreover, a potential explanation for the retrieval-induced inhibition task’s resistance to improvement may lie in its unique calculation method: unlike the Stroop and Simon tasks that use RT difference (e.g., in Stroop, IC score = average RT – neutral RT), this task expresses IC in terms of a ratio (IC score = inhibited RT/control RT). This computational difference calls for further methodological work that investigates the optimal approaches to the computation of an IC score for different tasks and its implications for subsequent analyses to address, for example, the predictive validity of the measure.
Notably, the effectiveness of the model-based approach was somewhat moderated by the data transformation method employed in the sense that the optimal transformation (log vs. inverse) varied by task. Log transformation was proven most useful for the Stroop task, while the Simon task benefited most from inverse transformation. This finding raises the question of how to determine the “best” analytical approach when multiple transformations are available. In our study, we used reliability coefficients as the primary selection criterion, with higher values indicating better measurement precision. However, this approach requires careful consideration to avoid the potential pitfalls of researchers’ degrees of freedom.
In SLA, response and processing times are often transformed in statistical analyses, but the choice of which transformation to use is not always justified. Few studies conduct sensitivity analysis involving more than one transformation method. Notable exceptions include Maie et al. (Reference Maie, Eguchi and Uchihara2024), who demonstrated the impact of arbitrary choices in analysis (e.g., transformation) on the results of a study, and Wu and Toda Cosi (Reference Wu and Toda Cosi2025), who showed that certain cases do not benefit from standard transformation methods and require alternative modeling approaches. In the present context of reliability estimation, we encourage researchers to consider the resulting levels of reliability, in addition to model diagnostics, in making transformation decisions. That said, researchers must be transparent in their election and not abuse their researcher’s degree of freedom to arrive at desirable results by: (1) preregistering their analytical plans including transformation methods before data collection; (2) reporting all transformations tested rather than only the “optimal” one; (3) establishing clear criteria for what constitutes meaningful reliability improvement before analysis begins. Also, more methodological investigations should be carried out to help applied researchers make more informed decisions when selecting an appropriate approach. We suggest the following steps.
-
1. Begin with theoretical considerations about the distribution of your RT data. Log transformations are typically more appropriate for positively skewed distributions (Feng et al., Reference Feng, Wang, Lu, Chen, He, Lu and Tu2014), while inverse transformations may better handle extreme outliers (Özdemir & Çavuș, Reference Özdemir and Çavuș2016).
-
2. Apply prescreening criteria before selecting transformations. For example, evaluate whether transformations effectively normalize residuals and check variance homogeneity using diagnostic plots.
-
3. Report reliability estimates for all transformations tested, not just the optimal one.
-
4. Conduct sensitivity analyses to determine whether your conclusion remains stable across different transformation approaches.
Although the model-based approach improved the reliability of individual measures (as shown in RQ1), it revealed only slightly stronger relationships between the tasks. Using Huensch’s original method, correlations between tasks were consistently low (below .10). In contrast, the model-based approach yielded notably higher correlations, with significant positive correlations observed between the retrieval-induced inhibition and Simon tasks (r = .29 to .33) and between the Simon and Stroop tasks (r = .26 to .27). This suggests that the three tasks may be tapping into somewhat different aspects of IC. This result has theoretical implications for the construct of IC, because it has sometimes been, at least implicitly, conceptualized as single-dimensional. Indeed, when justifying her addition of the Stroop and Simon tasks, Huensch (Reference Huensch2024) argued that the three different tasks measure distinct aspects of IC: the retrieval-induced inhibition task taps into unintentional resistance to proactive interference with a strong language processing component; the Stroop task examines language-oriented but not language-focused response inhibition; and the Simon task assesses domain-general inhibition. The low inter-task correlations require researchers using IC measures to be very specific about their target subconstruct of IC, because the findings can depend on task selection. More generally, this specificity is important because one goal of Instructed SLA is to examine how specific individual differences interact with treatment to influence language learning (DeKeyser, Reference DeKeyser and DeKeyser2021).
Despite the improved reliability of IC measures, the relationship between IC and L2 phonological processing remained weak to moderate, according to the partial correlation analyses and the hierarchical regression analyses with multiple sensitivity checks (RQ3). This finding further confirmed the null effects reported in Huensch (Reference Huensch2024). In addition, although our inclusion of the Stroop and Simon tasks has revealed some role of the Stroop task in accounting for vowel perception, the effects were not significant, and the variance explained was almost negligible. At the surface level, our model-based approach did not change the conclusion drawn by Huensch (Reference Huensch2024), leading some to wonder about its usefulness. At the same time, the key contribution of the secondary analysis is that, again, with the more reliable measures derived from a mixed-effects model, researchers can rule out the possibility of the null effects resulting from low reliability. In other words, these findings are methodologically important, as they demonstrate that improved reliability of predictor measures, through the use of a model-based approach, has the potential to address confounding statistical issues (Hedge et al., Reference Hedge, Powell and Sumner2018; Rey-Mermet et al., Reference Rey-Mermet, Gade and Oberauer2018) and provide a clearer assessment of these measures’ true predictive power or lack thereof.
These findings have significant implications for the broader field of SLA research beyond IC studies. First, they highlight the importance of measurement reliability when investigating individual differences in cognitive abilities that may influence language acquisition. Researchers exploring cognitive predictors of learning outcomes should prioritize reliable measurement to better uncover genuine relationships. Second, our work demonstrates how advanced statistical methods can be productively applied to existing datasets in SLA, allowing researchers to extract additional insights from previously collected data—a practice that aligns with growing emphasis on open science and resource efficiency in our field. Finally, the methodological advances demonstrated here extend beyond IC to any SLA research involving RT-difference measures, including studies of lexical access, syntactic processing, and language comprehension, offering new analytical tools to enhance the rigor of future investigations across diverse domains of SLA research.
Conclusion
Throughout the article, we have repeatedly underscored the importance of considering measurement reliability when studying individual differences in cognitive abilities and their relationships to language learning. As suggested by previous research (Buffington et al., Reference Buffington, Demos and Morgan-Short2021; Hui & Wu, Reference Hui and Wu2024), the low reliability of a task can be partially attributed to computational methods, and more robust approaches, such as the model-based method employed in this study, may provide an alternative for L2 researchers. It is important to keep in mind that reliability is a prerequisite of validity (Davis, Reference Davis1992; McKay & Plonsky, Reference McKay, Plonsky, Winke and Brunfaut2021) and represents a cornerstone in quantitative research. Since RT-difference measures can be very unreliable, as many external factors can influence the data (e.g., handedness, physical difficulties, and coordination, Hui & Jia, Reference Hui and Jia2024), and much SLA research relies on RT data, serious researchers should not turn a blind eye to the issues surrounding reliability.
Competing interests
We have no known conflict of interest to disclose.
Appendix Sensitivity check for RQ3
Comparison of hierarchical regression results from different sensitivity analysis methods with retrieval-induced inhibition predicting vowel perception error rates while controlling for proficiency with different methods.
