Parallel and identical test–retest reliability of the Tower of London test – Freiburg version

Abstract Objectives: The Tower of London – Freiburg version (TOL-F) was developed in three parallel-test versions (A, B, and C) that only differ in their physical appearance by interchanged ball colors, but not in their cognitive demands. We addressed the question whether the test–retest reliability of an identical problem set differs from the parallel test–retest reliability of a structurally identical problem set with a marginally different physical appearance. Methods: Reliabilities were assessed in two samples of young adults over a 1-week interval: In the parallel test–retest sample (n = 93; 49 female), half of the participants accomplished version A at the first session and version B at the second session, while the other half started with version B in the first session and continued with A in the second session. In the identical test–retest sample (n = 86; 48 female), half of the participants performed on version A in both the first and the second session, while the other half went through the same procedure with version B. Results: For overall planning accuracy, intraclass correlation coefficients for absolute agreement were r = .501 for the parallel test–retest and r = .605 for the identical test–retest sample, with Pearson correlations of r = .559 and r = .708 respectively. Greatest lower bound estimates of reliability were adequate to high in the two samples (ranging between .765 and .854) confirming previous studies. Conclusions: Although the TOL-F revealed only moderate intraclass correlations for absolute agreement, it showed some of the highest psychometric indices compared to repeated assessments with other TOL tests.

In the context of cognitive tasks, reliability indexes the stability of measurements and features mainly two aspects: (i) the task's internal consistency and split-half reliability reflect the degree to which all items of the task measure the same underlying construct and (ii) the consistency between repeated measurements of identical (test-retest reliability) or highly similar versions (parallel testretest reliability) of the task. In the present study, we will focus on the latter aspect by studying the test-retest and parallel test-retest reliability of the TOL. Previous studies have mainly reported the Pearson correlation coefficient. However, it is no longer considered an ideal measure of identical and parallel test-retest reliability, as it only captures the relative consistency but not the absolute agreement of test scores over repeated measurements. For absolute agreement, total score variance is taken into account, including not only the variance between two measurements but also within the sample (McCraw & Wong, 1996). There is a growing consensus towards the use of different forms of the intraclass correlation coefficient (ICC, see Shrout & Fleiss, 1979;McCraw & Wong, 1996), as these may inform about both the relative consistency [ICC(3,1)] and the absolute agreement [ICC(2,1)] between the repeated measurements. Tyburski, Kerestey, Kerestey, Radoń, & Mueller (2021) have recently provided a comprehensive overview of identical and parallel test-retest reliability studies of TOL versions, which also lists four studies on adults that reported ICCs. It is noticeable that with the exception of one study (Köstering, Nitschke, Schumacher, Weiller, & Kaller, 2015; ICC(2,1) = 0.69 for accuracy in terms of total optimal solutions), the ICCs for the performance parameters remained considerably below the desired requirements of at least 0.5, indicating poor reliability (Portney & Watkins, 2000). More specifically, for outcomes that consider the number of problems solved, Lemay, Bédard, Rouleau, & Tremblay (2004)  This observation of low replicability of TOL measurements is neither new nor surprising when seen in the wider context of similar findings for other tasks measuring higher-order executive functions (Burgess, 1997;Rabbitt, 1997). Planning as a prototypical higher-order executive function reflects the mental generation and evaluation of potential solution alternatives in novel problem situations. This novelty aspect particularly hampers the test-retest reliability assessment of planning tasks, as novelty is not given in a second measurement with identical problem items, and practice effects are likely to occur (Rabbitt, 1997;Strauss, Sherman, & Spreen, 2006). One way to avoid using identical items for the second measurement is to use an alternative or parallel-test version. Accordingly, Calamia, Markon, & Tranel (2012) observed that the use of alternate forms helps to decrease the size of practice effects, although it does not necessarily increase reliability. In a meta-analysis of test-retest correlations of instruments typically used in neuropsychological assessment (Calamia, Markon, & Tranel, 2013), for a majority of tests the application of a parallel form at retesting was associated with a decrease in the test-retest correlation in comparison to retesting the identical form. Although the magnitude of the differences was generally Δr = .1 or less, according to the authors, psychometric properties like difficulty can differ significantly between test versions.
In this respect, it is an open question whether the test-retest reliability of an identical problem set differs from the parallel test-retest reliability of an alternative but highly similar problem set. One instrument that could be used to systematically address this question is the TOL-Freiburg version (Kaller, Unterrainer, Kaiser, Weisbrod, & Aschenbrenner, 2012a). This planning test has a sufficiently high internal consistency (glb = .73 and .76, Kaller et al. 2016, Unterrainer et al. 2019 and hence fulfills basic psychometric requirements. It was developed in three parallel-test versions (A, B, and C), whose problems are identical in structure, but whose physical appearance differs due to permutations of ball colors. Thus, these versions should require identical cognitive demands since structural problem parameters like search depth and goal hierarchy were kept completely identical (Kaller, Rahm, Köstering, & Unterrainer, 2011;Kaller, Unterrainer, & Stahl, 2012b). Köstering et al. (2015) already assessed test-retest reliability of the TOL-Freiburg using version A at the first and B at the second session over a 1-week interval. Pearson correlation (r = .739) and ICC for absolute agreement (r = .690) yielded adequate test-retest reliabilities. As in this study participants performed two different versions and the sample size war rather small (n = 27), here we addressed the question whether the testretest reliability of an identical problem set (versions A-A and B-B) differs from the parallel test-retest reliability (versions A-B and B-A) on the basis of two larger samples.

Participants
The study comprised two separate, completely non-overlapping samples including only participants who had no previous experience with the TOL test.
For the parallel test-retest sample, we originally recruited 103 young participants with predominantly high school degrees or who are studying. Inclusion criteria were sufficient German language skills to ensure comprehension of task instructions, age between 18 and 26 years, and normal or corrected-to-normal vision. Exclusion criteria were current/past psychiatric or neurological disease, psychotropic medication, and color blindness. Depressive symptoms, crystallized, and fluid intelligence were assessed with the Beck Depression Inventory-II (BDI-II; Beck, Steer, & Brown, 1996), a German vocabulary test (Mehrfachwahl-Wortschatz-Intelligenztest or MWT-B; Lehrl, 2005), and with the Advanced Progressive Matrices (short version, German adaptation and norming, Bulheller, & Häcker, 1998), respectively. Due to increased depression scores (BDI score above 14), ten subjects had to be excluded. The final parallel test-retest sample (N = 93; 49 females) had a mean age of 21.9 years (SD = 1.95; range 18.33-25.92). Participants were compensated with 20€ for both sessions. In the parallel test-retest sample, half of the participants accomplished version A at the first session and version B at the second session, while the other half started with version B in the first session and continued with A in the second session.
For the identical test-retest sample, 93 young participants with predominantly high school degrees or who are studying were recruited applying identical inclusion/exclusion criteria, screening for depressive symptoms, crystallized and fluid intelligence tests, and compensation as for the parallel test-retest sample. In consequence, seven participants had to be excluded resulting in the final identical test-retest sample of 86 participants (48 female) with a mean age of 22.01 (SD = 2.32; range 18.08-26.42). In the identical test-retest sample, half of the participants performed version A in both the first and the second session, while the other half went through the same procedure with version B. Table 1 provides a comparative overview of both samples.
The study was approved by local ethics authorities (EK-Freiburg nr. 479/19). Data acquisition complied with local institutional research standards for human research and was completed in accordance with the Helsinki Declaration.

Tower of London -Freiburg Version (TOL-F)
All participants were tested individually in a quiet room with the TOL-F (Kaller et al., 2012a). The TOL-F is as a computerized pseudo-realistic representation of the TOL's originally wooden configuration and is implemented in the Vienna Test System (https://marketplace.schuhfried.com/de/tol). Individual problem items consist of a start and a goal state that are presented in the lower and upper halves of the computer screen, respectively. In order to transfer the start into the goal state, the TOL-F can be worked on by touch screen. Thus, a ball is picked up simply by clicking the ball via finger touch. The selected ball is then encircled by a transparent whitish corona and can be moved by selecting the respective rod by finger touch. Subjects are instructed to transform the start state into the goal state in the minimum number of moves which are shown to the left of the start state. Written instructions inform that only one ball may be moved at a time, that balls cannot be placed beside the rods, that only the top-most ball can be moved in case several balls are stacked on a rod, and that the rods differ in their capacities of accommodating one, two, or three balls at maximum. The computer program does not allow breaking these rules, but records any attempts to do so. Instructions further emphasize that problems have to be solved in the minimum number of moves and that participants should always plan ahead the problem solution before starting with movement execution.
For assessment of individual planning ability with the TOL-F, overall planning accuracy, defined as the sum of problems that were correctly solved in the minimum number of moves, is regarded as the primary outcome variable of interest. The TOL-F provides three different levels of minimum moves (four, five, and six move problems, eight of each, presented in increasing minimum number of moves) resulting in an overall planning accuracy of 24 problems at maximum (corresponding to 100 percent). A one-minute time limit per trial was implemented, as in the original study of Shallice (1982).
The TOL-F features three parallel-test versions, A, B, and C, consisting of the same set of problems, which are color-permuted; that is, ball colors are interchanged (cf. Berg & Byrd, 2002). Thus, across parallel-test versions, problems are structurally identical, while their physical appearance is different. As already described in the Participants section, in the parallel test-retest sample, half of the participants accomplished version A at the first session and version B at the second session, while the other half started with version B in the first session and continued with A in the second session. In the identical test-retest sample, half of the participants performed version A in both the first and the second session, while the other half went through the same procedure with version B.

Analyses
First, to compare planning accuracy between the two samples and to assess changes over the two time points, a repeated-measurements ANOVA (RM-ANOVA) was calculated with the withinsubject factor session (1 versus 2) and the between-subjects factor group (parallel test-retest versus identical test-retest). For assessing parallel and identical test-retest reliabilities, ICCs using two-way random effects models of absolute agreement ICC(2,1) and relative consistency ICC(3,1) corresponding to Shrout and Fleiss (1979) were computed. For comparability with previous studies, we also report Pearson product-moment correlations as indices of identical/parallel test-retest reliability as well as glb (estimations of greatest lower bound) as index of internal consistency.
To additionally check whether the order with which version testing has started is associated with a different learning process, the analysis above is carried out separately for both samples, but supplemented with the between-subject factor start (A versus B).
In the identical test-retest sample, there was also a significant main effect for session (F(1, 84) = 50.595, p = <.001; η 2 partial = .376), but not for start (F(1, 84) = 0.013, p = .911; η 2 partial = .000), or the interaction effect (F(1, 84) = 1.298, p = .258; η 2 partial = .015). In both samples performance increased from the first to the second session, but there was no difference between starting with version A versus B or an interaction with learning across repeated assessments.

Internal consistency (glbs)
The greatest lower bound estimations for the parallel test-retest sample were 0.765 and 0.854 for session 1 and session 2, respectively. In the identical test-retest sample, glbs were 0.806 and 0.817 for session 1 and 2, respectively.
To check whether numerically higher reliability in the identical test-retest sample may be related to differences in variance, we also compared the variance of the overall performance between groups with Levene's test. As a result, sessions did not differ significantly, in line with assumed equality of variance (Session 1: F (1, 177) = 0.378; p = .539; Session 2: F (1, 177) = 0.000; p = .991). This was also true for the difference between sessions: According to Levene's test, there was no significant difference between group variances with regard to this difference (F (1, 177) = 2.119; p = .147).

Discussion
This study examined parallel and identical test-retest reliability of the TOL-F and revealed the following results: On the one hand, reliability was numerically higher for repeated assessment with the identical version compared to the parallel-test version. On the other hand, we found higher ICC absolute agreement measures than in most previous TOL studies. Except for the study by Köstering et al. (2015), no ICC(2,1) values for absolute agreement above .45 have been published so far for any TOL version (Tyburski et al., 2021). With ICCs(2,1) of .501 and .605 for parallel test-retest and identical test-retest reliability, we obviously exceed these values. For both results, however, it must be noted that the confidence intervals in the current study are rather large. Thus, the range of the true reliability value between both the parallel and the identical test-retest version and in comparison with previous studies does not indicate a significant difference.
The main question of this study was the comparison of retesting an identical versus an alternative version. In line with the results of Calamia et al. (2013), the identical versions achieved numerically higher reliability than alternative versions. Calamia et al. call for an ideally psychometrically identical alternative version, although this is not the case for many measurements (Calamia et al., 2012). TOL-F versions A and B consist of the same set of problems, only the ball colors are interchanged (cf. Berg & Byrd, 2002). Thus, we concluded that the color permutation should correspond to the idea of an ideal parallel version and at least reduce item-specific learning. General learning of the task remains, but that should always be the case. Numerically, it seems that the exchange of colors can lead to different reliabilities. Nevertheless, this conclusion is restricted by the overall performance difference between the two groups. It was confirmed again that the TOL-F problem set, that is balanced according to known structural problem parameters (Kaller et al., 2011), can exhibit satisfactory psychometric properties and even exceed internal consistencies established earlier (Kaller et al. 2012b, Unterrainer et al., 2019. However, the present ICC values can only be rated as "moderate" (ranged between .5 and .75) according to the criteria of Portney and Watkins (2000). This probably reflects the double-faced nature of executive functions and reliability. In their meta-analysis of some of the most widely used neuropsychological tests, Calamia et al. (2013) demonstrated that EF tests had poorer test-retest reliabilities compared to other measures of cognitive performance (r < .70). One explanation was the assumption that complex EF tasks involve multiple cognitive processes which makes them more susceptible to performance variability in repeated testing (Delis, Kramer, Kaplan, & Holdnak, 2004). In other words, the intended measurements of higher-order cognitive functions such as planning can also be strongly influenced by basic ongoing processes such as attention or working memory. Another explanation for lower reliability could be a learning effect that affects the second measurement: According to Strauss et al. (2006), practice effects Note. Min = minimum; Max = maximum; SD = standard deviation; Difference score in accuracy is computed by subtracting Session 1 from Session 2. on EF tests can lead to a restriction of range in test scores which in turn result in lower test-retest correlations. However, this assumption is only partly consistent with the present data and the analyses of Calamia et al. (2012Calamia et al. ( , 2013. Probably it is not the size of the practice effect but individual changes in the rankings between the two measurement points that explain the different reliabilities (individual change of position in the second measurement, Duff, 2012). In very homogeneous samples as in our study the range in test scores is more restricted than in representative samples of the population (Strauss et al. 2006). Participants of the same age with similar cognitive abilities suggest less variance in performance than a more heterogeneous group with large age and educational differences. Lower variances render the same ranking in the second measurement less likely and thus may also lead to lower reliability scores. But how can the noticeably higher ICCs of about Δr = .2 in the study by Köstering et al. (2015) be explained? After all, this study used the same TOL version as in the current parallel test-retest sample (A-B versus B-A), the test interval was identical, and the participants were students of the same age with similar intelligence scores and were recruited using the same exclusion and inclusion criteria. Apart from random sample variance, the extreme value adjustment in Köstering et al. may be an explanation. Since they studied a small sample (n = 29), they had to omit two cases deviating more than 2.5 standard deviations from the mean z-standardized between-session difference score to obtain reasonably normally distributed data. The two outliers were at the negative end of the distribution. This means that their performance in the second measurement was in the opposite direction of the whole group, which showed better performance in the second measurement. Duff (2012) has described how impressively test-retest reliabilities decrease when second measurements go in the contrary direction. The sample in the present study, which was more than three times larger, produced an acceptable normal distribution of the data per se, so that all values at both ends of the scale were included.

Limitations
A clear constraint is the rather homogeneous sample. A broader sample in terms of age and education would presumably allow the reliabilities to be increased even further and would offer better generalizability to the population. In addition, the recording of patient groups would be desirable. Although in both studies a total of more than 180 subjects were tested, the sample size is still below Watson's (2004) recommendation of at least 300 participants. In order to better quantify learning effects, several retests with different time intervals should be conducted. The overall performance difference between the two groups was an undesired outcome and might be related to the time period of data collection. While the parallel test-retest assessment was finalized before the Corona pandemic, the identical test-retest reliability measurements took place during the pandemic. Testing conditions therefore were slightly different due to the need to wear a face mask and to keep a greater interpersonal distance. In addition, one may speculate that due to reduced social contact and suspended faceto-face teaching, students may have been in a generally poorer mental and emotional condition during this time.

Conclusion
Even though the reliabilities obtained were only moderate, as can commonly be observed with EF, the present study showed some of the highest psychometrics for the TOL test. The small difference in reliability values between identical and parallel versions speak in favor of using the same version, as this allows us to expect more stable results over two measurement points.