Prognostic Value of Learning and Retention Measures from the Free and Cued Selective Reminding Test to Identify Incident Mild Cognitive Impairment

Abstract Objective: To compare the predictive validity of learning and retention measures from the picture version of the Free and Cued Selective Reminding Test with Immediate Recall (pFCSRT + IR) for identifying incident mild cognitive impairment (MCI). Methods: Learning was defined by the sum of free recall (FR) and retention by delayed free recall (DFR) tested 15–20 min later. Totally, 1422 Baltimore Longitudinal Study of Aging (BLSA) participants (mean age 69.6 years, 54% male, mean 16.7 years of education) without dementia or MCI received the pFCSRT + IR at baseline and were followed longitudinally. Cox proportional hazards models were used to evaluate the effect of baseline learning and retention on risk of MCI. Results: In total, 187 participants developed MCI over a median of 8.1 years of follow-up. FR and DFR each predicted incident MCI adjusting for age, sex, and education. Also, each independently predicted incident MCI in the presence of the other with similar effect sizes: around 20% decrease in the hazard of MCI corresponding to one standard deviation increase in FR or DFR. Conclusion: The practice of preferring retention over learning to predict incident MCI should be reconsidered. The decision to include retention should be guided by time constraints and patient burden.


INTRODUCTION
Over several decades, the identification of individuals at high risk for AD has been both a priority and a challenge. Neuropsychology has been at the forefront of this effort (Butters, Delis, & Lucas, 1995;Salmon & Bondi, 2009;Bondi et al., 2017;Han et al., 2017). The neuropsychological tests that most consistently and accurately predict incident AD are tests of learning and retention (Albert et al., 2001;Linn et al., 1995;Elias et al., 2000;Jacobs et al., 1995;Tierney et al., 1996;Chen et al., 2001). The earliest studies focused on the inability to learn new information by testing immediate recall of word lists (Miller et al., 1971;Weingartner et al, 1981). Subsequent studies demonstrated that in addition to the hallmark learning deficit, retention tested at delays from 10 to 30 min was also impaired (Larrabee et al., 1993;Moss, Albert, Butters, & Payne, 1986;Welsh, Butters, Hughes, Mohs, & Heyman, 1991) In some studies, learning measures outperformed retention measures in predicting incident AD, whereas the reverse was observed in other studies. Nonetheless, both learning and retention deficits are present in preclinical AD (Bondi et al., 1994;Grober et al., 2000).
Previously, we argued that the retention deficit in AD is best examined with memory tests like the Free and Cued selective Reminding test (FCSRT) that control initial encoding in order to obtain maximum learning, the basis for subsequent retention (Grober & Kawas, 1997). Measuring retention of inadequately learned material can lead to contradictory results as studies of forgetting have shown (e.g., Becker et al., 1987;Moss et al., 1986). AD participants in the preclinical stage recalled significantly fewer words than matched controls indicating an impairment of learning; their retention, measured by percent retained of initial learning, was identical to that of controls. A retention deficit was documented three years later for AD participants but not for controls, whose retention was still perfect. We concluded that a retention deficit was not present in preclinical AD when hallmark learning deficits can be documented (Grober & Kawas, 1997).
We revisited this issue by examining the trajectories of declines in learning and retention of initially clinically normal participants at baseline in the Baltimore Longitudinal Study of Aging (BLSA) who went on to develop AD dementia over 10 years of follow-up (Grober et al., 2008;. Learning was defined by the sum of free recall (FR) over the three test trials on the picture version of the test with immediate recall (pFCSRT þ IR). Retention was defined by delayed free recall (DFR) tested 15-20 min after learning. Learning and retention displayed similar profiles of decline in the years prior to the clinical diagnosis of AD with a first acceleration of decline (change point) at 6.6-7.3 years prior to diagnosis and a second at 1.9-2.9 years prior to diagnosis. The change points for learning and retention were not significantly different. Retention defined by savings, percent retained of learning, had only one change point at 5.3 years. These analyses included only persons who prospectively developed AD dementia and did not address predictive validity.
In predictive validity studies, the conventional practice has been to rely on retention rather than learning for identifying mild cognitive impairment (MCI) and dementia (Welsh et al., 1991;Jack et al., 2019). In choosing between measures of learning and retention, evidence which favors retention is required, to offset increment in patient and tester burden. We sought to clarify this issue by comparing the predictive validity of learning and retention measures for the identification of BLSA participants who developed incident MCI over 10 years of follow-up. If retention (DFR) adds predictive value beyond learning (FR), that would justify measuring it. Cox proportional hazards models were used to answer these questions.

Participants
The primary analyses were based on data from 1422 BLSA participants without MCI at baseline who underwent longitudinal assessments with the pFCSRT þ IR between July 1985 and December 2015. At enrollment, BLSA participants meet rigorous screening criteria for health and functional status. All participants had at least one follow-up assessment after baseline. All available visits were included in the analysis. The event being modeled was incident MCI. All analyses were repeated on a subsample of 1283 participants who had Apolipoprotein E (APOE) ϵ4 genotype to determine whether APOE ϵ4 carriers were at increased risk of incident MCI. The BLSA study is approved by the local institutional review board, and all participants gave written informed consent before each assessment.

MCI Diagnosis
Clinical and neuropsychological data from each participant were reviewed at a consensus case conference if their Clinical Dementia Rating score (CDR: Morris, 1993) was greater than or equal to .5 or if they made more than three errors on the Blessed Information-Memory-Concentration (BIMC) Test (Blessed et al., 1968). MCI was defined using the Petersen criteria (Petersen et al., 1999). Diagnoses of dementia and clinical AD were based on criteria outlined in the Diagnostic and Statistical Manual of Mental Disorders, third edition, revised (American Psychiatric Association, 1987) and the National Institute of Neurological and Communication Disorders and Stroke -Alzheimer's Disease and Related Disorders (McKhann, Drachman, Folstein, et al., 1984). Diagnoses relied on clinical history, informant report, and a broad battery of neurocognitive tests that included pFCSRT þ IR scores.
Because knowledge of scores on the pFCSRT þ IR may have compromised the test's independence as a predictor, we performed a sensitivity analysis using an alternative definition of the cognitive impairment defined as four or more errors on the Blessed Information Memory Concentration (BIMC) Test. Persons with incident BIMC scores of 5 to 8 indicate very high risk for incident AD (Katzman et al., 1989). Analyses were repeated using this alternative end point.
pFCSRTþIR Before the pFCSRT þ IR was administered (Grober and Buschke, 1987), the 16-line drawings used in the test were presented for naming. The study phase followed in which participants were asked to search a card containing four of the drawings (e.g., grapes) for an item that goes with a unique category cue (e.g., fruit). After all-four items were identified, immediate recall of just those four items was tested by FR followed by cued recall for missed items. When cued recall failed, the participant was told the name of the item. The study phase was repeated for all 16 drawings. The test phase consisted of three trials of FR each followed by cued recall for items not retrieved by FR. The learning measure was the sum of FR (maximum = 48). The retention measure was DFR tested 15-20 min after learning without representation of the items (maximum = 16). Retention defined by the savings method was not used as a predictor because its decline occurred later in the predementia phase and so would not be sensitive to early disease (Grober & Kawas, 1997;Grober et al., 2019).

Statistical Analysis
The outcome we modeled was time to incident MCI within 10 years of baseline pFCSRT þ IR. Some individuals were never observed to have MCI; they were assessed as normal one assessment and as having dementia the next. For those individuals we assume that the onset of MCI was unobserved Learning and retention from FCSRT predicts MCI 293 and occurred prior to or simultaneously with the onset of dementia. As a proxy for time to MCI in those individuals, we use time to dementia. Cox proportional hazards models were used to evaluate the effect of baseline learning (FR) and retention (DFR) on risk of MCI adjusting for age, sex, and education. The effects of the predictors were reported as hazard ratios, the ratio of the hazard rate of MCI incidence corresponding to 1 unit difference in the predictor for continuous variable, and in the exposed versus the reference group (e.g., APOE ϵ4 carriers versus noncarriers). Because learning and retention scores from the same test are highly correlated (r = 0.65), we sought to determine if each score made an independent contribution to prediction of incident MCI. A sensitivity analysis further adjusting for APOE ϵ4 genotype was performed to evaluate whether APOE ϵ4 genotype (ϵ4 carriers vs. noncarriers) altered the findings. These analyses were repeated using four or more errors on the BIMC as the outcome as a sensitivity analysis. The partial likelihood ratio test (Cox, 1972) was used to compare nested models (model with DFR and FR vs FR only, adjusting for covariates). A measure of explained variation, defined as the ratio of distance measures between the survival processes and the fitted survival curves with and without predictors in the model, was also reported (Schemper & Henderson, 2000).
The hazard ratios for incident MCI with using learning (model 1) and retention (model 2) in separate models and in the same analysis (model 3), adjusted for covariates, are shown in Table 2. Age was a significant predictor of incident MCI in all models. Both FR and DFR were significant predictors of incident MCI in separate models: for each SD increase in FR, risk of MCI decreased (HR = 0.66); for each SD increase in DFR, MCI risk also decreased (HR = 0.68). When both FR and DFR were examined simultaneously, both FR (HR = 0.77) and DFR (HR = 0.81) remained significant, and there was no significant difference between the magnitudes of the two effects (p = 0.755). Using the measure of explained variation, the percent explained by FR and covariates is 16.1; the addition of DFR increased the explained variation to 17.0, an increase of 0.9%. For the comparison of adding DFR to the model with FR and covariates, partial likelihood test showed that the addition of DFR significantly improved the model fit (p value = 0.018). The results were not materially different when four or more errors on the BIMC was the outcome event being modeled in 1303 eligible participants with BIMC <= 3 at baseline and follow-up assessment. Using this end point, 261 incident cases developed (Table 3).
Of the 1283 participants who had APOE information, 36% of the incident MCI participants were APOE ϵ4 carriers compared to 25% of the non-cases (p = 0.005). APOE ϵ4 carriers developed incident MCI at more than twice the rate of noncarriers (HR = 2.75). The addition of APOE status as a covariate in the Cox models did not materially change the HRs for FR or DFR for predicting incident MCI (Table 4).

DISCUSSION
We sought to determine whether the conventional practice of relying on retention rather than learning for identifying MCI was justified when learning and retention were measured with the pFCSRT þ IR. We compared their predictive validity for risk of incident MCI among 1422 BLSA participants who

294
E. Grober et al.   Age was a significant predictor in all models, whereas sex and years of education were not significant predictors in any model. Analyses were repeated with four or more errors on the BIMC as the end point as a sensitivity analysis to avoid diagnostic circularity given inclusion of pFCSRT þ IR in diagnostic consensus conferences. The results were not materially different. APOE ϵ4 carriers developed incident MCI at rates in line with published studies (Dang et al., 2018). Importantly, adjusting for APOE ϵ4 allele did not diminish the relationship between FR or DFR and incident MCI. The incidence rate for MCI was 1.75% per year which is low in comparison to the incidence rate in other longitudinal studies (Kantarci et al., 2013;Machulda et al., 2013). The low incidence rate most likely reflects the strict health and functional criteria at BLSA enrollment and the continuous enrollment of BLSA participants.
The FCSRT has been widely used to identify prevalent dementia, incident dementia and AD, and MCI (Auriacombe et al., 2010;Di Stefano et al., 2015;Derby et al., 2013;Katz et al., 2012;Sarazin et al., 2007). pFCSRT þ IR measures, specifically FR, total recall (TR; sum of FR and cued recall) and their combination (FR þ TR) are components in the preclinical Alzheimer's disease clinical composite (PACC) for detecting cognitive change that also includes Logical Memory, Digit Symbol Substitution Test and the Mini Mental State Exam Papp et al., 2017). When the PACC was administered annually to 277 clinically normal participants in the Harvard Aging Brain Study (HABS) grouped according to threshold levels of amyloid imaging and followed for up to 5 years , all combinations including FR resulted in larger magnitude of effect for differences between Aβ groups over three and five years of follow-up than any other PACC component. FR alone or combined with total recall was the only individual component to show differences between the Aβþ group who progressed to CDR 0.5 versus those that remained stable.
The failure of clinical trials targeted to decreasing the accumulation of Aβ pathology in cognitively normal adults prompted examination of cognitive decline that occurs within the normal range of the amyloid imaging tracer, 18F-florbetapir (Insel et al., 2020). Continuous levels of the tracer were associated with the individual PACC components of 4432 cognitively unimpaired adults screened for inclusion in the A4 trial . The magnitude of the decrease in FR and FR þ TR scores at subclinical levels of tracer uptake, standard uptake volume ratio (SUVR = 1.10) compared to normal levels (SUVR = 0.78) was more than twice that of the other PACC components with a larger magnitude of effect than the PACC itself. Though the decline in pFCSRT þ IR performance in the subthreshold range of Aβ was small, it marks the start of episodic memory impairment that is the hallmark of AD.
The decline of FR in the preclinical course of AD was associated with the progression of neurofibrillary tangle (NFT) pathology defined by Braak stage in 300þ cases from Washington University clinic-neuropathologic cohort (Grober et al., 2021b). Compared with cases with limited NFT pathology (Braak stage 0 and I), FR of cases with Braak stage III pathology was significantly lower and continued to decline at similar rates in successive Braak stages. Unlike FR, Mini Mental State Exam and CDR sum of boxes scores did not decline until Braak stage IV. We suggest that FR performance may be useful in predicting tau positivity in observational studies and in clinical trials (Grober et al., 2021b).
Other studies have compared measures of learning and retention as predictors of incident AD, with varying results. Differences in the patterns of results can be observed for the same test. When the California Verbal Learning Test (CVLT) measures were used to predict incident dementia in 133 participants without dementia at baseline, neither the short nor long delay measures improved prediction over learning, as measured by the sum of FR over five trials (Bondi et al., 1999). In another comparison of CVLT measures in predicting MCI or dementia, learning was the most powerful predictor of all the measures, but predictive value was enhanced by adding delayed story recall (Rabin et al., 2009). The varying results even for the same test are not surprising when the factors that determine predictive value are considered: the stage of an individual with respect to the multiyear process of cognitive decline that precedes dementia (Bilgel et al., 2014); the psychometric properties of the particular test being used (Grober, Ocepek-Welikson, & Teresi, 2009); and the composition of the sample that does not go on to develop dementia.
The most recent example of neuropsychology's contribution to the identification of individuals at high risk of AD examined trajectories of 35 neuropsychological tests in an APOE-ϵ4-enriched cohort of 784 cognitively normal participants to determine how far in advance of incident MCI cognitive decline can be identified (Caselli et al., 2020). Sixty-five participants developed amnestic MCI during an average follow-up of 9.5 years at mean age of 73. The rate of decline of 34 of the 35 tests was steeper among MCI converters relative to nonconverters following the inflection point when performance of the two groups diverged. Multiple episodic memory tests (Auditory Verbal Learning Test and Selective Reminding Test) displayed the earliest inflection points, nearly 20 years in advance of MCI diagnosis, with retention decline beginning a year earlier than learning decline (age 54 versus 55). These findings challenge the current disease model of preclinical AD wherein cognition 296 E. Grober et al. begins to decline after sufficient amyloid and tau deposition has occurred (Sperling et al., 2011;Jack et al., 2013). The strength of our data set is the sizable and wellcharacterized cohort of incident MCI cases and the large number of assessments available over more than 20 years of follow-up. However, generalizability is limited due to the high educational level of the BLSA cohort. The similarity of the findings based on clinical conference diagnosis, where circularity is possible, and an outcome based on a BIMC cut point for impairment, adds confidence that our findings are not an artifact of our diagnostic procedures.
The benefits of testing retention after initial learning may depend on the particular memory test being used and the stage of disease. Our results using the pFCSRT þ IR suggest that the practice of preferring retention over learning to predict incident MCI merits reconsideration since both independently predict the outcome in the presence of the other with a similar magnitude of effect. Adding retention to the model that included learning increased the explained variation by about 1%. Thus, the decision to include DFR in the assessment may depend on the setting. In the clinic, if there is time to extend the assessment by 20 min to capture DFR, the additional information may be warranted. In a telephone or web-based assessment where time may be more limited and patient burden a greater concern, adding DFR may be inadvisable given its marginal enhancement in predicting MCI.

FUNDING STATEMENT
This study was supported in part by the Intramural Research Program, National Institute on Aging, NIH and the NIH: 2PO1 AG003949.

CONFLICTS OF INTEREST
The FCSRT þ IR is copyrighted by the Albert Einstein College of Medicine and is made freely available for noncommercial purposes. Dr. Ellen Grober receives a small percentage of any royalties on the FCSRT þ IR when it is used for commercial purposes.
Dr. Cuiling has no disclosures. Dr. Susan Resnick has no disclosures other than being an employee of the NIA. Dr. Claudia Kawas has no disclosures. Dr. Melissa Kitner-Triolo has no disclosures.