Low rate of performance validity failures among individuals with bipolar disorder

Abstract Objective: Assessing performance validity is imperative in both clinical and research contexts as data interpretation presupposes adequate participation from examinees. Performance validity tests (PVTs) are utilized to identify instances in which results cannot be interpreted at face value. This study explored the hit rates for two frequently used PVTs in a research sample of individuals with and without histories of bipolar disorder (BD). Method: As part of an ongoing longitudinal study of individuals with BD, we examined the performance of 736 individuals with BD and 255 individuals with no history of mental health disorder on the Test of Memory Malingering (TOMM) and the California Verbal Learning Test forced choice trial (CVLT-FC) at three time points. Results: Undiagnosed individuals demonstrated 100% pass rate on PVTs and individuals with BD passed over 98% of the time. A mixed effects model adjusting for relevant demographic variables revealed no significant difference in TOMM scores between the groups, a = .07, SE = .07, p = .31. On the CVLT-FC, no clinically significant differences were observed (ps < .001). Conclusions: Perfect PVT scores were obtained by the majority of individuals, with no differences in failure rates between groups. The tests have approximately >98% specificity in BD and 100% specificity among non-diagnosed individuals. Further, nearly 90% of individuals with BD obtained perfect scores on both measures, a trend observed at each time point.


Introduction
Assessing protocol validity for performance-based measures is imperative in both clinical and research contexts as interpretation of data presupposes adequate participation on the part of examinees. Performance validity tests (PVT) are therefore recommended to identify instances where one cannot interpret obtained data at face value (Bush et al., 2014).
Among the most frequently used measures of performance validity are the Test of Memory Malingering (TOMM; Tombaugh, 1996) and the optional forced-choice recognition trial of the California Verbal Learning Test-second edition (CVLT-FC; Delis et al., 2000). These measures employ a forced-choice recognition memory paradigm and given the large number of stimuli, particularly in the case of the TOMM, give the impression that they are harder than they are (Tombaugh, 1996). As a result, clinicians become concerned that those who obtain low scores on the task are more likely to be performing poorly for reasons other than genuine cognitive impairment. Each measure has demonstrated strong psychometric properties across settings and study designs (Martin et al., 2020;Schwartz et al., 2016). As per practice guidelines to minimize false positive diagnoses (Boone, 2013), each test employs conservative cut scores that maximize specificity and positive predictive power. As part of this study, additional cutoffs for the TOMM were explored to increase sensitivity.
Cognitive difficulties in learning and memory, verbal fluency, executive functioning, working memory, and attention, for example, have been associated with bipolar disoder (BD) (Chang et al., 2012;Grande et al., 2016). However, a debate exists as to whether such cognitive weaknesses represent manifestations of disease driven mechanisms, poor effort on cognitive testing due to variable mood states, or blatant malingering (Gierok et al., 2005;Lieberman et al., 2016;Moritz et al., 2017). While performance validity is commonly examined among specific populations, such as those with mild traumatic brain injury (TBI) (Carone & Bush, 2013), it is rarely a focus among individuals with affective disorders such as BD. Until there is adequate research establishing the rate with which affective disorders alone influence the rate of failures on PVTs, hypothesizing that poor performance on these measures is a function of these diagnoses is merely speculative and relies primarily on conjecture (Considine et al., 2011).
Previous researchers have examined performance validity among individuals with diagnoses of major depressive disorder (MDD) as well as those who endorse a high level of anxiety. In a small study of 20 individuals diagnosed with MDD and high levels of depression measured by the Beck Depression Inventory, only two individuals scored below the 45 cutoff on the TOMM (Yanez, Fremouw, Tennant, Strunk, & Coker, 2006). Their results suggest that the TOMM can be used with severely depressed individuals, although authors do suggest slight caution given the fact that performance was not above the cutoff for all participants. In a more recent study of individuals with MDD, all study participants including 45 patients with MDD and 32 healthy controls performed above the cutoff on trial 2 (100% in each group, Considine et al., 2011). Among older community-dwelling adults with both anxiety and depression symptoms, all participants within the study scored above the 45 cutoff regardless of mood symptoms (Ashendorf, Constantinou, & McCaffrey, 2004).
This study examines the rate of failures on two common PVTs, the TOMM and CVLT-FC, across multiple time points among a large sample of well-educated individuals with BD and individuals without psychiatric diagnosis involved in a longitudinal research study. Given the state and trait effects of mood states such as depression, including multiple time points allows for examination of potential patterns over time. We explore the rate with which people with BD fail these tests and, given that recognition memory presented in a forced-choice format is fairly simple, whether obtained scores even approach failing scores. Given existing literature, we expect that individuals with mood disorders will demonstrate similar performance to healthy controls on both PVTs. Lastly, we explore different cutoffs for the TOMM (trial 1 and 2) and CVLT-FC to increase sensitivity and to better inform clinical practice.

Participants
This study includes a subset (N = 991) of participants enrolled in the Prechter Longitudinal Study of Bipolar Disorder at the University of Michigan (McInnis et al., 2018) between October of 2010 and December of 2018. The larger longitudinal study examines biological, phenotypic, and cognitive outcomes associated with BD (see McInnis et al., 2018 for a full description). In the current study, a subset of participants of the larger cohort included those who completed performance validity measures at a baseline assessment, 1 year following the baseline, and again 5 years following the baseline.
Demographic and diagnostic information for participants can be found in Table 1. The two groups were well matched on sex and ethnicity. The BD group was older (M = 39.13, SD = 13.6, ages 18 to 84) than the healthy control group (M = 35.72, SD = 15.5, ages 18 to 77), t(397.07) = −3.12, p = .002. The healthy control group was more educated (M = 16.05, SD = 2.0) than the BD group (M = 15.29, SD = 2.1), t(462.01) = 5.158, p < .001. The BD group had a significantly higher percentage of individuals who identified as White (M = 86.7%) compared to the healthy control group (M = 71.4%), t(301.76) = −5.023, p < .001. The healthy control group had higher IQ scores on average (M = 112.95, SD = 11.7) when compared to the BD group (M = 108.77, SD = 12.4), t(989) = 4.713, p < .001. Racial identification notwithstanding, these demographic differences were of relatively small effect size (age g = 0.24, education g = 0.37, race g = 0.41, and IQ g = 0.35). Lastly, there was a significant difference in depression scores (Hamilton Depression Rating Scale [HDRS]), F (3) = 125.212, p = <.001, and mania scores (Young Mania Rating Scale [YMRS]), F (3) = 47.280, p = <.001, between the groups, with the BD group having higher scores compared to the healthy control group; however, participants were not assessed if acutely manic per study protocol (i.e., YMRS score above 30) given concerns with ability to provide informed consent and need for medical intervention.

Study procedure
Recruitment for the study occurred through advertisements in community mental health centers, in an outpatient specialty psychiatric clinic, in an inpatient psychiatric clinic, in local newspapers, at community outreach events, and on the Internet. All participants provided informed consent, and each participant received incentive payment for participation in the longitudinal study. The Institutional Review Board of the University of Michigan Medical School approved the study (IRBMED: HUM00000606).
Once participants gave informed consent, all participants completed an in-person baseline assessment. Clinicians evaluated all participants during the baseline assessment to confirm Diagnostic and Statistical Manual of Mental Disorders, fourth edition (DSM-IV) diagnoses using the Diagnostic Interview for Genetic Studies (DIGS, Nurnberger et al., 1994). To establish a consensus diagnosis and BD subtype (bipolar disorder I (BD I), bipolar disorder II (BD II), bipolar disorder not otherwise specified (BD NOS), and Schizoaffective bipolar disorder (Schizoaffective BD)) for participants, three authors evaluated medical records and data from the DIGS. The study excluded participants with active substance dependence or neurological disease at the time of enrollment. Clinicians administered mood measures were given at each time point. The research-defined neuropsychological test battery included the TOMM and CVLT-FC as performance validity measures as well as a measure of estimated IQ. Choice of performance validity measures were determined by data available from a fixed research battery that was part of a larger study from which our sample was derived.

Performance validity measures Test of memory malingering
The TOMM (Tombaugh, 1996) is a 50-item recognition test designed to detect feigned, exaggerated, or fabricated memory impairment. Initial validation studies of the TOMM (see Rees et al., 1998;Tombaugh, 1996) indicated that this instrument is a useful test for detecting exaggerated or malingered memory impairment. Using a criterion cutoff score of <45 on trial 2 (i.e., ≥90% correct responding), specificity values exceeded 91% for clinical cases and 100% for nonclinical cases (see Rees et al., 1998;Tombaugh, 1996). This test was initially validated with a clinical sample, including five major groups (no cognitive impairment, cognitive impairment, aphasia, TBI, and dementia), and while there was no specific psychiatric group those with depression were linked into the cognitive impairment group which also included Korsakoff's syndrome, Parkinson's disease, Post Traumatic Stress Disorder (PTSD), and Huntington's disease. The cognitive impairment group obtained a mean score of 43.9 (5.3) for trial 1 and a mean score of 48.6 (3.1) for trial 2, suggesting that performance on the TOMM is very resistant to different types of severe cognitive impairment (Tombaugh, 1996). While it is traditional to use the cutoff score of <45 on trial 2 as suggested in the manual, this study also examined alternative cutoffs for trial 2 based on existing literature (e.g., Erdodi et al., 2018;Martin et al., 2020) in addition to examining trial 1 data as part of secondary analyses.

California verbal learning test-second edition (CVLT-II), forced choice trial
The CVLT-II (Delis et al., 2000) is a measure of episodic memory and verbal learning. Participants are read a list of 16 words over five trials and are asked to recall the list after each repetition of the list. The measure also consists of a distractor list, short and long-delay free recall, short and long-delay category-cued recall trials, a delayed recognition trial, and lastly, a delayed forced choice trial. During the delayed forced-choice trial, the technician presents two words to the examinee and they must choose the one that was in the original list. Distracter words are either concrete nouns, as are the target words, or are abstract nouns. Traditionally, the forced choice trial is an embedded performance validity measure in which more than one error signals possible problematic engagement (Schwartz et al., 2016). As described by Delis et al. (2000), over 94% of the CVLT-II normative sample obtained perfect scores on the forced choice recognition test, suggesting that a low score on this trial may indicate insufficient effort and that possible indications for a low score may include serious psychiatric disturbance such as psychosis, severe pain, mental delirium, or conscious or unconscious exaggeration of memory deficits.

Mood measures Hamilton depression rating scale
The HAMD-17 (Hamilton, 1960) is a 17-item measure designed to assess current (i.e., in the past 1 week) symptoms of depression. Scores between 10 and 13 represent mild symptoms of depression, 14-17 represents mild to moderate symptoms, and scores above 17 represent moderate to severe symptoms.

Young mania rating scale
The YMRS (Young et al., 1978) contains 11 items and quantifies current (i.e., in the past 1 week) symptoms of mania. Additionally, clinicians include information derived from clinical observation. Scores between 13 and 19 represent minimal symptoms of mania, 20-25 represent mild symptoms, 26-37 represent moderate symptoms, and 38-60 represent severe symptoms.

Cognitive measures
Within the Prechter Longitudinal Study of Bipolar Disorder, participants received a research-defined neuropsychological evaluation that focused on areas known to be adversely impacted in BD, including memory, attention and executive functioning, psychomotor speed, and emotion processing (Langenecker et al., 2010). Standard data reduction techniques were utilized, given the large number of dependent variables, using conceptually, statistically, and theoretically derived variables (see Langenecker et al., 2010 for a full description). All test scores with negative scale properties (lower numbers reflect better performance) were inverted. Eight factor scores, based on raw scores, are utilized, including auditory memory (as measured by the CVLT-II (Delis et al., 2000)), visual memory (Rey-Osterrieth Complex Figure Test (Meyers and Meyers, 1995)), fine motor dexterity (Purdue Pegboard test (Lezak, 1995)), verbal fluency and processing speed (FAS/Phonemic fluency of the Controlled Oral Word Association Test (Benton and Hamsher, 1976), Category/Animal fluency (Benton and Hamsher, 1976), Digit Symbol-Coding from the Wechsler Adult Intelligence Scale-Third Edition (Wechsler, 1997), Stroop Color and Word Test (Golden, 1978), and Trail Making Test Form B from Trail Making Test-Parts A and B (Armitage, 1946)), conceptual reasoning and set-shifting (Wisconsin Card Sorting Test (Grant and Berg, 1948) and Parametric Go/No-go (Langenecker et al., 2005)), processing speed with interference resolution (Trail Making Test Forms A and B, Digit Symbol, Stroop Color Word Test, and Parametric Go/ No-go), inhibitory control (Parametric Go/No-go), and emotion processing (Emotion Perception Test (Green and Allen, 1997) and Facial Emotion Perception Test (Rapport et al., 2002)). The scores utilized from the CVLT-II included total learning over five trials, short delay free recall, short delay cued recall, long delay free recall, long delay cued recall, and recognition hits. Scores from Rey-Osterrieth Complex Figure Test included immediate recall, delayed recall, and recognition. On the fine motor task, scores on the Purdue Pegboard test included performance with the dominant hand, nondominant hand, and bimanual (using both hands). On emotion processing measures, errors (inverted) were utilized from the Emotion Perception Test along with accuracy and response time (inverted) from the Facial Emotion Perception Test. Regarding the Stroop Color Word Test, performance from the Stroop Word Condition, the Stroop Color Condition, and the Interference Condition were used. Correct categories and perseverative errors (inverted) were used from the Wisconsin Card Sort test. Mean accuracy for target trials, mean target response times (inverted), and mean accuracy for inhibitory trials were used from the Parametric Go/No-go task.

Wechsler abbreviated scale intelligence
The WASI (Wechsler, 1999) is a standardized measure that provides a Full-Scale Intellectual Quotient (FSIQ) and contains four subtests: Vocabulary, Similarities, Matrix Reasoning, and Block Design. The average reliability coefficient for the FSIQ-4 is 0.98 and 0.96 for the FSIQ-2. Average test-retest reliabilities are 0.92 and 0.88, respectively.

Method of analyses
The primary analyses included examining the frequencies with which participants, across diagnostic groups and time periods, obtained adequate scores on the TOMM and CVLT-FC. We calculated percentages for individuals that received specific score combinations (i.e., a 50 on the TOMM and a 16 on the CVLT-FC at baseline) on both the TOMM and CVLT-FC at each of the three time points.
This study used independent-samples t-tests to analyze demographic variables using version 24 of the Statistical Package for the Social Sciences (SPSS). This study also used independent-samples t-tests to compare mean scores on the TOMM trial 1 and trial 2 and CVLT-FC between the two primary groups (i.e., all individuals with BD and healthy controls) at each of the three time points as well as between each of the bipolar groups at each time point. While the primary focus of the analyses is on trial 2 of the TOMM, secondary analyses were also conducted on available TOMM trial 1 data as well. Furthermore, independent-samples t-tests were utilized among euthymic individuals (HDRS < 8 and YMRS < 8) who had passed symptom validity tests (TOMM trial 2 > 45 and CLVT-FC > 14) to compare neurocognitive test performance between the two groups (BD and controls). Additionally, we analyzed data using the lme4 package from the R statistical software program (Bates et al., 2015;R Core Team, 2018). The study conducted mixed effects models to analyze whether TOMM trial 2 scores differed between individuals with BD and healthy controls with data from three different time points. While the data are skewed, the model is quite robust and only requires that residuals are normally distributed (Schielzeth et al., 2020). The outcome variable was the TOMM score. The study used fixed effect of diagnostic group with time point, HAMD-17, YMRS, and WASI IQ as covariates and individual participants as a random effect.

Mood measures
As expected, individuals with BD had significantly higher scores on both mood measures compared to the healthy controls at each of the three time points; all ps < .01. Table 1 includes these results.

Performance validity results
Only five of 1316 total TOMM trial 2 scores fell below the clinical cut off of <45 on trial 2. These five scores belonged to four participants from the BD group (three with a diagnosis of BD I and one with a diagnosis of BD NOS). The performances of this small group of participants are further addressed in the discussion section. Regarding the CVLT-FC, 84 participants provided a total of 126 errors consisting of 88 concrete errors and 38 abstract errors. Table 5 contains the number of individuals who performed at various cutoffs on the CVLT-FC at three time points. Cutoff scores were determined using existing literature (e.g., Erdodi et al., 2018;Schwartz et al., 2016). One cutoff was based on a systematic review conducted by Schwartz and colleagues (2016) in which they applied a cutoff of 14 on the forced choice trial (sensitivity 50% and specificity 93%). Another, more stringent cutoff, was based on a study conducted with a group of mixed TBI individuals (75% mTBI) (Erdodi et al., 2018). In that study, they used a cutoff of 15 on the forced choice trial (sensitivity 56% and specificity 92%). Given the longitudinal nature of the study and attrition over time, not all participants completed the CVLT-FC over the three time points as can be seen in Table 5. Table 5 also contains the number of individuals who performed at various cutoffs on the TOMM (trial 1 and 2) at multiple time points.
This study compared the performance on the TOMM trials 1 and 2 and the CVLT-FC between individuals with BD I, BD II, BD NOS, and Schizoaffective BD ( Table 2). As discussed in the methods section, the primary focus was on TOMM trial 2 scores based on the manual cutoff and the more commonly used CVLT-FC cutoff. However, TOMM trial 1 data were used when available as well as additional cutoffs for secondary analyses to explore alternative cutoffs that could increase the sensitivity of the measure. While the CVLT-FC and TOMM trial 2 scores were examined at three time points, TOMM trial 1 data was only available for one time point. There were no statistically or clinically significant differences between the groups on the TOMM trial 1 or 2 at any of the three time points. While the Schizoaffective BD group and the BD II group demonstrated a statistically significant difference in performance on the CVLT-FC at one time point (p = .044), this does not represent a clinically significant difference as the means differed by less than one point. Additionally, after correcting for multiple tests, this would no longer be considered statistically significant either. Table 3 summarizes the performance validity comparisons between the combined BD group and the healthy control group. The combined BD group of individuals did not demonstrate significant differences in their performance compared to that of healthy controls on the TOMM trial 2 across all three time points. While the healthy control group demonstrated statistically significant differences in their performance compared to the BD group on the CVLT-FC and TOMM trial 1 during baseline (p < .001) and the 1-year follow-up (p < .001), these differences do not represent clinically significant differences. Results from the mixed effects model adjusting for mood and IQ revealed no significant difference in TOMM trial 2 scores between the healthy control group and the BD group, a = .07, SE = .07, p = .31. The groups demonstrated incredibly similar performance (BD M = 49.85, SD = .926 and HC M = 49.93, SD = .307). When this model was re-rerun without adjusting for mood using the HAMD-17 and YMRS, there were still no significant differences between the healthy control group and individuals with BD. Additionally, Table 4 outlines the percentage of participants who obtained specific score combinations on the TOMM trial 2 and CVLT-FC at each time point.

Neurocognitive test results
After accounting for the impact of mood symptoms and performance validity scores, the euthymic BD group exhibited poorer performance than the control group on most cognitive factors, including fine motor control t(262) = −4.019, p < .001, auditory memory t(265) = −2.768, p = .006, verbal fluency and processing speed t(256) = −4.848, p < .001, processing speed with interference resolution t(255) = −5.498, p < .001, and inhibitory control t(259) = −5.317, p < .001. Table 6 includes the means and standard deviations between the two groups for each of the eight factor scores.

Discussion
Across all time points and among BD and controls, the rate of PVT failures was exceedingly low. None of the 255 undiagnosed individuals failed a single performance validity test at any of the assessment periods, whereas the 736 individuals with BD failed less than 2% of the time, and inconsistently over time. In fact, nearly 90% of individuals with BD obtained perfect scores on both performance validity measures at each time point.
Among those with imperfect scores, individuals typically only missed one item on either test. In fact, the mean scores on the CVLT-FC differed by less than one point between the BD group and the healthy control group. Such a negligible difference is not considered clinically meaningful. Such perfect and nearly perfect scores are far from the more stringent standards for failure of each test, indicating that each PVT is robust to possible momentary attentional lapses. Other research samples show a similar pattern of perfect, or nearly perfect scores (Rees et al., 1998;Tombaugh, 1997). Dementia notwithstanding, this study contributes to the literature indicating that having a clinical diagnosis is insufficient to account for failures on these PVT's (Etherton et al., 2005;Hunt et al., 2014;Kirk et al., 2011). The commonly employed cut scores for these measures are therefore sufficiently conservative that clinicians can interpret failing scores with high positive predictive power and specificity.
These results provide compelling evidence that the mere diagnosis of BD is insufficient to account for failing scores on measures of performance validity. The results are consistent with existing literature (Duncan, 2005;Gierok et al., 2005) and helped to extend this literature with the use of multiple PVTs, a much larger sample, and the inclusion of scores across multiple time points. Clinicians obtaining substandard scores for examinees, therefore, must account for low scores on these measures from factors other than a BD diagnosis. Further, based on the results of this study, authors encourage future discussion on potentially more useful cutoff scores among patients with BD. Specifically, an alternative cutoff for the TOMM trial 2 could be beneficial for clinical practice while maintaining high specificity and strong sensitivity. Others have suggested alternative cutoffs (e.g., Denning, 2012), and this study suggests that the cutoff for the TOMM trial 2 could be <46 which would increase sensitivity and represent a conservative change. A less conservative practice to consider would be to have clinicians utilize a cutoff of <49 which would provide higher sensitivity to poor effort while maintaining adequate specificity.
This study has several limitations that subsequent research can address. Given the longitudinal nature of the larger study the data were utilized from, attrition occurred over time as would be expected, and scores were not available for each participant at each of the three time points for the TOMM trials 1 and 2 and the CVLT-FC. Also, these findings pertain to well-educated young to early older adults who were mostly Caucasian and volunteered in a research context. As such, these results will need replication in clinical settings with older, less educated, more ethnically diverse individuals in order to provide more generalizability of this data. Each participant received incentive payment for their participation alone as opposed to performing at a particular level on cognitive testing. The fact that there was no incentive to perform well could have also affected participants' performance on cognitive testing. Additionally, individuals did not complete testing if, at the time of the neurocognitive evaluation, they were acutely manic. While Vrabie et al. (2015) concluded that assessing those in mania is challenging due to the associated distractibility and impulsivity, it is plausible that acute mania affects performance on PVTs. This is a subject for subsequent investigations to explore. There has also been an emerging literature regarding PVTs in research settings, and in particular PVT failures, though this literature has yielded mix results with most studies having focused on a college population with higher failure rates attributed to using more liberal cutoffs (Roye et al., 2019). While we cannot assume that all or our participants were putting forth maximal effort, our study sample was older compared to those previously reported and we also utilized the recommended cutoffs with rates similar to those reported by other studies.
Another limitation of this study concerns the use of forcedchoice recognition PVTs, which are easily passed by individuals with significant neuropathology outside dementia. Readers should not assume that failure of any PVT is unexplainable by BD, especially PVTs embedded within existing cognitive measures. Accordingly, these conclusions do not fully address effortful processing, investment in performing well, motivation, or level    of engagement. It should also be noted that while PVT sensitivity good, it is not 100%. It is possible that existing PVT's, both embedded and stand-alone, do not have the sensitivity required for all clinical circumstances. Further research should explore other PVTs and address these factors.
In conclusion, having a diagnosis of BD alone is not sufficient to explain failures on two widely used forced-choice recognition PVTs. These PVTs in particular require minimal time to administer and speak to the interpretability of a testing battery. Additionally, insufficient effort does not explain away or address reduced neurocognitive functioning associated with individuals with BD and this lends credibility to our groups previous extensive work and current findings showing poorer neurocognitive function in BD compared to controls on measures of memory, executive functioning, and motor abilities. Thus, disease driven mechanisms associated with cognitive weaknesses experienced by individuals with affective disorders such as BD should serve as a focus for research and clinical intervention.