Efficacy of attention bias modification training for depressed adults: a randomized clinical trial

Background This study examined the efficacy of attention bias modification training (ABMT) for the treatment of depression. Methods In this randomized clinical trial, 145 adults (77% female, 62% white) with at least moderate depression severity [i.e. self-reported Quick Inventory of Depressive Symptomatology (QIDS-SR) ⩾13] and a negative attention bias were randomized to active ABMT, sham ABMT, or assessments only. The training consisted of two in-clinic and three (brief) at-home ABMT sessions per week for 4 weeks (2224 training trials total). The pre-registered primary outcome was change in QIDS-SR. Secondary outcomes were the 17-item Hamilton Depression Rating Scale (HRSD) and anhedonic depression and anxious arousal from the Mood and Anxiety Symptom Questionnaire (MASQ). Primary and secondary outcomes were administered at baseline and four weekly assessments during ABMT. Results Intent-to-treat analyses indicated that, relative to assessment-only, active ABMT significantly reduced QIDS-SR and HRSD scores by an additional 0.62 ± 0.23 (p = 0.008, d = −0.57) and 0.74 ± 0.31 (p = 0.021, d = −0.49) points per week. Similar results were observed for active v. sham ABMT: a greater symptom reduction of 0.44 ± 0.24 QIDS-SR (p = 0.067, d = −0.41) and 0.69 ± 0.32 HRSD (p = 0.033, d = −0.42) points per week. Sham ABMT did not significantly differ from the assessment-only condition. No significant differences were observed for the MASQ scales. Conclusion Depressed individuals with at least modest negative attentional bias benefitted from active ABMT.

8. Data analytic approach 9. Results of original, pre-registered data analytic plan 10. Training Compliance 11. Results of linear mixed effects modeling for the active versus sham ABMT comparison Table SA2. Linear Mixed Effect Modeling Output (with treatment contrasts against sham ABMT).
12. Table of Effect Sizes for ABMT  Table SA3. Effect size (d) for each ABM comparison for the primary and secondary outcomes. The analysis reports are RMarkdown documents that contain the R code with its associated results that were used to generate the results presented in this manuscript. The reports also contain additional follow-up analyses (e.g., sensitivity analyses) that were performed but not reported due to space limitations. All primary results presented in the manuscript can be crosschecked with the results presented in those reports. The supplemental materials (below) also contain key findings from these analysis reports so that the interested reader does not have to search through the files in the Dataverse.
All analyses reported in the article and supplemental materials were implemented in R (version 4.0). 1 Our code made extensive use of the tidyverse 2 packages dplyr, purrr, and tidyr for general data extraction and transformation. The itrak package was used to process eye-tracking data and compute attention bias metrics. The lme4 3 package was used to fit linear mixed effects regression models, and the lmerTest 4 package was used to calculate inferential statistics. The DHARMa 5 package was used to plot residuals and test assumptions from the linear mixed effects models. Figures were generated using the packages effects 6 , ggplot2 7 and patchwork. 8

Reliability for the QIDS-SR
The QIDS-SR cronbach alphas ranged from .23 to .78 across assessments. The QIDS poor internal consistency at baseline is likely an artifact of a truncated distribution associated with the inclusion criteria.  . Before the presentation of practice trials, participants completed a 13-point calibration routine in order to map eye position to screen coordinates. The calibration process was then validated with a 13-point retest routine. The calibration process was accepted if there was less than a 0.5 degree visual angle between initial calibration and subsequent validation. After the initial calibration process, quality of eye tracking was maintained through drift checks between each trial. Participants were required to maintain their gaze on the fixation cross before the next set of stimuli was presented; if fixation was not detected, a single-point drift correction procedure was initiated. All stimuli were normalized to mean luminance (12.0 cd/m 2 ). Location and presentation of the IAPS and POFA stimulus pair varied randomly throughout the task. Stimuli-pairs did not vary in terms of luminance (12.0 cd/m2).

Split-half reliability of attention bias assessment outcomes
Bootstrapped split-half reliability for attention bias assessment outcomes are provided below in Table SA1.

Additional in-lab attention bias modification training details
The in-lab training consisted of 198 trials (22 trials x 9 blocks) lasting approximately 20 min. Participants were seated in an illuminated room (12.0 cd/m2), 45 cm from the computer screen. Before the task, participants' position and distance from the screen were assessed by an in-house automated procedure using a webcam connected to the training computer. This information was presented to the participant as a visual parallax. The participant would progress only if the participant was aligned with the webcam and their face was detected.
If both position and distance were not maintained for a duration window of 10 sec, this initial calibration attempt was considered a failure. After two failed attempts or a total duration of 60 sec, a behavioral version of the same task was instead employed. The parallax was followed by a nine-point calibration routine used to map eye position to screen coordinates. After completing calibration, participants were informed that the task would soon begin and all instructions would be presented on the monitor.
Participants were instructed to view the images naturally. Further, they were also instructed to look at the fixation cross prior to each trial in order to standardize the starting location of their gaze. The task began with a series of 20 practice trials, using IAPS and POFA images not included with test stimuli. Each trial began with the appearance of a central fixation cross (FC). Participants were required to maintain gaze of central fixation (subtending 1° by 1° visual angle) for 1500 msec. Immediately following fixation, a stimulus pair appeared for either 3 (POFA) or 4.5 (IAPS) sec. Location of stimuli (left or right side of visual field) was randomized with equal frequency. Following stimulus offset, a probe appeared (single or doubleasterisk) in place of one of the stimuli. Participants were asked to indicate the type of probe by pressing "8" if one asterisk and "9" if two asterisks.
After their response, the probe disappeared before beginning the next trial. IAPS and POFA stimuli appeared a total of 12 and 10 times across the block, respectively. The 3rd and 6th block were followed by a self-paced break.

Implementation of the Study Blind
The randomization sequences generated for stratified blocks in this study were stored in a private, backend data structure connected to a public frontend Shiny dashboard through which researchers entered the participant's ID and clicked an "Enroll" button, which resulted in a unique random 3-character string (e.g., J89) being written to the participant's record in REDCap, without ever displaying the actual treatment assignment or even a dummy variable (e.g., A, B, C) for that assignment. blind was not broken until all data were collected and a data-blind analysis had been completed.

Data analytic approach
Model specification. In specifying the random effects of the models, we followed the "keep it maximal" approach advocated by Barr and colleagues by first attempting to fit a model with a random intercept for each participant, a random slope over time for each participant, and the correlation between the two. 10 In the event that this model had issues with singularity or non-convergence, we had planned to simplify the random effects specification by assuming independence (no correlation) between intercept and slope and, if this also resulted in a problematic fit, to omit the random slope, leaving only the random intercept. Simplification proved unnecessary for the models presented in the manuscript; all of the effects presented in Table 2 were derived from models that included a correlated random slope and random intercept.
T-tests of model coefficients were performed using Satterthwaite's method as implemented in the 'lmerTest' package. 4 In addition, as a sensitivity analysis, we report below the results of a simple ANCOVA predicting training differences at the final time point only, covarying for pretraining baseline differences. This is a more traditional outcome than rate of change but is potentially biased because it excludes participants who dropped out of the study before the 4- week time point, whereas the rate-of-change analysis makes use of all available followup data.

Linear model assumptions.
After fitting a regression model for each outcome, we examined linear model assumptions (e.g., normally distributed outcomes, homoscedasticity, normally distributed residual errors). Models for two of the outcomes, the MASQ anhedonic depression and anxious arousal subscales, significantly violated these assumptions owing to substantial negative and positive skew of these respective outcomes. A square-root transform of anxious arousal and reversal of the anhedonic depression scale followed by square-root transform remedied these problems (Bartlett, 1947). problem with this approach is that, logically speaking, a baseline measurement cannot be a response to treatment, but the above model specification permits precisely that, attributing variance in pre-training symptom severity to future training assignment. 13 This is also equivalent to testing for a significant baseline difference, a practice that has been rightly discouraged in the CONSORT statement on the grounds that it is unnecessary in the context of a randomized trial and potentially misleading. 14 Any group differences at baseline must be due to chance, so it is The differences between these approaches is largely philosophical since they tend to yield similar conclusions and even identical effects estimates under certain conditions (e.g., ANCOVA and cLDA are equivalent when there are no missing data). However, the question of which approach to use is not merely academic; Coffman et al. show an example in which pvalues could range from 0.006 to 0.15, depending on the analysis method used, and they review evidence that both the ANCOVA and cLDA approaches are superior to the unconstrained LDA model that we preregistered. 12 Therefore, we thought it prudent to critically evaluate these modeling approaches and specify prior to data unblinding which one should take precedence if there was non-consensus regarding whether the null hypothesis should be rejected.
We did this in the context of a data-blind analysis, in which we fit models using simulated group assignments (see 1.0-jds-blind-data-plan.html in the associated dataverse: https://doi.org/10.18738/T8/UWKEFM). This exercise made it clear that both the pre-registered LDA and the purportedly superior cLDA approaches, which both treat baseline as part of the response vector, violated an assumption that baseline and post-baseline values are jointly multivariate normally distributed. 15 This happened because the QIDS-SR was used as an inclusion/exclusion criterion, which resulted in a baseline distribution that was skewed and truncated at the eligibility cutoff with far less variance than the subsequent measurements.
On the other hand, the standard ANCOVA approach -in which experimental groups are allowed to have different intercepts (i.e., a "main" effect of training) as well as different slopes (i.e., an interaction effect of training by time) and baseline is used as a covariate -loses information about the rate of change between start of training and the first post-baseline measurement at 1 week. Relatedly, information about group equivalence at the start of training is also lost, and our blind data analysis demonstrated that the standard ANCOVA failed to model the group mean trendlines as originating from a point of equivalence, even though our simulated group assignments had been chosen to ensure baseline equivalence. Given that we predefined our primary outcome as the difference in rate of change from baseline (as opposed to a treatment difference at the final time point), the inability of this approach to accurately estimate group slopes starting from baseline was unacceptable.
Our solution was to apply the constraining logic of the cLDA -equivalence of groups prior to start of training is built into the model as a given -but treat baseline values as a covariate rather than as part of the dependent variable. Given that we determined this model specification to be logical and statistically sound, the results presented in the main body of the text correspond to our preferred approach. We also include the results from the preregistered approach below. In this case, both approaches yielded very similar conclusions regarding group differences in the rate of change.

Adherence to target training level.
Participants were scheduled for a total of 8 inclinic trainings. The very first training, which occurred at enrollment following the fMRI scan, was abbreviated to 66 trials, but otherwise in-clinic trainings were 198 trials in length. Thus, this amounts to an in-clinic training target of 66 + (7 * 198) = 1452 trials. Participants were also asked to complete a total of 12 at-home trainings, which amounts to a target of 12 * 66 = 792 trials. Therefore, percent adherence was calculated for each person as a fraction of 1452 + 792 = 2244 trials. The median participant was 74% (IQR = 44-85%) adherent to active training and 74% (IQR = 47-91%) adherent to sham training protocols. Table SA1 presents the results for the linear mixed effects modeling using sham ABMT as the comparison condition. These models are identical to the models presented in Table 2 with the exception that the treatment contrasts specify sham ABMT as the reference level instead of assessment only. We prioritized the active ABMT vs assessment-only comparisons in the main document because those are the contrasts that we were powered to detect and that we indicated would be our primary outcome in our pre-registration. We

Target Engagement Analyses
14.1 Definition of target engagement. As noted in our methods, a generalized linear mixed effects model examined whether ABMT, time, and their interaction was associated with change in a trial-level binomial variable of whether or not gaze was directed primarily (> 50%) toward or away from sad stimuli. During peer review of this manuscript it was suggested that we consider using the original criterion (37.5% of trials with a negative bias) as a test for target engagement. To follow this suggestion, this would mean creating a binomial outcome of whether or not participants would still meet the eligibility criterion. In the end, we decided against this idea because the 37.5% criterion is not an inclusion criterion for high bias; it is an exclusion criterion for low bias to make sure that there is room for bias to improve. It is not meant to indicate the cutoff for a healthy vs. unhealthy amount of bias. Thus, we examined change in the odds of gaze being directed primarily toward sad stimuli as our primary measure of target engagement.
14.2 Exploratory target engagement analyses. The project's data repository (https://doi.org/10.18738/T8/UWKEFM) contains detailed reports of our target engagement analyses. Specifically, 1.02-jds-gaze-bias-blind-analysis-plan.html contains our rationale for how we selected our primary target engagement outcome using data blind analyses. Document 1.21jds-bias-primary-analysis.html describes how we implemented our target engagement analyses.
We then conducted exploratory analyses using other eye gaze and behavioral metrics of attention bias (documented in 1.22-jds-bias-secondary-analysis-RT.html and 1.23-jds-bias-secondaryanalysis-itrak.html). Notably, most traditional metrics of attention bias did not significantly change with ABMT; however, a trial-level metric, attention bias variability, appears to change with active ABMT compared to the assessment only condition. This was an exploratory analysis so it is not included in the main outcome paper.

Treatment effects covarying for adherence
To further understand the relationship between training treatment outcome, we conducted post-hoc analyses with individuals that completed either active or sham ABMT that examined how training group and adherence rate were associated with post-treatment symptom severity, also controlling for baseline symptom severity. Greater adherence predicted greater symptom reduction: for example, a 10% increase in adherence was associated with an additional 0.5 (SE = 0.2, p = .035) point decrease in QIDS score. After controlling for adherence, the active group was associated with a significantly greater decrease in QIDS score than the sham group (b = 1.74, SE = 0.87, p = .049).

Exploring the relation between gender and treatment response
Although underpowered, we conducted exploratory analyses to further examine the association between gender and treatment response to ABMT. We observed no significant gender x group x week or gender x week interactions, nor a main effect of gender in our primary outcome analyses. We followed with a subgroup analysis with women only; treatment effects appeared larger, as active ABMT condition significantly differed from the sham ABMT (p = .032, d = -0.58) and assessment only (p = .007, d = -0.70) conditions. In the original analyses with the full sample, the active ABMT condition outperformed the assessment only condition (p = .008, d = -0.57) but not the sham ABMT condition (p = .067, d = -0.41).