The Negative Effects Questionnaire: psychometric properties of an instrument for assessing negative effects in psychological treatments

Abstract Background: Psychological treatments provide many benefits for patients with psychiatric disorders, but research also suggests that negative effects might occur from the interventions involved. The Negative Effects Questionnaire (NEQ) has previously been developed as a way of determining the occurrence and characteristics of such incidents, consisting of 32 items and six factors. However, the NEQ has yet to be examined using modern test theory, which could help to improve the understanding of how well the instrument works psychometrically. Aims: The current study investigated the reliability and validity of the NEQ from both a person and item perspective, establishing goodness-of-fit, item bias, and scale precision. Method: The NEQ was distributed to 564 patients in five clinical trials at post-treatment. Data were analysed using Rasch analysis, i.e. a modern test theory application. Results: (1) the NEQ exhibits fairness in testing across sociodemographics, (2) shows comparable validity for a final and condensed scale of 20 instead of 32 items, (3) uses a rating scale that advances monotonically in steps of 0 to 4, and (4) is suitable for monitoring negative effects on an item-level. Conclusions: The NEQ is proposed as a useful instrument for investigating negative effects in psychological treatments, and its newer shorter format could facilitate its use in clinical and research settings. However, further research is needed to explore the relationship between negative effects and treatment outcome, as well as to test it in more diverse patient populations.


Introduction
Psychological treatments can provide patients with effective means of overcoming mental distress and increasing their well-being (McHugh and Barlow, 2010). Research on the efficacy of evidencebased approaches, such as cognitive behaviour therapy (CBT), suggest that a large number of patients improve thanks to the interventions they receive (Hofmann et al., 2012). However, not everyone seems to benefit, with only half of the patients being regarded as responders at post-treatment and follow-up (Loerinc et al., 2015). Similarly, several investigations suggest that useful in clinical and research settings. In particular, such a method makes it possible to detect item bias and to explore whether each item performs in a comparable way across sociodemographics. A similar study was conducted using Rasch analysis for the Depression, Anxiety and Stress Scales (Lovibond and Lovibond, 1995), suggesting that a number of items could be removed and that it was not supported as a general instrument for mental distress (Shea et al., 2009).
The aim of the current study was thus twofold: to explore the response categories of the NEQ to see if they are of incremental scale steps, i.e. 0-4, and to examine the response pattern and goodness-of-fit between persons and items. The overall objective was to determine the usefulness of the NEQ as a way of exploring negative effects in psychological treatments.

Participants
Participants were recruited from five clinical trials of spider phobia, perfectionism, social anxiety disorder, and loneliness (N = 564). Each case involved self-referrals and the studies were advertised in Sweden via national and regional newspapers and radio shows, social media, posters and flyers. A complete overview of the sociodemographics and clinical variables at pre-treatment is given in Table 1. Because not every clinical trial requested the same type of information from the participants, there was some degree of systematic missing data, e.g. living with someone, prior psychological treatment, and prior or ongoing psychotropic medication. Also, due to publication issues, symptom severity was not possible to present for one of the clinical trials. In addition, one of the clinical trials was included as part of the exploratory factor analysis of the NEQ, i.e. social anxiety disorder (n = 189) (Rozental et al., 2016).

Treatment and therapists
The psychological treatments that were administered in the clinical trials consisted of CBT, delivered in various formats: face-to-face, virtual reality, and via the internet, with or without guidance from a therapist, or by support on demand . The therapists were masters degree students having undergone basic clinical training or more experienced therapists in advanced clinical training (i.e. psychotherapists in training). As for the internet conditions, participants received weekly modules consisting of both reading material and exercises to be completed by the participants every week, comparable to a self-help book (Andersson, 2016). The psychological treatments ranged from one session to 9 weeks; shortest for spider phobia and longest for social anxiety disorder.

Procedure
The participants filled out their sociodemographics and several outcome measures during the recruitment process before being assessed for eligibility. This was performed on a secure online interface using an auto generated identification code, such as 1234abcd, thereby ensuring anonymity and minimizing data loss (Vlaescu et al., 2016). Upon completing their treatment, the participants answered the outcome measures again, with the addition of the NEQ (Rozental et al., 2016). The only exception was the clinical trial of spider phobia where paper and pencil was used.

Measures
The Negative Effects Questionnaire. The NEQ was developed by Rozental et al. (2016) with the aim of investigating the occurrence and characteristics of negative effects in psychological treatments. The process of developing the instrument is described in detail in the original study. The 1 (1.0) 4 (2.6) 4 (2.1) 0 (0.0) 0 (0.0) 9 (1.6) Sick leave 1 (1.0) 3 (1.9) 3 (1.6) 2 (2.7) 0 (0.0) 9 (1.6) Other 0 (0.0) 4 (2.6) 0 (0.0) 2 (2.7) 0 (0.0) 6 (1.1) Clinical severity mean (SD) Patient Health Questionnaire -9 items 2.6 (3.5) 9.6 (5.5) 8.7 (4.8) 9.8 (5.0) -7.9 (5.5) d (Continued) Category not applicable in n = 3; b based on n = 561; c based on n = 405; d based on n = 518. exploratory factor analysis resulted in a rotated factor-solution with 32 items and the following six factors: symptoms, quality, dependency, stigma, hopelessness, and failure. The NEQ was found to have a good internal consistency, α for the full instrument .95, range .72 to .93 for the six separate factors. The instrument also consists of one open-ended question in order to capture other negative effects that are not included among the items, but this was not explored in the current study.

Outcome measures
Each clinical trial included in the current study distributed a primary outcome measure selected by relevance, for instance the Spider Phobia Questionnaire (Muris and Merckelbach, 1996). Several secondary outcome measures were also administered; the Patient Health Questionnaire -9 items (PHQ-9; Löwe et al., 2004), the Generalized Anxiety Disorder -7 items (GAD-7; Spitzer et al., 2006), and the Brunnsviken Brief Quality of Life Scale (BBQ; Lindner et al., 2016). These are, however, only presented descriptively in Table 1 for an overview of the sample.

Statistical analysis
In order to investigate and evaluate the validity of the internal structure and response processes of the NEQ, Rasch analysis was applied, following the same steps as described in Lerdal et al. (2016). The software WINSTEPS, version 3.91.0.0, was used for all analyses, implementing a rating scale model as all of the items in the NEQ are scored on a similar rating scale category. Rasch analysis converts the patterns of raw scores from the NEQ into item and person equal-interval measures simultaneously, using a logarithmic transformation of the odds probabilities of the responses (Bond and Fox, 2013). This converted item measure is then applied to determine whether they are scored on a similar unidimensional construct, which is often viewed as crucial in terms of validity in both classical and modern test theory (Spector, 1992). In a similar manner, the converted person measure is utilized to evaluate person response validity and the precision of the scale. The psychometric properties of the NEQ rating scale categories were initially examined using the following criteria: (a) minimum of 10 responses per step category, (b) the average measures for each step category should advance monotonically, and (c) outfit Mean Square (MnSq) values less than 2.0 for the step category calibrations (Linacre, 2002). If these criteria were not initially met, actions to collapse rating scale categories or deletion of categories would be initiated, in line with the literature (Linacre, 2004).
Evidence of internal structure of the NEQ was then further investigated by monitoring the item goodness-of-fit statistics. WINSTEPS generates both MnSq residuals and standardized z-values for each of the items of the NEQ. The goodness-of-fit statistics indicate the degree of match between actual responses on the items and expected responses from the Rasch model assertions (Bond and Fox, 2013). Goodness-of-fit was evaluated by infit statistics, as they are viewed as more sensitive to item performance and also more informative when exploring internal scale validity (Wright and Masters, 1982;Bond and Fox, 2013). Furthermore, the MnSq fit statistic is preferable for item goodness-of-fit with polytomous data as it is less sensitive to sample size (Smith et al., 2008). The current study chose a sample-size adjusted criterion for item goodness-of-fit set for infit MnSq values between 0.7 and 1.3 for the NEQ (Smith et al., 2008). If one or more items would not demonstrate acceptable goodness-of-fit to the model, the items would be removed from the analysis and the iteration process would be repeated until all items met the criterion of acceptable goodness-of-fit.
In order to evaluate the unidimensionality of the NEQ, a principal component analysis of the residuals was also performed (Linacre, 2005). The criterion for unidimensionality was that at least 50% of the total variance should be explained by the first latent variable (Raîche, 2005), and that no more than 5% should be explained by the largest secondary dimension with an associated eigen value of 2.0, which is an indication of lack of multi-dimensionality.
Evidence of person response validity was then evaluated by monitoring the person goodnessof-fit statistics. The criterion for evaluating person goodness-of-fit was to reject infit MnSq values >1.4 associated with a z-value >2. It was also accepted that 5% of the sample may fail to demonstrate acceptable goodness-of-fit by chance, without a serious threat to validity (Patomella et al., 2006).
In order to monitor the precision of the converted measures, the person and item separation indices were calculated (Fisher, 1992). The person separation index reflects the number of statistically different strata that the test can identify in the sample, considering the range and precision of the individual person and item estimates. In a similar way, the item separation index reflects the number of statistically different strata that the sample can identify among the items. An index above 1.5 would ensure that the NEQ could differentiate at least two different groups in the sample/among the items.
Finally, a number of Differential Item Functioning (DIF) analyses were performed in order to explore the stability of the response patterns of the NEQ items across sociodemographics, giving further support of validity in relation to internal structure and potential unfairness in testing. This was conducted because it is crucial that an instrument is not biased with regard to any sociodemographics that may otherwise compromise the converted measures, question the validity of the instrument, and influence the interpretation of subsequent findings. The magnitude of DIF was evaluated using the Mantel-Haenszel statistic for polytomous scales using log-odds estimators (Mantel, 1963).

Overall response pattern
Prior to evaluating the categorical responses from the NEQ, all the criteria were met. All rating scale categories were used, which advanced monotonically, and the outfit MnSq values for the step category calibrations ranged from 0.89 to 1.11. Only 281 participants out of 564 scored any of the items of the NEQ, and a total of 86% of the person-item data matrix were non-responses, i.e. empty cells (see Table 1). The following item and person validity analyses was thus performed with a limited number of data records, as only 50.9% of the sample reported to have experienced any negative effect of their psychological treatment.

Item goodness-of-fit
The first iteration generating item goodness-of-fit statistics for the 32 items revealed that six items did not meet the criterion for item goodness-of-fit (see Table 2). By removing these items, the next iteration revealed that an additional four items did not meet the criterion and were thus removed. In the third iteration, two more items were removed. Hence, after the third iteration and the removal of 12 items in total (37.5%), the remaining 20 items on the NEQ demonstrated acceptable item fit to the Rasch model assertions. For an overview of the frequencies and average negative impact of each item in the final scale, see Table 3.

Principal component analysis
Following the removal of the twelve items demonstrating misfit, the principal component analysis revealed that the first component explained 62.5% of the total variance, which exceeded the criterion of at least 50% required in order to establish unidimensionality (see Table 2). The second dimension explained an additional 6.3 associated with an eigen value of 3.37, which surpasses the criteria set. By monitoring the item residual loadings, items 15, 11, 3 and 1 loaded more strongly on one component, while items 18, 4, 16, 12 and 20, however, loaded more strongly on another (see Table 4).

Person response validity
When evaluating the person response validity, twelve of the 264 participants (4.6%) did not demonstrate acceptable goodness-of-fit to the Rasch model in their responses to the NEQ, which    Table 2. The person-separation index for the original version of the NEQ, i.e. with 32 items, was 0.89. Moreover, the item-separation index (N = 281) was 2.01. After deletion of the 12 NEQ items demonstrating misfit to the Rasch model, the person-separation index increased to 1.08, and the item-separation index scale (N = 264) to 2.61 (see Table 2).
The DIF analyses revealed that all of the 20 remaining items of the NEQ functioned in a similar manner across sociodemographics (see Table 2), supporting fairness in testing.
The person-item map is presented in Fig. 1. Items reflecting negative effects more frequently experienced by the sample are placed at the lower end of the continuum, and items reflecting negative effects less frequently experienced by the sample are placed at the higher end of the continuum. In a similar way, participants with fewer experiences of negative effects are placed at the lower end of the continuum, and participants with more experiences of negative effects are placed at the higher end of the continuum.

Discussion
The current study is the first to examine the psychometric properties of an instrument for determining negative effects of psychological treatments using Rasch analysis. In contrast to prior investigations, which have relied on classical test theory (Ladwig et al., 2014;Parker et al., 2013;Rozental et al., 2016), this has enabled an additional investigation of the reliability and validity of persons and items (Waugh and Chapman, 2004), providing a more comprehensive understanding of how negative effects might be assessed. The results suggest that the NEQ exhibits fairness in testing, i.e.it does not demonstrate any bias in terms of the participants' sociodemographics. This important finding suggests that the instrument should yield comparable measures across respondents regardless of gender, age, civil status, educational level, and type of employment, as items are functioning in a similar manner. Also, out of the original 32 items of the NEQ, 12 could be removed as they did not meet the criterion for goodness-of-fit, resulting in a final scale of 20 items that can be downloaded and used for free in clinical and research settings: www.neqscale.com. Reviewing these items indicate that the factor failure is no longer included in the instrument, which may be explained by the fact that it explained less than 3% of the variance in Rozental et al. (2016). From a theoretical perspective, it is also uncertain if failure reflects a poor outcome rather than actually experiencing these negative effects during treatment, making it reasonable to exclude the items belonging to this factor from the NEQ. As for the rest of the items that were removed from the instrument, these were primarily related to dependency and quality, and to a lesser extent hopelessness and symptoms. Albeit not as clear, it could be argued these items are unrelated to the underlying construct of negative effects or that there is a considerable overlap between them and the items that were retained. For instance, 'I did not have confidence in my treatment' (Item 24) may possibly capture the same concept as 'I felt that my expectations for the treatment were not fulfilled' (Item 27), the latter being excluded. However, it is also important to note that although 12 items did not demonstrate acceptable fit to the Rasch measurement model, indicating that these items demonstrated more unexpected variations in their scores in order to contribute to one underlying measureable construct, they may still add important information about negative effects. Still, it seems that a final scale of 20 items is reliable and could be easier to administer compared with the total scale of 32 items, which should help researchers and therapists to monitor negative effects on a more regular basis.
The rating scale of the instrument also seems to function equally across items, i.e. advancing monotonically, suggesting that the incremental steps of 0 to 4 are appropriate. Several item residuals did, however, load on two components, implying possible multi-dimensionality in the instrument. In relation to the factors obtained by Rozental et al. (2016), the first component is associated with symptoms, while the other is linked to four separate factors. The reason for this finding is unclear and prior research has not discussed the dimensionality or hierarchy of negative effects. Nonetheless, one plausible explanation could be that it reflects a distinction between the subjective experiences of incidents occurring during treatment, e.g. more anxiety, and implications that are interpersonal or social in character, such as dependency and stigma. Strupp and Hadley (1977) considered this issue in their tripartite perspective of psychological treatments, proposing that positive and negative effects might be judged differently by the patient, the therapist, and significant others. Another explanation may be that some negative effects are short term, as in experiencing more unpleasant feelings during treatment, while others are long term, as in believing things cannot improve. This notion has been raised by Castonguay et al. (2010), pointing to the fact that some interventions will never be perceived as particularly pleasant to the patient, even though they are seen as beneficial in the long run. Differentiating those negative effects that are enduring from those that are transient is thus an important research endeavour, preferably by assessing such instances both during treatment and at long-term follow-up. Future studies should also explore if other approaches to examine the instrument (e.g. multi-dimensional Rasch modelling) could yield additional and better solutions to measure negative effects. Still, the findings from the current study indicate that a majority of the items function well enough together to explain a large proportion of the variance and that they also yield acceptable person-fit statistics, which is an important aspect of measurement validity. Given that research on negative effects of psychological treatments is still a fairly new and unexplored territory, psychometric issues such as multi-dimensionality are nevertheless important to consider in order to move the field forward.
As for the rate of negative effects, the number of participants reporting negative effects in the current study was 50.9%, consistent with 58.7% among patients in a psychiatric setting who responded to the INEP (Rheker et al., 2017). However, this number varies significantly between investigations, with rates as high as 92.9% among patients with obsessive-compulsive disorder that were assessed with the Side-effects of Psychotherapy Scale in a study by Moritz et al. (2015), and as low as 5.2% in national survey by Crawford et al. (2016) probing for 'lasting bad effects from the treatment'. Hence different types of assessments and patients will generate different ratios, making it difficult to determine which estimate is more accurate and to compare it across investigations. One of the advantages of implementing Rasch analysis is, however, the possibility to go beyond just frequencies or levels of symptoms, adjusting for both aspects within a sample. In other words, a person experiencing a large impact on a limited number of items that are rarely perceived among the sample will generate a higher measure of negative effects, compared with a person who is experiencing a moderate impact on a larger number of items that are more often experienced among the sample. Taking this into account, the results from the current study suggest that the instrument has an acceptable person goodness-of-fit, but that it is inappropriate for differentiating distinct subgroups with regard to their experiences of negative effects. This is caused by a relatively large individual standard error associated with each individual measure, as most participants only endorsed a limited number of items (see Table 4). The NEQ is therefore restricted in detecting changes or differences within a specific sample based on their person measures, but is probably suitable for examining differences between clinical trials, settings or interventions by monitoring item difficulty calibrations, e.g. the rate and impact of a particular item, such as placement along the continuum. Further research will help to provide a better estimate of negative effects in psychological treatments by administering the NEQ on a more regular basis during the treatment period, but also to include it in more diverse patient populations.
There are some limitations that need to be addressed when interpreting the results. First, even though the sample was relatively heterogeneous with regard to their sociodemographics, the inclusion and exclusion criteria of each clinical trial may have affected the generalizability of the findings. A majority of the participants were female, middle-aged, in a relationship, having a university degree, and either students or employed, which may have affected the negative effects that were reported. Second, in terms of symptom severity at pre-treatment, the participants were, on average, sub-clinical, at least with regard to the PHQ-9, the GAD-7 and the BBQ, suggesting that they were relatively high functioning. It is possible that another sample would have responded differently on the NEQ, for instance patients with more severe psychiatric disorders than those included in the current study, such as personality disorders, recurrent depression, or eating disorders. Third, given that the participants received psychological treatments mostly administered via the internet or virtual reality, it might be that other formats or theoretical orientations than CBT could result in different negative effects, hence affecting such issues as what items to retain and the principal component analysis. Distributing the NEQ to even more diverse patient populations is thus needed and important to fully understand the occurrence and characteristics of negative effects of psychological treatments. Thus, until further research has been made, some caution is warranted in terms of interpreting the results from the 20 item-version of the NEQ in other formats than the internet, as well as for patients with more severe psychiatric disorders. Fourth, distributing an instrument to patients on a single occasion is problematic, especially concerning incidents that may have been experienced as negative by patients . It is possible that the negative effects that were reported were affected by recall bias, primacy-recency effects, and social desirability (Krosnick, 1999), resulting in less valid responses. Future research could therefore include the NEQ on at least one more occasion during the treatment period, for instance at mid-assessment. This should also be accompanied by an investigation of its relationship with outcome, i.e. whether or such incidents affect the long-term benefits.