Screening for depression in primary care: a Rasch analysis of the PHQ-9

Aims and method To explore the modern psychometric properties of the Patient Health Questionnaire (PHQ-9), we used the Rasch analysis in a sample of 767 primary care patients with depression. Results The analysis highlighted dependency issues between items 1 and 2 (‘Little interest or pleasure in doing things' and ‘Feeling down, depressed, or hopeless’), and items 3 and 4 (‘Trouble falling or staying asleep, or sleeping too much’ and ‘Feeling tired or having little energy’). Items 1 and 2 displayed an over-discrimination, suggesting their potential redundancy within the complete item set. Clinical implications In its current format the PHQ-9 displays some problems with regard to its measurement structure among a sample of primary care patients. These problems can be addressed by removing potentially redundant items to deliver a stable screening tool. The results also lend support for the PHQ-2 to be used as a screening tool in a primary care setting.


Measure
The PHQ-9 is a nine-item measure of depressive symptoms containing four somatic items (items 3, 4, 5 and 8) and five items relating to thoughts and feelings linked to depressed mood (items 1, 2, 6, 7 and 9). Respondents are asked to report on the frequency of their symptoms during the past 2 weeks using four response categories: 0 (not at all), 1 (several days), 2 (more than half the days) and 3 (nearly Aims and method To explore the modern psychometric properties of the Patient Health Questionnaire (PHQ-9), we used the Rasch analysis in a sample of 767 primary care patients with depression.
Results The analysis highlighted dependency issues between items 1 and 2 ('Little interest or pleasure in doing things' and 'Feeling down, depressed, or hopeless'), and items 3 and 4 ('Trouble falling or staying asleep, or sleeping too much' and 'Feeling tired or having little energy'). Items 1 and 2 displayed an over-discrimination, suggesting their potential redundancy within the complete item set.
Clinical implications In its current format the PHQ-9 displays some problems with regard to its measurement structure among a sample of primary care patients. These problems can be addressed by removing potentially redundant items to deliver a stable screening tool. The results also lend support for the PHQ-2 to be used as a screening tool in a primary care setting.
Declaration of interest None.

ORIGINAL PAPERS
Screening for depression in primary care: a Rasch analysis of the PHQ-9 { Mike Horton, 1 Amanda E. Perry 2 every day). The total score ranges from 0 to 27, and the recommended cut-off point to classify clinical depression is a score of 10. The PHQ-9 has consistently demonstrated robust psychometric properties, reliability and validity in adult populations. [17][18][19] Procedures Administration of the PHQ-9 was conducted at baseline (prior to randomisation), and at 4, 12 and 24 months post-randomisation as part of a battery of tests for the larger REEACT study during the recruitment period. Trained interviewers at each of the four trial sites (York, Manchester, Bristol and Sheffield) read or asked participants to self-report their responses on the PHQ-9 items and recorded the responses. The data included all screened participants (n = 767).

Rasch analysis
Analysis was conducted on the entire baseline sample (n = 767) using RUMM 2030 software (www.rummlab. com.au) using the partial credit (unrestricted) parameterisation of the model. 20 Assessing the internal construct validity of the PHQ-9 involved investigating the individual item thresholds, the overall scale fit, individual item fit to the measurement model and issues relating to the local independency assumptions. The response structure across all items was investigated to assess whether it was working as intended. This was explored by assessing the thresholds at the cross-over points between adjacent response categories, which should remain ordered in a logical pattern. 16 The overall scale fit statistics provide a summary measure of how the scale conforms to Rasch model expectations. Reliability indices are delivered in the form of a person separation index (PSI) and Cronbach's alpha. Analysis was also conducted at the individual item level, where adequate fit is indicated by non-significant chisquared test values and z-standardised fit residual statistics within +/72.5. 21 High positive fit residual values indicate a misfit to model expectations, while high negative fit residuals may suggest item redundancy. 22 A residual correlation matrix was used to explore the extent of dependency between items within the scale. Dependency occurs when the response to one item has a direct effect on the response to another item within the scale, conditional on the level of depression. If present, this is liable to artificially inflate reliability indices and to create spurious multidimensionality within an item set. 22,23 Residual correlations above a value of 0.2 indicate a significant level of dependency between items, 24 although values above 0.1 have been used to identify potential response dependency. 23 Response dependency can be accounted for by grouping the dependent items together into 'sub-tests' within the analysis framework.
The unidimensionality of the scale was assessed using a t-test procedure 25 whereby the percentage of significant individual t-tests should not be above 5%. In practice, the lower bound of a binomial confidence interval should overlap the 5% level to indicate an acceptable absence of multidimensionality within the scale. 15 The targeting of a scale to the study sample is assessed by investigating the relative logit locations of the item threshold distribution and person location distribution. The primary purpose of scales may differ, but for a well-targeted measure, the mean person location should not deviate too much from the mean item difficulty (fixed at 0 logits). 15 This distribution is also reflected in the Person-Item Threshold Distribution plot available in the RUMM2030 software.
The second element of the analysis involved the exploration of ways to account for any misfit found within the scale and offers further insight into the contributions of each individual item. Iterations of the analyses involved a combination of item removal or sub-testing to account for response dependency.

Results
Part one: PHQ-9 assessment The sample of 767 contained a full range of scores (0-27) with no missing data (median 16; interquartile range 7). Over 90% (n = 695) scored above the cut-off point for clinical depression (i.e. a score of 10 or over).

Thresholds of individual items
None of the items displayed reversed/disordered thresholds, meaning that the original PHQ-9 response structure appears to be functioning as intended. The threshold marking the lower boundary of the scale is the transition from response category 0 (not at all) to response category 1 (several days) on item 2. The threshold marking the upper boundary of the scale is the transition from response category 2 (more than half the days) to category 3 (nearly every day) on item 9.

Initial fit to the Rasch measurement model
All iterations of the analysis are summarised in Tables 1a and 1b.

Summary fit statistics
The summary statistics of the initial analysis (Table 1a, 1b) suggested some misfit within the scale as indicated by a significant w 2 item-trait interaction term and a high-item fit residual standard deviation. The series of t-tests suggest that the item set was not unidimensional; however, this can also be heavily influenced by the response dependency within an item set. This led to an exploration of the individual item fit.
The initial analysis fit statistics for each individual item are presented in Table 2. This indicates that items 1 and 2 are problematic in terms of the w 2 fit statistic and items 2 and 3 are problematic in terms of their fit residuals.
Item 2 appears to be the most problematic item. It displays a high-negative residual and an over-discriminating response pattern, indicating a possible redundancy or dependency within the item set.

Local independency
Two aspects of local independency were investigated. First the residual correlation matrix was assessed to identify response dependencies between items. At a correlation indication level of 0.1, dependencies were indicated between items 1 and 2 (r = 0.25), items 3 and 4 (r = 0.14) and items 2 and 6 (r = 0.17).
Second, the t-test results (Table 1a, 1b) indicated some evidence of multidimensionality, which could be caused by the response dependency found within the scale.

Targeting
The person-item threshold distribution for the PHQ-9 scale is shown in Fig. 1. The scale appeared to be well targeted to this clinical sample, with the mean person location slightly higher than the mean item location. This indicates that this sample was displaying a higher average depression level than that represented by the scale (Fig. 1).
The results of the initial analysis show some potential problems with the scale among the current sample, although these problems are not entirely consistent with previous research in other populations.

Part two: accounting for misfit within the model
The analyses iterations emphasise the problems that were apparent in the initial analysis (Table 1a, 1b). Thresholds remained ordered in all of the analyses, suggesting that all response categories are appropriate to this sample. The most parsimonious analysis iteration was analysis H, which has no specific problems despite the w 2 probability displaying significance. The overall significant w 2 value is likely to be affected by the reasonably large sample size. This is also the case for all of the other analyses, but individual problems were identified within analyses A-G.
For analysis H, when the overall w 2 value is adjusted based on a sample size of 500 (available within RUMM2030), P = 0.473. Within all analyses, once the response dependency has been taken into account the scale displays an acceptable level of unidimensionality. This suggests that the initial apparent multidimensionality is due to the response dependency that is present within the scale.

PHQ-2
Items 1 and 2 were identified as displaying response dependency, and are potentially redundant when used in conjunction with all other PHQ-9 items. It should be noted that these two items make up the PHQ-2 short form. Additional analysis was carried out on the PHQ-2; results are reported in Table 1a and 1b. Fit statistics displayed a degree of misfit, with the overall fit statistics similar to the better PHQ-9 analysis iterations. Also, to compare the PHQ-2 short form and the complete PHQ-9, individual person estimates were derived from each version of the scale (when anchored to the same metric). A t-test was then used to compare the PHQ-9 and PHQ-2 estimates for each individual. When test-based extremes are removed, 4 out of 757 (0.53%) individuals display person estimates that are significantly different at P = 0.05. When sub-test-based extremes are removed (i.e. those that were at the floor or ceiling of the shorter PHQ-2 scale), 2 out of 620 (0.32%) individuals display person estimates that are significantly different at P = 0.05. This same analysis also allows for equivalent levels of depression to be estimated on both the PHQ-9 and the PHQ-2. This revealed that the PHQ-9 cut-off point of 10 is equivalent to a PHQ-2 value of 2.705. When rounded to the nearest whole number, this is equivalent to the PHQ-2 cut-off point

ORIGINAL PAPERS
Horton & Perry Screening for depression: PHQ-9 suggested by Kroenke et al 10 as the optimal cut-off point for depression screening purposes. These findings offer support for the use of PHQ-2 as a screening tool, as the person estimates of level of depression provided by the PHQ-2 do not significantly differ from the estimates provided by the PHQ-9.

Discussion
The Rasch modelling process provides an integrated framework to explore different measurement characteristics of a scale. This integrated approach emphasises the relationship between the scale items and an assumed underlying latent construct. The Rasch model has a number of assumptions, including that of a unidimensional structure, which assumes that all of the items within a scale contribute to measuring the same underlying construct. Any deviation from this measurement structure will be identified through a series of fit statistics. 15 Rasch analysis is a form of item response theory, as it is based around the interaction of how people respond to individual items within a scale. However, it is also often seen as separate entity owing to the differences in the epistemological approach. Andrich 26 has broadly described these two approaches as the 'statistical modelling' paradigm (item response theory) and the 'experimental measurement' paradigm (Rasch), and has argued that the paradigms are incompatible, despite their apparent similarities. A distinctive feature of Rasch modelling is that the model is considered a formal representation of proper measurement and data are examined against this formal model, whereas with a statistical modelling approach the best model is sought to describe the data. 16 To our knowledge this is the first attempt to apply Rasch models to a sample of primary care patients with varying levels of depression. Research in other areas has suggested different models for the PHQ-9, including the PHQ-2, which has been extensively developed to reduce the burden of time taken to identify people who may be experiencing depression.
The main finding of the study suggests that when all items of the PHQ-9 are taken together, then it contains items where response dependency is present among a sample with current depression. This is particularly apparent between items 1 and 2 and, to a lesser extent, between items 3 and 4. Both of these dependencies make sense conceptually, as the contents of items 1 and 2 are linked to the major symptoms of depression, whereas items 3 and 4 are concerned with issues of sleep and tiredness. It should be noted that items 1 and 2 are the two items that make up the PHQ-2 short form. This raises a question about the potential duplication of clinical information when items 1 and 2 are used alongside the other items within the PHQ-9.
Previous research 9 found only three items of the PHQ-9 in their final solution: 1, 2 and 4. These core symptoms create the core ICD-10 criteria which link to the diagnosis of depression. 9 Our research suggested that item 2 is overdiscriminating and is potentially redundant in the existing scale. Conversely, this means that it could be seen as good summary item for the rest of the scale items, thus offering support for the use of the PHQ-2 as a screening tool. However, within the Rasch measurement model framework, the PHQ-2 items appear to be problematic when administered alongside the other items of the PHQ-9. The PHQ-2 has been advocated by some researchers as the preferred model of screening, with nurses reporting high satisfaction with an average screening process and reporting time of 1-2 min. 11 Despite satisfaction on a practical clinical level, the sensitivity and specificity of the PHQ-2 with a sample of drug users in the community was shown to be poor in relation to the PHQ-9 in people with moderate clinical depression. 12 The measurement properties of the PHQ-9 have also been explored by attempting to generate fit to the Rasch model through combinations of removal of mis-fitting items and sub-testing to account for dependency between items. This analysis procedure contributes towards the further understanding of the relationship between items of the scale. In this case, this analysis emphasised the findings of the initial analysis. In summary, ordered response thresholds were never an issue: response dependency was apparent between the PHQ-2 items (items 1,2), and between the sleep and tiredness items (items 3, 4). The initial apparent multidimensionality appears to be due to this dependency, and items 1 and 2 also overdiscriminate within the PHQ-9.
Our finding is contrary to that of previous research, 8 which identified one mis-fitting item (item 8) that included contrasting symptoms. Williams et al argue that including both poles of the diagnostic criterion is confusing and is likely to contribute to item misfit. Consequently, they suggest that splitting items such as 'poor appetite or overeating' is likely to reduce cognitive demands, improve the psychometric properties, enhance specificity and minimise the costs associated with follow-up examinations of those who screen positive. 8 However, from a psychometric point of view, splitting these items is likely to result in some dependency within the measure as the response to 'poor appetite' is unlikely to be independent from the response to 'overeating'.
Overall, it is interesting to note that the four research studies using Rasch analysis have produced different models for the PHQ-9 with different populations. This point demonstrates the importance of validity within psychometric testing and variability of results across different groups of patients. For this reason, clinicians must weigh up the pros and cons of alternative cut-off points to determine the best fit for their circumstances.
In the present study, the most parsimonious analysis involved the removal of items 1 and 2, and accounting for the dependency between items 3 and 4. Following these amendments, the fit to the Rasch model appeared to be adequate. The scale appears to be well targeted to this particular sample, but the reported reliability values are not sufficient for the PHQ-9 to be used as an outcome measure for individual-level use. However, as the primary function of the PHQ-9 is as a screening tool rather than as an outcome measure, the reliability of the scale is probably sufficient.

Study limitations
The sample of participants may not fully represent the diverse characteristics found within the wider population as patients with the most serious depression in this study are likely to be more severely impaired, the sample was screened into the study without external validation, and the purposive sampling may influence the findings of the analysis as a sample with depression would affirm items relating to symptomatic depression. Therefore, the apparent redundancy of the PHQ-2 items may be due to the sample inclusion criteria. Nonetheless, the conceptual redundancy still holds, regardless of the sampling. An improved strategy to assess the properties of the PHQ-9 would be to administer it to a clinically validated sample of patients with depression; however, it was beyond the scope of this study to collect the data in this way.
In its current format the PHQ-9 displays some problems with regard to its measurement structure among a sample with depression. However, these problems can be addressed to deliver a stable screening tool. The results also offer support for the PHQ-2 short form as a screening tool. Unlike many fields in medicine, most clinical outcomes in psychiatry are not directly observable and cannot be captured with diagnostic tests such as blood work or imaging. In recent years, the importance of the routine use of clinical outcome assessments (patient-reported outcomes, clinician-reported outcomes, observer-reported outcomes and performance outcomes) for measuring the symptoms of disease and treatment outcomes has been increasingly emphasised. 1 Clinical outcome assessments such as the Patient Health Questionnaire-9 (PHQ-9) 2 are now commonly used in clinical research and practice to provide an assessment of a patient's severity of mood and improvement in response to treatment. 3 More broadly, as the demand increases for a broad range of mental health services to be patient-centred, clinical outcome assessments are used to capture outcomes such as sustained symptom reduction, return to full functioning and optimal patient well-being. 4 To optimise mental healthcare, clinical outcome assessments used in psychiatry should be shown to be fit for purpose. They should appropriately capture the concept of interest (e.g. depression) in the context of use (e.g. Summary This commentary argues the importance of robust, meaningful assessment of clinical and functional outcomes in psychiatry. Outcome assessments should be fit for the purpose of measuring relevant concepts of interest in specific clinical settings. As well, the measurement model selected to develop and test assessments can be critical for guiding care. Three types of measurement models are presented: classical test theory, item response theory, and Rasch measurement theory. To optimise current diagnostic and treatment practices in psychiatry, careful consideration of these models is warranted..