Utility of PHQ-2, PHQ-8 and PHQ-9 for detecting major depression in primary health care: a validation study in Spain

Background Primary health care (PHC) professionals may play a crucial role in improving early diagnosis of depressive disorders. However, only 50% of cases are detected in PHC. The most widely used screening instrument for major depression is the Patient Health Questionnaire (PHQ), including the two-, eight- and nine-item versions. Surprisingly, there is neither enough evidence about the validity of PHQ in PHC patients in Spain nor indications about how to interpret the total scores. This study aimed to gather validity evidence to support the use of the three PHQ versions to screen for major depression in PHC in Spain. Additionally, the present study provided information for helping professionals to choose the best PHQ version according to the context. Methods The sample was composed of 2579 participants from 22 Spanish PHC centers participating in the EIRA-3 study. The reliability and validity of the three PHQ versions for Spanish PHC patients were assessed based on responses to the questionnaire. Results The PHQ-8 and PHQ-9 showed high internal consistency. The results obtained confirm the theoretically expected relationship between PHQ results and anxiety, social support and health-related QoL. A single-factor solution was confirmed. Regarding to the level of agreement with the CIDI interview (used as the criterion), our results indicate that the PHQ has a good discrimination power. The optimal cut-off values were: ⩾2 for PHQ-2, ⩾7 for PHQ-8 and ⩾8 for PHQ-9. Conclusions PHQ is a good and valuable tool for detecting major depression in PHC patients in Spain.


Introduction
Depression is a common mental disorder affecting a high percentage of the population. In 2019, 5.02% of the world population suffered from depression (Institute of Health Metrics & Evaluation, 2019), with this rate having increased considerably during the COVID-19 pandemic (Bueno-Notivol et al., 2021). PHC professionals may play a crucial role in improving early diagnosis and management of depressive disorders for two reasons. Firstly, major depression is the mental disorder with the highest prevalence (around 10%) among Primary Health Care patients (PHC) (Craven & Bland, 2013;Serrano-Blanco et al., 2010). Secondly, around 83% of the population had used PHC services over the last 12 months (Macinko, de Andrade, de Souza Junior, & Lima-Costa, 2019). Spain is the country with the largest volume of PHC visits in Europe. In Spain, PHC is organized into around 3000 PHC centers (Government of Spain, 2021). However, it has been estimated that only 22-31% of patients with major depression received a correct diagnosis in Primary Health Care (Aznar-Lou et al., 2018; Fernández et al., 2010).
Having in mind this scenario and the current situation of saturation of mental health services, it is crucial that PHC professionals are provided with the instruments necessary to detect depression in PHC patients (Ferenchick, Ramanuj, & Pincus, 2019).
Among the available tools, semi-structured or structured interviews are the most accurate instruments for detecting depression (Brugha, Bebbington, & Jenkins, 1999). However, this is a timeconsuming tool not useful in routine PHC practice. In addition, the final diagnosis may vary according to the screening instrument employed. Thus, according to the results reported by Levis et al. (2018), patients undergoing the Mini-International Neuropsychiatric Interview (MINI) (Sheehan et al., 1998) were more likely to receive a diagnosis of major depression than those who underwent the Composite International Diagnostic Interview (CIDI) (WHO, 1997), although the CIDI provided a deeper diagnosis of depression.
With regard to questionnaires, the most widely used screening tool for detecting major depression in PHC is the Patient Health Questionnaire (PHQ) (Ferenchick et al., 2019;Maurer, Raymond, & Davis, 2018). The PHQ is derived from the New Procedure for Diagnosing Mental Disorders in Primary Care (PRIME-MD study), which was originally developed to detect depression, anxiety, alcohol abuse, somatoform disorder, and eating disorders in PHC (Spitzer et al., 1994). The PHQ was originally a nine-item questionnaire that was developed to assess major depression in PHC patients in the UK, showing adequate internal consistency (α = 0.89; Kroenke, Spitzer, and Williams, 2001). According to the authors, total scores can be interpreted using two different strategies: a cut-off value, which determines whether the person has or not a diagnosis of depression, or a diagnostic algorithm, which requires a score ⩾ 2 in five items, including item 1 or 2. However, according to previous studies, the algorithm has poorer psychometric properties (pooled sensitivity 35%, pooled specificity 95%), as compared to the PHQ-9 cut-off point ⩾ 10 .
Since PHQ-9 was created, many studies, including systematic reviews and meta-analyses, have been published to evaluate its utility from different perspectives (Aslan et al., 2020;Costantini et al., 2021;Diez-Quevedo, Rangil, Sanchez-Planell, Kroenke, & Spitzer, 2001;Gelaye et al., 2013;He et al., 2020;Levis et al., 2017Levis et al., , 2020Manea, Gilbody, & McMillan, 2012;Muñoz-Navarro et al., 2017;Wu et al., 2020). Most studies were aimed at establishing the optimal cut-off point for detecting major depression. For instance, an individual participant data meta-analysis showed that the most widely used cut-off point (PHQ-9 ⩾10) had a pooled sensitivity of 67% and a pooled specificity of 86% . However, evidence about the validity of these cut-off values is not consistent. A meta-analysis based on 18 validation studies did not reveal substantial differences in the pooled sensitivity and specificity of the different PHQ-9 cut-off points (8 to 11) when using different standardized interviews based on DSM or ICD-10 for diagnosis of depression as a criterion (Manea et al., 2012). In that sense, the authors of PHQ-9 recommend using a different cut-off point based on the population to be assessed (Kroenke, Spitzer, Williams, & Löwe, 2010). This recommendation is consistent with current consensus about validity, where validation is described as a continuous process of gathering evidence to support the interpretation of scores for the specific purposes of the test (American Educational Research Association, American Psychological Association, National Council on Measurement in Education, 2014).
Shorter versions of the PHQ-9 have also been used in PHC: PHQ-2 (Kroenke, Spitzer, & Williams, 2003) and PHQ-8 (Kroenke et al., 2009). PHQ-2 contains the first two items of PHQ-9 (depressed mood and anhedonia) and is a very brief prescreening tool. The PHQ-8 consists of the first eight items of the PHQ-9 (omitting the item about suicidal ideation) in an attempt to avoid the problems reported about the inaccuracy of this item in assessing suicide risk, especially in non-psychiatric populations (Razykov, Ziegelstein, Whooley, & Thombs, 2012;Walker et al., 2011). Both, PHQ-2 (cut-off >3: sensitivity 83%, specificity 92%) and PHQ-8 (cut-off ⩾10: sensitivity of 100%, specificity 95%) showed adequate validity evidence regarding the relationship with a criterion in the American population (Kroenke et al., 2003(Kroenke et al., , 2009. Despite the range of studies available on the utility of PHQ, few studies have explored the psychometric properties of the Spanish versions of PHQ-2, PHQ-8 and PHQ-9 in PHC in Spain. The Spanish version of PHQ-9 showed good sensitivity (84%) and specificity (92%) in hospitalized patients in Spain, as compared to the diagnosis established by a mental health professional (Diez-Quevedo et al., 2001). Two additional studies confirmed the utility of PHQ-9 in assessing depression in PHC when administered telephonically (Pinto-Meza, Serrano-Blanco, Peñarrubia, Blanco, & Haro, 2005), and when the Structured Clinical Interview for DSM Disorders (SCID-I) is used as a criterion (Muñoz-Navarro et al., 2017). The studies conducted in other Spanish-speaking populations in Latin America raise concern about linguistic and cultural differences (Aslan et al., 2020;Baader et al., 2012;Urtasun et al., 2019;Wulsin, Somoza, & Heck, 2002). Regarding PHQ-2, a study of Spanish pregnant women who attended in PHC concluded that PHQ-2 (cut-off point ⩾2) had a good sensitivity (84.5%) and specificity (79.5%), taking PHQ-9 as a criterion . No studies were found assessing the psychometric properties of PHQ-8 in the Spanish population. Therefore, further validity evidence is needed to support the use of PHQ-2, PHQ-8 and PHQ-9 to assess major depression and establish the optimal cutoff points for PHC patients in Spain. This would allow PHC professionals identify patients at risk of depression using a rapid, easy-to-use method. The purpose of screening is to ensure the adequate management of depression, which is performed based on a stepped care and collaboration model between primary care and mental health services (Ministry of Health, Social Services & Equality, 2014).
This study aimed to gather validity evidence supporting the use of PHQ-2, PHQ-8 and PHQ-9 as a tool for assessing major depression in the Spanish PHC population. To such purpose, a validation study was conducted where different sources of validity were combined and integrated to evaluate whether or not the total scores obtained in the three PHQ versions could be interpreted to detect major depression in PHC.

Design and study setting
A cross-sectional multicenter study was carried out in 22 PHC centers in six regions of Spain. PHC centers were recruited in the context of the EIRA-3 study (Zabaleta-del-Olmo et al., 2021), a randomized controlled hybrid type II preventive trial conducted in 25 PHC centers from seven regions of Spain. The centers included in the EIRA-3 study were located in Andalusia (n = 2); the Basque country (n = 3); Aragon (n = 4); the Balearic Islands (n = 4); Castile and Leon (n = 4); Catalonia (n = 4) and;
Galicia (n = 4). However, three PHC centers from the Basque country were excluded from the study since they did not screen for baseline depression using a CIDI interview.

Participants
Participants were recruited from PHC centers between February 2017 and January 2018. Eligible participants included subjects aged 45-75 years with at least two lifestyle risk factors (low physical activity, smoking consumption and/or unhealthy diet). Participants were excluded if they had severe mental illness or cognitive impairment, advanced serious physical illness, were not autonomous for daily activities, were involved in a home health care program, were receiving cancer treatment or palliative care, and did not live in the area during the study. A total of 4387 participants were assessed for eligibility. Of them, 69.9% (n = 3062) provided informed consent and met the inclusion criteria. In total, 2.5% of participants (n = 78) were excluded due to incomplete PHQ-9 responses and 13.2% (n = 405) due to incomplete CIDI responses. The final sample was composed of 2579 participants. The total sample was randomly divided into three subsamples to conduct separate analysis for PHQ-2, PHQ-8 and PHQ-9. Subsamples were composed of 859, 860 and 860 participants for PHQ-2, PHQ-8 and PHQ-9, respectively. For PHQ-8 and PHQ-9, the two latter subsamples were additionally and randomly divided into two groups for crossed validation; it is detailed below in the description of the analysis conducted for assessing dimensionality. Table 1 shows the sociodemographic characteristic of each subsample, which were not significantly across groups.

Procedure
Participants were recruited by PHC professionals during routine visits or by telephone, and by informative posters and local advertisements displayed in the waiting room of PHC centers. The participants who met the inclusion criteria and were interested in participating in the study read the information sheet and signed informed consent. Then, participants were invited telephonically to attend an assessment session. All data were collected all in once at baseline by local trained personnel, coordinated at central level, through a face-to-face interview (60 min approximately) which was recorded in an electronic data collection booklet specifically created for the EIRA-3 study.
The study protocol was approved by the Research Ethics Committee of the IDIAP Jordi Gol (approval number P16/025) and the local ethics committees of each participating Autonomous Communities. The protocol was registered at ClinicalTrials.gov, NCT03136211.

Sociodemographic data
The sociodemographic questionnaire collected information about sex, age, education level, marital status, and employment status.

PHQ-9
The PHQ-9 is a 9-item self-administered questionnaire Kroenke & Spitzer, 2002) created according to the diagnostic criteria for major depression disorder proposed in the 4 th version of the Diagnostic and Statistical Manual of Mental Disorders (DSM-IV). PHQ-9 explores the presence of the symptoms described in Table 2 over the past two weeks. Each PHQ-9 item contains four Likert-response categories, ranging from 0 'Not at all' to 3 'Nearly every day'. The total score ranges between 0 to 27 points, with the original cut-off point set at ⩾10 to determine the presence of major depression . Higher scores indicate more depressive symptoms. McDonald's omega coefficient for the PHQ-9 was 0.89 in PHC Spanish patients (Muñoz-Navarro et al., 2017). PHQ-9 has also been proposed as a diagnostic tool based on a diagnostic algorithm .

PHQ-2
The PHQ-2 consists of the first two items of PHQ-9 exploring the presence of depressed mood and anhedonia over the past two weeks, scoring from 0 'not at all' to 3 'nearly every day' (Kroenke et al., 2003). Total score ranges from 0 to 6. The Note. *p < 0.05; **p < 0.01; ***p < 0.001.

PHQ-8
The PHQ-8 contains the first eight items of the PHQ-9 (Kroenke & Spitzer, 2002), exploring the presence of depressive symptoms over the past two weeks with each item scoring from 0 'not at all' to 3 'nearly every day'. Total score ranges from 0 to 24 with a proposed cut-off point ⩾ 10 to detect major depression . Cronbach's alpha coefficient was 0.81 for outpatients recruited at three major public hospitals in Bolivia (Schantz et al., 2017) and 0.92 for the general population of Puerto Rican adults (Pagán-Torres, González-Rivera, & Rosario-Hernández, 2020).
The CIDI -Module E The CIDI-module E is a standardized diagnostic interview created by the WHO to assesses major depression (WHO, 1997). The CIDI includes the first two screening questions assessing anhedonia and depressive mood during the last 12 months for a period of two consecutive weeks. When at least one of these two screening questions is affirmatively answered, 31 additional items are asked including both yes/no and free-text answers. According to a recent systematic review, the CIDI is the second most common diagnostic interview used in primary care to screen for depression using the PHQ-9 (Costantini et al., 2021). The CIDI showed excellent inter-rater reliability (> 0.90) in most of the diagnoses. Test-retest reliability reached good-to-excellent Kappa indexes for most of the CIDI modules, including the depression module (Wittchen, 1994). The CIDI-depression module showed an Area Under the Curve (AUC) of 0.75, a sensitivity of 53.3%, and a specificity of 93.7% using SCID as a criterion (Haro et al., 2006).

The 25-item version of the Hopkins Symptom Checklist (HSCL-25)
The HSCL-25 is a self-administered 25-item questionnaire (Nabbe et al., 2019) that assesses anxiety (items 1-10) and depression (items 11-25). Each item offers four Likert response options, ranging from 1 'Not at all' to 4 'Extremely'. The total score is calculated by dividing the sum of the scores of all items by 25 (all items), 10 (anxiety dimension) or 15 (depression dimension);

Generalized Anxiety Disorder Scale (GAD-7)
The GAD-7 is a 7-item scale measuring symptoms of generalized anxiety (Spitzer, Kroenke, Williams, & Löwe, 2006). Each GAD-7 item contains four Likert-response options, ranging from 0 'Not at all' to 3 'Nearly every day'. Total score ranges from 0 to 21 points. Higher scores indicate more symptoms of anxiety. Cronbach's alpha coefficient was 0.94 in the Spanish general population (Garcia-Campayo et al., 2010).
Functional social support questionnaire (DUKE-UNC-11) The DUKE-UNC-11 is a multidimensional questionnaire assessing functional social support (Broadhead, Gehlbach, de Gruy, & Kaplan, 1988). The DUKE-UNC-11 consists of 11 items with five Likert-response options ranging from 1 'much less than I want' to 5 'as much as I want'. The total score ranges from 11 to 55 points, where higher scores indicate more functional social support. Cronbach's alpha coefficient was 0.92 for Spanish PHC patients (Bellón, Delgado Sánchez, Luna del Castillo, & Lardelli Claret, 1996).

Statistical analysis
All statistical analyses were carried out using the statistical package SPSS (V 26) and JASP software (V 0.14.1.0). To evaluate the psychometric properties of PHQ-2, PHQ-8 and PHQ-9, the total sample was randomly divided into three subsamples. The internal consistency of PHQ-8 and PHQ-9 scores was assessed according to McDonald's omega coefficient and the 95% confidence intervals (CI). Validity evidence based on internal structure and on relations to other variables was collected following the Standards for Educational and Psychological Testing guidelines (AERA et al., 2014). First, the dimensionality of the instrument was explored by conducting an exploratory factor analysis (EFA) and a confirmatory factor analysis (CFA). For such purpose, the PHQ-8 and PHQ-9 subsamples were randomly divided into two with the same number of participants to perform crossed validation among samples. Kaiser-Meyer-Olkin (KMO) and Bartlett's tests were used to assess the appropriateness of applying EFA to PHQ-8 and PHQ-9 data sets. Principal Axis Factoring extraction method and Varimax rotation were used to perform EFA. Parallel analysis was used to decide the number of factors to be extracted. Regarding CFA, it was performed using the maximum likelihood method. The Goodness of Fix Index (GFI), the Comparative Fit Index (CFI), and the Root Mean Square Error of Approximation (RMSEA) were used to estimate the goodness-of-fit of the model. Values of GFI and CFI higher than 0.90, and of RMSEA lower than 0.05 were considered appropriate (Hu & Bentler, 1999;McDonald & Ho, 2002).

Measurement invariance was assessed by multi-group analysis,
where participants were compared according to variables previously described in the literature as potential sources of bias when using PHQ. Specifically, the structure of the construct was compared across groups according to sex, age and educational level (Bellón et al., 2011;González-Blanch et al., 2018;Patel et al., 2019). To assess invariance between groups, the change in chisquare value (Δχ 2 ) and its p value, CFI (ΔCFI) and RMSEA (ΔRMSEA) values were calculated. Four models were tested sequentially, from the least to the most restrictive level of invariance (configural, metrics, strong and, strict). Invariance between groups was settled when the p value of Δχ 2 was non-significant, and when the change of the RMSEA and CFI values was lower than 0.015 and 0.01 respectively (Chen, 2007). Associations with other variables were explored by: (i) obtaining evidence about convergence between variables by calculating correlations between the three versions of the PHQ and the theoretically related variables (HCLS-25, GAD-7, DUKE-UNC-11 and EQ-5D-3); and (ii) analyzing the level of agreement between the scores obtained in the three versions of the PHQ and in the criterion (the CIDI) through Receiver Operating Characteristic (ROC) curve analysis. We expected a positive correlation between the different PHQ versions and the instrument measuring similar constructs (HCLS-25 and GAD-7), as well as a negative correlation between PHQ versions and the other instruments measuring related but distinct variables (DUKE-UNC-11 and EQ-5D-3). AUC was calculated and interpreted as follows: 0.5-0.6 'no discrimination'; 0.6−0.7 'low discrimination'; 0.7-0.8 'acceptable discrimination'; 0.8-0.9 'good discrimination'; > 0.9 'excellent discrimination' (Muñiz, 2018). Moreover, sensitivity rates, specificity rates, positive and negative likelihood ratios, and positive and negative predictive values were extracted. Youden's index was calculated to determine the optimal cut-off value for a good sensitivity/specificity balance (Youden, 1950). Convergence between the PHQ-9 diagnostic algorithm and the optimal PHQ-9 cut-off according to Youden's index was calculated using Cohen's kappa coefficient index (Cohen, 1960). Table 2 shows descriptive statistics for PHQ-2, PHQ-8 and PHQ-9. McDonald's omega coefficient (ω) was 0.83 (95% CI 0.81-0.84) for the PHQ-8 and 0.84 (95% CI 0.82-0.85) for the PHQ-9. The items in the two versions showed discrimination indexes above the established minimum, which was 0.30.

Descriptive statistics and internal consistency
Validity evidence based on internal structure for PHQ-8 and PHQ-9 KMO (greater than 0.7) and Bartlett's test ( p < 0.001) indicated the adequacy of applying EFA to PHQ-8 and PHQ-9.
Regarding AFE, a one-factor solution was extracted based on parallel analysis calculated over the first half of each subsample (see Table 3). Using the second half of each subsample, the one-factor model showed a reasonable fit of the model to the data for both PHQ-8 (GFI = 0.936; CFI = 0.899; RMSEA = 0.10) and PHQ-9 (GFI = 0.947; CFI = 0.932; RMSEA = 0.08). Regarding to multigroup analysis by sex, age and educational level, results confirmed invariance between groups in all cases, at least at configural level (see online Supplementary Tables A1 and A2 in supplementary material). It is worth noting that χ 2 becomes usually significant due to its sensitivity to sample size.

Convergence between PHQ-2, PHQ-8 and PHQ-9 and related variables
The postulated relationship between the three PHQ versions and theoretically related variables was confirmed. Correlation between the three versions and both HSCL-25 and GAD-7 were positive (⩾ 0.50) and significant (see Table 4). Correlations with DUKE-UNC-11 and HR-QoL index were negative (ranging from −0.30 to −0.50) and significant. PHQ-2 was the version with the highest correlation with theoretically related variables.
Inconsistencies between PHQ-9 cut-off points and the CIDI There were five participants with major depression according to the CIDI and PHQ-9 scores lower than 6. Exploration of these inconsistencies indicated that differences in final diagnosis derived from two situations: participants who answered 'not at all' to a specific item in PHQ-9 but 'yes' to the equivalent item in CIDI, i.e. participants who were not consistent in their responses; and participants who answered 'several days' to a specific item of PHQ-9 and 'yes' in the equivalent items of the CIDI. To confirm the conclusions formulated, analyses were replicated by excluding participants with inconsistent diagnoses. Results did not change for the PHQ-9 except for ROC analysis, where the AUC increased to 0.95 (95% CI 0.931-0.970; S.E. = 0.010; p < 0.0001). According to Youden's index, the optimal cut-off value was the same (PHQ-9 ⩾8) but sensitivity increased (J = 0.82; sensitivity 100%; specificity 82%).

Discussion
To the best of our knowledge, this is the first study to gather validity evidence supporting the use of PHQ-2, PHQ-8, and PHQ-9 as a tool to detect major depression in the Spanish PHC population. The present study suggests the optimal cut-off value for a good sensitivity/specificity balance were ⩾ 2 for PHQ-2, ⩾7 for PHQ-8 and ⩾8 for PHQ-9.
Considering the information provided by the CIDI, the prevalence of major depression in our sample was 4.8%, which is consistent with a recent study in PHC attendees in Spain (Vieta et al., 2021) but lower than rates found in other studies focused on PHC (Craven & Bland, 2013;Serrano-Blanco et al., 2010). However, as the aim of the present study was to collect validity evidence to Note. *p < 0.05; **p < 0.01; ***p < 0.001. Since the total sample was divided into subsamples, correlations between scores in the other variables measured are not included. support the use of PHQ, we cannot formulate conclusions about the prevalence, which will be addressed in future research studies. In terms of psychometric properties, the PHQ-8 and the PHQ-9 showed high reliability as measured by the internal consistency of their scores, which is consistent with the results of previous studies assessing the validity of PHQ-9 in PHC in Spain (Muñoz-Navarro et al., 2017) and other Spanish-speaking countries from Latin America (Aslan et al., 2020;Schantz et al., 2017;Scoppetta et al., 2021).
Regarding validity evidence based on internal structure, a single-factor solution structure was supported, which is in agreement with previous studies in PHC patients in Spain (González-Blanch et al., 2018) and other Spanish-speaking populations (Alpizar, Laganá, Plunkett, & French, 2018;Aslan et al., 2020;Pagán-Torres et al., 2020). The one-factor solution was suggested in the exploratory phase of the analysis and confirmed in the confirmatory phase, where it showed reasonable goodness-of-fit indexes of the model in relation to data for both the PHQ-8 and PHQ-9. These results are similar to those reported in previous studies (González-Blanch et al., 2018;Patel et al., 2019). In addition, invariance across groups divided by sex, age and educational was confirmed, as it occurred in previous studies (González-Blanch et al., 2018;Patel et al., 2019). Validity evidence based on relationships with other variables confirmed the expected results, i.e. positive and significant relationships with anxiety and depression, and negative and significant associations with social support and H-RQoL. Additionally, ROC curve analysis allowed to establish, with good discrimination values, the optimal cut-off point using the CIDI as the criterion. According to the results, the optimal cut-off value for PHQ-2 is ⩾ 2, detecting 88% of PHC patients with major depressive disorders. Although the original validation study recommended a cut-off point ⩾ 3 (sensitivity 86%, specificity 90%), the authors used in that case an independent structured mental health professional interview as a criterion (Kroenke et al., 2003). In contrast, previous studies establish ⩾ 2 as the optimal cut-off values as well (Arroll et al., 2010;Rodríguez-Muñoz et al., 2017). In the present study, the originally proposed cut-off point (⩾3) had lower sensitivity (68%) and specificity (85%), as compared to the cut-off point of the original validation study. Thus, the cut-off point ⩾ 2 might be more effective in detecting major depression in the PHC setting, considering the CIDI as the criterion. With regard to PHQ-8, the optimal cut-off point in the present study was ⩾ 7 (sensitivity 86%, specificity 81%). This cut-off point detects 86% of cases of major depressive disorders. A recent meta-analysis based on fully-structured interviews as the criterion proposed the cut-off points of ⩾ 7 and ⩾ 8 as the values with the optimal sensitivity/specificity balance . Using the CIDI as the criterion, Arroll et al. (2010) also pointed to the need of considering a cut-off value lower than 10.
Regarding the PHQ-9 diagnostic algorithm, our results are in line with previous studies revealing high specificity but low sensitivity (Manea, Gilbody, & McMillan, 2015). In that case, results also converge with previous findings of the present study and suggest lower cut-off values than in studies conducted by other authors. This could be explained by the criterion used: the CIDI. According to previous studies, the CIDI yields a deeper diagnosis of major depression compared to the MINI interview, which diagnoses twice as many cases of major depression than the CIDI (Levis et al., 2018). Taking into account that PHQ is used to detect major depression in clinical practice, the cut-off values employed must maximize sensitivity with an adequate specifity.
The study has several strengths. First, the large size and heterogeneity of the sample, as participants were recruited from diverse PHC centers over the country, which guarantees the representativeness of the sample. Second, a rigorous methodology was used based on the Standards for Educational and Psychological Testing (AERA et al., 2014). Third, it should be emphasized that participants were evaluated by an external unit and that all instruments were administered the same day. Regarding limitations, on the one hand, participants were recruited in the context of the EIRA-3 study, which is focused on contacting participants with unhealthy behaviors. Thus, our study sample differs from that used in the validation study of the original version of the PHQ-9  in two main aspects. The first one is co-occurrence of unhealthy behaviors, and the second the age of participants, which were older in our study . Although this could have increased the probability that patients suffered from depression, the prevalence of depression was similar or even lower than expected (Institute of Health Metrics & Evaluation, 2019). In addition, the health behaviors evaluated are the most common in PHC, so participants were representative of the Spanish PHC population (Government of Spain, 2021; Gómez-Gómez et al., 2020). However, the scope of our study goes beyond estimating the prevalence of depression. Thus, this study also aimed at assessing the utility of the PHQ and providing evidence supporting its utility by showing its convergence with CIDI results. Hence, diagnoses established based on the CIDI were generally confirmed by the PHQ, independently of the characteristics and/or age of patients (which potential influence, if exists, would be true and not a bias as shown by invariance results). On the other hand, the instruments were administered by adjusting the model to the specific needs of each PHC center, using both, self-administered questionnaires and face-to-face interviews.

Implications for practice
Having in mind the results of the study, and considering both, the adequate psychometric properties of the three versions and the context of the Spanish health system, where the opportunities of assessment are limited, our recommendations are: (1) using PHQ-2 with a cut-off point ⩾ 2 as a pre-screening instrument; (2) when PHQ-2 indicates major depression, applying PHQ-8 (with a cut-off point ⩾ 7) or PHQ-9 (with a cut-off point ⩾ 8). The decision of using PHQ-8 or PHQ-9 should be made considering that the only difference between them is item 9, which was created to assess suicidal ideation and self-harm. In the psychiatric setting, item 9 is used as an indicator of suicide risk , as it is considered a strong predictor of suicide attempt in psychiatry practice (Simon et al., 2016). However, several studies have found that it overestimates suicide risk in different populations and settings (Na et al., 2018;Razykov et al., 2012Razykov et al., , 2013Suarez et al., 2015;Walker et al., 2011). For instance, a study conducted in 841 patients with depression revealed that item 9 detected almost three times more patients at risk of suicide (41.1%) than a specific scale created to assess suicide risk (13.4%) (Na et al., 2018). Similar results were found in non-psychiatric patients such as veterans attended in PHC (Corson, Gerrity, & Dobscha, 2004) and in patients with medical conditions such as heart disease (Razykov et al., 2012;Suarez et al., 2015); cancer (Walker et al., 2011); and systemic sclerosis (Razykov, Hudson, Baron, & Thombs, 2013). In all these studies, only a small proportion of participants responding affirmatively to item 9 had suicidal ideation or planned to commit suicide. Considering that item 9 has a high rate of false positives, especially in non-psychiatric populations, it can be concluded that PHQ-8 is more effective in assessing major depression in non-psychiatric settings, in large population surveys or large epidemiological studies in which the study staff or the researchers cannot cope with affirmative responses, and in clinical and research settings where follow-up of positive responses may be delayed (Kroenke et al., 2010). However, PHQ-9 would be an adequate choice on suspicion of suicidal ideation in a personal evaluation where the professional could intervene somehow. Finally, we do not recommend the PHQ-9 diagnostic algorithm to be used to assess major depression in the Spanish PHC setting as it only detects 29 to 49% of cases of major depression.
Based on the results of the present study, it can be concluded that PHQ is a good and valuable tool for screening and assessing major depression in PHC patients in Spain. Having adequate tools for detecting major depression could contribute to the early detection, implementation of prevention policies and programs, and administration of adequate and early treatments. Future studies could use a mixed methods approach to collect qualitative evidence related to the response processes developed by participants to respond the questionnaire.
Supplementary material. The supplementary material for this article can be found at https://doi.org/10.1017/S0033291722002835