Revising the self-report strengths and difficulties questionnaire for cross-country comparisons of adolescent mental health problems: the SDQ-R

Aims The Strengths and Difficulties Questionnaire (SDQ) has been used in many epidemiological studies to assess adolescent mental health problems, but cross-country comparisons of the self-report SDQ are scarce and so far failed to find a good-fitting, common, invariant measurement model across countries. The present study aims to evaluate and establish a version of the self-report SDQ that allows for a valid cross-country comparison of adolescent self-reported mental health problems. Methods Using the Health Behaviour in School-aged Children study, the measurement model and measurement invariance of the 20 items of the self-report SDQ measuring adolescent mental health problems were evaluated. Nationally representative samples of 11-, 13- and 15-year old adolescents (n = 33 233) from seven countries of different regions in Europe (Bulgaria, Germany, Greece, the Netherlands, Poland, Romania, Slovenia) were used. Results In order to establish a good-fitting and common measurement model, the five reverse worded items of the self-report SDQ had to be removed. Using this revised version of the self-report SDQ, the SDQ-R, partial measurement invariance was established, indicating that latent factor means assessing conduct problems, emotional symptoms, peer relationships problems and hyperactivity-inattention problems could be validly compared across the countries in this study. Results showed that adolescents in Greece scored relatively low on almost all problem subscales, whereas adolescents in Poland scored relatively high on almost all problem subscales. Adolescents in the Netherlands reported the most divergent profile of mental health problems with the lowest levels of conduct problems, low levels of emotional symptoms and peer relationship problems, but the highest levels of hyperactivity-inattention problems. Conclusions With six factor loadings being non-invariant, partial measurement invariance was established, indicating that the 15-item SDQ-R could be used in our cross-country comparison of adolescent mental health problems. To move the field of internationally comparative research on adolescent mental health forward, studies should test the applicability of the SDQ-R in other countries in- and outside Europe, continue to develop the SDQ-R as a cross-country invariant measure of adolescent mental health, and examine explanations for the found country differences in adolescent mental health problems.


Introduction
Worldwide, a significant percentage of adolescents experience mental health problems (Polanczyk et al., 2015). As these problems are likely to continue into adulthood (Rutter et al., 2006), mental health promotion efforts in adolescence are a global public health priority (Patel et al., 2007). To advance population-based knowledge of adolescent mental health, cross-country comparisons are essential (Achenbach et al., 2012). There is clear evidence of cross-country variation in adolescent subjective well-being (e.g., life satisfaction) in Europe (Bradshaw and Richardson, 2009;Klocke et al., 2014;Inchley et al., 2016), but global prevalence data on adolescent mental health problems are scarce (Erskine et al., 2017).
The Strengths and Difficulties Questionnaire (SDQ) (Goodman, 1997) is one of the most frequently used instruments to assess mental health problems (i.e., emotional, behavioural and relational problems) in adolescents. It has been included in epidemiological studies in various individual countries to assess population levels of adolescent mental health problems (see http://www.sdqinfo.org). However, cross-country comparisons based on the self-report SDQ are scarce and faced methodological challenges which are lined out below (Ravens-Sieberer et al., 2008;Essau et al., 2012;Ortuño-Sierra et al., 2015;Stevanovic et al., 2015;De Vries et al., 2018).
First, samples that are compared should be nationally representative and sample characteristics, sampling methods and data collection methods should be comparable across countries (Achenbach et al., 2012). However, this is often not the case in the available cross-country literature. Of the few cross-country studies that used the self-report SDQ, some only included adolescents from specific regions within countries (Essau et al., 2012;Stevanovic et al., 2015), they compared national samples with different gender or age distributions (Essau et al., 2012;Ortuño-Sierra et al., 2015;Stevanovic et al., 2015), or they compared national samples that were collected with different sampling methods (i.e., schoolv. household-based surveys) (Ravens-Sieberer et al., 2008), or data collection methods (e.g., collective or individual questionnaire administration) (Ravens-Sieberer et al., 2008;Ortuño-Sierra et al., 2015), that each may impact estimates of adolescent mental health problems (e.g., Vollebergh et al., 2006). Thus, it is not clear whether the crosscountry variation observed in these studies reflect actual or methodological differences in adolescent mental health problems between countries (Achenbach et al., 2012).
Second, to make valid comparisons, studies should test whether the structure of the underlying constructs measured by the SDQ (a common measurement model), and the meanings ascribed to these underlying constructs (measurement invariance) are comparable across countries. Only some of the cross-country studies on the self-report SDQ tested the (meaning of the) underlying constructs of the SDQ. These studies either did not find a common measurement model across different countries (Stevanovic et al., 2015), or had to allow correlated residuals between items (Ortuño-Sierra et al., 2015) to establish a common measurement model. Such modifications may however not replicate in different data sets (Kyriazos, 2018). Often, model fit issues were related to the five reverse worded items of the SDQ: they cross-loaded on the prosocial behaviour subscale or negatively affected the overall model fit (Essau et al., 2012;Ortuño-Sierra et al., 2015). Those studies that did establish a common measurement model did not find evidence for measurement invariance (Essau et al., 2012) or established partial measurement invariance (Ortuño-Sierra et al., 2015).
Because of these challenges, it has been argued that the selfreport SDQ in its present form is not suitable for cross-country comparisons (Stevanovic et al., 2017) and needs to be revised (Essau et al., 2012;Ortuño-Sierra et al., 2015;Stevanovic et al., 2015). More specifically, it has been suggested that the reverse worded items of the SDQ should be re-worded or removed (Essau et al., 2012). Also, it has been argued that the measurement model should be examined in countries across different regions in-and outside Europe (Ortuño-Sierra et al., 2015;De Vries et al., 2018).
The present study sets out to evaluate and establish a version of the self-report SDQ that can be used to validly compare mental health problems of 11-, 13-and 15-year old adolescents across seven European countries. We attempt to overcome the former methodological challenges by (1) using national representative samples of adolescents with similar sample characteristics, assessed with similar sampling and data collection methods in seven countries of different regions in Europe (Bulgaria, Germany, Greece, the Netherlands, Poland, Romania, Slovenia), (2) establishing a goodfitting, common measurement model using cross-validation to assess the replicability of model modifications, and (3) testing the invariance of this common measurement model.

Participants
Data on the self-report SDQ from the Health Behaviour in School-aged Children (HBSC) study that were collected in the 2005(Poland), 2009/2010 and 2013/ 2014 (Bulgaria, the Netherlands, Slovenia, Romania) surveys were used. HBSC is a cross-sectional, school-based survey that is conducted every 4 years across more than 40 countries in Europe and North America. Using a standardised research protocol, self-report questionnaires are administered to nationally representative samples of 11-, 13-and 15-year-olds in the classroom. Samples are drawn using cluster sampling, with schools or school classes as primary sampling units. School response rates varied by country but were >80% in all countries except in the Netherlands (49%). At the student-participant level, response rates ranged from 78 to 94%. More information can be found elsewhere (Currie et al., 2010. In the Netherlands (2005( /2010( /2014( ) and Bulgaria (2005( /2014 self-report SDQ data were collected in multiple HBSC surveys. Results showing that the measurement model of the SDQ is invariant across these timepoints in the Netherlands (Duinhof et al., 2015) and Bulgaria (Appendix The total sample consisted of 33 233 11-, 13-and 15-year old adolescents, 51% were girls (ranging between 47.7 and 53.3% across countries). No significant ( p > 0.001) gender and age distribution differences were found across the country samples. Adolescents who did not fill in the SDQ (n = 279, 0.8% of the total sample) were excluded from the analyses. For the remaining samples, missing item responses ranged from 0.1 to 3.3%.

Measures
Adolescents filled in the self-report SDQ (Goodman, 1997) in their national language. The self-report SDQ is a 25-item questionnaire for 11-17 year olds. It consists of four subscales measuring mental health problems (conduct problems, emotional symptoms, peer relationship problems, hyperactivity-inattention problems) and one subscale measuring strengths (prosocial behaviour). In the present study, data were only available for the problem subscales. Each subscale comprises five items that are scored on a three-point ordinal Likert scale (0 = 'Not true', 1 = 'Somewhat true', 2 = 'Certainly true'). Items are phrased in the direction of their subscales, with higher scores indicating higher problem levels, except for five reverse worded items: obedient, has good friend, generally liked, thinks before acting and good attention. The exact wording of the items and abbreviations used in this study can be found in Appendix B.
Adolescents indicated their gender by responding to the question: 'Are you a boy or a girl?'. Age was determined based upon the participant's month and year of birth and the date of survey administration.

Analytical strategy
Analyses were performed in Mplus 8.2 (Muthén and Muthén, 2017), using the weighted least squares mean and variance adjusted estimator and the theta parameterisation. Analyses were corrected for cluster effects of adolescents in the same school.
Step 1: Establishing a common measurement model To establish a common measurement model we collated the data from all countries. A common measurement model was only established if the model showed an acceptable to good fit in this total sample and in each individual country. Based on preliminary analyses (see Appendix C) and findings from previous cross-country comparisons supporting a first-order five-factor model (Essau et al., 2012;Ortuño-Sierra et al., 2015), a first-order model with four correlated factors measuring mental health problems was used as a starting point.
Using confirmatory factor analysis (CFA), a common measurement model was established considering the following guidelines. First, to find a parsimonious common measurement model that corresponds to the theoretical structure introduced by Goodman (1997) and to protect against non-theory-driven model modifications that might not replicate in other samples (Hermida, 2015;Kyriazos, 2018), correlated item residuals and cross-loadings of items were not allowed. Second, items with nonsignificant factor loadings and/or standardised factor loadings below 0.40 were considered unacceptable (Ford et al., 1986). When supported by previous empirical findings, these items were removed. Third, the overall model fit was evaluated (acceptable fit = RMSEA <0.08 and CFI >0.90; good fit = RMSEA <0.05, CFI >0.96) (Browne and Cudeck, 1992;Hu and Bentler, 1999;Yu, 2002). If models did not show acceptable fits, model modification indices (MI) were consulted to review misspecified model parameters. As MI may be driven by characteristics of the sample on which the measurement model is tested (Byrne, 2012), crossvalidation was used. Of the total sample, 9/10th was used to test and modify models using MI while a random 1/10th was used for validation purposes. Only if a good fitting model was established in both the test and validation set, validation was ended.
Internal consistency of the problem subscales was examined as a quality indicator of the final common measurement model using the ordinal alpha coefficient. Ordinal alpha values above 0.70 were considered acceptable (Nunnally and Bernstein, 1967;Gadermann et al., 2012).
Step 2: Invariance testing To make valid cross-country comparisons, a common measurement model should be established (configural invariance), items should have invariant relationships to their latent factors across countries (metric invariance) and adolescents in different countries should report invariant average scores on the items (scalar invariance). The three-step method testing configural, metric and scalar models was used. First, a configural model with factor loadings and thresholds freely estimated across countries was tested. Second, a metric model with factor loadings constrained equal across countries was examined. Third, a scalar model with factor loadings and thresholds constrained equal across countries was tested.
If invariance tests indicated a lack of metric or scalar invariance, partial invariance can be established and latent means scores can still be compared across countries (Steinmetz, 2013;Bowen and Masa, 2015). Partial measurement invariance was established by freeing the factor loading/threshold of one item at the time, starting with the factor loading/threshold with the highest MI (Dimitrov, 2010). Our analyses showed that only MI accompanied by a meaningful expected parameter change increased model fit. Hence, both values were inspected to identify non-invariant item factor loadings/thresholds. Changes in CFI values (ΔCFI ⩾ −0.010) and RMSEA values (ΔRMSEA ⩾ 0.015) compared to the configural or metric model were used to evaluate whether (partial) invariance criteria were met (Cheung and Rensvold, 2002;Chen, 2007). Following Dimitrov (2010), partial measurement invariance was established if <20% of the factor loadings and thresholds were non-invariant across all countries.
Step 3: Cross-country comparisons If (partial) measurement invariance was established, latent means were compared across countries. Since significant latent mean differences are easy to find in large samples, we applied a stringent significance level ( p < 0.001) and examined the substantially of the latent mean differences by evaluating the size of the standardised latent mean differences using Cohen's d (Cohen, 1988). In multi-group CFA, Mplus by default fixes the means of the latent variables in the first group to zero. Bulgaria was arbitrarily set as the reference country.

Results
Step 1: Establishing a common measurement model Table 1 shows the fit indices of the models tested to establish a common measurement model. Models testing and validating the firstorder four-factor model failed to demonstrate acceptable CFI values (Table 1; Model 1 and 2). The reverse worded item obedient was not related to the conduct problem subscale in both the first (β = 0.01, p = 0.52, R 2 = 0.00), and second model (β = −0.03, p = 0.17, R 2 = 0.00). Standardised factor loadings of the other reverse worded items belonging to the peer relationship problems and hyperactivity-inattention problems subscales were unsatisfactory low with standardised factor loadings below 0.40. Only in the validation model good attention loaded just satisfactory on the hyperactivity-inattention problems subscale (β = −0.41, R 2 = 0.17).
To increase model fit, the non-significant item obedient was removed. Model 3 and 4 show that after removing this item CFI values remained unacceptably low. Similar to Model 1 and 2, factor loadings of the remaining reverse worded items were unsatisfactorily low (β < 0.40), and only a small proportion of their variance was explained by their corresponding latent factors (R 2 range = 0.08-0.13). The MIs of Model 3 and 4 also indicated problems with the reverse worded items. In both models, the two largest MIs suggested to correlate the residuals of the reverse worded items belonging to the same subscale (peer relationship problems or hyperactivity-inattention problems). Given these findings, our aim to establish a parsimonious common measurement model that replicates in future studies, and the numerous studies indicating problems with the reverse worded items (e.g., Essau et al., 2012;Ortuño-Sierra et al., 2015), we decided to remove the remaining reverse worded items. This resulted in a good model fit (Table 1; Model 5 and 6), with CFI values nearing 0.96 and RMSEA values below 0.05. Therefore we took this model without the reverse worded items as the final measurement model and tested its fit on the total sample and in each individual country. The final common measurement model showed a good fit on the total sample (Table 1; Model 7) and acceptable factor loadings were found for all items (β > 0.40) ( Table 4). The final measurement model reached an acceptable to good fit in Bulgaria, the Netherlands, Germany, Greece and Poland, with CFI values near 0.96 and RMSEA values below or close to 0.05, and an acceptable model fit in Slovenia and Romania, with CFI values of or above 0.90, and RMSEA below 0.08 (Table 1). Except for the items steals (β = 0.33) and prefers adult (β = 0.34) in Poland, in all countries, items loaded satisfactorily on their latent factors (β > 0.40). Table 2 shows that in all countries, the emotional symptoms and hyperactivity-inattention subscales showed acceptable internal consistencies (α close to or above 0.70). The conduct problems subscale showed acceptable internal consistencies in most countries, with Greece and Slovenia reporting ordinal α values slightly below 0.70. Only in Poland, an unsatisfactorily low ordinal α value was found for the conduct problem subscale (α = 0.60). In all countries, the peer relationship problems subscale had a low internal consistency.
Step 2: Invariance testing Measurement invariance was tested across countries (Table 3). The configural model (i.e., the common measurement model), with no equality constraints across countries, showed an acceptable fit to the data. Constraining factor loadings equal across countries decreased the model fit (ΔCFI ⩾ 0.010), showing that latent factors had no equivalent relationships with all items across countries and that metric invariance was not supported. After the factor loadings of six items were set free in specific countries (see footnote Table 3), partial metric invariance was established. After establishing partial metric invariance, we tested for scalar invariance. Scalar invariance was found (ΔCFI = −0.006). With sixfactor loadings being non-invariant of the total 45 parameters in the measurement model (i.e., 15 factor loadings and 30 thresholds), the observed percentage of non-invariance across all countries was 13.3%. The resulting final partially invariant model showed an acceptable fit to the data (Table 3; Model 4). Table 4 shows that in the final partially invariant model all items loaded satisfactorily on their latent factors (βs > 0.40). Only in Poland the prefers adult item loaded unsatisfactorily low (β = 0.33) on the peer relationship problems subscale and the fights and steals items loaded just satisfactorily (βs = 0.40/ 0.41) on the conduct problems subscale. The final model included a warning about the high correlations between the latent factors in Romania. The results of Table 4 support this warning, especially the correlation between peer relationship problems and conduct problems is exceptionally high in Romania (r = 0.98). In general, latent factor intercorrelations were high (see Table 4), indicating that models with less factors might be a better fit to the data. However, additional CFA analyses testing a one-factor solution (measuring general mental health problems) and a two-factor solution (measuring internalising and externalising mental health problems) did not support this (see Appendix D).
Step 3: Cross-country comparisons To describe cross-country differences in adolescent mental health problems, Table 5 displays country rankings for each problem subscale based on the unstandardised latent mean differences, with Bulgaria as the reference country. Higher rankings indicate higher latent mean levels of adolescents' self-reported problems. Setting other countries as the reference country resulted in similar rankings. To evaluate the substantiality of these cross-country differences, Table 5 also includes standardised latent mean differences (d). Only significant ( p < 0.001) and substantial (d > 0.20) latent mean differences were considered indicative of cross-country differences. Adolescents in Poland reported the highest levels of emotional symptoms and conduct problems. Adolescents in Greece reported the lowest levels of emotional symptoms (together with adolescents in Bulgaria), peer relationship problems and hyperactivity-inattention problems. Adolescents in Bulgaria, Germany and Slovenia reported the highest levels of peer relationship problems. Adolescents in the Netherlands reported the lowest levels of conduct problems, but the highest levels of hyperactivity-inattention problems.

Discussion
By applying cross-validation and using nationally representative samples of seven countries of different European regions assessed with similar sampling and data collection methods, this study established a revised version for the problem subscales of the selfreport SDQ (SDQ-R). To construct this good-fitting, common measurement model, the five reverse worded items of the selfreport SDQ had to be removed. The SDQ-R was found to have a sufficient amount of invariant items, indicating that adolescent mental health problems could be validly compared across the seven countries in this study. By establishing the SDQ-R, this study contributes to the scarce literature on the cross-cultural validity of scales that examine adolescent mental health problems (Stevanovic et al., 2017). Our findings are in line with previous internationally comparative studies, that also indicated problems with the five reverse worded items of the SDQ (Essau et al., 2012;Ortuño-Sierra et al., 2015). The removal of the reverse worded items led to a common measurement model that showed an acceptable to good fit in each individual country. The finding that the reverse worded items had no significant or substantial relationships with their underlying latent factors might be explained by a methodological phenomenon called reversal ambiguity (Weijters and Baumgartner, 2012). Adolescents may not interpret the reverse worded items as opposites of the construct being measured and thus agree with both the reverse and positively worded items of the SDQ subscales. To illustrate, adolescents may agree with both the reverse worded item 'I have one good friend or more' and the positively worded item 'Other children or young people pick on me or bully me' of the peer relationship problems subscale. It is also possible that the reverse worded items tap into a different construct (e.g., Van de Looij-Jansen et al., 2011), and do not adequately measure a positive equivalence of mental health problems. Both these explanations substantiate our decision to remove the reverse worded items in order to establish the SDQ-R.
Notwithstanding the former, invariance tests indicated that the factor loadings of the fights, lies, clingy, prefers adult, fidgety and distractible items were non-invariant across all countries. Except for the fidgety item, these findings are in accordance with results from previous cross-country comparisons (Essau et al., 2012;Ortuño-Sierra et al., 2015). As such, partial measurement invariance is established, which means that latent means can still be compared across countries (Steinmetz, 2013). To facilitate the interpretation of latent mean differences we presented crosscountry rankings.
Looking at the cross-country rankings found in this study, previous studies on cross-country variation in adolescents' subjective well-being found highly similar country rankings, with Greece and the Netherlands at the top and Poland at the bottom (Bradshaw and Richardson, 2009;Klocke et al., 2014;Inchley et al., 2016). Whereas a recent meta-analysis found no cross-country variation in adolescents' attention-deficit/hyperactivity disorders (ADHD) (Willcutt, 2012), this study found clear cross-country differences in adolescent self-reported hyperactivity-inattention problems. Interestingly, while Dutch adolescents reported the lowest levels of conduct problems and low levels of emotional symptoms and  Note. All factor loadings, explained variance (R 2 ), and correlations between latent factors were significant at p < 0.001. EMO, emotional symptoms; COND, conduct problems; HYP, hyperactivity-inattention problems; PEER, peer relationship problems. peer relationship problems, they reported by far the highest levels of hyperactivity-inattention problems. Future studies are encouraged to further investigate the found country differences in adolescent mental health problems.

Epidemiology and Psychiatric Sciences
In evaluating the SDQ-R some limitations should be considered. First, this study included data from different HBSC surveys. Although a recent trend analysis in the Netherlands based on the self-report SDQ revealed rather stable mental health problem levels over a 10-year period (Duinhof et al., 2015), we cannot exclude the possibility that our country rankings to some extent reflect time interval differences. Second, by removing the reverse worded items, the SDQ-R measures slightly different concepts of conduct problems, peer relationship problems and hyperactivity-inattention than the original self-report SDQ. To illustrate, the original hyperactivity-inattention subscale was designed to represent the three behavioural dimensions of a DSM-IV diagnoses of ADHD (American Psychiatric Association, 2013) and includes items measuring hyperactivity, inattention and impulsiveness (Goodman, 2001). By removing the reverse worded item from the hyperactivity-inattention problems subscale, the impulsiveness dimension of ADHD is not included in the SDQ-R anymore, and only one item taps into the inattention dimension. However, more generally, being a brief instrument for assessing adolescent mental health problems, one can debate whether the multidimensional nature of ADHD can be captured adequately by the SDQ at all (e.g., Garrido et al., 2018).
Third, the three-step method of invariance testing requires a referent indicator to identify the model (Muthén and Muthén, 2017), that is assumed to be perfectly invariant across groups. Non-invariant referent indicators may negatively impact the model fit and affect the results of invariance testing (Cheung and Rensvold, 1999;Johnson et al., 2009). A sensitivity analyses were conducted to make sure that the choice for the referent indicator did not influence the results negatively. For these, we ran several metric models by setting items consecutively as the referent indicator. The default setting (the first item as the referent indicator) showed one of the best model fits, and we continued with this metric model. Fourth, there is a debate about whether factor loadings and thresholds should be tested separately or in tandem to establish measurement invariance. We choose to test factor loadings and thresholds separately as this approach is less conservative and more explicit about the source of non-invariance (Bowen and Masa, 2015). Finally, CFA is known to produce inflated latent factor correlations if cross-loadings are meaningfully departing from zero in the population (Asparouhov and Muthén, 2009;Garrido et al., 2018). For example, in Romania, the MI suggested a cross-loading between the distractible item of the hyperactivity-inattention problems subscale and the peer relationship problems subscale. In CFA, such nonzero crossloadings are fixed to zero, which may have been an overly stringent requirement for Romania, and resulted in overestimated latent factor intercorrelations. Thus, the latent factor correlations in this study need to be interpreted with care.

Conclusion
Cross-country comparison using the SDQ have the great potential to advance our understanding of adolescent mental health. It can inform and drive global and national intervention and prevention efforts. The present study introduces a revised version of the selfreport SDQ, the SDQ-R, that allowed for a valid comparison of adolescent mental health problems across seven countries of different regions in Europe. Mental health was relatively high in Greece, relatively low in Poland and most divergent in the Netherlands. To build our knowledge of adolescent mental health in-and outside Europe, future studies should further test the applicability of the SDQ-R, and further develop the self-report SDQ-R as a cross-country invariant measure of adolescent mental health problems.
Availability of data and materials. Data of the HBSC study is available upon request. For more information, see http://www.hbsc.org.