Implementation measurement in global mental health: Results from a modified Delphi panel and investigator survey

Limited guidance exists to support investigators in the choice, adaptation, validation and use of implementation measures for global mental health implementation research. Our objectives were to develop consensus on best practices for implementation measurement and identify strengths and opportunities in current practice. We convened seven expert panelists. Participants rated approaches to measure adaptation and validation according to appropriateness and feasibility. Follow-up interviews were conducted and a group discussion was held. We then surveyed investigators who have used quantitative implementation measures in global mental health implementation research. Participants described their use of implementation measures, including approaches to adaptation and validation, alongside challenges and opportunities. Panelists agreed that investigators could rely on evidence of a measure’s validity, reliability and dimensionality from similar contexts. Panelists did not reach consensus on whether to establish the pragmatic qualities of measures in novel settings. Survey respondents (n = 28) most commonly reported using the Consolidated Framework for Implementation Research Inner Setting Measures (n = 9) and the Program Assessment Sustainability Tool (n = 5). All reported adapting measures to their settings; only two reported validating their measures. These results will support guidance for implementation measurement in support of mental health services in diverse global settings.


Introduction
Mental, neurological and substance-use (MNS) disorders are the leading causes of disability globally, yet most people in need of treatment for MNS disorders never receive care (Thornicroft et al., 2017;Pathare et al., 2018;Vos et al., 2020).Effective, affordable, scalable and sustainable services are needed to bridge this global gap (Lancet Global Mental Health Group et al., 2007).A broad range of preventive and treatment interventions for high-burden MNS conditions have demonstrated promising cost-effectiveness in both high-and low-resource settings (Patel et al., 2016); in response, researchers and funders alike have called for an increased scientific focus on strengthening intervention implementation and scale-up, particularly in low-and middle-income countries (LMICs), through the application of the methods of implementation science (Betancourt and Chambers, 2016).The primary aim of implementation science is to design and test ways to promote and sustain the delivery of evidence-based practices in routine healthcare (Eccles and Mittman, 2006).These implementation strategies target specific aspects of the environment of service delivery, or of the intervention providers or of the intervention itself, all with the goal of improving uptake and sustainment.Implementation success is assessed through a range of implementation outcomes, including acceptability, adoption, appropriateness, cost, feasibility, fidelity, penetration and sustainability (Proctor et al., 2011).For example, if unhelpful attitudes or beliefs among clinic staff are thought to be hindering implementation of evidence-based mental health care, the use of peer influencers or opinion leaders might be considered as an implementation strategy to improve provider acceptance of mental health services.Application of implementation science methods to the field of global mental health has grown rapidly in recent years (Wagenaar et al., 2020).
This growth has outpaced the development and validation of pragmatic tools for implementation measurement in diverse global settings.As with any science, valid measurement is critical to the utility and reproducibility of implementation research (Lewis et al., 2015).For example, many implementation studies begin with an assessment of the multi-level contextual determinants of implementation effectiveness (Damschroder et al., 2009).These determinants can inform the choice of implementation strategies; they are also useful for understanding the process of implementation and they may moderate or mediate intervention effects (Waltz et al., 2019).Measurement of implementation outcomes is also critical to judging the effectiveness of implementation strategies.While some implementation constructs may be manifest, or measured through observable indicators (e.g., rate of provider serviced delivery as an indicator of penetration) (Willmeroth et al., 2019), many are latent, implying some level of self-report (e.g., provider acceptability).Many quantitative measures of latent implementation constructs exist and have been identified and catalogued through systematic review; relatively few, however, have been assessed for validity or have documented strong psychometric properties, though the number of measures with strong psychometric properties is increasing (Khadjesari et al., 2020;Mettert et al., 2020).Even fewer measures have been assessed for their pragmatic qualities, including burden, length, reliability and sensitivity to change (Hull et al., 2022).Importantly, almost all extant, validated, pragmatic, quantitative implementation measures were developed for use in high-income countries (Lewis et al., 2015).These implementation measuresand their corresponding theories, models and frameworksmay need to be appropriately translated, adapted and validated for use in diverse global contexts (Means et al., 2020).
To date, most implementation studies by global mental health researchers have relied exclusively on qualitative assessment, with relatively few using quantitative implementation measures (Wagenaar et al., 2020).Though qualitative methods are a crucial part of implementation science, valid quantitative measurement allows for larger studies and improves study rigor and reproducibility (Palinkas et al., 2011;Palinkas, 2014).Investigators have several factors to consider when choosing quantitative measures for usein addition to whether an appropriate measure existsincluding different aspects of measure validity and reliability, as well as each measure's pragmatic qualities (e.g., length, cost) (Powell et al., 2017).Given that almost all existing implementation measures were developed for use in high-resource settings, global mental health researchers must carefully consider the validity and appropriateness of each measure in their setting.There are several distinct approaches available for establishing validity and other measure characteristics in novel settings (Boateng et al., 2018).Table 1 describes these characteristics and approaches in detail and notes which approaches are designed to assess which characteristics.For example, cross-cultural validity can be established using translation, back-translation, expert advice and pre-testing.
Limited guidance exists to support global mental health services investigators in the choice and use of quantitative implementation measuresor the choice and use of approaches to adapt and validate those measures.Our objectives in this project were to (1) bring together a panel of experts to better understand and develop consensus on best practices for implementation measurement, with a particular focus on mental health implementation research in LMICs, and (2) survey investigators applying these measures to identify strengths and opportunities in current practice.

Participants
We used purposive sampling to select and invite a panel of experts at the intersection of implementation science, psychometrics and global mental health, starting from a list generated by members of the study team.Specifically, we approached experts in our extended professional networks who we knew had experience with developing, adapting or validating implementation measures for use in global mental health research.We recruited eight panel members (see Supplementary Material for a full list of panel participants).One panel member withdrew between the first and second panel discussions.

Delphi process
The goal of our modified Delphi process was to develop consensus among the panel members on: (1) prioritization of different types of measure validity, reliability and pragmatic qualities for assessment and confirmation when using measures under different circumstances and in different settings (see Table 1 for definitions of each quality); (2) feasibility and utility of different measure validation approaches (see Table 1 for definitions of each approach) and (3) a minimal set of validation approaches for use when applying implementation measures in new contexts and settings.We followed the steps of a conventional Delphi process, including an exploratory phase, a first round of quantitative questionnaires, analysis/summation and results discussion (Avella, 2016).A preliminary discussion was held in March 2020 to orient panelists to the Delphi process.Questionnaires were then distributed and completed electronically.Questionnaire responses were aggregated and anonymized, and summary statistics of responses were presented to the panel.Following the distribution of the questionnaire analysis, available panel members were convened virtually to review the results and, if possible, achieve consensus on recommendations.
Questionnaires included three sections (see Supplementary Material).In the first section, panel members were given different measurement scenarios (e.g., use of an implementation measure developed in a US context to assess the same construct in a novel, lower-resource context) and were asked which types of measurement characteristics (e.g., different types of validity, reliability or pragmatic qualities; Table 1) need to be established prior to measure use in a novel context.In the second section, panel members rated distinct validation strategies (e.g., informal expert elicitation, pilot survey with subsequent real-world outcomes; Table 1) on nine dimensions of rigor, feasibility and resource intensiveness.Finally, in the third section panel members proposed a minimal set of validation strategies that researchers could use under most circumstances when applying an implementation measure in a diverse new setting.
One author (KD) had access to the questionnaire responses and interview data and completed all analyses (Linstone and Turoff, 1975).To maintain confidentiality and promote the rigor of the process, no identifying information was shared with other members of the research team or expert panel.Results draw from all questionnaire and interview responses as well as discussion during the second-round call.CK moderated and LA attended, but did not contribute to, both rounds of panel discussion.
The aim was to achieve a reasonable degree of consensus among panel members.No a priori target for degree of consensus was set for this study, and a full consensus-based approach was not pursued.This was done for reasons of appropriateness and feasibility; in particular, there are only a small number of experts at the intersection of global mental health and implementation measurement worldwide, and ongoing travel restrictions and social distancing measures related to the COVID-19 pandemic meant in-person consensus-building activities were impossible at the time.Though we did not use a quantitative threshold (e.g., calculating an agreement statistic or a formal vote) to assess consensus, we did bring the expert panel together for a Zoom-based discussion of the summary of their questionnaire results, with a particular focus on areas of divergence.Panel members agreed with the synthesis of results and concluded that the rankings of results within each subsection were acceptable and reflected their judgement.

Participants
We also conducted a survey of global mental health researchers to understand current practice in implementation measurement.We searched NIH RePORTER and the Grand Challenges Canada website on May 18, 2020, for descriptions of funded implementation research studies related to mental health services in LMIC settings (see Supplementary Material for the NIH RePORTER search strategy).The names and contact information for the lead principal investigator for each study, as well as study descriptions, were abstracted into a sampling frame.One of three authors (C.G.K., K.D., L.A.) screened each study and associated principal investigator for inclusion; studies were excluded if they were not conducted in an LMIC or were not related to mental health.We contacted all remaining principal investigators and invited them to participate in a structured online survey related to the measurement of implementation processes and outcomes in their study.Principal investigators could also nominate a study team member or collaboratorsomeone who was directly involved in the implementation measurement component of the studyto participate in their place.Between NIH RePORTER, Grand Challenges Canada and this snowball sampling approach, we anticipated reaching most investigators with experience leading formal global mental health implementation research.Contacted investigators were sent a reminder email if they did not initially respond to the online questionnaire within a 2-week period, and a final reminder was sent 2 weeks later.Survey recruitment and data collection occurred from July to November 2020.

Survey measures
We designed the survey to assess: (1) the scope and nature of global mental health implementation research conducted by each investigator, (2) the range of implementation process and outcome measures used by investigators across any of their implementation studies and (3) the study setting, population, sample size, types of measure adaptation or validation used if any, assessment of measure performance and any recommendations for measure improvement.

Analysis
Categorical responses were summarized using simple descriptive statistics at the level of the respondent.Open-text responses were reviewed for recurring themes or approaches to adaptation and validation.

Research ethics
The Human Subjects Division of the University of Washington determined that both components of this study qualified for exemption status under 45 CFR 46.101 (b).

Expert panel
Section 1: Measure characteristics There was substantial concordance across panel members indicating it was reasonable to rely on evidence of most measure characteristics that had been established in similar contexts (e.g., another lowresource setting) without needing to establish those characteristics in every new setting (Supplementary Material, Section 1).This was true for all types of measure validity, reliability and dimensionality, except for cross-cultural validity (i.e., adequate adaptation for and performance in a new context), which was judged important to be established in each new setting.In contrast, there was limited agreement on the need to establish the pragmatic qualities of measures in each new setting.Though qualities like measure cost, length, ease of completion and assessor burden were judged to be unnecessary to be established in new settings if already established in similar settings, qualities related to how the measure would be used (e.g., whether it would inform decision-making, whether it fit with organizational activities) were felt to be important to establish in each new setting.
Panel members were then asked whether it was ever possible to rely on evidence of measure characteristics that had been established in other settings, even settings that were substantially different (e.g., high-income country).Respondents indicated that if investigators established the face validity of an implementation measure in a new settingfor example, through informal expert review and a small pilot use with confirmatory factor analysisit would not then be necessary to conduct an intensive validation process.Respondents suggested that because implementation measures were not used directly to guide patient care, the stakes were lower than for other measures (e.g., diagnostic or screening tools), and correspondingly the bar for validation was lower.
Panel members were also asked about how they would choose between different hypothetical implementation measures based on their pragmatic qualities, assuming the hypothetical measures were equally valid.Respondents scored nearly all pragmatic qualities as important in making this decision, though acceptability, ease of completion, cost and language accessibility were rated as the most important qualities that would be considered (Table 2).In followup conversations with panel members, nearly all highlighted measure length as a key issue with current implementation measures, raising concerns related to respondent fatigue, assessor fatigue and artificial inflation of internal consistency.Respondents also felt that the results from most currently available measures were difficult to interpret, and that this was holding back their use and applicability.They suggested that the inclusion of quantitative thresholds and other guidance on how to judge what measure scores "mean" would be beneficial.

Section 2: Validation strategies
Respondents identified a trade-off between the rigor of different validation approaches and their resource-intensiveness (Supplementary Material, Section 2).The two survey-based validation strategies, one using other established measures and the other using subsequent real-world outcomes for validation, were judged to be the most rigorous as well as the most expensive and timeconsuming.Respondents rated the two forms of expert elicitation (informal and formal) as moderately or highly feasible and inexpensive, but there was no agreement on the assumed rigor of the results.Translation/back-translation scored consistently and moderately on all dimensions.Respondents disagreed most about the vignette-based strategy; they did not agree on the amount of time and resources required, nor whether it was feasible to develop vignettes that could provide high-confidence results in diverse low-resource settings.One respondent cautioned that developing good vignettes for community mental health programs could be hampered by the fact that these services are often uncommon in low-resource settings, and thus there is no "gold standard" program to which one can refer.Instead, vignettes must use hypothetical examples that take longer to explain and may produce unreliable results.
Section 3: Package of validation strategies Translation/back-translation was the most frequently recommended strategy followed by informal expert elicitation.No other strategy was recommended by more than two respondents.Several respondents struggled with the tension between cost and rigor and wondered whether a minimal set of validation strategies might be feasible in most situations but ultimately insufficient for establishing validity.Most respondents suggested using a combination of validation strategies was the most appropriate approach; nearly all respondents argued that strategies should be "fit for purpose" and only as rigorous and complex as necessary.Respondents also debated the most appropriate approach to disseminate guidance on implementation measurement to mental health services researchers across diverse global settings.One respondent argued for the provision of step-bystep guidance, while another cautioned against offering overly prescriptive guidance to LMIC-based investigators.
Complete Delphi panel results are presented in the Supplementary Material.3.8 Completed with ease (How hard is the measure to complete?)3.8 Cost (Is the measure free to use?) 3.8 Uses accessible language (What is the reading level of the measure?) 3.8 Appropriate (Does use of the measure interfere with service implementation?)3.4 Length (How many items does the measure have?)

3.2
Informs clinical or organizational decision-making (Are the measure findings actionable?) 3 Fits organizational activities (Does the measure map to actual services?) 2. 8Assessor burden (training) (How much training is required to learn how to administer the measure?) 2.8 Assessor burden (interpretation) (Does the measure have clear cut-offs, instructions for handling missing data and generating summary scores?) 2.8 Offers relative advantage over existing methods (Is the measure better than other approaches to assessment of the same construct?)2.4 Table 3. Cambridge Prisms: Global Mental Health

Investigator survey
We invited 107 investigators to participate in the survey or suggest other investigators for participation.Sixty-two investigators responded.We sent survey links to 45 investigators who indicated interest in participation.Thirty-eight investigators started the survey.Table 3 presents the characteristics of the 28 investigators who completed the survey.The majority (61%) were based in the United States, most (82%) were at universities or other academic institutions and almost all (96%) were focused on research as opposed to clinical service delivery or program implementation.Investigators had been involved in a mean of 2.2 implementation studies related to mental health.Table 4 describes the usage of implementation measures reported by at least two investigators in LMIC settings.The most used implementation measures included the Consolidated Framework for Implementation Research Inner Setting measures (n = 7) (Fernandez et al., 2018), the Program Assessment Sustainability Tool (n = 5) (Luke et al., 2014) and the Acceptability of Intervention Measure, Intervention Appropriateness Measure and Feasibility of Intervention Measure (n = 5) (Weiner et al., 2017).Measures were most commonly used prior to intervention implementation (n = 18) or mid-implementation (n = 18) as opposed to postimplementation (n = 7) and were most often used to assess contextual determinants of implementation effectiveness (n = 20) rather than to assess implementation outcomes (n = 9).Providers were the most common group sampled (n = 25), followed by clients (n = 9).Measures were used in a diverse range of contexts across Latin America, Sub-Saharan Africa, Eastern Europe and South/ Southeast Asia.Adaptation approaches were generally limited to translation and back-translation (n = 23) and stakeholder feedback (n = 16), and only one investigator reported conducting any measure validation prior to use (pilot testing).Limited response variability, positive response bias, measure length and item relevance were the most common challenges reported.
Other measures reported as used by individual investigators included the Implementation Leadership Scale (Aarons et al., 2014), the Theory of Planned Behavior measures (Ajzen, 2011), the Feelings Thermometer (ALWIN, 1997), the Systems Usability  Scale (Lewis, 2018), the Organizational Social Context scale (Glisson et al., 2008), several intervention-specific fidelity scales and several measures developed new for individual studies.

Discussion
This study sought to improve quantitative implementation measurement in the field of global mental health by generating consensus recommendations on best practices for measure choice and validation and by surveying the field to understand current practice.Our expert panel concluded that pragmatic concerns are key to choosing between measures and validation approaches.
They noted that many quantitative implementation measures are lengthy and identified a trade-off between resources and rigor in the various approaches available for adapting and validating implementation measures in diverse global settings.However, they concluded that in many cases, it is sufficient for investigators to establish the face validity of an implementation measure in a new setting through some combination of reviewing the use of that measure in a similar setting, convening an informal expert and stakeholder panel, conducting translation and backtranslation and piloting the measure to confirm its dimensionality and internal reliability.Though confirming the predictive validity of a measure by correlating it with subsequent real-world outcomes would be the gold standard for measure validation, panel members felt this was unnecessary prior to using most implementation measures.Survey results suggested that though several implementation measures have been used or are in use in global mental health studies across a variety of levels and study phases, almost none have been formally validated as part of those studies.
Quantitative measures must be reliable, valid and practical to be useful for implementation research or practice, though comprehensive reviews of published implementation measures have noted that the field faces several major issues.These include the poor distribution of quantitative measures across implementation constructs and analytic levels; a lack of measures with strong psychometric qualities; measure synonymy (the same measure items are sometimes used to measure different constructs), homonymy (different measure items are used to measure the same construct) and instability (measure items are often changed with each use) and the reality that many implementation measures exhibit poor pragmatic qualities (Lewis et al., 2018).Nevertheless, a growing number of strong implementation measures do exist: the challenge for investigators in diverse global settings in choosing and adapting theseor developing new onesand ensuring that they perform well.Notably, the Psychometric and Pragmatic Evidence Rating Scale has been developed through stakeholder consensus to provide clear criteria for measure quality, both to inform measure development and measure choice (Stanick et al., 2019).In addition, domain-specific resources are increasingly available to support investigators in choosing between manifest and latent indicators of implementation process and outcomes, including the HIV Implementation Outcomes Crosswalk (Li et al., 2020).
Several key limitations should be noted.Our expert panel consisted of only seven members, reflecting the relatively small number of individuals with intersecting expertise in global mental health, implementation science and psychometrics.In response, we opted for depth over breadth and sought to reach panel consensus across a wide range of issues related to measure use and validation, rather than for one or two key questions.Our Delphi panel size is considered acceptable for non-statistical analysis (Rowe and Wright, 1999).All panel procedures were carried out during the first 6 months of the COVID-19 pandemic, meaning procedures were remote and sometimes asynchronous.For our survey, we sampled investigators from NIH RePORTER and Grand Challenges Canada; these are two of the most prolific funders of global mental health implementation research, though this approach likely biased our sample toward investigators based in North America.To mitigate this risk, we used snowball sampling to attempt to identify and recruit other investigators that would have been missed with this approach.Our overall response rate was low, which again may reflect the small number of individuals actively using quantitative measures in their global mental health implementation studies; many investigators we contacted declined to participate because they were not using quantitative implementation measures.
Despite these limitations, our findings may directly support the growing field of global mental health implementation research.We have used our results to compile a set of guidance documents for investigators planning to quantitatively measure latent implementation processes and outcomes in diverse global settings.These include a compendium of available measures across implementation constructs and detailed descriptions of common adaptation and validation approaches.This guidance should facilitate rigorous and replicable implementation research in an area of high need, though it is not intended to be prescriptive, and local investigators are encouraged to adapt and apply the guidance only where it is useful.Moving forward, as the quantity and quality of implementation measures designed for use in for diverse global contexts increase (Aldridge et al., 2022), the standards for measure adaptation and validation may also shift.Less emphasis may be placed on establishing measure validity for the sake of scientific rigor, with a corresponding increased emphasis on measure pragmatic qualities and capacity to inform real-world health service delivery.
Open peer review.To view the open peer review materials for this article, please visit http://doi.org/10.1017/gmh.2023.63.Supplementary material.The supplementary material for this article can be found at https://doi.org/10.1017/gmh.2023.63.

Table 4 .
Implementation measure usage and adaptation/validation approaches CFIR

Table 1 .
Implementation measure characteristics mapped to measure assessment approaches

Table 2 .
Delphi panel pragmatic qualities importance ratings Measures reported as used by only one investigator, or used only in a high-income country setting, are not included in Table4.Responses related to the Acceptability of Intervention, Intervention Appropriateness, and Feasibility of Intervention Measures were collapsed across the scales as there was complete overlap within respondents for these measures.Responses related to the Applied Mental Health Research implementation measures, which include client-, provider-, organizational-and policy-level scales for several implementation outcomes and contextual determinants, were collapsed for the same reason.AIM, Acceptability of Intervention Measure; AMHR/mhIST, Applied Mental Health Research/Mental Health Implementation Science Tool; CFIR, Consolidated Framework for Implementation Research; EBPAS, Evidence-Based Practice Attitude Scale; FIM, Feasibility of Intervention Measure; IAM, Appropriateness of Intervention Measure; ORIC, Organization Readiness for Implementing Change; PSAT, Program Sustainability Assessment Tool. Note: