Development and preliminary inter-rater reliability of the new PROOF tool to measure fidelity of problem-solving therapy for depression delivered by non-specialists in a low-resource African setting

Lily Cooke; Tarisai Bere; Amelia Stanton; Walter Mangezi; Steven A. Safren; Tsitsi Mawere; Lena Skovgaard Andersen; Christina Psaros; Samantha M. McKetchnie; Meghana Vagwala; Kia-Chong Chua; Conall O’Cleirigh; Aya Mitani; Melanie Abas

doi:10.1017/gmh.2025.10034

Development and preliminary inter-rater reliability of the new PROOF tool to measure fidelity of problem-solving therapy for depression delivered by non-specialists in a low-resource African setting

Published online by Cambridge University Press: 08 July 2025

Lena Skovgaard Andersen

Christina Psaros ,

Samantha M. McKetchnie and

Meghana Vagwala

...Show all authors

Show author details

Lily Cooke*: Affiliation:
Institute of Psychiatry, Psychology and Neuroscience, King’s College London, London, UK Rehabilitation and Human Performance, Icahn School of Medicine at Mount Sinai , New York, New York, USA
Tarisai Bere: Affiliation:
College of Health Sciences, University of Zimbabwe , Harare, Zimbabwe
Amelia Stanton: Affiliation:
Department of Psychological and Brain Sciences, Boston University , Boston, Massachusetts, USA
Walter Mangezi: Affiliation:
College of Health Sciences, University of Zimbabwe , Harare, Zimbabwe
Steven A. Safren: Affiliation:
Department of Psychology, University of Miami , Miami, Florida, USA
Tsitsi Mawere: Affiliation:
College of Health Sciences, University of Zimbabwe , Harare, Zimbabwe
Lena Skovgaard Andersen: Affiliation:
Department of Public Health, University of Copenhagen , Copenhagen, Denmark
Christina Psaros: Affiliation:
Department of Psychiatry, Massachusetts General Hospital, Boston, Massachusetts, USA
Samantha M. McKetchnie: Affiliation:
Department of Psychiatry, Massachusetts General Hospital, Boston, Massachusetts, USA Harvard University Center for AIDS Research, Harvard University, Boston, Massachusetts, USA
Meghana Vagwala: Affiliation:
Department of Psychiatry, Harvard University, Boston, Massachusetts, USA
Kia-Chong Chua: Affiliation:
Institute of Psychiatry, Psychology and Neuroscience, King’s College London, London, UK
Conall O’Cleirigh: Affiliation:
Department of Psychiatry, Massachusetts General Hospital, Boston, Massachusetts, USA Department of Psychiatry, Harvard University, Boston, Massachusetts, USA Harvard University Center for AIDS Research, Harvard University, Boston, Massachusetts, USA
Aya Mitani: Affiliation:
Dalla Lana School of Public Health, University of Toronto , Toronto, Ontario, Canada
Melanie Abas: Affiliation:
Institute of Psychiatry, Psychology and Neuroscience, King’s College London, London, UK
*: Corresponding author: Lily Cooke; Email: k2481125@kcl.ac.uk

Article contents

Abstract
Impact statement
Introduction
Methods
Results
Discussion
Conclusion
Open peer review
Data availability statement
Author contribution
Financial support
Competing interests
References

Rights & Permissions

Abstract

Problem-solving therapy (PST) is a brief psychological intervention often implemented for depression. Currently, there are no tools with well-evidenced reliability to measure PST fidelity. This pilot study aimed to measure the inter-rater reliability and agreement of the Problem-Solving Therapy Fidelity (PROOF) scale, comprising binary 14-item adherence and an 8-item competence subscales. Transcripts were from the TENDAI trial, a Zimbabwe-based PST intervention for depression and medication adherence. Seven transcripts were each rated by seven specialists, and two transcripts were each rated by two non-specialists. Inter-rater agreement was assessed using percent agreement and inter-rater reliability was assessed using Gwet’s AC1. The PROOF subscales demonstrated promising inter-rater agreement among specialists (adherence = 90.4%, competence = 82.5%) and non-specialists (adherence = 92.9%, competence = 68.8%). Inter-rater reliability analyses yielded a Gwet’s AC1 of 0.411–0.778 and 0.619–0.959 for adherence and competence among specialists, and 0.529–1.00 for adherence in non-specialists. The PROOF scale has the potential to fill the gap of fidelity tools for PST delivery.

Topics structure

Topic(s)

Competency assessment Expanding capacity Training non-mental health professionals Treatment access and equity of care

Subtopic(s)

Quality of care Teaching and learning

Keywords

depression quality of care psychotherapy low-income countries global mental health

Information

Type: Rapid Communication
Information: Cambridge Prisms: Global Mental Health , Volume 12 , 2025 , e98

DOI: https://doi.org/10.1017/gmh.2025.10034 [Opens in a new window]
Creative Commons: This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright: © The Author(s), 2025. Published by Cambridge University Press

Impact statement

Despite treatment fidelity measures being a vital prerequisite for assessing the efficacy of any psychotherapeutic modality, no problem-solving therapy (PST)-centred fidelity tools have measured inter-rater reliability to date. With the WHO call for the expansion of brief psychological interventions, the demand for this fidelity gap to be met is clear. This study provides strong evidence for the inter-rater reliability of the Problem-Solving Therapy Fidelity (PROOF) scale as a measure for evaluating PST fidelity. Principally, however, the development of the PROOF scale has provided valuable insights into the design aspects that are most likely to result in a successful and useable tool, and the pitfalls that may limit it. The method of piloting and iterating on item language showed that ambiguity, even with deliberate attempts to minimise, was likely a sustained influence on disagreement. Thus, efforts to maximise objectivity and minimise subjectivity in item language should be viewed as of paramount importance. With further psychometric research, the PROOF scale has high potential for valued contributions in the field of Global Mental Health, as it fills a long-unaddressed gap in PST fidelity measurements.

Introduction

Brief psychological interventions, typically consisting of 2–10 sessions serviced from evidence-based therapies, are growing in popularity due to the increasing burden of mental health disorders globally (Roberts et al., Reference Roberts, Travers-Hill, Coker, Troup, Casey, Parkin and Kim2021). Task-shifting, wherein non-specialist healthcare workers (NSHWs) deliver therapies under the supervision of trained professionals, renders these interventions particularly relevant to low-resource settings, which lack adequate availability of mental health specialists (Chibanda et al., Reference Chibanda, Cowan, Gibson, Weiss and Lund2016). Treatment fidelity can be defined as the degree to which treatments are implemented as intended, and involves both adherence, the degree to which pre-specified interventions are used, and competence, the skill with which the intervention is delivered (e.g. empathy, use of open questions) (Fonagy and Luyten, Reference Fonagy and Luyten2019). When task-shifting, it is vital to ensure that NSHWs can deliver with fidelity to permit the safe expansion of NSHW responsibilities (Fonagy and Luyten, Reference Fonagy and Luyten2019).

Problem-solving therapy (PST), a brief intervention, has been shown to have greater effect on depression outcomes than usual primary care (d = −0.021, 95% CI = −0.37 to −0.05, p = 0.047), by training people with depression to generate workable solutions to stressful life difficulties (Cape et al., Reference Cape, Whittington, Buszewicz, Wallace and Underwood2010). TENDAI, delivered by NSHWs in Zimbabwe, is a 6-session PST intervention for depression and antiretroviral therapy adherence, alongside simplified behavioural activation, sleep hygiene and stress management (Abas et al., Reference Abas, Mangezi, Nyamayaro, Jopling, Bere, McKetchnie, Goldsmith, Fitch, Saruchera, Muronzie, Gudyanga, Barrett, Chibanda, Hakim, Safren and O’Cleirigh2022; Nyamayaro et al., Reference Nyamayaro, Bere, Magidson, Simms, O’Cleirigh, Chibanda and Abas2020). Despite the critical role of assessing NSHW fidelity in delivering PST, few reliable tools exist. Current tools, such as PST-PAC, ENACT and EQUIP are appropriate for assessment of role play and/or training of their respective interventions; however, lack the dual assessment of both adherence and competence warranted for our setting and were not designed to rate audio recordings of sessions (Hegel et al., Reference Hegel, Dietrich, Seville and Jordan2004; Kohrt et al., Reference Kohrt, Ramaiya, Rai, Bhardwaj and Jordans2015; Pedersen et al., Reference Pedersen, Gebrekristos, Eloul, Golden, Hemmo, Akhtar, Schafer and Kohrt2021). This study, nested within the TENDAI intervention, set out to address this gap through the development and assessment of a PST adherence and competence tool.

Methods

PROOF development

The Problem-Solving Therapy Fidelity (PROOF) scale includes both an interventionist adherence and competence subscale. The adherence subscale measures the interventionist’s integrity in delivering each element of the TENDAI intervention as protocolised. Details of the TENDAI intervention design and session components can be found in Figure 1 and Table 1. The last author, a UK-based psychiatrist (MA), co-designed the subscale with a Zimbabwean clinical psychologist (TB) and two US-based clinical psychologists with experience working in Zimbabwe (CO, AS), using the TENDAI interventionist manual and PST-PAC. The adherence items were tailored to session-specific content, and a binary rating system was employed.

Figure 1.

TENDAI and enhanced-usual-care intervention design.

Table 1.

Summary of six TENDAI sessions and booster session

The competence subscale evaluates the interpersonal skills of the interventionist and PST-relevant therapeutic factors. The competence subscale is consistent across all TENDAI sessions and was adapted from ENACT (Kohrt et al., Reference Kohrt, Ramaiya, Rai, Bhardwaj and Jordans2015). Each ENACT item was analysed for relevancy in evaluating PST fidelity and culturally adapted through the rewording of certain items to include culturally specific terms and idioms, originating from qualitative work surrounding the Zimbabwe-based Friendship Bench and TENDAI trial (Abas et al., Reference Abas, Bowers, Manda, Cooper, Machando, Verhey, Lamech, Araya and Chibanda2016; Chibanda et al., Reference Chibanda, Cowan, Verhey, Machando, Abas and Lund2017). Competence items were initially scored as ‘0’ (not demonstrated), ‘1’ (partially demonstrated) and ‘2’ (demonstrated well). During the pre-pilot phase, the competence items were revised to a binary rating scheme of ‘0’ (not demonstrated) and ‘1’ (demonstrated) to increase the feasibility of use in the local context.

Rater selection

The rater team included seven mental health specialists (six practicing clinical psychologists and one psychiatrist), and two non-specialist general nurses. The non-specialists worked on the TENDAI trial as research assistants and had been trained in the components of the PST intervention. Four of the specialists had trained the interventionists for the TENDAI trial. Specialist raters had at least four years of experience in training of, and/or research on, psychological therapies for depression in people living with HIV in low resource settings, with five out of seven having more than 10 years of experience.

Study design

This pilot study focuses solely on Session 2 of TENDAI as it is the first to introduce PST and psychoeducation for depression, whereas Session 1’s focus is primarily antiretroviral therapy adherence. A fully crossed design was chosen to assess inter-rater reliability, whereby all transcripts were assessed by all raters (Hallgren, Reference Hallgren2012).

Initially, a pre-pilot ‘mock trial’ study was conducted where three of the specialist raters and two of the nonspecialist raters each rated 12 sessions of therapy for the purpose of discussion aimed at language refinement of scale items and reproducibility optimisation. The raters were members of the research team, and the scoring sheets were completed and stored on REDCap (Harris et al., Reference Harris, Taylor, Thielke, Payne, Gonzalez and Conde2009, Reference Harris, Taylor, Minor, Elliott, Fernandez, O’Neal, McLeod, Delacqua, Delacqua, Kirby, Duda and Consortium2019). Subsequently, two further mock trials were conducted using the refined subscales on only Session 2. These subsequent trials involved the complete team of seven specialists and two non-specialist raters to familiarise all raters with the rating process and to resolve any outstanding item ambiguity. The mock trials led to revisions of the tool and resulted in a final 14-item adherence subscale and 8-item competence subscale (Supplementary Tables S1 and S2).

With the finalised tool, specialist raters (N = 7) rated the same seven transcripts, while the non-specialist raters (N = 2) assessed two of these seven transcripts. All ratings were recorded and stored on REDCap. Transcripts for both the mock and true ratings were randomly selected, translated by a local clinical psychologist who was a native language speaker, and anonymised. Interventionist codes were assigned to ensure that raters were unaware of who was delivering the sessions. Ethical approval was granted by the Institutional Review Board of King’s College London (LRU/DP-21/22–29,822).

Data analysis

Percent agreement, the percent of ratings that agree, was originally used to measure inter-rater reliability, that is, the consistency between raters. This application has since been discredited due to its inability to account for chance agreement and the measure is now used solely to assess inter-rater agreement, with researchers having turned to reporting both inter-rater agreement and reliability (McHugh, Reference McHugh2012; Mitani et al., Reference Mitani, Freer and Nelson2017). Gwet’s AC₁, an inter-rater reliability statistic, is commonly used for data whose ratings have an unbalanced distribution (Eugenio and Glass, Reference Eugenio and Glass2004; Wongpakaran et al., Reference Wongpakaran, Wongpakaran, Wedding and Gwet2013). Due to the nature of fidelity tools, Gwet’s AC₁ was believed to be the most appropriate reliability metric.

Inter-rater reliability and agreement were calculated for each PROOF adherence and competence item. This by-item analysis was chosen to aid in further edits of the tool. The specialist and non-specialist data were analysed separately. Percent agreement was used to assess inter-rater agreement for both rater teams and subscales. Due to the low availability of non-specialist raters and transcripts, the two mock trial transcripts were included in the adherence dataset for non-specialists. Adherence and competence inter-rater reliability for both rater teams was assessed using Gwet’s AC₁ (Wongpakaran et al., Reference Wongpakaran, Wongpakaran, Wedding and Gwet2013). The small sample size and significant revisions made during the pilot prevented a formal analysis of non-specialist competence, though percent agreement was reported.

Results

Inter-rater reliability and agreement of ratings by specialists

The specialist-rated adherence subscale yielded an average percent agreement of 90.4%. Of the 14 items, 6 showed complete agreement (100%) and 10 showed greater than 90% agreement. Inter-rater reliability analysis ranged from 0.411 to 0.778, with items 8 and 9 showing moderate agreement and items 1–7 and 10–14 reporting substantial agreement.

Inter-rater agreement analysis of the specialist-rated competence subscale showed a range of 61.9–95.9% agreement, averaging 82.5%. Inter-rater reliability analysis using Gwet’s AC₁ ranged from 0.619 to 0.959, with items 1, 2, 4 and 5 representing substantial agreement. Excellent agreement was found for items 3, 6, 7 and 8.

All results can be seen in Table 2.

Table 2.

Inter-rater reliability and agreement of adherence and competence specialist ratings

Notes: Means, percent agreements, and Gwet’s AC1s of the PROOF adherence and competence subscale items rated by specialists. CI: confidence interval.

Inter-rater reliability and agreement of ratings by non-specialists

The NSHW ratings resulted in an average percent agreement of 92.9% for adherence. By-item analysis ranged from 75 to 100% agreement. Complete agreement occurred in 10/14 items. Inter-rater reliability analysis produced a Gwet’s AC₁ range from 0.529 to 1.

Due to the small non-specialist sample for competence, only inter-rater agreement analysis was attainable. Percent agreement ranged from 0 to 100%, with perfect agreement (100%) occurring in 4/8 items. The average percent agreement was 68.8%. For full data on non-specialist ratings, see Supplementary Table S3.

Discussion

We found high inter-rater reliability for the adherence items among specialist raters (>90% overall agreement). Perfect agreement was observed on key PST elements, such as psychoeducation (items 5 and 6), brainstorming solutions (item 10) and discussing pros and cons of solutions (item 11). Given these strong results, the applicability of the PROOF scale to other PST interventions is promising. Compared to PST-PAC, PROOF had notably improved relative disagreement rates: 31.4% lower for specialists and 49.3% lower for non-specialists (absolute percentage agreement PST-PAC = 86%, PROOF = 90.4% and 92.9%) (Hegel et al., Reference Hegel, Dietrich, Seville and Jordan2004). Interestingly, we found that the agreement between non-specialists was similar to the agreement between specialists. While NSHWs are not typically recommended to conduct treatment fidelity assessments, these initial signs of high inter-rater agreement suggests that their roles could be expanded to include intervention evaluation should an effective tool be validated.

The results of the competence subscale were less definitive with the specialist ratings resulting in an overall percent agreement of 82.5%. The weaker agreement results may have been a consequence of the relatively subjective nature of interventionist competency compared to adherence, or ambiguous item language. Items thematically aligned to reviewing and collaboratively establishing homework (items 3 and 7), providing praise and encouragement (item 6) and delivering clear and non-stigmatising psychoeducation (item 8) resulted in the strongest Gwet’s AC₁ values representing ‘excellent agreement’. Beyond the specialist raters, the non-specialist competence results were limited to an analysis of inter-rater agreement owing to the low availability of non-specialist raters. Our pre-pilot work led to us adding the competence item of ‘giving praise’ to the PROOF scale as this was considered culturally important in Zimbabwe but was not present in ENACT.

Despite these positive results, there were several limitations to this study. First, the limited, fixed transcript sample size, compounded by low availability of non-specialist raters, may have affected results. These sample size and rater availability limitations resulted in solely inter-rater agreement being reported for the non-specialist rated competence subscale. This was partially mitigated for the adherence subscale through the inclusion of the mock transcript data. However, the competence subscale was heavily edited during the pilot phase, causing the mock trial data to be incompatible for data analysis. Additionally, this pilot study only included one session of the TENDAI trial, thereby providing limited information regarding the scale’s transferability to other PST sessions.

The development and assessment of the PROOF scale addresses the lack of PST fidelity tools. For a scalable psychological intervention to be successful, ensuring fidelity must not be a time-consuming and resource-intensive process, given the other resource pressures associated with large-scale implementation. With finite expert supervisors and access limitations, the PROOF tool may provide a feasible and efficient route to quality assurance. To the best of knowledge of the researchers, this is the first PST-related fidelity tool to report both inter-rater reliability and agreement. The study design, based on the first PST-focused TENDAI session, allows transferability to other PST interventions, especially those involving psychoeducation, brainstorming solutions and discussing pros and cons of solutions – all common components of PST interventions. A further strength of this study is its co-production, ensuring cultural relevance. Our immediate next step is to show inter-rater reliability for fidelity to delivering all the remaining elements of the TENDAI intervention. In addition to PST, we will assess inter-rater reliability of fidelity to delivering motivational interviewing, collaborative problem-solving around barriers to taking long-term medication, positive activity scheduling, sleep hygiene and relaxation and relapse prevention. Beyond this, we plan to develop a generalisable, replicable measure that addresses core competencies for components of brief interventions being typically used in low-resource settings, which would apply across multiple interventions.

Conclusion

The PROOF scale addresses the dearth of PST-centric fidelity tools, with the adherence and competence components demonstrating sufficient performance in their current form. Moreover, its strong inter-rater reliability in non-specialists supports further work towards deeper integration of NSHWs in intervention evaluation. Future research on the PROOF scale will assess adherence and competence across the whole 6-session intervention. The PROOF scale has high potential to fill an unaddressed gap in PST fidelity measurements.

Open peer review

To view the open peer review materials for this article, please visit http://doi.org/10.1017/gmh.2025.10034.

Supplementary material

The supplementary material for this article can be found at http://doi.org/10.1017/gmh.2025.10034.

Data availability statement

The data that supports the findings of this study is available upon reasonable request from the corresponding author.

Acknowledgements

The authors would like to acknowledge the time and work provided by both rating teams.

Author contribution

Conceptualization: L.C., T.B., A.S., C.O., M.V., M.A.; Methodology: L.C., M.A. A.M., S.M., M.V.; Data collection: L.C.; Formal analysis: L.C., K.C., A.M.; Supervision: M.A., C.O., W.M.; Writing - original draft preparation: L.C., M.A.; Writing – reviewing and editing: all authors. All authors have read and agreed to the published version of the manuscript.

Financial support

The TENDAI trial was supported by the National Institute of Mental Health, NIH (Grant number 1R01MH114708).

Competing interests

The authors report no conflicts of interest.

References

Abas, M, Bowers, T, Manda, E, Cooper, S, Machando, D, Verhey, R, Lamech, N, Araya, R and Chibanda, D (2016) ‘Opening up the mind’: Problem-solving therapy delivered by female lay health workers to improve access to evidence-based care for depression and other common mental disorders through the friendship bench project in Zimbabwe. International Journal of Mental Health Systems 10(1), 39. https://doi.org/10.1186/s13033-016-0071-9.CrossRef Google Scholar PubMed

Abas, M, Mangezi, W, Nyamayaro, P, Jopling, R, Bere, T, McKetchnie, SM, Goldsmith, K, Fitch, C, Saruchera, E, Muronzie, T, Gudyanga, D, Barrett, BM, Chibanda, D, Hakim, J, Safren, SA and O’Cleirigh, C (2022) Task-sharing with lay counsellors to deliver a stepped care intervention to improve depression, antiretroviral therapy adherence and viral suppression in people living with HIV: A study protocol for the TENDAI randomised controlled trial. BMJ Open 12(12), e057844. https://doi.org/10.1136/bmjopen-2021-057844.CrossRef Google Scholar PubMed

Cape, J, Whittington, C, Buszewicz, M, Wallace, P and Underwood, L (2010) Brief psychological therapies for anxiety and depression in primary care: Meta-analysis and meta-regression. BMC Medicine 8, 38. https://doi.org/10.1186/1741-7015-8-38.CrossRef Google Scholar PubMed

Chibanda, D, Cowan, F, Gibson, L, Weiss, HA and Lund, C (2016) Prevalence and correlates of probable common mental disorders in a population with high prevalence of HIV in Zimbabwe. BMC Psychiatry 16(1), 55. https://doi.org/10.1186/s12888-016-0764-2.CrossRef Google Scholar

Chibanda, D, Cowan, F, Verhey, R, Machando, D, Abas, M and Lund, C (2017) Lay health workers’ experience of delivering a problem solving therapy intervention for common mental disorders among people living with HIV: A qualitative study from Zimbabwe. Community Mental Health Journal 53(2), 143–153. https://doi.org/10.1007/s10597-016-0018-2.CrossRef Google Scholar PubMed

Eugenio, BD and Glass, M (2004) The kappa statistic: A second look. Computational Linguistics 30(1), 95–101. https://doi.org/10.1162/089120104773633402.CrossRef Google Scholar

Fonagy, P and Luyten, P (2019) Fidelity vs. flexibility in the implementation of psychotherapies: Time to move on. World Psychiatry 18(3), 270. https://doi.org/10.1002/wps.20657.CrossRef Google Scholar PubMed

Hallgren, KA (2012) Computing inter-rater reliability for observational data: An overview and tutorial. Tutorials in Quantitative Methods for Psychology 8(1), 23. https://doi.org/10.20982/tqmp.08.1.p023.CrossRef Google Scholar PubMed

Harris, PA, Taylor, R, Thielke, R, Payne, J, Gonzalez, N and Conde, JG (2009) Research electronic data capture (REDCap) – A metadata-driven methodology and workflow process for providing translational research informatics support. Journal of Biomedical Informatics 42(2), 377–381. https://doi.org/10.1016/j.jbi.2008.08.010.CrossRef Google Scholar PubMed

Harris, PA, Taylor, R, Minor, BL, Elliott, V, Fernandez, M, O’Neal, L, McLeod, L, Delacqua, G, Delacqua, F, Kirby, J, Duda, SN and Consortium, REDC (2019) The REDCap consortium: Building an international community of software platform partners. Journal of Biomedical Informatics 95, 103208. https://doi.org/10.1016/j.jbi.2019.103208.CrossRef Google Scholar PubMed

Hegel, MT, Dietrich, AJ, Seville, JL and Jordan, CB (2004) Training residents in problem-solving treatment of depression: A pilot feasibility and impact study. Family Medicine 36(3), 204–208.Google Scholar PubMed

Kohrt, BA, Ramaiya, MK, Rai, S, Bhardwaj, A and Jordans, MJD (2015) Development of a scoring system for non-specialist ratings of clinical competence in global mental health: A qualitative process evaluation of the enhancing assessment of common therapeutic factors (ENACT) scale. Global Mental Health (Cambridge, England) 2, e23. https://doi.org/10.1017/gmh.2015.21.CrossRef Google Scholar PubMed

McHugh, ML (2012) Interrater reliability: The kappa statistic. Biochemia Medica 22(3), 276–282.10.11613/BM.2012.031CrossRef Google Scholar PubMed

Mitani, AA, Freer, PE and Nelson, KP (2017) Summary measures of agreement and association between many raters’ ordinal classifications. Annals of Epidemiology 27(10), 677, e4–685. https://doi.org/10.1016/j.annepidem.2017.09.001.CrossRef Google Scholar

Nyamayaro, P, Bere, T, Magidson, JF, Simms, V, O’Cleirigh, C, Chibanda, D and Abas, M (2020) A task-shifting problem-solving therapy intervention for depression and barriers to antiretroviral therapy adherence for people living with HIV in Zimbabwe: Case series. Cognitive and Behavioral Practice 27(1), 84–92. https://doi.org/10.1016/j.cbpra.2018.10.003.CrossRef Google Scholar PubMed

Pedersen, GA, Gebrekristos, F, Eloul, L, Golden, S, Hemmo, M, Akhtar, A, Schafer, A and Kohrt, BA (2021) Development of a Tool to Assess Competencies of Problem Management Plus Facilitators Using Observed Standardised Role Plays: The EQUIP Competency Rating Scale for Problem Management Plus. https://journals.lww.com/invn/fulltext/2021/19010/development_of_a_tool_to_assess_competencies_of.14.aspx (accessed 19 March 2025).10.4103/INTV.INTV_40_20CrossRef Google Scholar

Roberts, K, Travers-Hill, E, Coker, S, Troup, J, Casey, S, Parkin, K and Kim, Y (2021) Brief psychological interventions for anxiety and depression in a secondary care adult mental health service: An evaluation. The Cognitive Behaviour Therapist 14, e29. https://doi.org/10.1017/S1754470X21000258.CrossRef Google Scholar

Wongpakaran, N, Wongpakaran, T, Wedding, D and Gwet, KL (2013) A comparison of Cohen’s Kappa and Gwet’s AC1 when calculating inter-rater reliability coefficients: A study conducted with personality disorder samples. BMC Medical Research Methodology 13(1), 61. https://doi.org/10.1186/1471-2288-13-61.CrossRef Google Scholar PubMed

Figure 1. TENDAI and enhanced-usual-care intervention design.

Table 1. Summary of six TENDAI sessions and booster session

Table 2. Inter-rater reliability and agreement of adherence and competence specialist ratings

Cooke et al. supplementary material

DOI: https://doi.org/10.1017/gmh.2025.10034.sm001

File 25.2 KB

Author comment: Development and preliminary inter-rater reliability of the new PROOF tool to measure fidelity of problem-solving therapy for depression delivered by non-specialists in a low-resource African setting — R0/PR1

Published online by Cambridge University Press: 08 July 2025

DOI: https://doi.org/10.1017/gmh.2025.10034.pr1

Lily Cooke

King’s College London, United Kingdom of Great Britain and Northern Ireland

Revision round: 0

Role: author

Comments

No accompanying comment.

Review: Development and preliminary inter-rater reliability of the new PROOF tool to measure fidelity of problem-solving therapy for depression delivered by non-specialists in a low-resource African setting — R0/PR2

Published online by Cambridge University Press: 08 July 2025

DOI: https://doi.org/10.1017/gmh.2025.10034.pr2

Sonal Mathur

Sangath, India

Date of review: 13 February 2025

Revision round: 0

Role: reviewer

Recommendation/decision: major-revision

Conflict of interest statement

Reviewer declares none.

Comments

Thank you for the invitation to review this interesting manuscript describing the development and preliminary reliability of a new tool to assess fidelity of problem-solving therapy for depression delivered by non-specialists in low resource African setting. The study targets an important topic of both task-sharing as well as ensuring high quality of services delivered. The methodology appear sound, and the findings are encouraging, adding to the impact in the field of global mental health. See few suggestions for consideration:

1. While the authors have provided background on the source of transcripts used for the current study, having some more information in the current paper on the components of intervention would be more useful to understand the scale described later

2. Additionally, can a link or table of components for the TENDAI interventionist manual referenced on page 6.

3. While the raters all have their individual qualifications, if the authors add a line about their extent of involvement in the TENDAI trial or the extent of experience of working in PST interventions, it would be a much clearer read (page 6).

4. On page 7, a mock trial is referred to, the aim of which was to familiarize raters with the rating process. It is not clear as to who conducted the mock trial and how the feedback for the trials was collected .

5. While the authors acknowledge the work done by the ENACT team in designing competence measure for non-specialists, they have also referred to their PROOF scale as filling a long sought-after gap in PST fidelity measures. However, the authors would benefit from referencing the work done by EQUIP team in developing a measure to assess competencies in PST interventions through medium of role plays (https://journals.lww.com/invn/fulltext/2021/19010/development_of_a_tool_to_assess_competencies_of.14.aspx, https://www.cambridge.org/core/journals/bjpsych-open/article/perspectives-on-competencybased-feedback-for-training-nonspecialists-to-deliver-psychological-interventions-multisite-qualitative-study-of-the-equip-competencybased-approach/EE4201B24BE4ED7FE50E66725090E409, https://www.thelancet.com/journals/lanpsy/article/PIIS2215-0366(24)00183-4/abstract )

Review: Development and preliminary inter-rater reliability of the new PROOF tool to measure fidelity of problem-solving therapy for depression delivered by non-specialists in a low-resource African setting — R0/PR3

Published online by Cambridge University Press: 08 July 2025

DOI: https://doi.org/10.1017/gmh.2025.10034.pr3

Reviewer_1

Date of review: 14 February 2025

Revision round: 0

Role: reviewer

Recommendation/decision: minor-revision

Conflict of interest statement

Reviewer declares none.

Comments

This is a useful paper demonstrating the utility and reliability of a measure of fidelity of the delivery of Problem Solving Therapy (PST). It establishes the psychometric properties of the PROOF measure for a particular implementation protocol and context, and demonstrates an appropriate methodology for documenting such properties in PST-related treatment elsewhere.

The Introduction establishes the case for the importance of measuring the fidelity of implementation of scalable psychological interventions and the Method then appropriately documents the process by which this was tackled in this instance. I think some information on the process of producing and analysing transcripts of sessions needs to be presented (given the loss of information that would result from this process compared to, for instance, use of video recordings). In the data analysis sub-section, I think readers would benefit from a clearer elaboration of the distinction and purpose of the respective indicators inter-rater agreement and inter-rater reliability (and the specification in that section of ‘by-item analysis’ - was not all analysis by item?).

The Results are presented concisely and in line with the documented methodology. Given that the content/focus of items is of obvious interest to the reader, I think some information on item focus is relevant to be included in Table 1 (rather than being consigned to the Supplementary Material). Reference is made to broad focus of items in the Discussion, so I think a column indicating in a single word the focus of the numbered item would be both appropriate and feasible.

The Discussion is rather limited in scope and links back to only one published paper. I think a broader elaboration of issues in the quality assurance of scalable psychological interventions is warranted here. In particular, I see a major constraint of the current study to be its focus on a single session of a manualised PST protocol in a single context. It is not unreasonable to expect higher levels of reliability in judgements of interventionist behaviour when the scope of the assessment is so narrow. Is the presumption that researchers (and service quality assurers) should develop bespoke fidelity assessments for each session of a manualised intervention in each intervention context? Or is the goal a more generalisable, replicable measure that addresses core PST competences as demonstrated across multiple intervention sessions (and, potentially, across multiple settings)? I consider that this is an important question that the ‘forward plan’ for research touches on, but does not really address in manner that stands to shape debate and practice.

The manuscript would benefit also from correction of a few grammatical lapses/expression issues including:

Impact Statement: l. 8: ‘their [sic] inter-rater reliability to date’ - tools don’t measure their own reliability‘ line 11: ’This study has provided‘ - better to say ’provides;?; l. 24: ‘ long sought-after gap in PST fidelity measurements’ - I don’t think gaps are long sought after!

Abstract: l. 30: ‘brief psychological intervention often for depression’ - missing word; l. 34: ‘This pilot study aims to’ - better ‘aimed to’?; l.49: better to avoid jargon in this concluding sentence about the ‘tool gap’: ‘The PROOF scale has the potential to fill the fidelity tool gap within PST delivery.’

Intro. l. 5-9: Thonk these opening sentences need unpacking slightly with one or two relevant citations given; l. 51-56: ‘Adherence and Competence Scale (PST-PAC), have not reported reliability or focus exclusively on specific PST interventions or specific settings, such as primary care (Hegel et al. 2004)’ - this sentence is confusing given it references many diverse elements; ‘The demand for this PST fidelity gap to be met is undeniable, as the WHO has called for the expansion of brief psychological interventions (World Health Organization 2016)’ - sadly many facts are now denied so I am not sure the phrase ‘is undeniable’ is the best choice of words'.

Results. l. 48: ‘Inter-Rater Reliability and Agreement for Specialist Ratings’ - isn’t it ‘of Specialist Ratings’?, or perhaps even ‘of ratings by specialists’?? [and similar for non-specialist sub-head]; l. 51: ‘%. By-item inter-rater agreement analysis showed a 56.5-100% agreement range with 6/14 items representing complete agreement (100%) and 10/14 items having >90% agreement’ - I think this can be more clearly worded by simply noting something of the form ‘....of the 14 items 6 showed complete agreement (100%) and 10 greater than 90% agreement’.

Discussion. l. 51 ‘the slightly weaker agreement’ - I don’t think a scientific paper benefits from the addition of slightly here.

Conclusion. l. 44: ‘ demonstrating sufficient performance in their current states’ - I found ‘states’ confusing - ‘current form’?

Review: Development and preliminary inter-rater reliability of the new PROOF tool to measure fidelity of problem-solving therapy for depression delivered by non-specialists in a low-resource African setting — R0/PR4

Published online by Cambridge University Press: 08 July 2025

DOI: https://doi.org/10.1017/gmh.2025.10034.pr4

Reviewer_2

Date of review: 18 February 2025

Revision round: 0

Role: reviewer

Recommendation/decision: minor-revision

Conflict of interest statement

Reviewer declares none.

Comments

Abstract: Statement of specialists and non-specialists reading needs clarification - what does 'same seven transcripts mean in relation to same two transcripts?

Page 6: Lines 3-4: Description of the process is unclear - Is the first session of the trial to introduce PST and psychoeducation of depression?

Page 6: Lines 5-8: Suggest rewrite sentence avoiding split infinitives - A fully crossed design...was chosen.. scale..., whereby all transcripts were assessed by all raters.

Page 6: Lines 9-11 Unclear what is meant by iterative trials...? This description does not appear to be about trials in a formal sense and likely refers to “sessions”. Who delivered these sessions in the pre-pilot study? Were these individuals trained in PST?

Page 6: Lines 14-15 - the expression suggests that some of the seven specialists were involved in the pre-pilot work - this should be clarified as agreement about items then is not completely independent! This is not an issue as Gwet’s AC1 was used in place of Cohen’s Kappa.

Page 6: Lines 22-23: Is there a rationale that explains why non-specialists rated only two of the seven transcripts? How does this impact on the claim that the tool is for use by non-specialists.

General Comment:

The MS provides an interesting account of the PROOF tool around inter-rater reliability. The title is misleading and should perhaps indicate that it is focused on inter-rater reliability.

The MS would also benefit from better describing the context of specialists and non-specialists in which the tool was tested, including the parent study cohort.

PROOF Tool:

While binary scoring assists in reducing complex scoring, it also may mask the nuances that are contained in the Competence Subscale. All of these items are scored as 0, 1, or 2. It is conceivable that a person could straddle 2 or more of these domains! Could the authors comment if the binary approach inflates the level of agreement achieved by simple fact that a 0-2 range has been reduced to Yes/No?

Review: Development and preliminary inter-rater reliability of the new PROOF tool to measure fidelity of problem-solving therapy for depression delivered by non-specialists in a low-resource African setting — R0/PR5

Published online by Cambridge University Press: 08 July 2025

DOI: https://doi.org/10.1017/gmh.2025.10034.pr5

Abdulai Bah

Institute for Global Health and Development, Queen Margaret University School of Health Sciences, United Kingdom of Great Britain and Northern Ireland

Date of review: 25 February 2025

Revision round: 0

Role: reviewer

Recommendation/decision: accept

Conflict of interest statement

Reviewer declares none.

Comments

Thank you for the opportunity to re-review this manuscript, which is well written. I have only one observation in the supplementary material, for the competence scale the authors decided to use a binary approach: “Competence items were initially scored as ‘0’ (not demonstrated), ‘1’ (partially demonstrated), and ‘2’ (demonstrated well). During the pre-pilot phase, the competence items were revised to a binary rating scheme of ‘0’ (not demonstrated) and ‘1’ (demonstrated) to increase feasibility of use in the local context”. But this should be reflected it Table 2 in the left column (0 and 1), to align with the options on the right-hand column (done and not done).

Recommendation: Development and preliminary inter-rater reliability of the new PROOF tool to measure fidelity of problem-solving therapy for depression delivered by non-specialists in a low-resource African setting — R0/PR6

Published online by Cambridge University Press: 08 July 2025

DOI: https://doi.org/10.1017/gmh.2025.10034.pr6

André Janse van Rensburg

Centre for Rural Health, University of KwaZulu-Natal, South Africa

Date of review: 25 February 2025

Revision round: 0

Role: Handling Editor

Recommendation/decision: minor-revision

Comments

Dear Professor Abas

Regarding your paper “Development and preliminary reliability of the new PROOF tool to measure fidelity of problem-solving therapy for depression delivered by non-specialists in a low-resource African setting”, we have received feedback from the peer review process, and there was consensus that the manuscript requires minor revisions. Kindly address all revisions, we look forward to receiving a revised paper.

Decision: Development and preliminary inter-rater reliability of the new PROOF tool to measure fidelity of problem-solving therapy for depression delivered by non-specialists in a low-resource African setting — R0/PR7

Published online by Cambridge University Press: 08 July 2025

DOI: https://doi.org/10.1017/gmh.2025.10034.pr7

Judith Bass

Johns Hopkins University Bloomberg School of Public Health, United States

Revision round: 0

Role: Editor in Chief

Recommendation/decision: minor-revision

Comments

No accompanying comment.

Author comment: Development and preliminary inter-rater reliability of the new PROOF tool to measure fidelity of problem-solving therapy for depression delivered by non-specialists in a low-resource African setting — R1/PR8

Published online by Cambridge University Press: 08 July 2025

DOI: https://doi.org/10.1017/gmh.2025.10034.pr8

Lily Cooke

King’s College London, United Kingdom of Great Britain and Northern Ireland

Revision round: 1

Role: author

Comments

No accompanying comment.

Review: Development and preliminary inter-rater reliability of the new PROOF tool to measure fidelity of problem-solving therapy for depression delivered by non-specialists in a low-resource African setting — R1/PR9

Published online by Cambridge University Press: 08 July 2025

DOI: https://doi.org/10.1017/gmh.2025.10034.pr9

Reviewer_1

Date of review: 06 June 2025

Revision round: 1

Role: reviewer

Recommendation/decision: accept

Conflict of interest statement

Reviewer declares none.

Comments

I have reviewed the changes made in the manuscript in response to reviewers comments and am pleased to recommen acceptance of the revision.

Review: Development and preliminary inter-rater reliability of the new PROOF tool to measure fidelity of problem-solving therapy for depression delivered by non-specialists in a low-resource African setting — R1/PR10

Published online by Cambridge University Press: 08 July 2025

DOI: https://doi.org/10.1017/gmh.2025.10034.pr10

Sonal Mathur

Sangath, India

Date of review: 10 June 2025

Revision round: 1

Role: reviewer

Recommendation/decision: accept

Conflict of interest statement

Reviewer declares none.

Comments

No further comments

Recommendation: Development and preliminary inter-rater reliability of the new PROOF tool to measure fidelity of problem-solving therapy for depression delivered by non-specialists in a low-resource African setting — R1/PR11

Published online by Cambridge University Press: 08 July 2025

DOI: https://doi.org/10.1017/gmh.2025.10034.pr11

André Janse van Rensburg

Centre for Rural Health, University of KwaZulu-Natal, South Africa

Date of review: 17 June 2025

Revision round: 1

Role: Handling Editor

Recommendation/decision: accept

Comments

No accompanying comment.

Decision: Development and preliminary inter-rater reliability of the new PROOF tool to measure fidelity of problem-solving therapy for depression delivered by non-specialists in a low-resource African setting — R1/PR12

Published online by Cambridge University Press: 08 July 2025

DOI: https://doi.org/10.1017/gmh.2025.10034.pr12

Judith Bass

Johns Hopkins University Bloomberg School of Public Health, United States

Revision round: 1

Role: Editor in Chief

Recommendation/decision: accept

Comments

No accompanying comment.

Article contents

Development and preliminary inter-rater reliability of the new PROOF tool to measure fidelity of problem-solving therapy for depression delivered by non-specialists in a low-resource African setting

Abstract

Topics structure

Topic(s)

Subtopic(s)

Keywords

Information

Impact statement

Introduction

Methods

PROOF development

Rater selection

Study design

Data analysis

Results

Inter-rater reliability and agreement of ratings by specialists

Inter-rater reliability and agreement of ratings by non-specialists

Discussion

Conclusion

Open peer review

Supplementary material

Data availability statement

Acknowledgements

Author contribution

Financial support

Competing interests

References

Cooke et al. supplementary material

Author comment: Development and preliminary inter-rater reliability of the new PROOF tool to measure fidelity of problem-solving therapy for depression delivered by non-specialists in a low-resource African setting — R0/PR1

Comments

Review: Development and preliminary inter-rater reliability of the new PROOF tool to measure fidelity of problem-solving therapy for depression delivered by non-specialists in a low-resource African setting — R0/PR2

Conflict of interest statement

Comments

Review: Development and preliminary inter-rater reliability of the new PROOF tool to measure fidelity of problem-solving therapy for depression delivered by non-specialists in a low-resource African setting — R0/PR3

Conflict of interest statement

Comments

Review: Development and preliminary inter-rater reliability of the new PROOF tool to measure fidelity of problem-solving therapy for depression delivered by non-specialists in a low-resource African setting — R0/PR4

Conflict of interest statement

Comments

Review: Development and preliminary inter-rater reliability of the new PROOF tool to measure fidelity of problem-solving therapy for depression delivered by non-specialists in a low-resource African setting — R0/PR5

Conflict of interest statement

Comments

Recommendation: Development and preliminary inter-rater reliability of the new PROOF tool to measure fidelity of problem-solving therapy for depression delivered by non-specialists in a low-resource African setting — R0/PR6

Comments

Decision: Development and preliminary inter-rater reliability of the new PROOF tool to measure fidelity of problem-solving therapy for depression delivered by non-specialists in a low-resource African setting — R0/PR7

Comments

Author comment: Development and preliminary inter-rater reliability of the new PROOF tool to measure fidelity of problem-solving therapy for depression delivered by non-specialists in a low-resource African setting — R1/PR8

Comments

Review: Development and preliminary inter-rater reliability of the new PROOF tool to measure fidelity of problem-solving therapy for depression delivered by non-specialists in a low-resource African setting — R1/PR9

Conflict of interest statement

Comments

Review: Development and preliminary inter-rater reliability of the new PROOF tool to measure fidelity of problem-solving therapy for depression delivered by non-specialists in a low-resource African setting — R1/PR10

Conflict of interest statement

Comments

Recommendation: Development and preliminary inter-rater reliability of the new PROOF tool to measure fidelity of problem-solving therapy for depression delivered by non-specialists in a low-resource African setting — R1/PR11

Comments

Decision: Development and preliminary inter-rater reliability of the new PROOF tool to measure fidelity of problem-solving therapy for depression delivered by non-specialists in a low-resource African setting — R1/PR12

Comments

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests