Introduction
Depression and anxiety are prevalent mental disorders frequently encountered in both the general population and in primary care settings (Hirschfeld, Reference Hirschfeld2001; Sartorius et al., Reference Sartorius, Bedirhan, Lecrubier and Wittchen1996). Although evidence-based psychological therapies have been developed to treat a wide range of mental disorders, access to such treatments remains limited in many countries, including in Japan. In response to this treatment gap, the United Kingdom’s Talking Therapies for Anxiety and Depression (TTAD) model – a large-scale framework launched in 2008Footnote 1 – has been implemented to deliver psychological therapies, particularly to improve access to cost-effective cognitive behavioural therapy (CBT) for treating depression and anxiety (Clark, Reference Clark2011). The effectiveness of the TTAD programme at providing rapid and accessible psychological support for common mental health problems has been demonstrated using patient-reported outcome measures (Clark, Reference Clark2011; Clark et al., Reference Clark, Layard, Smithies, Richards, Suckling and Wright2009; NHS England, 2024a).
In Japan, mental healthcare is covered by public health insurance (national health insurance and social insurance). However, psychotherapy, including CBT, is generally not covered and is typically provided on a self-funded basis, regardless of whether it is delivered by certified psychologists. Therefore, the system for delivering psychological therapies remains limited.
To bridge this gap, a research group at the authors’ university has established a two-year training course for CBT practitioners in Japan, informed by the TTAD model. The course consists of structured CBT lectures followed by ongoing case-based supervision, which aims to support CBT delivery in routine clinical practice. To date, 146 practitioners, including psychologists, doctors, nurses, and mental health and social welfare professionals, have completed this training programme.
To support the ongoing delivery of evidence-based practice, an online database was created in 2017 to systematically collect the psychological data of patients that have been treated by graduates of this training course. The collected data include patient-reported outcomes for depression and anxiety, measured using the Patient Health Questionnaire-9 (PHQ-9) and the Generalized Anxiety Disorder-7 (GAD-7) questionnaire, respectively. By September 2024, 31 practitioners had entered data on 290 patients. In this study, we analysed these routinely collected clinical data as an interim evaluation to describe the current status of CBT outcomes in real-world clinical practice in Japan.
Method
Data collection
This study collected clinical data on the depression and anxiety symptom scale scores of patients receiving high-intensity, one-on-one, face-to-face CBT, irrespective of their specific diagnoses. Treatment was delivered by practitioners in a disorder-specific manner, tailored to each patient’s primary clinical presentation and individual treatment needs. Therefore, the range of target disorders, presenting problems, and treatment foci varied by patient. Disorder-specific outcome measures were selected and collected by therapists based on each patient’s primary presenting problem and CBT was delivered accordingly. In the Japanese clinical context, disorder-specific CBT is practised based on Japanese-language understanding through case formulation and tailored behavioural experiments, with cultural responsiveness embedded naturally in routine clinical practice rather than through explicit protocol-level modification.
Patient data were entered into the online database by the participating CBT practitioners (see the ‘Practitioners’ subsection below) and analysed once a sufficient number of cases had been accumulated. The database was designed to capture only general measures of depressive and anxiety symptoms, and data were obtained from the PHQ-9 and GAD-7 during routine clinical sessions rather than through separate assessments. To facilitate data entry and sustained clinician engagement, the range of variables recorded in the database was intentionally kept to a minimum.
In clinical research, outcome indicators are typically derived from questionnaire measures that assess changes in symptoms most closely related to the target disorder. For example, trials of depression treatment primarily employ depression-specific outcome measures, whereas trials targeting post-traumatic stress disorder (PTSD) predominantly use PTSD-specific measures. However, depressive and anxiety symptoms represent transdiagnostic features that commonly co-occur across a wide range of psychiatric disorders. Although these symptoms may occur as primary disorders, they frequently emerge as secondary features within the context of other target conditions. This fact made it essential to assess depressive and anxiety symptom severity not only in treatments explicitly targeting depressive disorder or generalised anxiety disorder, but also in cases in which disorder-specific CBT was delivered based on the patient’s primary clinical presentation (Clark et al., Reference Clark, Canvin, Green, Layard, Pilling and Janecka2018).
Variables
The variables entered into the database were age, sex, and the depression and anxiety scale scores from the PHQ-9 and GAD-7, respectively. In both these scales, lower scores indicate lower severity of depression and anxiety symptoms.
Measurement tools
Japanese version of the PHQ-9
The PHQ-9, developed by Kroenke et al. (Reference Kroenke, Spitzer and Williams2001), has been the most widely used depression measure since 2001 because of its brevity and the inclusion of nine criteria for major depressive disorder. The PHQ-9 scores range from 0 to 27. A cut-off of ≥10 is recommended to distinguish clinical from non-clinical populations.
The original version of the PHQ-9 has been translated into several languages and validated in numerous studies worldwide (Kroenke et al., Reference Kroenke, Spitzer and Williams2001). The UK National Institute for Health and Care Excellence depression guidelines (National Institute for Health and Care Excellence, 2022) and the DSM-5 for depressive disorders (American Psychiatric Association, 2013) recommend using the PHQ-9 to measure depression severity. Several studies in Japan have aimed to derive a Japanese version of the PHQ-9 to measure major depressive disorder (Muramatsu et al., Reference Muramatsu, Muramatsu and Miyaoka2009; Muramatsu et al., Reference Muramatsu, Miyaoka, Kamijima, Muramatsu, Fuse, Yoshimine, Mashima, Tanaka, Hosaka, Toshiaki, Shimizu and Baba2021). The sensitivity and specificity of the Japanese version of the PHQ-9 are reported to be 80.5% and 89.5%, respectively (Muramatsu et al., Reference Muramatsu, Miyaokab, Kamijimac, Muramatsud, Tanakae, Hosakaf, Miwag, Fuseh, Yoshiminei, Mashimaj, Shimizu, Itol and Shimizu2018).
Japanese version of the GAD-7
In 2006, Kroenke et al. developed the GAD-7, a seven-item self-rated questionnaire for identifying generalised anxiety disorder (Spitzer et al., Reference Spitzer, Kroenke, Williams and Löwe2006). The GAD-7 scores range from 0 to 21. The sensitivity and specificity are 89.0% and 82.0%, respectively, with an optimal cut-off value of 10 (Spitzer et al., Reference Spitzer, Kroenke, Williams and Löwe2006). The Japanese version of the GAD-7 used in this study was translated by Muramatsu and verified by Kroenke and Spitzer using the back-translation method (Muramatsu et al., Reference Muramatsu, Maeda, Miyaoka, Kamijima, Muramatsu, Tanaka, Hosaka, Miwa, Fuse, Yoshimine, Mashima, Kikuchi, Seino, Shimizu, Saito, Kobayashi, Kamoshida, Koiwa, Ito and Shimizu2025).
In clinical practice, practitioners assess patients using disorder-specific measures in addition to the PHQ-9 and GAD-7. However, the practitioners that participated in this study were instructed to enter only the PHQ-9 and GAD-7 scores into the database, as these measures were administered consistently across all the cases. This approach was important to reduce the complexity of data entry and encourage therapist compliance with database submission. The scores from the disorder-specific measures were managed separately by each therapist as part of routine clinical care.
Sample and sample size
The sample size in this study was targeted to be approximately 300 people aged 13 years or older (i.e. junior high school students and above). This required sample size was determined based on the annual number of patients presenting depressive and anxiety symptoms at the CBT centre of the authors’ university hospital (approximately 100). Given the two-year study period, it was estimated that data could be obtained from approximately 200 patients at the authors’ university, with an additional 50 patients per year (approximately 100 over two years) recruited from other institutions. The target sample size was determined based on the number of patients expected to attend the participating clinical site over the study period rather than on an a priori statistical power calculation. The sample in this study was not randomly drawn from the general population but consisted of individuals who lived within commuting distance of the university hospital and voluntarily sought CBT, and were able to afford the cost of treatment, as therapy was provided on a self-pay basis.
Practitioners
A total of 31 practitioners participated in the study between May 2017 and May 2024. To ensure the quality of treatment, all the practitioners were nationally licensed professionals. They were 17 clinical psychologists, six doctors, five nurses, two pharmacists, and one mental health social worker. In Japan, only psychotherapy provided by medical doctors, primarily psychiatrists, is covered by public health insurance. To avoid a situation in which patients may preferentially request medical doctors as therapists, all treatment at our institution is provided on a self-funded basis.
Although the ages and clinical experience of the practitioners varied, all had completed the two-year CBT training course and were continuing to learn about CBT at university. While working in their primary workplaces, most were additionally employed to provide CBT at our university hospital one to three days per week. As our institution is an educational setting that emphasises collaborative learning and professional development, all practitioners were paid the same hourly rate.
In our CBT training course, lectures on CBT were offered in the first half of each academic year, followed by weekly case study supervisions in the second half (15 times in total). Attendance was mandatory for current trainees, while graduates participated voluntarily. When beginners were responsible for sessions, they were accompanied by a supervisor. During case study supervisions, beginners were given priority in discussing cases, and the participants split into small groups to discuss the case or receive advice from experienced participants.
As noted above, when a new patient was assigned, the practitioners entered the patient’s age, sex, and depression and anxiety scale scores into the online database. The PHQ-9 and GAD-7 scores were collected as treatment progressed and added into the database. We provided each practitioner with a URL for data entry and requested that they register an account. The timing of questionnaire administration was left to the discretion of the practitioners, who were advised to follow standard procedures.
Outcome definitions based on the TTAD
Caseness, recovery, and reliable improvement were defined in accordance with the TTAD outcome framework. The participants were considered to be ‘in caseness’ at baseline if they had a PHQ-9 score of ≥10 or a GAD-7 score of ≥8. Consistent with the TTAD definitions, caseness did not require the participants to exceed the threshold on both measures simultaneously. ‘Recovery’ was defined among the participants who met the caseness criteria at baseline. A participant was classified as recovered if, at the end of treatment, their symptom scores fell below the caseness thresholds on both measures: PHQ-9 <10 and GAD-7 <8. Thus, recovery reflects the transition from caseness at baseline to non-caseness at the end of treatment across both the depression and anxiety symptom measures. ‘Reliable improvement’ was defined at the participant level, based on the reliable change criteria for each measure, and integrated across the PHQ-9 and GAD-7 in line with TTAD guidance. A participant was classified as having achieved reliable improvement if there was a reduction of at least 6 points on the PHQ-9 and/or at least 4 points on the GAD-7 and there was no reliable deterioration on the other measure (defined as an increase of ≥6 points on the PHQ-9 or ≥4 points on the GAD-7). Reliable improvement did not require the participants to meet the caseness criteria at baseline.
Definitions and eligibility of recovery-related outcomes
Because reliable improvement is defined in terms of exceeding established reliable change thresholds, it is not possible for those participants with low baseline scores to achieve reliable improvement on a given measure. Specifically, reliable improvement on the PHQ-9 is not possible when the baseline PHQ-9 scores are <6 and reliable improvement on the GAD-7 is not possible when the baseline GAD-7 scores are <4. Accordingly, the participants were included in the reliable improvement analyses only if reliable improvement was possible on at least one measure at baseline (i.e. PHQ-9 ≥6 and/or GAD-7 ≥4). The participants with PHQ-9 <6 and GAD-7 <4 at baseline were excluded from the reliable improvement analyses, as reliable improvement was not definitionally possible on either measure.
‘Reliable recovery’ was defined as the achievement of both recovery and reliable improvement. Specifically, a participant was classified as having achieved reliable recovery if they met the criteria for caseness at baseline, met the criteria for recovery at the end of treatment, and demonstrated reliable improvement, as defined above. Thus, reliable recovery represents a more stringent outcome reflecting both clinically meaningful symptom reduction and a transition from caseness to non-caseness, consistent with the TTAD outcome definitions.
Statistical analysis
Analyses were descriptive in nature. Data from patients who provided at least two PHQ-9 or GAD-7 scores, one at baseline and one at a subsequent assessment, were included in the analyses. For patients who discontinued treatment before completion, the last available score was used as the end-of-treatment assessment.
Baseline clinical characteristics were summarised using means and standard deviations for continuous variables and frequencies and percentages for categorical variables. Recovery-related outcomes were classified according to established TTAD definitions. The proportions of patients meeting the criteria for recovery, reliable improvement, and reliable recovery were calculated using outcome-specific denominators, consistent with TTAD reporting conventions. All the analyses were conducted using SPSS Statistics version 29.0 (IBM, Armonk, NY, USA).
Results
In total, 290 patients were enrolled in this study. Of these, 49 were excluded because although they attended multiple sessions, they did not complete any additional measurements, resulting in only one score being recorded and thus making analysis impossible. The data on the remaining 241 patients were analysed.
Table 1 summarises the baseline clinical characteristics of the study sample (N=241). The mean age of the participants was 34.6 years (SD=13.4). Of the participants, 99 (41.1%) were male and 142 (58.9%) were female. The mean baseline scores were 11.5 (SD=6.4) on the PHQ-9 and 10.0 (SD = 5.2) on the GAD-7. Based on established criteria, 178 participants (73.9%) met the criteria for caseness at baseline, defined as a PHQ-9 score of ≥10 or a GAD-7 score of ≥8.
Baseline clinical characteristics

Caseness:PHQ-9 ≥10 OR GAD-7 ≥8 at baseline.
With respect to age, the largest proportions of the participants were aged 20–29 years (25.6%), 30–39 years (24.9%), and 40–49 years (22.0%), followed by those aged 13–19 years (13.3%), 50–59 years (9.1%), and 60 years or older (4.1%). Regarding the classification of the scores of the two scales at baseline, the PHQ-9 scores were distributed as follows: 15.4% scored 0–4, 26.1% scored 5–9, 26.1% scored 10–14, and 32.4% scored 15–27. Similarly, the GAD-7 scores were evenly distributed, with 17.8% scoring 0–4, 24.5% scoring 5–9, 37.3% scoring 10–14, and 20.3% scoring 15–21.
Next, we examined the rates of recovery, reliable improvement, reliable deterioration, and reliable recovery (Fig. 1) in accordance with the TTAD outcome definitions. Of the 241 participants, 178 met the criteria for caseness at baseline. Among these participants, 75 met the criteria for recovery. Recovery was defined as a transition from caseness at baseline (PHQ-9 ≥10 or GAD-7 ≥8) to non-caseness at the end of treatment (PHQ-9 <10 and GAD-7 <8) on both measures. Reliable improvement was assessed among those participants for whom reliable change was theoretically possible at baseline (PHQ-9 ≥6 and/or GAD-7 ≥4; n=218). Of these, 103 participants demonstrated reliable improvement, defined as a reduction of ≥6 points on the PHQ-9 or ≥4 points on the GAD-7, in the absence of reliable deterioration on the other measure. Reliable deterioration was defined as an increase of ≥6 points on the PHQ-9 or ≥4 points on the GAD-7, and was observed in 20 participants out of the total sample (n=241). Finally, 65 participants achieved reliable recovery, defined as meeting the criteria for both recovery and reliable improvement, reflecting a combination of clinically meaningful symptom change and a transition from caseness to non-caseness.
Recovery-related outcomes according to TTAD definitions.

Table 2 summarises the recovery-related outcomes. Recovery was observed in 75 of the 178 participants with baseline caseness, reliable improvement was observed in 103 of the 218 eligible participants, and reliable recovery was observed in 65 of the 178 participants with baseline caseness.
Recovery-related outcomes

Caseness at baseline was defined as PHQ-9 ≥10 orGAD-7 ≥8.
Percentages were calculated using outcome-specific denominators.
Discussion
In Japan, our research group has applied this model within a university-based CBT training programme initiated in 2010, through which practitioners have been trained and clinical outcome data have been systematically accumulated. The present study draws on these routinely collected data to examine recovery-related outcomes using established TTAD definitions. This study also recognises that depressive and anxiety symptoms are common across a range of psychiatric conditions. Accordingly, changes in depressive and anxiety symptoms were evaluated in patients who received disorder-specific CBT tailored to their primary diagnoses. Symptom change was assessed using the PHQ-9 for depression and the GAD-7 for anxiety, consistent with TTAD outcome reporting practices.
At baseline, approximately three-quarters of the participants met the criteria for caseness, indicating that the sample included a substantial proportion of individuals with clinically meaningful symptom severity. Within this context, some participants demonstrated improvement over the course of treatment. Specifically, 42.1% of the participants who met the caseness criteria at baseline transitioned to non-caseness at the end of treatment, while nearly half of the full sample (47.2%) met the criteria for reliable improvement. In addition, 36.5% of the participants with baseline caseness met the criteria for reliable recovery. The discrepancy between the rates of reliable improvement and recovery has significant clinical implications (Clark et al., Reference Clark, Canvin, Green, Layard, Pilling and Janecka2018; NHS England, 2024a). Although many patients exhibit improvement, a considerable proportion do not transition to non-caseness. This may indicate the chronic nature of mental disorders for which a substantial number of patients experience prolonged illness. In the United Kingdom, 66.4% of referrals who completed a course of treatment showed reliable improvement (NHS England, 2024b). Reliable recovery rates varied widely between services, ranging from 23.9% to 56.5% (SD=8.0%) (Clark, Reference Clark2011). These findings underscore the need for additional interventions and long-term follow-up to promote sustained recovery (Stochl et al., Reference Stochl, Soneson, Stuart, Fritz, Walsh, Croudace, Hodgekins, Patel, Russo, Knight, Jones and Perez2022).
The observed pattern of outcomes in this study highlights the importance of distinguishing between different recovery-related indicators. Recovery reflects a transition from caseness to non-caseness and represents a clinically meaningful change in symptom status. By contrast, reliable improvement captures statistically reliable symptom change regardless of whether post-treatment scores fall below clinical thresholds. The higher frequency of reliable improvement relative to recovery is consistent with this conceptual distinction and suggests that symptom reduction may occur without necessarily meeting the non-caseness criteria. This distinction is reflected in the use of outcome-specific denominators: reliable improvement was calculated across the full sample, whereas recovery and reliable recovery were restricted to those participants who met the caseness criteria at baseline. Importantly, the participants who did not meet the caseness criteria at baseline were not excluded from the analyses but were appropriately included in the estimates of reliable improvement, reflecting the inclusive nature of the change-based outcome metrics (NHS England, 2024a). In addition, a minority of participants demonstrated reliable deterioration (8.3% of the total sample), highlighting the importance of monitoring not only improvement but also potential symptom worsening in routine clinical practice.
The recovery and reliable recovery rates observed in this study should be interpreted as descriptive indicators within this specific clinical context rather than as estimates directly comparable to those reported for TTAD services. The present clinical context differs from that for TTAD services in several key respects, including the absence of publicly funded psychotherapy, greater financial burden on patients, and variability in treatment duration and assessment timing, which may have introduced selection bias and may limit the generalizability of the findings. Differences in baseline symptom severity, service structure, patient characteristics, treatment duration, and service context and treatment delivery may also contribute to the variation in outcome patterns across settings.
Accordingly, caution is warranted when comparing outcomes across settings.
This study has several limitations. First, data entry was based on the voluntary cooperation of practitioners. Reminder emails were sent regularly, and data were entered to the extent possible. When two or more outcome assessments were available, the difference between the initial and final assessments was used as the change score. However, the number of CBT sessions varied across patients. The database used in this study did not include the number of CBT sessions as a variable. Therefore, the exact number of sessions received by each patient could not be determined. Although the exact number of sessions was not recorded, high-intensity CBT at our institution is typically delivered within a framework of approximately 16–20 sessions, consistent with NICE guidance (National Institute for Health and Care Excellence, 2022).
Furthermore, the practitioners were asked to enter only symptom scale scores that were routinely collected in clinical practice. Therefore, it was not feasible to implement formal data verification procedures or systematic checks to ensure data quality. Consequently, the quality assurance of the data may have been limited. Future research will require more standardised data management and quality control procedures to enhance the reliability of routinely collected clinical data.
Another important limitation of this study is the absence of a control group. This study was not conducted as a randomised controlled trial but was designed as a descriptive evaluation using clinical data routinely collected in real-world practice in accordance with the TTAD framework. Although this represents one of the study’s limitations, estimating the causal contributions of specific factors to treatment outcomes was not within the scope of this study.
Moreover, the target sample size in the present study was not determined a priori based on statistical power considerations. In light of this limitation, a post hoc sensitivity analysis was conducted. Specifically, given the paired pre–post design of the study, a post hoc power analysis indicated that the achieved sample size (N=241) provided 80% power to detect a small effect size (Cohen’s d=0.18) at a two-tailed significance level of α=.05.
Additionally, all the CBT sessions in this study were provided on a self-pay basis. Therefore, it cannot be ruled out that socioeconomic status influenced both the treatment outcomes and the number of sessions required to achieve recovery (Finegan et al., Reference Finegan, Firth, Wojnarowski and Delgadillo2018; Richardson et al., Reference Richardson, Ferrie, Smith, Ellis-Nee, Smart, Gray, Roberts, Delgadillo and Simmons-Dauvin2025; Stochl et al., Reference Stochl, Soneson, Stuart, Fritz, Walsh, Croudace, Hodgekins, Patel, Russo, Knight, Jones and Perez2022).
Moreover, although disorder-specific CBT models are used in clinical practice, the outcome evaluation in this pilot study relied solely on the PHQ-9 and GAD-7. Therefore, it remains unclear how improvements in disorder-specific symptom measures were related to changes observed in these generic depression and anxiety scales. We did not administer measures such as Patient Experience Questionnaires, which assess patients’ subjective experiences of treatment. In particular, patients’ subjective evaluations of whether they feel their symptoms have improved may be an important indicator of treatment benefit, complementing reductions in symptom scale scores.
Finally, although the effectiveness of CBT is widely recognised in Japan, its implementation rate in medical institutions remains limited. While this study implemented evidence-based CBT and systematically accumulated patient-reported outcome data to examine patterns of improvement, establishing a more standardised framework for data collection remains an important task for future research.
In conclusion, this study provided a descriptive examination of recovery-related outcomes following CBT in a real-world Japanese clinical setting using established TTAD definitions. By applying standardised criteria for recovery, reliable improvement, and reliable recovery, this study presents outcome distributions within the sample in a transparent manner. These findings should be interpreted as descriptive indicators rather than as estimates of treatment effectiveness. Although our institution has implemented elements of the TTAD programme through therapist training and routine data collection, access to CBT in Japan remains limited due to the structural characteristics of the healthcare system. Nevertheless, the real-world clinical data presented here suggest potential for the further development of CBT services in Japan.
Data availability statement
The data that support the finding of this study are available in figshare at doi: 10.6084/m9.figshare.28706669
Acknowledgements
This paper reflects on the development and evolution of CBT for depression and anxiety, with particular attention to its application and implementation in routine clinical practice within the TTAD framework. We would like to express our sincere gratitude to the many patients from whom we have learned so much. We are also deeply thankful to the clinicians who face clinical challenges with dedication every day and who generously contributed to data collection for this study.
Author contributions
Chihiro Sutoh: Conceptualization, Methodology; Noriko Numata: Statistics, Writing – original draft; Yijing Bai: Statistics (equal); Shinnosuke Okada: Data acquisition; Kumiko Muramatsu: review and editing (equal); Eiji Shimizu: Conceptualization (equal), Methodology (equal), review and editing (equal).
Financial support
This research was supported by AMED under Grant Number JP23rea522113.
Competing interests
The authors declare no competing interests.
Ethical standards
This study was approved by the appropriate research ethics committee (no. 3368). Patient consent was obtained for the use of their symptom scale scores from the administered questionnaires. Patients receiving CBT were asked to participate in the study; those who agreed had their symptom scale scores collected during routine clinical care and entered into the online database. The informed consent process was waived by introducing an opt-out system announced on posters and on the facility’s website, which allowed patients to refuse to participate. The study was conducted in accordance with the Declaration of Helsinki.



Comments
No Comments have been published for this article.