Replicating behavioural insights in health: a quasi-experimental Phase 2 trial of integrating descriptive social norms and institutional cost in SMS reminders to reduce missed hospital appointments

Pelle Guldborg Hansen; Raoni Demnitz; Jesper Enøe Elbæk; Sidsen Bruun; Caroline Drøgemüller Gundersen

doi:10.1017/bpp.2026.10030

Replicating behavioural insights in health: a quasi-experimental Phase 2 trial of integrating descriptive social norms and institutional cost in SMS reminders to reduce missed hospital appointments

Published online by Cambridge University Press: 23 April 2026

Pelle Guldborg Hansen

Raoni Demnitz ,

Jesper Enøe Elbæk ,

Sidsen Bruun and

Caroline Drøgemüller Gundersen

Show author details

Pelle Guldborg Hansen*: Affiliation:
Science Studies, Roskilde University, Roskilde, Denmark iNudgeyou – The Applied Behavioural Science Centre, Copenhagen, Denmark
Raoni Demnitz: Affiliation:
iNudgeyou – The Applied Behavioural Science Centre, Copenhagen, Denmark
Jesper Enøe Elbæk: Affiliation:
iNudgeyou – The Applied Behavioural Science Centre, Copenhagen, Denmark
Sidsen Bruun: Affiliation:
Science Studies, Roskilde University, Roskilde, Denmark iNudgeyou – The Applied Behavioural Science Centre, Copenhagen, Denmark
Caroline Drøgemüller Gundersen: Affiliation:
iNudgeyou – The Applied Behavioural Science Centre, Copenhagen, Denmark
*: Corresponding author: Pelle Guldborg Hansen; Email: pgh@ruc.dk

Article contents

Abstract
Introduction
Methods
Results
Cardiology department
Endocrinology department
Pulmonary department
Discussion
AI declaration
Footnotes
References

Rights & Permissions

Abstract

Missed hospital appointments (Do Not Attend [DNAs]) undermine healthcare efficiency and access. A high-profile study found that adding descriptive social norms (DSNs) or specific institutional cost (SIC) messages to SMS reminders could substantially reduce DNAs. This prompts optimism that integrating behavioural insights, besides reminders themselves, offers a cost-effective approach to mitigate DNAs. However, subsequent similar interventions have reported heterogeneous findings, echoing broader debates on recent meta-analyses about how to evaluate such findings. We address this issue by framing Behavioural Insights as Applied Science, which structures validation in three phases inspired by clinical research. We treat the aforementioned study as a Phase 1 proof of concept and conduct a Phase 2 replication under comparable operational conditions in a quasi-experimental, time-blocked field trial at South-western Jutland Hospital (20,867 appointments) across Cardiology, Endocrinology and Pulmonology. Patients received SMS reminders rotating every 2 months between a standard message, DSN framing or SIC framing. Neither DSN nor SIC reduced DNAs overall. SIC increased cancellations (OR = 1.41, p < 0.001) but not DNAs; DSN reduced DNAs in Cardiology (OR = 0.76, p = 0.027), while SIC increased DNAs in Endocrinology (OR = 1.31, p = 0.021). Our findings underscore the importance of applying a systematic approach in the evaluation of Behavioural Insights.

Keywords

behavioural insights missed appointments public health replication study SMS reminders

Information

Type: Article
Information: Behavioural Public Policy , First View , pp. 1 - 25

DOI: https://doi.org/10.1017/bpp.2026.10030 [Opens in a new window]
Creative Commons: This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright: © The Author(s), 2026. Published by Cambridge University Press.

Introduction

Missed medical appointments – commonly referred to as ‘no-shows’ or ‘Do Not Attend’ (DNAs) – represent a persistent and costly challenge to healthcare systems worldwide. DNAs result in underutilised clinical capacity, inefficient resource allocation and reduced access to care for other patients (Dantas et al., Reference Dantas, Fleck, Oliveira and Hamacher2018). In the UK, the National Health Service (NHS) estimates the annual cost of missed general practitioner appointments alone at over £216 million (Werner et al., Reference Werner, Alsuhaibani, Alsukait, Alshehri, Herbst, Alhajji and Lin2023), while in the United States, missed visits are estimated to cost the health system US$150 billion annually (£112 billion) (Mooney, Reference Mooney2024). Similar patterns appear in low- and middle-income countries; for example, specialty outpatient clinics in Santa Maria, Brazil, reported an average DNA rate of 9.4% from 2016 to 2021, resulting in substantial costs for municipal health budgets (Jacobi et al., Reference Jacobi, Jacobi, Souza, Dorneles and Coronel2023). Comparable inefficiencies have been documented in Kenya, Nigeria and other LMIC settings (Opon et al., Reference Opon, Tenambergen and Njoroge2021; Fatoye et al., Reference Fatoye, Afolabi, Gebrye, Oyewole, Fasuyi and Mbada2024; Pacheco & Leal de Souza, Reference Pacheco and Leal de Souza2025). Likewise, across high-income countries such as Denmark, DNAs remain a significant source of inefficiency and strain on healthcare resources (Nørgård et al., Reference Nørgård, Iachina, Ammentorp, Schwalbe, Waidtløw, Richardt and Sodemann2025).

Beyond the financial burden, persistent non-attendance disrupts workflows, creates scheduling inefficiencies and increases administrative effort, often leading to staff frustration when time slots remain unused (Groden et al., Reference Groden, Capellini, Levine, Wajnberg, Duenas, Sow, Ortega, Medder and Kishore2021; Nørgård et al., Reference Nørgård, Iachina, Ammentorp, Schwalbe, Waidtløw, Richardt and Sodemann2025). Improving attendance enhances predictability, enabling clinicians to plan caseloads, allocate resources more efficiently and focus on patients who need care (Boone et al., Reference Boone, Celhay, Gertler, Gracner and Rodriguez2022). Even small improvements can have cascading system-level effects; for example, an AI-driven scheduling intervention improved capacity utilisation by 6% (Toker et al., Reference Toker, Ataş, Mayadağlı, Görmezoğlu, Tuncay and Kazancioglu2024). These pressures make reducing DNAs a key policy priority. Because DNAs ultimately result from individual behaviour, addressing them effectively requires interventions grounded in behavioural as well as operational insights.

Often the intuitive response to attempt to reduce DNAs is to fine patients. However, evidence suggests that these penalties have a poor track record as they do not reduce DNAs relative to non-fined patients (Blæhr et al., Reference Blæhr, Væggemose and Søgaard2018). In addition, they affect patients disproportionately not only because lower SES patients might have to reduce spending on essential items to pay fines, but because higher SES people might also have the knowledge or personal connections to ‘navigate’ the system and appeal to have fines removed furthering inequities. The study described in this paper provides policymakers and hospital administrators with an alternative method that is more equitable and effective.

The use of behavioural insights to reduce DNAs

Forgetfulness is the most frequently reported reason for DNA (Kaplan-Lewis & Percac-Lima, Reference Kaplan-Lewis and Percac-Lima2013; Parsons et al., Reference Parsons, Bryce and Atherton2021; Wilson & Winnard, Reference Wilson and Winnard2022; Charton et al., Reference Charton, Gatier, Delacour and Lépine2024). In a retrospective study of 5,604 patients, 927 (16.5%) missed their appointments, with forgetting cited as the most common reason (n = 97, 35.5%) (Kaplan-Lewis & Percac-Lima, Reference Kaplan-Lewis and Percac-Lima2013). A systematic meta-review of 20 studies similarly found that forgetfulness was the most frequent cause of non-attendance in 5 of them (Parsons et al., Reference Parsons, Bryce and Atherton2021). A review of 21 NHS studies published between 2016 and 2021 also reported forgetfulness as the primary reason for missed appointments (Wilson & Winnard, Reference Wilson and Winnard2022).

To address this issue, healthcare providers have increasingly turned to using reminders typically delivered a few days before a scheduled appointment. Such reminders deliver low-cost, scalable interventions that may be conceptualised as ‘behavioural interventions’ from the perspective of this journal. More generally, behavioural interventions can be defined as suggestions for new policy initiatives or procedures, or changes to existing ones, that results from systematically applying theoretical and methodological insights from the behavioural sciences to policy practices – from the identification and definition of the challenges, over analysing their nature, to testing and suggesting potential interventions for policy improvements dealing with these challenges (Hansen, Reference Hansen2019; Hallsworth, Reference Hallsworth2023).

Some interventions are found post-hoc to align with these insights and are referred to as behaviourally aligned; other interventions are inspired by these insights and referred to as behaviourally informed; and yet others are tested according to these insights and referred to as behaviourally tested (Sousa et al., Reference Sousa, Ciriolo, Rafael and Troussard2016). In this perspective, the increasing use of reminders can be conceptualised as originally being behavioural aligned, but increasingly behaviourally tested interventions, aimed at mitigating problems arising from humans’ limited attention giving rise to phenomena such as forgetfulness, overlooking and relegation of information (Hansen, Reference Hansen2019; Werner et al., Reference Werner, Alsuhaibani, Alsukait, Alshehri, Herbst, Alhajji and Lin2023). One way of explaining the increasing tendency to test behavioural interventions in the domain of public health may be seen as the natural compatibility of methods from the behavioural sciences – such as randomised controlled trials – with those traditionally used in clinical research and healthcare.

Today a substantial body of evidence, including meta-analyses and systematic reviews, has demonstrated that mobile phone reminders are generally effective at improving appointment adherence (Gurol-Urganci et al., Reference Gurol-Urganci, de Jongh, Vodopivec-Jamsek, Atun and Car2013). A Cochrane review concluded that text messaging increases attendance compared to no reminder, with an estimated improvement in attendance rates of 11–12 percentage points across studies (Gurol-Urganci et al., Reference Gurol-Urganci, de Jongh, Vodopivec-Jamsek, Atun and Car2013). Systematically reviewing the various channels available, Stubbs et al. (Reference Stubbs, Geraci, Stephenson, Jones and Roth2012) find that letter reminders, on average across 7 studies, reduced DNAs by 7.6%, while the more cost-effective technological channel of SMS reminders, across 12 studies, on average reduced DNAs by 8.6%. Another systematic review confirms the effect of using SMS reminders and adds that the effect is unchanged whether reminders are used in the primary health sector or outpatient clinics on hospitals, and whether the reminder is sent 24, 48 or 72 hours prior to the appointment, and whether they are sent to younger or older patients (Eriksen & Kjellberg, Reference Eriksen and Kjellberg2013). A third meta-study finds the effectiveness of reminders to be even more effective, with an average reduction in DNAs of 34% independent of whether reminders were sent the day before or the week before the appointment (Hasvold & Wootton, Reference Hasvold and Wootton2011). However, this study also finds that automatic calls and SMS reminders reduced DNAs on average by 29%, while the less cost-effective approach of reminding people by manually calling them reduced DNAs by 39% on average, thus suggesting that there is more to DNAs than merely forgetting.

This point is confirmed by research which points out that patients’ tendency to attend or miss appointments – besides the psychological factors of forgetfulness and limited attention – correlates with structural factors such as transportation or work commitments (Werner et al., Reference Werner, Alsuhaibani, Alsukait, Alshehri, Herbst, Alhajji and Lin2023) as well as demographic and clinical factors, such as younger age, male sex and mental health diagnoses (Nørgård et al., Reference Nørgård, Iachina, Ammentorp, Schwalbe, Waidtløw, Richardt and Sodemann2025). Still, as such structural, demographic and clinical factors are difficult to change, applied behavioural science has tended to study how cost-effective framing the content of reminders may ‘nudge’ patients to keep their appointments without applying additional coercion or financial incentives that would otherwise exacerbate the unequal influence of structural, demographic and clinical factors. A notable example in this stream of research is Hallsworth et al. (Reference Hallsworth, Berry, Sanders, Sallis, King, Vlaev and Darzi2015) who found that simply including DSNs (descriptive social norms; ‘9 out of 10 patients attend’) reduced DNAs, and perhaps more surprising that including the SIC of a missed appointment (specific institutional costs; ‘Not attending costs the NHS £160’) significantly reduced DNAs compared to a standard reminder. Other experiments have corroborated and extended this point that there is more to DNAs than merely forgetting. For instance, Berliner Senderey et al. (Reference Berliner Senderey, Kornitzer, Lawrence, Zysman, Hallak, Ariely and Balicer2020) conducted a large-scale A/B testing trial in Israel, finding that emotionally resonant messages – particularly those invoking guilt – reduced the DNAs rate, lowering them by nearly 7 percentage points compared to control groups. Similarly, Groden et al. (Reference Groden, Capellini, Levine, Wajnberg, Duenas, Sow, Ortega, Medder and Kishore2021) found that behavioural messaging through appointment cards and signage in a US primary care clinic improved both visit adherence and long-term retention among high-cost, high-need patients.

While these results suggest that interventions based on integrating behavioural insights in reminders to reduce DNAs can yield real and measurable improvements in health care delivery when implemented at scale, there is also a need for caution. Teo et al. (Reference Teo, Niederhausen, Handley, Metcalf, Call, Jacob, Zikmund-Fisher, Dobscha and Kaboli2023) have cautioned against assuming the universal applicability of such nudges, particularly in complex care environments such as mental health. In a pragmatic trial at a U.S. Veterans Affairs medical centre, they found that incorporating persuasive nudges into appointment letters failed to produce statistically significant reductions in DNAs. More generally, recent meta-reviews (e.g., Hummel & Maedche, Reference Hummel and Maedche2019; Mertens et al., Reference Mertens, Herberz, Hahnel and Brosch2022) and subsequent methodological critiques (Maier et al., Reference Maier, Bartoš, Stanley, Shanks, Harris and Wagenmakers2022; Szaszi et al., Reference Szaszi, Higney, Charlton, Gelman, Ziano, Aczel and Tipton2022) have sparked debate about the actual effectiveness of applying behavioural insights and nudging in public policy. However, besides the evaluation criteria used in these meta-reviews being confused about inclusion criteria, this debate – even where distinctions are made between categories of interventions (e.g., defaults, simplification, social reference) – tends to evaluate behavioural interventions as a unified class independently of the relations to the specific behavioural problems they aim to address or the psychological mechanisms that supposedly mediate effects (Szaszi et al., Reference Szaszi, Palinkas, Palfi, Szollosi and Aczel2018; reiterated in Szaszi et al., Reference Szaszi, Higney, Charlton, Gelman, Ziano, Aczel and Tipton2022). The latter is particularly problematic as behavioural interventions do not constitute a single, uniform class of interventions with universal effects, but rather a diverse set of insights that mediate through different psychological mechanisms that work differently relative to different contexts and behavioural problems (Hansen, Reference Hansen2019). Thus, attempting to evaluate ‘the average effectiveness of behavioural interventions’ is like attempting to evaluate ‘the average effectiveness of medication’ without regard for them being different medications and whether a study matches a medication with the relevant disease (Hansen, as cited in ScienceDK, 2022).

BIAS: a systematic approach for testing behavioural insights concepts

In the wider literature on Behavioural Insights and nudging, the debates about the effectiveness of behavioural interventions have led to an emerging discussion in what is becoming two streams of reception.Footnote ¹ The first, which might be termed scientific universalism, seeks generalisable insights and adheres to the nomothetic foundations of behavioural science. The second, and growing stance, which might be termed contextualism, adopts a more ideographic position, asserting that contextual factors not only may moderate outcomes, but even presuppose the very efficacy of behavioural interventions – such that their real-world effectiveness cannot be understood apart from cultural and subjective perceptions of those nudged. The latter stream of contextualism often points to debates arising from recent meta-reviews and methodological critiques as evidence in their favour.

At our research centre, we adhere to the former stream for reasons of consistency with the behavioural sciences that goes beyond the scope of this paper. Here it suffices to say, that we have developed and follow a structured research approach that conceptualises and advances Behavioural Insights as Applied Science (BIAS). This approach draws explicit parallels to methodologies from medical and clinical research when it comes to evidence-building.Footnote ² Rather than treating Behavioural Insights as a collection of universal and generalisable persuasive techniques that work independently of the nature of behavioural problems, BIAS frames the field as a cumulative and systematic science, focused on identifying, developing and rigorously testing Behavioural Insight Concepts (BICs) relative to such problems; but rather than treating behavioural problems merely as a matter of contextualism, it works on the assumption that there exist generic behavioural problems sharing an underlying psychological structure.

To elaborate, a BIC is a structured intervention concept that integrates behavioural insights and is developed to address a specific generic behavioural problem, e.g., missed appointments, that may be adapted further to specific domains, e.g., DNAs in healthcare settings. BIAS insists, like Comparative Effectiveness Research (CRE) in the clinical sciences (see Greenfield & Rich, Reference Greenfield and Rich2012, for CRE principles), that interventions must be evaluated in relation to which treatment will work best, in which patient, and under what circumstances. However, the diagnosis or description of the behavioural problem is regarded as having an underlying generic psychological structured situation – parallel to diseases in medical research – comprising the most crucial component of these circumstances. Like CER, BIAS insists on evaluating BICs in relation to the type of behavioural problem they target, rather than pooling everything into generic meta-estimates of averages across domains. Typically, the development of a BIC is embedded within a broader diagnostic framework, in our case BASIC (Hansen, Reference Hansen2019), which ensures that interventions are grounded in a behavioural understanding of the problem context rather than merely assumed. Without a proper diagnosis, ‘researchers risk testing interventions that either appear effective but cannot be scaled or sustained due to poorly understood mechanisms or fail entirely without a clear explanation’ (Osman et al., Reference Osman, McLachlan, Fenton, Neil, Löfstedt and Meder2020). The implication is that this results in wasted resources.

In particular, the BIAS approach currently structures the testing of BICs into three sequential experimental phases, closely aligned with the logic of clinical trials:

1. Phase 1: Providing Experimental Proof of Concept. The first phase establishes efficacy, that is, it establishes whether the BIC can affect behaviour under controlled conditions, providing initial evidence that the underlying behavioural insights translate into measurable change.
2. Phase 2: Confirming Efficacy through Replication. The second phase evaluates whether the BIC reliably produces effects across similar, if not identical, instances of the same underlying behavioural problem, rather than in a single controlled setting. The aim is not to test every possible context, but to establish whether the mechanism embedded in the BIC is robust enough to support generalisation beyond the original trial. This step addresses concerns about replicability without collapsing into contextualism: variation across domains or populations is considered only to the extent that it tests whether the BIC addresses the same generic behavioural problem through the same underlying mechanism. A successful Phase 2 replication shows that the concept’s efficacy is not an artefact of the original trial’s design, setting or statistical luck (cf. Spiegelhalter, Reference Spiegelhalter2019), thereby justifying progression to Phase 3, where moderators, context interactions and other refinements are examined systematically.
3. Phase 3: Studying Moderations, Moderators and Comparative Dimensions. The third phase examines the conditions and refinements under which a BIC works, addressing both theoretical and practical considerations for policy implementation. This includes systematically identifying moderators across what we call the ‘3Cs’ – cognition (e.g., individual differences such as age, gender, cognitive style), context (institutional settings, delivery mechanisms) and culture (shared norms and values shaping interpretation of interventions). In addition to moderators, Phase 3 investigates moderations, encompassing both variations of the intervention and targeted refinements to enhance performance, which, if significant, may be treated as a sort of sub-BICs in themselves. Beyond mechanism-level analysis, Phase 3 expands the evidential base by comparing the BIC with alternative interventions (akin to comparative effectiveness in medicine), assessing cost–benefit profiles, potential side effects and ethical acceptability. Finally, this phase allows exploration of sub-hypotheses that can inform theory development and guide adaptive design in future applications.

A central motivation behind BIAS is to address the mixed findings in the Behavioural Insights literature and practice. We assume that such mixed results often arise not simply from contextual variability but from the absence of proper behavioural diagnosis prior to intervention. Without diagnosis, practitioners risk overlooking the fact that many behavioural challenges – such as DNAs – are instances of generic behavioural problems that recur across contexts. By failing to establish whether an intervention aligns with the actual behavioural problem at play, interventions may target the wrong problem, leading to inconsistent outcomes. These inconsistencies are then often attributed to ‘contextual factors’, while meta-analyses, which typically pool diverse interventions as a single class, conclude limited or no effects on average. BIAS positions behavioural diagnosis as a foundational step in treating BIAS. By embedding diagnosis and structured testing within a systematic framework, BIAS aims to improve the validity and replicability of BICs and strengthen their cumulative contribution to public policy. In particular, BIAS provides an evidential hierarchy that serves to coordinate expectations and inform meta-reviews like how it is done in clinical research.

A quasi-experimental Phase 2 trial of SMS reminder framing to reduce missed hospital appointments

Returning to health policy, Hallsworth et al. (Reference Hallsworth, Berry, Sanders, Sallis, King, Vlaev and Darzi2015) is widely regarded as a landmark study in applying Behavioural Insights to healthcare. Conducted across 2 large-scale randomised controlled trials involving approximately 19,800 outpatient appointments in the UK, the study tested whether integrating behavioural insights into SMS reminders could reduce DNAs. Trial 1 (four arms; N ≈ 10,000) compared a standard reminder (control) with three variations: a DSN (‘9 out of 10 patients attend’), a message highlighting the SIC of a missed appointment to the NHS (‘Not attending costs the NHS £160’) and an ‘Easy Call’ message facilitating cancellation. The cost-framed message significantly reduced DNAs from 11.1% to 8.4% (OR 0.74, 95% CI [0.61, 0.89], p < 0.01), while a DSN message showed no effect, but slightly increased cancellations (8.8% to 10.5%, OR 1.23, 95% CI [1.02, 1.48], p = 0.03). Trial 2 (four arms; N ≈ 9,800) retained the SIC message as the reference and introduced two new insights: empathy (‘Please be fair to others waiting’) and accountability (reminding patients that a missed appointment would be recorded), along with a general cost message. Again, the SIC message produced the lowest DNA rate (8.2%), compared to 9.9% for the general cost framing (OR = 1.22) and 10.7% for empathy (OR = 1.33).Footnote ³

In the present context, we take experiment 1 in Hallsworth et al. (Reference Hallsworth, Berry, Sanders, Sallis, King, Vlaev and Darzi2015) to suggest Phase 1 experimental proof of concept that the sub-BICs of integrating the SIC of missed appointments into the well-tested BIC of SMS reminders can successfully reduce DNAs, and integrating a DSN did not show any effect, though it slightly increased cancellations. Hallsworth et al. do not explain at greater length the diagnostic assumptions behind the addition of these sub-BICs. Relative to the SIC, it is said that ‘The patient may not be aware that missing an appointment incurs a cost. Even if they are, the cost may not be salient, since it is likely to be seen as just an “opportunity cost” – i.e., the loss of an opportunity to do something more productive’ and there is evidence emerging that highlights the SIC influence on behaviour. Relative to a DSN, it is said that ‘There is evidence from other fields that people overestimate the extent to which others perform acts that are non-optimal (or which cause harm to others), and that correcting these perceptions can change behaviour’. Still, we take these explanations to be well-aligned with diagnostic assumptions that may be identified using BASIC (Hansen & OECD, Reference Hansen2019), which associates a DSN with influencing Choice through a preference for social conformity; and the SIC with supporting Determination relative to keeping one’s appointment by leveraging social norms about not imposing cost on the collective, and amplified by its specificity.

From the perspective of BIAS, however, the evidential base for these sub-BICs remains incomplete. Hallsworth et al. provide Phase 1 evidence of efficacy of highlighting the SIC in SMS messages under controlled conditions, but as experiment 2 did not include a control, only retaining the SIC message as a reference, it does not count as a Phase 2 replication. Thus, before making claims about robust efficacy, progressing to Phase 3 (where moderators, refinements and comparative dimensions are more systematically explored), or implementing the intervention at scale, BIAS insists on establishing whether these effects replicate in a new trial that targets the same generic behavioural problem through the same hypothesised mechanism, but in a similar operational context. Phase 2 replication is thus not an exercise in contextual exploration, but a test of actual efficacy and, in turn, diagnostic robustness: do the sub-BICs perform consistently when applied to another health system with comparable reminder infrastructure and appointment processes, given the same generic problem and comparable delivery mechanisms? Confirming or disconfirming this assumption is critical because, if the effects do not replicate, the efficacy of highlighting the SIC as well as the diagnostic inference underlying it could be called into question.

The present study addresses this gap by conducting a quasi-experimental Phase 2 trial at a major Danish hospital. Using a large sample across three outpatient departments, we examine the impact of SMS reminders framed either with a DSN or the SIC relative to a standard reminder. This design allows us to evaluate the robustness of the high-profile sub-BIC of highlighting the SIC under real-world conditions and to contribute to the cumulative science of Behavioural Insights by adhering to the structured testing approach advocated by BIAS.

Methods

As our experiment concerns assessing whether previously observed causal effects replicate under comparable operational conditions, the structure of this methods section is adapted from Cronbach (Reference Cronbach1982) and Shadish et al. (Reference Shadish, Cook and Campbell2002) concept of UTOS – units that receive the conditions being contrasted, the treatments themselves, observations made on the units and the settings in which the study is conducted. Rearranging these and adding the goal of the experiment (as determined by its phase), we get the mnemonic GUSTO, which stands for Goals, Units, Setting, Treatment and Observations, and reports on these elements the sequence most suitable for laying out the experiment. Including ‘Goals’ reflects BIAS’s emphasis on an evidential hierarchy: each phase serves a distinct purpose in cumulative testing (proof of concept, replication or exploration). Structuring the methods section around GUSTO ensures transparency about how design choices align with the phase-specific goal and the overarching process of building robust evidence.

Goal

The experiment is a Phase 2 replication, testing whether the sub-BICs of DSNs and institutional costs, respectively – when integrated into the established BIC of text reminders – affect patient behaviour in the setting of hospital appointments. Specifically, the study examines whether these framings increase advance cancellations and thereby, or independently, reduce DNAs, the consequence of the underlying behavioural problem of choosing not to attend.

Units

The experimental unit was the individual outpatient appointment scheduled within eligible departments. The trial took place at Sydvestjydsk Sygehus (South-western Jutland Hospital, hereafter SVS), a regional hospital in Southern Denmark serving approximately 390,000 outpatient visits annually. Only departments using the scheduling system Bookplan were included, as this system allows for customised SMS reminders – a requirement for implementing the treatment conditions – unlike the national Nem-SMS system, which is restricted to 160 characters (see Figure 1). All included patients had attended at least one prior appointment, ensuring both phone number registration in Bookplan and consent for SMS communication. Patient confidentiality was preserved through anonymisation of CPR numbers, and ethical clearance was granted by Danish Regions (the coordinating body for Denmark’s five regional health authorities) and by SVS.

Figure 1.

The figure shows the messages for different groups within the Cardiology department as they were designed to appear on the phone of patients.

These represent sub-BICs (significant moderations) of the established BIC of text reminders. Assignments rotated every 2 months (see Table 1). The experiment was discontinued on 16 March 2020 following nationwide COVID-19-related changes that paused SMS systems and reprioritised hospital operations.

Table 1.

Assignment of hospital departments to intervention conditions (Social Norm, Cost, Control) across three consecutive periods during the experimental phase

Patients in all conditions received at least one SMS reminder 2 days or less before an appointment, and only these were modified. Messages sent earlier retained the hospital’s standard text to ensure consistency across conditions and avoid message overlap. Although patients might receive more than one reminder text within 2 days before the appointment, they were not given different messages to avoid overlap with other treatment conditions. Departmental staff implemented the schedule, and compliance was verified via internal reporting to the research team.

The dataset included all outpatient appointments in the three departments between 1 October 2019 and 16 March 2020. For each appointment, the following variables were extracted, with 20,867 appointments retained in the final sample (see Figure 2). Table 2 summarises the distribution by treatment group and gender.

Figure 2.

Criteria used to filter the dataset, resulting in 20,867 data points.

Table 2.

This table summarises the total number of appointments included in the study (N = 20,867) across three hospital departments – Endocrinology, Pulmonary and Cardiological – by treatment groups and gender

Observations (outcome measures)

Two primary outcomes were evaluated:

• DV1: Advance Cancellation Rate – The proportion of appointments cancelled by patients at least 1 day before the scheduled date, reflecting behavioural compliance with the request to reschedule rather than default to inaction.
• DV2: No-show Rate (DNAs) – The proportion of appointments where patients neither attended nor cancelled in advance, representing the behavioural problem of non-attendance.

Both outcomes allow us to evaluate whether the effects of interventions grounded in these assumed mechanisms – social conformity for DSN and strengthened determination through highlighting the SIC – replicate when applied to a new instance of the same behavioural problem under real-world hospital conditions. We have summarised the experimental design of this study in Figure 3.

Figure 3.

Experimental design.

Results

We hypothesised that including a DSN message in the SMS reminder would increase behavioural compliance by influencing decisions to attend or cancel in advance, as reflected in higher advance cancellation rates and/or lower DNA rates compared to the Control group (H1). Alternatively, we hypothesised that emphasising the SIC of missed appointments to the healthcare system would produce similar effects relative to the Control group (H2). To evaluate these hypotheses, we conducted a series of logistic regression analyses using cancellation and DNA as outcome variables. In Analysis 1, we compare outcomes between patients in the DSN group and the Control group, and between the SIC group and the Control group.

As noted in the Introduction, demographic variation and departmental differences may influence observed effects. To account for this and ensure robustness, we conduct two additional analyses. In Analysis 2, we include patient age and gender as covariates. In Analysis 3, we examine whether treatment effects are consistent across the three participating departments by segmenting the analysis by specialty.

Analysis 1

To test whether cancellation and DNA rates differed between treatment groups and the Control group, we estimated logistic regression models comparing outcomes for patients in the DSN (Treatment 2) and SIC groups (Treatment 3) relative to the Control group (Treatment 1). Formally, this analysis can be summarised as follows for each outcome variable:

\begin{equation*} \textrm{log}\left(\textrm{p}/\left(\textrm{1 - p} \right) \right) = {\unicode{x03B2}_{0}} + {\unicode{x03B2}_{1}} \cdot \textrm{DSN}_{\textrm{i}} + {\unicode{x03B2}_{2}} \cdot \textrm{SIC}_{\textrm{i}} \end{equation*}

Percentages for the different outcomes by treatment variables are shown in Figures 4 and 5, and the results for the inferential statistics are shown in Table 3.

Figure 4.

Cancellation rate by treatment group. Error bars represent Wilson score 95% CIs.

Figure 5.

DNA rate by treatment group. Error bars represent Wilson score 95% CIs.

Table 3.

This table presents logistic regression results showing the effect of each treatment on the odds of cancellation and no-show. Each treatment condition was compared to the Control condition. Only the effect of Cost on Cancellation rate was statistically significant

Cancellation rates

No significant difference in advance cancellation rates was observed between the Control group and the DSN group. Patients in the SIC group had an advance cancellation rate of 3.42%, compared to 2.45% in the Control group (OR = 1.41, 95% CI [1.16, 1.72], p < 0.001). This corresponds to a 41% increase in odds of cancelling in advance, although the absolute difference was under 1 percentage point.

DNA rates

No statistically significant differences in DNA rates were observed between the DSN group and the Control group, between the SIC group and the Control group, and between the two treatment groups.

In sum, highlighting the SIC significantly increased advance cancellations, whereas neither treatment significantly reduced DNAs.

Analysis 2

To test whether demographic variables influenced the outcomes, the analyses above were repeated with age and gender included as covariates in the model for each outcome variable. More specifically:

\begin{equation*}\textrm{log}\left(\textrm{p}/\left(\textrm{1 - p}\right)\right)= \unicode{x03B2}_{0}+\unicode{x03B2}_{1} \cdot \textrm{DSN}_{\textrm{i}}+\unicode{x03B2}_{2} \cdot \textrm{Age}_{\textrm{i}}+\unicode{x03B2}_{3} \cdot \textrm{Gender}_{\textrm{i}}\end{equation*}

\begin{equation*}\textrm{log}\left(\textrm{p}/\left(\textrm{1 - p}\right)\right)= \unicode{x03B2}_{0} + \unicode{x03B2}_{1} \cdot \textrm{SIC}_{\textrm{i}}+\unicode{x03B2}_{2} \cdot \textrm{Age}_{\textrm{i}}+\unicode{x03B2}_{3} \cdot \textrm{Gender}_{\textrm{i}}\end{equation*}

Figures 6 and 7 present outcome percentages by gender, and Table 4 provides the corresponding inferential statistics. Inclusion of demographic variables did not alter treatment effects; therefore, we report only the demographic results below.

Figure 6.

Cancellation rate by sex. Error bars represent 95% CIs for the model-standardised mean probability; because estimates are averaged over a large sample with low event rates, the CIs are very narrow.

Figure 7.

DNA rate by sex. Error bars represent 95% CIs for the model-standardised mean probability; because estimates are averaged over a large sample with low event rates, the CIs are very narrow.

Table 4.

This table shows the logistic regression results, controlling for age and gender, by treatment model. Odds ratios, confidence intervals and significance are reported for each predictor variable

Cancellation rates

Older age was associated with significantly lower odds of cancelling in advance (OR = 0.98, 95% CI [0.97, 0.98], p < 0.01), indicating that each additional year of age slightly reduced the likelihood of cancellation. No significant differences in cancellation rates were observed for gender.

DNA rates

Older patients were also significantly less likely to DNA (OR = 0.98, 95% CI [0.96, 0.99], p < 0.01), with the odds of DNA decreasing gradually with age.

Male patients were significantly more likely to miss appointments without cancelling: the DNA rate was 5.13% for men vs 3.62% for women (OR = 1.56, 95% CI [1.32, 1.84], p < 0.001).

These demographic patterns did not modify treatment effects, which remained consistent with Analysis 1.Footnote ⁴

Analysis 3

To examine whether the results from Analysis 1 remained consistent across clinical subspecialties, we repeated the logistic regression analyses separately for each department.

Cardiology department

Cancellation rates

In the Cardiology Department, patients in the SIC group had an advance cancellation rate of 2.04%, compared to 0.84% in the Control group (OR 2.47, 95% CI [1.47, 4.15], p < 0.001), indicating that integrating the SIC in the SMS increased the likelihood of cancelling in advance. No other significant findings were observed.

DNA rates

Patients in the DSN group had a significantly lower DNA rate of 4.03% compared to 5.23% in the Control group (OR 0.76, 95% CI [0.60, 0.97], p = 0.027), indicating that it led to a reduction in DNA. No other significant findings were observed.

Endocrinology department

Cancellation rates

In the Endocrinology Department, patients in the DSN group had an advance cancellation rate of 4.76%, compared to 3.04% in the Control group (OR = 1.59, 95% CI [1.22, 2.09], p < 0.001), suggesting that integrating the DSN in the SMS reminder effectively increased advance cancellations in this department.

Patients in the SIC group had a cancellation rate of 4.46%, also significantly higher than the Control group (OR 1.49, 95% CI [1.16, 1.91], p < 0.01). No other significant findings were observed.

DNA rates

Patients in the SIC group had a DNA rate of 5.09%, which was significantly larger compared to 3.94% in the Control group (OR 1.31, 95% CI [1.04, 1.64], p = 0.021), suggesting that highlighting the SIC increased DNA rates in this department. No other significant findings were observed.

Pulmonary department

Cancellation rates

In the Pulmonary Department, patients in the SIC group had advance cancellation rate of 2.97%, compared to 4.68% in the Control group (OR 0.62, 95% CI [0.41, 0.95], p = 0.029), indicating that integrating the SIC in the SMS reminder reduced the likelihood of advance cancellations in this department. No other significant findings were observed.

DNA rates

No significant differences in no-show rates were observed across groups.

Summary of analysis 3

Treatment effects varied across departments, with no consistent pattern across specialties. Figures 8 and 9 summarise the descriptive statistics of department-wise DNA and cancellation rates and Table 5 summarises the inferential statistics.

Figure 8.

Cancellation rate by Department and Treatment group. Error bars represent Wilson score 95% CIs.

Figure 9.

DNA rate by Department and Treatment group. Error bars represent Wilson score 95% CIs.

Table 5.

This table summarises logistic regression outcomes by department, showing how DSN and SIC treatments affected odds of cancellation and DNA relative to control

Discussion

This study illustrates why Phase 2 diagnostic replication is essential for advancing BIAS. Hallsworth et al. (Reference Hallsworth, Berry, Sanders, Sallis, King, Vlaev and Darzi2015) provided Phase 1 proof of concept for integrating the SIC and a DSN into SMS reminders for hospital appointments, reporting that SIC messages significantly reduced DNAs from 11.1% to 8.4% in Trial 1, while DSN messages had no aggregate effect but slightly increased cancellations. Although the study included two trials involving approximately 19,800 appointments, the second trial did not function as a replication but as an exploratory refinement: it retained the SIC message as the reference and replaced the DSN with new framings (empathy and accountability), thereby shifting the intervention logic rather than testing robustness.

However, the evidential base for these interventions remains incomplete without systematic Phase 2 replication. Behavioural interventions are not designed to eliminate behavioural problems entirely but to mitigate them by shifting probabilities of action under real-world conditions. Whether such mitigation occurs depends on whether the BIC effectively addresses the behavioural problem via the assumed mechanisms driving the problem, and whether those mechanisms remain operative alongside other constraints. Phase 2 serves this purpose: moving beyond initial proof of concept to test whether a BIC and its diagnostic assumptions demonstrate efficacy across new instances of the same underlying behavioural problem. This logic contrasts with interpretations that attribute mixed results primarily to contextual idiosyncrasies, such as cultural differences. While context may matter, premature contextual explanations risk obscuring a more fundamental question: does the BIC align with the correct behavioural diagnosis and perform consistently under comparable operational conditions? Establishing this is a necessary step before progressing to Phase 3, where moderators, moderations and contextual interactions can be examined systematically.

Across more than 20,000 outpatient appointments in three hospital departments, neither integrating a DSN nor highlighting the SIC produced consistent improvements in patient attendance relative to the standard reminder. This contrasts with Hallsworth et al. (Reference Hallsworth, Berry, Sanders, Sallis, King, Vlaev and Darzi2015), who observed significant reductions in DNA rates for highlighting the SIC in SMS reminders in their Trial 1. Overall, in our study, highlighting the SIC significantly increased advance cancellations, but the absolute effect size was small and did not translate into fewer missed appointments (DNAs). By comparison, Hallsworth et al. reported that highlighting the SIC reduced DNA rates from 11.1% to 8.4% in their first trial (OR = 0.74), while integrating a DSN showed no aggregate effect. In our study, integrating a DSN similarly showed no significant effect in the aggregate analysis (Analysis 1), although a modest reduction in DNAs was observed in the Cardiology department (Analysis 3). Conversely, in Endocrinology, highlighting the SIC was associated with a slight increase in DNAs despite higher cancellation rates, while in Pulmonology, it reduced cancellations rather than increasing them. Table 6 compares the differences between Hallsworth’s original study and our replication.

Table 6.

This table compares Hallsworth et al. (Reference Hallsworth, Berry, Sanders, Sallis, King, Vlaev and Darzi2015) (Experiment 1) with our replication

^a Statistically different from Control.

These mixed patterns suggest that while our design mirrored the original intervention’s logic, its effects did not replicate under comparable operational conditions. However, the demographic associations observed in our study – older patients less likely to miss or cancel, and men more likely to DNA – align with both prior research (Nørgård et al., Reference Nørgård, Iachina, Ammentorp, Schwalbe, Waidtløw, Richardt and Sodemann2025) and the patterns reported by Hallsworth et al. (Reference Hallsworth, Berry, Sanders, Sallis, King, Vlaev and Darzi2015). This consistency supports the interpretation that the present study functioned as a valid Phase 2 replication within the same diagnostic frame, despite divergent results on treatment efficacy. This highlights the need for systematic Phase 2 testing before progressing to Phase 3 explorations of contextual moderators, moderations and comparative effectiveness.

Strengths and limitations

This study has several strengths that enhance its relevance for both research and practice. It uses a large sample of more than 20,000 hospital appointments across three distinct specialties, providing robust statistical power and variation in patient populations. The quasi-experimental design, implemented under routine operational conditions, ensures high ecological validity and policy relevance. The design differed from the randomised trials reported by Hallsworth et al., but the time-blocked allocation helped maintain comparability and internal validity while accommodating hospital scheduling constraints.

At the same time, certain limitations warrant caution. The absence of randomisation means that unobserved temporal or organisational factors could have influenced outcomes, although the time-blocked design mitigates some of this risk. Additionally, the study did not capture structural or psychological variables – such as transport constraints, comorbidities or planning habits – that may affect responsiveness to reminder framings. Finally, treatment fidelity was maintained to mirror the original intervention design, focusing only on message content. While this is appropriate for Phase 2 replication, it limits conclusions about whether additional operational features – such as interactive links or multiple prompts, which would belong in Phase 3 testing – could improve effectiveness. These limitations underline the importance of further Phase 2 replications and controlled comparisons before broader implementation or progression to Phase 3 refinement testing.

In the Introduction, we wrote that missed medical appointments might be due to forgetting. However, what patients report may reflect surface-level explanations that mask deeper issues. Several authors have suggested that forgetfulness may sometimes serve as a socially desirable explanation, masking more uncomfortable or stigmatised reasons such as anxiety, ambivalence about treatment or dissatisfaction with care (Quinn et al., Reference Quinn, Detman and Bell-Ellison2008; Parsons et al., Reference Parsons, Bryce and Atherton2021). Patients might prefer to describe their absence as simple forgetfulness rather than admit to financial strain, emotional distress or mistrust in the healthcare system. Recognising this possibility is important, as it suggests that the apparent simplicity of ‘forgetting’ may conceal deeper barriers that require more nuanced and supportive interventions.

We might never be able to detect the real reasons behind why patients ‘forget’. However, with advances in artificial intelligence, healthcare providers now have access to more predictive and data-driven methods that might help BI professionals design and test more nuanced and perhaps better interventions. One promising example is Deep Learning, a branch of machine learning that uses multi-layered neural networks to automatically detect patterns in large datasets (Dashtban & Li, Reference Dashtban and Li2022). Dashtban & Li (Reference Dashtban and Li2022) combined electronic patient records with socioeconomic and environmental data to predict non-attendance in the healthcare system. Based on that data, they were able to categorise the risk of DNA for different patients. This approach outperformed traditional statistical models.

The point is that this method allows us to ask more nuanced questions, such as whether a subset of patients at risk for DNA are more likely to attend if they are called in advance, which allows staff to address barriers such as transportation or personal challenges – something that text reminders alone cannot achieve (Dashtban & Li, Reference Dashtban and Li2022). BI professionals should at least experiment with using AI methods to assist them in their work of designing interventions to be tested.

Policy implications

From a policy perspective, these findings argue for caution. While highlighting the SIC reliably increased advance cancellations in aggregate analyses, it did not reduce DNAs overall and produced inconsistent patterns across departments, including a small but statistically significant increase in DNAs in one specialty. Consistent with Hallsworth et al. (Reference Hallsworth, Berry, Sanders, Sallis, King, Vlaev and Darzi2015), integrating a DSN in the SMS reminder showed no aggregate effect and only a limited benefit within a single department. Taken together, these results do not justify the broad implementation of either sub-BICs at scale. We would like to stress to policymakers and hospital administrators reading this that the lack of replication here does not mean that behavioural interventions in reducing DNAs do not work. Instead, they support a ‘stop-and-hold’ position: before such interventions are embedded into routine practice, further Phase 2 replications are needed to determine whether these effects are robust, or whether their variability reflects underlying diagnostic limitations or competing constraints that limit behavioural responsiveness.

Although we have found evidence that the interventions do not work, implementing them would probably not yield worse outcomes than what is currently done (i.e., sending generic appointment reminders). However, we hope that policymakers and hospital administrators who wish to implement behavioural interventions to reduce DNAs – such as the ones described in this paper – reach out to BI professionals to test interventions and avoid unnecessary spending related to the infrastructure necessary to build these new interventions in place, especially in health systems operating under tight efficiency constraints.

Diagnostic implications

Our findings, considered alongside prior evidence, suggest two plausible explanations for the inconsistent effects observed in this Phase 2 trial. First, the performance of integrating a DSN or highlighting the SIC may depend on operational features beyond message content, such as message timing, interactivity or opportunities for personalisation. Studies reporting stronger effects often used designs that incorporated these elements: Berliner Senderey et al. (Reference Berliner Senderey, Kornitzer, Lawrence, Zysman, Hallak, Ariely and Balicer2020), for example, included confirmation links and obligation-based language, while Groden et al. (Reference Groden, Capellini, Levine, Wajnberg, Duenas, Sow, Ortega, Medder and Kishore2021) required patients to sign and record their next appointment. Such enhancements may amplify the intended mechanism – or, as discussed below, they may engage a different mechanism altogether. Compared to these designs, our intervention – a single SMS sent 2 days prior without additional features – may not have incorporated elements that strengthen the relevant behavioural mechanisms.

A second explanation concerns the nature of the behavioural problem. DNAs are rarely a matter of isolated choice; they involve sustaining a prior intention in the face of competing goals. Framings that integrate a DSN or reference the SIC appeal to conformity or abstract responsibility but do little to reinforce individual commitment to act. Evidence from prior studies aligns with this interpretation: the interactive features in Berliner Senderey et al. (Reference Berliner Senderey, Kornitzer, Lawrence, Zysman, Hallak, Ariely and Balicer2020) and the signing requirement in Groden et al. (Reference Groden, Capellini, Levine, Wajnberg, Duenas, Sow, Ortega, Medder and Kishore2021) may both operate by activating a stronger commitment rather than by appealing to social conformity or cost imposed on the collective. Similarly, Hallsworth et al. (Reference Hallsworth, Berry, Sanders, Sallis, King, Vlaev and Darzi2015) observed higher DNA rates among patients whose previous appointments had been cancelled, suggesting that commitment fragility predicts DNAs. Finally, unmeasured structural or psychological factors – such as competing goals, psychological challenges or differences in planning habits – may also account for the absence of consistent effects, although these were beyond the scope of the present analysis. If the commitment hypothesis proves promising, interventions that actively reinforce determination – such as confirmation requests or commitment prompts – would warrant new Phase 1 testing before progressing to replication.

Advancing BI as applied science

These findings underscore the value of approaching Behavioural Insights through a structured evidential hierarchy, as advocated by BIAS, rather than ad hoc experimentation or meta-analyses that aggregate heterogeneous interventions. If this framework had been adopted systematically following Hallsworth et al. (Reference Hallsworth, Berry, Sanders, Sallis, King, Vlaev and Darzi2015), subsequent efforts could have prioritised Phase 2 diagnostic replications – such as the one presented here – before progressing to Phase 3 activities such as exploring moderators, making cross-intervention comparisons or pooling results in meta-analyses. Instead, the field often moved prematurely to implementation and broad claims of generalisability, leaving key diagnostic assumptions untested. The BIAS approach addresses this problem by situating each BIC within a staged process of validation: Phase 1 establishes proof of concept, Phase 2 tests robustness across new instances of the same behavioural problem, and only after consistent evidence emerges does Phase 3 examine moderators, refinements and comparative effectiveness. This sequencing matters: without Phase 2 replication, it is impossible to know whether null or mixed effects reflect contextual idiosyncrasies, suboptimal operational features or a deeper mismatch between mechanism and behavioural problem. More broadly, adopting a BIAS framework would allow meta-analyses to aggregate evidence within clearly defined diagnostic categories rather than collapsing diverse interventions into a single estimate of ‘average effectiveness’. We hope that future research and policy evaluation in Behavioural Insights will increasingly adopt such structured approaches – both to strengthen the cumulative science of the field and to enable meta-analyses that produce meaningful, problem-specific insights rather than misleading generalisations.

AI declaration

The authors confirm that AI tools (specifically our AI colleague ‘Sally’ – a ChatGPT-based agent) were used to assist with language refinement and phrasing during the check of the final draft of this manuscript, the cover letter and this declaration. All content; research decisions, analysis and narrative interpretations are the original work of the authors. Also, in line with Cambridge University Press policy:

• AI tools have not been listed as authors.
• All use of generative AI has been transparently declared.
• The authors remain fully accountable for the integrity of the work.

Footnotes

¹ The confusion stems from meta-analyses that pool very different studies under broad labels like ‘behavioural interventions’ or ‘nudges’, without checking whether they address a clearly defined behavioural problem. For example, Hummel & Maedche (Reference Hummel and Maedche2019) and Mertens et al. (Reference Mertens, Herberz, Hahnel and Brosch2022) include self-labelled nudges that lack a behavioural diagnosis linking problem, mechanism and intervention. This creates inclusion and categorisation errors, mixing fundamentally different interventions and grouping results by surface features instead of mechanisms. Maier et al. (Reference Maier, Bartoš, Stanley, Shanks, Harris and Wagenmakers2022) and Szaszi et al. (Reference Szaszi, Higney, Charlton, Gelman, Ziano, Aczel and Tipton2022) then read this heterogeneity as weak or inconsistent effects, mistaking methodological noise for conceptual weakness.

² Although BIAS draws explicit parallels to the phased structure of clinical research, it does not replicate these phases in a one-to-one fashion. In clinical research, diseases have relatively stable underlying pathologies that can be studied in vitro or under highly controlled conditions without fundamentally changing the phenomenon. This allows Phase 1 laboratory or tightly controlled clinical studies to establish efficacy before moving to larger and more heterogeneous populations in Phase 2 and beyond. Behavioural problems, by contrast, are situationally constituted: they arise from the interaction between human cognition, the immediate choice architecture and the surrounding operational constraints. Attempting to ‘bring them into the lab’ typically alters or removes the very conditions that define the problem. For example, a DNA in a hospital setting is not just a decision in isolation; it is embedded in scheduling systems, institutional routines, social expectations and competing priorities, all of which contribute to its persistence. This difference in the stability of the underlying phenomenon is the reason BIAS departs from the medical model by inserting a distinct Phase 2 between proof of concept and exploration of moderators. In medicine, the underlying condition remains the same when moving from Phase 1 to Phase 2, so replication across similar cases is less critical at that stage. In behavioural science, however, because the ‘diagnosis’ of a generic behavioural problem cannot be fully established independently of its real-world setting, Phase 2 functions as a diagnostic robustness test: it evaluates whether a BIC performs consistently across similar, if not identical, instances of the same behavioural problem with comparable delivery mechanisms. Only once this replication step confirms the mechanism’s stability does BIAS progress to Phase 3, where broader context variation, moderators and refinements are systematically explored.

³ It is worth noting that in both trials, the DSN and SIC messages were not tested in isolation but included the operational element of an ‘Easy Call’ feature – a prompt and contact number facilitating cancellation. This constitutes a sub-BIC in its own right, potentially influencing both cancellation and DNA rates independently of the social norm or cost framing. As such, the effects attributed to DSN or SIC in Hallsworth et al. (Reference Hallsworth, Berry, Sanders, Sallis, King, Vlaev and Darzi2015) may partly reflect the impact of this facilitation element. Nonetheless, in the present study, we follow Hallsworth et al.’s own interpretation, treating the effects as attributable to the DSN or SIC framings as reported, in order to preserve comparability with their conclusions.

⁴ We attempted time-adjusted, department-specific estimates using pooled logistic models with Department × Intervention interactions and Period fixed effects (plus age/sex). However, many department × period × intervention cells had very few cancellations/no-shows, inflating standard errors, widening confidence intervals and sometimes causing quasi/complete separation. These estimates were therefore unstable and are omitted; we instead report within-department comparisons unadjusted for time.

References

Berliner Senderey, A., Kornitzer, T., Lawrence, G., Zysman, H., Hallak, Y., Ariely, D. and Balicer, R. (2020), ‘It’s how you say it: systematic A/B testing of digital messaging cut hospital no-show rates’, PLOS ONE, 15(6): e0234817. https://doi.org/10.1371/journal.pone.0234817CrossRef Google Scholar PubMed

Blæhr, E. E., Væggemose, U. and Søgaard, R. (2018), ‘Effectiveness and cost-effectiveness of fining non-attendance at public hospitals: a randomised controlled trial from Danish outpatient clinics’, BMJ Open, 8(4): e019969. https://doi.org/10.1136/bmjopen-2017-019969CrossRef Google Scholar PubMed

Boone, C. E., Celhay, P., Gertler, P., Gracner, T. and Rodriguez, J. (2022), ‘How scheduling systems with automated appointment reminders improve health clinic efficiency’, Journal of Health Economics, 82: 102598. https://doi.org/10.1016/j.jhealeco.2022.102598CrossRef Google Scholar PubMed

Charton, L., Gatier, F., Delacour, C. and Lépine, C. (2024), ‘From statistics to stories: understanding the complex landscape of missed medical appointments. A mixed-methods pilot study’, BJGP Open, 8(3): BJGPO.2024.0007. https://doi.org/10.3399/BJGPO.2024.0007CrossRef Google Scholar PubMed

Cronbach, L. J. (1982), Designing Evaluations of Educational and Social Programs. Jossey-Bass.Google Scholar

Dantas, L. F., Fleck, J. L., Oliveira, F. L. C. and Hamacher, S. (2018), ‘No shows in appointment scheduling – a systematic literature review’, Health Policy, 122(4): 412–421. https://doi.org/10.1016/j.healthpol.2018.02.002CrossRef Google Scholar PubMed

Dashtban, M. and Li, W. (2022), ‘Predicting non-attendance in hospital outpatient appointments using deep learning approach’, Health Systems, 11(3): 189–210. https://doi.org/10.1080/20476965.2021.1924085CrossRef Google Scholar PubMed

Eriksen, M. and Kjellberg, J. (2013), Nedbringelse Af Udeblivelser I Sundhedsvæsenet: International Litteraturstudie. KORA.Google Scholar

Fatoye, F., Afolabi, O. E., Gebrye, T., Oyewole, O. O., Fasuyi, F. and Mbada, C. E. (2024), ‘Missed physiotherapy appointments and their influence on cost, efficiency and patient outcomes in Nigeria’, Annali Di Igiene, 36(1): 3–14. https://doi.org/10.7416/ai.2023.2586Google Scholar

Greenfield, S. and Rich, E. (2012), ‘Welcome to the Journal of Comparative Effectiveness Research’, Journal of Comparative Effectiveness Research, 1(1): 1–3. https://doi.org/10.2217/cer.11.13CrossRef Google Scholar

Groden, P., Capellini, A., Levine, E., Wajnberg, A., Duenas, M., Sow, S., Ortega, B., Medder, N. and Kishore, S. (2021), ‘The success of behavioral economics in improving patient retention within an intensive primary care practice’, BMC Family Practice, 22(1): 253. https://doi.org/10.1186/s12875-021-01593-8CrossRef Google Scholar PubMed

Gurol-Urganci, I., de Jongh, T., Vodopivec-Jamsek, V., Atun, R. and Car, J. (2013), ‘Mobile phone messaging reminders for attendance at healthcare appointments’, Cochrane Database of Systematic Reviews, 12: CD007458. https://doi.org/10.1002/14651858.CD007458.pub3Google Scholar

Hallsworth, M., Berry, D., Sanders, M., Sallis, A., King, D., Vlaev, I. and Darzi, A. (2015), ‘Stating appointment costs in SMS reminders reduces missed hospital appointments: findings from two randomised controlled trials’, PLOS ONE, 10(9): e0137306. https://doi.org/10.1371/journal.pone.0137306CrossRef Google Scholar PubMed

Hallsworth, M. (2023), ‘A manifesto for applying behavioural science’, Nature Human Behaviour, 7(10): 1527–1529. https://doi.org/10.1038/s41562-023-01555-3CrossRef Google Scholar PubMed

Hansen, P. G. OECD (2019), Tools and Ethics for Applied Behavioural Insights: The BASIC Toolkit. Organisation for Economic Cooporation and Development, OECD. https://doi.org/10.1787/9ea76a8f-enGoogle Scholar

Hasvold, P. E. and Wootton, R. (2011), ‘Use of telephone and SMS reminders to improve attendance at hospital appointments: a systematic review’, Journal of Telemedicine and Telecare, 17(7): 358–364. https://doi.org/10.1258/jtt.2011.110707CrossRef Google Scholar PubMed

Hummel, D. and Maedche, A. (2019), ‘How effective is nudging? A quantitative review on the effect sizes and limits of empirical nudging studies’, Journal of Behavioral and Experimental Economics, 80: 47–58. https://doi.org/10.1016/j.socec.2019.03.005CrossRef Google Scholar

Jacobi, E. R. T., Jacobi, L. F., Souza, A. M., Dorneles, T. D. C. and Coronel, D. A. (2023). Valores financeiros que deixaram de ser repassados ao município de Santa Maria – RS em razão do absenteísmo de pacientes, em consultas médicas especializadas de 2016 à 2021. [Conference paper]. 10º Congresso Internacional em Saúde, Universidade Regional do Noroeste do Estado do Rio Grande do Sul. https://www.publicacoeseventos.unijui.edu.br/index.php/conintsau/article/download/23176/21917 Google Scholar

Kaplan-Lewis, E. and Percac-Lima, S. (2013), ‘No-show to primary care appointments: why patients do not come’, Journal of Primary Care & Community Health, 4(4): 251–255. https://doi.org/10.1177/2150131913498513CrossRef Google Scholar

Maier, M., Bartoš, F., Stanley, T. D., Shanks, D. R., Harris, A. J. and Wagenmakers, E. J. (2022), ‘No evidence for nudging after adjusting for publication bias’, Proceedings of the National Academy of Sciences, 119(31): e2200300119. https://doi.org/10.1073/pnas.2200300119CrossRef Google Scholar PubMed

Mertens, S., Herberz, M., Hahnel, U. J. J. and Brosch, T. (2022), ‘The effectiveness of nudging: a meta-analysis of choice architecture interventions across behavioral domains’, Proceedings of the National Academy of Sciences, 119(1): e2107346118. https://doi.org/10.1073/pnas.2107346118CrossRef Google Scholar PubMed

Mooney, J. (2024, August 29 ). The cost of missed medical appointments: a hidden burden on healthcare. TransLoc Blog. https://transloc.com/blog/the-cost-of-missed-medical-appointments-a-hidden-burden-on-healthcare/Google Scholar

Nørgård, B. M., Iachina, M., Ammentorp, J., Schwalbe, D. M., Waidtløw, K. Y., Richardt, L. and Sodemann, M. (2025), ‘Non-attendance in hospital appointments based on data from the entire region of Southern Denmark: descriptive analyses and predictive factors’, Clinical Epidemiology, 17: 303–314. https://doi.org/10.2147/CLEP.S512971CrossRef Google Scholar PubMed

Opon, S. O., Tenambergen, W. M. and Njoroge, K. M. (2021), ‘Influence of organizational and access factors on adherence to appointments in antenatal clinics at county referral hospitals, Kenya’, PAMJ – One Health, 5(17). https://doi.org/10.11604/pamj-oh.2021.5.17.26710CrossRef Google Scholar

Osman, M., McLachlan, S., Fenton, N., Neil, M., Löfstedt, R. and Meder, B. (2020), ‘Learning from behavioural changes that fail’, Trends in Cognitive Sciences, 24(12): 969–980. https://doi.org/10.1016/j.tics.2020.09.009CrossRef Google Scholar PubMed

Pacheco, A. D. O. and Leal de Souza, Â. R. (2025), ‘Custos no setor público: o custo do absenteísmo nas unidades básicas de saúde’, Cadernos Gestão Pública E Cidadania, 30: e92013. https://doi.org/10.12660/cgpc.v30.92013CrossRef Google Scholar

Parsons, J., Bryce, C. and Atherton, H. (2021), ‘Which patients miss appointments with general practice and the reasons why: a systematic review’, The British Journal of General Practice, 71(707): e406–e412. https://doi.org/10.3399/BJGP.2020.1017CrossRef Google Scholar PubMed

Quinn, G., Detman, L. and Bell-Ellison, B. (2008), ‘Missed appointments in perinatal care: response variations in quantitative versus qualitative instruments’, The Journal of Medical Practice Management, 23(5): 307–313. Missed Appointments in Perinatal Care: Response Variations in Quantitative versus Qualitative Instruments - ProQuest Google Scholar PubMed

Science, D. K. (2022, February 3 ), Ny analyse: sådan nudges vi bedst til at ændre adfærd. Videnskab.dk. https://videnskab.dk/kultur-samfund/ny-analyse-saadan-nudges-vi-bedst-til-at-aendre-adfaerd/Google Scholar

Shadish, W. R., Cook, T. D. and Campbell, D. T. (2002), Experimental and Quasi-experimental Designs for Generalized Causal Inference. Houghton Mifflin.Google Scholar

Sousa, L. J., Ciriolo, E., Rafael, R. V. D. and Troussard, X. (2016), Behavioural Insights Applied to Policy-European Report 2016. European Commission, Joint Research Centre. https://doi.org/10.2760/903938Google Scholar

Spiegelhalter, D. (2019), The Art of Statistics: How to Learn from Data. London, UK: Pelican.Google Scholar

Stubbs, N. D., Geraci, S. A., Stephenson, P. L., Jones, D. B. and Roth, B. J. (2012), ‘Methods to reduce outpatient non-attendance’, The American Journal of the Medical Sciences, 344(3): 211–219. https://doi.org/10.1097/MAJ.0b013e31824997c6CrossRef Google Scholar PubMed

Szaszi, B., Palinkas, A., Palfi, B., Szollosi, A. and Aczel, B. (2018), ‘A systematic scoping review of the choice architecture movement: toward understanding when and why nudges work’, Journal of Behavioral Decision Making, 31(3): 355–366. https://doi.org/10.1002/bdm.2035CrossRef Google Scholar

Szaszi, B., Higney, A., Charlton, A., Gelman, A., Ziano, I., Aczel, B. and Tipton, E. (2022), ‘No reason to expect large and consistent effects of nudge interventions’, Proceedings of the National Academy of Sciences, 119(31): e2200732119. https://doi.org/10.1073/pnas.2200732119CrossRef Google Scholar PubMed

Teo, A. R., Niederhausen, M., Handley, R., Metcalf, E. E., Call, A. A., Jacob, R. L., Zikmund-Fisher, B. J., Dobscha, S. K. and Kaboli, P. J. (2023), ‘Using nudges to reduce missed appointments in primary care and mental health: a pragmatic trial’, Journal of General Internal Medicine, 38(Suppl 3): S894–S904. https://doi.org/10.1007/s11606-023-08131-5CrossRef Google Scholar PubMed

Toker, K., Ataş, K., Mayadağlı, A., Görmezoğlu, Z., Tuncay, I. and Kazancioglu, R. (2024), ‘A solution to reduce the impact of patients’ no-show behavior on hospital operating costs: artificial intelligence-based appointment system’, Healthcare, 12(21): 2161. https://doi.org/10.3390/healthcare12212161CrossRef Google Scholar PubMed

Werner, K., Alsuhaibani, S. A., Alsukait, R. F., Alshehri, R., Herbst, C. H., Alhajji, M. and Lin, T. K. (2023), ‘Behavioural economic interventions to reduce health care appointment non-attendance: a systematic review and meta-analysis’, BMC Health Services Research, 23(1): 1136. https://doi.org/10.1186/s12913-023-10059-9CrossRef Google Scholar PubMed

Wilson, R. and Winnard, Y. (2022), ‘Causes, impacts and possible mitigation of non- attendance of appointments within the National Health Service: a literature review’, Journal of Health, Organization and Management, 36: 892–911. https://doi.org/10.1108/JHOM-11-2021-0425CrossRef Google Scholar

Figure 1. The figure shows the messages for different groups within the Cardiology department as they were designed to appear on the phone of patients.

Table 1. Assignment of hospital departments to intervention conditions (Social Norm, Cost, Control) across three consecutive periods during the experimental phase

Figure 2. Criteria used to filter the dataset, resulting in 20,867 data points.

Table 2. This table summarises the total number of appointments included in the study (N = 20,867) across three hospital departments – Endocrinology, Pulmonary and Cardiological – by treatment groups and gender

Figure 3. Experimental design.

Figure 4. Cancellation rate by treatment group. Error bars represent Wilson score 95% CIs.

Figure 5. DNA rate by treatment group. Error bars represent Wilson score 95% CIs.

Table 3. This table presents logistic regression results showing the effect of each treatment on the odds of cancellation and no-show. Each treatment condition was compared to the Control condition. Only the effect of Cost on Cancellation rate was statistically significant

Figure 6. Cancellation rate by sex. Error bars represent 95% CIs for the model-standardised mean probability; because estimates are averaged over a large sample with low event rates, the CIs are very narrow.

Figure 7. DNA rate by sex. Error bars represent 95% CIs for the model-standardised mean probability; because estimates are averaged over a large sample with low event rates, the CIs are very narrow.

Table 4. This table shows the logistic regression results, controlling for age and gender, by treatment model. Odds ratios, confidence intervals and significance are reported for each predictor variable

Figure 8. Cancellation rate by Department and Treatment group. Error bars represent Wilson score 95% CIs.

Figure 9. DNA rate by Department and Treatment group. Error bars represent Wilson score 95% CIs.

Table 5. This table summarises logistic regression outcomes by department, showing how DSN and SIC treatments affected odds of cancellation and DNA relative to control

Table 6. This table compares Hallsworth et al. (2015) (Experiment 1) with our replication

Article contents

Replicating behavioural insights in health: a quasi-experimental Phase 2 trial of integrating descriptive social norms and institutional cost in SMS reminders to reduce missed hospital appointments

Abstract

Keywords

Information

Introduction

The use of behavioural insights to reduce DNAs

BIAS: a systematic approach for testing behavioural insights concepts

A quasi-experimental Phase 2 trial of SMS reminder framing to reduce missed hospital appointments

Methods

Goal

Units

Observations (outcome measures)

Results

Analysis 1

Cancellation rates

DNA rates

Analysis 2

Cancellation rates

DNA rates

Analysis 3

Cardiology department

Cancellation rates

DNA rates

Endocrinology department

Cancellation rates

DNA rates

Pulmonary department

Cancellation rates

DNA rates

Summary of analysis 3

Discussion

Strengths and limitations

Policy implications

Diagnostic implications

Advancing BI as applied science

AI declaration

Footnotes

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests