43.1 Overview
Melodic intonation therapy (MIT) is a music-based treatment for aphasia that has been in use for about 50 years, implemented in numerous languages and adapted in a variety of ways (Albert et al., Reference Albert, Sparks and Helm1973; Van der Meulen et al., Reference Van der Meulen, Van De Sandt-Koenderman, Heijenbrok, Visch-Brink and Ribbers2016; Zhang et al., Reference Zhang, Yu and Teng2021; Zumbansen et al., Reference Zumbansen, Peretz and Hébert2014a). Research addressing the efficacy of MIT has mostly consisted of case studies, case series, and other single-subject designs, but there are also a small number of randomized controlled trials (RCTs) as well as some systematic reviews and meta-analyses (Haro-Martínez et al., Reference Haro-Martínez, Pérez-Araujo, Sanchez-Caro, Fuentes and Díez-Tejedor2021; Popescu et al., Reference Popescu, Stahl and Wiernik2022; Van der Meulen et al., Reference Van der Meulen, Van De Sandt-Koenderman and Ribbers2012, Reference Van der Meulen, Van De Sandt-Koenderman, Heijenbrok, Visch-Brink and Ribbers2014, Reference Van der Meulen, Van De Sandt-Koenderman, Heijenbrok, Visch-Brink and Ribbers2016; Zumbansen and Tremblay, Reference Zumbansen and Tremblay2019). Despite the evidence, it is still not clear which of the numerous elements of the treatment are important, for whom they work best, and how much improvement can be expected for a given person with aphasia (PWA) when MIT is implemented. This chapter provides a brief review of what MIT is and the evidence for its effectiveness, and then addresses why and for whom it works. The main focus of the chapter is on MIT’s treatment ingredients, with particular attention to the rhythm and timing aspects of the treatment. Finally, a framework is presented for systematically considering the ingredients of MIT, both in clinical practice and research.
43.2 What Is MIT?
MIT was first presented in the early 1970s as a treatment for severe nonfluent aphasia. Based on the long-standing clinical observation that people with aphasia can often produce more words when singing than when speaking, MIT was developed to support patients in singing short phrases in a simplified way (Albert et al., Reference Albert, Sparks and Helm1973; Sparks et al., Reference Sparks, Helm and Albert1974). The standard MIT protocol incorporates basic musical elements – including singing on a limited number of pitches (intoning), a simplified rhythm, and tapping the left hand along with the rhythm – to facilitate the production of target utterances. The treatment protocol consists of a structured series of steps, with decreasing clinician support over time. For each target utterance, the PWA progresses from watching and listening while the clinician hums and then intones the utterance, to intoning in unison with the clinician, and through a series of steps that ends with the PWA intoning the utterance in response to a question from the clinician. Throughout these steps, the clinician guides the PWA to tap their left hand to the rhythm of the utterance (Sparks and Holland, Reference Sparks and Holland1976). Eventually, the PWA would be encouraged to transition from intoning to using exaggerated prosody, and then regular spoken language. The original goal of MIT was improved propositional speech, an increased ability to use spoken language in a functional and communicative way (Helm-Estabrooks and Albert, Reference Helm-Estabrooks and Albert2004; Sparks, Reference Sparks and Chapey2008).
43.2.1 Effectiveness Evidence: Does It Work?
For decades after MIT was developed, the research studies evaluating its effectiveness were virtually all case studies or case series. Some of the early studies identified factors that differentiated treatment responders from nonresponders, showing improvement in at least some of the participants (Naeser and Helm-Estabrooks, Reference Naeser and Helm-Estabrooks1985; Sparks et al., Reference Sparks, Helm and Albert1974). Later small-sample and case studies continued to show treatment effects of MIT, demonstrating effects on trained utterances, generalized improvement on standardized measures of language function, or generalization to propositional speech tasks such as conversation or picture description (e.g., Curtis et al., Reference Curtis, Nicholas, Pittmann and Zipse2020; Hough, Reference Hough2010; Wilson et al., Reference Wilson, Parsons and Reutens2006; Zipse et al., Reference Zipse, Norton, Marchina and Schlaug2012; Zumbansen et al., Reference Zumbansen, Peretz and Hébert2014b). However, these studies were often not well controlled, and some were conducted in the subacute phase, when spontaneous neurological recovery was likely a contributing factor to the observed improvements (Popescu et al., Reference Popescu, Stahl and Wiernik2022).
In more recent years, research studies considered to be higher-level evidence, including RCTs, systematic reviews, and meta-analyses, have addressed the efficacy of MIT. Van der Meulen et al. conducted two RCTs including participants with severe nonfluent aphasia, the first with PWAs in the subacute phase and the second with PWAs in the chronic phase (Reference Van der Meulen, Van De Sandt-Koenderman, Heijenbrok, Visch-Brink and Ribbers2014, Reference Van der Meulen, Van De Sandt-Koenderman, Heijenbrok, Visch-Brink and Ribbers2016). In the subacute phase, PWAs treated with MIT showed significantly more improvement on repetition of both trained and untrained utterances compared to deferred-treatment control participants, and they also showed evidence of better generalization to functional communication. In the chronic phase, the participants treated with MIT showed more improvement on trained utterances than the no-treatment control group but did not show robust evidence of generalization to other utterances or tasks, and gains were not maintained at a six-week follow-up. In a systematic review of four RCTs, Haro-Martínez et al. (Reference Haro-Martínez, Pérez-Araujo, Sanchez-Caro, Fuentes and Díez-Tejedor2021) found that MIT results in improved repetition but not auditory comprehension, and there is less robust evidence that it leads to improved functional communication. In the most comprehensive meta-analysis to date, Popescu et al. (Reference Popescu, Stahl and Wiernik2022) considered both group-level data from RCTs as well as individual participant data from case reports. They found that MIT produces small positive effects, mostly on repetition tasks, with greater effects for trained than untrained utterances. The authors concluded that MIT alone may not substantially improve everyday communication.
In summary, the evidence for MIT is rather modest. However, a noteworthy finding from the analysis conducted by Popescu et al. (Reference Popescu, Stahl and Wiernik2022) is that the treatment effect sizes are substantially larger for single-subject design studies than for RCTs, and effect sizes are larger for modified treatment protocols compared to the original protocol. The authors very reasonably interpret this as illustrating the importance of well-controlled studies to account for spontaneous recovery and other factors. Another possible interpretation, though, is that these results highlight the importance of adjusting the MIT protocol to the individual PWA. In contrast to the more rigid protocols typically used in RCTs, single-subject designs allow more flexibility for the clinician to respond to the behavior and preferences of the clients, as in clinical practice. This responsiveness is likely critical when treating complex behavior in a heterogeneous population.
43.2.2 Target Population: For Whom Does It Work?
Aphasia varies widely across individuals in terms of subtype, severity, etiology, and time post-onset. Other comorbid conditions also affect how aphasia is managed and which treatments are considered appropriate. Although there has not been extensive research on the factors associated with a meaningful treatment response to MIT, good treatment candidates are considered to be PWAs with a single left-hemisphere lesion resulting in nonfluent aphasia with severely limited verbal output, poor repetition even at the single-word level, relatively good auditory comprehension, poor articulation, good emotional stability, and adequate attentional capacity (American Academy of Neurology, 1994; Sparks, Reference Sparks and Chapey2008). Consistent with this, most treatment studies have focused on people with nonfluent aphasia, particularly moderate to severe Broca’s aphasia, though a minority of studies have included a wider variety of PWAs (García-Casares et al., Reference García-Casares, Barros-Cano and García-Arnés2022; Haro-Martínez et al., Reference Haro-Martínez, Pérez-Araujo, Sanchez-Caro, Fuentes and Díez-Tejedor2021). However, the response to MIT is variable even across individuals who fit the treatment criteria (Van der Meulen et al., Reference Van der Meulen, Van De Sandt-Koenderman, Heijenbrok, Visch-Brink and Ribbers2016).
One critical variable is whether a PWA also has apraxia of speech (AOS), a motor speech-planning disorder that manifests in slow, distorted speech with atypical prosody, and sometimes audible or visible articulatory groping. In contrast to aphasia, a disorder of language, AOS affects the planning and programming of the motor commands for speech articulation. While this is a reasonably clear distinction in theory, in practice the two conditions can be challenging to differentiate, particularly when aphasia results in phonemic errors, and the conditions often co-occur (Basilakos et al., Reference Basilakos, Yourganov and Den Ouden2017; Strand et al., Reference Strand, Duffy, Clark and Josephs2014). The developers of MIT initially reported that good responders exhibited “restricted but clearly articulated stereotype-like speech,” indicating that these initial patients may not have had AOS (Sparks et al., Reference Sparks, Helm and Albert1974, p. 312). The researchers did note, though, that this particular criterion should be interpreted with caution due to the small sample size in their study, and the treatment criteria have evolved over time to specify that the best candidates show “diminished articulatory agility and effortful initiation of speech production” – a description that applies to AOS (American Academy of Neurology, 1994, p. 566). It is noteworthy that at the time MIT was developed, “apraxia of speech” was a new term, coined by Darley et al., and AOS was not widely accepted as a disorder separate from aphasia (Darley et al., Reference Darley, Aronson and Brown1969). It therefore was not considered as a factor for treatment eligibility.
To date, most studies evaluating music-based treatments for aphasia more broadly include PWAs with a concomitant motor speech disorder, usually AOS (Zumbansen and Tremblay, Reference Zumbansen and Tremblay2019). In fact, the odds of a music-based intervention resulting in improved speech outcomes is approximately 21 times higher, and the odds of the treatment resulting in improved language outcomes about four times higher, in people with aphasia and a motor speech disorder compared to those with aphasia alone (Zumbansen and Tremblay, Reference Zumbansen and Tremblay2019). Furthermore, many of the elements included in MIT are commonly used in treatments for AOS, including slow rate, regular rhythm, and tapping or other means to enact pacing (Ballard et al., Reference Ballard, Wambaugh and Duffy2015; Wambaugh et al., Reference Wambaugh and Martinez2000). Adaptations of the standard MIT protocol, even when being used with the rationale of treating aphasia, often include additional elements suited to treating AOS, such as articulatory-kinematic approaches and/or the use of automated phrases to induce a correct production of a target word (e.g., Zhang et al., Reference Zhang, Yu and Teng2021). It seems, then, that having both AOS and aphasia may make one a particularly good candidate for MIT, although this depends on how exactly the treatment is used.
43.2.3 Outcome Measures: What Does “Working” Mean?
In addition to determining individual patient factors, it is also important to consider what “working” means. Across the literature, MIT has been used with different treatment goals and outcome measures. MIT was created as a treatment to stimulate the language system, with the aim of improving spontaneous, propositional language production (Sparks and Holland, Reference Sparks and Holland1976). It has also been used to train specific utterances, with the aim of creating a set of almost automatic phrases the PWA can use in functional ways; Zumbansen et al. have referred to this as “palliative” use (Reference Zumbansen, Peretz and Hébert2014a; also see Section 43.3.8.2). Indeed, a number of research studies have taken this approach, quite likely because it offers advantages in terms of research design and experimental control (e.g., Hough, Reference Hough2010; Stahl et al., Reference Stahl, Henseler, Turner, Geyer and Kotz2013). A third approach is to teach MIT as a strategy: The music-based elements are taught as a facilitation technique the PWA can use when they experience difficulty in everyday communication (see Zumbansen et al., Reference Zumbansen, Peretz and Hébert2014a, for an in-depth consideration of different treatment goals). This third approach is used in a French adaptation of MIT, thérapie mélodique et rhythmée (TMR) (Van Eeckhout and Bhatt, Reference Van Eeckhout and Bhatt1984).
These different treatment goals are reflected in different outcome measures (Van der Meulen et al., Reference Van der Meulen, Van De Sandt-Koenderman, Heijenbrok, Visch-Brink and Ribbers2016). For example, if MIT is being used to improve propositional language or as a compensatory technique to circumvent difficulties, outcomes should be evaluated with a measure that reflects improvement of functional communication. This may include analyses of language samples (using correct information units, CIUs, or similar, as in Schlaug et al., Reference Schlaug, Marchina and Norton2008) or self or caregiver report of functional changes. Alternatively or in addition, some studies have used standard language measures to evaluate improvement in naming, repetition, auditory comprehension, or other core language functions (e.g., Belin et al., Reference Belin, Van Eeckhout and Zilbovicius1996; Sparks et al., Reference Sparks, Helm and Albert1974). An important consideration is that standardized language assessments are typically developed to diagnose and classify aphasia rather than to document change, and they may not offer adequate sensitivity for this latter purpose. For studies evaluating MIT for palliative use, performance on trained phrases may be the primary outcome measure. Finally, some studies use measures at multiple levels – repetition of trained phrases, repetition of untrained phrases, and measures of functional language use (e.g., Curtis et al., Reference Curtis, Nicholas, Pittmann and Zipse2020; Zipse et al., Reference Zipse, Norton, Marchina and Schlaug2012). This last approach is the most thorough for documenting improvement at any level and evaluating generalization.
43.2.4 Treatment Theory: How Does It Work?
The developers of MIT proposed that exaggerated speech melody and left-hand tapping would engage the right hemisphere, since the right hemisphere was viewed as responsible for processing music and prosody, and since left-hand tapping requires activation of right-hemisphere motor areas. The right-hemisphere engagement would reduce reliance on the lesioned left hemisphere, but ultimately support recovery of left-hemisphere control of spoken language (Albert et al., Reference Albert, Sparks and Helm1973; Helm-Estabrooks and Albert, Reference Helm-Estabrooks and Albert2004; Sparks et al., Reference Sparks, Helm and Albert1974). This idea was bolstered by an early finding that good responders to MIT did not have right-hemisphere lesions, while poor responders sometimes did (Naeser and Helm-Estabrooks, Reference Naeser and Helm-Estabrooks1985).
More recently, it has been debated quite extensively in the literature whether neuroplastic changes in the right or the left hemisphere are correlated with MIT-induced behavioral changes. There is evidence for both structural and functional changes in the right hemisphere of PWAs who have responded to MIT, which the researchers interpreted as lending support to the idea that MIT spurs increased right-hemisphere involvement in spoken language production (e.g., Schlaug et al., Reference Schlaug, Marchina and Norton2008; Tabei et al., Reference Tabei, Satoh and Nakano2016; Wan et al., Reference Wan, Zheng, Marchina, Norton and Schlaug2014). However, other neuroimaging studies have found no increase in right-hemisphere activation and an increase in left-hemisphere activation, especially perilesional areas, during repetition of phrases using MIT techniques (Belin et al., Reference Belin, Van Eeckhout and Zilbovicius1996; Breier et al., Reference Breier, Randle, Maher and Papanicolaou2010; Laine et al., Reference Laine, Tuomainen and Ahonen1994). This has been viewed as evidence that spared left-hemisphere tissue supports MIT’s treatment effect. A recent systematic review concluded that the neural underpinnings of MIT-based improvement are complex and varied across individuals (García-Casares et al., Reference García-Casares, Barros-Cano and García-Arnés2022). It seems likely that neuroplastic changes are determined in part by how much left hemisphere is spared. If enough left-hemisphere tissue remains, the perilesional tissue may support MIT-mediated recovery, while if the lesion is very large, homologous right-hemisphere areas may offer the only potential for recovery (Wan et al., Reference Wan, Zheng, Marchina, Norton and Schlaug2014; Zumbansen et al., Reference Zumbansen, Peretz and Hébert2014a). (The idea that the neural basis of aphasia recovery depends on lesion size has previously been discussed in the context of other treatment approaches, e.g., Crosson et al., Reference Crosson, Moore and Gopinath2005.)
In addition to theories that address neuroplastic reorganization, other mechanisms for MIT-induced language recovery have been suggested (see Merrett et al., Reference Merrett, Peretz and Wilson2014, for an excellent review). These include activation of shared features of language and music, engagement of the mirror neuron system, and improvements to mood and motivation. These various theories are not necessarily distinct; for example, activating musical elements that overlap with aspects of language may drive neuroplastic change, recalling Luria’s idea of intersystemic reorganization (Reference Luria1970), where a damaged neural system is coupled with an intact one to promote improvement of the weaker system. The idea that MIT leverages prosody as a link between music and language is perhaps the most discussed theory, whether framed in terms of neurophysiology or cognitive models.
Within the field of rehabilitation more broadly, there have been efforts in the past decade to specify treatments in terms of their ingredients and how these ingredients effect change. Conceptual models of treatment theory and specification, and their application in clinical practice, are active areas of scholarship with implications for how clinicians conceive of, describe, apply, and measure the outcomes of interventions (Hart et al., Reference Hart, Dijkers and Whyte2019; Whyte et al., Reference Whyte, Dijkers and Hart2014; Zanca et al., Reference Zanca, Turkstra and Chen2019; also see Section 43.4.1). Accordingly, the following section considers the various treatment ingredients of MIT and how each of these may work to facilitate spoken-language production and promote aphasia recovery.
43.3 The Treatment Ingredients of MIT
Speculation about why and how MIT works has abounded because the treatment has many different elements, making multiple mechanisms of action possible. Some elements may work in parallel, synergistically, or even at odds with one another. Understanding these ingredients is of critical importance for tailoring the treatment to a particular individual. In the following sections, key treatment ingredients of MIT are discussed, including variations in how they are used and why they may work. Throughout, it is important to keep in mind the difference between the immediate effects of these techniques to promote language production in the moment versus generalization to spoken-language production beyond the treatment encounter – that is, facilitation effects versus treatment effects (Zumbansen et al., Reference Zumbansen, Peretz and Hébert2014a).
43.3.1 Rhythm
There is growing evidence that rhythm is a particularly important aspect of MIT. Using a beat-based rhythm as a facilitating technique, Stahl et al. (Reference Stahl, Kotz, Henseler, Turner and Geyer2011) concluded that PWAs do not benefit from singing more than from rhythmic speech during a repetition task, while Kershenbaum et al. (Reference Kershenbaum, Nicholas, Hunsaker and Zipse2019) found that rhythmic speech was more facilitating than singing in their sample of PWAs. Rhythm also seems to enhance treatment effects. In a scripted sentence-learning paradigm that shared some features of MIT, Quique et al. (Reference Quique, Evans, Ortega-Llebaría, Zipse and Dickey2022) demonstrated that sentence learning was improved by the addition of rhythmic beats to the audio track (all conditions were spoken, with none sung or intoned). Studies comparing melodic (sung) treatment conditions to rhythmically spoken ones have found comparable gains in trained utterances across the two conditions (Stahl et al., Reference Stahl, Henseler, Turner, Geyer and Kotz2013; Zumbansen et al., Reference Zumbansen, Peretz and Hébert2014b; also see Boucher et al., Reference Boucher, Garcia, Fleurant and Paradis2001, for a related finding). However, generalization to functional communication may be better with melodic than with rhythmically spoken treatment. Zumbansen et al. (Reference Zumbansen, Peretz and Hébert2014b) used a within-subjects design to compare three treatment conditions: melodic (rhythmically intoned, with hand tapping), rhythmically spoken (with hand tapping), and normally spoken. All participants made significant gains on trained phrases with all three treatments, but generalization to functional communication was only found after the melodic treatment.
Of course, “rhythmic” production can be realized in different ways. Early versions of MIT used quarter and eighth notes, respectively, for stressed and unstressed syllables (Sparks, Reference Sparks and Chapey2008). Later descriptions by the originators of MIT specified that productions should follow the “stress and rhythm patterns associated with normal speech” (Helm-Estabrooks and Albert, Reference Helm-Estabrooks and Albert2004, p. 224). TMR, a French version of MIT, uses two durations: longer for naturally stressed syllables and syllables of function words (which tend to be omitted in people with Broca’s aphasia), and shorter for all other syllables (Van Eeckhout and Bhatt, Reference Van Eeckhout and Bhatt1984; Zumbansen et al., Reference Zumbansen, Peretz and Hébert2014b). In contrast, some instantiations of MIT present all syllables with equal duration, lending the utterance a steady, metronomic quality (e.g., Curtis et al., Reference Curtis, Nicholas, Pittmann and Zipse2020). Many studies, though, do not comment on the rhythm used. (See Chapters 30, 33, and 40 for a consideration of how the rhythmic class of a language affects spoken-language learning in speakers without aphasia.)
Even when slightly more complex rhythms are used, the utterances likely adhere to a metrical pattern because each target utterance is repeated multiple times, both within and across steps. For example, the PWA has multiple attempts to intone the utterance in unison with the clinician (unison step). If successful, the PWA moves on to the next step, where they intone in unison with the clinician, who “fades out” and leaves the PWA to complete the utterance on their own (unison fade; Helm-Estabrooks and Albert, Reference Helm-Estabrooks and Albert2004). Across these two steps, the target utterance is typically repeated several times. When a motor action is cyclically repeated, it tends to adhere to a hierarchical timing structure, creating metrical levels (Cummins and Port, Reference Cummins and Port1998; Murton et al., Reference Murton, Zipse, Jacoby and Shattuck-Hufnagel2017).
Rhythmic production, repetition, and the meter that emerges result in expectation. Indeed, London has defined meter as “a stable and recurring pattern of hierarchically structured temporal expectations” (Reference London2002, p. 529). Regular timing of stressed syllables, adhering to a metrical structure, helps listeners predict when these syllables will occur and direct their attention accordingly (Pitt and Samuel, Reference Pitt and Samuel1990). This predictability makes it easier for listeners to entrain to the stimulus, aligning their motor actions – in this case, speech – with an external stimulus. In fact, speech stimuli that follow a simple metrical structure have been shown to facilitate accurate spoken-language production in people with aphasia and/or AOS (Aichert et al., Reference Aichert, Lehner, Falk, Späth and Ziegler2019; Kershenbaum et al., Reference Kershenbaum, Galassi, Shattuck-Hufnagel, Bachan and Zipse2024). At a neurophysiological level, rhythmically regular speech input may promote the entrainment of intrinsic neural oscillatory activity to the speech signal, allowing for the coupling of perception and production (Haegens and Zion Golumbic, Reference Haegens and Zion Golumbic2018; Large and Jones, Reference Large and Jones1999; see Chapters 3, 5, and 6). This may help with the organization of motor commands, perhaps targeting AOS in particular. Notably, treatments using metronome pacing, hand tapping, and other rhythmic elements have been described for AOS (Brendel and Ziegler, Reference Brendel and Ziegler2008; Mauszycki and Wambaugh, Reference Mauszycki and Wambaugh2008). A metrically regular context may also serve to prime words with lexical stress that fits the pattern, aiding retrieval of phonological word forms – an area of impairment across all subtypes of aphasia. This facilitation is analogous to how lyrics are better recalled when sung than when spoken (Kasdan and Kiran, Reference Kasdan and Kiran2018).
43.3.2 Rate
Rhythm is one way in which timing can be manipulated in MIT, and rate is another. The originators of MIT specified that a slow rate should be used (Helm-Estabrooks and Albert, Reference Helm-Estabrooks and Albert2004; Sparks, Reference Sparks and Chapey2008). Singing naturally tends to slow the rate of articulation (Racette, Reference Racette2006; Stahl and Kotz, Reference Stahl and Kotz2014). Laughlin et al. (Reference Laughlin, Naeser and Gordon1979) compared the number of phrases correctly produced during MIT under three conditions: spoken syllables < one second in duration, intoned syllables 1.5 seconds in duration, and intoned syllables two seconds in duration. The best results were observed with the slowest rate, a much slower rate than typically reported in MIT treatment studies. Rates reported in the literature include one second/syllable (Merrett et al., Reference Merrett, Tailby, Jackson and Wilson2019), 750 ms/syllable (Curtis et al., Reference Curtis, Nicholas, Pittmann and Zipse2020), and 600 or 1,200 ms, for unstressed non-functor syllables versus stressed syllables or functors (Zumbansen et al., Reference Zumbansen, Peretz and Hébert2014b). Many studies, however, do not specify the rate used.
A slow rate is helpful because it allows for more processing time for perceiving the model utterance, as well as for planning and executing articulatory commands (Stahl and Kotz, Reference Stahl and Kotz2014). This extra time may be beneficial when the right hemisphere is more involved in controlling spoken-language production, since this hemisphere is thought to operate at a slower timescale than the left (Poeppel, Reference Poeppel2003; Zatorre, Reference Zatorre2001). A slow rate may affect both language processing and speech planning, though the latter may be particularly important: A number of studies that use a slow rate with good effects, either as a treatment or facilitation technique, have outcome measures that evaluate speech accuracy, such as percentage of syllables correct (Boucher et al., Reference Boucher, Garcia, Fleurant and Paradis2001; Kershenbaum et al., Reference Kershenbaum, Nicholas, Hunsaker and Zipse2019; Laughlin et al., Reference Laughlin, Naeser and Gordon1979; Stahl et al., Reference Stahl, Kotz, Henseler, Turner and Geyer2011). Furthermore, along with a regular rhythm, a slowed rate has been shown to increase articulatory accuracy in people with AOS, who typically also have aphasia (Mauszycki and Wambaugh, Reference Mauszycki and Wambaugh2008; Wambaugh and Martinez, Reference Wambaugh and Martinez2000).
43.3.3 Tapping
Another timing-related element of MIT is tapping. In the standard MIT protocol, the clinician sits across from the PWA, with their right hand over the PWA’s left hand, so that the clinician can pick up the patient’s hand and tap it on the tabletop. Protocols differ in whether tapping is once for each syllable, or only on stressed syllables (Helm-Estabrooks and Albert, Reference Helm-Estabrooks and Albert2004; Sparks and Holland, Reference Sparks and Holland1976). As the PWA advances through the MIT levels and learns the procedure, the clinician may discontinue guiding the patient’s tapping and simply monitor that they continue to tap correctly (Sparks, Reference Sparks and Chapey2008). Other variations of MIT (e.g., TMR) allow for tapping with other parts of the body or do not include tapping at all (Zumbansen et al., Reference Zumbansen, Peretz and Hébert2014a). No-tapping variations of MIT were developed in response to observations of individual patients; there is preliminary evidence that tapping may not always be beneficial in MIT, and some PWAs respond better without it (Curtis et al., Reference Curtis, Nicholas, Pittmann and Zipse2020; Hough, Reference Hough2010). As with rate and rhythm, many studies do not provide a description of how tapping was implemented.
Tapping is thought to work by activating motor regions in the right hemisphere. In particular, the cortical representation of the articulators is near regions representing the hand. Tapping may prime a sensorimotor network that integrates perception and action and is useful for producing spoken language (Gentilucci and Dalla Volta, Reference Gentilucci and Dalla Volta2008; Norton et al., Reference Norton, Zipse, Marchina and Schlaug2009; Schlaug et al., Reference Schlaug, Marchina and Norton2008). Related to this, tapping to an isochronous or otherwise predictable rhythm may act to promote pacing, shown to have benefits for motor speech disorders (Brendel and Ziegler, Reference Brendel and Ziegler2008). Furthermore, tapping can be viewed as a simplified gesture, specifically beat gesture, and gestural planning is closely related to prosodic planning (Rohrer et al., Reference Rohrer, Delais-Roussarie and Prieto2023). Emphasizing prosody with gesture may promote spoken-language production.
43.3.4 Pitch Modulation
Pitch modulation has often been viewed as the core feature of MIT (Norton et al., Reference Norton, Zipse, Marchina and Schlaug2009). As noted above, though, the rhythmic elements of the treatment have been more strongly linked to the facilitation associated with MIT-based techniques. Pitch modulation has been used in various ways in different adaptations of MIT, with the most common versions using either two notes or simple melodies. Specified intervals for the two-note variations have included a minor third or a fourth (Norton et al., Reference Norton, Zipse, Marchina and Schlaug2009; Zumbansen et al., Reference Zumbansen, Peretz and Hébert2014b). In English, stressed syllables are produced on the higher pitch and unstressed on the lower, while the dominant French adaptation of MIT, TMR, uses the higher pitch to highlight functor words (typically omitted by people with Broca’s aphasia) as well as stressed syllables (Albert et al., Reference Albert, Sparks and Helm1973; Van Eeckhout and Bhatt, Reference Van Eeckhout and Bhatt1984). Another key consideration related to pitch modulation is whether and how Sprechgesang (“speech song”) is implemented. This technique uses an exaggerated rhythm but speech-like intonation and is introduced in the later stages of MIT (Level III or IV, depending on the protocol used), as a way to transition from intoning to normal speech (Helm-Estabrooks and Albert, Reference Helm-Estabrooks and Albert2004; Sparks, Reference Sparks and Chapey2008; Sparks and Holland, Reference Sparks and Holland1976). Some MIT treatment studies use Sprechgesang (Hough, Reference Hough2010; Van der Meulen et al., Reference Van der Meulen, Van De Sandt-Koenderman, Heijenbrok, Visch-Brink and Ribbers2014, Reference Van der Meulen, Van De Sandt-Koenderman, Heijenbrok, Visch-Brink and Ribbers2016; Zhang et al., Reference Zhang, Yu and Teng2021) while others do not (Curtis et al., Reference Curtis, Nicholas, Pittmann and Zipse2020; Wilson et al., Reference Wilson, Parsons and Reutens2006; Zumbansen et al., Reference Zumbansen, Peretz and Hébert2014b).
Like many of the other elements of MIT, pitch modulation was included to promote right-hemisphere involvement, since the right hemisphere was seen as dominant in processing music and prosody (Sparks et al., Reference Sparks, Helm and Albert1974). Singing has also been noted to slow articulatory rate, as mentioned above (Section 43.3.2), and promote constant voicing across syllables, both of which have been mentioned as potentially important (Helm-Estabrooks and Albert, Reference Helm-Estabrooks and Albert2004; Norton et al., Reference Norton, Zipse, Marchina and Schlaug2009). These manipulations of rate and voicing may primarily affect motor speech rather than language. On the other hand, in modified MIT protocols that include more complex melodies and even instrumental accompaniment, the melody may serve as a mnemonic device to aid in retrieval of the phonological word forms, rather than to facilitate motor speech function (Baker, Reference Baker2000).
43.3.5 Unison Production
Speaking along with another person or a recording, commonly referred to as unison or choral production, is a component of a number of aphasia treatments besides MIT, whether spoken or sung (Kershenbaum et al., Reference Kershenbaum, Galassi, Shattuck-Hufnagel, Bachan and Zipse2024). It is used to promote fluent, accurate productions in other clinical populations, including people who stutter and people with Parkinson’s disease (Juste et al., Reference Juste, Sassi, Costa and De Andrade2018; Kiefte and Armson, Reference Kiefte and Armson2008; Ritto et al., Reference Ritto, Costa, Juste and de Andrade2016). The facilitating effect of unison production for PWAs has been documented across a few studies (Fridriksson et al., Reference Fridriksson, Hubbard and Hudspeth2012; Kershenbaum et al., Reference Kershenbaum, Nicholas, Hunsaker and Zipse2019, Reference Kershenbaum, Galassi, Shattuck-Hufnagel, Bachan and Zipse2024; Racette, Reference Racette2006). In the aphasia research literature more generally, unison production has been implemented in various ways. In some paradigms, PWAs listen to an utterance and then repeat it in unison (such that the target utterance is known in advance, e.g., Kershenbaum et al., Reference Kershenbaum, Nicholas, Hunsaker and Zipse2019), while in others, the PWAs speak along with utterances that are at least initially unfamiliar (such that they are quickly repeating what they hear, i.e., shadowing; Fridriksson et al., Reference Fridriksson, Hubbard and Hudspeth2012). Another important variable is whether the model utterance is audio-only or includes a view of the speaker’s mouth, as the latter can be facilitating (Fridriksson et al., Reference Fridriksson, Hubbard and Hudspeth2012). MIT relies on repetition so that the target utterance is known, and the clinician typically sits opposite the PWA so their face is visible during articulation.
Explanations for why speaking in unison is helpful for PWAs have included perception–action interactions, possibly involving mirror neurons, and synchrony with a model (Racette, Reference Racette2006; see Chapter 6). Cummins has examined synchronous speech in neurotypical adults without aphasia rather extensively, noting how precisely speakers can align their speech with one another (Reference Cummins2003). He has highlighted entrainment of oscillatory neural activity as a potential mechanism for speech entrainment, though allowed that models need to better account for the complex time structure of typical speech (Cummins, Reference Cummins2009, Reference Cummins2012). More recently, the ability of PWAs to synchronize their speech with a spoken model has been examined under different prosodic timing conditions: typical conversational timing, and beat-based metrical timing, where stressed syllables occur at regular intervals. Both PWAs and control speakers align their speech more precisely with the model in the metrical condition, and also produce a higher percentage of accurate syllables in this condition (Kershenbaum et al., Reference Kershenbaum, Galassi, Shattuck-Hufnagel, Bachan and Zipse2024). The metrical condition, similar to MIT, uses a predictable timing structure that does not rely as heavily on linguistic knowledge (e.g., syntax). This simple, predictable timing structure may allow PWAs to achieve better sensorimotor synchrony with the spoken model and more fully benefit from unison production. More work remains to be done to understand entrainment as a potential facilitating mechanism for PWAs.
43.3.6 Treatment Intensity
A key feature of MIT, as originally conceived, was its intensity: Sparks (Reference Sparks and Chapey2008) stipulated that treatment should take place twice daily. Numerous treatment studies have used high-intensity treatment, for example 1.5 hours/day, five days/week, for approximately 15 weeks (Schlaug et al., Reference Schlaug, Marchina and Norton2008); 30 minutes/day, five days/week, for eight weeks (Zhang et al., Reference Zhang, Yu and Teng2021); and five hours/week for six weeks (Van der Meulen et al., Reference Van der Meulen, Van De Sandt-Koenderman, Heijenbrok, Visch-Brink and Ribbers2014, Reference Van der Meulen, Van De Sandt-Koenderman, Heijenbrok, Visch-Brink and Ribbers2016). Three one-hour sessions/week is a fairly common intensity in MIT treatment studies (Curtis et al., Reference Curtis, Nicholas, Pittmann and Zipse2020; Hough, Reference Hough2010; Zumbansen et al., Reference Zumbansen, Peretz and Hébert2014b). Even this is considerably greater than what is typically used in aphasia treatment in clinical practice, at least in the US, and raises the question of whether promising evidence for MIT is driven in part by the high intensities used in treatment studies (Cavanaugh et al., Reference Cavanaugh, Kravetz and Jarold2021). In fact, across two RCTs, one in the subacute and one in the chronic phase of recovery, intensity was predictive of treatment response to MIT even when age, aphasia severity, and a variety of baseline language measures were not, although time post-onset was also predictive in the subacute phase (Van der Meulen et al., Reference Van der Meulen, Van De Sandt-Koenderman, Heijenbrok, Visch-Brink and Ribbers2014, Reference Van der Meulen, Van De Sandt-Koenderman, Heijenbrok, Visch-Brink and Ribbers2016; although see also Zumbansen and Tremblay, Reference Zumbansen and Tremblay2019). The importance of treatment intensity is one of the principles of neuroplasticity in rehabilitation, as described by Kleim and Jones (Reference Kleim and Jones2008): Sufficient intensity is required to induce neuroplasticity, and adequate treatment duration is critical for behavioral change (“use it and improve it”).
In this regard, the highly facilitating nature of MIT is critical: It makes intensive practice of spoken language both possible and palatable for many people with severe aphasia. Clinical researchers have noted that patients often find MIT motivating and even enjoyable (Merrett et al., Reference Merrett, Peretz and Wilson2014; Sparks et al., Reference Sparks, Helm and Albert1974; Van der Meulen et al., Reference Van der Meulen, Van De Sandt-Koenderman and Ribbers2012). For PWAs who struggle to say single words, the feeling of fluently intoning phrases is likely rewarding. Van der Meulen et al. (Reference Van der Meulen, Van De Sandt-Koenderman, Heijenbrok, Visch-Brink and Ribbers2016) noted that although aphasia treatment studies with higher intensity generally have greater attrition, their own RCT did not have any individuals drop out (Brady et al., Reference Brady, Kelly, Godwin, Enderby and Brady2016). In addition, the structure and repetition of MIT may make it suitable for home practice, a potentially important way to increase intensity in the face of limited provider availability and high treatment costs.
43.3.7 Gradual Progression
Another key feature of MIT is gradual progression through steps of increasing difficulty. Depending on the protocol used, MIT may include three or four levels of difficulty with progressively longer and more challenging utterances. Within each of the levels, the PWA proceeds through several steps for each utterance, with decreasing clinician support (Helm-Estabrooks and Albert, Reference Helm-Estabrooks and Albert2004; Sparks, Reference Sparks and Chapey2008). For example, in Level I (or II, depending on the protocol), there are five steps: (1) the clinician hums the utterance and intones it twice while tapping the PWA’s hand; (2) the clinician and PWA intone the utterance in unison while tapping; (3) the clinician and PWA intone in unison while tapping, but the clinician “fades out” partway through the utterance; (4) the clinician intones the utterance and the PWA repeats it immediately, with tapping for both productions; (5) the clinician intones a question and the PWA responds with the target utterance, with hand tapping. Some studies have used simplified protocols with just one level (e.g., Curtis et al., Reference Curtis, Nicholas, Pittmann and Zipse2020). However, gradual progression through steps with decreasing clinician support is central to MIT.
One major source of variation in treatment response is likely to be exactly how clinician support is implemented, what types of cues are used to facilitate utterance production, and when and how they are discontinued across the steps. For example, one MIT protocol says that in the unison-fade step (step 3 above), the clinician should not lip-synch the utterance after fading out (Helm-Estabrooks and Albert, Reference Helm-Estabrooks and Albert2004). This visual input can be very facilitating, so variation in how and whether it is used is likely quite important (Fridriksson et al., Reference Fridriksson, Hubbard and Hudspeth2012). This same protocol specifies to discontinue utterances not produced after four attempts. In practice, clinicians often use facilitation techniques such as placement cues, other articulatory-kinematic approaches, or written letters or words to help the PWA segment the target if they initially struggle with an utterance (e.g., Dunham and Newhoff, Reference Dunham and Newhoff1979). Such approaches may particularly address AOS.
Gradual progression through steps of increasing difficulty may work through errorless learning, a cognitive rehabilitation approach widely used to support learning in people with amnesia, that has also been used to treat PWAs (Middleton and Schwartz, Reference Middleton and Schwartz2012). In this approach, successful retrieval of a target behavior is practiced repeatedly to promote implicit, Hebbian learning of a desired response. In addition, the progression of MIT provides a motivating context for many repetitions of each utterance. The importance of repetition is a principle of neuroplasticity in rehabilitation (Kleim and Jones, Reference Kleim and Jones2008).
43.3.8 Utterances Targeted in Treatment
43.3.8.1 Types of Utterances
MIT was rooted in clinical observations that PWAs can often sing familiar songs when unable to speak, but this benefit seems due in part to song familiarity (Hébert et al., Reference Hébert, Racette, Gagnon and Peretz2003; Straube et al., Reference Straube, Schulz, Geipel, Mentzel and Miltner2008). A couple of studies have included utterance familiarity or automaticity as a variable (Hough, Reference Hough2010; Stahl et al., Reference Stahl, Kotz, Henseler, Turner and Geyer2011). Stahl et al. (Reference Stahl, Kotz, Henseler, Turner and Geyer2011) compared unison production of formulaic versus non-formulaic lyrics, where the former consisted of common social phrases judged to be automatized, while the latter were syntactically correct phrases likely to be novel. They found an advantage for formulaic lyrics, measured in percentage of syllables correct.
There is evidence that the production of formulaic utterances is supported by right-hemisphere and subcortical structures, making these utterances easier for people with left-hemisphere lesions and aphasia to produce (Van Lancker Sidtis, Reference Van Lancker Sidtis2012; Van Lancker Sidtis and Postman, Reference Van Lancker Sidtis and Postman2006). Stahl et al. (Reference Stahl, Henseler, Turner, Geyer and Kotz2013) speculated that intensive training of formulaic phrases may result in increased right-hemisphere activation, consistent with the idea that MIT works by shifting spoken-language processing to the right hemisphere (see Section 43.2.4). Another explanation for the advantage of formulaic utterances concerns motor speech: People with AOS typically show an advantage for automatic utterances relative to volitional ones (West et al., Reference West, Hesketh, Vail and Bowen2005).
43.3.8.2 Variety of Utterances
MIT was conceived of as a way to stimulate the language system, such that a wide variety of utterances should be used to avoid practice effects for particular utterances (Sparks, Reference Sparks and Chapey2008). Indeed, studies that have documented generalization to functional communication have often used many different utterances in treatment (Curtis et al., Reference Curtis, Nicholas, Pittmann and Zipse2020; Zipse et al., Reference Zipse, Norton, Marchina and Schlaug2012; Zumbansen et al., Reference Zumbansen, Peretz and Hébert2014b). In aphasia treatment more generally, using a variety of treatment targets and tasks has been associated with improved generalization (Nadeau et al., Reference Nadeau, Rothi, Rosenbek and Chapey2008).
In some cases, MIT may be used to train a relatively small set of utterances in people with severe aphasia, not with the aim of stimulating the language system more generally but rather to build a limited repertoire of functional phrases. As noted above (see Section 43.2.3), Zumbansen et al. (Reference Zumbansen, Peretz and Hébert2014a) have called this “palliative” use. In essence, the goal is to select utterances and train them until they become formulaic. This is more of a compensatory strategy than a true rehabilitative one. In this case, the treatment rationale would not justify using a wide variety of utterances in treatment.
43.4 Putting the Ingredients Together
MIT is a complex treatment with many potential active ingredients, each of which may interact with others. For example, unison production may be most facilitating at a slow, metrically regular rate that allows the PWA to more easily align their speech with the spoken model. The multifaceted nature of MIT can offer multiple opportunities for a mechanism to work: The slow rate, melodic line, formulaic phrases, and left-hand tapping may all increase engagement of right-hemisphere networks. In addition, some of the many ingredients may be able to work in various ways, potentially activating left-hemisphere perilesional tissue or right-hemisphere regions homologous to left-hemisphere speech and language areas. This may increase the potential that the treatment benefits a given PWA, regardless of whether they have a smaller left-hemisphere lesion with the potential for recovery mediated by left-hemisphere perilesional tissue, or whether the lesion is so large any recovery must rely on right-hemisphere areas (Merrett et al., Reference Merrett, Peretz and Wilson2014).
MIT may be especially powerful due to its ability to simultaneously address aphasia and AOS in an integrated way. The criteria for MIT candidacy describe severe Broca’s aphasia, which commonly co-occurs with AOS. These two conditions may interact. Failure to retrieve a phonological word form (an aphasic impairment) obviously affects the ability to prepare a speech motor plan for that word. However, once the phonological word form is activated – perhaps more weakly or intermittently than in a speaker without aphasia – any delay or disruption in motor speech planning may result in a missed opportunity to take advantage of the instance of successful lexical retrieval.
43.4.1 MIT as a Framework: What Works for Whom?
What works best on average is not necessarily what works best for a particular individual, especially in a heterogeneous population such as PWAs (even within the subpopulation of individuals who meet the criteria for MIT; see Section 43.2.2). In providing the first detailed protocol for MIT, the originators noted this came with the risk that it would be followed too rigidly, and they encouraged clinicians to modify the protocol to meet each individual’s needs, stating, “The clinical awareness and skill of the person providing clinical service should influence the form of that service” (Sparks and Holland, Reference Sparks and Holland1976, p. 288). Along these lines, Merrett et al. (Reference Merrett, Tailby, Jackson and Wilson2019) note the tension between standardization and customization of a treatment approach, and advocate for “standardized customization” (p. 431). MIT will likely be most powerful if used as a framework, with specific ingredients selected and adjusted for each individual.
Characterizing treatments is particularly complex in rehabilitation more generally. Interventions are behavioral, relying heavily on the interaction between patients and providers, and are typically focused on increasing function and life participation, which results in widely varied goals based on patients’ needs. This creates challenges for clinical effectiveness research, practice, and training (Hart et al., Reference Hart, Dijkers and Whyte2019). To define rehabilitation interventions in a way that allows for a balance of customization (tailoring the treatment to the patient’s needs) and standardization (following a fixed, replicable protocol), a taxonomy is needed (Hart et al., Reference Hart, Dijkers and Whyte2019). While a variety of different frameworks have been proposed, the rehabilitation treatment specification system (RTSS) offers the advantage of defining interventions in the context of treatment theory: Which aspect of function is the treatment targeting, which mechanism(s) of action can effect this change, and which treatment ingredients will be used (Van Stan et al., Reference Van Stan, Whyte and Duffy2021; Whyte et al., Reference Whyte, Dijkers and Hart2014)? Applying the RTSS to MIT is a promising way to organize the evidence pertaining to each ingredient, and highlight gaps for future research to address.
43.4.2 Research Design Considerations
Reviews of the evidence for MIT’s effectiveness often suggest that more and larger-scale RCTs are needed. Before undertaking such trials, critical individual factors that affect treatment responsiveness need to be better understood. Researchers can take advantage of the heterogeneity inherent to the population of PWAs by using single-subject and within-group research designs to understand varied treatment responses, relate these to individual factors, and explore mechanisms of action (Merrett et al., Reference Merrett, Tailby, Jackson and Wilson2019; Thompson, Reference Thompson2006). MIT treatment studies often describe the treatment protocol in a very cursory way, particularly in studies that include an imaging component. This makes it difficult to understand how the treatment worked. Treatment studies must carefully specify which ingredients are used and how they are implemented so that work can be replicated and variation across studies can be accounted for in later meta-analyses. Meta-analyses using individual participant data (IPD) are a useful means for combining data across studies while offering the flexibility to account for variation in participants and treatment factors (Popescu et al., Reference Popescu, Stahl and Wiernik2022; Zumbansen and Tremblay, Reference Zumbansen and Tremblay2019).
43.5 Conclusions
The facilitating techniques used in MIT can be striking in their ability to promote spoken-language production in people with severe aphasia. As a result, MIT has attracted considerable attention from clinical researchers. The numerous treatment ingredients incorporated into MIT make it a potentially flexible and powerful approach, but the heterogeneity inherent in the aphasia population, the range of different MIT protocols used, and the lack of specificity regarding how they are implemented has resulted in relatively weak evidence regarding which components work best for whom. A treatment taxonomy that specifies treatment targets, ingredients, and their mechanisms of action is a promising tool to better investigate MIT and implement it in clinical practice.
Summary
MIT is a music-based aphasia treatment with numerous potentially active treatment ingredients. Evidence suggests the rhythm and timing of the utterances used, along with unison production, are particularly important. A treatment taxonomy specifying treatment targets, ingredients, and mechanisms of action is a promising tool to understand MIT.
Implications
Evidence from studies of MIT suggests that the rhythm and timing aspects of spoken language can be clinically important variables. A slow rate and regular rhythm facilitate spoken-language production in people with nonfluent aphasia. These manipulations may support unison production, another component of MIT.
Gains
Understanding how, why, and for whom MIT works provides insight into how language interacts with motor speech planning. Neuroimaging work adds information about the brain regions and neural circuits capable of supporting these functions.