The Physiology of Rhythm

doi:10.1017/9781009295888.002

Section 1 - The Physiology of Rhythm

Published online by Cambridge University Press: 23 April 2026

Edited by

Lars Meyer and

Antje Strauss

Show author details

Lars Meyer: Affiliation:
Max Planck Institute for Human Cognitive and Brain Sciences
Antje Strauss: Affiliation:
University of Konstanz

Book contents

Information

Type: Chapter
Information: Rhythms of Speech and Language
Physiology, Cognition, Culture
, pp. 1 - 128

DOI: https://doi.org/10.1017/9781009295888.002 [Opens in a new window]

Publisher: Cambridge University Press

Print publication year: 2026
Creative Commons: This content is Open Access and distributed under the terms of the Creative Commons Attribution licence CC-BY-NC 4.0 https://creativecommons.org/cclicenses/

Section 1 The Physiology of Rhythm

1 Body-Grounded Speech Rhythm: Ubiquitous Interactions between Speech, Breathing, and Limb Movements

Be thou assured, if words be made of breath,

And breath of life, I have no life to breath

What thou hast said to me.

William Shakespeare, Hamlet

1.1 Introduction

Have you ever wondered, if we didn’t need to inhale and exhale regularly, would we speak without pausing, constructing sentences of infinite length? Would the variations in speech amplitude and fundamental frequency resemble what we observe today? Might our conversational turns become more extended? In exploring these questions, any speaker is likely to acknowledge that their manner of speaking, particularly the rhythmic organization of speech and conversational turns, is significantly influenced by breathing. As briefly outlined in Section 1.2, respiration is a complex physiological system greatly influenced by behavioral and environmental factors. Acquiring the skill of speaking also entails acquiring a particular control over breathing. The rhythm of breathing is significantly molded by speech production, yet the imperative to breathe also exerts a profound influence on speech rhythm at different levels.

The complexity of the influence of breathing on speech becomes even more apparent when one speaks while concurrently performing another motor activity. As indicated by previous research, limb motion may have a specific impact on real-time spoken language. Section 1.3 delves into the repercussions of two distinct forms of limb motion that are prevalent during spoken communication. First, numerous studies have underscored the significance of co-speech gestures in spoken communication, and various pieces of evidence highlight the structuring or influence of these gestures on speech rhythm. Second, speech often occurs during various noncommunicative activities that involve motions of our hands and/or other parts of our body. As such, speech is part of a dynamic sensorimotor system working as a whole. Interactions among various processes occur at neurophysiological, biomechanical, and thus behavioral levels, resulting in coordinated temporal patterns.

Sections 1.4 and 1.5 address the tripartite relationship between speech, breathing, and limb movements. Breathing is indeed a central aspect of limb motion, speech, and, more recently, cognition research. The reciprocal influences between limbs and speech may occur through the respiratory system, as suggested recently. The connections between limb motion’s motor control, physiology, and spoken language present new challenges and perspectives that will also be discussed.

1.2 Speech Breathing: Where Linguistic and Physiological Rhythms Meet

1.2.1 A Brief Description of the Breathing System

Breathing is a gas exchange mechanism between the air and the bloodstream (oxygen in and carbon dioxide out) through a cyclical series of inhalations and exhalations. It is supported by a physiological system involving peripheral organs and a complex neural control that interacts with other behavioral functions (Hoit et al., Reference Hoit, Weismer and Story2021).

Gas exchanges occur in the lungs within the thoracic cavity. Inhalation involves the contraction of intercostal muscles and the diaphragm, expanding lung volume (Figure 1.1A, left). As the lung volume increases, keeping the air quantity constant, negative pressure in the rib cage draws air into the lungs. At rest, exhalation is initiated passively as the relaxation of external intercostal muscles and the diaphragm leads to a reduction in lung volume. During active processes such as speech, additional muscles, including internal intercostal and abdominal muscles, contribute to controlled air expulsion. The larynx, comprising the epiglottis and vocal cords, plays a predominant role in controlling the flow of exhaled air during speech (Hoit et al., Reference Hoit, Weismer and Story2021).

Figure 1.1

Illustration of breathing process and control

(A)

Mechanism of inspiration and expiration phases;

An illustration of the mechanism of breathing in two side by side panels. See long description.

Figure 1.1(A) Long description

Part A: Top. 2 line drawings of the human torso are labeled Inspiration and Expiration, respectively. Left. The labels include thoracic cavity expands, external intercostal muscle contracts and diaphragm contracts. Right. The labels include Thoracic cavity reduces, External intercostal muscles and Diaphragm relax. Bottom: A line graph of displacement versus time plots two fluctuating lines for rest and read speech.

(B)

top: Centers of the brain involved in breathing control; (B) bottom: Example of rib cage breathing movement at rest (tidal breathing) and speech breathing (read speech) monitored using inductive plethysmography.

Figure 1.1(B) Long description

Part B: An illustration of a human brain with a rectangular box marking the medulla and pons. The marked region is expanded below to show Pneumotaxic center, Apneustic center, Pontine, respiratory center, Dorsal respirator, Pons, Ventral respiratory center, and medulla. Three arrows originate from the ventral respiratory center and point at the illustrations, including accessory respiratory muscles, internal intercostal muscle and external intercostal muscle and diaphragm.

Figures 1.1A and B adapted from openstax.org under the Creative Commons Attribution 4.0 International License. Legends have been slightly modified for readability reasons and for consistency with the text.

The control of breathing is influenced by various nervous system structures (Shea, Reference Shea1996; Hoit et al., Reference Hoit, Weismer and Story2021) and governed by metabolic and behavioral needs. Under automatic control, a central pattern generator (CPG) in the brainstem (Figure 1.1B) regulates rhythmic inhalation and exhalation patterns, involving the medullary respiratory center (MRC) and the pontine respiratory center (PRC). During inhalation, the MRC’s dorsal and ventral respiratory centers stimulate the contraction of the diaphragm and of the external intercostal muscles. The ventral center also inhibits muscles for exhalation. Both centers receive information from the PRC that comprises the pneumotaxic and apneustic centers. The PRC sends inhibitory and stimulatory signals to the MRC, regulating the duration and intensity of breathing phases. The CPG communicates with muscles through spinal cord phrenic, intercostal, and abdominal motoneurons. Cortical centers controlling voluntary breathing send messages to the CPG that also receives feedback from peripheral structures, central chemoreceptors, and lung stretch receptors, impacting the regulation of respiration (Del Negro et al., Reference Del Negro, Funk and Feldman2018; Ben-Tal et al., Reference Ben-Tal, Wang and Leite2019).

Subcortical structures and cortical areas associated with various behaviors also significantly impact respiration regulation in humans, rendering breathing sensitive and adaptive to sensory stimuli, emotional responses, and changes in cognitive or physical activity levels. Breathing adapts particularly to behaviors requiring airflow, such as speech sound production, and those increasing muscle oxygenation needs, such as limb movements. The unique link between speech and breathing arises from the animated structures of the vocal tract, shaping inspiratory and respiratory flows into articulated sounds.

1.2.2 Insights into Recording and Analyzing Speech Breathing

Breathing can be monitored using different devices, such as the pneumotachograph, which directly measures the airflow in the respiratory system. Pneumotachographs are commonly employed in medical and physiological research using face masks and have been adapted to record speech breathing (Rothenberg, Reference Rothenberg1977; Ghio and Teston, Reference Ghio and Teston2004). However, speaking with a facial mask is unnatural and alters speech articulation as well as breathing (Shea, Reference Shea1996). Alternative methods, such as inductance plethysmography, record variations in thoracic and abdominal volumes thanks to elastic straps with insulated wires around the speaker’s abdomen and chest. During inhalation and exhalation, the cross-sectional areas of the rib cage and abdomen expand and compress, altering the self-inductance of the coils and the frequency of their oscillation, which is converted into a digital waveform. The shape of the waveform is proportional to the inhaled or exhaled breath volume (see Figure 1.1B). Despite its sensitivity to motion artifacts, inductance plethysmography is suitable for estimating breathing in various tasks, including speech (Caretti et al., Reference Caretti, Pullen, Premo and Kuhlmann1994; Clarenbach et al., Reference Clarenbach, Senn, Brack, Kohler and Bloch2005). It provides displacement data over time and allows the extraction of parameters such as respiratory rate and amplitude. Calibration is necessary for converting amplitude into volumetric units (Banzett et al., Reference Banzett, Mahan, Garner, Brughera and Loring1995; McKenna and Huber, Reference McKenna and Huber2019). The symmetry between inhalation and exhalation durations is a crucial parameter for studying speech breathing, characterized by unique specificities compared to silent breathing at rest or in different activities (Fuchs and Rochet-Capellan, Reference Fuchs and Rochet-Capellan2021).

1.2.3 Speech Breathing as a Multidetermined Rhythm

During speech, the regulation of breathing is intertwined with the planning of speech motor functions, involving a dynamic interaction between the forebrain, particularly cortical structures for speech production, and subcortical structures (Fuchs and Rochet-Capellan, Reference Fuchs and Rochet-Capellan2021). This interaction results in significant alterations in breathing patterns, notably a pronounced asymmetry between rapid inhalation and slow exhalation during speech (see Figure 1.1B). Variability in speech breathing cycles, in terms of duration and volume of air inhaled and exhaled, is influenced by linguistic, cognitive, and interactive factors (Conrad and Schönle, Reference Conrad and Schönle1979; McFarland, Reference McFarland2001; Fuchs and Rochet-Capellan, Reference Fuchs and Rochet-Capellan2021).

For example, in reading and spontaneous speech, inhalation duration and depth correlate with the upcoming utterance’s length, although correlations are smaller in spontaneous speech (Sperry and Klich, Reference Sperry and Klich1992; Winkworth et al., Reference Winkworth, Davis, Ellis and Adams1994, Reference Winkworth, Davis, Adams and Ellis1995; Rochet-Capellan and Fuchs, Reference Rochet-Capellan and Fuchs2013). The flexibility of breathing in adapting to speech demands allows for variability in extemporaneous speech, enabling the expression of communicative cues such as F0 modulation (Fuchs et al., Reference Fuchs, Reichel and Rochet-Capellan2015).

During speech, inhalation phases are coordinated with syntax. Readers tend to breathe predominantly at syntactic boundaries, and the amplitude of the breathing cycle correlates with the type of boundary, such as greater amplitude before a new paragraph than before a new sentence in the same paragraph (Conrad et al., Reference Conrad, Thalacker and Schönle1983). As speech rate increases, the frequency of breathing pauses decreases, defining a “syntactic tempo” observed in both text reading and spontaneous speech (Grosjean and Collins, Reference Grosjean and Collins1979; Rochet-Capellan and Fuchs, Reference Rochet-Capellan and Fuchs2013; Werner et al., Reference Werner, Trouvain and Möbius2022).

In conversation, breathing becomes intricately linked to turn-taking, with partners displaying coordinated breathing at turn-taking events and the shape of breathing cycles specific to conversational events (McFarland, Reference McFarland2001; Rochet-Capellan and Fuchs, Reference Rochet-Capellan and Fuchs2014). Holding one’s breath in a silent breathing cycle may signal intentions related to turn-taking, influencing the perception of speech pauses (Wlodarczak and Heldner, Reference Wlodarczak and Heldner2020). Ultimately, the presence of inbreath noises significantly influences the perception of speech pauses (MacIntyre and Scott, Reference MacIntyre and Scott2022). These noises are influenced by the configuration of the vocal tract, suggesting that breathing may acoustically convey specific information that merits further investigation (Werner et al., Reference Werner, Fuchs and Trouvain2023). This critical information can influence the behavior of a dialogue partner. Moreover, breathing interactions between communicative partners appear to develop early, as observed in mother–infant interactions (McFarland et al., Reference McFarland, Fortin and Polka2020).

The adaptability of the breathing system is crucial to support spoken communication. However, as discussed in Fuchs and Rochet-Capellan (Reference Fuchs and Rochet-Capellan2021), spoken communication is also constrained by the intricate interplay between breathing, cognitive abilities, and lung volume capacity. Consequently, speech breathing is highly individualized, exhibiting significant variations among speakers.

1.2.4 Speech Breathing as a Speaker-Specific Tempo

The exploration of individual breathing characteristics began with an examination of tidal breathing. Dejours noted variations in ventilation characteristics among participants even under identical conditions, leading to the concept of “ventilatory personality” (Dejours, Reference Dejours1966; Shea and Guz, Reference Shea and Guz1992). Consistency in breathing patterns over time was confirmed by Bennchetrit et al. (Reference Bennchetrit, Shea and Dinh1989), with monozygotic twins displaying more similar tidal breathing than random or dizygotic pairs, hinting at potential genetic or physiological influences (Kawakami et al., Reference Kawakami, Yamamoto, Yoshikawa and Shida1984; Shea et al., Reference Shea, Horner, Benchetrit and Guz1990). However, task-specific variations in breathing individuality exist, with stability observed within tasks such as tidal breathing or exercise but not consistently between them (Eisele et al., Reference Eisele, Wuyam and Savourey1992; Besleaga et al., Reference Besleaga, Blum and Briot2016).

Speech breathing is influenced by various speaker-specific parameters. Aging, for example, leads to declines in breathing capacities, affecting speech production, with older adults requiring higher lung volumes for speech initiation (Hoit and Hixon, Reference Hoit and Hixon1987; Sperry and Klich, Reference Sperry and Klich1992; Huber, Reference Huber2008). Factors such as weight, stature, age, sex, developmental processes, or diseases such as Parkinson’s or asthma also impact speech breathing (Hoit and Hixon, Reference Hoit and Hixon1986, Reference Hoit and Hixon1987; Loudon et al., Reference Loudon, Lee and Holcomb1988; Hoit et al., Reference Hoit, Hixon, Watson and Morgan1990; Solomon and Hixon, Reference Solomon and Hixon1993; Boucher and Lalonde, Reference Boucher and Lalonde2015), suggesting that breathing is determined by speaker-specific factors related to both bodily and cognitive constraints.

Despite ample evidence of speaker-specific breathing profiles during speech, recent work by Serré et al. (2020) also highlights individual consistency in speech breathing. They found that while the period of the speech breathing cycle differs among participants, it remains remarkably consistent for the same individual, even during light exercise. This underscores the high variability and specificity of speech breathing in each speaker.

Speech rhythm is also influenced by limb movements co-occurring when speaking. This relation between speech and limb movements may involve breathing, as discussed in Section 1.3.

1.3 The Influence of Limb Movements on Speech Rhythm

Limb movements are ubiquitous in spoken communication. Communicative limb movements, especially those executed with the arms and hands, are intrinsic to human conversation. Gestures accompanying speech, in particular hand gestures, have been extensively studied and exhibit a multilayered relationship with speech (see also Chapter 4 in this volume). At the level of speech production, representational gestures can help lexical access and speech fluency, thereby influencing speech rate and acting as essential pacing cues during conversations. At the sensorimotor level, the coordination between upper-body movements and vocalization is a key step of language development and still observed in adulthood.

1.3.1 Gestures as an Essential Component of Speech Production

Gestures, and especially hand gestures, appear to serve different purposes in spoken communication. The multitude of connections between gestures and speech has sparked numerous studies and hypotheses exploring the interplay between these two systems and their roles in communication and language.

1.3.1.1 Effect of Gestures on Speech Conceptualization

Speech planning and speech fluency are tightly linked. Numerous studies emphasize the supportive role of gestures in accessing lexical items or concepts during speech production (Graham and Heywood, Reference Graham and Heywood1975; Morsella and Krauss, Reference Morsella and Krauss2004; Hostetter et al., Reference Hostetter, Alibali and Kita2007; Hoetjes et al., Reference Hoetjes, Krahmer and Swerts2014; Cravotta et al., Reference Cravotta, Busà and Prieto2019). These studies looked at the effect of gesturing (compared to not being able to gesture) on lexical content, but also on the number of pauses, speech rate, and prosody. The lexical retrieval hypothesis, supported by Krauss (Reference Krauss1998) and Ruiter (Reference Ruiter1998), posits that gestures play a crucial role in lexical access, facilitating access to concepts and words in memory. This theory echoes studies highlighting the neural connections between semantic comprehension of actions and corresponding sensorimotor circuits (Pulvermüller and Fadiga, Reference Pulvermüller and Fadiga2010; Bidet-Ildei et al., Reference Bidet-Ildei, Beauprez and Badets2020). Besides aiding memory retrieval, gestures help organize information for verbal encoding, especially spatio-motoric details (Alibali et al., Reference Alibali, Kita and Young2000; Kita et al., Reference Kita, Alibali and Chu2017). Gestures support the activation of spatio-motoric information and problem-solving in both concrete and abstract thinking, potentially influencing the course of speaking, facilitating novel ideas and affecting discourse orientation. Through this process, gestures influence the packaging of verbal information into clauses. For example, when describing the manner (e.g., rolling) and path (e.g., down) of a motion with two separate gestures, participants tend to produce a two-clause verbal description, while a single gesture leads to a single-clause verbal description (see Kita et al., Reference Kita, Alibali and Chu2017 for references). This role attributed to gestures may have correlates in speech rhythms that require further investigation.

Various hypotheses surround the roles of gestures – whether they serve conceptual or communicative purposes – prompted by the question of whether speech and gestures are distinct or stem from a unified internal representation. Beyond their communicative role, gestures serve pragmatic functions, supporting interpersonal coordination in spoken conversations.

1.3.1.2 Role of Gestures in Conversational Stream

The role of nonverbal communication in dialogue is briefly highlighted here, emphasizing gestures’ contribution to the rhythm of interpersonal verbal exchanges. Gestures serve as visual cues in conversations and convey a variety of information supporting interpersonal convergence and coordination (Holler, Reference Holler2022). They often precede the verbal message or convey additional information without interrupting speech. Specifically, gestures are integral to backchannels and turn-taking processes (Wagner et al., Reference Wagner, Malisz and Kopp2014), which are indispensable for dialogue fluidity and pace. Gestures, and nonverbal communication in general, offer real-time feedback to speakers, aiding in understanding listener processing without disrupting ongoing discourse. This mechanism allows speakers to assess communication success, aiding anticipation of semantic shifts, topic changes, and turn-taking events (Holler and Levinson, Reference Holler and Levinson2019). For instance, Ter Bekke et al. (Reference Ter Bekke, Drijvers and Holler2024) found quicker responses to questions accompanied by iconic gestures. Electroencephalographic studies show that brain responses in dyadic interactions are modulated by the processing of nonverbal cues (Zhang et al., Reference Zhang, Frassinelli, Tuomainen, Skipper and Vigliocco2021). More specifically, beat gestures facilitate semantic processing (Wang and Chu, Reference Wang and Chu2013) and syntactic parsing (Obermeier and Gunter, Reference Obermeier and Gunter2014; Biau et al., Reference Biau, Fromont and Soto‐Faraco2018).

Co-speech gestures significantly influence the timing of spoken communication, affecting thinking, speech production, and conversational tempo. Coordination between speech and co-speech gestures is proposed to be rooted in the conceptual phase, relying on common meaning (McNeill, Reference McNeill1992). Additionally, sensorimotor connections between articulatory and limb control, especially hand and mouth, may underpin coordination between speech and co-speech gestures.

1.3.2 Speech and Co-speech Gestures Acting in Synchrony

Regardless of speech content, sensorimotor connections are at play in the coordination between speech gestures and co-speech gestures. These connections emerge early during development, in the course of speech acquisition. Similar coordination patterns are observed between noncommunicative vocalization and limb movements during adulthood, suggesting that the hand and the mouth share common control mechanisms.

1.3.2.1 Manual and Vocal Development towards Spoken Language

Low-level motor links appear early in ontogenesis and may establish the connection between speech and co-speech gestures (see also Chapter 38). Fetal initiation of facial touching, particularly around the mouth, begins between 12 and 15 weeks of gestation (Gallagher, Reference Gallagher, Bermúdez, Eilan and Marcel1995; Fagard, Reference Fagard2013). At birth, about one-third of hand movements result in hand–mouth contact, with coordination advancing rapidly, especially when the child is hungry (Gallagher, Reference Gallagher, Bermúdez, Eilan and Marcel1995). Rhythmic coupling between limb motion and vocal tract control for vocalization typically begins around six months, serving as a basis for hand-to-mouth coordination development (Iverson and Thelen, Reference Iverson and Thelen1999). In particular, a peak in the frequency of occurrence of rhythmic hand activities, such as rattle shaking, is observed at the onset of canonical babbling (see Iverson and Wozniak, Reference Iverson and Wozniak2007). These concurrent activities also influence each other, with babbling accompanied by hand movements displaying longer syllable lengths than babbling without hand movements (Ejiri and Masataka, Reference Ejiri and Masataka1999). Repetitive entrainments between hand movements and the vocal tract may set the stage for the speech–gesture system’s emergence in the last month of the first year, featuring more controlled gestures such as pointing, followed by spoken words a few weeks later. Gestures play an important role in both lexical and syntactic development and are combined with speech to convey meaning through more and more complex morphosyntax (Iverson and Goldin-Meadow, Reference Iverson and Goldin-Meadow2005; Özçalışkan and Goldin-Meadow, Reference Özçalışkan and Goldin-Meadow2005). Low-level sensorimotor coupling persists in adults, providing insights into the enduring link between co-speech gestures and speech in communication.

1.3.2.2 Speech–Gesture Synchrony: A Somatosensory Phenomenon?

Pointing gestures have been a focal point in the speech–gesture coordination investigation (Levelt et al., Reference Levelt, Richardson and La Heij1985; Rochet-Capellan et al., Reference Rochet-Capellan, Laboissière, Galván and Schwartz2008; Chu and Hagoort, Reference Chu and Hagoort2014). Bidirectional adaptations between speech and pointing gestures have been observed during perturbation of either gestures or speech, preserving their synchronization (Chu and Hagoort, Reference Chu and Hagoort2014). More specifically, Rochet-Capellan et al. (Reference Rochet-Capellan, Laboissière, Galván and Schwartz2008) found that mutual adaptation occurs between speech and pointing gestures, such as to synchronize the lexical stress with the gesture apex and pointing phase. Another type of gestures related to speech prosody are beat gestures. Beat gestures are rhythmic, baton-like gestures that have been shown to modulate prosodic prominence (Krahmer and Swerts, Reference Krahmer and Swerts2007; Pouw et al., Reference Pouw, Harrison and Dixon2020a; see also Section 1.4 of this chapter).

In spontaneous narration, McNeill (Reference McNeill1992) found gestures and speech synchrony to be unaffected by delayed auditory feedback (DAF) but disrupted in scripted situations. These findings align with the Growth Point hypothesis suggesting that synchronization is rooted in the conceptual phase. According to Pouw and Dixon (Reference Pouw and Dixon2019), the coupling between gestures and speech is even more stable under DAF conditions, which suggests that “gesture with its own intrinsic dynamics (and thus different entrainment to DAF) can be utilized to resist DAF in speech production” (p. 27).

Coordination patterns between speech and hand motion have also been found through noncommunicative movements. Studies of spatiotemporal coupling between vocalizations and finger tapping (Kelso et al., Reference Kelso, Tuller, Harris and MacNeilage1983; Parrell et al., Reference Parrell, Goldstein, Lee and Byrd2014; Zelic et al., Reference Zelic, Kim and Davis2015) have found some evidence of hand and mouth being controlled as a single coordinative structure. Their results suggest that the coupling between prosody and hand movements may rely on broad motor control and somatosensory feedback not specific to speech, as it persists during silent speech. Finally, the hand and the mouth seem to be controlled by a common motor command (Gentilucci and Volta, Reference Gentilucci and Volta2008): during the co-occurrence of object picking and vocalizing, the lip aperture and the first formant increase with object size. Further support for this hypothesis comes from neurophysiological findings about Broca’s area, believed to exert control over both the mouth and the hand (Fadiga et al., Reference Fadiga, Craighero and D’Ausilio2009). The sensorimotor connections and common rhythmic patterns between speech and limb movements have been shown to play a key role in speech rehabilitation for patients with Parkinson’s disease and patients with stuttering symptoms (see also Chapter 45).

However, the generalization of synchronization patterns of natural speech with hand movements remains incompletely understood (Parrell et al., Reference Parrell, Goldstein, Lee and Byrd2014). While cognitive processes related to spoken language or motor-level origins for speech–limb coordination are open questions, recent exploration suggests breathing may play a significant role in this coordination.

1.4 Breathing: A Bridge between Speech and Limb Movements

Section 1.2 of this chapter delved into the dual nature of breathing, serving both automatic and voluntary functions, and highlighted its adaptability to diverse mental and physical states. The coordination of breathing with speech involves balancing linguistic, communicative, and physiological factors, where ventilation constraints shape spoken communication rhythms. Simultaneous speech and limb movements introduce complexity, prompting adjustments in breathing and vocalizations. This complexity is evident in physical exercise studies exploring the interplay between speech, breathing, and limb movement. Breathing serves as a shared resource linking speech and body motion, motivating comprehensive empirical and theoretical approaches to understand their interaction. The intricate relationship between speech, breathing, and limb movements fosters various research perspectives, which are briefly outlined in this last section.

1.4.1 Talking and Breathing during Physical Effort

Engaging in physical exercise imposes constraints on breathing via limb movements, creating a unique context for exploring the interaction between speech breathing and limb actions. In the absence of speech during exercise, the breathing cycle is limited by the oxygen requirements of the muscles. The respiratory system dynamically adjusts to meet the escalating demand for oxygen, revealing an intricate interplay between central and peripheral mechanisms (see Shevtsova et al., Reference Shevtsova, Marchenko and Bezdudnaya2019). Behaviorally, the initiation of exercise results in a swift elevation of respiration frequency and tidal volume (Whipp et al., Reference Whipp, Ward, Lamarra, Davis and Wasserman1982). An intriguing question arises: Does this observed pattern persist when individuals converse during exercise?

1.4.1.1 Effect of Exercise on Speech Breathing and Pauses

When speaking during physical exercise, speech breathing must accommodate the heightened oxygenation needs of the body. Unlike speaking at rest, speaking during physical activity involves shorter exhalations, increased airflow during exhaling, and longer, deeper inhalations (Hixon et al., Reference Hixon, Goldman and Mead1973, Reference Hixon, Mead and Goldman1976). Consequently, the syntactic structure of speech undergoes modifications. Baker et al. (Reference Baker, Hipp and Alessio2008) examined the impact of different physical effort intensities in cycling on a reading speech task. They observed an increase in the number of non-syntactic pauses, a reduction in syllable count within breath groups, and an increase in articulation rate. Fuchs et al. (Reference Fuchs, Reichel and Rochet-Capellan2015) found that while engaged in semi-spontaneous speech, individuals exhibited a heightened frequency of breathing when cycling compared to speaking alone. Moreover, this trend became more pronounced with an increased level of biking effort. Trouvain and Truong (Reference Trouvain and Truong2015) investigated reading speech before and after volitional exhaustion on a treadmill. They found that both the duration and the intensity of inbreath segments increase from before to after, and a tendency to observe a higher count of inbreath tokens per second in the post-phase compared to the pre-phase.

1.4.1.2 Effect of Speech on Exercise Breathing

Despite changes in speech organization and in particular pauses induced by physical effort, there is an adaptation in breathing patterns to preserve speech quality. Interestingly, the elevation in ventilation during vocalization is less noticeable, and the respiratory frequency is lower, in comparison to exercising without speaking (Doust and Patrick, Reference Doust and Patrick1981). However, there is a limit to sustaining speech quality when physiological demands reach a certain threshold. In sports studies, the talk test instrumentalized speech during exercise to evaluate and guide exercise training intensity for each individual. The exercise intensity at which individuals initially “experience difficulty speaking comfortably” (utilizing various speech stimuli) serves as a dependable indicator of the ventilatory threshold in healthy individuals (see Foster et al., Reference Foster, Porcari and Ault2018). Recent research indicates that the point at which normal speech can no longer be maintained aligns with the onset of metabolic acidosis during escalating exercise intensity (De Lucca et al., Reference De Lucca, de Oliveira, Foster and Carminatti2021). While sports studies traditionally overlook speech changes induced by physical exertion, recent research in language and cognitive sciences has witnessed a growing interest in investigating these phenomena.

1.4.1.3 Effect of Exercise on Speech Signals

The intricate relationship between speech breathing and physical effort extends to speech acoustics, where parameters change with increased exertion, including a rise in average intensity and fundamental frequency (F0) during biking (Mohler, Reference Mohler1982; Fuchs et al., Reference Fuchs, Reichel and Rochet-Capellan2015; Weston et al., Reference Weston, Fuchs and Rochet-Capellan2020; Serré et al., Reference Serré, Dohen, Fuchs, Gerber and Rochet-Capellan2022) or treadmill tasks (Primov-Fever et al., Reference Primov-Fever, Lidor, Meckel and Amir2013; Trouvain and Truong, Reference Trouvain and Truong2015). Trouvain and Truong proposed that this F0 increase may be attributed to heightened subglottal pressure during physical exertion. Conflicting findings on F0 changes during limb movements and exertion levels highlight significant inter-speaker variability, presenting diverse and sometimes opposing behaviors (Godin and Hansen, Reference Godin and Hansen2008, Reference Godin and Hansen2015; Weston et al., Reference Weston, Fuchs and Rochet-Capellan2020). The relationship between exertion level and F0 in speech is complex. Johannes et al. (Reference Johannes, Wittels and Enne2007) observed that F0 is primarily affected by exercise near exhaustion. These varied outcomes suggest that the impact of physical stress on speech acoustics is neither linear nor constant, emphasizing the need for a detailed examination at smaller scales. Crucial factors include the unique articulation space for each consonant, affecting sensitivity to subglottal pressure fluctuations, and considerable variability in limb movement dynamics influenced by biomechanical mechanisms (Godin and Hansen, Reference Godin and Hansen2011) that may also involve breathing (Pouw et al., Reference Pouw, De Jonge-Hoekstra, Harrison, Paxton and Dixon2020b).

1.4.2 Multilayer Connections between Speech, Breathing, and Limb Movement

Understanding the intricate interplay of speech, breathing, and limb movement is essential to understand their coexistence in facilitating human communication. This involves exploring their contributions across the evolutionary and developmental aspects of spoken language. Previous research indicates the involvement of various factors, including low-level biomechanical as well as physiological and cognitive aspects. Additionally, learning from sensorimotor and social experiences may contribute to shaping the influence of physiological rhythms on communicative ones.

1.4.2.1 A Biomechanical Anchor of Speech–Breathing–Limb Interaction

Pouw et al. (Reference Pouw, Harrison and Dixon2020a, Reference Pouw, De Jonge-Hoekstra, Harrison, Paxton and Dixon2020b) found correlates of arm motions in vocalizations. In particular, when participants imitated a beat gesture in a rhythmic fashion during syllable production or spontaneous speech, the amplitude envelope and F0 increased close to the deceleration peak of the arm movement. The authors introduce the hypothesis of a biomechanical interplay between arm breathing and speech in relation to anatomical constraints between the arms and the breathing system. While moving the arms, thoracic muscles are activated to maintain posture and support limb movements. This activation triggers rib cage displacements, impacting the lungs. When the arms move such as in beat gestures, this induces changes in the respiratory system, leading to an increase in subglottal pressure, and changes in acoustical parameters of co-occurring vocalizations. In line with this idea, Serré et al. (Reference Serré, Dohen, Fuchs, Gerber and Rochet-Capellan2022) observed a correlation between the amplitude envelope of speech acoustic peaks and leg deceleration events during biking, further supporting possible biomechanical interplays involving body movement, breathing, and speech production, not limited to the arms.

Recently, Pouw and Fuchs (Reference Pouw and Fuchs2022) argue that this biomechanical entanglement extends beyond immediate physiological effects, potentially influencing coordinated speech patterns. They propose that this link may have evolutionary significance, contributing to the development of finely tuned vocalizations in species with a diverse coupling of breathing and limb movements, particularly in bipedal species where upper bodies are liberated during locomotion.

However, as much as the biomechanical entanglement of limb, respiratory, and vocal systems can explain the impact of limb movements on speech acoustics, and may contribute to speech and gesture coordination, it can’t account for the entire variability of speech. If speech, breathing, and limb movements obviously share the same physical body substrate, and if motion in one part of this body might spread to others through biomechanical constraints, they also share the same brain as well as sensorimotor and social experiences.

1.4.2.2 The Interplay between Cognitive and Physiological Factors

Distinguishing between the effect of cognitive and physiological factors on speech is not always easy, and hypotheses assuming only one cause often end in a chicken-and-egg loop. For example, in ontogenesis, researchers noted a correlation between the growth of lung capacity and the increasing length of utterances (Bouchet and Lalonde, Reference Boucher and Lalonde2015). However, at the same time, cognitive abilities are also changing. In adulthood, there is no clear evidence that speakers with higher lung capacity are likely to produce longer utterances (Heldner et al., Reference Heldner, Carlsson and Wlodarczak2019). While exploring the involvement of the forebrain in individual tidal breathing patterns, Shea et al. (Reference Shea, Horner, Benchetrit and Guz1990) noted varying cycle shapes among individuals with similar body types and observed diverse body types exhibiting similar cycle shapes. Moreover, motor-respiratory coordination in humans is highly flexible, sensitive to factors such as cognitive effort (Hessler and Amazeen, Reference Hessler and Amazeen2014). New MRC patterns can also be intentionally learned to improve energy consumption during specific physical effort (Hessler and Amazeen, Reference Hessler and Amazeen2014). Furthermore, the modification of speech breathing through learning is evident, as seen in actors and actresses (Master et al., Reference Master, Guzman, Azócar, Muñoz and Bortnem2015). Breathing is also intertwined with cognition, attention processes (Allen et al., Reference Allen, Varga and Heck2023), as well as various behavioral factors related to motor, emotional, physiological, and cognitive processes (Shea, Reference Shea1996; Grassman et al., Reference Grassmann, Vlemincx, Von Leupoldt, Mittelstädt and Van den Bergh2016; Park et al., Reference Park, Barnoud and Trang2020). In this respect, the breathing profiles are markers of ongoing activities of the body and mind as well as speaker specificities and training. The study of the impact of practices such as swimming, free diving, or yoga on breathing during speech and the organization of pauses could bring forth interesting insights regarding the effects of training.

Investigating speech breathing consistency over time and across different limb movements (leg versus arm biking), Serré et al. found that speech breathing cycles are consistent over time and across tasks, despite a significant effect of the limb deceleration on speech acoustics observed in the same dataset (Reference Serré, Dohen, Fuchs, Gerber and Rochet-Capellan2022). The temporal scale of this latter biomechanical aspect is less than a second, while speech breathing cycles take place over three to four seconds on average. The entanglement of these different scales suggests a complex interaction between multilevel rhythms shaping spoken communication (Pouw et al., Reference Pouw, Proksch and Drijvers2021). The emergence of such rhythms might depend on the speaker’s environment and peculiarities such as breathing capacities or spontaneous speech rate, embedded in speech control across individual experience.

1.4.2.3 Breathing as a Bridge Built through Grounded Experiences

Speech, breathing, and limb movements exhibit an interconnected evolution in both phylogenesis and ontogenesis (see also Chapter 7). This coevolution stems from anatomical overlap and connections among the three systems, further supported by the intricate development of peripheral and central nervous systems. This development occurs through concurrent activation in daily communicative experiences from birth. Despite their interdependence, research has predominantly explored speech, breathing, and limb movements separately or in pairs. On the one hand, studies on breathing and limb movements are often driven by sports and physiological approaches, to investigate energy expenditure and expert motor control. On the other hand, fields related to linguistics and cognition focus on the relationship between speech and breathing, neglecting limb movements, with minimal consideration for breathing. As evidenced in this chapter, breathing is a resource intrinsic to speech and limb movements and adapts to speech via utterance planning and to limb movements via afferent connections and chemoreceptors that transmit metabolic needs. Since limb and speech movements frequently co-occur, the brain may have developed specific abilities to control them jointly (Bernstein, Reference Bernstein1967; Latash et al., Reference Latash, Scholz and Schöner2007) and the shape of our evolved behaviors might be strongly determined by physiological factors. Coordinative structures between speech, breathing, and limbs might be learned in a specific way through experiences. Adopting an integrated view of this three-bodies issue might shed light on the complex nature of speech rhythms and their grounding in breathing and body movements.

1.5 Conclusion

The chapter discusses the intertwined relationship between breathing, limb movements, and speech rhythms, drawing on interdisciplinary evidence. It reveals that speech is organized into breath groups, influenced by cognitive, physical, and speaker-specific factors. Limb movements affect speech production by influencing conceptual, executive, and interactive aspects. Recent research suggests biomechanical connections between limb movements and speech acoustics via the respiratory system. This triadic interaction remains understudied, despite its potential significance in understanding creativity and communication. Holistic approaches hold promise for uncovering vital insights into the collaborative work of speech, breathing, and limb movements. However, studying the three-level interaction among complex systems poses challenges, necessitating interdisciplinary methods and expertise. Research perspectives on the interaction between speech, breathing, and limb movements may finally benefit from delving into more ecological as well as longitudinal investigations.

Box 1.1Chapter Overview

Summary

Research has shown that (1) speech is structured in breath groups, delineated by inhalation pauses aligned with syntactic as well as communicative events; (2) this coordination is yet subject to adaptation based on both cognitive and physical factors as well as speaker-specific influences; and (3) gestures and other limb movements influence speech rhythms by impacting conceptual, executive, and interactive aspects of speech production.

Implications

Speech, breathing, and limb movements exhibit an interconnected evolution in both phylogenesis and ontogenesis. This coevolution stems from anatomical overlap and connections among the three systems, further supported by the intricate development of peripheral and central nervous systems. As such, the brain likely integrates concurrent leg or arm movements into speech planning and speech motor control early on.

Gains

Despite their interdependence, research has predominantly explored speech, breathing, and limb movements separately or in pairs. Adopting an integrated view might shed light on the complex nature of speech rhythms and their grounding in breathing and body movements. Holistic approaches hold promise for uncovering vital insights, leading to more ecological investigations.

2 Jaw-Opening Patterns and Their Correspondence with Syllable Stress Patterns

2.1 Introduction

This chapter approaches the topic of language rhythm from an articulatory point of view. It is divided into six sections where we present and discuss data on articulation, acoustics, and perception, with a focus on the jaw. The proposal is that articulatory patterns, specifically how much the jaw opens for each syllable, reflect the abstract hierarchical metrical structure of a spoken language (see, for example, work by Liberman and Prince, Reference Liberman and Prince1977; Selkirk, Reference Selkirk1982; Hayes, Reference Hayes1995); it is these metrically derived patterns of syllable prominence and phrasing, observed in the articulatory patterns, that provide a basis for language “rhythm.” In an English utterance, the metrical hierarchy manifests itself such that each syllable has an n-ary degree of prominence/stress, depending on its position in the hierarchy, with the nuclear stress syllable (or the emphasized/focused syllable) having the largest stress value, then subsequently the phrasally stressed and foot-stressed syllables, respectively, with the reduced syllables having relatively little jaw displacement.

2.1.1 Utterance Prominence and Acoustic Characteristics

In English, there is lexical stress, such that a word with more than one syllable has one syllable that receives more stress than others. The position of the stress is lexically fixed such that with the word linguistics, for example, the stressed syllable is the second one. Phrasal stress/prominence, in a sense, is like word stress: that is, with word stress, one syllable in the word receives more stress/prominence than the others; with phrase stress, one word (syllable) within the phrase receives more stress/prominence than the others.Footnote ¹

In a simple utterance, such as I like dates, the word dates probably receives the largest amount of prominence. In English, the syllable/word in an utterance with the largest prominence is referred to as having utterance stress, also referred to as nuclear stress, with the default nuclear stress/utterance stress on the last content word of the utterance (e.g., Cole et al., Reference Cole, Hualde and Smith2019). In a more complex utterance, such as I saw five bright highlights in the sky tonight, there are perhaps three phrases: (I saw)(five bright highlights)(in the sky tonight). The second phrase (five bright highlights) consists of two smaller units (five bright) and (highlights), which we refer to as foot units. We suggest that for this utterance, the first member of each foot is more prominent than the second member (e.g., Erickson et al., Reference Erickson, Suemitsu, Shibuya and Tiede2012, Reference Erickson, Kawahara, Shibuya, Suemitsu and Tiede2014).

In English, speakers generally have choices about which words to group together in a phrase, and which word in that phrase gets the most stress. But the utterance prominence rule seems to be that no matter how the syllables are grouped into phrases, there will be only one syllable in the phrase that will be the most prominent. It is this pattern of prominence/stress that underlies English “rhythm.”

A hypothesis is that the abstract metrical hierarchical organization of an English utterance is realized such that each syllable in an utterance has an n-ary degree of prominence/stress; the nuclear stress syllable has the largest stress value, and then subsequently the phrasally stressed and foot-stressed syllables, respectively. The acoustic consequences of stress/prominence in English tend to be increased duration, increased intensity, increased or decreased fundamental frequency (F0) (e.g., H* or L* patterns), and more extreme formants (e.g., Fry, Reference Fry1955; Lehiste, Reference Lehiste1970; Cooper et al., Reference Cooper, Eady and Mueller1985; Beckman, Reference Beckman1986; Turk and Sawusch, Reference Turk and Sawusch1996; Kochanski et al., Reference Kochanski, Grabe, Coleman and Rosner2005).

2.1.2 Utterance Prominence Patterns and Articulation

In terms of articulation, the proposal in this chapter is that the amount of jaw displacement for each syllable (i.e., how much the jaw opens for each syllable) is commensurate to the amount of prominence/stress for that syllable (e.g., Erickson et al., Reference Erickson, Suemitsu, Shibuya and Tiede2012). The hypothesis explored is that the patterns of varying amounts of jaw displacement provide a window into the metrical organization of spoken language; that is, we implement/translate the abstract metrical rhythm of our spoken language in terms of how much we open our jaw for each syllable in an utterance.

The jaw, thus, is a prosodic articulator, in addition, of course, to the larynx. Here we focus on the jaw. This chapter has the following sections: (2.2) A review of findings about jaw displacement and emphasis; (2.3) a review of jaw displacement and utterance prominence patterns; (2.4) the relation between segment articulation and syllable articulation; (2.5) new articulatory, acoustic, and perceptual findings about jaw displacement and how these relate to utterance prominence and phrase boundary patterns in American English (AE) utterances; and (2.6) jaw and phrasal stress in other languages and applications for language teaching, along with plans for future research.

2.2 Jaw Displacement and Emphasis

Jaw displacement increases with prominence, including contrastive emphasis, not only for low vowels (e.g., Kent and Netsell, Reference Kent and Netsell1971; Stone, Reference Stone1981; Summers, Reference Summers1987; Macchi, Reference Macchi1988; Westbury and Fujimura, Reference Westbury and Fujimura1989; Beckman and Edwards, Reference Beckman, Edwards and Keating1994; de Jong, Reference de Jong1995; Erickson, Reference Erickson1998, Reference Erickson2002, Reference Erickson2004; Harrington et al., Reference Harrington, Fletcher, Beckman, Broe and Pierrehumbert2000; Menezes, Reference Menezes2003, Reference Menezes2004) but also for high vowels (Harrington et al., Reference Harrington, Fletcher, Beckman, Broe and Pierrehumbert2000; Erickson, Reference Erickson2002) and mid vowels (Erickson, Reference Erickson2002). Increased jaw displacement with increased prominence has also been reported for French (Loevenbruck, Reference Loevenbruck1999; Tabain, Reference Tabain2003) and Japanese (Erickson et al. Reference Erickson, Hashi and Maekawa2000).

An acoustic consequence of increased jaw displacement for emphasis is formant (F) changes. Jaw lowering changes the size and shape of the vocal tract; in order to produce the same phonological vowel with a larger jaw opening, the tongue must move accordingly. Erickson (Reference Erickson2002) reported that when a vowel is emphasized, the jaw lowers and the tongue of necessity also changes position – more up and forward for high vowels, more low and back for low vowels. For emphasized high vowels, F2 becomes higher while F1 tends to lower; for emphasized low vowels, F2 lowers while F1 raises, resulting in the emphasized vowels positioned at the more extreme edges of the vowel triangle. Findings of increased F1 with increased jaw displacement for low vowels have also been reported by, for example, Menezes (Reference Menezes2003).

Here we report on a pilot articulatory and perceptual study of emphasis in English (Erickson et al., Reference Erickson, Kim and Kawahara2015). Jaw displacement patterns were examined for two North American English speakers (one male, one female) for the utterance Pam said bat that fat cat at that mat, spoken in five different emphasis conditions: emphasis on bat, that, fat, cat, and no emphasis. Note that vowel quality also affects jaw displacement (i.e., a low vowel has about 4 mm lower jaw displacement than a high vowel) (Menezes and Erickson, Reference Menezes and Erickson2013; Williams et al., Reference Williams, Erickson and Ozaki2013); thus, all vowels in the utterance must be phonologically the same in order to see the effects of utterance prominence. For tracking the jaw movement, we used electromagnetic articulography (EMA), where a sensor was glued to the middle of the two lower incisor teeth just above the gumline. MView (Tiede, Reference Tiede2010) was used to measure the point in each vowel where the jaw was at its maximally lowest position from the bite plane.

Figure 2.1 shows the average amount of jaw displacement for each of the syllables in the five different emphasis conditions. The height of each bar represents the jaw displacement (mm) for each syllable. Notice that the emphasized word (underlined in capital letters) always has the largest jaw displacement in the utterance. As reported in Kim et al. (Reference Kim, Erickson and Lee2015), t-tests with Bonferroni post hoc tests showed that the difference was significant for that and fat for both speakers, but not for bat for A05, while neither of the two speakers showed a significant difference for cat.

Figure 2.1

Jaw displacement values for different emphasis conditions.

Jaw displacement values (mm) for each syllable in the utterance Pam said bat that fat cat at that mat, spoken in five utterance conditions by a single speaker. The emphasized words, from top to bottom, are BAT, THAT, FAT, CAT, and no emphasis.

Five bar graphs with error bars plot jaw displacement in millimeters for utterances including Pam, Said, Bat, that, fat, cat, at, that, and mat. See long description.

Figure 2.1 Long description

The highest to lowest mean jaw displacement values for various utterances are as follows. First. 28.5 for Bat, 27 for Cat, 26 for Fat, 25 for that and mat, 24.5 for that and at, 22 for pam and 20 for said. Second. 31 for That, 24 for bat, 25 for Cat, 23 for fat, 23 for fat, 22 for At, 21.5 for That and mat, 20.9 for pam and 18 for Said. Third. 29 for Fat, 26 for Cat, 24.8 for bat. 23 for That. 23 for Mat. 23.5 for At. 21 for Pam. 18.8 for Said. Fourth. 30 for Cat. 27 for Fat. 24.8 for Bat. 24 for At. 23.2 for Mat. 23 for That. 21 for Pam. 19 for Said. Fifth. 27.2 for Cat. 27 for Fat. 25.2 for Bat. 24 for That. 23.5 for Mat. 23 for Pam. 23 for At. 19 for Said. The values are estimated.

Erickson et al. (2015)

The bottom panel of Figure 2.1 shows the jaw displacement pattern for the utterance type spoken without emphasis on any word. Notice that in the no-emphasis utterance, there is still one word, cat, which shows the largest jaw displacement in the utterance, suggesting that nuclear stress is on cat. Notice also the strong–weak pattern of jaw displacement for the first two pairs of words (Pam said, bat that) in the utterance; if the second member is emphasized, we then see a weak–strong pattern. But, for the third pair of words, fat cat, unless fat is emphasized, we see a weak–strong pattern of jaw displacement such that cat has more jaw displacement than fat. We suggest that this is because for this speaker, the nuclear stress is on cat. These jaw displacement patterns suggest that in some inherent way, jaw articulation patterns form a framework for English rhythmic prominence patterns.

What about perception? Do jaw displacement patterns affect listeners’ perception of prominence? In order to test the hypothesis that the patterns of jaw displacement “match/reflect” the prominence patterns of an utterance, an online rapid prosodic transcription (RPT) listening test was done (Cole et al., Reference Cole, Hualde and Smith2019). A total of 50 listeners (18 for the 26 tokens of A03, and 32 for the 24 tokens of A05) were asked to evaluate each token twice: first to mark with a vertical line between each word where they heard phrase breaks, and second to underline which word or words “stood out” more than the others, that is, which word seemed louder, longer, or higher pitched. Listeners could listen to each token as many times as they wished. Results of the RPT perception test indicated a significant relation (p <. 001) between jaw displacement and perceived prominence for both speakers (r=0.60 for A03; r=0.68 for A05). We also see a significant, but less strong, relation between jaw displacement and perceived boundaries, with one speaker (A03, r=0.43, p < 0.001) showing a slightly stronger relation than the other (A05, r=0.18, p < 0.05). The interaction between phrase boundaries and prominence, both in terms of perception and articulation, is explored further in the following sections.Footnote ²

Given that jaw displacement increases for emphasized words, in the next section, we explore further how jaw displacement patterns vary with utterance prominence patterns. The hypothesis is that we will see patterns of increased jaw displacement that correspond to, for example, foot stress, phrasal stress, and utterance nuclear stress.

2.3 Jaw Displacement and Utterance Prominence Patterns

As seen in the previous sections, not only does jaw displacement increase for emphasis, it also increases for utterance (nuclear) stress. Here we report on findings from earlier studies (e.g., Erickson et al., Reference Erickson, Suemitsu, Shibuya and Tiede2012, Reference Erickson, Kawahara, Shibuya, Suemitsu and Tiede2014; Huang and Erickson, Reference Huang and Erickson2019; Erickson and Niebuhr, Reference Erickson and Niebuhr2023). Articulatory and acoustic recordings by a number of AE speakers were made for the utterance I saw five bright highlights in the sky tonight. This utterance was chosen because the vowels in the content words are all /aɪ/ diphthongs (except for the word saw in the phrase I saw, and this phrase was excluded from the analysis); also, as briefly described above, the utterance has the two phrases five bright highlights and in the sky tonight. The first phrase consists of two two-word feet: five bright and highlights. (The first phrase could be referred to as an accent phrase, or an intermediate phrase, but here we refer to it simply as a phrase.) Maximum jaw displacement was measured during the vowel /a/ of the diphthong.

Figure 2.2 (top) shows jaw tracings of an AE speaker producing the utterance (I saw) five bright highlights in the sky tonight. The valleys indicate when the jaw is lowered/open, as measured from the occlusal (bite) plane. We observe that for the top jaw tracing (gray arrows), (1) there are valleys (jaw openings) for each syllable; (2) the depth of the valleys varies, even though the vowel is always /aɪ/; and (3) the biggest valley (jaw displacement) occurs for sky, then for high of highlights, and then for five of five bright.

Figure 2.2

Samples of default and intent nuclear stress.

Jaw tracings for AE speakers for the utterance I saw five bright highlights in the sky tonight. For the top figure, gray arrows from left to right point to foot stress on five, phrase stress on the first syllable of the compound word highlights, and utterance (nuclear) stress on sky. For the bottom figure, nuclear stress is instead on high(lights), phrasal stress is on sky, and foot stress is on five, as indicated by the white arrows.

The jaw displacement patterns for the top jaw tracing in Figure 2.2 (seen for the four English speakers reported in Huang and Erickson, Reference Huang and Erickson2019), with nuclear stress on sky, can be depicted in terms of a metrical arrangement of syllable stress patterns along the lines of, for example, Liberman and Prince (Reference Liberman and Prince1977), Selkirk (Reference Selkirk1982), and Hayes (Reference Hayes1995), as shown in the top part of the metrical grid displayed in Table 2.1. The word sky has utterance nuclear stress (level 5 stress), high has phrasal stress (level 4 stress), and five has foot stress (level 3 stress). Notice that for this speaker, nuclear stress occurs in the default position, that is, the last content word of the utterance. (Note that tonight is not a content word but rather an adverb.) The stress pattern in this utterance would be 3-2-4-1/5-2 (where a slash indicates a syntactic phrasal break).

Table 2.1Metrical grids for two types of nuclear stress

Metrical grids for two types of nuclear stress productions for the utterance (Yes, I saw) five bright highlights in the sky tonight (along the lines of, for example, Hayes, Reference Hayes1995). Both productions show foot stress on five, but the top grid shows default type nuclear stress on sky and phrasal stress on highlights; the bottom grid shows intent nuclear stress on highlights, and phrasal stress on sky.

Utterance					x
Phrase			x		x
Foot	x		x		x
Word	x	x	x		x	x
Syllable	x	x	x	x	x	x
Stress level	3	2	4	1	5	2
(Yes, I saw)	five	bright	high	lights	sky	night
Utterance			x
Phrase			x		x
Foot	x		x		x
Word	x	x	x		x	x
Syllable	x	x	x	x	x	x
Stress level	3	2	5	1	4	2
(Yes, I saw)	five	bright	high	lights	sky	night

When AE speakers are asked to read a sentence with no prior context, they have choices of where to put nuclear stress, as indicated by the bottom jaw tracing of Figure 2.2. In the bottom tracing, the pattern of jaw displacement shows the largest stress in the utterance on high(lights), perhaps because the word highlights seemed to carry a certain amount of salience for these speakers. Thus, the stress pattern in this utterance would be 3-2-5-1/4-2 (where a slash indicates a syntactic phrasal break).

The metrical grid for this utterance, with the largest jaw displacement on highlights, typical for three of the four AE speakers reported in Erickson et al. (Reference Erickson, Suemitsu, Shibuya and Tiede2012), might be as represented in the bottom metrical grid portrayed in Table 2.1. The largest jaw displacement is hypothesized to be the utterance nuclear stress high(lights), while the phrasal stress is sky and the foot stress is five, with a stress pattern of 3-2-5-1/4-2. Another speaker in the Erickson et al. (Reference Erickson, Suemitsu, Shibuya and Tiede2012) study, however, put the most jaw displacement/prominence on five, and for this speaker, there would be yet another different pattern of utterance stress: 5-2-3-1/4-2 (Figure 2.1; Erickson et al., Reference Erickson, Suemitsu, Shibuya and Tiede2012). Our findings suggest that nuclear stress in AE may be of two types: default type, which occurs on the last content word of the utterance, and intent type, which occurs on the content word that seems to be most salient to the speaker.

Erickson et al. (Reference Erickson, Suemitsu, Shibuya and Tiede2012) reported that increased jaw displacement showed a significant relation with F1 for three of the four speakers; moreover, both jaw displacement and F1 showed a significant relation with the metrical grid as shown in Table 2.1 (bottom part). These findings lend support to the jaw as being the metrical prosodic organizer of English utterances, with increased F1 being one of the acoustic changes. The Huang and Erickson (Reference Huang and Erickson2019) study investigating utterances with mid front vowel /ɛ/ also addresses possible effects on jaw displacement due to interactions among the tongue body together with formant changes and F0 (see also Chen et al., Reference Chen, Whalen and Tiede2019, as well as Erickson, Reference Erickson2002).

A question is to what extent jaw displacement patterns affect/influence listeners’ perceptions of utterance prominence. Pilot studies suggest that the word with the largest jaw displacement is also perceived as having the most prominence (e.g., Erickson et al., Reference Erickson, Kim and Kawahara2015, Reference Erickson, Huang and Menezes2020a). What in the acoustic signal is related to increased jaw displacement and prominence perception? These questions are explored in Section 2.5, reporting on new articulatory, acoustic, and perceptual data. In addition to jaw data, we report on vowel acoustic measurements (F1, F2, F0, intensity, and duration) along with listeners’ assessment of prominence and phrase boundaries. However, first we wish to report on some recent findings concerning the relationship between segmental articulation and syllabic articulation, and to present ideas on how this relationship might be affected by prominence.

2.4 Relation between Segment Articulation and Syllable Articulation

The pattern of jaw displacement reflects the abstract metrical prosodic hierarchy of a spoken utterance (see, for example, Erickson et al., Reference Erickson, Suemitsu, Shibuya and Tiede2012). These rhythmic closed and opened periods of the jaw underlying each syllable must synchronize with the alternating degrees of constriction of consonants and vowels for the syllable. In this section we examine segmental articulation and how it might time with syllabic articulation. Here, we review recent work by Svensson Lundmark (Reference Svensson Lundmark2023), who describes segmental articulation in terms of de/acceleration peaks of the articulators (here the slash refers to both deceleration and acceleration) where the acceleration and deceleration peaks of the segmental articulators align with acoustic segment boundaries. This approach is centered around the concept of acceleration which is a movement characteristic of a mass (= an articulator). Both acceleration (movement initiation) and deceleration (movement “braking”, approaching a target) involve positive muscle forces—generated by agonist or antagonist muscles, respectively. The amount and type of force applied determine articulatory speed. The moments of maximal acceleration or deceleration (de/acceleration peaks) divide the articulatory movement into postures/steady states, which are delimited by very fast intervals. It is specifically the postures of the crucial segmental articulators of the consonant production, that is, the lips for a bilabial segment, that align with the acoustic consonant segment. For more details concerning this approach, we refer to Svensson Lundmark (Reference Svensson Lundmark2023) and Svensson Lundmark and Erickson (Reference Svensson Lundmark and Erickson2024).

As for vowel segment duration, the relationship to the de/acceleration peaks is different: as the consonantal production is of an instantaneous nature, layered on top of vowels (Öhman, Reference Öhman1966), the vowel segment duration is the result of when the consonant segments are produced. More specifically, the acceleration peak at the end of the syllable onset consonant forms the vowel segment onset, and the deceleration peak of the coda consonant forms the vowel segment offset, as indicated in the schematized kinematic trajectories of a CVC (consonant-vowel-cononant) sequence in Figure 2.3. Note that the de/acceleration peaks of the tongue body for vowel targets do not align with the segment boundaries, but instead shape a much shorter tongue body posture (Svensson Lundmark, Reference Svensson Lundmark2023). In other words, the duration of the acoustic vowel segment is a reflection of when in time the first crucial segmental articulator (for the syllable onset) leaves its position, be it the tongue tip or the lips, and when the second crucial segmental articulator (for the syllable coda) arrives at its target destination.

Figure 2.3

Schematized steady stades and fast transit intervals.

A schematized figure on steady states (thick solid lines) and fast transit intervals (solid lines with arrows) of a syllable, as divided by de/acceleration peaks. The areas marked CVC are the duration of the acoustic segments. The steady states of the crucial segmental articulators (here, vertical positions of the tongue tip or lower lip) form the CVC segments. The syllable articulator (the jaw) displays shorter steady states than both C and V segments. Note that the steady state of the tongue body (a low vowel) does not align with the V segment. Still unresolved questions are marked with dotted lines. The schematized figure is based on findings of Svensson Lundmark (Reference Svensson Lundmark2023) and Svensson Lundmark and Erickson (Reference Svensson Lundmark and Erickson2024).

Reproduced with permission from M. Svensson Lundmark and D. Erickson, JSLHR, https://doi.org/10.1044/2024_JSLHR-23-00092, accepted; licensed under a Creative Commons Attribution (CC BY) license.

The duration of the vowel segment is connected to jaw displacement by the nature of the jaw cycle; the vowel appers during its open state, while the closed steady states of the jaw (when the jaw stays up) are when the onset and coda appear, as shown in Figure 2.3 (Svensson Lundmark and Erickson, Reference Svensson Lundmark and Erickson2024). Furthermore, the steady states of the jaw are shorter than the steady states of the crucial segmental articulators, as also indicated in Figure 2.3 (Svensson Lundmark and Erickson, Reference Svensson Lundmark and Erickson2024). We suggest that these timing differences between the steady states/postures on the jaw cycle and the ones on the crucial segmental articulators could be because the syllabic jaw movements and the segmental articulators have inherently different rhythmic patterns. We intend to follow up this approach in future research.

Although the nature of the timing of the de/acceleration peaks is still unclear, it’s related to how fast the articulators are moving. We know that the speed of the segmental articulators is affected by the distance to the target and what type of target is made (the task difficulty) (Bootsma et al., Reference Bootsma, Fernandez and Mottet2004). While task difficulty is related to how the constriction is manifested (and therefore possibly related to deceleration), the distance to the target is manifested by the size of the oral cavity, that is, how long/far the articulator has to travel. If the tongue is producing a low vowel, the tongue tip might need to travel further to make the following coda constriction at the palate, whereas for a high vowel, the tongue tip is presumably already near the target constriction. We presume these conditions affect the timing of the de/acceleration peaks, hence directly having an impact on the posture and the resulting acoustic segment duration. Furthermore, as the size of the oral cavity changes according to jaw displacement and longer distances for the articulators to travel for a more prominent syllable, this would possibly affect tongue body positions, leading to even more variants of travel distances for the tongue tip between, for example, a prominent /a/ and a prominent /i/. However, this relationship still needs to be investigated.

The vowel segment differences are also dependent on what type of syllable onset or coda constriction, that is, manner, place, and voicing, the segmental articulator is making, as the de/acceleration peaks of the segmental articulators determine the acoustic vowel segment boundaries. Thus, where a segmental articulator travels from and where it is going next affects the distance to the target and its velocity, which ultimately affects the acceleration and the deceleration phase, and the timing of those de/acceleration peaks. In other words, any correlation between vowel segment duration and jaw displacement is dependent on the timing of the segmental articulators (= the consonants) and the context in which they are produced.

2.5 Articulatory, Acoustic, and Perceptual Study of AE Utterance Prominence Patterns: New Data

In Section 2.3, we established that the greatest jaw displacement corresponds to the perception of contrastive emphasis in utterances containing the diphthong /aɪ/. This section continues to investigate utterance prominence in terms of articulation, perception, and acoustic cues.

Here we look at the utterance Pam had a chance to chat and nap as spoken by five AE speakers (three male, two female). This utterance was selected because (1) the content words, Pam, chance, chat, and nap, all contain the phonological AE /æ/ vowel, and (2) it has two phrases, Pam had a chance and to chat and nap. Looking at jaw displacement across the same vowels allows us to see prominence effects; looking at an utterance with two phrases allows us to see possible differences in articulation of phrasal stress and nuclear stress.

2.5.1 Articulatory and Acoustic Study

The acoustic and articulatory recordings were done with EMA at Professor Jianwu Dang’s laboratory at the Japan Advanced Institute of Science and Technology, Nomi, Japan (see Section 2.2 for an explanation of sensor placement). The target sentence was read in five different randomizations for four of the speakers, and for S4 there were eight randomizations of the target sentence. MView (Tiede, Reference Tiede2010) was used to measure the amount of maximum jaw displacement from the occlusal plane during each vowel. The average F1, F2, F0, intensity, and duration for each vowel was estimated using Praat (Boersma and Weenink, Reference Boersma and Weenink2025) to mark TextGrids, and a customized Praat script. Perception tests to assess listeners’ perceptions of prominence and phrase break/boundary were conducted using an online interface.

Figure 2.4 shows the mean values of jaw displacement for each of the five speakers. In the bar graphs, the height of each bar represents the average amount of jaw displacement for each of the one-syllable content words, Pam, chance, chat, and nap. As discussed in the previous sections, bar graphs convey relative syllable size, that is, big (more prominent) syllables will have larger y-values (mm) than smaller ones. Thus, the y-axis values reflect the relative values of jaw displacement (syllable stress) within the utterance.

Figure 2.4

Jaw displacement patterns for five speakers.

Bar graph displays of average amount of jaw displacement (mm) shown on y-axis with content word on the x-axis (Pam, chance, chat, nap), for each of the five speakers (S1, S2, S3, S4, and S5). Jaw displacement ranges from 15 to 30 mm for S1, 10 to 30 mm for S2, 21 to 31 mm for S3, 5 to 20 mm for S4, and 45 to 53 mm for S5.

A set of 5 bar graphs with error bars plots jaw displacements for five speakers labeled S 1, S 2, S 3, S 4, and S 5. See long description.

Figure 2.4 Long description

The highest to lowest mean jaw displacement values for five speakers are as follows. S 1: 26, 25.5, 24, and 17. S 2: 24, 22, 17, and 15. S 3: 28.5, 27.5, 27.5, and 23. S 4: 16, 14, 13, and 10. S 5: 50.5, 48.5, and 47.5. The values are estimated.

As discussed in the above paragraph, the bar graphs suggest that in terms of an abstract metrical prosodic hierarchy, the word with the largest jaw displacement in the utterance carries the nuclear stress of the utterance; when there is more than one phrase, the word in each phrase that has the largest jaw displacement is referred to as having phrasal stress, and often this is the word with the second largest jaw displacement. These data suggest that for this sentence, we can have at least two types of nuclear stress intent type – that is, either chat or Pam can have nuclear stress.

Table 2.2 shows acoustic measurements related with the average jaw displacements for each of the content words. The word with the largest jaw displacement also has the largest F1 value; peak F0 always occurs on the initial word, with the final word having the lowest F0; intensity is greatest on the initial word of the utterance for three of the speakers, but on the initial word of the second phrase for the other two speakers; acoustic duration is longest on the initial word of the utterance for two of the speakers, while the other three speakers show the longest duration on chance, chat, or nap. In Section 2.5.3, we discuss some of the acoustic cues listeners may be listening to in order to hear increased prominence on a syllable in an utterance.

Table 2.2Average jaw displacement and acoustic values for content words

Average jaw displacement and average acoustic values of the content words for the five speakers. Bold numbers indicate the largest value for each measurement across the content words for each of the speakers.

Sub	N	word	Jaw		F0Av		F1		F2		IntAv		vowel dur (s)
S1	5	Pam	23.7	1.1	244.8	5.3	726.8	10.4	1229.8	78.8	52.8	0.3	0.17	0.01
	5	chance	18.1	0.9	191.2	3.3	661.8	7.8	1401.5	167.0	49.0	0.9	0.19	0.02
	5	chat	25.4	0.6	189.8	5.4	890.4	40.9	1429.9	140.0	49.5	1.2	0.19	0.02
	5	nap	26.3	0.5	152.4	16.0	914.8	61.8	1547.8	93.8	42.3	1.6	0.16	0.01
S2	5	Pam	21.9	1.1	153.3	6.6	666.9	20.9	1367.7	189.4	33.9	2.2	0.13	0.01
	5	chance	15.5	0.5	116.0	2.6	612.6	24.2	1443.8	153.0	33.9	1.0	0.14	0.01
	5	chat	18.3	0.5	110.7	1.9	695.5	12.1	1357.3	38.1	39.9	1.5	0.14	0.02
	5	nap	23.3	1.4	98.2	10.3	769.7	30.6	1416.0	103.5	28.2	1.9	0.13	0.01
S3	5	Pam	27.7	1.8	137.6	8.7	560.6	16.8	1746.5	32.2	59.9	1.4	0.18	0.01
	5	chance	23.2	0.9	110.7	7.9	475.4	34.0	1708.1	48.0	56.7	0.9	0.15	0.01
	5	chat	27.6	0.7	105.4	9.0	583.8	30.1	1600.9	31.4	52.8	1.4	0.16	0.02
	5	nap	28.3	2.2	81.5	1.4	600.6	45.0	1718.1	52.3	48.8	1.4	0.16	0.02
S4	8	Pam	13.3	0.9	119.1	4.4	728.6	15.6	1632.2	73.9	54.6	0.5	0.23	0.02
	8	chance	10.5	0.6	106.7	4.0	659.6	19.5	1775.0	30.0	51.3	0.6	0.17	0.01
	8	chat	15.7	0.8	102.6	2.1	690.2	11.2	1470.9	22.5	55.3	1.2	0.16	0.01
	8	nap	13.9	0.9	91.8	2.4	752.6	16.9	1510.0	13.6	49.0	1.4	0.17	0.02
S5	5	Pam	48.5	1.6	247.9	9.6	682.6	34.8	1593.9	224.3	46.0	1.1	0.16	0.03
	5	chance	47.6	1.9	196.8	8.4	699.2	20.2	1653.6	124.1	39.9	1.8	0.19	0.01
	5	chat	50.5	1.6	184.5	6.1	804.4	5.1	1190.7	39.9	44.3	1.5	0.16	0.01
	5	nap	49.2	1.1	161.6	8.2	621.9	34.6	1040.6	88.9	34.7	0.9	0.19	0.0

Linear correlation results with jaw displacement and acoustic measurements indicate that four of the five speakers showed a significant relation between the amount of jaw displacement and F1 (p < 0.001), indicating that as jaw displacement increases, F1 raises; three other speakers showed significant relations between jaw displacement and F2 minus F1, indicating that as the distance between F2 and F1 decreases, the low vowel became even more extreme and compact, as discussed in Erickson (Reference Erickson2002).

No significant relations for the five speakers were found between amount of jaw displacement and the acoustic measures of F0 and duration. Only one speaker (S4) showed a significant positive relation between jaw displacement and intensity (p < 0.001); another speaker (S1) showed a significant negative relation between jaw and vowel duration. When we look at each speaker separately (Figure 2.5), four of the speakers (S1, S2, S4, and S5) show negative linear correlation between vowel duration and jaw displacement for each of the content words, while only S3 shows a positive correlation. These findings are surprising given that increased F0 and duration are what has been reported for increased prominence/stress (e.g., Turk and Sawusch, Reference Turk and Sawusch1996).

Figure 2.5

Correlation of vowel duration and jaw displacement.

Correlation plot on acoustic vowel duration (s) and jaw displacement (mm) for each speaker. Jaw displacement is mean-centered for reasons of clarity. The four content words are represented by different symbols.

A set of 5 scatter plots, each labeled S 1 through S 5. Each graph plots the relationship between jaw opening in millimeters and vowel duration in seconds. See long description.

Figure 2.5 Long description

The horizontal axis represents the jaw opening which ranges from minus 5 through 5, and the vertical axis represents the vowel duration which ranges from 0.14 through 0.22, 0.1 through 0.16, 0.12 through 0.2, 0.15 through 0.25, 0.15 through 0.25, and 0.14 through 0.2, respectively. The data sets representing pam, chance, chat, and nap are plotted in a random trend around the best-fit line. The best fit line originates at (minus 5, 0.19) and terminates at (4, 0.17) in S 1, originates at (minus 5, 0.14) and terminates at (5, 0.13) in S 2, originates at (minus 5, 0.158) and terminates at (4.5, 0.16) in S 3, originates at (minus 5, 0.158) and terminates at (4.5, 0.16) in S 3, originates at (minus 5, 0.4) and terminates at (4, 0.4) in S 4 and originates at (minus 5, 0.16) and terminates at (4, 0.16) in S 5. The values are estimated.

In the next section, we examine how listeners hear prominence/stress in English utterances – specifically, the probability of stress perception as a function of both prosodic articulation (i.e., jaw displacement) and prosodic acoustic measurements (F0, F1, F2, intensity, and vowel duration). Since, in addition, prosodic boundaries also affect utterance prominence (e.g., Fujimura, Reference Fujimura2000; Erickson et al., Reference Erickson, Kawahara, Shibuya, Suemitsu and Tiede2014; Cole et al., Reference Cole, Hualde and Smith2019), we also include boundary perception in the next section.

2.5.2 Pilot Study of Perception of Prominence and Boundaries in Utterances

Here we introduce some preliminary results about the extent to which perception of nuclear (utterance) and phrasal stress and phrase boundaries relate to jaw displacement. Twenty-eight AE listeners participated in our two experiments designed respectively for perceiving nuclear stress and phrasal boundaries in a neutral context. For each experiment, we recruited 14 participants (aged 18–78), who voluntarily participated or attended for course credit at the University of Arizona.

As described in Section 2.5.1, five native speakers of AE produced the utterance Pam had a chance to chat and nap. We selected one token from each speaker (S1 token 55; S2 token 115; S3 token 16; S4 token 64; S5 token 001) as the stimuli investigated in our perception tests. The reason for doing this was because we felt having listeners try to hear differences in prominence and boundaries within a single speaker as well as across speakers might be too difficult, given that most listeners are usually not familiar with listening to stress patterns. Also, based on the intra-speaker variations in jaw displacement, it seemed that speakers may have been inconsistent in their stress patterns. The present study employed two small-scaled RPT experiments. Sixteen stimuli (= (5 target sentences + 3 filler sentences) × 2 repetitions) were presented randomly online to each participant. An inter-stimulus interval (ISI) of 200 ms was used. Because perceiving utterance stress or boundaries is not something that most listeners are consciously aware of, before starting each of the tests, in order to help the subject “tune their ears” for prominent words, or for boundaries, subjects were asked to listen to a brief one-minute training video prepared by the first author. In the video for prominence testing, the listeners were asked to hear the difference between I like red apples, where apples had more stress, and I like RED apples, where red had more stress. After the training session, the participants were instructed to do the online RPT over their headphones in a quiet room. In the nuclear stress experiment, the listeners (2 males, 12 females) were asked to choose which word bore nuclear stress after hearing the given utterance. For the phrasal boundary experiment, the listeners (3 males, 11 females) were told to determine after what words they considered a boundary occurred.

2.5.2.1 Perceived Prominence and Boundaries

Not surprisingly, our results revealed that stress is perceptible on content words (i.e., Pam, chance, chat, nap), but not on function words (i.e., a, to, and). Listeners reported nuclear stress on Pam for three of the speakers, and chat and chance for the other two speakers. However, it seems that the relation between articulation of prominence (i.e., amount of jaw displacement) and perception of prominence (listener ratings) is not necessarily straightforward. For example, Figure 2.6 shows the relation of jaw displacement with vowel duration, and the probability of listeners hearing nuclear stress and boundaries. Figure 2.6b shows S2 had a preference for nuclear stress on Pam (35.7%), even though the jaw displacement was only the second largest (22.63 mm), and the phrasal stress was on chat (28.6%), even though nap had the largest jaw displacement. In the case of S2, thus, listeners’ perceptions of stress and speakers’ articulations of stress (i.e., jaw displacement) do not align.

Figure 2.6

Probability of perceived prominence and boundaries.

(Top) jaw displacement for the utterance; (bottom) probability of nuclear prominence (dash lines with circle symbols) and phrase boundaries for each word produced by (a) S1, (b) S2, (c) S3, (d) S4, and (e) S5.

Five sets of data point graphs plot the jaw movement in millimeter, and probability and duration in percentage. See long description.

Figure 2.6 Long description

Each set plots a wavy line for jaw movement and three zig zag lines for probability and duration. The wavy line originates at minus 18, minus 19, minus 22, minus 8, and minus 45, and terminates at minus 8, minus 19, minus 20, minus 9, and minus 45 in sets labeled S 1 through S 5. The zig-zag lines originate at (0.6, 1), (0.8, 1), and (0.3, 1) and terminate at (2.8, 7), (2.8, 5) and (2.8, 3) in S 1. The lines originate at (0.5, 0.5), (0.5, 0.3), and (0.5, 0.1) and terminate at (2.7, 6), (2.7, 6) and (2.7, 3.5) in S 2. The lines originate at (0.5, 0.5), (0.5, 0.3), and (0.5, 0.1) and terminate at (2.7, 6), (2.7, 6) and (2.7, 3.5) in S 2. The lines originate at (0.5, 0.7), (0.5, 0.5), and (0.5, 0.2) and terminate at (2.25, 8), (2.25, 6) and (2.25, 2.25) in S 3. The lines originate at (0.5, 0.7), (0.5, 0.5), and (0.5, 0.2) and terminate at (2.25, 8), (2.25, 6) and (2.25, 2.25) in S 3. The values are estimated.

For S4 (Figure 2.6d), the words Pam and chat are perceived as equally prominent (35.7%), but note that jaw displacement is largest for chat (16.15 mm), followed by Pam (11.27 mm). The suggestion here is that the nuclear utterance stress is on chat, with the phrasal stress on Pam, but in terms of perception it is somewhat ambivalent. Nevertheless, we see an alternating strong–weak pattern of jaw displacement within each of the phrases, as well as an alternating pattern of perceived stress. For S5 (Figure 2.6e) the likelihood of perceiving stress on chance was highest (67.9%), but jaw displacement was smallest (49.50 mm). We discuss this pattern later.

As for boundary perception, the amount of jaw displacement can account for both syllable prominence as well as phrasal boundaries (see also, for example, Fujimura, Reference Fujimura2000; Erickson et al., Reference Erickson, Kim and Kawahara2015).

In order to further investigate the relationship between the amount of jaw displacement with the probability of perceived prominence and that of phrase boundaries for each syllable across the five speakers, a linear correlation analysis was applied. By-speaker z-score normalization of articulatory measurements were done before entering correlation analyses to resolve inter-speaker physiological differences. We found a significant difference between jaw displacement and phrase boundaries (p < .05). For instance, for S3 (Figure 2.6c), the largest amount of jaw displacement corresponds with the greatest number of perceived breaks. These results suggest that the more open the jaw, the larger the number of perceived boundaries; in contrast, there was only a marginally significant difference observed for prominence in relation to jaw displacement (r=−0.281).

2.5.3 Acoustic Cues for Perception

In this section, we explore some acoustic cues that might affect listeners’ perception of prominence and boundaries. We measured duration, mean F0, mean intensity, and the first two formants (F1, F2) for the vowel /æ/ of the content words (Pam, chance, chat, and nap). A linear correlation analysis was taken to assess the relation of each acoustic measure with perceived prominence and boundaries. Acoustic measurements underwent by-speaker z-score normalization before entering correlation analyses.

Table 2.3 presents the correlation results for the acoustic cues related to perceived prominence. Prominent syllables are significantly longer, have significantly higher maximum F0, have marginally significantly higher mean F0, and are significantly louder. That syllables that are longer, louder, and higher in F0 were rated as more prominent is not surprising, given the literature about acoustic cues of stress/prominence, for example, Lehiste (Reference Lehiste1970) and Erickson and Niebuhr (Reference Erickson and Niebuhr2023); also see Section 2.2.

Table 2.3Correlation analysis for perceived prominence and acoustic measures

The correlation analysis for perceived prominence and vowel acoustic measurements – mean duration, mean F0, mean intensity, F1, and F2. *** indicates significance < 0.001, ** indicates significance < 0.01, * indicates significance < 0.05, and † indicates significance < 0.1 (marginally significant).

Table 2.3Correlation analysis for perceived prominence and acoustic measures
The correlation analysis for perceived prominence and vowel acoustic measurements – mean duration, mean F0, mean intensity, F1, and F2. *** indicates significance < 0.001, ** indicates significance < 0.01, * indicates significance < 0.05, and † indicates significance < 0.1 (marginally significant).
Variables	R	p-value
Prominence and duration	0.740	(< .001) ***
Prominence and mean F0	0.302	(0.065) †
Prominence and max F0	0.442	(0.005) **
Prominence and intensity	0.381	(0.015) *
Prominence and F1	0.053	(0.743)
Prominence and F2	0.353	(0.026) *

As for formant values, perceived prominent syllables have significantly higher F2 values (p < .05), implying the vowel /æ/ tends to be more fronted when it is perceived as having nuclear stress. Pam produced by three of the speakers has higher F2 (see Table 2.5) and is perceived as having nuclear (utterance) stress for four of the speakers. (Note that for S2, Pam is perceived has having nuclear stress, but nap has the highest F2.) Interestingly, however, F1 values in this current study are not remarkably raised for the perceived prominent syllable, even though we see significant correlation with increased F1 and articulatory prominence, that is, jaw displacement, both in this study as well as previous studies.

For perceived boundaries, as shown in Table 2.4, vowel duration is also significantly correlated with phrase boundaries. Long syllables tend to be followed by a phrase boundary. This is confirmed in our study. In addition, perceived phrase boundaries have marginally significantly higher mean F0, and significantly higher vowel mid-point F0, indicating listeners’ sensitivities to changes in F0 when a boundary is perceived. Intensity is nonsignificant. Also, the F1 value tends to be increased for syllables with a perceived boundary; specifically, it is relatively high in chance or nap, which are perceived as the final syllables in the phrases. To summarize, based on this small data set, increased duration, heightened F0, and raised F1 tend to cue phrase boundaries. Similar findings about duration and F0 as cues to phrase boundaries have been reported by, for example, Yang et al., Reference Yang, Shen, Li and Yang2014, but as far as we know, no one has reported raised F1 as a cue.

Table 2.4Correlation analysis for perceived boundaries and acoustic measures

The correlation analysis for perceived boundary and mean duration, mean F0, mean intensity, F1, and F2. *** indicates significance < 0.001, * indicates significance < 0.05, and † indicates significance < 0.1 (marginally significant).

Table 2.4Correlation analysis for perceived boundaries and acoustic measures
The correlation analysis for perceived boundary and mean duration, mean F0, mean intensity, F1, and F2. *** indicates significance < 0.001, * indicates significance < 0.05, and † indicates significance < 0.1 (marginally significant).
Variables	R	p-value
Boundary and duration	0.740	(< .001) ***
Boundary and mean F0	−0.316	(0.053) †
Boundary and mid F0	−0.371	(0.022) *
Boundary and intensity	−0.193	(0.233)
Boundary and F1	0.310	(0.052) †
Boundary and F2	0.240	(0.136)

Table 2.5Average F2 values for four syllables of five speakers

Average F2 values (in Hz) for the four syllables for each of the five speakers’ utterances. F2 values in bold indicate the highest value among the four syllables in the utterance. Asterisks mark perceived nuclear stress.

Table 2.5Average F2 values for four syllables of five speakers
Average F2 values (in Hz) for the four syllables for each of the five speakers’ utterances. F2 values in bold indicate the highest value among the four syllables in the utterance. Asterisks mark perceived nuclear stress.
	Pam	chance	chat	nap
S1	*2284.8	1985.7	2002.2	2086.1
S2	1634.7	1592.6	1646.1	*1671.1
S3	*1889.6	1612.4	1515.5	1875.0
S4	*1851.4	1843.7	1463.8	1514.7
S5	2043.8	*2109.8	1995.9	2023.3

Our data suggest that when one’s jaw is open more, the acoustic cues including duration, F0, and intensity seem to increase, which in turn affects our judgment of prominence and boundaries. However, the interaction between perception of prominence, phrase boundaries, and articulation of prominence is complex and needs to be explored further.

2.6 Jaw and Phrasal Stress in Other Languages and Applications for Language Teaching

This section is a brief review of jaw studies (e.g., Vatikiotis-Bateson and Kelso, Reference Vatikiotis-Bateson and Kelso1993; Erickson et al., Reference Erickson, Kawahara, Shibuya, Suemitsu and Tiede2014, Reference Erickson, Iwata and Suemitsu2016; Kawahara et al., Reference Kawahara, Erickson, Moore, Suemitsu and Shibuya2014a; Erickson and Kawahara, Reference Erickson and Kawahara2016; Smith et al., Reference Smith, Erickson and Savariaux2019; Erickson, Reference Erickson2021) showing how patterns of jaw displacement are language-specific, reflecting the prominence patterns (metrical organization) of the language; being aware of these language-specific differences has applications for teaching second language prosody. The above sections discussed jaw displacement patterns of English utterances. Here we look at jaw displacement patterns of languages that are purported to be edge-strengthening languages (e.g., Jun Reference Jun and Jun2005, Reference Jun and Jun2014), such as French, Japanese, and Mandarin.

As shown in Figure 2.7, jaw displacement increases at the end of each phrase, with the largest amount of jaw displacement at the end of the utterance – which is in accord with these languages being called edge-strengthening. For the Japanese and Mandarin utterances, we also see increased jaw displacement at the beginning of the utterance, as if the speaker uses increased jaw displacement to indicate “I’m starting to speak” and “I’ve finished speaking.”

Figure 2.7

Jaw displacement patterns for French, Japanese, and Mandarin.

Jaw displacement patterns from top to bottom for French (Natacha didn’t attach her cat, Pasha, who escaped), Japanese (That’s why Mana’s hair is smooth), and Mandarin (Mother curses the horse). The term AP refers to accent phrase, and the final AP is referred to as an IP (intonational phrase).

Three graphical representations depict the jaw displacement patterns for words in different languages. See long description.

Figure 2.7 Long description

Each graph shows the acoustic properties of the speech sounds. The horizontal axis represents the following range of values. 1000 to 3500, 1000 to 3000 and 1000 to 2200. Arrows pointed at A P 1, A P 2, A P 3, A P 4, I P, m a 1, m a, m a 4 and m a 3 mark specific acoustic segments of the words. The sets are labeled with words of foreign languages like French, Japanese and Mandarin.

In learning a second language, beginners tend to carry over their first language jaw displacement patterns to their second language (e.g., Wilson et al., Reference Wilson, Erickson, Vance and Moore2020; Erickson and Niebuhr, Reference Erickson and Niebuhr2023) – see Figure 2.8.

Figure 2.8

Jaw displacement for L2 Japanese and French speakers, and an L1 English speaker.

Jaw displacement patterns for L1 Japanese (top row), L1 French (middle row) and L1 English (bottom row) speakers for the utterance I saw five bright highlights in the sky tonight.

A set of three graphs shows wavy lines which represents acoustic data. Underneath the wavy lines, there are words in English. Arrows point troughs in each graph. The text at the bottom reads, I saw five bright highlights in the sky tonight.

In Figure 2.8, the L1 English speaker (bottom row) shows clear patterns of strong–weak jaw displacement, with the largest jaw displacement on nuclear utterance stress sky, the next largest on phrasal stress highlights, and then on foot stress five. The speakers of Japanese (top) and French (middle), both L1 speakers of edge-strengthening languages, show different patterns of jaw displacement when speaking English. The L1 Japanese speaker shows more jaw displacement on highlights and on five, but no clear reduction on bright and lights, as seen for the L1 English speaker. This makes sense since Japanese does not reduce vowels. For the L1 French speaker, the largest jaw displacement is at the end of the phrase, highlights, produced with one sustained jaw lowering; for the foot, five bright, more jaw displacement is on the final member of this foot, bright. These patterns suggest that the French speaker may be carrying over her L1 edge-strengthening pattern of increased jaw displacement for phrase-final positions. As for the final phrase, sky tonight, both L1 Japanese and French speakers carry over their phrase-final jaw patterns to produce the utterance-final word night with more jaw displacement than on sky – a jaw pattern not seen with L1 AE speakers. All AE speakers put reduced jaw movement on night, resulting in a more shwa-like vowel for this word.

Recent work with teaching English jaw displacement patterns to Japanese learners of English (Wilson et al., Reference Wilson, Erickson, Kawahara and Monou2019, Reference Wilson, Erickson, Vance and Moore2020) suggests that when a Japanese speaker acquires an English pattern of jaw displacement, not only does the stressed word have larger jaw displacement but the following word has less jaw displacement. The acoustic results were increased F1 and decreased F2 on the stressed word (with the low vowel /a/) and decreased F1/increased F2 on the following word (also a low vowel /a/), thus making the second vowel reduced, and resulting in a more English-type strong–weak rhythm. These results encourage teaching second language prosody using a kinematic approach.

The results also encourage thinking about jaw mechanics; perhaps the jaw cannot make two consecutive large jaw displacements without pause or resetting. To produce two or three consecutive syllables, one jaw opening will be stronger than the adjacent ones. In an oft-times called stress-timed language such as English, where increased prominence/jaw displacement occurs within each prosodic structure (e.g., word, foot, phrase, utterance), the other syllables in the prosodic structure will be produced with less displacement. Studying displacement patterns of various languages and how to kinematically teach these to second language learners can lead to an increased understanding of the biophysical constraints of our articulators, and also the segmental articulators, for example, the lips, tongue, and velum, since syllable prominence affects both the vowel nucleus as well as the onset and coda segmental articulators (e.g., Heldner, Reference Heldner, van Dommelen and Fretheim2001; Giavazzi, Reference Giavazzi2010; Fletcher, Reference Fletcher, Hardcastle, Laver and Gibbon2013; Svensson Lundmark et al., Reference Svensson Lundmark, Ambrazaitis and Ewald2017, for discussion of how prominence can affect segmental phonology; Kawahara et al., Reference Kawahara, Masuda and Erickson2014b, for discussion of the interaction between jaw and segment acoustics in Japanese; de Jong et al., Reference de Jong, Beckman and Edwards1993, reporting that stressed syllables show less coarticulatory overlap than non-stressed syllables).

This chapter has focused on the jaw as a prosodic syllable articulator; work is underway for further exploration into the interactions between crucial segmental articulators and the syllabic articulator, as they work together to orchestrate the rhythm of a language.

Future work into the relation between utterance prominence, jaw displacement, and language rhythm requires, first and foremost, more articulatory data. Since articulatory data collection is still a relatively expensive undertaking both in terms of time and money – for example, EMA-type facilities – currently, some new approaches are being developed for recording articulatory data. One of them is being developed by Oliver Niebuhr at the University of Southern Denmark – the MARRYS hat – which registers jaw displacement via two bending sensors in the cheek straps on both sides, timed with the audible voice signal; it is cheaper and also requires reduced preparation and processing times, allowing the collection of jaw data from a large number of speakers – even in the field, given its mobility (Erickson et al., Reference Erickson, Niebuhr, Gu, Huang and Geng2020b; Niebuhr and Gutnyk, Reference Niebuhr and Gutnyk2021; Svensson Lundmark et al., Reference Svensson Lundmark2023; Svensson Lundmark and Niebuhr Reference Svensson Lundmark and Niebuhr2025; Weston et al., Reference Weston, Svensson Lundmark, Erickson and Niebuhr2023).

To summarize, we suggest that speech rhythm is based on an abstract metrical hierarchical organization of syllable stress patterns. These stress patterns are implemented by jaw articulation patterns, that is, how much the jaw lowers for each syllable within the utterance. For English, within each prosodic unit (word, foot, phrase, utterance), one syllable has incrementally the largest jaw displacement, such that the word/syllable with the most prominence in the utterance also has the largest jaw displacement. Languages such as French, Japanese, and Mandarin have different underlying metrical structures, and hence different patterns of jaw articulation from English. Previous studies of jaw displacement patterns for English utterances show how the patterns relate to metrical patterns of utterance stress, and how increased jaw displacement is always accompanied by formant changes – specifically, increased F1 (or decreased F2 minus F1) for low vowels. Since jaw patterns vary depending on the language, language learners often apply their first language jaw displacement patterns to produce the second language.

Box 2.1Chapter Overview