1. Introduction
How does a language user plan and produce an intended utterance? How does an interlocutor recognize and understand a perceived utterance? How does a learner acquire the grammar and processing mechanisms that enable these skills? We contend that these questions can only be sufficiently addressed if multimodal prosody is integrated into our models of language processing and development.
The concept of multimodal prosody underscores the implementation of the prosodic dimension of language across both spoken and gestural modalities during perception, comprehension, planning, production and acquisition. Adopting this perspective necessitates bridging three traditionally distinct research strands: spoken prosody, co-speech gestures and the prosodic elements of sign languages. This special issue brings these research areas together, highlighting the advantages of their integration and facilitating a fundamental reconceptualization of prosody by allowing these fields to inform one another.
Traditional accounts have characterized prosody as a linguistic phenomenon manifested at the level of the speech signal. For instance, in the seminal Oxford Handbook of Language Prosody, Gussenhoven and Chen (Reference Gussenhoven and Chen2020) define ‘language prosody’ as encompassing the pitch, intensity, duration and spectral patterns of the speech signal. These acoustic features convey phonological forms – including segments, syllables, feet and larger phonological constituents – while expressing communicative meanings such as speech acts, epistemic stance, emotions and affect. The editors also point out that languages employ tone, stress, prosodic constituents and intonation to achieve these functions. Numerous pioneering studies have underscored the phonological and pragmatic significance of these acoustic elements across languages (see Gussenhoven & Chen, Reference Gussenhoven and Chen2020 for a review and see also Barnes & Shattuck-Hufnagel, Reference Barnes and Shattuck-Hufnagel2022 for further discussion), leading to the recognition of prosody as an integral component of language grammar and of human communication.
In recent decades, however, this conventional view of prosody, as a linguistic dimension defined with respect to the spoken modality only, has been increasingly challenged. Research findings supporting this shift can be summarized into three principal arguments: (a) there are robust temporal and magnitude associations between spoken prosody and co-speech gestures; (b) spoken prosody and gesture convey comparable pragmatic meanings in an integrative manner and (c) prosody is manifested through non-acoustic means in sign language and tactile sign language (see Prieto et al., Reference Prieto, Estee-Gibert and Shattuck-Hufnagel2024 for a review).
The first main argument challenging the view that prosody is restricted to the spoken modality is the evidence of temporal and magnitude associations between spoken prosody and co-speech gestures. Scholars examining the temporal link between gesture and speech often conceptualize prosody as a sequence of acoustic events that facilitates the temporal anchoring of body gestures with the speech stream. From this perspective, the connection between prosody and gestures is rooted traditionally in their temporal coordination. In his seminal work, for example, McNeill (Reference McNeill1992) proposed the ‘phonological synchrony rule’, which stated that ‘…the stroke of a (manual) gesture precedes or ends at, but does not follow, the phonological peak syllable of speech’. These insights have inspired numerous studies investigating which acoustic events are temporally coordinated with specific visual signals, as well as what constraints – ranging from sensorimotor and biomechanical to linguistic typology, pragmatic context or developmental factors – influence this temporal coordination (e.g., Kendon, Reference Kendon1980; Krivokapić et al., Reference Krivokapić, Tiede and Tyrone2017; Leonard & Cummins, Reference Leonard and Cummins2010; Loehr, Reference Loehr2012; Pouw et al., Reference Pouw, de Jonge-Hoekstra, Harrison, Paxton and Dixon2021, and many others; see reviews in Wagner et al., Reference Wagner, Malisz and Kopp2014 and Prieto et al., Reference Prieto, Estee-Gibert and Shattuck-Hufnagel2024). While the crucial role of prosody in regulating the gesture-speech temporal coordination is now widely acknowledged, significant questions remain regarding (a) how this relationship manifests across varied linguistic typologies (spoken and sign), in language acquisition, and in populations with language and communication difficulties (such as in autism) and (b) how this relationship influences the expression and comprehension of prominence, phrasing and pragmatic intent. This special issue seeks to address these critical gaps.
A second core argument challenging the view that prosody is restricted to the spoken modality concerns the observation that prosody and gesture convey parallel pragmatic meanings in an integrative manner. As research increasingly examines spoken prosody and co-speech gestures in tandem, it is becoming evident that multimodal prosody does more than provide a temporal anchor; it mediates and drives communicative intent and pragmatic meaning through a dynamic interplay of vocal and bodily cues. Substantial evidence suggests that speakers and listeners utilize not only spoken prosody but also an array of visual signals – including facial expressions, manual or head gestures and body posture – to navigate the communication of pragmatic meaning. The papers of this special issue as well as the reerences included therein show that these bodily movements, or co-speech gestures, possess phonological and pragmatic significance akin to the acoustic prosodic signal. This special issue underscores the multimodal nature of prosody by emphasizing the functions and meanings conveyed by co-speech gestures, and particularly those that have a non-referential relation with speech. Non-referential gestures are defined as non-imagistic, that is, lacking direct visual imagery related to the content of a spoken utterance. While they are typically dismissed as mere rhythmic accompaniments devoid of meaning and have traditionally been termed ‘beat gestures’, the studies in this special issue demonstrate that they are deeply meaningful. Their significance, however, is not lexical or semantic in nature, but rather prosodic and pragmatic, serving to highlight information, structure discourse and signal speaker intent and stance.
A third compelling line of evidence demonstrating that prosody transcends the speech signal is found in sign languages. Like spoken languages, sign languages encompass phonology, morphology, semantics, syntax and pragmatics, and they also have a prosodic dimension. The prosodic features of sign languages are evidently manifested through the visual signal. Research has demonstrated that non-manual movements, such as facial expressions, head position, function in tandem with manual signs to fulfil prosodic roles. This special issue includes empirical studies that investigated the visual manifestation of prosody across diverse sign language typologies, offering a cross-modal perspective on how these expressions compare to and inform our understanding of prosody.
In our view, examining the interplay between the vocal and visual dimensions of prosody not only refines theoretical models of both spoken and signed languages but also offers significant practical applications for fields such as language acquisition, speech-language pathology and human–computer interaction. Accordingly, the objective of this special issue is to present innovative research demonstrating that prosody needs to be reconceptualized as an inherently multimodal phenomenon, manifested across the spoken and/or visual domain. The following sections summarize the contributions to this special issue, which are organized into three core themes: the temporal alignment of spoken and visual aspects of prosody, the coordination of spoken and visual aspects of prosody in conveying pragmatic meaning, and the use of visual correlates of prosody to express pragmatics in sign languages.
2. Theme 1: The temporal alignment of spoken and visual aspects of prosody
This special issue includes seven papers examining the temporal alignment of spoken and visual aspects of prosody, exploring how this temporal alignment is shaped by linguistic (Farinella et al., Reference Farinella, Kaland and Kaufman2025; Zellers et al., Reference Zellers, Gorisch and House2025), contextual (Paggio et al., Reference Paggio, Mitterer, Attard and Vella2025; Wagner, Reference Wagner2025), and speaker-specific factors (Eigsti & Pouw, Reference Eigsti and Pouw2024), as well as by language learning patterns (Baills & Baumann, Reference Baills and Baumann2025; Hirata et al., Reference Hirata, Friedman, Kaicher and Kelly2024). The influence of linguistic factors is addressed in Farinella et al. (Reference Farinella, Kaland and Kaufman2025), who examine gesture alignment with spoken prominence across dialects of Indonesian, a language in which the spoken prosodic structure is not yet clearly understood. By analysing publicly available videos of four religious leaders in both preaching and interview contexts, the authors found distinct alignment tendencies across different dialects. Their findings demonstrate that the study of co-speech gestures is methodologically crucial for uncovering dialect-specific prosodic structures. Similarly, Zellers et al. (Reference Zellers, Gorisch and House2025) explore linguistic influences on temporal alignment focusing on gestural behaviour during turn-taking transitions. The study investigates whether gesture-speech alignment reflects a speaker’s intention to hold the turn after completing a sentence or to yield the floor. Analysing data from over 20 German and Swedish speakers in face-to-face conversational corpora, the authors observed significant differences in gesture phase alignment; specifically, locations where speakers maintained the floor exhibited increased gestural activity, with more strokes aligned to sentence completion.
Two studies in this special issue explore the influence of contextual factors on temporal alignment by manipulating the same parameter (i.e. interlocutor’s visual availability), albeit with distinct scientific goals. Paggio et al. (Reference Paggio, Mitterer, Attard and Vella2025) study how the visual availability of an interlocutor’s co-speech gestures influences the perception of multimodal prominence in Maltese. They asked 95 listeners to rate the prominence of target words that had been produced with or without an accompanying manual gesture, while either viewing the gesture (audiovisual condition) or not (audio-only condition). Results showed that words that had been produced with an accompanying gesture were rated as more prominent, even when the listener did not see the gesture, and acoustic analysis showed that the co-occurrence of a gesture influenced the pitch of the target vowel. In contrast, Wagner (Reference Wagner2025) examines whether speakers adjust the tightness of temporal alignment based on mutual visibility and information-structural needs. This study tested 20 pairs of speakers of Northern German in a dyadic cardboard game and found that the coordination between gestural and prosodic landmarks remained remarkably stable. This stability persisted regardless of whether the interlocutors could see one another or the specific informational relevance of the verbal message.
The impact of speaker-specific aspects on the temporal alignment of spoken and visual aspects of prosody is addressed in Eigsti and Pouw (Reference Eigsti and Pouw2024). The authors shed light on whether potential asynchronies in autistic populations originate from motor impairments or, alternatively, from difficulties in the language production processes. A group of nine young autistic English-speaking individuals were compared to a group of neurotypical adolescents when producing nursery rhymes accompanied by deliberate, beat-like motor movements. The authors found no significant differences between the groups regarding synchrony, tempo, or rhythm of the multimodal events, which led them to conclude that difficulties of autistic individuals in synchronizing gesture and prosodic events may be rooted in higher-level linguistic processes rather than fundamental motor deficits.
Finally, two papers in this special issue deal with how spoken and visual aspects of prosody temporally interact in the context of language learning. Baills and Baumann (Reference Baills and Baumann2025) are interested in the temporal alignment of the multimodal marking of focus in the context of second language (L2) learning, as compared to native speakers. Using a spontaneous narrative task, they elicited productions from 25 Catalan learners of French and 19 native French speakers, which were analysed in terms of prosodic prominence (i.e. presence and type of pitch accent), of gesture prominence (i.e. presence and type of head gesture), and of whether information was new or given, referential or non-referential. Results showed that Catalan learners of French transfer L1 pitch accent types into the L2, leading to prosodic prominence marking that is not native-like. In terms of gestural marking of focus, the authors did not find differences between populations and instead concluded that co-speech gestures seem to be a secondary strategy for both native and non-native speakers of French, who mainly rely on spoken prosody for this purpose. Hirata et al. (Reference Hirata, Friedman, Kaicher and Kelly2024) also explore the importance of multimodality for second language learners, but from a different point of view. They ask whether being exposed to different types of multimodal input can enhance learning of prosodic contrasts in an L2. They tested English learners of Japanese, a language in which the pitch contour on a word changes its meaning. The authors found that the most effective training for learning and generalization combined the acoustic signal with a visual–spatial written display of how low and high pitch targets are arranged in the word (compared to a training where the pitch targets were presented in a flat written display or one in which a manual gesture mimicked the visual–spatial display). Interestingly, they also found that the type of multimodal input did not influence the learners’ cognitive load, but that observing gestures did strongly impact the learners’ motivation, engagement and enjoyment.
3. Theme 2: The pragmatic contribution of spoken and visual aspects of prosody
The seven papers included in this theme explore the coordination of spoken and visual aspects of prosody in conveying various pragmatic meanings, such as epistemic stance (Lehmann & Pentrel, Reference Lehmann and Pentrel2024), speech acts (da Silva Miranda et al., Reference da Silva Miranda, de Moraes and Rilliard2024; Mahdinazhad Sardhaei et al., Reference Mahdinazhad Sardhaei, Żygis and Sharifzadeh2024), negation (Harrison, Reference Harrison2024), emotion (Petrone et al., Reference Petrone, Carbone, Audibert and Champagne-Lavau2024), as well as backchanneling (Blomsma et al., Reference Blomsma, Vaitonyté, Skantze and Swerts2024). The upcoming paragraphs detail their contributions, organizing them as a function of whether they address this issue from the production side (Blomsma et al., Reference Blomsma, Vaitonyté, Skantze and Swerts2024; Harrison, Reference Harrison2024; Lehmann & Pentrel, Reference Lehmann and Pentrel2024; Mahdinazhad Sardhaei et al., Reference Mahdinazhad Sardhaei, Żygis and Sharifzadeh2024), or from a comprehension point of view (da Silva Miranda et al., Reference da Silva Miranda, de Moraes and Rilliard2024; Petrone et al., Reference Petrone, Carbone, Audibert and Champagne-Lavau2024). Together, these studies collectively illustrate that the processes of pragmatic production and comprehension inherently incorporate gestural information which significantly interacts with spoken prosody.
Four papers focus on the production of the spoken and visual aspects of prosody, showing the flexible association between the two in the expression of pragmatic meaning. First, in light of recent advancements in Construction Grammar, Lehmann and Pentrel (Reference Lehmann and Pentrel2024) investigate the interaction between verbal, prosodic and gestural constructions of the particle ‘ish’ in English, analysing a total of 406 instances from audiovisual television discourses. In English, the derivational suffix ‘-ish’ has undergone a transformation, now functioning independently as a free particle that conveys vagueness or that attenuates the proposition. Their results revealed that the free particle is distinguished by a longer duration and greater pitch variability compared to the suffix, while also being prosodically detached. Furthermore, the various ‘ish’ constructions were typically associated with distinct sets of gestural features, which support these differing functions, resulting in a cross-modal pragmatic construct. Second, Harrison’s (Reference Harrison2024) study assesses how emphasized negation is multimodally expressed in English through the analysis of a set of 50 video examples from televised dialogues. The findings revealed that this strong entanglement occurs at the syllable-onset consonant, which is produced with significant lengthening, increased pitch and intensity (e.g., ‘NNNNNNEVER’), and is accompanied by a negative palm gesture and facial distortion. These results underscore the strong temporal and magnitude associations between visual and vocal elements in the expression of negation. Third, Blomsma et al. (Reference Blomsma, Vaitonyté, Skantze and Swerts2024) analysed the variability in multimodal back-channel behaviour by analysing different forms of audiovisual feedback in a corpus of 14 addressees who interacted with the same confederate speaker. They conducted this research by means of an experimental paradigm designed to elicit feedback behaviour in a controlled manner (e.g., choosing tangram figures without asking questions). The findings indicated that back-channelling behaviour exhibited variability among listeners, with certain individuals demonstrating greater multimodal activity, including increased head nods and vocalizations, compared to others. Furthermore, the occurrence of back-channelling behaviour was contingent upon the presence of back-channelling opportunity points, with some points eliciting more frequent back-channelling responses than others. The fourth paper by Mahdinazhad Sardhaei et al. (Reference Mahdinazhad Sardhaei, Żygis and Sharifzadeh2024) examines the compensatory trade-off between visual and prosodic elements of interrogative marking in the presence of degraded visual or acoustic information. The authors sought to determine whether the facial and prosodic indicators of questionhood in Farsi – a language that lacks morpho-syntactic markers for polar questions – are affected under conditions of degraded information, such as whispering and the use of face masks. Ten Farsi speakers produced ten pairs of statements and questions in both normal and whispered speech modes, with and without face masks. The findings revealed that (a) in whispered speech, speakers demonstrated longer durations of sentence-final words and enhanced orofacial expressions and (b) the use of masks covering the mouth during whispered speech led speakers to intensify their orofacial expressions (i.e. higher raising of the eyebrows) under marked conditions. These results support the presence of a multimodal ‘trade-off’ mechanism between visual and acoustic manifestations of questionhood in adverse conditions.
Turning to pragmatic comprehension, two papers focus on the role of visual and spoken aspects of prosody in the processing of speech acts and emotions using similar experimental paradigms. First, da Silva Miranda et al. (Reference da Silva Miranda, de Moraes and Rilliard2024) set out to investigate the contribution of intonation and visual information to the comprehension of wh-questions and wh-exclamations in Brazilian Portuguese. A total of 36 listeners participated in an identification experiment using auditory–visual congruent and incongruent stimuli, in order to investigate the relative importance of each modality in signalling pragmatic meaning. Results showed that while the congruent stimuli increased the identification of wh-questions and wh-exclamations, the incongruent stimuli tended to decrease listeners’ identification. Although variation in the accuracy of detecting both speech acts was found across speakers, this study corroborates that the visual channel impacts the perceptual identification of the pragmatic intonation function of distinguishing sentence mode. Second, Petrone et al. (Reference Petrone, Carbone, Audibert and Champagne-Lavau2024) conducted two behavioural studies to examine whether the emotional information present in facial gestures biases the interpretation of an upcoming acoustic signal (e.g., angry or neutral prosody). In the first study, the facial information was congruent with the auditory stimuli, while in the second study, the facial information was incongruent with the auditory stimuli. Two hundred native French speakers participated in the two tasks. The findings indicated that (a) facial cues to anger influenced the evaluation of angry and neutral spoken prosody, suggesting that multimodal emotion perception applies even when one sensory inputs precedes the other in time and (b) while all participants used visual information when paired with congruent emotional spoken prosody, only some relied on facial gestures to interpret neutral spoken prosody. These results show that cross-modal affective priming may play a facilitatory role in everyday communication, for example, when making emotional predictions in conversation.
4. Theme 3: Visual prosody in the pragmatics of sign languages
Two papers in this special issue deal with the pragmatics of sign languages and how pragmatic meanings are expressed through visual prosody. First, Cruz and Frota (Reference Cruz and Frota2025) compared differences and commonalities in how spoken and sign languages use visual correlates of prosodic meaning in the context of question marking. They tested 5 deaf users of Portuguese Sign Language (LGP) and 5 speakers of the spoken modality of European Portuguese (EP), analysing their use of head lowering and eyebrow movements. The results showed that the kinematic features of head movements differ between LGP and EP. That is, LGP participants utilized larger head-falling movements – and distinct movement shapes – to distinguish questions from statements. Conversely, in EP, these kinematic parameters were not so distinct between speech act types, revealing that visual cues are secondary in the spoken modality. This dynamic interaction was also observed across movement types: when participants produced eyebrow movements, there was a decrease in the amplitude of the head lowering movement. Second, Karabüklü and Gürer (Reference Karabüklü and Gürer2024) also explored visual correlates of prosody in sign languages but studied a different type of meaning: discourse focus. They tested 20 deaf users of Turkish Sign Language in an elicitation paradigm in which participants had to respond to target questions that were specifically created to elicit responses with different focus types and different syntactic positions of the focalized element. The authors found that participants used lengthening of the manual signals to mark contrastive focus (and to a lesser extent also to mark presentational focus). Contrary to previous studies in other sign languages, they found no evidence that Turkish Sign Language uses nonmanual markers as a prosodic strategy to mark focus. These results are interesting because they reveal cross-language variation in the prosodic strategies that sign languages use to highlight discourse elements and suggest that these strategies need not involve nonmanual signs.
5. Conclusions, future directions and challenges
In summary, the evidence presented in the papers of this special issue supports a reconceptualization of prosody as a phenomenon that is realized through diverse sensory modalities (auditory, visual and even potentially tactile – as shown in Franken et al., Reference Franken, Liu and Ostry2022) in spoken and sign languages. A compelling indicator of the evolving nature of this emerging field is the rich variety of terms that have been used to discuss these phenomena, both in this special issue and in the field in general. The reader will notice that these terms are sometimes synonymous or sometimes overlapping. The term ‘Multimodal Prosody’ is used by many papers of this special issue, while some researchers have preferred the term ‘Audiovisual Prosody’ (a term initially introduced by Swerts & Krahmer, Reference Swerts and Krahmer2005). Both terms refer to the same phenomena, but the latter emphasizes the different sensory channels through which prosody is perceived. Other researchers use ‘cross-modal’ or ‘multi-channel’ to talk about the gesture-speech coordination or the way spoken prosodic and gestural events synchronize during speech. Recently, Prieto et al. (Reference Prieto, Estee-Gibert and Shattuck-Hufnagel2024) have advanced this discussion by introducing the concept ‘Modality-Neutral Prosody’, which allows for the consideration of the prosodic component of language as non-specific (or ‘neutral’) to any sensory channel (auditory, visual, tactile) or language modality (spoken or sign language). Irrespective of the terminology adopted, it is evident that prosody can no longer be conceptualized solely as an aspect of the acoustic speech signal; rather, it is an integrated, multi-sensory pillar of the human linguistic system.
The evidence that prosody is not solely a characteristic of the speech signal shows that it is also embodied, and our bodily movements can manifest prosodic characteristics. This integrated perspective serves a dual purpose: first, it challenges the traditional boundaries of spoken language prosody, necessitating a comprehensive re-evaluation of our understanding of how speakers, listeners and learners engage with the prosodic dimension of language. Second, this perspective contests the conventional treatment of non-referential gestures within gesture research, which has typically regarded non-referential co-speech gestures as body movements that do not contribute to meaning and are primarily characterized by their beat-like form and rhythmic timing. Current research, including that reported in this special issue, indicates that non-referential co-speech movements are prosodically structured and contribute to pragmatic meaning in a manner analogous to spoken prosody.
The reader of this special issue will discover, through the empirical evidence presented herein, that prosodic structures are fundamental to addressing the questions posed at the outset: how a language user plans and produces an intended utterance; how an interlocutor recognizes and understands a perceived utterance and how a learner acquires the grammar and processing mechanisms that enable these skills. Crucially, these prosodic structures are not exclusive to the spoken domain; rather, visual cues are an essential aspect, and one ignores them at one’s peril when constructing models of the human language capacity. As the editors of this special issue, we are pleased to extend an invitation to researchers in the fields of speech prosody, gesture and sign language to engage in a comprehensive dialogue to re-evaluate and integrate these disciplines. This initiative presents a dual challenge. Firstly, it prompts a reconsideration of the traditional approaches to language prosody, necessitating a re-examination of our understanding of how speakers, listeners and learners utilize the multisensory prosodic aspects of language. Secondly, it challenges the conventional perspectives within gesture research, particularly the characterization of non-referential co-speech gestures as mere body movements devoid of meaning, typically defined by their beat-like form and rhythmic timing. This new look at prosody and gesture encourages and even necessitates interdisciplinary collaboration to develop integrative models that capture the dynamic interplay between prosody in the spoken and visual (and potentially tactile) domains. Ultimately, this approach seeks to enrich our understanding of language as a multisensory, embodied phenomenon that transcends language modalities.