15.1 Introduction
To state that language use is fundamentally multimodal is uncontroversial for usage-based linguists. It is long recognized that the primary setting of language or its ur-context (Cienki Reference Cienki2016: 605) is face-to-face interaction and in direct, face-to-face interaction we simultaneously draw on verbal speech, gesture, posture, facial expressions, and other non-verbal cues to convey meaning. Yet, it is only a fairly recent development that cognitive linguists have started to fully embrace the multimodal nature of language use by working with authentic, video-recorded discursive data and developing theories to account for how semiotic modes work together in conceptualization.
A serious boost for multimodality research from a cognitive-linguistic perspective came from pioneer studies on multimodal metaphor and metonymy as expressed in co-speech gesture (Mittelberg Reference Mittelberg2006, Reference Mittelberg2019; Cienki Reference Cienki, Müller and Cienki2008; Müller Reference Müller2008; Cienki & Müller Reference Müller2008) as well as in pictures and video (Forceville Reference Forceville and Gibbs2008; for an overview see Sanaz Reference Sanaz2013 and Feyaerts et al. Reference Feyaerts, Brône, Oben and Dancygier2017). Over the past decade, other cognitive-linguistic paradigms, most notably Construction Grammar, have followed that path and widened their focus towards the kinesic modalities. A growing number of construction grammarians have raised the issue of whether, in the light of the inherent multimodality of human language use, the status of constructions as pairings of verbal forms and verbally encoded meanings needs to be reconsidered (Andrén Reference Andrén2010; Cienki Reference Cienki2015, Reference Cienki2016; Zima Reference Zima2017a, Reference Zima, de Mendoza, Luzondo and Pérez-Sobrino2017b; Zima & Bergs Reference Zima and Bergs2017; Feyaerts et al. Reference Feyaerts, Brône, Oben and Dancygier2017; Schoonjans Reference Schoonjans2018). At the same time, interactional linguists and gesture researchers have turned to Construction Grammar in search of a model of linguistic knowledge and cognitive representation to account for the tight coupling of verbal and kinesic structures observed in language use (Lanwer Reference Lanwer2017; Stukenbrock Reference Stukenbrock, Weidner, König, Wegner and Imo2020; Debras Reference Debras2021).
This convergent development, which one may hope will ring in a fully-fledged multimodal turn in Cognitive Linguistics (Zima & Brône Reference Zima and Brône2015), originates in the very core of the usage-based model and its premise that all knowledge of language is abstracted from language use. The implications of fully embracing the multimodality of language use, though, are far reaching for Cognitive Linguistics. The issue opens up the question of “what counts as language?” (Cienki Reference Cienki2016: 606) and thus what the research objects of Cognitive Linguistics should be. Furthermore, many theoretical debates that are ongoing within the field come to the fore with even greater saliency once we take a broader, multimodal perspective (Cienki Reference Cienki2017: 1). This also holds for the nascent field of multimodal Construction Grammar, which struggles with a number of theoretical and empirical issues and is occasionally met with skepticism within Construction Grammar and gesture studies alike (Ningelgen & Auer Reference Ningelgen and Auer2017; Lanwer Reference Lanwer2017; Ziem Reference Ziem2017; Debras Reference Debras2021). Therefore, my aim for this chapter is to present the current state of the ongoing debate on whether “we really need a multimodal Construction Grammar” (Ziem Reference Ziem2017: 1). I will start by giving a basic introduction to what gestures are, how they convey meaning, and why the discussion on the constructional status of gestural information only concerns co-speech gestures. There is no controversy that emblematic gestures (also called ‘emblems’) are constructions in their own right, just as signs of sign languages are (Hoffmann Reference Hoffmann2017). To illustrate co-speech gestures’ close integration with speech, I will show how they contribute to an utterance’s meaning at all levels and also touch upon issues of the temporal alignment between gestures and speech. Both are crucial aspects to be borne in mind when exploring the possible existence of multimodal constructions and the nature of the constructicon.Footnote 1
15.2 What Are Gestures and How Do They Convey Meaning?
Lay people often use the word ‘gesture’ as an umbrella term covering all sorts of hand movements, ranging from pointing gestures, iconics, and depictions to unspecific hand movements such as scratching one’s head or fiddling with one’s wedding ring. In gesture studies, the concept is on the one hand employed in a broader sense, encompassing all sorts of bodily articulators, such as the hands, the head, shoulders, arms, feet, and also facial expressions. On the other hand, the analytical focus is restricted to what Adam Kendon has termed ‘gesticulation’: “visible bodily action used as an utterance or as part of an utterance” (Kendon Reference Kendon2004: 7) or for short “utterance visible action” (Kendon Reference Kendon2014: 7). Gestures are thus produced with the intent to be semantically and pragmatically meaningful and thereby an integral part of utterance construction. In Kendon’s words, they are “employed to accomplish expressions that have semantic and pragmatic import similar to, or overlapping with, the semantic and pragmatic import of spoken utterances” (Kendon Reference Kendon2014: 7). In a similar vein, Calbris (Reference Calbris2011: 6) defines gestures as “visible movement[s] of any body part consciously and unconsciously made with the intention of communicating while speech is being produced” (my emphasis).Footnote 2 Both definitions emphasize gestures’ communicative meaning or deliberate expressiveness and hence exclude bodily movements that are not produced with the intent of encoding semantic-pragmatic meaning but rather reveal aspects of the speaker’s emotional or psychological state. The boundary, however, is not clear-cut and the analysis of authentic discourse always reveals a number of ambiguous cases. Nonetheless, there is consensus on what constitutes the core domain of co-speech gesture or ‘visible bodily action’: Gestures are kinesic movements that point towards a referent (present or imagined), depict a concrete or abstract referent, or serve to structure discourse.
David McNeill, one of the leading researchers in the field of psycholinguistic modality research, has therefore proposed a gesture typology comprising four types: deictics, iconics, metaphorics, and beats (McNeill Reference McNeill1992).Footnote 3 Deictic, iconic, and metaphorical gestures are referential in nature, that is, they relate to a referent either by pointing to or by depicting it. This referent may be a concrete entity (iconics) or an abstract one (metaphorics).Footnote 4 Beat gestures (also called ‘batons’, Efron Reference Efron1941; Ekman & Friesen Reference Ekman and Friesen.1969) are coordinated with the rhythm of the speech they accompany. The relationship is not semantic but discursive-pragmatic as they are often used to stress or emphasize a particular aspect. With respect to formal aspects, they usually consist of a back and forth, up-down, or left-right movement.
Another gesture category, not mentioned in McNeill (Reference McNeill1992), which, however, may also play a role for multimodal Construction Grammar, are recurrent gestures (Ladewig Reference Ladewig, Müller, Cienki, Fricke, Ladewig, McNeill and Bressem2014), such as the palm-up open hand (Kendon Reference Kendon2004; Müller Reference Müller, Müller and Posner2004), the throwing-away gesture (Bressem & Müller Reference Bressem, Müller, Müller, Cienki, Fricke, Ladewig, McNeill and Teßendorf2014), and cyclic gestures (Ladewig Reference Ladewig2011). Their main characteristic is the fact that they “show a stable form–meaning relationship” (Ladewig Reference Ladewig, Müller, Cienki, Fricke, Ladewig, McNeill and Bressem2014: 1158) and are thus more conventionalized than the spontaneous gestures that fall within the four other categories. However, they have not (yet) developed into emblems such as, for example, the thumbs-up gesture. They are thus not fully conventional signs or constructions in a Construction Grammar sense with a speech-independent semantics, as is the case with emblematic gestures (for an overview of emblematic gestures, see Teßendorf Reference Teßendorf, Müller, Cienki, Fricke, Ladewig, McNeill and Bressem2014). Rather, the meaning of recurrent gestures is schematic. Most notably, Bressem and Müller (Reference Bressem and Müller2017) propose that one such recurrent gesture, that is, the throwing-away gesture, constitutes the gestural component of a verbo-gestural pattern expressing negative assessment, which may qualify as a multimodal construction in a Construction Grammar sense (to be explained in more detail in Section 15.4.1).
Other important facts to know about gestures pertain to the way they are produced in time and in relation to speech and the difference as to how they encode meaning in contrast to verbal language. Pioneering work by Adam Kendon (Reference Kendon and Key1980) has identified different phases in the execution of a gesture. The central phase is the so-called stroke phase. It is this phase that we excerpt the gesture’s meaning from. The stroke phase is preceded by a preparation phase, in which the hands move from the rest position to perform the stroke (usually close to the center of the speaker’s gesture space; McNeill Reference McNeill1992). This stroke phase may be followed by a retraction phase in which the hands move back into the rest position. Hold phases in between these phases (e.g., post-stroke hold) or within the stroke phase are also possible and often correlate with verbal disfluencies.
These gesture phases can be combined to form higher-level units: gesture phrases and gesture units. Gesture phrases comprise preparation phases and strokes and gesture units involve the full movement cycle from preparation to retraction. Gesture phrases, gesture units, and speech are temporally aligned with each other in a particular way: The preparation phase usually precedes the articulation of the lexical affiliate, that is, the lexical element that is semantically co-expressive with the gesture. Concerning the gesture stroke, there is more controversy. Some studies argue that the stroke onset may start and even end before the affiliate is articulated (Ferré Reference Ferré2010; ter Bekke et al. Reference ter Bekke, Drijvers and Holler2020), while others report that the stroke coincides with the affiliate (Chui Reference Chui2005; McNeill Reference McNeill2005). Focusing on the relationship between gesture and intonation, Loehr (Reference Loehr2004) reports that the stroke most typically shortly precedes or collides with the utterance’s focus accent. This is corroborated by follow-up studies (Jannedy & Mendoza-Denton Reference Jannedy and Mendoza-Denton2005; Shattnuck-Hufnagel et al. Reference Shattnuck-Hufnagel, Yasinnik, Veilleux, Renwick, Espositio, Bratanić, Keller and Marinaro2007). Although the details of the temporal alignment between speech and gesture are thus partly subject to debate, it is uncontroversial that they are closely aligned and this temporal alignment is mirrored in their semantic alignment, as co-speech gestures are generally considered to be co-expressive.
However, verbal language and gestures co-express meaning in different ways. Speech is segmented on various levels, that is, into phonemes, lexemes, phrases, constructions, etc. Although gestures can be segmented, too, their meaning is not compositional. Rather, they are considered to constitute one meaningful whole. Furthermore, co-expressiveness should not be confused with semantic redundancy. Although gestures can of course co-express meaning that is also encoded verbally, it is common for them to express meaning aspects that are not specified in speech (Kendon Reference Kendon and Key1980; McNeill Reference McNeill1992). For instance, if a speaker recounts a soccer match and says that “the defender tackled me and I lost the ball” and moves their right arm to depict an elbow check, we infer from this gesture that the act of tackling involved an elbow check and take it to be the reason the speaker lost the ball. But the gesture does more than that. It also involves specific information about how the elbow check was performed, including how quickly and with how much physical force, whether the elbow was moved in a horizontal trajectory or whether it was lifted, possibly to aim at the opponent’s upper body or face. All this information is put in and inferred from the gesture and it is obviously far more ecological and easier to depict all of it in one gesture than to put it in words. Hence, there is a division of labor between the verbal and the gestural modality or, put differently, they both work together to convey one thought. Kendon (Reference Kendon and Key1980) has termed this one thought the underlying ‘idea unit’. Speakers, however, do not only make use of referential gestures to express content that is easier to depict than to recount; they also use gestures to highlight meaning aspects (Alibali & Kita Reference Alibali and Kita2010; Schoonjans Reference Schoonjans2018). This point has been made most notably by Müller (Reference Müller2008), who argues that metaphors that are co-expressed in gesture and speech entail that the level of activation of the construal’s metaphoricity is higher than if the metaphor is only present in one modality. This is reminiscent of Givón’s (Reference Givón1985) principle of quantitative iconicity: “More form is more meaning.”
Besides these communicative and semantic-pragmatic functions, gestures also serve a number of functions that are linked to speech production and interaction management. For instance, gestures are frequent in persistent word searches and it has been argued that gesturing helps to overcome the word retrieval problem because the motor activity stimulates cognitive activity (Kraus Reference Kraus1998) and reduces cognitive load (Goldin-Meadow et al. Reference Goldin-Meadow, Nusbaum, Kelly and Wagner2001). Accordingly, it is claimed that gestures, including representational gestures, have self-oriented cognitive functions (Kita et al. Reference Kita, Alibali and Chu2017). At the same time, gestures also play a role in the turn-taking process as they are used to allocate turns as well as to signal a wish to take the turn (Mondada Reference Mondada2007; Schmitt Reference Schmitt2014; Zima Reference Zima2018). Gestures are hence multi-functional and this multi-functionality has a number of implications for how co-occurrences of gestures and verbal constructions may be modeled within Construction Grammar.
After this compact overview of some of the main characteristics of the gestural modality, the next section addresses this chapter’s main concern: Are constructions multimodal and how do we know? We start with the theoretical seeds of the idea.
15.3 Multimodal Constructions? The Discussion’s Theoretical Foundations
This contribution focuses on the place of co-speech gestures within Construction Grammar, most notably Cognitive Construction Grammar (Goldberg Reference Goldberg1995, Reference Goldberg2006), which clearly subscribes to the usage-based thesis (Barlow & Kemmer Reference Barlow, Kemmer, Barlow and Kemmer2000). Accordingly, it holds that all linguistic knowledge is abstracted from language use, drawing on general cognitive mechanisms such as pattern recognition, abstraction, schematization, and categorization. This usage, that is, the input, is inherently multimodal: We do not only speak with words, but we gesture, we direct our gaze someplace, we display emotions and attentional states through our postures and facial expressions, we speak up sometimes and whisper at other times, etc. Language is thus learned in a multimodal environment (Enfield Reference Enfield2009). Most notably, children gesture before they are able to speak and the language acquisition process is heavily dependent on co-speech gesture use (e.g., to establish the link between a given concept and the name for this concept). The dependence of language use on co-speech gesture use diminishes in the course of language acquisition (Cienki Reference Cienki2015), but nonetheless communication in face-to-face interaction remains inherently multimodal throughout the lifespan. Therefore, one theoretical argument put forward in favor of a multimodal reconceptualization of language is grounded in the fact that we obviously have extensive, systematic, and structured knowledge of how to communicate in multimodal environments. This knowledge must be stored, that is, entrenched, in one way or the other. The crucial question is: Is it part of linguistic knowledge, of grammar?
Usage-based linguistics models grammar as “the cognitive organization of one’s experience with language” (Bybee Reference Bybee2006: 2916) and construction grammarians have posited that this cognitive organization consists of constructions only. This idea is often referred to by citing Goldberg’s iconic statement (Reference Goldberg2003: 226), “It’s constructions all the way down.” Hilpert (Reference Hilpert2014: 2) has rephrased the same idea as “Knowledge of language consists of a large network of constructions, and nothing else in addition.” However, there is a crucial difference, notable in Bybee’s and Hilpert’s quotes, because Bybee refers to grammar, whereas Hilpert speaks of knowledge of language. This is not a trivial difference, as the way we model knowledge of gesture use in the Construction Grammar framework crucially depends on how we conceptualize the relationship between grammar and language.
From Hilpert’s encompassing view of constructions and the constructicon one may infer that constructions must include information on how to instantiate them multimodally because it is assumed that all language knowledge is stored as part of constructions and, surely, the way we use constructions is entrenched knowledge. This resonates with the line of argument put forward in Zima (Reference Zima2014b): If we align with the idea that the constructicon comprises all of our knowledge of language but conceptualize language and constructions as purely monomodal, it leaves us with the unresolved problem of needing to explain why the usage-based thesis should only hold for recurrences at the verbal level and where our rich knowledge on how to communicate multimodally, that is, to employ constructions multimodally, is stored.
Another take on the issue, however, is to view grammar and language knowledge as non-equivalent. This implies that knowledge of language includes grammatical knowledge as well as other forms of knowledge that are abstracted from language usage, potentially including knowledge on how to combine constructions with gestures. Quite a few authors have advocated this position (Ningelgen & Auer Reference Ningelgen and Auer2017; Ziem Reference Ziem2017; Verhagen Reference Verhagen2021), proposing it as a way out of the current impasse in the field. This discussion is not settled and it cannot be resolved in this chapter, but one consequence seems evident: If grammar and knowledge of language are only partly overlapping, the Construction Grammar claim that knowledge of language is “constructions and nothing else in addition” (Hilpert Reference Hilpert2014: 2) may not be tenable and may need to be revised.
In this context, it is important to note that the discussion on where to locate gestures in the constructicon did not really originate in Construction Grammar but was stipulated by Ronald Langacker, who explicitly acknowledged that gestures may be part of a linguistic unit:
In Cognitive Grammar …, the form in a form–meaning pairing is specifically phonological structure. I would of course generalize this to include other symbolizing media, notably gesture and writing. … Cognitive Grammar takes the straightforward position that any aspect of a usage event, or even a sequence of usage events in a discourse, is capable of emerging as a linguistic unit, should it be a recurrent commonality.
In 2008, he even got more specific, giving the example of a co-speech gesture that is performed in baseball:
When a baseball umpire yells Safe! and simultaneously gives the standard gestural signal to this effect (raising both arms together to shoulder level and then sweeping the hands outward, palms down), why should only the former be analyzed as part of the linguistic symbol? Why should a pointing gesture not be considered an optional component of a demonstrative’s linguistic form?
The theoretical statement and the example, however, differ in one important aspect. In the case of the umpire signal, the gesture is a mandatory component of the sign, that is, the signal is not adequately performed if one only yells Safe! and does not gesture. Therefore, from a Construction Grammar perspective the status of this form–meaning pairing as consisting of a verbal and a gestural component is rather uncontroversial, and some authors have indeed argued in a similar vein, proposing that constructions are multimodal if and only if a gestural component is mandatory and cannot be omitted without the construction being incomplete (Ningelgen & Auer Reference Ningelgen and Auer2017; Ziem Reference Ziem2017). In the case of the baseball signal, completeness is determined by sports convention, that is, to just perform the gesture without yelling Safe! is not uninterpretable but it is treated as pragmatically unacceptable. This is because at some point in time people have agreed upon the convention that in order for the umpire signal to be effective and consequential, the verbal and gestural parts have to be performed together. In other cases, especially in the case of some deictic constructions which are often discussed as candidate multimodal constructions (Stukenbrock Reference Stukenbrock2010, Reference Stukenbrock2015, Reference Stukenbrock, Weidner, König, Wegner and Imo2020; Ningelgen & Auer Reference Ningelgen and Auer2017; Balantani Reference Balantani2021), completeness is a semantic-pragmatic category. This, for instance, concerns deictic constructions like [like that/this] or [this ADJ] (also German so ‘like this’; Stukenbrock Reference Stukenbrock2015; Ningelgen & Auer Reference Ningelgen and Auer2017). These are uninterpretable without a gesture that specifies the deictic slot, by, for example, depicting how a certain action has to be performed (‘you need to hold your hand like this’) or by specifying the shape of an object or some spatial dimension (‘the hole was this big’). For constructions that involve an obligatory gestural component, the multimodal unit status is uncontested, too. Rather, the debate centers on the questions whether obligatoriness of a gestural component is a prerequisite for multimodal constructions and whether gestures can fill optional slots of multimodal constructions. The latter hypothesis is grounded, among others, in Goldberg’s definition of constructions as frequency dependent:
Any linguistic pattern is recognized as a construction as long as some aspect of its form or function is not strictly predictable from its component parts or from other constructions recognized to exist. In addition, patterns are stored as constructions even if they are fully predictable as long as they occur with sufficient frequency.
Innumerable studies have since then studied the effects of frequency on unit formation and entrenchment (e.g., Bybee Reference Bybee2006; Schmid Reference Schmid, Geeraerts and Cuyckens2007, Reference Schmid2014; Blumenthal-Dramé Reference Blumenthal-Dramé2012; Divjak & Caldwell-Harris Reference Divjak, Caldwell-Harris, Dąbrowska and Divjak2015; Divjak Reference Divjak2019) (see also Section 15.4.1), providing arguments and counterarguments for the unit status of highly frequent instantiations alongside more abstract and/or unpredictable constructional patterns, while at the same time agreeing on the fact that ‘sufficient frequency’ is too vague a term and therefore not an operational criterion (Traugott & Trousdale Reference Traugott and Trousdale2013: 11; for discussion, see also Hartmann & Ungerer Reference Hartmann and Ungerer2023). At the same time, the exemplar view advocated by Bybee (Reference Bybee2010) holds that even constructions that one comes across only once or a couple of times in one’s life may get stored in the long-term memory if there is some salient aspect to them that makes them stick in the mind. The exact role of frequency in Construction Grammar is hence still disputed (Hoffmann Reference Hoffmann, Hoffmann and Trousdale2013) and this has implications for multimodal Construction Grammar. Obviously, it is impossible to define a frequency threshold for gesture recurrence that any claim about the constructional status of a (verbal) construction-gesture co-occurrence can be safely based on. This has been the most critical issue in multimodal Construction Grammar so far. It touches upon the recognizable gap between the general acceptance of the claim that language is multimodal and the difficulties in proving that a particular construction is multimodal in nature. The next section sketches the state of the art in the field.
15.4 State of the Art in Multimodal Construction Grammar
The current debate in the field can be framed as comprising two main strands. The first one includes construction-based case studies that in one way or another rely on frequency of gesture co-occurrence as an argument in favor of or against the multimodal status of constructions. The second strand takes a more gesture- and meaning-centered approach. The following section is structured as follows. In Section 15.4.1, the state of the art in the field is presented by focusing on the case studies that have been conducted so far. These studies lay the groundwork for the presentation of approaches that draw on them for proposing novel ways to think about the issues under debate, most notably Cienki’s (Reference Cienki2017) proposal of an ‘utterance construction grammar’. These proposals are discussed in Section 15.4.2.
15.4.1 Case Studies
As outlined above, one of the main arguments brought forward in favor of a multimodal reconceptualization of the constructicon and constructions is grounded in claims that “any recurrent aspect of a construction’s usage can become entrenched” (Langacker Reference Langacker2001). Over the past decade, several studies have shown that gestures recurrently and systematically co-occur with given verbal constructions but co-occurrence frequencies vary strongly. They range from up to 85 percent for English motion and distance constructions ([all the way from X PREP Y]; see Zima Reference Zima2014b, Reference Zima2017a, Reference Zima, de Mendoza, Luzondo and Pérez-Sobrino2017b, and also Pagán Cánovas & Valenzuela Reference Pagán Cánovas and Valenzuela2017) to approximately 70 percent for different types of English time expressions (Pagán Cánovas et al. Reference Pagán Cánovas, Valenzuela, Alcaraz-Carríon, Olza and Ramscar2020), 58 percent for English aspectual verbs (Hinell Reference Hinell2018), and 37 percent (and less) for German modal particles (Schoonjans Reference Schoonjans2018). To date, except for Ningelgen & Auer (Reference Ningelgen and Auer2017) on deictic so ‘like this’ in German (see discussion in Section 15.3 on mandatory gestures with particular deictic expressions), no study thus far reports co-occurrence rates of 100 percent, and it seems safe to say that there may indeed be only very few verbal constructions that qualify as multimodal if a 100 percent co-occurrence rate is taken as the sole criterion. Ziem (Reference Ziem2017) takes this to be a strong counterargument against the multimodal conception of constructions and the constructicon. Similarly to Ningelgen and Auer’s line of argumentation, he proposes to perform deletion tests, arguing that the gesture’s input to the meaning of the construction must be so crucial that without the gesture the construction collapses and becomes uninterpretable.
A different path is followed by Lanwer (Reference Lanwer2017), Schoonjans (Reference Schoonjans2017), and most recently Debras (Reference Debras2021), who argue that mere frequency is rather uninformative and the analytical focus needs to be transferred to how gestures contribute to utterance meaning. Debras (Reference Debras2021) links this to a general complaint that the focus of Construction Grammar is too much on form. If we take the verbal construction and its form as the point of departure, we tend to consider the co-occurring gesture as secondary and optional, that is, something we add while we speak but which we could equally well leave out. Our notion of constructions and the constructicon, however, may look fundamentally different if we depart from the meaning side (cf. Lasch Reference Lasch2020 and his meaning-centered approach to the German constructicon) and shift focus to how gesture and speech collaborate to express an idea, that is, the ‘idea’ unit according to Kendon (Reference Kendon2004). This is the line of argumentation followed by, for example, Hoffmann (Reference Hoffmann2017), Mittelberg (Reference Mittelberg2017), Bressem & Müller (Reference Bressem and Müller2017), Schoonjans (Reference Schoonjans2018), and (partly) Zima (Reference Zima2014b, Reference Zima2017a, Reference Zima, de Mendoza, Luzondo and Pérez-Sobrino2017b).
Departing from an emergent grammar perspective, which takes grammar to be “the name for certain categories of observed repetitions in discourse” (Hopper Reference Hopper and Tomasello1998: 156), Mittelberg (Reference Mittelberg2017) presents a case study on the German existential construction [es gibt X] ‘there is an X’. She argues that this particular construction involves a slot for a gestural enactment that depicts an act of “giving or holding something” (Mittelberg Reference Mittelberg2017: 1). This gestural re-enactment is grounded in the basic pattern of experience that Goldberg has argued to motivate (di)transitive constructions: “The initial meaning is an experiential gestalt. This basic pattern of experience is encoded in a basic pattern of language” (Goldberg Reference Goldberg and Tomasello1998: 208). Accordingly, Mittelberg (Reference Mittelberg2017:2) argues that “the basic manual actions of giving and holding … motivate multimodal instantiations of existential constructions in German discourse.” Drawing on semi-experimental data of German spoken discourse, she illustrates that es gibt-constructions co-occur with unimanual variants of the palm-up open-hand gesture as well as bimanual palm-vertical open-hand gestures.Footnote 5 Her analysis shows that there is formal recurrence in the gestures, while their semantic-pragmatic meaning is clearly situated and dependent on the discursive context. The semantic recurrence only holds for the very schematic meaning of “holding some kind of imaginary entity.” As Mittelberg acknowledges, these analyses are preliminary, but her work on existential constructions points towards candidate constructions for future research in multimodal Construction Grammar by suggesting that “linguistic constructions that recruit basic embodied manual actions and interactions with the physical and social world are particularly likely to be instantiated multimodally and thus also engender emergent multimodal patterns, or clusters, of experience” (Mittelberg Reference Mittelberg2017: 5).
This conclusion seems to be backed up by my own studies on English motion and distance constructions such as [Vmotion in circles], [zigzag], and [all the way from X PREP Y] (Zima Reference Zima2014b and Zima Reference Zima2017a, Reference Zima, de Mendoza, Luzondo and Pérez-Sobrino2017b). In American English data from various TV formats (UCLA Library NewsScape; Steen et al. Reference Steen, Hougaard, Joo, Olza, Pagán Cánovas, Pleshakova, Ray, Uhrig, Valenzuela, Woźny and Turner2018), I found gesture co-occurrence frequencies that range between 37 percent and 85 percent. Although these are considerably high frequencies, gesturing with these constructions is obviously not mandatory, at least not under every circumstance. If one was to perform a deletion test, as proposed by Ziem (Reference Ziem2017), the conclusion would have to be that all these constructions are not multimodal in nature as the constructs are not uninterpretable without the gesture. Yet, the gestures are not redundant and add to the meaning of the utterances. In particular, the iconic gestures make a certain aspect of conceptualization particularly salient, following the quantitative iconicity principle of “more form is more meaning.” The following examples are meant to illustrate this.
In example (1), the speaker is telling a story and enacting a scene from a hockey game. The gesture, which consists of consecutive rapid movements of the right hand, emphasizes both the marked path of motion (in circles) and the velocity (faster and faster). It thus fulfills the function of highlighting and drawing attention to the semantic aspects of path and manner of motion.
(1) KNBC Tonight Show with Jay Leno, July 16, 2010

This highlighting function also holds for gestural instantiations of temporal and spatial uses of [all the way from X PREP Y] (Zima Reference Zima and Bergs2017). An example of a spatial instantiation that is accompanied by a co-speech gesture is given in (2). The bimanual gesture performed by the speaker depicts and thereby emphasizes the long distance between location X (Long Beach) and location Y (Lancaster), thereby communicating that the task of delivering food to all clients in this area on a single day is difficult.
(2) KNBC 4 News at Noon, December 25, 2012

Frame grab (1) shows the first stroke of the gesture that is co-produced with the articulation of the first geographical reference point (Long Beach), which instantiates the X-slot of the constructional template. Frame grab (2) depicts the second stroke that is aligned with Lancaster. Right and left hand each mark the beginning and endpoint of a spatial path. The space between the two extended hands maps onto the distance between the two places.
Based on both the analysis of the gestures’ semantic-pragmatic meaning and their frequency (63 percent for [V(motion) in circles]; 85 percent for spatial uses of [all the way from X PREP Y]), it is argued that we should not treat these seemingly redundant, co-expressive gestures as totally optional. Rather, our focus should be more data-centered, acknowledging and trying to explain the fact the speakers recurrently do gesture. Following Kendon (Reference Kendon2004) and Calbris (Reference Calbris2011), these gestures are produced with the intention to convey meaning and hence cannot be dismissed as ‘just optional’.
An equally meaning-centered approach is taken by Bressem and Müller (Reference Bressem and Müller2017). Departing from a recurrent gesture, the so-called throwing-away gesture, they illustrate that this gesture can be combined with a number of different verbal constructions including a wide range of grammatical categories such as particles, nouns, verbs, and adverbs. The throwing-away gesture is “characterized by a particular kinesic core: a lax flat hand oriented vertically with the palm facing away from the speaker’s body flapping downwards from the wrist” (Bressem & Müller Reference Bressem and Müller2017: 3). Just as Mittelberg argues for palm-up open-hand gestures that co-occur with German existential constructions, Bressem and Müller argue for an experiential basis of the gesture which they situate in the embodied experience of throwing concrete entities away. This is extended to metaphorical uses when referring to abstract objects in speech. They thus identify a constructional pattern, which they term “negative assessment construction,” with the multimodal form [throwing-away gesture] + [particles/negation/N/V/ADV]. From a theoretical perspective, they suggest the compelling idea that whether constructions are multimodal in nature is probably not a polar question requiring a yes-or-no answer. Rather, verbal constructions may constitute a multimodal network, with some of them being more, and others less, bound to particular gestures.
A further pioneering study is Schoonjans’ (Reference Schoonjans2018) monograph on German modal particles and the role of manual and head gestures to co-express down-toning meanings. His study is among the very first to not only raise theoretical questions but perform a large-scale corpus analysis that inquires in detail into the interdependence of verbal constructions and non-verbal co-occurrence patterns. The frequencies reported for multimodal instantiations of the modal particles under scrutiny are rather low (37 percent and less), but this should not lead one to dismiss Schoonjans’ results and his approach. Indeed, he raises and discusses a number of issues that are critical for future endeavors in multimodal Construction Grammar. These include the problem that recurrence (e.g., Langacker Reference Langacker2001) involves the assumption that there is a stable formal and semantic core that is common to all instantiations and results from subtraction of all in situ variation. However, as Bressem (Reference Bressem, Müller, Cienki, Fricke, Ladewig, McNeill and Teßendorf2013) illustrates, the form of manual gestures may vary in a great number of dimensions including hand shape, orientation, movement, and position in gesture space; therefore, “no two tokens of gesture are ever identical” (Harrison Reference Harrison2009: 82). Put differently, the issue of whether two gesture tokens are instantiations of the same gesture type is far from trivial.
Another methodological problem with far-reaching implications that Schoonjans draws attention to is the fact that there is not always perfect temporal alignment between the verbal construction and a co-expressive gesture. For instance, the performance of gesture phrases and units may take more time than the articulation of the lexical affiliate and, more importantly, the lexical affiliate may not be just one verbal construction but a larger semantic unit within an utterance. To date, all these issues are unresolved. As Schoonjans (Reference Schoonjans2017) argues, many of them are, however, not restricted to attempts to develop a multimodal Construction Grammar but they also concern monomodal Construction Grammars. This mostly concerns the still debated link between frequency and entrenchment (Hoffmann Reference Hoffmann, Hoffmann and Trousdale2013, Reference Hoffmann2017) but also the question of the level of granularity that one assumes a construction to be situated at.
15.4.2 Theoretical Proposals: Monomodal, Multimodal Construction Grammar, or Something Else?
Monomodal Construction Grammars posit that constructions exist at every level of granularity or schematicity, ranging from highly abstract patterns to lexically and syntactically fully fixed ones. They further allow for constructions to have optional slots. Therefore, one may consider it arbitrary to posit that verbal elements can be optional but gestural ones need to be obligatory. At the same time, one may equally wonder whether non-obligatory elements in verbally defined constructions are cognitively real or whether they rather point towards the existence of different constructions at different levels of granularity. This issue is raised by Lanwer (Reference Lanwer2017), who suggests that the difference between mono- and multimodal constructions may be a degree of schematicity. Therefore, multimodal constructions comprising a given [verbal form + gesture] may be stored alongside the more specific monomodal ones that do not involve a slot for a co-speech gesture. This argument is grounded in the very basic claim of Construction Grammar, namely that constructions may be stored redundantly at different levels of granularity. He further argues that in order to account for the varying frequencies of constructions’ co-occurrence with gestures and the varying degree of constructions’ dependence on gesture, we should consider thinking of a multimodal network of interrelated constructions as prototypically structured and involving fuzzy boundaries.
This idea is worked out in some more detail in Cienki (Reference Cienki2017). He introduces the idea of an Utterance Construction Grammar, with utterance being defined as “a level of description above that of speech and gesture for characterizing audio-visual communicative constructions” (Cienki Reference Cienki2017: 1). The suggestion of yet another model of linguistic knowledge is grounded in the conviction that it may be futile to try to coerce gestures into a verbally based constructional framework. In taking the utterance as point of departure, Cienki aligns with Kendon’s approach to gesture as “utterance dedicated visible bodily action” and speech as “utterance dedicated audible bodily action” (Kendon Reference Kendon and Allen2015: 44, cited in Cienki Reference Cienki2017: 3) as well as Langacker’s concept of the ‘usage event’ defined as including “the full phonetic detail of an utterance, as well as any other kinds of signals, such as gestures and body language” (Langacker Reference Langacker2008: 457, cited in Cienki Reference Cienki2017: 3). His proposal that constructions have a deep as well as a surface structure is reminiscent of two concepts that are traditionally associated with Generative Grammar, but Cienki stresses that the terms are borrowed without adhering to the nativist assumptions that underlie the Universal Grammar approach. The deep structure is conceptualized as “a set of tools that can be drawn upon to express the construction,” whereas the surface structure is “a metonymic representation of some (if not all) elements of the construction” (Cienki Reference Cienki2017: 3). Accordingly, information about which gestures go with a construction is stored in the construction’s deep structure. Constructions thus exhibit an inherent potential for multimodal realization and some aspects of this potential may get activated and be visible at a construction’s surface representation, that is, in a construct. Crucially, potential component elements as part of the deep structure may differ in being more or less prototypically associated with the construction. This way of thinking about constructions, Cienki (Reference Cienki2017: 5) argues, “is a more flexible alternative than positing that the model has the binary choice between required and optional elements” and is more compatible with the idea of various degrees of entrenchment.
Cienki thus proposes a new way of thinking about many issues that have turned out to be challenging for multimodal Construction Grammar. However, one may wonder about the ways of putting these ideas to the test. In that vein, Hoffmann (Reference Hoffmann2017) emphasizes the need for larger-scale data studies and the application of quantitative and statistical methods that go beyond absolute and relative frequencies (as, for example, in Zima Reference Zima2014b, Reference Zima2017a, Reference Zima, de Mendoza, Luzondo and Pérez-Sobrino2017b; Schoonjans Reference Schoonjans2018).
An example of such a quantitative approach is a recent study by Debras (Reference Debras2021) on French je (ne) sais pas ‘I don’t know’. Her approach is not explicitly situated within multimodal Construction Grammar. However, her paper involves an interesting discussion on why the constructional approach does not do full justice to the semantic-pragmatic import of co-speech gestures, arguing that the original Construction Grammar focus on verbal constructions entails that gestures are regarded as “secondary and dependent on speech” (Debras Reference Debras2021: 42). At the same time, she concludes that the association of the various uses of je (ne) sais pas as a pragmatic marker with recurrent gestures is too loose to allow for a straightforward categorization as a multimodal construction. In that respect, the methodology that is applied in her study is especially interesting and points to a potentially fruitful direction; based on a qualitative, multimodal analysis of eighty-four occurrences,Footnote 6 she identifies three multimodal profiles of je (ne) sais pas. A multiple correspondence analysis is then performed to identify the strength of association between all annotated parameters, which include phonetic realization, prosodic detail, functions, type of co-speech gestures, and a couple more. It turns out that the variable ‘type of co-speech gesture’ accounts for a big portion of the variation in the dataset and is thus only loosely associated with the particular phonetic realizations and functions. Mirroring the ongoing discussion on obligatoriness, frequency, and prototype structure in the field of multimodal Construction Grammar, these results may be thus interpreted in two ways: either as evidence for je (ne) sais pas clearly not being a multimodal construction, or as an argument for the need for a more nuanced model along the lines proposed by Zima (Reference Zima2017a, Reference Zima, de Mendoza, Luzondo and Pérez-Sobrino2017b), Lanwer (Reference Lanwer2017), Cienki (Reference Cienki2017), and Schoonjans (Reference Schoonjans2018).
All these studies, hence, suggest that there are many ways to conduct research with a multimodal constructional focus. However, in some way they are all struggling with similar issues, most notably difficulties in answering the pending question of where multimodal information is stored in our mind. This question clearly calls for an interdisciplinary approach that brings together experts in multimodal communication and gesture studies as well as cognitive linguists, psycholinguists, and cognitive scientists. However, it seems that one step to take before that is to increase the empirical basis by conducting more case studies on large enough multimodal datasets. Little is known on how systematic the relationship between given verbal constructions and gestures really is. So, where do we go from here?
15.5 The Road Ahead
As I hope to have shown in this chapter, the inquiry into the potential multimodality of constructions and the constructicon is still in its infancy and faces a number of theoretical and methodological challenges. These relate to the debated role of frequency of co-occurrence, the status of open slots in constructions, and the issue of whether grammar is restricted to verbal symbols or not. Some of these issues are intrinsic to the Construction Grammar framework but come to the fore with greater saliency when we extend the focus towards multimodal communication. This may leave readers with the impression that the endeavor may be futile altogether. I would like to close this chapter with a different conclusion. Much of the current discussion in the field of multimodal Construction Grammar suffers from a top-down approach; instead, we should adopt a more bottom-up perspective. Many arguments, including those presented in Zima (Reference Zima2014a, Reference Zima2014b, Reference Zima2017a, Reference Zima, de Mendoza, Luzondo and Pérez-Sobrino2017b), Zima and Bergs (Reference Zima and Bergs2017), and in this chapter, depart from the basic tenets of Cognitive Linguistics, the usage-based model, and especially Cognitive Construction Grammar. It is argued that there is a discrepancy between the acknowledgment that language use is multimodal and the way we theorize about language and language use in Construction Grammar. While this observation is valid, the discussion about the place of gesture (and other non-verbal modalities) within communication and grammar remains a purely theoretical one, unless we ground it in a much broader empirical basis. Too little is known about how consistent co-occurrences and mappings between the verbal and the gestural modalities are on a constructional level. Therefore, we need many more case studies, and this includes studies that start out from verbal constructions and their multimodal instantiations as well as more gesture- and meaning-centered ones. This entails the need for large enough, annotated, multimodal corpora. The NewsScape Library (Steen et al. Reference Steen, Hougaard, Joo, Olza, Pagán Cánovas, Pleshakova, Ray, Uhrig, Valenzuela, Woźny and Turner2018) is an exceptionally good starting point for any study on multimodal instantiations of constructions as it is fully searchable (for verbal constructions) and contains enormous amounts of audio-visual data, not only in English, but also in Spanish, Russian, German, and many more languages. Of course, this is not to say that smaller multimodal corpora cannot be used. They are equally relevant especially for constructions that occur frequently enough to compose a large enough dataset. Not least, these corpora are very valuable resources because the NewsScape Library only contains televised interactions and thus no private, face-to-face conversations or other interactional settings.
Finally, we need to broaden our methodological toolkit. To move forward on the issues under scrutiny, we need both qualitative research, which pays close attention to how meaning is expressed in situ in all modalities, as well as quantitative studies that make use of the full array of statistical methods that have been applied so successfully in Construction Grammar and other Cognitive Linguistic disciplines over the past decade (cf. Janda Reference Janda2013). Most notably, the issue at hand is a fundamentally interdisciplinary one that calls for an interdisciplinary approach and may not be resolvable by construction grammarians alone.

