1 Introduction
Proposals that the semiotic system of gesture played a pivotal role in the evolution of language have been, and continue to be, influential (Reference ŻywiczyńskiŻywiczyński, 2018).* This statement, however, illustrates not so much a specific theory but an axis of debate in the field of language origins, along which “gesture-first” proposals traditionally compete with “speech-first” theories (Reference FitchFitch, 2010). Below this general characterization, there are many differences between gestural origin theories, including the understanding of basic notions such as the concept of gesture itself. Thus, we begin this chapter by discussing definitional issues concerning gesture, with differences both within and across fields. In the context of language origins, such differences in defining “gesture” have profound consequences for formulating and evaluating theories. Since our aim is to survey a considerable number of contemporary gestural theories of the origin of language, we do so by using a new typology, developed with the help of cognitive semiotics (Reference Sonesson, Ziemke, Zlatev and FrankSonesson, 2007; Reference Zlatev and TrifonasZlatev, 2015a), and in particular the notions of semiotic system and polysemiotic communication (Reference Zlatev and PetersStampoulidis, Bolognesi, & Zlatev, 2019; Reference Zlatev and PetersZlatev, 2019; Reference Zlatev, Devylder, Defina, Moskaluk and AndersenZlatev et al. 2023). In very brief terms, a semiotic system is a combination of signs or signals of a particular type, defined by characteristic properties, and the interrelations between these signs/signals. Universal human sign systems are language, gesture and depiction (the latter understood as forming marks on a two-dimensional surface that resemble three-dimensional or imaginary referents). Signal systems, for example spontaneous facial expressions and non-linguistic vocalizations, are under less voluntary control than sign systems (Reference Zlatev, Żywiczyński and WacewiczZlatev, Żywiczyński, & Wacewicz, 2020). Combinations of different sign and/or signal systems form the basis for polysemiotic communication or polysemiosis.
Applying this conceptual apparatus to gestural theories of language origins, the basic distinction relates to the question of whether the semiotic system of gesture played an exclusive role in early stages of language evolution or whether other semiotic systems were involved as well. A positive answer to the first question implies monosemiotic theories, which we review in Section 3; a positive answer to the latter implies polysemiotic theories, to which we turn in Sections 4 and 5. The latter are more commonly known as “multimodal theories,” but we avoid this term due to its excessive ambiguity.Footnote 1 Importantly, we distinguish between two kinds of polysemiotic theories of language evolution: (a) equipollent, where language and gesture are considered equally prominent from the onset, and (b) pantomimic, where gesture played the main but not exclusive role in breaking from predominantly signal-based to sign-based communication.
After reviewing the evidence for each of the three kinds of gestural origin theories, we conclude that the last kind, namely pantomimic theories, appears to offer the most viable account of language origins. Further, it has the benefit of accounting for the evolution of polysemiotic communication as a whole.
2 Different Ways of Understanding Gesture
The work of Adam Kendon is hugely influential within gesture studies, and so is his characterization of gesture as bodily “movements that partake of […] features of manifest deliberate expressiveness to an obvious degree” (Reference Kendon2004, p. 14). “Expressiveness” implies that the communicator intends something by a gesture; “deliberate,” that this is done with a communicative intent; and “manifest,” that these features are to be discernible by an audience. As the notion of deliberateness is controversial, and is denied, for example, by Reference McNeillMcNeill (2012), gestures may be more generally defined as “expressive movements performed by the hands, the head, or any other part of the body, and perceived [predominantly] visually” (Reference Zlatev, MacWhinney and O’GradyZlatev, 2015b, p. 458, emphasis in original). How are gestures to be distinguished from other visually perceived communicative movements, such as adaptors and facial expressions (e.g. Reference Żywiczyński, Wacewicz and OrzechowskiŻywiczyński, Wacewicz, & Orzechowski, 2017)? An intuitive proposal of how to delineate this “lower boundary” of gesture is presented by Reference AndrénAndrén (2010), who argued that two dimensions of gestural meaning need to be distinguished: communicative explicitness (CE) and representational complexity (RC). Within each, there are three different levels and at least one of the two dimensions should be on the third level for the bodily act to count as a gesture. For example, the wave-bye gesture does not represent a goodbye but performs it, so it lacks the highest level of RC. Yet, it is typically produced with the communicative intent to be understood as taking farewell, and thus has the highest level of CE. On the other hand, an act of symbolic play performed in solitude would also qualify as a gesture, since by definition it is on the highest level of RC (if it is to be “symbolic”), even though it lacks a communicative intent completely. Most iconic (i.e. resemblance-based) gestures used in face-to-face communication would have the highest levels of both CE and RC.
Other researchers of human gestural communication adopt a narrower definition of gestures. Kendon’s approach encompasses so-called orofacial gestures: communicative movements of facial muscles and the tongue other than articulatory movements of speech; These would not be regarded as gestural by many (e.g. Reference Orzechowski, Wacewicz, Żywiczyński, Cartmill, Roberts, Lyn and CornishOrzechowski, Wacewicz, & Żywiczyński, 2014), though see Chovil (this volume). Reference McNeillMcNeill (1992, Reference McNeill2012) would further limit (prototypical) gestures to spontaneous and idiosyncratic hand and arm movements that are functionally integrated with speech, which is a very limited definition, based on his theory of the nature of gesture (Section 4.2).
On the other hand, within primatology, gestures are typically understood even more broadly than they are by Kendon, as concerning any communicative behaviors that involve body posture, facial expressions, and manual movements, and are mainly perceived visually (e.g. Reference Pika, Zlatev, Racine, Sinha and ItkonenPika, 2008a; Reference TomaselloTomasello, 2008).Footnote 2 Voluntary control (Reference Byrne, Cartmill, Genty, Graham, Hobaiter and TannerByrne et al., 2017) and the related properties of flexibility (Reference Bard, Maguire-Herring, Tomonaga and MatsuzawaBard, Maguire-Herring, Tomonaga, & Matsuzawa, 2019) and plasticity (Reference Pollick and de WaalPollick & de Waal, 2007) are commonly used to differentiate gestures from other behaviors. For example, to initiate grooming, infant chimpanzees indicate the place they want to be groomed at, first by looking at this spot and then touching it (Reference Bard, Maguire-Herring, Tomonaga and MatsuzawaBard et al., 2019). This example also illustrates the receiver-directedness of ape gestures, further underlined by persistence: “[I]f the recipient’s response is not satisfactory, […] the signaller […] repeat[s] the produced signal” (Reference Fröhlich, Sievers, Townsend, Gruber and van SchaikFröhlich, Sievers, Townsend, Gruber, T., & van Schaik, 2019, cf. Reference Leavens, Hopkins and ThomasLeavens, Hopkins, & Thomas, 2004; Reference Tomasello, George, Kruger, Jeffrey and EvansTomasello, George, Kruger, Jeffrey, & Evans, 1985) and elaboration: If the recipient fails to respond in a satisfactory way, a gesture different from the original one may be used.
These findings testify to significant cognitive and semiotic complexity in ape gestures, possibly differentiating them from ape vocalizations. At the same time, this evidence does not amount to full-fledged communicative intent, which implies not just “intention” in the sense of volition, but a Gricean (second-order) intention that the addressee recognize the primary intention (Reference Zlatev, Madsen, Lenninger, Persson, Sayehli, Sonesson and van de WeijerZlatev et al., 2013), and it is not clear if apes can either produce or recognize such gestures. Further, unlike human gestures with the highest form of representational complexity (see above), ape gestures hardly refer to an object that is distinct from the communicator or the addressee, that is, they are “dyadic” (me–you) rather than “triadic” (me–you–referent) (Reference HurfordHurford, 2007), which prevents them from being full-fledged signs (Reference Zlatev, Żywiczyński and WacewiczZlatev et al., 2020).
As pointed out in the introduction to this chapter, these definitional differences imply that any comparison and discussion of gestural origins theories needs to proceed with care, and preferably with the help of a uniform conceptual apparatus, such as the one we propose below. In each of the following three sections, we review and evaluate one of the three basic types of gestural theories: monosemiotic, polysemiotic-equipollent, and polysemiotic-pantomimic.
3 Monosemiotic Gestural Theories
3.1 General Features
Monosemiotic gestural theories claim that early stages of language evolution exclusively depended on gesture. They share the postulate of monosemiosis with vocal, or “speech-first,” theories, and have complementary difficulties (Section 3.7). The rise of modern language-evolution research in the latter part of the twentieth century was marked by the formulation of gestural hypotheses of language origin, starting with the work of Reference Hewes, Andrew, Carini, Hackeny, Gardner, Kortland and WescottGordon Hewes and colleagues (1973). This both contributed to the methodology of language-evolution research, such as the use of converging evidence (Reference Żywiczyński and WacewiczŻywiczyński & Wacewicz, 2019, pp. 122–124), and defined dominant positions in the field. In the following subsections we review a number of such monosemiotic gestural theories, before proceeding with evaluation.
3.2 Hewes’ Gestural Primacy Hypothesis
Hewes put forward a theory of a gestural protolanguage, termed the Gesture Primacy Hypothesis (Reference Hewes, Andrew, Carini, Hackeny, Gardner, Kortland and WescottHewes et al., 1973), and suggested how this protolanguage transitioned into speech (Reference HewesHewes, 1977a). Hewes’ conception of protolanguage can be described as synthetic (as opposed to holistic) (Reference Żywiczyński and WacewiczŻywiczyński & Wacewicz, 2019, p. 187) as it takes protolanguage to have consisted of gestures as quasi-lexical units standing for objects and actions that could be combined into sequences, but, overall, lacking syntactic and morphological structure. Some of Hewes’ arguments focus on the observation that in naturally occurring conversation, gestures usually accompany speech; but the bulk of the evidence summoned by him in support of his theory can be summed up as follows:
Anthropological data. Hewes analyzed logs of European travelers, who were apparently able to communicate with indigenous peoples by means of gestures about even highly complex topics, such as topography, dangers that awaited newcomers, politics, or religion.
Comparative data. After multiple failures to teach non-human apes elements of spoken language (Reference FurnessFurness, 1916; Reference Hayes and HayesHayes & Hayes, 1952; Reference Kellogg and KelloggKellogg & Kellogg, 1933), there followed several successful attempts to teach them visually perceived forms of communication, including elements of American Sign Language (ASL) (Reference Gardner and GardnerGardner & Gardner, 1969, Reference Gardner, Gardner, Schrier and Stollnitz1971; Reference PremackPremack, 1970; Reference Premack, Premack, Schiefelbusch and LloydPremack & Premack, 1974). Based on these findings, Hewes argued that there is continuity between ape and human gestural behaviors, and discontinuity in vocal behaviors (Reference HewesHewes, 1977a, Reference Hewes and Rumbaugh1977b; Reference Hewes, Andrew, Carini, Hackeny, Gardner, Kortland and WescottHewes et al. 1973).
Neurocognitive data. Hewes appealed to evidence from neuropathology that indicated relative immunity of gestural communication in language-related disorders (e.g. Reference HewesHewes, 1977a, pp. 132–133), and to research on handedness and lateralization, where he drew attention to the fact that right-hand dominance for manual actions coincides with the left-hemisphere dominance for language processing and production (cf. Reference Knecht, Dräger, Deppe, Bobe and LohmannKnecht et al., 2000).
Signed languages. Hewes claimed that sign(ed) languages are universal in the sense that they can appear from scratch thanks to a high degree of iconicity (Reference HewesHewes, 1977a), which has since been confirmed by studies into emerging signed languages (Reference Meir, Sandler, Padden, Aronoff, Marschark and SpencerMeir, Sandler, Padden, & Aronoff, 2010a; Reference Senghas and CoppolaSenghas & Coppola, 2001; Reference Senghas, Kita and ÖzyürekSenghas, Kita, & Özyürek, 2004).
The combined force of this evidence led Hewes to the conclusion that gestural protolanguage had constituted the first form of homininFootnote 3 communication on the path to modern language. In fact, he was the first to use the term “protolanguage” in the technical sense as a transitional system between, on the one hand, the signal-based communication of apes, and on the other, human language. The visionary character of Hewes’ project is also testified to by the fact that it relied on areas of research that have been used as sources of evidence in language-evolution debates until now.
When evaluating the Gestural Primacy Hypothesis, it should be noted that although Hewes did not expressly commit himself to a narrow understanding of gesture as the communicative action of hands and arms (see Section 2), some of his key arguments indicate that he envisaged protolanguage as primarily relying on manual gesture. For example, such is the import of his discussion of the relation between handedness and language, but also the way he used signed languages and comparative studies favors a narrow definition of gesture. For example, the manual character of protolinguistic communication directly motivates Hewes’ explanation of the so-called volar depigmentation of the inner part of the hand in non-Caucasian populations: that it may be an adaptation for gestural communication, as it increases the visibility of the hands in the dark (Reference Hewes, Lock and PetersHewes, 1996). Apparently, he ignored the fact that volar depigmentation also affects the sole of the foot.
3.3 Stokoe and Research in Signed Languages
The second part of the twentieth century saw the beginning of modern research on signedFootnote 4 languages, founded on the postulate they are not qualitatively different from spoken languages (Reference EmmoreyEmmorey, 2002; Reference StokoeStokoe, 1960; Reference Stokoe, Casterline and CronebergStokoe, Casterline, & Croneberg, 1965). The pioneers of signed-language linguistics had a keen interest in language origins. For example, early works emphasize that gesture has a greater iconic potential than vocalization, which makes gesture a better candidate for a communicative system based on signs rather than signals (Reference Zlatev, Żywiczyński and WacewiczZlatev et al., 2020) and hence is a likely starting point for the evolution of language (Reference StokoeStokoe, 1960). Using insights from the emergence of signed languages, Stokoe argued that the spatial character of gestures could have facilitated the emergence of rudimentary syntax, as gestures are able to represent not only an action but also the agent who performs it and the patient affected by it (Reference Armstrong, Stokoe and WilcoxArmstrong, Stokoe, & Wilcox, 1995; Reference StokoeStokoe, 1991). It was proposed that nouns were derived from the shape and position of hands and arms, verbs from their actions, and that, collectively, they gave rise to prototypical sentences (Reference Armstrong and WilcoxArmstrong & Wilcox, 2007).
Hence, Stokoe and his collaborators proposed theories consisting of the following evolutionary stages: (a) gestural protolanguage with holistic and iconically motivated signs, (b) gestural language with discrete but iconically motivated signs and combinatorial syntax, (c) the transition into speech, which promoted conventionalization and growth of syntactic complexity.
3.4 Corballis’ Manual Protolanguage
Corballis’ position on the gestural origin of language is somewhat ambivalent. On the one hand, he understands gesture broadly as comprising a heterogeneous variety of forms of bodily action: from spontaneous hand movement accompanying speech, to glances, postures, and even orofacial gestures of the mouth area (Reference CorballisCorballis, 2002, Reference Corballis2003). On the other hand, most of his arguments for the gestural origin of language focus on manual gesture. In presenting them, Corballis organizes and upgrades many ideas put forward by Hewes and Stokoe. For example, he reviews the evidence coming from attempts to teach non-human apes manual-visual and vocal forms of communication, but also points to the fact that primates in general, and apes in particular, acquire manual skills with relative ease. These include not only communicative but also praxic skills (mainly, related to tool use), which contrasts with the difficulty with which they acquire vocal skills (Reference CorballisCorballis, 2012). He also brings up the problem of the neural infrastructure of language, focusing on the nature of the primate mirror neuron system and its homology with language circuits in the human brain (Reference CorballisCorballis, 2003). In his view, this fact is able to explain both the relative success in teaching apes to communicate gesturally rather than vocally and the correlation between handedness and cerebral asymmetry for language (Reference CorballisCorballis, 2012). On this basis, he argues that “manual gesture [is] a natural communication medium” (Reference CorballisCorballis, 2013, p. 203) for primates and, hence, the evolution of language must have begun with some form of manual protolanguage.
In relation to modern human communicative capacity, he uses the standard lines of argumentation for the gestural origin of language, referring to (a) the use of the hands as the most natural way to represent events in space and time in the absence of a shared code, and (b) the ready invention of sophisticated signed languages by the deaf (Reference CorballisCorballis, 2013). Taking stock of these two points as well as the postulate about the continuity of ape and human gestural behaviors, Corballis posits a scenario according to which language began with an internal capacity to engage in so-called mental time travel – “the mental reconstruction of personal events from the past (episodic memory) and the mental construction of possible events in the future” (Reference Suddendorf and CorballisSuddendorf & Corballis, 1997, p. 133). For him, the adaptive pressure for the evolution of this ability came with the uncertain and dangerous ecology of the Pleistocene era, which required long-term planning and a suite of other skills (Reference CorballisCorballis, 2019). The type of gesture hominins inherited from the Last Common Ancestor with apes was naturally adept at communicating sequences of past and future events. From this, Corballis envisages the gradual evolution of a communicative system, in analogy to the historical emergence of signed languages: It first relied on pantomime – understood as holistic iconic gesture – and later developed conventional signs as well as syntax (Reference CorballisCorballis, 2019). In this regard, his scenario resembles the trajectory of language evolution drawn by Stokoe and colleagues.
3.5 Arbib’s Mirror Neuron Hypothesis
The Mirror Neuron Hypothesis of Reference ArbibArbib (2005, Reference Arbib2012, Reference Arbib2016) is one of the most elaborate and empirically best-documented current theories of language evolution. As its name implies, Arbib envisages the mirror neuron system, and in particular the subsystem involved in grasping, as the basis for his evolutionary scenario. These evolved capacities for complex action recognition and complex imitation allowed for the imitation of aspects of observed movements even if they are not part of the imitator’s current stock of actions, thus introducing new variants of actions into the “praxicon,” an individual’s repertoire of actions. When coupled with communicative intentions (according to Reference ArbibArbib, 2016, already present in non-human apes), this developed into the communicative system of pantomime, which Arbib understands as holistic, impromptu gesture, which “allows the transfer of a wide range of action behaviors to communication about action and much more – whereby, for example, an absent object is indicated by outlining its shape or miming its use” (Reference ArbibArbib, 2012, p. 177). The impromptu character of pantomime resulted in its low replicability, as each pantomimic sign had to be invented and interpreted anew. The pressure for communicative effectiveness brought about the conventionalization and segmentation of pantomime, ultimately leading to the emergence of gestural “protosign.” In this way, holistic pantomime was transformed into a synthetic gestural protolanguage, as in the theories reviewed above.
Unlike conceptions of pantomime derived from mimesis theory (see Section 5), Arbib’s theory is consistently monosemiotic: the early stages of language evolution are limited to the semiotic system of gesture: first, holistic pantomime, and then conventionalized gestural protosign. When describing gestural protolanguage, Arbib is not as emphatic as Hewes, Stokoe, or Corballis about the importance of manual gesture at early stages of language emergence. However, the selection of the starting point of language evolution – the Mirror Neuron System for grasping and manual praxic actions attributed to our Last Common Ancestor with monkeys (LCA-m) – and his account of the transition of protolanguage into speech both imply a key role of manual gesture.
3.6 Tomasello’s Pointing and Pantomime
Reference TomaselloTomasello (2008, Reference Tomasello2009) refrains from articulating a detailed scenario of the evolution of language. The significance of his work for language origins lies in rich empirical evidence of a developmental and comparative nature, which focuses on the emergence of pro-sociality and shared intentionality as prerequisites for language. In terms of communication, uniquely human forms of social cognition are realized according to Tomasello in two kinds of gesture: pointing and pantomime. The first is manifest in informative-declarative pointing: Pointing performed with the intention of providing the recipient with new information. “Pantomiming,” the term Tomasello uses preferentially to “pantomime,” comprises iconic manual or whole-body gestures, which are used “(i) to indicate that this is the action I want you to perform, or that I intend to perform myself, or that I want to tell you about; and (ii) to request or otherwise indicate an object that ‘does this’ or an object that ‘one does this with’” (Reference TomaselloTomasello, 2008, p. 67). Pantomiming is then capable of expressing an open-ended range of meanings, which are action-orientated and displaced from the here-and-now. Thus, such gestures do not have internal morphological structures analyzable into discrete component parts. Similarly, pantomimes themselves are not replacements for words but correspond to larger units that are at least proposition-size. Words can only complete communicative acts by being combined with other words, but pantomiming, in Tomasello’s account, can serve as a complete communicative act with its own illocutionary force.
3.7 Evaluation
Monosemiotic theories tend to focus on the three lines of argumentation and corresponding evidence: (a) the gestural and vocal communication of non-human apes, (b) the expressive potential of gesture in contemporary interpersonal communication, and (c) neural links, such as those between hand and mouth. Let us consider each of these in turn.
As pointed out in Section 2, ape gestures are generally understood to be flexible, learned and volitionally produced, unlike ape vocalizations, which are often taken to be instinctive, species-specific, and involving little or no learning. Hence, supporters of gestural theories conclude that there is continuity between ape gesture and human communicative behaviors, and discontinuity between ape vocalizations and human speech. This argument has been backed up by ethological research documenting both the flexibility of ape gestures and the largely inflexible character of ape vocalizations, such as chimpanzee food cries (e.g. Reference CorballisCorballis, 2002; Reference DeaconDeacon 1997; Reference Hewes, Andrew, Carini, Hackeny, Gardner, Kortland and WescottHewes et al., 1973; Reference Scherer, Johnstone, Klasmeyer, Davidson, Scherer and GoldsmithScherer, Johnstone, & Klasmeyer, 2003; but see Reference Fröhlich, Sievers, Townsend, Gruber and van SchaikFröhlich et al., 2019). However, more recent ethological data have complicated this picture, as many primate calls demonstrate “audience effects,” whereby the intensity and rate of calling is regulated by situational context. For example, the manner of producing alarm calls depends on whether somebody is present or not, including the presence of specific recipients (Reference Crockford, Wittig, Mundry and ZuberbühlerCrockford, Wittig, Mundry, & Zuberlbühler, 2012; Reference Crockford, Wittig and ZuberbühlerCrockford, Wittig, & Zuberbühler, 2017). Food calls differ when the quantity of food is large or small (Reference Brosnan and De WaalBrosnan & de Waal, 2002), and are “associated with audience checking, gaze alternation and goal persistence” (Reference Fröhlich, Sievers, Townsend, Gruber and van SchaikFröhlich et al., 2019, p. 7). Primate vocalizations have also been shown to involve “functional reference,” productivity (devoid of compositionality) or tactical deception (Reference SlocombeSlocombe, 2011). Finally, there is growing evidence that naturally living apes combine gestures with vocalizations, touch, and haptic behaviors (Reference Fröhlich, Sievers, Townsend, Gruber and van SchaikFröhlich et al., 2019).
However, most researchers still accept that there is a qualitative difference between ape gestural and vocal behaviors, although it may not be as categorical as once assumed (Reference Bard, Bakeman, Boysen and LeavensBard, Bakeman, Boysen, & Leavens, 2014; Reference Hobaiter and ByrneHobaiter & Byrne, 2014; Reference PikaPika, 2008b). The new research does not disqualify monosemiotic gestural theories, but lends stronger support for polysemiotic approaches, which posit that vocalization may have played a non-negligible role in the evolutionary emergence of language.
Arguments about the expressive potential of gesture concentrate on differences between human gestures and the gestural behaviors of non-human apes, the most important being the ability of human gestures to denote objects, actions, and relations between them, which directly bears on the triadic nature of human gestures, in contrast to the dyadic design of ape gestures (see Section 2). There are two main categories of gestures that demonstrate human-specificity in this regard: pointing and iconic gestures. A number of proponents of gestural theories have considered the emergence of pointing an important stepping-stone toward language. Tomasello’s extensive work on pointing demonstrates that it represents an important watershed in the evolution of human cognition and communication: Non-human primates do not point to distal entities in their environments (Reference TomaselloTomasello, 2000, p. 358), while human infants as early as in the twelfth month of life perform spontaneous, informative pointing aimed at sharing attention with another person (Reference Liszkowski, Carpenter, Henning, Striano and TomaselloLiszkowski, Carpenter, Henning, Striano, & Tomasello, 2004). Tomasello holds that this difference results from the lack of cooperative motivations in non-human apes, but of equal importance is the semiotic quality of pointing: Like other human gestures, pointing is triadic in the sense that the communicator intends to bring the attention of the addressee to a relevant object, and intends for the addressee to recognize this, rather than just to look in a given direction (Reference ZlatevZlatev, 2008).
The other watershed in the evolution of language highlighted in gestural theories is that of iconic gestures.Footnote 5 While there is very little evidence that iconic gesture is present in gestural repertoires of non-human apes (Reference PerlmanTanner & Perlman, 2017), it appears in different forms of human gestural communication, such as co-speech gestures (Reference McNeillMcNeill, 1992), pantomiming events (Reference Zlatev, Wacewicz, Żywiczyński and van de WeijerZlatev, Wacewicz, Żywiczyński, & van de Weijer, 2017) and even in signed languages (Reference Klima and BellugiKlima & Bellugi, 1979). Arguments used in gestural theories that appeal to the iconic-expressive potential of gesture have been corroborated by experimental-semiotic research, which investigates how people develop novel communication systems in the absence of a shared code (Reference GalantucciGalantucci, 2009). Studies comparing spontaneous gestures and vocalizations conclusively show that gesture is a much better basis for developing a sign-based communication system than vocalization (e.g. Reference Fay, Arbib and GarrodFay, Arbib, & Garrod, 2013; Reference Fay, Lister, Ellison and Goldin-MeadowFay, Lister, Ellison, & Goldin-Meadow, 2014). Interestingly, such studies show that the combined use of spontaneous gesture and non-linguistic vocalization is not significantly better than the use of gesture alone (Reference Zlatev, Wacewicz, Żywiczyński and van de WeijerZlatev et al., 2017). Although there are many studies showing that vocalization does have iconic potential (e.g. Reference Ahlner and ZlatevAhlner & Zlatev, 2010; Reference Blasi, Wichmann, Hammarstörm, Stadler and ChristiansenBlasi, Wichmann, Hammarstörm, Stadler, & Christiansen, 2016; Reference Imai and KitaImai & Kita, 2014; Reference Lockwood and DingemanseLockwood & Dingemanse, 2015; Reference PerlmanPerlman, 2017; Reference Perlman and CainPerlman and Cain, 2014), the conclusion that iconic gesture is superior to vocalization for communication based on signs, in the semiotic sense of the term, is now well established in the literature (cf. Reference ŻywiczyńskiŻywiczyński, 2020). This conclusion has gained further support from research on emerging signed languages (Reference Lepic, Börstell, Belsitzman and SandlerLepic, Börstell, Belsitzman, & Sandler, 2016; Reference Meir, Sandler, Padden, Aronoff, Marschark and SpencerMeir et al., 2010a; Reference Meir, Aronoff, Sandler, Padden, Scalise and VogelMeir, Aronoff, Sandler, & Padden, 2010b; Reference Sandler, Meir, Padden and AronoffSandler, Meir, Padden, & Aronoff, 2005; Reference Stamp, Sandler, Roberts, Cuskley, McCrohon, Barceló-Coblijn, Feher and VerhoefStamp & Sandler, 2016). Such findings lend support to gestural origins theories in general, and are also consistent with polysemiotic accounts discussed in Sections 4 and 5.
A third kind of evidence used in monosemiotic gestural theories derives from neuroscience. The most elaborate one of these is Arbib’s Mirror Neuron Hypothesis. To account for the problem of the transition from “protosign” to “protospeech,” Arbib proposes a mechanism of colateralization, whereby the activity of the area responsible for manual production gradually spilt over into the neighboring areas responsible for vocalization. In this way, the brain came to support protospeech through the invasion of the vocal apparatus by collaterals from the protosign system (Reference Rizzolatti and ArbibRizzolatti & Arbib, 1998; cf. Reference ArbibArbib, 2005 and Reference ArbibArbib, 2006). This colateralization hypothesis is supposed to explain not only a substantial degree of coupling of gestural and orofacial behaviors, but also segregation between them, manifested in dissociations between limb apraxia, speech apraxia, and aphasia (Reference ArbibArbib, 2006; Reference Vingerhoets, Alderweireldt, Vandemaele, Cai, Van der Haegen, Brysbaert and AchtenVingerhoets et al., 2013). Arbib’s account is interesting but is only able to suggest a neural mechanism that could have brought about the transition, while it fails to identify any evolutionary pressure responsible for it (Reference FitchFitch, 2010).
The issue of hand–mouth neural links belongs to one of the standard arguments used in gestural hypotheses. It rests on the assumption, partly corroborated by neurocognitive research, that hand and orofacial movements are governed by the same, phylogenetically old brain circuits (Reference Żywiczyński and WacewiczŻywiczyński & Wacewicz, 2019). Such links appear to be rooted in mouth feeding behaviors (Reference Gentilucci and CorballisGentilucci & Corballis, 2006), and are able to explain motoric phenomena like differences in mouth opening depending on whether we hold a small or large object when speaking (Reference Gentilucci and CorballisGentilucci & Corballis, 2006) or the activity of articulators resultant from specific manual movements (Reference Higginbotham, Isaak and DomingueHigginbotham, Isaak, & Domingue, 2008). All this evidence can be seen as supporting gestural theories, but is by no means limited to the monosemiotic type, and is arguably even stronger support for the polysemiotic theories discussed in the following sections.
An inevitable feature of monosemiotic gestural theories is “the transition problem” (cf. Reference Żywiczyński and WacewiczŻywiczyński & Wacewicz, 2019): If language emerged as a gestural phenomenon, why is it that all modern languages are predominantly vocal, with the exception of signed languages? This problem is identified as the key challenge to monosemiotic gestural theories both by its proponents (Reference ArbibArbib, 2005; Reference CorballisCorballis, 2002; Reference Hewes, Andrew, Carini, Hackeny, Gardner, Kortland and WescottHewes et al., 1973) and critics (Reference BurlingBurling, 2005; Reference FitchFitch, 2010; Reference MacNeilageMacNeilage, 2008). The only viable strategy to tackle this problem is to point to potential selection pressures facilitating the development of vocal communication despite its original gestural basis. There are a variety of candidates for such selection pressures, the best-known of which are the following:Footnote 6
Speech enables communication with poor visibility or in the dark (Reference RousseauRousseau, 1755/2008).
The voice attracts attention more effectively (Reference RousseauRousseau, 1755/2008).
Speech does not engage hands, thereby allowing their use in practical tasks – work or carrying objects during communication (e.g. Reference Carstairs-McCarthyCarstairs-McCarthy, 1996).
Speech allows one to teach manual activities, such as tool making (Reference Armstrong and WilcoxArmstrong & Wilcox, 2007).
The acquisition of speech begins in the human foetal life, which grants it a developmental advantage (Reference Hewes, Lock and PetersHewes, 1996).
Speech is more economical, as articulatory movements require less time and energy than hand, arm, and body movements (e.g. Reference Knight, Knight, Studdert-Kennedy and Hurford (Eds.)Knight, 2000).
Vocal communication facilitates continuous monitoring of the location of a child, which might have been important in hominins due to their hunter-gatherer lifestyle, and with lack of constant physical contact between mother and child, as is the case in the great apes (Reference FalkFalk, 2009).
Voice is directed at everyone and not only to a specific individual (Reference TomaselloTomasello, 2008).
While suggestive and deserving more research, it is generally accepted that these proposals can neither individually nor jointly resolve the transition problem for monosemiotic gesture theories. In general, they point to one or another deficit of the visual channel, and could thus be used as arguments for “speech first” theories. Reference FitchFitch (2010) criticizes the majority of the arguments listed above, as it is easy to find a counter-argument in each case. Gestures may not be visible in the dark, but they are visible by the firelight, and can be used in the tactile modality, as done by visually impaired signers. Further, the visual channel gains an advantage in long-distance or noisy communication, and successfully attracts attention in these situations. Although speech frees the hands and arms, gestures free the mouth, which was very significant in the Palaeolithic period, given that fossil data show that hominins intensively used teeth to chew hard foods and perform various mechanical operations. Importantly, the argument concerning the energetic effectiveness of speech is convincing only to the extent that speech and gesture are independent of one another, as in the pantomimic theories discussed in Section 5, but not according to the equipollent theories (Section 4), since if speech is (almost) necessarily accompanied by gesticulation, this way of communicating would be at least equally, if not more, costly than gestures alone.
Further issues could be brought against theories that trace the beginnings of language exclusively to the sign system of gesture. Regarding Reference Armstrong and WilcoxArmstrong and Wilcox’s point (2007, see above), in teaching manual activities, verbal instructions are less effective than demonstration or physical guidance of the learner’s hands. One thing that is problematic for the suggestion of Reference Hewes, Lock and PetersHewes (1996) is the developmental data showing equal paces of spoken and signed language acquisition. Reference FalkFalk’s (2009) idea of the vocal contact between mother and child does not require speech but just the emission of sound. Reference TomaselloTomasello’s (2008) point is compelling as far as open information sharing is concerned, but gesture allows more accurate choice of addressee, which is important in less cooperative contexts and, further, is less at risk of being discovered by enemies and predators (Reference Wacewicz, Żywiczyński, Smith, Smith and Ferrer-i-CanchoWacewicz & Żywiczyński, 2008).
There have been attempts to address the transition problem not by indicating adaptive pressures that could have affected the changeover from gesture to speech, but by highlighting the interaction between the two semiotic systems. Reference Goldin-Meadow, Smith, Smith and Ferrer-i-CanchoGoldin-Meadow (2008) points out that in modern face-to-face communication, (a) combinatorial-segmented information is usually transferred by speech, while (b) holistic-imagistic information is usually transferred by gesture. Further, while gestures can become lexicalized and grammaticalized to communicate (a), as in signed languages, speech is less predisposed to perform (b). This proposal can be formulated also along the following lines (Reference BrownBrown, 2012): Speech is less capable of iconic representation, for which there is solid evidence, as pointed out above, and this may have been the reason for the transfer from the hypothetical gestural protolanguage toward speech when the need for a larger vocabulary and more combinatorial structure arose. This argument is, in our view, the most promising for explaining the transition, but it should be noted that it presumes that at least some vocalization was present already for the shift from gesture to speech to commence. Hence, it is more properly seen as an argument in favor of pantomimic theories (see Section 5).
To sum up, there seems to be no convincing solution to “the transition problem” for purely monosemiotic gestural theories, given that the proposed solutions face counter-arguments or are underpowered in terms of evolutionary logic. These difficulties have contributed to a growing popularity of polysemiotic theories, which we review in the following sections. Two different kinds of these can be distinguished, which derive from different research backgrounds. Equipollent theories are primarily affiliated with modern gesture studies (e.g. Reference KendonKendon, 2004; Reference McNeillMcNeill, 1992, Reference McNeill2005), while pantomimic theories derive from mimesis theory (Reference DonaldDonald, 1991, Reference Donald2001; Reference ZlatevZlatev, 2008). The two positions share the assumption that gesture was not the only semiotic system at the origin of human language, but there are important differences between them regarding the role of vocalization, the evolutionary trajectory to modern-day human communication, and the end point of language evolution. Hence, we present and discuss them in two separate sections.
4 Equipollent Polysemiotic Theories
4.1 General Features
The defining property of equipollent origin theories is the postulate of an early integration of gesture and vocalization, with the foundational assumption that gesture and speech form two sides of a single cognitive-communicative system (Reference McNeillMcNeill, 2005) or process (Reference Kendon, Dor, Knight and LewisKendon, 2014). It should be noted that while being representative, the views of these scholars are not coextensive with that of gesture studies. Even early research often viewed gesture as a functionally versatile category, including Reference EfronEfron’s (1941) study of “illustrators,” “batons,” and “ideographs,” and Reference Ekman and FriesenEkman and Friesen’s (1969) study of “regulators.” Further, there was much interest in pantomime (e.g. Reference KendonKendon, 1992; Reference Laudanna and VolterraLaudanna & Volterra, 1991), emblematic gestures and their cultural variability (e.g. Reference KendonKendon 1995, Reference Kendon2004; Reference Poggi, Zomparelli and PoggiPoggi & Zomparelli, 1987), and adaptive movements reflecting bodily needs, psychological stress, and arousal (e.g. Reference Dittman, Siegman and PopeDittman, 1972; Reference Ekman and FriesenEkman & Friesen, 1969; Reference Freedman, Seigman and PopeFreedman, 1972; Reference WaxerWaxer, 1977). It is only more recently that attention seems to have shifted specifically to co-speech gestures: gestures that are temporally and semantically integrated with speech (Reference KendonKendon, 2004; Reference McNeillMcNeill, 1992). It is from this more recent approach within gesture studies that the most prominent equipollent language-origin theories derive.
4.2 McNeill’s Growth Point
Probably the best known of this class of theories is the scenario proposed by McNeill, which zooms in on the key notion in his account of gesture, the Growth Point (Reference McNeillMcNeill, 1992, Reference McNeill2005, this volume). In McNeill’s model, speech and gesture are coexpressive, but at the same time semiotically distinct, and responsible for the transmission of different aspects of the message: speech for propositional content and gestures for imagistic content. According to McNeill, the semantically most prominent element of the utterance comes at the stroke (i.e. the most pronounced phase) of a gesture. In this way, the Growth Point, the basic unit of thinking, becomes externalized.
This idea is also central to McNeill’s theory of language evolution, the critical moment of which is the integration of gestural and vocal communication, both at the level of cognition and expression (Reference McNeillMcNeill, 2012). The claim is that language originated from the coming together of vocalization and gesture to form a propositional-imagistic dialectic. The critical element of this process was what he calls “twisting” of mirror neurons, whereby they began “to respond to one’s own gestures, as if they were from someone else” (Reference McNeillMcNeill, 2012, p. 65). To support this idea, McNeill paraphrases Reference Mead and MorrisMead (1934/1974): “[…] a gesture is a meaningful symbol to the extent that it arouses in the one making it the same response it arouses in someone witnessing it” (Reference McNeillMcNeill, 2012, p. 180). As this gestural system was coorchestrated with vocalization, the Growth Point emerged.
It should be noted that McNeill does not provide any evolutionarily realistic pressures that could have been responsible for any of these changes. In fact, he suggests two conflicting accounts of how speech started, deriving it either (a) from movements “originally for ingestion, [which] could be orchestrated in new ways, by gesture imagery” (Reference McNeill2012, p. 65), or (b) from the type of polysemiotic communication that is found in extant non-human apes, such as “chimp gestures with vocalization” (Reference McNeill2012, p. 195). Although McNeill refers to the “twisting” of mirror neurons and the voice-gesture integration as adaptations, he actually describes them as saltational leaps (Reference GoldschmidtGoldschmidt, 1982),Footnote 7 not unlike Chomsky’s idea of a lucky mutation giving rise to the operation of Merge, which first endowed our ancestors with a language of thought and then with the communicative use of it (Reference Berwick and ChomskyBerwick & Chomsky, 2015).
4.3 Kendon’s Languaging
Kendon also proposes that the emergence of language crucially depended on the integration of speech and gesture but opts for a more gradual and evolutionarily realistic scenario. First, his notion of a “speech-kinesic ensemble” is less categorical about both the definition of gesture (see Section 2) and the functional interplay between speech and gesture (Reference Kendon, Dor, Knight and Lewis2014). His proposal is formulated in terms of a “dynamic orchestration of communicative action,” which extends far beyond the category of co-speech gestures and embraces any deliberately communicative bodily movements (hence, the use of the term “kinesic”), including postural shifts, eye contact, or facial expressions (Reference KendonKendon, 2004, Reference Kendon2011). Likewise, “speech” is not confined to a purely linguistic capacity responsible for the transmission of propositional information, but extends to vocal means of expressing emotional-imagistic content, as in the case of paralinguistic features (e.g. emotional prosody) or iconic vocal phenomena, as in ideophones, phonesthemes, reduplication or word lengthening (Reference KendonKendon, 2008). Languaging, in line with his notion of utterance, is a dynamic process (Reference KendonKendon, 2004) that involves a tight integration between speech and gestures to the effect that “words and gestures labor together to produce virtual objects that serve as conceptual expressions” (Reference Kendon, Dor, Knight and LewisKendon, 2014, p. 168; see also Kendon, this volume). Unlike McNeill, who postulates a strict functional division between these two semiotic systems, Kendon submits that the interaction between them is dynamic and flexible, with one or the other being dominant depending on the social or environmental context, including factors such as the level of noise (Reference Kendon, Tannen and Saville TroikeKendon, 1985).
Applying this framework to language origins, Kendon posits that the beginning of the human-specific communicative system of languaging was marked by the coming together of speech and gesture (Reference KendonKendon, 2004). This is a gradualistic scenario that identifies multifarious factors both for the emergence of speech and gesture, and for their conjunction. On the one hand, Kendon highlights the importance of “communicative action” for the emergence of language, whereby vocal behaviors and gestures acquired representational functions. As the hominin ecology favored close-distance face-to-face interaction, they came to be used jointly and in the course of time merged into one process of meaning-making (Reference Kendon, Dor, Knight and LewisKendon, 2014). On the other hand, he points to “the original praxic nature of language” (Reference KendonKendon, 2017, p. 165), when he speculates that many gestures derive from object-handling actions.
However, Kendon’s evolutionary theory does not spell out a clear solution to the problem of why speech is the dominant system in the ensemble of languaging, the analog to the transition problem for monosemiotic theories. Rather, Kendon appeals to various arguments, starting from more physiologically oriented ones, such as hypotheses concerning orofacial movements as an evolutionary “bridge” between manual gesture and speech, or the neural links between hand and mouth, appealing to Arbib’s Mirror Neuron Hypothesis (see Section 3.5), to more semiotic ones, such as the role of various forms of sound symbolism in bootstrapping vocal signs.
4.4 Evaluation
Equipollent theories stress the polysemiotic, or more specifically bisemiotic, nature of modern language as well as its evolution. They argue for tight integration between the semiotic systems of speech and gesture, to the effect that they form one overarching system, such as McNeill’s “imagery-language dialectic” or Kendon’s version of the notion of “languaging.” The postulate of gesture-speech equipollence in modern human communication is accompanied by what McNeill refers to as the equiprimordiality of gesture and speech, whereby language is thought to have begun with the integration of vocalization and gesture, which jointly assumed representational and communicative functions. In this regard, McNeill posits a saltationist scenario of the sudden emergence of Growth Point, while Kendon argues for a long-drawn and multicausal process of voice and gesture integration. This latter approach is similar to views articulated by Reference Goldin-Meadow, Smith, Smith and Ferrer-i-CanchoGoldin-Meadow (2008) and Reference SandlerSandler (2013), whose theories may also be regarded as equipollently polysemiotic.
Such theories effectively obviate “the transition problem” that burdens monosemiotic gestural theories (cf. Reference KendonKendon, 2011), which is an advantage. However, they struggle to explain the adaptive pressures that would have brought about a strong form of gestural-speech integration, as well as the dominant role of speech in hearing populations. As they take language to derive from semiotic systems that became integrated in the course of hominin evolution, both McNeill and Kendon reject the possibility that language and gesture may have had independent evolutionary trajectories. Their evolutionary theories can be seen as growing out of the strong uniformitarianFootnote 8 conviction that, as Kendon’s puts it, the way language is today “it must have been in its beginnings” (Reference Kendon2011, p. 103). But it is hardly uncontroversial to claim that language (or languaging) necessarily involves the integrated use of the vocal and gestural channels (e.g. Reference Vigliocco, Perniss and VinsonVigliocco, Perniss, & Vinson, 2014).
There are in fact stronger and weaker positions on gesture-speech integration among equipollent theories. McNeill puts forward a more extreme version, whereby speech is impossible without gesture: “the core is gesture and speech together. […] They are united as a matter of thought itself. Even if for some reason a gesture is not externalized (social inappropriateness, physical difficulty, etc.) the imagery it embodies can still be present, hidden but part of the speech process” (Reference McNeillMcNeill, 2012, p. 19). A weaker version of the thesis, more consonant with Kendon’s position, has been elaborated by Kita, who argues that speech and gesture are distinct psychological processes that interact (e.g. Reference Kita and ÖzyürekKita & Özyurek, 2003, p. 30). According to such an account, both speech and gesture are governed by separate but tightly interrelated production mechanisms: an “action generator” responsible for the production of gesture and a “message generator” responsible for speech (Reference Kita and McNeillKita, 2000).
There are other issues with this type of theory as well. On the one hand, proponents of equipollent theories argue for the division of labor between the two parts of the integrated system; on the other hand, they obliterate the distinction between them, downplaying, for instance, the fact that speech performs the dominant role in the transfer of referential information in the “speech-kinesic ensemble.”Footnote 9 Equipollent theories also disregard the fact that the compositional nature of language can manifest itself not just in speech but also in various other subsystems such as writing involving the visual modality, the tactile modality (e.g. Braille), or haptic modality (e.g. Tadoma, the tactile lipreading of the deaf-blind). The growing amount of signed-language literature, particularly on emerging signed languages, has sometimes been used to support the equipollent position (e.g. Reference SandlerSandler, 2013), but, as Kendon rightly points out, it actually goes against it, by challenging the idea of a watertight integration between speech and gesture (Reference KendonKendon, 2009).
A final problem with equipollent gesture-speech theories is an exaggerated focus on co-speech gesture (see Section 2). McNeill’s “gesture continuum” (Reference McNeillMcNeill, 1992, Reference McNeill2005), for example, includes a variety of gestures that are produced in the absence of speech, such as “language slotted gestures,” emblems (produced at least sometimes without speech), pantomime, and the signs of signed languages; however, it zooms in on co-speech gestures (or “gesticulation,” in Kendon’s terms) and describes them as the prototypical form of gesture. Accordingly, much of contemporary gesture studies centers on co-speech gestures (e.g. many of the chapters in the two-volume set Body - Language - Communication, edited by Reference Müller, Cienki, Fricke, Ladewig, McNeill, Teßendorf and BressemMüller et al., 2013–Reference Müller, Müller, Cienki, Fricke, Ladewig, McNeill and Bressem2014) and in this way, explicitly or implicitly, embraces the view that co-speech gesture is gesture par excellence and the ancillary view that speech is language par excellence. Such an attitude is, as we have seen, not unproblematic when it comes to language evolution. Arguably, it also overemphasizes the links between speech and gesture on the developmental level (Reference LevyLevy, 2011) and the neurocognitive level (Reference Demir‐Lira, Asaridou, Raja Beharelle, Holt, Goldin‐Meadow and SmallDemir-Lira et al., 2018) as well as ignores evidence of the dissociation between them, such as the developmental evidence for partial independence between language and gesture (e.g. Reference AndrénAndrén, 2010), or limited impairment of gestures in many forms of aphasia (e.g. Reference Whiteside, Dyson, Cowell and VarleyWhiteside, Dyson, Cowell, & Varley, 2015). Issues such as these have motivated the third general type of theories in our discussion, to which we turn in Section 5.
5 Pantomimic Polysemiotic Theories
5.1 General Features
A number of polysemiotic pantomimic theories claim that the unique features of human communication derive from a general cognitive capacity, whose appearance eventually led to the emergence of language and gesture, but also to other semiotic systems, such as music and dance, ritual, and depiction (forming marks on a two-dimensional surface that resemble three-dimensional referents, as in drawing and painting). The most common candidate for this is mimesis (Reference DonaldDonald, 1991, Reference Donald, Hurford, Studdert-Kennedy and Knight1998, Reference Donald, Tallerman and Gibson2012, Reference Donald, Hatfield and Pittman2013; Reference Zlatev and PetersZlatev, 2019). According to Donald, the original function of mimesis was to facilitate tool production, as it evolved as an adaption in late australopithecines or early Homo ca. 2 million years ago. Gradually, it was exapted for communication, allowing the use of the body as a representational device, whereby bodily movements could stand for something other than themselves (Reference Zlatev, Persson and GärdenforsZlatev, Persson, & Gärdenfors, 2005). As shown below, the breadth of the notion of mimesis gives rise to a number of related theories of language origins, which share the feature of viewing the evolution of language as an integral part of the evolution of human thought and culture in general, and at the same time focus on (polysemiotic) pantomime as an essential step in the evolutionary process.
5.2 Gärdenfors’ Pedagogy
Gärdenfors links the emergence of human-specific communication to pedagogy (Reference GärdenforsGärdenfors, 2017; Reference Gärdenfors and HögbergGärdenfors & Högberg, 2017). In his scenario, the original form of communication that emerged from mimesis was demonstration, in which a teacher performs an action for the benefit of a student, and which is defined by Reference GärdenforsGärdenfors (2017) as follows:
(D1) The demonstrator actually performs the actions involved in the task.
(D2) The demonstrator makes sure that the learner attends to the series of actions.
(D3) The demonstrator intends that the learner perceives the right actions in the correct sequence.
(D4) The demonstrator exaggerates and slows down some of the actions in order to facilitate for the learner to perceive important features.
Thus, demonstration both resembles praxis and differs from it, since its main goal is not practical per se, but for the student to understand how to perform the actions in question. Hence, demonstration is a true representational sign, and according to the definition of Reference AndrénAndrén (2010) also a gesture (see Section 2). According to Gärdenfors, the minimal “symbolic distance” between the expression and what it represents makes demonstration the likely candidate for being the first form of mimetic communication. The only difference between this and pantomime is feature (D1): Whereas in demonstration the teacher actually performs the actions involved in the task (e.g. one striking a stone to produce a stone tool), in pantomime the teacher pretends to perform the actions, making pantomime a form of pretense. This semiotic breakthrough allowed for the eventual emergence of language (Reference GärdenforsGärdenfors, 2017) and ritual (Reference GärdenforsGärdenfors, 2018).
This description indicates that gesture performed the dominant semiotic role in pantomime. However, it does not entail that pantomime was monosemiotic, and, indeed, this is not the case in a popular definition of pantomime in the field of language evolution as: “a non-verbal, mimetic and non-conventionalised means of communication, which is executed primarily in the visual channel by coordinated movements of the whole body, but which may incorporate other semiotic resources, most importantly non-linguistic vocalisations” (Reference Żywiczyński, Wacewicz and SibierskaŻywiczyński, Wacewicz, & Sibierska, 2018, p. 315). In other words, pantomime should be understood as a hybrid, polysemiotic system, combining both signs and signals, and a number of different sensory modalities (Reference Zlatev, Żywiczyński and WacewiczZlatev et al., 2020).
This is also apparently how Reference GärdenforsGärdenfors (2017) understands pantomime, given his frequent mention of physical objects and “vocal gestures.” Further, given that pantomime tends to represent events in an undifferentiated way, Gärdenfors outlines the transition to language in terms of its differentiation, accompanied by lexicalization and grammaticalization. He also assumes that the main referential role shifted to the vocal aspects of pantomime over time, without, however, explaining why.
5.3 Zlatev’s Mimesis Hierarchy
Building on the work of Reference DonaldDonald (e.g. 1991), Reference ZlatevZlatev (e.g. 2008, Reference Zlatev2014, Reference Zlatev, Etzelmüller and Tewes2016) attempted to make the concept of mimesis more explicit, as well as to develop a theory of its transition to language, and more recently to polysemiotic communication as such (Reference Zlatev and PetersZlatev, 2019). In the process, Zlatev has proposed a number of related definitions of the concept of bodily mimesis, one of which is the following:
[…] an act of cognition or communication is an act of bodily mimesis if: (1) it involves a cross-modal mapping between exteroception (e.g. vision) and proprioception (e.g. kinesthesia); (2) it is under conscious control and is perceived by the subject to be similar to some other action, object or event, (3) the subject intends the act to stand for some action, object or event for an addressee, and for the addressee to recognize this intention; (4) it is not fully conventional and normative, and (5) it does not divide (semi)compositionally into meaningful sub-acts that systematically relate to other similar acts, as in grammar.
On this basis, Zlatev proposes an evolutionary (and in part developmental) model known as the Mimesis Hierarchy (Reference ZlatevZlatev, 2008). The rudimentary form of protomimesis, based on requirement (1), is found in activities like emotional and attentional contagion (e.g. contagious laughter), and is common for all primates. The more advanced form of dyadic mimesis (based on 1 and 2) involves volition and imitation, but not true representation or sign-function; it is common for all great apes. Only at the next level (based on 1, 2, and 3), referred to as triadic mimesis, do mimetic acts gain a clear sign-function, as well as Gricean communicative intentions (i.e. that the addressee should understand that a communicative act is being performed for their benefit). This level, also in agreement with Gärdenfors, is claimed to be uniquely human (Reference Zlatev, Persson and GärdenforsZlatev et al., 2005). Further, point (4) distinguishes mimesis from a conventionalized protolanguage and point (5) distinguishes it from language proper.
This provides a convenient conceptual apparatus, but does not address key questions such as what drove the evolutionary process, as well as more specific aspects of how the transition from triadic mimesis (i.e. pantomime) to protolanguage and language took place, including the shift from gesture to vocalization in terms of dominance. Reference Zlatev, Etzelmüller and TewesZlatev (2016) addresses these gaps, but in a somewhat schematic matter. With respect to evolutionary pressures, Zlatev appeals to an increase of prosociality in hominins, in the manner of Tomasello (see Section 3.6). As for the ecological pressures behind this, the uniquely human reproductive strategy among the great apes of alloparenting (Reference HrdyHrdy, 2009) is evoked. Concerning the gradual transition to vocalization, this is sought in the nature of pantomime itself: a hybrid system that is polysemiotic (i.e. combines various sign and signal systems) and multimodal (i.e. involves different sensory channels). The dominant semiotic system of pantomime is claimed to have been highly iconic gesture, understood in terms of the following properties (Reference Zlatev, Żywiczyński and WacewiczZlatev et al., 2020):
a) use of primary iconicity (Reference Sonesson, Rauch, Carr and GeraldSonesson, 1997), where the similarity between the gesture and what it represents is largely sufficient for establishing the referent, as opposed to secondary iconicity, where appreciation of this similarity comes only later;
b) use of the whole body, rather than the hands only (cf. Reference Żywiczyński, Wacewicz and SibierskaŻywiczyński et al., 2018);
c) use of a first-person perspective, where the gesturer adopts the perspective of the agent who performs the represented actions (cf. Reference Zlatev, Andrén, Zlatev, Andrén, Johansson-Falck and LundmarkZlatev & Andrén, 2009);
d) use of the enacting “mode of representation,” with the body of the gesturer mapping onto the (human) body of the referent (Reference Müller, Müller, Cienki, Fricke, Ladewig, McNeill and BressemMüller, 2014);
e) use of the peripersonal space, where gestures stand for actions in the space immediately surrounding one’s own body (Reference Brown, Mittermaier, Kher and ArnoldBrown, Mittermaier, Kher, & Arnold, 2019).
The transition towards language consisted in a transition from (a) primary to secondary iconicity, where resemblance is less important than convention, (b) whole-body gestures to manual gestures, (c) first-person to third-person perspective, and related to this (d) “tracing” and “embodying” modes of representation, where the hands represent specific features of the referents in different ways, and can consequently (e) denote objects that are more displaced in space and time. Such a gestural system is clearly much less iconic than that of pantomimic gesture, but nevertheless retains considerable iconicity, exceeding that of vocalization. Thus, Reference Zlatev, Etzelmüller and TewesZlatev (2016) makes use of the arguments presented by Reference BrownBrown (2012, see Section 3.7), to motivate the gradual transition from gesture to vocalization when the need for less iconicity and more “arbitrariness” arose.Footnote 10
But while language (realized as speech, writing, or signing) may be the dominant system in modern human communication when it comes to expressing propositions and narratives, it is very seldom used alone, but alongside other semiotic systems such as gesture and depiction (e.g. Reference GreenGreen, 2014): polysemiotic communication. An advantage of the mimesis/pantomime approach is that it can help explain this, as pantomime consisted of gesture, vocalizations, and “protodrawing,” when gestures left marks on surfaces such as sand (Reference Zlatev and PetersZlatev, 2019; Reference Zlatev, Żywiczyński and WacewiczZlatev et al., 2020; Reference Zlatev, Devylder, Defina, Moskaluk and AndersenZlatev et al., 2023).
5.4 Similar Accounts
Perhaps less elaborated but similar polysemiotic theories have been suggested by others as well. Reference Levinson, Enfield and LevinsonLevinson (2006), for example, has proposed the Interaction Engine hypothesis, according to which what evolved in our ancestors was a sociocognitive adaptation allowing “joint attention, common ground, collaboration and the reasoning about communicative intent” (Reference Levinson and HollerLevinson & Holler, 2014, p. 369) which transformed face-to-face interaction. Levinson argues that such communication was polysemiotic in the sense that it incorporated facial expressions, body movements, and affective vocalizations (cf. Reference Fröhlich, Sievers, Townsend, Gruber and van SchaikFröhlich et al., 2019) but was still lacking representational signs. In a next stage, iconic gesture accompanied by simple referential vocalizations emerged, the latter of which gradually assumed the dominant role in the transfer of meaning (Reference Levinson and HollerLevinson & Holler, 2014).
A similar scenario is proposed by Reference CollinsCollins (2014), according to whom the communication system of early Homo consisted of a majority of relatively involuntary, non-representational signals (“indices”), and a smaller inventory of voluntary, representational signs. The latter were initially more or less evenly distributed between the bodily and vocal channels, but gesture had a leading role. With Homo erectus and Acheulean culture,Footnote 11 there was a sharp increase in the proportion of bodily-gestural signs, but for some unspecified reason the importance of the latter began to decrease from ca. 1 million years ago, and with the evolution of Homo sapiens, the relative importance of gesture and speech were reversed compared to what they were at the onset of language evolution.
5.5 Evaluation
Similarly to the equipollent type, pantomimic theories of language origins alleviate the transition problem of the monosemiotic gesture theories by drawing attention to the polysemiotic nature of modern human communication and arguing that the evolutionary beginnings of language must have been similarly polysemiotic. However, they disagree about the nature of polysemiotic communication, both in modern communication and as it concerns evolution.
First, pantomimic theories adopt a more complex view of human-specific communication, comprising not just speech and gesture, but also other semiotic systems like depiction and music. Some of these theories propose that all of these have developed from the fountainhead of (bodily) mimesis. The most important shared characteristic of these semiotic systems is that they consist of (representational) signs, which distinguishes them from most forms of animal communication, which are based on signals (Reference Zlatev, Żywiczyński and WacewiczZlatev et al., 2020). Modern human communication indeed involves combinations of such semiotic systems and is clearly polysemiotic (Reference Zlatev and PetersZlatev, 2019; Reference Zlatev, Devylder, Defina, Moskaluk and AndersenZlatev et al., 2023). For example, the Bayaka nomads of the Western Congo Basin incorporate whole-body, silent pantomime as well as vocalizations of the hunted game into their hunting narratives (Reference Lewis, Dor, Knight and LewisLewis, 2014), while inhabitants of central Australia speak and gesture as they draw in the sand while narrating (Reference GreenGreen, 2014). Considering these traditional societies with their traditional technologies, how much more polysemiotic is modern communication, mediated by all the current (electronic) media that we have at our disposal? The interactions between these systems are flexible and context-specific, not unlike the way Kendon describes the polysemiotic process of languaging (Section 4.3).
Second, proponents of pantomimic theories gain some support from “uniformitarianism” when they argue that the original human-specific system was likewise polysemiotic, though less differentiated than present polysemiosis, with “bodily mimesis,” or an “interaction engine” serving as the springboard. A clear difference from the equipollent polysemiotic theories is the claim that the division of labor between semiotic systems was different at the onset of the evolutionary process, with gesture serving the leading role, and speech playing the dominant role at its end. In this respect, pantomimic theories can use all the arguments from monosemiotic theories (Section 3) without encountering the transition problem to the same extent. However, they do face a version of it, as alluded to repeatedly above: How exactly can one explain the reconfiguration in the division of labor between semiotic systems from “more gesture” to “more speech”?
While we do not believe that a conclusive response to this question has been given, it appears that pantomimic polysemiotic theories are in the best position to answer it when compared to the other types, by emphasizing the flexible and context-dependent relation between semiotic systems. Further, speech and gesture have different intrinsic potentials for iconic representation, and contexts in which iconicity is less effective as a communication strategy, while a less iconic and more conventional, and systematically structured form of sign use would have been more effective and constituted an evolutionary pressure towards speech. One such context, suggested originally by Donald (1991), could have been a culture dependent on relatively complex narratives, or myths, where events are not represented just sequentially, but also in counter-iconic orders, reflecting causal and logical relations that require more systematic and conventionalized means of representation.
6 Conclusions
In this chapter we have provided a survey of a number of gestural theories of the origin of language, using a new typology that divides theories first in terms of whether they viewed “gesture alone” as the starting point to the emergence of language (monosemiotic theories) or gesture in combination with other semiotic systems (polysemiotic theories). The latter were then divided based on whether gesture and speech were considered to play a (more or less) equal role from the start to the present (equipollent) or whether gesture first dominated, but the vocalizations that were there from the start eventually gained dominance (pantomimic).
To sum up, while there is a lot of value in monosemiotic gestural theories, in particular in regard to the role of iconicity for “bootstrapping” a shared system of signs, they were all shown to have difficulty in explaining the transition from gesture to speech. In tackling this problem, many proponents of such theories tend to overemphasize the role of speech in modern human communication, at the same time as they tend to overemphasize the role of gesture at the beginning of the evolutionary process. Another problematic move is the stress that they put on the continuity between ape and human gestures, whereby they find themselves hard-pressed to fully account for the differences between these two phylogenetically distinct forms of gesture.
As for equipollent polysemiotic theories, they are at least to some degree able to account for the role of co-speech gesture in modern human communication but tend to disregard other forms of gestural communication (and other semiotic systems, even more). They find it difficult to provide an evolutionarily satisfactory explanation of why gesture and speech differ semiotically (rather than just postulating that they are two sides of a “dialectic”), or indeed why they are so closely connected.
We concluded with pantomimic polysemiotic theories, which in a way capitalize on the strengths of the others: Like monosemiotic theories, they claim an advantage of gesture at an initial evolutionary stage, and like the equipollent theories (perhaps most of all, that of Adam Kendon) they proclaim the advantages of flexible polysemiosis. As such, they appear to be best able to explain modern communication: not only speech, but language in general; not only co-speech gesture, but a wide variety of gestures; not only language and gesture, but also other semiotic systems, like depiction and music. In several cases, they are capable of proposing plausible evolutionary scenarios, informed by research in ethnography and non-human communication. At the same time, they are able to explain the differences between human-specific and animal communication, including ape gestures. Finally, their notion of polysemiosis alleviates the transition problem of monosemiotic theories. However, it does not eliminate it completely, and pantomimic polysemiotic theories still have to propose a mechanism responsible for reconfiguring the original system of pantomime into modern language and modern polysemiotic communication. While there are some attempts in this direction, this is still very much work in progress. We have indicated some promising ideas in this direction, including the role of narrative.
1 Introduction
The purpose of this chapter is to capture the role and use of gesture in first language development and its integration in the child’s multimodal communicative system. It presents an overview of theories and methods that have triggered and facilitated the study of gestures in language development. The chapter is primarily focused on production and on gestures used by neurotypical hearing children who acquire spoken languages.Footnote 1 The main issues are illustrated with detailed analyses of examples extracted from longitudinal data in English and French(Reference Morgenstern and ParisseMorgenstern & Parisse, 2012).Footnote 2
The human communication system develops in a space of shared meanings in which adults socialize children into language in situated activities; consequently, this overview highlights the crucial role of caregivers in adult–child interactions. We thus first focus on the role of gestures in adults’ communicative input and then follow children’s development into the use of the adult multimodal communicative system. We use the word “multimodal” to refer to a variety of semiotic resources used within the audio-vocal and visual-gestural modalities (such as speech, gesture, gaze, facial expressions). In our perspective, humans have created language by interacting via different types of meaning-making resources which have collectively been sedimented through experience and use into what is called “language.” As speech is the primary mode of communication children are progressively mastering, their use of gesture will be presented before they begin speaking, then as they produce their first words, and finally once speech is mastered. At the end of the developmental process, speech becomes clearly predominant but is both complemented and supplemented by other semiotic resources according to variables such as linguistic context, situation, interlocutor, activity, or discourse genre (Reference Cienki, Badio and KoseckiCienki, 2012).
2 Historical Background and Evolution of Theoretical Approaches and Methods
Child language research is one of the first fields in which spontaneous interaction data were systematically collected, initially through diary studies (Reference IngramIngram, 1989; Reference MorgensternMorgenstern, 2009) and later by audio and video-recordings shared worldwide thanks to the CHILDES project (Reference MacWhinneyMacWhinney, 2000). This data-centered method has allowed many researchers to confirm that, in the course of their development, children make their way through successive transitory multimodal systems, each with their own internal coherence (Reference Cohen and Linguistique de ParisCohen, 1924). This phenomenon can be observed at all levels of linguistic analysis. The starting point of language-acquisition scholars’ interest in gesture or visible bodily action could be summarized in de Laguna’s famous assertion that “in order to understand what the baby is saying you must see what the baby is doing” (Reference De LagunaDe Laguna, 1927, p. 91). Children’s productions are like evanescent sketches of adult language and can only be analyzed in their interactional context, taking into account the interlocutors’ interpretations and reactions, shared knowledge, actions, gestures, facial expressions, postures, and head movements, as well as words produced by the children (Reference Morgenstern and ParisseMorgenstern & Parisse, 2007; Reference Parisse and MorgensternParisse & Morgenstern, 2010).
Children’s language development has long been described by starting with the first vocalizations and the maturation of phonic abilities without paying much attention to motor skills, actions, gaze, and gestures. However, the first diaries of observers of child language had already contained dazzling insights about the multisensory qualities (through hearing, sight, touch, and sometimes through taste and smell) and forms of expression (speech, gesture, gaze, facial expressions) that make language acquisition a multimodal process. During the second half of the nineteenth century, child language development was studied through researchers’ observations of their own children. The detailed follow-ups on children’s language production, anchored in their daily lives, were a source of fascinating links between motor and psychological development, cognition, affectivity, and language. The “founding fathers” of the study of child development and language had great intuitions about gestures and their relation to language. In his notes on his son’s development, Reference DarwinDarwin (1877) highlighted the importance of observing the transition from uncontrolled body movements to intentional gestures. Reference DarwinDarwin (1872) was mainly interested in the expressiveness of the baby, and his diary entries therefore focused on the expression of emotions. He first emphasized the different functions of intonation. In addition, according to him, certain habitual movements become automatic and are associated with communicative functions. This can be illustrated by the first bodily manifestations of negation (gestures of avoidance or rejection which consist of turning the head or the body away, pushing things away with the hand). Darwin’s insights were confirmed by more contemporary research. Mimetic patterns of imitable actions, shared representations of objects that can be manipulated, anchor the acquisition of the child’s first gestures (Reference Zlatev, Persson and GärdenforsZlatev, Persso, & Gärdenfors, 2005). There is now evidence from brain imaging studies that the use of language involves motor representations concerning more than just the movement of the vocal apparatus (Reference ArbibArbib, 2012). It is through subtle shaping of daily actions and practices with objects in the environment that manual-gestural communication in social interactions leads the child to adopt conventional forms of symbolic gestural and verbal language. Reference RomanesRomanes (1889) also provided interesting ideas on gesture in his own diary study of his child. He compared human and animal gestures and mentioned the gestural language of the deaf as a sign of the universality of symbolic gestures.
Despite those early observations, gesture was not in the foreground in studies on child language during the first half of the twentieth century. However, thanks in particular to the work of Reference BrunerBruner (1975, Reference Bruner1983), actions and gestures were gradually considered by researchers who analyzed early development. They saw gesture as a system of communication that precedes the verbal system and then becomes complementary to it. According to Reference Werner and KaplanWerner and Kaplan (1963, p. 66): “Linguistic representation emerges from, and is rooted in, non-linguistic forms of representation.” For Reference BatesBates (1976), children’s first gestures have properties that had been in her time specifically attributed by linguists to speech. Thirteen-month-old children (Reference Bates, Bretherton and SnyderBates, Bretherton, & Snyder, 1988) produce manual gestures that are considered the equivalent of nouns (e.g. brushing their hair when seeing a brush is equivalent to saying “brush”).
Research in language acquisition has now developed tools, methods, and theoretical approaches to analyze children’s situated multimodal productions, as they provide evidence for links between motor and psychological development, cognition, affectivity, and language. Those links can only be established by conducting both quantitative and qualitative analyses through both natural settings and experiments.
As of the end of the twentieth century, thanks to video data linked to transcripts with specialized software (CLAN, ELAN, PHON),Footnote 3 detailed coding and analyses of multimodality have been possible and have opened whole new fields of research. It is especially the case for researchers who study language with a usage-based perspective, in its natural habitat, that is daily discourse: “the prototypical kind of language usage, the form in which we are all first exposed to language – the matrix for language acquisition” (Reference LevinsonLevinson, 1983, p. 284). We are now able to document in detail how the visual and vocal modalities come together in daily interaction and progressively shape children’s language. Video-recording tools have notably advanced the detailed analysis of the organization of human action and interaction (Reference MondadaMondada, 2019). These tools have shaped new avenues of research on language in interaction, as it is deployed in multiple ecologies, both in time (the moment-to-moment unfurling of an interaction) and over time (multiple recordings over several years of the same children in their family environment). Sacks, as he was grounding what became the Conversation Analysis framework, encouraged the use of video-recordings (Reference Sacks, Atkinson and HeritageSacks, 1984, Reference Sacks1992) so as to capture, analyze, and share sequences that unveil the structure of everyday practices.
Building on Conversation Analysis, an integrative, multimodal approach to language was developed thanks to contributions from Reference GoodwinGoodwin (1986, Reference Goodwin2013) and Reference Levinson, Enfield and LevinsonLevinson (2006). The recognition of sign languages also played an important role in considering the gestural dimension. An abundance of work on gesture and on the complementary role of semiotic forms is emerging and enriches the research already carried out on the development of children’s language. Specialists in linguistic anthropology (Reference HavilandHaviland, 1998), confronted with a multiplicity of cultures, many of them with purely oral traditions, have helped linguists become aware that the formal apparatus constructed to describe languages is largely based on written texts (Reference LinellLinell, 2005) whereas the most common uses of language are in face-to-face interactions (Reference GoffmanGoffman, 1963). Semioticians (Reference KressKress, 2010) also insist on the importance of taking into account the different simultaneous channels (auditory, visual, tactile) with which we conceptualize the world around us and express ourselves.
Research on spontaneous data (Reference Morgenstern, Inbal, Estigarribia, Tice and KurumadaMorgenstern, 2014, Reference Morgenstern, Mazur-Palandre and Colon2019) is characterized by the researchers’ attempts to capture how children become competent interlocutors as they learn to deploy the various semiotic resources at their disposal in relevant ways, for example, by coadapting to their conversational partners in various situations and environments. Children’s use of speech and gesture differs from that of adults and dynamically changes over time. Children’s multimodal communicative expression is thus analyzed in longitudinal data with an ethnographic approach in line with Kendon’s call for studies of use in context inspired by Reference EfronDavid Efron (1941/1972) and Reference WundtWilhelm Wundt (1921/1973). Children’s communicative profiles are shaped by their local environment (family home) and their microcultural norms. Through children’s everyday interactions in their ecological circumstances, we can understand both how language is “experienced” (Reference OchsOchs, 2012) and how experience is “languaged” in situated activities. Multimodal approaches include some combination of verbal content and the accompanying prosody, facial expressions, posture, gesture as well as gaze according to a dynamic deployment of the “scope of relevant behaviors” (Reference Cienki, Badio and KoseckiCienki, 2012, Reference Cienki2017), ideally taking into account multiple factors such as age, context, and affordances of the situation or interlocutors. When children’s gestures are analyzed in interaction, we try to capture what the modality of gesture affords its child users as a means of communication among the other semiotic resources they have at their disposal and as their motor, cognitive, and social skills evolve in time.
Video-recording has also had an invaluable impact on experimental methods focused on gesture development (Reference Congdon, Novack and Goldin-MeadowCongdon, Novack, & Goldin-Meadow, 2018). Gesture had often been overlooked in standard psychological observation and research until experimenters started using cameras and making detailed annotations of children’s comprehension and production of gestures during the tasks they were being asked to do. Thanks to a wealth of studies, “gesture research has fundamentally changed the way psychologists think about language, learning, and reasoning” (Reference Congdon, Novack and Goldin-MeadowCongdon et al., 2018, p. 497). With video-recorded data, time is suspended. Researchers cannot only relive/replay the experimental sessions as many times at they wish, but they can also change the annotation schemes, enrich them, and revisit them, and when the data are shared, the studies can be replicated more faithfully.
In both naturalistic and experimental data, gesture types, patterns, variations, and formal components can be coded, quantified, or analyzed in fine detail. Investigative experiments are often informed by naturalistic field methods and enable researchers to test what occurs in ecological settings. Researchers make an indispensable contribution to determining which hypotheses should be tested. Since young children’s productions in experimental settings might be influenced by a variety of performance-related factors (Reference AirentiAirenti, 2015), and since observers have to overcome great challenges when video-recording data in naturalistic or home environments, experimental results and spontaneous speech data must both be collected in order to capture children’s multimodal communication system and its development. New avenues include the use of motion capture (Reference Morgenstern, Mazur-Palandre and ColonDodane, Boutet, Didirkova, Ouni, & Morgenstern, 2019) and also involve computational modeling (Reference Kaplan, Oudeyer and BergenKaplan, Oudeyer, & Bergen, 2008). Those possibilities might further extend our knowledge about gesture–speech relations (Reference Abramov, Kopp, Nemeth, Kern, Mertens and RohlfingAbramov et al., 2018) but have not yet been fully deployed for studying the development of multimodal communication.
3 Gesture and Scaffolding in the Adult Input
Children’s language gradually develops into rich multimodal production using the variety of semiotic resources at their disposal through constant exposure to the adult input. It is thus crucial to account for the multimodal input surrounding children in order to understand how they co-construct their communicative skills thanks to adult scaffolding. Given that parents provide their children with other forms of expressions along with speech, we will refer to “child-directed communication” rather than the usual expression “child-directed speech” to insist on the crucial function of the child’s multimodal input.
Scholars have wondered for centuries about how children construct meaning. In his reconstruction of his developmental process, Saint Augustine, back in the fifth century, stressed the link between surrounding events, adults’ actions, gestures, and words in his own acquisition of language (Reference Augustine1996, I.8). Word learning is facilitated by multimodal cues produced by adults (gazing, pointing, touching, manipulating) (Reference Booth, McGregor and RohlfingBooth, McGregor, & Rohlfing, 2008). Adults provide redundant sensory information and positively affect infants’ attention. Symbolic gestures, especially pointing (Reference Iverson, Capirci, Longobardi and CaselliIverson, Capirci, Longobardi, & Caselli, 1999), have been shown to facilitate children’s comprehension by providing additional cues. The role of pointing and eye gaze in the construction of joint attention has been a key topic of research in child language development (Reference Baldwin, Moore and DunhamBaldwin, 1995; Reference TomaselloTomasello, 2003). As shown in example (1), video (1) (Table 14.1) from our longitudinal data (Reference Morgenstern, Blondel, Beaupoil-Hourdel, Benazzo, Boutet, Kochan, Limousin, Hickman, Veneziano and JisaMorgenstern et al., 2018),Footnote 4 the caretaker’s gestural information reinforces the vocal information. In the examples, “nb” indicates the utterance number and the column “Part.” indicates the relevant participant in the interaction.
Table 14.1 Example (1). Video (1). Ellie, 10 months.Footnote 5 https://repository.ortolang.fr/api/content/cup-morgenstern/head/video%201-Ellie-0–10-finished%20-%20gesture.mp4 (Mother is taking care of her child (CHI) Ellie; Grandmother is filming.)
| nb | Part. | Actions and gestures | Vocal and verbal productions |
|---|---|---|---|
| 1 | MOTHER | Palm down lateral movement with both hands in front of her from center to exterior. Gaze at CHI. | Ellie, you finished? |
| 2 | ELLIE | Moves right arm up and down, left hand is grasping the tablet of her highchair. Gaze at camera. | |
| 3 | MOTHER | Hands in rest position, intent gaze at CHI. | That’s hello, you’re waving |
| 4 | MOTHER | Palm down lateral movement with both hands in front of her from center to exterior. Gaze at CHI. | Are you finished? |
| 5 | ELLIE | Same movement as previous but with a wider range and excitedly. Gesture and prosody of vocal production are synchronized. Big smile at the camera. | Ha ha ha … ah ha ah ! ah! |
| 6 | MOTHER | Palm down lateral movement with both hands in front of her from center to exterior. Gaze at CHI. | Are you finished Ellie? |
| 7 | ELLIE | Very briefly produces the same gesture as her mother, with two arms, smaller range and in the reverse direction, from exterior where her arms were positioned, to center, eyes gazing down. |
Ellie has finished her meal, her mother is providing both a spoken and a gestural expression of her state (Figure 14.1, turn 1).
Reference Bahrick, Pickens., Lewkowicz and LickliterBahrick and Pickens (1994) explain that intersensory redundancy grounds how we detect stimuli that belong together and constitute a unitary event. They show that redundancy across the senses foregrounds core information for infants. Ellie’s mother presumably expects the gesture to help her daughter understand the meaning of her utterance. By linking gesture and speech, the mother grounds the word “finished” and facilitates the relation between sign and referent as she relies on both auditory and visual cues. She actually connects her own production of what Reference Bressem, Müller, Müller, Cienki, Fricke, Ladewig, McNeill and Bressem.Bressem and Müller (2014) call a “recurrent gesture” (palm down lateral, arms sweeping toward the exterior) used to express completion (Reference Ladewig, Müller, Cienki, Fricke, Ladewig, McNeill and BressemLadewig, 2014) to Ellie’s behavior. The child’s movement, seems to be addressed to her grandmother who is filming (the child waves her right arm quite excitedly, vocalizes and gazes at the camera [turn 2], and it is interpreted as another type of gesture (an emblem) by her mother [“that’s hello,” turn 3]). The child’s voice and arm movements are synchronous in turn 5: the gestural strokes and prosodic nuclei are perfectly aligned. Not only is Ellie’s body in harmony with her own voice (turn 5), despite a lack of semantic content, but after two repetitions on the part of the mother of the same multimodal production (turns 4 and 6), Ellie seems to echo her mother’s gesture in a more sketchy performance with a reverse movement from periphery to center, probably because of the positioning of her arms at the beginning of the gesture excursion (Figure 14.2, turn 7). In this way, Ellie’s body is in resonance with her mother’s.
Figure 14.2 Ellie’s resonance with her mother’s recurrent gesture (turn 7)
Caregivers, in a variety of cultures, synchronize words, actions, and gestures as they show objects, embody actions, or illustrate properties (Reference Zukow-Goldring, Dent-Read and Zukow-GoldringZukow-Goldring, 1997). These practices scaffold the referential process by linking words and gestures to objects and events (Reference Rader and Zukow-GoldringRader & Zukow-Goldring, 2010). Studies show that the nature and the frequency of maternal gesture influence the development of children’s communicative repertoires (Reference Rowe and Goldin-MeadowRowe & Goldin-Meadow, 2009). Caretakers also use gestures in playful scripts or songs and nursery rhymes, such as “bye bye” (waving hands), “peek-a-boo” (playfully hiding face with hands), “bravo” (clapping hands), and the French “ainsi font font font les marionettes” (a song that is accompanied by hand gestures representing puppets), in which they “teach” their children specific conventional gestures along with words.Footnote 6
Parents spontaneously use gestures in everyday communication. All those gestures derive from the culture in which the children are being raised and have very strong social and symbolic values. Caretakers embody their communicational intent more, and rely on redundant multimodal combinations, when the children are younger. Gestures are used to attract the child’s attention to particular events, objects, or words, to highlight, reinforce, and disambiguate the spoken content, and function as a support system secondary to speech (Reference Kelly, Arnon and ClarkKelly, 2011). This could be considered as a particular feature of language addressed to infants and young children. “Gestural motherese” is quite specific in terms of the type of gestures used: It involves more deictic and recurrent (semi-conventional) gestures and fewer representational gestures (Reference Özçalışkan and Goldin-MeadowÖzçalışkan & Goldin-Meadow, 2005).
However, not only do we need to integrate all semiotic resources to capture adult–child communicative behavior, but all actions and manipulations of artifacts could also be included. In real life and real time, humans communicate while they are involved in other activities such as eating, cooking, cleaning, drawing, digging. It is thus crucial to analyze interactions in the context of what could be called “multiactivity” (Reference Haddington, Keisanen, Mondada and NevileHaddington, Keisanen, Mondada, & Nevile, 2014). As we adopt a multimodal, dynamic, and situated approach to language use in interaction, the analysis of both actions and symbolic gestures in their multimodal and interactive context is particularly relevant to apprehend how adults and children distribute meaning. Actions and gestures produced by young children are often subtly integrated into adult–child interaction, interpreted, and reformulated into spoken language by their parents. Indeed, body actions, such as children lifting their arms to be picked up, are repeated sometimes several times a day, are interpreted and reformulated into spoken language by adults and thus become ritualized requests, just as reduplications such as papapa are interpreted as referring to the child’s father in French and reformulated into the conventional word “papa.” As children progressively gain control over their bodies, their behaviors are more and more often interpreted by those around them as meaningful and are linked to specific affordances and contexts: Bringing the palms of the hands together is interpreted as clapping and takes on a praising function or expresses glee (Reference Aronsson and MorgensternAronsson & Morgenstern, 2021); bringing the hand to the mouth or extending the arm toward a piece of fruit is interpreted as a request. Children’s body movements thus take on the status of conventional gestures and become intentional communicative signals (Reference TomaselloTomasello, 2008). Reference Goldin-Meadow, Mylander and FrankGoldin-Meadow, Mylander, and Frank (2007) have shown that when a mother reformulates the meaning of her infant’s gestures into words, those words for referents are more likely to enter the child’s spoken repertoire than words for non-reformulated referents are.
Past studies, our own research (Reference Morgenstern, Inbal, Estigarribia, Tice and KurumadaMorgenstern, 2014, Reference Morgenstern, Mazur-Palandre and Colon2019), and our examples illustrate that as children’s language skills develop, adults tend to give primacy to speech in their own productions but subtly use the visual-gestural modality to supplement and complement the verbal production according to their communicative needs. The use of multimodal child-addressed communication to reinforce verbal productions, along with actions, interpretations of children’s bodily movements into words, and adjustments to the child’s age as well as cognitive and linguistic skills, are crucial features of caretakers’ communication with young children.
4 Children’s Gestures before They Produce Spoken Language
As shown in Section 3, gestures provide a means for infants to enter communication prior to their own production of spoken language. Adults’ use of multimodal utterances seems to enhance children’s early comprehension of what adults are trying to communicate (Reference Pfandler, Lakatos and MiklósiPfandler, Lakatos, & Miklósi, 2013; Reference Wu and Gros-LouisWu & Gros-Louis, 2014). More specifically, young children assign meanings to nouns (Reference Clark and EstigarribiaClark & Estigarribia, 2011), to verbs (Reference Goodrich and Hudson KamGoodrich & Hudson Kam, 2009; Reference Ozçalışkan, Gentner and Goldin-MeadowÖzçaliskan, Gentner, & Goldin-Meadow, 2014), and to prepositions (Reference McGregor, Rohlfing, Bean and MarschnerMcGregor, Rohlfing, Bean, & Marschner, 2009) through the mediation of gesture.
Along with vocal productions, eye-gaze is the first semiotic resource used by children to enter communication. After three months old, infants increasingly look reliably in the direction of another person’s own attention as signaled by the direction of their gaze (Reference BrunerScaife & Bruner, 1975). As of nine months old, they follow the adults’ gaze along with pointing gestures (Reference Brooks and MeltzoffBrooks & Meltzoff, 2005). Young children then progressively learn to use their gaze to monitor and guide others’ attention.
In early childhood vocal-motor coordination develops with neural maturation, shown in the rhythmic qualities shared by early hand-banging movements and canonical babbling (Reference MasatakaMasataka, 2003). Reference Iverson and FaganIverson and Fagan (2004, p. 1063) suggest “a link between more speech-like vocalizations and manual activity that may be a precursor to coordinated manual movement of the sort involved in adult gestures” (see example [1], video [1] for an illustration).
Meaningful manual actions precede and pave the way for the development of language and share a semantic link with gestures and words (Reference Capirci, Contaldo, Caselli and VolterraCapirci, Contaldo, Caselli, & Volterra, 2005) as we have shown in example 1. Between nine and ten months old, children use ritualized requests like open-close grasping motions or pulling an open hand to obtain something (Reference Bates, Benigni, Bretherton, Camaioni, Volterra and BatesBates, Benigni, Bretherton, Camaioni, & Volterra, 1979). If infants’ first vocalizations imitate the prosodic patterns that are most salient and frequent in their environment and that have pragmatic and affective functions (Reference Esteve-Gibert and PrietoEsteve-Gibert & Prieto, 2013), thanks to their finer hand-motoric skills, gesturing also allows children to express more specific semantic functions and thus communicate meanings that they might not be able to express with vocal means (or at least that are not captured in their vocal productions by adults). According to Reference Goldin-MeadowGoldin-Meadow (1999, p. 423), gesture may thus serve as a “way-station on the road to language.” It has also been shown that “late-bloomers” (children who seem to have a delay in their first word production, but catch up by the age of three) can be differentiated from late-talkers thanks to their gestural performance (Reference Capone and McGregorCapone & McGregor, 2004); gesture can compensate for verbal expressive deficits.
Conventional gestures enter the child’s repertoire at around 10–11 months old. The most frequent and usually the first gesture used is pointing accompanied by gaze. It has a variety of functions and uses (Reference Bates, Benigni, Bretherton, Camaioni, Volterra and BatesBates et al., 1979; Reference Liszkowski, Carpenter, Henning, Striano and TomaselloLiszkowski, Carpenter, Henning, Striano, & Tomasello, 2004) and a variety of forms, the most common being index-finger pointing (Reference AndrénAndrén, 2010; Reference BatesBates, 1976; Reference Franco and ButterworthFranco & Butterworth, 1996; Reference Morgenstern, Morgenstern and Goldin-MeadowMorgenstern, 2022). Deictic gesture development indicates how children gradually distance themselves from objects physically, without using touch and manipulation, and enter symbolic communication (Reference Capone and McGregorCapone & McGregor, 2004).
Children’s pointing used in interaction is integrated in the dialogue by adults as (proto)speech acts, and could be treated as requests or declaratives according to context (Reference BatesBates, 1976; Reference MarcosMarcos, 1998). When children can point not only to request an object but also to refer to it, that reveals their ability to enter joint attention, but also to create a disconnection. Pointing at a location is progressively used to refer to an absent entity (Reference Le GuenLe Guen, 2011) and predicts to some degree the ability to use language.
The headshake is another gesture produced in early childhood in a wide range of cultures. It is taken up by children at first to mark refusal and rejection (Reference Beaupoil-Hourdel, Morgenstern, Boutet, Larrivée and LeeBeaupoil-Hourdel, Morgenstern, & Boutet, 2015; see also Reference GuidettiGuidetti, 2005). In example (2), video (2) (Table 14.2), Ellie is one year and two months old. She shakes her head several times and the situation illustrates how the conventional gesture can be seen in a continuum with the action of avoidance.
Table 14.2 Example (2). Video (2). Ellie one year and two months. Between action and gesture: headshakes. https://repository.ortolang.fr/api/content/cup-morgenstern/head/video%202-Ellie-1–02%20anticipation-grounded%20in%20routines.mp4
| nb | Part. | Actions and gestures | Vocal productions |
|---|---|---|---|
| 1 | MOTHER | Is preparing to wipe CHI’s mouth with a tissue. | That’s where they’re supposed to go. |
| 2 | ELLIE | Attentively watches her mother (MOT) and starts to shake her head as MOT’s hand approaches her face, but not very vigorously. | |
| 3 | MOTHER | MOT manages to wipe what she wanted from CHI’s face. | |
| 4 | MOTHER | MOT prepares a spoonful of food and lifts it up. | |
| 5 | ELLIE | CHI turns head and gaze away toward her right, lifts both her arms. | Da |
| 6 | MOTHER | Holds the spoon in place. | Are you done? |
| 7 | ELLIE | Gazes at MOT, then gazes down. | |
| 8 | MOTHER | Are you sure | |
| 9 | GDMOT (grandmother) | Filming | Are you finished Ellie? |
| 10 | ELLIE | Looks at GDMOT, shakes her head. | |
| 11 | GDMOT | Filming | Oh, ooh! All done? Oh dear! |
| 12a | ELLIE | Continues to shake her head with a severe gaze at GDMOT. | |
| 12b | GDMOT | Laughs | |
| 13 | MOTHER | Brings up the spoonful to CHI’s mouth. | |
| 14 | ELLIE | Turns head away. | |
| 15 | MOTHER | Puts back spoon in bowl. Takes bowl and spoon away. | No, OK. |
| 16 | ELLIE | Very quick palm up open hand with left hand then opens arms. | |
| 17 | MOTHER | Wipes highchair tablet with tissue. | Do you want to go out then? |
| (…) | |||
| 28 | MOTHER | Are you actually still hungry? | |
| 29 | ELLIE | Smacks her lips and gazes smilingly at MOT. | |
| 30 | MOTHER | Takes bowl and tries giving her a spoonful. | Oh you’re already done with breakfast aren’t you? |
| 31 | ELLIE | Turns head away very decisively. |
From her past experiences, Ellie knows that when her mother prepares a wet tissue and moves it toward her face, she is about to wipe her face.
In turn 2, Ellie is already preparing to evade the wiping, but not too vigorously; maybe past experience has shown her that it is an unavoidable ritual and that as unpleasant as it is, it is still bearable. Her avoidance is therefore not a complete rejection. However, in turn 5, she is more adamant in her refusal of more food. Her movement allows her to avoid the food physically but also communicate her negation through a headshake and a vocal production “da” which could be associated to the word “done” as reformulated in turn 6 in her mother’s recast “are you done?”.
Both her grandmother with verbal questions (turns 9 and 11) and her mother with the attempt at giving her another spoonful (turn 13) check that the meal is really over. This is made clear not only through Ellie’s headshake, which avoids and refuses the food once again (turn 14), but also through her bringing both arms up in what is interpreted by her mother as a request to get out of the highchair (turn 17 “do you want to go out then?”).
Figure 14.3 Ellie’s preparation for avoidance
In the same video, when her mother says the word “hungry” (turn 28), Ellie represents the idea with a playful smacking of her lips, which is an action sometimes performed by family members when saying “hungry” (Figure 14.6, turn 29), even though she has demonstrated previously and confirms at the end of this sequence that she wants breakfast to be over, as formulated by her mother (turns 30 and 31).
Figure 14.5 Request to get out of the chair
Figure 14.6 a, b Smacking lips
What have been called “representational gestures” in the literature are used around the age of 12 months, before the 25-word milestone (Reference Capone and McGregorCapone & McGregor, 2004). A child can manually represent the action of holding a glass and drinking, or use her hand to comb her hair. Reference Goodwyn and AcredoloGoodwyn and Acredolo (1993) argue that a gesture or a word is symbolic if it refers to multiple examplars (including pictures or in absence of the referent), if it is produced spontaneously (without following the model of an adult), and if it is not part of a well-rehearsed routine. The status of those gestures often performed in pretend play is, however, very different from coverbal representational gestures used with speech later on.
Within a few months, children’s gestural repertoire is enriched, especially in cultures or families where the input is gesturally varied thanks to both the general multimodal features of face-to-face interaction and to specific use of gestures in infant-directed communication (as shown in Section 2).
Reference Iverson, Capirci and CaselliIverson, Capirci, and Caselli (1994) demonstrate that 16-month-old children show a preference for either words or gestures, but by 20 months, types and tokens of spoken words increase significantly. Reference Butcher, Goldin-Meadow and McNeillButcher and Goldin-Meadow (2000) longitudinally observed three boys and three girls during the transition from one-word to two-word speech. During the first session, most of the subjects (five out of six) produced the majority of their gestures without speech. During the following sessions, there was a decline in the proportion of gestures produced. By the end of the observation period, the children mainly used gesture-speech combinations. This is when the gesture-speech integration begins with the progressive predominance of speech in hearing populations.
5 Children’s Gestures as They Enter Verbal Language
Gestures and speech develop in hearing children as they manipulate more and more words. The child’s multimodal communication skills emerge in their first cross-modal combinations and multimodal constructions.
Speech and gesture together form an integrative system (Reference McNeillMcNeill, 1992). However, multimodal productions are quite rare at first or are mostly gestalt communicative acts in which body movements and vocal productions are not fully controlled. Children first use the audio-vocal and visual-gestural modalities together to communicate about the same element, as in example (3). In example (3), video (3) (Table 14.3), when she is one-and-a-half years old, Ellie waves her right arm, shakes her head and says “no” at the same time, but the headshake is pursued vigorously for a while. The gestural modality is amplified and maybe not as finely controlled as it will be when Ellie gets a little older and uses her semiotic resources with more expertise.
Table 14.3 Example 3. Video 3. Ellie, one year and six months. Refusal. https://repository.ortolang.fr/api/content/cup-morgenstern/head/video%203-Ellie-1–02-Tangerine%20or%20ball.mp4 Ellie is in her highchair. Her bowl of food with chicken and fish is almost finished.
| Nb | Part. | Actions and gestures | Vocal and verbal productions |
|---|---|---|---|
| 1 | AUNT | Placing a piece of chicken on the fork in order to feed it to Ellie | Would you like a big piece Ellie? |
| 2 | ELLIE | Shakes her head and moves right arm with palm lateral toward her center and back. | No, no. |
| 3 | GDMOT | Filming | She’s always liked her fish though. |
| 4 | AUNT | Bent over the plate | Hu |
| 5 | GDMOT | Filming | XXX Footnote 7Ellie, fish. |
| 6 | ELLIE | Still slowly shaking her head from side to side without interruption. |
First gestures are tightly connected to first words. Reference Iverson and Goldin-MeadowIverson and Goldin-Meadow (2005) found that pointing precedes and predicts lexical acquisition during the early stages of language learning. The authors have shown that there is an increase of pointing gestures during the period when children’s vocabulary expansion is the largest. At least in Western cultures, parents often respond to children’s pointing by labeling, which in turn helps children integrate those words into their verbal repertoire (Reference Bruner, Sinclair, Jarvelle and LeveltBruner, 1978; Marcos, 2003). Productive use of gesture and speech is linked. Children who produce more meanings in gesture at 14 months showed faster growth in productive vocabulary use (word types) between 14 and 46 months (Reference Rowe, Raudenbush and Goldin-MeadowRowe, Raudenbush, & Goldin-Meadow, 2012).
At that age, children still often use gestures rather than words to express themselves, as Ellie does emphatically in example (4), video 4 (Table 14.4).
Table 14.4 Example 4. Video 4. Ellie, one year and 11 months. Echoing speech with gesture. https://repository.ortolang.fr/api/content/cup-morgenstern/head/video%204-Ellie-1–11-stir.mp4
| nb | Part. | Actions and gestures | Vocal and verbal productions |
|---|---|---|---|
| 1 | GDMOT | Filming | Is that enough sugar Ellie? Is there enough sugar there? |
| 2 | ELLIE | Ellie gazes at GDMOT, then picks up big bowl full of sugar. | |
| 3 | MOTHER | Or do we need to put more in? Ellie, I think we need to put more in here. | |
| 4 | ELLIE | Ellie continues to pick up heavy bowl of sugar and starts pouring. | |
| 5 | MOTHER | Puts her hand on Ellie’s as the child is pouring the sugar and finishes for her. | Stop, stop! |
| 6 | GDMOT | Filming. Laughs loud. | Too late. |
| 7 | ELLIE | Emphatic shoulder shrug. | |
| 8a | MOTHER | Prepares flour and smiles. | She just like gave up. |
| 8b | ELLIE | Laughing. |
When her mother says “too late” after Ellie has poured too much sugar in the bowl to make her cake, Ellie deploys a beautiful shrug, which could be categorized as a recurrent or pragmatic gesture (Reference DebrasDebras 2017), and which is part of the child’s cultural repertoire of gestures: She lifts her shoulders as high as she can, slightly opening her two arms with a radiant smile on her face (Figure 14.7). Her grandmother reformulates the same meaning into speech by saying “she just like gave up.”
Figure 14.7 a, b Smiling, lifting shoulders, and opening arms
After that period, use of two simultaneous modalities for two different elements precedes the onset of two-word speech. This might be linked to children’s cognitive ability to produce simultaneous information before they can express the elements sequentially. Therefore, what Reference Levelt, Bellugi and Studdert-KennedyLevelt (1980) calls the “linearization problem” – in speech, we can only order the information successively in a linear format and we need to organize the elements and select what comes first – does not affect children who, just like adults, have multimodal resources to express themselves. By combining gesture with speech synchronously, they can form a predicative structure. In example (5), video (5) (Table 14.5), Ellie is pointing at various elements in the room and saying a color each time. The combination of gesture (pointing at Bob’s trousers) and word (black) in answer to the question “what color are Bob’s trousers?” could be glossed as “the trousers are black,” a multimodal utterance in which the subject, trousers, is designated through the gesture and the predication, “black,” is expressed with a word. Ellie expertly creates a multimodal construction with pointing + word, even though she does not yet appear to have mastered the association between color adjectives and the actual color of objects she points at, as the extract illustrates (she says black but the trousers are actually blue).
Table 14.5 Example 5. Video 5. Ellie, one year and nine months. Pointing + adjective. https://repository.ortolang.fr/api/content/cup-morgenstern/head/video%205-Ellie-1–09-pointing%20and%20word.mp4
| nb | Part. | Actions and gestures | Vocal and verbal productions |
|---|---|---|---|
| 1 | GDMOT | Filming | What color are Bob’s trousers? |
| 2 | BOB | Gazing at Ellie | What color are my trousers? |
| 3 | ELLIE | Walks up to Bob. Points at the trousers with index on Bob’s leg. | Black |
| 4 | BOB | Black or are they blue? | |
| 5 | ELLIE | Moving her arms up. | Blue. |
In another session, at one year and 11 months, Ellie and her grandmother are playing with dolls called Maddie and Susie. The grandmother is taking care of Maddie while Ellie is taking care of Susie. The grandmother takes a bottle and pretends to feed Maddie. But Ellie shouts “Susie bottle!” and simultaneously produces a headshake. The headshake is often called a co-verbal gesture, but at least in this case, the spoken words could just as well be called “co-gestural.” The two modalities, verbal and gestural, are integrated to construct the multimodal utterance which could be interpreted as “no Grandma, the bottle is for Susie and not Maddie.” The predominance of the verbal channel characterizing Ellie’s communicative productions at the end of her second year leads us to subordinate gesture to speech from this age on in our terminology. However, in Ellie’s productions, each modality is used with a specific function and they are assembled to form a negative assertion. The two modalities do not seem to be organized into a hierarchy, but her linguistic community lead the child to favor the verbal channel for practical reasons in her daily communicative practices in a world in which we speak as we cook, eat, draw, clean, or drive and can skillfully combine bodily actions and spoken communication.
Thus, around the age of two, speech comes to be used more than gestures to refer to objects (Reference Capirci, Contaldo, Caselli and VolterraCapirci et al., 2005; Reference Iverson, Capirci and CaselliIverson et al., 1994). During their second year, children also start making combinations. They produce many more [gesture+word] and [word+word] combinations than [gesture+gesture] combinations (Reference Capirci, Contaldo, Caselli and VolterraCapirci et al., 2005; Reference Capirci, Iverson, Pizzuto and VolterraCapirci, Iverson, Pizzuto, & Volterra, 1996; Reference Goldin-Meadow, Morford, Volterra and ErtingGoldin-Meadow & Morford, 1990; Reference Stefanini, Bello, Caselli, Iverson and VolterraStefanini, Bello, Caselli, Iverson, & Volterra, 2009). In most cases when gestures are combined, one or both of the gestures are deictic rather than representational or pragmatic (Reference Capirci, Iverson, Pizzuto and VolterraCapirci et al., 1996; Reference Morford and Goldin-MeadowMorford & Goldin-Meadow, 1992; Reference VolterraVolterra, 1981), as in example (4).
Studies have also shown that [gesture+speech] combinations reliably predict the onset of [word+word] combinations (Reference Butcher, Goldin-Meadow and McNeillButcher & Goldin-Meadow, 2000; Reference Capirci, Iverson, Pizzuto and VolterraCapirci et al., 1996; Reference Iverson, Capirci, Volterra and Goldin-MeadowIverson, Capirci, Volterra, & Goldin-Meadow, 2008). Interestingly enough, a majority of all types of gestures (deictic or representational) are coordinated with speech rather than being used in isolation during this transition period (Reference AndrénAndrén, 2010). Even though speech and gesture have coevolved (Reference Levinson and HollerLevinson & Holler, 2014) and are codeveloped through infancy, hearing children who are bathed in multimodal input and capture language with all their senses are progressively directed toward the predominance of vocal communication mediated by adult scaffolding. When children are between 18 and 30 months, there still seems to be a symbiotic relation between speech and gesture. According to Andrén’s analyses of five Swedish children (Reference Andrén2010), a majority of gestures observed were coordinated with speech. Gesture seems to play an important role in the productive use of multiword speech as there is an increase in all types of gestures in association with speech between 24 and 30 months old. But speech becomes more dominant, confirming that it is the “typifying medium par excellence” in hearing dyads (Schutz, 1953, p. 10) as there is a shift to a more productive or more generalized mode of communicating in which multiword utterances become more common than one-word utterances associated with gestures.
6 Children’s Gestures When Verbal Language Is Mastered and Dominant
Once the multimodal communication system has been mastered, gesture and speech work more subtly together in later childhood. As speech develops, gestures become more and more diverse and elaborate, especially in their relation to speech. Representational gestures tend to appear more and more with verbs and adjectives, rather than with nouns (Reference Capone and McGregorCapone & McGregor, 2004) as shown in example (6), video 6, Table 14.6.
Table 14.6 Example 6. Video 6. Ellie, four years and two months. Big. https://repository.ortolang.fr/api/content/cup-morgenstern/head/video%206-Ellie-4_02-this%20big.mp4 Ellie is co-narrating for her grandmother, with her mother’s help, her visit to a zoo.
Intriguingly, the visual modality is used to represent what cannot be expressed in words as well: the exact dimension (or supposed dimension) of a baby gorilla and her mother. Figures 14.8 and 14.9 illustrate how Ellie gestures to demonstrate sizes. As she is co-narrating the story with her mother for her grandmother, she is constantly visually checking with her mother (who is filming at the time) that her gesture is “correct.”

Figure 14.8 “I think it was this big”
Figure 14.9 a, b Size readjustment: “this big?”
After her mother and grandmother’s comments (turns 5b and 6), Ellie adjusts her gesture Figure 14.9 (turn 7).
Then Ellie spreads her arms much wider and the amplitude of the space between the hands is now accompanied with her arms extending in a V shape and her hands bending wide (Figure 14.12, turn 9) to indicate the “huge” (turns 10a and 10b) size of the mother gorilla.
Figure 14.10 “the Mum is this big”
Communication thus remains multimodal in face-to-face interaction throughout the rest of our life span. After representational gestures have increased and diversified even more, then beats are used (Reference CollettaColletta, 2004) as well as more metaphorical gestures and abstract deictics.
However, in his data, Reference AndrénAndrén (2010) found that as children’s language skills developed, conventional (recurrent) gestures were more likely to be used with speech than representational gestures. This is in line with Kendon’s observations (2008) on speakers with a rich repertoire of conventionalized gestures which are “fully integrated into the flow of everyday discourse” (p. 360). Andrén also found in his spontaneously recorded family interactions that the complexity in speech was symmetrically related to the complexity in gesture.
Studies have shown that, as of five years old, children’s motoric skills become more controlled and thus gesture-speech integration turns out to be more subtle and complex (Reference Alibali, Kita and YoungAlibali, Kita, & Young, 2000). Children’s gestures and speech become more closely aligned and complementary in terms of semantics, syntax, and pragmatics as they produce more complex multimodal utterances (Reference Morgenstern, Inbal, Estigarribia, Tice and KurumadaMorgenstern, 2014, Reference Morgenstern, Mazur-Palandre and Colon2019). Reference CollettaColletta (2004) also showed that multimodal story-telling skills (linguistic, prosodic, and gestural skills in narration) develop together and simultaneously. Once they have mastered the complexity of speech, children can still resort to multiple semiotic resources and conventionalized arbitrary as well as non-arbitrary mappings grounded in their sensory and affective experiences to fully express the various facets of their inner thoughts, their desires, feelings, and to make comments, explanations, or narratives. Reference Colletta, Pellenq, Nippold and ScottColletta and Pellenq (2009) conducted multimodal analyses of explanations produced by French children aged three to 11 years. The authors found an increase in all observed measures: duration, number of syllables, number of clauses, and use of connectives as well as use of co-speech gestures. Reference CollettaColletta (2009) showed that children aged nine years and over relied more on gesture and gaze resources and delivered truly embedded narratives, acting as narrators, rather than only recounting facts and events they witnessed. But multimodal language production is found to be closely tied to context and genre. Reference Alamillo, Colletta and GuidettiAlamillo, Colletta, and Guidetti (2013) compared explanations and narratives produced by the same group of children. The task had effects on the use of both language and gesture: Cohesion markers were more often used in narratives when gestures and subordinate markers were more frequent in explanations. Reference ÖzyürekÖzyürek (2014) also showed how the different multimodal layers of integrated processing depend on context, pragmatic knowledge, and the communicative intent of the speaker.
Though a rich range of facial expressions expressing emotions is discriminated in early childhood, their production, especially those expressing stance in interaction, develops at different rates and continues all the way to adolescence (Reference Odom and LemondOdom & Lemond, 1972). More recent work conducted in different regions of France indicates that the use of facial expressions is a complex developmental process influenced by several factors including age, gender, regional differences, and types of expression (Reference Grossard, Chaby, Hun, Pellerin, Bourgeois, Dapogny and CohenGrossard et al., 2018).
As children learn to modulate their expression with a rich pallet of multimodal tools, they can progressively use the various degrees of iconicity (Reference EmmoreyEmorey, 2014) that our semiotic resources can offer. They can use the most abstract, indirect relationships that have been generalized and conventionalized in language, as well as transparent imitation and embodied direct relationships that support their ability to refer to entities in their absence (displacement) and enable their interlocutors to capture their meaning. But gesture is not solely guided by our visual modality and by imagistic relationships between form and meaning. As we have illustrated in our examples, gestures are very much linked to actions. There is a continuum between actional/manipulative and communicative/symbolic gestures which also explains why blind children use gestures (Reference Iverson and Goldin MeadowIverson & Goldin Meadow, 1998). Proprioception and sensorimotor skills are an integral part of children’s entrance and mastery of multimodal communication.
In example (7), video (7) (Table 14.7), we illustrate Madeleine’s multimodal skills at the end of our longitudinal data.Footnote 8 As she is about to be seven years old, the little girl is able to recount both the content of the events and the discourse she has witnessed. In her productions related to the act of portraying (Reference StreeckStreeck, 2008), she has acquired the skills to show the situation reported in both the vocal modality and the gestural modality. She depicts her mother finding out on her phone’s agenda that she has a business meeting around the time of Madeleine’s birthday party.
Table 14.7 Example 7. Video 7. Madeleine, six years and 11 months. https://repository.ortolang.fr/api/content/cup-morgenstern/head/video%207-Mad-6_11-mince.mp4
The first instance of reported speech attributed to Madeleine’s mother (turn 1) is not introduced by a quotative verb (turn 2): She uses non-segmental markers to indicate that the viewpoint has changed. This involves a change in voice and accentuated gesturing with specific facial expressions. The observer understands her perfectly (turn 3). Madeleine’s use of gaze to change perspective is particularly interesting: She stops gazing at the observer as she takes on the role of her mother, in a personal transfer reminiscent of what Reference CuxacCuxac (2000) describes in sign language narratives (turn 6). The alternation of gaze is very consistent throughout the sequence. Her gaze at her interlocutor indicates that she is in the discourse space. When gaze leaves the observer and is either on her hands “holding” the phone (Figure 14.11) as she embodies her mother, or up in the air as she makes exaggerated facial expressions, she expresses that she is entering narration space. But gaze management becomes even more complex. As she plays the role of her mother addressing herself, Madeleine, the little narrator, looks into the observer’s eyes (turns 6, 9, Figure 14.12) and thus makes the observer transfer into her own role when the event took place: The observer becomes Madeleine, while Madeleine acts as her own mother.
Figure 14.11 Narrative space

Figure 14.12 Discourse space
Madeleine’s voice has become that of her mother, expressed by the subtle pitch variations in her prosody. Madeleine’s body embodies that of her mother with her gestures and facial expressions.
We also notice in this passage how some multimodal constructions are used automatically by Madeleine, such as the rather sophisticated recurrent gesture involving a hand configuration performed with both hands (index extended), a particular localization and a cyclical movement that accompany the verb “régler” in French (to settle the problem, Figure 14.13).
Figure 14.13 a, b, c [régler + recurrent gesture]
At seven years old, Madeleine has become an expert multimodal communicator who masters the different functions of each modality and handles multimodal constructions to express both her own perspective and the perspectives she is able to attribute to others.
More studies of children across languages and cultures are needed to fully capture in detail the development of the various categories of gesture and find out more about similarities and variations between individual children’s pathways to master the complexity of multimodal communication.
7 Conclusion
From an early age, children have cognitive skills that allow them to analyze the language input that surrounds them and thus structure their own practices. Without mastering the complex uses of each word, each gesture, each intonation pattern, and each multimodal construction, they can still construct meaning.
Children use all the resources provided by their bodies to express themselves when they are brought up in an environment that is favorable to multimodal communication and language development. They construct a shared repertoire of gestures and words with their interlocutors. But they constantly use the multimodal resources at their disposal and progressively enrich the complexity of their production, as the examples from Ellie and Madeleine’s longitudinal data have illustrated.
More specifically, throughout infancy, children learn the meaning of gestures in multimodal utterances produced around them. Child-directed communication is very often more emphatically multimodal than is typical in adult-addressed language. Children are also socialized to the use of gestures through specific routines, songs, and rituals in which gestures are integrated or even focused on. Coordination of gaze, gesture, facial expressions, posture, and speech for communication could already be captured in children’s early productions in gestalt, multimodal communicative acts. However, this orchestration of semiotic resources used for interaction develops steadily throughout childhood and into adulthood. Children learn to dissociate the uses and specific functions of each modality and to master the dynamic multimodal communicative system used around them and with them in their daily interactions.
1 Introduction
Most people in the world speak more than one language – many learning new languages throughout life, inside and outside of classrooms, for reasons of study, work, migration, religion, or pleasure. Studies of second language acquisition (L2 or SLA) or foreign language acquisitionFootnote 1 (FLA) examine how this comes about, querying how a new language emerges and develops in the mind in the presence of one or several existing ones. SLA studies track developmental trajectories and “outcomes,” and explore the factors assumed to influence acquisition, such as the nature of the languages that come into contact (e.g. the difference between being a Russian, Dutch, or Japanese learner of English), learners’ age (child vs. adult), skill levels or proficiency, individual cognitive capacities such as working memory and language-learning aptitude, the learning situation (classroom vs. naturalistic settings), the type of instruction (form vs. meaning-based), the amount of input/exposure, and patterns of output/use in conversation and interaction (see Reference Gass and MackeyGass & Mackey, 2012 for overviews). SLA is thus a vast field of study with linguistic, psychological and neurocognitive, social, anthropological, and pedagogical subfields. Until quite recently, gestures were not seen as scientifically relevant in any of these subfields. However, as embodied views of cognition and language gain ground (e.g. Reference BarsalouBarsalou, 2008; Reference Glenberg and KaschakGlenberg & Kaschak, 2002), and evidence grows that gestures are an integral part of language production and comprehension (Reference ClarkClark, 1996; Reference Goldin-MeadowGoldin-Meadow, 2003; Reference Kendon, Siegman and PopeKendon, 1972, Reference Kendon2004; Reference McNeillMcNeill, 1992, Reference McNeill2014; Reference Volterra, Beronesi, Massoni, Volterra and ErtingVolterra, Beronesi, & Massoni, 1990), gestures are becoming important to SLA concerns as part of the cross-linguistic, psycholinguistic, and sociolinguistic variation to consider.
Over the past two decades, gesture and SLA has become a thriving field of study in its own right (see Reference GullbergGullberg, 2006b, Reference Gullberg, Robinson and Ellis2008; Reference Gullberg and McCaffertyGullberg & McCafferty, 2008; Reference Stam and ChapelleStam, 2012; Reference Stam, Buescher, Phakiti, De Costa, Plonsky and StarfieldStam & Buescher, 2018 for overviews). The work largely concentrates on two broad domains: gestures as a window onto language acquisition issues; and gestures as a medium of acquisition where the effect of gesture on acquisition is examined. In principle, a third domain could be the study of the acquisition of gestural repertoires in and of themselves (cf. Reference Gullberg, Müller, Cienki, Fricke, Ladewig, McNeill and BressemGullberg, 2014), but very little such work exists.
The current overview will focus on traditional SLA, that is, on speakers whose language skills are still emerging (traditionally “L2 users, L2 learners”) and whose learning is at stake, leaving bilingualism aside (but for a review of gestures and bilingualism, see Reference Gullberg, Bhatia and RitchieGullberg, 2012a). Moreover, the review will focus on manual gestures rather than on the full repertoire of behaviors that are part of multimodal communication (gaze, body orientation, head movements, facial expressions, etc.). This chapter summarizes current research on what gestures reveal about SLA and L2 development, and how gestures affect SLA in interaction and in instruction. It closes with a possible research agenda, outlining further issues to explore. Methodological issues for gestures and SLA will not be discussed (but see Reference GullbergGullberg, 2010, Reference Gullberg and Chapelle2012b for overviews). A final terminological point is that we will follow Reference KendonKendon (2004) and refer to gesture functions (representational/referential, pragmatic) rather than gesture “types” wherever necessary.
2 Gestures in Acquisition
2.1 The Influence of Other Languages in SLA, Cross-Linguistic Influence
In contrast to child language acquisition, adults already have languages in place when they learn new ones. A key issue in SLA studies is how those established languages influence the emergence of new ones in speakers’ minds, or more generally, how languages in contact in one mind influence each other. Foreign accent is the textbook example of how a first language (L1) influences a second (L2). The study of cross-linguistic influence (CLI) (Reference Jarvis and PavlenkoJarvis & Pavlenko, 2008) is a huge research area in SLA, often seen as a main reason for why L2 learners do not achieve “nativelikeness.” Learners are assumed to continue to rely on categories and structures from the L1 rather than restructure towards those of the L2. Traditionally, similarities across languages have been assumed to facilitate learning (“positive transfer”), and differences to cause difficulties (“negative transfer,” “interference”), a view no longer adhered to in these simplistic terms (Reference Jarvis and PavlenkoJarvis & Pavlenko, 2008; Reference Odlin, Doughty and LongOdlin, 2003).
Cross-linguistic comparisons between structures in the L1 and the L2 are key to this line of study. The growing body of work showing that native speakers of different languages also gesture systematically differently as a function of how their languages encode and express meaning (see Reference KitaKita, 2009 for an overview) has opened an opportunity to use gesture analysis as a tool to reveal more about the nature of learners’ representations. Critically, the argument for SLA studies is that gestures may reveal whether L2 speakers continue to use conceptual representations and categories from the L1 (producing L1-like gesture patterns) rather than show restructuring towards L2 representations (producing L2-like gesture patterns). Crucial to this argument is the assumption that gestures reflect conceptual-semantic elements (e.g. path and manner of motion) as well as their morphosyntactic organization (e.g. word order, number of clauses).
A few linguistic domains with well-documented cross-linguistic differences have been studied for bimodal CLI effects. The vast majority of studies target voluntary and caused motion, drawing on Reference Talmy, Sutton, Johnson and ShieldsTalmy’s (1991, 2000) distinction between verb-framed languages that encode path of motion in verb roots (e.g. French traverser “cross”) and satellite-framed languages that instead encode path of motion in satellites (e.g. prepositions, like English across), leaving the verb to express manner of motion (e.g. crawl, run, sashay). Many studies have shown that L2 learners often do not gesture about motion like native speakers of the target language but continue to gesture in L1-typical ways even as they speak the L2. L1 traces can be found in gestural timing: Learners may temporally align their gestures with different spoken elements to native speakers (e.g. with verbs vs. locative phrases; e.g. Reference Choi and LantolfChoi & Lantolf, 2008; Reference StamStam, 2006). Traces can also be found in gestural forms with learners targeting different semantic content in gestures to native speakers (e.g. path, manner, objects; e.g. Reference GullbergGullberg, 2009; Reference Özyürek and SkarabelaÖzyürek, 2002). Findings are often discussed in terms of Slobin’s notion of “thinking for speaking” (Reference Slobin, Gumperz and LevinsonSlobin, 1996), that is, the idea that linguistic categories influence what information speakers select for expression when speaking. L1-like gesture patterns suggest that L2 learners continue to think for speaking in the L1 rather than the L2.
Similar results are found with expressions of time, and verbal aspect. Mandarin Chinese makes use of vertical time metaphors (e.g. shàng “above,” xià “below” for earlier [past] and later [future]), which are often accompanied by vertical gestures. English speakers instead express time in other metaphors (e.g. before, after) and often gesture on a lateral spatial axis (Reference Gu, Mol, Hoetjes and SwertsGu, Mol, Hoetjes, & Swerts, 2017). Mandarin learners of L2 English occasionally also produce vertical time gestures in L2 English, revealing a lingering L1 influence. The SLA of verbal aspect is well studied in speech (Reference Bardovi-HarligBardovi-Harlig, 2000), but less is known about the expression of aspect in gesture (e.g. Reference DuncanDuncan, 2002). A recent analysis opposes so-called bounded gestures, involving a “pulse of effort,” with unbounded gestures, involving “smooth movement” (Reference Cienki and IriskhanovaCienki & Iriskhanova, 2018, p. 3). In this study, native speakers of French aligned perfective aspect (passé composé) significantly more often with bounded gestures, and imperfective aspect (imparfait) with unbounded gestures, whereas L1 Russian speakers aligned bounded gestures with both aspects. Russian L2 learners of French continued to produce bounded gestures with both aspects, but also increase their use of unbounded gestures, suggesting the start of a shift towards a French pattern (Reference Denisova, Cienki and IriskhanovaDenisova, Cienki, & Iriskhanova, 2018).
A recent line of work examines CLI between gestures and sign languages, specifically whether L1 co-speech gestures influence L2 sign acquisition. It has been argued that gestures affect the acquisition of sign hand shapes and iconic signs adversely since hearing learners draw on their more varied and unrestricted co-speech gestures and therefore pay less attention to the (phonological and) articulatory details of signs (e.g. Reference Janke and MarshallJanke & Marshall, 2017; Reference Ortega and MorganOrtega & Morgan, 2015a, Reference Ortega and Morgan2015b). In contrast, other studies have suggested that gestures may help L2 sign learners acquire properties of sign prosody (Reference Brentari, Nadolske and WolfordBrentari, Nadolske, & Wolford, 2012). This is clearly a domain open to exploration.
The CLI of gesture on gesture is also examined. It is often assumed that members of some cultures gesture more than those of others (e.g. Reference ScheflenScheflen, 1972). Despite a noteworthy lack of data comparing gesture rates cross-culturally under comparable conditions, preconceived ideas are rife (cf. Reference Sekine, Stam, Yoshioka, Tellier and CapirciSekine, Stam, Yoshioka, Tellier, & Capirci, 2015). Native speakers’ gesture rates are occasionally reported, but typically in specific tasks and limited domains (e.g. Reference GullbergGullberg, 1998; Reference Iverson, Capirci, Volterra and Goldin-MeadowIverson, Capirci, Volterra, & Goldin-Meadow, 2008; Reference Pettenati, Sekine, Congestrì and VolterraPettenati, Sekine, Congestrì, & Volterra, 2012; Reference YoshiokaYoshioka, 2005). Nevertheless, a few studies have examined whether gesture rates transfer. One study explored transfer from what they assumed to be gesture-frequent languages, Spanish and French, into a language with a lower gesture frequency: English (Reference Pika, Nicoladis and MarentettePika, Nicoladis, & Marentette, 2006). All bilinguals gestured more than the monolingual English speakers, but it remains unclear why, since there was no baseline L1 data for the Romance languages. In contrast, Reference SoSo (2010) established that monolingual Mandarin Chinese speakers gestured less than American English speakers. Chinese-English bilinguals gestured as frequently in English as English monolinguals and more often than Chinese monolinguals in Chinese, specifically producing more representational gestures. It remains unclear why the influence only affected representational gestures and not non-representational ones. This line of study clearly needs solid baseline data, but also clearer theorizing around why gesture rates should transfer, and why certain gesture functions transfer more than others.
In sum, influence from the L1 on the L2 is typically found both in speech and in gesture. Importantly, most studies suggest that gestures are more conservative than speech, meaning that they can reveal enduring CLI even when speech may have shifted to target L2 structures. However, whether learners show persistent CLI in speech and gesture (Reference ÖzçalışkanÖzçalışkan, 2016), or early shifts of gesture patterns and evidence of further learning (Reference GullbergGullberg, 2009; Reference LewisLewis, 2012; Reference StamStam, 2015) depends on factors such as learners’ proficiency in, exposure to, and usage of the L2 over time. Control over these factors is vital in SLA research in order to meaningfully assess patterns and possible development.
Traditionally, CLI studies have only examined influences from the L1 on the L2. However, prompted by psycholinguistic insights that all languages that one knows are typically active at any time (Reference Van Hell and DijkstraVan Hell & Dijkstra, 2002), studies have now begun to examine influences from the L2 on the L1 in speech and gesture (cf. papers in Reference CookCook, 2003). For example, Japanese speakers with intermediate knowledge of L2 English talk and gesture significantly differently about manner of motion in their L1 Japanese than monolingual Japanese speakers do (Reference Gullberg, Robinson and EllisBrown & Gullberg, 2008). The distribution of manner across speech and gestures shows traces of English speech-gesture patterns, although the speakers’ L2 skills are very modest. Even more strikingly, when compared to themselves, their speech-gesture patterns are indistinguishable even when they are speaking the two different languages. The results reveal an influence of the L2 on the L1, but also an influence of the L1 on the L2. Similar bidirectional shifts have been found in Mandarin and Japanese intermediate learners of L2 English (Reference BrownBrown, 2015; Reference Iwasaki, Yoshioka, Pappalardo and HeinrichIwasaki & Yoshioka, 2020), and in the acquisition of a L2 sign language, where L1 co-speech gestures are affected by sign as reflected in increased overall gesture rates, increased number of iconic gestures, and number of hand shape types (e.g. Reference Casey, Emmorey and LarrabeeCasey, Emmorey, & Larrabee, 2012; Reference Gu, Zheng and SwertsGu, Zheng, & Swerts, 2019). Not surprisingly, the longer the experience with the L2, the more likely the influence on the L1 both for speech-gesture and for sign (Reference StamStam, 2015; Reference Weisberg, Casey, Sevcikova Sehyr and EmmoreyWeisberg, Casey, Sevcikova Sehyr, & Emmorey, 2020). Obviously, demonstrable influences of an L2 on an L1 even at modest levels of skill in the L2 raise important theoretical and practical challenges for views of the monolingual native speaker norm (cf. Reference DaviesDavies, 2003).
Although CLI is a flourishing area of study, many things remain unknown. We know remarkably little about CLI effects beyond the linguistic domain of motion. We must explore other linguistic domains if gesture analysis is really to contribute to SLA studies of CLI. It is an obvious challenge to first establish L1 baselines in new domains, especially for gesture, but that is a challenge we must meet. We also need more longitudinal studies to further our understanding of how the speech-gesture ensemble develops with increasing proficiency and time, and why gestures are more “conservative” (cf. Reference StamStam, 2015). Furthermore, CLI is only studied in bimodal language production. We know nothing about whether native interlocutors perceive and care about learners’ “manual accents” and non-target-like L2 gestures the way they do about foreign accent in speech. Although some studies show that learners’ production of gestures has a positive effect on assessments of their skills (Reference GullbergGullberg, 1998; Reference Jenkins and ParraJenkins & Parra, 2003), studies have not directly examined native perception of “foreign gesture” or its potential interactional consequences (but see, Reference Hooijschuur, Hilton and LoertsHooijschuur, Hilton, & Loerts, 2017).
2.2 General Learner Phenomena
SLA studies are not only interested in CLI. Research also examines learners’ language use as a variety in its own right, as well as properties determined by general learning mechanisms, referred to as interlanguage or learner varieties (Reference PerduePerdue, 2000; Reference SelinkerSelinker, 1972). In this perspective, gesture analyses shed light on details of general developmental patterns at a given level of skill.
One line of work considers the relationship between fluency, complexity, and accuracy (e.g., Reference Housen, Kuiken and VedderHousen, Kuiken, & Vedder, 2012), a field of SLA research that has hitherto focused exclusively on speech. However, in gesture studies, the relationship between fluency, proficiency, and gesture rates has generated much work. The general expectation is that the lower the proficiency, the higher the gesture rate – either because gestures can function as communication tools and problem solvers in L2 production (cf. Section 3.1), because (representational) gestures facilitate lexical access (Reference Rauscher, Krauss and ChenRauscher, Krauss, & Chen, 1996), or because greater cognitive load is associated with increased gesture production regardless of linguistic challenges (e.g. Reference Alibali, Yeo, Hostetter, Kita, Breckinridge Church, Alibali and KellyAlibali, Yeo, Hostetter, & Kita, 2017; Reference Melinger and KitaMelinger & Kita, 2007). And indeed, many studies show that L2 learners and bilinguals typically produce more gestures overall than native speakers and monolinguals do (see Reference Gullberg, Bhatia and RitchieGullberg, 2012a; Reference NicoladisNicoladis, 2007, for overviews). However, the link to fluency and proficiency is not straightforward. For example, gestures are overwhelmingly produced with fluent rather than with disfluent speech, both in L1 and L2 production (e.g. Reference Graziano and GullbergGraziano & Gullberg, 2018). Furthermore, gesture rates may be modulated by task demands (e.g. Reference Aziz and NicoladisAziz & Nicoladis, 2019; Reference LinLin, 2020), the languages involved (e.g. Reference SoSo, 2010), and individual communicative style (e.g. Reference GullbergGullberg, 1998; Reference Nagpal, Nicoladis and MarentetteNagpal, Nicoladis, & Marentette, 2011). Different gestures may also be differentially affected. Some studies suggest that low proficient learners mainly produce representational gestures (e.g. Reference KidaKida, 2005), whereas others find more representational gestures with increasing proficiency and fluency (Reference Gregersen, Olivares-Cuhat and StormGregersen, Olivares-Cuhat, & Storm, 2009; Reference GullbergGullberg, 1998), or even U-shaped developmental patterns (Reference Zvaigzne, Oshima-Takane and HirakawaZvaigzne, Oshima-Takane, & Hirakawa, 2019). Some suggest that pragmatic and deictic gestures are more frequent in early L2 production (Reference GullbergGullberg, 1998; Reference Isaeva, Fernández-Villanueva, Fernández-Villanueva and JungbluthIsaeva & Fernández-Villanueva, 2016). To understand the interplay between language skills/proficiency and bimodal behavior, we clearly need more detailed studies that track gesture use, task demands, linguistic complexity, fluency, cognitive capacities such as working memory (cf. Reference Cook, Fenn, Breckinridge Church, Alibali and KellyCook & Fenn, 2017), and (critically) independently established proficiency or skill levels. More detailed analyses of the temporal relationship between gestures and elements in speech (including disfluency) would also help elucidate the relationship. At this stage, it would be ill advised to consider gesture rate as a reliable diagnostic for L2 proficiency.
Another concern in SLA is how L2 learners at early stages of proficiency organize information about who does what to whom (“reference tracking”) to create coherent and intelligible discourse (Reference PerduePerdue, 2000), especially when they do not yet master pronouns or word order patterns in the L2. When introducing and referring back to entities, native speakers typically create referential chains of lexical noun phrases for new entities followed by pronouns for known ones (a girl-she). New entities are often accompanied by anchoring or localizing gestures, whereas known ones are not (Reference Azar, Backus and ÖzyürekAzar, Backus, & Özyürek, 2019; Reference Debreslioska and GullbergDebreslioska & Gullberg, 2020; Reference Foraker, Stam and IshinoForaker, 2011; Reference McNeillLevy & McNeill, 1992). In contrast to this pattern, early L2 learners with different L1s often use overly explicit chains of lexical noun phrases to refer to the same entity whether new or known (girl – girl) (Reference Hendriks and RamatHendriks, 2003; Reference WilliamsWilliams, 1988). They also anchor entities with gestures, and typically gesture about them at every mention regardless of whether they are new or known, creating over-explicit bimodal reference tracking (Reference Gullberg, Dimroth and StarrenGullberg, 2003, Reference Gullberg2006a; Reference So, Kita and Goldin-MeadowSo, Kita, & Goldin-Meadow, 2013; Reference So, Lim and TanSo, Lim, & Tan, 2014; Reference YoshiokaYoshioka, 2008; Reference Yoshioka and KellermanYoshioka & Kellerman, 2006). These patterns are similar regardless of learners’ L1 and L2, suggesting that this is a general learner phenomenon. Moreover, since the patterns persist even when addressees cannot see learners’ localizing gestures, they do not seem to be a strategy for disambiguation (Reference GullbergGullberg, 2006a). That said, the nature of the languages in contact may affect the timing of the anchoring gestures. For example, gestural reference tracking with lexical noun phrases in languages like Swedish and French may be different from that in languages like Japanese, Mandarin Chinese, or Turkish, which typically drop arguments (nouns, pronouns) in native production. It makes it challenging to assess whether anchoring gestures in clauses with dropped arguments serve other functions than they do in languages with overt arguments (cf. Reference So, Kita and Goldin-MeadowSo et al., 2013; Reference YoshiokaYoshioka, 2008). Moreover, over-explicit reference tracking is not necessarily found in L2 sign language (Reference Frederiksen and MayberryFrederiksen & Mayberry, 2019), again suggesting language-specific issues to explore. Much remains to be clarified regarding bimodal reference tracking in different L1–L2 pairings and at different proficiency levels.
3 Gestures in L2 Interaction and Instruction
3.1 Teachers’ and Learners’ Gestural Practices in and outside of Language Classrooms
An important domain in SLA studies is how learners and their native and non-native interlocutors combine speech and gestures in interactive practices to promote communication, comprehension, and learning. All forms of didactic talk – by adults to children (child-directed speech) and by native speakers to L2 users (”foreigner/teacher talk,” Reference Ferguson and HymesFerguson, 1971) – appear to display an increased use of gestures (e.g. Reference AdamsAdams, 1998; Reference AllenAllen, 2000; Reference Iverson, Capirci, Longobardi and CaselliIverson, Capirci, Longobardi, & Caselli, 1999; Reference LazaratonLazaraton, 2004). Teachers, instructors, and parents clearly think that seeing gestures facilitates learners’ comprehension (cf. Reference KellermanKellerman, 1992; Reference Sueyoshi and HardisonSueyoshi & Hardison, 2005) and possibly also learning. Studies of gesture in communicative practices are often – but not always – qualitative in nature, focusing on interaction. The focus is typically not on measurable effects on learning.
A range of studies investigates how learners themselves use speech and gesture as resources to enable communication in the L2 when linguistic skills are limited. Studies have focused on how speech and gesture are jointly deployed to avoid communicative breakdowns, resolve misunderstanding, and handle repairs. In SLA studies, the field of communication strategies examines how learners solve lexical, grammatical, and pragmatic problems (Reference Kasper and KellermanKasper & Kellerman, 1997). Studies have focused both on interactive and cognitive aspects of this process and identified strategies such as circumlocutions, word coinage, avoidance, and gestures. For example, French and Swedish learners produce representational gestures to resolve lexical issues in joint solutions with their interlocutors; representational and deictic gestures to handle grammatical difficulties; and addressee-directed and pragmatic gestures (“thinking” or “cyclic” gestures; Reference Gullberg, Streeck, Goodwin and LeBaronGullberg, 2011; Reference Ladewig, Müller, Cienki, Fricke, Ladewig, McNeill and TeßendorfLadewig, 2014) to manage difficulties arising from non-fluent speech, turn-taking, and repairs (Reference GullbergGullberg, 1998, Reference Gullberg, Streeck, Goodwin and LeBaron2011). Conversation analysis shows that gestural word searches and repair sequences in particular are deeply interactive (Reference Fornel, Conein, Fornel and QuéréFornel, 1991; Reference Harrison, Adolphs, Gillon Dowens, Du and LittlemoreHarrison, Adolphs, Gillon Dowens, Du, & Littlemore, 2018; Reference Olsher, McCafferty and StamOlsher, 2008). Learners also talk and gesture to themselves to solve problems, rehearse and internalize new knowledge, known as private speech and private gestures (Reference LeeLee, 2008; Reference McCaffertyMcCafferty, 1998; Reference McCafferty and RosboroughMcCafferty & Rosborough, 2014). Such talk is often accompanied by beat-like gestures both during the problematic sequence and at the resolution of the problem (Reference HauserHauser, 2014, for solution strokes). Interestingly, when teachers themselves are non-native speakers, they too become more fluent the more they gesture (Reference SatoSato, 2020).
Another line of work focuses specifically on instructed SLA activities in classrooms. Studies in the traditions of Conversation Analysis or Sociocultural Theory exemplify how both learners and teachers use gestures in classroom interaction to support the teaching and learning of vocabulary, grammar, pronunciation, pragmatics, and even writing (Reference Eskildsen and WagnerEskildsen & Wagner, 2013; Reference Kim and ChoKim & Cho, 2017; Reference Kimura and KazikKimura & Kazik, 2017; Reference LazaratonLazaraton, 2004; Reference Matsumoto and DobsMatsumoto & Dobs, 2017; Reference SmotrovaSmotrova, 2017). Studies also examine how gestures are used to regulate classroom interaction (Reference CekaiteCekaite, 2009; Reference Tabensky, McCafferty and StamTabensky, 2008), raise awareness of elements to be learned (Reference HilliardHilliard, 2020; Reference Van Compernolle and Williamsvan Compernolle & Williams, 2011), and enable joint coproduction of the L2 (Reference Mori and HayashiMori & Hayashi, 2006; Reference Olsher, Gardner and WagnerOlsher, 2004). Students’ reactions to teachers’ gestures vary (Reference SimeSime, 2006), but learners clearly often pick up and reuse teachers’ and other students’ gestures, suggesting that these gestures play a role in the internalization and entrenchment of new lexical and grammatical elements (Reference BelhiahBelhiah, 2013; Reference Clark and TrofimovichClark & Trofimovich, 2016; Reference Eskildsen and WagnerEskildsen & Wagner, 2015; Reference Smotrova and LantolfSmotrova & Lantolf, 2013). Finally, gestures play a role in learners’ explorations of their L2 identities, especially after lengthy use of the new language (Reference Nardotto Peltier and McCaffertyNardotto Peltier & McCafferty, 2010; Reference Tian and McCaffertyTian & McCafferty, 2020).
Teachers’ gestures are the focus in studies on pedagogical practice. Tellier has identified three broad functions of teachers’ gestures, namely: to inform, to animate (meaning to enliven and guide), and to assess (Reference TellierTellier, 2006, Reference Tellier, Tellier and Cadet2014; cf. Reference Quinlisk, McCafferty and StamQuinlisk, 2008). Interestingly, when using gestures to inform and explain vocabulary, teachers modulate their behavior depending on the skill level of the student. They produce more representational gestures of longer duration and spatial expanse when explaining vocabulary to non-native than to native students (Reference Stam and ChapelleTellier & Stam, 2012), and often produce gestures both before and after speech pauses to highlight a new word (Reference Stam, Tellier, Breckinridge Church, Alibali and KellyStam & Tellier, 2017). This is a form of gestural “teacher talk” (Reference Sinclair and BrazilSinclair & Brazil, 1982).
The gestural assessment function is often studied under the SLA heading of (corrective) feedback and recasts, meaning cases where a teacher provides feedback to learners’ production, for example, through recasts that provide a correct reformulation (see Reference Nakatsukasa, Loewen, Nassaji and KartchavaNakatsukasa & Loewen, 2017 for an overview). Many studies in this domain are descriptive microanalytical studies (e.g. Reference Van Compernolle and Smotrovavan Compernolle & Smotrova, 2014), but there are also intervention studies using pre-test/post-test designs, with results showing both positive (Reference NakatsukasaNakatsukasa, 2016) and more modest effects of gestural recasts on learning (Reference NakatsukasaNakatsukasa, 2021), possibly depending on the linguistic domain being taught.
Overall, instructed SLA settings remain underexplored with regard to multimodality and gesture, in particular concerning effects of different kinds of instruction, such as a focus on formal aspects of language (“focus on form”) versus a focus on communication (“focus on meaning”) (Reference Doughty, Doughty and LongDoughty, 2003). Moreover, the acquisition of vocabulary continues to dominate these studies, whereas other linguistic domains should be examined. There is a budding literature on gestures in the teaching of grammar and pronunciation, but these are big domains where many subareas remain to be explored.
3.2 The Effects of Seeing and Producing Gestures on SLA
The studies reviewed above have all assumed that gestures benefit learners’ L2 development and use, but, with the exception of the pre-test/post-test studies, they have not typically measured actual effects. Outside of language studies, in contrast, a considerable body of research now deals with the measurable effects of seeing and producing gestures on memory and learning (see Reference Cook, Fenn, Breckinridge Church, Alibali and KellyCook & Fenn, 2017, for an overview). Results typically show that gestures promote learning. For example, learners who see and produce gestures learn more about math and science than those who do not, both in the short term and in the long term. Moreover, those with lower working memory capacity typically benefit more from gestures than those with higher capacity.
Studies examining possible effects for language learning and specifically SLA have been scarcer. However, following the pioneering studies by Reference AllenAllen (1995), Reference TellierTellier (2008), and Reference Kelly, McDevitt and EschKelly, McDevitt and Esch (2009), an explosion of behavioral and neurocognitive investigations now examine how gestures during training affect language learning in child and adult L2 learners.
Overall, studies find beneficial effects of gesture perception and, even more so, of gesture production in different linguistic domains. Vocabulary in real and artificial languages is better retained when presented with gestures, especially if learners produce gestures themselves during explicit training (e.g. Reference Andrä, Mathias, Schwager, Macedonia and von KriegsteinAndrä, Mathias, Schwager, Macedonia, & von Kriegstein, 2020; Reference García-Gámez and MacizoGarcía-Gámez & Macizo, 2019; Reference Kelly, McDevitt and EschKelly et al., 2009; Reference Krönke, Mueller, Friederici and ObrigKrönke, Mueller, Friederici, & Obrig, 2013; Reference Macedonia, Müller and FriedericiMacedonia, Müller, & Friederici, 2011; Reference MorettMorett, 2014, Reference Morett2018). Even without explicit training, seeing gestures implicitly helps learners to link meaning to word forms in an unknown language (Reference Gullberg, Roberts and DimrothGullberg, Roberts, & Dimroth, 2012). Gesture training improves L2 pronunciation of vowel and syllable durations, word stress, and intonation (e.g. Reference Ghaemi and RafiGhaemi & Rafi, 2018; Reference Gluhareva and PrietoGluhareva & Prieto, 2017; Reference Iizuka, Nakatsukasa and BraverIizuka, Nakatsukasa, & Braver, 2020; Reference Kushch, Igualada and PrietoKushch, Igualada, & Prieto, 2018; Reference Li, Baills and PrietoLi, Baills, & Prieto, 2020; Reference Yuan, González-Fuente, Baills and PrietoYuan, González-Fuente, Baills, & Prieto, 2019; Reference Zhang, Baills and PrietoZhang, Baills, & Prieto, 2018), and the learning of phonological distinctions such as lexical tones in Mandarin Chinese (Reference Baills, Suárez-González, González-Fuente and PrietoBaills, Suárez-González, González-Fuente, & Prieto, 2019; Reference Morett and ChangMorett & Chang, 2015). Even L2 grammar may benefit from gesture training, such as in learning about prepositions (e.g. Reference NakatsukasaNakatsukasa, 2016). The behavioral findings are bolstered by neurocognitive evidence highlighting that gestures are integrated with language in processing (see Reference Kelly, Breckinridge Church, Alibali and KellyKelly, 2017, for an overview), also offering possible explanations for the learning effects. It is suggested that gestures add a depth of encoding since sensorimotor brain networks are activated and grow larger the more sensory modalities are connected to a word (Reference Macedonia, Repetto, Ischebeck and MuellerMacedonia, Repetto, Ischebeck, & Mueller, 2019).
Importantly, studies also examine effects of different kinds of gestures. Most studies have investigated the effects of representational gestures on vocabulary learning, depicting concrete content (e.g. size, shape, movement) (e.g. Reference Kelly, McDevitt and EschKelly et al., 2009; Reference Macedonia and KnöscheMacedonia & Knösche, 2011; Reference MorettMorett, 2014; Reference PorterPorter, 2016; Reference So, Sim Chen-Hui and Low Wei-ShanSo, Sim Chen-Hui, & Low Wei-Shan, 2012; Reference TellierTellier, 2008). Studies have also examined the effect of gestures depicting abstract content (e.g. a lateral sweeping movement to depict duration), but here results are more mixed (Reference Baills, Suárez-González, González-Fuente and PrietoBaills et al., 2019; Reference Hirata and KellyHirata & Kelly, 2010; Reference Hirata, Kelly, Huang and ManansalaHirata, Kelly, Huang, & Manansala, 2014; Reference Li, Baills and PrietoLi et al., 2020). Studies have also examined the effect of non-representational gestures, such as prominence-marking manual beats (Reference Kushch, Igualada and PrietoKushch et al., 2018), hand-clapping (Reference Iizuka, Nakatsukasa and BraverIizuka et al., 2020; Reference Zhang, Baills and PrietoZhang et al., 2018); and head nods (Reference Zheng, Hirata and KellyZheng, Hirata, & Kelly Spencer, 2018).
Although gestures generally seem to boost SLA, various caveats apply. Importantly, the semantic content of gestures should match that of vocabulary items (Reference García-Gámez and MacizoGarcía-Gámez & Macizo, 2019; Reference Huang, Kim and ChristiansonHuang, Kim, & Christianson, 2019; Reference Kelly, McDevitt and EschKelly et al., 2009; Reference MacedoniaMacedonia, 2019). Non-matching gestures may hinder acquisition more than the total absence of gestures. Another important issue is task demands (Reference Kelly and LeeKelly & Lee, 2012; Reference Morett and ChangMorett & Chang, 2015) and learners’ developmental levels. For example, representational gestures help both children and adults to retain new vocabulary, but beats may only help adults (Reference So, Sim Chen-Hui and Low Wei-ShanSo et al., 2012).
The importance of these details is highlighted by some (seemingly) contradictory evidence for effects in the domains of phonology and phonetics. Hirata and colleagues (Reference Hirata and KellyHirata & Kelly, 2010; Reference Hirata, Kelly, Huang and ManansalaHirata et al., 2014; Reference Kelly, Hirata, Manansala and HuangKelly, Hirata, Manansala, & Huang, 2014) have found that while lip information improves English learners’ ability to perceive long and short vowels in Japanese, manual sweeping gestures indicating duration have limited effects both on the perception of vowels and on the retention of vocabulary containing the distinction. However, the same sweeping gesture has recently been shown to improve the pronunciation of the vowel distinction, but not performance on a perceptual task (Reference Li, Baills and PrietoLi et al., 2020). In both cases, then, effects are limited in the perceptual domain, but there is gestural boosting in L2 production. Reference Kelly, Breckinridge Church, Alibali and KellyKelly (2017) has suggested that some linguistic components are more deeply connected to gestures (e.g. concrete semantics, pragmatics) and may therefore be more susceptible to gestural effects on learning than others (e.g. syntax, phonology). Although an interesting suggestion, it may be premature to discount the effect of gestures on the acquisition of whole linguistic domains until more careful distinctions have been made between perception/production benefits and different task difficulties. The effect of cognitive task complexity also remains largely untapped in the gesture-learning literature (but see Reference Nicoladis, Pika, Yin and MarentetteNicoladis, Pika, Yin, & Marentette, 2007). Given the great variation of tasks used in these studies (e.g. perceptual tasks: multiple choice, word-meaning associations, word recognition, discrimination; production tasks: free and cued word recall, foreign to native/native to foreign translation, imitation), there is much to explore and systematize here.
Aspects that have not yet been properly addressed include how much training is needed for gestures to have an effect (Reference MacedoniaMacedonia, 2019), and the corresponding longevity of the effects. Most studies still find effects one to two weeks after training, but proper longitudinal studies are rare (but see Reference Macedonia and KlimeschMacedonia & Klimesch, 2014). Further, most studies are intervention studies involving explicit classroom-like instruction. Few studies have investigated the potential of gestures for incidental or implicit L2 learning, that is, learning in the absence of overt instruction. The difference between explicit and implicit language learning is an important research domain in SLA studies (see Reference HulstijnHulstijn, 2005 for an overview), where the role of attention and differences between procedural (implicit) and declarative (explicit) learning, memory, and knowledge are vital areas of study (e.g. Reference ParadisParadis, 2009; Reference Schmidt and RobinsonSchmidt, 2001; Reference UllmanUllman, 2001). No SLA studies of implicit/explicit learning include gestures. Since implicit L2 language learning is arguably more common in the world than explicit classroom instruction, the role of gestures in implicit L2 learning clearly merits study. A further issue concerns attention to the linguistic details, including parts of speech to be learned, and the role of similarity. It is well known in SLA that truly new distinctions may be easier to learn than distinctions that occupy a similarity space to the L1 (e.g. Reference Flege, Burmeister, Piske and RohdeFlege, 2002; Reference IjazIjaz, 1986). Very little attention is paid to these details, and potential (unintended) cross-linguistic effects therefore remain unexplored. Finally, the nature of the gestural enhancement receives surprisingly little detailed attention in these studies. With the exception of Reference MorettMorett (2018), who has shown that learners’ own spontaneous gesture production has greater effects on word recall than viewing someone else’s non-spontaneous gestures, very little information is provided about how training gestures are selected, their articulatory and temporal features relative to speech, and so on. All these aspects are likely to play a role but remain underdescribed and underexplored.
4 A Research Agenda for Gestures and SLA?
The study of SLA and gesture has made considerable progress in the past 20 years. However, many things remain unexplored. Across all domains reviewed above, there is an obvious need to move beyond the lexicon (and the domain of motion) and consider the acquisition of morphosyntax, (sustained) discourse, phonology, phonetics, figurative language, idiomatic expressions, and pragmatics both in production and in comprehension, both offline and online. The SLA field is rife with claims built on speech alone, ranging from the effects of the languages in contact, individual differences in cognitive makeup, language use, to type of instruction, and so on. These claims could and should be tested taking gestures into account (and preferably on a wider language sample than that which has been examined so far).
Fundamental questions about the study of SLA and gesture are when, why, and how speech and gestures change in L2; why L2 speech seems to change more readily than L2 gestures; under what conditions gestures do change, whether through imitation, through changes in language use, or both. We clearly need more longitudinal work to improve our understanding of these vital issues. Moreover, while we know very little about whether, when, and how L2 learners produce language-specific co-speech patterns, we know even less about the SLA of conventional, quotable gestures, or emblems. If communicative fluency and cultural appropriateness are seen as important to SLA, then the acquisition of gestural repertoires of linguistic and cultural communities should be included. Although a few studies investigate L2 users’ comprehension of culture-specific quotable gestures (e.g. Reference JungheimJungheim, 2006; Reference Molinsky, Krabbenhoft, Ambady and ChoiMolinsky, Krabbenhoft, Ambady, & Choi, 2005; Reference Wolfgang and WolofskyWolfgang & Wolofsky, 1991), it remains largely unknown whether L2 learners themselves produce them. For example, do L2 speakers learn to respect handedness taboos (e.g. Reference Kita and EssegbeyKita & Essegbey, 2001), or to produce appropriate gestural backchannelling, shifting from head toss to headshake (Reference Morris, Collett, Marsh and O’ShaughnessyMorris, Collett, Marsh, & O’Shaughnessy, 1979)? Since emblems function like idiomatic expressions, they may be subject to the same acquisition difficulties as spoken idiomatic expressions (e.g. Reference IrujoIrujo, 1993). However, since they are often assumed to be inherently “salient” in the absence of co-occurring speech, they may also be easier to acquire than spoken idiomatic expressions. A contrastive study of idiom acquisition in speech vs. gesture could illuminate whether the visual modality enjoys an advantage in learning.
Further, the majority of work on SLA and gesture probes L2 production. We know much less about L2 gesture perception – both about learners’ perception of surrounding gestures in classrooms, as well as in study abroad and immersion contexts, and the actual effect of gestures on L2 comprehension (but see e.g. Reference Sueyoshi and HardisonSueyoshi & Hardison, 2005, and Reference Drijvers and ÖzyürekDrijvers & Özyürek, 2020, for somewhat conflicting findings); and about native speakers’ perceptions of learners’ (foreign) gestures. Both domains need elucidating.
We are beginning to get a handle on the measurable effect of explicit gesture training on SLA and memory, but we still know virtually nothing about implicit effects of gesture processing in SLA, that is, the learning effect of gestures that are not part of explicit training, but just normal language use. We know little about the difference between instructed and uninstructed multimodal SLA, about different kinds of classroom instruction (focus on form vs. on meaning), and the role gestures may play in classrooms with literate vs. illiterate L2 learners (e.g. Reference Tarone and BigelowTarone & Bigelow, 2005).
A final methodological point is worth making. We should all aim to provide much more detail on gestures themselves in studies of gesture and SLA, and on their relationship to co-occurring speech. A surprising number of published papers provide no or only the most minimal information about the articulatory, spatial, and temporal properties of the gestures under study, and their temporal relationship to speech. Moreover, it is often assumed that gestural function (“gesture type”), semantic content, and coexpressivity between speech and gesture are easily established when they are not; or that gestures are monofunctional when in fact that are deeply multifunctional. We need to be more attentive to all these details. Data sharing and the creation of open multimodal learner corpora is obviously challenging in gesture studies, but that is precisely why it behooves us all to be more explicit than we typically are in the interest of replicability.
5. Conclusions
The study of gestures and SLA can now be said to be a field in its own right. This field considers gestures both as a tool to study acquisition, and as a phenomenon to be studied per se. The double nature of gestures as interactive, addressee-directed phenomena on the one hand, and as internal, speaker-directed ones on the other, make them deeply relevant to issues of L2 acquisition. Theoretically, this new field still needs to integrate concerns from SLA and gesture studies, from language studies and cognitive (neuro)science. Currently, cognitive learning science lends more attention to gestures than SLA studies do, but this state of affairs will change as the body of work grows, as terminologies and methods become more unified, and as the multimodal view of language gains further ground. The challenge for us all is to shift theories and models of language acquisition away from monomodal monolingual perspectives toward multimodal multilingual ones. It is high time.
1 Signed Languages
Signed languages are natural human languages used by communities of deaf people as their native or primary language. “Sign language” is a broad category requiring clarification in two ways. First, many linguists prefer the term “signed language” rather than “sign language.” The term “sign language” implies that “sign” itself is a language, leading to confusion between the way a language is produced, its modality or medium of expression, and the name of a language. The term “signed” is best used to describe the modality in which a language is produced, parallel to spoken and written. The terms speaking, writing, and signing describe the ways in which a language can be expressed: We speak a language (such as English), we write a language (such as Chinese), and we sign a language (Argentine Sign Language). To put it another way: “Speech” is not the name of a language (students do not take classes in “speech language,” they take classes in Spanish or Japanese). Likewise, “sign” is not the name of a language, although we persist in the pernicious tradition of using the term “sign language” in this way: We say that students are taking “sign language” classes and assume, incorrectly, that this names the language. American Sign Language (ASL) is the name of a signed language. In fact, ASL and other signed languages can be written; linguists and deaf community members have developed orthographies for this purpose Reference Stokoe, Casterline and CronebergStokoe, Casterline, & Croneberg, 1965). There are many signed languages: British Sign Language, Italian Sign Language, Japanese Sign Language, Iranian Sign Language, Taiwanese Sign Language, and so forth. Although their names contain the word “sign,” they are the names of particular languages in the general category of signed languages. Some deaf communities have adopted names for their languages that do not include the word “sign,” such as Auslan (Australian Sign Language), Libras (Língua Brasileira de Sinasi), and others. In fact, the word “sign” and its translation is not even used in some spoken languages to name the local signed language, but rather the equivalent of “gesture” – which makes understanding the relation between sign and gesture even more challenging in those languages, for example, Dutch Nederlandse Gebarentaal, German Deutsche Gebärdensprache, and Russian Russkij zhestovyj jazyk. In summary, many linguists believe it is more accurate to use the term “signed” for the set of all the world’s languages produced in the signed modality, just as we use the term “spoken” for the set of the world’s spoken languages and “written” for those languages with orthographies.
The second way that the term “sign language” – even if replaced with the term “signed language” – must be clarified is that the term often includes signed languages such as American Sign Language, British Sign Language, Chinese Sign Language, and other natural signed languages which are used by large communities; village sign languages which arise when a number of deaf children are born into an insular indigenous community, such as San Juan Quiahije Chatino Sign Language (Reference Mesh and HouMesh & Hou, 2018); newly emerging languages such as Nicaraguan Sign Language (Reference Senghas and CoppolaSenghas & Coppola, 2001), and Al-Sayyid Bedouin Sign Language (Reference Meir, Sandler, Padden, Aronoff, Marschark and SpencerMeir, Sandler, Padden, & Aronoff, 2010); and International Sign (IS), an emerging pidgin that has arisen among signers from different language communities primarily in Europe and is often used as a lingua franca at international conferences (Reference WhynotWhynot, 2016). As one might expect, these different categories of signed languages have different histories, sociolinguistic characteristics, and potentially quite different stages of development, conventionalization, and recognition in education. Signed languages in developed countries with large communities of signers often have become accepted in educational settings (although, as we will see, historically this has not always been the case). In these situations the signed language may be learned by deaf children as a second language in school. Because the school systems in urban communities can draw from a larger population, deaf children from hearing parents often come into contact with deaf children with deaf parents who have learned the local signed language as their first language. However, even in otherwise developed countries such as Japan or China, lesser studied signed languages often are not adopted by educational systems. Village signed languages often occur in naturally isolated settings and smaller communities. As a result, they are sometimes shared with hearing people living in the same community who come into contact with deaf people. Because of the setting, the probability of acquiring more deaf users is relatively low for these signed languages. If deaf children leave these communities to attend school, they often abandon their village signed language and acquire the urban signed language.
A number of myths and misunderstandings have pervaded our understanding of the first class of signed languages, the natural signed languages of large communities of deaf people. One pervasive misunderstanding, held throughout much of history, is that signed languages are merely depictive gestures and not linguistically structured. Signed languages are not simply holistic gestures. There is, however, a complex relationship between signed languages and gesture that scholars are only now beginning to understand (Reference WilcoxWilcox, 2004, Reference Wilcox, Pizzuto, Pietrandrea and Simone2007, Reference Wilcox2009). This relationship will be discussed in Section 3.
Another common misunderstanding is that signed languages are merely representations of spoken languages – that ASL, for example, is a signed representation of spoken English. Signed languages are independent languages with their own lexicons and grammars. Related to this misconception, many people believe that signed languages are invented languages. They are not. Signed languages, like spoken languages, are naturally developing human languages. There are, however, sign systems created to represent spoken/written language, such as Signing Essential English and Signing Exact English.
Following from the belief that signed languages are invented is the assumption that they are languages with a shallow historical depth. The full story of the age of signed languages is quite complex and depends on the language, the question of emergent village signed languages, and the region of the world in which the language is used. Since signed languages were not regarded as true languages for most of recorded history, it is quite difficult to ascertain their age. We know that signed languages are mentioned in Talmudic law, and in the writings of Aristotle, Quintilian, and many others. In his lessons, for example, Quintilian, taught that not only a movement of the hand, but even a nod, may express our meaning, and he noted that such gestures are used by deaf people instead of speech. Deaf people and signed language are also mentioned in ancient Egyptian writings from the 19th Dynasty, ca. 1350–1200 BCE (Reference ErmanErman, 1971).
We do have a few historical accounts of signed language communities, such as that provided by Pierre Desloges in 1779, as reported in Reference Lane, Lane and GrosjeanLane (1980, pp. 123–124). Desloges became deaf at the age of seven from smallpox. As an adult, he wrote a treatise describing how he learned to sign:
Like a Frenchman who sees his language attacked by a German who knows only a few words of French, I felt obliged to defend my own language against the false imputations [that it is not a language]. […] For a long time I was unaware of sign language. I only used scattered signs, isolated, without an orderly sequence and without linkages. I was quite unacquainted with the skill of combining them to sketch clearly defined scenes whereby we can represent our various ideas, communicate them to our deaf companions, and converse with them in an orderly and extended discussion. The first person who taught me this very useful skill was a deaf-mute from birth, of Italian nationality, who knew neither how to read nor write; he was a servant in the home of one of the actors in the “Comédie Italienne.” […] There are deaf-mutes from birth, workers in Paris, who know neither reading nor writing, and who never went to the lessons of the Abbé de l’Epée, but who were so well instructed in religion, solely through the medium of sign, that they were judged worthy of the sacraments of the church. There is no event in Paris, in France, and in the four corners of the world that is not a topic of our conversations. We express ourselves on all topics with as much orderliness, precision, and speed as if we enjoyed the faculties of speech and hearing.
Although very little can be definitely claimed about the history of signed language over the course of centuries, we can certainly say that as long as communities of deaf people have existed, they have used signed languages to communicate with each other.
2 Relation of Sign and Gesture: Historical Background
Our understanding of the relationship between signed languages and gesture has been the subject of debate among scholars of language and philosophers for centuries. One way in which this question has been manifest is in the centuries-long debate about language origins. (See more on this topic in the chapter by Żywiczyński and Zlatev, this volume.) The philosopher Étienne Bonnot de Condillac, for example, suggested that language began as a gesture language or langage d’action. The term “language of action” was even later used to describe signed languages. Jean-Jacques Rousseau, Johann Herder, Wilhelm von Humboldt, and other philosophers of language caricatured and ridiculed this position, arguing instead that language could not have arisen from such natural, animalistic beginnings. Herder, for example, proposed that the fundamental linguistic act was naming; critically, this naming was based on an unemotional sense of curiosity, a desire for pure knowledge. Furthermore, Herder argued that the naming had to have been audible, unaccompanied by any visible movement. The reasoning behind this argument was that “when a man is under the influence of an emotion (such as fear of an enemy) and yet suppresses, from rational grounds, any movement which might reveal it, he is acting from reason, not from passion” (Reference WellsWells, 1987, p. 40). Language thus is essential to reason – “without language man has no reason, and without reason no language” (ibid.). From this perspective, the best evidence for reason is that we ignore our body, our senses, our emotions, and our passions.
Thus, the accepted wisdom was that true language is spoken. Signs were regarded as nothing more than natural gestures evoked by emotion rather than reason. This view came clearly into view during the Milan Conference of 1880. At this time a great debate was taking place between educators who supported the use of signing in the education of deaf children and those who supported speech, the so-called oral method. Supporters of speech, such as Marius Magnat, the director of an oral school in Geneva, maintained that signed languages lacked any features of language and thus were not suited for educating deaf children (Reference LaneLane, 1984, pp. 387–388):
The advantages of articulation training [i.e., speech] […] are that it restores the deaf to society, allows moral and intellectual development, and proves useful in employment. Moreover, it permits communication with the illiterate, facilitates the acquisition and use of ideas, is better for the lungs, has more precision than signs, makes the pupil the equal of his hearing counterpart, allows spontaneous, rapid, sure, and complete expression of thought, and humanizes the user. Manually taught children are defiant and corruptible. This arises from the disadvantages of sign language. It is doubtful that sign can engender thought. It is concrete. It is not truly connected with feeling and thought. […] It lacks precision. […] Sign cannot convey number, gender, person, time, nouns, verbs, adverbs, adjectives, he claims. […] It does not allow [the teacher] to raise the deaf-mute above his sensations. […] Since signs strike the senses materially they cannot elicit reasoning, reflection, generalization, and above all abstraction as powerfully as can speech.
Statements made by Giulio Tarra, the president of the Milan conference, reveal even more starkly the confusion that equated speech with language, and sign with gesture.
Gesture is not the true language of man which suits the dignity of his nature. Gesture, instead of addressing the mind, addresses the imagination and the senses. Moreover, it is not and never will be the language of society […] Thus, for us it is an absolute necessity to prohibit that language and to replace it with living speech, the only instrument of human thought. […] Oral speech is the sole power that can rekindle the light God breathed into man when, giving him a soul in a corporeal body, he gave him also a means of understanding, of conceiving, and of expressing himself. […] While, on the one hand, mimic signs are not sufficient to express the fullness of thought, on the other they enhance and glorify fantasy and all the faculties of the sense of imagination. […] The fantastic language of signs exalts the senses and foments the passions, whereas speech elevates the mind much more naturally, with calm and truth and avoids the danger of exaggerating the sentiment expressed and provoking harmful mental impressions.
The debate between those who supported signed language in the education of the deaf, and those who argued that only speech should be used, provides us with a clear view of how they considered the relationship between sign and gesture. It is no surprise that the relationship was framed in terms of mind–body dualism as espoused by one of the leading philosophers of the time, René Descartes. Summarizing the descriptions of sign and speech from Magnat and Tarra, we see that language is of the mind; it is associated with speech, with the acquisition of ideas and the expression of thought; it elicits reasoning, reflection, abstraction, generalization, and rationality. Speech has precision (perhaps meaning it has grammar); its users exhibit calm, prudence, and truth, such that it humanizes its users. Speech is of the soul, the spirit, because it originated from the breath, the aspiration, of God. Sign, on the other hand, is of the body; it is merely gesture. Signs corrupt deaf children, making them less human, more animalistic. Signs are concrete, and thus they cannot engender thought. Signs lack parts of speech and grammar. Because signs are gesture, they are associated with the corporeal body and with the senses. Signs strike the senses materially. They foment the passions – they are, in both senses of the term, sensual. Signs glorify fantasy and imagination. It is no accident that the root of imagination is image, and the iconic nature of signs, the antithesis of abstraction, is considered to be one of their most damning features. Whereas speech originated as the breath by which God gave humans a soul, sign is of the corruptible body, the flesh, and the material world.
The historical significance for signed language linguistics was tremendous, because it set the worldview for nearly 100 years. Scholars were left with the following assumptions firmly entrenched in our understanding of language, speech, gesture, and sign:
Speech is of the mind; signs are of the body.
Language is equivalent to speech; gesture is not language.
Signs are gesture and therefore are not language.
In one form or another, these assumptions persist today. Much of the early work of sign linguists was motivated by a perceived need to distinguish sign from gesture. Linguists sought to demonstrate, for example, that signed languages have the same design features that spoken languages have (Reference Hockett and WangHockett, 1982). The pioneering work of William C. Stokoe Reference Stokoe(1960) was directed at demonstrating that ASL exhibits duality of patterning – that is, that it has a level equivalent to that of the phonology of spoken languages, with meaningless units that he called cheremes by analogy with phonemes that are combined to form meaningful units, or morphemes. Stokoe initially identified three classes of cheremes: hand shapes, movements, and locations. He later simplified the analysis to two classes – that which acts and its action (Reference StokoeStokoe, 1980).
Other linguists worked to document the complex grammar of signed languages (Reference Klima and BellugiKlima & Bellugi, 1979; Reference SipleSiple, 1978) and the nature of iconicity (Reference Engberg-Pedersen, Michael, Harder, Heltoft and JakobsenEngberg-Pedersen, 1996; Reference FrishbergFrishberg, 1975; Reference Mandel and FriedmanMandel, 1977). Sociolinguists documented the historical relation of ASL to French Sign Language (Reference WoodwardWoodward, 1976b, Reference Woodward and Siple1978) and sociolinguistic characteristics of the deaf community (Reference LucasLucas, 1989; Reference WoodwardWoodward, 1974, Reference Woodward1976a).
3 Sign and Gesture in Acquisition
A substantial body of research has investigated the relation between sign and gesture in first language acquisition. One area of research has examined deaf children who have no exposure to a signed language either at school or in the home. These children often develop a system of idiosyncratic gestures to communicate with parents or siblings (Reference MorfordMorford, 2003). These gestures exhibit many of the same properties seen in signed or spoken languages. Homesigners develop systematic ways of indicating negation and questions. Specific hand shapes and movements become associated with specific meanings. Homesigners use these gestures in a consistent way across settings, rather than creating new gestures for each new setting. Homesigners also use gestures to refer to generic entities and events and not just specific instances. However, homesign also displays important differences when compared to more conventional spoken or signed languages. Homesigners do not appear to ever use complex syntactic structure. Homesign also does not seem to develop phonological structure that is independent of morphological structure (Reference Morford and Hänel‐FaulhaberMorford & Hänel‐Faulhaber, 2011).
Research on homesign appears to contradict strong claims that language cannot be acquired after a critical period. While this may be true for spoken languages, deaf children who use homesign do acquire signed language once they are exposed to signing deaf adults (Reference MorfordMorford, 2003). However, as adults, these signers display deficits when compared to other members of the signed language community. Reference Morford and Hänel‐FaulhaberMorford and Hänel-Faulhaber (2011) propose two possible explanations. One explanation focuses on the differences between homesign and conventional languages, noting that one area of deficit is in acquisition of more native-like phonological structure and complex syntax. The second explanation points out that homesign is not acquired in a shared language community, and so homesigners have reduced exposure to receptive language. As a result, late learners also exhibit receptive processing deficits such as slower sign recognition (Reference Morford and Hänel‐FaulhaberMorford & Hänel-Faulhaber, 2011).
A second area compares the acquisition of gesture in hearing children and sign in deaf children. A broad summary of the results suggests links between early actions, gestures and words, and the importance of multimodal communication and the interplay between gestures and spoken words (Reference Volterra, Capirci, Rinaldi and SparaciVolterra, Capirci, Rinaldi, & Sparaci, 2018). Early work examined the role of gestural performatives and representational gestures (Reference Bates, Camaioni and VolterraBates, Camaioni, & Volterra, 1975). Performatives include ritualized requests, showing off, showing, giving, and pointing. Performatives were typically classified as deictic gestures, the content of which can only be interpreted referring to the extralinguistic context. The second type of gestures, termed representational gestures, are used by children to refer to objects, persons, locations, or events through hand, body, or facial movements (Reference Capirci, Iverson, Montanari and VolterraCapirci, Iverson, Montanari, & Volterra, 2002). Representational gestures differ from deictic gestures by iconically representing attributes or actions of specific referents. The meaning of representational gestures does not change across different contexts.
Early development of these two types of gesture, and of signs, were often attributed to different underlying cognitive systems, one gestural and the other linguistic. A more recent conclusion suggests “not a clear-cut separation, but a continuity between co-speech gestures produced by hearing children and early signs produced by children exposed to a sign language” (Reference Volterra, Capirci, Rinaldi and SparaciVolterra, Capirci, Rinaldi, & Sparaci, 2018, p. 217). These researchers conclude that the traditional dichotomy between gestures as gradient, variable, iconic, and signs as categorical, invariable, arbitrary, should be replaced with a multimodal approach to the study of both spoken and signed languages. This conclusion mirrors recent research questioning any clear-cut distinction between gestural and linguistic systems (Reference TalmyTalmy, 2018; Reference Wilcox and MartínezWilcox & Martínez, 2020; Reference Wilcox and OcchinoWilcox & Occhino, 2016a).
4 The Relation of Sign and Gesture: Current Views
As is so often the case in the history of ideas, the pendulum of science swings between two poles. Space and time were seen as distinct by Newton; Einstein merged them into space-time. Physicists now debate whether time exists at all. Such is also the case with sign and gesture. As we have seen, for centuries signed languages were considered to be gesture and not language. The early work of sign linguists was directed at demonstrating that signed languages are languages and not gesture. In recent years, there has emerged a position which holds that signed languages are integrations of linguistic and non-linguistic or gestural systems. Finally, there are even those who are beginning to question the usefulness of the term “gesture” itself.
One contribution to the paradigm shift that permitted sign and gesture to be reexamined was the publication of Gesture and the Nature of Language (Reference Armstrong, Stokoe and WilcoxArmstrong, Stokoe, & Wilcox, 1995), which made the case that scholars could examine gesture and language as related phenomena. The authors did not simply claim that gesture and signed language might be related. Rather, they put forward the hypothesis that all language is gestural, and that the origins of human language can be traced to visible gestures. The ideas in this were based in part on an understanding of gesture as a functional unit, an equivalence class of coordinated movements that achieve some end (Reference Studdert Kennedy and AllportStuddert Kennedy, 1987). Gesture was seen more broadly as articulatory movements of any part of the body. This concept was derived from the work of the cognitive psychologist Ulrich Reference NeisserNeisser (1967, pp. 156, 161):
To speak is to make finely controlled movements in certain parts of your body, with the result that information about these movements is broadcast to the environment. For this reason the movements of speech are sometimes called articulatory gestures. A person who perceives speech, then, is picking up information about a certain class of real, physical, tangible (as we shall see) events that are occurring in someone’s mouth. […] Since articulatory events are motions of certain parts of the body, speech perception has something in common with perceiving other bodily motions, like those of dancers and athletes. In particular, the perception of facial expressions, nonverbal cues, “body language,” and the like must be continuous with it. There is every reason to believe that speech perception begins just as one aspect of the general perception of other people’s movements.
The ideas presented in Gesture and the Nature of Language were also informed by cognitive linguistics, and in fact the work was instrumental in bringing together a nexus of gesture-language-sign research informed by cognitive linguistics. Cognitive linguistics dramatically changes the perspective of how we view language, and it does so in a way that allows linguists to learn from gesture researchers, research on animal communication, and the study of general perceptual and cognitive abilities, options that were precluded under prior theories.
Although linguists have demonstrated that signed languages are not simply unanalyzable depictive gestures, recent research has begun to explore the relation between signed language and gesture. This work can be classified into three major themes: (1) the historical development by which gestures used in a hearing community are incorporated into a signed language, thus becoming part of the sign linguistic system; (2) the claim that some signs and sign constructions are fusions of linguistic and gestural material, so-called “sign-gesture fusions”; and (3) an analysis of signs and gestures based on cognitive linguistic theory, and in particular, the theory of cognitive grammar (see below).
The first approach is based on grammaticalization theory (Reference BybeeBybee, 2006; Reference Hopper and TraugottHopper & Traugott, 2003). Grammaticalization is the process by which lexical material becomes grammatical material. Working within this framework, linguists have demonstrated that some of the gestures used in the surrounding language community are incorporated into a signed language as lexical signs. These lexical signs then grammaticize, forming grammatical signs. Grammaticalization of gesture and signs has been described for a number of signed languages Reference Pfau, Steinbach, Heine and NarrogPfau & Steinbach, 2011; Reference Shaffer, Jarque, Wilcox, Nogueira and Lopes.Shaffer, Jarque, & Wilcox, 2011; Reference Wilcox, Rossini, Antinoro Pizzuto and BrentariWilcox, Rossini, & Antinoro Pizzuto, 2010; Reference Xavier and WilcoxXavier & Wilcox, 2014).
An example occurs with the gesture meaning “to leave or depart” (Figure 16.1) common in the Mediterranean region (Reference Morris, Collett, Marsh and O’ShaughnessyMorris, Collett, Marsh, & O’Shaughnessy, 1979). This gesture appears to have been incorporated into old French Sign Language (LSF) as the sign PARTIR “to depart.” ASL is genetically related to French Sign Language (LSF), LSF having been a historical source for ASL since the early 1800s (Reference Wilcox and OcchinoWilcox & Occhino, 2016b; Reference Woodward and SipleWoodward, 1978). An old ASL sign derived from the old LSF sign PARTIR appears in films of older signers from 1913 with both lexical and grammatical meanings. The lexical sign DEPART is used in utterances that can be translated from ASL to English as, “At that time, the president of Gallaudet, Edward Miner Gallaudet, departed. A few days prior, he departed/left (to go) to Philadelphia.” We also find the sign with slightly reduced movement used to mark future (Reference Janzen, Shaffer, Meier, Quinto and CormierJanzen & Shaffer, 2002). In the same series of 1913 films, in giving a lay sermon, a deaf person uses the sign when saying (again translating this from ASL to English), “When you understand the words of our Father, you will do that no more.”
Figure 16.1 Depart gesture
Reference Wilcox, Wilcox, Bybee and FleischmanWilcox and Wilcox (1995) demonstrated that gestural forms often serve as the source for lexical signs, which then grammaticalize to modals. The ASL modal sign CAN, for example, is derived from the ASL lexical form STRONG. Reference LongLong (1918) pointed out that the sign for “strong” is very similar to CAN, noting that the difference lies in the way the hands are moved. For “strength” they are moved somewhat sidewise with a slight circular motion. The source for the lexical sign STRONG is a gesture indicating strength, commonly expressed by moving the fists in an outward or downward motion so as to indicate upper body strength.
An example demonstrating a longer diachronic and cross-linguistic chain occurs with the modal meaning “necessity” (Figure 16.2). The notion of necessity is expressed by the ASL modal form glossed as MUST. The history of this form is more complex, appearing in modern ASL as a downward moving bent index finger. In LSF the form IL FAUT “it is necessary” is similar, except the index finger is straight and the palm orientation is to the side rather than down. In LSF from the nineteenth century, the straight index finger is used but the entire hand points down. Ultimately, the form appears to have had as its gestural source a downward pointing finger. This gesture was used in classical antiquity to indicate “in this place” and “insistence.” Reference DodwellDodwell (2000, p. 36) discusses this gesture from ancient Roman times, which he calls an imperative, “consists of directing the extended index finger towards the ground.” The gesture was ascribed a modal sense by Quintilian, who noted that “when directed towards the ground, this finger insists” (ibid.). Insistence is semantically related to the modal notion of necessity.
Figure 16.2 LSF sign IL FAUT and ASL modal sign MUST
Reference WilcoxWilcox (2004, 2009) proposed two routes by which gesture is incorporated into a signed language. The first route, described above, begins when manual gesture enters a signed language as a lexical sign and develops through grammaticalization into a grammatical morpheme. The second route proceeds along a distinctly different path. The source is not the manual gesture itself; rather, it is the way that a manual gesture is produced, the sign’s manner of movement, as well as various facial, mouth, and eye gestures that may accompany a manual gesture or sign. Here, too, we see gesture as a source. For example, Quintilian observed that when the hand is thrown out gently it promises and declares assent; when it is moved more quickly, it is a gesture of exhortation or sometimes of praise.
Upon entering the linguistic system, these manner of movement and facial gestures follow a developmental path from paralinguistic (prosody or intonation) to grammatical marker. As an example of the grammaticalization of manner of movement, in many signed languages the manner in which a modal sign is produced indicates strength of the modal. Reference JarqueJarque (2006) notes that manner of movement is used to indicate differences in modal strength and to mark deontic versus epistemic function in Catalan Sign Language. Modal strength is also marked in Italian Sign Language by weak or strong articulation of the base movement.
Manner of movement also appears as a marker of verb aspect. Reference Pizzuto and VolterraPizzuto (1987) observed that temporal aspect can be expressed in Italian Sign Language via systematic alterations of the verb’s movement pattern, specifying, for instance, the “suddenness” of an action by means of a tense, fast, short movement (e.g. the distinction between to meet and to suddenly/unexpectedly meet someone). Conversely, a verb produced using an elongated, elliptical, large and slow movement specifies that an action is “repeated over and over in time” or “takes place repeatedly in time (e.g. to constantly telephone or to always be on the telephone). Similarly, in ASL, manner of movement marks verb aspect (Reference Klima and BellugiKlima & Bellugi, 1979).
The second route also characterizes the grammaticalization of facial displays. Facial displays play a significant role cross-linguistically in signed languages. In addition to expressing emotion, facial gestures mark a variety of grammatical functions such as interrogatives, topics, adverbials, conditionals, imperatives, and more. These facial displays often begin as gestural expressions. For example, brow furrow is well documented as a display of physical or mental exertion. Reference DarwinDarwin (1872, p. 221) noted that brow furrow marks “the perception of something difficult or disagreeable, either in thought or action.” Brow furrow marks a number of grammatical meanings across several signed languages, including wh-questions, imperatives, and root or deontic modality.
Recently, some sign linguists have proposed that certain signs are not purely linguistic; rather, they claim that these signs and sign constructions are better described as “language-gesture fusions” (Reference Fenlon, Cooperrider, Keane, Brentari and Goldin-MeadowFenlon, Cooperrider, Keane, Brentari, & Goldin-Meadow, 2019; Reference Hodge and JohnstonHodge & Johnston, 2014). One such claim concerns pointing signs. Pointing signs are quite common across signed languages, functioning as deictic and anaphoric pronouns, possessive and reflexive pronouns, demonstratives, locatives, body part signs, and indicating verb agreement. The gesture-language fusion claim is most clearly made in the case of personal pronouns. Reference Meier and Lillo-MartinMeier and Lillo-Martin (2013) observe that the first-person pronoun in ASL is fully specified phonologically: a point to the center of the signer’s chest. However, they claim that the locations to which non-first person signs point cannot be enumerated in a listing of sublexical phonological units: Signers point to an open-ended set of locations. Since location is a phonological prime, having an open-ended phonological inventory of locations is not possible. Thus, they conclude (Reference Meier and Lillo-MartinMeier & Lillo-Martin, 2013, p. 163) that “first-person points, but not non-first points, can be specified entirely in terms of the phonological units that form lexical signs. […] In contrast the locations in space of non-first person points appear to be gestural inasmuch as the direction of pointing is – when the signer is referring to an individual who is present at the conversation – determined by the referent’s physical location in the environment.” This same argument is extended to the marking of verb agreement, in which agreement is marked by location in space.
The gesture-fusion claim requires that linguists identify criteria by which gesture can be reliably and objectively distinguished from language. One proposal that has been offered (Reference SandlerSandler, 2009) is that gestures are holistic and synthetic; they are lacking in hierarchical and combinatoric properties; they are idiosyncratic – different speakers or even the same speaker may use different gestures to represent the same image; and gestures are context-sensitive, their interpretation dependent on the linguistic context. Linguistic structure, on the other hand, is componential, combinatoric, and hierarchically organized. Linguistic signs such as words are highly conventionalized in form, meaning, and distribution. These criteria, however, pose problems. Language is both conventional and unconventional – innovation by definition establishes new, unconventional expressions. So is gesture: Emblems or recurrent gestures are conventional, while idiosyncratic or innovative gestures are, by definition, not conventional. Language is gradient. As Reference BybeeBybee (2010, p. 2) observes, “All types of units proposed by linguists show gradience, in the sense that there is a lot of variation within the domain of the unit (different types of words, morphemes, syllables) and difficulty in setting the boundaries of the unit.”
Another option has been to propose a modality-free definition of gesture (Reference Okrent, Meier, Cormier and Quinto-PozosOkrent, 2002), based on degree of conventionalization (How conventionalized must something be in order to be considered linguistic?), site of conventionalization (What kinds of conventions are linguistic conventions?), and restrictions on combination (What kinds of conditions on the combination of semiotic codes are linguistic conditions?). Answers to these questions are, however, not offered, nor is it clear how they could be objectively answered. For these and other reasons, not all sign linguists accept the language-gesture fusion claim (Reference QuerQuer, 2011; Reference WilburWilbur, 2013). Reference DotterDotter (2018) presented a strong rejection of the claim that gestural components are combined with language elements in essential areas of signed language grammar.
As we have seen, the relation between gesture and sign has a complex history, informed by different perspectives on sign itself and on gesture. Reference MüllerMüller (2018) offers a comprehensive overview of the history of gesture studies and sign linguistics with special attention to examining the relation between gesture and sign. First, Müller reconstructs the history of gesture studies, focusing on the seminal work of Kendon, McNeill, and Goldin-Meadow. She traces Kendon’s view that gesture and sign are one gestural medium of expression, or utterance visible actions, detailing the functional similarities between gesture and sign. Müller points out that McNeill, on the other hand, claims that gesture and sign exhibit sharp discontinuities. Müller attributes this position to McNeill’s decision to restrict the concept of gesture to spontaneously used gestures. A third position described by Müller is that presented by Reference Goldin-Meadow and BrentariGoldin-Meadow and Brentari (2017), in which they strengthen the discontinuity view, framing the distinction such that the essential features distinguishing the two are that sign and speech are categorical, while gesture is imagistic. Müller rejects the assumptions that gestural equals imagistic and that there is a clear-cut boundary between categorical and gestural. Müller concludes by suggesting the need to view gesture and sign dynamically, both in historical relation and in multimodal interaction. Although Müller essentially aligns with Kendon’s continuity position, she rejects Kendon’s proposal of the term “utterance visible actions,” preferring instead to retain the term “gesture,” advocating an understanding of gestures as “deliberate expressive movements.” Ultimately, she suggests that “the question of how gesture and sign relate critically depends on the notion of ‘gesture’ employed” (Reference MüllerMüller, 2018, p. 16). Concerning the historical relation between sign and gesture, Müller concludes that research such as that discussed above suggests a “dynamic, continuous and ongoing process of historical change, where no cataclysmic break is involved, and no sudden rupture transforms gesture into sign from one moment to another” (Reference MüllerMüller, 2018).
Recently, Wilcox and his colleagues (Reference Wilcox and OcchinoWilcox & Occhino, 2016a) have proposed an account of pointing that does not require evoking gesture. Their approach uses a dynamic usage-based model, specifically cognitive grammar (Reference LangackerLangacker, 1987, Reference Langacker, Barlow and Kemmer2000, Reference Langacker2008). Cognitive grammar claims that grammar and lexicon form a continuum of symbolic assemblies composed of phonological structures, semantic structures, and the symbolic links between the two. Phonological, semantic, and symbolic structures are abstracted from usage events: “instances of language use in all their complexity and specificity” (Reference LangackerLangacker, 2008, p. 547). Symbolic assemblies vary along two dimensions: schematicity and complexity. Schematicity pertains to level of detail or precision. Schematic elements are elaborated or instantiated by more specific elements. Symbolic structures combine to form higher-level symbolic structures, or symbolic assemblies. Through repeated combination, symbolic assemblies of high complexity may be formed.
In this analysis, pointing is a complex symbolic assembly, a construction consisting of two component structures: a pointing device and a Place (Figure 16.3). Both of these are symbolic structures: They consist of a semantic pole and a phonological pole. The pointing device serves to direct attention; this is its schematic meaning. The schematic phonological pole of a pointing device is any articulator capable of directing attention. This may be a pointing finger, but eye gaze and even body torso orientation may serve as the phonological pole of a pointing device. The Place (the term appears with initial capital letter to indicate that it names the entire symbolic structure) is the entity at which attention is directed. The schematic semantic pole of a Place structure, which must be specified in an actually occurring usage event, is the referent. The schematic phonological pole of a Place structure is a location in the spatial surroundings. Places appear in a number of constructions, including both deictic and anaphoric expressions (Reference Wilcox and OcchinoWilcox & Occhino, 2016a), as well as reported dialogue and so-called “agreement” constructions (Reference Wilcox, Martínez, Morales, Jucker and HausendorfWilcox, Martínez, & Morales, 2022). Place structures thus unify two distinct types of phonological locations: those internal to language, and those external locations which in other analyses are regarded as gestural. Places also subsume the traditional distinction between deixis and anaphor. Reference TalmyTalmy (2018, p. 1) writes that, “Broadly, an anaphoric referent is an element of the current discourse, whereas a deictic referent is outside the discourse in the spatiotemporal surroundings. This is a distinction made between the lexical and the physical, one that has traditionally led to distinct theoretical treatments of the corresponding referents.” Talmy offers an account in which language engages the same cognitive system for both speech-internal and speech-external referents. Places provide the comparable symbolic resource for languages produced in visual space. For signed languages, a referent at a location in the spatiotemporal surroundings is not outside of the discourse; rather, both deictic and discourse referents occupy phonological locations in the spatiotemporal surroundings.
Reference Ruth-Hirrel and WilcoxRuth-Hirrel and Wilcox (2018) apply the pointing device and Place analysis to speech-gesture constructions. They focus primarily on complex symbolic assemblies consisting of pointing constructions and beats that accompany speech. Beat gestures have been formally characterized as “biphasic movements of the hands” (Reference Biau and Soto-FaracoBiau & Soto-Faraco, 2015). These movements typically involve a “simple flick of the hand or finger up and down, or back and forth” (Reference McNeillMcNeill, 1992, p. 15); however, beats may also be performed using other body parts, such as the head or eyebrows (Reference Krahmer and SwertsKrahmer & Swerts, 2007). Researchers often claim that beats have no semantic content (Reference Alibali, Heath and MyersAlibali, Heath, & Myers, 2001; Reference Biau and Soto-FaracoBiau & Soto-Faraco, 2013; Reference Özçalışkan and Goldin-MeadowÖzçalışkan & Goldin-Meadow, 2009). Researchers acknowledge that beats serve emphatic functions and are closely tied to information structure, comparing beats to an “all-purpose highlighter” superimposed on more objective content (Reference McNeill, Levy, Duncan, Deborah, Heidi and DeborahMcNeill, Levy, & Duncan, 2015).
In the cognitive grammar analysis proposed by Ruth-Hirrel and Wilcox, beats are symbolic structures; thus, they have both phonological and semantic import. Reference LangackerLangacker (2001) expands on the notion of symbolic structures, noting that such structures incorporate multiple channels. The semantic pole consists of several conceptualization channels, including speech management, information structure, and objective content. Speech management includes such functions as holding the floor and turn taking. Information structure includes emphasis, discourse topic, and given versus new information. Objective content is the conceptualization of the situation being described by a linguistic expression. The phonological pole consists of several vocalization channels. The core vocalization channel for speech is segmental content. Other channels include intonation and gesture.
Ruth-Hirrel and Wilcox claim that beat gestures are symbolic structures, but significantly they are phonologically and conceptually dependent structures, requiring autonomous structures for their expression. Semantically, beats are dependent structures in the information structure channel, making reference to some more objective autonomous content, the information that is emphasized or highlighted. Phonologically, beats are expressed as manner of movement, requiring an autonomous gesture carrier for their articulatory expression. This is canonically specified by the movement of a hand, since manner of movement is dependent on movement (there is a further level of dependency, because movement requires some entity such as a hand). It is also possible for the movement of any more substantive and autonomous structure, such as a head, to serve as the phonological carrier for a beat.
Ruth-Hirrel and Wilcox show that simple beat gestures, as well as beat gestures coexpressed with pointing gestures, are used to direct attention to meanings in speech that are associated with salient components of stancetaking acts. Their account reveals both that beats have meaning and the symbolic motivation for the apparent “superimposing” of beats onto pointing gestures and their integration with speech.
5 Sign and Gesture Revisited
One result of the contemporary linguistic analyses of signed languages and gesture has been a blurring, or indeed the loss, of any categorical distinction between sign and gesture. Noting these difficult issues, Reference KendonKendon (2017, p. 30) concluded that “‘gesture’ is so muddied with ambiguity, and theoretical and ideological baggage, that its use in scientific discourse impedes our ability to think clearly about how kinesic resources are used in utterance production and interferes with clarity when comparing signers and speakers.” Kendon has gone so far as to propose abandoning the categories gesture and sign altogether, instead focusing on a comparative semiotics of what he terms visible bodily action as it is used in utterances by speakers and by signers.
The dynamic usage-based approach offers a new way to reframe our understanding of sign and gesture (Reference Occhino and WilcoxOcchino & Wilcox, 2017; Reference Wilcox and OcchinoWilcox & Occhino, 2016a). The world comes to us unlabeled. We perceive not “sign” or “gesture” but perceptible usage events – Kendon’s visible bodily actions or Neisser’s articulatory events. These actions, as perceptual events, are categorized by language learners. In the present case of sign and gesture, we can restrict our focus to deaf language learners. Reference StokoeStokoe (1960, pp. 6–7) adopted this user-centered deaf viewpoint from the very start:
To take a hypothetical example, a shoulder shrug, which for most speakers accompanied a certain vocal utterance, might be a movement so slight as to be outside the awareness of most speakers; but to the deaf person, the shrug is unaccompanied by anything perceptible except a predictable set of circumstances and responses; in short, it has a definite “meaning”.
Having both a perceptible form and a meaning (in cognitive grammar parlance, we would say having a phonological and a semantic pole), the shoulder shrug poses a categorization problem to be solved by the deaf observer: How does this symbolic structure fit into his emerging and dynamic understanding of communicative performance? The units that compose an individual’s linguistic knowledge (i.e. grammar) are related to actual expressions that are perceived in usage events by the process of categorization. In the theory of cognitive grammar, the process works in this way (Reference Langacker, Barlow and KemmerLangacker, 2000). A particular target of categorization activates a variety of established units, the activation set. For any given usage event, taking into consideration all the linguistic and contextual factors, the language user must search the activation set for the member that will categorize the target. Members of the activation set must in a sense compete; the winner becomes the most active member of the set and the active structure which categorizes the target. A number of factors determine which member will become the active structure. One factor is degree of entrenchment of the member, which influences its inherent likelihood of activation and thus of being selected. A second is contextual priming, both phonological and semantic. A third is amount of (phonological or semantic) overlap between the target and a potential categorizing structure. These three factors are primarily linguistic, but we must also include many other factors in order to fully understand the categorization process (Reference Wilcox and OcchinoWilcox & Occhino, 2016a).
In the current case we must include such factors as individual variability (age of exposure to signed language, level of hearing loss); social variability (whether the observer comes from a deaf or hearing family, type of education, accessibility of signed language in the general environment); and cultural variability (society’s attitude toward sign and gesture in general, for example, Navajo culture and Neapolitan culture exhibit very different attitudes toward gesturing; the categories provided by a culture for naming such perceptual events). As Reference Van Hoekvan Hoek (1997) notes, these categorizing judgements determine if the construction is an instantiation of a particular schema or an extension from that schema. With a small amount of conflict, the construction may be judged to be an acceptable innovation, but a significant conflict will cause signers to judge the construction to be anomalous, that is, not a part of their grammar. Thus, the process of categorizing a target structure, be it a shoulder shrug, a facial display, or any other visible bodily action, is none other than the linguistic process of judging whether a perceived structure is well formed with respect to others, that is, it is a part of their dynamically changing grammar.
In all cases, the key to answering the question “Is it a sign or a gesture?” in the context of visibly perceptible usage events lies with the observer, not the observed. It requires that we stop assuming that these events form natural categories – that is, categories that exist in nature, independent of language users. Instead we must reframe the question and adopt an approach which acknowledges that deaf observers categorize perceptual events.
Clifford Reference GeertzGeertz (1973, p. 6) offered an example of how a visible perceptual usage event is categorized:
Consider two boys rapidly contracting the eyelids of their right eyes. In one, this is an involuntary twitch; in the other, a conspiratorial signal to a friend. The two movements are, as movements, identical; from an I-am-a-camera “phenomenalistic” observation of them alone, one could not tell which was twitch and which was wink, or indeed whether both or either was twitch or wink. Yet the difference, however unphotographable, between a twitch and a wink is vast; as anyone unfortunate enough to have had the first taken for the second knows. As [Gilbert] Ryle points out, the winker has not done two things, contracted his eyelids and winked, while the twitcher has done only one, contracted his eyelids. Contracting your eyelids on purpose when there exists a public code in which so doing counts as a conspiratorial signal is winking. That’s all there is to it: a speck of behavior, a fleck of culture, and – voilà! – a gesture.
In the context of understanding sign and gesture, we might rephrase Geertz and say: the same speck of behavior with a fleck of cultural, contextual, and background knowledge and the act of categorization by the deaf observer, and – voilà! – sign (in this case, a grammatical facial display). Like Stokoe’s shrug and Geertz’s wink, the visible bodily actions of usage events are the very stuff from which language is made. The labeling of visible bodily actions as sign or gesture is, as Geertz would say, a matter of determining what counts as what. Language, gesture, and sign are historical-cultural constructs, folk classifications that may or may not be relevant to deaf language users. The relevant question to be examined is the dynamically emerging knowledge – or as a linguist would call it, the grammar – of deaf language users. The key is to not forget the observer. Geertz is helpful even here: We must see things, he tells us (Reference Geertz1974), “from the native’s point of view” – in this case, from the point of view of the individual deaf language user observing and categorizing visible bodily actions. The categorization of these usage events is an individual user’s cognitive activity. The linguist’s task is to discover the user’s categories.
Adopting a user-based cognitive linguistic perspective may produce a paradigm shift in the study of signed language and gesture. As Geertz pointed out, “small facts speak to large issues, winks to epistemology” (Reference GeertzGeertz, 1973, p. 23). For linguists, the small facts are intricately complex usage events, the larger epistemological issue is the construction of a grammar. If linguists are to understand how language is constructed by users from usage events, we must begin by compiling thick descriptions of actual discourse usage events in all of their expressive and conceptual complexity. This demands that linguists incorporate cultural, social, and historical data into our linguistic theories.
Ultimately, the question of what is sign and what is gesture may not be a scientific but an ethnoscientific one. The answer lies not in finding observable, I-am-a-camera photographable differences between the “sign” and “gesture” independent of who is doing the observing and classifying; rather, it is a matter of what counts as sign or gesture from the deaf person’s point of view. This, in turn, depends on received folk classifications that are handed down and change over time, that vary across cultures and across the individuals doing the classification. Whether deaf and hearing people share the same folk classifications of what counts as sign and what counts as gesture – if indeed they even have such categories – is an open question, although research suggests that the answer is likely to be quite complex (Reference Kusters and SahasrabudheKusters & Sahasrabudhe, 2018). From winks and shrugs and facial displays to depictive shapes and movements of the hands; from whether a hand is moved gently or with a sudden, quick movement; from highly innovative signed constructions expressing an actor’s action in chopping down a tree and the tree’s personified emotional reaction to being chopped to highly conventional lexical signs – these are the photographable raw data. As linguists and gesture scholars, our task is to meticulously describe this input to the categorization process and explore how the process plays out, how grammars are dynamically constructed, grow, and change. Echoing Kendon, it appears to this sign linguist that the label gesture, with its historically laden ideological baggage, contributes little to the task at hand.
1 Introduction
When, how, and why do people gesture? The Cambridge Handbook of Gesture Studies offers many answers to this question from diverse theoretical and methodological perspectives on the relation between language and gesture. The topic introduced in this chapter pulls together several strands of research that have highlighted gesture’s relation to notions and processes that are traditionally seen as “grammatical.” In particular, rich observations on gesture’s link with negation have featured in the work of several key thinkers and texts, and can therefore be said to have played a role in shaping contemporary gesture studies. Rather than emphasizing the spontaneity and idiosyncrasy of co-speech gestures, for instance, studies of gesture’s association with negation have shed light on regularities in gesture form, function, and linguistic organization, and in turn, offered evidence for the multimodality of grammar, the embodiment of cognition, and our bodies’ “potential for language” (Reference Müller, Müller, Cienki, Fricke, Ladewig, McNeill and TeßendorfMüller, 2013, p. 202).
As any linguist knows, negation involves lexical and grammatical patterns that determine word order, operate on the semantics of the utterance, and play a role in the conceptualization and pragmatics of speech acts, such as rejections, disagreements, denials, objections, and rejections (Reference HornHorn, 1989). What is also well known in gesture studies is that central to these patterns is an array of gestures that speakers perform in association with their linguistic and pragmatic expression of negation (Reference CalbrisCalbris, 1990, Reference Calbris2011; Reference HarrisonHarrison, 2018; Reference KendonKendon, 2004; Reference Lapaire, Bonnefille and SalbayreLapaire, 2006a). The headshake might immediately come to mind (with its notorious cultural variations; Reference Harrison, Müller, Cienki, Fricke, McNeill and BressemHarrison, 2014a), as well as various gestures of the hands striking outwards from the body or towards the addressee in a holding motion.
Over the last ten to fifteen years, increasing attention has been paid to these gestures associated with negation.Footnote 1 Their forms and functions characteristically recur among members of a given linguistic and cultural community in clearly identifiable discourse contexts with relatively stable meanings. However, these gestures are somewhat distinct from “emblems” or “quotable gestures” (as discussed in Payrató, this volume). They are often considered to be exemplars of the recurrent gesture category (Reference Harrison and LadewigHarrison & Ladewig, 2021; Reference Ladewig, Müller, Cienki, Fricke, Ladewig, McNeill and BressemLadewig, 2014, this volume). Studies of gestures associated with negation have arguably helped in paving the way for a reconceptualization of the nature of gesture and its relation to linguistic structures, as well as of our understanding of common ground between spoken and signed languages, the multimodality of language, and the embodiment of cognition.
With the wider issue of grammar–gesture relations in the background, the first task of this chapter is to chart the territory of gestures associated with negation: what is known about their forms, organizational properties, and functions related to linguistic negation (explicit and covert), discourse-pragmatics, and interaction, as discernible from their study in a wide variety of linguistic communities (Section 2). Then, the range of discourse domains, interactive contexts, and methodological perspectives in which studies of gestures associated with negation have been conducted will be considered (Section 3). The empirical and theoretical contributions that such studies have made can then be evaluated by reporting the uptake of relevant research findings across different areas of the linguistic, social, and cognitive sciences (Section 4). Throughout these sections, the aim is to not only show what has been discovered, but also to disclose areas that seem ripe for further development and which raise open empirical questions.
2 Gestures Associated with Negation
In their entry for the Stanford Encyclopedia of Philosophy, Reference Horn, Wansing and ZaltaHorn and Wansing (2020) write that “Negation is a sine qua non of every human language.” As a linguistic universal, all languages have grammatical forms and structures that express negation, the subtleties of which have been widely documented and debated in decades of linguistic, pragmatic, and psycholinguistic research (e.g. for an accessible introduction to English negation, see Reference Huddleston and PullumHuddleston & Pullum, 2005, Ch. 8; for Mandarin Chinese negation, see Reference Li and ThompsonLi & Thompson, 1989, Ch. 12). What had not been as closely studied until relatively recently were the forms, meanings, and structures of gestures that can be observed when people express negation in face-to-face spoken communication, and more specifically, the intricate ways in which linguistic and gestural forms and structures relate during the expression of negation. The current section aims to convey what is known about recurrent gestures associated with negation by reporting where in the world they have been observed so far, the typologies in the literature that have been established, and our understanding of the forms, functions, and organizational properties of these gestures.
2.1 Geographical Coverage
Focusing on spoken languages, the following coverage of gestures associated with negation compiles and abstracts from different kinds of research-typologies, descriptive studies of individual gestures, experiments, etc. – conducted along different themes and subjects in relation to particular languages (Table 17.1). This is also visualized with a Google map with pins to represent the location of studies or linguistic communities under study (Figure 17.1). The main criteria for inclusion in this overview was that researchers observed and discussed the relation between gesture and negation in the locations where the studies were conducted; distinctions concerning different forms and form variants of gestures/gesturing were set aside for the present purposes. Distinctions concerning regional uses of language have been made where research permits, such as in different varieties of Spanish. Several papers listed in this review are discussed in more detail in later sections of the chapter.
Table 17.1 Widespread observations of gestures associated with negation, classified by language familyFootnote 2
Figure 17.1 Geographical coverage of the attested relation between gesture and negation in spoken languages
Figure 17.1 is no doubt incomplete, being restricted to the studies published in the languages that I can read (or could confidently cross-reference) and those that I could locate. Several gestures described in A world guide to gestures by Reference MorrisMorris (1994) are associated with negation and either located geographically or described as having “widespread” locality. They include the Head Shake (Widespread; p. 144), Palm Thrust (Greece; p. 191), Palms Front (Worldwide; p. 195), Palms Wipe (Widespread; p. 198), and others. Similarly, Reference DarwinDarwin’s (1872) The expression of emotions in man and animals is a rich resource for culturally specific observations of various head and hand gestures associated with negation that Darwin noted on his voyages or in correspondence with peers (see Reference CooperriderCooperrider, 2019). However, work by Morris and Darwin could not be easily pinned down to the above map.
Disclaimers aside, Table 17.1 and Figure 17.1 combined reveal both a number of research hotspots and data deserts, which may be useful in revealing gaps in research for future studies. Observations of gestures associated with negation in locations not observed here will add incrementally to established findings and, moreover, hopefully extend or challenge the picture of these gestures that is emerging. It is this picture to which we can now turn.
3.2 Forms, Organizational Properties, and Functions of Gestures Associated with Negation
This section’s first port of call is the landmark work on gestures with pragmatic functions in Italian and English by Kendon and his studies on the Open Hand Prone gesture family (Reference KendonKendon, 1995, Reference Kendon2002, Reference Kendon2004, Reference Kendon2017), while research into the semiotics of French gesture by Calbris then introduces an alternative perspective on similar gesture forms (Reference CalbrisCalbris, 1990, Reference Calbris2003, Reference Calbris2005, Reference Calbris2011, Reference Calbris, Müller, Cienki, Fricke, Ladewig, McNeill and Teßendorf2013; Calbris & Copple, this volume). These landmark observations provide the departure point for various lines of research that have subsequently added to our understanding of gestures associated with negation.
3.2.1 Context-of-Use, Kinesic and Semantic Core, Underlying Action
In a highly influential study by Reference KendonKendon (2004), associations between gesture and the expression of negation were observed as part of characterizing the use of recurrent gestural forms in the Open Hand Prone family. In this family of forms, “the forearm is always in a prone position so that the palm of the hand faces either toward the ground [‘ZP’ gestures] or away from the speaker [‘VP’ gestures], depending upon how the elbow is bent” (p. 248). Analyzing examples of these gestures in a video corpus of Italian speakers (supplemented with several examples of English speakers), Reference KendonKendon (2004) observed that gestures in the Open Hand Prone family occurred in discursive contexts “where something is being denied, negated, interrupted or stopped, whether explicitly or by implication” (p. 248), as well as “in contexts where a speaker gives an extreme positive evaluation of something” (p. 249). For each example, Kendon scrutinized the specific form of the gestures and their timing in relation to the verbal utterance, and analyzed their potential semantic and pragmatic contributions to the discursive context-of-use (on this methodology; Reference KendonKendon, 2004, p. 226; Reference Müller, Posner and MüllerMüller, 2004).
Reference KendonKendon’s (2004) findings led him to argue that the formational core of gestures in the Open Hand Prone family expresses a “core semantic theme” of “halting, interrupting, or indicating the interruption of a line of action” (p. 281). This theme, he proposed, was derived from the manipulatory action of the hands that motivates the core form of the gesture (see work by Calbris, Streeck, and Müller below). Gestures with the palm oriented downwards and “swept” laterally (ZP gestures) “perhaps derive from the action of cutting something through, knocking something away or sweeping away irregularities on a surface, as in rubbing out any marks or traces of something,” whereas for gestures with the palm raised vertically and oriented toward the addressee (VP gestures), “the actor engages in a schematic act of stopping something or holding something back” (Reference KendonKendon, 2004, p. 263). In contexts that seemed to be expressing meanings that are positive, Kendon argued that the performance of these gestures with their core semantic themes may be making explicit the underlying expression of an implied negative. Observations of headshakes in these contexts were also included (Reference KendonKendon, 2002). This research was foundational in flagging up gestures with the Open Hand Prone formation as candidates for studying relations between gestures and negation.
3.2.2 Analogical Links between Gestures and Negation
In a study that has been similarly influential to our understanding of gestures associated with negation, French semiotician Calbris arrived at the association between gesture and negation from a different perspective to that of Kendon. In the context of characterizing the symbolic import of certain physical components of gestures in French, Calbris discovered the salience of a number of such components to the expression of concepts related to negation (Reference CalbrisCalbris, 1990, Reference Calbris2003, Reference Calbris2005, Reference Calbris2011, Reference Calbris, Müller, Cienki, Fricke, Ladewig, McNeill and Teßendorf2013).
During Calbris’ career-long study of symbolic relations between gestures and notions, negation played a central role in developing and illustrating several key constructs, namely “gesture variants,” “kinesic ensembles,” “polysemous gestures,” and “polysigns” (Reference CalbrisCalbris, 2011, p. 24). While “gesture variants” refer to the finding that one notion may be expressed by several different gestures, which may be performed simultaneously resulting in a “kinesic ensemble” (or “cumulative variant”), the construct of “polysemous gestures” refers to gesture forms that may convey a range of different meanings, sometimes expressed simultaneously resulting in a “polysign.”
Defining negation “as an act of the mind that consists in refusing a relation, a proposition, an existence, and as the process of refusal” (Reference CalbrisCalbris, 2005, p. 2; my translation), Calbris has examined the different gestures that French speakers use to express negation and developed an explicit typology of them (Reference CalbrisCalbris, 1990, Reference Calbris2005, Reference Calbris2011, Reference Calbris, Müller, Cienki, Fricke, Ladewig, McNeill and Teßendorf2013). This typology is presented succinctly in Reference CalbrisCalbris (2005), which contains nine “variantes gestuelles de la négation,” namely: three head gestures (backwards toss, lateral sweep, and shake), three gestures with the vertical palm shaped flat (either raised, swept laterally, or oscillated), two with the extended index finger (raised, oscillated), and one involving the level hand moved horizontally.
Focusing on Calbris’ treatment of the variants involving the vertical palm and the level hand will illustrate further constructs while distinguishing her approach from Kendon’s (on this distinction, see Reference CalbrisCalbris, 2011, pp. 282–284). For Calbris, the Palms Forward (Kendon’s “VP”) and Level Hand (Kendon’s “ZP”) are not only gesture variants of negation, but also examples of “polysemous gestures,” negation being only one of their meanings, because the gestures can express a variety of notions on different occasions of use. Thus, what Calbris calls the Palms Forward gesture illustrates a semantic derivation from a singular analogical link to a physical action of the outwards-turned hand: self-protection, which depending on the context, “may express the notions of ‘opposition,’ ‘prudence,’ ‘refusal of responsibility,’ ‘stopping,’ ‘requesting someone to wait,’ ‘agreement,’ ‘refusal-negation,’ ‘objection,’ ‘restriction,’ or ‘perfection’” (Reference Calbris, Müller, Cienki, Fricke, Ladewig, McNeill and TeßendorfCalbris, 2013, p. 666). While this is a similar idea to the notion of a core semantic theme being derived from the underlying action motivating the gesture form (Reference KendonKendon, 2004), the gesture that Calbris calls the Level Hand illustrates a different relation, not of a core semantic theme, but one of plural motivation. Depending on which component of the Level Hand is salient or profiled (its movement trajectory or profiling of shape configurations) determines the analogical link established with a physical correlate, which in turn is the basis for the gesture’s meaning and array of subsequent semantic derivations, including: superlative, perfection, determinism, certainty, negation, refusal, cutting, and equality (Figure 17.2) (see also Calbris & Copple, this volume, Section 2.2.2).
Figure 17.2 Plural motivation of the “Level Hand” gesture
Spelling out this network of plural motivations for the Level Hand leads Calbris to a different analysis than positing a core semantic theme derived from a single schematic action as per Reference KendonKendon (2004). This different position can be exemplified for the case of the gesture’s occurrence in contexts where positive assertions are being made. Whereas Kendon posits the expression of an implied negative in such contexts, which is made explicit through the performance of the gesture (enacting a “cutting,” “knocking,” “sweeping,” “rubbing”), Calbris posits a different analogical link based on singling out a physical component of the gesture’s form. As per Figure 17.2, the Level Hand’s expression of a superlative or perfection derives from the meanings of “quantity” and “totality,” which are motivated by the horizontal movement of the gesture. This is not an enactment or representation of a manual action but conveys the concept of “everywhere.” Similarly, the expression of negation by the Level Hand gesture is not related to the gesture’s origin in an action of knocking or sweeping away (as per Kendon), but to the meaning of “stop-refusal” represented as an obstacle, conveyed by the resistance of the palm down. “In short, the interpretation of a co-speech gesture supposes not only an appreciation of the contextual situation but also a physical understanding of the gesture and of the underlying symbolic system” (Reference CalbrisCalbris, 2011, p. 284, emphasis in original).
It seems that some gestures associated with negation may have a single, identifiable origin in a manual action, such as the Vertical Palm’s connection to “schematic stopping.” This may be the case for a number of other gestures that have been shown to express negation. The “brushing aside” gesture observed in Spain (Reference Teßendorf, Müller, Cienki, Fricke, McNeill and BressemTeßendorf, 2014, Reference Teßendorf, Fernández-Villanueva and Jungbluth2016) and the “dusting off palms” gesture observed in Nigeria (Reference WillWill, 2018) seem to be candidates. Variations in the performance of other gestures associated with negation, such as the Level Hand (ZP), may not be variations on an underlying action motif, but variations on analogical links that are based on fundamentally different signifiers (i.e. movement through the visual field, finger tips that draw lines, palms that resist/cover, and the edge that cuts).
Several aspects of the work discussed so far have been the subject of further studies of gestures associated with negation, including the relation between gestures and physical actions, the connection to negation in the linguistic utterance, and the occurrences of these gestures with the expression of explicit and implicit negation.
3.2.3 Relations to Aspects of Physical Action
In their typology of recurrent gestures identified in a corpus of German speakers, Reference Bressem, Müller, Müller, Cienki, Fricke, Ladewig, McNeill and BressemBressem and Müller (2014a, Reference Bressem, Müller, Müller, Cienki, Fricke, Ladewig, McNeill and Bressem2014b) identified a family of recurrent gestures clearly associated with the linguistics and pragmatics of negation. These gestures were crucially shaped by a shared formational characteristic – movement of the hand away from the body, which, according to Reference Bressem, Müller, Müller, Cienki, Fricke, Ladewig, McNeill and BressemBressem and Müller (2014b), revealed a common origin in an underlying “action scheme” motivating meanings and functions associated with negation. Specifying this scheme in a subsequent paper, Reference Bressem, Müller, Müller, Cienki, Fricke, Ladewig, McNeill and BressemBressem and Müller (2014b) stated that “Gestures may reproduce perceptually salient aspects of instrumental actions and extract distinctive elements of the action by comparing, selecting, and recombining physically pertinent elements” (p. 1600).
The action scheme underpinning all gestures included in the Away family, Reference Bressem, Müller, Müller, Cienki, Fricke, Ladewig, McNeill and BressemBressem and Müller (2014b) have argued, is “the effect that actions involving the clearing of the body space have in common: Something that was present has been moved away – or something wanting to intrude has been or is being kept away from intrusion” (p. 1596). Following the convention of referring to a recurrent gestural form based on the assumed underlying action, the recurrent gestures in the Away family were named the “brushing away gesture,” the “throwing away gesture,” the “holding away gesture,” and the “sweeping away gesture.”
Bressem and Müller articulated this relation between a physical action scheme and gestures based on analyses of spoken language use among adults and have subsequently argued it could be the basis for a multimodal construction in German (Reference MüllerBressem & Müller, 2017; see Section 4.2 below). Building on this line of work, Reference GawneGawne (2021) has observed a gesture among some speakers of Syuba in Nepal with a rotated “away trajectory” (thus similar to ‘brushing away’) that is “used only with grammatically negative forms” (p. 4) and “used to indicate the absence of someone or something” (p. 9). This relation between aspects of actions and gestures has been the focus of other research into gestures associated with negation, such as investigating children’s early language development and the connections between grammar and gesture.Footnote 4
3.2.4 Developmental Pathways into Multimodal Negation
While the research discussed so far has been based on video recordings of adult speakers, the multimodality of negation has caught the attention of French language acquisition researchers for its potential as a rich case study for investigating multimodal language development. For this, researchers have adopted a usage-based, video corpus methodology to explore children’s acquisition of negation with longitudinal data from a range of linguistic and interactive contexts (e.g. Reference Beaupoil-HourdelBeaupoil-Hourdel, 2015; Reference Beaupoil-Hourdel, Morgenstern, Boutet, Larrivée and LeeBeaupoil-Hourdel et al., 2016; Reference Morgenstern, Blondel, Beaupoil-Hourdel, Benazzo, Boutet, Kochan, Limousin, Hickmann, Veneziano and JisaMorgenstern et al., 2018).
Central to this body of research is the Paris Corpus (Reference Morgenstern and ParisseMorgenstern & Parisse, 2012). Hour-long video recordings were made at monthly intervals of five monolingual French children from the ages of only several months up to several years, filmed with high-quality video and audio equipment in naturally occurring settings for child–parent interaction, then transcribed and annotated by research teams using the software CHAT (Reference Morgenstern and ParisseMorgenstern & Parisse, 2007). Articulating the notion of a “multimodal pathway,” five stages of development in these children’s expression of negation have been identified, with careful consideration of the caregiver input that was also transcribed in the data (Reference Beaupoil-Hourdel, Morgenstern, Boutet, Larrivée and LeeBeaupoil-Hourdel, Morgenstern, & Boutet, 2016). With qualitative and quantitative analyses, the researchers have characterized this pathway as a gradual transition between “non-symbolic actions” (of rejection, avoidance), “symbolic actions” (gestures), and multimodal linguistic constructions similar in complexity to those of adult speakers (Reference Morgenstern, Blondel, Beaupoil-Hourdel, Benazzo, Boutet, Kochan, Limousin, Hickmann, Veneziano and JisaMorgenstern et al., 2018).Footnote 5 By making these distinctions, the findings are also shown to demonstrate a gradual specification of the functional roles of gestural and linguistic components being combined in the children’s expression (Reference Beaupoil-Hourdel, Morgenstern, Boutet, Larrivée and LeeBeaupoil-Hourdel et al., 2016). The researchers have proposed to view the acquisition of multimodal negation as evidence for the children’s development of linguistic and cognitive skills (Reference Beaupoil-HourdelBeaupoil-Hourdel, 2015).
Though focusing more on patterns of multimodal negation than on individual gestures, these studies have highlighted the salience of “PalmUp-shrug” and “indexWave” gestures in the acquisition of negation (Reference Beaupoil-Hourdel and DebrasBeaupoil-Hourdel & Debras, 2017; Reference Beaupoil-Hourdel and MorgensternBeaupoil-Hourdel & Morgenstern, 2021; Reference Blondel, Boutet, Beaupoil-Hourdel and MorgensternBlondel et al., 2017; Reference Morgenstern, Beaupoil-Hourdel, Blondel, Boutet, Ortega, Tyler, Park and UnoMorgenstern, Beaupoil-Hourdel, Blondel, & Boutet, 2016). Building on previous descriptions of similar forms and their association with the expression of notions related to negation (e.g. Reference CalbrisCalbris, 1990; Reference KendonKendon, 2004; Reference StreeckStreeck, 2009), studies in this line of research have advanced our understanding by proposing kinesiological analyses of these gestures, central to which is an intrinsic frame of reference. In describing gesture relative not to an observer’s frame of reference but to its own articulatory physiology, gestures are viewed as the combination of a muscular impulse constrained by the biomechanics of the human skeleton (cf. Reference BoutetBoutet, 2008, Reference Boutet2010, Reference Boutet2015, Reference Boutet2018).
3.2.5 Kinesiological Perspectives on Negation
In his “formal analysis of gestural negation,” Reference BoutetBoutet (2015) observes that previous descriptions of gestures associated with negation have adopted an egocentric frame of reference. They have restricted the description of such gestures to three-dimensional space (x, y, z) and overlooked an essential characteristic of the gestural negation system related to the physiology of human bodies. The bodily articulation of a given gesture from finger to shoulder can involve seventeen reference points (“degrees of freedom”). Taking the example of gestures described by Kendon and others as “Vertical Palm” and “Horizontal Palm,” by adopting the perspective of the gestural articulator, the negative meanings would not derive from the action motif said to be motivating the form of the gesture (i.e. of the hand sweeping aside for the ZP or schematically stopping for the VP), but from the physiological configuration that remains constant or “invariant” across all manifestations of this gesture: the pronation of the palm (Figure 17.3).
Figure 17.3 Invariant feature in different orientations of the palm: a. pronation/palm down, b. pronation/palm forward, c. pronation/palm sideways
Explaining this invariance (Figure 17.3), Boutet points out that from an egocentric reference point, “the forearm and the arm are in a different position for the three gestures,” as has been an important distinction in previous work adopting an egocentric frame of reference; however, Boutet highlights that from an allocentric perspective, “the hand does not change position relative to the forearm, being in a position of pronation across the three cases” (Reference BoutetBoutet, 2015, p. 118). In this approach, the invariance of the position of the hand (relative to articulatory segments of the forearm and arm – not to the observer) is where the different gestures derive their semantic theme from, which allows the analyst to arrive at the relation between embodiment and negation without the intermediary of an underlying action. Boutet’s analysis here is similar to Streeck’s, who also posits the invariant in these motions as their conceptual core, attributing form variations such as orientation of the gesture and position of the hand (supine-prone) to the speaker’s embodied adaptations to the interactional context in which the gestures are made (Reference StreeckStreeck, 2017, Ch. 5). From Boutet’s (kinesiological, intrinsic) perspective, the role of the difference between the palm oriented either away from the speaker’s body or facing down in distinguishing “different gestures” may have been exaggerated in the previous literature (see further discussion in Reference Beaupoil-Hourdel and MorgensternBeaupoil-Hourdel & Morgenstern, 2021, and in Boutet & Cienki, this volume).
Additional explanations for the form and organization of gestures associated with negation have been proposed based on studies that attend to the grammatical constructs of negation.
3.2.6 Kinesic Organization in the Grammar–Gesture Nexus
Building on sensory-kinesic approaches to negation (Reference Lapaire, Bonnefille and SalbayreLapaire, 2006a; see Section 4.2) with a video corpus of spoken English and form-based methods of gesture analysis (Reference Müller, Ladewig, Borkent, Dancygier and HinnellMüller, Bressem, & Ladewig, 2013), in Reference HarrisonHarrison (2018) I reported on the collection and analysis of a corpus of English spoken utterances that all involved linguistically what Reference HornHorn (1989) describes as “the traditional criteria for negativity – the presence of a negative particle, its appearance in a specified syntactic location, and so forth” (p. 34).Footnote 6 A second criterion for inclusion in the corpus was that the utterance involved the performance of a gesture from the Open Hand Prone gesture family (which I assumed to be potentially related to the expression of negation; Reference CalbrisCalbris, 1990; Reference KendonKendon, 2004). Based on grammatical and gestural microanalysis of over eighty examples, a number of distinctions in the form–function relations of such gestures, as well as a principle governing their temporal organization with speech, were proposed.
First of all, I distinguished between three form variations of Palm Down gestures associated with different kinds of negation or negative speech acts and, by analyzing several examples, argued that their performance with these linguistic utterances supported the view that each gesture reproduced variations on a manual action. As illustrated in Figure 17.4, the associations observed were between clausal negation and “sweeping away” for the Palm Down Across, exclusions and “clearing aside” for the 2-Palms Down Mid, and rejections and “cutting through” for the 2-Palms Down Across (Figure 17.4).
Figure 17.4 Three Horizontal Palm gestures based on different underlying actions: PDAcross (“sweeping away”), 2PDmid (‘clearing aside’), and 2PDAcross (‘cutting through’)
Close study of these gestures at the level of utterance revealed how speakers may organize their different phases of the gestural action (i.e. preparation, stroke, and holds) in relation to the grammatical structures in the accompanying speech, which can be viewed as creating “sync points” for gestures associated with negation (Reference HarrisonHarrison, 2018, Ch. 3).Footnote 7 Specifically, this research has shown the way in which the manual gestures were organized in relation to the node and scope of negation (Reference HarrisonHarrison, 2010) and to negative polarity items and negative focus (Reference HarrisonHarrison, 2013), as well as how manual, head, and linguistic elements were orchestrated in kinesic ensembles (Reference HarrisonHarrison, 2014b). Extending notions of lexical and conceptual affiliation between speech and gestures (Reference McNeill, Renals and BengioMcNeill, 2006), these examples of negation, expressed and organized multimodally, were offered as supporting evidence for a grammatical affiliation between speech and gesture (Reference Lapaire, Bonnefille and SalbayreLapaire, 2006a). The organization of gestures in the Open Hand Prone family in relation to the node and scope of negation has offered a basis for cross-linguistic comparison. One study found a similar pattern in French, whose syntax for negation on the sentence level overlaps to some extent with that of English (Reference Harrison, Larrivée, Larrivée and ChungminHarrison & Larrivee, 2016). In a study of negative utterances from the Subaya language in Nepal (Reference GawneGawne, 2021), however, the holding of gesture over scope of negation was not found to occur. Subaya is “a verb final language,” which may explain the difference in gesture organization because there is “less content within the scope of the negation that follows the verb” (Reference GawneGawne, 2021, p. 18).
Gesture studies of other key grammatical phenomena relating to negation will be discussed in further sections below, including implicit/covert negation, single versus double negation, and quantification, negation and scopal ambiguity. These studies develop explanations for aspects of gestures associated with negation.
4 Explanations for the Occurrence of Gestures Associated with Negation
What is driving the relations between gestures and negation? What is the conceptual and functional role of gestures associated with negation in spoken language production and perception? A growing body of published studies converge on answers to these questions. Linguistic, cognitive-semantic, functional, embodied, cultural, and psycholinguistic explanations can be identified.
4.1 Gestures as Semantic Operators
Several researchers have observed that gestures associated with negation may also be performed in contexts-of-use where the spoken utterances exhibit no visible or ‘surface manifestation’ of linguistic negation. These observations have been the basis for some researchers to propose explanations for why gestures occur.
Reference KendonKendon’s (2002) analyses of the headshake in a corpus of examples from speakers of Italian and English revealed several such contexts. His examples included utterances structured with certain adverbs (e.g. “only”), evidentiality markers (e.g. “obviously”), statements without exceptions, superlatives, and intensified expressions (e.g. declaring that something was “marvelous” or “wonderful”; pp. 172–173). Considered alongside examples of the headshake in contexts of explicit verbal negation, Reference KendonKendon (2002) concluded “it seems that we can interpret [the headshake] as an operator that does semantic work similar in many ways to the work done by the various verbal particles of negation” (p. 180). How the headshake was placed in relation to speech further revealed it to be used as “an expression in its own right,” operating somewhat freely in relation to different parts of the verbal utterance “according to the rhetorical needs of the moment” (p. 180). Taking a more cognitive perspective on the expression of negation, other researchers have related these operations to the manifestation of a cognitive domain.
4.2 Manifestation of a Cognitive Domain
Based on a typology of gestures originally identified as co-occurring with grammatical negation in a twenty-hour corpus of TV interviews conducted in Israeli Hebrew (Reference Inbar and ShorInbar & Shor, 2017), Reference Inbar and ShorInbar and Shor (2019) present a subsequent study in which they report a “typology of verbal utterances that do not contain markers of grammatical negation and that may be accompanied by gestures associated with grammatical negation in spoken Israeli Hebrew” (p. 87). They identify six “patterns,” which, building on previous work, they call the “headshake,” “sweeping away’, “holding away,” “hands up gesture,” “finger wagging,” and “shoulder shrug.” Finding that “all the contexts revealed are connected to negation on some cognitive level” (Reference Inbar and ShorInbar & Shor, 2019, p. 93), they argue that when these gestures occur with utterances that show no surface form of negation, the gestures are a manifestation of an underlying cognitive domain of negativity. According to Reference Inbar and ShorInbar and Shor (2019), such utterances include those with “words with [a] negative meaning component” (p. 88); they are found in “contexts of intensification” (p. 88), with “indefinite modifiers and hedging expressions” (p. 90), “discourse particles that imply negation or restriction” (p. 90), and “conversational implicatures arising from the verbal utterance” (p. 92).
By analyzing the semantic import of the gestures in these contexts, Reference Inbar and ShorInbar and Shor (2019) argue that the “gestures indicate a higher abstract notion, namely ‘negativity’ rather than negation” (p. 94). They account for the occurrence of these gestures with utterances containing no surface expression of negation as the gestural manifestation of the speaker’s underlying cognitive domain. Gestures seem to offer a diagnostic of the cognitive domains at play which may underpin or have been “bleached” from the explicit linguistic expressions.
4.3 Functional Load Sharing and Encoding Strategies
The prevalence of Open Hand Prone gestures with utterances showing no verbal negation has been the basis for alternative perspectives, such as “functional load sharing” proposed by Reference Wegener and BressemWegener and Bressem (2019). The data were observations of the “sweeping away” and “holding away” gestures examined in a six-hour corpus of narratives, procedural texts, and interviews among fifteen speakers (mainly male) of SavoSavo, a non-Austronesian language spoken on Savo Island in the Solomon Islands. In this study, the search domain for the sweeping away and holding away gestures was identified as utterances containing lexemes of explicit and implicit negation. Reference Wegener and BressemWegener and Bressem (2019) found “rather few instances of gestures associated with negation co-occurring with explicit verbal negation,” and instead, “the ‘sweeping away’ gesture, is used mostly with implicit (lexical or pragmatic) negation/negativity and only rarely accompanies explicit verbal negation” (para. 1). The researchers propose a functional explanation for these findings which can be called the “sharing the load” view: “When explicit verbal negation is used, it bears the main functional load. When explicit verbal negation is absent, negation/negativity is emphasized and made visible through gestures” (Figure 17.5).
Figure 17.5 Functional explanation: “sharing the load”
Wegener and Bressem’s diagram illustrates “the interplay between verbal and gestural negation,” and they argue that the link between gesture forms attested to occur with negation (such as Vertical and Horizontal manifestations of the Open Hand Palm family) may not be as tight or as frequent as suggested in the literature, and on that basis they advise “don’t look for negation gestures where verbal negation is used” (ibid., Results). Similar to Reference Inbar and ShorInbar and Shor (2019), the assumption here that the “sweeping away” and “holding away” gestures are expressions or manifestations of negation is central to this explanation, though Wegener and Bressem propose to view the relation between the gesture form and negation as a more explicit one of encoding. While this view does not entertain Reference CalbrisCalbris’ (2011) notion of polysemous gestures, the functional perspective finds some support in a rich chain of psycholinguistic experiments.
4.4 Psycholinguistic Perspectives
Functional explanations such as load sharing and encoding strategies are consistent with a line of psycholinguistic research that has examined the interaction between negation and gesture using an experimental methodology. Explicitly building on naturalistic observations that negation receives multimodal expression in spoken language usage, researchers have designed experiments to investigate the role of gesture and prosody in the perception, interpretation, and comprehension of negative utterances (Reference Brown and KamiyaBrown & Kamiya, 2019; Reference Ferre and MettouchiFerre & Mettouchi, 2020; Reference Li, González-Fuente, Prieto and EspinalLi et al., 2016; Reference Prieto, Borràs-Comes, Tubau and EspinalPrieto et al., 2013; Reference Prieto, Espinal, Deprez and Teresa EspinalPrieto and Espinal, 2020; Reference Tubau, González-Fuente, Prieto and EspinalTubau et al., 2015). Results of these studies offer evidence that gestures associated with negation (coupled with certain prosodic patterns) may function as “cues” that guide the addressee’s interpretation of negative meaning, which might be ambiguous if only the verbal sentence were to be taken into account.
One chain of these studies has been conducted by Prieto, Borràs-Comes, Tubau, and Espinal. In their Reference Prieto, Borràs-Comes, Tubau and Espinal2013 article, they focus on the negative particles ningú (Catalan) and nadie (Spanish), which in response to a question that includes a sentential negative (such as “Who did not eat dessert?”) can be interpreted as meaning either “nobody” (single negation) or “everybody” (double negation). In a first step of the study, utterances reflecting both these meanings were elicited on camera from native speakers. Representative patterns of prosody and gesture that co-occurred with either single negation or double negation (identical in Catalan and Spanish) were taken as the stimuli for a subsequent perception study. These gestural patterns included the headshake and two-handed horizontal palm for single negation (“nobody”) and a shrug with headshake or nod for double negation (“everybody”). Various software was then used to create versions of each type of negation integrated with either congruent or incongruent gestural/prosodic patterns, these being presented to naïve participants in auditory-only (AO), visual-only (VO), and audiovisual (AV) conditions. The results established that “prosodic and non-verbal cues (i.e., gestural patterns) crucially affect the interpretation of isolated n-words” (p. 147).
Subsequent studies have built on this finding, and applying similar experimental paradigms, have supported the crucial role of prosodic and gestural patterns in the interpretation of answers to negative yes/no-questions in Catalan (Reference Tubau, González-Fuente, Prieto and EspinalTubau et al., 2015) and of rejections to negative assertions/questions in Mandarin Chinese (Reference Li, González-Fuente, Prieto and EspinalLi et al., 2016). Reference Brown and KamiyaBrown and Kamiya (2019) focused on sentences in English that are ambiguous in scope. Scopal ambiguities are known to arise in the context of sentences that include both a quantifier (e.g. many, most, all) and a negative particle (e.g. not, -n’t), which can be “semantically ambiguous sentences with multiple interpretations” (ibid., p. 4). The researchers found that gestures play a facilitative role in the interpretation of scopal ambiguities notoriously associated with negation. Reference Brown and KamiyaBrown and Kamiya (2019) specify that “speakers may manipulate the features of gestural form, placement, and length potentially to help listeners resolve the ambiguities arising from scopal interactions between quantification and negation” (p. 27).
Finally, a line of quasi-experimental research has developed in studies focusing on the production and perception of refusal gestures in Japan, aiming to understand the implications for learners of Japanese as a second or foreign language (Reference JungheimJungheim, 2004, Reference Jungheim, McCafferty and Stam2008, Reference Jungheim2013). Jungheim’s starting point was a specific speech act – refusal – and its nonverbal and culturally specific dimensions, which include what he calls the Hand Fan gesture (following Reference MorrisMorris, 1994). Unlike the Vertical Palm gestures we have described above as being oriented towards the addressee or object of negation, the Hand Fan “is performed high in the central gesture space near the face with the palm facing to the left or right depending on which hand is used,” creating a fan-like motion in front of the face (Reference JungheimJungheim, 2004, p. 135). Native speakers of Japanese perform Hand Fans with refusals, but Reference JungheimJungheim (2004) showed that learners of Japanese as a second language mainly used Vertical Palm gestures instead, also bowing more than Japanese native speakers. Learners obviously struggle with the complexity of the “refusal-routines,” but they could still accurately interpret refusal gestures performed by Japanese speakers (Reference Jungheim, McCafferty and StamJungheim, 2008). Having reviewed various perspectives on gestures associated with negation, we can now turn to work that has considered the theoretical implications and practical applications of this research area.
5 Theoretical and Applied Contributions
Studies of gestures associated with negation are helping to conceptualize the relationships between gesture and language. This can be seen in the uptake of the findings concerning gestures associated with negation in discussions of the multimodal nature of grammar (Section 5.1), the embodiment of cognition (Section 5.2), and the relation between gesture and signs (Section 5.3).
5.1 Multimodality of Grammar
Cognitive grammarian Lapaire has long ruminated: “What is the relationship between gesture and grammar?” (Reference LapaireLapaire, 2011, p. 88). When gesture studies focus on the spontaneity of gestural expression, without attending to recurrent gestures, Reference McNeillMcNeill’s (2005) conclusion that gestures are “certainly not part of ‘grammar’” may seem warranted (p. 21). However, this position has become problematic for researchers investigating aspects of gesture that appear closely connected to grammar. Such studies are found in areas of applied cognitive grammar (Reference Lapaire, Lewandowska-Tomaszczyk, Turewicz and Hanway PouloskyLapaire, 2002, Reference Lapaire2005, Reference Lapaire, Bonnefille and Salbayre2006a, Reference Lapaire2006b, Reference Lapaire2016), cognitive linguistic gesture studies (Reference Cienki, Badio and KoseckiCienki, 2012, Reference Cienki2015, Reference Cienki2017), multimodal grammar (Reference FrickeFricke, 2012, Reference Fricke, Müller, Cienki, Fricke, Ladewig, McNeill and Teßendorf2013, Reference Fricke, Müller, Cienki, Fricke, Ladewig, McNeill and Bressem2014), and multimodal construction grammar (Reference SchoonjansSchoonjans, 2017, Reference Schoonjans2018; Reference Steen, Turner, Borkent, Dancygier and HinnellSteen & Turner, 2013; Reference Zima and BergsZima & Bergs, 2017). Integrated with these different developments – to greater and lesser degrees – have been the gestures associated with negation.
Lapaire originally posed his question in considering the challenges that a perspective combining Merleau-Ponty’s phenomenology and cognitive linguistics were presenting to traditional grammatical analysis. Salient among these challenges was the overwhelming evidence found by cognitive linguists for the pervasive role of bodies in all aspects of language structure, including the conceptual organization and expression of grammar (Reference JohnsonJohnson, 1987; Reference LakoffLakoff, 1987; Reference Lakoff and JohnsonLakoff & Johnson, 1980; Reference SweetserSweetser, 1990). Treating grammatical notions, processes, and structures not as “mental phenomena” but as “revealers” of embodied meaning, and drawing on his own gesture studies, Lapaire has proposed cognitive-etymological, body motion-based, and manual-haptic models of core grammatical phenomena, including epistemic modality (Reference LapaireLapaire, 2006b, Reference Lapaire2013), temporal experience/relations (Reference LapaireLapaire, 2016), and – most relevant to the current chapter – negation (Reference Lapaire, Bonnefille and SalbayreLapaire, 2006a). Lapaire’s analyses highlight how grammar and gesture share in schematicity, imagery, meaning, and conventionality, leading him to view at least some gestures as “co-grammatical” (Reference LapaireLapaire, 2013), that is, as an embodied dimension of grammar.
Another theorization of the relation between grammar and gesture through the lens of multimodality has led Reference FrickeFricke (2012) to propose a “multimodal grammar” (Reference FrickeFricke, 2012, Reference Fricke, Müller, Cienki, Fricke, Ladewig, McNeill and Bressem2014). One of Fricke’s main claims is that multimodality is not only a feature of individual utterances or constructions but also a property of linguistic systems and thus of grammar in general. The multimodality of grammar, for example, means that structures and processes typically identified as “grammatical” may be more general organization principles that also determine the form and function of other modes that participate in multimodal expressions. In addition to grammar’s multimodality, studies of gestures associated with negation shed light on the embodiment of cognition.
5.2 Embodiment of Cognition
Gestures are rooted cognitively not only in conceptual structures from inside the brain, but also in “human actions and movement experiences more generally (that may culturally vary) in connection with aspects of their practical uses” (Reference MüllerMüller, 2017, p. 291, original emphasis). The sensory-kinesthetic experience of a particular movement pattern that characterizes a number of recurrent gestures associated with negation offers a case in point. Several researchers have observed that gestures related to negation may exhibit a movement away from the body (Reference Bressem, Müller, Müller, Cienki, Fricke, Ladewig, McNeill and BressemBressem & Müller, 2014a, Reference Bressem and Müller2017; Reference CalbrisCalbris, 2011; Reference HarrisonHarrison, 2014b, Reference Harrison2018; Reference Lapaire, Bonnefille and SalbayreLapaire, 2006a). In addition to “encoding” a conceptual schema or metaphor of “negation as distance” (Reference ChiltonChilton, 2014), the movement of recurrent gestures away from the body cotimed with the verbal expression of negation and repeated across contexts constitutes for the speaker a real-time, embodied experience of negation as distance (Reference MüllerBressem & Müller, 2017; Reference HarrisonHarrison, 2014b, Reference Harrison2018; Reference MüllerMüller, 2017). The movement of the hand away from the body results in a proprioceptive experience of something no longer being present, and it has been argued that such dynamic, real-time experience has become a basis for linguistic meanings and functions of gestures related to negation.
This feature of gestures provides evidence that cognition is not only centrally processed in the brain, but also influenced through certain interactive behavior and sensory-motor actions. As Reference Müller, Ladewig, Borkent, Dancygier and HinnellMüller and Ladewig (2013) specified, such findings advocate a view of cognition in which “gestures in and of themselves are embodied and dynamic conceptualizations” (p. 298; see also Reference StreeckStreeck, 2009, Reference Streeck2017).
5.3 Relations between Gesture and Sign
On the form configurations involved in the configuration of signs for negation, one landmark typology of the forms of signs that express negation is Reference ZeshanZeshan (2004). Similarities of forms can be found between examples in this typology and in typologies of gestures associated with negation found in spoken languages (Reference BoutetBoutet, 2015; Reference HarrisonHarrison, 2018; Reference Lapaire, Bonnefille and SalbayreLapaire, 2006a; Reference Mesh and HouMesh & Hou, 2018). For instance, linguistic signs in various sign languages related to the expression of negation involve a Vertical Palm with lateral oscillation movement (Reference ZeshanZeshan, 2004). Searching for “no” on the website www.spreadthesign.com provides video clips of signers from China, Ukraine, India, and Estonia performing a sign very similar to the action of wiping away and the Vertical Palm oscillate gesture. Several examples of a sign for negation are based on the Vertical Palm oscillate form in Chinese Sign Language (Reference Yang and FischerYang & Fisher, 2002) and Indonesian Sign Language (Reference PalfreymanPalfreyman, 2019). The “away” movement mentioned earlier is also prominent. In American Sign Language, Reference BembridgeBembridge (2016) explains how “The predicates KNOW, WANT, LIKE, HAVE, and GOOD […] are customarily negated through a reverse in the orientation of hand or hands […] a twisting outward or downward movement” (p. 4; see also Reference LiskovaLiskova, 2012).
Against a background of research bringing gesture and sign into a comparative perspective (Reference HarrisonHarrison, 2018, Ch. 7; Reference KendonKendon, 2004, Reference Kendon2008; Reference MüllerMüller, 2018), several studies have explicitly compared features of negation across signed and spoken languages by considering gestures. For example, Reference SchoonjansSchoonjans (2017) found similarities in the form and organization of “downtoning” stance markers in German multimodal speech and German Signed Language (DGS). Reference Mesh and HouMesh and Hou (2018) identified five “negative conventional gestures” with clausal, emphatic, and imperative negation (termed WAG, TWIST, PALM-UP, PALM-DOWN, DEAD) used by both speakers and signers of a municipality in Oaxaca, Mexico – something which facilitated communication between hearing and deaf people within the community. In Reference HarrisonHarrison (2018), I studied uses of the Vertical Palm form by a teacher of French Sign Language (LSF) and identified interactive functions shared with speakers, though also grammatical functions not observed in spoken language data. In language acquisition research, a developmental perspective on gestures and their relations to signed language have been proposed (Reference Blondel, Boutet, Beaupoil-Hourdel and MorgensternBlondel et al., 2017; Reference Morgenstern, Beaupoil-Hourdel, Blondel, Boutet, Ortega, Tyler, Park and UnoMorgenstern et al., 2016; Reference Morgenstern, Blondel, Beaupoil-Hourdel, Benazzo, Boutet, Kochan, Limousin, Hickmann, Veneziano and JisaMorgenstern et al., 2018).
While gesture and sign are often positioned at opposite extremities of a gesture–sign continuum (Reference McNeillMcNeill, 1992, Reference McNeill2005), comparative studies of spoken language and signed language negation bring them closer together. Whether we are dealing with sign-like gestures, gesture-like signs, both, or something else will require more research and discussion in the future.
6 Conclusion
This chapter has offered an overview of gestures associated with negation from empirical and theoretical perspectives, developing and challenging a number of themes that have become widely acknowledged in research into gestures associated with negation specifically, and into the study of recurrent gestures and grammar/gesture relations more widely. Given the unquestionably linguistic nature of negation and its centrality to all languages, the area of research discussed in this chapter further attests that gesture is an essential part of human communication – fundamentally intertwined with language on every level. Theories of language and grammar must ultimately be able to account for these relations between gesture and linguistic concepts.

