1 Introduction
Proposals that the semiotic system of gesture played a pivotal role in the evolution of language have been, and continue to be, influential (Reference ŻywiczyńskiŻywiczyński, 2018).* This statement, however, illustrates not so much a specific theory but an axis of debate in the field of language origins, along which “gesture-first” proposals traditionally compete with “speech-first” theories (Reference FitchFitch, 2010). Below this general characterization, there are many differences between gestural origin theories, including the understanding of basic notions such as the concept of gesture itself. Thus, we begin this chapter by discussing definitional issues concerning gesture, with differences both within and across fields. In the context of language origins, such differences in defining “gesture” have profound consequences for formulating and evaluating theories. Since our aim is to survey a considerable number of contemporary gestural theories of the origin of language, we do so by using a new typology, developed with the help of cognitive semiotics (Reference Sonesson, Ziemke, Zlatev and FrankSonesson, 2007; Reference Zlatev and TrifonasZlatev, 2015a), and in particular the notions of semiotic system and polysemiotic communication (Reference Zlatev and PetersStampoulidis, Bolognesi, & Zlatev, 2019; Reference Zlatev and PetersZlatev, 2019; Reference Zlatev, Devylder, Defina, Moskaluk and AndersenZlatev et al. 2023). In very brief terms, a semiotic system is a combination of signs or signals of a particular type, defined by characteristic properties, and the interrelations between these signs/signals. Universal human sign systems are language, gesture and depiction (the latter understood as forming marks on a two-dimensional surface that resemble three-dimensional or imaginary referents). Signal systems, for example spontaneous facial expressions and non-linguistic vocalizations, are under less voluntary control than sign systems (Reference Zlatev, Żywiczyński and WacewiczZlatev, Żywiczyński, & Wacewicz, 2020). Combinations of different sign and/or signal systems form the basis for polysemiotic communication or polysemiosis.
Applying this conceptual apparatus to gestural theories of language origins, the basic distinction relates to the question of whether the semiotic system of gesture played an exclusive role in early stages of language evolution or whether other semiotic systems were involved as well. A positive answer to the first question implies monosemiotic theories, which we review in Section 3; a positive answer to the latter implies polysemiotic theories, to which we turn in Sections 4 and 5. The latter are more commonly known as “multimodal theories,” but we avoid this term due to its excessive ambiguity.Footnote 1 Importantly, we distinguish between two kinds of polysemiotic theories of language evolution: (a) equipollent, where language and gesture are considered equally prominent from the onset, and (b) pantomimic, where gesture played the main but not exclusive role in breaking from predominantly signal-based to sign-based communication.
After reviewing the evidence for each of the three kinds of gestural origin theories, we conclude that the last kind, namely pantomimic theories, appears to offer the most viable account of language origins. Further, it has the benefit of accounting for the evolution of polysemiotic communication as a whole.
2 Different Ways of Understanding Gesture
The work of Adam Kendon is hugely influential within gesture studies, and so is his characterization of gesture as bodily “movements that partake of […] features of manifest deliberate expressiveness to an obvious degree” (Reference Kendon2004, p. 14). “Expressiveness” implies that the communicator intends something by a gesture; “deliberate,” that this is done with a communicative intent; and “manifest,” that these features are to be discernible by an audience. As the notion of deliberateness is controversial, and is denied, for example, by Reference McNeillMcNeill (2012), gestures may be more generally defined as “expressive movements performed by the hands, the head, or any other part of the body, and perceived [predominantly] visually” (Reference Zlatev, MacWhinney and O’GradyZlatev, 2015b, p. 458, emphasis in original). How are gestures to be distinguished from other visually perceived communicative movements, such as adaptors and facial expressions (e.g. Reference Żywiczyński, Wacewicz and OrzechowskiŻywiczyński, Wacewicz, & Orzechowski, 2017)? An intuitive proposal of how to delineate this “lower boundary” of gesture is presented by Reference AndrénAndrén (2010), who argued that two dimensions of gestural meaning need to be distinguished: communicative explicitness (CE) and representational complexity (RC). Within each, there are three different levels and at least one of the two dimensions should be on the third level for the bodily act to count as a gesture. For example, the wave-bye gesture does not represent a goodbye but performs it, so it lacks the highest level of RC. Yet, it is typically produced with the communicative intent to be understood as taking farewell, and thus has the highest level of CE. On the other hand, an act of symbolic play performed in solitude would also qualify as a gesture, since by definition it is on the highest level of RC (if it is to be “symbolic”), even though it lacks a communicative intent completely. Most iconic (i.e. resemblance-based) gestures used in face-to-face communication would have the highest levels of both CE and RC.
Other researchers of human gestural communication adopt a narrower definition of gestures. Kendon’s approach encompasses so-called orofacial gestures: communicative movements of facial muscles and the tongue other than articulatory movements of speech; These would not be regarded as gestural by many (e.g. Reference Orzechowski, Wacewicz, Żywiczyński, Cartmill, Roberts, Lyn and CornishOrzechowski, Wacewicz, & Żywiczyński, 2014), though see Chovil (this volume). Reference McNeillMcNeill (1992, Reference McNeill2012) would further limit (prototypical) gestures to spontaneous and idiosyncratic hand and arm movements that are functionally integrated with speech, which is a very limited definition, based on his theory of the nature of gesture (Section 4.2).
On the other hand, within primatology, gestures are typically understood even more broadly than they are by Kendon, as concerning any communicative behaviors that involve body posture, facial expressions, and manual movements, and are mainly perceived visually (e.g. Reference Pika, Zlatev, Racine, Sinha and ItkonenPika, 2008a; Reference TomaselloTomasello, 2008).Footnote 2 Voluntary control (Reference Byrne, Cartmill, Genty, Graham, Hobaiter and TannerByrne et al., 2017) and the related properties of flexibility (Reference Bard, Maguire-Herring, Tomonaga and MatsuzawaBard, Maguire-Herring, Tomonaga, & Matsuzawa, 2019) and plasticity (Reference Pollick and de WaalPollick & de Waal, 2007) are commonly used to differentiate gestures from other behaviors. For example, to initiate grooming, infant chimpanzees indicate the place they want to be groomed at, first by looking at this spot and then touching it (Reference Bard, Maguire-Herring, Tomonaga and MatsuzawaBard et al., 2019). This example also illustrates the receiver-directedness of ape gestures, further underlined by persistence: “[I]f the recipient’s response is not satisfactory, […] the signaller […] repeat[s] the produced signal” (Reference Fröhlich, Sievers, Townsend, Gruber and van SchaikFröhlich, Sievers, Townsend, Gruber, T., & van Schaik, 2019, cf. Reference Leavens, Hopkins and ThomasLeavens, Hopkins, & Thomas, 2004; Reference Tomasello, George, Kruger, Jeffrey and EvansTomasello, George, Kruger, Jeffrey, & Evans, 1985) and elaboration: If the recipient fails to respond in a satisfactory way, a gesture different from the original one may be used.
These findings testify to significant cognitive and semiotic complexity in ape gestures, possibly differentiating them from ape vocalizations. At the same time, this evidence does not amount to full-fledged communicative intent, which implies not just “intention” in the sense of volition, but a Gricean (second-order) intention that the addressee recognize the primary intention (Reference Zlatev, Madsen, Lenninger, Persson, Sayehli, Sonesson and van de WeijerZlatev et al., 2013), and it is not clear if apes can either produce or recognize such gestures. Further, unlike human gestures with the highest form of representational complexity (see above), ape gestures hardly refer to an object that is distinct from the communicator or the addressee, that is, they are “dyadic” (me–you) rather than “triadic” (me–you–referent) (Reference HurfordHurford, 2007), which prevents them from being full-fledged signs (Reference Zlatev, Żywiczyński and WacewiczZlatev et al., 2020).
As pointed out in the introduction to this chapter, these definitional differences imply that any comparison and discussion of gestural origins theories needs to proceed with care, and preferably with the help of a uniform conceptual apparatus, such as the one we propose below. In each of the following three sections, we review and evaluate one of the three basic types of gestural theories: monosemiotic, polysemiotic-equipollent, and polysemiotic-pantomimic.
3 Monosemiotic Gestural Theories
3.1 General Features
Monosemiotic gestural theories claim that early stages of language evolution exclusively depended on gesture. They share the postulate of monosemiosis with vocal, or “speech-first,” theories, and have complementary difficulties (Section 3.7). The rise of modern language-evolution research in the latter part of the twentieth century was marked by the formulation of gestural hypotheses of language origin, starting with the work of Reference Hewes, Andrew, Carini, Hackeny, Gardner, Kortland and WescottGordon Hewes and colleagues (1973). This both contributed to the methodology of language-evolution research, such as the use of converging evidence (Reference Żywiczyński and WacewiczŻywiczyński & Wacewicz, 2019, pp. 122–124), and defined dominant positions in the field. In the following subsections we review a number of such monosemiotic gestural theories, before proceeding with evaluation.
3.2 Hewes’ Gestural Primacy Hypothesis
Hewes put forward a theory of a gestural protolanguage, termed the Gesture Primacy Hypothesis (Reference Hewes, Andrew, Carini, Hackeny, Gardner, Kortland and WescottHewes et al., 1973), and suggested how this protolanguage transitioned into speech (Reference HewesHewes, 1977a). Hewes’ conception of protolanguage can be described as synthetic (as opposed to holistic) (Reference Żywiczyński and WacewiczŻywiczyński & Wacewicz, 2019, p. 187) as it takes protolanguage to have consisted of gestures as quasi-lexical units standing for objects and actions that could be combined into sequences, but, overall, lacking syntactic and morphological structure. Some of Hewes’ arguments focus on the observation that in naturally occurring conversation, gestures usually accompany speech; but the bulk of the evidence summoned by him in support of his theory can be summed up as follows:
Anthropological data. Hewes analyzed logs of European travelers, who were apparently able to communicate with indigenous peoples by means of gestures about even highly complex topics, such as topography, dangers that awaited newcomers, politics, or religion.
Comparative data. After multiple failures to teach non-human apes elements of spoken language (Reference FurnessFurness, 1916; Reference Hayes and HayesHayes & Hayes, 1952; Reference Kellogg and KelloggKellogg & Kellogg, 1933), there followed several successful attempts to teach them visually perceived forms of communication, including elements of American Sign Language (ASL) (Reference Gardner and GardnerGardner & Gardner, 1969, Reference Gardner, Gardner, Schrier and Stollnitz1971; Reference PremackPremack, 1970; Reference Premack, Premack, Schiefelbusch and LloydPremack & Premack, 1974). Based on these findings, Hewes argued that there is continuity between ape and human gestural behaviors, and discontinuity in vocal behaviors (Reference HewesHewes, 1977a, Reference Hewes and Rumbaugh1977b; Reference Hewes, Andrew, Carini, Hackeny, Gardner, Kortland and WescottHewes et al. 1973).
Neurocognitive data. Hewes appealed to evidence from neuropathology that indicated relative immunity of gestural communication in language-related disorders (e.g. Reference HewesHewes, 1977a, pp. 132–133), and to research on handedness and lateralization, where he drew attention to the fact that right-hand dominance for manual actions coincides with the left-hemisphere dominance for language processing and production (cf. Reference Knecht, Dräger, Deppe, Bobe and LohmannKnecht et al., 2000).
Signed languages. Hewes claimed that sign(ed) languages are universal in the sense that they can appear from scratch thanks to a high degree of iconicity (Reference HewesHewes, 1977a), which has since been confirmed by studies into emerging signed languages (Reference Meir, Sandler, Padden, Aronoff, Marschark and SpencerMeir, Sandler, Padden, & Aronoff, 2010a; Reference Senghas and CoppolaSenghas & Coppola, 2001; Reference Senghas, Kita and ÖzyürekSenghas, Kita, & Özyürek, 2004).
The combined force of this evidence led Hewes to the conclusion that gestural protolanguage had constituted the first form of homininFootnote 3 communication on the path to modern language. In fact, he was the first to use the term “protolanguage” in the technical sense as a transitional system between, on the one hand, the signal-based communication of apes, and on the other, human language. The visionary character of Hewes’ project is also testified to by the fact that it relied on areas of research that have been used as sources of evidence in language-evolution debates until now.
When evaluating the Gestural Primacy Hypothesis, it should be noted that although Hewes did not expressly commit himself to a narrow understanding of gesture as the communicative action of hands and arms (see Section 2), some of his key arguments indicate that he envisaged protolanguage as primarily relying on manual gesture. For example, such is the import of his discussion of the relation between handedness and language, but also the way he used signed languages and comparative studies favors a narrow definition of gesture. For example, the manual character of protolinguistic communication directly motivates Hewes’ explanation of the so-called volar depigmentation of the inner part of the hand in non-Caucasian populations: that it may be an adaptation for gestural communication, as it increases the visibility of the hands in the dark (Reference Hewes, Lock and PetersHewes, 1996). Apparently, he ignored the fact that volar depigmentation also affects the sole of the foot.
3.3 Stokoe and Research in Signed Languages
The second part of the twentieth century saw the beginning of modern research on signedFootnote 4 languages, founded on the postulate they are not qualitatively different from spoken languages (Reference EmmoreyEmmorey, 2002; Reference StokoeStokoe, 1960; Reference Stokoe, Casterline and CronebergStokoe, Casterline, & Croneberg, 1965). The pioneers of signed-language linguistics had a keen interest in language origins. For example, early works emphasize that gesture has a greater iconic potential than vocalization, which makes gesture a better candidate for a communicative system based on signs rather than signals (Reference Zlatev, Żywiczyński and WacewiczZlatev et al., 2020) and hence is a likely starting point for the evolution of language (Reference StokoeStokoe, 1960). Using insights from the emergence of signed languages, Stokoe argued that the spatial character of gestures could have facilitated the emergence of rudimentary syntax, as gestures are able to represent not only an action but also the agent who performs it and the patient affected by it (Reference Armstrong, Stokoe and WilcoxArmstrong, Stokoe, & Wilcox, 1995; Reference StokoeStokoe, 1991). It was proposed that nouns were derived from the shape and position of hands and arms, verbs from their actions, and that, collectively, they gave rise to prototypical sentences (Reference Armstrong and WilcoxArmstrong & Wilcox, 2007).
Hence, Stokoe and his collaborators proposed theories consisting of the following evolutionary stages: (a) gestural protolanguage with holistic and iconically motivated signs, (b) gestural language with discrete but iconically motivated signs and combinatorial syntax, (c) the transition into speech, which promoted conventionalization and growth of syntactic complexity.
3.4 Corballis’ Manual Protolanguage
Corballis’ position on the gestural origin of language is somewhat ambivalent. On the one hand, he understands gesture broadly as comprising a heterogeneous variety of forms of bodily action: from spontaneous hand movement accompanying speech, to glances, postures, and even orofacial gestures of the mouth area (Reference CorballisCorballis, 2002, Reference Corballis2003). On the other hand, most of his arguments for the gestural origin of language focus on manual gesture. In presenting them, Corballis organizes and upgrades many ideas put forward by Hewes and Stokoe. For example, he reviews the evidence coming from attempts to teach non-human apes manual-visual and vocal forms of communication, but also points to the fact that primates in general, and apes in particular, acquire manual skills with relative ease. These include not only communicative but also praxic skills (mainly, related to tool use), which contrasts with the difficulty with which they acquire vocal skills (Reference CorballisCorballis, 2012). He also brings up the problem of the neural infrastructure of language, focusing on the nature of the primate mirror neuron system and its homology with language circuits in the human brain (Reference CorballisCorballis, 2003). In his view, this fact is able to explain both the relative success in teaching apes to communicate gesturally rather than vocally and the correlation between handedness and cerebral asymmetry for language (Reference CorballisCorballis, 2012). On this basis, he argues that “manual gesture [is] a natural communication medium” (Reference CorballisCorballis, 2013, p. 203) for primates and, hence, the evolution of language must have begun with some form of manual protolanguage.
In relation to modern human communicative capacity, he uses the standard lines of argumentation for the gestural origin of language, referring to (a) the use of the hands as the most natural way to represent events in space and time in the absence of a shared code, and (b) the ready invention of sophisticated signed languages by the deaf (Reference CorballisCorballis, 2013). Taking stock of these two points as well as the postulate about the continuity of ape and human gestural behaviors, Corballis posits a scenario according to which language began with an internal capacity to engage in so-called mental time travel – “the mental reconstruction of personal events from the past (episodic memory) and the mental construction of possible events in the future” (Reference Suddendorf and CorballisSuddendorf & Corballis, 1997, p. 133). For him, the adaptive pressure for the evolution of this ability came with the uncertain and dangerous ecology of the Pleistocene era, which required long-term planning and a suite of other skills (Reference CorballisCorballis, 2019). The type of gesture hominins inherited from the Last Common Ancestor with apes was naturally adept at communicating sequences of past and future events. From this, Corballis envisages the gradual evolution of a communicative system, in analogy to the historical emergence of signed languages: It first relied on pantomime – understood as holistic iconic gesture – and later developed conventional signs as well as syntax (Reference CorballisCorballis, 2019). In this regard, his scenario resembles the trajectory of language evolution drawn by Stokoe and colleagues.
3.5 Arbib’s Mirror Neuron Hypothesis
The Mirror Neuron Hypothesis of Reference ArbibArbib (2005, Reference Arbib2012, Reference Arbib2016) is one of the most elaborate and empirically best-documented current theories of language evolution. As its name implies, Arbib envisages the mirror neuron system, and in particular the subsystem involved in grasping, as the basis for his evolutionary scenario. These evolved capacities for complex action recognition and complex imitation allowed for the imitation of aspects of observed movements even if they are not part of the imitator’s current stock of actions, thus introducing new variants of actions into the “praxicon,” an individual’s repertoire of actions. When coupled with communicative intentions (according to Reference ArbibArbib, 2016, already present in non-human apes), this developed into the communicative system of pantomime, which Arbib understands as holistic, impromptu gesture, which “allows the transfer of a wide range of action behaviors to communication about action and much more – whereby, for example, an absent object is indicated by outlining its shape or miming its use” (Reference ArbibArbib, 2012, p. 177). The impromptu character of pantomime resulted in its low replicability, as each pantomimic sign had to be invented and interpreted anew. The pressure for communicative effectiveness brought about the conventionalization and segmentation of pantomime, ultimately leading to the emergence of gestural “protosign.” In this way, holistic pantomime was transformed into a synthetic gestural protolanguage, as in the theories reviewed above.
Unlike conceptions of pantomime derived from mimesis theory (see Section 5), Arbib’s theory is consistently monosemiotic: the early stages of language evolution are limited to the semiotic system of gesture: first, holistic pantomime, and then conventionalized gestural protosign. When describing gestural protolanguage, Arbib is not as emphatic as Hewes, Stokoe, or Corballis about the importance of manual gesture at early stages of language emergence. However, the selection of the starting point of language evolution – the Mirror Neuron System for grasping and manual praxic actions attributed to our Last Common Ancestor with monkeys (LCA-m) – and his account of the transition of protolanguage into speech both imply a key role of manual gesture.
3.6 Tomasello’s Pointing and Pantomime
Reference TomaselloTomasello (2008, Reference Tomasello2009) refrains from articulating a detailed scenario of the evolution of language. The significance of his work for language origins lies in rich empirical evidence of a developmental and comparative nature, which focuses on the emergence of pro-sociality and shared intentionality as prerequisites for language. In terms of communication, uniquely human forms of social cognition are realized according to Tomasello in two kinds of gesture: pointing and pantomime. The first is manifest in informative-declarative pointing: Pointing performed with the intention of providing the recipient with new information. “Pantomiming,” the term Tomasello uses preferentially to “pantomime,” comprises iconic manual or whole-body gestures, which are used “(i) to indicate that this is the action I want you to perform, or that I intend to perform myself, or that I want to tell you about; and (ii) to request or otherwise indicate an object that ‘does this’ or an object that ‘one does this with’” (Reference TomaselloTomasello, 2008, p. 67). Pantomiming is then capable of expressing an open-ended range of meanings, which are action-orientated and displaced from the here-and-now. Thus, such gestures do not have internal morphological structures analyzable into discrete component parts. Similarly, pantomimes themselves are not replacements for words but correspond to larger units that are at least proposition-size. Words can only complete communicative acts by being combined with other words, but pantomiming, in Tomasello’s account, can serve as a complete communicative act with its own illocutionary force.
3.7 Evaluation
Monosemiotic theories tend to focus on the three lines of argumentation and corresponding evidence: (a) the gestural and vocal communication of non-human apes, (b) the expressive potential of gesture in contemporary interpersonal communication, and (c) neural links, such as those between hand and mouth. Let us consider each of these in turn.
As pointed out in Section 2, ape gestures are generally understood to be flexible, learned and volitionally produced, unlike ape vocalizations, which are often taken to be instinctive, species-specific, and involving little or no learning. Hence, supporters of gestural theories conclude that there is continuity between ape gesture and human communicative behaviors, and discontinuity between ape vocalizations and human speech. This argument has been backed up by ethological research documenting both the flexibility of ape gestures and the largely inflexible character of ape vocalizations, such as chimpanzee food cries (e.g. Reference CorballisCorballis, 2002; Reference DeaconDeacon 1997; Reference Hewes, Andrew, Carini, Hackeny, Gardner, Kortland and WescottHewes et al., 1973; Reference Scherer, Johnstone, Klasmeyer, Davidson, Scherer and GoldsmithScherer, Johnstone, & Klasmeyer, 2003; but see Reference Fröhlich, Sievers, Townsend, Gruber and van SchaikFröhlich et al., 2019). However, more recent ethological data have complicated this picture, as many primate calls demonstrate “audience effects,” whereby the intensity and rate of calling is regulated by situational context. For example, the manner of producing alarm calls depends on whether somebody is present or not, including the presence of specific recipients (Reference Crockford, Wittig, Mundry and ZuberbühlerCrockford, Wittig, Mundry, & Zuberlbühler, 2012; Reference Crockford, Wittig and ZuberbühlerCrockford, Wittig, & Zuberbühler, 2017). Food calls differ when the quantity of food is large or small (Reference Brosnan and De WaalBrosnan & de Waal, 2002), and are “associated with audience checking, gaze alternation and goal persistence” (Reference Fröhlich, Sievers, Townsend, Gruber and van SchaikFröhlich et al., 2019, p. 7). Primate vocalizations have also been shown to involve “functional reference,” productivity (devoid of compositionality) or tactical deception (Reference SlocombeSlocombe, 2011). Finally, there is growing evidence that naturally living apes combine gestures with vocalizations, touch, and haptic behaviors (Reference Fröhlich, Sievers, Townsend, Gruber and van SchaikFröhlich et al., 2019).
However, most researchers still accept that there is a qualitative difference between ape gestural and vocal behaviors, although it may not be as categorical as once assumed (Reference Bard, Bakeman, Boysen and LeavensBard, Bakeman, Boysen, & Leavens, 2014; Reference Hobaiter and ByrneHobaiter & Byrne, 2014; Reference PikaPika, 2008b). The new research does not disqualify monosemiotic gestural theories, but lends stronger support for polysemiotic approaches, which posit that vocalization may have played a non-negligible role in the evolutionary emergence of language.
Arguments about the expressive potential of gesture concentrate on differences between human gestures and the gestural behaviors of non-human apes, the most important being the ability of human gestures to denote objects, actions, and relations between them, which directly bears on the triadic nature of human gestures, in contrast to the dyadic design of ape gestures (see Section 2). There are two main categories of gestures that demonstrate human-specificity in this regard: pointing and iconic gestures. A number of proponents of gestural theories have considered the emergence of pointing an important stepping-stone toward language. Tomasello’s extensive work on pointing demonstrates that it represents an important watershed in the evolution of human cognition and communication: Non-human primates do not point to distal entities in their environments (Reference TomaselloTomasello, 2000, p. 358), while human infants as early as in the twelfth month of life perform spontaneous, informative pointing aimed at sharing attention with another person (Reference Liszkowski, Carpenter, Henning, Striano and TomaselloLiszkowski, Carpenter, Henning, Striano, & Tomasello, 2004). Tomasello holds that this difference results from the lack of cooperative motivations in non-human apes, but of equal importance is the semiotic quality of pointing: Like other human gestures, pointing is triadic in the sense that the communicator intends to bring the attention of the addressee to a relevant object, and intends for the addressee to recognize this, rather than just to look in a given direction (Reference ZlatevZlatev, 2008).
The other watershed in the evolution of language highlighted in gestural theories is that of iconic gestures.Footnote 5 While there is very little evidence that iconic gesture is present in gestural repertoires of non-human apes (Reference PerlmanTanner & Perlman, 2017), it appears in different forms of human gestural communication, such as co-speech gestures (Reference McNeillMcNeill, 1992), pantomiming events (Reference Zlatev, Wacewicz, Żywiczyński and van de WeijerZlatev, Wacewicz, Żywiczyński, & van de Weijer, 2017) and even in signed languages (Reference Klima and BellugiKlima & Bellugi, 1979). Arguments used in gestural theories that appeal to the iconic-expressive potential of gesture have been corroborated by experimental-semiotic research, which investigates how people develop novel communication systems in the absence of a shared code (Reference GalantucciGalantucci, 2009). Studies comparing spontaneous gestures and vocalizations conclusively show that gesture is a much better basis for developing a sign-based communication system than vocalization (e.g. Reference Fay, Arbib and GarrodFay, Arbib, & Garrod, 2013; Reference Fay, Lister, Ellison and Goldin-MeadowFay, Lister, Ellison, & Goldin-Meadow, 2014). Interestingly, such studies show that the combined use of spontaneous gesture and non-linguistic vocalization is not significantly better than the use of gesture alone (Reference Zlatev, Wacewicz, Żywiczyński and van de WeijerZlatev et al., 2017). Although there are many studies showing that vocalization does have iconic potential (e.g. Reference Ahlner and ZlatevAhlner & Zlatev, 2010; Reference Blasi, Wichmann, Hammarstörm, Stadler and ChristiansenBlasi, Wichmann, Hammarstörm, Stadler, & Christiansen, 2016; Reference Imai and KitaImai & Kita, 2014; Reference Lockwood and DingemanseLockwood & Dingemanse, 2015; Reference PerlmanPerlman, 2017; Reference Perlman and CainPerlman and Cain, 2014), the conclusion that iconic gesture is superior to vocalization for communication based on signs, in the semiotic sense of the term, is now well established in the literature (cf. Reference ŻywiczyńskiŻywiczyński, 2020). This conclusion has gained further support from research on emerging signed languages (Reference Lepic, Börstell, Belsitzman and SandlerLepic, Börstell, Belsitzman, & Sandler, 2016; Reference Meir, Sandler, Padden, Aronoff, Marschark and SpencerMeir et al., 2010a; Reference Meir, Aronoff, Sandler, Padden, Scalise and VogelMeir, Aronoff, Sandler, & Padden, 2010b; Reference Sandler, Meir, Padden and AronoffSandler, Meir, Padden, & Aronoff, 2005; Reference Stamp, Sandler, Roberts, Cuskley, McCrohon, Barceló-Coblijn, Feher and VerhoefStamp & Sandler, 2016). Such findings lend support to gestural origins theories in general, and are also consistent with polysemiotic accounts discussed in Sections 4 and 5.
A third kind of evidence used in monosemiotic gestural theories derives from neuroscience. The most elaborate one of these is Arbib’s Mirror Neuron Hypothesis. To account for the problem of the transition from “protosign” to “protospeech,” Arbib proposes a mechanism of colateralization, whereby the activity of the area responsible for manual production gradually spilt over into the neighboring areas responsible for vocalization. In this way, the brain came to support protospeech through the invasion of the vocal apparatus by collaterals from the protosign system (Reference Rizzolatti and ArbibRizzolatti & Arbib, 1998; cf. Reference ArbibArbib, 2005 and Reference ArbibArbib, 2006). This colateralization hypothesis is supposed to explain not only a substantial degree of coupling of gestural and orofacial behaviors, but also segregation between them, manifested in dissociations between limb apraxia, speech apraxia, and aphasia (Reference ArbibArbib, 2006; Reference Vingerhoets, Alderweireldt, Vandemaele, Cai, Van der Haegen, Brysbaert and AchtenVingerhoets et al., 2013). Arbib’s account is interesting but is only able to suggest a neural mechanism that could have brought about the transition, while it fails to identify any evolutionary pressure responsible for it (Reference FitchFitch, 2010).
The issue of hand–mouth neural links belongs to one of the standard arguments used in gestural hypotheses. It rests on the assumption, partly corroborated by neurocognitive research, that hand and orofacial movements are governed by the same, phylogenetically old brain circuits (Reference Żywiczyński and WacewiczŻywiczyński & Wacewicz, 2019). Such links appear to be rooted in mouth feeding behaviors (Reference Gentilucci and CorballisGentilucci & Corballis, 2006), and are able to explain motoric phenomena like differences in mouth opening depending on whether we hold a small or large object when speaking (Reference Gentilucci and CorballisGentilucci & Corballis, 2006) or the activity of articulators resultant from specific manual movements (Reference Higginbotham, Isaak and DomingueHigginbotham, Isaak, & Domingue, 2008). All this evidence can be seen as supporting gestural theories, but is by no means limited to the monosemiotic type, and is arguably even stronger support for the polysemiotic theories discussed in the following sections.
An inevitable feature of monosemiotic gestural theories is “the transition problem” (cf. Reference Żywiczyński and WacewiczŻywiczyński & Wacewicz, 2019): If language emerged as a gestural phenomenon, why is it that all modern languages are predominantly vocal, with the exception of signed languages? This problem is identified as the key challenge to monosemiotic gestural theories both by its proponents (Reference ArbibArbib, 2005; Reference CorballisCorballis, 2002; Reference Hewes, Andrew, Carini, Hackeny, Gardner, Kortland and WescottHewes et al., 1973) and critics (Reference BurlingBurling, 2005; Reference FitchFitch, 2010; Reference MacNeilageMacNeilage, 2008). The only viable strategy to tackle this problem is to point to potential selection pressures facilitating the development of vocal communication despite its original gestural basis. There are a variety of candidates for such selection pressures, the best-known of which are the following:Footnote 6
Speech enables communication with poor visibility or in the dark (Reference RousseauRousseau, 1755/2008).
The voice attracts attention more effectively (Reference RousseauRousseau, 1755/2008).
Speech does not engage hands, thereby allowing their use in practical tasks – work or carrying objects during communication (e.g. Reference Carstairs-McCarthyCarstairs-McCarthy, 1996).
Speech allows one to teach manual activities, such as tool making (Reference Armstrong and WilcoxArmstrong & Wilcox, 2007).
The acquisition of speech begins in the human foetal life, which grants it a developmental advantage (Reference Hewes, Lock and PetersHewes, 1996).
Speech is more economical, as articulatory movements require less time and energy than hand, arm, and body movements (e.g. Reference Knight, Knight, Studdert-Kennedy and Hurford (Eds.)Knight, 2000).
Vocal communication facilitates continuous monitoring of the location of a child, which might have been important in hominins due to their hunter-gatherer lifestyle, and with lack of constant physical contact between mother and child, as is the case in the great apes (Reference FalkFalk, 2009).
Voice is directed at everyone and not only to a specific individual (Reference TomaselloTomasello, 2008).
While suggestive and deserving more research, it is generally accepted that these proposals can neither individually nor jointly resolve the transition problem for monosemiotic gesture theories. In general, they point to one or another deficit of the visual channel, and could thus be used as arguments for “speech first” theories. Reference FitchFitch (2010) criticizes the majority of the arguments listed above, as it is easy to find a counter-argument in each case. Gestures may not be visible in the dark, but they are visible by the firelight, and can be used in the tactile modality, as done by visually impaired signers. Further, the visual channel gains an advantage in long-distance or noisy communication, and successfully attracts attention in these situations. Although speech frees the hands and arms, gestures free the mouth, which was very significant in the Palaeolithic period, given that fossil data show that hominins intensively used teeth to chew hard foods and perform various mechanical operations. Importantly, the argument concerning the energetic effectiveness of speech is convincing only to the extent that speech and gesture are independent of one another, as in the pantomimic theories discussed in Section 5, but not according to the equipollent theories (Section 4), since if speech is (almost) necessarily accompanied by gesticulation, this way of communicating would be at least equally, if not more, costly than gestures alone.
Further issues could be brought against theories that trace the beginnings of language exclusively to the sign system of gesture. Regarding Reference Armstrong and WilcoxArmstrong and Wilcox’s point (2007, see above), in teaching manual activities, verbal instructions are less effective than demonstration or physical guidance of the learner’s hands. One thing that is problematic for the suggestion of Reference Hewes, Lock and PetersHewes (1996) is the developmental data showing equal paces of spoken and signed language acquisition. Reference FalkFalk’s (2009) idea of the vocal contact between mother and child does not require speech but just the emission of sound. Reference TomaselloTomasello’s (2008) point is compelling as far as open information sharing is concerned, but gesture allows more accurate choice of addressee, which is important in less cooperative contexts and, further, is less at risk of being discovered by enemies and predators (Reference Wacewicz, Żywiczyński, Smith, Smith and Ferrer-i-CanchoWacewicz & Żywiczyński, 2008).
There have been attempts to address the transition problem not by indicating adaptive pressures that could have affected the changeover from gesture to speech, but by highlighting the interaction between the two semiotic systems. Reference Goldin-Meadow, Smith, Smith and Ferrer-i-CanchoGoldin-Meadow (2008) points out that in modern face-to-face communication, (a) combinatorial-segmented information is usually transferred by speech, while (b) holistic-imagistic information is usually transferred by gesture. Further, while gestures can become lexicalized and grammaticalized to communicate (a), as in signed languages, speech is less predisposed to perform (b). This proposal can be formulated also along the following lines (Reference BrownBrown, 2012): Speech is less capable of iconic representation, for which there is solid evidence, as pointed out above, and this may have been the reason for the transfer from the hypothetical gestural protolanguage toward speech when the need for a larger vocabulary and more combinatorial structure arose. This argument is, in our view, the most promising for explaining the transition, but it should be noted that it presumes that at least some vocalization was present already for the shift from gesture to speech to commence. Hence, it is more properly seen as an argument in favor of pantomimic theories (see Section 5).
To sum up, there seems to be no convincing solution to “the transition problem” for purely monosemiotic gestural theories, given that the proposed solutions face counter-arguments or are underpowered in terms of evolutionary logic. These difficulties have contributed to a growing popularity of polysemiotic theories, which we review in the following sections. Two different kinds of these can be distinguished, which derive from different research backgrounds. Equipollent theories are primarily affiliated with modern gesture studies (e.g. Reference KendonKendon, 2004; Reference McNeillMcNeill, 1992, Reference McNeill2005), while pantomimic theories derive from mimesis theory (Reference DonaldDonald, 1991, Reference Donald2001; Reference ZlatevZlatev, 2008). The two positions share the assumption that gesture was not the only semiotic system at the origin of human language, but there are important differences between them regarding the role of vocalization, the evolutionary trajectory to modern-day human communication, and the end point of language evolution. Hence, we present and discuss them in two separate sections.
4 Equipollent Polysemiotic Theories
4.1 General Features
The defining property of equipollent origin theories is the postulate of an early integration of gesture and vocalization, with the foundational assumption that gesture and speech form two sides of a single cognitive-communicative system (Reference McNeillMcNeill, 2005) or process (Reference Kendon, Dor, Knight and LewisKendon, 2014). It should be noted that while being representative, the views of these scholars are not coextensive with that of gesture studies. Even early research often viewed gesture as a functionally versatile category, including Reference EfronEfron’s (1941) study of “illustrators,” “batons,” and “ideographs,” and Reference Ekman and FriesenEkman and Friesen’s (1969) study of “regulators.” Further, there was much interest in pantomime (e.g. Reference KendonKendon, 1992; Reference Laudanna and VolterraLaudanna & Volterra, 1991), emblematic gestures and their cultural variability (e.g. Reference KendonKendon 1995, Reference Kendon2004; Reference Poggi, Zomparelli and PoggiPoggi & Zomparelli, 1987), and adaptive movements reflecting bodily needs, psychological stress, and arousal (e.g. Reference Dittman, Siegman and PopeDittman, 1972; Reference Ekman and FriesenEkman & Friesen, 1969; Reference Freedman, Seigman and PopeFreedman, 1972; Reference WaxerWaxer, 1977). It is only more recently that attention seems to have shifted specifically to co-speech gestures: gestures that are temporally and semantically integrated with speech (Reference KendonKendon, 2004; Reference McNeillMcNeill, 1992). It is from this more recent approach within gesture studies that the most prominent equipollent language-origin theories derive.
4.2 McNeill’s Growth Point
Probably the best known of this class of theories is the scenario proposed by McNeill, which zooms in on the key notion in his account of gesture, the Growth Point (Reference McNeillMcNeill, 1992, Reference McNeill2005, this volume). In McNeill’s model, speech and gesture are coexpressive, but at the same time semiotically distinct, and responsible for the transmission of different aspects of the message: speech for propositional content and gestures for imagistic content. According to McNeill, the semantically most prominent element of the utterance comes at the stroke (i.e. the most pronounced phase) of a gesture. In this way, the Growth Point, the basic unit of thinking, becomes externalized.
This idea is also central to McNeill’s theory of language evolution, the critical moment of which is the integration of gestural and vocal communication, both at the level of cognition and expression (Reference McNeillMcNeill, 2012). The claim is that language originated from the coming together of vocalization and gesture to form a propositional-imagistic dialectic. The critical element of this process was what he calls “twisting” of mirror neurons, whereby they began “to respond to one’s own gestures, as if they were from someone else” (Reference McNeillMcNeill, 2012, p. 65). To support this idea, McNeill paraphrases Reference Mead and MorrisMead (1934/1974): “[…] a gesture is a meaningful symbol to the extent that it arouses in the one making it the same response it arouses in someone witnessing it” (Reference McNeillMcNeill, 2012, p. 180). As this gestural system was coorchestrated with vocalization, the Growth Point emerged.
It should be noted that McNeill does not provide any evolutionarily realistic pressures that could have been responsible for any of these changes. In fact, he suggests two conflicting accounts of how speech started, deriving it either (a) from movements “originally for ingestion, [which] could be orchestrated in new ways, by gesture imagery” (Reference McNeill2012, p. 65), or (b) from the type of polysemiotic communication that is found in extant non-human apes, such as “chimp gestures with vocalization” (Reference McNeill2012, p. 195). Although McNeill refers to the “twisting” of mirror neurons and the voice-gesture integration as adaptations, he actually describes them as saltational leaps (Reference GoldschmidtGoldschmidt, 1982),Footnote 7 not unlike Chomsky’s idea of a lucky mutation giving rise to the operation of Merge, which first endowed our ancestors with a language of thought and then with the communicative use of it (Reference Berwick and ChomskyBerwick & Chomsky, 2015).
4.3 Kendon’s Languaging
Kendon also proposes that the emergence of language crucially depended on the integration of speech and gesture but opts for a more gradual and evolutionarily realistic scenario. First, his notion of a “speech-kinesic ensemble” is less categorical about both the definition of gesture (see Section 2) and the functional interplay between speech and gesture (Reference Kendon, Dor, Knight and Lewis2014). His proposal is formulated in terms of a “dynamic orchestration of communicative action,” which extends far beyond the category of co-speech gestures and embraces any deliberately communicative bodily movements (hence, the use of the term “kinesic”), including postural shifts, eye contact, or facial expressions (Reference KendonKendon, 2004, Reference Kendon2011). Likewise, “speech” is not confined to a purely linguistic capacity responsible for the transmission of propositional information, but extends to vocal means of expressing emotional-imagistic content, as in the case of paralinguistic features (e.g. emotional prosody) or iconic vocal phenomena, as in ideophones, phonesthemes, reduplication or word lengthening (Reference KendonKendon, 2008). Languaging, in line with his notion of utterance, is a dynamic process (Reference KendonKendon, 2004) that involves a tight integration between speech and gestures to the effect that “words and gestures labor together to produce virtual objects that serve as conceptual expressions” (Reference Kendon, Dor, Knight and LewisKendon, 2014, p. 168; see also Kendon, this volume). Unlike McNeill, who postulates a strict functional division between these two semiotic systems, Kendon submits that the interaction between them is dynamic and flexible, with one or the other being dominant depending on the social or environmental context, including factors such as the level of noise (Reference Kendon, Tannen and Saville TroikeKendon, 1985).
Applying this framework to language origins, Kendon posits that the beginning of the human-specific communicative system of languaging was marked by the coming together of speech and gesture (Reference KendonKendon, 2004). This is a gradualistic scenario that identifies multifarious factors both for the emergence of speech and gesture, and for their conjunction. On the one hand, Kendon highlights the importance of “communicative action” for the emergence of language, whereby vocal behaviors and gestures acquired representational functions. As the hominin ecology favored close-distance face-to-face interaction, they came to be used jointly and in the course of time merged into one process of meaning-making (Reference Kendon, Dor, Knight and LewisKendon, 2014). On the other hand, he points to “the original praxic nature of language” (Reference KendonKendon, 2017, p. 165), when he speculates that many gestures derive from object-handling actions.
However, Kendon’s evolutionary theory does not spell out a clear solution to the problem of why speech is the dominant system in the ensemble of languaging, the analog to the transition problem for monosemiotic theories. Rather, Kendon appeals to various arguments, starting from more physiologically oriented ones, such as hypotheses concerning orofacial movements as an evolutionary “bridge” between manual gesture and speech, or the neural links between hand and mouth, appealing to Arbib’s Mirror Neuron Hypothesis (see Section 3.5), to more semiotic ones, such as the role of various forms of sound symbolism in bootstrapping vocal signs.
4.4 Evaluation
Equipollent theories stress the polysemiotic, or more specifically bisemiotic, nature of modern language as well as its evolution. They argue for tight integration between the semiotic systems of speech and gesture, to the effect that they form one overarching system, such as McNeill’s “imagery-language dialectic” or Kendon’s version of the notion of “languaging.” The postulate of gesture-speech equipollence in modern human communication is accompanied by what McNeill refers to as the equiprimordiality of gesture and speech, whereby language is thought to have begun with the integration of vocalization and gesture, which jointly assumed representational and communicative functions. In this regard, McNeill posits a saltationist scenario of the sudden emergence of Growth Point, while Kendon argues for a long-drawn and multicausal process of voice and gesture integration. This latter approach is similar to views articulated by Reference Goldin-Meadow, Smith, Smith and Ferrer-i-CanchoGoldin-Meadow (2008) and Reference SandlerSandler (2013), whose theories may also be regarded as equipollently polysemiotic.
Such theories effectively obviate “the transition problem” that burdens monosemiotic gestural theories (cf. Reference KendonKendon, 2011), which is an advantage. However, they struggle to explain the adaptive pressures that would have brought about a strong form of gestural-speech integration, as well as the dominant role of speech in hearing populations. As they take language to derive from semiotic systems that became integrated in the course of hominin evolution, both McNeill and Kendon reject the possibility that language and gesture may have had independent evolutionary trajectories. Their evolutionary theories can be seen as growing out of the strong uniformitarianFootnote 8 conviction that, as Kendon’s puts it, the way language is today “it must have been in its beginnings” (Reference Kendon2011, p. 103). But it is hardly uncontroversial to claim that language (or languaging) necessarily involves the integrated use of the vocal and gestural channels (e.g. Reference Vigliocco, Perniss and VinsonVigliocco, Perniss, & Vinson, 2014).
There are in fact stronger and weaker positions on gesture-speech integration among equipollent theories. McNeill puts forward a more extreme version, whereby speech is impossible without gesture: “the core is gesture and speech together. […] They are united as a matter of thought itself. Even if for some reason a gesture is not externalized (social inappropriateness, physical difficulty, etc.) the imagery it embodies can still be present, hidden but part of the speech process” (Reference McNeillMcNeill, 2012, p. 19). A weaker version of the thesis, more consonant with Kendon’s position, has been elaborated by Kita, who argues that speech and gesture are distinct psychological processes that interact (e.g. Reference Kita and ÖzyürekKita & Özyurek, 2003, p. 30). According to such an account, both speech and gesture are governed by separate but tightly interrelated production mechanisms: an “action generator” responsible for the production of gesture and a “message generator” responsible for speech (Reference Kita and McNeillKita, 2000).
There are other issues with this type of theory as well. On the one hand, proponents of equipollent theories argue for the division of labor between the two parts of the integrated system; on the other hand, they obliterate the distinction between them, downplaying, for instance, the fact that speech performs the dominant role in the transfer of referential information in the “speech-kinesic ensemble.”Footnote 9 Equipollent theories also disregard the fact that the compositional nature of language can manifest itself not just in speech but also in various other subsystems such as writing involving the visual modality, the tactile modality (e.g. Braille), or haptic modality (e.g. Tadoma, the tactile lipreading of the deaf-blind). The growing amount of signed-language literature, particularly on emerging signed languages, has sometimes been used to support the equipollent position (e.g. Reference SandlerSandler, 2013), but, as Kendon rightly points out, it actually goes against it, by challenging the idea of a watertight integration between speech and gesture (Reference KendonKendon, 2009).
A final problem with equipollent gesture-speech theories is an exaggerated focus on co-speech gesture (see Section 2). McNeill’s “gesture continuum” (Reference McNeillMcNeill, 1992, Reference McNeill2005), for example, includes a variety of gestures that are produced in the absence of speech, such as “language slotted gestures,” emblems (produced at least sometimes without speech), pantomime, and the signs of signed languages; however, it zooms in on co-speech gestures (or “gesticulation,” in Kendon’s terms) and describes them as the prototypical form of gesture. Accordingly, much of contemporary gesture studies centers on co-speech gestures (e.g. many of the chapters in the two-volume set Body - Language - Communication, edited by Reference Müller, Cienki, Fricke, Ladewig, McNeill, Teßendorf and BressemMüller et al., 2013–Reference Müller, Müller, Cienki, Fricke, Ladewig, McNeill and Bressem2014) and in this way, explicitly or implicitly, embraces the view that co-speech gesture is gesture par excellence and the ancillary view that speech is language par excellence. Such an attitude is, as we have seen, not unproblematic when it comes to language evolution. Arguably, it also overemphasizes the links between speech and gesture on the developmental level (Reference LevyLevy, 2011) and the neurocognitive level (Reference Demir‐Lira, Asaridou, Raja Beharelle, Holt, Goldin‐Meadow and SmallDemir-Lira et al., 2018) as well as ignores evidence of the dissociation between them, such as the developmental evidence for partial independence between language and gesture (e.g. Reference AndrénAndrén, 2010), or limited impairment of gestures in many forms of aphasia (e.g. Reference Whiteside, Dyson, Cowell and VarleyWhiteside, Dyson, Cowell, & Varley, 2015). Issues such as these have motivated the third general type of theories in our discussion, to which we turn in Section 5.
5 Pantomimic Polysemiotic Theories
5.1 General Features
A number of polysemiotic pantomimic theories claim that the unique features of human communication derive from a general cognitive capacity, whose appearance eventually led to the emergence of language and gesture, but also to other semiotic systems, such as music and dance, ritual, and depiction (forming marks on a two-dimensional surface that resemble three-dimensional referents, as in drawing and painting). The most common candidate for this is mimesis (Reference DonaldDonald, 1991, Reference Donald, Hurford, Studdert-Kennedy and Knight1998, Reference Donald, Tallerman and Gibson2012, Reference Donald, Hatfield and Pittman2013; Reference Zlatev and PetersZlatev, 2019). According to Donald, the original function of mimesis was to facilitate tool production, as it evolved as an adaption in late australopithecines or early Homo ca. 2 million years ago. Gradually, it was exapted for communication, allowing the use of the body as a representational device, whereby bodily movements could stand for something other than themselves (Reference Zlatev, Persson and GärdenforsZlatev, Persson, & Gärdenfors, 2005). As shown below, the breadth of the notion of mimesis gives rise to a number of related theories of language origins, which share the feature of viewing the evolution of language as an integral part of the evolution of human thought and culture in general, and at the same time focus on (polysemiotic) pantomime as an essential step in the evolutionary process.
5.2 Gärdenfors’ Pedagogy
Gärdenfors links the emergence of human-specific communication to pedagogy (Reference GärdenforsGärdenfors, 2017; Reference Gärdenfors and HögbergGärdenfors & Högberg, 2017). In his scenario, the original form of communication that emerged from mimesis was demonstration, in which a teacher performs an action for the benefit of a student, and which is defined by Reference GärdenforsGärdenfors (2017) as follows:
(D1) The demonstrator actually performs the actions involved in the task.
(D2) The demonstrator makes sure that the learner attends to the series of actions.
(D3) The demonstrator intends that the learner perceives the right actions in the correct sequence.
(D4) The demonstrator exaggerates and slows down some of the actions in order to facilitate for the learner to perceive important features.
Thus, demonstration both resembles praxis and differs from it, since its main goal is not practical per se, but for the student to understand how to perform the actions in question. Hence, demonstration is a true representational sign, and according to the definition of Reference AndrénAndrén (2010) also a gesture (see Section 2). According to Gärdenfors, the minimal “symbolic distance” between the expression and what it represents makes demonstration the likely candidate for being the first form of mimetic communication. The only difference between this and pantomime is feature (D1): Whereas in demonstration the teacher actually performs the actions involved in the task (e.g. one striking a stone to produce a stone tool), in pantomime the teacher pretends to perform the actions, making pantomime a form of pretense. This semiotic breakthrough allowed for the eventual emergence of language (Reference GärdenforsGärdenfors, 2017) and ritual (Reference GärdenforsGärdenfors, 2018).
This description indicates that gesture performed the dominant semiotic role in pantomime. However, it does not entail that pantomime was monosemiotic, and, indeed, this is not the case in a popular definition of pantomime in the field of language evolution as: “a non-verbal, mimetic and non-conventionalised means of communication, which is executed primarily in the visual channel by coordinated movements of the whole body, but which may incorporate other semiotic resources, most importantly non-linguistic vocalisations” (Reference Żywiczyński, Wacewicz and SibierskaŻywiczyński, Wacewicz, & Sibierska, 2018, p. 315). In other words, pantomime should be understood as a hybrid, polysemiotic system, combining both signs and signals, and a number of different sensory modalities (Reference Zlatev, Żywiczyński and WacewiczZlatev et al., 2020).
This is also apparently how Reference GärdenforsGärdenfors (2017) understands pantomime, given his frequent mention of physical objects and “vocal gestures.” Further, given that pantomime tends to represent events in an undifferentiated way, Gärdenfors outlines the transition to language in terms of its differentiation, accompanied by lexicalization and grammaticalization. He also assumes that the main referential role shifted to the vocal aspects of pantomime over time, without, however, explaining why.
5.3 Zlatev’s Mimesis Hierarchy
Building on the work of Reference DonaldDonald (e.g. 1991), Reference ZlatevZlatev (e.g. 2008, Reference Zlatev2014, Reference Zlatev, Etzelmüller and Tewes2016) attempted to make the concept of mimesis more explicit, as well as to develop a theory of its transition to language, and more recently to polysemiotic communication as such (Reference Zlatev and PetersZlatev, 2019). In the process, Zlatev has proposed a number of related definitions of the concept of bodily mimesis, one of which is the following:
[…] an act of cognition or communication is an act of bodily mimesis if: (1) it involves a cross-modal mapping between exteroception (e.g. vision) and proprioception (e.g. kinesthesia); (2) it is under conscious control and is perceived by the subject to be similar to some other action, object or event, (3) the subject intends the act to stand for some action, object or event for an addressee, and for the addressee to recognize this intention; (4) it is not fully conventional and normative, and (5) it does not divide (semi)compositionally into meaningful sub-acts that systematically relate to other similar acts, as in grammar.
On this basis, Zlatev proposes an evolutionary (and in part developmental) model known as the Mimesis Hierarchy (Reference ZlatevZlatev, 2008). The rudimentary form of protomimesis, based on requirement (1), is found in activities like emotional and attentional contagion (e.g. contagious laughter), and is common for all primates. The more advanced form of dyadic mimesis (based on 1 and 2) involves volition and imitation, but not true representation or sign-function; it is common for all great apes. Only at the next level (based on 1, 2, and 3), referred to as triadic mimesis, do mimetic acts gain a clear sign-function, as well as Gricean communicative intentions (i.e. that the addressee should understand that a communicative act is being performed for their benefit). This level, also in agreement with Gärdenfors, is claimed to be uniquely human (Reference Zlatev, Persson and GärdenforsZlatev et al., 2005). Further, point (4) distinguishes mimesis from a conventionalized protolanguage and point (5) distinguishes it from language proper.
This provides a convenient conceptual apparatus, but does not address key questions such as what drove the evolutionary process, as well as more specific aspects of how the transition from triadic mimesis (i.e. pantomime) to protolanguage and language took place, including the shift from gesture to vocalization in terms of dominance. Reference Zlatev, Etzelmüller and TewesZlatev (2016) addresses these gaps, but in a somewhat schematic matter. With respect to evolutionary pressures, Zlatev appeals to an increase of prosociality in hominins, in the manner of Tomasello (see Section 3.6). As for the ecological pressures behind this, the uniquely human reproductive strategy among the great apes of alloparenting (Reference HrdyHrdy, 2009) is evoked. Concerning the gradual transition to vocalization, this is sought in the nature of pantomime itself: a hybrid system that is polysemiotic (i.e. combines various sign and signal systems) and multimodal (i.e. involves different sensory channels). The dominant semiotic system of pantomime is claimed to have been highly iconic gesture, understood in terms of the following properties (Reference Zlatev, Żywiczyński and WacewiczZlatev et al., 2020):
a) use of primary iconicity (Reference Sonesson, Rauch, Carr and GeraldSonesson, 1997), where the similarity between the gesture and what it represents is largely sufficient for establishing the referent, as opposed to secondary iconicity, where appreciation of this similarity comes only later;
b) use of the whole body, rather than the hands only (cf. Reference Żywiczyński, Wacewicz and SibierskaŻywiczyński et al., 2018);
c) use of a first-person perspective, where the gesturer adopts the perspective of the agent who performs the represented actions (cf. Reference Zlatev, Andrén, Zlatev, Andrén, Johansson-Falck and LundmarkZlatev & Andrén, 2009);
d) use of the enacting “mode of representation,” with the body of the gesturer mapping onto the (human) body of the referent (Reference Müller, Müller, Cienki, Fricke, Ladewig, McNeill and BressemMüller, 2014);
e) use of the peripersonal space, where gestures stand for actions in the space immediately surrounding one’s own body (Reference Brown, Mittermaier, Kher and ArnoldBrown, Mittermaier, Kher, & Arnold, 2019).
The transition towards language consisted in a transition from (a) primary to secondary iconicity, where resemblance is less important than convention, (b) whole-body gestures to manual gestures, (c) first-person to third-person perspective, and related to this (d) “tracing” and “embodying” modes of representation, where the hands represent specific features of the referents in different ways, and can consequently (e) denote objects that are more displaced in space and time. Such a gestural system is clearly much less iconic than that of pantomimic gesture, but nevertheless retains considerable iconicity, exceeding that of vocalization. Thus, Reference Zlatev, Etzelmüller and TewesZlatev (2016) makes use of the arguments presented by Reference BrownBrown (2012, see Section 3.7), to motivate the gradual transition from gesture to vocalization when the need for less iconicity and more “arbitrariness” arose.Footnote 10
But while language (realized as speech, writing, or signing) may be the dominant system in modern human communication when it comes to expressing propositions and narratives, it is very seldom used alone, but alongside other semiotic systems such as gesture and depiction (e.g. Reference GreenGreen, 2014): polysemiotic communication. An advantage of the mimesis/pantomime approach is that it can help explain this, as pantomime consisted of gesture, vocalizations, and “protodrawing,” when gestures left marks on surfaces such as sand (Reference Zlatev and PetersZlatev, 2019; Reference Zlatev, Żywiczyński and WacewiczZlatev et al., 2020; Reference Zlatev, Devylder, Defina, Moskaluk and AndersenZlatev et al., 2023).
5.4 Similar Accounts
Perhaps less elaborated but similar polysemiotic theories have been suggested by others as well. Reference Levinson, Enfield and LevinsonLevinson (2006), for example, has proposed the Interaction Engine hypothesis, according to which what evolved in our ancestors was a sociocognitive adaptation allowing “joint attention, common ground, collaboration and the reasoning about communicative intent” (Reference Levinson and HollerLevinson & Holler, 2014, p. 369) which transformed face-to-face interaction. Levinson argues that such communication was polysemiotic in the sense that it incorporated facial expressions, body movements, and affective vocalizations (cf. Reference Fröhlich, Sievers, Townsend, Gruber and van SchaikFröhlich et al., 2019) but was still lacking representational signs. In a next stage, iconic gesture accompanied by simple referential vocalizations emerged, the latter of which gradually assumed the dominant role in the transfer of meaning (Reference Levinson and HollerLevinson & Holler, 2014).
A similar scenario is proposed by Reference CollinsCollins (2014), according to whom the communication system of early Homo consisted of a majority of relatively involuntary, non-representational signals (“indices”), and a smaller inventory of voluntary, representational signs. The latter were initially more or less evenly distributed between the bodily and vocal channels, but gesture had a leading role. With Homo erectus and Acheulean culture,Footnote 11 there was a sharp increase in the proportion of bodily-gestural signs, but for some unspecified reason the importance of the latter began to decrease from ca. 1 million years ago, and with the evolution of Homo sapiens, the relative importance of gesture and speech were reversed compared to what they were at the onset of language evolution.
5.5 Evaluation
Similarly to the equipollent type, pantomimic theories of language origins alleviate the transition problem of the monosemiotic gesture theories by drawing attention to the polysemiotic nature of modern human communication and arguing that the evolutionary beginnings of language must have been similarly polysemiotic. However, they disagree about the nature of polysemiotic communication, both in modern communication and as it concerns evolution.
First, pantomimic theories adopt a more complex view of human-specific communication, comprising not just speech and gesture, but also other semiotic systems like depiction and music. Some of these theories propose that all of these have developed from the fountainhead of (bodily) mimesis. The most important shared characteristic of these semiotic systems is that they consist of (representational) signs, which distinguishes them from most forms of animal communication, which are based on signals (Reference Zlatev, Żywiczyński and WacewiczZlatev et al., 2020). Modern human communication indeed involves combinations of such semiotic systems and is clearly polysemiotic (Reference Zlatev and PetersZlatev, 2019; Reference Zlatev, Devylder, Defina, Moskaluk and AndersenZlatev et al., 2023). For example, the Bayaka nomads of the Western Congo Basin incorporate whole-body, silent pantomime as well as vocalizations of the hunted game into their hunting narratives (Reference Lewis, Dor, Knight and LewisLewis, 2014), while inhabitants of central Australia speak and gesture as they draw in the sand while narrating (Reference GreenGreen, 2014). Considering these traditional societies with their traditional technologies, how much more polysemiotic is modern communication, mediated by all the current (electronic) media that we have at our disposal? The interactions between these systems are flexible and context-specific, not unlike the way Kendon describes the polysemiotic process of languaging (Section 4.3).
Second, proponents of pantomimic theories gain some support from “uniformitarianism” when they argue that the original human-specific system was likewise polysemiotic, though less differentiated than present polysemiosis, with “bodily mimesis,” or an “interaction engine” serving as the springboard. A clear difference from the equipollent polysemiotic theories is the claim that the division of labor between semiotic systems was different at the onset of the evolutionary process, with gesture serving the leading role, and speech playing the dominant role at its end. In this respect, pantomimic theories can use all the arguments from monosemiotic theories (Section 3) without encountering the transition problem to the same extent. However, they do face a version of it, as alluded to repeatedly above: How exactly can one explain the reconfiguration in the division of labor between semiotic systems from “more gesture” to “more speech”?
While we do not believe that a conclusive response to this question has been given, it appears that pantomimic polysemiotic theories are in the best position to answer it when compared to the other types, by emphasizing the flexible and context-dependent relation between semiotic systems. Further, speech and gesture have different intrinsic potentials for iconic representation, and contexts in which iconicity is less effective as a communication strategy, while a less iconic and more conventional, and systematically structured form of sign use would have been more effective and constituted an evolutionary pressure towards speech. One such context, suggested originally by Donald (1991), could have been a culture dependent on relatively complex narratives, or myths, where events are not represented just sequentially, but also in counter-iconic orders, reflecting causal and logical relations that require more systematic and conventionalized means of representation.
6 Conclusions
In this chapter we have provided a survey of a number of gestural theories of the origin of language, using a new typology that divides theories first in terms of whether they viewed “gesture alone” as the starting point to the emergence of language (monosemiotic theories) or gesture in combination with other semiotic systems (polysemiotic theories). The latter were then divided based on whether gesture and speech were considered to play a (more or less) equal role from the start to the present (equipollent) or whether gesture first dominated, but the vocalizations that were there from the start eventually gained dominance (pantomimic).
To sum up, while there is a lot of value in monosemiotic gestural theories, in particular in regard to the role of iconicity for “bootstrapping” a shared system of signs, they were all shown to have difficulty in explaining the transition from gesture to speech. In tackling this problem, many proponents of such theories tend to overemphasize the role of speech in modern human communication, at the same time as they tend to overemphasize the role of gesture at the beginning of the evolutionary process. Another problematic move is the stress that they put on the continuity between ape and human gestures, whereby they find themselves hard-pressed to fully account for the differences between these two phylogenetically distinct forms of gesture.
As for equipollent polysemiotic theories, they are at least to some degree able to account for the role of co-speech gesture in modern human communication but tend to disregard other forms of gestural communication (and other semiotic systems, even more). They find it difficult to provide an evolutionarily satisfactory explanation of why gesture and speech differ semiotically (rather than just postulating that they are two sides of a “dialectic”), or indeed why they are so closely connected.
We concluded with pantomimic polysemiotic theories, which in a way capitalize on the strengths of the others: Like monosemiotic theories, they claim an advantage of gesture at an initial evolutionary stage, and like the equipollent theories (perhaps most of all, that of Adam Kendon) they proclaim the advantages of flexible polysemiosis. As such, they appear to be best able to explain modern communication: not only speech, but language in general; not only co-speech gesture, but a wide variety of gestures; not only language and gesture, but also other semiotic systems, like depiction and music. In several cases, they are capable of proposing plausible evolutionary scenarios, informed by research in ethnography and non-human communication. At the same time, they are able to explain the differences between human-specific and animal communication, including ape gestures. Finally, their notion of polysemiosis alleviates the transition problem of monosemiotic theories. However, it does not eliminate it completely, and pantomimic polysemiotic theories still have to propose a mechanism responsible for reconfiguring the original system of pantomime into modern language and modern polysemiotic communication. While there are some attempts in this direction, this is still very much work in progress. We have indicated some promising ideas in this direction, including the role of narrative.