Gestures in Relation to Interaction

Part V Gestures in Relation to Interaction

22 Gesturing for the Addressee

Both face-to-face dialogue and the co-speech gestures that occur within it are social as well as cognitive. “Social” has a broad and vague meaning; it can refer to cultural and societal factors that influence gesturing or to the mere presence of another person. However, co-speech gestures occur on a scale of seconds or fractions of seconds; they are microsocial (Reference Bavelas, Duncan, Levy and CassellBavelas, 2007). Moment by moment, speakers use their gestures to provide their addressees with information that advances the topic of their dialogue or informs their addressee about the state of their dialogue at that moment. As Reference KendonKendon (1994, p. 194) observed, “speakers and listeners collaborate in intimate ways.” This chapter summarizes evidence that a speaker’s obligation to his or her addressee at a particular moment in their dialogue often determines the form and even occurrence of gesturing.

Clearly, the above position presupposes a communicative function of co-speech gestures, which seems to be an enduring issue in gesture circles (e.g. Reference HostetterHostetter, 2011; Reference KendonKendon, 1994; Reference Krauss, Morrel-Samuels and ColasanteKrauss, Morrel-Samuels, & Colasante, 1991). Therefore, the chapter must start with a closer examination of the approach to this issue so far.

1 The Great Communication Debate

It is striking that questioning the communicative status of co-speech gestures seems to be an issue for experimental researchers, but not for those who study gesturing in contexts outside a lab. For example, researchers such as Goodwin, Mondada, Streeck, Enfield, and many others (e.g. Reference Streeck, Goodwin and LeBaronStreeck, Goodwin, & Le Baron, 2011) rarely mention the issue; instead, they study how gestures contribute to social interactions. I propose that a core methodological decision is responsible for this difference.

Perhaps the most important distinction between these two approaches (Reference BavelasBavelas, 2022, Ch. 4) is the researcher’s primary goal and consequent means of achieving it. One method prioritizes internal validity, aiming to make clear, preferably causal, inferences from the data, so hypothesis testing and experimental control are priorities. The other method prioritizes external validity, aiming to generalize to the widest variety of social interactions, therefore diversity and minimal interference are priorities.

In an ideal world, these two methods would complement each other, achieving together what neither can do alone. However, for historical reasons, the choice of internal validity has led to a different unit of analysis: Most experimental research on co-speech gestures has followed the model of psychology, which is to study individuals, not social interactions. Reference DanzigerDanziger (1990) traced experimental psychology from studies of a researcher and participant in interaction, to increasing isolation of individuals as subjects of research, to their eventual aggregation into an anonymous N. As Danziger pointed out, it soon became the norm that results in experimental psychology “were not taken as conveying information about an individual-in-a-situation but about an individual in isolation whose characteristics existed independently of any social involvement” (p. 186).

Even social psychology did not require social interaction. For example, Reference Allport and LindzeyAllport’s (1954, p. 3) authoritative definition was that social psychology is “an attempt to understand and explain how the thought, feeling, and behavior of individuals are influenced by the actual, imagined, or implied presence of other human beings.” Ultimately, even the actual presence of another human being who was free to interact became a methodological problem (e.g. Reference Aronson, Carlsmith, Lindzey and AronsonAronson & Carlsmith, 1968), and experimental control came to mean avoiding spontaneous social interaction entirely. A true interlocutor was replaced by a minimally responsive confederate, the experimenter, or a highly constrained addressee. Another alternative has been to remove interaction entirely by having speakers record something that others would hear or watch later. Indeed, the communication debate has often come to be phrased in terms of individuals: Are gestures for the speaker or for the listener, that is, which individual are they for?

The alternative, as Reference Bavelas, Fitch and SandersBavelas (2005), Reference Bavelas and HealingBavelas and Healing (2013), Reference Kuhlen and BrennanKuhlen and Brennan (2013), and many others have argued, is that the unit of study for a social process must – and can – be a social interaction, that is, a dyad rather than an individual. Researchers focused on external validity already know this, and virtually every such study provides abundant evidence that speakers are gesturing for their addressees. Therefore, the review in this chapter focuses on experimental evidence, showing that researchers with a primary interest in internal validity can also take social interaction as their unit of study, with controlled experiments in which a real speaker and a real addressee interact freely within their assigned task. The experiments reviewed here show that co-speech gestures are communicative by revealing the many ways in which the immediate social context affects whether and how speakers gesture for their addressees.

1.1 Criteria for Experimental Evidence that Speakers Design Their Gestures for Their Addressees

To be included in this review, first, dyads had to interact, which excluded studies in which a speaker was recorded for later rating by others. Second, these dyads were real interlocutors, that is, one of them was not a confederate, the experimenter, a participant who was instructed how to interact, or an imaginary addressee. So in at least one experimental condition, two people were interacting freely, face to face, within an experimental task. Third, the focus was on what Reference DuncanDuncan (1969) called internal variables, that is, variations related to the dialogue itself, rather than external variables, such as the age or personality of the interlocutors. Fourth, the independent variable was one that would influence the speaker, the addressee, and their interaction. Fifth, some aggregated dependent variables (e.g. rate) may not be relevant, moment by moment, to their addressee. So, as well as overall gesture rates or frequencies, it was often desirable and informative to include qualitative features of individual gestures (e.g. size, precision, relationship to words) that would affect the addressee at a microsocial level. These measures demonstrated good interanalyst agreement and could be summarized quantitatively for statistical analysis. Finally, it was highly desirable that each study should be replicated with different tasks and procedures, providing some external validity.

In the shadow of prominent studies disputing whether gestures communicate at all, a less well-known body of experimental evidence has been accumulating on how speakers shape their gesturing for their addressee. By meeting the above criteria with an encouraging variety of tasks, these studies have provided evidence that speakers are gesturing for the benefit of their addresses in at least six different ways:

They gesture more in dialogue than monologue.
They adapt their gestures to their shared space with the addressee.
Mutual visibility affects the form (but not the rate) of their gestures.
Their gestures reflect prior shared common ground.
Their gestures mark incremental common ground within the dialogue.
Gestures can also be about having a dialogue, rather than about the topic of a dialogue.

A warning: Some of these studies challenge received wisdom about past research.

2 Speakers Gesture More in Dialogues than in Monologues

The most straightforward test of whether addressees matter to their speakers is to remove the addressee and count gestures. Reference Bavelas, Chovil, Lawrie and WadeBavelas, Chovil, Lawrie, and Wade (1992, Study 1) compared dyads in dialogues to individuals who were alone, speaking in monologue. All were doing the same two tasks: relating a short episode from an animated cartoon and giving complete instructions on how to get a book from a library. In both tasks, speakers gestured at significantly higher rates per minute in the dialogues than in the monologues.

Reference Bavelas, Gerwing, Sutton and PrevostBavelas, Gerwing, Sutton, and Prevost (2008) compared gesturing of speakers in a face-to-face dialogue, a telephone dialogue, and a monologue to a handheld microphone. All were describing the same stimulus (an unusual dress; see Figure 22.1 in 2008, p. 501). Their rate of gesturing, whether measured per minute or per 100 words, was significantly higher in each of the dialogues than in the monologue; the face-to-face and telephone dialogues did not differ from each other in this respect. (Notice that holding a telephone in one hand did not reduce the rate of gesturing compared to the face-to-face dialogues.) Bavelas et al. (2014) replicated this effect with the same design. This time, the picture was a drawing of a series of geometric shapes (see Figure 22.3 in 2014, p. 637). Again, the rate of gesturing, whether per minute or per 100 words, was significantly higher in each of the dialogues than in the monologue, and the two dialogues did not differ from each other.

Reference Holler, Turner and VarciannaHoller, Turner, and Varcianna (2013) created dialogues that took place face-to-face or through a screen, as well as a monologue to a tape recorder. The participants saw a series of dictionary definitions and had to think of the word being defined. In this study, there were no assigned speakers or addressees; either person could answer. There were the same number of individuals in each condition, but some were in dialogues and others in monologues. These individuals made significantly more gestures in face-to-face dialogues than in monologues, and there was no difference between the face-to-face and screen dialogues. Thus, four studies with different procedures and tasks have shown that dialogues elicit more gesturing than monologues. Gesturing is not, as often claimed, part of speaking; it is part of speaking in dialogue.

3 Speakers Adapt to Shared Space with Their Addressee

When a dialogue is face to face, the interlocutors share a common space between them, which speakers adapt to and use. In a classic experiment, Reference Özyurek and McNeillÖzyurek (2000, Reference Özyurek2002) created experimental conditions in which the spatial relationship between speaker and addressee affected the form of the speakers’ gestures. These speakers were relating an episode from an animated cartoon in which a character went, for example, into or out of something. The seating of the interlocutors determined their shared space. When the addressee was sitting directly across from the speaker, their shared space was in the center, between them. When the addressee was sitting off to the side, their shared space was shifted to that side. Rather than gesturing solely from their own perspective, the speakers strongly tended to make their directional gestures in the shared space. For example, the same action would be gestured straight ahead or angled off to the side, depending on the position of the addressee.

Reference Bavelas, Gerwing, Allison, Sutton, Stam and IshinoBavelas, Gerwing, Allison, and Sutton (2011) showed that interlocutors coordinated their use of shared space quite precisely. These dyads were working together to create a floor plan for a two-bedroom student apartment. They had no writing implements, only a bare table between them, so they had to “draw” their plans by tracing on and pointing at the table, leaving no visible record. Both contributed; there was no assigned speaker and addressee. In the vast majority of their gestures, they tended to name a room or object and simultaneously locate it with gestures. So their gestures were providing essential information that was not in their words.

In order to assess whether these dyads were understanding each other’s gestures, we adapted Reference Clark and SchaeferClark and Schaefer’s (1987, p. 22) three-step model of grounding, that is, of establishing mutual understanding:

i. One of them presented a non-redundant gesture with his or her words (e.g., “the kitchen could be here”). This person was the speaker for this gesture.
ii. Their addressee showed evidence of understanding (e.g., a back channel, naming the location, or building on their speaker’s gesture in the same location).
iii. The speaker responded in a way that confirmed their addressee’s evidence (e.g., a back channel, repeating their addressee’s gesture, or building on it).

Our microanalysis revealed that the interlocutors completed this sequence for 95.5 percent of their 552 nonredundant gestures, which showed they were using their shared space in ways they mutually understood and responding appropriately to each other’s gestures.

It is worth noting that an experimental variable made no difference. Half of the dyads were working across the width of a rectangular table, so it was easy for them to gesture in the same space. The other half were working across the table placed lengthwise, so they leaned or reached forward to share a space. There was no difference in completed sequences, that is, in the evidence of mutual understanding. They created and used a shared space regardless of effort.

In both of these experiments, shared space between interlocutors was not just a constraint on the speakers but a medium they could use for their purposes.

4 Disentangling the Effects of Dialogue and Mutual Visibility

Since Reference Cohen and HarrisonCohen and Harrison’s (1973) experiment, their focus on visibility has seemed to become the “gold standard” for determining whether gestures are communicative. They proposed that if speakers intend their gestures to communicate to an addressee, they should be less likely to gesture when their addressee would not see them, and this is what Cohen and Harrison apparently found: The mean number of gestures was significantly lower when speakers and addressees could not see each other than when they were face-to-face. However, this conclusion turned out to be premature, and further research ultimately pointed to a different interpretation.

Our review (Reference Bavelas and HealingBavelas & Healing, 2013, Tables 22.1 and 22.2) found fourteen similar studies, which compared dyads who were face-to-face with dyads who were communicating through a partition, on an intercom, or on the phone. Despite this unusual number of replications, the results were ultimately uninformative. Seven of the studies found that, when their addressee would not see their gestures, speakers gestured at a significantly lowered frequency or rate: Reference Cohen and HarrisonCohen and Harrison (1973), Reference CohenCohen (1977), Reference Krauss, Dushay, Chen and RauscherKrauss, Dushay, Chen, and Rauscher (1995), Reference Alibali, Heath and MyersAlibali, Health, and Myers (2001), Reference Emmorey and CaseyEmmorey and Casey (2001), and Reference Mol, Krahmer, Maes and SwertsMol, Krahmer, Maes, and Swerts (2009a, Reference Mol, Krahmer, Maes and Swerts2009b).

Table 22.1 Two levels of gesture functions, corresponding to Reference ClarkClark’s (1996) Track 1 and Track 2

Study	Track 1	Track 2 About Discourse About Interaction
Reference EfronEfron (1941, Reference Efron1972)	Objective	Logical/discursive
Reference McNeill and LevyMcNeill & Levy (1982)	Iconix	Beats
Reference McNeillMcNeill (1985)	Propositional	Off-propositional
Reference KendonKendon (1995)	Substantive	Illocutionary and discourse structure markers
Reference De Fornel, Auer and di LuzioDe Fornel (1992), e.g. return gesture	Other iconics		Return gestures
Reference Bavelas, Chovil, Lawrie and WadeBavelas, Chovil, Lawrie, & Wade (1992); Reference Bavelas, Chovil, Coates and RoeBavelas, Chovil, Coates, & Roe (1995)	Topical		Interactive
Reference KendonKendon (2004)	Referential	Pragmatic: modal, performative, parsing	Pragmatic: interactive or interpersonal
Reference Seyfeddinipur, Müller and PosnerSeyfeddinipur (2004)	Propositional content	Metadiscursive: discourse structure; speech acts
Reference GerwingGerwing (2003); Reference Gerwing and BavelasGerwing & Bavelas (2004)	New information		Given information
Reference StreeckStreeck (2009)	Iconics and other conceptual gestures	Speech-handling: e.g. shrugs, negation	Speech-handling: e.g. turns, new information
Reference Holler and WilkinHoller & Wilkin (2011)			Mimicked gestures
Reference Kok, Bergmann, Cienki and KoppKok, Bergmann, Cienki, & Kopp (2016)	Semantic	Metacommunicative
Reference KendonKendon (2017)	Referential	Operational; modal; performative; parsing/punctuational	Interactional regulation

The other seven studies, which tended to be more recent, found the opposite; speakers gestured at the same frequency or rate whether or not their addressees could see them: Reference RiméRimé (1982), Reference Bavelas, Chovil, Lawrie and WadeBavelas et al. (1992, Exp. 2), Reference Bavelas, Gerwing, Sutton and PrevostBavelas et al. (2008), Reference Pine, Burney and FletcherPine, Burney, and Fletcher (2010), Reference Holler, Tutton and WilkinHoller, Tutton, and Wilkin (2011), Reference De Ruiter, Bangerter and DingsDe Ruiter, Bangerter, and Dings (2012), and Bavelas et al. (2014). Since our review, Reference Holler, Turner and VarciannaHoller et al. (2013) also found no effect of visibility.

We looked for differences that might explain such contradictory results and found one characteristic that divided the two groups exactly: In the experiments that found a difference, the addressees were a confederate, the experimenter, or a participant with instructions to interact minimally. We called those quasi-dialogues. In the experiments that found no difference due to visibility, both speaker and addressee alike were participants; we called these free dialogues. The question, then, is: which one to accept?

Any textbook on experimental methods would recommend accepting the quasi-dialogue results on the assumption of better experimental control. That is, a quasi-dialogue should have provided better experimental control, reducing error variance due to uncontrolled actions of an addressee, thereby making this design more likely to detect a difference. On this basis, the findings of a difference due to visibility should take precedence over the findings of the other studies.

Unfortunately, as pointed out in detail in Reference Bavelas and HealingBavelas and Healing (2013), these quasi-dialogues had a surprising lack of experimental control. (a) Instructions to the quasi-addressees were different in each of the seven experiments (see Bavelas & Healing, Table 3). (b) Only some of these experiments recorded the quasi-addressee. (c) None conducted a manipulation check as required for experimental designs; that is, none verified that their quasi-addressee carried out his or her instructions as planned. Nor did any of these studies mention excluding or replacing any dyads because of a quasi-addressee’s actions, which also implies there was no verification. (d) There was therefore no assurance that quasi-addressees acted the same way in the visible and not-visible conditions, that is, that they did not inadvertently bias the results. In free dialogues, control was at the level of the dyad; both participants had the same instructions and performed the same (joint) task together. Four of those studies reported dropping and replacing dyads, which suggests greater procedural scrutiny than in the quasi-dialogue studies.

There are also two alternative explanations. First, Reference Bavelas and HealingBavelas and Healing (2013) pointed out that adding a relatively unresponsive addressee to a lack of visibility may have created something closer to a monologue than a dialogue, which would have lowered the rate of gesturing. Second, even assuming that confederates carried out their instructions, their task is likely to have distracted them from listening to what their speaker was saying. In Reference Bavelas, Coates and JohnsonBavelas, Coates, and Johnson (2000), distracted addressees made significantly fewer back-channel responses, which affected their speaker’s narrative negatively; it might also have reduced gesturing.

Given that the seven quasi-dialogues were not methodologically superior, their results have no priority, which implies accepting the null hypothesis of no demonstrated effect of visibility on gesture rates, especially when the goal of research is to generalize to free dialogues. However, even more interesting effects have emerged.

4.1 Effects of Dialogue versus Visibility on Gesture Rate

Logically, if free dialogues have consistently shown no difference in gesture rates between visible and not-visible conditions, then these two conditions might have had something in common. The best candidate is that both were dialogues. Recall that Reference Bavelas, Gerwing, Sutton and PrevostBavelas et al. (2008, 2014) had an experimental condition that tested the effect of dialogue versus monologues. They found significant effects of dialogue and not of visibility. Speakers described the same material in a face-to-face dialogue, a telephone dialogue, or a monologue to a microphone. Gesture rates were lowest in the monologues. Speakers in telephone dialogues gestured at a significantly higher rate than speakers in monologues, and there was no significant difference between the two dialogues. Their gesture rates depended on dialogue, with no independent effect of visibility.Footnote ¹

Reference Holler, Turner and VarciannaHoller et al. (2013) had similar results with a different task. The number of gestures in the face-to-face dialogues was not significantly different from the dialogues through a screen, and they were significantly higher than in the monologues. (In this study, the difference between the screen and monologue conditions did not reach significance.)

4.2 Speakers Change Their Gestures When Their Addressee Will or Will Not See Them

Visibility does not affect speakers’ rate of gesturing, but several experiments that went beyond rate have shown that visibility affects how they gesture. For example, some kinds of gestures occur less often when an addressee would not see them. Using Reference Clark and Wilkes-GibbsClark and Wilkes-Gibbs’s (1986) Tangram task, Reference Holler and WilkinHoller and Wilkin (2011) found that interlocutors who were face-to-face often mimicked the form of each other’s gestures as a way of showing they were understanding each other. When they had a screen between them, they made fewer gestures of the same shape, which shows that similar gestures for the same picture were not happening simply by chance. Reference De Ruiter, Bangerter and DingsDe Ruiter et al. (2012) asked interlocutors to identify specific Tangram figures on a poster that both could see. When they could see each other, they often pointed at a figure when referring to it. In the other condition, with a screen between them, speakers used no pointing gestures. Reference Bavelas, Chovil, Lawrie and WadeBavelas et al. (1992) found that speakers were much less likely to make gestures with interactive (versus topical) functions when their addressee could not see them. (See more about these gesture functions below.)

Speakers also adapt the form and usefulness of their visible gestures. Reference Clark and KrychClark and Krych (2004) asked “directors” to teach “builders” how to assemble a series of Lego models. They found that, when their workspace and gestures were mutually visible, builders often gestured what they were doing or planning to do: “Builders communicated with directors by exhibiting, poising, pointing at, placing, and orienting blocks, […] all timed with precision. Directors often responded by altering their utterances midcourse, also timed with precision” (p. 62). Interlocutors in Reference Holler, Tutton and WilkinHoller et al. (2011) were doing the Tangram task. Speakers who were face-to-face with their addressees made their gestures in the typical gesture space between them (in front of the upper body, above the hipline) where they would be clearly visible to the addressee. When divided by a partition, speakers made their gestures lower, below the usual gesture space.

Reference Healing and GerwingHealing and Gerwing (2012) found that speakers added useful information to gestures their addressees could see. They were describing a drawing of a sequence of geometric figures connected by a continuous line (Bavelas et al., 2014, Figure 22.3). When talking face-to-face, speakers gesturally reproduced the drawing in the air, facing their addressee. They drew a line, then a figure on it, then resumed the line from the same place and drew it to the next figure. In other words, speakers linked the figures gesturally along the line, just as they appeared in the drawing. When talking to their addressee on the phone, speakers tended not to link the figures. Instead, they drew them in the same place – on top of each other, so to speak. Indeed, the proportions of linked gestures in face-to-face versus telephone conditions were so different that the distributions of these proportions did not overlap.

Speakers also changed the relationship of visible gestures to their words. When gesturing to an addressee in face-to-face dialogue, speakers often drew attention to their gestures with their words by using deictic references such as “down here,” “this big,” or “like this.” They timed these deictic references precisely with a gesture that demonstrated “down where,” “how big,” or “what it was like.” Reference Clark and KrychClark and Krych (2004) found that interlocutors had significantly fewer turns with deictic expressions when their shared workspace (and therefore gestures) were not visible. Reference Bangerter, Chevally, van der Sluis, Theune, Reiter and KrahmerBangerter and Chevally (2007) found a lower rate of deictic references in their not-visible condition. Reference Bavelas, Gerwing, Sutton and PrevostBavelas et al. (2008) also found lower rates of gestures accompanied by a deictic expression in their not-visible condition.

Visibility also changes the distribution of information in words versus gestures. In two studies, Reference Bavelas, Chovil, Lawrie and WadeBavelas et al. (1992, Reference Bavelas, Gerwing, Sutton and Prevost2008) found that, when their addressee could see them, speakers used significantly more gestures that were not redundant with their words. Reference De Ruiter, Bangerter and DingsDe Ruiter et al. (2012) called these obligatory gestures because some information was only in the gesture; they found that their speakers made almost no obligatory gestures when their addressee would not see them. Reference Gerwing and AllisonGerwing and Allison (2011) quantitatively assessed the distribution of semantic information in gestures versus words; they found that speakers shifted more semantic information into gestures rather than words when their addressee could see their gestures.

4.3 Summary: Dialogue versus Visibility, Rate Versus Form

The amount of accumulating and replicated evidence strongly supports three conclusions: First, dialogue elicits gesturing. Speakers who have an addressee gesture significantly more than speakers who are in monologue, without an addressee. Second, there is no reliable effect of mutual visibility on gesture rates. An addressee does not have to be visible to their speaker; it appears that their dialogue is enough to elicit gesturing. Third, a visible addressee does affect the form of speakers’ gesturing. In face-to-face dialogues, speakers make gestures that are more useful and even more necessary to their addressee.

5 Common Ground Affects Speakers’ Gesturing

The common ground of two individuals is “the sum of their mutual, common, or joint knowledge, beliefs, and suppositions” (Reference ClarkClark, 1996, p. 93), and it affects how speakers present information to a particular addressee. According to Reference Grice, Cole and MorganGrice’s (1975) cooperative principle, a speaker should provide information clearly but also avoid providing more information than is necessary. Therefore, explanations to an addressee who shares common ground should be different from explanations of the same material to an addressee who does not already have the information. As shown below, there are two different sources of common ground: The information may be known to both interlocutors before the dialogue begins (prior common ground) or shared information may accumulate during the dialogue (incremental common ground). It turns out that both of these affect speakers’ gesturing.

5.1 Speakers Gesture Differently When They Share Prior Common Ground with Their Addressee

The first experiment on prior common ground was by Reference GerwingGerwing (2003; published as Reference Gerwing and BavelasGerwing & Bavelas, 2004). The (adult) participants were initially alone, manipulating several toys (e.g. a finger cuff, a whirly gig). Some had the same toys (and therefore had common ground), and others did not (no common ground). Then, by random assignment, one person was the speaker who met with one of the others who had had the same toy and with another who had not. Thus, each speaker was describing the same toy to an addressee who was familiar with it and also to an addressee who was not familiar with it. Two raters independently compared video excerpts of these two descriptions. For each matched pair of videos, the raters judged which gestures, overall, conveyed “more information, were more complex, or were more precise” (Reference Gerwing and BavelasGerwing & Bavelas, 2004, p. 168). For nineteen of the twenty speakers, the raters chose the gestures to an addressee who had not seen the toy before. (One was a tie.) So speakers did not provide as much gestural information to addressees who did not need it.

Reference Holler and StevensHoller and Stevens (2007) used “Where’s Wally?” picture puzzles to create common ground (or not) before the participants interacted. As in the children’s game, the addressee’s task was to find the image of the “Wally” character in several much larger, background pictures that had busy scenes filled with many characters, objects, and activities. In the no-common-ground condition, the speaker had a picture with Wally in it and the addressee had the same picture without Wally in it; the speaker told the addressee how to find Wally in his or her picture. In the common-ground condition, both speaker and addressee had first looked at the various background pictures (without Wally in them), so both were familiar with the details of each picture before the addressee had to find Wally somewhere in it. When their addressee had common ground about the picture, speakers mainly used words to specify Wally’s location. When they did not have common ground, speakers were much more likely to include gestures. Specifically, speakers’ gestures to addressees without prior common ground were larger and represented size more accurately. These speakers used their gestures to depict information that they knew their addressee did not have.

Reference Hilliard and CookHilliard and Cook (2015) used the Tower of Hanoi task in which the “towers” were pegs with disks stacked on them. An addressee’s goal was to move disks from one tower to another as the speaker directed. One independent variable was common ground. In the common-ground condition, a speaker first did the task several times while their addressee was watching, so the addressee could hear what the speaker was saying and watch how he or she was doing the task. In the other condition, the addressee did not have this experience. In both conditions, the addressee would have to do the task correctly. As in the two previous experiments, the focus was on how prior common ground (or the lack of it) affected speakers’ gestures to their addressee. For example, gestures that depicted lifting the disks were essential to the task, and these were the ones that speakers varied as a function of common ground:

Speakers lifted their gestures up higher relative to their body when their listener had no experience with lifting the disks compared to when their listener had seen disk lifting before […] Thus, the spatial and motoric information necessary for task completion was expressed on the hands in all conditions, but this information was produced higher in space when the information was more relevant for the listener. Speakers clearly adjust their gestural form with respect to shared information.

(p. 101)

Reference Galati and BrennanGalati and Brennan (2013) compared speakers retelling an animated cartoon to the same or a different addressee. They gestured more per narrative element to their addressee who did not share common ground, especially gestures that directly depicted events in the cartoon. Their gestures were also larger and more precise.

These four experiments show that, when information is new to their addressee, speakers are more likely to make gestures that are larger, more precise, and more relevant, compared to the gestures speakers make for information in their prior common ground.

5.2 Speakers’ Gestures Mark Incremental Common Ground

In addition to the prior common ground just described, interlocutors also accumulate shared information as their dialogue proceeds. This is a microprocess Clark called incremental common ground (e.g. Reference Clark1996, pp. 38–39, 221–251). Moment by moment, as the dialogue proceeds, information that was new becomes given (Reference ChafeChafe, 1994), and speakers mark this difference; for example, a repeated word is shorter (Reference FowlerFowler, 1988). Reference GerwingGerwing (2003) and Reference Gerwing and BavelasGerwing and Bavelas (2004, Analysis 2) tested this principle with gestures, microanalyzing each dialogue of speakers describing their toys to an addressee who had not seen those items before. Over the course of their dialogue, the speaker’s gestures for the same toy changed (as they had with prior common ground), becoming smaller and less precise. Arguably, these “shorthand” versions of the original gesture marked them as referring to information the addressee now had.

Reference Holler, Bavelas, Woods, Geiger and SimonsHoller, Bavelas, Woods, Geiger, and Simons (2022) tested the effects of incremental common ground in two experiments, using Reference FowlerFowler’s (1988) measure: duration. Speakers were describing the Tangram figures or various dance steps as they reoccurred several times. Both experiments found significantly shorter durations for later gestures. Shortening marked these gestures as referring to common ground rather than new information, just as Fowler had found for words.

Three other experiments elicited repeated gestures with apparently negative results. Reference Holler, Tutton and WilkinHoller et al. (2011), Reference De Ruiter, Bangerter and DingsDe Ruiter et al. (2012), and Reference Galati and BrennanGalati and Brennan (2013) all reported no significant decrease in a gesture/word ratio over repeated references. However, as Reference Holler, Bavelas, Church, Alibali and KellyHoller and Bavelas (2017, pp. 218–222) pointed out, this measure can be misleading. When the same variables affect gestures and words in the same way, then the denominator of their ratio is not constant. In these three studies, word frequency was also decreasing (see Reference De Ruiter, Bangerter and DingsDe Ruiter et al., 2012, p. 329; Reference Galati and BrennanGalati & Brennan, 2013, p. 10; and Reference Holler, Tutton and WilkinHoller et al., 2011, Section 3.2 online). Thus, a relatively constant gesture/word ratio meant that gestures were also decreasing. In Holler et al. (ibid.), it even produced anomalous results: The decrease was more pronounced in words than for gestures, so the rate of gesturing increased although the frequency of gestures was dropping. Instead of “controlling” for amount of speech, a gesture/word ratio treats gestures as dependent on words. That is, if gestures are decreasing in parallel with words, a gesture/word ratio represents it as no decrease. Rate per minute is a more neutral measure.Footnote ²

6 Speakers Gesture to Their Addressee about the Moment-by-Moment State of Their Dialogue

Reference ClarkClark (1996, pp. 241–249) pointed out that dialogues proceed on two levels or tracks: track 1 is about the topic of a dialogue – what the interlocutors are talking about; track 2 is collateral – about the process of having a dialogue. Track 2 approximates one of Reference Bateson, Ruesch and BatesonBateson’s (1951) meanings of “meta-communication” but has the advantage of embedding these acts in a moment-by-moment dialogue. Track 2 words or gestures do not contribute directly to the topic; indeed, one test is that they can often accompany any particular topic. Repairs and qualifiers such as “What I meant was … ” and “As I already said … ” are verbal examples of track 2 in words. Prosodic emphasis and intonation can also serve these functions: for example, intonation that marks a declarative sentence as a question. A surprising number of gesture researchers have also noticed this distinction; see Table 22.1. In the following, it is important to point out that track 1 and track 2 are functions, not categories. Gestures serving track 2 functions can and often do have other functions or characteristics as well.

6.1 Track 1 versus Track 2 Gestures

At least since Reference EfronEfron (1941/1972), close observers have distinguished between gestures serving track 1 versus track 2 functions. Efron’s objective gestures (pointing, depicting, or symbolically representing a referent) conveyed specific content and would therefore be track 1. In contrast, a logical or discursive gesture could apply to any content. These baton-like movements were a sort of “timing out with the hand the successive stages of the referential activity” or an abstract sketch of the “paths” and “directions” of a thought-pattern (Reference Efron1972, p. 96). Later, Reference Ekman and FriesenEkman and Friesen (1969, p. 68) adopted Efron’s term, batons, defining them more narrowly as “movements which time out, accent or emphasize a particular word or phrase.”

Reference McNeill and LevyMcNeill and Levy (1982) also noted the rhythmic function of some gestures, which they called beats: “small rapidly made gestures with indefinite form that do not depict any aspect of the verbally described situation” (p. 285). Reference McNeillMcNeill (1985) called beats off-propositional, in contrast to other gestures serving propositional functions. In Reference McNeillMcNeill’s (2005, pp. 40–41) analogy, these simple beats were “the equivalent to using a yellow highlighter on a written text.” Like a highlighter, a beat is nonspecific; it could highlight anything the speaker might choose.

Reference KendonKendon (1995) identified pragmatic versus substantive gestures in his Italian corpus. The former were conventional (stereotypic) forms that, for example, indicated that a speaker was asking a question or commenting on topic (vs. contributing to the topic). Similarly, Reference Seyfeddinipur, Müller and PosnerSeyfeddinipur (2004) described the “pistol hand” in Iran as a meta-discursive (vs. propositional) gesture that marks, for example, a comment on the topic rather than depicting the topic itself. In 2004 and 2017, Kendon elaborated on pragmatic gestures, identifying several specific functions that referred to the discourse rather than the referent (e.g. negating it or treating it as a joke; Reference Kendon2017, pp. 167–168).

Reference StreeckStreeck (2009) extensively illustrated speech-handling gestures in dialogue, which often treat some aspect of the dialogue (e.g. a speaking turn) as an object to be offered, handed over, or received. Reference Kok, Bergmann, Cienki and KoppKok, Bergmann, Cienki, and Kopp (2016) found both semantic and meta-communicative gestures in a quantitative survey of their corpus. For example, the latter marked some verbal elements as approximations (a gestural equivalent of “sort of”), stressed some elements as important, or marked a pause while searching for a word.

6.2 Discursive versus Interactive Track 2 Gestures

All of the above track 2 gestures are for an addressee because they provide subtle and often essential information about the ongoing discourse. In the past few decades, researchers have identified gestures with a new kind of track 2 function that is more explicitly social. (Table 22.1 distinguishes between these discursive and social functions.) These social gestures are about the immediate interaction between this speaker and this addressee, that is, about the process of having their dialogue. Reference De Fornel, Auer and di LuzioDe Fornel (1992) noticed that addressees sometimes repeated a gesture that their speaker had just made. He called these return gestures and showed that addressees were using them to demonstrate to their speaker that they had understood his or her gesture. Similarly, Reference Holler and WilkinHoller and Wilkin (2011) showed that addressees often mimicked their speaker’s gesture to demonstrate their understanding of what their speaker was telling them.

In the late 1980s, our research group was intrigued by the implicitly negative definitions of gestures such as batons and beats (i.e. they were not conveying referential content). So we took an inductive approach. Our data were unstructured dialogues with no assigned topic; we asked pairs of unacquainted students to have a ten-minute conversation such as they would have if they had met at the university cafeteria. Analysis started by identifying all gestures that were clearly depicting something about a topic in their conversation. Then, instead of studying these, we put them aside and focused on what remained.

It became clear that these gestures had a common, though brief and simple, form: At some point, the speaker’s hand, finger(s), or palm were directly oriented to their addressee. These gestures were not about the topic; they had interactive rather than topical functions. Our interpretation of each of these gestures, given its specific moment in the context of the dialogue, was directly related to the addressee. For example:

very briefly presenting one or both open palms to the addressee (a conduit metaphor), which we glossed as “Here’s something new to you”;
quickly rotating one hand and “flipping” an index finger at the addressee, while repeating or drawing on a contribution of the addressee, which we glossed as crediting the addressee, “As you said earlier”;
circling one hand from the addressee to oneself one or more times as part of a word search, as if asking the addressee to provide the word.

Speakers were using these gestures to inform their addressee about the state of their dialogue at that moment. Interactive gestures were one way a speaker could stay in close contact with their addressee without interrupting the flow of topical content.

Several experiments supported this theory (Reference Bavelas, Chovil, Coates and RoeBavelas, Chovil, Coates, & Roe, 1995; Reference Bavelas, Chovil, Lawrie and WadeBavelas et al., 1992 Reference Bavelas, Gerwing, Sutton and Prevost, 2008). First, independent analysts were consistently able to agree on distinguishing interactive from topical gestures. Second, unlike a topical gesture, the meaning of an interactive gesture rarely appears in the accompanying speech; they are usually nonredundant. Third, speakers made significantly more interactive gestures in dialogue than in monologue. Fourth, even in dialogues, they made significantly fewer interactive gestures when their addressee was not visible (whereas topical gestures were not affected by mutual visibility, as described earlier). Fifth, even in face-to-face dialogues, speakers made significantly more interactive gestures when working together in a reciprocal dialogue than when speaking in alternating monologues.

Another gestural track 2 function was implied earlier. Reference GerwingGerwing (2003), Reference Gerwing and BavelasGerwing and Bavelas (2004), and Reference Holler, Bavelas, Woods, Geiger and SimonsHoller et al. (2022) showed that gestures referring to something the speaker has already mentioned earlier are smaller, shorter, and less well formed. Reference FowlerFowler (1988) had pointed out that shortening a word may be for the addressee’s benefit, that is, marking given information. Notice that shortening a topical gesture (or word) serves both track 1 and track 2 functions; it depicts a specific referent and also shows its status as given rather than new information.

Two other gestures mentioned above also combine both track 1 and track 2 functions: The addressees’ return gestures that Reference De Fornel, Auer and di LuzioDe Fornel (1992) described conveyed the same topical information as the speakers’ original gestures did; it was their immediate repetition that added a track 2 function. Similarly, addressees who mimicked their speaker’s gestures in Reference Holler and WilkinHoller and Wilkin (2011) were depicting the same referent, but the repetition conveyed “I understood your gesture.”

In short, speakers and addressees have numerous ways of shaping their gestures to indicate how their interlocutor should take the information they are offering, from qualifying what it means to confirming mutual understanding. The advantage of Clark’s track 1/track 2 framework is that, rather than simply labeling these instances, it places them directly in a moment of interaction between speaker and addressee.

7 Summary

This chapter proposes that the microsocial functions of gestures are well documented in experimental research and are as important as cognitive or individual functions. Gestures are neither “for the speaker” nor “for the addressee”; they are for both and for the interaction between them. Moreover, attending to both could provide better explanations of each. For example, the speed and subtlety of the social functions of gestures may require new cognitive explanations: How do speakers make a specific interactive gesture in a second or less while also conveying complex topical content? More generally, Reference Holler and LevinsonHoller and Levinson (2019) have shown that multimodality is not an add-on but an asset in language processing.

There is also a methodological lesson here, which is not limited to experimental methods or psychology. All research methods constrain as well as reveal. It is important to examine periodically whether the history or tradition of a preferred method is unnecessarily limiting one’s research. The solution, as here, is always to question our taken-for-granted assumptions and the biases they may lead to. Some limitations may be resolved – and new phenomena revealed – by complementary methods.

23 Gesture and Intersubjectivity

1 Introduction

Intersubjectivity can mean many things and is deployed broadly across disciplines (Reference Brinck, Zlatev, Racine and SinhaBrinck, 2008). How does this idea relate to empirical and embodied interests in interaction and intercorporeality? Is there a single “I” word that should rule them all?

The everyday ubiquitous practice of gesturing (here I will focus on spontaneous conversational or cospeech gesturing with the hands) manifests a rich range of intersubjective phenomena, from a Habermasian sense of interpersonal world-sharing through Schegloff-esque turn organization to phenomenologically profound sensory intertwining à la Merleau-Ponty. Studying gestures holds the potential to forge a “bodies-centered” conception of intersubjectivity that realizes its fullest significance.

In the following selective discussion of gesture scholarship, I want my support of this claim to serve as a resource for all gesture scholars interested in the inherently rational and normative nature of their irreducibly social and bodily subject of study. But my argument here as elsewhere (Reference CuffariCuffari, 2011, Reference Cuffari2012) is that gesturing is important – in fact, game-changing – for anyone interested in how language really works. Philosophers have long been inspired by gestural phenomena (e.g. Merleau-Ponty, Mead, Wundt, and philosophers of antiquity), but interest on the part of contemporary figures like Shaun Reference GallagherGallagher (2006, Reference Gallagher2013) and Joel Reference KruegerKrueger (2012) focused primarily on broad demonstrations of embodied or extended cognition. Philosophers of language, particularly those concerned with pragmatics, have yet to reckon adequately with the richly intersubjective possibilities of conversational gesturing. If it is the case that gestures, via their intersubjective characteristics, meet criteria for rational, normative, referential, and recursive communicative action (and I think it is), this does not imply that we ought simply to assimilate this modality into the standard story about linguistic meaning. Given the ways this story skews to biases of propositionality, written language, solipsism, and disembodiment, such a move is unlikely to work anyway. It is not that hand gesturing wins an honorary title of “linguistic” or being “part of language.” It is, rather, that appreciating all the things that gestures do, and that we do together in gesturing, rezones the linguistic. Pace philosopher Reference BrandomRobert Brandom (1994), the downtown of language is not propositional score-keeping; it is rich, nuanced intersubjectivity as showcased in gesture. To be slightly more ecumenical about it, there is no empirical reason to prioritize propositional score-keeping over bodily intersubjectivity to the extent that the philosophical tradition has done.

The following sections describe a traditional philosophical picture of intersubjectivity (Section 2), noting studies of hand gesturing that borrow from it and challenge its commitments to verbal propositionality; discuss the interactional and ecological side of gesture studies that interpret intersubjectivity in terms of practical, material situations in which people are copresent (Section 3); and discuss phenomenologically oriented understandings of intersubjectivity as intercorporeality (Section 4). Closely on the heels of this, I present the enactive account of linguistic bodies, largely inspired by gesture studies and philosophical reflections on gesture, as offering a framework for investigating gesture as fully linguistic and therefore as manifesting intersubjectivity in its rational, interactional, and intercorporeal dimensions.

2 Common-Ground Approaches to Intersubjectivity in Gesture

Intersubjectivity, in the post-Kantian and post-Wittgensteinian, broadly pragmatist philosophical tradition of Habermas, Grice, and Brandom is a background condition for communicating. This understanding of intersubjectivity features cooperation, agreement, and mutual understanding, founded in human openness to the perspectives of others. Historically speaking, the key philosophical works discussing intersubjectivity as a defining feature of human life and communication consistently fail to present it as an embodied phenomenon; the body remains a secret or unstated condition of communication, while intersubjectivity is explored as either the mystical or quotidian meeting of rational minds.Footnote ¹

Gesturing, even as a varied phenomenon, will always point to these broad conditions of possibility of collaborative meaning construction. We do well to note, following Reference StreeckStreeck (2017, pp. 216–217): “Because gestures have the features of living, not mechanical, motion, interaction partners will always perceive them as inherently meaningful, even if this perception is as autonomic and unrecognized as the making of gesture is on the part of the speaker […] Our perception of gesture, finally, is informed by our own ‘tacit knowing’ of the world we inhabit.” This world is material, physical, and fundamentally intersubjective in a Habermasian sense.

For the philosopher Reference HabermasHabermas (1984), the immanent purpose or “original mode” of language is reaching understanding (p. 288). Communicative action is the rational attempt to reach understanding, establish interpersonal relations, and coordinate action (Reference HabermasHabermas, 1984, p. 86) and to do all this noncoercively. Rational attempts are those that can fail, that require intersubjective recognition, and that allow defense against criticism. Symbolic expressions and actions are rational when they are done in a self-consciously defeasible way, with a sensitivity to conditions of validity – in other words, with the built-in expectation that the validity of the action is determined by others. Research on gesture repair (corrections made when gestures are infelicitously received or not understood, or uses of gesture to signal a need for or provide a repair for an unsuccessful utterance), while not often deploying the language of intersubjectivity, demonstrates the rationality of gesturing in rich ways (e.g. Reference Goodwin, Goodwin, Enfield and LevinsonGoodwin, 2006; Reference StreeckStreeck, 1994).Footnote ² When engaging in communicative action, actors take a reflexive stance on their actions while acting, understanding that the actions may be questioned and can be further explained, usually by making reference to shared forms of life (Reference HabermasHabermas, 1984, p. 9).

Such reflexivity is a condition for a result of intersubjectivity. In other words, one can treat intersubjectivity both as a background condition for shared meaning-making and as the outcome of such coactivity. In any case, it is not only the gesturer’s self-awareness (of the possibility of being misunderstood or of having to say and do more to be understood or to be correct, etc.) that is necessary, but also the involvement of the gesture recipient. Reference EnfieldEnfield (2009, p. 14) tells us that both verbal and gestural elements in a composite utterance put burdens of recognition and interpretation on the conversation participant. Meeting those burdens, together, is tantamount then to establishing intersubjectivity. Recognition and interpretation are as much necessary processes for successful gesture meaning-making as they are for verbal language. This means that a gesture is not self-interpreting; its meaning requires effort, attention, context, sensitivities, and more to be achieved. Moreover, the meaning of a gesture may in fact be the effects it has in context.

Turning to empirical gesture scholarship, one finds that work explicitly mentioning intersubjectivity often operationalizes it in terms of “common ground.” In “How gesture use enables intersubjectivity in the classroom,” Reference Nathan, Alibali, Stam and IshinoNathan and Alibi (2011) note that establishing common ground requires delineating common referents and connecting familiar representations with unfamiliar ones. Helpfully, they use these criteria to show that not all gesturing guarantees intersubjectivity automatically. Indeed, if gesturing cannot fail to achieve shared understanding, it would not be rational. (Instances of spoken language also routinely fail to achieve shared understanding.) They identify and analyze pedagogical linking gestures that help the achievement of common ground in mathematics classroom learning (Reference Nathan, Alibali, Church, Church, Alibali and KellyAlibali, Nathan, & Breckinridge-Church, 2017).

Using conversation analysis methodology combined with McNeill’s gestural taxonomy to study second language learner situations, Reference BelhiahBelhiah (2013) equates intersubjectivity with “mutual understanding” and notes that participants’ gestures secure mutual understanding by unpacking meaning and by displaying alignment in both replicating and coproducing gestures. Reference MajlesiMajlesi (2015) observes that teachers in a second language classroom in Sweden repeat their students’ gestures – they use “matching gestures” – as a means to “maintain and sustain intersubjectivity” in instructional interactions. This understanding of intersubjectivity is dynamic, sensitive to the need to keep attention joined and shared over time. Matching gestures (“those gestures that are similar, if not identical, to those in the prior turns-at-talk”) (p. 30) do this by linking referents in time (they “tie two subsequent gestures”) and focusing attention on certain points as deserving of attention (matching gestures “highlight learnables”). For a review of current research on what gestures reveal about second language acquisition (SLA), and how gestures affect SLA in interaction and in instruction, see Gullberg, this volume.

Instruction is a rich ground for exploring gesture’s role in intersubjectivity, perhaps because of the epistemological asymmetry that characterizes most instructional interactions.Footnote ³ As Jerome Reference BrunerBruner (1983) noticed in the context of his psychological research on mother–child interactions and their role in language acquisition, a joint referent can come up in asymmetric conversations and for one party this referent may be initially indistinct, but the referent “can be developed both for its truth value and its definiteness” (p. 68, emphasis added). Much like mothers, then, teachers may not know what their students have in mind, “[…] nor are they sure their own speech has been understood […] but they are prepared to negotiate on the shared belief that something comprehensible can be established” (Reference BrunerBruner, 1983, p. 86, original emphasis). As studies like those above show, gesture assists in the collaborative work toward this achievement. A review of common-ground influences on gesture finds widespread support “of recipient design as well as more specific social functions such as grounding, the given-new contract, and Grice’s maxims” (Reference Holler, Bavelas, Breckinridge Church, Alibali and KellyHoller & Bavelas, 2017, p. 213).

Casting intersubjectivity in gesture in terms of common-ground effects introduces a potentially interesting twist on Grice (and see Reference EnfieldEnfield, 2009; Reference WhartonWharton, 2009; Reference CooperriderCooperrider 2011, Reference Cooperrider2017). Recall that for Grice, interlocutors are expected, and expect each other, to track the meaning that emerges across multiple levels of speech acts, not all of them verbal. But the reasoning process that hearers go through, in Grice’s view – that is, the reasoning process that renders nonconventional implicatures rational, thereby securing their communicative force on the basis of presumed communicative intent – is traditionally a thoroughly propositional affair. Given a (presumed male) speaker whose speech act is not conventional, the following is the verbal process of his interlocutor:

He has said that p; there is no reason to suppose that he is not observing the maxims, or at least the Cooperative Principle; he could not be doing this unless he thought that q; he knows (and knows that I know that he knows) that I can see the supposition that he thinks that q is required; he has done nothing to stop me thinking that q; he intends me to think, or is at least willing to allow me to think, that q; and so he has implicated that q.

(Grice 1989 , p. 31)

Work on gesture and communicative intent carries the potential to at least broaden the modality if not upend the whole explanation of this “working out.”Footnote ⁴

Yet, gesture research on communicative intent and common-ground effects also runs a risk of inheriting the baggage of other-minds theorizing. For example, Reference Hostetter, Alibali, Schrager, Stam and IshinoHostetter, Alibali, and Schrager (2011) suggest that when speakers intend to communicate clearly with their interlocutor (as in cooperative rather than competitive conditions), this intention or motivation yields a higher proportion of relatively large gestures, tying gesture production not just to speakers’ intentions but to their beliefs about listeners’ broad social intentions. This view is theoretically oriented to the mental attitudes and states of participants, relegating the gesturing activities of the hands as not themselves communicative or cognitive, but instead as indicating mindedness or communicative intent, a “window onto the mind” (e.g. McNeill, in 2013, but for a time this phrasing was ubiquitous) that paradoxically positions the body in an “unwitting” role (Reference McNeillMcNeill, 1992).

Embodied and enactive approaches to social cognition have long been at pains to keep a distance from the common premise of simulation or theory of mind explanations of other minds (see Reference RatcliffeRatcliffe, 2006, for discussion). That is, embodied and enactive approaches resist the idea that others’ minds are self-enclosed “behind” or “inside” their bodies. Extensions of the enactive social cognition argument call into question the idea as such of communicative intentions that are temporally prior to interactions (or a participant’s contribution to an interaction) (Reference Di PaoloDi Paolo, 2015). While it may seem that gesture as a “window onto the mind” would help solve the other minds problem, and do so in an embodied way, this solution comes at a cost to the complexity of gesture interpretation that I have been tracking here. Hand gestures are not necessarily any more readily understood than verbal gestures; all gestures put burdens of recognition and interpretation on their recipients (Reference EnfieldEnfield, 2009), and this is vital to their status as normative, rational, and intersubjective (not just subjective!) phenomena. As Reference PutnamPutnam (1974) famously put it, meaning just ain’t in the head; it should be easier than it appears to be for gesture researchers to rejoin, “That’s right, it’s in the hands” – or, as we see below, in interactions and interacting bodies (if it makes sense to put meaning “in” anything at all).

From a philosophical perspective, then, talking about common ground and the achievement of intersubjectivity quickly shifts into talking about conventions and other minds. As Adam Kendon points out, “even when speakers are not making use of forms with ‘quotable’ meanings, the forms of action they employ still convey meaning in various ways and are governed, to varying degrees, by social conventions” (Reference Kendon, Müller, Cienki, Fricke, Ladewig, McNeill and Teßendorf2013, p. 12). Wonderful work has been done on recurrent gestures and gesture grammar (see Ladewig, this volume; Reference Ladewig and BressemLadewig and Bressem, 2013) that usefully brings community and history into the explanation of how gesturing is understood, which is a part of explaining how it secures common ground (or fails to). To return to Kendon, the abilities of observers and analysts of gesture to regard visible action as “intended or meant as part of the speaker’s expression” and as “related to the semantic or pragmatic content that has been apprehended from the speech” is “based upon our ability to grasp how these actions are intelligible. The basis for this understanding remains obscure, however” (Reference Kendon, Müller, Cienki, Fricke, Ladewig, McNeill and TeßendorfKendon, 2013, p. 16, emphasis added).

Without attending in more detail to bodies, their environments, and the habitual movements and perceptions that emerge from histories of transactions between the two, we cannot advance in our discussion of intersubjectivity. Gesture meaning emerges as a collaborative and contingent element of a social interaction, in moving, sensing, agentive, and perspectival bodily intelligences as well as in normative and abstract horizons of shared lifeworlds.

3 Ecological and Interactional Approaches to Intersubjectivity in Gesture

Meaning, as an achievement of recognition and interpretation, is in each case a coconstruction with the environment, with horizons of bodily affordance and cultural knowledge, with what is present and what is absent, and with others. Gestures are the intelligent and intelligible actions of persons embedded as well as “embodied.” Gesturing is a practice of interacting not only with others but with, in, and through some kind of shared world. Examples of this kind of account in gesture studies can be found in Streeck’s gestural ecologies (Reference Streeck2009, Reference Streeck and Streeck2010). In these analyses we see that gesturing is a local activity embedded in and referring to a world, or perhaps to concentrically nested, or unevenly overlapping, worlds. Gestures do not merely manifest something internal, nor do they simply repeat or mirror what is external; they also shape, create, and coordinate that which coinhabitants of a semiotic space experience.

By identifying multiple and distinct ecologies, Streeck at the same time identifies different styles of gestural sense-making, and hence different senses made. Attending not only to Habermas’ three formal worlds (objective, social, and subjective), but to a range of particular ecologies (interactional domains structured by distinct affordances and built by types of gestures), Streeck gives detailed examples of how gestures fit and form moments of meaning-making. For example, Streeck varies the size of ecologies relative to gestural intervention in them. He contrasts “the world at hand” with “the world within sight” (Reference Streeck2009, pp. 8–9). Depending on where one draws the borders of a microecology in a moment of interaction, gestures may couple with objects and instrumental actions, or they may mark up a scene out of reach of direct contact but in reach of shared view, highlighting certain points and features, lining up causal relationships, indicating movement and action potentials. In this way, gestures themselves contribute to the sizing up of the semiotic space from moment to moment. In some cases, gesturing operates in the dimension drawn by speech, as in depicting.

Gestures also gather into the sensory modalities of proprioception, sight, and touch aspects of invisible meaning-worlds, or horizons of normativity and reference, that is, dimensions that emerge out of and persist in repeated human interactions. This is the case in “ceiving” and “displaying communicative action”; in these less “concrete” ecologies, gesturing enacts “manual concepts” through selective schematic demonstration (e.g. a mechanic rotating his index finger in a circle by his ear while making a “listening face” to demonstrate hearing something crank (Reference Streeck and StreeckStreeck, 2010, p. 233)) or refer to speaker performances. The invisible is made observable also when hand gestures render accessible and salient sensory qualities such as texture, or dynamic spatial information such as fitting-together, as Streeck discusses at length in his analyses of a mechanic in his shop (Reference StreeckStreeck, 2002, Reference Streeck2009, Reference Streeck2017).

This is the logic of appropriative disclosure (Reference StreeckCuffari & Streeck, 2017): My touching something in order for you to know something about its touchable features works not only because I perform this act in such a way that it is both exploratory and communicative. My performance is part of the process, but so is the possibility of your comprehension, a possibility that undergirds communicative intentionality as such: “the beholder, the recipient of conversational gestures, also draws upon this undisclosed background of haptic understandings; otherwise, he or she would not be able to recognize the action-patterns that the gestures instantiate nor the equipment and objects that go with them” (Reference StreeckStreeck 2009, p. 150).

Common ground in gesture is grounded in common bodily experience, but this commonality cannot be taken for granted. Gesturing, as Streeck’s work shows, selects, elaborates, and presents certain features of what is there for us to notice our bodies noticing. In attending to shared place and time, gesturing changes what unfolds there, and interlocutors manage this changing together.

Intersubjectivity is also pursued in gesture studies as “a temporally-bound achievement” of shared understanding “accomplished through (and embedded in) turns at [multimodal] talk” (Reference Sikveland and OgdenSikveland & Ogden, 2012, p. 167). Here the embodied collaboration with another participant is analyzed as unfolding over time, with less close attention paid to space and environmental coupling. “Shared understanding is also embedded in their shared use of features such as reference and deixis, and through the tying together of turns at talk into sequence” (ibid.). Particular gestural phenomena that enact intersubjectivity in this way include gestures held across turns (Reference Sikveland and OgdenSikveland & Ogden, 2012) and overlapping gestures (Reference Mondada, Oloff, Stam and IshinoMondada & Oloff, 2011), analysis of which reveals participants actively managing understanding together.

Reference Mondada, Oloff, Stam and IshinoMondada and Oloff (2011) examine cases such as when an interrupted speaker yields the floor but “the speaker’s gesture is frozen throughout overlap and is remobilized when he resumes the turn,” which the authors take to demonstrate “the embodied conception participants have of speakership as it is locally defined, achieved and sustained” (p. 325). In other words, what the interrupted speaker does with her hands (as well as body orientation, gaze, and speech) is a key part of how the interruption is perceived and responded to. The fact that gestures can be perturbed, frozen, or abandoned completely as a result of an other’s overlapping activity indicates multiple sensitivities to others as well as the perpetually fragile nature of social interaction dynamics and shared understanding regarding dialogical roles (Mondada 2007; Reference Mondada, Oloff, Stam and IshinoMondada & Oloff, 2011, p. 336). Gesturing as a practice inherently moves toward intersubjectivity even as it is vulnerable to failure.

In several detailed analyses, Reference Sikveland and OgdenSikveland and Ogden (2012) show that “gestures can be held across turns at talk as a resource for speakers to display that there remains an outstanding issue of shared understanding” (p. 190). For example, a speaker makes a metaphorical gesture expressing her struggle in a second language (via a GOOD IS UP schema) while asking her (male) interlocutor “do you understand?” and she holds this gesture through his first and second verbal responses, only finally changing her gesture to match his when he uses speech and gesture to express the difficulty of achieving an equal level of understanding in two languages (still employing a vertical spatial metaphor for good skill). She holds this mirroring gesture while confirming verbally. Over a sequence of turn-overlapping gestural holds, both participants come to “use the same verbal and gestural metaphor, and by mirroring one another’s use of this metaphor, they have demonstrated to one another that they have understood the other to share this metaphor as well. Thus their understanding is not just subjective, but intersubjective; and visibly so” (Reference Sikveland and OgdenSikveland & Ogden, 2012, p. 190).

There are still other ways that gesturing works to secure intersubjectivity as an interactional achievement. Addressee effects, or what might more commonly be referred to as audience or recipient design, constitute a robust subfield in gesture scholarship (Reference Bavelas, Chovil, Lawrie and WadeBavelas et al., 1992; Reference GoodwinGoodwin, 1981; Reference ÖzyürekÖzyürek, 2002; Reference Schubotz, Özyürek and HollerSchubotz, Holler, & Özyürek, 2019; Reference StreeckStreeck, 1993, Reference Streeck1994). As noted above in the discussion of Streeck’s gestural ecologies, pointing may work to secure shared recognition of referents or of roles. Focusing on a Japanese dyad’s interaction, Reference Ishino, Duncan, Cassell and LevyIshino (2007) identifies “synchronic intersubjectivity” at play in how gestural deixis “encodes” “the speaker’s conceptualization of perspective” and in how, specifically, the speaker uses the addressee’s body to stand in for other characters. A study on pointing gestures in a design activity finds that participants “employ pointing not merely to create a place for shared attention, i.e. to index an object, but also, and more importantly, as a device for solving potential troubles of misunderstanding and disagreement in relation to the ongoing design” (Reference Donovan, Heinemann, Matthews and BuurDonovan, Heinemann, Matthews, & Buur, 2011, p. 7).

Finally, an interactional approach can be taken to revisit the question of communicative intent brought up in the Section 2. Consider that Reference Vajrabhaya and PedersonVajrabhaya and Pederson (2018) find reduction in representational gestures (reduction in rate and size) in situations where listeners provide minimal, unvarying feedback. While studies have found that gesture may reduce, or certain features of gesture may decrease, in cases where the speaker knows the listener already knows or has already received what is being communicated (e.g. Reference Gerwing and BavelasGerwing & Bavelas, 2004; Reference Holler and StevensHoller & Stevens, 2007; Reference Jacobs and GarnhamJacob & Garnham, 2007), Vajrabhaya and Pederson offer an interactional explanation of this effect being one of listener sensitivity in the encounter. That is, the reduction effects in the speakers’ gesture production emerge “from an on-going and dynamic interaction between interlocutors,” rather than a more abstract explanation rooted in an a priori understanding of common ground. In their study, the reduction effect was found regardless of “the novelty of information to the listener” (Reference Vajrabhaya and PedersonVajrabhaya & Pederson, 2018, p. 66).

In a vast range of ways, hand gesturing is a practice of intersubjectivity. This means, crucially, that gesturing does not merely reflect shared understanding (or a shared world) but builds it. For this reason, it makes sense to count gesturing as a kind of collaborative thinking. It does not express thought, as philosopher Maurice Merleau-Ponty famously wrote, but accomplishes it; gesturing is thinking.Footnote ⁵ The enactive approach to cognition understands thinking together as “participatory sense-making” (Reference De Jaegher and Di PaoloDe Jaegher & Di Paolo, 2007). Partly inspired by the myriad ways that gesturing evidences complex cobodily meaning-making, Reference De Jaegher, Newen, De Bruin and GallagherDi Paolo, Cuffari, and De Jaegher (2018) derive an account of languaging that is continuous with the principles of participatory sense-making. What this framework may mean for gesture scholarship is discussed in Section 4.

4 Intersubjectivity as Intercorporeality: Gesturing as Something Linguistic Bodies Do

Enactive and phenomenological treatments often understand embodied intersubjectivity in terms of intercorporeality (Reference CsordasCsordas, 2008; Reference De Jaegher, Newen, De Bruin and GallagherDe Jaegher, 2018; Reference Fuchs and De JaegherFuchs & De Jaegher, 2009; Reference Loenhoff, Meyer, Streeck and JordanLoenhoff, 2017; Reference Weiss, Weiss and HaberWeiss, 1999). Following Merleau-Ponty, the philosopher Maclaren defines it: “Intercorporeality is this bodily perception of another body – a perception which consists not in an intellectual grasp of something that is other to us, but in a bodily mirroring, or a bodily resuming (reprendré), of an intentionality that we inhabit over there” (Reference MaclarenMaclaren, 2002, p. 190, emphasis in the original). We experience others in this direct, intercorporeal way by “touching and being touched by them, as in a handshake, and by in general moving and living through the coordination and miscoordination patterns between our bodies as they interact […] as internally related parts in a unity or synergy” (Reference De Jaegher, Newen, De Bruin and GallagherDi Paolo, Cuffari, & De Jaegher, 2018, p. 63). Attend for a moment to the last face-to-face conversation you had with somebody. Recall the sense that you were participating not only with somebody but in something. Intercorporeality describes the ways that our bodies can blur, in sensations, actions, and intentions (Reference Hougaard, Rasmussen, Müller, Cienki, Fricke, Ladewig, McNeill and TeßendorfHougaard & Rasmussen, 2013). Conversational or interactional synergy is possibilized by this openness, this bodily-given “readiness to interact” (Reference Di Paolo and De JaegherDi Paolo & De Jaegher, 2012). In turn, intercorporeal participation produces that which we are in, together, when we are “in conversation”: an autonomous interaction dynamic (Reference De Jaegher and Di PaoloDe Jaegher & Di Paolo, 2007; Reference De Jaegher, Newen, De Bruin and GallagherDi Paolo, Cuffari, & De Jaegher, 2018, pp. 64–73).

On an enactive view, languaging is a special social agency that rises to the unique challenge of coregulating intersubjectivity, understood as a bodily disposition toward getting entangled with other bodily beings-in-the-world. We have seen already here that gesturing is a bit like a Swiss Army Knife of possibilities for realizing and managing intersubjectivity; it is languaging par excellence. As I suggest in Section 1, language use is not just proposition-mongering, but is better thought of as a normatively constrained practice of meaning-achievement in a dialogic, cooperative context. Languaging as cobodily social agency provides a particular response to recurrent tensions between individual and interactive levels of sense-making and between codified and spontaneous styles of sense-making (Reference Cuffari, Di Paolo and De JaegherCuffari, Di Paolo, & De Jaegher, 2015). Put another way, speaking and gesturing humans are linguistic bodies insofar as we organize the multiple dimensions of our embodied caring and openness to the world through coordinated comanagement of utterances (Reference De Jaegher, Newen, De Bruin and GallagherDi Paolo, Cuffari, & De Jaegher, 2018). To be a human person is an ongoing, socially scaffolded, bodily achievement of navigating and participating in an open sea of dialogic acts.

Inspired by the rich intersubjective possibilities witnessed in gesturing, the linguistic-bodies view in turn suggests that hand gestures be studied as enactive symbols (or to put it less statically, that hand gesturing be viewed as a practice of enactive symbolizing). Hand gestures “emerge from operations between bodies, interactions, and sedimented community practices, and they establish for participants microcontexts of virtual flows and ideal images, with attendant constraints and affordances” (Reference De Jaegher, Newen, De Bruin and GallagherDi Paolo, Cuffari, & De Jaegher, 2018, p. 301). For example, recurrent gestures are noticeably more sedimented than other gesture symbols, without being as fixed as emblems. This in-between status “actually makes recurrent gestures a more ready example of enactive symbolizing than word use […] neither fully open nor fully closed […] [they] are likely to bring about this set of projective and regulatory effects on an interaction rather than another” (ibid.). On the enactive view of linguistic bodies, this “admitting of degrees” that places on all language users continual twin burdens of local, intersubjective recognition and interpretation “is not true only of gesture, or of semiosis, but is the beating heart of all that we call linguistic” (ibid.).

In “The lived experience of a recurrent gesture in China: Embedding the Vertical Palm within a gift-giving episode (also known as the ‘seesaw battle’),” gesture researcher Harrison’s microanalysis of a multimodal meaning-making event deploys conceptual resources from the linguistic-bodies account to show how participants generate and manage interactional tensions together, in a cyclical and spontaneous fashion, through bodily sense-making including “postures, facial expressions, gestures, and direct body-to-body manipulations” (Reference HarrisonHarrison, 2021). Two friends enact formalized roles demanded by the gift-giving ritual, but at the same time their interaction locally and interpersonally realizes the significance this exchange has for their relationship and for the encounter. Describing the “see-saw” phenomenon that emerges spontaneously yet nearly conventionally throughout China in Lunar New Year gift-giving battles, a version of which the friends (in his case) come to enact, Harrison notes, “the result is a surging back-and-forth movement coauthored by the participants, whose languaging bodies become physically and intersubjectively entangled” (Reference HarrisonHarrison, 2021). Carefully tracking how certain communicative gestures become strongly normative in this exchange, even sometimes seeking to physically force cooperation against resistance, Harrison uses dialectical stages from the linguistic bodies account to find structure in a rapid, dynamic event brimming with strong feelings and movements.

In a recent book-length treatment of a single subject’s gestural enactments unfolding on a single day, Reference StreeckStreeck (2017) highlights the enactive concept of autopoiesis or “self-making” and demonstrates it to be an interactional, intercorporeal project achieved in vital part through gesturing. In a highly nuanced refinement of his work on conceptual gestures (discussed above), Streeck indicates a new path in gesture research, that of self-interaction, or an intersubjective relation one enacts with oneself in gesturing. Streeck writes:

Conceptual gestures, like all gestures, are doubly interactive: they are elements of the interaction between speaker and listener (and sometimes other participants), but they are also products and elements of the speaking body’s interaction with itself. Gestures have something to say to the self. Acting out a “cept,” completing a physical gesture, even subconsciously, registers a feeling of action in the speaker, the more clearly the more energetically it is done. […] Even where it concerns virtual and abstract domains, the conceptual and social activity of speaking remains bodily action.

(Streeck, 2017, p. 295)

Self-dialogue enacted in gesture is possible because gesturing, as the body’s self-organizing response to a meaning-rich environment (Reference StreeckStreeck, 2017, pp. 284, 293–295), is at once deeply personal and broadly impersonal and anonymous.Footnote ⁶ We cannot be satisfied with either of these options; they are in constant dialectical negotiation, as we live as bodies that tend to act like “bodies like ours,” yet are irreducibly idiosyncratic and singular, and live in a social world that always already anticipates and arranges everything from perception to protest, yet remains open to an undetermined future. The enactive philosophy of linguistic bodies articulates a closely related view of the self as a perpetual process of navigating the tension between personally embodying (“incorporating”) an utterance, that is, making it one’s own, or impersonally embodying (“incarnating”) an utterance, that is, being made by it (Reference De Jaegher, Newen, De Bruin and GallagherDi Paolo, Cuffari, & De Jaegher, 2018). Sometimes this comes down to the difference between a gesture of straight-out mimicry versus one of tribute, or an unknown repetition of a parent’s tic versus a family resemblance. The difference is slippery and subtle, but these are the components that make any one linguistic body its own.

New methodological possibilities follow from an auto-poietic or self-organizing approach to gesturing. Attending to a single person over years (or even over a day) opens up a new dimension of analysis: “gesture habits – or habitualized gestures – represent a living body’s communicative self-organization on a biographical scale” (Reference StreeckStreeck, 2017, p. 296). An individual’s repertoire may be explored in potentially analogous relationship to recurrent gestures identified in a community. In this chapter, concerned with making a case for a special kind of linguistic intersubjectivity achieved in gesturing, I did not inquire into the roles gesturing may play in intersubjectivity achieved in adult–infant or adult–child interactions (Reference Brinck, Zlatev, Racine and SinhaBrinck, 2008; Reference TrevarthenTrevarthan, 2011), human–animal interactions (Reference Gómez, Armstrong and BotzlerGómez, 2003, Reference Gómez2010; Reference Pika, Liebal, Call and TomaselloPika et al., 2005), human interactions mediated by machines, and human–computer interactions (e.g. Reference Cassell and TartaroCassell and Tartaro, 2007), but these are likely significant boundary cases for studying everyday linguistic bodies in action.

5 Gesture and Intersubjectivity

Considering gesture research through the lens of intersubjectivity offers a broad array of phenomena, from interpersonal common-ground approaches through coachieved interaction effects to intercorporeality and complexly entangled participatory enactments. All of these are part of the story of making sense together. Gesturing and intersubjectivity are multifaceted yet reciprocally informing phenomena that presuppose each other. A diverse range of gesture scholarship invokes and demonstrates intersubjectivity. This diversity spans both sides of what is sometimes identified as a social/cognitive debate in the literature on why people gesture and how gestures are meaningful (see, e.g. Reference CooperriderCooperrider, 2017; Reference Vajrabhaya and PedersonVajrabhaya & Pederson, 2018).

Reflecting on the inherent intersubjective potential and ground of gesturing perhaps helps one to appreciate the logical limits of either totally internalist or totally externalist stances on gesture meaning. If one gives a more cognitive-process-oriented account, the world in common is still implicitly relied on or assumed to secure mutual understanding, and this becomes even more pressing in absence of convention at the level of form. If, on the other hand, one says that a gesture is an immediately visible mental state, this may only extend the borders of the brain to the fingertips; in other words, this body externalism continues to make the individual mind the arbiter of meaning, and an individual mind acting alone cannot guarantee gestures as rationally, intersubjectively meaningful acts. Meeting the dual burdens of recognition and interpretation has to be done as a team, and it is not usually helpful to locate a team’s activity as happening “inside” or “outside” individual team members. If each episode is its own and “only the microscopic analysis of real-time, real-life, real-culture interaction” explains “how the understanding of the situated meaning of a gesture is achieved” (Reference StreeckStreeck, 2017, pp. 296–297), then the question of whether gestures in general admit more of listener-side sensitivity or speaker-side use value is likely unresolvable.

Meaning itself is neither an utterly unobservable internal, nor a perfectly observable external, phenomenon. Meaning fails to be captured in fully natural or fully conventional treatments. Meaning is the ultimate intersubjective, a border phenomenon relating without erasing individual and social, mind and body, nature and culture, self and other. Gesturing helps us to know this.

24 Variation in Gesture: A Sociocultural Linguistic Perspective

1 Introduction

The field of gesture studies has long recognized that gestures vary across linguistic/cultural groups. However, gesture has also been viewed as a universal, natural, and even primitive form of expression. In antiquity, rhetoricians considered gesture a universal language of the hands, but also noted that it could be modified for effective oratory (see Reference QuintilianQuintilian, 1922, in Reference KendonKendon, 2004, pp. 17–19). Philosophers in the seventeenth century regarded gesture as the natural universal language of humankind and a window on the soul, while emphasizing its role in rhetoric as an art form that could be refined (Reference BonifacioBonifacio, 1616; Reference Bulwer and CleatyBulwer, 1644/1974 in Reference KendonKendon, 2004, pp. 22–28). This focus on gesture as part of rhetoric and bodily conduct that could be modified for pragmatic purposes continued into the eighteenth century. Bodily comportment and gestures were recognized as indicating one’s status and considered amenable to management and alteration for social and political ends (Reference KendonKendon, 2004, pp. 32–34).

The view that gestures were a window on the soul and inner thoughts led to theories about gesture as the most natural form of expression and therefore as the precursor of speech in the evolution of spoken languages (see Reference KendonKendon, 2004, pp. 50–60 for discussion of Reference Condillac and WeyantCondillac, 1746/1971; Reference Diderot and JourdianDiderot, 1751/1916; Reference TylorTylor, 1865/1964; Reference WundtWundt, 1973). Anthropological founders, like Edward Tylor, recognized that gestures varied across the world, and early anthropological theory hypothesized that differences in gesture use between so-called “primitive” and other societies might be due to cultural evolution. (See Żywiczyński and Zlatev, this volume, for other views on the role of gesture in language evolution.) However, once anthropology accepted the idea of cultural relativism in the early twentieth century, it rejected an evolutionary explanation for gesture variation across cultural groups. Instead, anthropologists demonstrated how bodily behavior was culturally infused and varied accordingly (Reference MaussMauss, 1935/1973; Reference Mead and BatesonMead & Bateson, 1942).

The first systematic comparative study of gesture variation was motivated, in part, to challenge evolutionary theories of racial superiority that suggested differences in gesture behaviors were an innate characteristic of race. David Efron, a student of Franz Boaz, conducted a comparative study of the gestural behavior of Eastern European Jewish and Southern Italian immigrants to New York City in the 1930s (Reference EfronEfron, 1941, Reference Efron1972). He found that while Jewish and Italian gestures and gestural behavior were different, these differences became less distinct with succeeding assimilated generations. These findings demonstrated that variation in gesture behavior was not an innate characteristic of race or ethnicity, but a product of sociocultural environments.

What is universal about gesture, and what varies and why, remain central questions in the field of gesture studies. After all, mind and body are common components for making meaning in a shared physical world. As Reference CooperriderCooperrider (2019, p. 209) points out, “Gesture is unmistakably similar around the world while also being broadly diverse.” Although some features of gesture appear universal, speakers of different languages vary in their use of gesture. Similar propositional content may be expressed using different physical forms, movements, and parts of the body. Kinesic aspects, such as the physical articulation of gestures and use of space, are different. Patterns of use also vary. For example, head nods are more frequent in interactions in some cultures than others (Reference Kita and IdeKita & Ide, 2007). Some cultures appear to favor iconic depictions of content and the frequent use of established form-meaning pairs such as emblems (Reference EfronEfron, 1972 ; Reference KendonKendon, 2004). Speakers of different languages also appear to vary in the prominence of their gesture use. For example, a popular stereotype of Italians is that they use gesture more frequently and more conspicuously than other cultural groups. (This stereotype is reflected in popular publications with titles such as Speak Italian: The fine art of the gesture featuring Bruno Reference MunariMunari’s 1963 dictionary of Italian gestures.)

The field of gesture studies has offered several explanations as to why gesture varies. Variation in gestures and their use may be due to differences in the semantic and structural nature of spoken languages. Conceptual/cognitive differences in how linguistic-cultural groups perceive and represent the world may also be a reason for variation. Cultural norms governing interaction and bodily conduct can also account for differences in gesture practices. Lastly, patterns of gesture use, or what could be called collectively gesture profiles, may vary due to different roles gesture plays in the communicative and sociocultural ecologies of communities.

Most research on gesture variation takes language, ethnic, cultural, or national boundaries as the unit of analysis. Often these terms are used interchangeably. Gesture studies uses national boundaries as a proxy for linguistic boundaries and linguistic boundaries as a proxy for cultural commonality. Although there are linguistic and cultural differences in gesture use, are these parameters the most significant? Is gesture variation primarily a function of ethnic/linguistic cultural differences? What about social differences, such as class, region, and gender among speakers of the same language? Does gesture vary along these lines, and how do we account for this kind of variation?

In this chapter, I review the study of variation in gesture and its theoretical underpinnings in the field of gesture studies. My purpose is to question the framing of gesture variation in terms of fixed notions of culture, language, or nationality such as we do when we talk about German, French, or Italian gestures, and so on. To theorize adequately about gesture variation, we need to question whether linguistic and cultural “boundaries” should be the default unit of analysis. In this respect, theoretical developments in accounting for gesture variation in gesture studies lag behind subfields of linguistics such as sociolinguistics, sociocultural linguistics, and linguistic anthropology. What can these fields, that are concerned with how social factors and divisions influence spoken variation, bring to the question of variation in gesture studies? My argument is that our default use of cultural-linguistic categories may prevent us from developing more coherent theoretical explanations for variation that can help identify what is universal and variable in gesture.

I begin with a review of what gesture studies has identified as variable in gesture and the main factors that the field has focused on to explain variation. I then provide an overview of theoretical developments in the study of variation in sociolinguistics and discuss their implications for the field of gesture studies. Using recent studies on gesture from sociocultural linguistic and anthropological perspectives, I argue for the possibility that social factors and divisions other than linguistic/cultural boundaries may provide a more robust and comprehensive theoretical account for variation in gesture.

2 How Gestures Vary

2.1 Efron’s Findings

Reference EfronEfron’s (1972) comparative description of gesturing among Eastern European Jews and Southern Italian immigrants in New York in the 1930s, mentioned above, was the first systematic study of variation in gestural behavior between two cultural/ethnic groups. Efron looked at gestural behavior from linguistic, kinesic, and interlocutional perspectives. Looking linguistically at how gestures relate to spoken content, he observed that “traditional” Italians used gesture to illustrate the content of their speech and had a large vocabulary of what he called “symbolic” gestures. Among “traditional” Jews, he found very few of these types of gestures. He noted that their gestures were predominantly discursive, being used to point to and link propositions or to mark the tempo of utterances.

When examining the kinetic aspects of gestures (movement of body parts, radii, planes, and tempo of gesturing), he observed that Italians tended to use the whole arm from the shoulder; their arm and hand movements worked as a whole and were more “rounded,” and they used a wider space to the side. They also tended to use both arms in a symmetrical manner and had few head movements. Jewish speakers, on the other hand, mainly moved their forearms from the elbow, the movements of their fingers and arms were more differentiated, intricate, and angular, and they mainly utilized the frontal plane. Jews also tended to use one arm, and if both were used, they were usually used asymmetrically, or one after the other. Among Jews, the head was often involved when gesturing.

It is unclear from Efron’s work whether the kinds of interactions he filmed were limited to specific situations or people. For example, political or religious discussions of the Talmud might have elicited more abstract gestures and discursive gestures related to the structure of arguments. Participants might have been of a certain class, and filming in public spaces might have resulted in predominantly male participants. Nevertheless, his study is important because: (1) it suggests that speakers from different backgrounds may rely on gesture for different communicative purposes based on their communicative needs, culture, and environmental contexts, and (2) it identifies features of gestures that are variable, allowing us to distinguish what might be variable and universal elements of gesture in a systematic way (see Reference CooperriderCooperrider, 2019).

2.2 Emblems

One of the most obvious areas of gesture variation has been what Reference EfronEfron (1972) termed “symbolic” gestures, that is, established form-meaning associations commonly referred to as emblems (Reference Ekman and FriesenEkman & Friessen, 1969) or quotable gestures (Reference KendonKendon, 1992).Footnote ¹ A number of publications have documented emblem or quotable gesture repertoires of different languages and/or national groups. (See Reference PayratóPayrató, 1993, this volume, for further information.) Although these repertoires appear to differ from one language group to another, there have been only three systematic comparisons.Footnote ² Reference CreiderCreider (1977) compared the forms and meanings of quotable gestures in four East African ethnic groups (Luo, Kipsigis, Gusii, and Samburu) speaking Nilotic and Bantu languages located adjacent to one another. He compared the quotable gestures of each group and the gestures of these groups with those listed by Reference Saitz and CervankaSaitz and Cervenka (1972) for North America and Columbia. Sixty-eight percent of the gestures were found to be common across three or more of the four African groups’ repertoires. Twenty-four percent of the African groups’ gestures were also found in the North American repertoire and thirty-one percent in the repertoire of Columbians.Footnote ³ Another comparative study is that of Reference Morris, Collett, Marsh and O’ShaughnessyMorris, Collett, Marsh, and O’Shaughnessy (1979), who tested the knowledge of twenty gesture forms across forty areas in Europe. They found that a third were widely known, while two-thirds of the list were localized. Gesture differences did not necessarily coincide with language boundaries, but there appeared to be regional similarities. For example, Britain and Scandinavia shared some common gestures, as did Italy and Spain, but British and Scandinavian gestures were distinctly different from Italian and Spanish ones.

Instead of comparing vocabulary lists of the forms and meanings of gestures, Reference KendonKendon (1981) looked at the functional domains of quotable gestures from six published repertoires. He found that quotable gestures covered the same three domains: interpersonal regulation (commands and insults), expressions of current states of affairs such as “I’m amazed” or “I am hungry,” and comments about others such as “He’s crazy.” While the forms and meanings of quotable gestures might vary, his study suggests that these types of gestures are used for similar purposes when speech is impeded, inappropriate, or inadequate for the person’s expressive needs. However, Reference KendonKendon (2004) points out that these functional similarities still do not explain why some meanings and not others become expressed in gesture to the point where they become part of an established repertoire. Reference KendonKendon (2004, p. 344) points out that comprehensive “context-of-use” studies detailing communicative and social environments are necessary to explain this kind of variation.

2.3 Cospeech Gestures

Gesture studies has maintained a categorical distinction between emblems and cospeech gestures, arguing that the latter are less conventional. Describing the generally accepted notion of a continuum of conventionality from gesticulation (not conventionalized), emblem (partly conventionalized), to sign language (fully conventionalized), Reference McNeillMcNeill (2005, p. 10) writes that “At the gesticulation end, in contrast [to emblems], a lack of convention is a sine qua non.”

However, systematic studies of cospeech gestures reveal that speakers regularly employ similar forms and movements with consistent physical modifications to express meanings and functions that are semantically and pragmatically related. These types of gestures have been characterized as “recurrent” as their core form and semantic or pragmatic function remain stable across different situations (Reference Ladewig, Müller, Cienki, Fricke, Ladewig, McNeill and BressemLadewig, 2014a, this volume). As the core form and meaning realize several different but related iterations, these related gestures have been grouped into gesture families (Reference KendonKendon, 2004). Reference Bressem, Müller, Müller, Cienki, Fricke, Ladewig, McNeill and BressemBressem and Müller (2014a) point out that each language has a repertoire of recurrent gestures and have provided a repertoire used by German speakers. Some commonly studied examples include the Open Hand Supine (OHS) form (see Reference Cooperrider, Abner and Goldin-MeadowCooperrider, Abner, & Goldin-Meadow, 2018; Reference KendonKendon, 2004; Reference Müller, Müller and PosnerMüller, 2004) and the Open Hand Prone, that is, palm down and moved from side to side, accompanying a variety of expressions of negation (Reference HarrisonHarrison, 2018, this volume). Other examples include the Cyclic gesture in which an open hand is loosely rotated expressing the semantic core of cyclic continuity (Reference Ladewig, Müller, Cienki, Fricke, Ladewig, McNeill and BressemLadewig 2014b), and Away gestures of different types where the hand is moved out and away from the body (Reference Bressem, Müller, Müller, Cienki, Fricke, Ladewig, McNeill and BressemBressem & Müller, 2014b; Reference Bressem, Stein and WegenerBressem, Stein, & Wegener, 2017).

Comparative studies suggest that these gestures are similar across many cultures and can be considered natural conventions (see Reference CooperriderCooperrider, 2019). These similarities have led scholars to speculate that common physical actions may give rise to schematized enactments, raising the question of universal origins for at least some gestures (see Reference Harrison and LadewigHarrison & Ladewig, 2021, for a fuller discussion). There is, however, still variation. Studies by Reference Harrison, Larrivée, Larrivée and LeeHarrison and Larrivée (2016) and Reference GawneGawne (2021) suggest that differences in linguistic structure may result in variation in the organization of recurrent gestures of negation with speech. Other studies suggest cultural reasons, such as differences in traditional story-telling that can explain preferences for certain forms, degrees of recurrency, and differences in function (Reference Ferré and MettouchiFerré & Mettouchi, 2020).

The gestural expression of time, direction, and sequence is also both conventional and yet variable cross-linguistically. Some languages, such as English, consistently express future time toward the front of the person and the past behind both in speech and gesture, while others such as Aymara, Vietnamese, and Mandarin do the opposite (Reference Gu, Zheng and SwertsGu, Zheng, & Swerts, 2019; Reference Núñez and SweetserNúñez & Sweetser, 2006; Reference Sullivan and BuiSullivan & Bui, 2016). Among Guugu Yimithirr speakers, direction and spatial relations are gestured according to an absolute frame of reference using cardinal directions (Reference HavilandHaviland, 1998). English speakers, on the other hand, use a relative frame of reference expressing directions using left and right in relation to the speaker. Speakers of some languages, such as Yucatec, use the lateral axis to depict sequences and contrasts, while speakers of languages such as Mopan use the sagittal axis and do not recognize the lateral axis as having representational significance (Reference Kita, Danziger, Stolz and GattisKita, Danziger, & Stolz, 2001).

3 Why Gestures Vary

The field of gesture studies has attributed variation in gesture to four main factors: (1) linguistic differences in spoken languages, (2) conceptual/cognitive differences in how people perceive the world around them, (3) differences in cultural norms of interaction, and (4) diverse sociocultural environments.

3.1 Linguistic Differences in Spoken Languages

Considerable attention has been paid to how the structure of spoken languages influences gesture. The expression of motion events, based on Reference Talmy and ShopenTalmy’s (1985/2007) distinction between verb- and satellite-framed languages, has garnered the most interest. Some languages such as English tend to encode manner and path in one clause using a verb of manner and a “satellite” (such as a preposition) for path: for example, the ball rolled down the hill. Other languages that are verb-framed, such as Japanese and Turkish, use two clauses, one to describe manner and the other to describe path: for example, it descended as it rolls (Reference KitaKita, 2009). Comparative experimental studies of English, Japanese, and Turkish show that one gesture expressing both manner and path is often used in English, while two gestures may be used in Japanese and Turkish encoding manner and path separately (Reference Brown and GullbergBrown & Gullberg, 2011; Reference Kita and ÖzyürekKita & Özyürek, 2003; Reference Özyürek, Kita, Allen, Furman and BrownÖzyürek, Kita, Allen, Furman, & Brown, 2005; Reference Özyürek, Kita, Allen, Brown, Furman and IshizukaÖzyürek et al., 2008; Reference StamStam, 2015). Comparisons of English and Spanish show that among English speakers, when manner is encoded in the verb and path in the satellite (e.g. He rolled across the street), a gesture accompanying this clause may encode manner or sometimes only path.

What is semantically encoded in speech can also influence what is expressed in gesture. Reference Kita and ÖzyürekKita and Özyürek (2003) compared English, Japanese, and Turkish speakers describing the movement of swinging on a rope from one building to another. English has the verb swing to describe this movement. However, in Japanese and Turkish, there is no verb for swinging, and more generic words are used such as “go” and “jump.” English speakers produced arc-like gesture movements, while Japanese and Turkish speakers produced direct (straight) movements, suggesting that when the movement is not encoded in speech, this absence may lead to its omission in gesture.

Other aspects of speech can influence the timing of gestures. Reference CreiderCreider (1978) analyzed the relationship between intonation, tone, and gesture in Luo and then did a comparison to Kipsigi and Gusii, three languages spoken in Kenya (Reference CreiderCreider, 1986). He showed that intonational structure shapes the character of body movements (Reference CreiderCreider, 1986, p. 148). He found cross-linguistic differences in the alignment of gestures with speech depending on how stress is used to mark aspects of discourse structure in different languages (Reference Brookes, Nyst, Müller, Cienki, Fricke, Ladewig, McNeill and BressemBrookes & Nyst, 2014). Reference McNeill, Duncan and McNeillMcNeill and Duncan (2000) also found that language structure can influence the timing of gestures with speech. They showed that Mandarin speakers time a gesture within a phrase describing motion events differently from how this is done by English speakers. Mandarin speakers tend to place a gesture depicting action at the beginning of the clause with the topic announcement and not on the verb, as English speakers do.

3.2 Conceptual Differences

Gesture research shows that variation in the depiction and expression of location and time are due to conceptual differences between cultures. As described above, some cultural groups express location and direction in absolute terms according to cardinal points (north, south, etc.), while in other cultures the speaker’s body is the locus from which directions are relatively expressed (e.g. to the right/left of the speaker) (Reference Levinson, Bloom, Peterson, Nadel and GarrettLevinson, 1996, Reference Levinson2003). Even when direction and location are not encoded in speech, speakers will still encode location or movement in gesture according to their conceptual frames of reference. If a speaker with a relative frame of reference sees an object move to the right, they will make a gesture to their right (Reference KitaKita, 2009; Reference Kita and ÖzyürekKita & Özyürek, 2003). A speaker with an absolute frame of reference will depict the direction of movement according to cardinal directions, even if they change position (Reference HavilandHaviland, 1993; Reference LevinsonLevinson, 2003).

Conceptual differences also explain variation in the use of lateral and sagittal axes among two closely related Mayan cultures in Central America: Mopan (Belize) and Yucatec (Mexico) (see Reference KitaKita, 2009). Mopan has no words for the expression of relative location such as left and right, unlike Yucatec. In experimental tasks that did not involve speech, lateral mirror images were seen to be the same by speakers of Mopan, but not among Yucatec speakers. Mopan speakers did not recognize differences along the lateral axis (left to right or vice versa) as having representational significance. They placed contrastive elements along the sagittal axis while Yucatec speakers used the lateral axis (Reference Kita, Danziger, Stolz and GattisKita, Danziger, & Stolz, 2001).

Conceptual differences also shape how time is metaphorically expressed spatially in different cultures. As described above, English speakers talk about the future as in front and the past as behind, while Aymara speakers conceive of the past as in front, as it is known/seen, and the future as behind because it is unknown/unseen. Among Aymara speakers, even when speech does not explicitly use spatial metaphors, talk about the past is represented gesturally toward the front of the speaker and the future toward the back (Reference Núñez and SweetserNúñez & Sweetser, 2006). Similar construals of time with the future behind and past in front have been found with other languages, including speakers of Vietnamese (Reference Sullivan and BuiSullivan & Bui, 2016) and Mandarin (Reference Gu, Zheng and SwertsGu et al., 2019).

Reference KitaKita (2009) argues that these conceptual differences are not only a result of how speakers describe concepts in speech, but also reflect how people in different cultures process spatial information as they apply these conceptions of space even with nonlinguistic tasks. Spoken language appears to play a role in shaping gestures, but the link between spoken language and gestural expression may not necessarily be a direct one. Reference Núñez and SweetserNúñez and Sweetser (2006) consider how the concept of time might also be shaped by bodily experience and culture (see also Reference EnfieldEnfield, 2005; Reference Le GuenLe Guen, 2011; Reference Le Guen and BalamLe Guen & Balam, 2012). Given the Aymara case, Reference Núñez and SweetserNúñez and Sweetser (2006) question the universality of translating bodily grounded experience directly into gesture and suggest that culture may mediate the emergence of different patterns. Studying the use of gesture to express location among indigenous Australian Guugu Yimithirr speakers, Zinacantec Tzotzil Mayan speakers, and signers of a new sign language, Zinacantec Family Homesign (Z), Reference HavilandHaviland (2019) demonstrates that there is not necessarily a one-to-one relation between speech, cultural perception, and the expression of spatial concepts in gesture. Instead, he shows how language, gesture, bodily practices, and changing frames of environmental reference within immediate interactions play a role in shaping conceptual expression in gesture.

3.3 Norms of Interaction

Variation in gesture can also be the result of differences in cultural norms related to bodily behavior (see Reference KitaKita, 2009). One example of how norms of politeness impact gesture is the taboo on using the left hand observed in several African contexts (Reference Kita and EssegbyKita & Essegby, 2001; Reference Sanders and AgwueleSanders, 2015). In Ghana, this taboo inhibits the use of the left hand to gesture. Ghanaians consider it rude to point with the left hand and will use a two-handed point to indicate direction or extend the right hand to gesture on the left-hand side of the body. According to Reference Kita and EssegbyKita & Essegby (2001), this taboo makes Ghanaian gestural behavior distinctive.

Another area of cross-cultural pragmatic difference is how gestures regulate and manage interactions. Comparative studies between Japanese and English speakers show that head nods are more frequent in Japanese conversations and have both similar and different functions in the two cultures (Reference Kita and IdeKita & Ide, 2007). In both cultures, proposition-final head nods may signal agreement and/or that the speaker should continue. However, Japanese also nod in the middle of propositions as part of establishing a social bond between speakers (Reference Kita and IdeKita & Ide, 2007). This pattern may be a result of cultural norms and values relating to cooperation, consideration, and sensitivity toward the thoughts and feelings of others (Reference KitaKita, 2009).

There also appear to be culturally based differences in speakers’ gestural space and where they gesture in relation to the body. Reference EfronEfron’s (1941, Reference Efron1972) comparison of Southern Italian and Eastern European Jewish immigrants found that Italians used a wider space, moved the arm predominantly from the shoulder, and used the lateral plane. Jewish immigrants used a small space, gestured mainly from the elbow and wrist, and tended to use vertical and sagittal planes. In another study, Reference MüllerMüller (1998) has shown that Spaniards gesture more above shoulder height than Germans do.

3.4 Sociocultural Environments

3.4.1 Single Gesture Context-of-Use Studies

A few studies have examined how different sociocultural environments might shape the nature and use of gestures. Reference SherzerSherzer (1972, Reference Sherzer1991) used “context-of-use” studies to look at “lip-pointing” among Kuna Indians and the “thumbs-up” gesture in Brazil to explain their meanings, functions, and prominence. In his analysis of spontaneous occurrences of the “thumbs-up” gesture in urban Brazil, Reference SherzerSherzer (1991) showed that this gesture has a core underlying paradigmatic meaning of “good” or “positive.” It is used at specific moments in interactional sequences to thank, approve, get permission to take action, agree, praise, show one has understood, acknowledge, and greet. It occurs mostly during passing interactions to show that one is fulfilling one’s obligation to show friendly and positive relations in public exchanges. He suggests that its prominence within Brazilian society to express positivity and social goodwill in public exchanges is due to the volatile economic, social, and political climate where public displays that maintain friendly relations are essential.

Reference BrookesBrookes (2001) undertook a similar study in South Africa, looking at the use of a quotable gesture known as the cleva gesture. The cleva gesture involves index and pinkie fingers extended usually toward the speaker’s eyes and moved diagonally down across the face and back up. Its core meaning is “seeing” both in literal and metaphorical senses. It is used in various situations to greet in the sense of “I see you,” to tell someone you want to see them, to warn someone to look out or that they are being watched, and to tell someone to be alert. Metaphorically it means streetwiseness and the ability to see what is going on around one, to be alert, with-it, forward-thinking, and having the necessary qualities to survive in a rough, volatile, urban environment. The gesture is useful for practical communication and to compliment someone as cleva “streetwise” by increasing the number and amplitude of the movements up and down or to label them a tsotsi “crook” with minimal movement of the fingers across the eyes. It has come to symbolize the modern urban black South African who has left their rural tribal past behind, and its main function is to show who is included in that category and who is not. It symbolizes a key ideological and social divide between those who can “see” in the metaphorical sense of being forward-looking and progressive, associated with streetwise urban living, and those who lack streetwiseness, are tribal, backward, and associated with rural life. Its semantic and pragmatic development, in that it can express a wide range of different meanings and functions, is due to both practical reasons relating to living in a volatile and often dangerous urban environment, and ideological concerns relating to a key social divide between urban cosmopolitan and rural tribal ways.

These studies demonstrate how communicative and social needs give rise to certain established form-meaning pairings that play a prominent role and become what Reference CooperriderCooperrider (2019, p. 218) terms “privileged” within their communities. However, these studies still do not explain variation in gesture patterning or profiles of entire communities. This would require both the systematic documentation of the range of gestures in a society as well as their contexts-of-use and patterns of gestural behavior in relation to environment.

3.4.2 Gesture Profiles

Few studies, other than Efron’s, have described and compared the nature and patterning of different kinds of gestures and their articulation or kinesic execution in extended stretches of discourse in different contexts between speakers of different languages. Reference KendonKendon’s (2004) comparative analysis of the nature, action, and patterning of gestures in extended stretches of discourse of an Italian Neapolitan speaker and an English speaker from Northamptonshire, UK shows distinct gesture profiles. He notes differences in the way gesture and speech phrases were coordinated, the amplitude of movements, variety of hand shapes, the detail of the content of what was represented in gesture, and how gesture was used to mark discourse structure. The Italian employed fourteen distinct hand shapes while the Englishman employed only one. The Italian also used a larger amplitude in movement and a greater variety of locations around his body. His gesturing provided greater substantive detail, marked discourse more frequently, and more often indicated the kind of speech act in a more direct and explicit manner of engagement than did the English speaker’s.

Reference KendonKendon (2004) notes that his description of this Neapolitan’s gesturing is similar to Reference EfronEfron’s (1941, Reference Efron1972) descriptions of the gestural profile of Southern Italian immigrants in New York who came from the Neapolitan area. He asks why this gesture profile occurs among Neapolitans and suggests how we might explain these differences in terms of the communicative economy and social ecology of Naples (Reference KendonKendon, 2004, p. 350). Kendon draws on Reference HymesHymes’ (1974, p. 4) concept of communicative economy that conceives of patterns of communication within the boundaries of a community. To understand how cultural practices shape gestural practices, Reference KendonKendon (2004) argues that we must determine the role of gesture as one component of communication shaped by the communicative requirements of a culture. He suggests that we consider how different modalities of communication are used in relation to each other according to different communicative situations and that this patterning can differ from one culture to another.

3.4.3 Ecological Analyses

Two studies have applied an ecological analysis of gestural profiles of communities. Reference KendonKendon (2004) has looked at the gestural profile of Neapolitan speakers in relation to the communicative economy and social ecology of Neapolitan culture, its history, and the material environment. Reference Brookes, Seyfeddinipur and GullbergBrookes (2014a) has done a similar analysis of a comparable gesture profile to that of Neapolitans in a black urban Johannesburg township community in South Africa. In both communities, gesturing is a prominent component of everyday communication. Both communities have a large repertoire of quotable gestures that are used autonomously in different situations, and both frequently use gestures to depict the content of what is said. Residents in both communities use gestures in a highly performative manner, to enhance visibility and to compel attention, to engage in multiple interactions simultaneously as well as for secret communication.

Reference Brookes, Seyfeddinipur and GullbergBrookes (2014a) notes similarities between these two communities in terms of their social, cultural, and environmental characteristics that may have encouraged similar kinds of gestures and gestural practices to develop. In both communities, people live close together in crowded conditions and are generally visible to one another. Many different social activities take place within the same space, creating a noisy environment and requiring participation in, and monitoring of, multiple exchanges simultaneously. Both communities value the aesthetic role of communication as performance and have social environments that require one to visibly display who one is and with whom one is aligned. Under such conditions, gesture plays a more prominent role in everyday exchanges, is more likely to convey semantic content, and involves a large repertoire of established form-meaning pairs or quotable gestures that make precise communication possible whenever speech is impeded or to communicate secretly in public spaces where many activities and actions are easily observable. However, more holistic comparative studies are needed to establish the kinds of social factors that impact gesture profiles (Reference CooperriderCooperrider, 2019 ).

4 Rethinking Variation

Most studies take language, ethnic, cultural, or national boundaries as the unit of analysis when comparing differences in gesture. As pointed out in the introduction to this chapter, these terms are often used interchangeably. But do broad cultural, language, and ethnic differences adequately explain variation in gesture? What about regional differences and social categories such as class, gender, and age? What about communicative situations involving different tasks, participants, and circumstances?

Few studies have looked at regional differences in gestures. In his ethnography of gesture use in Naples, Reference De JorioDe Jorio (2000) noted that gesturing in Naples was distinct from gesturing in nearby regions. However, he did not indicate what differed nor systematically compare regions. Reference CreiderCreider (1977) and Reference Morris, Collett, Marsh and O’ShaughnessyMorris et al.’s (1979) comparative studies of emblems noted that variation did not always coincide with linguistic boundaries. Reference Morris, Collett, Marsh and O’ShaughnessyMorris et al. (1979) suggested that the presence of common gestures across different linguistic groups in nearby regions was due to diffusion. They also hypothesized that some gestures may not spread because they become indexical of a particular nationality or because another gesture already exists for a similar concept. Studies of emblems such as the V for victory emblem suggest that some gestures spread beyond linguistic and national borders because they appear to have become iconic of social movements (Reference SchulerSchuler, 1944). Comparing form-meaning pairings across different linguistic groups, the key question arises as to whether similarities might be because of similar cultural contexts, universal origins in common physical properties and actions, or gestural diffusion due to contact.

Similarly, there are only a handful of studies that describe differences in gestural practices within a linguistic community. Reference Driessen, Bremmer and RoodenburgDriessen’s (1992) study of gesturing among working-class males in rural Andalusia, Spain, suggests that both class and gender play important roles in gestural behavior within the same linguistic and geographic community. He describes distinct gestures and patterns of gesture use among working-class men when they gather in local drinking establishments. In these situations, specific gestures are frequently used in interactions to express a particular kind of masculinity that dominates and humiliates others to maintain power and status within male social hierarchies. Both bodily behavior and gesture play an important role in male sociality.

Reference BrookesBrookes (2014b) documents a similar use of gesture among young men in black urban South African townships when they gather in friendship groups on the street corners. She notes how certain gestures, such as ritual gestural greetings and some quotable gestures, are used mainly by men (Reference BrookesBrookes, 2004). Young men make prominent use of gestures in distinctive ways in interactions with their peers to negotiate and maintain status within male peer networks. Their gestural behavior differs not only from young women’s but varies across male social networks to indicate different social levels among young men. Reference KuneneKunene (2010) also notes the role of gender and age in shaping gestural behavior among young adult isiZulu-speaking males in South Africa. She shows how they make use of a larger gesture space than females of the same age when narrating. These studies suggest that gender identity in relation to bodily behavior may impact certain aspects of gestural behavior.

Although little research on the impact of social categories on gesture exists, the idea that gesture might vary according to social categories such as gender and class is not new. For example, books on etiquette from the sixteenth to eighteenth centuries such as Castiglione’s Il libro del cortegiano (“The Book of the Courtier”) (1527) considered how gesture behavior should be adapted to an ideal of courtly behavior (see Reference KendonKendon, 2004, p. 21), suggesting the notion of style. Art in the seventeenth century, such as Gerard de Lairesse’s book Groot Schilderboek (Great Painting Book) published in 1707, describes how gestures and other body movements reveal not only the inner emotions and personality, but also class, education, identity, and background (see Reference KendonKendon, 2004, p. 30). However, there are no studies that systematically compare variation in gesture in relation to gender, class, or other social factors. How should gesture studies approach the challenge of exploring the role of social categories in variation? What can we learn from sociolinguistics?

4.1 Sociolinguistic Perspectives

Although sociolinguistics has been mainly concerned with how spoken language is shaped by social factors, theoretical developments in this field are relevant to how we approach variation in gesture. Reference EckertEckert (2012) identified three main approaches or waves in the study of spoken variation. The first sociolinguistic studies looked at the relation between spoken linguistic variables and macrosociological categories, in particular class, but also region, ethnicity, gender, and age (see Reference EckertEckert, 2012). Early studies involving social surveys, such as Reference LabovLabov’s (2006) The social stratification of English in New York City, showed there were regular patterns of linguistic forms associated with different socioeconomic classes. Gender was also found to play a role in variation. For example, females appeared to be more conservatively oriented than males toward preserving and using prestige/standard forms (Reference TrudgillTrudgill, 1972, Reference Trudgill1974).

However, macrosocial categories did not explain some aspects of variation. This prompted sociolinguists to undertake more detailed ethnographic work that showed how local social categories also impacted variation. For example, Reference MilroyMilroy (1980) looked at how local social networks accounted for phonological variation in Belfast, Northern Ireland. Eckert’s (1989, Reference Eckert2000) work on female adolescents in Detroit high schools showed that linguistic variables patterned with local social divisions between adolescent groups in the context of wider class and geographic divisions across the city. This focus on the relationship between macrolevel and local social categories constituted the second wave of sociolinguistic research (Reference EckertEckert, 2012).

However, Reference EckertEckert’s (2012) work also pointed to the need to consider the agency of the speaker. She showed that linguistic variables were not simply a product of an “a priori” identity, but adolescents chose linguistic variables and bodily styles depending on variables’ indexical association with opposing social groups. Other studies also showed how speakers made use of linguistic variables based on the social meanings/characteristics they convey, even if these variables were not typically associated with their ethnicity, gender, or class. Work on language and sexual identity, for example, showed that men use variables associated with female speech and women use features associated with males as part of expressing different sexual identities, relations, and orientations (Reference Kulick, Holmes and MeyerhoffCameron & Kulick, 2003; Reference Kulick, Holmes and MeyerhoffKulick, 2003; Reference Maribe and BrookesMaribe & Brookes, 2014). Similarly, interactional analyses of the use of linguistic features typically associated with one ethnic/race group by other individuals or groups showed that speakers draw on multiple linguistic resources for social and identity purposes in relation to wider social dynamics (Reference BucholtzBucholtz, 1999; Reference CutlerCutler, 1999; Reference RamptonRampton, 1995; Reference Rampton, Charalambous, Martin-Jones, Blackledge and CreeseRampton & Charalambous, 2010).

These studies prompted sociolinguistics to challenge essentialist notions of identity where linguistic features are seen as a product of a priori social categories (Reference Bucholtz and HallBucholtz & Hall, 2005 ; Reference Kulick, Holmes and MeyerhoffKulick, 2003). Rather than view variation as a result or product of speaker identity, the third wave and most recent theoretical approach in sociolinguistics sees linguistic variables producing social meanings that index characteristics. Reference SilversteinSilverstein’s (2003) theory of indexicality – in which a linguistic feature comes to have a specific social meaning that speakers draw on to express a social stance, position, and identity – was an important theoretical step in trying to account for patterns of variation. When a social group regularly and repeatedly combines a set of linguistic features, these features can become enregistered as styles that become associated with the identities of social groups (Reference AghaAgha, 2007). A speaker’s use of that style or a salient linguistic variable associated with that style expresses the persona or social character associated with this group without necessarily belonging to that group. Building on indexicality and the concept of style, subsequent work has tracked how speakers draw on and combine linguistic features and resources that have particular social meanings to express stance, attitude, persona, and social identities in the service of social distinctions (Reference Bucholtz and HallBucholtz & Hall, 2005; Reference EckertEckert, 2012). Rather than understand identity as the source of linguistic behavior, identity is the product of linguistic behavior and other semiotic practices (Reference Bucholtz and HallBucholtz & Hall, 2005).

These theoretical developments should prompt us to question whether variation in gesture is simply a product of ethnic differences. Sociolinguistic studies also suggest that we avoid explaining variation in gesture purely as a product of class or gender differences. Instead, we might consider how choice of gestures and gestural practices are motivated by the social meanings they express. While we might focus on immediate communicative goals to explain the types of gesture and patterns of gesture, there is always another level of meaning related to stance and identity that shapes gesture behavior.

4.2 Sociocultural Linguistic Approaches to Gesture

Anthropological studies of gesture have begun to draw on third-wave theoretical insights. Using Reference SilversteinSilverstein’s (2003) theory of orders of indexicality, Reference LempertLempert (2011) analyses Barack Obama’s use of the precision-grip gesture (sometimes referred to as the ring gesture) in which the tips of the thumb and forefinger are brought together. He examines how Obama uses this gesture to mark information foci (first order of indexicality) and pragmatically to indicate the speaker’s stance, “I am making an important point” (second order of indexicality) during political debates. Reference LempertLempert (2011) shows how this gesture functions as a metapragmatic icon for making a sharp point. With repeated use, the gesture takes on a third level of meaning conveying a “sharp” authoritative persona. The precision-grip is not only a communicative resource for marking discourse, but also a social semiotic resource conveying authority and indexing qualities of the speaker. Obama uses it in combination with other semiotic resources to cultivate an authoritative persona in line with his political position and persona.Footnote ⁴

If speakers use the precision-grip/ring gesture to make a statement in an authoritative manner, this gesture may occur more frequently in situations where a person needs to be seen as an authority. For example, university professors may make more use of this gesture to mark important points in their lectures. If the precision-grip/ring gesture occurs more frequently in academic and political discourse, can its occurrence be explained in terms of academic or political registers? If not, then we account for its use in terms of what speakers wish to express both about the content of what is communicated and about themselves.

Gestural/multimodal registers do appear to develop in some circumstances. For example, Reference Driessen, Bremmer and RoodenburgDriessen’s (1992) description of male gesturing in drinking establishments in rural Andalusia, described above, suggests that men have developed a multimodal register that includes specific gestures and ways of using gesture and the body in a particular social activity and setting. As he did not look at male interactions in other contexts, we do not know whether this pattern of talk and gesturing extended to other activities that would then indicate a distinct male working-class style across different domains.

A multimodal register is also evident in Reference BrookesBrookes’ (2014b) study of male youths in black urban townships where male youths use gesture as part of a performative register in interactions with peers. Within this performative register, there are different styles of gesturing, distinguished by the use of different gestures, how gestures are used in relation to spoken content, as well as kinesic differences in movement, use of space and tempo. These styles have become enregistered in the sense that they are recognized as representing different social levels among male youths in black urban neighborhoods. Various elements of these styles, such as the ubiquitous use of gesture, described as “gesturing too much,” and the lax and flexible kinesic action of the fingers and hands, have become associated in the minds of the community with the style of disrespectable and delinquent male youth. These features of gesture behavior have come to have social meanings of disrespectability that index both persona and social identities.

Reference Driessen, Bremmer and RoodenburgDriessen’s (1992) and Reference BrookesBrookes’ (2014b) descriptions also show how the use of gesture in these multimodal registers plays a key role in maintaining discursive and social power in male peer networks. Reference Covington-WardCovington-Ward (2016, Reference Covington-Ward2019) explores this role of gesture as a form of visual capital to gain and maintain power in her ethnographic work on gestures and bodily behavioral practices among the Bisikongo (inhabitants of the precolonial Kongo kingdom in the Democratic Republic of Congo). Drawing on historical texts, she describes how political and religious figures used embodied practices, including gestures, to reproduce or challenge social hierarchies, values, and political and religious authority with the invasion of Christian missionaries and the beginnings of European conquest. She shows how gesture rituals and the flouting of these rituals served as visual capital for subverting or maintaining power. In this context, gestures played a central role in public interactions to mobilize social and political action. Also evident is that the importance, meanings, and usefulness of different gestures and gesture rituals changed over time with shifting sociopolitical conditions.

Gesture as a key tool in performance and power is also explored by Reference Hall, Goldstein and IngramHall, Goldstein, and Ingram (2016) in their analysis of Donald Trump’s gestures. Gestures are a strategic tool in Trump’s rhetoric. He uses gestures, especially mimicry, to mock his opponents, to subvert established political rhetoric and thereby to express an antiestablishment stance. His gestures are a key part of entertaining or shocking his audiences to create political spectacle. Gestures provide visual capital to garner and maintain attention from supporters, but also opponents. Reference Hall, Goldstein and IngramHall et al. (2016) argue that his gestures create the excess needed for comedy that subverts the social order. His gesturing expresses a stance that contributes to constructing his antiestablishment political persona, providing an iconic relationship between his use of gestures and his personality.

These studies show that while gestures are shaped by immediate representational, pragmatic, and discursive needs, the interpersonal and social purposes of communicative behavior also play a role in shaping the nature of gestures, their pragmatic functions, their social meanings and functions, and their patterning. Gesture is an important part of what Reference GoffmanGoffman (1963) terms “giving off” information in that it is key to expressing speaker stance. The interaction between communicative and social purposes can explain the nature and choice or patterning of gestures. Accounting for variation therefore requires consideration of gestures at three levels: discursive, interactive, and social. A sociolinguistic theory of style “as a set of practices for displaying social stances and personae in local sociocultural contexts” (Reference Bucholtz, Hall and CouplandBucholtz & Hall, 2016, p. 182) can explain some aspects of variation in gesture.

Reference Bucholtz, Hall and CouplandBucholtz and Hall (2016, p. 173) point out that “bodies and embodiment are central to the production, perception, and social interpretation of language.” They argue that semiotic practices such as adornment, gaze, gesture, and other bodily movements give linguistic variables meaning. Their view is that the body is not secondary to language, and placing embodiment at the center of socioloinguistic studies will increase understanding of the relationship between body, language, and linguistic variation. Their question is: How do we theorize the relationship between language and embodiment?

Gesture studies can address this question by asking what role gesture plays in the relationship between body and symbolic expression/language. Gesture has both linguistic/symbolic and bodily/action properties. Gesture links thought, emotional expression, and material action through symbolic expression to spoken language, bringing conceptual and social meaning together. Since it is externalized through movement, it is visual, and its creation involves felt material experience (see Boutet & Cienki, this volume). In interaction, it is part of negotiating both ideational and social meaning. The importance of gesture in expressing stance means it plays a key role in attributing social meaning to spoken linguistic variables that enact social distinctions. The body and gesture are a key part of attributing meaning to spoken language and voice, linking symbolic meanings to the social environment. Gesture is part of creating semantic and discursive coherence at the level of discourse, and visual intertextual bodily coherence at the level of interaction and social membership.

5 Conclusion

What is universal, what varies, and why are central questions in the field of gesture studies. Although humans share similar kinds of gestures in terms of communicative functions, such as representational gestures that convey content and discursive gestures that mark discourse structure, there is some evidence to suggest that these functional gesture types are distributed differently across linguistic/cultural groups. Representational strategies, such as the use of iconic gestures to depict content, appear to be more common among some linguistic groups than others.Footnote ⁵ There are also kinesic differences related to the articulation of gestures. There seems to be variation in how arms, forearms, hands, and fingers are moved, in the use of space, as well as in how gestures are used in interactions. The prominence of gestures as a mode of communication also appears to be different across language groups.

Variation in gesture has been attributed to differences in spoken language structure in terms of what gestures encode and where they occur. Variation has also been attributed to differences in conceptual thinking. The focus in gesture studies has been on how gestures represent or contribute to propositional spoken content and what gestures reveal about cognitive processes, and less on how gestures and gestural behavior are also shaped by cultural norms of conduct. As Reference KendonKendon (2017, p. 157) points out, “speaking is not only an expression of content. It is also a form of social action.” The pragmatics of gesture are important. This function appeared to be one of the most important aspects of gesture to early scholars who originally considered gesture as primarily a pragmatic tool in rhetoric and part of bodily social conduct. Increasingly, it appears that answers regarding variation lie in considering gesture as a form of social action shaped by material and social environments that require different communicative and social roles for gestures, resulting in different gestural patterns and practices.

Most studies of gesture are not based on extensive observations across all communicative situations. It is possible that certain communicative tasks elicit specific types of gestures and patterns of gesturing, and that our conclusions relating to ethnolinguistic groups are based on a limited number of communicative settings. Importantly, ethnic, linguistic, and cultural boundaries do not account for all types of variation. Gesture, like speech, may vary according to social category and be employed in the service of social distinction. In this respect, we can take note of theoretical developments in sociolinguistics and its approaches to spoken linguistic variation. While macrolevel social categories such as region, class, age, and gender may account for some aspects of gesture variation, the social meanings that gestures and gestural patterns of behavior acquire to express attitudes, stances, persona, and identities may have greater explanatory value. Certain features of gesture may be connected to communicative purpose, but may also be shaped by social categories, identities, and ideologies of bodily expression.

Reference CooperriderCooperrider (2019) has pointed out the need for an explicit conceptual framework if we are to systematically examine universality and diversity in gesture. However, as he points out, a more comprehensive documentation of gestures and gestural behaviors within different communities is needed to understand what varies and why. We need more holistic studies to determine the function and role of gestures in a community. These data will enable us to compare different communities systematically to see what can vary, why, and what is universal in gesture. For example, we might consider both universal physical limits of the body in terms of the limits of human body parts in expressing concepts and actions while at the same time considering how different ideologies of the body related to social values and identities might lead to different gestures and gestural practices. Gestures, for example, mainly occur in the frontal space and occasionally at people’s sides. However, the use of space may differ based on what is considered appropriate behavior for males and females, what is acceptable in formal versus information situations, and the communicative aims of the speakers.

A comparative undertaking requires consistent methodological and conceptual approaches. We need more systematic context-of-use studies that include social aspects. As the visual aspects of communication have become more accessible with technology, we have an increasing ability to gather and analyze large amounts of spontaneous data across the world. With better access to a wider variety of different groups of people internationally, we can begin to make systematic comparisons along different lines. At the same time, we can see how contact might impact gestures, gesture attrition, how gestures get acquired and changed, and what aspects are resistant to change.

Understanding what makes gestures vary and how gesture systems are shaped might tell us how symbolic systems develop, become shared, and diversify. What is clear is that all factors involved in symbolic expression are interrelated and fundamentally grounded in the social. Gesture and the nature of gesture is not only a product of the mind, but also a product of, and reproduces, the social. If we are to understand the relation between symbolic systems, cognition, social interaction, and environment, it is necessary to look at a gestural system as a whole and in relation to other communicative modes as part of the communicative and social ecologies of a community.

25 Communicative Gesturing in Interaction with Robots

1 Introduction

1.1 Definitions and Overview of Communicative Gesturing

Gesturing plays a fundamental role in human–human interaction (Reference Bavelas and ChovilBavelas & Chovil, 2000; Reference KendonKendon, 2004; Reference McNeillMcNeill, 2005). Besides gestures that are used to grasp and manipulate objects, communicative gestures used as part of the linguistic interaction are also important. Communicative functions of gesturing and body posture range from conveying semantic and pragmatic meaning (Reference KendonKendon, 1995, Reference Kendon2004) to coordination of interaction and feedback giving: Iconic gestures (Reference Lis and NavarrettaLis & Navarretta, 2014), metaphoric gestures (Reference Cienki and KoenigCienki, 1998), and pointing gestures (Reference Jokinen, Esposito, Campbell, Vogel, Hussain and NijholtJokinen, 2010) can be effectively used for meaning cocreation, indicating disfluencies and emotional states, as well as supporting synchrony, intersubjectivity, and engagement in interaction (Reference Allwood, Cerrato, Jokinen, Navarretta and PaggioAllwood, Cerrato, Jokinen, Navarretta, & Paggio, 2007; Reference Bavelas, Chovil and RoeBavelas, Chovil, & Roe, 1995; Reference Cassell, Cassell, Sullivan, Prevost and ChurchillCassell, 2000; Reference Gullberg, Ellis and RobinsonGullberg, 2008; Reference KendonKendon, 1986, Reference McNeill, Duncan, Guendouzi, Loncke and WilliamsMcNeill & Duncan, 2011; Reference StreeckStreeck, 2009). Gesture studies extend from (neuro)cognitive processes (Reference Chu and KitaChu & Kita, 2016; Reference Cienki and MullerCienki & Muller, 2008; Reference Kita, Alibali and ChuKita, Alibali, & Chu, 2017) to intercultural comparison (Reference Endrass, André, Rehm and NakanoEndrass, André, Rehm, & Nakano, 2013; Reference Graziano and GullbergGraziano & Gullberg, 2018; Reference Navarretta, Ahlsén, Allwood, Jokinen and PaggioNavarretta, Ahlsén, Allwood, Jokinen, & Paggio, 2012), and from gesture generation (Reference Bergmann, Kopp, Ruttkay, Kipp, Nijholt and VilhjálmssonBergmann & Kopp, 2009) to natural interaction for human–robot interaction (HRI) (Reference Beck, Canamero and BardBeck, Canamero, & Bard, 2010; Reference Jokinen, Wilcock, Mariani, Rosset, Garnier-Rizet and DevillersJokinen & Wilcock, 2014; Reference Kanda, Ishiguro, Ono, Imai and NakatsuKanda et al., 2002). In short, gesturing creates mutual understanding and social bonding without needing to convey one’s intention in explicit verbal utterances that may distract the flow of conversation. Communicative gesturing is a way to support visual interaction management, that is, provide feedback to the partner in an efficient and unobtrusive way. Since gesturing is often simultaneous with speaking, it offers another channel to convey information to one’s conversation partners, and also to control multiple interactions simultaneously (e.g. indicating to a third partner that one is occupied).

Gestures are defined here as hand, head, and body movements, and they may or may not have communicative meaning. At one end, sign languages consist of highly conventionalized gestures intended to convey meanings, while at the other end, specific hand movements such as grasping are used for object manipulation. Between these two ends of the gesture scale, we follow Reference KendonKendon (2004), who defines gesticulation as intentional communicative action with immediately recognizable features. Gesticulation may have different forms and functions, but it is usually described as having an internal structure of three phases: preparation, stroke, and retraction (Reference KendonKendon, 2004 ). Gesture classifications like Reference KendonKendon (2004) and Reference McNeillMcNeill (2005) are based on gesture functions (iconic, metaphoric, rhythmic, cohesive, and deictic gestures), while the MUltiModal INteraction (MUMIN) annotation scheme (Reference Allwood, Cerrato, Jokinen, Navarretta and PaggioAllwood et al., 2007) provides annotations for both gesture form and function. The MUMIN scheme is aimed at studying multimodal feedback and turn-taking behavior, and it thus also includes head gesturing and body posture besides hand gestures. Commonly used gesture classifications include gesture functions such as deictic pointing gestures, which complement the speech and single out a certain referent (that box), and iconic gestures which illustrate the speech and describe an element or an event, for example, a big box can be drawn in the space by both hands depicting the outline of box and the palms being kept far away from each other to signify the large size. Beat or baton gestures support the rhythm of the speech and also emphasize important concepts of the spoken utterance, while metaphoric gestures are gestures with an abstract meaning that resembles metaphoric use of linguistic words.

There is no one-to-one mapping between the gesture forms and functions. For instance, typical meanings assigned to the pointing gesture concern referential pointing, direction giving, telling off someone, and commanding someone to pay attention, but pointing gestures also seem to serve on a conversational metalevel to enable communication management (cf. Reference KendonKendon, 2004). We will not go into details of gesture semiotics or cognitive models but refer to Reference Cienki and MullerCienki and Muller (2008), Reference KendonKendon (2004), Reference McNeillMcNeill (2005), and Reference Melinger and LeveltMelinger and Levelt (2004).

1.2 Robotics Technology

Views of the future society, such as those sketched in the Japanese Society 5.0,Footnote ¹ regard artificial intelligence (AI) solutions as services which enable the digital economy and the development of a human-centered society, with sustainable management of energy and innovations built on the foundation of Cyber-Physical Systems. AI agents are envisaged to be seamlessly integrated in the environment and have symbiotic relation with human users. An important aspect is to bring a human-centered dimension to the design and decision making in the digital society. Interaction between humans and embodied agents (robots) is assumed to take place in a smooth, multimodal manner, supporting natural communication with multimodal capability.

Social Robotics is a subarea of AI robotics which focuses on research and development of social tasks applications where interaction with the human user in conversational situations is important. Typical social robot applications concern such domains as caregiving, health, and education, where the social robot is to provide useful information, recommendations, and instructions to the human users, and simultaneously to take human emotional needs into account so as to create trust and appear as a friendly, helpful companion. This presupposes that the robot is able to conduct spoken interaction with the user and support affective natural interaction using appropriate multimodal behavior, like gesturing. As the tasks for the robot agent become more complex, the agent’s communicative competence must also increase. The ambitious goals include the robot’s capability to analyze human behavior on the basis of the social signals and movements detected and to produce appropriate gestures and movements for supporting its presentation and for eliciting natural human behavior. If the emphasis is on the cognitive aspects and reasoning mechanisms behind the robot action, the field of study is usually called Cognitive Robotics.

Social Robotics often includes Assistive Robotics or Service Robotics where the robot helps the human to do daily tasks, or the robot and the human try to accomplish a task together in collaboration. Assistive robots can work as monitoring systems and to support safe independent living by providing reminders, alarms in case of someone falling down, or even helping people to get back up (Reference Neßelrath, Lu, Schulz, Frey, Alexandersson, Wichert and EberhardtNeßelrath, Lu, Schulz, Frey, & Alexandersson, 2011; see also the teddy bear-faced robot ROBEARFootnote ² and its earlier version RIBA). In the case of wheelchair users, speech and gesture interaction can be used to navigate in the environment (Reference Anastasiou, Efthimiou and KouroupetroglouAnastasiou, 2011). Another area where robots may provide novel opportunities is interactive therapy and coaching. For instance, Reference Billard, Robins, Nadel and DautenhahnBillard, Robins, Nadel, and Dautenhahn (2006) and Reference Robins, Dickerson, Hyams and DautenhahnRobins, Dickerson, Hyams, and Dautenhahn (2004) have shown that autistic children respond weakly or not at all to social cues by a human partner but respond well to mechanical devices such as robots. Robots can thus have a therapeutic role in helping children to improve social interactions. In these applications, gesture input recognition is necessary to adapt the robot’s assistance to the human needs.

Much interaction research is done with the humanoid NAO robotFootnote ³ and the Pepper robotFootnote ⁴ from Softbank Robotics (originally Aldebaran). They are interactive and customizable robots which have been used as a research platform in many interactive robot applications for education, tourism, shop assistance, healthcare, and companionship in retirement homes and hospitals. These robots use the NAOqi operating system and can be operated by the Choregraphe software which allows for controlling and creating applications. Other social robot platforms include the iCubFootnote ⁵ humanoid robot, built in a European Union project and used as a testbed for human cognition and artificial intelligence, as well as the PAL RoboticsFootnote ⁶ service robots Tiago and a human-sized ARI, which can be used for various applications requiring multilingual interaction and user engagement. One of the first socially oriented robots was the robot head Kismet (Reference BreazealBreazeal, 2002), which could simulate emotions through facial expressions and head movements. Continuing along these lines, the popular social robot FurhatFootnote ⁷ provides an expressive head and face and supports various social-engagement studies with head gesturing, attentive listening, and rich facial expressions. Geminoid robots (Reference Kanda and IshiguroKanda & Ishiguro, 2013) are a special type of android robots which have a human-like appearance and thus challenge the limits of human-ness (cf. Uncanny Valley, Reference MoriMori, 1970/2012). They have been widely used to study HRI from long-term relations (Reference Kanda, Sato, Saiwaki and IshiguroKanda, Sato, Saiwaki, & Ishiguro, 2007) to eye-gaze (Reference Mutlu, Shiwa, Kanda, Ishiguro and Hagita.Mutlu, Shiwa, Kanda, Ishiguro, & Hagita, 2009 ) and gesturing (Reference Ishi, Machiyashiki, Mikata and IshiguroIshi, Machiyashiki, Mikata, & Ishiguro, 2018), and, in particular, human acceptance and attitudes toward robots (Reference Nishio, Ogawa, Kanakogi, Itakura and IshiguroNishio, Ogawa, Kanakogi, Itakura, & Ishiguro, 2012).

Active work is going on related to architectures, models, and representations for the robot’s natural language communication. For instance, standards are being developed for software interoperability in the Object Management GroupFootnote ⁸ and in various International Organization for Standardization (ISO) standards concerning safety management for service robotics,Footnote ⁹ personal care robots,Footnote ¹⁰ and human–robot collaboration.Footnote ¹¹ The Moving Picture, Audio and Data Coding by Artificial Intelligence (MPAI) communityFootnote ¹² focuses on standards, interoperability, and sharing multimodal data and applications for practical interests in human–machine communication. The Robot Operating System (ROS) is the de facto standard for robotics, and OpenCV is the most important standard for Computer Vision. Some humanoid social robots, such as the PAL Robotics robots, use ROS, whereas others, like Nao and Pepper, resort to their own operating system.

As for speech, social robots usually use proprietary systems such as Google Cloud services for automatic speech recognition (ASR) and text to speech (TTS) rendition or Amazon Polly for TTS as in Furhat, or Nuance (as with Nao and Pepper). In this respect, Reference Fujii and JokinenFujii and Jokinen (2022) describe how ROS, the ESPNet Footnote ¹³ speech recognizer, and Rasa Open Source dialogue framework can be combined as a platform to support natural interaction with a social robot.

1.3 Communicative Gestures for Social Robots

Requirements for natural HRI in general are discussed in Reference Jokinen, Ge, Cabibihan, Salichs, Broadbent, He, Wagner and Castro-GonzálezJokinen (2018). Gestures are an integral part of human–human interaction and the goal of gesture modelling in Social Robotics is to improve HRIs by enhancing the robot’s multimodal conversational capabilities, its presentation skills, expressivity, intelligibility, and social acceptability. For instance, Reference Lim, Ogata and OkunoLim, Ogata, and Okuno (2012) presented a framework for emotional expressions in musical robots and showed that a robot’s likeability and general usability increase with emotional gesturing, while such experimental systems as WikiTalk (Reference Jokinen, Wilcock, Mariani, Rosset, Garnier-Rizet and DevillersJokinen & Wilcock, 2014; Reference WilcockWilcock, 2012) and ERICA (Reference Inoue, Milhorat, Lala, Zhao and KawaharaInoue, Milhorat, Lala, Zhao, & Kawahara, 2016) focus on natural language interaction by taking a repertoire of multimodal expressive communication into account. An overview of dialogue research with social robots is provided in Reference Jokinen and WilcockJokinen and Wilcock (2017).

In HRI, multimodal issues are important since speaking robots are starting to appear in homes, public spaces, and work. User evaluations usually point to the robot’s inflexible feedback strategies with humans (Reference Jokinen, Ge, Cabibihan, Salichs, Broadbent, He, Wagner and Castro-GonzálezJokinen, 2018; Reference Sidner, Rich, Shayganfar, Bickmore, Ring and ZhangSidner et al., 2015). For instance, the user’s gesture patterns can indicate unexpected and misunderstood interactive situations. This information should be used in designing the robot’s feedback strategy to support friendly interaction: Detection of the user’s understanding and misunderstanding is crucial in order to provide expressive interaction which supports emotionally satisfying and pleasant interaction (Reference Beck, Canamero and BardBeck et al., 2010).

Visual information in the form of hand and head gesturing, body posture, and movement is essential in creating expressive and reliable presentations that support social acceptance and trust in the robot’s capability to fulfill the tasks it is designed to do. Pertinent aspects deal with the embodied nature of robots and integration of contextual information into the robot’s interaction model, as well as the robot’s engagement in long-term relationships with humans (Reference Heylen, Krenn, Payr and TrapplHeylen, Krenn, & Payr, 2010; Reference Jokinen and WilcockJokinen & Wilcock, 2021; Reference Leite, Martinho and PaivaLeite, Martinho, & Paiva, 2013). The role of the social robot also needs to be determined and its functionality designed according to its expected capabilities: If the robot is to act as a companion, an instructor, a coach, or a coworker, it needs to provide true and useful information as well as to support natural, friendly, emotional, and expressive interaction. It needs to address issues of trust and empathy (Reference Gebhard, Aylett, Higashinaka, Jokinen, Tanaka, Yoshino, Miehle, Minker, Andre and YoshinoGebhard et al., 2021 ). For this purpose, other multimodal aspects such as emotion recognition and affective gesturing (Reference Glowinski, Dael, Camurri, Volpe, Mortillaro and SchererGlowinski et al., 2011; Reference Noroozi, Kaminska, Corneanu, Sapinski, Escalera and AnbarjafariNoroozi et al., 2018; Reference Spezialetti, Placidi and RossiSpezialetti, Placidi, & Rossi, 2020), as well as how the robot’s behavior affects human empathy (Reference Kwak, Kim, Kim, Shin and ChoKwak, Kim, Kim, Shin, & Cho, 2013), are important. Moreover, studies of the robot’s nonverbal behavior have focused on kinesics, proxemics, and haptics, and how such behaviors can be detected and interpreted in communicative contexts, as well as how they affect the robot’s decision making and HRI (Reference Saunderson and NejatSaunderson & Nejat, 2019).

2 Gestures in the Analysis and Generation of Robot Behavior

Robots’ behavior can be studied for both interpretation and generation purposes, that is, from the point of view of the robot perceiving, processing, and understanding human gestures (perception) and from the point of view of how the robot’s own gestures should be designed (action). Figure 25.1 summarizes the two viewpoints for studying gestures in interactive situations and depicts how the robot’s processes of interpreting and generating gestures correspond to similar counterparts on the human side. The robot’s interpretation is based on its input devices (cameras, sensors), which enable the robot to perceive human behavior and analyze this in terms of a meaningful action in the given context. On the generation side, the robot produces gesturing as part of its reaction in the context and realizes gestures as appropriate hand and head movements. Gestures can be produced as independent dialogue acts which convey one intent (e.g. pointing or stopping gestures), or as co-speech gestures that accompany speech and support the utterance content or give rhythm to the speech (beat gestures).

Figure 25.1 The robot’s perception and reaction as counterparts of human perception and reaction in interactive situations

The enablement conditions for communication (or enablements, Reference JokinenJokinen, 2009) are listed in the middle of Figure 25.1 as supporting factors for context-aware interaction management. Being in contact and observing the partner’s behavior enables the agent to build understanding of the partner’s goals and produce a reaction that is considered appropriate in the given situation. Such reciprocal actions form a cycle of interaction which manifests the participants’ cooperation with each other and creates a mutual context (Reference Clark and SchaeferClark & Schaefer, 1989) in which the task goals underlying the interaction can be accomplished. The robot agent thus needs a dialogue model which can take care of the perception and interpretation of human communicative signals (like gesturing) and link them to behavior generation which includes reasoning about cooperative actions and contextually appropriate behavior.

An important aspect in constructing mutual context is grounding, that is, establishing meaningful links between symbolic labels and objects in the world (Reference HarnadHarnad, 1990). Grounding has been studied extensively in computer vision and also in dialogue modelling, where it refers to the disambiguation of referents in the communicative context. Recent models try to integrate neurocognitive models for language grounding with the robot’s sensorimotor modalities (Reference Heinrich, Yao, Hinz, Liu, Hummel, Kerzel, Weber and WermterHeinrich et al., 2020). In philosophical contexts, the question of how the symbols (words) get their meaning in human thinking and computational systems has been widely discussed (Reference Cangelosi and HarnadCangelosi & Harnad, 2001).

2.1 Gesture Analysis with Robot Agents

In HRIs, two methodological approaches are available for gesture analysis: a top-down, theory-driven approach and a bottom-up, data-driven approach. The top-down approach is typically based on manual annotation and analysis of the data to determine relevant gesturing events. It relies on human interpretation, and intercoder agreement is calculated to assess reliability and validity of annotations. The number and type of classes depends on the goals of the research.

The bottom-up approach relies on big data and advanced AI techniques to detect possible classes and classify the elements. Reliability and validity of the models are based on model evaluation, where algorithmic measures of accuracy, precision, f-score, Receiver Operating Characteristic (ROC) curve, and so on are used with respect to ground truth annotations. Automatic recognition of speech and visual signal analysis give an objective basis for studies, and the co-speech gestures can be modelled by building estimates of the synchrony and appropriate gesture times. Such work is supported by different tools; for example, ELAN Footnote ¹⁴ and ANVIL Footnote ¹⁵ are widely used annotation tools. Automatic analysis platforms have also been developed to study such correlations between multimodal signals (Reference Heimerl, Baur, Lingenfelser, Wagner and AndréHeimerl, Baur, Lingenfelser, Wagner, & André, 2019), and machine learning is used to study interpersonal dynamics (Reference Baltrušaitis, Ahuja and MorencyBaltrušaitis, Ahuja, & Morency, 2019). Reference Kipp, Neff and AlbrechtKipp, Neff, and Albrecht (2007) present a gesture annotation scheme for gesture generation and test it for re-creating animations. It aims to be a compromise between expressivity and economy. See also Reference Jokinen, Pelachaud, Rojc and CampbellJokinen and Pelachaud (2013) for multimodal modelling in general.

Technology for face and head movement detection is already accurate, with standard algorithms found in the OpenCV library.Footnote ¹⁶ Traditional algorithms use bounding boxes to detect the areas of interest in the video, and median values of front and back coordinates of body, with noise removed, can be used to retrieve possible hand movements by applying a simple peak detection algorithm to coordinates (see Reference Mitra and AcharyaMitra and Acharya, 2007, for an early overview). Deep-learning algorithms can be directly used to detect hand gestures from video data, and, for example, Reference Romeo, Hernandez Garcia, Han, Cangelosi and JokinenRomeo, Hernandez Garcia, Han, Cangelosi, and Jokinen (2021) conducted extensive studies on various architectures for gesture detection in order to study their relation to perceived personality traits.

Movement detectors can track the user’s movements in the rooms and will be able to tell which room the user is in and, in particular, if the user is in the same room as the robot. The robot can use Simultaneous Localization and Mapping (SLAM) algorithms for mapping rooms and navigating around, and the robot’s local sensors (infrared and touch sensors, heat camera) may be able to detect distance from the robot. This can be used in dialogue initiation by telling if the user is close enough to start spoken interaction.

2.2 Gesture Generation in HRI

In WikiTalk (Reference Jokinen, Wilcock, Mariani, Rosset, Garnier-Rizet and DevillersJokinen & Wilcock, 2014; Reference WilcockWilcock, 2012), gestures are manually designed to enhance the robot’s presentation capability. Particular interest was paid to the concept of semantic theme (Reference KendonKendon, 2004), which aims to group gestures of a similar shape together into gesture families with a particular meaning. For instance, Open Hand Supine (“palm up”) and Open Hand Prone (“palm down”) express in general “offering and giving” versus “stopping and halting.” Some examples of WikiTalk gestures are given in Figure 25.2 with assigned discourse functions and explanation with respect to the robot’s behavior (Reference Csapo, Gilmartin, Grizou, Han, Meena, Anastasiou, Jokinen and WilcockCsapo et al., 2012).

Reprinted, with permission, from Csapo et al, Multimodal conversational interaction with a humanoid robot, IEEE Proceedings)

Recently, gesture studies have focused on end-to-end gesture generation with neural networks (Reference Alexanderson, Henter, Kucherenko and BeskowAlexanderson, Henter, Kucherenko, & Beskow (2020); Reference Kucherenko, Jonell, van Waveren, Henter, Alexanderson and KjellströmKucherenko et al., 2020). Previous gesture generation systems, as described above for WikiTalk, were rule-based and required that the gestures to be generated were specified in advance, while data-driven approaches aim at flexible gesture generation by segmenting the input speech text appropriately and then generating a suitable sequence of co-speech gestures; for example, Reference Chiu, Morency, Marsella, Brinkman, Broekens and HeylenChiu, Morency, and Marsella (2015) discuss various models for gesture generation. Reference Yoon, Ko, Jang, Lee, Kim and LeeYoon et al. (2019) present a model which can produce iconic, metaphoric, deictic, and beat gestures on a NAO robot, and Reference Kucherenko, Jonell, van Waveren, Henter, Alexanderson and KjellströmKucherenko et al. (2020) discuss semantic segmentation which is important in order to link semantics and acoustics of speech with the recorded gestures in the new field of study called continuous gesture generation.

2.3 Combining Speech and Gesturing

Reference McNeill, Duncan, Guendouzi, Loncke and WilliamsMcNeill and Duncan (2011) emphasized that the interaction of speech and gestures is necessary to express the speaker’s meaning, while Reference Goldin-MeadowGoldin-Meadow (2003) pointed out that co-speech gestures convey information that usually complements the verbal information and is thus meaningful. Co-speech gesturing (Reference Cassell, Stone, Douville, Prevost, Achorn, Steedman, Badler, Pelachaud, Ram and EiseltCassell et al., 1994) is characterized by the stroke of a gesture which is aligned with the pitch accent of a verbal expression. However, it is important that gestures are contextually appropriate and accurate timewise: As pointed out by Reference Bergmann, Kopp, Eyssel, Allbeck, Badler, Bickmore, Pelachaud and SafonovaBergmann, Kopp, and Eyssel (2010), random gestures confuse the participants. The information about the duration of the intonational contour is used to align the gesture stroke with the pitch accent of a verbal expression. Reference Kopp, Bergmann and WachsmuthKopp, Bergmann, and Wachsmuth (2008) discuss an integrated model of speech and gesture production, while an overview of speech and gestures in interaction can be found in Reference Wagner, Malisz and KoppWagner, Malisz, and Kopp (2014). They focus on form and function of head and hand gestures in speech communication and discuss temporal speech–gesture synchrony with prosody having a special role in speech–gesture alignment.

Co-speech gesturing has been studied in particular in the spatial domain. For instance, Reference Sowa and WachsmuthSowa and Wachsmuth (2001) consider gestures as an inherently space-related modality, while Reference Kopp, Bergmann and WachsmuthKopp et al. (2008) point out that gestures have sufficient specificity to be communicative of spatial information. Reference Allen, Özyürek, Kita, Brown, Furman, Ishizuka and FujiiAllen et al. (2007) studied how people express motion events with speech and gestures in intercultural contexts and elicited spontaneous gestures by having adult participants narrate animated cartoons. The study showed that the speakers of Turkish, Japanese, and English expressed the same motion events differently using different gestures, and that the gestural differences mirrored lexical and syntactic differences between the languages. The study concludes that language and gesture can be regarded as two separate systems which interact when a speaker expresses meanings, that is, speech-accompanying gestures are generated from an interface representation between spatial cognition and speaking.

Co-speech gesturing is challenging for robot systems due to various types of gestures, their different functions, and the accuracy of timing with respect to the spoken utterance in order to match the semantics of the modalities. Reference Mori, Jokinen and DenMori, Jokinen, and Den (2020) explored human gesturing in HRIs in English and Japanese dialogues and noticed several differences; for example, when talking to a human partner, Japanese speakers used fewer and less varied gestures than the English-speaking participants did, but when talking to a robot, both used less gestures. It seems likely that the interactive situation with the robot is not regarded as being as natural as that with human partners and also that the robot’s ability to detect and react to human communicative gesturing is not considered as natural as would be necessary to elicit responsive human gestures. From the HRI point of view, the amount of information to be presented at one time is also an important question and related to cognitive processing: An appropriate chunk-size varies depending on the content and the partner.

Co-speech gesturing can be visualized using video analysis tools (see Section 2.1). Some examples of movement graphs for two participant video clips are shown in Figures 25.3 and 25.4 (from Reference Vels and JokinenVels and Jokinen, 2015). In the figures, blue lines correspond to the detected movement coordinates of the person on the left, and green lines to those of the person on the right, facing each other. The solid blue and green bars represent speech and nonarticulated vocalizations of the left and right person, respectively.

Figure 25.3 Visualization of cospeech gesturing (Reference Vels and JokinenVels & Jokinen, 2015). From left to right: left person’s torso back, head, and torso front, the right person’s torso front, head and torso back. Time is shown on the y-axis running from top to bottom, and the x-axis depicts the magnitude of the person’s horizontal movement in pixels in the video frame

In Figure 25.3, the green line representing movements of the right person’s torso shows a large horizontal spike (a sharp increase in the magnitude) in the back coordinate. This occurs without activity in the front coordinate, suggesting that a gesture occurs behind the speaker’s back, as seen in the screenshot. It can be seen that the gesture occurs simultaneously to the right person speaking, with the start of the co-speech gesture accurately timed at the turn-taking. The peak of the gesture is clearly visible, and it is followed by a smaller gesture before the person stops speaking. The co-occurring gestures seem to form a gesture group which closely accompanies the speaker’s speech: The gesture phrases can be linked to the intonation phrases and to the structure of the spoken utterance, thus supporting the view that gestures and speech are produced from the same underlying semantic unit and grouped together to express a coherent unit of the speaker’s intention (Reference Graziano and GullbergGraziano & Gullberg, 2018; Reference Kendon and KeyKendon, 1980).

Figure 25.4 Visualization of cospeech gesturing

(Vels & Jokinen, 2015)

Figure 25.4 shows a similar gesture for the speaker on the left. Now the gesture contains two spikes in the coordinates related to the person’s front torso and one gesture at the back torso, which can be associated with the speaker waving her hands around in front and in back of her body. The gesture to the back of the left speaker occurs in-between the two hand gestures in the front of the body and coincides with the beginning of the person’s speech. Being a single occurrence, it seems not to be related to the utterance structure like the gestures in the example in Figure 25.3, but rather functions as a pointing or a beat gesture to mark the start of the utterance.

2.4 Intercultural Aspects in HRI and Gesturing

While gestures offer an efficient way to deal with tasks and to animate communication, problematic issues concern differences in the interpretation of the function of the gestural behavior. The same gesture can have a different meaning in different contexts, although there is flexibility in the actual form of the gesture so that it is possible to talk about meaning of the gestures on a general level (cf. Kendon’s semantic themes). In different cultural contexts, the speakers can use emblems (Reference KendonKendon, 1995 ), gestures which are specific to the particular culture and which have a standardized form and which can function as a complete utterance as such. Speakers with close affection may also use specific gestures which convey meaning only known to them and which thus strengthen their mutual bonding.

The speakers may also have preferences for particular gesturing behavior which can affect their interpretation of the gestures, and also their experience of the whole interaction. In particular, speakers may have different awareness of communicative gestures and different thresholds for perceiving them. Such signals can be confusing if modelled differently from what the participant is used to, and thus have a negative effect on the participant’s view of the partner. For instance, single nods as feedback need not be noticed by cultures where repetitive nodding is common as feedback signaling, while they may be enough to give appropriate feedback in cultures where frequent gesturing is not typical (Reference Navarretta, Ahlsén, Allwood, Jokinen and PaggioNavarretta et al., 2012). In another study, native Finnish and native German annotators analyzed Finnish language interaction related to gesturing behavior, and it was noticed that there were small differences in distinguishing head and hand signals as such (i.e. segmenting the interaction with respect to possibly meaningful gestures), but the biggest differences were found in determining gesturing as turn-taking signals. Since turn-taking is usually marked by signals in spoken language (topic, intonation), it is understandable that recognition of gestures as such conversational controlling and sequencing signals is difficult. This is corroborated with the observation that non-native speakers tend to focus on the face and notice more facial display signals than native speakers do (Reference Kabashima, Nishida, Jokinen and YamamotoKabashima, Nishida, Jokinen, & Yamamoto, 2012), and that in language learning, gestures are commonly used as communicative strategies to overcome fluency-related problems (Reference Graziano and GullbergGraziano & Gullberg, 2018; Reference Gullberg, Ellis and RobinsonGullberg, 2008).

Different speakers produce different gestures, and this can be used to characterize the person in communicative settings. This is assumed to be appropriate for HRI in that the user seems to be engaged more with agents whose behaviors reflect their own personality or cultural background, especially concerning verbal and nonverbal aspects (see, e.g. Reference Aylett, Vannini, André, Paiva, Enz and HallAylett et al., 2009). To study if gesturing can signal the personality of the speaker, Reference Çeliktutan, Skordos and GunesÇeliktutan, Skordos, and Gunes (2017) collected a large amount of video data about two human speakers and used this data to study how gesturing can be used to characterize the speaker’s personality. In an extensive study of different deep-learning architectures to detect gestures in the above data, Reference Romeo, Hernandez Garcia, Han, Cangelosi and JokinenRomeo et al. (2021) studied if the gesturing patterns of the speakers can be used to recognize the perceived personality of the speaker, that is, if the individual way of using gestures in their communication is also linked to such personality traits as being extroverted, introverted, conscientious, open, and so on. While the study focused on the automatic gesture recognition and on effectiveness of the different machine-learning techniques, the conclusion is that there is a link between gesturing and personality, although more data and systematic studies are needed to conclude the particular type of gesturing and particular personality trait.

There are a few studies of the differences in the interpretation of the function of gesture behavior in different contexts of embodied and virtual agents. For instance, Reference Rehm, Nakano, André, Nishida, Bee and HuangRehm et al. (2009) built virtual agents specifically on the basis of the data collected from German and Japanese interactions, and Reference Endrass, André, Rehm and NakanoEndrass et al. (2013) showed that virtual agents can help to learn about culture. In another study concerning HRI, it has been observed that the users’ attitude toward the robot also depends on their cultural background (Reference Bartneck, Suzuki, Kanda and NomuraBartneck et al., 2007), and that the users’ willingness to accept a robot’s recommendations seems to depend on the robot’s interaction style and the participants’ cultural background (Reference Rau, Li and LiRau, Li, & Li, 2009). This calls for adaptive robot design which takes the user’s cultural background and individual preferences into account.

Reference Anastasiou, Jokinen, Wilcock and KurosuAnastasiou, Jokinen, and Wilcock (2013) used the WikiTalk gesture library, which contained three different gesture forms (small, medium, and large), and asked the users to evaluate the gesture types with respect to their appropriateness in the context. The results show that the users preferred the medium-size gestures, considering small gestures too small to be properly noticed as communicative gesturing, and the big gestures possibly somewhat frightening, which in HRI relates to conveying aggressiveness and being harmed by a robot. Although comparisons are often conducted between Western and Eastern cultural backgrounds, nonverbal behavior can vary significantly even within closely related cultures, and in HRIs, inappropriate or badly timed nonverbal signals can lead to misunderstandings, or the robot’s whole behavior being regarded as odd or even irritable due to its unnatural patterns.

3 Conclusions

Social robots give new opportunities to use AI technology in making interactions more natural and expressive. Accordingly, their behavior models need to account for verbal and nonverbal behavior in order for the robot to create a mutual context with the human partner and to coordinate interaction. This requires both detection and analysis of gesturing (head, hand, body) and creating appropriate co-speech gesturing. Such models can be developed by integrating rich analysis of conversational behavior with technological advancements of AI robot technology.

26 Gestural Interfaces in Human–Computer Interaction

Imagine you come home from work, and as you walk through the front door, your home realizes you’re there and starts playing the podcast you were listening to on your commute. You settle into your favorite chair and move your hands, just a little, to adjust the volume, then move your hands again to skip ahead to the next episode in the series. You get wrapped up in the story until your home gently reminds you of a date with your best friend, so you head for the car. As you close the front door, your home turns off the podcast. You ask your car to take you to your friend’s house, and when you’re nearly there, you tell your car to “park over there” while pointing to the shady spot beneath the beech tree. It does so, seamlessly, and opens the door so you can walk over to your friend.

Once the purview of science fiction, this futuristic scenario, with seamless transitions between “smart” systems that recognize individual users and accept commands using natural interaction like speech and gesture, is now on the horizon (Reference CrumCrum, 2020). Researchers in the fields of user experience (UX), human–computer interaction (HCI), and interaction design (IxD) are working on creating these “smart” scenarios, which try to ensure that interaction with these technologies will be as natural, intuitive, and comfortable as possible for their users. Alongside the engineering challenges, this scenario presents plenty of behavioral challenges and opportunities where experts in gesture studies and other behavioral sciences can use what is known about ordinary human interaction to design better interaction with computing systems.

This chapter takes this kind of scenario as a starting point. Our goal is to empower gesture researchers to conduct meaningful research in the fields of HCI and UX research and design. We therefore give special focus to the similarities and differences between HCI research, UX research, and gesture studies when it comes to theoretical framework, research questions, empirical methods, and use cases, that is, the contexts in which gesture control can be used. As part of this, we touch on the role of various gesture-detecting technologies in conducting this kind of research; technical details can be found in Trujillo, this volume. The chapter ends with our suggestions for the opportunities gesture researchers have to extend this body of knowledge and add value to the implementation and instantiation of systems with gesture control.

1 Theoretical Framework

1.1 UX Research (and HCI)

UX design is a multidisciplinary field, drawing on ideas and methods from HCI, ethnography, user-centered design (UCD), and psychology, among others. UX as a discipline was introduced in the late 1990s as a reaction to the limitations of usability engineering, which focuses on objective attributes such as efficiency and effectiveness at the expense of, for example, the user’s expectations and emotional state, context-of-use or the so-called hedonic aspects such as enjoyment or pleasure. The International Organization for Standardization (ISO) defines UX as “a person’s perceptions and responses that result from the use or anticipated use of a product, system or service” in the ergonomics of human–system interaction (ISO, 2020). At the “Dagstuhl Seminar on Demarcating User Experience” in 2010, thirty UX researchers and professionals reviewed more than twenty-five definitions of UX put forward from 1996 to 2010 and published their discussions in both a white paper (Reference Hoonhout, Law, Roto and VermeerenHoonhout, Law, Roto, & Vermeer, 2011) and accompanying website (Reference All AboutAll About UX, 2020). All definitions agree on the user as the focus point, how they experience the interaction with some product or system, and the impact of that system on them. As UX is considered over the lifetime of a product, UX can be seen as momentary, episodic, or accumulated (over time) and must be evaluated accordingly. From ISO’s definition, it follows that UX can actually exist before the user has even starting using the system or product, that is, solely based on the user’s expectations. Thus, two users who have a 100 percent identical interaction with a system, but have different prior expectations, will experience different UX.

1.1.1 Assessing UX

In order to work with the UX of a system, it is paramount to be able to assess and evaluate its impact on the user in the situations described above. A plethora of methods exists for doing this, depending on the purpose, environment, social context, user characteristics, and so on. In the early phases of a product design, explorative methods, such as in situ observations, interviews, role-playing, and so on are well suited. As an example, one popular method is contextual inquiry (Reference Beyer and HoltzblattBeyer & Holzblatt, 1997), which consists of an in-depth interview with the user along with unstructured observations of their using the computing system in their usual environment.

UX can be assessed repeatedly during an iterative design process, often with the help of prototypes. Early prototypes can be very basic and made from paper or cardboard (see Reference SnyderSnyder, 2003). This allows for frequent and cheap updates of the design. Later prototypes will often be closer to the final product and thus require more effort to build. UX assessment can also involve use of Wizard of Oz (WOZ) experiments, with or without prototypes. WOZ is a simulation technique, where a “Wizard” (someone behind the scenes) enacts some or all of the system functionalities (Reference KelleyKelley, 1984). This approach is particularly useful for gesture-based interaction, where it would otherwise be necessary to build extensive and cost-intensive gesture-recognition sensor systems to investigate a particular interface. Instead, a human “Wizard” can identify the particular gesture expressed by a test user and activate the corresponding function in the system, which then performs the desired action. This is an approach similar to other natural user interface (NUI) systems, such as voice-assistant systems, especially in their early days. An example of a WOZ simulation for gesture interaction with a music system is given in Reference Nielsen, Nellemann, Larsen, Stec and KurosuNielsen, Nellemann, Larsen, and Stec (2020). In this case, the Wizard observed the test participants from a separate room via a video link. When the participant performed a gesture aimed toward the music system, the Wizard would initiate the corresponding action, such as increasing the volume or skipping to the next track in the playlist. (See further explanation of WOZ experiments in Section 5.2 below.)

1.2 Research for IxD

The preceding section briefly mentioned NUIs. The term “natural” refers to the removal of system artifacts, such as a keyboard, mouse, and so on in favor of a direct, or natural, interaction with a computer system. Haptics (for touch screens like the iPhone), speech, and gesture interfaces are all examples of NUIs (Reference MannMann, 1998). However, although NUIs supposedly favor human senses and abilities instead of forcing the user to learn and employ some artifact, the design and use of NUIs is not exactly straightforward. Like, for example, graphic user interfaces (GUIs) such as the applications used on laptops or apps used on smartphones and tablets, they must obey basic rules about affordances and feedback in order for humans to understand and interact with them. Indeed, Don Norman, one of the founders of UX as a field, pointed this out: “Gestural systems are no different from any other form of interaction. They need to follow the basic rules of IxD, which means well-defined modes of expression, a clear conceptual model of the way they interact with the system, their consequences, and means of navigating unintended consequences” (Reference NormanNorman, 2010, p. 9).

Affordances and feed-forward (i.e. letting the system present opportunities for the user’s next move) pose particular challenges to speech- and gesture-based NUIs compared to the “what-you-see-is-what-you-get” paradigm of GUIs. Furthermore, even though gestures are central and thus natural in human–human communication and interaction, this is not necessarily so for human–machine communication. Reference GaverGaver (1991) discusses the affordances of technology and identifies the problem of hidden affordances; that is, the fact that a system may offer a functionality to the user, but it is not perceivable and must therefore be communicated and learned in some other way.

Both gestural and speech-based interfaces are prone to this problem. In contrast, the “interaction frogger” (Reference Wensveen, Djajadiningrat and OverbeekeWensveen, Djajadiningrat, & Overbeeke, 2004) is one example of a framework intended to let designers systematically couple action (of the user) and function (of the system) by creating information through functional, inherent, and augmented feed-forward and feed-backward subsystems. An example of this can be found in a study by Reference Freeman, Brewster and LantzFreeman, Brewster, and Lantz (2016). They created a generic “do that, there” paradigm for gestural interaction with mobile phones and small household devices, using pulsing light (LEDs surrounding the edges of their devices) to provide feedforward and augmented feedback.

2 Gestures in HCI and UX

In this section, we consider how interaction with gestures in HCI, UX, and gestural interaction in gesture studies complement each other.

2.1 Approaches to Gesture Interaction in UX and HCI Research

In HCI and UX research, the term interaction means anything a user does to or with a computing system: Users have goals, for example, to control a product (adjust the volume, change a setting) or achieve something (find a parking space, watch a movie). By interacting with a system via one of any interaction modalities (such as a remote control, app, voice command, gesture command, button, or other types of tangible interaction), users provide input to the system to get the job done. Examples of typical interaction modalities include remote controls, apps, websites, and tangible interaction with button presses, knobs, mouses, and trackpads, and so on. These exist alongside emerging modalities for natural interaction, like voice, gesture, and other body actions (facial expression, emotion, or affect, etc.) which are so named for taking behaviors and skills we already do “naturally” with other humans and applying them to interaction with machines. In this corner of science, multimodal interaction means providing a system with multiple entry points for users so they can do what suits them best at any given moment. Examples include a smart car which accepts voice commands, gestures, and tangible interaction via surface gestures on screens as well as typical car interfaces like radio controls with buttons and knobs; or a smart speaker which accepts input from voice, apps, buttons on the product, or a remote control.

In this context, gesture refers to any direct action made by the user to control the product. Surface gestures made on smartphones and tablets are a prototypical example of this as users perform distinct actions on the surface of the product they wish to control, for example, tapping, performing a swiping motion in the usual directions (up/down, left/right, diagonally), or three-dimensional surface gestures (long tap with haptic feedback). As a class, gestures contrast to input by, for example, a remote control, mouse or keyboard, that is, devices which translate user intention to input the system understands. Gesture may also refer to movements made by holding a remote control or other device. Examples include Nintendo Switch (2020), which users hold in the hand and move in different patterns through the air to control action in games; and Myo (2020), an armband which translated arm motion into drone motion but is no longer in production.

To refer to manual gestures of the kind studied by gesture researchers (hands moving freely in space), the terms touchless gestures, 3D gestures, air gestures, and freehand gestures are used interchangeably. These gestures are direct actions made by users in space to control systems such as, for example, smart home appliances, smart cars, and laptops or other handheld devices. They require an external sensor which can see the gestures by using, for example, microwave radio (radar), light detection and ranging (lidar; similar to radar but with light instead of radio waves), WiFi signals, infrared (IR), or red-green-blue (RGB) input, and a system which can interpret the sensory data to “see” that a particular gesture was produced. One important factor is the distance to the sensor: This subclass of gestures typically assumes far-field sensors (50 cm to 20 m from the user), for which the term far-field gestures is used. This contrasts with near-field gestures (within 15 cm), detected with miniature radar chips, such as Google’s Project Soli (within 5 cm) (Soli, 2020), or capacitive sensors which detect minute changes in the electrical field, such as the one used by Bixi (within 10 cm) (Bixi, 2020). Broadcom’s APDS-9960 chip (Broadcom 2020) is an example of a low cost (~1 USD) IR-based sensor capable of near-field gesture detection. These near-field gestures rely on small trajectories of the fingers or hands (e.g. left/right vs. up/down) and are highly constrained by the sensing technology; They will not be considered further here. In addition, although gesture interaction with other body parts (e.g. facial gestures or body posture) constitutes a type of natural interaction, such movements will also not be considered here as they are typically not covered by the term “gesture” in HCI and UX research.

Research in this domain typically concerns one of three veins: an engineering approach, which focuses on fast, accurate detection and recognition of human-produced gestures; a UX approach, which focuses on the identification of appropriate gesture forms, the situations in which those gestures should be used, and the acceptability of either the gestures or the computing system in those contexts-of-use; and an HCI approach, which focuses on how the previous two approaches, plus constraints of the computing system (e.g. any graphic user interface or physical constraints of the system) affect its usability. Usability is defined as the extent to which a user is able to successfully and comfortably navigate through a series of tasks. Whereas usability studies tend to focus on where and how HCI breaks down so that it can be improved, UX studies tend to focus on whether that interaction is meaningful for users at all: If users do not see any value in the system, there is no reason to make it good.

Typical research questions are: What is the gesture? What do you do with the gesture? And how socially acceptable is the gesture? These will be considered in depth in Section 5. For now, let us say that researchers are typically interested in creating a gesture set which is robust for the sensing technology they are using. Generally, researchers in this field assume that, once identified and “verified” by a group of naïve users, the gesture set is appropriate cross-culturally, for users of all genders and ages, and that there will be no variation in production when users make the gestures. The extent to which this assumption is flawed remains to be demonstrated.

Gestures are primarily of interest because they can be mapped to appropriate system-level commands (e.g. to access favorite channels or playlists, control volume or change channel, etc.) and can be used without requiring the use of a control device such as a remote control, app, and so on. Sometimes there are questions about which functions are even appropriate for control with gestures.

The rapidly emerging field of virtual reality (VR), mostly for games, may prove to be a turning point for the traditional viewpoint. In a virtual world, body movements and gestures comprise the baseline for interaction, and devices such as “magic wand” pointers and keyboards are considered very poor approximations of real-world interactions. At the time of writing (mid-2020), most VR systems use wireless handworn controllers, which are then tracked in three dimensions. From here, the leap to simply track the hand and body is very small and will likely happen for mainstream gaming over the next few years. For example, the Leap Motion controller (Leap, 2020), can now be affixed to commercial VR headsets such as the HTC Vive (Vive, 2020) for three-dimensional hand-gesture control, while the Oculus Quest (2020a) standalone VR headset supports hand-gesture recognition (see Trujillo, this volume). A comprehensive overview of gesture interaction for VR can be found in Reference Li, Huang, Tian, Wang and DaiLi, Huang, Tian, Wang, and Dai (2019).

Recently, researchers have also begun to look at multimodal interaction (voice commands combined with gestures; see also Hayo (2020), a multimodal smart home controller) and ask about the social acceptability of the interaction, both of the gestures in general and of the setup required to enable it. For example, a study of augmented-reality interaction by Reference Lee, Billinghurst, Baek, Green and Woo.Lee, Billinghurst, Baek, Green, and Woo (2013) showed significant user preferences for multimodal interaction (speech + gestures) over unimodal gesture interaction. This interest is linked closely to privacy concerns due to sensors and data processing, and so on, which will be discussed further below. Researchers and practitioners have also started to use an inclusive approach, which means that the system being designed can and should be used by any user regardless of any (in)visible impairments, cultural differences, or anything else; see Reference HolmesHolmes (2018) for more on the importance of inclusive design for UX research and design.

2.2 Relationship to Gesture Studies

The approach of HCI and UX research stands in contrast with that of gesture studies, where gestures are always understood to complement ongoing interaction, which is by definition situational and refers to whatever people are doing by themselves, as well as to or with other people or animals, objects, and so on. In the cognitive sciences, interaction is largely linguistic and primarily communicative. Manual gestures, hands moving freely in space, are the most frequently studied, though other articulators and modalities are garnering increasing attention. Typical research questions include the relationship between gesture, action, thought, and language, how context affects gesture production, which gesture forms occur naturally and what they express, and how these behaviors vary both across and within a population.

This approach is primarily meant to understand and describe human capacities for communication: what is natural, feels good, and “just” happens. While this focus on description is at odds with the focus of UX and HCI to create experiences, many of the central questions and approaches to understanding what makes a “good” gesture and what a “good” gesture “does” can be used to inform HCI and UX research and make interaction with systems better. In our view, when it comes to opportunities for using gesture studies to improve gesture interaction with products, we expect that focusing on the strengths of gesture studies, such as the identification of gesture forms and their expressive capacities, as well as demonstrations of how context-of-use affects both the form of the gesture and variations in its production, will give the biggest payoffs to the UX and HCI fields.

The central question of gesture interaction in UX and HCI is the creation of the gesture set, which comprises the gesture forms that will be used to control the system. Essentially, this is a question about the creation of emblems: deliberately produced gesture forms with precise definitions and interpretations (Reference KendonKendon, 2004). Some emblems, such as the index-finger point, the wave hello/goodbye, and the thumbs up or pinch to mean good/ok, can be found across a number of cultures, while many others are specific to a given culture or even subculture (see Payrató, this volume). In UX and HCI studies, users are often asked to create gesture forms on the spot, or to validate/accept gestures which others have created. Here, it is important to note that “users are not designers” and “designers are not users” (Reference NielsenNielsen, 1993). This means research needs to go in both directions in order to result in something that is understandable and usable. From the perspective of gesture studies, these user-created or designer-created gestures are often puzzling, with a low likelihood of widespread acceptance or use. However, they have value in UX and HCI as user-chosen interactions generally enjoy a higher likelihood of acceptance in the long term. Taking emblems as a starting point or building on international sign language signs could lead to more meaningful gesture sets, as well as the identification of meaningful opportunities for using gestures to control products.

Once identified, the next question is whether the gestures will be “universally acceptable.” To someone in gesture studies, this question is absurd: Basic expressive principles are the foundation of the language–gesture continuum (Reference LangackerLangacker, 2008), but mostly we see variation between speakers, between groups of speakers (by age, gender, language, nationality, even group affiliation), and even within the same speaker in different contexts. However, for a company trying to produce a product which is likely to be used globally, there must be assurances that users in different markets will not only find the gestures acceptable, but that the gestures do not mistakenly require users to do something that they would find awkward, rude, or obscene. This means taking what we know from documentation efforts (e.g. about multimodal metaphors and emblems) and applying it to creation efforts where acceptance can be measured directly (“What do you think about this gesture?”) or, more interestingly, indirectly; for example, by asking people to interact with devices using gestures, and investigating the effects of that interaction (human–computer) on ongoing interaction (human–context).

A related question is that of social acceptability. Again, to a gesture researcher, this is an absurd question as co-speech gestures are “always” acceptable: They are a necessary part of both communication and self-expression. However, to a UX or HCI researcher, the question is very real because natural interaction might make people so self-conscious that they would be unwilling to use the interaction modality when others are present. As an example, consider voice assistants like Siri or Alexa: today (mid-2020), they are quite common, and people accept both the interruptions to ongoing interaction and the occasional blunder. However, when voice assistants were first introduced, users were quite self-conscious and disappointed by the results, so only those who were “forced” to use them (e.g. users with disabilities or early adopters who struggled with traditional GUI interfaces) did so. A similar story holds for gesture interaction: We know that gestures in spontaneous interaction are largely produced unconsciously, but in the context of HCI, they become not only conscious, but purposeful. This creates the opportunity for a number of questions about use case (the situation in which the gesture should be used), gesture set (the gestures that will be used), and the flow of the interaction as a whole so that the result has a chance of living up to the promise of natural interaction.

We end this section with a reflection on classification schemes: In the cognitive sciences, gesture classification schemes focus on the form of the gesture (Reference Bressem, Ladewig, Müller, Müller, Cienki, Fricke, Ladewig, McNeill and TeßendorfBressem, Ladewig, & Müller, 2013; Reference HassemerHassemer, 2016; Reference McNeillMcNeill, 1992) or what the gesture represents or expresses (e.g. Reference McNeillMcNeill, 1992; Reference MüllerMüller, 2017; and others) with a focus on giving researchers a shared language for describing naturally occurring instances of co-speech gesture. In UX and HCI, classification typically focuses on what the gesture from a created gesture set is used for, that is, the use case (Reference Groenewald, Anslow, Islam, Rooney, Passmore and WongGroenewald et al., 2016): menu operations, system operations, or media control, to name a few. Some schemes focus on what the gesture does (e.g. Reference Karam and schraefelKaram & Schraefel 2005; Reference Mehler, vor der Brück, Lücking and KurosuMehler, vor der Brück, & Lücking, 2014). Some simply indicate the number of fingers or hands which are involved (Reference Rekik, Vatavu and GrisoniRekik, Vatavu, & Grisoni, 2014), as this is, to an engineer, an indirect measure of complexity, which should be kept as low as possible. At least in UX and HCI, these classification schemes are equal-opportunity schemes, in the sense that they treat gestures as being equally well received or likely to be used. We suspect that input from gesture studies can help identify gestures (or classes of gestures) which have a higher likelihood of being adopted, similar to the work of linguists and conversation analysts in identifying constructions which enjoy a higher likelihood of adoption for voice assistants.

3 Approaches to Recognizing Gestures

Empirical investigations into naturalistic, intuitive gesture interaction only make sense when coupled with investigations into the technology (sensors, algorithms, etc.) that make detecting and interpreting those gestures possible; that is, machine learning and computer vision. The prevalent sensor devices for recognizing human gestures are cameras. In cases where a handheld or handworn device is used, such as smart watches, mobile phones or the Nintendo Wii (Wii, 2020) and Switch controllers (Switch, 2020), accelerometers are also being successfully utilized. Near-field gesture recognition can be done very cheaply by chip-based IR sensors such as the APDS-9960 (Broadcom, 2020) mentioned above. More recently, radar and lidar have been proposed for gesture detection. What is common to all technologies is that they rely heavily on sophisticated and often computationally very demanding signal processing. In recent years the advent of the machine-learning (ML) paradigm has been revolutionizing the performance of many fields relying on signal processing and pattern recognition, and gesture recognition is no exception. Note that although motion-capture technology is familiar to many gesture researchers, it is not relevant for user-centered research as it is currently practiced. For more about gesture recognition technologies, we refer interested readers to Trujillo (this volume).

4 Companies Working with Gestures

When empirical investigations into natural interaction with gestures is coupled with the technology needed to recognize them, business opportunities come knocking. Although gesture interaction has long been the holy grail of research in IxD and computer vision, commercial products capable of real-time gesture detection which live up to the promise of natural interaction (Reference NormanNorman, 2010) have only recently become available. This is a rapidly developing field, so we will mention just a few examples here.

Gaming companies were the first to introduce gesture interaction on a larger scale. In gaming contexts, gestures are purposefully designed to be both easy for the system to recognize and playful, which effectively means game developers are excused from the design guidelines that interaction should be as natural as possible. The Xbox Kinect (2020) and Nintendo Switch (2020) provide excellent examples of natural interaction which is not natural. In contrast, concepts for future living such as Panasonic’s kitchen of the future (Panasonic, 2020) show how natural interaction with gestures can be integrated with other interaction modalities, such as voice or touch, in the home.

Mixed reality (MR) is a field closely related to VR, in that the user is required to wear a head-mounted display. Microsoft’s HoloLens 2 (2020) and MetaVision’s Meta 2 (2020) are examples of this as both systems are capable of detecting a restricted set of gestures, though the approaches to gesture interaction are quite different. In a comparative study by Reference Frutos-Pascual, Creed, Williams and LamasFrutos-Pascal, Creed, and Williams (2019), the Hololens is characterized by employing a metaphoric interaction paradigm, similar to that of the mouse point-and-click known from the WIMP (windows, icons, menus, pointer) PC interface. In contrast, Meta2 employs an isomorphic interaction paradigm, where the real world is mimicked by, for example, grasping virtual objects similar to interacting with physical objects in the physical world. The study showed a clear overall user preference for the isomorphic interaction paradigm over the metaphoric one, as well as higher perceived naturalness and lower task load index, both of which are important indices of user-friendly UX and usability.

In 2015, Samsung launched smart TV which could be controlled by one- or two-handed gestures to change channel, volume, settings, and so on. Some videos of early demos of the TV are still available on YouTube; these show Samsung’s best attempts at culturally universal gestures which are easy for 2015 sensors to recognize (Samsung, 2020). In fact, access to low-cost hardware, such as webcams and laptop PCs capable of gesture detection, has led to a number of companies offering gesture control of TVs, such as Visimote (2020).

BMW i-series cars support gesture interaction, but only allow users to control the most frequent media player functions: adjust volume, change radio station or track, and answer or reject a call. Like Samsung, BMW was interested in so-called culturally universal gestures (Reference Loehmann, Knobel, Lamara and ButzLoehmann, Knobel, Lamara, & Butz, 2013), which meant conducting field work to ensure their system would provide a luxurious experience for their customers worldwide. As of mid-2019, several car companies have followed suit, with VW, Ford, and others announcing gesture control for in-car infotainment.

These are all examples of individual products or product lines. When it comes to gesture interaction, the ability to recognize a gesture can itself be the product. All of the big tech companies have patents to this effect: ubiquitous computing systems integrated into the home to allow seamless interpretation of system-oriented gestures alongside voice commands, surface gestures, and other input opportunities (e.g. Apple, 2018). This is also the approach of Piccolo Labs (2020), TwentyBN (2020), and Gestoos (2020): three startups which combine machine learning, sensor technologies, and IxD to provide gesture recognition for domestic use cases. Typically, the approach of these companies is to provide a gesture library or demo set which contains some number of predefined gestures along with the possibility to extend the gesture set to custom-defined gestures for clients. A new open-access research tool called Gelicit (Reference Magrofuoco and VanderdoncktMagrofuoco & Vanderdonckt, 2019) does the same. Here, UX research and gesture experts could work to define relevant gestures for a particular use case and a particular set of users, and then test that those gestures are recognized and acted on appropriately by the system.

This is actually the biggest opportunity for gesture researchers: the identification of appropriate gestures combined with thorough usability testing to ensure that interaction with them is as seamless (error-free, successful, rewarding) as when gesturing to another human.

5 Empirical Methods for Investigating Gestures in UX and HCI

Since early 2000, more than 200 user studies (Reference Magrofuoco and VanderdoncktMagrofuoco & Vanderdonckt, 2019) have investigated gesture interaction with computing systems, largely using qualitative methods which emulate typical usability studies (Reference Bargas-Avila and HornbækBargas-Avila & Hornbæk, 2011). This section introduces typical methodologies and identifies important considerations when conducting this kind of research. The most important concept is that of the user, who will eventually use the system. In any design-oriented field, high importance is placed on user-generated or user-verified approaches; see, for example, Reference RoseRose (2018) for a short popular-science article written by one of the world’s most influential design-research firms.

Because systems will be used by people, it is important to include people in the process of making those systems. That is the only way to ensure that the end result is not only easy, understandable, and delightful (e.g. Reference Zaiţi, Pentiuc and VatavuZaițI, Pentiuc, & Vatavu, 2015), but also natural and intuitive in the sense that users will not have to overthink anything but just do the gesture and get the expected results. Including people in the process of creating gestures increases the likelihood of a gesture being learned (Reference Vatavu and WobbrockVatavu & Wobbrock, 2016), remembered (Reference Nacenta, Kamber, Qiang and KristenssonNacenta, Kamber, Qiang, & Kristensson, 2013), and used (Reference Havlucu, Ergin, Bostan, Buruk, Göksun and ÖzcanHavlucu et al., 2017), all of which are important when creating an interactive system with gestural interaction. This is true even when recognizing that it is understandably difficult for people to create gestures, and that when they do, they are often limited by what they already know from interactions with other devices. This limitation is also called the legacy bias (Reference Beşevli, Buruk, Erkaya and ÖzcanBeşevli, Buruk, Erkaya, & Özcan, 2018; Reference Morris, Danielescu, Drucker, Fisher, Lee and WobbrockMorris et al., 2014), which designers seek to avoid when introducing new interaction modalities.

Empirical approaches typically ask users to complete tasks and reflect on their experiences while doing them (“think aloud”) or immediately after the fact (“post task”). In the case of UX design, these approaches are often explicit, though new technologies such as MR are creating opportunities for implicit measures.

5.1 What is the Gesture?

The foundation of research on gesture interaction is the identification of a gesture set, which matches gesture forms with system functions. By far the most common approach is eliciting those gestures from users by means of a gesture-elicitation protocol and identifying the winning set by means of a gesture-preference protocol. This is due to three considerations: (1) it is not obvious which gestures would be the best for any given scenario; (2) designer-made gestures often result in puzzling responses from users; and (3) users often know what makes sense to them, even if they cannot articulate why. Once elicited, it is the job of the designers to refine the set and ensure it meets standards, such as between-user acceptance.

In a gesture-elicitation protocol, users are situated in a relevant environment (e.g. a car or a living room) and asked to create gestures for a number of researcher-defined scenarios. Researchers typically ask questions like “How would you gesture to do ___?” for anything from using the hand as a computer mouse, to quickly navigating menus or controlling media, such as music or movies. As the user does the gesture, the researcher ensures that the system responds appropriately (using a WOZ paradigm, see below) so that the user can judge the fit of the gesture to the task at hand. While creating gestures, users may be asked to use a think-aloud protocol, which requires them to verbalize their thoughts and observations as they do the task. Alternatively, they may be asked to describe the reasoning behind their gesture after presenting it to the researcher.

Realizing how difficult a task this is for most users, researchers introduced kinesthetic priming (Reference Hoff, Hornecker and BertelHoff, Hornecker, & Bertel, 2016) and increased production, whereby users are asked to playfully create several variations of each gesture. Two examples of this approach are Reference Zaiţi, Pentiuc and VatavuZaiți et al. (2015), in which users were asked to create gestures for nearly 20 smart TV functions, and Reference Henze, Löcken, Boll, Hesselmann and PielotHenze, Löcken, Boll, Hesselmann, and Pielot (2010) which asked users to create gestures for smart music systems; Reference Groenewald, Anslow, Islam, Rooney, Passmore and WongGroenewald et al. (2016) is a meta-analysis of the use cases investigated and gesture sets proposed, and Reference McAweeney, Zhang and NebelingMcAweeney, Zhang and Nebeling (2018) argue for standards for visualizing gestures in user guides, as these instructional visuals affect how a user’s gestures are eventually produced.

In a gesture-preference study, users are presented with multiple options for a gesture (i.e. different actions their hands could perform for a particular function) and asked to indicate their preference along with the reason why. They may do this with either a ranking of, for example, their top three forms (Reference Nielsen, Nellemann, Larsen, Stec and KurosuNielsen et al., 2020) or their overall preference (e.g. Reference Grandhi, Joue and MittelbergGrandhi, Joue, & Mittelberg, 2011), and are often allowed to alter a gesture form so that it becomes a better fit for them, in which case these changes are also noted.

Estimating user agreement (and its extension, user acceptance) of a particular gesture form is its own subfield; Reference Wobbrock, Aung, Rothrock and MyersWobbrock, Aung, Rothrock, and Myers (2005) and Reference Vatavu and WobbrockVatavu and Wobbrock (2016) are excellent introductions to this topic, which relies on sophisticated statistics. As their work demonstrates, it is of utmost importance to take a between-user approach which considers gender, age, culture, and other characteristics (such as acceptance of new technologies) as relevant factors for ensuring both user acceptance and agreement on particular gesture forms.

However, what these approaches miss is the fact that, even for preferred gesture forms, user characteristics only contribute part of the story: Physical and environmental context can also affect user preference for particular gesture forms. Reference Stec and LarsenStec and Larsen (2018) demonstrated this with a smart TV study, where they showed that depending on where users were situated in a room (near vs. far; left vs. center vs. right of the TV), different gestures were preferred. This variation was unconscious but necessary for intuitive interaction and demonstrates the importance of including contextual variation even in foundational work.

5.2 What Do You Do with the Gesture?

For a UX designer, the question of what you do with a gesture is the fun one because it means you can start to investigate the relationship between users, interaction modalities, systems, and context. Although the methods for this are quite different from those typically used in gesture studies, the overarching goals and sensitivity to context are comparable.

This step follows the defining of the gesture set and its verification by a group of users. Crucially, it means investigating the experience of using the gesture(s) in real-world scenarios. To do this, WOZ scenarios are typically used. In a WOZ design, experimenters fake the experience they are trying to design, and provide participants with the best-case scenario, the argument being that if the best-case scenario fails to provide a meaningful experience, the experience itself might not be worth developing. Alternatively, a WOZ design can provide crucial breaks in the experience, such as error messages or failures of various types, in order to investigate meaningful recovery options which can save the experience.

In the case of gesture interaction, WOZ experiments typically entail creating an environment where users think they are gesturing to control a system and get appropriate responses from the system when they do the gestures, but in fact everything is controlled by a hidden “Wizard” who can see the gestures as they are produced and manually control the system to ensure a good response. In this scenario, the system is as good as the human Wizard. This kind of study typically makes use of two cameras in the experiment room: a small webcam which is said to detect gestures, and a normal camera which records the interaction while simultaneously providing a live-feed of the participants to the Wizard, located in another room. The Wizard controls the experience by watching the participants. This scenario is depicted in Figure 26.1.

Figure 26.1 Schematic of a typical Wizard of Oz design for investigating gestural interaction in the home

Examples of this approach include Reference Nielsen, Nellemann, Larsen, Stec and KurosuNielsen et al. (2020) and Reference Taniberg, Botin and StecTaniberg, Botin, and Stec (2018), in which participants were asked to control a music system using gestures while they did something else with a friend. Another example is Reference Tscharn, Löffler, Latoschik and HurtienneTscharn, Löffler, Latoschik, and Hurtienne (2017), which investigated multimodal control in concept cars, where the WOZ scenario enabled researching the situations under which multimodal commands felt the most intuitive to users. One of the benefits of gesture interaction is the freedom it affords to control devices around you without having to interrupt your other activities, for example, by searching for a remote control. Investigations into the relationship between gesture interaction and peripheral interaction (interaction which can be done in the periphery of the user’s attention) also make use of this task design; see, for example, Reference Bakker, Hausen and SelkerBakker, Hausen, and Selker (2016).

In these studies, users typically do three tasks: (1) a training or rehearsal task, where users learn the gestures and what they do, which may or may not be assessed by researchers; (2) the WOZ task, where users use the gestures to do something; and (3) a post task interview or questionnaire, where users reflect on the experience they just had. Posttask questionnaires typically ask users to reflect on the strengths and weaknesses of the experience they just had and invite them to suggest changes to improve the experience. The WOZ task is often recorded and reviewed by researchers as a part of data analysis, though not to the extent that is typically done in gesture studies.

Alternatively, gesture sets can be used to evaluate prototypes at different stages. This could mean evaluation of recognition technologies, which would entail careful investigation into where the recognition breaks down so that it can be reinforced; these studies are more akin to usability studies, where user context and user goals are of utmost importance. Prototype evaluation could also mean looking at the relationship between gesture commands and physical properties of the system and room, such as in the Reference Stec and LarsenStec and Larsen (2018) study mentioned above, or the relationship between the gestures and the system they are connected to, such as in Reference Chattopadhyay and BolchiniChattopadhyay and Bolchini (2014), which looked at how gestures might enable circular displays instead of the typical rectangular ones.

5.3 How Socially Acceptable Is the Gesture?

Gesture interaction only makes sense if users accept it in social contexts, that is, if they are comfortable using it regardless of whether or not they are alone. This kind of investigation also ensures that the gesture does not accidentally require users to do something obscene or embarrassing, which researchers might not have been aware of.

Research in this area was pioneered by Reference Rico and BrewsterRico and Brewster (2009, Reference Rico and Brewster2010a, Reference Rico and Brewster2010b), who investigated the adoption of surface gestures amongst different populations varying by age, gender, and culture. Similar research was conducted for voice assistants (e.g. Reference Moorthy and VuMoorthy & Vu, 2015), and is growing for gesture interaction (Reference Ahlström, Hasan and IraniAhlström, Hasan, & Irani, 2014; Reference Nielsen, Nellemann, Larsen, Stec and KurosuNielsen et al., 2020; Reference Taniberg, Botin and StecTaniberg et al., 2018).

Importantly, any gesture set that is created with input from one set of users needs to be validated by users with other demographics to ensure widespread acceptability. This may seem obvious to gesture researchers, but it certainly is not that obvious to researchers in UX or HCI. One example of a study which tries to account for cross-cultural variation in gesture forms is Reference Meier, Goto and WörmannMeier, Goto, and Wörmann (2014), which asked users in eighteen countries to create gesture forms to do any of a number of common interaction tasks. The result is a puzzling list of “user-verified” gestures which are scarcely referenced in the literature.

Instead, the results from one carefully conducted gesture set task should be extended in a kind of replication study with users from another demographic until gestures are agreed on. We expect that this will be easier for gestures that control actual motion (e.g. Reference Stec and LarsenStec & Larsen’s, 2018, smart TV prototype or VR/MR gestures such as Reference Frutos-Pascual, Creed, Williams and LamasFrutos-Pascual et al., 2019) than gestures that behave like emblems (e.g. PLAY the movie), even though the latter are more commonly found in UX and HCI research. Relationships to direct motion are arguably easier than those requiring iconicity; but actions for emblem-like events (PLAY the movie, NEXT channel, FAVORITE selection) are much more common in research due to interest from interaction designers and engineers. This may change as XR, VR, and the related field of Augmented Reality (AR) become mainstream – in these contexts, so-called direct, isomorphic action (what gesture studies might call character viewpoint gestures) is generally preferred by users (see Reference Frutos-Pascual, Creed, Williams and LamasFrutos-Pascual et al., 2019).

Once the gesture set is validated, the next question concerns the interaction as a whole: Does it matter who is in the room with the user? Does device feedback (or feed forward) matter? What about the physical relationship between the user and the system? One example of a study which tries to tackle these questions is the BMW study mentioned above, where the same task was carried out by European and Asian participants. This kind of approach, combined with tasks designed to match local scenarios which may differ between regions, is needed to ensure natural interaction which is robust across contexts-of-use.

An alternative approach might be to compare users of various ages, on the premise that willingness to adopt, and/or be exposed to, new technologies will affect the social acceptability of gesture interaction: Reference Cabreira and HwangCabreira and Hwang (2016) find an age-based difference while Reference Nielsen, Nellemann, Larsen, Stec and KurosuNielsen et al. (2020) do not. More work is needed to clarify this – and in fact, all three of these approaches should be carried out to ensure the interaction is both “natural” and “intuitive.”

6 Use Cases Relevant for Gesture Control

This section briefly summarizes the examples and applications of interactive gesture-controlled systems and points toward potential future areas of application.

6.1 Gestures for Multimedia

Media control (e.g. for TVs, music, and other media devices) has been the subject of numerous studies (see, e.g. Reference Nielsen, Nellemann, Larsen, Stec and KurosuNielsen et al., 2020), and patent applications (Movea, 2018), but only a few have been realized so far as products. One notable example is Samsung’s (2020) TV described in Section 4. However, it was only on the market for a short time and the feature was discontinued in the following product versions, probably because the gesture set was counter-intuitive at best. The interaction style was mainly menu navigation, something gestures are not well suited to. Several TV remotes marketed as gesture-controlled have been on the market at one time or another. A notable example is LG’s “Magic Remote” (LG 2020), supporting pointing gestures and speech recognition. The Harry Potter-style “Kymera Magic Wand” is one of the more curious examples (Kymera, 2020). These all depend on accelerometer technology for gesture detection, as described in Section 3. The Sevenhugs “Smart Remote X” universal remote control is another example (Sevenhugs, 2020), where a pointing gesture lets the user select the device they want to interact with.

As discussed in the previous sections, gesture interaction is a major trend for gaming, and the gaming world has been driving the price down for gesture-tracking technology. The Nintendo Wii (2020) and Xbox Kinect (2020) constituted a first generation of gesture-controlled games, and VR headset-based games constituted the second generation. These can rely on controllers, as do the HTC Vive (Vive, 2020) and Oculus Rift (Oculus, 2020b), or devices such as the Leap Motion (Leap, 2020) for direct detection of hand and finger gestures. These systems typically come with preinstalled gestures which are defined by the manufacturer and can be quite limited (Reference Cabreira and HwangCabreira & Hwang, 2015).

The most promising of these is the Oculus Quest (2020a), as it allows free movement in a room (no tethering to a device) and can detect a number of free-hand object-manipulation gestures by using computer vision, without any sensors on the hands. The associated SDK allows computer scientists, game developers, and researchers to define and extend hand-tracking capabilities, that is, to create new gestures. We look forward to better games, and better UX research, with this device.

6.2 Multimodal Control

In 1980, Richard Bolt demonstrated the first real-time multimodal gesture-speech interactive system at MIT (Reference BoltBolt, 1980). Recently, researchers have begun to look at the IxD of autonomous vehicles: situations where users can not only talk to their cars, but also use near-field gestures to support the interaction. For example, Reference Tscharn, Löffler, Latoschik and HurtienneTscharn et al. (2017) look at how pointing gestures can be used to support ongoing interaction in a system which has full video access to the interior and exterior of the vehicle. As technology permeates our lives more and more, use cases like these will become increasingly relevant. For example, imagine what a visit to one of Amazon’s Go shops (2020) could be like if, instead of retrieving items yourself, you had the help of a bot who could retrieve them for you.

7 Conclusions with Opportunities for Future Work

The previous sections have reviewed how gestures are slowly entering into the domain of HCI as an important element of NUIs. Research with regard to UX and IxD has been presented along with examples of applications, technology, and the relation to gesture research. However, compared to other technologies, such as smartphones and smart assistants using speech interaction, the proliferation of gesture interaction is happening at a relatively modest rate.

There are a number of good reasons for this. Most notably, the technological breakthrough driven by machine learning is getting traction, but as of mid-2020, has not yet reached a sufficient maturity for widespread use of gesture interaction. Compared to speech recognition, where vast amounts of investment and resources have been allocated since the 1990s, gesture recognition has so far been a niche field within computer-vision research. So far, the only commercially viable products have been exclusively applied within gaming, but as ways of working transform and the 5G revolution takes place, we expect this to change. Gesture detection, recognition, and interpretation present very difficult scientific problems to solve, not least because manual articulation may vary widely between users, over time, and so on, but also because gestures are inherently context-dependent.

Despite this, there are numerous opportunities for well-crafted research studies which can demonstrate (1) the benefit of using gestures (optionally combined with speech) in particular contexts-of-use, (2) the identification of gesture (or gesture-speech) units to use, and (3) collaborative work with engineers on how best to detect gestures so that they can become as natural in HCI as they already are in human–human interaction.

Book contents

Part V - Gestures in Relation to Interaction

Summary

Information

1 The Great Communication Debate

1.1 Criteria for Experimental Evidence that Speakers Design Their Gestures for Their Addressees

2 Speakers Gesture More in Dialogues than in Monologues

3 Speakers Adapt to Shared Space with Their Addressee

4 Disentangling the Effects of Dialogue and Mutual Visibility

Table 22.1 Two levels of gesture functions, corresponding to Reference ClarkClark’s (1996) Track 1 and Track 2

4.1 Effects of Dialogue versus Visibility on Gesture Rate

4.2 Speakers Change Their Gestures When Their Addressee Will or Will Not See Them

4.3 Summary: Dialogue versus Visibility, Rate Versus Form

5 Common Ground Affects Speakers’ Gesturing

5.1 Speakers Gesture Differently When They Share Prior Common Ground with Their Addressee

5.2 Speakers’ Gestures Mark Incremental Common Ground

6 Speakers Gesture to Their Addressee about the Moment-by-Moment State of Their Dialogue

6.1 Track 1 versus Track 2 Gestures

6.2 Discursive versus Interactive Track 2 Gestures

7 Summary

1 Introduction

2 Common-Ground Approaches to Intersubjectivity in Gesture

3 Ecological and Interactional Approaches to Intersubjectivity in Gesture

4 Intersubjectivity as Intercorporeality: Gesturing as Something Linguistic Bodies Do

5 Gesture and Intersubjectivity

1 Introduction

2 How Gestures Vary

2.1 Efron’s Findings

2.2 Emblems

2.3 Cospeech Gestures

3 Why Gestures Vary

3.1 Linguistic Differences in Spoken Languages

3.2 Conceptual Differences

3.3 Norms of Interaction

3.4 Sociocultural Environments

3.4.1 Single Gesture Context-of-Use Studies

3.4.2 Gesture Profiles

3.4.3 Ecological Analyses

4 Rethinking Variation

4.1 Sociolinguistic Perspectives

4.2 Sociocultural Linguistic Approaches to Gesture

5 Conclusion

1 Introduction

1.1 Definitions and Overview of Communicative Gesturing

1.2 Robotics Technology

1.3 Communicative Gestures for Social Robots

2 Gestures in the Analysis and Generation of Robot Behavior

2.1 Gesture Analysis with Robot Agents

2.2 Gesture Generation in HRI

2.3 Combining Speech and Gesturing

2.4 Intercultural Aspects in HRI and Gesturing

3 Conclusions

1 Theoretical Framework

1.1 UX Research (and HCI)

1.1.1 Assessing UX

1.2 Research for IxD

2 Gestures in HCI and UX

2.1 Approaches to Gesture Interaction in UX and HCI Research

2.2 Relationship to Gesture Studies

3 Approaches to Recognizing Gestures

4 Companies Working with Gestures

5 Empirical Methods for Investigating Gestures in UX and HCI

5.1 What is the Gesture?

5.2 What Do You Do with the Gesture?

5.3 How Socially Acceptable Is the Gesture?

6 Use Cases Relevant for Gesture Control

6.1 Gestures for Multimedia

6.2 Multimodal Control

7 Conclusions with Opportunities for Future Work

Footnotes

22 Gesturing for the Addressee

23 Gesture and Intersubjectivity

24 Variation in Gesture: A Sociocultural Linguistic Perspective

25 Communicative Gesturing in Interaction with Robots

26 Gestural Interfaces in Human–Computer Interaction

References

References

References

References

References