1 Introduction: Aims and Challenges in Gesture Coding and Annotation
Any analysis of verbo-gestural utterances requires the processing of audiovisual material in usually at least two steps: transcribing spoken utterancesFootnote 1 and coding and annotating gestures. Coding and annotation are usually considered to be analogous and are not clearly separated. However, typically, two aspects are distinguished. The encoding scheme includes linguistic phenomena and determines the categories and the members that are necessary for the description and the analysis. The annotation structure scheme defines structural aspects, such as number of tiers, temporal, and/or hierarchical relations (see e.g. Reference Brugman, Wittenburg, Levinson and KitaBrugman, Wittenburg, Levinson, & Kita, 2002). Annotation is thus understood as the association of descriptive or analytic notation with data and can include many different kinds, such as syntactic labels, part of speech tagging, semantic role labels as well as different layers and levels (Reference Ide, Ide and PustejovskyIde, 2017, p. 2).
In this process, coding and annotation systems for verbo-gestural data are faced with particular challenges. They must reproduce both sensory modalities in their specifics and independent of each other. For gestures, categories are to be found that adequately capture the nature of the gestural sign as precisely as possible in written form. Categories have to be chosen that enrich the raw material sufficiently for the particular research question to be addressed without tampering with the raw material too much. Furthermore, the dynamics of the relation between speech and gestures, along with their simultaneous and successive nature, needs to be preserved. Technical possibilities, such as motion tracking, open up even more questions: How, and to what extent, can and should manual annotation be combined with such data? How reliably can the annotation of motion, for instance, be performed by human coders relying solely on the visual input, or is it necessary to add quantitative tracking to allow for the extraction of particular kinematic features of gestures (Reference MittelbergMittelberg, 2018; Reference Pouw, Trujillo and DixonPouw, Trujillo, & Dixon, 2019)? On another level, requirements such as searchability, browsability, and the extraction of automatic information play a role in the coding and annotation process (Reference Bird and LibermanBird & Liberman, 2001, p. 54). A tagged and searchable verbo-gestural database opens up the possibility of new research questions that rely on larger corpora and a quantitative and corpus-linguistic perspective (see e.g. Reference Steen, Hougaard, Joo, Olza, Cánovas, Pleshakova and WoźnySteen et al., 2018). The questions “How much does one abstract?” (Reference Kipp, Neff and AlbrechtKipp, Neff, & Albrecht, 2007, p. 327) and “How many and what kinds of classes and categories are needed?” thus remain key ones in coding and annotating verbo-gestural data.
This chapter gives a selective overview of the current state of the art on gesture coding and annotation systems. It touches upon the aspects mentioned above in different levels of detail and from various perspectives by reflecting on the interrelation between subject, research question, and coding and annotation systems. Section 2 opens up the discussion by emphasizing that coding and annotations systems are always influenced by the particular theoretical framework in which they are situated. Accordingly, similar to the situation in the analysis of language, a theory-neutral analysis of gestures is not possible. This will be illustrated by consideration of some representative fields of research in gesture studies: language use, language development, cognition, interaction, and human–machine interaction. Section 3 of the chapter discusses different coding and annotation schemes addressing research questions in these fields. Rather than giving an extensive discussion of the individual systems, the section focuses on their general logic for answering a research question from a particular field. Here, differences between systems addressing the same research topic (see e.g. language use) as well as differences across research topics (see e.g. language use vs. interaction) will be explored. Section 4 of the chapter closes with some considerations on the current state of automatic gesture recognition and recording practices and possible future developments in coding and annotating verbo-gestural data.
2 Framework, Subject, and Analysis: On the Interrelation between Theory and Method in Gesture Studies
Looking at the particular chapters included in this Handbook, the ways in which gestures are considered as being part of communication, and how the role of the body in language (use) is looked upon, differ greatly: This is true not only in terms of theoretical perspectives but also in the methodological approaches taken. Hence, there is no single method or approach to coding and annotating gesture. Rather, there are many different ways, and as such, both the theoretical and methodological frameworks provide important orientation points for a study’s scope of explanation. Analyses of linguistic constructions with gestures, for instance, call for a detailed analysis of the verbal construction, its temporal relation with gestures, along with differences in form, meaning, and type of the gesture as well as a quantitative, corpus-linguistic analysis (Reference Zima and BergsZima & Bergs, 2017). Research that focuses on gestures’ relation with the material world needs to put particular emphasis on the environmental setting and the dynamics of the interaction (Reference StreeckStreeck, 2017) and follow a more qualitative perspective on the relation between speaking and gesturing. These two examples show that the particular research question determines the aspects and the level of detail of gesture–speech relations that need to be coded or not. Thus, each interest leads to a specific view on verbo-gestural data that has theoretical and practical implications for coding and annotation. Consequently, no single coding or annotation schema exists with which all possible research questions could be addressed. Rather, an adequate description of specific phenomena always calls for a particular focus and, thus, a specific coding and annotation procedure. Leaving aside for now the particularities of the individual systems and their relation to specific research interests (see Section 3 for a detailed discussion of the systems), Section 2.1 briefly discusses four general differences in current systems: (1) the relation between the verbal and gestural modalities, (2) facets and specifics of coding and annotation, (3) qualitative versus quantitative perspectives, and (4) procedures. All these aspects are influenced by theoretical assumptions about the nature of gesture–speech relations, lead to particular methodological and practical implications in the coding and annotation process, and thus influence the potential outcome of verbo-gestural analyses.
2.1 Relation between Verbal and Gestural Modalities
Speech and gesture are tightly connected on the temporal, semantic, and pragmatic levels (Reference McNeillMcNeill, 1992). For analyzing gesture–speech relations and the design of coding and annotation systems, this close connection of both modalities results in rather practical consequences, because systems have to reproduce this link on different levels. Existing systems, also depending on their technical implementation, solve this problem in different ways. The majority of systems use the transcription of speech as the center for the analysis of gesture–speech relations. As a result, gestures are placed in a position that is secondary to speech, and first steps in coding and annotation concentrate on the analysis of speech (see e.g. Reference McNeillMcNeill, 1992). Gestures are added either by annotating them into speech transcription or by making them hierarchically dependent on the speech transcription. Gestures are thus viewed in relation to speech from the beginning of the analytical process. Other systems allow for a first review and analysis of the data without an instant inclusion of speech and even allow for coding and annotating gestures without sound and, initially, independently of speech. As a result, these systems initially concentrate on the form of the gestures before bringing together speech and gestures in the analytical process (Reference Bressem, Ladewig, Müller, Müller, Cienki, Fricke, Ladewig, McNeill and TeßendorfBressem, Ladewig, & Müller, 2013; Reference Lausberg and SloetjesLausberg & Sloetjes, 2009) and thus separate speech and gestures into different parts of the analysis to allow for a preferably nonbiased coding and annotation of gestures.
2.2 Facets and Specifics of Coding and Annotation
Tied to the different possible relations between, and orders of, both modalities in the coding and annotation process are differences in the type and the specificity of the aspects to be coded and annotated. As already mentioned, some systems put particular emphasis on detailed descriptions of gestural forms because this allows one, for instance, to discover regularities and structures on the level of form and meaning, along with gestures’ potential for combinatorics and hierarchical structures,Footnote 2 or the best possible reproduction of gestures in technical systems, such as avatars (Reference Kipp, Neff and AlbrechtKipp et al., 2007; Reference Pouw, Trujillo and DixonPouw et al., 2019) or robots (Jokinen, this volume; Reference Kopp, Church, Alibali and KellyKopp, 2017). Connected to this approach is often the development of particular categories and a separate description of gestural form parameters to make the kinesics and semiotic processes in gestures visible (Reference BoutetBoutet, 2015; Reference Bressem, Müller, Cienki, Fricke, Ladewig, McNeill and TeßendorfBressem, 2013a). In contrast to these approaches, other systems carry out, for instance, a selective description of gestural forms that is also dependent on the meaning expressed in speech. The focus of these systems is on a functional analysis of gestures in relation to speech, such that “only the bodily movements maintaining a relation to speech – co-speech gesture – as well as the function of the gesture are of interest to us” (Reference Colletta, Kunene, Venouil, Kaufmann and SimonColletta, Kunene, Venouil, Kaufmann, & Simon, 2008, p. 59). Categories for the description of gestural forms are taken, for instance, from sign language studies and gestures are matched to the sign language alphabet (see e.g. Reference McNeillMcNeill 1992). The differences in describing and categorizing gestural forms discussed above are only meant to exemplify how the specificity of individual systems may vary; they are meant to underline the type of consequences that may be connected to different possible conceptions of, and views on hierarchies in, the relations between speech and gestures. (Further differences in the specificity and details of the particular systems along with the underlying research foci that are decisive for the structure of the systems will be discussed in Section 3.) In general, regardless of the particular approach and focus taken, the basics of current coding and annotation systems are gesture phases, gestural forms, gesture type, a transcription of speech, and the relation between gestures and speech.
2.3 Qualitative Versus Quantitative Perspectives
A further essential difference in existing systems is found in the research perspective. A qualitative perspective is characterized by more openness and flexibility, as well as often more naturalistic data from more ecologically valid settings, by which the dynamics of gesture–speech relations can be captured and hypotheses can be developed from, and based on, the material. By contrast, quantitative research aims at the validation of given hypotheses, replicable data and larger amounts of data, that are statistically evaluable. These two different positions influence the conception and design of coding and annotation systems. Systems with a stronger qualitative emphasis might show greater variability and flexibility (see e.g. Reference Müller, Müller and PosnerMüller, 2004) that is usually not included in systems aiming for a quantitative approach. Connected to this is also the use of larger corpora as data, as well as reliability or intercoder agreement measures (see e.g. Reference Lausberg and SloetjesLausberg & Sloetjes, 2009).
2.4 Procedures for Coding and Annotation
Annotation programs have become the standard for analyzing gesture–speech relations in recent years, both for qualitative and quantitative studies. Through the use of such programs, coding and annotation have become faster and more reliable, in terms of both technical practicability and plausibility. While the majority of studies use ELAN, a program developed by the Max Plank Institute in Nijmegen (Reference Wittenburg, Brugman, Russal, Klassmann and SloetjesWittenburg, Brugman, Russal, Klassmann, & Sloetjes, 2006), researchers also have the option of choosing other programs, such as EXMARaLDA (Reference Schmidt, Mehler and LobinSchmidt, 2004) or ANVIL (Reference KippKipp, 2001). Although the basic setup of these programs is similar, differences in their functionality may lead to differences in the coding and annotation systems. Whereas all kinds of annotation software offer the possibility of including three-dimensional motion data, ANVIL, in addition, allows “spatial annotation” where two points can be marked directly on the video screen so that the distance between two hands can be more accurately annotated (Reference Kipp, Neff and AlbrechtKipp et al., 2007, p. 331). The three annotation programs thus offer fine-grained possibilities for combining visual information only with automatic analyses of movement patterns (see e.g. Reference Ripperda, Drijvers and HollerRipperda, Drijvers, & Holler, 2020) and consequently allow for an analysis of gestural forms based on more material, on larger corpora, and at a more abstract level (Reference MittelbergMittelberg, 2018; Reference Trujillo, Vaitonyte, Simanova and ÖzyürekTrujillo, Vaitonyte, Simanova, & Özyürek, 2019). (For an overview and discussion of the different methods in multimodal motion tracking, see Reference Pouw, Trujillo and DixonPouw et al., 2019, and Trujillo, this volume.)
3 Systems of Gesture Coding and Annotation
This section will first consider some general aspects of practices of coding and annotation of gestures, highlighting the fact that theoretical assumptions influence subjects, aspects, and levels of analysis and as such also make themselves visible in annotation systems. We will then turn to an illustration of this in more detail by discussing existing coding and annotation systems from a thematic point of view. Using several research domains in gesture studies as examples, namely language (use), language development, cognition, interaction, and human–machine interaction, the section focuses on the general logic of existing systems for answering the particular research questions at hand. Because only few explicit coding and annotation systems for gestures exist, the following sections address systems as well as procedures and methods followed in studies analyzing gesture–speech relations. We will focus on a selection of systems and of coding and annotations practices that illustrate the link of theory and method for particular thematic areas and/or research questions.
3.1 Words and Gestures: Coding and Annotation for Exploring Language (Use)
Most studies exploring the relation of words and gestures in language (use) assume that usage events are dynamic and multimodal in nature, yet the degree to which gestures are part of language is variable (Reference CienkiCienki, 2015). The relation between speech and gesture is understood to be “reciprocal” such that the “gestural component and the spoken component interact with one another to create a precise and vivid understanding” (Reference KendonKendon, 2004, p. 174, emphasis in original). Spoken words and/or phrases have a close relation with gestures on different linguistic levels (see e.g. phonology, semantics, syntax, pragmatics), but, depending, for instance, on the type of gesture, the communicative context or genre, or the language community, this connection may vary. Starting from this assumption, studies have aimed to discover the tight relation between speech and gestures on these different levels and relations. Depending on the object of investigation, different aspects of coding and annotation become relevant, both regarding the temporal coordination and as the functional relation between the two sensory modalities.
Early on, research showed that the temporal relation between gestures and body movement in general shows a “precise correlation between changes of body motion and the articulated patterns of the speech stream” (Reference Condon and OgstonCondon & Ogston, 1967, p. 227). Accordingly, it is assumed that “the pattern of movement that co-occurs with the speech has a hierarchic organization which appears to match that of the speech units” (Reference Kendon, Seigman and PopeKendon, 1972, p. 190). Research addressing this relation thus requires a coding and annotation system that captures the link of both modalities regarding intonation phrases, stress, and syllables, for instance. In addition, particular movement characteristics, such as the velocity profile, may be of interest (Reference Karpiński, Jarmołowicz-Nowikow and MaliszKarpiński, Jarmołowicz-Nowikow, & Malisz, 2008). Likewise, data from programs such as Praat, a software package for phonetic speech analysis (Reference Boersma and van HeuvenBoersma & van Heuven, 2001), should be integrable. If, for instance, in addition to other factors, the connection of gestures with the narrative structure of speech is to be investigated, gestures’ relation with discursive elements has to be included. For instance, in a study on the Palm Up Open Hand (PUOH) and the type of prosodic and discourse units which are marked by this type of gesture, Reference FerréFerré (2011) transcribes, annotates, and segments speech by using a range of different programs, such as Praat and EasyAlign, and annotates speech for speech acts, intonation phrases, words, syllables, and stress. For gesture annotations, the study uses the annotation programs ELAN and ANVIL, and using the typology of gestures proposed by Reference McNeillMcNeill (1992), codes beat gestures, their phases, and their semantic and discursive functions in discourse. With this, Reference FerréFerré (2011), for instance, shows that beat gestures accompany emphatic stress in the verbal mode and that the PUOH, in particular the hand flick, fulfills different pragmatic functions in discourse and can acquire “a judgmental or epistemic value” (Reference FerréFerré, 2011, p. 16).
If it is assumed that because gestural movement parameters vary in their characteristics depending on their correlation with prosodic characteristics of speech (see e.g. Reference Ruth-Hirrel and WilcoxRuth-Hirrel & Wilcox, 2018), a fine-grained coding of the gestural kinesics is necessary (Reference BoutetBoutet, 2015). This perspective is followed by Reference Shattuck-Hufnagel and RenShattuck-Hufnagel and Ren (2018). Addressing the temporal relationship between gesture and speech with non-referential gestures, the authors applied a coding and annotation system that starts out with an analysis of the gestures without sound, “to avoid any possibility of the labeler’s judgment about events in one channel being influenced by events in the other” (p. 4), and coded gestures for their phases, their forms (hand shape, movement trajectory, location, handedness), referentiality, and their use in sequences. Subsequently, speech was transcribed orthographically and labeled for its intonational structure using Praat and ToBI, segmented into syllables, and annotated for higher-level prosodic constituents. Using this approach, the authors show that trajectory shapes of the gestures that were investigated “are consistent across a higher-level prosodic grouping” and that the category of beats includes “gestures with multiple phases and various types of rhythmicity” (p. 1).
As illustrated by the two examples given above, studies can overlap in the categories and classification applied, in particular with respect to the coding of gestural forms and functions, yet depending on the research focus on different facets. However, annotation systems that specifically address such a prosodic perspective with the aim of providing a general basic structure that may be adjusted to the needs of particular research questions are rare. One example is that proposed by Reference Jarmołowicz, Karpiński, Malisz, Szczyszek, Esposito, Faundez-Zanuy, Keller and MarinaroJarmołowicz, Karpiński, Malisz, & Szczyszek (2007).
Similarly, only a few systems provide a basic outline for analyzing gesture–speech relations from the level of semantics and/or syntax. One system that puts forward a general framework is the Linguistic Annotation System for Gestures, or LASG (Reference Bressem, Ladewig, Müller, Müller, Cienki, Fricke, Ladewig, McNeill and TeßendorfBressem et al., 2013). It provides ways to describe and analyze the motivation of gestural forms (modes of representation, image schemas, motor patterns, and actions), addresses gestures in relation to speech on a range of levels of linguistic description (prosody, syntax, semantics, and pragmatics), and in doing so offers obligatory as well as optional categories for each of the different aspects of linguistic description. Yet, due to its claim to cover all of these levels of description, the system is of course not detailed enough for the investigation of particular semantic or syntactic research questions; it thus needs to be adjusted to address different depths and aspects of the relation of speech and gestures in language (use) (see e.g. Reference Seeling, Fricke, Lynn, Schöller and BullingerSeeling, Fricke, Lynn, Schöller, & Bullinger, 2016, for an adaptation of the system).
The majority of research rather follows the LASG procedure outlined above and, based on generally accepted categories, provides for coding and annotation systems tailored to a specific research question. This will be exemplified in Section 3.2 by studies focusing on verb semantics and their relation to syntactic constructions. Cross-linguistic studies on motion events not only emphasize a tight link between gestures with the semantics of speech, but also show a close connection to syntactic patterns in the spoken utterance (Reference Kita and ÖzyürekKita & Özyürek, 2003). For example, differences in grammatical aspects used in speech are reflected in the gestural forms, the timing of gestures relative to the verbal utterance, and in the information distributed across the modalities (Reference Cienki and IriskhanovaCienki & Iriskhanova, 2018; Reference DuncanDuncan, 2005; Reference Gullberg and PavlenkoGullberg, 2011). As a result, coding and annotations schemes need to be able to code and annotate speech in terms of how events, for instance, are semantically and syntactically construed along with gestural form and meaning on different levels. Reference Cienki and IriskhanovaCienki and Iriskhanova (2018), for example, in investigating aspectuality in German, Russian, and French, thus use a coding and annotation scheme that includes a set of verb tense coding categories, “time meaning” (past, present, future, and ø for infinitives and imperatives), gesture phases, and a kinesiological account to determine the category type of a given gestural movement (Reference BoutetBoutet, 2015; Boutet & Cienki, this volume) – that is bounded versus unbounded movements – “without consideration of the type of co-occurring verbal expression” (Reference Cienki and IriskhanovaCienki & Iriskhanova, 2018, p. 68). For each of these features, hierarchical relations between the tiers and controlled vocabularies were defined in ELAN. “Each of the verbs or its constituents (auxiliary, participle) was annotated according to the duration of its vocal production” and only gestures “that overlapped [with] the production of a verb” (Reference Cienki and IriskhanovaCienki & Iriskhanova, 2018, p. 75) were annotated. In another study (investigating aspectuality and, in particular, Aktionsart, in Mandarin Chinese and English), Reference Duncan and EmmoreyDuncan (2003) included a transcription of speech, along with speech dysfluencies, in which “utterances expressive of the target aspect and Aktionsart distinctions were noted” (Reference Duncan and EmmoreyDuncan, 2003, p. 192) along with the relevant features of grammatical structure. The gestures were coded for phases, form, type, semantic content, and function in relation to the speech; “the timing of the gesture production relative to the speech was exactly coded” (Reference DuncanDuncan, 1996, p. 21), and movements of referential gestures were coded according to Talmy’s classificatory scheme (1985) see e.g. MOTION, PATH, and MANNER. For a study on the English motion construction [V(motion) in circles], Reference ZimaZima (2014) coded gestures for motion components (Manner vs. Path), shape and orientation of palm and fingers, handedness, depiction of number of circles, and type of motion. Speech was coded for the semantic meaning of the construct (see e.g. literal motion, metaphorical use, ambiguous) and discourse genre.
All of these examples given above underline that even with similar research topics, coding and annotation procedures differ immensely between studies, not only for speech but, more importantly, also for gestures. Although similarities between the practices are visible, the biggest differences lie in the amount and degree of detail of the coding of gestural characteristics, on the levels of form, meaning, and function. As mentioned in Section 2, one reason for these differences can be found in the (hierarchical) relation assumed between both modalities and thus the “direction” from which the data is approached (speech first vs. gesture first). Depending on this, the coding and annotation practices differ. Reasons for these differences lie in the underlying research question. Compare, for instance, Reference Cienki and IriskhanovaCienki and Iriskhanova (2018) and Reference ZimaZima (2014). Whereas Reference Cienki and IriskhanovaCienki and Iriskhanova (2018) are interested in whether speakers of French, German, or Russian gesture “similarly or differently (with regard to movement quality) when talking about events in the perfect(ive) versus in the imperfect(ive)” (p. 5), Reference ZimaZima (2014) addresses the question of whether gestures can be a recurrent feature across instantiations of English motion constructions. Due to the fact that Zima is, first and foremost, interested in whether gestures occur often with these constructions, a less detailed form description is given in favor of a number of occurrences of gestures. Reference Cienki and IriskhanovaCienki and Iriskhanova (2018), however, based on their assumption that aspectuality may be visible in gestural movement characteristics, need to take a more fine-grained perspective on these parameters. Both research questions thus take a different look at gestural form characteristics, and, as a result, the coding and annotation procedures put different emphases on the gestural form characteristics.
Similar variations can also be found in studies addressing the relation between speech and gestures on the level of pragmatics.
The functions gestures have as they contribute to or constitute the acts or moves accomplished by utterances, are referred to as pragmatic functions. In the terminology proposed, gestures which show what sort of a move or speech act a speaker is engaging in are said to have performative functions. Gestures are said to have modal functions, if they seem to operate on a given unit of verbal discourse and show how it is to be interpreted. Gestures may serve parsing functions when they contribute to the marking of various aspects of the structure of spoken discourse.
Studies on pragmatic gestures assume that these different functions go hand in hand with differences in their context of use as well as in the gestural forms (Reference KendonKendon, 1995; Reference LadewigLadewig, 2010; Reference Müller, Müller and PosnerMüller, 2004). In order to account for these different aspects, coding and annotation procedures thus call for a detailed description of gestural forms, “micro-analyses of single sequences […] to corroborate the semiotic and context-of-use analysis” (Reference Müller, Müller and PosnerMüller, 2004, p. 12), and the exact timing of the gestures in relation to speech (see also Reference LadewigLadewig, 2010, for an adaption of this method in ELAN). As a result, in their study on discursive function of holding-away gestures, Reference Bressem, Stein and WegenerBressem, Stein, and Wegener (2017) take the form-based linguistic approach with the four-step procedure moving from a context-independent toward a context-sensitive analysis. They take the distribution of the gestures (as proposed in the LASG system) as their starting point, combining it with a functional analysis of discourse markers (Reference FraserFraser, 1999), and add additional layers of analysis to account for the particular focus on discursive functions of gestures. Based on this, the authors illustrate specific forms and a functional diversity of the gestures that are connected to different types of speech units and show that changes in the discourse type and interactional setting lead to specific forms and uses of the holding away gesture. In a study investigating pragmatic functions of pointing gestures, Reference Enfield, Kita and de RuiterEnfield, Kita, and de Ruiter (2007) “suggest that form/function differences between […] two types of pointing gesture reflect distinct types of constraints which interactants have to satisfy in confronting the online problem-solving task of designing utterances in face-to-face interaction” (p. 1724). For this, the authors coded and annotated all utterances with spatially anchored pointing gestures for the “manner of articulation of the gesture, distinguishing formally between B-points (gestures in which the whole arm is used as articulator, outstretched, with elbow fully raised), and S-points (gestures in which the hand is the main articulator, the arm is not fully straightened, typically with faster and more casual articulation)” (p. 1725). Additionally, orientation of the head was coded. Gestures with forms other than those with a single protruding digit and movements that seemed to represent motion were excluded. As the authors were interested in the relation between form and function in these gestures, their focus was on particular and quite restricted form characteristics of the gestures.
All of these examples given above illustrate how a particular focus on exploring gestures in language (use) on different levels of linguistic description may lead to very specific coding and annotation schemes and practices.
3.2 Learning to Speak and Gesture: Coding and Annotation for Examining Language Development
Research on language acquisition that is related to gesture focuses on the role that gestures play in mediating the acquisition of spoken language, in communication, and in cognition (see Morgenstern, this volume; see also Wilcox, this volume, on the place of gesture in sign language acquisition). Studies investigate how gestures develop and change in parallel to spoken language development and what role they play in the prelinguistic period (see Reference Gullberg, De Bot and VolterraGullberg, De Bot, & Volterra, 2008 for an overview). Depending on age groups, different topics emerge. In earliest development, the complex relationship between lexical and syntactic development in comprehension and production is of interest (see e.g. gesture types, difference of actions from gestures, relation with spoken syntagmas, speech acts) (see e.g. Reference AndrénAndrén, 2010; Reference Beaupoil-Hourdel, Morgenstern, Boutet, Larrivée and LeeBeaupoil-Hourdel, Morgenstern, & Boutet, 2016). In older children, their use of iconic gestures (Reference Allen, Özyürek, Kita, Brown, Furman, Ishizuka and FujiiAllen et al., 2007), narrative skills (Reference CollettaColletta, 2009), or marking of focus (Reference Koutalidis, Kern, Németh, Mertens, Abramov, Kopp and RohlfingKoutalidis et al., 2019) are investigated. For each of these topics, slightly different coding and annotation schemes are necessary to address the relation between speech and gestures at different levels of detail. In the rest of Section 3.2, four different coding and annotation practices of studies on first language development will be summarized to illustrate the diversity and variability of the approaches, depending on the research question involved.Footnote 3
In a study on the use of gestures in children between 18 and 30 months, Reference AndrénAndrén (2010) focused on the children’s gestural repertoire, gestural development over time, and the organization of gesture in coordination with other semiotic resources, such as speech. As a result, the coding and annotation scheme includes a transcription of utterances consisting of words and/or gestures in CHATFootnote 4 and the annotation of every instance of a gesture in terms of various deictic, iconic, and conventional features. “These categories were not predetermined before the annotation procedure [was] begun, but emerged in an iterative and ‘dialectic’ fashion during the annotation process itself. This means that all categories are essentially motivated by what was found in the data” (p. 97). In a further step, these basic annotation categories were then specified for a qualitative analysis that focused on “how deictic, iconic, and conventionalized aspects occur together in various sorts of gestures; how the more specific meaning of children’s gestures emerges from an interplay between gestural form, coordination with speech, previous utterances – and other factors” (pp. 332–333). For a study on the use of gestures in joint book reading situations with children at the age of 14 months, Reference Grimminger and RohlfingGrimminger and Rohlfing (2019) annotated all utterances and deictic gestures, specifying as to whether they were pointing, showing, or giving an object, and annotated their semantic relation with the speech (as reinforcing or supplementary). With the coding, the authors wanted to investigate whether the use of pointing gestures in combination with verbal utterances of children and mothers positively affected the lexicon of the children. Based on these annotations, the authors discovered that 98% of the deictic gestures were pointing gestures and that these pointing gestures indeed correlated significantly with the lexicon of the children at 18 months (p. 96). In an exploratory study on the status and evolution of actions, gestures, and words expressing negation between ages 1 and 4, Reference Beaupoil-Hourdel, Morgenstern, Boutet, Larrivée and LeeBeaupoil-Hourdel et al. (2016) developed a coding system combining Excel and Clan, “a program … designed specifically to analyze data transcribed in the CHAT format” (Reference MacWhinneyMacWhinney, 2000, p. 8). With the system, the authors coded all negative communicative acts, the channels of their expression and perception, and their semiotic status, and also distinguished between actions, that is, “movement produced by the child as a reaction to the environment rather than being intentional and conventionalized” (Reference Beaupoil-Hourdel, Morgenstern, Boutet, Larrivée and LeeBeaupoil-Hourdel et al., 2016, p. 101), and gestures. In doing so, they were able to conduct macro- and micro-analyses of the children’s negative occurrences. Based on the system, Reference Beaupoil-Hourdel, Morgenstern, Boutet, Larrivée and LeeBeaupoil-Hourdel et al. (2016) showed that “gestures used prelinguistically are qualitatively different from gestures used once speech is already elaborate” and that there are “different uses of multimodality and therefore, the status of multimodality changes throughout the corpus” (p. 27). Investigating children’s oral and multimodal discourse, and in particular, “linguistic and gesture production of narratives performed by children and adults of different languages, with emphasis [on] the relationship between speech and gesture and how it develops” (p. 59), Reference Colletta, Kunene, Venouil, Kaufmann and SimonColletta, Kunene, Venouil, Kaufmann and Simon (2008) used an annotation system that analyzes speech and gestures in terms of four different blocks: (1) speech transcription and syntactic analysis, (2) narrative analysis, (3) annotation of gestures, and (4) evaluation. For the gestures, they concentrated on gesture phases, the type of gesture (deictic, representational, performative, framing, discursive, interactive, word searching), the gestures’ relation with speech (reinforcing, complementing, supplementing, integrating, contradicting, substituting), the temporal placement of the gestures in relation to speech, and the gestures’ form (hand shape, movement trajectory, and manner). With the annotation system applied in ELAN, the authors point out that it is possible to “identify, count and describe concrete versus abstract representational gestures, marking of connectives, syntactic subordination, the anaphoric recoveries, hesitation phenomena, etc. as well as to study narrative behaviour from a multimodal perspective” (Reference Colletta, Kunene, Venouil, Kaufmann and SimonColletta et al., 2008, p. 66).
Comparing these four examples, it becomes clear that in order to investigate the relation of speech and gestures in language development at different ages and stages, no single approach to the coding and annotation can be taken, as the particular research question always calls for a specific focus that concentrates on certain aspects while neglecting others. As such, the perspective may vary, for instance, in the depth of the analysis both for gestures and speech (micro vs. macro) as may the research perspective (exploratory vs. guided by hypotheses). Both of these aspects highly influence the analysis and thus the results. An open coding procedure, such as the one applied by Reference AndrénAndrén (2010), for instance, leads to the adaptation of categories during the coding and annotation process, allowing for the incorporation of initially unexpected phenomena, thus making for an exploratory investigation. A rather narrow focus on a particular type of gestural phenomenon (deictic gestures), such as the one adapted by Reference Grimminger and RohlfingGrimminger and Rohlfing (2019), does not and is not meant to allow for the exploration of unincorporated phenomena. Accordingly, insights from both analytical and methodological procedures vary by providing different kinds of access to verbo-gestural phenomena of language acquisition.
3.3 Gestures as Windows onto Thinking: Coding and Annotation for Insights into Cognition
Numerous studies show that when people speak, they have, what Kita and Essegeby (2001) call, a “cognitive urge” to gesture, as gestures help us talk and think. They boost language production and comprehension processes (Reference De Ruiter and McNeillDe Ruiter, 2000; Reference Krauss, Chen, Gottesman and McNeillKrauss, Chen, & Gottesman, 2000), reduce cognitive load, are a tool for exploring reasoning and thinking, and especially reveal aspects of children’s cognitive development (see e.g. Reference Goldin-Meadow, Wagner Cook, Holyoak and MorrisonGoldin-Meadow & Wagner Cook, 2012) (see also the chapters by Alibali & Hostetter and Novak & Goldin-Meadow in this volume). In order to explore these aspects, studies addressing the relation between speech, gestures, and cognition are confronted with similar challenges to those already discussed in Section 3.2 on language use, however with slightly different foci. For studying gestures as windows onto the cognitive development of children, for instance, it is crucial to investigate the temporal and semantic relation between the relation of speech and gestures and to examine whether children are producing ‘mismatches’ between them (Reference Alibali and Goldin-MeadowAlibali & Goldin-Meadow, 1993). As a result, such studies and their coding and annotation practices focus on marking of the temporal movement phases of gestures, their semantic function (complementary or contradictory relation), the type of gestures, and aspects of their form (see e.g. Reference Alibali, Spencer, Knox and KitaAlibali, Spencer, Knox, & Kita, 2011). In order to be able to describe the form in more detail, Reference Hilliard and CookHilliard and Cook (2017), for instance, developed a coding system that, in combination with other existing systems, allows for a description of gestural forms that is suited to addressing gesture’s role in cognitive development (Reference Congdon, Novack and Goldin-MeadowCongdon, Novack, & Goldin-Meadow, 2018). Similar foci are also visible in studies addressing questions of cognitive load (Reference Wagner, Nusbaum and Goldin-MeadowWagner, Nusbaum, & Goldin-Meadow, 2004), yet, as mentioned before, approaches to coding and annotation may vary. Reference Chu and KitaChu and Kita (2008), for instance, in their study on mental rotation problems, pursue a different path that focuses not so much on the semantic role and forms of the gestures as on their functional role. Assuming that “when solving novel problems concerning the physical world, adults may start with bodily exploration of the physical world” (Reference Chu and KitaChu & Kita, 2008, p. 708), thus using gestures significantly, the authors coded gestures according to their movement segments and their location in gesture space, and they used a functional gesture classification suited to the tasks at hand (see e.g. hand–object interaction gestures, object-movement gestures, tracing gestures, rotation direction gestures). This functional classification of the gestures was brought together with an analysis of the speech focusing on agentivity, for instance. Based on this, the authors “concluded that the motor strategy becomes less dependent on agentive action on the object, and also becomes internalized over the course of the experiment, and that gesture facilitates the former process” (Reference Chu and KitaChu & Kita, 2008, p. 706).
Again, other foci are visible in studies investigating the relation between speech and gestures in light of production. In a study on speech disfluencies, Reference SeyfeddinipurSeyfeddinipur (2006), for example, pursues a strongly form-based perspective in addressing the question of “whether speech-accompanying gestures are sensitive to speech disfluency and whether gesture can provide evidence for the self-monitoring process in speech production” (p. 81). In order to account for this, Seyfeddinipur develops a particular method for describing gestural movement phases that allows for a detailed coding and annotation of different movement segments and gesture phases, and thus for close examination of speech and gesture in disfluencies.
This short discussion reveals that practices in coding and annotation, also with respect to gesture and cognition, vary greatly due to the vast range of possible research questions and the assumptions connected with them. A system that hopes to provide a perspective that bridges different interests on gesture and cognition is NEUROGES (Reference Lausberg and SloetjesLausberg & Sloetjes, 2009, Reference Lausberg and Sloetjes2015). It is a “research tool for the analysis of hand movement behavior, including gesture, self-touch, shifts, and actions” and assumes that “main kinetic and functional gesture categories are differentially associated with specific cognitive (spatial cognition, language, praxis), emotional, and interactive functions” (Reference Lausberg and SloetjesLausberg & Sloetjes, 2009, p. 1). The scheme implies that different gesture categories may be generated in different brain areas. NEUROGES is composed of three modules: (1) kinetic gesture coding, (2) bimanual relation coding, and (3) functional gesture coding. Module 1 refers to the kinetic features of a hand movement, that is, execution of movement versus no movement, trajectory and dynamics of movements, location of acting as well as contact with the body or not. Module 2 allows for the coding of bimanual relation (e.g. in touch vs. separate, symmetrical vs. complementary, independent vs. dominance). Module 3 brings in the functional aspects and determines the meaning of gestures based on a specific combination of kinetic features (hand shape, orientation, path of movement, effort, and others), which define the various gesture types. NEUROGES offers the option of analyzing gestures independently of speech, and “since it enables the investigation of processes that are not verbalized, its specific potential lies in the exploration of implicit cognitive, emotional, and interactive processes that may be conducted beyond awareness” (Reference Lausberg and SloetjesLausberg & Sloetjes, 2015, p. 2). NEUROGES also offers an ELAN template including controlled vocabularies and template files for statistical analysis, such as in SPSS.
The inclusion in NEUROGES of options for statistical analyses and, in particular, intercoder reliability, can be said to be a strong component in all systems addressing cognition, regardless of their particular research question. Reasons for including these features lie in the fact that studies on cognition usually follow a quantitative research perspective in an experimental setting. Categories for coding and annotation are thus normally designed as less variable and flexible and aim at the testing of given hypotheses using replicable methods and larger amounts of data. With these characteristics, the systems differ greatly from the majority discussed in Section 3.1 and, even more so, from the majority of those designed for exploring interaction, like those discussed in Section 3.4.
3.4 Gestures as Part of a Bodily Ensemble: Coding and Annotation for Exploring Interaction
Gestures, among other bodily movements, are one of the multimodal resources that speakers use in constructing and negotiating interaction. For an analysis of spoken language, approaches from an interactional and/or conversational point of view argue that their holistic and situated role in building human action has to be explained (Reference MondadaMondada, 2016) and that these resources need to be an integral part of the description of interactional practices (Reference Deppermann, Deppermann and ReinekeDeppermann, 2018). Thus, “the embodied way in which people communicate and gather together, as well as the ecology of the activities they engage in, and their material and spatial environment” (Reference MondadaMondada, 2016, p. 337) have to be made part of analyses. For this, actions have to be documented in their temporal and sequential order and the coparticipants’ orientation to these multimodal actions need to be captured. This “embodied turn” (Reference NevileNevile, 2015) in the study of human and social interaction results in theoretical and methodological challenges as to how to deal with the role of video material within such analysis. In general, two methodological lines may be distinguished on how this problem is solved.
On the one hand, some approaches follow a rather “classic,” qualitative conversational approach to the transcription and analysis and include video stills in the verbal transcript (Reference Stukenbrock, Birkner and StukenbrockStukenbrock, 2009). Transcripts contain the trajectories, temporal relations between speaking and other forms of visual communication, and qualities of the multiple resources from the perspective of the doers, meaning “all possibly relevant embodied actions, such as gesture, gaze, body posture, movements, etc. that happen simultaneously to talk or during moments of absence of talk” (Reference MondadaMondada, 2014, p. 1). Embodied actions are described briefly without focusing too much on their physical appearance, varying in the level of detailed description depending on the particular research question. An example illustrating this approach is found in a study on the arrival of guests at a dinner party. Here, Reference Oloff and SchmittOloff (2010) demonstrates how interactants orient themselves toward the sequentiality of the activity of waiting, and by doing so, jointly construct the beginning of the dinner. For the analysis, Reference Oloff and SchmittOloff (2010) concentrates on a precise transcription and coding of the sequentiality and temporality and short description of the bodily actions overlapping with speech (see e.g. “holds up hands,” “opens up bag,” “looks at watch”) and the spatial layout and arrangement of the room along with movements of interactants in space. Using this approach to annotation and coding, Reference Oloff and SchmittOloff (2010, p. 220) uncovers a particular temporal sequence of the individual phases of the activity of waiting and the bodily actions contained therein. In a similar yet slightly different vein, Reference SchmittSchmitt (2005) approaches the turn-taking mechanism from a multimodal perspective. The transcription follows the general principle of including bodily behavior in verbal transcripts by inserting still images. However, short descriptions of the gestures, for instance, are not included in the transcript. Rather, the still image is complemented by a short functional description in the analysis (see e.g. “She starts eye-contact with him, simultaneously lifts her left arm, stretches it completely and realizes a clear signal gesture with her arm and hand” [Reference SchmittSchmitt, 2005, p. 34], translation JB). Following this perspective, Schmitt illustrates that not only the turn and its construction are of interest to conversational analyses but so too is the status of the current speaker as an accomplishment of the other interactants.
On the other hand, studies may also follow a more “technical” way to exploring the body’s role in interaction. Often pursuing a combination of qualitative and quantitative designs, numerous studies include annotation programs for the coding and analysis of gestures. Video material thus assumes a specific role and is exploited differently in the transcription and annotation process than it was in the studies described above. Rather than concentrating on the whole temporal continuity of actions, these studies focus on particular facets of interaction and their relation with gestures and other bodily actions. Using data from head-mounted eye-trackers in a corpus of face-to-face conversations including various conditions, Reference Oben and BrôneOben and Brône (2015), for instance, explore whether gaze “fixations by speakers and fixations by interlocutors have [an effect] on subsequent gesture production by those interlocutors” (p. 546). Focusing on depictive gestures, the authors did not concentrate on a detailed description of the gestures, but rather measured how similar two gestures were to each other. In combination with the coding of regions of interest and fixation duration of the eye-gaze, the authors demonstrate that there is a “significant effect of interlocutor gaze, but not of speaker gaze on the amount of gestural alignment” (Reference Oben and BrôneOben & Brône, 2015, p. 546). Another example combining a corpus-based, bottom-up approach rooted in gesture studies and interactional linguistics exemplifying this second kind of approach is a study by Reference Debras and CienkiDebras and Cienki (2012). In order to account for “the possible functions of two types of gesture during stancetaking in the course of human-human interaction” (p. 932) (lateral head tilts and shoulder shrugs), the authors used a combination of coding and annotation in ELAN and Excel to retrieve the functions of the gestures during conversations. For this, the gestures were first described and annotated for their forms without listening to speech, and in a second step, coded in Excel in relation to verbal and other types of bodily behavior. Using this combination, that allows for a synergy of qualitative and quantitative perspectives, the authors demonstrate that “participants tend to use the two gestures in a similar way when positioning themselves with respect to a prior stance: either to affiliate with their conversation partner’s stance, or to disaffiliate with a third party’s positioning” (Reference Debras and CienkiDebras & Cienki, 2012, p. 937).
Whereas the “classical” approach to interaction analysis gives a more comprehensive understanding of the action of the interactants and of the interaction itself, the last two examples zoom in on particular gestural phenomena in relation to regulating and constructing interactional processes. Both examples therefore illustrate again the point repeatedly made throughout this chapter: Even though research interests may be similar, the methodological approaches and the perspectives taken may differ immensely and thus yield very diverse insights into, for example, interaction.
3.5 Making Gestures Reproducible: Coding and Annotation for Human–Machine Interaction
Other than in the examples given above, in which differences in the approach to coding and annotation are deeply rooted in theoretical assumptions, schemes and practices for coding and annotating human–machine interaction are guided first and foremost by technical requirements to make gestures reproducible. That is, these systems aim at reproducing the kinematics of human gestures along with the gestures’ relation to speech by “imitating” a human speaker’s gesture behavior (Reference Kim, Ha, Bien and ParkKim, Ha, Bien, & Park, 2012; Reference Kipp, Neff and AlbrechtKipp et al., 2007; Reference Theofilis, Nehaniv and DautenhahnTheofilis, Nehaniv, & Dautenhahn, 2014; see also the chapter by Jokinen, this volume). For achieving this, systems have to answer three main questions: (1) When do robots need to generate human-like behavior? (2) What human-like behavior needs to be generated? (3) How is it possible to generate human-like behavior (Reference Kim, Ha, Bien and ParkKim et al., 2012)? As a result, these systems face similar obstacles to the ones discussed in the sections above: Movement phases as well as form characteristics of the gestures need to be identified and coded, the types of gesture have to be accounted for, and of course speech and gestures need to be set in relation to each other. In a last step, these systems face a challenge that is not addressed in the ones discussed so far, namely the evaluation of the coding and annotation by recreating the human gestures; that is, the question of how well the annotation reflects the original gesture. How systems tackle those questions and challenges varies. Human gestures are either manually coded or captured by using motion capture data (for an overview see Reference Kim, Ha, Bien and ParkKim et al., 2012; Trujillo, this volume). Systems thereby concentrate on the hand shape of the gestures, their movement patterns, as well as their spatial arrangement along with their relation to speech (see e.g. Reference Kim, Ha, Bien and ParkKim et al., 2012; Reference Kipp, Neff and AlbrechtKipp et al., 2007; Reference Kopp and WachsmuthKopp & Wachsmuth, 2004; Reference MartellMartell, 2005). However, this approach is very time-consuming and a lot of the complexity, especially of the original movement, is lost. As a result, newer approaches to coding and annotation try to overcome this shortfall by making use of automatic gesture detection (see e.g. Reference Madeo, Lima and PeresMadeo, Lima, & Peres, 2017; Reference Spiro, Taylor, Williams and BreglerSpiro, Taylor, Williams, & Bregler, 2010; Reference Turchyn, Moreno, Cánovas, Steen, Turner, Valenzuela and RayTurchyn et al., 2018).Footnote 5 With these efforts, the systems not only wish to make generation of gestures more accurate for human–machine interaction but also aid gesture research in general by providing support for the coding and annotation of gestures.
4 Summary and Conclusion
The previous sections have shown that there are many different ways of approaching the coding and annotation of gestures, and that a theory-neutral analysis of gestures is not possible. Rather, theoretical assumptions influence the topics, aspects, and levels of analysis, and as such, make themselves visible in coding and annotation systems. Starting out with a discussion on the difference between coding and annotation, the chapter considered the various methodological and theoretical aims and challenges in gesture coding and annotation. In doing so, the chapter reviewed existing systems and practices and reflected on the interrelation between subject, research question, and coding and annotation system. By choosing exemplary research areas (language use, language development, cognition, interaction, and human–machine interaction), the chapter exemplified the different systems and practices by choosing studies and systems that illustrate the link between theory and method. Thereby it was pointed out that four main theoretical assumptions about the nature of gesture–speech relations lead to particular methodological and practical implications in the coding and annotation process, and thus influence the potential outcome of verbo-gestural analyses: (1) the relation between the verbal and gestural modalities, (2) facets and specifics of coding and annotation, (3) qualitative versus quantitative perspective, and (4) procedure. As a result, differences between systems addressing the same research topic (see e.g. language) as well as differences across research topics (see e.g. language vs. interaction) became visible, highlighting the close interrelation between theory and method.
The diversity of the systems and practices could be considered a hindrance in investigating gesture–speech relations because no uniform standard exists. At the same time, however, the spectrum of approaches and perspectives allows for a variety of descriptions that otherwise would be lost and maybe would even prevent insights into particular aspects of how speech and gestures work together in expressing meaning multimodally. Thus, researchers have to be aware of the advantages and challenges that come along with not having a theory of coding and annotating gestures, while at the same time welcoming the opportunities that the absence of a single, uniform system brings for investigating their particular research question.
Currently, as was briefly discussed in Section 3.5, a trend toward automated methods as an assistance to optimize effort and accuracy of manual coding can be noted. It remains to be seen what influence these procedures will have in the long run, both on the description of gestures with regard to the various research topics discussed in this chapter, and on the uniformity and standards of gesture coding and annotation systems.