To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
One of the largest and most important parts of the original AMI project was the collection of a multimodal corpus that could be used to underpin the project research. The AMI Meeting Corpus contains 100 hours of synchronized recordings collected using special instrumented meeting rooms. As well as the base recordings, the corpus has been transcribed orthographically, and large portions of it have been annotated for everything from named entities, dialogue acts, and summaries to simple gaze and head movement behaviors. The AMIDA Corpus adds around 10 hours of recordings in which one person uses desktop videoconferencing to participate from a separate, “remote” location.
Many researchers think of these corpora simply as providing the training and test material for speech recognition or for one of the many language, video, or multimodal behaviors that they have been used to model. However, providing material for machine learning was only one of our concerns. In designing the corpus, we wished to ensure that the data was coherent, realistic, useful for some actual end applications of commercial importance, and equipped with high-quality annotations. That is, we set out to provide a data resource that might bias the research towards the basic technologies that would result in useful software components. In addition, we set out to create a resource that would be used not just by computationally oriented researchers, but by other disciplines as well. For instance, corpus linguists need naturalistic data for studying many different aspects of human communication.
Automatic summarization has traditionally concerned itself with textual documents. Research on that topic began in the late 1950s with the automatic generation of summaries for technical papers and magazine articles (Luhn, 1958). About 30 years later, summarization has advanced into the field of speech-based summarization, working on dialogues (Kameyama et al., 1996, Reithinger et al., 2000, Alexandersson, 2003) and multi-party interactions (Zechner, 2001b, Murray et al., 2005a, Kleinbauer et al., 2007).
Spärck Jones (1993, 1999) argues that the summarizing process can be described as consisting of three steps (Figure 10.1): interpretation (I), transformation (T), generation (G). The interpretation step analyzes the source, i.e., the input that is to be summarized, and derives from it a representation on which the next step, transformation, operates. In the transformation step the source content is condensed to the most relevant points. The final generation step verbalizes the transformation result into a summary document. This model is a high-level view on the summarization process that abstracts away from the details a concrete implementation has to face.
We distinguish between two general methods for generating an automatic summary: extractive and abstractive. The extractive approach generates a summary by identifying the most salient parts of the source and concatenating these parts to form the actual summary. For generic summarization, these salient parts are sentences that together convey the gist of the document's content. In that sense, extractive summarization becomes a binary decision process for every sentence from the source: should it be part of the summary or not? The selected sentences together then constitute the extractive summary, with some optional post-processing, such as sentence compression, as the final step.
The analysis of conversational dynamics in small groups, like the one illustrated in Figure 9.1, is a fundamental area in social psychology and nonverbal communication (Goodwin, 1981, Clark and Carlson, 1982). Conversational patterns exist at multiple time scales, ranging from knowing how and when to address or interrupt somebody, how to gain or hold the floor of a conversation, and how to make transitions in discussions. Most of these mechanisms are multimodal, involving multiple verbal and nonverbal cues for their display and interpretation (Knapp and Hall, 2006), and have an important effect on how people are socially perceived, e.g., whether they are dominant, competent, or extraverted (Knapp and Hall, 2006, Pentland, 2008).
This chapter introduces some of the basic problems related to the automatic understanding of conversational group dynamics. Using low-level cues produced by audio, visual, and audio-visual perceptual processing components like the ones discussed in previous chapters, here we present techniques that aim at answering questions like: Who are the people being addressed or looked at? Are the involved people attentive? What conversational state is a group conversation currently in? Is a particular person likely perceived as dominant based on how they interact? As shown later in the book, obtaining answers for these questions is very useful to infer, through further analysis, even higher-level aspects of a group conversation and its participants.
The chapter is organized as follows. Section 9.2 provides the basic definitions of three conversational phenomena discussed in this chapter: attention, turn-taking, and addressing.
Money has been spent. About 20 million Euros over six years through the European AMI and AMIDA projects, complemented by a number of satellite projects and national initiatives, including the large IM2 Swiss NSF National Center of Competence in Research. This book has provided a unique opportunity to review this research, and we conclude by attempting to make a fair assessment of what has been achieved compared to the initial vision and goals.
Our vision was to develop multimodal signal processing technologies to capture, analyze, understand, and enhance human interactions. Although we had the overall goal of modeling communicative interactions in general, we focused our efforts on enhancing the value of multimodal meeting recordings and on the development of real-time tools to enhance human interaction in meetings. We pursued these goals through the development of smart meeting rooms and new tools for computer-supported cooperative work and communication, and through the design of new ways to search and browse meetings.
The dominant multimodal research paradigm in the late 1990s was centered on the design of multimodal human-computer interfaces. In the AMI and AMIDA projects we switched the focus to multimodal interactions between people, partly as a way to develop more natural communicative interfaces for human-computer interaction. As discussed in Chapter 1, and similar to what was done in some other projects around the same time, our main idea was to put the computer within the human interaction loop (as explicitly referred to by the EU CHIL project), where computers are primarily used as a mediator to enhance human communication and collaborative potential.
This approach raised a number of major research challenges, while also offering application opportunities. Human communication is one of the most complex processes we know, characterized by fast, highly sensitive multimodal processing in which information is received and analyzed from multiple simultaneous inputs in real time, with little apparent effort.
Customarily, hypothesis testing is about the problem of finding out if some effect of interest, like the curing power of a drug, is supported by observed data, or if it is a random “fluke” rather than a real effect of the drug. To have a formal way to test this a certain probability hypothesis is set up to represent the effect. This can be done by selecting a function of the data, s(xn), called a test statistic, the distribution of which can be calculated under the hypothesis. Frequently parametric probability distributions have a special parameter value to represent randomness, like 1/2 in Bernoulli models for binary data and 0 for the mean of normal distributions, and the no effect case can be represented by the single model such as f(s(xn); θ0), called the null hypothesis. The test statistic is or should be in effect an estimator of the parameter, and if the data cause the test statistic to fall in the tail, into a so-called critical region of small probability, say 0.05, the null hypothesis is rejected, indicating by the double negation that the real effect is not ruled out. Fisher called the amount of evidence for the rejection of the null hypothesis the statistical significance on the selected level 0.05, which is frequently misused to mean that there is 95 percent statistical significance to accept the real effect.
The word “information” has several meanings, the simplest of which has been formalized for communication by Hartley [17] as the logarithm of the number of elements in a finite set. Hence the information in the set A = {a, b, c, d} is log 4 = 2 (bits) in the base 2 logarithm. Only two types of logarithm are used in this book: the base 2 logarithm which we write simply as log, and the natural logarithm, written as ln. The amount of information in the set A = {a, b, c, d, e, f, g, h} is three bits, and so on. Hence such a formalization has nothing to do with any other meaning of the information communicated, such as the utility or quality. If we were asked to describe one element, say c in the second set, we could do it by saying that it is the third element, which could be done with the binary number 11 in both sets, but the element f in the second set would require three bits, namely 011. So we see that if the number of elements in a set is not a power of 2, we need either the maximum ∣log ∣A∣ number of bits or one less, as will be explained in the next section. Hence, we start seeing that “information,” relative to a set, could be formalized and measured by the shortest code length with which any element in a set could be described.
A primary consideration when designing a room or system for meeting data capture is of course how to best capture audio of the conversation. Technology systems requiring voice input have traditionally relied on close-talking microphones for signal acquisition, as they naturally provide a higher signal-to-noise ratio (SNR) than single distant microphones. This mode of acquisition may be acceptable for applications such as dictation and single-user telephony, however as technology heads towards more pervasive applications, less constraining solutions are required to capture natural spoken interactions.
In the context of group interactions in meeting rooms, microphone arrays (or more generally, multiple distant microphones) present an important alternative to close-talking microphones. By enabling spatial filtering of the sound field, arrays allow for location-based speech enhancement, as well as automatic localization and tracking of speakers. The primary benefit of this is to enable non-intrusive hands-free operation: that is, users are not constrained to wear headset or lapel microphones, nor do they need to speak directly into a particular fixed microphone.
Beyond just being a front-end voice enhancement method for automatic speech recognition, the audio localization capability of arrays offers an important cue that can be exploited in systems for speaker diarization, joint audio-visual person tracking, analysis of conversational dynamics, as well as in user interface elements for browsing meeting recordings.
For all these reasons, microphone arrays have become an important enabling technology in academic and commercial research projects studying multimodal signal processing of human interactions over the past decade (three examples of products are shown in Figure 3.1).
Although the material in this appendix is not used directly in the rest of the book, it has a profound importance in understanding the very essence of randomness and complexity, which are fundamental to probabilities and statistics. The algorithmic information or complexity theory is founded on the theory of recursive functions, the roots of which go back to the logicians Godel, Kleen, Church, and above all to Turing, who described a universal computer, the Turing machine, whose computing capability is no less than that of the latest supercomputers. The field of recursiveness has grown into an extensive branch of mathematics, of which we give just the very basics, which are relevant to statistics. For a comprehensive treatment we refer the reader to [26].
The field is somewhat peculiar in that the derivations and proofs of the basic results can be performed in two very different manners. First, since the partial recursive functions can be axiomatized as is common in other branches of mathematics the proofs are similar. But since the same set is also defined in terms of the computer, the Turing machine, a proof has to be a program. Now, the programs as binary strings of that machine are very primitive, like the machine language programs in modern computers, and they tend to be long and both hard to read and hard to understand. To shorten them and make them more comprehensible an appeal is often made to intuition, and the details are left to the reader to fill in.
Now that we know how to estimate real-valued parameters optimally the question arises of how confident we can be in the estimated result, which requires infinite precision numbers. It is clear that if we repeat the estimation on a new set of data, generated by the same physical machinery, the result will not be the same. It seems that the model class is too rich for the amount of data we have. After all, if we fit Bernoulli models to a binary string of length n, there cannot be more than 2n properties in the data that we can learn, even if no two strings have a common property. And yet the model class has a continuum of parameter values, each representing a property.
One way to balance the learnable information in the data and the model class is to restrict the precision in the estimated parameters. However, we should not take any fixed precision, for example each parameter quantized to two decimals, because that does not take into account the fact that the sensitivity of models with respect to changes in the parameters depends on the parameters. The problem is related to statistical robustness, which while perfectly meaningful is based on practical considerations such as a model's sensitivity to outliers rather than on any reasonably comprehensive theory. If we stick to the model classes of interest in this book the parameter precision amounts to interval estimation.
This chapter describes approaches used for video processing, in particular, for face and gesture detection and recognition. The role of video processing, as described in this chapter, is to extract all the information necessary for higher-level algorithms from the raw video data. The target high-level algorithms include tasks such as video indexing, knowledge extraction, and human activity detection. The main focus of video processing in the context of meetings is to extract information about presence, location, motion, and activities of humans along with gaze and facial expressions to enable higher-level processing to understand the semantics of the meetings.
Object and face detection
The object and face detection methods used in this chapter include pre-processing through skin color detection, object detection through visual similarity using machine learning and classification, gaze detection, and face expression detection.
Skin color detection
For skin color detection, color segmentation is usually used to detect pixels with a color similar to the color of the skin (Hradiš and Juranek, 2006). The segmentation is done in several steps. First, an image is converted from color into gray scale using a skin color model – each pixel value corresponds to a skin color likelihood. The gray scale image is binarized by thresholding. The binary image is then filtered by a sequence of morphological operations so as to avoid noise. Finally, the components of the binary image can be labeled and processed in order to recognize the type of the object.
Analyzing the behaviors of people in smart environment using multimodal sensors requires to answer a set of typical questions: who are the people? where are they? what activities are they doing? when? with whom are they interacting? and how are they interacting? In this view, locating people or their faces and characterizing them (e.g. extracting their body or head orientation) allows us to address the first two questions (who and where), and is usually one of the first steps before applying higher-level multimodal scene analysis algorithms that address the other questions. In the last ten years, tracking algorithms have experienced considerable progress, particularly in indoor environment or for specific applications, where they have reached a maturity allowing their deployment in real systems and applications. Nevertheless, there are still several issues that can make tracking difficult: background clutter and potentially small object size; complex shape, appearance, and motion, and their changes over time or across camera views; inaccurate/rough scene calibration or inconsistent camera calibration between views for 3D tracking; real-time processing requirements. In what follows, we discuss some important aspects of tracking algorithms, and introduce the remaining chapter content.
Scenarios and Set-ups. Scenarios and application needs strongly influence the considered physical environment, and therefore the set-up (where, how many, and what type of sensors are used) and choice of tracking method. A first set of scenarios commonly involves the tracking of people in the so-called smart spaces (Singh et al., 2006).
This informal introduction provides a fresh perspective on isomorphism theory, which is the branch of ergodic theory that explores the conditions under which two measure preserving systems are essentially equivalent. It contains a primer in basic measure theory, proofs of fundamental ergodic theorems, and material on entropy, martingales, Bernoulli processes, and various varieties of mixing. Original proofs of classic theorems - including the Shannon–McMillan–Breiman theorem, the Krieger finite generator theorem, and the Ornstein isomorphism theorem - are presented by degrees, together with helpful hints that encourage the reader to develop the proofs on their own. Hundreds of exercises and open problems are also included, making this an ideal text for graduate courses. Professionals needing a quick review, or seeking a different perspective on the subject, will also value this book.
This book describes the essential tools and techniques of statistical signal processing. At every stage theoretical ideas are linked to specific applications in communications and signal processing using a range of carefully chosen examples. The book begins with a development of basic probability, random objects, expectation, and second order moment theory followed by a wide variety of examples of the most popular random process models and their basic uses and properties. Specific applications to the analysis of random signals and systems for communicating, estimating, detecting, modulating, and other processing of signals are interspersed throughout the book. Hundreds of homework problems are included and the book is ideal for graduate students of electrical engineering and applied mathematics. It is also a useful reference for researchers in signal processing and communications.
Have you ever wanted to know how modern digital communications systems work? Find out with this step-by-step guide to building a complete digital radio that includes every element of a typical, real-world communication system. Chapter by chapter, you will create a MATLAB realization of the various pieces of the system, exploring the key ideas along the way, as well as analyzing and assessing the performance of each component. Then, in the final chapters, you will discover how all the parts fit together and interact as you build the complete receiver. In addition to coverage of crucial issues, such as timing, carrier recovery and equalization, the text contains over 400 practical exercises, providing invaluable preparation for industry, where wireless communications and software radio are becoming increasingly important. A variety of extra resources are also provided online, including lecture slides and a solutions manual for instructors.