To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
While the meeting setting creates many challenges just in terms of recognizing words and who is speaking them, once we have the words, there is still much to be done if the goal is to be able to understand the conversation. To do this, we need to be able to understand the language and the structure of the language being used.
The structure of language is multilayered. At a fine-grained, detailed level, we can look at the structure of the spoken utterances themselves. Dialogue acts which segment and label the utterances into units with one core intention are one type of structure at this level. Another way of looking at understanding language at this level is by focusing on the subjective language being used to express internal mental states, such as opinions, (dis-)agreement, sentiments, and uncertainty.
At a coarser level, language can be structured by the topic of conversation. Finally, within a given topic, there is a structure to the language used to make decisions. Language understanding is sufficiently advanced to capture the content of the conversation for specific phenomena like decisions based on elaborate domain models. This allows an indexing and summarization of meetings at a very high degree of understanding.
Finally, the language of spoken conversation differs significantly from written language. Frequent types of speech disfluencies can be detected and removed with techniques similar to those used for understanding language structure as described above.
Segmenting multi-party conversations into homogeneous speaker regions is a fundamental step towards automatic understanding of meetings. This information is used for multiple purposes as adaptation for speaker and speech recognition, as a meta-data extraction tool to navigate meetings, and also as input for automatic interaction analysis.
This task is referred to as speaker diarization and aims at inferring “who spoke when” in an audio stream involving two simultaneous goals: (1) the estimation of the number of speakers in an audio stream and (2) associating each speech segment with a speaker.
Diarization algorithms have been developed extensively for broadcast data, characterized by regular speaker turns, prompted speech, and high-quality audio, while processing meeting recordings presents different needs and additional challenges. From one side, the conversational nature of the speech involves very short turns and large amounts of overlapping speech; from the other side, the audio is acquired in a nonintrusive way using far-field microphones and is thus corrupted with ambient noise and reverberation. Furthermore real-time and online processing are often required in order to enable the use of many applications while the meeting is actually going on. The next section briefly reviews the state-of-the-art in the field.
State of the art in speaker diarization
Conventional speaker diarization systems are composed of the following steps: a feature extraction module that extracts acoustic features like mel-frequency cepstral coefficients (MFCCs) from the audio stream, a Speech/Non-speech Detection which extracts only the speech regions discarding silence, an optional speaker change module which divides the input stream into small homogeneous segments uttered by a single speaker, and an agglomerative hierarchical clustering step which groups together those speech segments into the same cluster.
The basic modeling problem begins with a set of observed data yn = {yt : t = 1, 2, …, n}, generated by some physical machinery, where the elements yt may be of any kind. Since no matter what they are they can be encoded as numbers we take them as such, i.e. natural numbers with or without the order if the data come from finite or countable sets, and real numbers otherwise. Often each number yt is observed together with others x1,t, x2,t, …, called explanatory data, written collectively as a K × n matrix X = {xi,j}, and the data then are written as yn ∣X. It is convenient to use the terminology “variables” for the source of these data. Hence, we say that the data {yt} come from the variable Y, and the explanatory data are generated by variables X1, X2, and so on.
In physics the explanatory data often determine the data yn of interest, called a “law,” but not so in statistical problems. Although by taking sufficiently many explanatory data we may also fit a function to the given set of observed data, but this is not a “law,” since if the same machinery were to generate additional data yn+1, x1,n+1, x2,n+1, … the function would not give yn+1. This is the reason the objective is to learn the statistical properties of the data yn, possibly in the context of the explanatory data.
Many kinds of information technology can be used to make meetings more productive, some of which are related to what happens before and after meetings, while others are intended to be used during a meeting. Document repositories, presentation software, and even intelligent lighting can all play their part. However, the following discussion of user requirements will be restricted to systems that draw on the multimodal signal processing techniques described in the earlier chapters of this book to capture and analyze meetings. Such systems might help people understand something about a past meeting that has been stored in an archive, or they might aid meeting participants in some way during the meeting itself. For instance, they might help users understand what has been said at a meeting, or even convey an idea of who was present, who spoke, and what the interaction was like. We will refer to all such systems, regardless of their purpose or when they are used, as “meeting support technology.”
This chapter reviews the main methods and studies that elicited and analyzed user needs for meeting support technology in the past decade. The chapter starts by arguing that what is required is an iterative software process that through interaction between developers and potential users gradually narrows and refines sets of requirements for individual applications. Then, it both illustrates the approach and lays out specific user requirements by discussing the major user studies that have been conducted for meeting support technology.
Meetings are a rich resource of information that, in practice, is mostly untouched by any form of information processing. Even now it is rare that meetings are recorded, and fewer are then annotated for access purposes. Examples of the latter only include meetings held in parliaments, courts, hospitals, banks, etc., where a record is required for reasons of decision tracking or legal obligations. In these cases a labor-intensive manual transcription of the spoken words is produced. Giving much wider access to the rich content is the main aim of the AMI consortium projects, and there are now many examples of interest in that access – through the release of commercial hardware and software services. Especially with the advent of high-quality telephone and videoconferencing systems the opportunity to record, process, recognize, and categorize the interactions in meetings is recognized even by skeptics of speech and language processing technology.
Of course meetings are an audio-visual experience by nature and humans make extensive use of visual and other sensory information. To illustrate the rich landscape of information is the purpose of this book and many applications can be implemented even without looking at the spoken word. However, it is still verbal communication that forms the backbone of most meetings, and accounts for the bulk of the information transferred between participants. Hence automatic speech recognition (ASR) is key to access the information exchanged and is the most important part required for most higher level processing.
Meeting support technology evaluation can broadly be considered to be in three categories, which will be discussed in sequence in this chapter, in terms of goals, methods, and outcomes, following a brief introduction on methodology and undertakings prior to the AMI Consortium (Section 13.1). Evaluation efforts can be technology-centric, focused on determining how specific systems or interfaces performed in the tasks for which they were designed (Section 13.2). Evaluations can also adopt a task-centric view, defining common reference tasks such as fact finding or verification, which directly support cross-comparisons of different systems and interfaces (Section 13.3). Finally, the user-centric approach evaluates meeting support technology in its real context of use, measuring the increase in efficiency and user satisfaction that it brings (Section 13.4).
These aspects of evaluation differ from the component evaluation that accompanies each of the underlying technologies described in Chapters 3 to 10, which is often a black-box evaluation based on reference data and distance metrics (although task-centric approaches have been adopted for summarization evaluation, as shown in Chapter 10). Rather, the evaluation of meeting support technology is a stage in a complex software development process for which the helix model was proposed in Chapter 11. We think back on this process in the light of evaluation undertakings, especially for meeting browsers, at the end of this chapter (Section 13.5).
Approaches to evaluation: methods, experiments, campaigns
The evaluation of meeting browsers, as pieces of software, should be related (at least in theory) to a precise view of the specifications they answer.
This book has two parts: the first summarizes the facts of coding and information theory which are needed to understand the essence of estimation and statistics, and the second describes a new theory of estimation, which also covers a good part of statistics as well. After all, both estimation and statistics are about extracting information from the often chaotic looking data in order to learn what it is that makes the data behave the way they do. The first part together with an outline of the algorithmic information in Appendix A is meant for the statistician who wants to understand his or her discipline rather than just learn a bag of tricks with programs to apply them to various data, tricks that are not based on any theory and do not stand a critical examination although some of them can be quite useful, providing solutions for important statistical problems.
The word information has many meanings, two of which have been formalized by Shannon. The first is fundamental in communication as just the number of messages, strings of symbols, either to be stored or to be sent over some communication channel, the practical question being the size of the storage device needed or the time it takes to send them. The second meaning is the measure of the strength of the statistical property a string has, which is fundamental in statistics, and very different from that in communication.
I have a long lasting interest in estimation, which started with attempts to control industrial processes. It did not take long to realize that the control part is easy if you knew the behavior of the process you want to control, which meant that the real problem is estimation. When I was asked by the Information Theory Society to give the 2009 Shannon Lecture I thought of giving a coherent survey of estimation theory. However, during the year given to prepare the talk I found that it was not possible, because there was no coherent theory of estimation. There was a collection of facts and results but they were isolated with little to connect them. To my surprise this applied even to the works of some of the greatest names in statistics, such as Fisher, Cramér, and Rao, which I had been familiar with for decades, but which I had never questioned until now that I was more or less forced to do so. As an example, the famous maximum likelihood estimator due to Fisher [12] had virtually no formal justification. The celebrated Cramér-Rao inequality gives it a non-asymptotic justification only for special models and for more general parametric models only an asymptotic justification. Clearly, no workable theory should be founded on asymptotic behavior. About the value of asymptotics, we quote Keynes' famous quip that “asymptotically we all shall be dead.”
This book is an introduction to multimodal signal processing. In it, we use the goal of building applications that can understand meetings as a way to focus and motivate the processing we describe. Multimodal signal processing takes the outputs of capture devices running at the same time – primarily cameras and microphones, but also electronic whiteboards and pens – and automatically analyzes them to make sense of what is happening in the space being recorded. For instance, these analyses might indicate who spoke, what was said, whether there was an active discussion, and who was dominant in it. These analyses require the capture of multimodal data using a range of signals, followed by a low-level automatic annotation of them, gradually layering up annotation until information that relates to user requirements is extracted.
Multimodal signal processing can be done in real time, that is, fast enough to build applications that influence the group while they are together, or offline – not always but often at higher quality – for later review of what went on. It can also be done for groups that are all together in one space, typically an instrumented meeting room, or for groups that are in different spaces but use technology such as videoconferencing to communicate. The book thus introduces automatic approaches to capturing, processing, and ultimately understanding human interaction in meetings, and describes the state of the art for all technologies involved.
We begin with the description of the most basic and primitive codes. Let A = a1, …, ak be a finite set: the set is called an alphabet, and its elements are called symbols. The main interest is in the set of finite sequences sn = aiaj … of the symbols of some length n, called messages or for us just data. The problem is to send or store them in a manner that costs the sending device little time and storage space. Again for practical reasons these devices use binary symbols 0 and 1, while the original symbols are represented as a sequence of binary symbols, such as the eight bits long “bytes.” Let for each symbol a in A, C: a ↦ C(a) be a one-to-one map, called a code, from the alphabet into the set of binary strings. It is extended to the messages by concatenation C : aiaj … an ↦ C(ai)C(aj) … C(an). Both binary strings C(a) and C(ai)C(aj) … C(an) are called codewords.
This will give us an upside down binary tree, with the root on top, whose nodes are the codewords, first of the symbols for n = 1 and then of the length 2 messages, and so on. The left hand tree in Figure 2.1 illustrates the code for the symbols of the alphabet A = {a, b, c}.