We use cookies to distinguish you from other users and to provide you with a better experience on our websites. Close this message to accept cookies or find out how to manage your cookie settings.
To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
As we have seen in the previous chapters, a variety of sensors can measure and record data in various modalities. However, how does a stream of sensory data become multimedia? We have discussed the multi part before. In this chapter, we will take a first look at a very important concept that combines sensory data into a palatable and presentable format, making it a medium. We are talking about the concept of a document.
The concept of a document has been used over centuries as a device or mechanism to communicate information. Because the technology to store and distribute information has been evolving and changing, the nature and concept of a document has been evolving to take advantage of the new media technology. At one time, one considered a document in the form of physical embodiment, such as a book, that mostly contained text as the source of information. This popular notion of a book as a document has gone through changes over the past few decades and has now transformed the notion of document into a (virtual) container of information and experiences using sensory data in multiple forms. In this modern reincarnation of the document, it is not limited to one medium, but can use different media as needed to communicate the information most effectively to a recipient. This means, however, that textual and sensory output needs to be combined in various ways to communicate different aspects and perspectives of information related to an event or an object. This chapter deals with the properties of applications and systems that do exactly that.
This quote is often attributed to Albert Einstein, often also to Niels Bohr. Regardless of who said it, we believe it’s true. However, it’s also good news. It tells us two things: first, prediction is hard, not impossible; and second, we predict many things in our lives, not only what is popularly identified as “the future.” Here is an example: When somebody throws a ball at you, you move your hand toward catching it and you often succeed. Martial artists train their brains to identify opponents’ subtle body cues to dodge a punch. Without predicting the punch, dodging would be impossible. And it is not only hard. This anticipation capability of the brain based on sensory input is one of the unsolved topics in multimedia computing.
Of course, at the end of our book we want to give a broader overview than just a very short time prediction over the next couple of milliseconds. For that to not just be a wild guess, we want to introduce a framework that has been proposed in the literature that tries to infer from the past based on the assumption that history repeats in cycles and that the cycle frequency is determined by reputation gains and losses regarding technology.
A multimedia system, including a human, understands the domain of discussion by correlating and combining partial information from multiple media to get complete information. In most situations, without correlating the information from multiple data streams, one cannot extract information about the real world. Even in those systems where multimedia is for humans’ direct consumption, all correlated information must be presented, or rendered, for humans to extract information that they need from the multimedia data. It is well known that humans combine information from multiple sensors and apply extensive knowledge from various sources to form the models of the real world in which they live. Humans also use different media, assuming the familiarity and knowledge of a recipient, in communicating their experiences and knowledge of the world. Most human experiences are multimedia and are easily captured, stored, and communicated using multimedia.
We already discussed the nature of sensory data and the emerging nature of documents that are increasingly inherently multimedia. Here we discuss the nature of individual sensor streams with a goal to convert them to a multimedia stream. We also start the discussion on how a computing system should view multimedia to develop systems that are not just a simple combination of multimedia, but are inherently multimedia such that they are much more than the sum of their component media elements.
One of the biggest research questions in multimedia is how to make the computer (partly) understand the content of an image, audio, or video file. The field inside multimedia computing exploring this is called multimedia content analysis. Multimedia content analysis tries to infer information from multimedia data, that is, from images, audio, or video, with the goal of enabling interesting applications.
In contrast to fields like computer vision or natural language processing that adopt a restricted view focusing on only one data type, multimedia content analysis adopts a top-down viewpoint studying how computer systems should be designed to process the different types of information in a document in an integrated fashion, including metadata (for example geo-tags) and the context in which the content is presented (for example the social graph of an author). Many examples have been discussed in various chapters of this book. The top-down approach of multimedia computing mostly succeeds in solving problems with data whose scale and diversity challenge the current theory and practice of computer science. Therefore, progress is measured using empirical methods, often in global competitions and evaluations that use large volumes of data.
Entropy-based compression as presented in the previous chapter is an important foundation of many data formats for multimedia. However, as already pointed out, it often does not achieve the compression rates required for the transmission or storage of multimedia data in many applications. Because compression beyond entropy is not possible without losing information, that is exactly what we have to do: lose information.
Fortunately, unlike texts or computer programs, where a single lost bit can render the rest of the data useless, a flipped pixel or a missing sample in an audio file is hardly noticeable. Lossy compression leverages the fact that multimedia data can be gracefully degraded in quality by increasingly losing more information. This results in a very useful quality/cost trade-off: one might not lose any perceivable information and the cost (transmission time, memory space, etc.) is high; with a little bit of information loss, the cost decreases, and this can be continued to a point where almost no information is left and the perceptual quality is very bad. Lossless compression usually can compress multimedia by a factor of about 1.3: to 2:1. Lossy compression can go up to ratios of several hundreds to one (in the case of video compression). This is leveraged on any DVD or Blu-ray, in digital TV, or in Web sites that present consumer-produced videos. Without lossy compression, media consumption as observed today would not exist.
In the previous chapters, we looked at low-level methods for multimedia content analysis and audio/visual processing. This chapter explores some of the bigger building blocks and applications of content analysis. Many of these single-media content analysis systems can be used alone, be modified for multimedia content analysis, or integrated within a bigger multimedia content analysis system. This chapter is designed as a short overview of how typical visual, speech, and music analysis systems work by describing on a high level which signal processing and machine learning techniques are typically used. Note that all of the systems presented here will only approximate a solution and to achieve high accuracies, they will still require a significant amount of engineering. To go beyond a certain accuracy, more research is needed, which might redefine how typical systems work in the future (thus potentially making our descriptions obsolete). However, as outlined in the previous chapter, it is possible to take error into account on a higher level and integrate several less accurate systems into one more accurate system, especially when multiple media are taken into account.
Consumer-produced videos, images, and text posts are the fastest-growing type of content on the Internet. At the time of the writing of this book, YouTube.com alone claims that seventy-two hours of video are uploaded to its Web site every minute. At this rate, the amount of consumer-produced videos will grow by 37 million hours in one year. Of course, YouTube.com is merely one site among many where videos may be uploaded. Social media posts provide a wealth of information about the world. They consist of entertainment, instructions, personal records, and various aspects of life in general. As a collection, social media posts represent a compendium of information that goes beyond what any individual recording captures. They provide information on trends, evidence of phenomena or events, social context, and societal dynamics. As a result, they are useful for qualitative and quantitative empirical research on a larger scale than has ever been possible. To make this data accessible, we need to automatically analyze the content of the posts and make them findable. Therefore, Multimedia Information Retrieval (MIR) has rapidly emerged as the most important technology needed to answer many questions people face in different aspects of their regular activities. The next chapters will focus on multimedia organization and analysis, mostly from retrieval aspects. We begin by defining multimedia retrieval and the set of challenges and algorithms that dominate the field. In this chapter, we will present basic concepts and techniques related to accessing multimedia data. We will start with the structured data in databases, discuss information retrieval to deal with accessing information in text, and then present techniques developed and being explored in MIR.
Light is one of the most basic phenomena in the universe. The first words in the Bible are, “Let there be light!” A large part of the human brain is dedicated to translating the light reflected off of objects and onto our eyes to form an image of our surroundings. As discussed in Chapter 2, many human innovations have evolved around capturing and storing that image, mostly because of its use for communication purposes: first were the Stone Age cave painters; then followed the painters and sculptors of the Middle Ages and the Renaissance; then came photography, film, and digital storage of movies and photographs. Most recently, a computer science discipline evolved around computer-based interpretation of images, called computer vision. Recent years have brought rapid progress in the use of photography and movies through the popularity of digital cameras in cell phones. Many people now carry a device for capturing and sharing visual information and use it on a daily basis.
In this chapter, we introduce the basic properties of light and discuss how it is stored and reproduced. We examine basic image processing and introductory computer vision techniques in later chapters.
A major difference between multimedia data and most other data is its size. Images and audio files take much more space than text, for example. Video data is currently the single largest network bandwidth and hard disk space consumer. Compression was, therefore, among the first issues researchers in the emerging multimedia field sought to address. In fact, multimedia’s history is closely connected to different compression algorithms because they served as enabling technologies for many applications. Even today, multimedia signal processing would not be possible without compression methods. A Blu-ray disc can currently store 50 Gbytes, but a ninety-minute movie in 1,080p HDTV format takes about 800 Gbytes (without audio). So how does it fit on the disc? The answer to many such problems is compression.
This chapter discusses the underlying mathematical principles of compression algorithms, from the basics to advanced techniques. However, all the techniques outlined in this chapter belong to the family of lossless compression techniques; that is, the original data can be reconstructed bit by bit. Lossless compression techniques are applicable to all kinds of data, including non-multimedia data. However, these techniques are not always effective with all types of data. Therefore, subsequent chapters will introduce lossy compression techniques that are usually tailored to a specific type of data, for example, image or sound files.
As discussed in the previous chapter, multimedia is closely related to how humans experience the world. In this chapter, we first introduce the role of different sensory signals in human perception for understanding and functioning in various environments and for communicating and sharing experiences. A very important lesson for multimedia technologists is that each sense provides only partial information about the world. One sense alone, even the very powerful sense of vision, is not enough to understand the world. Data and information from different sensors must be combined with other senses and prior knowledge to understand the world – and even then we only obtain a partial model of the world. Therefore, different sensory modalities should be combined with other knowledge sources to interpret the situation. Multimedia computing and communication is fundamentally about combining information from multiple sources in the context of the problem being solved. This is what distinguishes multimedia from several other disciplines, including computer vision and audio processing, where the focus is on analyzing one medium to extract as much information as possible from it.
In multimedia systems, different types of data streams simultaneously exist, and the system must process them not as separate streams, but as one correlated set of streams that represent information and knowledge of interest for solving a problem. The challenge for a multimedia system is to discover correlations that exist in this set of multimedia data and combine partial information from disparate sources to build the holistic information in a given context.
Clearly everybody knows the word “multimedia,” yet when people think of it, they usually think of different things. For some people, multimedia equals entertainment. For other people, multimedia equals Web design. For many computer scientists, multimedia often means video in a computing environment. All these are narrow perspectives on multimedia. For example, visual information definitely dominates human activities because of the powerful visual machinery that we are equipped with. In the end, however, humans use all five senses effectively, opportunistically, and judiciously. Therefore, multimedia computing should utilize signals from multifarious sensors and present to users only the relevant information in the appropriate sensory modality.
This book takes an integrative systems approach to multimedia. Integrated multimedia systems receive input from different sensory and symbolic sources in different forms and representations. Users ideally access this information in experiential environments. Early techniques dealt with individual media more effectively than with integrated media and focused on developing efficient techniques for separate individual media, for example, MPEG video compression. During the past few years, issues that span multimedia have received more central attention. Many researchers now recognize that most of the difficult semantic issues become easier to solve when considering integrated multimedia rather than separate individual media.
So far, we have mostly described ideal and typical environments. In this chapter, we will discuss some issues that designers need to consider when building multimedia systems. We call this chapter “The Human Factor” because the content of this chapter deals with effects that can be observed when multimedia systems are exposed to human beings. Of course, ultimately, all computer systems are made to be used by us human beings.
Principles of User Interface Design
Most of today’s applications, especially ones that support multimedia in any way, use a graphical user interface (GUI), that is, an interface that is controlled through clicks, touch and/or gestures and that allows for the display of arbitrary image and video data. Therefore, knowing how to design GUI-based applications in a user-friendly manner is an important skill for everybody working in multimedia computing. Unfortunately, with the many factors that go into the behavior of a program and the perceptual requirements of the user, there is no unique path or definite set of guidelines to follow. Here is an example: Is it better to have the menu bar inside the window of an application, or is it better to have one menu bar that is always at the same place and changes with the application? As we assume the reader knows, this is one fundamental difference between Apple and Microsoft’s operating systems – and it is hard to say one or the other is right or wrong. However, some standards have evolved over many years, using research results and feedback from many users. These standards can be seen in many places today in desktop environments, smartphones, DVD players, and other devices.
As discussed in the previous chapter, hearing and vision are the two most important sensor inputs that humans have. Many parallels exist between visual signal processing and acoustic signal processing, but sound has unique properties – often complementary to those of visual signals. In fact, this is why nature gave animals both visual and acoustic sensors: to gather complementary and correlated information about the happenings in an environment. Many species use sound to detect danger, navigate, locate prey, and communicate. Virtually all physical phenomena – fire, rain, wind, surf, earthquake, and so on – produce unique sounds. Species of all kinds living on land, in the air, or in the sea have developed organs to produce specific sounds. In humans, these have evolved to produce singing and speech.
In this chapter, we introduce the basic properties of sound, sound production, and sound perception. More details of audio and audio processing are covered later in this book.
A multimedia computing system is designed to facilitate natural communication among people, that is, communication on the basis of perceptually encoded data. Such a system may be used synchronously or asynchronously for remote communication, using remote presence, or for facilitating better communication in the same environment. These interactions could also allow users to communicate with people across different time periods or to share knowledge gleaned over a long period. Video conferencing, video on demand, augmented reality, immersive video, or immersive telepresence systems represent different stages of technology enhancing natural communication environments. The basic goal of a multimedia system is to communicate information and experiences to other humans. Because humans sense the world using their five senses and communicate their experiences using these senses and their lingual abstractions of the world, a multimedia system should use the same senses and abstractions in communications.
Multimedia systems combine communication and computing systems. In multimedia systems, the notions of computing systems and communication systems basically become so intertwined that any efforts to distinguish them as computing and communications result in a difficult and meaningless exercise. In this chapter, we discuss basic components of a multimedia system. Where appropriate, differences from a traditional computing system will be pointed out explicitly along with the associated challenges.
While the compression techniques presented so far have assumed generic acoustic or visual content, this chapter presents lossy compression techniques especially designed for a particular type of acoustic data: human speech. Almost every human being on earth talks virtually every day – needless to say, there is a lot of captured digital speech content. Every movie or TV show contains an audio track, most of which usually consists of spoken language. The most important use of captured speech, however, is for communication, such as in cell phones, voice-over IP applications, or as part of video conferencing and meeting recordings. Most of the compression concepts discussed so far will also work on speech. The algorithms presented in this chapter were developed to achieve a higher compression ratio while preserving higher perceptual quality by exploiting speech-specific properties of the audio signal. We discussed human speech in Chapter 5. This chapter will directly dig into the algorithmic part using that knowledge.
Properties of a Speech Coder
As explained in Chapter 5, the properties of every sound are defined by the properties of the objects that create the sounds, by the environment that the sound waves travel in, and by the characteristics of the receiver and/or capturing device. The object that creates human speech is the vocal tract. Vocal tracts also exist in animals, such as birds or cats. As we all know, the sounds they produce differ substantially from average human speech, so creating a bird-sing compression or cat’s meow encoding algorithm would also be substantially different. The following algorithms all try to exploit the characteristics of speech and have very limited applicability to music or other nonspeech. However, all of them are of importance to multimedia computing because millions of people use them in everyday life.