To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
In this chapter, we revisit the shortest path and minimum-cost matching problems. Both were first introduced in Chapter 1, where we discussed practical example applications. We further showed that these problems can be expressed as IPs. The focus in this chapter will be on solving instances of the shortest path and matching problems. Our starting point will be to use the IP formulation we introduced in Section 1.5. We will show that studying the two problems through the lens of linear programming duality will allow us to design efficient algorithms. We develop this theory further in Chapter 4.
The shortest path problem
Recall the shortest path problem from Section 1.4.1. We are given a graph G = (V, E), nonnegative lengths ce for all edges e ∈ E, and two distinct vertices s, t ∈ V. The length c(P) of a path P is the sum of the length of its edges, i.e. Σ(ce: e ∈ P). We wish to find among all possible st-paths one that is of minimum length.
Example 7 In the following figure, we show an instance of this problem. Each of the edges in the graph is labeled by its length. The thick black edges in the graph form an st-path P = sa, ac, cb, bt of total length 3 + 1 + 2 + 1 = 7. This st-path is of minimum length, hence is a solution to our problem.
An algorithm is a formal procedure that describes how to solve a problem. For instance, the simplex algorithm in Chapter 2 takes as input a linear program in standard equality form and either returns an optimal solution, or detects that the linear program is infeasible or unbounded. Another example is the shortest path algorithm in Chapter 3.1. It takes as input a graph with distinct vertices s, t and nonnegative integer edge lengths, and returns an st-path of shortest length (if one exists).
The two basic properties we require for an algorithm are: correctness and termination. By correctness, we mean that the algorithm is always accurate when it claims that we have a particular outcome. One way to ensure this is to require that the algorithm provides a certificate, i.e. a proof, to justify its answers. By termination, we mean that the algorithm will stop after a finite number of steps.
In Section A.1, we will define the running time of an algorithm; we will formalize the notions of slow and fast algorithms. Section A.2 reviews the algorithms presented in this book and discusses which ones are fast and which ones are slow. In Sections A.3 and A.4 we discuss the inherent complexity of various classes of optimization problems and discuss the possible existence of classes of problems for which it is unlikely that any fast algorithm exists. We explain how an understanding of computational complexity can guide us in the design of algorithms.
Broadly speaking, optimization is the problem of minimizing or maximizing a function subject to a number of constraints. Optimization problems are ubiquitous. Every chief executive officer (CEO) is faced with the problem of maximizing profit given limited resources. In general, this is too general a problem to be solved exactly; however, many aspects of decision making can be successfully tackled using optimization techniques. This includes, for instance, production, inventory, and machine-scheduling problems. Indeed, the overwhelming majority of Fortune 500 companies make use of optimization techniques. However, optimization problems are not limited to the corporate world. Every time you use your GPS, it solves an optimization problem, namely how to minimize the travel time between two different locations. Your hometown may wish to minimize the number of trucks it requires to pick up garbage by finding the most efficient route for each truck. City planners may need to decide where to build new fire stations in order to efficiently serve their citizens. Other examples include: how to construct a portfolio that maximizes its expected return while limiting volatility; how to build a resilient tele-communication network as cheaply as possible; how to schedule flights in a cost-effective way while meeting the demand for passengers; or how to schedule final exams using as few classrooms as possible.
Suppose that you are a consultant hired by the CEO of the WaterTech company to solve an optimization problem.
The lossy compression techniques presented so far have tried to exploit the fundamental mathematical properties of information (lossless coding), to model and approximate the properties of the signal directly (differential coding), and to model the creation of the signal (source coding, such as in speech compression). We also presented simple perceptual methods, such as the µ-law encoder.
The methods presented in this chapter use transformations modeled after how human sensory perception works, using a much greater sophistication level. These perceptual coders are so effective that they are used in virtually every device today that handles images or sound, from photo cameras to mobile phones to DVD players to mobile digital music players.
Before we introduce them, we recapitulate two fundamental signal transformations that are an important prerequisite for all the algorithms presented in this chapter, as well as for many of the analysis algorithms presented later. When explaining perceptual compression, two transformations are very important: the Discrete Fourier Transform (DFT) and the Discrete Cosine Transform (DCT), which are described in the following sections. Other transforms, such as the Discrete Wavelet Transforms (DWT), which are a generalization on the transforms mentioned, are also used in multimedia signal processing, and the references cited at the end of this chapter are well worth looking up.
The project to write a textbook on multimedia computing started a few years ago, when the coauthors independently realized that a book that addresses the basic concepts related to the increasing volume of multimedia data in different aspects of communications in the computing age is needed.
Digital computing started with processing numbers, but very soon after its start it began dealing with symbols of other kinds and developed computational approaches for dealing with alpha-numeric data. Efforts to use computers to deal with audio, visual, and other perceptual information did not become successful enough to be used for any applications until about the 1980s. Only slowly, computer graphics, audio processing, and visual analysis started becoming feasible. First was the ability to store large volumes of audiovisual data, then displaying or rendering it, then distributing it, and later processing and analyzing it. For that reason, it took until about the 1990s for the term multimedia to grow popular in computing.
While different fields have emerged around acoustic, visual, and natural text content that specializes in these data types, multimedia computing deals with documents holistically, taking into account all media available. Dominated by the availability of electronic sensors, multimedia communication is currently focused on visual and audio, followed by metadata (such as GPS) and touch. Multimedia computing deals with multiple media holistically because the purpose of documents that contain multiple media is to communicate information. Information about almost all real-world events and objects must be captured using multiple sensors as each sensor only captures one aspect of the information of interest. The challenge for multimedia computing systems is to integrate the different information streams into a coherent view. Humans do this every moment of their lives from birth and are therefore often used as a baseline when building multimedia systems. Therefore it’s not surprising that, slowly but surely, all computing is becoming multimedia. The use of multimedia data in computing has grown even more rapidly than imagined just a few years ago with the installation of cameras in cell phones in combination with the ability to share multimedia documents in social networks easily.
In the previous chapters, we described many signal processing and content analysis techniques. However, the content of an image or audio file does not alone determine its meaning and impression on the user. In this chapter, we will therefore describe other factors that are very important to consider in multimedia computing: the set of surrounding circumstances in which the content is presented, otherwise known as context. Context is often neglected in academic work because it can be leveraged in many ways in multimedia systems and is often so effective that the content analysis approach becomes secondary. So let us first find out what context really is.
Almost two centuries ago, George Berkeley asked: If a tree falls in a forest and no one is around to hear it, does it make a sound? Sound is often defined as the sensation excited in the ear when the air or other medium is set in motion. Thus, if there is no receiving ear, then there is no sound. In other words, perception is not only data – it is a close interaction between the data, transmission medium, and the interpreter. This is shown in Figure 19.1.
The data acquired for an environment
The medium used to transmit physical attributes to the perceiver
The perceiver
Characteristics of each of these must be considered in designing and developing a multimedia system. It has been very well realized, and rigorously articulated and represented, that we understand the world based on the sensory data that we receive using our sensors and the knowledge about the world that we have accumulated since our birth. Both the data and the knowledge are integral components of understanding.
Only a few inventions in the history of civilization have had the same impact on society in so many ways and at so many levels as computers. Where once we used computers for computing with simple alphanumeric data, we now use them primarily to exchange information, to communicate, and to share experiences. Computers are rapidly evolving as a means for gaining insights and sharing ideas across distance and time.
Multimedia computing started gaining serious attention from researchers and practitioners during the 1990s. Before 1991, people talked about multimedia, but the computing power, storage, bandwidth, and processing algorithms were not advanced enough to deal with audio and video. With the increasing availability and popularity of CDs, people became excited about creating documents that could include not only text, but also images, audio, and even video. That decade saw explosive growth in all aspects of hardware and software technology related to multimedia computing and communication. In the early 1990s, PC manufacturers labeled their high-end units containing advanced graphics multimedia PCs. That trend disappeared a few years later because every new computer became a multimedia computer.
Information about the environment is always obtained through sensors. To understand the handling of perceptual information, we must first start with an understanding of the types and properties of sensors and the nature of data they produce.
Types of Sensors
In general, a sensor is a device that measures a physical quantity and converts it into a signal that a human or a machine can use. Whether the sensor is human-made or from nature does not matter. Sensors for sound and light have been the most important for multimedia computing during the past decades because audio and video are best for communicating information for the tasks humans typically perform with or without a computer. That is, most people prefer to communicate through sound, and light serves illustrative purposes, supplementing the need for language-based description of a state of the world. New or different tasks might use different sensors, however. For example, in real-world dating (as opposed to current implementations of online dating), communication occurs on many other levels, such as scent, touch, and taste (e.g., when kissing). Artificial sensors are therefore invented as you read this chapter. Let’s start with a rough taxonomy of current sensors interesting to multimedia computing.
In Chapters 5 and 6, we described sound and light and their physical properties. In this chapter, we will discuss basic signal processing operations that are common initial steps of many algorithms for audio and video enhancement and content analysis.
Sampling and Quantization
As we explained in Chapter 4, a continuous function must be converted to a discrete form for representation and processing using a digital computer. The interface between the optical system that projects a scene onto the image plane and the computer must sample the image at a finite number of points and represent each sample within the finite word size of the computer. Likewise, the sound card samples the microphone output into a stream of numbers. In other words, in both cases, the matter to work with when doing computational audio and video processing is a stream of numbers that are representative of the signal at certain spatial or temporal points.
As we have seen in the previous chapters, a variety of sensors can measure and record data in various modalities. However, how does a stream of sensory data become multimedia? We have discussed the multi part before. In this chapter, we will take a first look at a very important concept that combines sensory data into a palatable and presentable format, making it a medium. We are talking about the concept of a document.
The concept of a document has been used over centuries as a device or mechanism to communicate information. Because the technology to store and distribute information has been evolving and changing, the nature and concept of a document has been evolving to take advantage of the new media technology. At one time, one considered a document in the form of physical embodiment, such as a book, that mostly contained text as the source of information. This popular notion of a book as a document has gone through changes over the past few decades and has now transformed the notion of document into a (virtual) container of information and experiences using sensory data in multiple forms. In this modern reincarnation of the document, it is not limited to one medium, but can use different media as needed to communicate the information most effectively to a recipient. This means, however, that textual and sensory output needs to be combined in various ways to communicate different aspects and perspectives of information related to an event or an object. This chapter deals with the properties of applications and systems that do exactly that.
This quote is often attributed to Albert Einstein, often also to Niels Bohr. Regardless of who said it, we believe it’s true. However, it’s also good news. It tells us two things: first, prediction is hard, not impossible; and second, we predict many things in our lives, not only what is popularly identified as “the future.” Here is an example: When somebody throws a ball at you, you move your hand toward catching it and you often succeed. Martial artists train their brains to identify opponents’ subtle body cues to dodge a punch. Without predicting the punch, dodging would be impossible. And it is not only hard. This anticipation capability of the brain based on sensory input is one of the unsolved topics in multimedia computing.
Of course, at the end of our book we want to give a broader overview than just a very short time prediction over the next couple of milliseconds. For that to not just be a wild guess, we want to introduce a framework that has been proposed in the literature that tries to infer from the past based on the assumption that history repeats in cycles and that the cycle frequency is determined by reputation gains and losses regarding technology.
A multimedia system, including a human, understands the domain of discussion by correlating and combining partial information from multiple media to get complete information. In most situations, without correlating the information from multiple data streams, one cannot extract information about the real world. Even in those systems where multimedia is for humans’ direct consumption, all correlated information must be presented, or rendered, for humans to extract information that they need from the multimedia data. It is well known that humans combine information from multiple sensors and apply extensive knowledge from various sources to form the models of the real world in which they live. Humans also use different media, assuming the familiarity and knowledge of a recipient, in communicating their experiences and knowledge of the world. Most human experiences are multimedia and are easily captured, stored, and communicated using multimedia.
We already discussed the nature of sensory data and the emerging nature of documents that are increasingly inherently multimedia. Here we discuss the nature of individual sensor streams with a goal to convert them to a multimedia stream. We also start the discussion on how a computing system should view multimedia to develop systems that are not just a simple combination of multimedia, but are inherently multimedia such that they are much more than the sum of their component media elements.
One of the biggest research questions in multimedia is how to make the computer (partly) understand the content of an image, audio, or video file. The field inside multimedia computing exploring this is called multimedia content analysis. Multimedia content analysis tries to infer information from multimedia data, that is, from images, audio, or video, with the goal of enabling interesting applications.
In contrast to fields like computer vision or natural language processing that adopt a restricted view focusing on only one data type, multimedia content analysis adopts a top-down viewpoint studying how computer systems should be designed to process the different types of information in a document in an integrated fashion, including metadata (for example geo-tags) and the context in which the content is presented (for example the social graph of an author). Many examples have been discussed in various chapters of this book. The top-down approach of multimedia computing mostly succeeds in solving problems with data whose scale and diversity challenge the current theory and practice of computer science. Therefore, progress is measured using empirical methods, often in global competitions and evaluations that use large volumes of data.
Entropy-based compression as presented in the previous chapter is an important foundation of many data formats for multimedia. However, as already pointed out, it often does not achieve the compression rates required for the transmission or storage of multimedia data in many applications. Because compression beyond entropy is not possible without losing information, that is exactly what we have to do: lose information.
Fortunately, unlike texts or computer programs, where a single lost bit can render the rest of the data useless, a flipped pixel or a missing sample in an audio file is hardly noticeable. Lossy compression leverages the fact that multimedia data can be gracefully degraded in quality by increasingly losing more information. This results in a very useful quality/cost trade-off: one might not lose any perceivable information and the cost (transmission time, memory space, etc.) is high; with a little bit of information loss, the cost decreases, and this can be continued to a point where almost no information is left and the perceptual quality is very bad. Lossless compression usually can compress multimedia by a factor of about 1.3: to 2:1. Lossy compression can go up to ratios of several hundreds to one (in the case of video compression). This is leveraged on any DVD or Blu-ray, in digital TV, or in Web sites that present consumer-produced videos. Without lossy compression, media consumption as observed today would not exist.
In the previous chapters, we looked at low-level methods for multimedia content analysis and audio/visual processing. This chapter explores some of the bigger building blocks and applications of content analysis. Many of these single-media content analysis systems can be used alone, be modified for multimedia content analysis, or integrated within a bigger multimedia content analysis system. This chapter is designed as a short overview of how typical visual, speech, and music analysis systems work by describing on a high level which signal processing and machine learning techniques are typically used. Note that all of the systems presented here will only approximate a solution and to achieve high accuracies, they will still require a significant amount of engineering. To go beyond a certain accuracy, more research is needed, which might redefine how typical systems work in the future (thus potentially making our descriptions obsolete). However, as outlined in the previous chapter, it is possible to take error into account on a higher level and integrate several less accurate systems into one more accurate system, especially when multiple media are taken into account.
Consumer-produced videos, images, and text posts are the fastest-growing type of content on the Internet. At the time of the writing of this book, YouTube.com alone claims that seventy-two hours of video are uploaded to its Web site every minute. At this rate, the amount of consumer-produced videos will grow by 37 million hours in one year. Of course, YouTube.com is merely one site among many where videos may be uploaded. Social media posts provide a wealth of information about the world. They consist of entertainment, instructions, personal records, and various aspects of life in general. As a collection, social media posts represent a compendium of information that goes beyond what any individual recording captures. They provide information on trends, evidence of phenomena or events, social context, and societal dynamics. As a result, they are useful for qualitative and quantitative empirical research on a larger scale than has ever been possible. To make this data accessible, we need to automatically analyze the content of the posts and make them findable. Therefore, Multimedia Information Retrieval (MIR) has rapidly emerged as the most important technology needed to answer many questions people face in different aspects of their regular activities. The next chapters will focus on multimedia organization and analysis, mostly from retrieval aspects. We begin by defining multimedia retrieval and the set of challenges and algorithms that dominate the field. In this chapter, we will present basic concepts and techniques related to accessing multimedia data. We will start with the structured data in databases, discuss information retrieval to deal with accessing information in text, and then present techniques developed and being explored in MIR.