To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
In this chapter we focus on the last two stages of the data mining process and examine techniques for modeling and evaluation. In many ways, these steps form the core of a mining task where automatic or semi-automatic analysis of streaming data is used to extract insights and actionable models. This process employs algorithms specially designed for different purposes, such as the identification of similar groups of data, of unusual groups of data, or of related data, whose associations were previously unknown.
This chapter starts with a description of a methodology for offline modeling, where the model for a dataset is initially learned, and online evaluation, where this model is used to analyze the new data being processed by an application (Section 11.2). Despite the use of algorithms where a model is learned offline from previously stored training data, this methodology is frequently used in SPAs because it can leverage many of the existing data mining algorithms devised for analyzing datasets stored in databases and data warehouses.
Offline modeling is often sufficient for the analytical goals of many SPAs. Nevertheless, the use of online modeling and evaluation techniques allows a SPA to function autonomically and to evolve as a result of changes in the workload and in the data. Needless to say this is the goal envisioned by proponents of stream processing. Thus, in the rest of the chapter, we examine in detail online techniques for modeling or mining data streams.
In this chapter we study data flow programming, including flow composition and flow manipulation. Flow composition focuses on techniques used for creating the topology associated with the flow graph for an application, while flow manipulation covers the use of operators to perform transformations on these flows.
We start by introducing different forms of flow composition in Section 4.2. In Section 4.3, we discuss flow manipulation and the properties of stream processing operators, including their internal state, selectivity, and arity, as well as parameterization, output assignments and functions, punctuations, and windowing configuration used to perform such manipulation.
Flow composition
Flow composition patterns fall into three main categories: static, dynamic, and nested composition.
Static composition is used to create the parts of the application topology that are known at development time. For instance, consider an application that has a source operator consuming data from a specific Internet source, for example, a Twitter [1] feed. Let's assume that the stream generated by the source is to be processed by a specific operator that analyzes this data, for example, a sentiment analysis operator [2] that probes the messages for positive or negative tone. In this case, the connection between the source operator and the analysis operator is known at development time and thus can be explicitly created by connecting the output port of the source operator to the input port of the analysis operator.
After discussing how SPAs are developed, visualized, and debugged in Chapters 3 to 5, this chapter will focus primarily on describing the architectural underpinnings of a conceptual SPS application runtime environment.
Shifting the discussion to the middleware that supports stream processing provides an opportunity to discuss numerous aspects that affect how an application runs. These aspects include the support for resource management, distributed computing, security, fault tolerance as well as system management services and system-provided application services for logging and monitoring, built-in visualization, debugging, and state introspection.
This chapter is organized as follows. Section 7.2 presents the conceptual building blocks associated with a stream processing runtime: the computational environment and the entities that use it. The second half of this chapter, Sections 7.3 and 7.4, focuses on the multiple services that make up a SPS middleware and describes how they are integrated to provide a seamless execution environment to SPAs.
Architectural building blocks
Middleware is the term used to define a software layer that provides services to applications beyond what is commonly made available by an Operating System (OS), including user and resource management, scheduling, I/O services, among others. In general, middleware software provides an improved environment for applications to execute. This environment referred to as the application runtime environment, further isolates the application from the underlying computational resources.
Therefore, the fundamental role of any middleware is to supply additional infrastructure and services.
Continuous operation and data analysis are two of the distinguishing features of the stream processing paradigm. Arguably, the way in which SPAs employ analytics is what makes them invaluable to many businesses and scientific organizations.
In earlier chapters, we discussed how to architect and build a SPA to perform its analytical task efficiently. Yet, we haven't yet addressed any of the algorithmic issues surrounding the implementation of the analytical task itself.
Now that the stage is set and we have the knowledge and tools for engineering a high-performance SPA, we switch the focus to studying how existing stream processing and mining algorithms work, and how new ones can be designed. In the next two chapters, we will examine techniques drawn from data mining, machine learning, statistics, and other fields and show how they have been adapted and evolved to perform in the context of stream processing.
This chapter is organized as follows. The following two sections provide a conceptual introduction of the mining process in terms of its five broad steps: data acquisition, pre-processing, transformation, as well as modeling and evaluation (Section 10.2), followed by a description of the mathematical notation to be used when discussing specific algorithms (Section 10.3).
Since many issues associated with data acquisition were discussed in Section 9.2.1, in this chapter, we focus on the second and third steps, specifically on the techniques for data pre-processing and transformation.
Mobility is a fundamental parameter of mechanisms expressing in a qualitative manner their kinematic and dynamic properties. The mobility formulae presented in literature take into consideration some of the structural entities, such as bodies, joints, constraints, closed loops, and space characteristics; however, the specific mechanical model that has traditionally been at the origin of the mobility criteria themselves is incompletely specified and, even then, only implicitly. In this paper, we propose a classification of the mobility criteria based on the known dynamic models. While all known mobility criteria have been associated with a specific dynamic model, some particular, less used dynamic models (like natural coordinates and multi-particle models) suggested new mobility criteria models. The associated mechanical model for each category of mobility criteria allows a qualitative assessment of the kinematic and dynamic sets of equations to be formulated in later stages of analysis. A simple multi-loop mechanism is taken as an example just to illustrate the mobility calculation and qualitative assessment of the kinematic and dynamic models in each case. Based on the proposed classification of the mobility formulae, an assessment is made with particular regard to their applicability to overconstrained mechanisms.
An extension of the well-known Particle Swarm Optimization (PSO) to multi-robot applications has been recently proposed and denoted as Robotic Darwinian PSO (RDPSO), benefited from the dynamical partitioning of the whole population of robots. Although such strategy allows decreasing the amount of required information exchange among robots, a further analysis on the communication complexity of the RDPSO needs to be carried out so as to evaluate the scalability of the algorithm. Moreover, a further study on the most adequate multi-hop routing protocol should be conducted. Therefore, this paper starts by analyzing the architecture and characteristics of the RDPSO communication system, thus describing the dynamics of the communication data packet structure shared between teammates. Such procedure will be the first step to achieving a more scalable implementation of RDPSO by optimizing the communication procedure between robots. Second, an ad hoc on-demand distance vector reactive routing protocol is extended based on the RDPSO concepts, so as to reduce the communication overhead within swarms of robots. Experimental results with teams of 15 real robots and 60 simulated robots show that the proposed methodology significantly reduces the communication overhead, thus improving the scalability and applicability of the RDPSO algorithm.
In this chapter, we examine visualization and debugging as well as the relationship of these services and the infrastructure provided by a SPS. Visualization and debugging tools help developers and analysts to inspect and to understand the current state of an application and the data flow between its components, thus mitigating the cognitive and software engineering challenges associated with developing, optimizing, deploying, and managing SPAs, particularly the large-scale distributed ones.
On the one hand, visualization techniques are important at development time, where the ability to picture the application layout and its live data flows can aid in refining its design.
On the other hand, debugging techniques and tools, which are sometimes integrated with visualization tools, are important because the continuous and critical nature of some SPAs requires the ability to effectively diagnose and address problems before and after they reach a production stage, where disruptions can have serious consequences.
This chapter starts with a discussion of software visualization techniques for SPAs (Section 6.2), including the mechanisms to produce effective visual representations of an application's data flow graph topology, its performance metrics, and its live status.
Debugging is intimately related to visualization. Hence, the second half of this chapter focuses on the different types of debugging tasks used in stream processing (Section 6.3).
Visualization
Comprehensive visualization infrastructure is a fundamental tool to support the development, understanding, debugging, and optimization of SPAs.
Let Hd(n,p) signify a random d-uniform hypergraph with n vertices in which each of the $\binom{n}{d}$ possible edges is present with probability p=p(n) independently, and let Hd(n,m) denote a uniformly distributed d-uniform hypergraph with n vertices and m edges. We derive local limit theorems for the joint distribution of the number of vertices and the number of edges in the largest component of Hd(n,p) and Hd(n,m) in the regime $(d-1)\binom{n-1}{d-1}p>1+\varepsilon$, resp. d(d−1)m/n>1+ϵ, where ϵ>0 is arbitrarily small but fixed as n → ∞. The proofs are based on a purely probabilistic approach.
The world has become information-driven, with many facets of business and government being fully automated and their systems being instrumented and interconnected. On the one hand, private and public organizations have been investing heavily in deploying sensors and infrastructure to collect readings from these sensors, on a continuous basis. On the other hand, the need to monitor and act on information from the sensors in the field to drive rapid decisions, to tweak production processes, to tweak logistics choices, and, ultimately, to better monitor and manage physical systems, is now fundamental to many organizations.
The emergence of stream processing was driven by increasingly stringent data management, processing, and analysis needs from business and scientific applications, coupled with the confluence of two major technological and scientific shifts: first, the advances in software and hardware technologies for database, data management, and distributed systems, and, second, the advances in supporting techniques in signal processing, statistics, data mining, and in optimization theory.
In Section 1.2, we will look more deeply into the data processing requirements that led to the design of stream processing systems and applications. In Section 1.3, we will trace the roots of the theoretical and engineering underpinnings that enabled these applications, as well as the middleware supporting them. While providing this historical perspective, we will illustrate how stream processing uses and extends these fundamental building blocks.
Stream processing has emerged from the confluence of advances in data management, parallel and distributed computing, signal processing, statistics, data mining, and optimization theory.
Stream processing is an intuitive computing paradigm where data is consumed as it is generated, computation is performed at wire speed, and results are immediately produced, all within a continuous cycle. The rise of this computing paradigm was the result of the need to support a new class of applications. These analytic-centric applications are focused on extracting intelligence from large quantities of continuously generated data, to provide faster, online, and real-time results. These applications span multiple domains, including environment and infrastructure monitoring, manufacturing, finance, healthcare, telecommunications, physical and cyber security, and, finally, large-scale scientific and experimental research.
In this book, we have discussed the emergence of stream processing and the three pillars that sustain it: the programming paradigm, the software infrastructure, and the analytics, which together enable the development of large-scale high-performance SPAs.
In this chapter, we start with a quick recap of the book (Section 13.1), then look at the existing challenges and open problems in stream processing (Section 13.2), and end with a discussion on how this technology may evolve in the coming years (Section 13.3).
Book summary
In the two introductory chapters (Chapters 1 and 2) of the book, we traced the origins of stream processing as well as provided an overview of its technical fundamentals, and a description of the technological landscape in the area of continuous data processing.