To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
This paper reports on an investigation into the role of shuffling and concatenation in the theory of graph drawing. A simple syntactic description of these and related operations is proved to be complete in the context of finite partial orders, and as general as possible. An explanation based on this result is given for a previously investigated collapse of the permutohedron into the associahedron, and for collapses into other less familiar polyhedra, including the cyclohedron. Such polyhedra have been considered recently in connection with the notion of tubing, which is closely related to tree-like finite partial orders, which are defined simply here and investigated in detail. Like the associahedron, some of these other polyhedra are involved in categorial coherence questions, which will be treated elsewhere.
We describe a framework to support the implementation of web-based systems intended to manipulate data stored in relational databases. Since the conceptual model of a relational database is often specified as an entity-relationship (ER) model, we propose to use the ER model to generate a complete implementation in the declarative programming language Curry. This implementation contains operations to create and manipulate entities of the data model, supports authentication, authorization, session handling, and the composition of individual operations to user processes. Furthermore, the implementation ensures the consistency of the database w.r.t. the data dependencies specified in the ER model, i.e., updates initiated by the user cannot lead to an inconsistent state of the database. In order to generate a high-level declarative implementation that can be easily adapted to individual customer requirements, the framework exploits previous works on declarative database programming and web user interface construction in Curry.
We introduce homotopical methods based on rewriting on higher-dimensional categories to prove coherence results in categories with an algebraic structure. We express the coherence problem for (symmetric) monoidal categories as an asphericity problem for a track category and use rewriting methods on polygraphs to solve it. The setting is extended to more general coherence problems, viewed as 3-dimensional word problems in a track category, including the case of braided monoidal categories.
With sensors becoming ubiquitous, there is an increasing interest in mining the data from these sensors as the data are being collected. This analysis of streaming data, or data streams, is presenting new challenges to analysis algorithms. The size of the data can be massive, especially when the sensors number in the thousands and the data are sampled at a high frequency. The data can be non-stationary, with statistics that vary over time. Real-time analysis is often required, either to avoid untoward incidents or to understand an interesting phenomenon better. These factors make the analysis of streaming data, whether from sensors or other sources, very data- and compute-intensive. One possible approach to making this analysis tractable is to identify the important data streams to focus on them. This chapter describes the different ways in which this can be done, given that what makes a stream important varies from problem to problem and can often change with time in a single problem. The following illustrate these techniques by applying them to data from a real problem and discuss the challenges faced in this emerging field of streaming data analysis.
This chapter is organized as follows: first, I define what is meant by streaming data and use examples from practical problems to discuss the challenges in the analysis of these data. Next, I describe the two main approaches used to handle the streaming nature of the data – the sliding window approach and the forgetting factor approach.
Protecting communications networks against attacks where the aim is to steal information, disrupt order, or harm critical infrastructure can require the collection and analysis of staggering amounts of data. The ability to detect and respond to threats quickly is a paramount concern across sectors, and especially for critical government, utility, and financial networks. Yet detecting emerging or incipient threats in immense volumes of network traffic requires new computational and analytic approaches. Network security increasingly requires cooperation between human analysts able to spot suspicious events through means such as data visualization and automated systems that process streaming network data in near real-time to triage events so that human analysts are best able to focus their work.
This chapter presents a pair of network traffic analysis tools coupled to a computational architecture that enables the high-throughput, real-time visual analysis of network activity. The streaming data pipeline towhich these tools are connected is designed to be easily extensible, allowing newtools to subscribe to data and add their own in-stream analytics. The visual analysis tools themselves – Correlation Layers for Information Query and Exploration (CLIQUE) and Traffic Circle – provide complementary views of network activity designed to support the timely discovery of potential threats in volumes of network data that exceed what is traditionally visualized. CLIQUE uses a behavioral modeling approach that learns the expected activity of actors (such as IP addresses or users) and collections of actors on a network, and compares current activity to this learned model to detect behavior-based anomalies.
Support vector machines (SVM) are currently one of the most popular and accurate methods for binary data classification and prediction. They have been applied to a variety of data and situations such as cyber-security, bioinformatics, web searches, medical risk assessment, financial analysis, and other areas [1]. This type of machine learning is shown to be accurate and is able to generalize predictions based upon previously learned patterns. However, current implementations are limited in that they can only be trained accurately on examples numbering to the tens of thousands and usually run only on serial computers. There are exceptions. A prime example is the annual machine learning and classification competitions such as the International Conference on Artificial Neural Networks (ICANN), which present problems with more than 100,000 elements to be classified. However, in order to treat such large test cases the formalism of the support vector machines must be modified.
SVMs were first developed by Vapnik and collaborators [2] as an extension to neural networks. Assume that we can convert the data values associated with an entity into numerical values that form a vector in the mathematical sense. These vectors form a space. Also, assume that this space of vectors can be separated by a hyperplane into the vectors that belong to one class and those that form the opposing class.
In the last two decades, the continuous increase of computational power has produced an overwhelming flow of data, which called for a paradigm shift in the computing architecture and large scale data processing mechanisms. In a speech given just a few weeks before he was lost at sea off the California coast in January 2007, Jim Gray, a database software pioneer and a Microsoft researcher, called the shift a “fourth paradigm” [32]. The first three paradigms were experimental, theoretical and, more recently, computational science. Gray argued that the only way to cope with this paradigm is to develop a new generation of computing tools to manage, visualize, and analyze the data flood. In general, the current computer architectures are increasingly imbalanced where the latency gap between multicore CPUs and mechanical hard disks is growing every year, which makes the challenges of data-intensive computing harder to overcome [6]. Therefore, there is a crucial need for a systematic and generic approach to tackle these problems with an architecture that can also scale into the foreseeable future. In response, Gray argued that the new trend should instead focus on supporting cheaper clusters of computers to manage and process all this data instead of focusing on having the biggest and fastest single computer. Figure 5.1 illustrates an example of the explosion in scientific data, which creates major challenges for cutting-edge scientific projects. For example, modern high-energy physics experiments, such as DZero, typically generate more than one terabyte of data per day.
As the previous chapter describes, data-intensive applications arise from the interplay of ever-increasing data volumes, complexity, and distribution. Add the needs of applications to process this complex data mélange in ever more interesting and faster ways, and you have an expansive landscape of specific application requirements to address.
Not surprisingly, this breadth of specific requirements leads to many alternative approaches to developing solutions. Different application domains also leverage different technologies, adding further variety to the landscape of dataintensive computing. Despite this inherent diversity, several model solutions for contemporary data-intensive problems have emerged in the last few years. The following briefly describes each one:
Data processing pipelines: Emerging from scientific domains, many large data problems are addressed using processing pipelines. Raw data that originates from a scientific instrument or a simulation is captured and stored. The first stage of processing typically applies techniques to reduce the data in size by removing noise and then processes the data (such as index, summarize, or markup) so that it can be more efficiently manipulated by downstream analytics. Once the capture and initial processing takes place, complex algorithms search and process the data. These algorithms create information and/or knowledge that can be digested by humans or further computational processes. Often, these analytics require large-scale distribution or specialized high-performance computing platforms to execute, making the execution environment of most pipelines both distributed and heterogeneous. Finally, the analysis results are presented to users so that they can be digested and acted upon.
Data-intensive applications have special characteristics that in many cases prevent them from executing well on traditional cache-based processors. They can have highly irregular access patterns with very little locality that do not match the expectations of automatically controlled caches. In other cases, such as when they process data in streaming, they do not have temporal locality at all and only limited spatial locality, therefore reducing the effectiveness of caches.
We present an application-driven study of several architectures that are suitable for data-intensive algorithms. Our chosen application is high-speed string matching, which exhibits two key properties of data-intensive codes: highly irregular access patterns and high-speed streaming data. Irregular access patterns appear in string matching when traversing graph-based representations of the pattern dictionaries being used. String matching is typically used in cybersecurity applications to scan incoming network traffic or files for the presence of signatures (such as specific sequences of symbols), which may relate to attack patterns, viruses, or other malware.
String Matching
String matching algorithms check and detect the presence of one or more known symbol sequences inside the analyzed data sets. Besides their wellknown application to databases and text processing, they are the basis of several other critical, real-world applications. String matching algorithms are key components of DNA and protein sequencing, data mining, security systems, such as Intrusion Detection Systems (IDS) for Networks (NIDS), Applications (APIDS), Protocols (PIDS), or Systems (Host based IDS [HIDS]), anti-virus software, and machine learning problems.
In our world of rapid technological change, occasionally it is instructive to contemplate how much has altered in the last few years. Remembering life without the ability to view the World Wide Web (WWW) through browser windows will be difficult, if not impossible, for less “mature” readers. Is it only seven years since YouTube first appeared, a Web site that is now ingrained in many facets of modern life? How did we survive without Facebook all those (actually, about five) years ago?
In 2010, various estimates put the amount of data stored by consumers and businesses around the world in the vicinity of 13 exabytes, with a growth rate of 20 to 25 percent per annum. That is a lot of data. No wonder IBM is pursuing building a 120-petabyte storage array. Obviously there is going to be a market for such devices in the future. As data volumes of all types – from video and photos to text documents and binary files for science – continue to grow in number and resolution, it is clear that we have genuinely entered the realm of data-intensive computing, or as it is often now referred to, big data.
Interestingly, the term “data-intensive computing” was actually coined by the scientific community. Traditionally, scientific codes have been starved of sufficient compute cycles, a paucity that has driven the creation of ever larger and faster high-performance computing machines, typically known as supercomputers. The Top 500 Web site shows the latest benchmark results that characterize the fastest supercomputers on the planet.
The MapReduce programming model has had a transformative impact on dataintensive computing, enabling a single programmer to harness hundreds or thousands of computers for a single task and get up and running in a matter of hours. Processing with thousands of computers require a different set of design considerations dominate: I/O scalability, fault tolerance, and flexibility rather than absolute performance. MapReduce, and the open-source implementation Hadoop, are optimized for these considerations and have become very successful as a result.
It is difficult to quantify the popularity of the MapReduce framework directly, but one indication of the uptake is the frequency of the search term. Figure 8.1 illustrates the search popularity for terms “mapreduce” and “hadoop” over the period 2006 to 2012. We see a spike in popularity for the term “mapreduce” in late 2007, but more or less constant popularity since. For the term “hadoop,” however, we see a steady increase to about twelve times that of “mapreduce.”
These data suggest that MapReduce and Hadoop are generating interest, as seen from the number of downloads, successful startups [12, 19, 47], projects [41, 53, 57], and interest from the research community [15, 18, 62, 63, 72, 78]. These data suggest a significant increase in interest in both MapReduce and Hadoop.
The MapReduce framework provides a simple programming model for expressing loosely coupled parallel programs by providing two serial functions, Map and Reduce. The Map function processes a block of input producing a sequence of (key, value) pairs, while the Reduce function processes a set of values associated with a single key.
Data management is the organization of information to support efficient access and analysis. For data-intensive computing applications, the speed at which relevant data can be accessed is a limiting factor in terms of the size and complexity of computation that can be performed. Data access speed is impacted by the size of the relevant subset of the data, the complexity of the query used to define it, and the layout of the data relative to the query. As the underlying data sets become increasingly complex, the questions asked of it become more involved as well. For example, geospatial data associated with a city is no longer limited to the map data representing its streets, but now also includes layers identifying utility lines, key points, locations, and types of businesseswithin the city limits, tax information for each land parcel, satellite imagery, and possibly even street-level views. As a result, queries have gone from simple questions, such as, “How long is Main Street?,” to much more complex questions such as, “Taking all other factors into consideration, are the property values of houses near parks higher than those under power lines, and if so, by what percentage?” Answering these questions requires a coherent infrastructure, integrating the relevant data into a format optimized for the questions being asked.
Data management is critical to supporting analysis because, for large data sets, reading the entire collection is simply not feasible. Instead, the relevant subset of the data must be efficiently described, identified, and retrieved. As a result, the data management approach taken effectively defines the analysis that can be efficiently performed over the data.
Discovering Biological Mechanisms through Exploration
The availability of massive amounts of data in biological sciences is forcing us to rethink the role of hypothesis-driven investigation in modern research. Soon thousands, if not millions, of whole-genome DNA and protein sequence data setswill be available thanks to continued improvements in high-throughput sequencing and analysis technologies. At the same time, high-throughput experimental platforms for gene expression, protein and protein fragment measurements, and others are driving experimental data sets to extreme scales. As a result, biological sciences are undergoing a paradigm shift from hypothesisdriven to data-driven scientific exploration. In hypothesis-driven research, one begins with observations, formulates a hypothesis, then tests that hypothesis in controlled experiments. In a data-rich environment, however, one often begins with only a cursory hypothesis (such as some class of molecular components is related to a cellular process) that may require evaluating hundreds or thousands of specific hypotheses rapidly. This large number of experiments is generally intractable to perform in physical experiments. However, often data can be brought to bear to rapidly evaluate and refine these candidate hypotheses into a small number of testable ones. Also, often the amount of data required to discover and refine a hypothesis in this way overwhelms conventional analysis software and hardware. Ideally advanced hardware can help the situation, but conventional batch-mode access models for high-performance computing are not amenable to real-time analysis in larger workflows. We present a model for real-time data-intensive hypothesis discovery process that unites parallel software applications, high-performance hardware, and visual representation of the output.
Constant conjunction theory of causation had been the dominant theory in philosophy for a long time and regained attention recently. This paper gives a logical framework of causation based on the theory. The basic idea is that causal statements are empirical, and are derived from our past experience by observing constant conjunction between objects. The logic is defined on linear time structures. A causal statement is evaluated at time points, such that its value depends on what has been in the past. We first give a semantics that contains basic conditions that, we think, must hold for a concept of causation, on which we define the minimal causal logic. Then we discuss its possible extensions for various concepts of causation. Complete deductive systems are given.