Search results for Knowledge Management, Databases and Data Mining

Index
Edited by Ian Gorton, Deborah K. Gracio
Book:

Data-Intensive Computing

Published online:

05 December 2012

Print publication:

29 October 2012, pp 287-290
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation

10 - Data-Intensive Visual Analysis for Cyber-Security
- By William A. Pike, Pacific Northwest National Laboratory, Daniel M. Best, Pacific Northwest National Laboratory, Douglas V. Love, Pacific Northwest National Laboratory, Shawn J. Bohn, Pacific Northwest National Laboratory
Edited by Ian Gorton, Deborah K. Gracio
Book:

Data-Intensive Computing

Published online:

05 December 2012

Print publication:

29 October 2012, pp 258-286
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Summary

Introduction
Protecting communications networks against attacks where the aim is to steal information, disrupt order, or harm critical infrastructure can require the collection and analysis of staggering amounts of data. The ability to detect and respond to threats quickly is a paramount concern across sectors, and especially for critical government, utility, and financial networks. Yet detecting emerging or incipient threats in immense volumes of network traffic requires new computational and analytic approaches. Network security increasingly requires cooperation between human analysts able to spot suspicious events through means such as data visualization and automated systems that process streaming network data in near real-time to triage events so that human analysts are best able to focus their work.
This chapter presents a pair of network traffic analysis tools coupled to a computational architecture that enables the high-throughput, real-time visual analysis of network activity. The streaming data pipeline towhich these tools are connected is designed to be easily extensible, allowing newtools to subscribe to data and add their own in-stream analytics. The visual analysis tools themselves – Correlation Layers for Information Query and Exploration (CLIQUE) and Traffic Circle – provide complementary views of network activity designed to support the timely discovery of potential threats in volumes of network data that exceed what is traditionally visualized. CLIQUE uses a behavioral modeling approach that learns the expected activity of actors (such as IP addresses or users) and collections of actors on a network, and compares current activity to this learned model to detect behavior-based anomalies.

7 - Binary Classification with Support Vector Machines
- By Patrick Nichols, Pacific Northwest National Laboratory, Bobbie-Jo Webb-Robertson, Pacific Northwest National Laboratory, Christopher Oehmen, Pacific Northwest National Laboratory
Edited by Ian Gorton, Deborah K. Gracio
Book:

Data-Intensive Computing

Published online:

05 December 2012

Print publication:

29 October 2012, pp 157-179
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Summary

Introduction
Support vector machines (SVM) are currently one of the most popular and accurate methods for binary data classification and prediction. They have been applied to a variety of data and situations such as cyber-security, bioinformatics, web searches, medical risk assessment, financial analysis, and other areas [1]. This type of machine learning is shown to be accurate and is able to generalize predictions based upon previously learned patterns. However, current implementations are limited in that they can only be trained accurately on examples numbering to the tens of thousands and usually run only on serial computers. There are exceptions. A prime example is the annual machine learning and classification competitions such as the International Conference on Artificial Neural Networks (ICANN), which present problems with more than 100,000 elements to be classified. However, in order to treat such large test cases the formalism of the support vector machines must be modified.
SVMs were first developed by Vapnik and collaborators [2] as an extension to neural networks. Assume that we can convert the data values associated with an entity into numerical values that form a vector in the mathematical sense. These vectors form a space. Also, assume that this space of vectors can be separated by a hyperplane into the vectors that belong to one class and those that form the opposing class.

5 - Large-Scale Data Management Techniques in Cloud Computing Platforms
- By Sherif Sakr, National ICT Australia (NICTA), University of New SouthWales, Anna Liu, National ICT Australia (NICTA), University of New South Wales
Edited by Ian Gorton, Deborah K. Gracio
Book:

Data-Intensive Computing

Published online:

05 December 2012

Print publication:

29 October 2012, pp 85-123
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Summary

Introduction
In the last two decades, the continuous increase of computational power has produced an overwhelming flow of data, which called for a paradigm shift in the computing architecture and large scale data processing mechanisms. In a speech given just a few weeks before he was lost at sea off the California coast in January 2007, Jim Gray, a database software pioneer and a Microsoft researcher, called the shift a “fourth paradigm” [32]. The first three paradigms were experimental, theoretical and, more recently, computational science. Gray argued that the only way to cope with this paradigm is to develop a new generation of computing tools to manage, visualize, and analyze the data flood. In general, the current computer architectures are increasingly imbalanced where the latency gap between multicore CPUs and mechanical hard disks is growing every year, which makes the challenges of data-intensive computing harder to overcome [6]. Therefore, there is a crucial need for a systematic and generic approach to tackle these problems with an architecture that can also scale into the foreseeable future. In response, Gray argued that the new trend should instead focus on supporting cheaper clusters of computers to manage and process all this data instead of focusing on having the biggest and fastest single computer. Figure 5.1 illustrates an example of the explosion in scientific data, which creates major challenges for cutting-edge scientific projects. For example, modern high-energy physics experiments, such as DZero, typically generate more than one terabyte of data per day.

2 - Anatomy of Data-Intensive Computing Applications
- By Ian Gorton, Pacific Northwest National Laboratory, Deborah K. Gracio, Pacific Northwest National Laboratory
Edited by Ian Gorton, Deborah K. Gracio
Book:

Data-Intensive Computing

Published online:

05 December 2012

Print publication:

29 October 2012, pp 12-23
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Summary

An Architecture Blueprint
As the previous chapter describes, data-intensive applications arise from the interplay of ever-increasing data volumes, complexity, and distribution. Add the needs of applications to process this complex data mélange in ever more interesting and faster ways, and you have an expansive landscape of specific application requirements to address.
Not surprisingly, this breadth of specific requirements leads to many alternative approaches to developing solutions. Different application domains also leverage different technologies, adding further variety to the landscape of dataintensive computing. Despite this inherent diversity, several model solutions for contemporary data-intensive problems have emerged in the last few years. The following briefly describes each one:
Data processing pipelines: Emerging from scientific domains, many large data problems are addressed using processing pipelines. Raw data that originates from a scientific instrument or a simulation is captured and stored. The first stage of processing typically applies techniques to reduce the data in size by removing noise and then processes the data (such as index, summarize, or markup) so that it can be more efficiently manipulated by downstream analytics. Once the capture and initial processing takes place, complex algorithms search and process the data. These algorithms create information and/or knowledge that can be digested by humans or further computational processes. Often, these analytics require large-scale distribution or specialized high-performance computing platforms to execute, making the execution environment of most pipelines both distributed and heterogeneous. Finally, the analysis results are presented to users so that they can be digested and acted upon.

3 - Hardware Architectures for Data-Intensive Computing Problems: A Case Study for String Matching
- By Antonino Tumeo, Pacific Northwest National Laboratory, Oreste Villa, Pacific Northwest National Laboratory, Daniel Chavarría-Miranda, Pacific Northwest National Laboratory
Edited by Ian Gorton, Deborah K. Gracio
Book:

Data-Intensive Computing

Published online:

05 December 2012

Print publication:

29 October 2012, pp 24-47
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Summary

Introduction
Data-intensive applications have special characteristics that in many cases prevent them from executing well on traditional cache-based processors. They can have highly irregular access patterns with very little locality that do not match the expectations of automatically controlled caches. In other cases, such as when they process data in streaming, they do not have temporal locality at all and only limited spatial locality, therefore reducing the effectiveness of caches.
We present an application-driven study of several architectures that are suitable for data-intensive algorithms. Our chosen application is high-speed string matching, which exhibits two key properties of data-intensive codes: highly irregular access patterns and high-speed streaming data. Irregular access patterns appear in string matching when traversing graph-based representations of the pattern dictionaries being used. String matching is typically used in cybersecurity applications to scan incoming network traffic or files for the presence of signatures (such as specific sequences of symbols), which may relate to attack patterns, viruses, or other malware.
String Matching
String matching algorithms check and detect the presence of one or more known symbol sequences inside the analyzed data sets. Besides their wellknown application to databases and text processing, they are the basis of several other critical, real-world applications. String matching algorithms are key components of DNA and protein sequencing, data mining, security systems, such as Intrusion Detection Systems (IDS) for Networks (NIDS), Applications (APIDS), Protocols (PIDS), or Systems (Host based IDS [HIDS]), anti-virus software, and machine learning problems.

1 - Data-Intensive Computing: A Challenge for the 21st Century
- By Ian Gorton, Pacific Northwest National Laboratory, Deborah K. Gracio, Pacific Northwest National Laboratory
Edited by Ian Gorton, Deborah K. Gracio
Book:

Data-Intensive Computing

Published online:

05 December 2012

Print publication:

29 October 2012, pp 1-11
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Summary

Introduction
In our world of rapid technological change, occasionally it is instructive to contemplate how much has altered in the last few years. Remembering life without the ability to view the World Wide Web (WWW) through browser windows will be difficult, if not impossible, for less “mature” readers. Is it only seven years since YouTube first appeared, a Web site that is now ingrained in many facets of modern life? How did we survive without Facebook all those (actually, about five) years ago?
In 2010, various estimates put the amount of data stored by consumers and businesses around the world in the vicinity of 13 exabytes, with a growth rate of 20 to 25 percent per annum. That is a lot of data. No wonder IBM is pursuing building a 120-petabyte storage array. Obviously there is going to be a market for such devices in the future. As data volumes of all types – from video and photos to text documents and binary files for science – continue to grow in number and resolution, it is clear that we have genuinely entered the realm of data-intensive computing, or as it is often now referred to, big data.
Interestingly, the term “data-intensive computing” was actually coined by the scientific community. Traditionally, scientific codes have been starved of sufficient compute cycles, a paucity that has driven the creation of ever larger and faster high-performance computing machines, typically known as supercomputers. The Top 500 Web site shows the latest benchmark results that characterize the fastest supercomputers on the planet.

8 - Beyond MapReduce: New Requirements for Scalable Data Processing
- By Bill Howe, University of Washington, Magdalena Balazinska, University of Washington
Edited by Ian Gorton, Deborah K. Gracio
Book:

Data-Intensive Computing

Published online:

05 December 2012

Print publication:

29 October 2012, pp 180-234
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Summary

Introduction and Background
The MapReduce programming model has had a transformative impact on dataintensive computing, enabling a single programmer to harness hundreds or thousands of computers for a single task and get up and running in a matter of hours. Processing with thousands of computers require a different set of design considerations dominate: I/O scalability, fault tolerance, and flexibility rather than absolute performance. MapReduce, and the open-source implementation Hadoop, are optimized for these considerations and have become very successful as a result.
It is difficult to quantify the popularity of the MapReduce framework directly, but one indication of the uptake is the frequency of the search term. Figure 8.1 illustrates the search popularity for terms “mapreduce” and “hadoop” over the period 2006 to 2012. We see a spike in popularity for the term “mapreduce” in late 2007, but more or less constant popularity since. For the term “hadoop,” however, we see a steady increase to about twelve times that of “mapreduce.”
These data suggest that MapReduce and Hadoop are generating interest, as seen from the number of downloads, successful startups [12, 19, 47], projects [41, 53, 57], and interest from the research community [15, 18, 62, 63, 72, 78]. These data suggest a significant increase in interest in both MapReduce and Hadoop.
The MapReduce framework provides a simple programming model for expressing loosely coupled parallel programs by providing two serial functions, Map and Reduce. The Map function processes a block of input producing a sequence of (key, value) pairs, while the Reduce function processes a set of values associated with a single key.

Frontmatter
Edited by Ian Gorton, Deborah K. Gracio
Book:

Data-Intensive Computing

Published online:

05 December 2012

Print publication:

29 October 2012, pp i-iv
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation

Contents
Edited by Ian Gorton, Deborah K. Gracio
Book:

Data-Intensive Computing

Published online:

05 December 2012

Print publication:

29 October 2012, pp v-vi
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation

4 - Data Management Architectures
- By Terence Critchlow, Pacific Northwest National Laboratory, Ghaleb Abdulla, Lawrence Livermore National Laboratory, Jacek Becla, Stanford University, Kerstin Kleese-Van Dam, Pacific Northwest National Laboratory, Sam Lang, Pacific Northwest National Laboratory, Deborah L. McGuinness, Rensselaer Polytechnic Institute
Edited by Ian Gorton, Deborah K. Gracio
Book:

Data-Intensive Computing

Published online:

05 December 2012

Print publication:

29 October 2012, pp 48-84
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Summary

Data management is the organization of information to support efficient access and analysis. For data-intensive computing applications, the speed at which relevant data can be accessed is a limiting factor in terms of the size and complexity of computation that can be performed. Data access speed is impacted by the size of the relevant subset of the data, the complexity of the query used to define it, and the layout of the data relative to the query. As the underlying data sets become increasingly complex, the questions asked of it become more involved as well. For example, geospatial data associated with a city is no longer limited to the map data representing its streets, but now also includes layers identifying utility lines, key points, locations, and types of businesseswithin the city limits, tax information for each land parcel, satellite imagery, and possibly even street-level views. As a result, queries have gone from simple questions, such as, “How long is Main Street?,” to much more complex questions such as, “Taking all other factors into consideration, are the property values of houses near parks higher than those under power lines, and if so, by what percentage?” Answering these questions requires a coherent infrastructure, integrating the relevant data into a format optimized for the questions being asked.
Data management is critical to supporting analysis because, for large data sets, reading the entire collection is simply not feasible. Instead, the relevant subset of the data must be efficiently described, identified, and retrieved. As a result, the data management approach taken effectively defines the analysis that can be efficiently performed over the data.

9 - Let the Data Do the Talking: Hypothesis Discovery from Large-Scale Data Sets in Real Time
- By Christopher Oehmen, Pacific Northwest National Laboratory, Scott Dowson, Pacific Northwest National Laboratory, Wes Hatley, Future Point Systems, Justin Almquist, Pacific Northwest National Laboratory, Bobbie-Jo Webb-Robertson, Pacific Northwest National Laboratory, Jason McDermott, Pacific Northwest National Laboratory, Ian Gorton, Pacific Northwest National Laboratory, Lee Ann McCue, Pacific Northwest National Laboratory
Edited by Ian Gorton, Deborah K. Gracio
Book:

Data-Intensive Computing

Published online:

05 December 2012

Print publication:

29 October 2012, pp 235-257
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Summary

Discovering Biological Mechanisms through Exploration
The availability of massive amounts of data in biological sciences is forcing us to rethink the role of hypothesis-driven investigation in modern research. Soon thousands, if not millions, of whole-genome DNA and protein sequence data setswill be available thanks to continued improvements in high-throughput sequencing and analysis technologies. At the same time, high-throughput experimental platforms for gene expression, protein and protein fragment measurements, and others are driving experimental data sets to extreme scales. As a result, biological sciences are undergoing a paradigm shift from hypothesisdriven to data-driven scientific exploration. In hypothesis-driven research, one begins with observations, formulates a hypothesis, then tests that hypothesis in controlled experiments. In a data-rich environment, however, one often begins with only a cursory hypothesis (such as some class of molecular components is related to a cellular process) that may require evaluating hundreds or thousands of specific hypotheses rapidly. This large number of experiments is generally intractable to perform in physical experiments. However, often data can be brought to bear to rapidly evaluate and refine these candidate hypotheses into a small number of testable ones. Also, often the amount of data required to discover and refine a hypothesis in this way overwhelms conventional analysis software and hardware. Ideally advanced hardware can help the situation, but conventional batch-mode access models for high-performance computing are not amenable to real-time analysis in larger workflows. We present a model for real-time data-intensive hypothesis discovery process that unites parallel software applications, high-performance hardware, and visual representation of the output.

List of Contributors
Edited by Ian Gorton, Deborah K. Gracio
Book:

Data-Intensive Computing

Published online:

05 December 2012

Print publication:

29 October 2012, pp vii-viii
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation

Recommender Systems

An Introduction
Dietmar Jannach, Markus Zanker, Alexander Felfernig, Gerhard Friedrich
Published online:

05 August 2012

Print publication:

30 September 2010
- Book
- - Get access
    
    Buy a print copy
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
In this age of information overload, people use a variety of strategies to make choices about what to buy, how to spend their leisure time, and even whom to date. Recommender systems automate some of these strategies with the goal of providing affordable, personal, and high-quality recommendations. This book offers an overview of approaches to developing state-of-the-art recommender systems. The authors present current algorithmic approaches for generating personalized buying proposals, such as collaborative and content-based filtering, as well as more interactive and knowledge-based approaches. They also discuss how to measure the effectiveness of recommender systems and illustrate the methods with practical case studies. The final chapters cover emerging topics such as recommender systems in the social web and consumer buying behavior theory. Suitable for computer science researchers and students interested in getting an overview of the field, this book will also be useful for professionals looking for the right technology to build real-world recommender systems.

Relational Knowledge Discovery

M. E. Müller
Published online:

05 July 2012

Print publication:

21 June 2012
- Book
- - Get access
    
    Buy a print copy
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
What is knowledge and how is it represented? This book focuses on the idea of formalising knowledge as relations, interpreting knowledge represented in databases or logic programs as relational data and discovering new knowledge by identifying hidden and defining new relations. After a brief introduction to representational issues, the author develops a relational language for abstract machine learning problems. He then uses this language to discuss traditional methods such as clustering and decision tree induction, before moving onto two previously underestimated topics that are just coming to the fore: rough set data analysis and inductive logic programming. Its clear and precise presentation is ideal for undergraduate computer science students. The book will also interest those who study artificial intelligence or machine learning at the graduate level. Exercises are provided and each concept is introduced using the same example domain, making it easier to compare the individual properties of different approaches.

Contents
M. E. Müller
Book:

Relational Knowledge Discovery

Published online:

05 July 2012

Print publication:

21 June 2012, pp v-vi
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation

Notation
M. E. Müller
Book:

Relational Knowledge Discovery

Published online:

05 July 2012

Print publication:

21 June 2012, pp 258-260
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation

Index
M. E. Müller
Book:

Relational Knowledge Discovery

Published online:

05 July 2012

Print publication:

21 June 2012, pp 267-271
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation

7 - Inductive logic learning
M. E. Müller
Book:

Relational Knowledge Discovery

Published online:

05 July 2012

Print publication:

21 June 2012, pp 159-223
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation

8 - Learning and ensemble learning
M. E. Müller
Book:

Relational Knowledge Discovery

Published online:

05 July 2012

Print publication:

21 June 2012, pp 224-250
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation

Knowledge Management, Databases and Data Mining

Refine search

Refine search

Actions for selected content:

1818 results in Knowledge Management, Databases and Data Mining

Index

10 - Data-Intensive Visual Analysis for Cyber-Security

Summary

7 - Binary Classification with Support Vector Machines

Summary

5 - Large-Scale Data Management Techniques in Cloud Computing Platforms

Summary

2 - Anatomy of Data-Intensive Computing Applications

Summary

3 - Hardware Architectures for Data-Intensive Computing Problems: A Case Study for String Matching

Summary

1 - Data-Intensive Computing: A Challenge for the 21st Century

Summary

8 - Beyond MapReduce: New Requirements for Scalable Data Processing

Summary

Frontmatter

Contents

4 - Data Management Architectures

Summary

9 - Let the Data Do the Talking: Hypothesis Discovery from Large-Scale Data Sets in Real Time

Summary

List of Contributors

Recommender Systems

Relational Knowledge Discovery

Contents

Notation

Index

7 - Inductive logic learning

8 - Learning and ensemble learning

Knowledge Management, Databases and Data Mining

Refine search

Refine search

Actions for selected content:

Save Search

1818 results in Knowledge Management, Databases and Data Mining

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Recommender Systems

Relational Knowledge Discovery