Search results for Knowledge Management, Databases and Data Mining

Editors’ Introduction
- By Julia Lane, American Institutes for Research, Washington DC, Victoria Stodden, Columbia University, New York, Stefan Bender, Institute for Employment Research of the German Federal Employment Agency, Helen Nissenbaum, New York University
Edited by Julia Lane, Victoria Stodden, Columbia University, New York, Stefan Bender, Helen Nissenbaum, New York University
Book:

Privacy, Big Data, and the Public Good

Published online:

05 July 2014

Print publication:

09 June 2014, pp xi-xx
- Chapter
- - You have access
- PDF
- Export citation
Summary

Massive amounts of data on human beings can now be accessed and analyzed. And the new ‘big data’ are much more likely to be harvested from a wide variety of different sources. Much has been made of the many uses of such data for pragmatic purposes, including selling goods and services, winning political campaigns, and identifying possible terrorists. Yet big data can also be harnessed to serve the public good in other ways: scientists can use new forms of data to do research that improves the lives of human beings; federal, state, and local governments can use data to improve services and reduce taxpayer costs; and public organizations can use information to advocate for public causes, for example.
Much has also been made of the privacy and confidentiality issues associated with access. Statisticians are not alone in thinking that consumers should worry about privacy issues, and that an ethical framework should be in place to guide data scientists; the European Commission and the U.S. government have begun to address the problem. Yet there are many unanswered questions. What are the ethical and legal requirements for scientists and government officials seeking to use big data to serve the public good without harming individual citizens? What are the rules of engagement with these new data sources? What are the best ways to provide access while protecting confidentiality? Are there reasonable mechanisms to compensate citizens for privacy loss?

10 - Engineered Controls for Dealing with Big Data
- By Carl Landwehr, George Washington University
Edited by Julia Lane, Victoria Stodden, Columbia University, New York, Stefan Bender, Helen Nissenbaum, New York University
Book:

Privacy, Big Data, and the Public Good

Published online:

05 July 2014

Print publication:

09 June 2014, pp 211-233
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Summary

Introduction
It is one thing for a patient to trust a physician with a handwritten record that is expected to stay in the doctor’s office. It’s quite another for the patient to consent to place their comprehensive electronic health record in a repository that may be open to researchers anywhere on the planet. The potentially great payoffs from (for example) being able to find a set of similar patients who have suffered from the same condition as oneself and to review their treatment choices and outcomes will likely be unavailable unless people can be persuaded that their individual data will be handled properly in such a system. Agreeing on an effective set of institutional controls (see Chapter 9) is an essential prerequisite, but equally important is the question of whether the agreed upon policies can be enforced by controls engineered into the system. Without sound technical enforcement, incidents of abuse, misuse, theft of data, and even invalid scientific conclusions based on undetectably altered data can be expected. While technical controls can limit the occurrence of such incidents substantially, some will inevitably occur. When they do, the ability of the system to support accountability will be crucial, so that abusers can be properly identified and penalized and systems can be appropriately reinforced or amended.
Questions to ask about the engineered controls include:
How are legitimate system users identified and authenticated?
What mechanisms are employed to distinguish classes of users and to limit their actions to those authorized by the relevant policies?
What mechanisms limit the authorities of system administrators?
How is the system software installed, configured, and maintained? How are user and administrator actions logged?
Can the logs be effectively monitored for policy violations?
When policy violations are detected, what mechanisms can be used to identify violators and hold them to account?

Part I - Conceptual Framework
Edited by Julia Lane, Victoria Stodden, Columbia University, New York, Stefan Bender, Helen Nissenbaum, New York University
Book:

Privacy, Big Data, and the Public Good

Published online:

05 July 2014

Print publication:

09 June 2014, pp 1-4
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Summary

This part begins by considering the existing legal constraints on the collection and use of big data in the privacy and confidentiality context. It then identifies gaps in the current legal landscape and issues in designing a coherent set of policies that both protect privacy and yet permit the potential benefits that come with big data. Three themes emerge: that the concepts used in the larger discussion of privacy and big data require updating; that how we understand and assess harms from privacy violations needs updating; and that we must rethink established approaches to managing privacy in the big data context.
The notion of ‘big data’ is interpreted as a change in paradigm, rather than solely a change in technology. This illustrates the first central theme of this part of the book. Barocas and Nissenbaum define big data as a “paradigm, rather than a particular technology,” while Strandburg differentiates between collections of data, and collections of data that have been “datafied,” that is, “aggregated in a computationally manipulable format.” She claims that such datafication is a key step in heightening privacy concerns and creating a greater need for a coherent regulatory structure for data acquisition. Traditional regulatory tools for managing privacy – notice and consent – have failed to provide a viable market mechanism allowing a form of self-regulation governing industry data collection. Strandburg elucidates the current legal restrictions and guidance on data collection in the industrial setting, including the Fair Information Practice Principles (FIPPs) dating from 1973 and underlying the Fair Credit Reporting Act (FCRA) from 1970 and the Privacy Act from 1974. Strandburg advocates a more nuanced assessment of trade-offs in the big data context, moving away from individualized assessments of the costs of privacy violations. The privacy law governing the collection of private data for monitoring purposes should be strengthened, in particular, a substantive distinction should be made between datafication and the repurposing of data that was collected as a byproduct of providing services.

9 - The New Deal on Data: A Framework for Institutional Controls
- By Daniel Greenwood, Massachusetts Institute of Technology, Arkadiusz Stopczynski, Technical University of Denmark, Brian Sweatt, Massachusetts Institute of Technology, Thomas Hardjono, MIT Kerberos & Internet Trust Consortium, Alex Pentland, Massachusetts Institute of Technology
Edited by Julia Lane, Victoria Stodden, Columbia University, New York, Stefan Bender, Helen Nissenbaum, New York University
Book:

Privacy, Big Data, and the Public Good

Published online:

05 July 2014

Print publication:

09 June 2014, pp 192-210
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Summary

Introduction
In order to realize the promise of a Big Data society and to reduce the potential risk to individuals, institutions are updating the operational frameworks which govern the business, legal, and technical dimensions of their internal organizations. In this chapter we outline ways to support the emergence of such a society within the framework of the New Deal on Data, and describe future directions for research and development.
In our view, the traditional control points relied on as part of corporate governance, management oversight, legal compliance, and enterprise architecture must evolve and expand to match operational frameworks for big data. These controls must support and reflect greater user control over personal data, as well as large-scale interoperability for data sharing between and among institutions. The core capabilities of these controls should include responsive rule-based systems governance and fine-grained authorizations for distributed rights management.
The New Realities of Living in a Big Data Society
Building an infrastructure that sustains a healthy, safe, and efficient society is, in part, a scientific and engineering challenge which dates back to the 1800s when the Industrial Revolution spurred rapid urban growth. That growth created new social and environmental problems. The remedy then was to build centralized networks that delivered clean water and safe food, enabled commerce, removed waste, provided energy, facilitated transportation, and offered access to centralized health care, police, and educational services. These networks formed the backbone of society as we know it today.

6 - The Value of Big Data for Urban Science
- By Steven E. Koonin, New York University, Michael J. Holland, New York University
Edited by Julia Lane, Victoria Stodden, Columbia University, New York, Stefan Bender, Helen Nissenbaum, New York University
Book:

Privacy, Big Data, and the Public Good

Published online:

05 July 2014

Print publication:

09 June 2014, pp 137-152
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Summary

Introduction
The past two decades have seen rapid advances in sensors, database technologies, search engines, data mining, machine learning, statistics, distributed computing, visualization, and modeling and simulation. These technologies, which collectively underpin ‘big data’, are allowing organizations to acquire, transmit, store, and analyze all manner of data in greater volume, with greater velocity, and of greater variety. Cisco, the multinational manufacturer of networking equipment, estimates that by 2017 there will be three networked devices for every person on the globe. The ‘instrumenting of society’ that is taking place as these technologies are widely deployed is producing data streams of unprecedented granularity, coverage, and timeliness.
The tsunami of data is increasingly impacting the commercial and academic spheres. A decade ago, it was news that Walmart was using predictive analytics to anticipate inventory needs in the face of upcoming severe weather events. Today, retail (inventory management), advertising (online recommendation engines), insurance (improved stratification of risk), finance (investment strategy, fraud detection), real estate, entertainment, and political campaigns routinely acquire, integrate, and analyze large amounts of societal data to improve their performance. Scientific research is also seeing the rise of big data technologies. Large federated databases are now an important asset in physics, astronomy, the earth sciences, and biology. The social sciences are beginning to grapple with the implications of this transformation. The traditional data paradigm of social science relies upon surveys and experiments, both qualitative and quantitative, as well as exploitation of administrative records created for non-research purposes. Well-designed surveys generate representative data from comparatively small samples, and the best administrative datasets provide high-quality data covering a total population of interest. The opportunity now presents to understand how these traditional tools can be complemented by large volumes of ‘organic’ data that are being generated as a natural part of a modern, technologically advanced society. Depending upon how sampling errors, coverage errors, and biases are accounted for, we believe the combination can yield new insights into human behavior and social norms.

13 - Using Statistics to Protect Privacy
- By Alan F. Karr, University of North Carolina, Jerome P. Reiter, Duke University
Edited by Julia Lane, Victoria Stodden, Columbia University, New York, Stefan Bender, Helen Nissenbaum, New York University
Book:

Privacy, Big Data, and the Public Good

Published online:

05 July 2014

Print publication:

09 June 2014, pp 276-295
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Summary

Introduction
Those who generate data – for example, official statistics agencies, survey organizations, and principal investigators, henceforth all called agencies – have a long history of providing access to their data to researchers, policy analysts, decision makers, and the general public. At the same time, these agencies are obligated ethically and often legally to protect the confidentiality of data subjects’ identities and sensitive attributes. Simply stripping names, exact addresses, and other direct identifiers typically does not suffice to protect confidentiality. When the released data include variables that are readily available in external files, such as demographic characteristics or employment histories, ill-intentioned users – henceforth called intruders – may be able to link records in the released data to records in external files, thereby compromising the agency’s promise of confidentiality to those who provided the data.
In response to this threat, agencies have developed an impressive variety of strategies for reducing the risks of unintended disclosures, ranging from restricting data access to altering data before release. Strategies that fall into the latter category are known as statistical disclosure limitation (SDL) techniques. Most SDL techniques have been developed for data derived from probability surveys or censuses. Even in complete form, these data would not typically be thought of as big data, with respect to scale (numbers of cases and attributes), complexity of attribute types, or structure: most datasets are released, if not actually structured, as flat files.

12 - Extracting Information from Big Data: Issues of Measurement, Inference and Linkage
- By Frauke Kreuter, University of Maryland, Roger D. Peng, Johns Hopkins Bloomberg School of Public Health
Edited by Julia Lane, Victoria Stodden, Columbia University, New York, Stefan Bender, Helen Nissenbaum, New York University
Book:

Privacy, Big Data, and the Public Good

Published online:

05 July 2014

Print publication:

09 June 2014, pp 257-275
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Summary

Introduction
Big data pose several interesting and new challenges to statisticians and others who want to extract information from data. As Groves pointedly commented, the era is “appropriately called Big Data as opposed to Big Information,” because there is a lot of work for analysts before information can be gained from “auxiliary traces of some process that is going on in the society.” The analytic challenges most often discussed are those related to three of the Vs that are used to characterize big data. The volume of truly massive data requires expansion of processing techniques that match modern hardware infrastructure, cloud computing with appropriate optimization mechanisms, and re-engineering of storage systems. The velocity of the data calls for algorithms that allow learning and updating on a continuous basis, and of course the computing infrastructure to do so. Finally, the variety of the data structures requires statistical methods that more easily allow for the combination of different data types collected at different levels, sometimes with a temporal and geographic structure.
However, when it comes to privacy and confidentiality, the challenges of extracting (meaningful) information from big data are in our view similar to those associated with data of much smaller size, surveys being one example. For any statistician or quantitative working (social) scientist there are two main concerns when extracting information from data, which we summarize here as concerns about measurement and concerns about inference. Both of these aspects can be implicated by privacy and confidentiality concerns.

7 - Data for the Public Good: Challenges and Barriers in the Context of Cities
- By Robert M. Goerge, University of Chicago
Edited by Julia Lane, Victoria Stodden, Columbia University, New York, Stefan Bender, Helen Nissenbaum, New York University
Book:

Privacy, Big Data, and the Public Good

Published online:

05 July 2014

Print publication:

09 June 2014, pp 153-172
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Summary

Introduction
Comprehensive, high-quality, multidimensional data has the potential to improve the services cities provide, as it does with the best private service-providing businesses. City officials, politicians, and stakeholders require data to (1) inform decisions that demonstrate service effectiveness, (2) determine which services should be targeted in a geographic area, and (3) utilize limited resources to best serve residents and businesses.
Administrative data is now ubiquitous in government agencies concerned with health, education, social services, criminal justice, and employment. Local government has primarily used this data to count cases and support budget making within the programs for which the data is collected. Yet data linked across programs, where individuals and families can be tracked with multiple data sources either cross-sectionally or longitudinally, is rare. Both data scientists and the public sector currently have an excellent opportunity to use the big data of government to improve the quality and quantity of analyses to improve service delivery. This chapter describes an effort in one place to use the administrative data collected in the public sector to have an impact by informing city leadership.

1 - Monitoring, Datafication, and Consent: Legal Approaches to Privacy in the Big Data Context
- By Katherine J. Strandburg, New York UniversitySchool
Edited by Julia Lane, Victoria Stodden, Columbia University, New York, Stefan Bender, Helen Nissenbaum, New York University
Book:

Privacy, Big Data, and the Public Good

Published online:

05 July 2014

Print publication:

09 June 2014, pp 5-43
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Summary

Introduction
Knowledge is power. ‘Big data’ has great potential to benefit society. At the same time, its availability creates significant potential for mistaken, misguided, or malevolent uses of personal information. The conundrum for law is to provide space for big data to fulfill its potential for societal benefit, while protecting citizens adequately from related individual and social harms. Current privacy law evolved to address different concerns and must be adapted to confront big data’s challenges. This chapter addresses only one aspect of privacy law: the regulation of private sector acquisition, aggregation, and transfer of personal information. It provides an overview and taxonomy of current law, highlighting the mismatch between current law and the big data context, with the goal of informing the debate about how to bring big data practice and privacy regulation into optimal harmony.
Part I briefly describes how privacy regulation in the United States has evolved in response to a changing technological and social milieu. Part II introduces a taxonomy of privacy laws relating to data acquisition, based on the following features: (1) whether the law provides a rule- or a fact-based standard; (2) whether the law is substantive or procedural, in a sense defined below; and (3) which mode(s) of data acquisition are covered by the law. It also argues that the recording, aggregation, and organization of information into a form that can be used for data mining, here dubbed ‘datafication’, has distinct privacy implications that often go unrecognized by current law. Part III provides a selective overview of relevant privacy laws in light of that taxonomy. Section A discusses the most standards-like legal regimes, such as the privacy torts, for which determining liability generally involves a fact-specific analysis of the behavior of both data subjects and those who acquire or transfer the data (‘data handlers’). Section B discusses the Federal Trade Commission’s (FTC’s) ‘unfair and deceptive trade practices’ standard, which depends on a fact-specific inquiry into the behavior of data handlers, but makes general assumptions about data subjects.

2 - Big Data’s End Run around Anonymity and Consent
- By Solon Barocas, New York University, Helen Nissenbaum, New York University
Edited by Julia Lane, Victoria Stodden, Columbia University, New York, Stefan Bender, Helen Nissenbaum, New York University
Book:

Privacy, Big Data, and the Public Good

Published online:

05 July 2014

Print publication:

09 June 2014, pp 44-75
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Summary

Introduction
Big data promises to deliver analytic insights that will add to the stock of scientific and social scientific knowledge, significantly improve decision making in both the public and private sector, and greatly enhance individual self-knowledge and understanding. They have already led to entirely new classes of goods and services, many of which have been embraced enthusiastically by institutions and individuals alike. And yet, where these data commit to record details about human behavior, they have been perceived as a threat to fundamental values, including everything from autonomy, to fairness, justice, due process, property, solidarity, and, perhaps most of all, privacy. Given this apparent conflict, some have taken to calling for outright prohibitions on various big data practices, while others have found good reason to finally throw caution (and privacy) to the wind in the belief that big data will more than compensate for its potential costs. Still others, of course, are searching for a principled stance on privacy that offers the flexibility necessary for these promises to be realized while respecting the important values that privacy promotes.
This is a familiar situation because it rehearses many of the long-standing tensions that have characterized each successive wave of technological innovation over the past half-century and their inevitable disruption of constraints on information flows through which privacy had been assured. It should come as no surprise that attempts to deal with new threats draw from the toolbox assembled to address earlier upheavals. Ready-to-hand, anonymity and informed consent remain the most popular tools for relieving these tensions – tensions that we accept, from the outset, as genuine and, in many cases, acute. Taking as a given that big data implicates important ethical and political values, we direct our focus instead on attempts to avoid or mitigate the conflicts that may arise. We do so because the familiar pair of anonymity and informed consent continues to strike many as the best and perhaps only way to escape the need to actually resolve these conflicts one way or the other.

Contributors
Edited by Julia Lane, Victoria Stodden, Columbia University, New York, Stefan Bender, Helen Nissenbaum, New York University
Book:

Privacy, Big Data, and the Public Good

Published online:

05 July 2014

Print publication:

09 June 2014, pp ix-x
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation

Foundations of Data Exchange

Marcelo Arenas, Pablo Barceló, Leonid Libkin, Filip Murlak
Published online:

05 June 2014

Print publication:

06 March 2014
- Book
- - Get access
    
    Buy a print copy
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
The problem of exchanging data between different databases with different schemas is an area of immense importance. Consequently data exchange has been one of the most active research topics in databases over the past decade. Foundational questions related to data exchange largely revolve around three key problems: how to build target solutions; how to answer queries over target solutions; and how to manipulate schema mappings themselves? The last question is also known under the name 'metadata management', since mappings represent metadata, rather than data in the database. In this book the authors summarize the key developments of a decade of research. Part I introduces the problem of data exchange via examples, both relational and XML; Part II deals with exchanging relational data; Part III focuses on exchanging XML data; and Part IV covers metadata management.

3 - Categorical Attributes
from PART ONE - DATA ANALYSIS FOUNDATIONS
Mohammed J. Zaki, Rensselaer Polytechnic Institute, New York, Wagner Meira, Jr, Universidade Federal de Minas Gerais, Brazil
Book:

Data Mining and Analysis

Published online:

28 May 2018

Print publication:

12 May 2014, pp 63-92
- Chapter
- Export citation

12 - Pattern and Rule Assessment
from PART TWO - FREQUENT PATTERN MINING
Mohammed J. Zaki, Rensselaer Polytechnic Institute, New York, Wagner Meira, Jr, Universidade Federal de Minas Gerais, Brazil
Book:

Data Mining and Analysis

Published online:

28 May 2018

Print publication:

12 May 2014, pp 301-330
- Chapter
- Export citation
Summary

In this chapter we discuss how to assess the significance of the mined frequent patterns, as well as the association rules derived from them. Ideally, the mined patterns and rules should satisfy desirable properties such as conciseness, novelty, utility, and so on. We outline several rule and pattern assessment measures that aim to quantify different properties of the mined results. Typically, the question of whether a pattern or rule is interesting is to a large extent a subjective one. However, we can certainly try to eliminate rules and patterns that are not statistically significant. Methods to test for the statistical significance and to obtain confidence bounds on the test statistic value are also considered in this chapter.
RULE AND PATTERN ASSESSMENT MEASURES
Let I be a set of items and T a set of tids, and let D ⊆ T × I be a binary database. Recall that an association rule is an expression X → Y, where X and Y are itemsets, i.e., X, Y ⊆ I, and X ∩ Y = ∅. We call X the antecedent of the rule and Y the consequent.

9 - Summarizing Itemsets
from PART TWO - FREQUENT PATTERN MINING
Mohammed J. Zaki, Rensselaer Polytechnic Institute, New York, Wagner Meira, Jr, Universidade Federal de Minas Gerais, Brazil
Book:

Data Mining and Analysis

Published online:

28 May 2018

Print publication:

12 May 2014, pp 242-258
- Chapter
- Export citation
Summary

The search space for frequent itemsets is usually very large and it grows exponentially with the number of items. In particular, a low minimum support value may result in an intractable number of frequent itemsets. An alternative approach, studied in this chapter, is to determine condensed representations of the frequent itemsets that summarize their essential characteristics. The use of condensed representations can not only reduce the computational and storage demands, but it can also make it easier to analyze the mined patterns. In this chapter we discuss three of these representations: closed, maximal, and nonderivable itemsets.
MAXIMAL AND CLOSED FREQUENT ITEMSETS
Given a binary database D ⊆ T × I, over the tids T and items I, let F denote the set of all frequent itemsets, that is,
F = {X | X ⊆ I and sup(X) ≥ minsup}
Maximal Frequent Itemsets
A frequent itemset X ∈ F is called maximal if it has no frequent supersets. Let M be the set of all maximal frequent itemsets, given as
M= {X | X ∈ F and 6 ∃Y ⊃ X, such that Y ∈ F}
The set M is a condensed representation of the set of all frequent itemset F, because we can determine whether any itemset X is frequent or not using M.

5 - Kernel Methods
from PART ONE - DATA ANALYSIS FOUNDATIONS
Mohammed J. Zaki, Rensselaer Polytechnic Institute, New York, Wagner Meira, Jr, Universidade Federal de Minas Gerais, Brazil
Book:

Data Mining and Analysis

Published online:

28 May 2018

Print publication:

12 May 2014, pp 134-162
- Chapter
- Export citation
Summary

Before we can mine data, it is important to first find a suitable data representation that facilitates data analysis. For example, for complex data such as text, sequences, images, and so on, we must typically extract or construct a set of attributes or features, so that we can represent the data instances as multivariate vectors. That is, given a data instance x (e.g., a sequence), we need to find a mapping φ, so that φ(x) is the vector representation of x. Even when the input data is a numeric data matrix, if we wish to discover nonlinear relationships among the attributes, then a nonlinear mapping φ may be used, so that φ(x) represents a vector in the corresponding high-dimensional space comprising nonlinear attributes. We use the term input space to refer to the data space for the input data x and feature space to refer to the space of mapped vectors φ(x). Thus, given a set of data objects or instances xi, and given a mapping function φ, we can transform them into feature vectors φ(xi), which then allows us to analyze complex data instances via numeric analysis methods.

17 - Clustering Validation
from PART THREE - CLUSTERING
Mohammed J. Zaki, Rensselaer Polytechnic Institute, New York, Wagner Meira, Jr, Universidade Federal de Minas Gerais, Brazil
Book:

Data Mining and Analysis

Published online:

28 May 2018

Print publication:

12 May 2014, pp 425-464
- Chapter
- Export citation
Summary

There exist many different clustering methods, depending on the type of clusters sought and on the inherent data characteristics. Given the diversity of clustering algorithms and their parameters it is important to develop objective approaches to assess clustering results. Cluster validation and assessment encompasses three main tasks: clustering evaluation seeks to assess the goodness or quality of the clustering, clustering stability seeks to understand the sensitivity of the clustering result to various algorithmic parameters, for example, the number of clusters, and clustering tendency assesses the suitability of applying clustering in the first place, that is, whether the data has any inherent grouping structure. There are a number of validity measures and statistics that have been proposed for each of the aforementioned tasks, which can be divided into three main types:
External: External validation measures employ criteria that are not inherent to the dataset. This can be in form of prior or expert-specified knowledge about the clusters, for example, class labels for each point.
Internal: Internal validation measures employ criteria that are derived from the data itself. For instance, we can use intracluster and intercluster distances to obtain measures of cluster compactness (e.g., how similar are the points in the same cluster?) and separation (e.g., how far apart are the points in different clusters?).

Index
Mohammed J. Zaki, Rensselaer Polytechnic Institute, New York, Wagner Meira, Jr, Universidade Federal de Minas Gerais, Brazil
Book:

Data Mining and Analysis

Published online:

28 May 2018

Print publication:

12 May 2014, pp 585-593
- Chapter
- Export citation

10 - Sequence Mining
from PART TWO - FREQUENT PATTERN MINING
Mohammed J. Zaki, Rensselaer Polytechnic Institute, New York, Wagner Meira, Jr, Universidade Federal de Minas Gerais, Brazil
Book:

Data Mining and Analysis

Published online:

28 May 2018

Print publication:

12 May 2014, pp 259-279
- Chapter
- Export citation
Summary

Many real-world applications such as bioinformatics, Web mining, and text mining have to deal with sequential and temporal data. Sequence mining helps discover patterns across time or positions in a given dataset. In this chapter we consider methods to mine frequent sequences, which allow gaps between elements, as well as methods to mine frequent substrings, which do not allow gaps between consecutive elements.
FREQUENT SEQUENCES
Let ∑ denote an alphabet, defined as a finite set of characters or symbols, and let |∑| denote its cardinality. A sequence or a string is defined as an ordered list of symbols, and is written as s = s1s2 …sk, where si ∈ ∑ is a symbol at position i, also denoted as s[i]. Here |s| = k denotes the length of the sequence. A sequence with length k is also called a k-sequence. We use the notation s[i : j] = sisi+1 … sj−1sj to denote the substring or sequence of consecutive symbols in positions i through j, where j > i. Define the prefix of a sequence s as any substring of the form s[1 : i] = s1s2 …si, with 0 ≤ i ≤ n. Also, define the suffix of s as any substring of the form s[i : n] = sisi+1 …sn, with 1 ≤ i ≤ n+1. Note that s[1 : 0] is the empty prefix, and s[n + 1 : n] is the empty suffix.

21 - Support Vector Machines
from PART FOUR - CLASSIFICATION
Mohammed J. Zaki, Rensselaer Polytechnic Institute, New York, Wagner Meira, Jr, Universidade Federal de Minas Gerais, Brazil
Book:

Data Mining and Analysis

Published online:

28 May 2018

Print publication:

12 May 2014, pp 514-547
- Chapter
- Export citation

Knowledge Management, Databases and Data Mining

Refine search

Refine search

Actions for selected content:

1835 results in Knowledge Management, Databases and Data Mining

Editors’ Introduction

Summary

10 - Engineered Controls for Dealing with Big Data

Summary

Part I - Conceptual Framework

Summary

9 - The New Deal on Data: A Framework for Institutional Controls

Summary

6 - The Value of Big Data for Urban Science

Summary

13 - Using Statistics to Protect Privacy

Summary

12 - Extracting Information from Big Data: Issues of Measurement, Inference and Linkage

Summary

7 - Data for the Public Good: Challenges and Barriers in the Context of Cities

Summary

1 - Monitoring, Datafication, and Consent: Legal Approaches to Privacy in the Big Data Context

Summary

2 - Big Data’s End Run around Anonymity and Consent

Summary

Contributors

Foundations of Data Exchange

3 - Categorical Attributes

12 - Pattern and Rule Assessment

Summary

9 - Summarizing Itemsets

Summary

5 - Kernel Methods

Summary

17 - Clustering Validation

Summary

Index

10 - Sequence Mining

Summary

21 - Support Vector Machines

Knowledge Management, Databases and Data Mining

Refine search

Refine search

Actions for selected content:

Save Search

1835 results in Knowledge Management, Databases and Data Mining

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Foundations of Data Exchange

Summary

Summary

Summary

Summary

Summary