Mining of Massive Datasets

Jure Leskovec; Anand Rajaraman; Jeffrey David Ullman

doi:10.1017/9781108684163

Chapter 4: Mining Data Streams

pp. 138-168

Jure Leskovec

, Stanford University, California,

Anand Rajaraman

, Rocketship VC,

Jeffrey David Ullman

, Stanford University, California

Get access

Add bookmark
Cite
Share

Summary

We shall assume that we are mining a database, that data arrives in a stream or streams, and if it is not processed immediately or stored, then it is lost forever. Moreover, we shall assume that the data arrives so rapidly that it is not feasible to store it all in active storage (i.e., in a conventional database), and then interact with it at the time of our choosing. The algorithms for processing streams each involve summarization of the stream in some way. We shall start by considering how to make a useful sample of a stream and how to filter a stream to eliminate most of the “undesirable” elements. We then show how to estimate the number of different elements in a stream using much less storage than would be required if we listed all the elements we have seen. Another approach to summarizing a stream is to look at only a fixed-length “window” consisting of the last n elements for some (typically large) n. We then query the window as if it were a relation in a database. If there are many streams and/or n is large, we may not be able to store the entire window for every stream, so we need to summarize even the windows. We address the fundamental problem of maintaining an approximate count on the number of 1s in the window of a bit stream, while using much less space than would be needed to store the entire window itself.

Keywords

data stream
approximate counting
Bloom filter
counting distinct elements
stream sampling
stream moments

About the book

Chapter DOI https://doi.org/10.1017/9781108684163.005
Book DOI https://doi.org/10.1017/9781108684163
Subjects Computer Science,Data Science, Databases, Data Mining, and Information Retrieval,Machine Learning and Pattern Recognition
Format: Hardback
- Publication date: 13 February 2020
- ISBN: 9781108476348
Format: Digital
- Publication date: 16 April 2020
- ISBN: 9781108684163
Find out more details about this book

Access options

Review the options below to login to check your access.

Purchase options

eTextbook

US$89.00

Hardback

US$89.00

Have an access code?

To redeem an access code, please log in with your personal login.

If you believe you should have access to this content, please contact your institutional librarian or consult our FAQ page for further information about accessing our content.

Also available to purchase from these educational ebook suppliers