Mining of Massive Datasets

Jure Leskovec; Anand Rajaraman; Jeffrey David Ullman

doi:10.1017/9781108684163

Chapter 3: Finding Similar Items

pp. 78-137

Jure Leskovec

, Stanford University, California,

Anand Rajaraman

, Rocketship VC,

Jeffrey David Ullman

, Stanford University, California

Get access

Add bookmark
Cite
Share

Summary

We begin our discussion of locality-sensitive hashing (LSH) with an examination of the problem of finding similar documents – those that share a lot of common text. We first show how to convert documents into sets in a way that lets us view textual similarity of documents as sets having a large overlap. A second key trick we need is minhashing, which is a way to convert large sets into much smaller representations, called signatures, that still enable us to estimate closely the Jaccard similarity of the represented sets. Finally, we see how to apply the bucketing idea inherent in LSH to the signatures. In Section 3.5 we begin our study of how to apply LSH to items other than sets. We consider the general notion of a distance measure that tells to what degree items are similar. Then, we consider the general idea of locality-sensitive hashing, and we see how to do LSH for some data types other than sets. We examine in detail several applications of the LSH idea. Finally, we consider some techniques for finding similar sets that can be more efficient than LSH when the degree of similarity we want is very high.

Keywords

locality-sensitive hashing
shingling
minhashing
Jaccard similarity
signature

About the book

Chapter DOI https://doi.org/10.1017/9781108684163.004
Book DOI https://doi.org/10.1017/9781108684163
Subjects Computer Science,Data Science, Databases, Data Mining, and Information Retrieval,Machine Learning and Pattern Recognition
Format: Hardback
- Publication date: 13 February 2020
- ISBN: 9781108476348
Format: Digital
- Publication date: 16 April 2020
- ISBN: 9781108684163
Find out more details about this book

Access options

Review the options below to login to check your access.

Purchase options

eTextbook

US$89.00

Hardback

US$89.00

Have an access code?

To redeem an access code, please log in with your personal login.

If you believe you should have access to this content, please contact your institutional librarian or consult our FAQ page for further information about accessing our content.

Also available to purchase from these educational ebook suppliers