0% found this document useful (1 vote)

369 views17 pages

Mining of Massive Datasets: Jure Leskovec Anand Rajaraman Jeffrey D. Ullman

Mine Massive Dataset

Uploaded by

Sad Fate

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (1 vote)

369 views17 pages

Mining of Massive Datasets: Jure Leskovec Anand Rajaraman Jeffrey D. Ullman

Mine Massive Dataset

Uploaded by

Sad Fate

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 17

Mining

of
Massive
Datasets

Jure Leskovec
Stanford Univ.
Anand Rajaraman
Milliway Labs
Jeffrey D. Ullman
Stanford Univ.

This book evolved from material developed over several years by Anand Raja-
raman and Jeff Ullman for a one-quarter course at Stanford. The course
CS345A, titled “Web Mining,” was designed as an advanced graduate course,
although it has become accessible and interesting to advanced undergraduates.
When Jure Leskovec joined the Stanford faculty, we reorganized the material
considerably. He introduced a new course CS224W on network analysis and
added material to CS345A, which was renumbered CS246. The three authors
also introduced a large-scale data-mining project course, CS341. The book now
contains material taught in all three courses.

What the Book Is About

At the highest level of description, this book is about data mining. However,
it focuses on data mining of very large amounts of data, that is, data so large
it does not fit in main memory. Because of the emphasis on size, many of our
examples are about the Web or data derived from the Web. Further, the book
takes an algorithmic point of view: data mining is about applying algorithms
to data, rather than using data to “train” a machine-learning engine of some
sort. The principal topics covered are:

1. Distributed file systems and map-reduce as a tool for creating parallel

algorithms that succeed on very large amounts of data.
2. Similarity search, including the key techniques of minhashing and locality-
sensitive hashing.
3. Data-stream processing and specialized algorithms for dealing with data
that arrives so fast it must be processed immediately or lost.
4. The technology of search engines, including Google’s PageRank, link-spam
detection, and the hubs-and-authorities approach.
5. Frequent-itemset mining, including association rules, market-baskets, the
A-Priori Algorithm and its improvements.
6. Algorithms for clustering very large, high-dimensional datasets.

iii
iv PREFACE

7. Two key problems for Web applications: managing advertising and rec-
ommendation systems.
8. Algorithms for analyzing and mining the structure of very large graphs,
especially social-network graphs.
9. Techniques for obtaining the important properties of a large dataset by
dimensionality reduction, including singular-value decomposition and la-
tent semantic indexing.

10. Machine-learning algorithms that can be applied to very large data, such
as perceptrons, support-vector machines, and gradient descent.

Prerequisites
To appreciate fully the material in this book, we recommend the following
prerequisites:

1. An introduction to database systems, covering SQL and related program-

ming systems.
2. A sophomore-level course in data structures, algorithms, and discrete
math.
3. A sophomore-level course in software systems, software engineering, and
programming languages.

Exercises
The book contains extensive exercises, with some for almost every section. We
indicate harder exercises or parts of exercises with an exclamation point. The
hardest exercises have a double exclamation point.

Support on the Web

Go to http://www.mmds.org for slides, homework assignments, project require-
ments, and exams from courses related to this book.

Gradiance Automated Homework

There are automated exercises based on this book, using the Gradiance root-
question technology, available at www.gradiance.com/services. Students may
enter a public class by creating an account at that site and entering the class
with code 1EDD8A1D. Instructors may use the site by making an account there
PREFACE v

and then emailing support at gradiance dot com with their login name, the
name of their school, and a request to use the MMDS materials.

Acknowledgements
Cover art is by Scott Ullman.
We would like to thank Foto Afrati, Arun Marathe, and Rok Sosic for critical
readings of a draft of this manuscript.
Errors were also reported by Rajiv Abraham, Ruslan Aduk, Apoorv Agar-
wal, Aris Anagnostopoulos, Yokila Arora, Stefanie Anna Baby, Atilla Soner
Balkir, Arnaud Belletoile, Robin Bennett, Susan Biancani, Amitabh Chaud-
hary, Leland Chen, Hua Feng, Marcus Gemeinder, Anastasios Gounaris, Clark
Grubb, Shrey Gupta, Waleed Hameid, Saman Haratizadeh, Julien Hoachuck,
Przemyslaw Horban, Hsiu-Hsuan Huang, Jeff Hwang, Rafi Kamal, Lachlan
Kang, Ed Knorr, Haewoon Kwak, Ellis Lau, Greg Lee, David Z. Liu, Ethan
Lozano, Yunan Luo, Michael Mahoney, Justin Meyer, Bryant Moscon, Brad
Penoff, John Phillips, Philips Kokoh Prasetyo, Qi Ge, Harizo Rajaona, Ti-
mon Ruban, Rich Seiter, Hitesh Shetty, Angad Singh, Sandeep Sripada, Dennis
Sidharta, Krzysztof Stencel, Mark Storus, Roshan Sumbaly, Zack Taylor, Tim
Triche Jr., Wang Bin, Weng Zhen-Bin, Robert West, Steven Euijong Whang,
Oscar Wu, Xie Ke, Christopher T.-R. Yeh, Nicolas Zhao, and Zhou Jingbo, The
remaining errors are ours, of course.

J. L.
A. R.
J. D. U.
Palo Alto, CA
March, 2014
vi PREFACE
Contents

1 Data Mining 1
1.1 What is Data Mining? . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Statistical Modeling . . . . . . . . . . . . . . . . . . . . . 1
1.1.2 Machine Learning . . . . . . . . . . . . . . . . . . . . . . 2
1.1.3 Computational Approaches to Modeling . . . . . . . . . . 2
1.1.4 Summarization . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.5 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Statistical Limits on Data Mining . . . . . . . . . . . . . . . . . . 4
1.2.1 Total Information Awareness . . . . . . . . . . . . . . . . 5
1.2.2 Bonferroni’s Principle . . . . . . . . . . . . . . . . . . . . 5
1.2.3 An Example of Bonferroni’s Principle . . . . . . . . . . . 6
1.2.4 Exercises for Section 1.2 . . . . . . . . . . . . . . . . . . . 7
1.3 Things Useful to Know . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3.1 Importance of Words in Documents . . . . . . . . . . . . 8
1.3.2 Hash Functions . . . . . . . . . . . . . . . . . . . . . . . . 9
1.3.3 Indexes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.3.4 Secondary Storage . . . . . . . . . . . . . . . . . . . . . . 11
1.3.5 The Base of Natural Logarithms . . . . . . . . . . . . . . 12
1.3.6 Power Laws . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.3.7 Exercises for Section 1.3 . . . . . . . . . . . . . . . . . . . 15
1.4 Outline of the Book . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.5 Summary of Chapter 1 . . . . . . . . . . . . . . . . . . . . . . . . 17
1.6 References for Chapter 1 . . . . . . . . . . . . . . . . . . . . . . . 18

2 MapReduce and the New Software Stack 21

2.1 Distributed File Systems . . . . . . . . . . . . . . . . . . . . . . . 22
2.1.1 Physical Organization of Compute Nodes . . . . . . . . . 22
2.1.2 Large-Scale File-System Organization . . . . . . . . . . . 23
2.2 MapReduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.2.1 The Map Tasks . . . . . . . . . . . . . . . . . . . . . . . . 25
2.2.2 Grouping by Key . . . . . . . . . . . . . . . . . . . . . . . 26
2.2.3 The Reduce Tasks . . . . . . . . . . . . . . . . . . . . . . 27
2.2.4 Combiners . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

vii
viii CONTENTS

2.2.5 Details of MapReduce Execution . . . . . . . . . . . . . . 28

2.2.6 Coping With Node Failures . . . . . . . . . . . . . . . . . 29
2.2.7 Exercises for Section 2.2 . . . . . . . . . . . . . . . . . . . 30
2.3 Algorithms Using MapReduce . . . . . . . . . . . . . . . . . . . . 30
2.3.1 Matrix-Vector Multiplication by MapReduce . . . . . . . 31
2.3.2 If the Vector v Cannot Fit in Main Memory . . . . . . . . 31
2.3.3 Relational-Algebra Operations . . . . . . . . . . . . . . . 32
2.3.4 Computing Selections by MapReduce . . . . . . . . . . . 35
2.3.5 Computing Projections by MapReduce . . . . . . . . . . . 36
2.3.6 Union, Intersection, and Difference by MapReduce . . . . 36
2.3.7 Computing Natural Join by MapReduce . . . . . . . . . . 37
2.3.8 Grouping and Aggregation by MapReduce . . . . . . . . . 37
2.3.9 Matrix Multiplication . . . . . . . . . . . . . . . . . . . . 38
2.3.10 Matrix Multiplication with One MapReduce Step . . . . . 39
2.3.11 Exercises for Section 2.3 . . . . . . . . . . . . . . . . . . . 40
2.4 Extensions to MapReduce . . . . . . . . . . . . . . . . . . . . . . 41
2.4.1 Workflow Systems . . . . . . . . . . . . . . . . . . . . . . 41
2.4.2 Recursive Extensions to MapReduce . . . . . . . . . . . . 42
2.4.3 Pregel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
2.4.4 Exercises for Section 2.4 . . . . . . . . . . . . . . . . . . . 46
2.5 The Communication Cost Model . . . . . . . . . . . . . . . . . . 46
2.5.1 Communication-Cost for Task Networks . . . . . . . . . . 47
2.5.2 Wall-Clock Time . . . . . . . . . . . . . . . . . . . . . . . 49
2.5.3 Multiway Joins . . . . . . . . . . . . . . . . . . . . . . . . 49
2.5.4 Exercises for Section 2.5 . . . . . . . . . . . . . . . . . . . 52
2.6 Complexity Theory for MapReduce . . . . . . . . . . . . . . . . . 54
2.6.1 Reducer Size and Replication Rate . . . . . . . . . . . . . 54
2.6.2 An Example: Similarity Joins . . . . . . . . . . . . . . . . 55
2.6.3 A Graph Model for MapReduce Problems . . . . . . . . . 57
2.6.4 Mapping Schemas . . . . . . . . . . . . . . . . . . . . . . 58
2.6.5 When Not All Inputs Are Present . . . . . . . . . . . . . 60
2.6.6 Lower Bounds on Replication Rate . . . . . . . . . . . . . 61
2.6.7 Case Study: Matrix Multiplication . . . . . . . . . . . . . 62
2.6.8 Exercises for Section 2.6 . . . . . . . . . . . . . . . . . . . 66
2.7 Summary of Chapter 2 . . . . . . . . . . . . . . . . . . . . . . . . 67
2.8 References for Chapter 2 . . . . . . . . . . . . . . . . . . . . . . . 69

3 Finding Similar Items 73

3.1 Applications of Near-Neighbor Search . . . . . . . . . . . . . . . 73
3.1.1 Jaccard Similarity of Sets . . . . . . . . . . . . . . . . . . 74
3.1.2 Similarity of Documents . . . . . . . . . . . . . . . . . . . 74
3.1.3 Collaborative Filtering as a Similar-Sets Problem . . . . . 75
3.1.4 Exercises for Section 3.1 . . . . . . . . . . . . . . . . . . . 77
3.2 Shingling of Documents . . . . . . . . . . . . . . . . . . . . . . . 77
3.2.1 k-Shingles . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
CONTENTS ix

3.2.2 Choosing the Shingle Size . . . . . . . . . . . . . . . . . . 78

3.2.3 Hashing Shingles . . . . . . . . . . . . . . . . . . . . . . . 79
3.2.4 Shingles Built from Words . . . . . . . . . . . . . . . . . . 79
3.2.5 Exercises for Section 3.2 . . . . . . . . . . . . . . . . . . . 80
3.3 Similarity-Preserving Summaries of Sets . . . . . . . . . . . . . . 80
3.3.1 Matrix Representation of Sets . . . . . . . . . . . . . . . . 81
3.3.2 Minhashing . . . . . . . . . . . . . . . . . . . . . . . . . . 81
3.3.3 Minhashing and Jaccard Similarity . . . . . . . . . . . . . 82
3.3.4 Minhash Signatures . . . . . . . . . . . . . . . . . . . . . 83
3.3.5 Computing Minhash Signatures . . . . . . . . . . . . . . . 83
3.3.6 Exercises for Section 3.3 . . . . . . . . . . . . . . . . . . . 86
3.4 Locality-Sensitive Hashing for Documents . . . . . . . . . . . . . 87
3.4.1 LSH for Minhash Signatures . . . . . . . . . . . . . . . . 88
3.4.2 Analysis of the Banding Technique . . . . . . . . . . . . . 89
3.4.3 Combining the Techniques . . . . . . . . . . . . . . . . . . 91
3.4.4 Exercises for Section 3.4 . . . . . . . . . . . . . . . . . . . 91
3.5 Distance Measures . . . . . . . . . . . . . . . . . . . . . . . . . . 92
3.5.1 Definition of a Distance Measure . . . . . . . . . . . . . . 92
3.5.2 Euclidean Distances . . . . . . . . . . . . . . . . . . . . . 93
3.5.3 Jaccard Distance . . . . . . . . . . . . . . . . . . . . . . . 94
3.5.4 Cosine Distance . . . . . . . . . . . . . . . . . . . . . . . . 95
3.5.5 Edit Distance . . . . . . . . . . . . . . . . . . . . . . . . . 95
3.5.6 Hamming Distance . . . . . . . . . . . . . . . . . . . . . . 96
3.5.7 Exercises for Section 3.5 . . . . . . . . . . . . . . . . . . . 97
3.6 The Theory of Locality-Sensitive Functions . . . . . . . . . . . . 99
3.6.1 Locality-Sensitive Functions . . . . . . . . . . . . . . . . . 99
3.6.2 Locality-Sensitive Families for Jaccard Distance . . . . . . 100
3.6.3 Amplifying a Locality-Sensitive Family . . . . . . . . . . . 101
3.6.4 Exercises for Section 3.6 . . . . . . . . . . . . . . . . . . . 103
3.7 LSH Families for Other Distance Measures . . . . . . . . . . . . . 104
3.7.1 LSH Families for Hamming Distance . . . . . . . . . . . . 104
3.7.2 Random Hyperplanes and the Cosine Distance . . . . . . 105
3.7.3 Sketches . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
3.7.4 LSH Families for Euclidean Distance . . . . . . . . . . . . 107
3.7.5 More LSH Families for Euclidean Spaces . . . . . . . . . . 108
3.7.6 Exercises for Section 3.7 . . . . . . . . . . . . . . . . . . . 109
3.8 Applications of Locality-Sensitive Hashing . . . . . . . . . . . . . 110
3.8.1 Entity Resolution . . . . . . . . . . . . . . . . . . . . . . . 110
3.8.2 An Entity-Resolution Example . . . . . . . . . . . . . . . 111
3.8.3 Validating Record Matches . . . . . . . . . . . . . . . . . 112
3.8.4 Matching Fingerprints . . . . . . . . . . . . . . . . . . . . 113
3.8.5 A LSH Family for Fingerprint Matching . . . . . . . . . . 114
3.8.6 Similar News Articles . . . . . . . . . . . . . . . . . . . . 115
3.8.7 Exercises for Section 3.8 . . . . . . . . . . . . . . . . . . . 117
3.9 Methods for High Degrees of Similarity . . . . . . . . . . . . . . 118
x CONTENTS

3.9.1 Finding Identical Items . . . . . . . . . . . . . . . . . . . 118

3.9.2 Representing Sets as Strings . . . . . . . . . . . . . . . . . 118
3.9.3 Length-Based Filtering . . . . . . . . . . . . . . . . . . . . 119
3.9.4 Prefix Indexing . . . . . . . . . . . . . . . . . . . . . . . . 119
3.9.5 Using Position Information . . . . . . . . . . . . . . . . . 121
3.9.6 Using Position and Length in Indexes . . . . . . . . . . . 122
3.9.7 Exercises for Section 3.9 . . . . . . . . . . . . . . . . . . . 125
3.10 Summary of Chapter 3 . . . . . . . . . . . . . . . . . . . . . . . . 126
3.11 References for Chapter 3 . . . . . . . . . . . . . . . . . . . . . . . 128

4 Mining Data Streams 131

4.1 The Stream Data Model . . . . . . . . . . . . . . . . . . . . . . . 131
4.1.1 A Data-Stream-Management System . . . . . . . . . . . . 132
4.1.2 Examples of Stream Sources . . . . . . . . . . . . . . . . . 133
4.1.3 Stream Queries . . . . . . . . . . . . . . . . . . . . . . . . 134
4.1.4 Issues in Stream Processing . . . . . . . . . . . . . . . . . 135
4.2 Sampling Data in a Stream . . . . . . . . . . . . . . . . . . . . . 136
4.2.1 A Motivating Example . . . . . . . . . . . . . . . . . . . . 136
4.2.2 Obtaining a Representative Sample . . . . . . . . . . . . . 137
4.2.3 The General Sampling Problem . . . . . . . . . . . . . . . 137
4.2.4 Varying the Sample Size . . . . . . . . . . . . . . . . . . . 138
4.2.5 Exercises for Section 4.2 . . . . . . . . . . . . . . . . . . . 138
4.3 Filtering Streams . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
4.3.1 A Motivating Example . . . . . . . . . . . . . . . . . . . . 139
4.3.2 The Bloom Filter . . . . . . . . . . . . . . . . . . . . . . . 140
4.3.3 Analysis of Bloom Filtering . . . . . . . . . . . . . . . . . 140
4.3.4 Exercises for Section 4.3 . . . . . . . . . . . . . . . . . . . 141
4.4 Counting Distinct Elements in a Stream . . . . . . . . . . . . . . 142
4.4.1 The Count-Distinct Problem . . . . . . . . . . . . . . . . 142
4.4.2 The Flajolet-Martin Algorithm . . . . . . . . . . . . . . . 143
4.4.3 Combining Estimates . . . . . . . . . . . . . . . . . . . . 144
4.4.4 Space Requirements . . . . . . . . . . . . . . . . . . . . . 144
4.4.5 Exercises for Section 4.4 . . . . . . . . . . . . . . . . . . . 145
4.5 Estimating Moments . . . . . . . . . . . . . . . . . . . . . . . . . 145
4.5.1 Definition of Moments . . . . . . . . . . . . . . . . . . . . 145
4.5.2 The Alon-Matias-Szegedy Algorithm for Second
Moments . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
4.5.3 Why the Alon-Matias-Szegedy Algorithm Works . . . . . 147
4.5.4 Higher-Order Moments . . . . . . . . . . . . . . . . . . . 148
4.5.5 Dealing With Infinite Streams . . . . . . . . . . . . . . . . 148
4.5.6 Exercises for Section 4.5 . . . . . . . . . . . . . . . . . . . 149
4.6 Counting Ones in a Window . . . . . . . . . . . . . . . . . . . . . 150
4.6.1 The Cost of Exact Counts . . . . . . . . . . . . . . . . . . 151
4.6.2 The Datar-Gionis-Indyk-Motwani Algorithm . . . . . . . 151
4.6.3 Storage Requirements for the DGIM Algorithm . . . . . . 153
CONTENTS xi

4.6.4 Query Answering in the DGIM Algorithm . . . . . . . . . 153

4.6.5 Maintaining the DGIM Conditions . . . . . . . . . . . . . 154
4.6.6 Reducing the Error . . . . . . . . . . . . . . . . . . . . . . 155
4.6.7 Extensions to the Counting of Ones . . . . . . . . . . . . 156
4.6.8 Exercises for Section 4.6 . . . . . . . . . . . . . . . . . . . 157
4.7 Decaying Windows . . . . . . . . . . . . . . . . . . . . . . . . . . 157
4.7.1 The Problem of Most-Common Elements . . . . . . . . . 157
4.7.2 Definition of the Decaying Window . . . . . . . . . . . . . 158
4.7.3 Finding the Most Popular Elements . . . . . . . . . . . . 159
4.8 Summary of Chapter 4 . . . . . . . . . . . . . . . . . . . . . . . . 160
4.9 References for Chapter 4 . . . . . . . . . . . . . . . . . . . . . . . 161

5 Link Analysis 163

5.1 PageRank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
5.1.1 Early Search Engines and Term Spam . . . . . . . . . . . 164
5.1.2 Definition of PageRank . . . . . . . . . . . . . . . . . . . 165
5.1.3 Structure of the Web . . . . . . . . . . . . . . . . . . . . . 169
5.1.4 Avoiding Dead Ends . . . . . . . . . . . . . . . . . . . . . 170
5.1.5 Spider Traps and Taxation . . . . . . . . . . . . . . . . . 173
5.1.6 Using PageRank in a Search Engine . . . . . . . . . . . . 175
5.1.7 Exercises for Section 5.1 . . . . . . . . . . . . . . . . . . . 175
5.2 Efficient Computation of PageRank . . . . . . . . . . . . . . . . . 177
5.2.1 Representing Transition Matrices . . . . . . . . . . . . . . 178
5.2.2 PageRank Iteration Using MapReduce . . . . . . . . . . . 179
5.2.3 Use of Combiners to Consolidate the Result Vector . . . . 179
5.2.4 Representing Blocks of the Transition Matrix . . . . . . . 180
5.2.5 Other Efficient Approaches to PageRank Iteration . . . . 181
5.2.6 Exercises for Section 5.2 . . . . . . . . . . . . . . . . . . . 183
5.3 Topic-Sensitive PageRank . . . . . . . . . . . . . . . . . . . . . . 183
5.3.1 Motivation for Topic-Sensitive Page Rank . . . . . . . . . 183
5.3.2 Biased Random Walks . . . . . . . . . . . . . . . . . . . . 184
5.3.3 Using Topic-Sensitive PageRank . . . . . . . . . . . . . . 185
5.3.4 Inferring Topics from Words . . . . . . . . . . . . . . . . . 186
5.3.5 Exercises for Section 5.3 . . . . . . . . . . . . . . . . . . . 187
5.4 Link Spam . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
5.4.1 Architecture of a Spam Farm . . . . . . . . . . . . . . . . 187
5.4.2 Analysis of a Spam Farm . . . . . . . . . . . . . . . . . . 189
5.4.3 Combating Link Spam . . . . . . . . . . . . . . . . . . . . 190
5.4.4 TrustRank . . . . . . . . . . . . . . . . . . . . . . . . . . 190
5.4.5 Spam Mass . . . . . . . . . . . . . . . . . . . . . . . . . . 191
5.4.6 Exercises for Section 5.4 . . . . . . . . . . . . . . . . . . . 191
5.5 Hubs and Authorities . . . . . . . . . . . . . . . . . . . . . . . . 192
5.5.1 The Intuition Behind HITS . . . . . . . . . . . . . . . . . 192
5.5.2 Formalizing Hubbiness and Authority . . . . . . . . . . . 193
5.5.3 Exercises for Section 5.5 . . . . . . . . . . . . . . . . . . . 196
xii CONTENTS

5.6 Summary of Chapter 5 . . . . . . . . . . . . . . . . . . . . . . . . 196

5.7 References for Chapter 5 . . . . . . . . . . . . . . . . . . . . . . . 200

6 Frequent Itemsets 201

6.1 The Market-Basket Model . . . . . . . . . . . . . . . . . . . . . . 202
6.1.1 Definition of Frequent Itemsets . . . . . . . . . . . . . . . 202
6.1.2 Applications of Frequent Itemsets . . . . . . . . . . . . . 204
6.1.3 Association Rules . . . . . . . . . . . . . . . . . . . . . . . 205
6.1.4 Finding Association Rules with High Confidence . . . . . 207
6.1.5 Exercises for Section 6.1 . . . . . . . . . . . . . . . . . . . 207
6.2 Market Baskets and the A-Priori Algorithm . . . . . . . . . . . . 209
6.2.1 Representation of Market-Basket Data . . . . . . . . . . . 209
6.2.2 Use of Main Memory for Itemset Counting . . . . . . . . 210
6.2.3 Monotonicity of Itemsets . . . . . . . . . . . . . . . . . . 212
6.2.4 Tyranny of Counting Pairs . . . . . . . . . . . . . . . . . 213
6.2.5 The A-Priori Algorithm . . . . . . . . . . . . . . . . . . . 213
6.2.6 A-Priori for All Frequent Itemsets . . . . . . . . . . . . . 214
6.2.7 Exercises for Section 6.2 . . . . . . . . . . . . . . . . . . . 217
6.3 Handling Larger Datasets in Main Memory . . . . . . . . . . . . 218
6.3.1 The Algorithm of Park, Chen, and Yu . . . . . . . . . . . 218
6.3.2 The Multistage Algorithm . . . . . . . . . . . . . . . . . . 220
6.3.3 The Multihash Algorithm . . . . . . . . . . . . . . . . . . 222
6.3.4 Exercises for Section 6.3 . . . . . . . . . . . . . . . . . . . 224
6.4 Limited-Pass Algorithms . . . . . . . . . . . . . . . . . . . . . . . 226
6.4.1 The Simple, Randomized Algorithm . . . . . . . . . . . . 226
6.4.2 Avoiding Errors in Sampling Algorithms . . . . . . . . . . 227
6.4.3 The Algorithm of Savasere, Omiecinski, and
Navathe . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228
6.4.4 The SON Algorithm and MapReduce . . . . . . . . . . . 229
6.4.5 Toivonen’s Algorithm . . . . . . . . . . . . . . . . . . . . 230
6.4.6 Why Toivonen’s Algorithm Works . . . . . . . . . . . . . 231
6.4.7 Exercises for Section 6.4 . . . . . . . . . . . . . . . . . . . 232
6.5 Counting Frequent Items in a Stream . . . . . . . . . . . . . . . . 232
6.5.1 Sampling Methods for Streams . . . . . . . . . . . . . . . 233
6.5.2 Frequent Itemsets in Decaying Windows . . . . . . . . . . 234
6.5.3 Hybrid Methods . . . . . . . . . . . . . . . . . . . . . . . 235
6.5.4 Exercises for Section 6.5 . . . . . . . . . . . . . . . . . . . 235
6.6 Summary of Chapter 6 . . . . . . . . . . . . . . . . . . . . . . . . 236
6.7 References for Chapter 6 . . . . . . . . . . . . . . . . . . . . . . . 238

7 Clustering 241
7.1 Introduction to Clustering Techniques . . . . . . . . . . . . . . . 241
7.1.1 Points, Spaces, and Distances . . . . . . . . . . . . . . . . 241
7.1.2 Clustering Strategies . . . . . . . . . . . . . . . . . . . . . 243
7.1.3 The Curse of Dimensionality . . . . . . . . . . . . . . . . 244
CONTENTS xiii

7.1.4 Exercises for Section 7.1 . . . . . . . . . . . . . . . . . . . 245

7.2 Hierarchical Clustering . . . . . . . . . . . . . . . . . . . . . . . . 245
7.2.1 Hierarchical Clustering in a Euclidean Space . . . . . . . 246
7.2.2 Efficiency of Hierarchical Clustering . . . . . . . . . . . . 248
7.2.3 Alternative Rules for Controlling Hierarchical
Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . 249
7.2.4 Hierarchical Clustering in Non-Euclidean Spaces . . . . . 252
7.2.5 Exercises for Section 7.2 . . . . . . . . . . . . . . . . . . . 253
7.3 K-means Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . 254
7.3.1 K-Means Basics . . . . . . . . . . . . . . . . . . . . . . . . 255
7.3.2 Initializing Clusters for K-Means . . . . . . . . . . . . . . 255
7.3.3 Picking the Right Value of k . . . . . . . . . . . . . . . . 256
7.3.4 The Algorithm of Bradley, Fayyad, and Reina . . . . . . . 257
7.3.5 Processing Data in the BFR Algorithm . . . . . . . . . . 259
7.3.6 Exercises for Section 7.3 . . . . . . . . . . . . . . . . . . . 262
7.4 The CURE Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 262
7.4.1 Initialization in CURE . . . . . . . . . . . . . . . . . . . . 263
7.4.2 Completion of the CURE Algorithm . . . . . . . . . . . . 264
7.4.3 Exercises for Section 7.4 . . . . . . . . . . . . . . . . . . . 265
7.5 Clustering in Non-Euclidean Spaces . . . . . . . . . . . . . . . . 266
7.5.1 Representing Clusters in the GRGPF Algorithm . . . . . 266
7.5.2 Initializing the Cluster Tree . . . . . . . . . . . . . . . . . 267
7.5.3 Adding Points in the GRGPF Algorithm . . . . . . . . . 268
7.5.4 Splitting and Merging Clusters . . . . . . . . . . . . . . . 269
7.5.5 Exercises for Section 7.5 . . . . . . . . . . . . . . . . . . . 270
7.6 Clustering for Streams and Parallelism . . . . . . . . . . . . . . . 270
7.6.1 The Stream-Computing Model . . . . . . . . . . . . . . . 271
7.6.2 A Stream-Clustering Algorithm . . . . . . . . . . . . . . . 271
7.6.3 Initializing Buckets . . . . . . . . . . . . . . . . . . . . . . 272
7.6.4 Merging Buckets . . . . . . . . . . . . . . . . . . . . . . . 272
7.6.5 Answering Queries . . . . . . . . . . . . . . . . . . . . . . 275
7.6.6 Clustering in a Parallel Environment . . . . . . . . . . . . 275
7.6.7 Exercises for Section 7.6 . . . . . . . . . . . . . . . . . . . 276
7.7 Summary of Chapter 7 . . . . . . . . . . . . . . . . . . . . . . . . 276
7.8 References for Chapter 7 . . . . . . . . . . . . . . . . . . . . . . . 280

8 Advertising on the Web 281

8.1 Issues in On-Line Advertising . . . . . . . . . . . . . . . . . . . . 281
8.1.1 Advertising Opportunities . . . . . . . . . . . . . . . . . . 281
8.1.2 Direct Placement of Ads . . . . . . . . . . . . . . . . . . . 282
8.1.3 Issues for Display Ads . . . . . . . . . . . . . . . . . . . . 283
8.2 On-Line Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . 284
8.2.1 On-Line and Off-Line Algorithms . . . . . . . . . . . . . . 284
8.2.2 Greedy Algorithms . . . . . . . . . . . . . . . . . . . . . . 285
8.2.3 The Competitive Ratio . . . . . . . . . . . . . . . . . . . 286
xiv CONTENTS

8.2.4 Exercises for Section 8.2 . . . . . . . . . . . . . . . . . . . 286

8.3 The Matching Problem . . . . . . . . . . . . . . . . . . . . . . . . 287
8.3.1 Matches and Perfect Matches . . . . . . . . . . . . . . . . 287
8.3.2 The Greedy Algorithm for Maximal Matching . . . . . . . 288
8.3.3 Competitive Ratio for Greedy Matching . . . . . . . . . . 289
8.3.4 Exercises for Section 8.3 . . . . . . . . . . . . . . . . . . . 290
8.4 The Adwords Problem . . . . . . . . . . . . . . . . . . . . . . . . 290
8.4.1 History of Search Advertising . . . . . . . . . . . . . . . . 291
8.4.2 Definition of the Adwords Problem . . . . . . . . . . . . . 291
8.4.3 The Greedy Approach to the Adwords Problem . . . . . . 292
8.4.4 The Balance Algorithm . . . . . . . . . . . . . . . . . . . 293
8.4.5 A Lower Bound on Competitive Ratio for Balance . . . . 294
8.4.6 The Balance Algorithm with Many Bidders . . . . . . . . 296
8.4.7 The Generalized Balance Algorithm . . . . . . . . . . . . 297
8.4.8 Final Observations About the Adwords Problem . . . . . 298
8.4.9 Exercises for Section 8.4 . . . . . . . . . . . . . . . . . . . 299
8.5 Adwords Implementation . . . . . . . . . . . . . . . . . . . . . . 299
8.5.1 Matching Bids and Search Queries . . . . . . . . . . . . . 300
8.5.2 More Complex Matching Problems . . . . . . . . . . . . . 300
8.5.3 A Matching Algorithm for Documents and Bids . . . . . . 301
8.6 Summary of Chapter 8 . . . . . . . . . . . . . . . . . . . . . . . . 303
8.7 References for Chapter 8 . . . . . . . . . . . . . . . . . . . . . . . 305

9 Recommendation Systems 307

9.1 A Model for Recommendation Systems . . . . . . . . . . . . . . . 307
9.1.1 The Utility Matrix . . . . . . . . . . . . . . . . . . . . . . 308
9.1.2 The Long Tail . . . . . . . . . . . . . . . . . . . . . . . . 309
9.1.3 Applications of Recommendation Systems . . . . . . . . . 309
9.1.4 Populating the Utility Matrix . . . . . . . . . . . . . . . . 311
9.2 Content-Based Recommendations . . . . . . . . . . . . . . . . . . 312
9.2.1 Item Profiles . . . . . . . . . . . . . . . . . . . . . . . . . 312
9.2.2 Discovering Features of Documents . . . . . . . . . . . . . 313
9.2.3 Obtaining Item Features From Tags . . . . . . . . . . . . 314
9.2.4 Representing Item Profiles . . . . . . . . . . . . . . . . . . 315
9.2.5 User Profiles . . . . . . . . . . . . . . . . . . . . . . . . . 316
9.2.6 Recommending Items to Users Based on Content . . . . . 317
9.2.7 Classification Algorithms . . . . . . . . . . . . . . . . . . 318
9.2.8 Exercises for Section 9.2 . . . . . . . . . . . . . . . . . . . 320
9.3 Collaborative Filtering . . . . . . . . . . . . . . . . . . . . . . . . 321
9.3.1 Measuring Similarity . . . . . . . . . . . . . . . . . . . . . 322
9.3.2 The Duality of Similarity . . . . . . . . . . . . . . . . . . 324
9.3.3 Clustering Users and Items . . . . . . . . . . . . . . . . . 325
9.3.4 Exercises for Section 9.3 . . . . . . . . . . . . . . . . . . . 327
9.4 Dimensionality Reduction . . . . . . . . . . . . . . . . . . . . . . 328
9.4.1 UV-Decomposition . . . . . . . . . . . . . . . . . . . . . . 328
CONTENTS xv

9.4.2 Root-Mean-Square Error . . . . . . . . . . . . . . . . . . 329

9.4.3 Incremental Computation of a UV-Decomposition . . . . 330
9.4.4 Optimizing an Arbitrary Element . . . . . . . . . . . . . . 332
9.4.5 Building a Complete UV-Decomposition Algorithm . . . . 334
9.4.6 Exercises for Section 9.4 . . . . . . . . . . . . . . . . . . . 336
9.5 The Netflix Challenge . . . . . . . . . . . . . . . . . . . . . . . . 337
9.6 Summary of Chapter 9 . . . . . . . . . . . . . . . . . . . . . . . . 338
9.7 References for Chapter 9 . . . . . . . . . . . . . . . . . . . . . . . 340

10 Mining Social-Network Graphs 343

10.1 Social Networks as Graphs . . . . . . . . . . . . . . . . . . . . . . 343
10.1.1 What is a Social Network? . . . . . . . . . . . . . . . . . 344
10.1.2 Social Networks as Graphs . . . . . . . . . . . . . . . . . 344
10.1.3 Varieties of Social Networks . . . . . . . . . . . . . . . . . 346
10.1.4 Graphs With Several Node Types . . . . . . . . . . . . . 347
10.1.5 Exercises for Section 10.1 . . . . . . . . . . . . . . . . . . 348
10.2 Clustering of Social-Network Graphs . . . . . . . . . . . . . . . . 349
10.2.1 Distance Measures for Social-Network Graphs . . . . . . . 349
10.2.2 Applying Standard Clustering Methods . . . . . . . . . . 349
10.2.3 Betweenness . . . . . . . . . . . . . . . . . . . . . . . . . . 351
10.2.4 The Girvan-Newman Algorithm . . . . . . . . . . . . . . . 351
10.2.5 Using Betweenness to Find Communities . . . . . . . . . 354
10.2.6 Exercises for Section 10.2 . . . . . . . . . . . . . . . . . . 356
10.3 Direct Discovery of Communities . . . . . . . . . . . . . . . . . . 357
10.3.1 Finding Cliques . . . . . . . . . . . . . . . . . . . . . . . . 357
10.3.2 Complete Bipartite Graphs . . . . . . . . . . . . . . . . . 357
10.3.3 Finding Complete Bipartite Subgraphs . . . . . . . . . . . 358
10.3.4 Why Complete Bipartite Graphs Must Exist . . . . . . . 359
10.3.5 Exercises for Section 10.3 . . . . . . . . . . . . . . . . . . 361
10.4 Partitioning of Graphs . . . . . . . . . . . . . . . . . . . . . . . . 361
10.4.1 What Makes a Good Partition? . . . . . . . . . . . . . . . 362
10.4.2 Normalized Cuts . . . . . . . . . . . . . . . . . . . . . . . 362
10.4.3 Some Matrices That Describe Graphs . . . . . . . . . . . 363
10.4.4 Eigenvalues of the Laplacian Matrix . . . . . . . . . . . . 364
10.4.5 Alternative Partitioning Methods . . . . . . . . . . . . . . 367
10.4.6 Exercises for Section 10.4 . . . . . . . . . . . . . . . . . . 368
10.5 Finding Overlapping Communities . . . . . . . . . . . . . . . . . 369
10.5.1 The Nature of Communities . . . . . . . . . . . . . . . . . 369
10.5.2 Maximum-Likelihood Estimation . . . . . . . . . . . . . . 369
10.5.3 The Affiliation-Graph Model . . . . . . . . . . . . . . . . 371
10.5.4 Avoiding the Use of Discrete Membership Changes . . . . 374
10.5.5 Exercises for Section 10.5 . . . . . . . . . . . . . . . . . . 375
10.6 Simrank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 376
10.6.1 Random Walkers on a Social Graph . . . . . . . . . . . . 376
10.6.2 Random Walks with Restart . . . . . . . . . . . . . . . . 377
xvi CONTENTS

10.6.3 Exercises for Section 10.6 . . . . . . . . . . . . . . . . . . 380

10.7 Counting Triangles . . . . . . . . . . . . . . . . . . . . . . . . . . 380
10.7.1 Why Count Triangles? . . . . . . . . . . . . . . . . . . . . 380
10.7.2 An Algorithm for Finding Triangles . . . . . . . . . . . . 381
10.7.3 Optimality of the Triangle-Finding Algorithm . . . . . . . 382
10.7.4 Finding Triangles Using MapReduce . . . . . . . . . . . . 383
10.7.5 Using Fewer Reduce Tasks . . . . . . . . . . . . . . . . . . 384
10.7.6 Exercises for Section 10.7 . . . . . . . . . . . . . . . . . . 385
10.8 Neighborhood Properties of Graphs . . . . . . . . . . . . . . . . . 386
10.8.1 Directed Graphs and Neighborhoods . . . . . . . . . . . . 386
10.8.2 The Diameter of a Graph . . . . . . . . . . . . . . . . . . 388
10.8.3 Transitive Closure and Reachability . . . . . . . . . . . . 389
10.8.4 Transitive Closure Via MapReduce . . . . . . . . . . . . . 390
10.8.5 Smart Transitive Closure . . . . . . . . . . . . . . . . . . 392
10.8.6 Transitive Closure by Graph Reduction . . . . . . . . . . 393
10.8.7 Approximating the Sizes of Neighborhoods . . . . . . . . 395
10.8.8 Exercises for Section 10.8 . . . . . . . . . . . . . . . . . . 397
10.9 Summary of Chapter 10 . . . . . . . . . . . . . . . . . . . . . . . 398
10.10References for Chapter 10 . . . . . . . . . . . . . . . . . . . . . . 402

11 Dimensionality Reduction 405

11.1 Eigenvalues and Eigenvectors of Symmetric Matrices . . . . . . . 406
11.1.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . 406
11.1.2 Computing Eigenvalues and Eigenvectors . . . . . . . . . 407
11.1.3 Finding Eigenpairs by Power Iteration . . . . . . . . . . . 408
11.1.4 The Matrix of Eigenvectors . . . . . . . . . . . . . . . . . 411
11.1.5 Exercises for Section 11.1 . . . . . . . . . . . . . . . . . . 411
11.2 Principal-Component Analysis . . . . . . . . . . . . . . . . . . . 412
11.2.1 An Illustrative Example . . . . . . . . . . . . . . . . . . . 413
11.2.2 Using Eigenvectors for Dimensionality Reduction . . . . . 416
11.2.3 The Matrix of Distances . . . . . . . . . . . . . . . . . . . 417
11.2.4 Exercises for Section 11.2 . . . . . . . . . . . . . . . . . . 418
11.3 Singular-Value Decomposition . . . . . . . . . . . . . . . . . . . . 418
11.3.1 Definition of SVD . . . . . . . . . . . . . . . . . . . . . . 418
11.3.2 Interpretation of SVD . . . . . . . . . . . . . . . . . . . . 420
11.3.3 Dimensionality Reduction Using SVD . . . . . . . . . . . 422
11.3.4 Why Zeroing Low Singular Values Works . . . . . . . . . 423
11.3.5 Querying Using Concepts . . . . . . . . . . . . . . . . . . 425
11.3.6 Computing the SVD of a Matrix . . . . . . . . . . . . . . 426
11.3.7 Exercises for Section 11.3 . . . . . . . . . . . . . . . . . . 427
11.4 CUR Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . 428
11.4.1 Definition of CUR . . . . . . . . . . . . . . . . . . . . . . 429
11.4.2 Choosing Rows and Columns Properly . . . . . . . . . . . 430
11.4.3 Constructing the Middle Matrix . . . . . . . . . . . . . . 431
11.4.4 The Complete CUR Decomposition . . . . . . . . . . . . 432
CONTENTS xvii

11.4.5 Eliminating Duplicate Rows and Columns . . . . . . . . . 433

11.4.6 Exercises for Section 11.4 . . . . . . . . . . . . . . . . . . 434
11.5 Summary of Chapter 11 . . . . . . . . . . . . . . . . . . . . . . . 434
11.6 References for Chapter 11 . . . . . . . . . . . . . . . . . . . . . . 436

12 Large-Scale Machine Learning 439

12.1 The Machine-Learning Model . . . . . . . . . . . . . . . . . . . . 440
12.1.1 Training Sets . . . . . . . . . . . . . . . . . . . . . . . . . 440
12.1.2 Some Illustrative Examples . . . . . . . . . . . . . . . . . 440
12.1.3 Approaches to Machine Learning . . . . . . . . . . . . . . 443
12.1.4 Machine-Learning Architecture . . . . . . . . . . . . . . . 444
12.1.5 Exercises for Section 12.1 . . . . . . . . . . . . . . . . . . 447
12.2 Perceptrons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 447
12.2.1 Training a Perceptron with Zero Threshold . . . . . . . . 447
12.2.2 Convergence of Perceptrons . . . . . . . . . . . . . . . . . 451
12.2.3 The Winnow Algorithm . . . . . . . . . . . . . . . . . . . 451
12.2.4 Allowing the Threshold to Vary . . . . . . . . . . . . . . . 453
12.2.5 Multiclass Perceptrons . . . . . . . . . . . . . . . . . . . . 455
12.2.6 Transforming the Training Set . . . . . . . . . . . . . . . 456
12.2.7 Problems With Perceptrons . . . . . . . . . . . . . . . . . 457
12.2.8 Parallel Implementation of Perceptrons . . . . . . . . . . 458
12.2.9 Exercises for Section 12.2 . . . . . . . . . . . . . . . . . . 459
12.3 Support-Vector Machines . . . . . . . . . . . . . . . . . . . . . . 461
12.3.1 The Mechanics of an SVM . . . . . . . . . . . . . . . . . . 461
12.3.2 Normalizing the Hyperplane . . . . . . . . . . . . . . . . . 462
12.3.3 Finding Optimal Approximate Separators . . . . . . . . . 464
12.3.4 SVM Solutions by Gradient Descent . . . . . . . . . . . . 467
12.3.5 Stochastic Gradient Descent . . . . . . . . . . . . . . . . . 470
12.3.6 Parallel Implementation of SVM . . . . . . . . . . . . . . 471
12.3.7 Exercises for Section 12.3 . . . . . . . . . . . . . . . . . . 472
12.4 Learning from Nearest Neighbors . . . . . . . . . . . . . . . . . . 472
12.4.1 The Framework for Nearest-Neighbor Calculations . . . . 473
12.4.2 Learning with One Nearest Neighbor . . . . . . . . . . . . 473
12.4.3 Learning One-Dimensional Functions . . . . . . . . . . . . 474
12.4.4 Kernel Regression . . . . . . . . . . . . . . . . . . . . . . 477
12.4.5 Dealing with High-Dimensional Euclidean Data . . . . . . 477
12.4.6 Dealing with Non-Euclidean Distances . . . . . . . . . . . 479
12.4.7 Exercises for Section 12.4 . . . . . . . . . . . . . . . . . . 479
12.5 Comparison of Learning Methods . . . . . . . . . . . . . . . . . . 480
12.6 Summary of Chapter 12 . . . . . . . . . . . . . . . . . . . . . . . 481
12.7 References for Chapter 12 . . . . . . . . . . . . . . . . . . . . . . 483

Mining Massive Datasets Preface
No ratings yet
Mining Massive Datasets Preface
17 pages
Data Mining PDF
No ratings yet
Data Mining PDF
24 pages
(Ebook) Mining of Massive Datasets by Jure Leskovec, Anand Rajaraman, Je Rey D. Ullman ISBN 9781107077232, 1107077230 Ready To Read
No ratings yet
(Ebook) Mining of Massive Datasets by Jure Leskovec, Anand Rajaraman, Je Rey D. Ullman ISBN 9781107077232, 1107077230 Ready To Read
103 pages
Advanced Methods For Knowledge Discovery From Complex Data
100% (1)
Advanced Methods For Knowledge Discovery From Complex Data
375 pages
MapReduce Book Final
No ratings yet
MapReduce Book Final
175 pages
MapReduce Book 20100219
No ratings yet
MapReduce Book 20100219
152 pages
Data Mining Course Overview
No ratings yet
Data Mining Course Overview
29 pages
Data Mining: Concepts and Techniques, 4th Edition Jiawei Han PDF Download
No ratings yet
Data Mining: Concepts and Techniques, 4th Edition Jiawei Han PDF Download
118 pages
Big Data Analytics Course Introduction
No ratings yet
Big Data Analytics Course Introduction
28 pages
Data Classification - Algorithms and Applications-Chapman and Hall - CRC (2014) - (Chapman & Hall - CRC Data Mining and Knowledge Discovery Series) Charu C. Aggarwal PDF
100% (1)
Data Classification - Algorithms and Applications-Chapman and Hall - CRC (2014) - (Chapman & Hall - CRC Data Mining and Knowledge Discovery Series) Charu C. Aggarwal PDF
704 pages
Previewpdf
No ratings yet
Previewpdf
107 pages
ch01 Intro
No ratings yet
ch01 Intro
45 pages
BDI Summary-4
No ratings yet
BDI Summary-4
61 pages
Introduction To Data Mining
No ratings yet
Introduction To Data Mining
792 pages
ch01 Intro
No ratings yet
ch01 Intro
28 pages
Data Mining and Analysis: Fundamental Concepts and Algorithms
No ratings yet
Data Mining and Analysis: Fundamental Concepts and Algorithms
9 pages
Data-Intensive Text Processing With MapReduce
100% (1)
Data-Intensive Text Processing With MapReduce
178 pages
Big Data Analytics
No ratings yet
Big Data Analytics
134 pages
Big Data Analytics
No ratings yet
Big Data Analytics
18 pages
Big Data & Hadoop Mastery Guide
No ratings yet
Big Data & Hadoop Mastery Guide
2 pages
001-2023-0921 DLMDSBDT01 Course Book
No ratings yet
001-2023-0921 DLMDSBDT01 Course Book
124 pages
(Ebook) Introduction To Data Mining by Pang-Ning Tan, Michael Steinbach and Vipin Kumar ISBN 9788131764633, 813176463X Available Full Chapters
No ratings yet
(Ebook) Introduction To Data Mining by Pang-Ning Tan, Michael Steinbach and Vipin Kumar ISBN 9788131764633, 813176463X Available Full Chapters
80 pages
(Ebook) Introduction To Data Mining by Pang-Ning Tan, Michael Steinbach and Vipin Kumar ISBN 9788131764633, 813176463X Online Version
No ratings yet
(Ebook) Introduction To Data Mining by Pang-Ning Tan, Michael Steinbach and Vipin Kumar ISBN 9788131764633, 813176463X Online Version
136 pages
Advanced R
100% (2)
Advanced R
24 pages
1 Introduction
No ratings yet
1 Introduction
55 pages
Syllabus Sem 7
No ratings yet
Syllabus Sem 7
10 pages
Advanced Information and Knowledge
No ratings yet
Advanced Information and Knowledge
105 pages
Zero To Deep Learning With Keras and Tensorflow Compress
No ratings yet
Zero To Deep Learning With Keras and Tensorflow Compress
769 pages
1st Chapter Dunham Book PDF
No ratings yet
1st Chapter Dunham Book PDF
13 pages
Mark Stamp - Introduction To Machine Learning With Applications in Information Security - Previewpdf
0% (1)
Mark Stamp - Introduction To Machine Learning With Applications in Information Security - Previewpdf
27 pages
Near Real Time Fraud Detection With Apac
No ratings yet
Near Real Time Fraud Detection With Apac
87 pages
Introduction To Big Data
No ratings yet
Introduction To Big Data
153 pages
Dunham - Data Mining PDF
100% (1)
Dunham - Data Mining PDF
156 pages
Dunham - Data Mining PDF
83% (6)
Dunham - Data Mining PDF
156 pages
Association Rule Mining - Models and Algorithms (Zhang & Zhang 2002-05-28)
50% (2)
Association Rule Mining - Models and Algorithms (Zhang & Zhang 2002-05-28)
248 pages
BDA Syllabus - Sem VII - Mumbai University
No ratings yet
BDA Syllabus - Sem VII - Mumbai University
3 pages
Big Data and Social Science-Dikompresi
No ratings yet
Big Data and Social Science-Dikompresi
81 pages
MCA 202-Big Data and Big Data Analysis
No ratings yet
MCA 202-Big Data and Big Data Analysis
189 pages
L8 Big Data Management en
No ratings yet
L8 Big Data Management en
58 pages
Pattern Recognition Algorithms For Data Mining 1st Edition Sankar K. Pal Download
No ratings yet
Pattern Recognition Algorithms For Data Mining 1st Edition Sankar K. Pal Download
52 pages
Course Outline and Introduction
No ratings yet
Course Outline and Introduction
37 pages
Data Mining A Tutorial-Based Primer, Second Edition PDF
100% (1)
Data Mining A Tutorial-Based Primer, Second Edition PDF
530 pages
SEM VII BDA Syllabus Theory
No ratings yet
SEM VII BDA Syllabus Theory
4 pages
Ielts Preparation
No ratings yet
Ielts Preparation
16 pages
MATLAB For Brain and Cognitive Scientists
0% (3)
MATLAB For Brain and Cognitive Scientists
11 pages
Sound Waves in Metal Rods
No ratings yet
Sound Waves in Metal Rods
6 pages
EG2401 - Tutorial 1 - t507 - A0127826Y
100% (1)
EG2401 - Tutorial 1 - t507 - A0127826Y
3 pages
EG2401 - Tutorial 2 - t507 - A0127826
No ratings yet
EG2401 - Tutorial 2 - t507 - A0127826
3 pages
ODEBookLiu PDF
No ratings yet
ODEBookLiu PDF
576 pages
John Gabriel-An Introduction To The Single Variable New Calculus 2018 PDF
No ratings yet
John Gabriel-An Introduction To The Single Variable New Calculus 2018 PDF
123 pages
High School Notes
No ratings yet
High School Notes
135 pages
Gram-Schmidt MATLAB Assignment
No ratings yet
Gram-Schmidt MATLAB Assignment
8 pages
1st Quiz Reviewer
No ratings yet
1st Quiz Reviewer
10 pages
Operations Research 34th Edition R.K. Gupta Digital Download
100% (1)
Operations Research 34th Edition R.K. Gupta Digital Download
120 pages
Geometric Means
No ratings yet
Geometric Means
6 pages
Number System - DPP 1
No ratings yet
Number System - DPP 1
4 pages
Algebra - Work Sheet
No ratings yet
Algebra - Work Sheet
26 pages
Second Quarterly Examination in Mathematics-11: Libertad National High School
No ratings yet
Second Quarterly Examination in Mathematics-11: Libertad National High School
2 pages
2-1 Practice - A
0% (1)
2-1 Practice - A
2 pages
Lesson Plan g9
No ratings yet
Lesson Plan g9
3 pages
On The Selection of M For Fuzzy C-Means
No ratings yet
On The Selection of M For Fuzzy C-Means
7 pages
34 Slope Field - 230313 - 173727 PDF
No ratings yet
34 Slope Field - 230313 - 173727 PDF
8 pages
SQP X Standard 01 QP
No ratings yet
SQP X Standard 01 QP
7 pages
Rational Equation Worksheet PDF
100% (1)
Rational Equation Worksheet PDF
11 pages
Machine Learning and Data Mining: Prof. Alexander Ihler Fall 2012
No ratings yet
Machine Learning and Data Mining: Prof. Alexander Ihler Fall 2012
36 pages
1st MONTHLY TEST IN MATH 9 1ND 10
No ratings yet
1st MONTHLY TEST IN MATH 9 1ND 10
5 pages
7g-1st Final Test Review 2024-25
No ratings yet
7g-1st Final Test Review 2024-25
10 pages
T. Y. B. Sc. (Sem. - V) Examination March - 2023 Mathematics: Paper - MTH - 501 (New Course) Group Theory
No ratings yet
T. Y. B. Sc. (Sem. - V) Examination March - 2023 Mathematics: Paper - MTH - 501 (New Course) Group Theory
2 pages
Applied Cat-Theory
100% (2)
Applied Cat-Theory
50 pages
CORE SUBJECT Mid General Mathematics 2k23 2k24
No ratings yet
CORE SUBJECT Mid General Mathematics 2k23 2k24
2 pages
Y5 Syllabus
No ratings yet
Y5 Syllabus
83 pages
CS2403 DSP
100% (1)
CS2403 DSP
47 pages
#Advanced Modern Engineering Mathematics PDF
No ratings yet
#Advanced Modern Engineering Mathematics PDF
652 pages
DPP-5 Arithmetic Progression PDF
No ratings yet
DPP-5 Arithmetic Progression PDF
4 pages
Further Math Week 4 SSS 1 2ND Term
No ratings yet
Further Math Week 4 SSS 1 2ND Term
28 pages
Math 151 Algebra2016 1
100% (1)
Math 151 Algebra2016 1
112 pages
On Integral Representations of Q - Gamma and Q-Beta Functions
No ratings yet
On Integral Representations of Q - Gamma and Q-Beta Functions
16 pages
1-5 Practice - C
No ratings yet
1-5 Practice - C
2 pages

Mining of Massive Datasets: Jure Leskovec Anand Rajaraman Jeffrey D. Ullman

Uploaded by

Mining of Massive Datasets: Jure Leskovec Anand Rajaraman Jeffrey D. Ullman

Uploaded by

Mining

What the Book Is About

1. Distributed file systems and map-reduce as a tool for creating parallel

1. An introduction to database systems, covering SQL and related program-

Support on the Web

Gradiance Automated Homework

2 MapReduce and the New Software Stack 21

2.2.5 Details of MapReduce Execution . . . . . . . . . . . . . . 28

3 Finding Similar Items 73

3.2.2 Choosing the Shingle Size . . . . . . . . . . . . . . . . . . 78

3.9.1 Finding Identical Items . . . . . . . . . . . . . . . . . . . 118

4 Mining Data Streams 131

4.6.4 Query Answering in the DGIM Algorithm . . . . . . . . . 153

5 Link Analysis 163

5.6 Summary of Chapter 5 . . . . . . . . . . . . . . . . . . . . . . . . 196

6 Frequent Itemsets 201

7.1.4 Exercises for Section 7.1 . . . . . . . . . . . . . . . . . . . 245

8 Advertising on the Web 281

8.2.4 Exercises for Section 8.2 . . . . . . . . . . . . . . . . . . . 286

9 Recommendation Systems 307

9.4.2 Root-Mean-Square Error . . . . . . . . . . . . . . . . . . 329

10 Mining Social-Network Graphs 343

10.6.3 Exercises for Section 10.6 . . . . . . . . . . . . . . . . . . 380

11 Dimensionality Reduction 405

11.4.5 Eliminating Duplicate Rows and Columns . . . . . . . . . 433

12 Large-Scale Machine Learning 439

You might also like