Introduction to
Machine Learning and Data Mining
(Học máy và Khai phá dữ liệu)
Khoat Than
School of Information and Communication Technology
Hanoi University of Science and Technology
2021
2
Contents
¡ Introduction to Machine Learning & Data Mining
¡ Unsupervised learning
¡ Supervised learning
¡ Probabilistic modeling
¡ Data mining
¨ Association rule
¨ Apriori
¡ Practical advice
3
Association rule discovery
¡ Supermarket shelf management: Market-basket model
¨ Goal: identify items that are bought together by sufficiently
many customers.
(tìm ra những sản phẩm mà hay được mua cùng nhau)
¨ Approach: process the sales data collected with barcode
scanners to find dependencies among items.
(xử lý dữ liệu bán hàng đã thu thập bằng máy quét mã vạch để tìm mối liên
hệ giữa các sản phẩm)
¡ A classical rule:
¨ If someone buys diaper and milk,
then he/she is likely to buy beer.
¨ Do not surprised if you find beer packs
next to diapers !
(Adapted from a lecture by Jure Leskovec)
4
The Market-Basket Model
¡ A large set of items TID Items
¨ E.g., things sold in a supermarket 1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
¡ A large set of baskets 4 Beer, Bread, Diaper, Milk
¡ Each basket 5 Coke, Diaper, Milk
is a small subset of items
¨ e.g., the things one customer buys on one day
¡ Association: a general many-to-many mapping between
two kinds of things (một ánh xạ nhiều-nhiều giữa hai loại đối tượng)
¨ But we ask about connections among “items”, not “baskets”
5
Association rules: approach
¡ Given a set of baskets TID Items
1 Bread, Coke, Milk
¡ Want to discover
association rules 2 Beer, Bread
(tìm tập các luật kết hợp) 3 Beer, Coke, Diaper, Milk
¨ People who bought {x,y,z} 4 Beer, Bread, Diaper, Milk
tend to buy {v,w} 5 Coke, Diaper, Milk
¡ 2 step approach:
¨ Find frequent itemsets
(tìm tập thường xuyên)
Rules discovered:
¨ Generate association rules {Milk} à {Coke}
(sinh các luật kết hợp) {Diaper, Milk} à {Beer}
6
Applications
¡ Items = products; baskets = sets of products someone
bought in one trip to the store
¡ Real market baskets: stores might keep TBs of data about
what customers buy together
¨ Tells how typical customers navigate stores, lets them position
tempting items
¨ Suggests tie-in “tricks”, e.g., run sale on diapers and raise the
price of beer
¨ Need the rule to occur frequently
7
Applications: amazon
People who bought X also bought Y.
8
Frequent itemsets
¡ Question: find sets of items that appear together
“frequently” in baskets
¡ Support_count for itemset I: number of baskets containing
all items in I.
¡ Support for itemset I: fraction of baskets containing I.
¨ Support(I) = support_count(I)/Total number of transactions
¡ Given a support threshold s, TID Items
the sets of items that has 1 Bread, Coke, Milk
support at least s are called 2 Beer, Bread
frequent itemsets.
3 Beer, Coke, Diaper, Milk
(tập thường xuyên là tập những sản
phẩm mà chúng xuất hiện cùng nhau, 4 Beer, Bread, Diaper, Milk
có ngưỡng support ít nhất là s) 5 Coke, Diaper, Milk
Support of {Beer, Bread} = 2/5
9
Frequent itemsets: example
¡ Items = {Milk, Coke, Pepsi, Beer, Juice}
¡ Support threshold = 3/8
¨ B1 = {m, c, b} B2 = {m, p, j}
¨ B3 = {m, b} B4 = {c, j}
¨ B5 = {m, p, b} B6 = {m, c, b, j}
¨ B7 = {c, b, j} B8 = {b, c}
¡ Frequent itemsets: {m}, {c}, {b}, {j},
{m, b}, {b, c}, {c, j}
10
Association rules
¡ Association rules: If-then rules about the content of baskets
¡ {i1, i2, …, ik} ⟹ j: “if a basket contains all of {i1, i2, …, ik}, then
it is likely to contain j”
(nếu một giỏ hàng mà chứa {i1, i2, …, ik} thì nó cũng có thể chứa j)
¡ In practice there are many rules, we want to find significant
or interesting rules.
¡ Confidence of this association rule is the probability of j
given I = {i1, i2, …, ik}
¨ Confidence của luật I à j là xác suất của j với điều kiện I
𝑠𝑢𝑝𝑝𝑜𝑟𝑡(𝐼 ∪ 𝑗) 𝑠𝑢𝑝𝑝𝑜𝑟𝑡_𝑐𝑜𝑢𝑛𝑡(𝐼 ∪ 𝑗)
𝑐𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒 𝐼 ⟹ 𝑗 = =
𝑠𝑢𝑝𝑝𝑜𝑟𝑡(𝐼) 𝑠𝑢𝑝𝑝𝑜𝑟𝑡_𝑐𝑜𝑢𝑛𝑡(𝐼)
11
Interesting association rules
¡ Not all high-confidence rules are interesting
¨ The rule X ⟹ milk may have high confidence for many itemsets
X, because milk is purchased very often (independent of X)
and the confidence will be high
¡ Interest of an association rule I ⟹ j:
difference between its confidence and the fraction of
baskets that contain j.
Interest(I ⟹ j) = confidence(I ⟹ j) – Pr(j)
¨ Interesting rules are those with high positive or negative interest
values (usually above 0.5)
¨ Why?
12
Confidence and Interest: example
¨ B1 = {m, c, b} B2 = {m, p, j}
¨ B3 = {m, b} B4 = {c, j}
¨ B5 = {m, p, b} B6 = {m, c, b, j}
¨ B7 = {c, b, j} B8 = {b, c}
¡ Association rule: {m,b} ⟹ c
¨ Confidence = 2/4 = 0.5
¨ Interest = |0.5 - 5/8| = 1/8
(item c appears in 5/8 of the baskets)
¨ Rule is not very interesting
13
Finding association rules
¡ Problem: find all association rules with support ≥ s and
confidence ≥ c.
¨ Note: support of an association rule is the support of the set of
items on the left side.
¡ Hard part: finding the frequent itemsets
¨ If {i1, i2, …, ik} ⟹ j has high support and confidence, then both
{i1, i2, …, ik} and {i1, i2, …, ik, j} will be frequent.
𝑠𝑢𝑝𝑝𝑜𝑟𝑡(𝐼 ∪ 𝑗)
𝑐𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒 𝐼 ⟹ 𝑗 =
𝑠𝑢𝑝𝑝𝑜𝑟𝑡(𝐼)
14
Mining association rules
¡ Step 1: find all frequent itemsets I
¡ Step 2: rule generation
¨ For every subset A of I, generate rule A ⟹ I \ A
u Since I is frequent, A is also frequent
u Variant 1: Single pass to compute the rule confidence
confidence(A,B ⟹ C,D) = support(A,B,C,D) / support(A,B)
u Variant 2:
u Observation: If A,B,C ⟹ D is below confidence, so is A,B ⟹ C,D
u Can generate “bigger” rules from smaller ones!
¨ Output the rules above the confidence threshold
15
Example
¨ B1 = {m, c, b} B2 = {m, p, j}
¨ B3 = {m, b} B4 = {c, j}
¨ B5 = {m, p, b, c} B6 = {m, c, b, j}
¨ B7 = {c, b, j} B8 = {b, c}
¡ Support_count threshold s = 3, confidence c = 0.75
¨ Frequent itemsets: {b,m}, {b,c}, {c,m}, {c,j}, {m,c,b}
¨ Generate rules:
¨ b ⟹ m: c=4/6 m ⟹ b: c=4/5
¨ b ⟹ c: c=5/6 ...
¨ b,c ⟹ m: c=3/5 b,m ⟹ c: c=3/4
¨ b ⟹ c,m: c=3/6
16
Finding Frequent Itemsets
17
Itemsets: computation model
Backdata
¡ Typically, to finding
is kept infrequent itemsets
flat files rather than in
Item
Item
a database system:
Typically, data is kept in flat files Item
Item
¨ rather
Stored than in a database system:
on disk Item
Item
¨ StoredStored
basket-by-basket
on disk Item
Item
¨ Baskets are small
Stored but we have many baskets
basket-by-basket Item
Item
and many items Item
Baskets are small but we have Item
u Expand baskets into pairs, triples, etc.
many
as you read baskets
baskets and many items
u
Expand
Use k nested baskets
loops into pairs,
to generate triples,
all sets of sizeetc.
k Etc.
as you read baskets
Use k nested loops to generate all
sets of size k
Note: We want to find frequent itemsets. To Items
Items are are positive
positive intege
find We
them, weto
have integers, and
Note: want findto count them.
frequent To count
itemsets. To find them, we and boundaries betwee
boundaries between
them, we have to generate them. basketsareare-1.
–1.
have to count them. To count them, we have to generate them. baskets
18
Computational model
¡ The true cost of mining disk-resident data is usually the
number of disk I/Os
¡ In practice, association-rule algorithms read the data in
passes – all baskets read in turn
¡ We measure the cost by the number of passes an
algorithm makes over the data
19
Main-memory bottleneck
¡ For many frequent-itemset algorithms, main-memory is the
critical resource
¨ As we read baskets, we need to count somethings, e.g.,
occurrences of pairs of items
¨ The number of different things we can count is limited by main
memory
¨ Swapping counts in/out is a disaster (why?)
20
Finding Frequent Pairs
¡ The hardest problem often turns out to be finding the
frequent pairs of items {i1, i2}
¨ Why?
¨ Frequent pairs are common, frequent triples are rare.
¨ Probability of being frequent drops exponentially with size,
number of sets grows more slowly with size.
¡ Let’s first concentrate on pairs, then extend to larger sets
¡ The approach:
¨ We always need to generate all the itemsets.
¨ But we would only like to count (keep track of) those itemsets
that in the end turn out to be frequent.
21
Naïve algorithm
¡ Naïve approach to finding frequent pairs
¡ Read file once, counting in main memory the occurrences
of each pair:
¨ If a basket has n items, then we need to generate n(n-1)/2
pairs.
¡ Fail if (#items)2 exceeds main memory
¡ In practice, #items can be
¨ 100K (Wal-Mart) or 10B (Web pages)
¨ Suppose 105 items, counts are 4-byte integers
¨ Number of pairs of items: 105(105-1)/2 = 5*109
¨ Therefore, 2*1010 (20 gigabytes) of memory needed
22
Counting pairs in memory
¡ Approach 1: count all pairs using a matrix
¡ Approach 2: keep a table of triples
{i, j, c} = “the number of pairs of items {i,j} is c”
¨ If integers and item ids are 4 bytes, we need approximately 12
bytes for pairs with count > 0.
¨ Plus some additional overhead for the hash table.
¡ Note:
¨ Approach 1 only requires 4 bytes per pair
¨ Approach 2 uses 12 bytes per pairs
(but only for pairs with count > 0)
23
Comparing the two approaches
12 per
4 bytes per pair
occurring pair
Triangular Matrix Triples
24
Comparing the two approaches
¡ Approach 1: triangular matrix
¨ n = total number of items
¨ Count pair of items {i,j} only if i < j
¨ Keep pair counts in lexicographic order:
{1,2}, {1,3},..., {1,n}, {2,3}, {2,4},...,{2,n}, {3,4},...
¨ Pair {i,j} is at position (i –1)(n– i/2) + j –1
¨ Total number of pairs n(n –1)/2; total bytes= 2n2
¨ Triangular Matrix requires 4 bytes per pair
¡ Approach 2 uses 12 bytes per counting pair
(but only for pairs with count > 0)
¨ Beats Approach 1 if less than 1/3 of possible pairs actually
occur.
25
Comparing the two approaches
¡ Approach 1: triangular matrix
¨ n = total number of items
¨ Count pair of items {i,j} only if i < j
¨ Problem order:
Keep pair counts in lexicographic is
{1,2}, {1,3},..., {1,n}, {2,3}, {2,4},...,{2,n}, {3,4},...
the pairs do not fit into memory
Pair {i,j} is at position (i –1)(n– i/2) + j –1
if we have too many items.
¨
¨ Total number of pairs n(n –1)/2; total bytes= 2n2
Triangular Matrix requires 4 bytes per pair
Can we do better?
¨
¡ Approach 2 uses 12 bytes per counting pair
(but only for pairs with count > 0)
¨ Beats Approach 1 if less than 1/3 of possible pairs actually
occur.
26
A-Priori algorithm (1)
¡ A two-pass approach
A two-pass called called
approach
A-Priori (by Agrawal and Srikant, 1994)
A-Priori
limits the needlimits thememory
for main need for
¡ Keymain
idea: memory
Monotonicity
Key idea:(đơn điệu)
monotonicity
If a set I of items appears at least b
¨
If a set of items I appears at
times, so does every subset J of I.
least s times, so does every subset J of I
¡ Contrapositive (tương phản):
Contrapositive
If item i has support lessfor pairs:
than s, then no pair including i has
support ≥ 𝑠 i does not appear in s baskets, then no
If item
¡ So, pair
how does A-Prioriifind
including canfrequent in s baskets
appearpairs?
So, how does A-Priori find freq. pairs?
27
A-Priori algorithm (2)
¡ Pass 1: read baskets and count in main memory the
occurrences of each individual item.
¨ Requires only memory proportional to #items
¡ Items with support at least s are the frequent items
¡ Pass 2: Read baskets again and count in main memory
only those pairs where both elements are frequent (from
Pass 1)
¨ Requires memory proportional to square of frequent items only
(for counts)
¨ Plus a list of the frequent items
(so you know what must be counted)
28
Main-Memory: picture of A-Priori
Frequent items
Item counts
Counts of
Main memory
pairs of
frequent items
(candidate
pairs)
Pass 1 Pass 2
1/8/2014 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 33
29
Details of A-Priori
¡ You canuse
You can usethethe
triangular matrix method with n = number
of frequent items
triangular matrix
¨May savewith
method spacencompared
= number Frequent Old
Item counts
with storing triples items item
of frequent items #s
¡ Trick: re-number frequent
May1,2,...
items saveand
space compared
keep a Counts of pairs
Main memory
withrelating
table storingnew
triples
numbers Counts
of of
frequent
Trick: re-number
to original item numbers pairs
itemsof
frequent items
frequent items 1,2,…
and keep a table relating
new numbers to original Pass 1 Pass 2
item numbers
30
For each k, we construct two sets of
Frequent triples, …
k-tuples (sets of size k):
¡ ForCeach k, we construct
k = candidate k-tuplestwo
= sets
thoseof k-tuples (sets
that might beof size k):
¨ Cfrequent setsk-tuples
k = candidate (support > s) that
= those based onbe
might information
frequent sets
(support
from the≥ s)pass
basedforonk–1
information from the pass for k–1
¨ Lk = the set of truly frequent k-tuples
Lk = the set of truly frequent k-tuples
Count All pairs Count
All of items To be
the items the pairs explained
items from L1
C1 Filter L1 Construct C2 Filter L2 Construct C3
1/8/2014 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 35
31
A-Priori: example
¡ Hypothetical steps of the A-Priori algorithm
¨ Generate C1 ={{b} {c} {j} {m} {n} {p}}
¨ Count the support of itemsets in C1
¨ Prune non-frequent to get L1 = { b, c, j, m }
¨ Generate C2 = { {b,c} {b,j} {b,m} {c,j} {c,m} {j,m} }
¨ Count the support of itemsets in C2
¨ Prune non-frequent to get L2 = { {b,m} {b,c} {c,m} {c,j} }
¨ Generate C3 = { {b,c,m} {b,c,j} {b,m,j} {c,m,j} }
¨ Count the support of itemsets in C3
¨ Prune non-frequent to get L3 = { {b,c,m} }
32
A-Priori for All frequent itemsets
¡ One pass for each k (itemset size)
¡ Needs memory to count each candidate k–tuple
¡ For typical market-basket data and reasonable support
(e.g., 1%), k = 2 requires the most memory
¡ Many possible extensions:
¨ Association rules with intervals:
u For example: Men over 65 have 2 cars
¨ Association rules when items are in a taxonomy
u Bread, Butter ⟹ FruitJam
u BakedGoods, MilkProduct ⟹ PreservedGoods
¨ Lower the support s as itemset gets bigger