2/16/2017
Bloom Filters
References
A. Broder and M. Mitzenmacher, “Network
Network applications of Bloom
filters: A survey,” Internet Mathematics, vol. 1 no. 4, pp. 485-509,
2004.
Li Fan, Pei Cao, Jussara Almeida, Andrei Broder, “Summary Cache:
A Scalable Wide-Area Web Cache Sharing Protocol,” IEEE/ACM
Transactions on Networking, Vol. 8, No. 3, June 2000.
o Origin
O i in of
f counting
ntin Bl
Bloom
m filt
filterss
2/16/2017 Bloom Filters (Simon S. Lam) 1
1
2/16/2017
Origin and applications
Randomized data structure introduced by
Burton Bloom [CACM 1970]
o It represents a set for membership queries, with
false positives
o Probability of false positive can be controlled by
design parameters
o When space efficiency is important, a Bloom filter
ma be used if the effect of
may f false p
positives
sitives can be
mitigated.
First applications in dictionaries and
databases
2/16/2017 Bloom Filters (Simon S. Lam) 2
2
2/16/2017
First application in networking:
distributed cache (2000)
Proxy 2
Cache 2
Proxy 1 Summary 1
Cache 1 Summary y3
Summary 2
Summary 3
Proxy 3
Cache 3
Summary 1
Summary 2
N
Numerous applications
li ti iin networking
t ki since
i 2000
2/16/2017 Bloom Filters (Simon S. Lam) 3
3
2/16/2017
Standard Bloom Filter
A Bloom filter is an array of m bits representing
a set S = { x1, x2, … , xn} of n elements
o Array set to 0 initially
k independent hash functions h1, … , hk with range
{1 2
{1, 2, …, m}}
o Assume that each hash function maps each item in the
universe to a random number uniformly over the range
{1, 2, …, m}
For each element x in S, the bit hi(x) in the array
i sett tto 11, f
is for 1 ≤ i ≤ k,
k
o A bit in the array may be set to 1 multiple times for
ff
different elements
m
2/16/2017 Bloom Filters (Simon S. Lam) 4
4
2/16/2017
A Bloom filter example
((three hash functions))
Insert X1 and X2
Check Y1 and Y2
2/16/2017 Bloom Filters (Simon S. Lam) 5
5
2/16/2017
Standard Bloom Filter (cont.)
To check membership of y in S, check
whether hi(y), 1≤i≤k,
≤ ≤k, are all set to 1
o If not, y is definitely not in S
o Else, we conclude that y is in S, but sometimes this
conclusion is wrong (false positive)
For many applications, false positives are
acceptable
t bl as long
l as the
th probability
b bilit of
fa
false positive is small enough
We will assume that kn < m
2/16/2017 Bloom Filters (Simon S. Lam) 6
6
2/16/2017
False positive probability
After all members of S have been hashed to a Bloom
filter, the probability that a specific bit is still 0 is
1 kn − kn / m
p ' = (1 − ) e = p
m
For a non member, it may be found to be a member
of S (all of its k bits are nonzero) with false positive
probability
(1 − p ') k
(1 − p ) k
2/16/2017 Bloom Filters (Simon S. Lam) 7
7
2/16/2017
False positive probability (cont.)
Define
1 kn k
f ' = (1 − p ') = (1 − (1 − ) )
k
m
f = (1 − p ) k = (1 − e − kn / m ) k
Two competing forces as k increases
o Larger k ->> (1 − p ')) k is smaller for a fixed p’
p
o Larger k -> p’= (1 − 1 / m ) kn is smaller -> 1-p’ larger
2/16/2017 Bloom Filters (Simon S. Lam) 8
8
2/16/2017
False positive rate vs. k
m
Number of bits per member =8
n
Number of
2/16/2017 Bloom Filters (Simon S. Lam) 9
9
2/16/2017
Optimal number k from derivative
Rewrite f as
f = exp(ln(1 − e − kn / m ) k ) = exp( k ln(1 − e − kn / m ))
Let g = k ln(1 − e − kn / m )
Minimizingg g will minimize f = exp(
p( g )
− kn / m
∂g − kn / m k ∂ (1 − e )
= ln(1 − e )+
∂k 1 − e − kn / m ∂k
− kn / m k n − kn / m
= ln(1 − e )+ − kn / m
e = − ln(2) + ln(2) = 0
1− e m
if we plug k = ( m / n ) ln 2 which is optimal
( is
(It i in
i fact
f a global
l b l optimum)
i )
2/16/2017 Bloom Filters (Simon S. Lam) 10
10
2/16/2017
Optimal k from symmetry
− kn / m
Alternatively, from p=e we get
m
k = − ln( p )
n
From previous slide, we have
− kn / m m
g = k ln(1 − e ) = − ln( p ) ln(1 − p )
n
From above, symmetry indicates that the
minimum value for g occurs when p=1/2.
p
Thus m m
kopt = − ln(1 / 2) = ln(2)
n n
2/16/2017 Bloom Filters (Simon S. Lam) 11
11
2/16/2017
Optimal k from symmetry
using the precise probability of false positive
f ' = ((1 − p ')) k = exp( ( − p '))
p( k ln(1 ))
From p ' = ((1 − 1 / m ) kn , solvingg for k
1
k= ln( p ')
l (1 − 1 / m )
n ln(1
Let g ' = k ln(1
( − p ')) ((in equation
q for f ' above))
1
= ln( p ') ln(1 − p ')
n ln(1 − 1 / m )
2/16/2017 Bloom Filters (Simon S. Lam) 12
12
2/16/2017
Using the precise probability of false
positive to get
p g optimal
p k (cont.)
( )
From previous slide
1
g' = ln( p ') ln(1 − p ')
n ln(1 − 1 / m )
By symmetry, g’ (also f’) minimized at p’=1/2
Optimal k is
1 1
k 'opt = ln( p ') = ln(1 / 2)
n ln(1 − 1 / m ) n ln(1 − 1 / m )
2/16/2017 Bloom Filters (Simon S. Lam) 13
13
2/16/2017
Optimal number of hash functions
Using k = m ln(2) the false positive rate is
opt
n
m m
ln(2)
( ) ln(2)
( )
(1 − p ) n
= (0.5) n
(0.6185) m / n , where ln(2) = 0.6931
In practice, k should be an integer. May choose an integer
value
l smaller
ll than
h kopt to reduce
d h
hashing
hi overhead
h d
m/n denotes
False positive rate
bits per entry
2/16/2017 Bloom Filters (Simon S. Lam) 14
14
2/16/2017
False positive rate vs. bits per entry
False
positive
rate 4 hash functions
Using optimal number
of hash functions
m/n
2/16/2017 Bloom Filters (Simon S. Lam) 15
15
2/16/2017
Standard Bloom Filter tricks
Two Bloom filters representing sets S1 and
S2 with the same number of bits and using
g
the same hash functions.
o A Bloom filter that represents the union of S1 and
S2 can be obtained by taking the OR of the bit
vectors
A Bloom filter can be halved in size. Suppose
the
h size
i is
i a power off22.
o Just OR the first and second halves of the bit
vector
o When hashing to do a lookup, the highest order bit
is masked
Notation: OR denotes bitwise or
2/16/2017 Bloom Filters (Simon S. Lam) 16
16
2/16/2017
Counting Bloom filters
Proposed by Fan et al. [2000] for distributed
cach ng
caching
Every entry in a counting Bloom filter is a small
counter ((rather than a single
g bit).
)
o When an item is inserted into the set, the
corresponding counters are each incremented by 1
o When
h an item is deleted
d l d from
f the
h set, the
h
corresponding counters are each decremented by 1
To avoid counter overflow,
overflow its size must be
sufficiently large. It was found that 4 bits per
counter
u are enough.
ug .
2/16/2017 Bloom Filters (Simon S. Lam) 17
17
2/16/2017
Counter overflow probability
Consider a set of n elements, k hash
functions, and m counters
o C(i) is the count for the ith counter
nk − j
1
j
nk 1
P[c(i ) = j ] = 1 −
j m m
nk 1
P[c(i ) ≥ j ] ≤ j
j m
j
enk
≤ (a very loose upper bound)
jm
2/16/2017 Bloom Filters (Simon S. Lam) 18
18
2/16/2017
Counter overflow probability (cont.)
Choose k such that k ≤ m/n (ln 2)
Then j j
enk e ln 2
P[c(i ) ≥ j ] ≤ ≤
jm j
j
e ln 2
P[max c(i ) ≥ j ] ≤ m for some i
1≤i ≤ m
j
Using 4 bits, each counter counts from 0 to
15
−15
P[max c(i ) ≥ 16] ≤ m × 1.37 × 10
1≤i ≤ m
2/16/2017 Bloom Filters (Simon S. Lam) 19
19
2/16/2017
Counter overflow consequences
When a counter does overflow, it may be left
at its
ts maximum
max mum value.
This can later cause a false negative only if
eventuallyy the counter goes
g down to 0 when it
should have remain at nonzero.
The expected
p time to this event is very
y large
g
but is something we need to keep in mind for
any application that does not allow false
negatives
ti
2/16/2017 Bloom Filters (Simon S. Lam) 20
20
2/16/2017
Conclusions
Wherever a list or set is used, and space is at
a premium,
prem um, a Bloom filter
f lter may be used if
f the
effect of false positives can be mitigated
o No false negative
With a counting Bloom filter, false negatives
are possible, albeit highly unlikely
2/16/2017 Bloom Filters (Simon S. Lam) 21
21
2/16/2017
The End
2/16/2017 Bloom Filters (Simon S. Lam) 22
22