0% found this document useful (0 votes)
127 views22 pages

Bloom Filters: References

The document summarizes Bloom filters, a space-efficient probabilistic data structure for representing a set in order to support membership queries. It describes how Bloom filters work using hash functions to set bits in an array, and how false positives can occur but their probability can be controlled. The document discusses the original application of Bloom filters in dictionaries and databases, the first networking application in 2000 for distributed caching, and how Bloom filters have found numerous applications in networking since. It also analyzes the optimal number of hash functions to use to minimize the false positive rate.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
127 views22 pages

Bloom Filters: References

The document summarizes Bloom filters, a space-efficient probabilistic data structure for representing a set in order to support membership queries. It describes how Bloom filters work using hash functions to set bits in an array, and how false positives can occur but their probability can be controlled. The document discusses the original application of Bloom filters in dictionaries and databases, the first networking application in 2000 for distributed caching, and how Bloom filters have found numerous applications in networking since. It also analyzes the optimal number of hash functions to use to minimize the false positive rate.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

2/16/2017

Bloom Filters

References
A. Broder and M. Mitzenmacher, “Network
Network applications of Bloom
filters: A survey,” Internet Mathematics, vol. 1 no. 4, pp. 485-509,
2004.

Li Fan, Pei Cao, Jussara Almeida, Andrei Broder, “Summary Cache:


A Scalable Wide-Area Web Cache Sharing Protocol,” IEEE/ACM
Transactions on Networking, Vol. 8, No. 3, June 2000.
o Origin
O i in of
f counting
ntin Bl
Bloom
m filt
filterss

2/16/2017 Bloom Filters (Simon S. Lam) 1

1
2/16/2017

Origin and applications


Randomized data structure introduced by
Burton Bloom [CACM 1970]
o It represents a set for membership queries, with
false positives
o Probability of false positive can be controlled by
design parameters
o When space efficiency is important, a Bloom filter
ma be used if the effect of
may f false p
positives
sitives can be
mitigated.

First applications in dictionaries and


databases

2/16/2017 Bloom Filters (Simon S. Lam) 2

2
2/16/2017

First application in networking:


distributed cache (2000)
Proxy 2
Cache 2
Proxy 1 Summary 1
Cache 1 Summary y3
Summary 2
Summary 3
Proxy 3
Cache 3
Summary 1
Summary 2

 N
Numerous applications
li ti iin networking
t ki since
i 2000
2/16/2017 Bloom Filters (Simon S. Lam) 3

3
2/16/2017

Standard Bloom Filter


 A Bloom filter is an array of m bits representing
a set S = { x1, x2, … , xn} of n elements
o Array set to 0 initially
 k independent hash functions h1, … , hk with range
{1 2
{1, 2, …, m}}
o Assume that each hash function maps each item in the
universe to a random number uniformly over the range
{1, 2, …, m}
 For each element x in S, the bit hi(x) in the array
i sett tto 11, f
is for 1 ≤ i ≤ k,
k
o A bit in the array may be set to 1 multiple times for
ff
different elements
m

2/16/2017 Bloom Filters (Simon S. Lam) 4

4
2/16/2017

A Bloom filter example


((three hash functions))

Insert X1 and X2

Check Y1 and Y2

2/16/2017 Bloom Filters (Simon S. Lam) 5

5
2/16/2017

Standard Bloom Filter (cont.)


To check membership of y in S, check
whether hi(y), 1≤i≤k,
≤ ≤k, are all set to 1
o If not, y is definitely not in S
o Else, we conclude that y is in S, but sometimes this
conclusion is wrong (false positive)
For many applications, false positives are
acceptable
t bl as long
l as the
th probability
b bilit of
fa
false positive is small enough

We will assume that kn < m

2/16/2017 Bloom Filters (Simon S. Lam) 6

6
2/16/2017

False positive probability


 After all members of S have been hashed to a Bloom
filter, the probability that a specific bit is still 0 is
1 kn − kn / m
p ' = (1 − )  e = p
m
 For a non member, it may be found to be a member
of S (all of its k bits are nonzero) with false positive
probability
(1 − p ') k
 (1 − p ) k

2/16/2017 Bloom Filters (Simon S. Lam) 7

7
2/16/2017

False positive probability (cont.)


Define

1 kn k
f ' = (1 − p ') = (1 − (1 − ) )
k

m
f = (1 − p ) k = (1 − e − kn / m ) k
 Two competing forces as k increases
o Larger k ->> (1 − p ')) k is smaller for a fixed p’
p

o Larger k -> p’= (1 − 1 / m ) kn is smaller -> 1-p’ larger

2/16/2017 Bloom Filters (Simon S. Lam) 8

8
2/16/2017

False positive rate vs. k


m
Number of bits per member =8
n

Number of

2/16/2017 Bloom Filters (Simon S. Lam) 9

9
2/16/2017

Optimal number k from derivative


Rewrite f as
f = exp(ln(1 − e − kn / m ) k ) = exp( k ln(1 − e − kn / m ))
Let g = k ln(1 − e − kn / m )
Minimizingg g will minimize f = exp(
p( g )
− kn / m
∂g − kn / m k ∂ (1 − e )
= ln(1 − e )+
∂k 1 − e − kn / m ∂k
− kn / m k n − kn / m
= ln(1 − e )+ − kn / m
e = − ln(2) + ln(2) = 0
1− e m
if we plug k = ( m / n ) ln 2 which is optimal
( is
(It i in
i fact
f a global
l b l optimum)
i )
2/16/2017 Bloom Filters (Simon S. Lam) 10

10
2/16/2017

Optimal k from symmetry


− kn / m
 Alternatively, from p=e we get
m
k = − ln( p )
n
From previous slide, we have
− kn / m m
g = k ln(1 − e ) = − ln( p ) ln(1 − p )
n
 From above, symmetry indicates that the
minimum value for g occurs when p=1/2.
p
Thus m m
kopt = − ln(1 / 2) = ln(2)
n n
2/16/2017 Bloom Filters (Simon S. Lam) 11

11
2/16/2017

Optimal k from symmetry


using the precise probability of false positive

f ' = ((1 − p ')) k = exp( ( − p '))


p( k ln(1 ))

From p ' = ((1 − 1 / m ) kn , solvingg for k


1
k= ln( p ')
l (1 − 1 / m )
n ln(1
Let g ' = k ln(1
( − p ')) ((in equation
q for f ' above))
1
= ln( p ') ln(1 − p ')
n ln(1 − 1 / m )
2/16/2017 Bloom Filters (Simon S. Lam) 12

12
2/16/2017

Using the precise probability of false


positive to get
p g optimal
p k (cont.)
( )
From previous slide
1
g' = ln( p ') ln(1 − p ')
n ln(1 − 1 / m )

By symmetry, g’ (also f’) minimized at p’=1/2

Optimal k is
1 1
k 'opt = ln( p ') = ln(1 / 2)
n ln(1 − 1 / m ) n ln(1 − 1 / m )

2/16/2017 Bloom Filters (Simon S. Lam) 13

13
2/16/2017

Optimal number of hash functions


 Using k = m ln(2) the false positive rate is
opt
n
m m
ln(2)
( ) ln(2)
( )
(1 − p ) n
= (0.5) n
 (0.6185) m / n , where ln(2) = 0.6931
 In practice, k should be an integer. May choose an integer
value
l smaller
ll than
h kopt to reduce
d h
hashing
hi overhead
h d
m/n denotes
False positive rate
bits per entry

2/16/2017 Bloom Filters (Simon S. Lam) 14

14
2/16/2017

False positive rate vs. bits per entry


False
positive
rate 4 hash functions

Using optimal number


of hash functions

m/n
2/16/2017 Bloom Filters (Simon S. Lam) 15

15
2/16/2017

Standard Bloom Filter tricks


Two Bloom filters representing sets S1 and
S2 with the same number of bits and using
g
the same hash functions.
o A Bloom filter that represents the union of S1 and
S2 can be obtained by taking the OR of the bit
vectors
A Bloom filter can be halved in size. Suppose
the
h size
i is
i a power off22.
o Just OR the first and second halves of the bit
vector
o When hashing to do a lookup, the highest order bit
is masked
Notation: OR denotes bitwise or
2/16/2017 Bloom Filters (Simon S. Lam) 16

16
2/16/2017

Counting Bloom filters


Proposed by Fan et al. [2000] for distributed
cach ng
caching
Every entry in a counting Bloom filter is a small
counter ((rather than a single
g bit).
)
o When an item is inserted into the set, the
corresponding counters are each incremented by 1
o When
h an item is deleted
d l d from
f the
h set, the
h
corresponding counters are each decremented by 1
To avoid counter overflow,
overflow its size must be
sufficiently large. It was found that 4 bits per
counter
u are enough.
ug .

2/16/2017 Bloom Filters (Simon S. Lam) 17

17
2/16/2017

Counter overflow probability


Consider a set of n elements, k hash
functions, and m counters
o C(i) is the count for the ith counter
nk − j
  1  
j
nk 1
P[c(i ) = j ] =      1 − 
 j  m   m

 nk  1
P[c(i ) ≥ j ] ≤   j
 j m
j
 enk 
≤  (a very loose upper bound)
 jm 

2/16/2017 Bloom Filters (Simon S. Lam) 18

18
2/16/2017

Counter overflow probability (cont.)


Choose k such that k ≤ m/n (ln 2)
Then j j
 enk   e ln 2 
P[c(i ) ≥ j ] ≤   ≤ 
 jm   j 
j
 e ln 2 
P[max c(i ) ≥ j ] ≤ m   for some i
1≤i ≤ m
 j 
Using 4 bits, each counter counts from 0 to
15
−15
P[max c(i ) ≥ 16] ≤ m × 1.37 × 10
1≤i ≤ m

2/16/2017 Bloom Filters (Simon S. Lam) 19

19
2/16/2017

Counter overflow consequences


When a counter does overflow, it may be left
at its
ts maximum
max mum value.
This can later cause a false negative only if
eventuallyy the counter goes
g down to 0 when it
should have remain at nonzero.
The expected
p time to this event is very
y large
g
but is something we need to keep in mind for
any application that does not allow false
negatives
ti

2/16/2017 Bloom Filters (Simon S. Lam) 20

20
2/16/2017

Conclusions
Wherever a list or set is used, and space is at
a premium,
prem um, a Bloom filter
f lter may be used if
f the
effect of false positives can be mitigated
o No false negative
With a counting Bloom filter, false negatives
are possible, albeit highly unlikely

2/16/2017 Bloom Filters (Simon S. Lam) 21

21
2/16/2017

The End

2/16/2017 Bloom Filters (Simon S. Lam) 22

22

You might also like