Sliding Windows
Sliding window is a useful model of stream processing in which the
queries are about a window of length N – the N most recent
elements received.
In certain cases N is so large that the data cannot be stored in
memory, or even on disk. Sliding window is also known as
Consider a sliding window of length N=6 on a single
stream as shown in figure 1. As the stream content
varies over time the sliding window highlights new
stream elements.
Figure 1. Sliding window on stream
Example1: Consider Amazon online transactions. For
every product X we keep 0/1 stream of whether that
product was sold in the n-th transaction. A query like,
“how many times have we sold X in the last k sales?”
and an answer for it can be derived using sliding
window concept.
Now let us suppose we have a window of
length N (say N=24) on a binary system, We
want at all times to be able to answer a query
of the form “ How many 1’s are there in the
last K bits?” for K<=N.
Here comes the DGIM Algorithm into
picture:
COUNTING THE NUMBER OF 1’s IN THE
DATA STREAM
DGIM algorithm (Datar-Gionis-Indyk-
Motwani Algorithm)
Designed to find the number 1’s in a data
set. This algorithm uses O(log²N) bits to
represent a window of N bit, allows to
estimate the number of 1’s in the window
with an error of no more than 50%.
So this algorithm gives a 50% precise answer.
In DGIM algorithm, each bit that arrives has a
timestamp, for the position at which it arrives.
if the first bit has a timestamp 1, the second bit
has a timestamp 2 and so on.. the positions are
recognized with the window size N (the window
sizes are usually taken as a multiple of
2).The windows are divided into buckets
consisting of 1’s and 0's.
RULES FOR FORMING THE
BUCKETS:
1. The right side of the bucket should
always start with 1. (if it starts with
a 0,it is to be neglected) E.g. ·
1001011 → a bucket of size
4 ,having four 1’s and starting with
1 on it’s right end.
2. Every bucket should have at least
one 1, else no bucket can be
formed.
3. All buckets should be in powers of
2.
4. The buckets cannot decrease in
size as we move to the left. (move in
increasing order towards left)
Let us take an example to understand the
algorithm.
Estimating the number of 1’s and
counting the buckets in the given data
stream.
This picture shows how we can form the
buckets based on the number of ones by
following the rules.
In the given data stream let us assume
the new bit arrives from the right. When
the new bit = 0
After the new bit ( 0 ) arrives with a time
stamp 101, there is no change in the
buckets.
But what if the new bit that arrives is 1,
then we need to make changes..
· Create a new bucket with the current
timestamp and size 1.
· If there was only one bucket of size 1,
then nothing more needs to be done.
However, if there are now three buckets
of size 1( buckets with timestamp
100,102, 103 in the second step in the
picture) We fix the problem by combining
the leftmost(earliest) two buckets of size
1.
To combine any two adjacent buckets of
the same size, replace them by one
bucket of twice the size. The timestamp
of the new bucket is the timestamp of the
rightmost of the two buckets.
Now, sometimes combining two buckets
of size 1 may create a third bucket of size
2. If so, we combine the leftmost two
buckets of size 2 into a bucket of size 4.
This process may ripple through the
bucket sizes.
How long can you continue doing this…
You can continue if current timestamp-
leftmost bucket timestamp of window <
N (=24 here) E.g. 103–87=16 < 24 so I
continue, if it greater or equal to then I
stop.
Finally the answer to the query.
How many 1’s are there in the last 20
bits?
Counting the sizes of the buckets in the
last 20 bits, we say, there are 11 ones.
In a decaying window, you assign a score or
weight to every element of the incoming data
stream.
Decaying Window Algorithm
This algorithm allows you to identify the most
popular/trending elements in an incoming data
stream.
The decaying window algorithm not only tracks
the most recurring elements in an incoming data
stream, but also discounts any random spikes or
spam requests that might have boosted an
element’s frequency.
In a decaying window, you assign a score or
weight to every element of the incoming data
stream.
Further, you need to calculate the aggregate sum
for each distinct element by adding all the
weights assigned to that element. The element
with the highest total score is listed as trending
or the most popular.
weights
timet
1. Assign each element with a weight/score.
2. Calculate aggregate sum for each distinct
element by adding all the weights assigned to
that element.
In a decaying window algorithm, we assign
more weight to newer elements. For a new
element, you first reduce the weight of all the
existing elements by a constant factor k and then
assign the new element with a specific weight.
The aggregate sum of the decaying exponential
weights can be calculated using the following
formula:
∑t-1 at−i(1−c)i
i=0
Here, c is usually a small constant. Whenever a
new element, say at+1, arrives in the data stream
you perform the following steps to achieve an
updated sum:
1. Multiply the current sum/score by the
value (1−c).
2. Add the weight corresponding to the new
element.
Weight decays exponentially over time
In a data stream consisting of various elements,
you maintain a separate sum for each distinct
element. For every incoming element, you
multiply the sum of all the existing elements by
a value of (1−c). Further, you add the weight of
the incoming element to its corresponding
aggregate sum.
A threshold can be kept to, ignore elements of
weight lesser than that.
Finally, the element with the highest aggregate
score is listed as the most popular element.
Example
For example, consider a sequence of twitter tags
below:
data stream [fifa, ipl, fifa, ipl, ipl, ipl, fifa]
Also, let's say each element in sequence has
weight of 1.
Let's c be 0.1
The aggregate sum of each tag in the end of
above stream will be calculated as below:
Fifa SCORE
fifa - 1 * (1-0.1) = 0.9
ipl - 0.9 * (1-0.1) + 0 = 0.81 (adding 0 because
current tag is different than fifa)
fifa - 0.81 * (1-0.1) + 1 = 1.729 (adding 1
because current tag is fifa only)
ipl - 1.729 * (1-0.1) + 0 = 1.5561
ipl - 1.5561 * (1-0.1) + 0 = 1.4005
ipl - 1.4005 * (1-0.1) + 0 = 1.2605
fifa - 1.2605 * (1-0.1) + 1 = 2.135
ipl SCORE
fifa - 0 * (1-0.1) = 0
ipl - 0 * (1-0.1) + 1 = 1
fifa - 1 * (1-0.1) + 0 = 0.9 (adding 0 because
current tag is different than ipl)
ipl - 0.9 * (1-0.01) + 1 = 1.81
ipl - 1.81 * (1-0.01) + 1 = 2.7919
ipl - 2.7919 * (1-0.01) + 1 = 3.764
fifa - 3.764 * (1-0.01) + 0 = 3.7264
In the end of the sequence, we can see the score
of fifa is 2.135 but ipl is 3.7264
So, ipl is more trending than fifa
Even though both of them occurred same
number of times in input there score is still
different.
Advantages of Decaying Window
Algorithm:
1. Sudden spikes or spam data is taken care.
2. New element is given more weight by this
mechanism, to achieve right trending output.
Reference:
https://nitinbhojwani-tech-talk.blogspot.com/
2018/12/decaying-window-algorithm.html