0% found this document useful (0 votes)

47 views79 pages

TRabl StreamProcessing

The document discusses stream processing and provides an overview of key concepts like streams, stream models, event time, time-based windows, and basic stream operators like windowed aggregation and joins. It also gives examples of stream processing with Flink, covering windows, triggers, and aggregations.

Uploaded by

Carina Lifschitz

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

47 views79 pages

TRabl StreamProcessing

Uploaded by

Carina Lifschitz

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 79

Big Data Stream Processing

Tilmann Rabl
Berlin Big Data Center
www.dima.tu-berlin.de | bbdc.berlin | rabl@tu-berlin.de

1 © 2013 Berlin Big Data Center • All Rights Reserved © DIMA 2017
Agenda
Introduction to Streams
• Use cases
• Stream Processing 101

Stream Processing Systems

• Ingredients of a stream processing system
• Some examples
• More details on Storm, Spark, Flink
• Maybe a demo (!)

Stream Processing Optimizations (if we have time)

• How to optimize
With slides from Data Artisans, Volker Markl, Asterios Katsifodimos, Jonas Traub
2
2 © DIMA 2017
Big Fast Data
• Data is growing and can be evaluated
– Tweets, social networks (statuses, check-
ins, shared content), blogs, click streams,
various logs, …
– Facebook: > 845M active users, > 8B
messages/day
– Twitter: > 140M active users, > 340M
tweets/day
• Everyone is interested!

Image: Michael Carey

3
3 © DIMA 2017
But there is so much more…
• Autonomous Driving
– Requires rich navigation info
– Rich data sensor readings
– 1GB data per minute per car (all sensors)1

• Traffic Monitoring
– High event rates: millions events / sec
– High query rates: thousands queries / sec
– Queries: filtering, notifications, analytical
Source: http://theroadtochangeindia.wordpress.com/2011/01/13/better-roads/

• Pre-processing of sensor data

– CERN experiments generate ~1PB of measurements per second.
– Unfeasible to store or process directly, fast preprocessing is a must.

1Cobb: http://www.hybridcars.com/tech-experts-put-the-brakes-on-autonomous-cars/
4
4 © DIMA 2017
Why is this hard?

Image: Peter Pietzuch

Tension between performance and algorithmic expressiveness

6
6 © DIMA 2017
Stream Processing 101

With some Flink Examples

Based on the Data Flow Model

7 © DIMA 2017
What is a Stream?
• Unbounded data
– Conceptually infinite, ever growing set of data items / events
– Practically continuous stream of data, which needs to be processed / analyzed

• Push model
– Data production and procession is controlled by the source
– Publish / subscribe model

• Concept of time
– Often need to reason about when data is produced and when processed data should be
output
– Time agnostic, processing time, ingestion time, event time

This part is largely based on Tyler Akidau‘s great blog on streaming - https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101
8
8 © DIMA 2017
Stream Models
S = si, si+1, … si = <data item, timestamp>
• Turnstile
– Elements can come and go
– Underlying model is a vector of elements (domain)
– si is an update (increment or decrement) to a vector element
– Traditional database model
– Flexible model for algorithms

• Cash register
– Similar to turnstile, but elements cannot leave

• Time series
– si is is a new vector entry
– Vector is increasing
– This is what all big stream processing engines use

9
9 © DIMA 2017
Event Time

• Event time
– Data item production time
• Ingestion time
– System time when data item is received
• Processing time
– System time when data item is processed

• Typically, these do not match!

• In practice, streams are unordered!
Image: Tyler Akidau
10
10 © DIMA 2017
Time Agnostic Processing

Image: Tyler Akidau

• Filtering
– Stateless
– Can be done per data item
– Implementations: hash table or bloom filter

11
11 © DIMA 2017
Time Agnostic Processing II

Image: Tyler Akidau

• Inner join
– Only current elements
– Stateful
– E.g., hash join
• What about other joins (e.g., outer join)?

12
12 © DIMA 2017
Approximate Processing

Image: Tyler Akidau

• Streaming k-means, sketches
– Low overhead
– Notion of time
• Not covered in this talk

13
13 © DIMA 2017
Windows
• Fixed
– Also tumbling
• Sliding
– Also hopping
• Session
– Based on activity

Image: Tyler Akidau

• Triggered by
– Event time, processing time, count, watermark
• Eviction policy
– Window width / size

14
14 © DIMA 2017
Processing Time Windows

Image: Tyler Akidau

• System waits for x time units
– System decides on stream partitioning
– Simple, easy to implement
– Ignores any time information in the stream -> any aggregation can be arbitrary
• Similar: Counting Windows

15
15 © DIMA 2017
Event Time Windows

Images: Tyler Akidau

• Windows based on the time information in stream
– Adheres to stream semantic
– Correct calculations
– Buffering required, potentially unordered (more on this later)

16
16 © DIMA 2017
Basic Stream Operators
• Windowed Aggregation
– E.g., average speed
– Sum of URL accesses
– Daily highscore
Aggregate
• Windowed Join
– Correlated observations in timeframe
– E.g., temperature in time

9
12
10
17
17 © DIMA 2017
Flink’s Windowing
• Windows can be any combination of (multiple) triggers & evictions
– Arbitrary tumbling, sliding, session, etc. windows can be constructed.

• Common triggers/evictions part of the API

– Time (processing vs. event time), Count

• Even more flexibility: define your own UDF trigger/eviction

• Examples:
dataStream.windowAll(TumblingEventTimeWindows.of(Time.seconds(5)));
dataStream.keyBy(0).window(TumblingEventTimeWindows.of(Time.seconds(5)));

18
18 © DIMA 2017
Example Analysis: Windowed Aggregation
(2) StockPrice(HDP, 23.8)

StockPrice(SPX, 2113.9) (1)

StockPrice(FTSE, 6931.7) StockPrice(SPX, 2113.9)
StockPrice(HDP, 23.8) (3) StockPrice(FTSE, 6931.7)
StockPrice(HDP, 26.6) StockPrice(HDP, 26.6)

StockPrice(SPX, 2113.9)
(4) StockPrice(FTSE, 6931.7)
StockPrice(HDP, 25.2)

(1) val windowedStream = stockStream.window(Time.of(10, SECONDS)).every(Time.of(5, SECONDS))

(2) val lowest = windowedStream.minBy("price")
(3) val maxByStock = windowedStream.groupBy("symbol").maxBy("price")
(4) val rollingMean = windowedStream.groupBy("symbol").mapWindow(mean _)
19 © DIMA 2017
Complex Event Processing
• Detecting patterns in a stream
• Complex event = sequence of events
• Defined using logical and temporal conditions
– Logical: data values and combinations
– Temporal: within a given period of time

Slide by Kai-Uwe Sattler

20
20 © DIMA 2017
Complex Event Processing Contd.
• Composite events constructed e.g. by
– SEQ, AND, OR, NEG, ...
– SEQ(e1, e2) ➝ (e1, t1) ∧ (e2, t2)∧t1 ≤ t2 ∧ e1,e2 ε 𝕎𝕎
• Implemented by constructing a NFA
– Example: SEQ(A, B, C)

Slide by Kai-Uwe Sattler

21
21 © DIMA 2017
Stream Processing Systems

What makes a system a stream processing system?

22 © DIMA 2017
8 Requirements of Big Streaming
• Keep the data moving • Integrate stored and streaming data
– Streaming architecture – Hybrid stream and batch

• Declarative access • Data safety and availability

– E.g. StreamSQL, CQL – Fault tolerance, durable state

• Handle imperfections • Automatic partitioning and scaling

– Late, missing, unordered items – Distributed processing

• Predictable outcomes • Instantaneous processing and

– Consistency, event time response
The 8 Requirements of Real-Time Stream Processing – Stonebraker et al. 2005
23
23 © DIMA 2017
8 Requirements of Big Streaming
• Keep the data moving • Integrate stored and streaming data
– Streaming architecture – Hybrid stream and batch

• Declarative access • Data safety and availability

– E.g. StreamSQL, CQL – Fault tolerance, durable state

• Handle imperfections • Automatic partitioning and scaling

– Late, missing, unordered items – Distributed processing

• Predictable outcomes • Instantaneous processing and

– Consistency, event time response
The 8 Requirements of Real-Time Stream Processing – Stonebraker et al. 2005
24
24 © DIMA 2017
Big Data Processing
• Databases can process very large data since forever (see VLDB)
– Why not use those?

• Big data is not (fully) structured

– No good for database 

• We want to learn more from data than just

– Select, project, join

• First solution: MapReduce

25
25 © DIMA 2017
Map Reduce
• Framework / programming model by Google
– Presented 2004 at OSDI'04
• Inspired by map and reduce functions in functional languages / MPI
– Second order functions
• Simple parallelization model for shared nothing architectures (“commodity hardware”)
• Apache Hadoop
– Open-source implementation
– Initiated at Yahoo
Map: Computation Reduce: Aggregation
For each input create list of output values Combine all intermediate values for one key
Example: Example:
For each word in a sentence emit a k/v pair Sum up all values for the same key
indicating one occurrence of the word (“Hello”,(“1”, “1”, “1”, “1”)) -> (“Hello”,(“4”))
(key, “hello world”) -> (“hello”,”1”), (“world”,”1”) Signature
Signature reduce (key, list(value)) -> list(value’)
map (key, value) -> list(key’, value’)

26
26 © DIMA 2017
MR Data Flow
k a a 1

k b MAP b 1
a 1 1 1 REDUCE a 3

MR Framework Shuffle & Sort

k b b 1

k a a 1
b 1 1 b 2
k c MAP c 1 REDUCE
c 1 c 1
k e e 1

k a a 1 d 1 e 2
k d MAP REDUCE
d 1 e 1 1 d 1
k e e 1

27
27 © DIMA 2017
MR / Batch Processing

28
28 © DIMA 2017
MR / Batch Processing

29
29 © DIMA 2017
MR / Batch Window Processing

30
30 © DIMA 2017
MR Discussion

Images: Tyler Akidau

• Great for large amounts of static data

• For streams: only for large windows
• Data is not moving!
• High latency, low efficiency

31
31 © DIMA 2017
How to keep data moving?
Discretized Streams (mini-batch)

Stream
discretizer
while (true) { Job Job Job Job
// get next few records
// issue batch computation
}

Native streaming

Long-standing
while (true) { operators
// process next record
}
32
32 © DIMA 2017
Discussion of Mini-Batch
• Easy to implement
• Easy consistency and fault-tolerance
• Hard to do event time and sessions

Image: Tyler Akidau

33
33 © DIMA 2017
True Streaming Architecture

• Program = DAG* of operators and • Stream transformations

intermediate streams • Basic transformations: Map, Reduce, Filter,
Aggregations…
• Operator = computation + state • Binary stream transformations: CoMap, CoReduce…
• Intermediate streams = logical stream of • Windowing semantics: Policy based flexible windowing
(Time, Count, Delta…)
records • Temporal binary stream operators: Joins, Crosses…
• Native support for iterations
34
34 © DIMA 2017
Handle Imperfections – Watermarks
• Data items arrive early, on-time, or late
• Solution: Watermarks
– Perfect or heuristic measure on when window is complete

Image: Tyler Akidau

35
35 © DIMA 2017
Handle Imperfections – Watermarks
• Data items arrive early, on-time, or late
• Solution: Watermarks
– Perfect or heuristic measure on when window is complete

Image: Tyler Akidau

36
36 © DIMA 2017
Data Safety and Availability
• Ensure that operators see all events
– “At least once”
– Solved by replaying a stream from a checkpoint
– No good for correct results

• Ensure that operators do not perform duplicate updates to

their state
– “Exactly once”
– Several solutions

• Ensure the job can survive failure

3737
37 © DIMA 2017
Lessons Learned from Batch

batch-2 batch-1

• If a batch computation fails, simply repeat computation as a transaction

• Transaction rate is constant
• Can we apply these principles to a true streaming execution?

execution snapshots
Initial approach (e.g., Naiad)
• Pause execution on t1,t2,..
• Collect state
• Restore execution

Propagating markers/barriers

snap - t1 snap - t2

[Carbone et. al. 2015] “Lightweight Asynchronous Snapshots for Distributed Dataflows”, Tech. Report. http://arxiv.org/abs/1506.08603
40
40 © DIMA 2017
Automatic partitioning and scaling
• 3 Types of Parallelization

• Big streaming systems should support all three

41
41 © DIMA 2017
Big Data Streaming Systems

42 © DIMA 2017
Streaming Systems Overview
Closed Source Open Source

Cloud DataFlow

(BigTable)

Naiad
StreamInsights Streaming
Systems

InfoSphere
Stream
Processing Esper
Language (SPL) Academia Aurora
AWS
NiagaraCQ
Kinesis
CQL
43
43 © DIMA 2017
Closed Source/Commercial Systems
Cloud DataFlow: • Unified primitives for batch and stream processing
• Runs in Google‘s cloud only
• Open Source SDK (programs can run on other systems)
• Check out the Apache Beam Project! (http://beam.apache.org/)

BigTable: • Not a real streaming solution

• Allows to feed streams as source into a google DB
• Data can be immediately queried

Naiad: • Goals of Naiad:

• High throughput (typical for batch processors)
• Low latency (known from single system stream processors)
• Is able to process iterative data flows
• Can discretize windows only based on time

StreamInsights: • Available through Microsoft's cloud

• Windows based on count-, time- and punctuation/snapshot
• Optimized for .NET framework applications
InfoSphere:
• Well specified in several publications
Stream • Can be deployed in customer clusters
Processing • Own SQL-like query language enables many optimization means
Language (SPL) • window discretization based on trigger- and eviction policies

44
44 © DIMA 2017
Open Source Systems by Apache (1/2)
• Reliable handling of huge numbers of concurrent reads and writes
• Can be used as data-source / data-sink for Storm, Samza, Flink, Spark and many more systems
• Fault tolerant: Messages are persisted on disk and replicated within the cluster. Messages (reads and
writes) can be repeated

• True streaming over distributed dataflow

• Low level API: Programmers have to specify the logic of each vertex in the flow graph
• Full understanding and hard coding of all used operators is required
• Enables very high throughput (single purpose programs with small overhead)

• True streaming built on top of Apache Kafka and Hadoop YARN

• State is first class citizen
• Low level API

45
45 © DIMA 2017
Open Source Systems by Apache (2/2)
Spark implements a batch execution engine
• The execution of a job graph is done in stages
• Operator outputs are materialized in memory (or disk) until the consuming operator is
ready to consume the materialized data
Spark uses Discretized Streams (D-Streams)
• Streams are interpreted as a series of deterministic batch-processing jobs
• Micro batches have a fixed granularity
• All windows defined in queries must be multiples of this granularity

Flinks runtime is a native streaming engine

• Based on Nephele/PACTs
• Queries are compiled to a program in the form of an operator DAG
• Operator DAGs are compiled to job graphs
• Job graphs are generic streaming programs
Flink implements “true streaming”
• The whole job graph is deployed concurrently in the cluster
• Operators are long-running: Continuously consume input and produce output
• Output tuples are immediately forwarded to succeeding operators and are available for
further processing (enables pipeline parallelism)
46
46 © DIMA 2017
Further open source systems
Esper
• Open source Complex Event Processing (CEP) engine
• Tightly coupled to Java: Executable on J2EE application servers
• Describing events in Plain Old Java Objects (POJOs)
• Time-based or count-based windows

Aurora
• First design and implementation that parallelizes stream computation including rich operation and windowing semantics
• Windows are always specified as ranges on some measure
• Was continued in Borealis Project

NiagaraCQ
• Focuses more on scalability than on the flexibility
• Provides various optimizations techniques to share common computation within and across queries
• Only time-based windows are possible

CQL
• Continuous query language
• Implemented by the STREAM DSMS at Stanford
• Captures a wide range of streaming application in an SQL-like query language

47
47 © DIMA 2017
Cloud-Based Streaming Systems (example)

48
48 © DIMA 2017
Storm, Spark Streaming, and Flink

49 © DIMA 2017
Big Data Analytics Ecosystem
Hive Cascading Giraph
Applications &
Languages Mahout Pig Crunch

MapReduce Flink
Data processing
engines Spark Storm Tez

App and resource Yarn Mesos

management
HDFS HBase Kafka …
Storage, streams 50
50 © 2013 Berlin Big Data Center • All Rights Reserved
50 © DIMA 2017
Apache Storm
Scalable Stream Processing Platform
by Twitter
• Tuple wise computation
• Programs are represented in a
topology graph
• vertices are computations / data
transformations
• edges represent data streams between
the computation nodes
• streams consist of an unbounded
sequence of data-items/tuples
• Low-level stream processing engine
51
51 © DIMA 2017
Storm‘s Fault Tolerance
• At least once guarantee via acknowledgments

• Acker logs progress of each tuple emitted by a spout

52
52 © DIMA 2017
Storm Bolt Example

Source: https://storm.apache.org/documentation/Tutorial.html
53
53 © DIMA 2017
Building a Storm Topology
1) Use the TopologyBuilder class to connect spouts and bolts:
builder.setSpout(“name”,new MySpout());
builder.setBolt(“name”,new MyBolt());

2) Additionally, specify groupings to allow parallelization (shuffle, all, global, field)

builder.shuffleGrouping(“BoltName”);

3) Create topology using the factory method

StormTopology st=builder.createTopology();

4) Use LocalCluster class to test the topology

LocalCluster cluster=new LocalCluster();
cluster.submitTopology(“name”,new Config(),st);

54
54 Source: Allen et al., Storm Applied: Strategies for Real-Time Event Processing © DIMA 2017
Storm – Trident
• High-level abstraction built on top of Storm core:
– operators like filter, join, groupBy, ...
• Stream-oriented API + UDFs
• Stateful, incremental processing
• Micro-Batch oriented (ordered & partitionable)
• Exactly-once semantics
• Trident topology compiled into spouts and bolt

55
55 © DIMA 2017
Storm – Heron
• New real-time streaming system based on Storm
• Introduced June 2015 by Twitter (SIGMOD)
• Fully compatible with Storm API
• Container-based implementation
• Back pressure mechanism
• Easy debugging of heron topologies through UI
• better performance than Storm (latency + throughput)

• No exactly once guarantee

56
56 © DIMA 2017
Apache Spark
• In memory abstraction for big data processing
– Resilient Distributed Data Sets
– Fault-tolerance through lineage
– Richt APIs for all kind of processing

Client
Loop outside the system, in
Step Step Step Step Step driver program

Client Iterative program looks like

many independent jobs
Step Step Step Step Step

Image: Tyler Akidau

• Similar to MR, but much faster

Image: Tyler Akidau

• Key abstraction: discretized streams (DStream)

– micro-batch = series of RDDs
– Stream computation = series of deterministic batch computation at a given time interval
• API very similar to Spark core (Java, Scala, Python)
– (stateless) transformations on DStreams: map, filter, reduce, repartition, cogrop, ...
– Stateful operators: time-based window operations, incremental aggregation, time-skewed joins
• Exactly-once semantics using checkpoints (asynchronous replication of state RDDs)
• No event time windows

• The core of Flink is a distributed streaming

dataflow engine.
• Executing dataflows in parallel on
clusters
• Providing a reliable foundation for
various workloads
• DataSet and DataStream programming
abstractions are the foundation for user
programs and higher layers

• Pipelined/Streaming engine
– Complete DAG deployed
Worker 1 Worker 2

Job Manager

Worker 3 Worker 4
61
61 © DIMA 2017
Built-in vs. driver-based looping
Client
Loop outside the system,
in driver program
Step Step Step Step Step
Iterative program looks
Client like many independent
jobs
Step Step Step Step Step

Dataflows with feedback

red. edges
join
map System is iteration-
Flink join aware, can optimize the
job

DataSet API (batch):

val lines: DataSet[String] = env.readTextFile(...)
lines.flatMap {line => line.split(" ")
.map(word => Word(word,1))}
.groupBy("word").sum("frequency")
.print()

DataStream API (streaming):

val lines: DataStream[String] = env.fromSocketStream(...)
lines.flatMap {line => line.split(" ")
.map(word => Word(word,1))}
.keyBy("word")
.window(Time.of(5,SECONDS)).every(Time.of(1,SECONDS))
.sum("frequency”)
.print()

[..]Storm 0.10.0, 0.11.0-SNAPSHOT and

Flink 0.10.1 show sub- second latencies
at relatively high throughputs[..]. Spark
streaming 1.5.1 supports high
throughputs, but at a relatively higher
latency.

http://yahooeng.tumblr.com/post/135321837876/benchmarking-streaming-computation-engines-at
https://data-artisans.com/extending-the-yahoo-streaming-benchmark/
64 © DIMA 2017
Processing Time vs Event Time

DEMO - STREAMING

Based on Hirzel et. al: A Catalog of Stream Processing

Optimizations, ACM Comp. Surveys. 46(4), 2014.

In Apache Flink (and many other applications) we call this

chaining

Goal: Reduce communication costs

Method: Shared memory among operators instead of network communication

Directly maps to data

parallelism:

Assigning Operators to slots

+ Co-locating Data and Computations

Distributed File Systems

A single storage layer for the whole cluster

Chaining again…
Share memory among several operators instead of copying the data

“Under the hood” batch wise network traffic (buffering)

D-Streams*: All the stream processing is done in micro-batches

* Zaharia, Matei, et al. Discretized streams: an efficient and fault-tolerant model for stream processing on large clusters. In: Proceedings of the
4th USENIX conference on Hot Topics in Cloud Ccomputing. USENIX Association, 2012. S. 10-10.
https://www.usenix.org/system/files/conference/hotcloud12/hotcloud12-final28.pdf
75
75 © DIMA 2017
Algorithm Selection & Load Shedding

The optimizer selects the (hopefully)

optimal join implementation

76
76 © DIMA 2017
Cost Model
• Traditional cost-based query optimization is based on cardinality estimation
➟ inadequate for unbounded streams
• Possible solution: rate-based cost estimation
– (Viglas et al.: Rate-based query optimization for streaming information sources, SIGMOD 2002)

• Challenges:
– Fluctuating streams
– Data-parallel processing

Slide by Kai-Uwe Sattler

Stream Processing Systems

• Ingredients of a stream processing system
• Storm, Spark, Flink
• Continuously evolving

Stream Processing Optimizations

• How to optimize

79 © 2013 Berlin Big Data Center • All Rights Reserved © DIMA 2017
Further Reading
Historical papers on STREAM, Aurora, TelegraphCQ, Borealis, CQL, ...
• Papers and blogs on Storm, Heron, Flink, Spark Streaming, ...
• Alexandrov, Alexander, et al. The Stratosphere platform for big data analytics. The VLDB Journal-The International Journal on Very Large Data Bases, 2014, 23. Jg., Nr.
6, S. 939-964.
• Zaharia, Matei, et al. Discretized streams: an efficient and fault-tolerant model for stream processing on large clusters. In: Proceedings of the 4th USENIX conference on
Hot Topics in Cloud Ccomputing. USENIX Association, 2012. S. 10-10.
• Murray, Derek G., et al. Naiad: a timely dataflow system. In: Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles. ACM, 2013. S. 439-
455.
Windows & Semantics
• Ghanem et al.: Incremental Evaluation of Sliding-Window Queries over Data Streams, TKDE 19(1), 2007
• Tucker et al.: Exploiting Punctuation Semantics in Continuous Data Streams, TKDE 15(3), 2003
• Krämer et al.: Sematics and Implementation of Continuous Sliding Window Queries over Data Streams, TODS 34(1), 2009
• Botan et al.: SECRET: A Model for Analysis of the Execution Semantics of Stream Processing Systems, VLDB 2010
CEP:
• Wu et al.: High-Performance Complex Event Processing over Streams, SIGMOD 2006
• Schultz-Moeller et al.: Distributed Complex Event Processing with Query Rewriting, DEBS 2009
Fault Tolerance:
• Hwang et al.: High-availability algorithms for distributed stream processing, ICDE 2005
• Zaharia et al.: Discretized Streams: An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters. HotCloud, 2012.
• Fernandez et al.: Integrating Scale Out and Fault Tolerance in Stream Processing using Operator State Management, SIGMOD 2013
Partitioning & Optimization:
• Hirzel et al.: A Catalog of Stream Processing Optimizations, ACM Comp. Surveys 46(4), 2014.
• Gedik et al.: Elastic Scaling for Data Streams, TPDS 25(6), 2014.
• Viglas et al.: Rate-Based Query Optimization for Streaming Information Sources, SIGMOD 2002

6 - Streaming Part 1
No ratings yet
6 - Streaming Part 1
44 pages
Big Data Analytics - Unit 2 Notes
No ratings yet
Big Data Analytics - Unit 2 Notes
44 pages
Mining Data Streams
No ratings yet
Mining Data Streams
37 pages
Streaming Data Ingestion v1 181001151203
No ratings yet
Streaming Data Ingestion v1 181001151203
59 pages
Module-2-MINING DATA STREAMS
100% (3)
Module-2-MINING DATA STREAMS
17 pages
Module3A MiningBigDataStreams
No ratings yet
Module3A MiningBigDataStreams
145 pages
Stream Processing for IT/CSE Students
No ratings yet
Stream Processing for IT/CSE Students
57 pages
SPA Notes
No ratings yet
SPA Notes
4 pages
Unit 4 Notes PDF
100% (2)
Unit 4 Notes PDF
27 pages
Swe2011 Bda - III
No ratings yet
Swe2011 Bda - III
53 pages
Bigdata Unit-Ii
No ratings yet
Bigdata Unit-Ii
33 pages
Big Data IV Nit
No ratings yet
Big Data IV Nit
15 pages
Lec 19
No ratings yet
Lec 19
23 pages
Lec 19
No ratings yet
Lec 19
24 pages
Unit 3-6
No ratings yet
Unit 3-6
14 pages
SA Unit 1 PPT 2
No ratings yet
SA Unit 1 PPT 2
27 pages
Bda Mid Ans
No ratings yet
Bda Mid Ans
18 pages
Bigdata-Mining Data Streams
No ratings yet
Bigdata-Mining Data Streams
19 pages
Chapter-5 Stream Processing Part1
No ratings yet
Chapter-5 Stream Processing Part1
32 pages
Big Data 3rd Unit
No ratings yet
Big Data 3rd Unit
16 pages
Mining Data Streams in Data Analytics Refers To The Process of Extracting Useful Patterns
No ratings yet
Mining Data Streams in Data Analytics Refers To The Process of Extracting Useful Patterns
30 pages
Big Data PDF
No ratings yet
Big Data PDF
10 pages
Big Data Notes
No ratings yet
Big Data Notes
37 pages
Lec 01
No ratings yet
Lec 01
17 pages
Big Data Stream Processing Guide
No ratings yet
Big Data Stream Processing Guide
22 pages
Stream Processing in Big Data
No ratings yet
Stream Processing in Big Data
39 pages
Stream Processing
No ratings yet
Stream Processing
33 pages
Big Data Analytics Module 4 Mumbai University
No ratings yet
Big Data Analytics Module 4 Mumbai University
24 pages
Big Data Analytics Unit-2
100% (1)
Big Data Analytics Unit-2
11 pages
Unit 1 Windowing
No ratings yet
Unit 1 Windowing
23 pages
Mining Data Streams
No ratings yet
Mining Data Streams
17 pages
Data Stream Mining Essentials
No ratings yet
Data Stream Mining Essentials
33 pages
BDA Lec10
No ratings yet
BDA Lec10
33 pages
Stream Data Processing
No ratings yet
Stream Data Processing
32 pages
FALLSEM2024-25 SWE2011 ETH VL2024250103282 2024-08-19 Reference-Material-I
No ratings yet
FALLSEM2024-25 SWE2011 ETH VL2024250103282 2024-08-19 Reference-Material-I
53 pages
7 - Streaming 2 - Calcite
No ratings yet
7 - Streaming 2 - Calcite
45 pages
BDA Unit 3
No ratings yet
BDA Unit 3
42 pages
Bigdata Unit II
No ratings yet
Bigdata Unit II
19 pages
Unit 3
No ratings yet
Unit 3
30 pages
Chapter 1-1
No ratings yet
Chapter 1-1
34 pages
BDA Unit-4
No ratings yet
BDA Unit-4
12 pages
Ade Mod 1 Incremental Processing With Spark Structured Streaming
No ratings yet
Ade Mod 1 Incremental Processing With Spark Structured Streaming
73 pages
Unit-Ii 30-1-24
No ratings yet
Unit-Ii 30-1-24
162 pages
4 Building Blocks of A Streaming Data Architecture
No ratings yet
4 Building Blocks of A Streaming Data Architecture
11 pages
Stream Processing With: Tamás István Ujj
No ratings yet
Stream Processing With: Tamás István Ujj
27 pages
Data Stream in Data Analytics
No ratings yet
Data Stream in Data Analytics
4 pages
DWDM - Unit - VII
No ratings yet
DWDM - Unit - VII
42 pages
Bda 2
No ratings yet
Bda 2
16 pages
b0m33bdt 7p Spark Databricks Streaming - 2023 - en
No ratings yet
b0m33bdt 7p Spark Databricks Streaming - 2023 - en
50 pages
BDA Mod 3
No ratings yet
BDA Mod 3
57 pages
BigData Mod2
No ratings yet
BigData Mod2
12 pages
Unit II (Big Data)
No ratings yet
Unit II (Big Data)
19 pages
A Deep Dive Into Data Stream Processing
No ratings yet
A Deep Dive Into Data Stream Processing
10 pages
Stream Computing
No ratings yet
Stream Computing
18 pages
Stream Processing and Analytics Handout
No ratings yet
Stream Processing and Analytics Handout
8 pages
Unit-2 BDA
No ratings yet
Unit-2 BDA
30 pages
Car Insurance Claim Prediction - First Seminar
No ratings yet
Car Insurance Claim Prediction - First Seminar
26 pages
Data Science Resume Template 1751390932
No ratings yet
Data Science Resume Template 1751390932
1 page
Big Data Tools 2 - Apache Spark With PySpark
No ratings yet
Big Data Tools 2 - Apache Spark With PySpark
33 pages
Name: Sadikshya Khanal Section: C3G2: Workshop - 9 - Hadoop Part 2
No ratings yet
Name: Sadikshya Khanal Section: C3G2: Workshop - 9 - Hadoop Part 2
51 pages
Unit V
No ratings yet
Unit V
35 pages
Data Scientist: How To Become A
100% (3)
Data Scientist: How To Become A
45 pages
Ankit Data Engineer Resume
No ratings yet
Ankit Data Engineer Resume
8 pages
Databricks Performance Optimization
No ratings yet
Databricks Performance Optimization
94 pages
Spark-Scala Code
No ratings yet
Spark-Scala Code
3 pages
Big Data Processing: Jiaul Paik
No ratings yet
Big Data Processing: Jiaul Paik
47 pages
Cloud Data Engineering V1.0
No ratings yet
Cloud Data Engineering V1.0
5 pages
Data Science & ML Syllabus
No ratings yet
Data Science & ML Syllabus
12 pages
Kafka
No ratings yet
Kafka
78 pages
Spark NLP Training-Public-April 2020
No ratings yet
Spark NLP Training-Public-April 2020
39 pages
BDA Unit-1
No ratings yet
BDA Unit-1
56 pages
DP 900 Master Cheat Sheet
No ratings yet
DP 900 Master Cheat Sheet
32 pages
Survey On Extract-Load-Transform Vs Extract-Transform-Load: A Modern Shift in Data Engineering
No ratings yet
Survey On Extract-Load-Transform Vs Extract-Transform-Load: A Modern Shift in Data Engineering
6 pages
Essential Python Libraries and Functions For Data Science 1706295212
No ratings yet
Essential Python Libraries and Functions For Data Science 1706295212
12 pages
Meghna Reddy Kunta
No ratings yet
Meghna Reddy Kunta
7 pages
Data Engineer's Career Portfolio
No ratings yet
Data Engineer's Career Portfolio
6 pages
Preparing For Your Professional Data Engineer Journey - T-GCPPDE-A-m0-l6-file-en-7
100% (1)
Preparing For Your Professional Data Engineer Journey - T-GCPPDE-A-m0-l6-file-en-7
80 pages
CV - Alekh Ved
No ratings yet
CV - Alekh Ved
5 pages
Anaconda: Open Source Python Analytics
No ratings yet
Anaconda: Open Source Python Analytics
8 pages
About Me Professional Experience
No ratings yet
About Me Professional Experience
1 page
Pyspark Dumps
No ratings yet
Pyspark Dumps
10 pages
Data Contracts Early Release 042024
No ratings yet
Data Contracts Early Release 042024
52 pages
Unit-6 - Data Visualization and Graph Analytics
No ratings yet
Unit-6 - Data Visualization and Graph Analytics
27 pages
Data Engineer Resume: Skills & Experience
No ratings yet
Data Engineer Resume: Skills & Experience
2 pages
MSC Datascience Unit1
No ratings yet
MSC Datascience Unit1
20 pages
Kibana, Grafana and Zeppelin On Monitoring Data
100% (1)
Kibana, Grafana and Zeppelin On Monitoring Data
21 pages

TRabl StreamProcessing

Uploaded by

TRabl StreamProcessing

Uploaded by

Big Data Stream Processing

Stream Processing Systems

Stream Processing Optimizations (if we have time)

Image: Michael Carey

• Pre-processing of sensor data

Image: Peter Pietzuch

Tension between performance and algorithmic expressiveness

With some Flink Examples

• Typically, these do not match!

Image: Tyler Akidau

Image: Tyler Akidau

Image: Tyler Akidau

Image: Tyler Akidau

Image: Tyler Akidau

Images: Tyler Akidau

• Common triggers/evictions part of the API

• Even more flexibility: define your own UDF trigger/eviction

StockPrice(SPX, 2113.9) (1)

(1) val windowedStream = stockStream.window(Time.of(10, SECONDS)).every(Time.of(5, SECONDS))

Slide by Kai-Uwe Sattler

Slide by Kai-Uwe Sattler

What makes a system a stream processing system?

• Declarative access • Data safety and availability

• Handle imperfections • Automatic partitioning and scaling

• Predictable outcomes • Instantaneous processing and

• Declarative access • Data safety and availability

• Handle imperfections • Automatic partitioning and scaling

• Predictable outcomes • Instantaneous processing and

• Big data is not (fully) structured

• We want to learn more from data than just

• First solution: MapReduce

MR Framework Shuffle & Sort

Images: Tyler Akidau

• Great for large amounts of static data

Image: Tyler Akidau

• Program = DAG* of operators and • Stream transformations

Image: Tyler Akidau

Image: Tyler Akidau

Image: Tyler Akidau

• Ensure that operators do not perform duplicate updates to

• Ensure the job can survive failure

• If a batch computation fails, simply repeat computation as a transaction

• Big streaming systems should support all three

BigTable: • Not a real streaming solution

Naiad: • Goals of Naiad:

StreamInsights: • Available through Microsoft's cloud

• True streaming over distributed dataflow

• True streaming built on top of Apache Kafka and Hadoop YARN

Flinks runtime is a native streaming engine

App and resource Yarn Mesos

• Acker logs progress of each tuple emitted by a spout

2) Additionally, specify groupings to allow parallelization (shuffle, all, global, field)

3) Create topology using the factory method

4) Use LocalCluster class to test the topology

• No exactly once guarantee

Client Iterative program looks like

Image: Tyler Akidau

Image: Tyler Akidau

• Key abstraction: discretized streams (DStream)

• The core of Flink is a distributed streaming

Dataflows with feedback

62 © 2013 Berlin Big Data Center • All Rights Reserved

DataSet API (batch):

DataStream API (streaming):

[..]Storm 0.10.0, 0.11.0-SNAPSHOT and

Based on Hirzel et. al: A Catalog of Stream Processing

In Apache Flink (and many other applications) we call this

Goal: Reduce communication costs

Directly maps to data

Assigning Operators to slots

+ Co-locating Data and Computations

Distributed File Systems

“Under the hood” batch wise network traffic (buffering)

D-Streams*: All the stream processing is done in micro-batches

The optimizer selects the (hopefully)

Slide by Kai-Uwe Sattler