0% found this document useful (0 votes)
47 views79 pages

TRabl StreamProcessing

The document discusses stream processing and provides an overview of key concepts like streams, stream models, event time, time-based windows, and basic stream operators like windowed aggregation and joins. It also gives examples of stream processing with Flink, covering windows, triggers, and aggregations.

Uploaded by

Carina Lifschitz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
47 views79 pages

TRabl StreamProcessing

The document discusses stream processing and provides an overview of key concepts like streams, stream models, event time, time-based windows, and basic stream operators like windowed aggregation and joins. It also gives examples of stream processing with Flink, covering windows, triggers, and aggregations.

Uploaded by

Carina Lifschitz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 79

Big Data Stream Processing

Tilmann Rabl
Berlin Big Data Center
www.dima.tu-berlin.de | bbdc.berlin | rabl@tu-berlin.de

1 © 2013 Berlin Big Data Center • All Rights Reserved © DIMA 2017
Agenda
Introduction to Streams
• Use cases
• Stream Processing 101

Stream Processing Systems


• Ingredients of a stream processing system
• Some examples
• More details on Storm, Spark, Flink
• Maybe a demo (!)

Stream Processing Optimizations (if we have time)


• How to optimize
With slides from Data Artisans, Volker Markl, Asterios Katsifodimos, Jonas Traub
2
2 © DIMA 2017
Big Fast Data
• Data is growing and can be evaluated
– Tweets, social networks (statuses, check-
ins, shared content), blogs, click streams,
various logs, …
– Facebook: > 845M active users, > 8B
messages/day
– Twitter: > 140M active users, > 340M
tweets/day
• Everyone is interested!

Image: Michael Carey


3
3 © DIMA 2017
But there is so much more…
• Autonomous Driving
– Requires rich navigation info
– Rich data sensor readings
– 1GB data per minute per car (all sensors)1

• Traffic Monitoring
– High event rates: millions events / sec
– High query rates: thousands queries / sec
– Queries: filtering, notifications, analytical
Source: http://theroadtochangeindia.wordpress.com/2011/01/13/better-roads/

• Pre-processing of sensor data


– CERN experiments generate ~1PB of measurements per second.
– Unfeasible to store or process directly, fast preprocessing is a must.

1Cobb: http://www.hybridcars.com/tech-experts-put-the-brakes-on-autonomous-cars/
4
4 © DIMA 2017
Why is this hard?

Image: Peter Pietzuch

Tension between performance and algorithmic expressiveness


6
6 © DIMA 2017
Stream Processing 101

With some Flink Examples


Based on the Data Flow Model

7 © DIMA 2017
What is a Stream?
• Unbounded data
– Conceptually infinite, ever growing set of data items / events
– Practically continuous stream of data, which needs to be processed / analyzed

• Push model
– Data production and procession is controlled by the source
– Publish / subscribe model

• Concept of time
– Often need to reason about when data is produced and when processed data should be
output
– Time agnostic, processing time, ingestion time, event time

This part is largely based on Tyler Akidau‘s great blog on streaming - https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101
8
8 © DIMA 2017
Stream Models
S = si, si+1, … si = <data item, timestamp>
• Turnstile
– Elements can come and go
– Underlying model is a vector of elements (domain)
– si is an update (increment or decrement) to a vector element
– Traditional database model
– Flexible model for algorithms

• Cash register
– Similar to turnstile, but elements cannot leave

• Time series
– si is is a new vector entry
– Vector is increasing
– This is what all big stream processing engines use

9
9 © DIMA 2017
Event Time

• Event time
– Data item production time
• Ingestion time
– System time when data item is received
• Processing time
– System time when data item is processed

• Typically, these do not match!


• In practice, streams are unordered!
Image: Tyler Akidau
10
10 © DIMA 2017
Time Agnostic Processing

Image: Tyler Akidau


• Filtering
– Stateless
– Can be done per data item
– Implementations: hash table or bloom filter

11
11 © DIMA 2017
Time Agnostic Processing II

Image: Tyler Akidau


• Inner join
– Only current elements
– Stateful
– E.g., hash join
• What about other joins (e.g., outer join)?

12
12 © DIMA 2017
Approximate Processing

Image: Tyler Akidau


• Streaming k-means, sketches
– Low overhead
– Notion of time
• Not covered in this talk

13
13 © DIMA 2017
Windows
• Fixed
– Also tumbling
• Sliding
– Also hopping
• Session
– Based on activity

Image: Tyler Akidau


• Triggered by
– Event time, processing time, count, watermark
• Eviction policy
– Window width / size

14
14 © DIMA 2017
Processing Time Windows

Image: Tyler Akidau


• System waits for x time units
– System decides on stream partitioning
– Simple, easy to implement
– Ignores any time information in the stream -> any aggregation can be arbitrary
• Similar: Counting Windows

15
15 © DIMA 2017
Event Time Windows

Images: Tyler Akidau


• Windows based on the time information in stream
– Adheres to stream semantic
– Correct calculations
– Buffering required, potentially unordered (more on this later)

16
16 © DIMA 2017
Basic Stream Operators
• Windowed Aggregation
– E.g., average speed
– Sum of URL accesses
– Daily highscore
Aggregate
• Windowed Join
– Correlated observations in timeframe
– E.g., temperature in time

9
12
10
17
17 © DIMA 2017
Flink’s Windowing
• Windows can be any combination of (multiple) triggers & evictions
– Arbitrary tumbling, sliding, session, etc. windows can be constructed.

• Common triggers/evictions part of the API


– Time (processing vs. event time), Count

• Even more flexibility: define your own UDF trigger/eviction

• Examples:
dataStream.windowAll(TumblingEventTimeWindows.of(Time.seconds(5)));
dataStream.keyBy(0).window(TumblingEventTimeWindows.of(Time.seconds(5)));

18
18 © DIMA 2017
Example Analysis: Windowed Aggregation
(2) StockPrice(HDP, 23.8)

StockPrice(SPX, 2113.9) (1)


StockPrice(FTSE, 6931.7) StockPrice(SPX, 2113.9)
StockPrice(HDP, 23.8) (3) StockPrice(FTSE, 6931.7)
StockPrice(HDP, 26.6) StockPrice(HDP, 26.6)

StockPrice(SPX, 2113.9)
(4) StockPrice(FTSE, 6931.7)
StockPrice(HDP, 25.2)

(1) val windowedStream = stockStream.window(Time.of(10, SECONDS)).every(Time.of(5, SECONDS))


(2) val lowest = windowedStream.minBy("price")
(3) val maxByStock = windowedStream.groupBy("symbol").maxBy("price")
(4) val rollingMean = windowedStream.groupBy("symbol").mapWindow(mean _)
19 © DIMA 2017
Complex Event Processing
• Detecting patterns in a stream
• Complex event = sequence of events
• Defined using logical and temporal conditions
– Logical: data values and combinations
– Temporal: within a given period of time

Slide by Kai-Uwe Sattler


20
20 © DIMA 2017
Complex Event Processing Contd.
• Composite events constructed e.g. by
– SEQ, AND, OR, NEG, ...
– SEQ(e1, e2) ➝ (e1, t1) ∧ (e2, t2)∧t1 ≤ t2 ∧ e1,e2 ε 𝕎𝕎
• Implemented by constructing a NFA
– Example: SEQ(A, B, C)

Slide by Kai-Uwe Sattler


21
21 © DIMA 2017
Stream Processing Systems

What makes a system a stream processing system?

22 © DIMA 2017
8 Requirements of Big Streaming
• Keep the data moving • Integrate stored and streaming data
– Streaming architecture – Hybrid stream and batch

• Declarative access • Data safety and availability


– E.g. StreamSQL, CQL – Fault tolerance, durable state

• Handle imperfections • Automatic partitioning and scaling


– Late, missing, unordered items – Distributed processing

• Predictable outcomes • Instantaneous processing and


– Consistency, event time response
The 8 Requirements of Real-Time Stream Processing – Stonebraker et al. 2005
23
23 © DIMA 2017
8 Requirements of Big Streaming
• Keep the data moving • Integrate stored and streaming data
– Streaming architecture – Hybrid stream and batch

• Declarative access • Data safety and availability


– E.g. StreamSQL, CQL – Fault tolerance, durable state

• Handle imperfections • Automatic partitioning and scaling


– Late, missing, unordered items – Distributed processing

• Predictable outcomes • Instantaneous processing and


– Consistency, event time response
The 8 Requirements of Real-Time Stream Processing – Stonebraker et al. 2005
24
24 © DIMA 2017
Big Data Processing
• Databases can process very large data since forever (see VLDB)
– Why not use those?

• Big data is not (fully) structured


– No good for database 

• We want to learn more from data than just


– Select, project, join

• First solution: MapReduce

25
25 © DIMA 2017
Map Reduce
• Framework / programming model by Google
– Presented 2004 at OSDI'04
• Inspired by map and reduce functions in functional languages / MPI
– Second order functions
• Simple parallelization model for shared nothing architectures (“commodity hardware”)
• Apache Hadoop
– Open-source implementation
– Initiated at Yahoo
Map: Computation Reduce: Aggregation
For each input create list of output values Combine all intermediate values for one key
Example: Example:
For each word in a sentence emit a k/v pair Sum up all values for the same key
indicating one occurrence of the word (“Hello”,(“1”, “1”, “1”, “1”)) -> (“Hello”,(“4”))
(key, “hello world”) -> (“hello”,”1”), (“world”,”1”) Signature
Signature reduce (key, list(value)) -> list(value’)
map (key, value) -> list(key’, value’)

26
26 © DIMA 2017
MR Data Flow
k a a 1

k b MAP b 1
a 1 1 1 REDUCE a 3

MR Framework Shuffle & Sort


k b b 1

k a a 1
b 1 1 b 2
k c MAP c 1 REDUCE
c 1 c 1
k e e 1

k a a 1 d 1 e 2
k d MAP REDUCE
d 1 e 1 1 d 1
k e e 1

27
27 © DIMA 2017
MR / Batch Processing

28
28 © DIMA 2017
MR / Batch Processing

29
29 © DIMA 2017
MR / Batch Window Processing

30
30 © DIMA 2017
MR Discussion

Images: Tyler Akidau

• Great for large amounts of static data


• For streams: only for large windows
• Data is not moving!
• High latency, low efficiency

31
31 © DIMA 2017
How to keep data moving?
Discretized Streams (mini-batch)

Stream
discretizer
while (true) { Job Job Job Job
// get next few records
// issue batch computation
}

Native streaming

Long-standing
while (true) { operators
// process next record
}
32
32 © DIMA 2017
Discussion of Mini-Batch
• Easy to implement
• Easy consistency and fault-tolerance
• Hard to do event time and sessions

Image: Tyler Akidau


33
33 © DIMA 2017
True Streaming Architecture

• Program = DAG* of operators and • Stream transformations


intermediate streams • Basic transformations: Map, Reduce, Filter,
Aggregations…
• Operator = computation + state • Binary stream transformations: CoMap, CoReduce…
• Intermediate streams = logical stream of • Windowing semantics: Policy based flexible windowing
(Time, Count, Delta…)
records • Temporal binary stream operators: Joins, Crosses…
• Native support for iterations
34
34 © DIMA 2017
Handle Imperfections – Watermarks
• Data items arrive early, on-time, or late
• Solution: Watermarks
– Perfect or heuristic measure on when window is complete

Image: Tyler Akidau

35
35 © DIMA 2017
Handle Imperfections – Watermarks
• Data items arrive early, on-time, or late
• Solution: Watermarks
– Perfect or heuristic measure on when window is complete

Image: Tyler Akidau

Image: Tyler Akidau


36
36 © DIMA 2017
Data Safety and Availability
• Ensure that operators see all events
– “At least once”
– Solved by replaying a stream from a checkpoint
– No good for correct results

• Ensure that operators do not perform duplicate updates to


their state
– “Exactly once”
– Several solutions

• Ensure the job can survive failure

3737
37 © DIMA 2017
Lessons Learned from Batch

batch-2 batch-1

• If a batch computation fails, simply repeat computation as a transaction


• Transaction rate is constant
• Can we apply these principles to a true streaming execution?

© DIMA 201738
38
38
Taking Snapshots – the naïve way
t1 t2

execution snapshots
Initial approach (e.g., Naiad)
• Pause execution on t1,t2,..
• Collect state
• Restore execution

© DIMA 201739
39
39
Asynchronous Snapshots in Flink
snapshotting snapshotting
t1 t2

Propagating markers/barriers

snap - t1 snap - t2

[Carbone et. al. 2015] “Lightweight Asynchronous Snapshots for Distributed Dataflows”, Tech. Report. http://arxiv.org/abs/1506.08603
40
40 © DIMA 2017
Automatic partitioning and scaling
• 3 Types of Parallelization

• Big streaming systems should support all three

41
41 © DIMA 2017
Big Data Streaming Systems

42 © DIMA 2017
Streaming Systems Overview
Closed Source Open Source

Cloud DataFlow

(BigTable)

Naiad
StreamInsights Streaming
Systems

InfoSphere
Stream
Processing Esper
Language (SPL) Academia Aurora
AWS
NiagaraCQ
Kinesis
CQL
43
43 © DIMA 2017
Closed Source/Commercial Systems
Cloud DataFlow: • Unified primitives for batch and stream processing
• Runs in Google‘s cloud only
• Open Source SDK (programs can run on other systems)
• Check out the Apache Beam Project! (http://beam.apache.org/)

BigTable: • Not a real streaming solution


• Allows to feed streams as source into a google DB
• Data can be immediately queried

Naiad: • Goals of Naiad:


• High throughput (typical for batch processors)
• Low latency (known from single system stream processors)
• Is able to process iterative data flows
• Can discretize windows only based on time

StreamInsights: • Available through Microsoft's cloud


• Windows based on count-, time- and punctuation/snapshot
• Optimized for .NET framework applications
InfoSphere:
• Well specified in several publications
Stream • Can be deployed in customer clusters
Processing • Own SQL-like query language enables many optimization means
Language (SPL) • window discretization based on trigger- and eviction policies

44
44 © DIMA 2017
Open Source Systems by Apache (1/2)
• Reliable handling of huge numbers of concurrent reads and writes
• Can be used as data-source / data-sink for Storm, Samza, Flink, Spark and many more systems
• Fault tolerant: Messages are persisted on disk and replicated within the cluster. Messages (reads and
writes) can be repeated

• True streaming over distributed dataflow


• Low level API: Programmers have to specify the logic of each vertex in the flow graph
• Full understanding and hard coding of all used operators is required
• Enables very high throughput (single purpose programs with small overhead)

• True streaming built on top of Apache Kafka and Hadoop YARN


• State is first class citizen
• Low level API

45
45 © DIMA 2017
Open Source Systems by Apache (2/2)
Spark implements a batch execution engine
• The execution of a job graph is done in stages
• Operator outputs are materialized in memory (or disk) until the consuming operator is
ready to consume the materialized data
Spark uses Discretized Streams (D-Streams)
• Streams are interpreted as a series of deterministic batch-processing jobs
• Micro batches have a fixed granularity
• All windows defined in queries must be multiples of this granularity

Flinks runtime is a native streaming engine


• Based on Nephele/PACTs
• Queries are compiled to a program in the form of an operator DAG
• Operator DAGs are compiled to job graphs
• Job graphs are generic streaming programs
Flink implements “true streaming”
• The whole job graph is deployed concurrently in the cluster
• Operators are long-running: Continuously consume input and produce output
• Output tuples are immediately forwarded to succeeding operators and are available for
further processing (enables pipeline parallelism)
46
46 © DIMA 2017
Further open source systems
Esper
• Open source Complex Event Processing (CEP) engine
• Tightly coupled to Java: Executable on J2EE application servers
• Describing events in Plain Old Java Objects (POJOs)
• Time-based or count-based windows

Aurora
• First design and implementation that parallelizes stream computation including rich operation and windowing semantics
• Windows are always specified as ranges on some measure
• Was continued in Borealis Project

NiagaraCQ
• Focuses more on scalability than on the flexibility
• Provides various optimizations techniques to share common computation within and across queries
• Only time-based windows are possible

CQL
• Continuous query language
• Implemented by the STREAM DSMS at Stanford
• Captures a wide range of streaming application in an SQL-like query language

47
47 © DIMA 2017
Cloud-Based Streaming Systems (example)

48
48 © DIMA 2017
Storm, Spark Streaming, and Flink

49 © DIMA 2017
Big Data Analytics Ecosystem
Hive Cascading Giraph
Applications &
Languages Mahout Pig Crunch

MapReduce Flink
Data processing
engines Spark Storm Tez

App and resource Yarn Mesos


management
HDFS HBase Kafka …
Storage, streams 50
50 © 2013 Berlin Big Data Center • All Rights Reserved
50 © DIMA 2017
Apache Storm
Scalable Stream Processing Platform
by Twitter
• Tuple wise computation
• Programs are represented in a
topology graph
• vertices are computations / data
transformations
• edges represent data streams between
the computation nodes
• streams consist of an unbounded
sequence of data-items/tuples
• Low-level stream processing engine
51
51 © DIMA 2017
Storm‘s Fault Tolerance
• At least once guarantee via acknowledgments

• Acker logs progress of each tuple emitted by a spout

52
52 © DIMA 2017
Storm Bolt Example

Source: https://storm.apache.org/documentation/Tutorial.html
53
53 © DIMA 2017
Building a Storm Topology
1) Use the TopologyBuilder class to connect spouts and bolts:
builder.setSpout(“name”,new MySpout());
builder.setBolt(“name”,new MyBolt());

2) Additionally, specify groupings to allow parallelization (shuffle, all, global, field)


builder.shuffleGrouping(“BoltName”);

3) Create topology using the factory method


StormTopology st=builder.createTopology();

4) Use LocalCluster class to test the topology


LocalCluster cluster=new LocalCluster();
cluster.submitTopology(“name”,new Config(),st);

54
54 Source: Allen et al., Storm Applied: Strategies for Real-Time Event Processing © DIMA 2017
Storm – Trident
• High-level abstraction built on top of Storm core:
– operators like filter, join, groupBy, ...
• Stream-oriented API + UDFs
• Stateful, incremental processing
• Micro-Batch oriented (ordered & partitionable)
• Exactly-once semantics
• Trident topology compiled into spouts and bolt

55
55 © DIMA 2017
Storm – Heron
• New real-time streaming system based on Storm
• Introduced June 2015 by Twitter (SIGMOD)
• Fully compatible with Storm API
• Container-based implementation
• Back pressure mechanism
• Easy debugging of heron topologies through UI
• better performance than Storm (latency + throughput)

• No exactly once guarantee

56
56 © DIMA 2017
Apache Spark
• In memory abstraction for big data processing
– Resilient Distributed Data Sets
– Fault-tolerance through lineage
– Richt APIs for all kind of processing

Client
Loop outside the system, in
Step Step Step Step Step driver program

Client Iterative program looks like


many independent jobs
Step Step Step Step Step

57
57 © DIMA 2017
Spark Job

Image: Tyler Akidau


• Similar to MR, but much faster

58
58 © DIMA 2017
Spark Streaming

Image: Tyler Akidau

• Key abstraction: discretized streams (DStream)


– micro-batch = series of RDDs
– Stream computation = series of deterministic batch computation at a given time interval
• API very similar to Spark core (Java, Scala, Python)
– (stateless) transformations on DStreams: map, filter, reduce, repartition, cogrop, ...
– Stateful operators: time-based window operations, incremental aggregation, time-skewed joins
• Exactly-once semantics using checkpoints (asynchronous replication of state RDDs)
• No event time windows

59
59 © DIMA 2017
Apache Flink
Apache Flink is an open source platform for scalable batch and stream data processing.

• The core of Flink is a distributed streaming


dataflow engine.
• Executing dataflows in parallel on
clusters
• Providing a reliable foundation for
various workloads
• DataSet and DataStream programming
abstractions are the foundation for user
programs and higher layers

http://flink.apache.org
60
60 © DIMA 2017
Architecture
• Hybrid MapReduce and MPP database runtime

• Pipelined/Streaming engine
– Complete DAG deployed
Worker 1 Worker 2

Job Manager

Worker 3 Worker 4
61
61 © DIMA 2017
Built-in vs. driver-based looping
Client
Loop outside the system,
in driver program
Step Step Step Step Step
Iterative program looks
Client like many independent
jobs
Step Step Step Step Step

Dataflows with feedback


red. edges
join
map System is iteration-
Flink join aware, can optimize the
job

62 © 2013 Berlin Big Data Center • All Rights Reserved


62 © DIMA 2017
Sneak peak: Two of Flink’s APIs
case class Word (word: String, frequency: Int)

DataSet API (batch):


val lines: DataSet[String] = env.readTextFile(...)
lines.flatMap {line => line.split(" ")
.map(word => Word(word,1))}
.groupBy("word").sum("frequency")
.print()

DataStream API (streaming):


val lines: DataStream[String] = env.fromSocketStream(...)
lines.flatMap {line => line.split(" ")
.map(word => Word(word,1))}
.keyBy("word")
.window(Time.of(5,SECONDS)).every(Time.of(1,SECONDS))
.sum("frequency”)
.print()

63 63
63 © DIMA 2017
Some Benchmark Results
Initially performed by Yahoo!
Engineering, Dec 16, 2015,

[..]Storm 0.10.0, 0.11.0-SNAPSHOT and


Flink 0.10.1 show sub- second latencies
at relatively high throughputs[..]. Spark
streaming 1.5.1 supports high
throughputs, but at a relatively higher
latency.

http://yahooeng.tumblr.com/post/135321837876/benchmarking-streaming-computation-engines-at
https://data-artisans.com/extending-the-yahoo-streaming-benchmark/
64 © DIMA 2017
Processing Time vs Event Time

DEMO - STREAMING

„Inspired“ by
https://github.com/dataArtisans/oscon
65 © 2013 Berlin Big Data Center • All Rights Reserved
65 © DIMA 2017
Stream Optimizations

Based on Hirzel et. al: A Catalog of Stream Processing


Optimizations, ACM Comp. Surveys. 46(4), 2014.

66 © DIMA 2017
Overview
• 11 Optimizations (numbered from 2 to 12 )

67
67 © DIMA 2017
Reordering and Elimination

68
68 © DIMA 2017
Operator Separation

69
69 © DIMA 2017
Fusion

In Apache Flink (and many other applications) we call this


chaining

Goal: Reduce communication costs


Method: Shared memory among operators instead of network communication

70
70 © DIMA 2017
Fission

Directly maps to data


parallelism:

71
71 © DIMA 2017
Placement

Assigning Operators to slots

+ Co-locating Data and Computations

72
72 © DIMA 2017
Load Balancing

73
73 © DIMA 2017
State Sharing

Distributed File Systems


A single storage layer for the whole cluster

Chaining again…
Share memory among several operators instead of copying the data

74
74 © DIMA 2017
Batching

“Under the hood” batch wise network traffic (buffering)

D-Streams*: All the stream processing is done in micro-batches

* Zaharia, Matei, et al. Discretized streams: an efficient and fault-tolerant model for stream processing on large clusters. In: Proceedings of the
4th USENIX conference on Hot Topics in Cloud Ccomputing. USENIX Association, 2012. S. 10-10.
https://www.usenix.org/system/files/conference/hotcloud12/hotcloud12-final28.pdf
75
75 © DIMA 2017
Algorithm Selection & Load Shedding

The optimizer selects the (hopefully)


optimal join implementation

76
76 © DIMA 2017
Cost Model
• Traditional cost-based query optimization is based on cardinality estimation
➟ inadequate for unbounded streams
• Possible solution: rate-based cost estimation
– (Viglas et al.: Rate-based query optimization for streaming information sources, SIGMOD 2002)

• Challenges:
– Fluctuating streams
– Data-parallel processing

Slide by Kai-Uwe Sattler


77
77 © DIMA 2017
Conclusion
Introduction to Streams
• Stream Processing 101
• How to do real streaming

Stream Processing Systems


• Ingredients of a stream processing system
• Storm, Spark, Flink
• Continuously evolving

Stream Processing Optimizations


• How to optimize

78
78 © DIMA 2017
Thank You
Contact:
Tilmann Rabl
rabl@tu-berlin.de We are hiring!

79 © 2013 Berlin Big Data Center • All Rights Reserved © DIMA 2017
Further Reading
Historical papers on STREAM, Aurora, TelegraphCQ, Borealis, CQL, ...
• Papers and blogs on Storm, Heron, Flink, Spark Streaming, ...
• Alexandrov, Alexander, et al. The Stratosphere platform for big data analytics. The VLDB Journal-The International Journal on Very Large Data Bases, 2014, 23. Jg., Nr.
6, S. 939-964.
• Zaharia, Matei, et al. Discretized streams: an efficient and fault-tolerant model for stream processing on large clusters. In: Proceedings of the 4th USENIX conference on
Hot Topics in Cloud Ccomputing. USENIX Association, 2012. S. 10-10.
• Murray, Derek G., et al. Naiad: a timely dataflow system. In: Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles. ACM, 2013. S. 439-
455.
Windows & Semantics
• Ghanem et al.: Incremental Evaluation of Sliding-Window Queries over Data Streams, TKDE 19(1), 2007
• Tucker et al.: Exploiting Punctuation Semantics in Continuous Data Streams, TKDE 15(3), 2003
• Krämer et al.: Sematics and Implementation of Continuous Sliding Window Queries over Data Streams, TODS 34(1), 2009
• Botan et al.: SECRET: A Model for Analysis of the Execution Semantics of Stream Processing Systems, VLDB 2010
CEP:
• Wu et al.: High-Performance Complex Event Processing over Streams, SIGMOD 2006
• Schultz-Moeller et al.: Distributed Complex Event Processing with Query Rewriting, DEBS 2009
Fault Tolerance:
• Hwang et al.: High-availability algorithms for distributed stream processing, ICDE 2005
• Zaharia et al.: Discretized Streams: An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters. HotCloud, 2012.
• Fernandez et al.: Integrating Scale Out and Fault Tolerance in Stream Processing using Operator State Management, SIGMOD 2013
Partitioning & Optimization:
• Hirzel et al.: A Catalog of Stream Processing Optimizations, ACM Comp. Surveys 46(4), 2014.
• Gedik et al.: Elastic Scaling for Data Streams, TPDS 25(6), 2014.
• Viglas et al.: Rate-Based Query Optimization for Streaming Information Sources, SIGMOD 2002

80
80 © DIMA 2017

You might also like