Spark Streaming
After this video you will be able to..
• Summarize how Spark reads streaming data
• List several sources of streaming data
supported by Spark
• Describe Spark’s sliding windows
Spark
SparkSQL Streaming MLlib GraphX
Spark Streaming
Spark Core
• Scalable processing for real-
time analytics
• Data streams converted to
discrete RDDs
• Has APIs for Scala, Java, and
Python
Spark Streaming Sources
• Kafka
• Flume
• HDFS
• S3
• Twitter
• Socket
• …etc.
Creating and Processing DStreams
Streaming Source
10 9 8 7 6 5 4 3 2 1 Discretize
DStream
Transformation
RDD
RDD
RDD
RDD
RDD
DStream
Action Results
RDD
RDD
RDD
RDD
RDD
Creating and Processing DStreams
Streaming Source
10 9 8 7 6 5 4 3 2 1
Discretize
Batch length: 2 seconds
DStream
Transformation
RDD
RDD
RDD
RDD
RDD
DStream
Action Results
RDD
RDD
RDD
RDD
RDD
Creating and Processing DStreams
Streaming Source
10 9 8 7 6 5 4 3 2 1
Discretize
Batch length: 2 seconds
DStream
Transformation
RDD
RDD
RDD
RDD
RDD
Window size: 4
DStream Sliding interval: 2
Action Results
RDD
RDD
RDD
RDD
RDD
Main Take-Aways
• Spark uses DStreams to make
discrete RDDs from streaming data.
• Same transformations and calculations
applied to batch RDDs can be applied
• DStreams can create a sliding
window to perform calculations on a
window of time.