0% found this document useful (0 votes)
45 views13 pages

Lec 02

Batch processing involves processing data in groups rather than individually. It is commonly used for tasks like reading files, sending email, and printing. Batch processing can be scaled up by using more powerful hardware for vertical scaling or distributing the workload across multiple machines for horizontal scaling. Stream processing involves continuously processing incoming data in real-time as it arrives.

Uploaded by

youmnaxxeid
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views13 pages

Lec 02

Batch processing involves processing data in groups rather than individually. It is commonly used for tasks like reading files, sending email, and printing. Batch processing can be scaled up by using more powerful hardware for vertical scaling or distributing the workload across multiple machines for horizontal scaling. Stream processing involves continuously processing incoming data in real-time as it arrives.

Uploaded by

youmnaxxeid
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

3/5/2024

Batch Processing

What is batch processing?

• Processing data in groups


• Runs from start of process to Finish
• No data added in between
• Typically run as a result of an
• interval
• starting event
• Processed in a certain size (batch size)
• An instance of a batch process is often referred to as a job

1
3/5/2024

Common batch processing scenarios

• Reading files or parts of files (text, mp3, etc)


• Sending/receiving email
• Printing

Why batch?

• Simple
• Generally consistent
• Multiple ways to improve performance

2
3/5/2024

What is scaling?

• Improving performance
• Processing more quickly
• Less time to process the same amount of data
• Processing more data
• More data processed in the same amount of time

Vertical scaling
• Better computing
• Faster CPU
• Faster IO
• More memory
• Typically, the easiest kind of scaling
• Least complexity
• Rarely requires changing underlying
• programs/algorithms

3
3/5/2024

Horizontal scaling

• Splitting a task into multiple parts


• More computers
• Could also be more CPUs
• Best done on tasks that are "embarrassingly parallel"
• Tasks that can be easily divided among workers
• Can be very cost-effective
• Can have near-linear performance improvements for certain types of processes
• Complexity
• Requires a processing framework (like Apache Spark or Dask)
• Requires more extensive networking
• Ongoing management
• Can be expensive depending on requirements
• "Non-parallel" tasks

Batch issues

• Delays
• Time until data is ready to process
• Is all data available?
• Time until the process begins
• When does the next interval start?
• Time to process data
• How long until completion?
• Time until processed data is available for use
• How long until users can use the data?

4
3/5/2024

Example #1

• Waiting on the source data


• Machines sending log files at times of low utilization
• Works ok during normal utilization
• High utilization would limit ability to send logs, potentially hiding issues.

Example #2

• Waiting on the process


• 100GB log files per day
• Currently takes 23 hrs to process
• Approximately 4.4GB/hr
• Grows at 5% per month
• Next month would be 105GB and take ~24 hrs
• Following month would be ~110GB and take ~25 hrs
• Takes longer than a day to process one day's worth of data

10

5
3/5/2024

Example #3

• Waiting for the data to be available


• How long until analytics are available?
• Sales report must wait for all information to be generated
• The sum of delays is the minimum time to generate a new report
• amount of time to collect/prepare data: 1 day
• Time required to process data: 7 hrs
• Time to update systems: 5 hrs
• Time to generate a report: 2 min
• Total time for each report: 1.5 days

11

Streaming Process

12

6
3/5/2024

Stream processing - Basics

• Streaming data lifecycle


• Data is generated (upstream)
• Distribution and reorganization of data (by Message Processor)
• Data processing (by Stream Processing)
• Storing results, alerting, sending messages downstream

13

Major components in stream processing

• Application (generating stream of data)


• Message processor
• Stream processor
• Data storage (stores processed data, state etc.)

14

7
3/5/2024

Stream processing - Basics

15

Stream processing - Basics

• Stream can be abstracted as an endless sequence of messages


• Stream can be represented by
• File
• TCP connection
• Database table
• Streams can be partitioned
• Enables parallelization
• Streams can be
• Read
• Written into
• Joined
• Filtered
• Transformed

16

8
3/5/2024

How can we characterize stream processing?

• Realtime streaming vs Micro batches


• Usage of time windows
• Length of a window, shift of a time window
• Event vs Processing Time
• Stateless or Stateful
• In many use cases, we need to keep stream processing state
• Aggregations, message
• Can be handled by stream processor or externally (database)
• Out-of-order messages
• Message can be received with delay (issues in network, backlog of messages)

17

Realtime streaming vs
Micro batches
• Realtime Streaming (True Realtime, Continuous
Processing)
• Message is processed immediately after
delivery
• Messages are processed one by one
• Low latencies (usually also lower throughput)
• Output should be available in tens to hundreds
of milliseconds

18

9
3/5/2024

Realtime streaming vs
Micro batches

• Micro batches (Near Realtime)


• Message is not processed
immediately after delivery
• Messages are processed together in
small batches
• Latency is at least the length of the
batch interval (usually leads to higher
throughput)
• Output is available within seconds or
tens of seconds

19

Time Windows

• Length of window
• Slide interval
• Windows can overlap
• Example
• Length of window: 3 seconds
• Slide interval: 2 seconds

20

10
3/5/2024

Time Windows - Event vs Processing Time

• Event Time
• When the message was generated
• Processing Time
• When the message was processed
• Some tools cannot window messages by Event Time
• It is necessary to understand the semantics of the timestamps we are
working with

21

Challenges: Stateful Stream Processing

• Adds another layer of complexity


• Size of data (does it fit in RAM?)
• If the state is too large, it slows downstream processing
• The state can be stored outside the stream processor
• Databases (Redis, HBase, Cassandra, …)
• Watermarking can be used to drop old data (more on that later)

22

11
3/5/2024

Challenges: Out-of-order messages

• How to handle messages that are received out of order?


• The solution depends on the use-case
• We can ignore them
• We can reprocess the data
• Or a custom action is executed (alert, include in a separate pipeline)
• Some tools uses watermarking
• Threshold specifying how long the stream processor waits for delayed messages
• If the message arrives before configured watermark, it is processed
• Otherwise it is dropped

23

Common tools

• Kafka (Streams API)


• Flink (near real-time)
• NiFi (Data flow management system, not true streaming)
• Spark (Streaming API, Structured Streaming API)

24

12
3/5/2024

25

13

You might also like