0% found this document useful (0 votes)

45 views13 pages

Lec 02

Batch processing involves processing data in groups rather than individually. It is commonly used for tasks like reading files, sending email, and printing. Batch processing can be scaled up by using more powerful hardware for vertical scaling or distributing the workload across multiple machines for horizontal scaling. Stream processing involves continuously processing incoming data in real-time as it arrives.

Uploaded by

youmnaxxeid

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

45 views13 pages

Lec 02

Uploaded by

youmnaxxeid

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

3/5/2024

Batch Processing

What is batch processing?

• Processing data in groups

• Runs from start of process to Finish
• No data added in between
• Typically run as a result of an
• interval
• starting event
• Processed in a certain size (batch size)
• An instance of a batch process is often referred to as a job

1
3/5/2024

Common batch processing scenarios

• Reading files or parts of files (text, mp3, etc)

• Sending/receiving email
• Printing

Why batch?

• Simple
• Generally consistent
• Multiple ways to improve performance

2
3/5/2024

What is scaling?

• Improving performance
• Processing more quickly
• Less time to process the same amount of data
• Processing more data
• More data processed in the same amount of time

Vertical scaling
• Better computing
• Faster CPU
• Faster IO
• More memory
• Typically, the easiest kind of scaling
• Least complexity
• Rarely requires changing underlying
• programs/algorithms

3
3/5/2024

Horizontal scaling

• Splitting a task into multiple parts

• More computers
• Could also be more CPUs
• Best done on tasks that are "embarrassingly parallel"
• Tasks that can be easily divided among workers
• Can be very cost-effective
• Can have near-linear performance improvements for certain types of processes
• Complexity
• Requires a processing framework (like Apache Spark or Dask)
• Requires more extensive networking
• Ongoing management
• Can be expensive depending on requirements
• "Non-parallel" tasks

Batch issues

• Delays
• Time until data is ready to process
• Is all data available?
• Time until the process begins
• When does the next interval start?
• Time to process data
• How long until completion?
• Time until processed data is available for use
• How long until users can use the data?

4
3/5/2024

Example #1

• Waiting on the source data

• Machines sending log files at times of low utilization
• Works ok during normal utilization
• High utilization would limit ability to send logs, potentially hiding issues.

Example #2

• Waiting on the process

• 100GB log files per day
• Currently takes 23 hrs to process
• Approximately 4.4GB/hr
• Grows at 5% per month
• Next month would be 105GB and take ~24 hrs
• Following month would be ~110GB and take ~25 hrs
• Takes longer than a day to process one day's worth of data

5
3/5/2024

Example #3

• Waiting for the data to be available

• How long until analytics are available?
• Sales report must wait for all information to be generated
• The sum of delays is the minimum time to generate a new report
• amount of time to collect/prepare data: 1 day
• Time required to process data: 7 hrs
• Time to update systems: 5 hrs
• Time to generate a report: 2 min
• Total time for each report: 1.5 days

Streaming Process

6
3/5/2024

Stream processing - Basics

• Streaming data lifecycle

• Data is generated (upstream)
• Distribution and reorganization of data (by Message Processor)
• Data processing (by Stream Processing)
• Storing results, alerting, sending messages downstream

Major components in stream processing

• Application (generating stream of data)

• Message processor
• Stream processor
• Data storage (stores processed data, state etc.)

7
3/5/2024

Stream processing - Basics

• Stream can be abstracted as an endless sequence of messages

• Stream can be represented by
• File
• TCP connection
• Database table
• Streams can be partitioned
• Enables parallelization
• Streams can be
• Read
• Written into
• Joined
• Filtered
• Transformed

8
3/5/2024

How can we characterize stream processing?

• Realtime streaming vs Micro batches

• Usage of time windows
• Length of a window, shift of a time window
• Event vs Processing Time
• Stateless or Stateful
• In many use cases, we need to keep stream processing state
• Aggregations, message
• Can be handled by stream processor or externally (database)
• Out-of-order messages
• Message can be received with delay (issues in network, backlog of messages)

Realtime streaming vs
Micro batches
• Realtime Streaming (True Realtime, Continuous
Processing)
• Message is processed immediately after
delivery
• Messages are processed one by one
• Low latencies (usually also lower throughput)
• Output should be available in tens to hundreds
of milliseconds

9
3/5/2024

Realtime streaming vs
Micro batches

• Micro batches (Near Realtime)

• Message is not processed
immediately after delivery
• Messages are processed together in
small batches
• Latency is at least the length of the
batch interval (usually leads to higher
throughput)
• Output is available within seconds or
tens of seconds

Time Windows

• Length of window
• Slide interval
• Windows can overlap
• Example
• Length of window: 3 seconds
• Slide interval: 2 seconds

10
3/5/2024

Time Windows - Event vs Processing Time

• Event Time
• When the message was generated
• Processing Time
• When the message was processed
• Some tools cannot window messages by Event Time
• It is necessary to understand the semantics of the timestamps we are
working with

Challenges: Stateful Stream Processing

• Adds another layer of complexity

• Size of data (does it fit in RAM?)
• If the state is too large, it slows downstream processing
• The state can be stored outside the stream processor
• Databases (Redis, HBase, Cassandra, …)
• Watermarking can be used to drop old data (more on that later)

11
3/5/2024

Challenges: Out-of-order messages

• How to handle messages that are received out of order?

• The solution depends on the use-case
• We can ignore them
• We can reprocess the data
• Or a custom action is executed (alert, include in a separate pipeline)
• Some tools uses watermarking
• Threshold specifying how long the stream processor waits for delayed messages
• If the message arrives before configured watermark, it is processed
• Otherwise it is dropped

Common tools

• Kafka (Streams API)

• Flink (near real-time)
• NiFi (Data flow management system, not true streaming)
• Spark (Streaming API, Structured Streaming API)

12
3/5/2024

BDA Lec10
No ratings yet
BDA Lec10
33 pages
6 - Streaming Part 1
No ratings yet
6 - Streaming Part 1
44 pages
Real-Time Systems & Streaming Data
No ratings yet
Real-Time Systems & Streaming Data
39 pages
Streaming Concepts
No ratings yet
Streaming Concepts
94 pages
Ccunit 5
No ratings yet
Ccunit 5
4 pages
Unit 3
No ratings yet
Unit 3
51 pages
SA Unit 1 PPT 2
No ratings yet
SA Unit 1 PPT 2
27 pages
Stream Processing in Big Data
No ratings yet
Stream Processing in Big Data
39 pages
Batch Processing Vs Stream Processing
No ratings yet
Batch Processing Vs Stream Processing
13 pages
Lecture #7.1 - Introducing Streaming Data
No ratings yet
Lecture #7.1 - Introducing Streaming Data
24 pages
C42-Batch Stream Micro Batch Realtime Processing
No ratings yet
C42-Batch Stream Micro Batch Realtime Processing
33 pages
SPA Session 10 Stream Platforms
No ratings yet
SPA Session 10 Stream Platforms
26 pages
Big Data PDF
No ratings yet
Big Data PDF
10 pages
SA Unit 1 PPT 1
No ratings yet
SA Unit 1 PPT 1
19 pages
Stream Processing
No ratings yet
Stream Processing
33 pages
Bigdata Oral Assignment
No ratings yet
Bigdata Oral Assignment
23 pages
Bigdata
No ratings yet
Bigdata
3 pages
A Deep Dive Into Data Stream Processing
No ratings yet
A Deep Dive Into Data Stream Processing
10 pages
Stream Processing for IT/CSE Students
No ratings yet
Stream Processing for IT/CSE Students
57 pages
Streaming Data Insights for Tech Pros
No ratings yet
Streaming Data Insights for Tech Pros
4 pages
What Is Stream Processing
No ratings yet
What Is Stream Processing
11 pages
Hazelcast Level Up To Instant Action-1706173416548
No ratings yet
Hazelcast Level Up To Instant Action-1706173416548
36 pages
Lec 01
No ratings yet
Lec 01
17 pages
Stream Computing
No ratings yet
Stream Computing
18 pages
T09 Data Streaming
No ratings yet
T09 Data Streaming
52 pages
Bigdata Unit-Ii
No ratings yet
Bigdata Unit-Ii
33 pages
JyothsnaDST Unit-1 Extra
No ratings yet
JyothsnaDST Unit-1 Extra
25 pages
What Is Stream Processing
No ratings yet
What Is Stream Processing
3 pages
CDA C2 R 050 en File 24.en
No ratings yet
CDA C2 R 050 en File 24.en
2 pages
ECS765P - W11 - Stream Processing II
No ratings yet
ECS765P - W11 - Stream Processing II
47 pages
Stream Processing and Analytics - Regular-HO
No ratings yet
Stream Processing and Analytics - Regular-HO
7 pages
BDA Unit 3
No ratings yet
BDA Unit 3
42 pages
Putting Apache Kafka To Use!: Building A Real-Time Data Platform For Event Streams!
No ratings yet
Putting Apache Kafka To Use!: Building A Real-Time Data Platform For Event Streams!
48 pages
Chapter 1
No ratings yet
Chapter 1
13 pages
Big Data Analytics - Unit 2 Notes
No ratings yet
Big Data Analytics - Unit 2 Notes
44 pages
Batch Processing Vs Stream Processing
No ratings yet
Batch Processing Vs Stream Processing
3 pages
Ververica Platform Whitepaper Stream Processing For Real-Time Business, Powered by Apache Flink®
No ratings yet
Ververica Platform Whitepaper Stream Processing For Real-Time Business, Powered by Apache Flink®
22 pages
Document 15
No ratings yet
Document 15
15 pages
Chapter 1-1
No ratings yet
Chapter 1-1
34 pages
Data Stream in Data Analytics
No ratings yet
Data Stream in Data Analytics
4 pages
Reference Guide To Stream Processing
No ratings yet
Reference Guide To Stream Processing
14 pages
Streaming 101
No ratings yet
Streaming 101
3 pages
Analytics On Big Fast Data Using A Realtime Stream Data Processing Architecture
No ratings yet
Analytics On Big Fast Data Using A Realtime Stream Data Processing Architecture
34 pages
Whitepaper Streaming Data Solutions On Aws With Amazon Kinesis
No ratings yet
Whitepaper Streaming Data Solutions On Aws With Amazon Kinesis
33 pages
Process Analysis
No ratings yet
Process Analysis
17 pages
Getting Started With Real-Time Analytics With Kafka and Spark in Microsoft Azure - Joe Plumb.
No ratings yet
Getting Started With Real-Time Analytics With Kafka and Spark in Microsoft Azure - Joe Plumb.
44 pages
StreamProcessingAndAnalytics Handout
No ratings yet
StreamProcessingAndAnalytics Handout
7 pages
Assignment 1 Solution
No ratings yet
Assignment 1 Solution
3 pages
DSPL Casestidy
No ratings yet
DSPL Casestidy
3 pages
TRabl StreamProcessing
No ratings yet
TRabl StreamProcessing
79 pages
7 - Streaming 2 - Calcite
No ratings yet
7 - Streaming 2 - Calcite
45 pages
Real-Time Data Stream Processing - Challenges and
No ratings yet
Real-Time Data Stream Processing - Challenges and
8 pages
AM601PC Knowledge Representation and Reasoning UNIT-1
No ratings yet
AM601PC Knowledge Representation and Reasoning UNIT-1
16 pages
Safe Motherhood Programme
No ratings yet
Safe Motherhood Programme
3 pages
Microsoft Office Crack Download Free 2024 - WPS Office Blog
0% (1)
Microsoft Office Crack Download Free 2024 - WPS Office Blog
11 pages
02 Springer Paper Template
No ratings yet
02 Springer Paper Template
17 pages
Diabetes Drug Tapering Guide
100% (1)
Diabetes Drug Tapering Guide
3 pages
Identifying Classroom Hazards
No ratings yet
Identifying Classroom Hazards
15 pages
Knowledge Management (KM) 2.0
No ratings yet
Knowledge Management (KM) 2.0
2 pages
Nursing Student Clinical Forms
No ratings yet
Nursing Student Clinical Forms
5 pages
Balarama A Royal Elephant
No ratings yet
Balarama A Royal Elephant
3 pages
Philips IntelliVue MP60 Manual
No ratings yet
Philips IntelliVue MP60 Manual
359 pages
Shaman in Africa
No ratings yet
Shaman in Africa
17 pages
Kioti Daedong MEC2240 UTV (Utility Vehicle) Service Manual 04-2014
No ratings yet
Kioti Daedong MEC2240 UTV (Utility Vehicle) Service Manual 04-2014
19 pages
Thermal Desorption for Soil Remediation
0% (1)
Thermal Desorption for Soil Remediation
35 pages
Database Management System Basics
No ratings yet
Database Management System Basics
35 pages
Own Notes
No ratings yet
Own Notes
54 pages
Auto Attendance AA9600
No ratings yet
Auto Attendance AA9600
2 pages
Standard Letter o Intent - Electrical Works
No ratings yet
Standard Letter o Intent - Electrical Works
2 pages
Tecno Supply Brochure Oil&Gas With Semi-Automatic
No ratings yet
Tecno Supply Brochure Oil&Gas With Semi-Automatic
12 pages
The Westminster Guardian: Reformed in Doctrine, Worship and Practice
No ratings yet
The Westminster Guardian: Reformed in Doctrine, Worship and Practice
20 pages
Copy of SalesQL - CloudandOn - 05.22.2025 - Neha Headquater
No ratings yet
Copy of SalesQL - CloudandOn - 05.22.2025 - Neha Headquater
23 pages
(08.03.2025) MTC 316L-2B 3 X 1500 X C X 8664KG - POSCO
No ratings yet
(08.03.2025) MTC 316L-2B 3 X 1500 X C X 8664KG - POSCO
1 page
IT Data Visualization Guide
No ratings yet
IT Data Visualization Guide
17 pages
The Dallas Post 05-06-2012
No ratings yet
The Dallas Post 05-06-2012
18 pages
Robotic Arm Design Optimization
No ratings yet
Robotic Arm Design Optimization
10 pages
Edama Et Al-2015-Scandinavian Journal of Medicine Science in Sports4
No ratings yet
Edama Et Al-2015-Scandinavian Journal of Medicine Science in Sports4
8 pages
IHCP Bulletin 1-9-20
No ratings yet
IHCP Bulletin 1-9-20
2 pages
PP Camlock Coupling
No ratings yet
PP Camlock Coupling
2 pages
History and Origins of Popcorn
No ratings yet
History and Origins of Popcorn
6 pages
God-Manifestation in Scripture
No ratings yet
God-Manifestation in Scripture
133 pages
VRTL b10025 00 7600 Simocrane Drivebased Crane Technology
No ratings yet
VRTL b10025 00 7600 Simocrane Drivebased Crane Technology
8 pages
USG Boral Product Catalogue PDF
No ratings yet
USG Boral Product Catalogue PDF
11 pages

Lec 02

Uploaded by

Lec 02

Uploaded by

3/5/2024

What is batch processing?

• Processing data in groups

Common batch processing scenarios

• Reading files or parts of files (text, mp3, etc)

• Splitting a task into multiple parts

• Waiting on the source data

• Waiting on the process

• Waiting for the data to be available

Stream processing - Basics

• Streaming data lifecycle

Major components in stream processing

• Application (generating stream of data)

Stream processing - Basics

Stream processing - Basics

• Stream can be abstracted as an endless sequence of messages

How can we characterize stream processing?

• Realtime streaming vs Micro batches

• Micro batches (Near Realtime)

Time Windows - Event vs Processing Time

Challenges: Stateful Stream Processing

• Adds another layer of complexity

Challenges: Out-of-order messages

• How to handle messages that are received out of order?

• Kafka (Streams API)

You might also like