3/5/2024
Batch Processing
    What is batch processing?
    • Processing data in groups
    •   Runs from start of process to Finish
        •     No data added in between
    •   Typically run as a result of an
        •     interval
            • starting event
    • Processed in a certain size (batch size)
    • An instance of a batch process is often referred to as a job
                                                                           1
                                                         3/5/2024
    Common batch processing scenarios
    • Reading files or parts of files (text, mp3, etc)
    • Sending/receiving email
    • Printing
    Why batch?
    • Simple
    • Generally consistent
    • Multiple ways to improve performance
                                                               2
                                                                 3/5/2024
         What is scaling?
         • Improving performance
            • Processing more quickly
               • Less time to process the same amount of data
           • Processing more data
              • More data processed in the same amount of time
    Vertical scaling
    • Better computing
        • Faster CPU
        • Faster IO
        • More memory
    • Typically, the easiest kind of scaling
        • Least complexity
        • Rarely requires changing underlying
        • programs/algorithms
                                                                       3
                                                                                     3/5/2024
          Horizontal scaling
    • Splitting a task into multiple parts
         • More computers
         • Could also be more CPUs
    • Best done on tasks that are "embarrassingly parallel"
        • Tasks that can be easily divided among workers
    • Can be very cost-effective
    • Can have near-linear performance improvements for certain types of processes
    • Complexity
        • Requires a processing framework (like Apache Spark or Dask)
        • Requires more extensive networking
    • Ongoing management
    • Can be expensive depending on requirements
    • "Non-parallel" tasks
          Batch issues
          • Delays
             • Time until data is ready to process
                  • Is all data available?
             • Time until the process begins
                  • When does the next interval start?
             • Time to process data
                  • How long until completion?
             • Time until processed data is available for use
                  • How long until users can use the data?
                                                                                           4
                                                                                       3/5/2024
     Example #1
     • Waiting on the source data
     • Machines sending log files at times of low utilization
     • Works ok during normal utilization
     • High utilization would limit ability to send logs, potentially hiding issues.
     Example #2
     • Waiting on the process
     • 100GB log files per day
     • Currently takes 23 hrs to process
     • Approximately 4.4GB/hr
     • Grows at 5% per month
     • Next month would be 105GB and take ~24 hrs
     • Following month would be ~110GB and take ~25 hrs
     • Takes longer than a day to process one day's worth of data
10
                                                                                             5
                                                                            3/5/2024
     Example #3
     • Waiting for the data to be available
     • How long until analytics are available?
     • Sales report must wait for all information to be generated
         • The sum of delays is the minimum time to generate a new report
         • amount of time to collect/prepare data: 1 day
         • Time required to process data: 7 hrs
         • Time to update systems: 5 hrs
         • Time to generate a report: 2 min
     • Total time for each report: 1.5 days
11
                    Streaming Process
12
                                                                                  6
                                                                           3/5/2024
     Stream processing - Basics
     • Streaming data lifecycle
        • Data is generated (upstream)
        • Distribution and reorganization of data (by Message Processor)
        • Data processing (by Stream Processing)
        • Storing results, alerting, sending messages downstream
13
     Major components in stream processing
     • Application (generating stream of data)
     • Message processor
     • Stream processor
     • Data storage (stores processed data, state etc.)
14
                                                                                 7
                                                                      3/5/2024
     Stream processing - Basics
15
      Stream processing - Basics
      • Stream can be abstracted as an endless sequence of messages
      • Stream can be represented by
           • File
           • TCP connection
           • Database table
      • Streams can be partitioned
           • Enables parallelization
      • Streams can be
           • Read
           • Written into
           • Joined
           • Filtered
           • Transformed
16
                                                                            8
                                                                                            3/5/2024
         How can we characterize stream processing?
         • Realtime streaming vs Micro batches
             • Usage of time windows
             • Length of a window, shift of a time window
             • Event vs Processing Time
         • Stateless or Stateful
             • In many use cases, we need to keep stream processing state
             • Aggregations, message
             • Can be handled by stream processor or externally (database)
         • Out-of-order messages
            • Message can be received with delay (issues in network, backlog of messages)
17
     Realtime streaming vs
     Micro batches
     • Realtime Streaming (True Realtime, Continuous
       Processing)
         • Message is processed immediately after
           delivery
         • Messages are processed one by one
         • Low latencies (usually also lower throughput)
         • Output should be available in tens to hundreds
           of milliseconds
18
                                                                                                  9
                                                     3/5/2024
     Realtime streaming vs
     Micro batches
     • Micro batches (Near Realtime)
         • Message is not processed
           immediately after delivery
         • Messages are processed together in
           small batches
         • Latency is at least the length of the
           batch interval (usually leads to higher
           throughput)
         • Output is available within seconds or
           tens of seconds
19
            Time Windows
            • Length of window
            • Slide interval
            • Windows can overlap
            • Example
                • Length of window: 3 seconds
                • Slide interval: 2 seconds
20
                                                          10
                                                                                 3/5/2024
     Time Windows - Event vs Processing Time
     • Event Time
        • When the message was generated
     • Processing Time
        • When the message was processed
     • Some tools cannot window messages by Event Time
        • It is necessary to understand the semantics of the timestamps we are
          working with
21
     Challenges: Stateful Stream Processing
     • Adds another layer of complexity
        • Size of data (does it fit in RAM?)
        • If the state is too large, it slows downstream processing
     • The state can be stored outside the stream processor
         • Databases (Redis, HBase, Cassandra, …)
     • Watermarking can be used to drop old data (more on that later)
22
                                                                                      11
                                                                                          3/5/2024
     Challenges: Out-of-order messages
     • How to handle messages that are received out of order?
        • The solution depends on the use-case
        • We can ignore them
        • We can reprocess the data
        • Or a custom action is executed (alert, include in a separate pipeline)
     • Some tools uses watermarking
        • Threshold specifying how long the stream processor waits for delayed messages
        • If the message arrives before configured watermark, it is processed
        • Otherwise it is dropped
23
     Common tools
     • Kafka (Streams API)
     • Flink (near real-time)
     • NiFi (Data flow management system, not true streaming)
     • Spark (Streaming API, Structured Streaming API)
24
                                                                                               12
     3/5/2024
25
          13