Unit 5
Big Data Platforms Overview
Sqoop
Definition: Tool for transferring data between Hadoop and relational databases.
Working
Connects to RDBMS via JDBC.
Uses import command to transfer data into Hadoop.
Data is imported into HDFS, Hive, or HBase.
Uses MapReduce internally for parallel processing.
Supports export of data back to RDBMS.
Automates repetitive data transfer tasks.
Features
Bi-directional data transfer (Import & Export).
Supports many databases via JDBC.
Incremental load support.
Parallel import/export using MapReduce.
Direct mode for faster transfer.
Integration with Hive and HBase.
Benefits
Easy RDBMS-Hadoop data exchange.
Saves time with automation.
Scales with large data volumes.
Reduces manual coding.
Efficient and fast.
Open-source and widely supported.
Applications
Migrating legacy data to Hadoop.
ETL workflows.
Data backup from Hadoop to RDBMS.
Loading data into Hive for analysis.
Data warehouse population.
Synchronizing databases and big data systems.
Cassandra
Definition: A distributed NoSQL database for managing large amounts of structured data.
Working
Stores data in tables using rows and columns.
Uses peer-to-peer architecture (no master node).
Writes are logged in commit logs, then to memory/table.
Data is periodically flushed to disk.
Supports replication across nodes.
Uses CQL (Cassandra Query Language).
Features
Highly scalable and distributed.
No single point of failure.
Supports high write throughput.
Schema-free (flexible model).
Tunable consistency levels.
Multi-datacenter replication.
Benefits
Handles large volumes of data.
High availability and fault tolerance.
Real-time performance.
Great for write-heavy apps.
Can grow horizontally.
Open-source and community-driven.
Applications
Real-time analytics.
IoT platforms.
Time-series data storage.
Recommendation engines.
Messaging apps.
E-commerce user activity tracking.
MongoDB
Definition: NoSQL database that stores data in JSON-like documents (BSON format).
Working
Stores data as key-value pairs in documents.
Collections group related documents.
Uses indexes for faster queries.
Supports CRUD operations via queries.
Can distribute data across shards.
Uses replica sets for fault tolerance.
Features
Document-oriented storage.
Schema-less/flexible data model.
High availability via replication.
Sharding for scalability.
Rich querying and aggregation.
Built-in security features.
Benefits
Easy to develop and maintain.
Scales horizontally.
Ideal for dynamic or changing data.
Good for real-time analytics.
Supports rich data types.
Active community and support.
Applications
Content management systems.
Product catalogs.
Mobile app backends.
Real-time analytics.
IoT data handling.
User profiles and session data.
Hive
Definition: Data warehouse system built on Hadoop that uses SQL-like queries (HiveQL).
Working
Converts SQL-like queries to MapReduce jobs.
Stores data in HDFS.
Users define tables and schemas.
HiveQL is parsed and optimized.
Executes in batch mode.
Shows output in CLI or writes to HDFS.
Features
SQL-like query language (HiveQL).
Supports large datasets.
Integration with HDFS.
Partitioning and bucketing for speed.
UDFs for custom functions.
Extensible and scalable.
Benefits
Easy for SQL users to learn.
Handles petabytes of data.
Seamless integration with Hadoop.
Cost-effective batch processing.
Supports various data formats.
Works with BI tools.
Applications
Data summarization and reporting.
Log analysis.
Batch data processing.
Data ETL and transformation.
Business intelligence.
Data mining on Hadoop.
Pig
Definition: Platform for processing large datasets using Pig Latin scripting.
Working
Write scripts in Pig Latin.
Scripts define a data flow (like ETL).
Compiler translates scripts into MapReduce.
Jobs are run on Hadoop.
Intermediate results are stored in HDFS.
Final results can be dumped or stored.
Features
Simple scripting language (Pig Latin).
Handles structured and semi-structured data.
Automatically handles parallel execution.
Supports custom functions (UDFs).
Interoperates with HDFS and Hive.
Debugging and optimization tools.
Benefits
Easy for developers to use.
Faster development than writing Java MapReduce.
Good for complex data transformations.
Scalable with large datasets.
Error handling and debugging support.
Integrates with existing Hadoop tools.
Applications
ETL (Extract, Transform, Load) processes.
Data cleaning and preprocessing.
Web log analysis.
Text mining.
Batch processing tasks.
Ad-hoc data analysis.
Apache Storm
Definition: Real-time computation system for processing streaming data.
Working
Uses spouts (data sources) and bolts (processors).
Data flows through a topology (a directed graph).
Topologies run indefinitely (unlike MapReduce jobs).
Spouts emit streams of data tuples.
Bolts process data, update DBs, or emit new tuples.
Storm distributes work across clusters for parallel processing.
Features
Real-time data processing.
Fault-tolerant and highly scalable.
Horizontal scalability.
Supports multiple languages (Java, Python).
Guaranteed data processing (at least once).
Pluggable architecture.
Benefits
Low latency processing.
Scales to massive data streams.
Reliable and robust.
Easily integrates with other tools.
Flexible for complex workflows.
Open-source and community-supported.
Applications
Real-time analytics and monitoring.
Online machine learning.
ETL for real-time data.
Fraud detection systems.
Sensor data processing.
Live user tracking and engagement.
Apache Flink
Definition: Stream processing framework for real-time and batch data processing.
Working
Processes data as streams, not batches.
Uses operators to transform data streams.
Supports event time and windowing.
Provides exactly-once state consistency.
Distributes tasks across a cluster.
Handles fault tolerance via checkpoints.
Features
True streaming (vs micro-batching).
Exactly-once processing semantics.
Unified batch and stream processing.
Low latency and high throughput.
Flexible windowing support.
Built-in state management.
Benefits
Real-time and historical data handling.
Highly accurate results with minimal delay.
Scalable and fault-tolerant.
Optimized for complex event processing.
Good for both stateless and stateful apps.
Strong ecosystem and API support.
Applications
Real-time recommendation systems.
Clickstream data analysis.
Streaming ETL pipelines.
IoT sensor analytics.
Fraud detection.
Stock market monitoring.
Apache
Definition: Apache Software Foundation (ASF) is an open-source community developing
and maintaining many big data tools.
Working
Supports a wide variety of projects (Hadoop, Spark, etc).
Projects are community-driven and open-source.
Contributors collaborate globally.
Focuses on scalable, distributed software.
Hosts projects with clear governance models.
Releases stable versions and documentation.
Features
Home of 350+ open-source projects.
Emphasizes community-driven development.
Supports scalable big data technologies.
Backed by robust infrastructure.
Offers diverse tech stacks (data, web, AI).
Free and open-source.
Benefits
Access to cutting-edge technologies.
No licensing costs.
Strong community and support.
Highly scalable tools.
Wide adoption in industry.
Constant innovation.
Applications
Big data processing (Hadoop, Spark).
Web server management (HTTP Server).
Data storage and retrieval (Cassandra, HBase).
Stream processing (Storm, Flink).
Data warehousing (Hive, Drill).
AI/ML pipelines (Mahout, MXNet).
Storm
Definition: Apache Storm is a real-time stream processing system for processing large-scale
data streams.
Working
Uses spouts to ingest data and bolts to process it.
Topologies define the flow of data through spouts and bolts.
Data is processed in real-time in distributed clusters.
Processes streams as continuous events.
Uses workers and supervisors for scalability.
Provides fault-tolerant processing with retries.
Features
Real-time stream processing.
Fault-tolerant and reliable.
Horizontal scalability.
Supports parallel processing.
Pluggable components (e.g., spouts, bolts).
Flexible for custom processing logic.
Benefits
Real-time analytics.
Highly scalable and fault-tolerant.
Efficient for high-volume data streams.
Supports custom stream processing logic.
Distributed architecture for high availability.
Integrates easily with other big data tools.
Applications
Real-time analytics and monitoring.
Fraud detection in financial transactions.
Real-time recommendation systems.
Social media analysis.
Sensor data processing.
Log aggregation and processing.
Flink
Definition: Apache Flink is a stream processing framework that performs real-time data
processing with high throughput and low latency.
Working
Processes data in both batch and stream modes.
Uses operators to transform and aggregate streams.
Can be run on YARN, Kubernetes, or standalone.
Provides stateful processing for stream processing.
Fault tolerance via distributed snapshots.
Supports event-time processing and windowing.
Features
Real-time and batch processing.
High throughput with low latency.
Event-time and processing-time windows.
Stateful stream processing.
Fault tolerance and exactly-once semantics.
Unified API for batch and stream processing.
Benefits
Suitable for low-latency applications.
Flexible for real-time and batch processing.
Supports high scalability.
High fault tolerance and reliability.
Easy to integrate with other big data systems.
Powerful event-time and windowing capabilities.
Applications
Real-time data analytics.
Event-driven applications.
IoT stream processing.
Real-time financial data processing.
Log processing.
Continuous machine learning pipelines.
Apache
Definition: Apache is a collection of open-source software projects for distributed
computing, with various projects such as Hadoop, Kafka, and Spark.
Working
Apache projects use distributed clusters to scale processing.
Supports batch and stream processing frameworks.
Use of MapReduce, stream processing, and distributed computation.
Supports integration with other big data systems.
Can handle massive data sets efficiently.
Built-in fault tolerance and scalability.
Features
Scalable distributed processing.
Wide variety of tools and frameworks (Hadoop, Spark, Kafka).
Support for both batch and real-time data.
Fault tolerance and reliability.
Data storage and processing integration.
Open-source and community-supported.
Benefits
Massive scalability for big data workloads.
High fault tolerance.
Open-source and cost-effective.
Broad ecosystem of related tools.
Active community and continuous improvements.
Integrates well with other enterprise systems.
Applications
Big data analytics.
Real-time stream processing.
Data warehousing and ETL.
Machine learning pipelines.
Data lakes and distributed storage.
Log processing and monitoring.
MID
Illucidate the document based storage model in MongoDB
Evaluate use of MongoDB for handling unstructured data in a big data application Describe
Apache storm processes real time data (easy 7 points)
Difference between storm and Spark streaming in ( tabular format, easy points of 10 ,only 2
columns)
Benefits of using flink for stream processing (easy 10 points)
Develop Flink app to process sensor data in real time
Implement a data warehouse model using Hive for an e-commerce company
1. Elucidate the document-based storage model in MongoDB
MongoDB stores data in JSON-like documents (called BSON).
Each document is a collection of key-value pairs, and fields can vary from
document to document.
Documents are grouped in collections (similar to tables in RDBMS).
The model is schema-less, meaning documents in the same collection can have
different structures.
It supports nested documents and arrays for complex data.
Ideal for applications that deal with flexible, hierarchical, or evolving data
structures.
2. Evaluate MongoDB for handling unstructured data in big data applications
Schema flexibility: Can store unstructured and semi-structured data (e.g., logs,
social media).
Scalability: Supports horizontal scaling through sharding.
Query power: Rich query language to extract insights from unstructured data.
Indexing: Can index any field to speed up queries.
Integration: Works well with Hadoop and Spark for big data ecosystems.
Real-time processing: Supports real-time analytics using change streams and
aggregation pipelines.
Limitation: Not ideal for complex joins or strict ACID transactions.
3. Describe how Apache Storm processes real-time data (Easy – 7 Points)
1. Stream-based processing framework.
2. Spouts receive data from sources (e.g., Kafka).
3. Bolts process, transform, or store data.
4. Topologies define data flow from spouts to bolts.
5. Fault-tolerant – failed tuples are replayed.
6. Real-time – processes data as it arrives (low latency).
7. Scales horizontally to handle large volumes.
4. Difference between Storm and Spark Streaming (Tabular Format – Easy 10 Points)
Apache Storm Spark Streaming
True real-time processing Micro-batch processing
Processes one record at a time Processes data in small time batches
Apache Storm Spark Streaming
Lower latency Slightly higher latency
Complex to manage and debug Easier to manage and monitor
Better for real-time use cases Better for near-real-time and batch use
Native fault tolerance Fault tolerance via RDD lineage
No in-built high-level APIs Rich APIs in Java, Scala, Python
Supports tuple-level reliability No per-record acknowledgment
Integration with Trident for stateful processing In-built stateful operations
Lightweight and fast More resource-consuming
5. Benefits of using Flink for stream processing (Easy – 10 Points)
1. True streaming (not micro-batching).
2. Low latency and high throughput.
3. Exactly-once state consistency.
4. Handles event time and late events.
5. Fault-tolerant with checkpointing.
6. Horizontal scalability.
7. Supports both batch and stream processing.
8. Flexible windowing (time, count, session).
9. Backpressure handling to control data flow.
10. Integrates well with Kafka, HDFS, Elasticsearch.
6. Develop a Flink App to Process Sensor Data in Real Time (Simple Example – Java/Python)
Example in Python (PyFlink):
python
CopyEdit
from pyflink.datastream import StreamExecutionEnvironment
from pyflink.common.typeinfo import Types
env = StreamExecutionEnvironment.get_execution_environment()
# Sample sensor data stream (id, temperature)
data = env.from_collection([
("sensor_1", 22.5),
("sensor_2", 25.7),
("sensor_1", 23.1)
], type_info=Types.TUPLE([Types.STRING(), Types.FLOAT()]))
# Filter sensors with temperature > 23
high_temp = data.filter(lambda x: x[1] > 23)
high_temp.print()
env.execute("Sensor Data Stream")
7. Implement a Data Warehouse Model Using Hive for an E-Commerce Company
Steps:
1. Create database:
sql
CopyEdit
CREATE DATABASE ecommerce;
USE ecommerce;
2. Create dimension tables:
sql
CopyEdit
CREATE TABLE customers (
customer_id INT,
name STRING,
email STRING,
city STRING
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE;
CREATE TABLE products (
product_id INT,
name STRING,
category STRING,
price FLOAT
);
3. Create fact table:
sql
CopyEdit
CREATE TABLE sales (
sale_id INT,
customer_id INT,
product_id INT,
quantity INT,
sale_date STRING
);
4. Partitioning/ Bucketing (Optional):
sql
CopyEdit
CREATE TABLE sales_partitioned (
sale_id INT,
customer_id INT,
product_id INT,
quantity INT
)
PARTITIONED BY (sale_date STRING);
5. Load data and run queries:
sql
CopyEdit
LOAD DATA INPATH '/path/sales.csv' INTO TABLE sales;
SELECT c.name, p.name, s.quantity FROM sales s
JOIN customers c ON s.customer_id = c.customer_id
JOIN products p ON s.product_id = p.product_id;