0% found this document useful (0 votes)

25 views14 pages

Unit 5

The document provides an overview of various big data platforms, including Sqoop, Cassandra, MongoDB, Hive, Pig, Apache Storm, and Apache Flink, detailing their definitions, working mechanisms, features, benefits, and applications. It also includes specific tasks related to MongoDB's document-based storage model, evaluating its use for unstructured data, and comparing Apache Storm with Spark Streaming. Additionally, it outlines the benefits of using Flink for stream processing and provides examples for developing a Flink application and implementing a data warehouse model using Hive.

Uploaded by

yasashwinigollakoti

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

25 views14 pages

Unit 5

Uploaded by

yasashwinigollakoti

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

Unit 5

Big Data Platforms Overview

Sqoop
Definition: Tool for transferring data between Hadoop and relational databases.

Working
 Connects to RDBMS via JDBC.
 Uses import command to transfer data into Hadoop.
 Data is imported into HDFS, Hive, or HBase.
 Uses MapReduce internally for parallel processing.
 Supports export of data back to RDBMS.
 Automates repetitive data transfer tasks.

Features
 Bi-directional data transfer (Import & Export).
 Supports many databases via JDBC.
 Incremental load support.
 Parallel import/export using MapReduce.
 Direct mode for faster transfer.
 Integration with Hive and HBase.

Benefits
 Easy RDBMS-Hadoop data exchange.
 Saves time with automation.
 Scales with large data volumes.
 Reduces manual coding.
 Efficient and fast.
 Open-source and widely supported.

Applications
 Migrating legacy data to Hadoop.
 ETL workflows.
 Data backup from Hadoop to RDBMS.
 Loading data into Hive for analysis.
 Data warehouse population.
 Synchronizing databases and big data systems.
Cassandra
Definition: A distributed NoSQL database for managing large amounts of structured data.

Working
 Stores data in tables using rows and columns.
 Uses peer-to-peer architecture (no master node).
 Writes are logged in commit logs, then to memory/table.
 Data is periodically flushed to disk.
 Supports replication across nodes.
 Uses CQL (Cassandra Query Language).

Features
 Highly scalable and distributed.
 No single point of failure.
 Supports high write throughput.
 Schema-free (flexible model).
 Tunable consistency levels.
 Multi-datacenter replication.

Benefits
 Handles large volumes of data.
 High availability and fault tolerance.
 Real-time performance.
 Great for write-heavy apps.
 Can grow horizontally.
 Open-source and community-driven.

Applications
 Real-time analytics.
 IoT platforms.
 Time-series data storage.
 Recommendation engines.
 Messaging apps.
 E-commerce user activity tracking.

MongoDB
Definition: NoSQL database that stores data in JSON-like documents (BSON format).

Working
 Stores data as key-value pairs in documents.
 Collections group related documents.
 Uses indexes for faster queries.
 Supports CRUD operations via queries.
 Can distribute data across shards.
 Uses replica sets for fault tolerance.

Features
 Document-oriented storage.
 Schema-less/flexible data model.
 High availability via replication.
 Sharding for scalability.
 Rich querying and aggregation.
 Built-in security features.

Benefits
 Easy to develop and maintain.
 Scales horizontally.
 Ideal for dynamic or changing data.
 Good for real-time analytics.
 Supports rich data types.
 Active community and support.

Applications
 Content management systems.
 Product catalogs.
 Mobile app backends.
 Real-time analytics.
 IoT data handling.
 User profiles and session data.

Hive
Definition: Data warehouse system built on Hadoop that uses SQL-like queries (HiveQL).

Working
 Converts SQL-like queries to MapReduce jobs.
 Stores data in HDFS.
 Users define tables and schemas.
 HiveQL is parsed and optimized.
 Executes in batch mode.
 Shows output in CLI or writes to HDFS.

Features
 SQL-like query language (HiveQL).
 Supports large datasets.
 Integration with HDFS.
 Partitioning and bucketing for speed.
 UDFs for custom functions.
 Extensible and scalable.

Benefits
 Easy for SQL users to learn.
 Handles petabytes of data.
 Seamless integration with Hadoop.
 Cost-effective batch processing.
 Supports various data formats.
 Works with BI tools.

Applications
 Data summarization and reporting.
 Log analysis.
 Batch data processing.
 Data ETL and transformation.
 Business intelligence.
 Data mining on Hadoop.

Pig
Definition: Platform for processing large datasets using Pig Latin scripting.

Working
 Write scripts in Pig Latin.
 Scripts define a data flow (like ETL).
 Compiler translates scripts into MapReduce.
 Jobs are run on Hadoop.
 Intermediate results are stored in HDFS.
 Final results can be dumped or stored.

Features
 Simple scripting language (Pig Latin).
 Handles structured and semi-structured data.
 Automatically handles parallel execution.
 Supports custom functions (UDFs).
 Interoperates with HDFS and Hive.
 Debugging and optimization tools.
Benefits
 Easy for developers to use.
 Faster development than writing Java MapReduce.
 Good for complex data transformations.
 Scalable with large datasets.
 Error handling and debugging support.
 Integrates with existing Hadoop tools.

Applications
 ETL (Extract, Transform, Load) processes.
 Data cleaning and preprocessing.
 Web log analysis.
 Text mining.
 Batch processing tasks.
 Ad-hoc data analysis.

Apache Storm
Definition: Real-time computation system for processing streaming data.

Working
 Uses spouts (data sources) and bolts (processors).
 Data flows through a topology (a directed graph).
 Topologies run indefinitely (unlike MapReduce jobs).
 Spouts emit streams of data tuples.
 Bolts process data, update DBs, or emit new tuples.
 Storm distributes work across clusters for parallel processing.

Features
 Real-time data processing.
 Fault-tolerant and highly scalable.
 Horizontal scalability.
 Supports multiple languages (Java, Python).
 Guaranteed data processing (at least once).
 Pluggable architecture.

Benefits
 Low latency processing.
 Scales to massive data streams.
 Reliable and robust.
 Easily integrates with other tools.
 Flexible for complex workflows.
 Open-source and community-supported.

Applications
 Real-time analytics and monitoring.
 Online machine learning.
 ETL for real-time data.
 Fraud detection systems.
 Sensor data processing.
 Live user tracking and engagement.

Apache Flink
Definition: Stream processing framework for real-time and batch data processing.

Working
 Processes data as streams, not batches.
 Uses operators to transform data streams.
 Supports event time and windowing.
 Provides exactly-once state consistency.
 Distributes tasks across a cluster.
 Handles fault tolerance via checkpoints.

Features
 True streaming (vs micro-batching).
 Exactly-once processing semantics.
 Unified batch and stream processing.
 Low latency and high throughput.
 Flexible windowing support.
 Built-in state management.

Benefits
 Real-time and historical data handling.
 Highly accurate results with minimal delay.
 Scalable and fault-tolerant.
 Optimized for complex event processing.
 Good for both stateless and stateful apps.
 Strong ecosystem and API support.

Applications
 Real-time recommendation systems.
 Clickstream data analysis.
 Streaming ETL pipelines.
 IoT sensor analytics.
 Fraud detection.
 Stock market monitoring.

Apache
Definition: Apache Software Foundation (ASF) is an open-source community developing
and maintaining many big data tools.

Working
 Supports a wide variety of projects (Hadoop, Spark, etc).
 Projects are community-driven and open-source.
 Contributors collaborate globally.
 Focuses on scalable, distributed software.
 Hosts projects with clear governance models.
 Releases stable versions and documentation.

Features
 Home of 350+ open-source projects.
 Emphasizes community-driven development.
 Supports scalable big data technologies.
 Backed by robust infrastructure.
 Offers diverse tech stacks (data, web, AI).
 Free and open-source.

Benefits
 Access to cutting-edge technologies.
 No licensing costs.
 Strong community and support.
 Highly scalable tools.
 Wide adoption in industry.
 Constant innovation.

Applications
 Big data processing (Hadoop, Spark).
 Web server management (HTTP Server).
 Data storage and retrieval (Cassandra, HBase).
 Stream processing (Storm, Flink).
 Data warehousing (Hive, Drill).
 AI/ML pipelines (Mahout, MXNet).
Storm
Definition: Apache Storm is a real-time stream processing system for processing large-scale
data streams.

Working
 Uses spouts to ingest data and bolts to process it.
 Topologies define the flow of data through spouts and bolts.
 Data is processed in real-time in distributed clusters.
 Processes streams as continuous events.
 Uses workers and supervisors for scalability.
 Provides fault-tolerant processing with retries.

Features
 Real-time stream processing.
 Fault-tolerant and reliable.
 Horizontal scalability.
 Supports parallel processing.
 Pluggable components (e.g., spouts, bolts).
 Flexible for custom processing logic.

Benefits
 Real-time analytics.
 Highly scalable and fault-tolerant.
 Efficient for high-volume data streams.
 Supports custom stream processing logic.
 Distributed architecture for high availability.
 Integrates easily with other big data tools.

Applications
 Real-time analytics and monitoring.
 Fraud detection in financial transactions.
 Real-time recommendation systems.
 Social media analysis.
 Sensor data processing.
 Log aggregation and processing.

Flink
Definition: Apache Flink is a stream processing framework that performs real-time data
processing with high throughput and low latency.

Working
 Processes data in both batch and stream modes.
 Uses operators to transform and aggregate streams.
 Can be run on YARN, Kubernetes, or standalone.
 Provides stateful processing for stream processing.
 Fault tolerance via distributed snapshots.
 Supports event-time processing and windowing.

Features
 Real-time and batch processing.
 High throughput with low latency.
 Event-time and processing-time windows.
 Stateful stream processing.
 Fault tolerance and exactly-once semantics.
 Unified API for batch and stream processing.

Benefits
 Suitable for low-latency applications.
 Flexible for real-time and batch processing.
 Supports high scalability.
 High fault tolerance and reliability.
 Easy to integrate with other big data systems.
 Powerful event-time and windowing capabilities.

Applications
 Real-time data analytics.
 Event-driven applications.
 IoT stream processing.
 Real-time financial data processing.
 Log processing.
 Continuous machine learning pipelines.

Apache
Definition: Apache is a collection of open-source software projects for distributed
computing, with various projects such as Hadoop, Kafka, and Spark.

Working
 Apache projects use distributed clusters to scale processing.
 Supports batch and stream processing frameworks.
 Use of MapReduce, stream processing, and distributed computation.
 Supports integration with other big data systems.
 Can handle massive data sets efficiently.
 Built-in fault tolerance and scalability.
Features
 Scalable distributed processing.
 Wide variety of tools and frameworks (Hadoop, Spark, Kafka).
 Support for both batch and real-time data.
 Fault tolerance and reliability.
 Data storage and processing integration.
 Open-source and community-supported.

Benefits
 Massive scalability for big data workloads.
 High fault tolerance.
 Open-source and cost-effective.
 Broad ecosystem of related tools.
 Active community and continuous improvements.
 Integrates well with other enterprise systems.

Applications
 Big data analytics.
 Real-time stream processing.
 Data warehousing and ETL.
 Machine learning pipelines.
 Data lakes and distributed storage.
 Log processing and monitoring.

MID
Illucidate the document based storage model in MongoDB

Evaluate use of MongoDB for handling unstructured data in a big data application Describe
Apache storm processes real time data (easy 7 points)

Difference between storm and Spark streaming in ( tabular format, easy points of 10 ,only 2
columns)

Benefits of using flink for stream processing (easy 10 points)

Develop Flink app to process sensor data in real time

Implement a data warehouse model using Hive for an e-commerce company

1. Elucidate the document-based storage model in MongoDB

 MongoDB stores data in JSON-like documents (called BSON).

 Each document is a collection of key-value pairs, and fields can vary from
document to document.
 Documents are grouped in collections (similar to tables in RDBMS).
 The model is schema-less, meaning documents in the same collection can have
different structures.
 It supports nested documents and arrays for complex data.
 Ideal for applications that deal with flexible, hierarchical, or evolving data
structures.

2. Evaluate MongoDB for handling unstructured data in big data applications

 Schema flexibility: Can store unstructured and semi-structured data (e.g., logs,
social media).
 Scalability: Supports horizontal scaling through sharding.
 Query power: Rich query language to extract insights from unstructured data.
 Indexing: Can index any field to speed up queries.
 Integration: Works well with Hadoop and Spark for big data ecosystems.
 Real-time processing: Supports real-time analytics using change streams and
aggregation pipelines.
 Limitation: Not ideal for complex joins or strict ACID transactions.

3. Describe how Apache Storm processes real-time data (Easy – 7 Points)

1. Stream-based processing framework.

2. Spouts receive data from sources (e.g., Kafka).
3. Bolts process, transform, or store data.
4. Topologies define data flow from spouts to bolts.
5. Fault-tolerant – failed tuples are replayed.
6. Real-time – processes data as it arrives (low latency).
7. Scales horizontally to handle large volumes.

4. Difference between Storm and Spark Streaming (Tabular Format – Easy 10 Points)
Apache Storm Spark Streaming

True real-time processing Micro-batch processing

Processes one record at a time Processes data in small time batches

Apache Storm Spark Streaming

Lower latency Slightly higher latency

Complex to manage and debug Easier to manage and monitor

Better for real-time use cases Better for near-real-time and batch use

Native fault tolerance Fault tolerance via RDD lineage

No in-built high-level APIs Rich APIs in Java, Scala, Python

Supports tuple-level reliability No per-record acknowledgment

Integration with Trident for stateful processing In-built stateful operations

Lightweight and fast More resource-consuming

5. Benefits of using Flink for stream processing (Easy – 10 Points)

1. True streaming (not micro-batching).

2. Low latency and high throughput.
3. Exactly-once state consistency.
4. Handles event time and late events.
5. Fault-tolerant with checkpointing.
6. Horizontal scalability.
7. Supports both batch and stream processing.
8. Flexible windowing (time, count, session).
9. Backpressure handling to control data flow.
10. Integrates well with Kafka, HDFS, Elasticsearch.

6. Develop a Flink App to Process Sensor Data in Real Time (Simple Example – Java/Python)

Example in Python (PyFlink):

python
CopyEdit
from pyflink.datastream import StreamExecutionEnvironment
from pyflink.common.typeinfo import Types

env = StreamExecutionEnvironment.get_execution_environment()
# Sample sensor data stream (id, temperature)
data = env.from_collection([
("sensor_1", 22.5),
("sensor_2", 25.7),
("sensor_1", 23.1)
], type_info=Types.TUPLE([Types.STRING(), Types.FLOAT()]))

# Filter sensors with temperature > 23

high_temp = data.filter(lambda x: x[1] > 23)

high_temp.print()

env.execute("Sensor Data Stream")

7. Implement a Data Warehouse Model Using Hive for an E-Commerce Company

Steps:

1. Create database:

sql
CopyEdit
CREATE DATABASE ecommerce;
USE ecommerce;

2. Create dimension tables:

sql
CopyEdit
CREATE TABLE customers (
customer_id INT,
name STRING,
email STRING,
city STRING
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE;

CREATE TABLE products (

product_id INT,
name STRING,
category STRING,
price FLOAT
);

3. Create fact table:

sql
CopyEdit
CREATE TABLE sales (
sale_id INT,
customer_id INT,
product_id INT,
quantity INT,
sale_date STRING
);

4. Partitioning/ Bucketing (Optional):

sql
CopyEdit
CREATE TABLE sales_partitioned (
sale_id INT,
customer_id INT,
product_id INT,
quantity INT
)
PARTITIONED BY (sale_date STRING);

5. Load data and run queries:

sql
CopyEdit
LOAD DATA INPATH '/path/sales.csv' INTO TABLE sales;
SELECT c.name, p.name, s.quantity FROM sales s
JOIN customers c ON s.customer_id = c.customer_id
JOIN products p ON s.product_id = p.product_id;

Karthiayinidva Notes
No ratings yet
Karthiayinidva Notes
29 pages
Big Data Important Qiestion
No ratings yet
Big Data Important Qiestion
10 pages
Asit Kumar Das - M5 SPARK
No ratings yet
Asit Kumar Das - M5 SPARK
24 pages
Fromat
No ratings yet
Fromat
17 pages
Unit 5
No ratings yet
Unit 5
4 pages
S - Hadoop Ecosystem
No ratings yet
S - Hadoop Ecosystem
14 pages
DC Unit V
No ratings yet
DC Unit V
26 pages
Yasir f29 Ass1 Bigdata
No ratings yet
Yasir f29 Ass1 Bigdata
7 pages
Big Data Technologies Notes
No ratings yet
Big Data Technologies Notes
3 pages
BDA - Unit 4
No ratings yet
BDA - Unit 4
18 pages
BIG Data Analytics 21CSH-471: Computer Science & Engineering
No ratings yet
BIG Data Analytics 21CSH-471: Computer Science & Engineering
24 pages
Hortonworks Data Platform (HDP)
100% (1)
Hortonworks Data Platform (HDP)
56 pages
2.2. Components of Hadoop - Analysing
No ratings yet
2.2. Components of Hadoop - Analysing
16 pages
Big Data Deals With Large Data Sets
No ratings yet
Big Data Deals With Large Data Sets
4 pages
Big Data
No ratings yet
Big Data
19 pages
BDTools
No ratings yet
BDTools
15 pages
13 Lecture
No ratings yet
13 Lecture
23 pages
Big Data Lab Manual
No ratings yet
Big Data Lab Manual
36 pages
CT 2
No ratings yet
CT 2
8 pages
MA - VaishuAchini - VIT - 24 - ICT703 - A3
No ratings yet
MA - VaishuAchini - VIT - 24 - ICT703 - A3
8 pages
BD by Maaz
No ratings yet
BD by Maaz
19 pages
Hadoop Training in Bangalore
No ratings yet
Hadoop Training in Bangalore
38 pages
Week - 5
No ratings yet
Week - 5
7 pages
IOT and Comp - Architecture
No ratings yet
IOT and Comp - Architecture
17 pages
BigData Unit 2
No ratings yet
BigData Unit 2
15 pages
BD Notes 5
No ratings yet
BD Notes 5
37 pages
Big Data Analytics Presentation
No ratings yet
Big Data Analytics Presentation
30 pages
Hadoop-Compatible Projects
No ratings yet
Hadoop-Compatible Projects
7 pages
Apache Hadoop Ecosystem
No ratings yet
Apache Hadoop Ecosystem
13 pages
Hadoopvsspark 180108070838
No ratings yet
Hadoopvsspark 180108070838
17 pages
Big Data Architecture Guide
No ratings yet
Big Data Architecture Guide
4 pages
4.big Data Platforms
No ratings yet
4.big Data Platforms
49 pages
Big Data Unit 1
No ratings yet
Big Data Unit 1
24 pages
Unit 4 BDTT
No ratings yet
Unit 4 BDTT
23 pages
Bda Angel
No ratings yet
Bda Angel
5 pages
BDA Final
No ratings yet
BDA Final
23 pages
Big Data Technologies UNIT 1
No ratings yet
Big Data Technologies UNIT 1
5 pages
Top Big Data Platforms & Use Cases
No ratings yet
Top Big Data Platforms & Use Cases
9 pages
Hadoop Ecosystem Overview
No ratings yet
Hadoop Ecosystem Overview
5 pages
BIG DATA Class 1 1741496163
No ratings yet
BIG DATA Class 1 1741496163
108 pages
Learning Real-Time Processing With Spark Streaming - Sample Chapter
No ratings yet
Learning Real-Time Processing With Spark Streaming - Sample Chapter
30 pages
Lecture 4 - Hadoop Ecosystem - 1691899782480
No ratings yet
Lecture 4 - Hadoop Ecosystem - 1691899782480
36 pages
Alternatives To Apache Spark
No ratings yet
Alternatives To Apache Spark
4 pages
Bda Kar
No ratings yet
Bda Kar
5 pages
? What Is Big Data
No ratings yet
? What Is Big Data
14 pages
BDA Unit 3
No ratings yet
BDA Unit 3
30 pages
The Big Data Technology Landscape
No ratings yet
The Big Data Technology Landscape
36 pages
Hadoop
No ratings yet
Hadoop
4 pages
Sub Unit 3
No ratings yet
Sub Unit 3
9 pages
Big Data Processing With Apache Spark - Infoqdotcom
No ratings yet
Big Data Processing With Apache Spark - Infoqdotcom
16 pages
Big Data en Gros Deepseek
No ratings yet
Big Data en Gros Deepseek
7 pages
Bda 123
No ratings yet
Bda 123
36 pages
Unit 2
No ratings yet
Unit 2
15 pages
Module 2
No ratings yet
Module 2
20 pages
Big Data Computing Notes
No ratings yet
Big Data Computing Notes
17 pages
Berkeley Data Analytics Stack (BDAS) Overview: Ion Stoica UC Berkeley
No ratings yet
Berkeley Data Analytics Stack (BDAS) Overview: Ion Stoica UC Berkeley
28 pages
Berkeley Data Analytics Stack BDAS Overview Ion Stoica Strata 2013
No ratings yet
Berkeley Data Analytics Stack BDAS Overview Ion Stoica Strata 2013
28 pages
Big Data Unit 4
No ratings yet
Big Data Unit 4
96 pages
RDBMS Session 1-5 QA
No ratings yet
RDBMS Session 1-5 QA
13 pages
SAP Business Transformation Steps For Ef
No ratings yet
SAP Business Transformation Steps For Ef
7 pages
Microsoft Herzliya ISO 27001/27018 Audit
No ratings yet
Microsoft Herzliya ISO 27001/27018 Audit
25 pages
3GPP TS 29.518
No ratings yet
3GPP TS 29.518
195 pages
Data WareHouse
No ratings yet
Data WareHouse
48 pages
Myths Ryan M. Arquiza
No ratings yet
Myths Ryan M. Arquiza
1 page
Making HTTP Rest Requests With Openedgenet
No ratings yet
Making HTTP Rest Requests With Openedgenet
13 pages
Utility Software - Wikipedia
No ratings yet
Utility Software - Wikipedia
12 pages
AWS CodeBuild Authorization Fix
No ratings yet
AWS CodeBuild Authorization Fix
4 pages
Chapter 14 Database
No ratings yet
Chapter 14 Database
5 pages
SOA: Business and IT Integration Guide
No ratings yet
SOA: Business and IT Integration Guide
3 pages
Lecture On Work Breakdown Structure
No ratings yet
Lecture On Work Breakdown Structure
28 pages
Survey Methods: Pros & Cons
No ratings yet
Survey Methods: Pros & Cons
2 pages
Indonesia Map and Template Usage Guide
No ratings yet
Indonesia Map and Template Usage Guide
5 pages
UNIT 4: Network Security: E-Commerce Perspective
No ratings yet
UNIT 4: Network Security: E-Commerce Perspective
124 pages
CS 378 Ethical Hacking Syllabus - Hintz Sp25
No ratings yet
CS 378 Ethical Hacking Syllabus - Hintz Sp25
3 pages
Resume 2023 PDF
No ratings yet
Resume 2023 PDF
2 pages
Internet Applications
No ratings yet
Internet Applications
19 pages
Critical Cybersecurity
No ratings yet
Critical Cybersecurity
17 pages
Cse 115: Computing Concepts: Chapter 1: Overview of Computers and Programming
No ratings yet
Cse 115: Computing Concepts: Chapter 1: Overview of Computers and Programming
20 pages
Ethiopian Banking Cloud Framework
No ratings yet
Ethiopian Banking Cloud Framework
107 pages
Database Normalization Guide
No ratings yet
Database Normalization Guide
21 pages
Business 20analyst PDF
No ratings yet
Business 20analyst PDF
5 pages
Early Rising
No ratings yet
Early Rising
151 pages
Authorization
No ratings yet
Authorization
6 pages
Security Other Mcqs Bet
No ratings yet
Security Other Mcqs Bet
3 pages
AWS DevOps - Cheat Sheet
100% (1)
AWS DevOps - Cheat Sheet
21 pages
2K20 CO 015 Abhay Sharma
No ratings yet
2K20 CO 015 Abhay Sharma
10 pages
Overview of Copilot in Fabric - Microsoft Fabric - Microsoft Learn
No ratings yet
Overview of Copilot in Fabric - Microsoft Fabric - Microsoft Learn
11 pages
Team Omega
No ratings yet
Team Omega
6 pages

Unit 5

Uploaded by

Unit 5

Uploaded by

Unit 5

Big Data Platforms Overview

Benefits of using flink for stream processing (easy 10 points)

Develop Flink app to process sensor data in real time

Implement a data warehouse model using Hive for an e-commerce company

1. Elucidate the document-based storage model in MongoDB

 MongoDB stores data in JSON-like documents (called BSON).

2. Evaluate MongoDB for handling unstructured data in big data applications

3. Describe how Apache Storm processes real-time data (Easy – 7 Points)

1. Stream-based processing framework.

True real-time processing Micro-batch processing

Processes one record at a time Processes data in small time batches

Lower latency Slightly higher latency

Complex to manage and debug Easier to manage and monitor

Native fault tolerance Fault tolerance via RDD lineage

No in-built high-level APIs Rich APIs in Java, Scala, Python

Supports tuple-level reliability No per-record acknowledgment

Integration with Trident for stateful processing In-built stateful operations

Lightweight and fast More resource-consuming

5. Benefits of using Flink for stream processing (Easy – 10 Points)

1. True streaming (not micro-batching).

Example in Python (PyFlink):

# Filter sensors with temperature > 23

env.execute("Sensor Data Stream")

7. Implement a Data Warehouse Model Using Hive for an E-Commerce Company

2. Create dimension tables:

CREATE TABLE products (

3. Create fact table:

4. Partitioning/ Bucketing (Optional):

5. Load data and run queries:

You might also like