0% found this document useful (0 votes)
11 views9 pages

18 Module 2

notes

Uploaded by

altac688
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views9 pages

18 Module 2

notes

Uploaded by

altac688
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

3a) What is Hadoop? Explain Hadoop Ecosystem with a neat diagram.

Hadoop:
Hadoop is an open-source framework for storing and processing Big
Data using a distributed computing model.
It provides a scalable, fault-tolerant, and reliable system for processing
large datasets using clusters of commodity hardware.
Key Characteristics:
Scalable: Can scale up by adding more nodes.
Fault-tolerant: Data replication ensures continuity even during failures.
Self-healing: Automatically detects and resolves faults.
Distributed Computing: Tasks are split and processed across clusters.
Hadoop uses the MapReduce programming model for processing data.
Hadoop Ecosystem Components:
Hadoop has a rich ecosystem to support storage, processing, access,
analysis, and management of Big Data.

Core Components:

Hadoop Common: Provides libraries and utilities required by other


Hadoop modules.
HDFS (Hadoop Distributed File System): Distributed storage system.
YARN: Manages and schedules resources for computational tasks.
MapReduce: Parallel processing of large datasets.
Ecosystem Tools:

Hive: SQL-like data querying.


Pig: High-level scripting for data transformation.
HBase: NoSQL database for Big Data.
Sqoop: Transfers data between Hadoop and relational databases.
Flume: Handles streaming data ingestion.
Zookeeper: Manages coordination across distributed systems.
Oozie: Workflow scheduler for Hadoop jobs.
Diagram: Hadoop Ecosystem
(Refer to Figure in PDF for the detailed diagram illustrating Hadoop
components and tools).
3b). Explain with neat diagram HDFS Components.
HDFS Overview:
HDFS (Hadoop Distributed File System): A core component of Hadoop
designed for Big Data processing.
Stores and manages data across clusters of nodes with fault tolerance.
Files are divided into blocks and distributed across multiple DataNodes.
Components:
NameNode (Master):

Manages metadata and file system namespace.


Tracks file locations and ensures data replication.
Functions:
Metadata storage.
File operations (open, close, delete).
Mapping blocks to DataNodes.
DataNode (Slave):

Stores actual data in the form of blocks.


Handles client read/write requests.
Periodically sends heartbeat signals to NameNode for health status.
Secondary NameNode:
Performs checkpoints of the NameNode metadata.
Combines edits and fsimage files to update metadata periodically.
HDFS Workflow:

File write involves block creation, replication, and DataNode


acknowledgment.
Reads are handled by the NameNode, which directs clients to the
appropriate DataNodes.
Diagram: HDFS Components
(Refer to Figure in PDF showing NameNode, DataNode, and Secondary
NameNode roles).

3c.) Write a short note on Apache Hive.


Apache Hive Overview:
Hive is an open-source data warehouse system built on top of Hadoop
for processing structured data.
It provides a SQL-like interface (HiveQL) for querying large datasets.
Key Features:
SQL Compatibility: Supports HiveQL for data querying and analysis.
Batch Processing: Ideal for processing large datasets but does not
support real-time queries.
Flexible Storage: Supports various file types (e.g., text, RCFiles, ORCFiles,
HBase).
Integration: Can work with MapReduce and Tez for query execution.
Advantages:
Simplifies data querying for users familiar with SQL.
Enables data summarization and ad hoc querying.
Provides scalability for large datasets.
(Refer to the PDF for specific usage scenarios and detailed descriptions).

4a.) Explain Apache Sqoop Import and Export Methods.


Sqoop Overview:
Apache Sqoop is a tool to transfer data between Hadoop and relational
databases.
Works with JDBC-compliant databases like MySQL, Oracle, and
PostgreSQL.
Import Method:
Step 1: Metadata Collection
Sqoop examines the database to gather metadata.
Step 2: Data Transfer
A map-only job transfers data to HDFS.
Data is divided into splits and processed in parallel.
Data Format:
Default: Comma-delimited text.
Customizable formats available.
Export Method:
Step 1: Metadata Collection
Examines the database to identify schema and connection details.
Step 2: Data Export
A map-only job writes HDFS data to the database.
Input data is divided into splits, and individual map tasks push the data.
Diagram: Sqoop Import and Export Workflow
(Refer to Figures showing two-step import and export processes).
4b.) Explain Apache Oozie with a neat diagram.
Oozie Overview:
Oozie is a workflow orchestration system for managing multiple Hadoop
jobs.
It schedules and coordinates jobs represented as Directed Acyclic Graphs
(DAGs).
Key Features:
Workflow Jobs: Sequential tasks with dependencies.
Coordinator Jobs: Triggered by time or data availability.
Bundle Jobs: Groups of workflows and coordinators.
Oozie Nodes:
Control Flow Nodes: Define workflow start, end, or failure points.
Action Nodes: Execute specific tasks (e.g., HDFS commands, MapReduce
jobs).
Fork/Join Nodes: Support parallel task execution.
Diagram: Oozie DAG Workflow
(Refer to Oozie diagrams in the PDF illustrating control flow and action
nodes)
.

4c.) Explain YARN Application Framework.


YARN Overview:
YARN (Yet Another Resource Negotiator) manages resources and
scheduling for Hadoop tasks.
Decouples resource management from application execution, allowing
multi-threaded applications.
Components:
Resource Manager (RM):
Allocates resources for tasks.
Manages cluster-level information.
Node Manager (NM):
Manages node-specific resources.
Sends heartbeat signals to RM.
Application Master (AM):
Coordinates application execution.
Sends resource requests to RM.
Containers:
Hold resources (CPU, memory) for task execution.
Workflow:
A client submits a job to RM.
RM assigns resources and launches AM.
AM allocates containers for sub-tasks.
Tasks run in parallel across nodes using the allocated containers.
(Refer to the YARN execution model diagram in the PDF for visual
clarity).
Q2: MapReduce Framework and Its Functions
Ans: What is MapReduce?
MapReduce is a programming model used in Hadoop for processing
large datasets in a parallel and distributed manner.
It simplifies complex data processing by dividing tasks into two phases:
Map and Reduce.
Key Features of MapReduce:
Parallel Processing: Splits tasks across multiple nodes for faster
execution.
Fault Tolerance: Automatically reassigns failed tasks to other nodes.
Scalable: Handles growing data by adding nodes to the cluster.
Data Locality: Moves computation to the nodes storing the data,
reducing network traffic.
Components of MapReduce Framework:
Mapper:
Processes input data and outputs it as key-value pairs.
Example: Counting words in a document where key = word, value = 1.
Reducer:

Aggregates the key-value pairs produced by the Mapper to generate


final results.
Example: Summing all values for each word to calculate word frequency.
JobTracker:

Manages the overall execution of the job.


Tracks task progress and reassigns tasks if failures occur.
TaskTracker:

Executes the tasks assigned by the JobTracker and reports progress.


How MapReduce Works:
Input Split:
Large data is split into smaller chunks for processing.
Mapping Phase:
Mapper reads the data chunk and generates key-value pairs.
Shuffling and Sorting:
Intermediate data is shuffled and grouped by keys.
Reducing Phase:
Reducer processes grouped data and generates the final output.
Output Storage:
Results are stored back in HDFS or other storage systems.
Functions of MapReduce Framework:
Automatic Parallelization: Divides computation across nodes
automatically.
Data Distribution: Ensures data is processed close to where it is stored.
Fault Recovery: Retries failed tasks on healthy nodes.
Simplified Programming: Developers only need to write Map and Reduce
functions.
Example: Word Count Problem
Input:
File containing: "apple apple orange banana apple"
Mapper Output:
apple → 1, apple → 1, orange → 1, banana → 1, apple → 1
Reducer Output:
apple → 3, orange → 1, banana → 1
Q3: Apache Flume - Hadoop Tool
Ans: What is Flume?
Flume is a tool used to collect, move, and store streaming data into
HDFS or other storage systems.
It's useful for data like logs, network traffic, or social media feeds.
Key Features:
Data Collection: Handles large amounts of streaming data.
Reliable: Ensures data is not lost even if there’s a failure.
Flexible: Can work with different data sources and storage systems.
Scalable: Handles more data by adding more Flume agents.
Main Components:
Source: Collects data (e.g., logs from a server).
Channel: Temporarily stores data before it is processed.
Sink: Sends data to the final destination (e.g., HDFS).
Agent: Combines the source, channel, and sink into a pipeline.
Why Use Flume?
To move streaming data efficiently into storage like HDFS.
Ensures data reliability and works well with Hadoop.

You might also like