0% found this document useful (0 votes)
24 views25 pages

BDH Unit 3

Uploaded by

Gunjan Jain
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views25 pages

BDH Unit 3

Uploaded by

Gunjan Jain
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

Data Science Unit – 3

Q: Explain the problem of fault tolerance and its solution with reference to HDFS.

Fault tolerance is the ability of a system to continue operating and providing its intended services
even when one or more components within it fail or experience unexpected problems. In simpler
terms, it's like the resilience of a well-designed safety net that prevents the entire system from
breaking down due to individual failures.

Imagine you have a toy robot that you've programmed to perform tasks, and you've given it a set of
instructions. Now, what if your robot accidentally drops a piece or malfunctions? In the world of
computers and data storage, these unexpected events are like faults or failures.

In the context of Hadoop Distributed File System (HDFS), the problem of fault tolerance arises when
one of the computers (nodes) storing pieces of your data fails or encounters a problem. Just like your
robot's unexpected hiccup, these failures can happen, and we need a way to handle them without
losing any important information.

Solution with Reference to HDFS:

Now, let's think of a solution inspired by how we handle mistakes in everyday life.

1. Make Copies (Replication):

• Explanation: Imagine you have a valuable document. Instead of keeping only one
copy, you make several copies and store them in different places. If one copy is lost
or damaged, you can still rely on the others.

• Real Life Example: Like having backups of important photos on different devices or
in the cloud.

2. Check and Correct (Checksums and Data Integrity):

• Explanation: Just as you might double-check your work, HDFS uses checksums to
verify that the pieces of data are stored correctly. If a piece is corrupted or damaged,
HDFS can correct it using the checksum information.

• Real Life Example: When you download a file from the internet, the system may
verify its integrity using a checksum to ensure you get the exact file without errors.

3. Distribute the Load (Parallel Processing):

• Explanation: Instead of relying on one person to do all the work, distribute the tasks
among many. Similarly, HDFS distributes your data across multiple nodes. If one
node fails, others can still handle their share of the work.

• Real Life Example: In a group project, each member has a specific role. If one person
can't complete their task, others can continue working.

4. Have a Plan B (Data Replication and Redundancy):

• Explanation: Just like having a backup plan, HDFS stores multiple copies of your
data. If a node fails, the system can use the copies to ensure no data is lost.
• Real Life Example: Businesses often have backup power generators in case the main
power source fails.

So, in the world of HDFS, fault tolerance is like having safety nets in place – making copies, checking
and correcting errors, distributing tasks, and having backup plans – to ensure your data remains
intact even when unexpected failures occur.

Q: How is data processed on Hadoop?

Data Processing on Hadoop:

Imagine you have a massive jigsaw puzzle, and you want to solve it. Each piece of the puzzle
represents a small part of your data. Hadoop is like a giant table where you can spread out all the
pieces and have many people (computers) working together to put them in the right places.

1. Breaking Down the Puzzle (Data Splitting):

• Explanation: You decide to break your puzzle into smaller pieces so that each person
can focus on a manageable chunk.

• Hadoop Equivalent: Hadoop takes your large dataset and breaks it into smaller parts
called "blocks." Each block is like a piece of the puzzle.

2. Assigning Pieces to Solvers (Map Phase):

• Explanation: You give different pieces of the puzzle to different people, each
working on their assigned section.

• Hadoop Equivalent: Hadoop assigns each block of data to different computers, and
they process it independently. This phase is called the "Map Phase."

3. Solving the Pieces (Processing Data):

• Explanation: Each person focuses on their section, figuring out where each piece
goes in their part of the puzzle.

• Hadoop Equivalent: Each computer processes its assigned data, applying the
required operations or calculations. This is where the actual data processing
happens.

4. Bringing Results Together (Shuffling and Sorting):

• Explanation: Once everyone finishes their part, you gather all the solved sections
and arrange them in the right order.

• Hadoop Equivalent: Hadoop shuffles and sorts the results from the Map Phase,
making sure everything is organized and ready for the next step.

5. Final Assembly (Reduce Phase):

• Explanation: You have a few experts who take the partially solved sections and
assemble the final puzzle.
• Hadoop Equivalent: The "Reduce Phase" in Hadoop takes the processed data from
different computers, combines and organizes it, providing a final result.

Example: Word Count on Hadoop:

Let's say you have a giant book, and you want to know how many times each word appears. Hadoop
would break the book into manageable chunks, assign each chunk to a different computer, count the
words in parallel, and then bring all the results together to provide you with the total word count.

In essence, Hadoop allows you to tackle big data problems by breaking them into smaller,
manageable pieces, processing those pieces in parallel, and then assembling the results to derive
meaningful insights or outcomes.

HDFS : Desgin , Component, Architecture:

Block-based Storage:

HDFS divides large files into fixed-size blocks (default 128 MB or 256 MB).
Each block is independently stored across multiple DataNodes in the cluster.
Data Replication:

HDFS replicates each block to multiple DataNodes (default replication factor is 3).
Replication provides fault tolerance and ensures data availability in case of node failures.
Write Once, Read Many (WORM):

HDFS follows a WORM model, where files are typically written once and read many times.
New versions or copies are created instead of modifying existing data.
Scalability:

HDFS is designed for horizontal scalability, allowing the addition of more nodes to the cluster.
Scalability enables the storage and processing of large datasets by distributing data across multiple nodes.
Fault Tolerance:

HDFS ensures fault tolerance through data replication and regular monitoring.
If a DataNode or block becomes unavailable, the system can retrieve the data from other replicas.

Architecture:
Master-Slave Architecture:

HDFS follows a master-slave architecture with two main components: NameNode (master) and DataNodes (slaves).
NameNode manages metadata, and DataNodes store actual data and follow the instructions of the NameNode.

Explain further with diagram

Components:
Normal General Components
Data Science:

1. Explain HDFS daemons with its diagram

Hadoop Distributed File System (HDFS) is like a super-efficient warehouse for data, but instead of

physical products, it stores massive amounts of digital information, such as product listings,

customer reviews, and purchase history.

HDFS Daemons:
• Blocks
• NameNode
• Secondary NameNode
• DataNode

1. Blocks: Think of your data as a big e-commerce catalog. HDFS takes this catalog and chops it
into smaller pieces, like dividing a big book into many small chapters. Each of these pieces is

called a "block."

2. Data Nodes: Imagine these blocks are stored in separate locations, like storage units in a

giant storage facility. These storage units are like "Data Nodes" in HDFS. Each Data Node

contains several blocks from the catalog.

3. Name Node: At the front desk of this storage facility, there's a "Name Node." This friendly

receptionist knows where each block is located and how they're organized. It's like a map

that helps you find your items in the storage units.

4. Secondary NameNode (Obsolete in modern Hadoop versions):

The Secondary NameNode was a component used in earlier versions of Hadoop, such as Hadoop 1.x.

It was not a secondary or backup NameNode; its primary purpose was to assist in creating periodic
checkpoints of the filesystem's namespace.

The Secondary NameNode helped reduce the time required to restart the NameNode in case of a
failure by periodically merging the edits log with the current filesystem image.

The Secondary NameNode did not provide high availability or automatic failover.

Features:

Replication: To keep your catalog safe, HDFS makes copies of each block and stores them in

different storage units. It's similar to having backup copies in case something goes wrong.

Real-Life Example

Imagine you run a large e-commerce website. You have a vast catalog of products, reviews, and

customer data. You decide to use HDFS to manage all this information efficiently.

1. Catalog Splitting: You take your entire e-commerce catalog, including product listings,

customer reviews, and sales records, and break it into smaller blocks. Each block might

contain information about a group of products or reviews.


2. Storage Facility: You rent a massive storage facility with lots of storage units. These storage

units represent the Data Nodes in HDFS. You put these blocks into different storage units.

3. Catalog Map: You create a giant map or directory called the Name Node. This map tells you

where each block is located in the storage units, making it easy to find data quickly.

4. Safety Copies: To ensure your data is safe, you make multiple copies of each block and store

them in different storage units within your facility. So, even if one storage unit has a

problem, you can always find the same data in another unit

2. Map reduce ? Demonstrate its working taking word count example


3. Explain working and importance of resource manager YARN.

Resource Manager in YARN in easy language:

What is the Resource Manager?


Job Allocator and Manager:

Imagine you have many tasks to do, and you need a fair way to decide who gets to use the computer resources like CPU and memory. The Resource
Manager is like the boss that helps manage and share these resources among different tasks or jobs.
Traffic Cop for Applications:

Think of the Resource Manager as a traffic cop at an intersection. It helps applications (like programs or tasks) move smoothly, deciding who goes first and
making sure everyone gets their turn.
What Does the Resource Manager Do?
Dividing Resources:

The Resource Manager decides how to divide the computer's resources (like slices of a pizza) among different applications. It ensures that everyone gets a
fair share to do their work.
Keeping Track of Tasks:

Just like a teacher keeps track of students in a classroom, the Resource Manager keeps track of all the tasks running on the computer, making sure they
have what they need to finish their work.
Making Plans:

The Resource Manager makes plans for which task should use the computer next. It's like deciding which player gets the ball in a game. It keeps everything
organized and moving.
How Does It Work?
Applications Ask for Resources:

Imagine applications asking the Resource Manager, "Can I have some resources, please?" They tell what they need, like CPU and memory, and the
Resource Manager figures out where to give it to them.
Sharing Resources:

The Resource Manager shares the resources by creating small containers for each application. It's like sharing toys with friends – each friend gets their own
space to play.

Checking in With Workers:

The Resource Manager talks to worker computers (called NodeManagers) to see if they have space and how things are going. It's like checking in with
helpers to make sure everything is running smoothly.
Wrapping Up Tasks:

Once a task is done, the Resource Manager is told, and it frees up the resources. It's like finishing a job and letting the boss know, so someone else can use
the computer.
Why is it Important?
Fairness and Efficiency:
The Resource Manager makes sure everyone gets a fair turn to use the computer. It's like making sure everyone in a group project gets a chance to
work and that nobody hogs all the resources.
Avoiding Traffic Jams:

Just as a traffic cop helps avoid traffic jams, the Resource Manager ensures that computer resources are used efficiently, preventing slowdowns or
problems.
Keeping Things Organized:

Think of the Resource Manager as the organizer of a big event. It keeps track of who needs what, plans how to share resources, and ensures
everything runs smoothly.
In simple terms, the Resource Manager in YARN is like the boss of computer resources, making sure everyone gets a fair chance to use them and
keeping things organized for different tasks or jobs.

You might also like