0% found this document useful (0 votes)

24 views25 pages

BDH Unit 3

Uploaded by

Gunjan Jain

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

24 views25 pages

BDH Unit 3

Uploaded by

Gunjan Jain

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 25

Data Science Unit – 3

Q: Explain the problem of fault tolerance and its solution with reference to HDFS.

Fault tolerance is the ability of a system to continue operating and providing its intended services
even when one or more components within it fail or experience unexpected problems. In simpler
terms, it's like the resilience of a well-designed safety net that prevents the entire system from
breaking down due to individual failures.

Imagine you have a toy robot that you've programmed to perform tasks, and you've given it a set of
instructions. Now, what if your robot accidentally drops a piece or malfunctions? In the world of
computers and data storage, these unexpected events are like faults or failures.

In the context of Hadoop Distributed File System (HDFS), the problem of fault tolerance arises when
one of the computers (nodes) storing pieces of your data fails or encounters a problem. Just like your
robot's unexpected hiccup, these failures can happen, and we need a way to handle them without
losing any important information.

Solution with Reference to HDFS:

Now, let's think of a solution inspired by how we handle mistakes in everyday life.

1. Make Copies (Replication):

• Explanation: Imagine you have a valuable document. Instead of keeping only one
copy, you make several copies and store them in different places. If one copy is lost
or damaged, you can still rely on the others.

• Real Life Example: Like having backups of important photos on different devices or
in the cloud.

2. Check and Correct (Checksums and Data Integrity):

• Explanation: Just as you might double-check your work, HDFS uses checksums to
verify that the pieces of data are stored correctly. If a piece is corrupted or damaged,
HDFS can correct it using the checksum information.

• Real Life Example: When you download a file from the internet, the system may
verify its integrity using a checksum to ensure you get the exact file without errors.

3. Distribute the Load (Parallel Processing):

• Explanation: Instead of relying on one person to do all the work, distribute the tasks
among many. Similarly, HDFS distributes your data across multiple nodes. If one
node fails, others can still handle their share of the work.

• Real Life Example: In a group project, each member has a specific role. If one person
can't complete their task, others can continue working.

4. Have a Plan B (Data Replication and Redundancy):

• Explanation: Just like having a backup plan, HDFS stores multiple copies of your
data. If a node fails, the system can use the copies to ensure no data is lost.
• Real Life Example: Businesses often have backup power generators in case the main
power source fails.

So, in the world of HDFS, fault tolerance is like having safety nets in place – making copies, checking
and correcting errors, distributing tasks, and having backup plans – to ensure your data remains
intact even when unexpected failures occur.

Q: How is data processed on Hadoop?

Data Processing on Hadoop:

Imagine you have a massive jigsaw puzzle, and you want to solve it. Each piece of the puzzle
represents a small part of your data. Hadoop is like a giant table where you can spread out all the
pieces and have many people (computers) working together to put them in the right places.

1. Breaking Down the Puzzle (Data Splitting):

• Explanation: You decide to break your puzzle into smaller pieces so that each person
can focus on a manageable chunk.

• Hadoop Equivalent: Hadoop takes your large dataset and breaks it into smaller parts
called "blocks." Each block is like a piece of the puzzle.

2. Assigning Pieces to Solvers (Map Phase):

• Explanation: You give different pieces of the puzzle to different people, each
working on their assigned section.

• Hadoop Equivalent: Hadoop assigns each block of data to different computers, and
they process it independently. This phase is called the "Map Phase."

3. Solving the Pieces (Processing Data):

• Explanation: Each person focuses on their section, figuring out where each piece
goes in their part of the puzzle.

• Hadoop Equivalent: Each computer processes its assigned data, applying the
required operations or calculations. This is where the actual data processing
happens.

4. Bringing Results Together (Shuffling and Sorting):

• Explanation: Once everyone finishes their part, you gather all the solved sections
and arrange them in the right order.

• Hadoop Equivalent: Hadoop shuffles and sorts the results from the Map Phase,
making sure everything is organized and ready for the next step.

5. Final Assembly (Reduce Phase):

• Explanation: You have a few experts who take the partially solved sections and
assemble the final puzzle.
• Hadoop Equivalent: The "Reduce Phase" in Hadoop takes the processed data from
different computers, combines and organizes it, providing a final result.

Example: Word Count on Hadoop:

Let's say you have a giant book, and you want to know how many times each word appears. Hadoop
would break the book into manageable chunks, assign each chunk to a different computer, count the
words in parallel, and then bring all the results together to provide you with the total word count.

In essence, Hadoop allows you to tackle big data problems by breaking them into smaller,
manageable pieces, processing those pieces in parallel, and then assembling the results to derive
meaningful insights or outcomes.

HDFS : Desgin , Component, Architecture:

Block-based Storage:

HDFS divides large files into fixed-size blocks (default 128 MB or 256 MB).
Each block is independently stored across multiple DataNodes in the cluster.
Data Replication:

HDFS replicates each block to multiple DataNodes (default replication factor is 3).
Replication provides fault tolerance and ensures data availability in case of node failures.
Write Once, Read Many (WORM):

HDFS follows a WORM model, where files are typically written once and read many times.
New versions or copies are created instead of modifying existing data.
Scalability:

HDFS is designed for horizontal scalability, allowing the addition of more nodes to the cluster.
Scalability enables the storage and processing of large datasets by distributing data across multiple nodes.
Fault Tolerance:

HDFS ensures fault tolerance through data replication and regular monitoring.
If a DataNode or block becomes unavailable, the system can retrieve the data from other replicas.

Architecture:
Master-Slave Architecture:

HDFS follows a master-slave architecture with two main components: NameNode (master) and DataNodes (slaves).
NameNode manages metadata, and DataNodes store actual data and follow the instructions of the NameNode.

Explain further with diagram

Components:
Normal General Components
Data Science:

1. Explain HDFS daemons with its diagram

Hadoop Distributed File System (HDFS) is like a super-efficient warehouse for data, but instead of

physical products, it stores massive amounts of digital information, such as product listings,

customer reviews, and purchase history.

HDFS Daemons:
• Blocks
• NameNode
• Secondary NameNode
• DataNode

1. Blocks: Think of your data as a big e-commerce catalog. HDFS takes this catalog and chops it
into smaller pieces, like dividing a big book into many small chapters. Each of these pieces is

called a "block."

2. Data Nodes: Imagine these blocks are stored in separate locations, like storage units in a

giant storage facility. These storage units are like "Data Nodes" in HDFS. Each Data Node

contains several blocks from the catalog.

3. Name Node: At the front desk of this storage facility, there's a "Name Node." This friendly

receptionist knows where each block is located and how they're organized. It's like a map

that helps you find your items in the storage units.

4. Secondary NameNode (Obsolete in modern Hadoop versions):

The Secondary NameNode was a component used in earlier versions of Hadoop, such as Hadoop 1.x.

It was not a secondary or backup NameNode; its primary purpose was to assist in creating periodic
checkpoints of the filesystem's namespace.

The Secondary NameNode helped reduce the time required to restart the NameNode in case of a
failure by periodically merging the edits log with the current filesystem image.

The Secondary NameNode did not provide high availability or automatic failover.

Features:

Replication: To keep your catalog safe, HDFS makes copies of each block and stores them in

different storage units. It's similar to having backup copies in case something goes wrong.

Real-Life Example

Imagine you run a large e-commerce website. You have a vast catalog of products, reviews, and

customer data. You decide to use HDFS to manage all this information efficiently.

1. Catalog Splitting: You take your entire e-commerce catalog, including product listings,

customer reviews, and sales records, and break it into smaller blocks. Each block might

contain information about a group of products or reviews.

2. Storage Facility: You rent a massive storage facility with lots of storage units. These storage

units represent the Data Nodes in HDFS. You put these blocks into different storage units.

3. Catalog Map: You create a giant map or directory called the Name Node. This map tells you

where each block is located in the storage units, making it easy to find data quickly.

4. Safety Copies: To ensure your data is safe, you make multiple copies of each block and store

them in different storage units within your facility. So, even if one storage unit has a

problem, you can always find the same data in another unit

2. Map reduce ? Demonstrate its working taking word count example

3. Explain working and importance of resource manager YARN.

Resource Manager in YARN in easy language:

What is the Resource Manager?

Job Allocator and Manager:

Imagine you have many tasks to do, and you need a fair way to decide who gets to use the computer resources like CPU and memory. The Resource
Manager is like the boss that helps manage and share these resources among different tasks or jobs.
Traffic Cop for Applications:

Think of the Resource Manager as a traffic cop at an intersection. It helps applications (like programs or tasks) move smoothly, deciding who goes first and
making sure everyone gets their turn.
What Does the Resource Manager Do?
Dividing Resources:

The Resource Manager decides how to divide the computer's resources (like slices of a pizza) among different applications. It ensures that everyone gets a
fair share to do their work.
Keeping Track of Tasks:

Just like a teacher keeps track of students in a classroom, the Resource Manager keeps track of all the tasks running on the computer, making sure they
have what they need to finish their work.
Making Plans:

The Resource Manager makes plans for which task should use the computer next. It's like deciding which player gets the ball in a game. It keeps everything
organized and moving.
How Does It Work?
Applications Ask for Resources:

Imagine applications asking the Resource Manager, "Can I have some resources, please?" They tell what they need, like CPU and memory, and the
Resource Manager figures out where to give it to them.
Sharing Resources:

The Resource Manager shares the resources by creating small containers for each application. It's like sharing toys with friends – each friend gets their own
space to play.

Checking in With Workers:

The Resource Manager talks to worker computers (called NodeManagers) to see if they have space and how things are going. It's like checking in with
helpers to make sure everything is running smoothly.
Wrapping Up Tasks:

Once a task is done, the Resource Manager is told, and it frees up the resources. It's like finishing a job and letting the boss know, so someone else can use
the computer.
Why is it Important?
Fairness and Efficiency:
The Resource Manager makes sure everyone gets a fair turn to use the computer. It's like making sure everyone in a group project gets a chance to
work and that nobody hogs all the resources.
Avoiding Traffic Jams:

Just as a traffic cop helps avoid traffic jams, the Resource Manager ensures that computer resources are used efficiently, preventing slowdowns or
problems.
Keeping Things Organized:

Think of the Resource Manager as the organizer of a big event. It keeps track of who needs what, plans how to share resources, and ensures
everything runs smoothly.
In simple terms, the Resource Manager in YARN is like the boss of computer resources, making sure everyone gets a fair chance to use them and
keeping things organized for different tasks or jobs.

Hadoop Ecosystem & HDFS Guide
No ratings yet
Hadoop Ecosystem & HDFS Guide
46 pages
Hadoop Architecture
No ratings yet
Hadoop Architecture
48 pages
Hadoop Basics and HDFS Overview
No ratings yet
Hadoop Basics and HDFS Overview
126 pages
Big Data Unit 3 by Multi Atoms
No ratings yet
Big Data Unit 3 by Multi Atoms
6 pages
Bda - Unit 2
No ratings yet
Bda - Unit 2
56 pages
Big Data Aktu Unit 3
No ratings yet
Big Data Aktu Unit 3
90 pages
Big Data Lecture # 05
No ratings yet
Big Data Lecture # 05
22 pages
Introduction To Hadoop and MapReduce Programming
No ratings yet
Introduction To Hadoop and MapReduce Programming
29 pages
Prepared By: Manoj Kumar Joshi & Vikas Sawhney
No ratings yet
Prepared By: Manoj Kumar Joshi & Vikas Sawhney
47 pages
Bigdata Unit 3
No ratings yet
Bigdata Unit 3
96 pages
Big Data-UNIT-2
No ratings yet
Big Data-UNIT-2
46 pages
Bigdta Unit 3
No ratings yet
Bigdta Unit 3
65 pages
Hadoop Presentaton
No ratings yet
Hadoop Presentaton
47 pages
DW - Bigdata9
No ratings yet
DW - Bigdata9
113 pages
Introduction To Hadoop - Chapter-2
No ratings yet
Introduction To Hadoop - Chapter-2
59 pages
NYOUG Hadoop Presentaton
No ratings yet
NYOUG Hadoop Presentaton
47 pages
Business Intelligence & Big Data Analytics-CSE3124Y
No ratings yet
Business Intelligence & Big Data Analytics-CSE3124Y
26 pages
HDFS
No ratings yet
HDFS
11 pages
Unit-2 Introduction To Hadoop
No ratings yet
Unit-2 Introduction To Hadoop
19 pages
Unit 3 1
No ratings yet
Unit 3 1
20 pages
Hadoop Frame Work
No ratings yet
Hadoop Frame Work
38 pages
Bda Unit-Iv
No ratings yet
Bda Unit-Iv
37 pages
Unit - 3 - Big Data
No ratings yet
Unit - 3 - Big Data
66 pages
Unit 2 Da Material
No ratings yet
Unit 2 Da Material
71 pages
Unit-4 BDA As On 25-11-2024
No ratings yet
Unit-4 BDA As On 25-11-2024
248 pages
Unit 2
No ratings yet
Unit 2
56 pages
BD Module 1 Final
No ratings yet
BD Module 1 Final
17 pages
Bda - M 2
No ratings yet
Bda - M 2
113 pages
Unit 3 Full
No ratings yet
Unit 3 Full
89 pages
Big Data Unit-III
No ratings yet
Big Data Unit-III
39 pages
Big Data All Units by MultiAtoms 1
No ratings yet
Big Data All Units by MultiAtoms 1
49 pages
Unit - 2
No ratings yet
Unit - 2
42 pages
BDA-Unit 4
No ratings yet
BDA-Unit 4
20 pages
HDFS
No ratings yet
HDFS
16 pages
Big Data Ia Answers
No ratings yet
Big Data Ia Answers
14 pages
BBVCX
No ratings yet
BBVCX
89 pages
Introduction To Hadoop
No ratings yet
Introduction To Hadoop
5 pages
HDFS Internals for Developers
No ratings yet
HDFS Internals for Developers
30 pages
Big Data Refers To Extremely Large and Complex Datasets That 1
No ratings yet
Big Data Refers To Extremely Large and Complex Datasets That 1
421 pages
Hadoop Architecture
No ratings yet
Hadoop Architecture
84 pages
BIG DATA - Unit 4 HADOOP AND MAP REDUCE - Mini Xerox - Easy Read
No ratings yet
BIG DATA - Unit 4 HADOOP AND MAP REDUCE - Mini Xerox - Easy Read
16 pages
Apex Institute of Technology: Big Data Security
No ratings yet
Apex Institute of Technology: Big Data Security
30 pages
Big Data Analytics Syllabus
No ratings yet
Big Data Analytics Syllabus
169 pages
Module 1 PDF
No ratings yet
Module 1 PDF
49 pages
HDFS (27 Jan 2025 Hadoop Distributed File System)
No ratings yet
HDFS (27 Jan 2025 Hadoop Distributed File System)
73 pages
BD U-3 (Anupam Sir)
No ratings yet
BD U-3 (Anupam Sir)
23 pages
Read Write in HDFS
No ratings yet
Read Write in HDFS
6 pages
Basic Hadoop
No ratings yet
Basic Hadoop
10 pages
Big Data Assighmwnt 2
No ratings yet
Big Data Assighmwnt 2
60 pages
Hadoop Presentation
No ratings yet
Hadoop Presentation
19 pages
U-3 Big Data
No ratings yet
U-3 Big Data
23 pages
Bda Mod 2
No ratings yet
Bda Mod 2
132 pages
Unit-1 Introduction To Big Data
No ratings yet
Unit-1 Introduction To Big Data
38 pages
Assignment 1 Big Data
No ratings yet
Assignment 1 Big Data
9 pages
Big Data Lecture Presentation
No ratings yet
Big Data Lecture Presentation
28 pages
Chapter N2 HDFS The Hadoop Distributed File System - Matrix
No ratings yet
Chapter N2 HDFS The Hadoop Distributed File System - Matrix
37 pages
Unit 3
No ratings yet
Unit 3
5 pages
4
No ratings yet
4
53 pages
IB Biology Lab "Tool-Kit": Logistics
No ratings yet
IB Biology Lab "Tool-Kit": Logistics
7 pages
Formulation of A Hypothesis
No ratings yet
Formulation of A Hypothesis
1 page
DBMS-DEC-19 Solved STRANGER
No ratings yet
DBMS-DEC-19 Solved STRANGER
13 pages
Resume - Fathima Noorudheen
No ratings yet
Resume - Fathima Noorudheen
2 pages
Solution Manual For Elementary Statistics A Step 8th Edition
No ratings yet
Solution Manual For Elementary Statistics A Step 8th Edition
22 pages
Penerapan Model K-Nearest Neighbors Dalam Klasifikasi Kebutuhan Daya Listrik Untuk Masing-Masing Daerah Di Kota Lhokseumawe
No ratings yet
Penerapan Model K-Nearest Neighbors Dalam Klasifikasi Kebutuhan Daya Listrik Untuk Masing-Masing Daerah Di Kota Lhokseumawe
8 pages
"Customer Attitude Towards Commercial Loan": A Research Project On
No ratings yet
"Customer Attitude Towards Commercial Loan": A Research Project On
57 pages
SQL in A Nutshell, 4th Edition Kevin Kline Download
100% (1)
SQL in A Nutshell, 4th Edition Kevin Kline Download
48 pages
Document 2123153.1, How To Deploy and Configure Oracle Incentive Compensation Analytics For Oracle Data Integrator 12c
No ratings yet
Document 2123153.1, How To Deploy and Configure Oracle Incentive Compensation Analytics For Oracle Data Integrator 12c
4 pages
PDF To Word
No ratings yet
PDF To Word
1,202 pages
CBC - Food Processing NC IV
100% (3)
CBC - Food Processing NC IV
136 pages
GDPR Implementation Guide
No ratings yet
GDPR Implementation Guide
2 pages
Tuted - iRWA Answers
No ratings yet
Tuted - iRWA Answers
5 pages
Inception Report - Community Eye Health Evaluation - Liberia
No ratings yet
Inception Report - Community Eye Health Evaluation - Liberia
18 pages
Understanding Data Types
100% (1)
Understanding Data Types
2 pages
Data Science Professional Resume
No ratings yet
Data Science Professional Resume
4 pages
RMIPR QnA
No ratings yet
RMIPR QnA
20 pages
Arabian Research Paper
No ratings yet
Arabian Research Paper
29 pages
A Research Proposal On The Impact of Digital Transformation On UK Small and Medium Enterprises
No ratings yet
A Research Proposal On The Impact of Digital Transformation On UK Small and Medium Enterprises
8 pages
Dzone2017 Researchguide Bigdata
No ratings yet
Dzone2017 Researchguide Bigdata
40 pages
Intro To Tableau Workshop
No ratings yet
Intro To Tableau Workshop
16 pages
Data Movement Modeling
No ratings yet
Data Movement Modeling
176 pages
LIRAS Brochure
No ratings yet
LIRAS Brochure
4 pages
Module 3 Data Gathering Establishing Requirements Analysis Interpretation and Presentation
No ratings yet
Module 3 Data Gathering Establishing Requirements Analysis Interpretation and Presentation
20 pages
Customer Churn Prediction Project: Group C
No ratings yet
Customer Churn Prediction Project: Group C
12 pages
DBMS Mid Sem 2020 (Autumn)
No ratings yet
DBMS Mid Sem 2020 (Autumn)
4 pages
Research1 Chapter1 3 03272023AY2023to2024 100618252023student
No ratings yet
Research1 Chapter1 3 03272023AY2023to2024 100618252023student
102 pages
Data Warehousing Essentials Guide
No ratings yet
Data Warehousing Essentials Guide
20 pages
Welcome, Grade 12 ABM Students!
No ratings yet
Welcome, Grade 12 ABM Students!
37 pages
Final Report
No ratings yet
Final Report
76 pages

BDH Unit 3

Uploaded by

BDH Unit 3

Uploaded by

Data Science Unit – 3

Solution with Reference to HDFS:

1. Make Copies (Replication):

2. Check and Correct (Checksums and Data Integrity):

3. Distribute the Load (Parallel Processing):

4. Have a Plan B (Data Replication and Redundancy):

Q: How is data processed on Hadoop?

Data Processing on Hadoop:

1. Breaking Down the Puzzle (Data Splitting):

2. Assigning Pieces to Solvers (Map Phase):

3. Solving the Pieces (Processing Data):

4. Bringing Results Together (Shuffling and Sorting):

5. Final Assembly (Reduce Phase):

Example: Word Count on Hadoop:

HDFS : Desgin , Component, Architecture:

Explain further with diagram

1. Explain HDFS daemons with its diagram

customer reviews, and purchase history.

contains several blocks from the catalog.

that helps you find your items in the storage units.

4. Secondary NameNode (Obsolete in modern Hadoop versions):

contain information about a group of products or reviews.

2. Map reduce ? Demonstrate its working taking word count example

Resource Manager in YARN in easy language:

What is the Resource Manager?

Checking in With Workers:

You might also like