0% found this document useful (0 votes)

16 views14 pages

BDH Unit 1

Big Data refers to large datasets characterized by volume, variety, velocity, and veracity, differing from traditional data which is smaller and structured. Handling Big Data presents challenges such as overwhelming volume, diverse formats, rapid data generation, and varying data quality. Hadoop is a framework for processing Big Data, consisting of components like MapReduce for processing and HDFS for storage, with a focus on scalability and fault tolerance.

Uploaded by

workclg036

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views14 pages

BDH Unit 1

Uploaded by

workclg036

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

Q1. What do you mean by Big Data? How is it different from traditional data?

Big data refers to extremely large datasets that are difficult to process using traditional data processing
applications. It's characterized by its volume, variety, velocity, and veracity.

• Volume: The sheer amount of data is massive, often measured in petabytes or exabytes.

• Variety: Big data comes in many different formats, including structured (like databases), semi-
structured (like XML or JSON), and unstructured (like text, images, and videos).

• Velocity: Data is generated at a high speed, often in real-time.

• Veracity: The quality and accuracy of the data can vary widely.

Traditional data is typically smaller, structured, and easier to manage. It's often stored in relational
databases and analyzed using traditional data analysis tools.

Key differences:

Feature Traditional Data Big Data

Volume Relatively small Extremely large

Variety Structured Structured, semi-structured, and unstructured

Velocity Generated slowly Generated rapidly

Veracity Relatively high Can vary widely

Processing Traditional data processing tools Specialized big data tools

In essence, big data presents unique challenges due to its scale, complexity, and speed. It requires
specialized tools and techniques to extract valuable insights.

Q2. What are the challenges of handling Big Data?

Challenges of Handling Big Data:

1. Volume: The sheer amount of data can overwhelm traditional storage and processing systems.

2. Variety: Big data comes in many different formats, making it difficult to integrate and analyze.
3. Velocity: Data is generated at a high speed, requiring real-time processing capabilities.

4. Veracity: The quality and accuracy of the data can vary widely, making it difficult to trust.

5. Complexity: Big data is often complex and unstructured, making it difficult to understand and
analyze.

6. Cost: Storing and processing large amounts of data can be expensive.

7. Talent: Finding skilled professionals with expertise in big data technologies can be challenging.

8. Privacy and Security: Protecting sensitive data in large-scale systems is a major concern.

Addressing these challenges requires specialized tools, techniques, and infrastructure.

Q3. Describe the architecture of Hadoop.

Hadoop is a framework written in Java that utilizes a large cluster of commodity hardware to maintain
and store big size data. Hadoop works on MapReduce Programming Algorithm that was introduced by
Google. Today lots of Big Brand Companies are using Hadoop in their Organization to deal with big data,
eg. Facebook, Yahoo, Netflix, eBay, etc. The Hadoop Architecture Mainly consists of 4 components.

• MapReduce

• HDFS(Hadoop Distributed File System)

• YARN(Yet Another Resource Negotiator)

• Common Utilities or Hadoop Common

Let’s understand the role of each one of this component in detail.

1. MapReduce

MapReduce nothing but just like an Algorithm or a data structure that is based on the YARN framework.
The major feature of MapReduce is to perform the distributed processing in parallel in a Hadoop cluster
which Makes Hadoop working so fast. When you are dealing with Big Data, serial processing is no more
of any use. MapReduce has mainly 2 tasks which are divided phase-wise:

In first phase, Map is utilized and in next phase Reduce is utilized.

Here, we can see that the Input is provided to the Map() function then it’s output is used as an input to
the Reduce function and after that, we receive our final output. Let’s understand What this Map() and
Reduce() does.

As we can see that an Input is provided to the Map(), now as we are using Big Data. The Input is a set of
Data. The Map() function here breaks this DataBlocks into Tuples that are nothing but a key-value pair.
These key-value pairs are now sent as input to the Reduce(). The Reduce() function then combines this
broken Tuples or key-value pair based on its Key value and form set of Tuples, and perform some
operation like sorting, summation type job, etc. which is then sent to the final Output Node. Finally, the
Output is Obtained.

The data processing is always done in Reducer depending upon the business requirement of that
industry. This is How First Map() and then Reduce is utilized one by one.

Let’s understand the Map Task and Reduce Task in detail.

Map Task:

• RecordReader The purpose of recordreader is to break the records. It is responsible for providing
key-value pairs in a Map() function. The key is actually is its locational information and value is
the data associated with it.
• Map: A map is nothing but a user-defined function whose work is to process the Tuples obtained
from record reader. The Map() function either does not generate any key-value pair or generate
multiple pairs of these tuples.

• Combiner: Combiner is used for grouping the data in the Map workflow. It is similar to a Local
reducer. The intermediate key-value that are generated in the Map is combined with the help of
this combiner. Using a combiner is not necessary as it is optional.

• Partitionar: Partitional is responsible for fetching key-value pairs generated in the Mapper
Phases. The partitioner generates the shards corresponding to each reducer. Hashcode of each
key is also fetched by this partition. Then partitioner performs it’s(Hashcode) modulus with the
number of reducers(key.hashcode()%(number of reducers)).

Reduce Task

• Shuffle and Sort: The Task of Reducer starts with this step, the process in which the Mapper
generates the intermediate key-value and transfers them to the Reducer task is known
as Shuffling. Using the Shuffling process the system can sort the data using its key value.

Once some of the Mapping tasks are done Shuffling begins that is why it is a faster process and does not
wait for the completion of the task performed by Mapper.

• Reduce: The main function or task of the Reduce is to gather the Tuple generated from Map and
then perform some sorting and aggregation sort of process on those key-value depending on its
key element.

• OutputFormat: Once all the operations are performed, the key-value pairs are written into the
file with the help of record writer, each record in a new line, and the key and value in a space-
separated manner.
2. HDFS

HDFS(Hadoop Distributed File System) is utilized for storage permission. It is mainly designed for working
on commodity Hardware devices(inexpensive devices), working on a distributed file system design. HDFS
is designed in such a way that it believes more in storing the data in a large chunk of blocks rather than
storing small data blocks.

HDFS in Hadoop provides Fault-tolerance and High availability to the storage layer and the other devices
present in that Hadoop cluster. Data storage Nodes in HDFS.

• NameNode(Master)

• DataNode(Slave)

NameNode:NameNode works as a Master in a Hadoop cluster that guides the Datanode(Slaves).

Namenode is mainly used for storing the Metadata i.e. the data about the data. Meta Data can be the
transaction logs that keep track of the user’s activity in a Hadoop cluster.
Meta Data can also be the name of the file, size, and the information about the location(Block number,
Block ids) of Datanode that Namenode stores to find the closest DataNode for Faster Communication.
Namenode instructs the DataNodes with the operation like delete, create, Replicate, etc.

DataNode: DataNodes works as a Slave DataNodes are mainly utilized for storing the data in a Hadoop
cluster, the number of DataNodes can be from 1 to 500 or even more than that. The more number of
DataNode, the Hadoop cluster will be able to store more data. So it is advised that the DataNode should
have High storing capacity to store a large number of file blocks.

High Level Architecture Of Hadoop

File Block In HDFS: Data in HDFS is always stored in terms of blocks. So the single block of data is divided
into multiple blocks of size 128MB which is default and you can also change it manually.
Let’s understand this concept of breaking down of file in blocks with an example. Suppose you have
uploaded a file of 400MB to your HDFS then what happens is this file got divided into blocks of
128MB+128MB+128MB+16MB = 400MB size. Means 4 blocks are created each of 128MB except the last
one. Hadoop doesn’t know or it doesn’t care about what data is stored in these blocks so it considers the
final file blocks as a partial record as it does not have any idea regarding it. In the Linux file system, the
size of a file block is about 4KB which is very much less than the default size of file blocks in the Hadoop
file system. As we all know Hadoop is mainly configured for storing the large size data which is in
petabyte, this is what makes Hadoop file system different from other file systems as it can be scaled,
nowadays file blocks of 128MB to 256MB are considered in Hadoop.

Replication In HDFS Replication ensures the availability of the data. Replication is making a copy of
something and the number of times you make a copy of that particular thing can be expressed as it’s
Replication Factor. As we have seen in File blocks that the HDFS stores the data in the form of various
blocks at the same time Hadoop is also configured to make a copy of those file blocks.

By default, the Replication Factor for Hadoop is set to 3 which can be configured means you can change
it manually as per your requirement like in above example we have made 4 file blocks which means that
3 Replica or copy of each file block is made means total of 4×3 = 12 blocks are made for the backup
purpose.

This is because for running Hadoop we are using commodity hardware (inexpensive system hardware)
which can be crashed at any time. We are not using the supercomputer for our Hadoop setup. That is
why we need such a feature in HDFS which can make copies of that file blocks for backup purposes, this
is known as fault tolerance.

Now one thing we also need to notice that after making so many replica’s of our file blocks we are
wasting so much of our storage but for the big brand organization the data is very much important than
the storage so nobody cares for this extra storage. You can configure the Replication factor in your hdfs-
site.xml file.
Rack Awareness The rack is nothing but just the physical collection of nodes in our Hadoop cluster
(maybe 30 to 40). A large Hadoop cluster is consists of so many Racks . with the help of this Racks
information Namenode chooses the closest Datanode to achieve the maximum performance while
performing the read/write information which reduces the Network Traffic.

HDFS Architecture

3. YARN(Yet Another Resource Negotiator)

YARN is a Framework on which MapReduce works. YARN performs 2 operations that are Job scheduling
and Resource Management. The Purpose of Job schedular is to divide a big task into small jobs so that
each job can be assigned to various slaves in a Hadoop cluster and Processing can be Maximized. Job
Scheduler also keeps track of which job is important, which job has more priority, dependencies
between the jobs and all the other information like job timing, etc. And the use of Resource Manager is
to manage all the resources that are made available for running a Hadoop cluster.

Features of YARN
• Multi-Tenancy

• Scalability

• Cluster-Utilization

• Compatibility

4. Hadoop common or Common Utilities

Hadoop common or Common utilities are nothing but our java library and java files or we can say the
java scripts that we need for all the other components present in a Hadoop cluster. these utilities are
used by HDFS, YARN, and MapReduce for running the cluster. Hadoop Common verify that Hardware
failure in a Hadoop cluster is common so it needs to be solved automatically in software by Hadoop
Framework.

Q4. Write notes on MapReduce and HDFS in the trams of Hadoop.

Hadoop: A Distributed Computing Framework

Hadoop is a distributed computing framework designed to process massive amounts of data efficiently. It
consists of two core components: MapReduce and HDFS.

MapReduce

• A programming model: It provides a simple way to write applications that process large datasets
in parallel.

• Two primary phases:

o Map phase: The input data is divided into smaller chunks, and a map function is applied
to each chunk to generate key-value pairs.

o Reduce phase: The key-value pairs are grouped by key, and a reduce function is applied
to each group to combine the values.

• Key advantages:

o Scalability: It can handle massive datasets by distributing the workload across multiple
nodes.

o Fault tolerance: It can recover from node failures without losing data.

o Flexibility: It can be used for a wide variety of data processing tasks.

HDFS (Hadoop Distributed File System)

• A distributed file system: It is designed to store and manage large datasets across multiple
commodity hardware.

• Key features:

o Replication: Data is replicated across multiple nodes to ensure fault tolerance.

o Data locality: Data is stored on the same nodes as the tasks that process it to minimize
network traffic.

o Streaming access: Data can be read and written in a streaming fashion, allowing for
efficient processing of large datasets.

• Components:

o NameNode: The master node that manages the file system namespace and metadata.

o DataNode: The slave nodes that store data blocks.

Together, MapReduce and HDFS provide a powerful platform for processing large-scale datasets in a
distributed environment.

Q5. Define challenges associated with managing Big Data in an enterprise setting.

Challenges of Managing Big Data in an Enterprise Setting

Managing big data in an enterprise setting presents several unique challenges, including:

Technical Challenges

• Volume: The sheer amount of data can overwhelm traditional storage and processing systems.

• Variety: Big data comes in various formats, making it difficult to integrate and analyze.

• Velocity: Data is generated at a high speed, requiring real-time processing capabilities.

• Veracity: The quality and accuracy of the data can vary widely, making it difficult to trust.

• Complexity: Big data is often complex and unstructured, making it difficult to understand and
analyze.

• Scalability: The system must be able to handle increasing amounts of data without
compromising performance.

Organizational Challenges

• Talent: Finding skilled professionals with expertise in big data technologies can be challenging.
• Governance: Establishing policies and procedures for data management, security, and privacy is
essential.

• Integration: Integrating big data with existing enterprise systems can be complex.

• Cost: Storing and processing large amounts of data can be expensive.

• Return on Investment (ROI): Demonstrating the value of big data initiatives can be difficult.

• Cultural Change: Adopting a data-driven culture may require significant changes in

organizational behavior.

Security and Privacy Challenges

• Data breaches: Protecting sensitive data from unauthorized access is a major concern.

• Compliance: Adhering to data privacy regulations (e.g., GDPR, CCPA) can be complex.

• Data sovereignty: Ensuring that data is stored and processed in accordance with local laws and
regulations.

Addressing these challenges requires a combination of technical solutions, organizational strategies, and
a strong commitment to data governance and security.

Q6. Explain Name-Node and Data-Nodes in HDFS.

NameNode and DataNodes in HDFS

HDFS (Hadoop Distributed File System) is a distributed file system designed to store and manage large
datasets across multiple commodity hardware. It consists of two primary components: the NameNode
and DataNodes.

NameNode

• The master node: It is responsible for managing the file system namespace, metadata, and block
allocation.

• Key functions:

o Namespace management: It maintains a hierarchical namespace for files and

directories.

o Metadata management: It stores information about files, such as their location, size,
and replication factor.

o Block allocation: It assigns blocks of data to DataNodes for storage.

o Client requests: It handles client requests for file operations, such as reading, writing,
and deleting files.

DataNode

• The slave nodes: They are responsible for storing and retrieving data blocks.

• Key functions:

o Data storage: Each DataNode stores a subset of the file system's data blocks.

o Block replication: DataNodes replicate blocks to ensure fault tolerance and data
availability.

o Block reporting: DataNodes report the blocks they are storing to the NameNode.

o Client requests: They handle client requests for reading and writing data blocks.

Together, the NameNode and DataNodes form a distributed file system that can efficiently store and
manage large datasets across multiple commodity hardware. The NameNode acts as the central
authority, while the DataNodes perform the actual data storage and retrieval. This architecture provides
scalability, fault tolerance, and high availability.

Q7.How does HDFS ensure data reliability and fault tolerance?

HDFS ensures data reliability and fault tolerance through several mechanisms:

1. Replication:

o Multiple copies: Each block of data is replicated across multiple DataNodes.

o Redundancy: This ensures that even if one or more DataNodes fail, the data remains
accessible.

o Replication factor: The number of copies can be configured to balance redundancy with
storage efficiency.

2. Data checksums:

o Integrity verification: Each block of data is accompanied by a checksum, which is used to

verify the data's integrity during read operations.

o Corruption detection: If a checksum mismatch is detected, the corrupted block can be

replaced with a valid copy from another DataNode.

3. Pipeline mechanism:

o Parallel data transfer: When a client reads a block, HDFS creates a pipeline of
DataNodes to transfer the data in parallel.
o Fault tolerance: If a DataNode in the pipeline fails, the client can request the data from
another DataNode in the pipeline.

4. NameNode redundancy:

o Secondary NameNode: A secondary NameNode is used to periodically create

checkpoints of the NameNode's namespace and metadata.

o Failover: If the primary NameNode fails, the secondary NameNode can take over its role.

5. DataNode heartbeat:

o Health monitoring: DataNodes periodically send heartbeat messages to the NameNode

to report their status.

o Failure detection: If a DataNode fails to send heartbeats, the NameNode can mark it as
dead and re-replicate its blocks.

By combining these mechanisms, HDFS provides a robust and fault-tolerant platform for storing and
managing large datasets.

Hadoop: Er. Gursewak Singh Dsce
No ratings yet
Hadoop: Er. Gursewak Singh Dsce
15 pages
Introduction To Hadoop
No ratings yet
Introduction To Hadoop
14 pages
Hadoop Ecosystem Overview
No ratings yet
Hadoop Ecosystem Overview
7 pages
Hadoop Architecture
No ratings yet
Hadoop Architecture
8 pages
Shortnotes For Cloud
No ratings yet
Shortnotes For Cloud
22 pages
Bda-3 Unit
No ratings yet
Bda-3 Unit
23 pages
Biodiesel Research
No ratings yet
Biodiesel Research
29 pages
WWW Doubtly in Big Data Analytics Semester 7 Mu Ai Ds Viva Qna
No ratings yet
WWW Doubtly in Big Data Analytics Semester 7 Mu Ai Ds Viva Qna
7 pages
Hadoop Karunesh
No ratings yet
Hadoop Karunesh
14 pages
BDA Notes
No ratings yet
BDA Notes
15 pages
HADOOP
No ratings yet
HADOOP
19 pages
1 Bda Chapter1 Answer
No ratings yet
1 Bda Chapter1 Answer
7 pages
DM Hadoop Architecture
No ratings yet
DM Hadoop Architecture
6 pages
U-3 Big Data
No ratings yet
U-3 Big Data
23 pages
MapReduce Architecture Explained
No ratings yet
MapReduce Architecture Explained
13 pages
Unit 4 1
No ratings yet
Unit 4 1
12 pages
Bda Unit 3
No ratings yet
Bda Unit 3
29 pages
Sem 7 - COMP - BDA
No ratings yet
Sem 7 - COMP - BDA
16 pages
MapReduce Architecture
No ratings yet
MapReduce Architecture
5 pages
Unit 5
No ratings yet
Unit 5
32 pages
BDA Unit - 4
No ratings yet
BDA Unit - 4
16 pages
Bigdata Final
No ratings yet
Bigdata Final
25 pages
Big Data Analysis PDF 2
No ratings yet
Big Data Analysis PDF 2
18 pages
Bda Winter 2021 Solution
No ratings yet
Bda Winter 2021 Solution
27 pages
Bda Unit-3
No ratings yet
Bda Unit-3
20 pages
Bigdata Lecture 3
No ratings yet
Bigdata Lecture 3
42 pages
Unit-III Big Data
No ratings yet
Unit-III Big Data
10 pages
Unit V FRAMEWORKS AND VISUALIZATION
No ratings yet
Unit V FRAMEWORKS AND VISUALIZATION
71 pages
BDA Unit 2 Notes
No ratings yet
BDA Unit 2 Notes
32 pages
Unit 2
No ratings yet
Unit 2
22 pages
Big Data Analytics
No ratings yet
Big Data Analytics
44 pages
BDA-Ass01 (082) Compressed
No ratings yet
BDA-Ass01 (082) Compressed
17 pages
BDA Answers-1
No ratings yet
BDA Answers-1
15 pages
BDA
No ratings yet
BDA
20 pages
Big Data Notes
No ratings yet
Big Data Notes
8 pages
Experiment No. 11 Part A A.1 Aim: 2 Prerequisite: A.3 Outcome: After Successful Completion of This Experiment, Students Will Be Able To
No ratings yet
Experiment No. 11 Part A A.1 Aim: 2 Prerequisite: A.3 Outcome: After Successful Completion of This Experiment, Students Will Be Able To
21 pages
Unit V Cloud Technologies and Advancements
No ratings yet
Unit V Cloud Technologies and Advancements
33 pages
Chapter2 Bdi
No ratings yet
Chapter2 Bdi
101 pages
A Review Paper On Big Data
No ratings yet
A Review Paper On Big Data
5 pages
Hadoop Architec
No ratings yet
Hadoop Architec
14 pages
Bda Answer Bank: 1. Characteristics of Big Data 5V
No ratings yet
Bda Answer Bank: 1. Characteristics of Big Data 5V
28 pages
Unit - III Advanced Analytics Technology and Tools
No ratings yet
Unit - III Advanced Analytics Technology and Tools
44 pages
Big Data and Hadoop
No ratings yet
Big Data and Hadoop
8 pages
Bda Unit 2
No ratings yet
Bda Unit 2
21 pages
Hadoop Architecture and Its Functionality
No ratings yet
Hadoop Architecture and Its Functionality
7 pages
Bda CHP 2
No ratings yet
Bda CHP 2
5 pages
Bda QB
No ratings yet
Bda QB
18 pages
Unit 5
No ratings yet
Unit 5
35 pages
Big Assignment 2
No ratings yet
Big Assignment 2
10 pages
Testing Big Data: Camelia Rad
No ratings yet
Testing Big Data: Camelia Rad
31 pages
Bda Unit 3
No ratings yet
Bda Unit 3
22 pages
Unit 3
No ratings yet
Unit 3
10 pages
MapReduce Unit3
No ratings yet
MapReduce Unit3
27 pages
Bda Assignment
No ratings yet
Bda Assignment
7 pages
BDA Assignment QP-3 IT A With Key Solutions
No ratings yet
BDA Assignment QP-3 IT A With Key Solutions
5 pages
Bda Unit 1
No ratings yet
Bda Unit 1
13 pages
Big Data Unit - 3
No ratings yet
Big Data Unit - 3
7 pages
The CAP Theorem Overview
No ratings yet
The CAP Theorem Overview
16 pages
Pandas Assignment Questions
No ratings yet
Pandas Assignment Questions
1 page
Matplotlib and Seaborn Assignment Questions
No ratings yet
Matplotlib and Seaborn Assignment Questions
1 page
Basic Python Assignment Questions
No ratings yet
Basic Python Assignment Questions
1 page
BDH Assignmemt 3, 4 and 5
No ratings yet
BDH Assignmemt 3, 4 and 5
1 page
Opens in A New Window
No ratings yet
Opens in A New Window
8 pages
International ISO Standard 4249-2: Iteh Standard Preview (Standards - Iteh.ai)
No ratings yet
International ISO Standard 4249-2: Iteh Standard Preview (Standards - Iteh.ai)
8 pages
Efficacy of Reactive Mineral-Based Sorbents For Phosphate, Bacteria, Nitrogen and TOC Removal e Column Experiment in Recirculation Batch Mode
No ratings yet
Efficacy of Reactive Mineral-Based Sorbents For Phosphate, Bacteria, Nitrogen and TOC Removal e Column Experiment in Recirculation Batch Mode
11 pages
Informacion Tecnica FXEQ AVE Cassette 1 Via
No ratings yet
Informacion Tecnica FXEQ AVE Cassette 1 Via
6 pages
Session 13A - The ARMA and ARIMA Models
No ratings yet
Session 13A - The ARMA and ARIMA Models
173 pages
Lab Manual BCE BT 205
No ratings yet
Lab Manual BCE BT 205
63 pages
Physics MCQs For Class 12 With Answers Chapter 10
No ratings yet
Physics MCQs For Class 12 With Answers Chapter 10
12 pages
Android App Development Guide
No ratings yet
Android App Development Guide
17 pages
Discrete-Event Simulation Guide
100% (1)
Discrete-Event Simulation Guide
22 pages
Anchor Industries Products Stud Link Chain and Offshore Anchors
No ratings yet
Anchor Industries Products Stud Link Chain and Offshore Anchors
13 pages
q2 Genchem 1 Quiz 2 q2
No ratings yet
q2 Genchem 1 Quiz 2 q2
3 pages
Lecture Notes
100% (1)
Lecture Notes
400 pages
West WPC 300 2 - en
100% (1)
West WPC 300 2 - en
18 pages
ReleaseNoteRSViewME 5 10 02
No ratings yet
ReleaseNoteRSViewME 5 10 02
12 pages
(1979) Design and Analysis of Guyed Transmission Towers by Computer
No ratings yet
(1979) Design and Analysis of Guyed Transmission Towers by Computer
7 pages
Tikz Tutorial
100% (1)
Tikz Tutorial
34 pages
Technical Data Sheet: Wire Rope Oil
No ratings yet
Technical Data Sheet: Wire Rope Oil
1 page
Curve 1
No ratings yet
Curve 1
25 pages
OAT Practicals Manual
No ratings yet
OAT Practicals Manual
27 pages
Form 2 Term 2 Maths Schemes
No ratings yet
Form 2 Term 2 Maths Schemes
13 pages
Goniotable
No ratings yet
Goniotable
5 pages
Circuit Theory Lab Manual Final
No ratings yet
Circuit Theory Lab Manual Final
31 pages
MCD Level 1 - 3
No ratings yet
MCD Level 1 - 3
6 pages
Informacast Over Sip
No ratings yet
Informacast Over Sip
8 pages
Delrin Material
No ratings yet
Delrin Material
2 pages
WS100 Manual PDF
No ratings yet
WS100 Manual PDF
42 pages
Eigenvalue Barycentric
No ratings yet
Eigenvalue Barycentric
29 pages
Chapter 4 - Graphs of Trigonometric Functions
No ratings yet
Chapter 4 - Graphs of Trigonometric Functions
6 pages
Terex RS6000
No ratings yet
Terex RS6000
4 pages
Machine Tool Structure
No ratings yet
Machine Tool Structure
11 pages
Finite Element Analysis in Thermofluids
No ratings yet
Finite Element Analysis in Thermofluids
218 pages

BDH Unit 1

Uploaded by

BDH Unit 1

Uploaded by

Q1. What do you mean by Big Data? How is it different from traditional data?

• Velocity: Data is generated at a high speed, often in real-time.

Feature Traditional Data Big Data

Volume Relatively small Extremely large

Variety Structured Structured, semi-structured, and unstructured

Velocity Generated slowly Generated rapidly

Veracity Relatively high Can vary widely

Processing Traditional data processing tools Specialized big data tools

Q2. What are the challenges of handling Big Data?

Challenges of Handling Big Data:

6. Cost: Storing and processing large amounts of data can be expensive.

Addressing these challenges requires specialized tools, techniques, and infrastructure.

Q3. Describe the architecture of Hadoop.

• HDFS(Hadoop Distributed File System)

• YARN(Yet Another Resource Negotiator)

• Common Utilities or Hadoop Common

In first phase, Map is utilized and in next phase Reduce is utilized.

Let’s understand the Map Task and Reduce Task in detail.

NameNode:NameNode works as a Master in a Hadoop cluster that guides the Datanode(Slaves).

High Level Architecture Of Hadoop

3. YARN(Yet Another Resource Negotiator)

4. Hadoop common or Common Utilities

Q4. Write notes on MapReduce and HDFS in the trams of Hadoop.

Hadoop: A Distributed Computing Framework

• Two primary phases:

o Flexibility: It can be used for a wide variety of data processing tasks.

HDFS (Hadoop Distributed File System)

o Replication: Data is replicated across multiple nodes to ensure fault tolerance.

o DataNode: The slave nodes that store data blocks.

Challenges of Managing Big Data in an Enterprise Setting

• Velocity: Data is generated at a high speed, requiring real-time processing capabilities.

• Cost: Storing and processing large amounts of data can be expensive.

• Cultural Change: Adopting a data-driven culture may require significant changes in

Security and Privacy Challenges

Q6. Explain Name-Node and Data-Nodes in HDFS.

NameNode and DataNodes in HDFS

o Namespace management: It maintains a hierarchical namespace for files and

o Block allocation: It assigns blocks of data to DataNodes for storage.

Q7.How does HDFS ensure data reliability and fault tolerance?

o Multiple copies: Each block of data is replicated across multiple DataNodes.

o Integrity verification: Each block of data is accompanied by a checksum, which is used to

o Corruption detection: If a checksum mismatch is detected, the corrupted block can be

o Secondary NameNode: A secondary NameNode is used to periodically create

o Health monitoring: DataNodes periodically send heartbeat messages to the NameNode

You might also like