0% found this document useful (0 votes)

29 views7 pages

BDA Exp 1

The document outlines an experiment focused on Hadoop HDFS, covering its basics, installation, and file operations. It discusses the evolution of Hadoop, its components such as HDFS, MapReduce, and YARN, and the characteristics of Big Data defined by the '5Vs'. The conclusion emphasizes the practical experience gained in using Hadoop for large-scale data processing tasks.

Uploaded by

dhruvshetty960

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

29 views7 pages

BDA Exp 1

Uploaded by

dhruvshetty960

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 7

Experiment 1

Aim: Hadoop HDFS practical: HDFS Basics, Hadoop

Ecosystem Tools Overview. Installing Hadoop. Copying
file to Hadoop. Copy from Hadoop File system and delete
file. Moving and displaying files in HDFS. Programming
exercises on Hadoop.

Theory:
Hadoop is a framework of the open source set of tools distributed under
Apache License. It is used to manage data, store data, and process data for
various big data applications running under clustered systems. In previous
years, Big Data was defined by the “3Vs” but now there are “5Vs” of Big
Data which are also termed as the characteristics of Big Data.

1. Volume: With increasing dependence on technology, data is being

produced at a large volume. Common examples are data being
produced by various social networking sites, sensors, scanners, airlines
and other organizations.

2. Velocity: Huge amount of data is generated per second. It is estimated

that by the end of 2020, every individual will produce 3mb data per
second. This large volume of data is being generated with great
velocity.

3. Variety: The data being produced by different means is of three types

Structured, Un-Structured and Semi-Structured.

4. Veracity: The term Veracity is coined for the inconsistent or incomplete

data which results in the generation of doubtful or uncertain
Information. Often data inconsistency arises because of the volume or
amount of data e.g. data in bulk could create confusion whereas a
lesser amount of data could convey half or incomplete Information.

5. Value: After considering the 4 Vs, one more V stands for Value. Bulk of
Data having no Value is of no good to the company, unless you turn it
into something useful. Data is of no use or importance, but it must be
converted into something valuable to extract Information. Hence, you
can state that Value! is the most important V of all the 5V’s.

Evolution of Hadoop:
Hadoop was designed by Doug Cutting and Michael Cafarella in 2005,
inspired by Google's technology. It stores data using the Hadoop Distributed
File System (HDFS) and processes it with MapReduce. Both HDFS and
MapReduce are based on the Google File System (GFS) and MapReduce.
Google's success in 2000 was largely due to these innovations, which were
detailed in papers released by Google in 2003 and 2004. Cutting and
Cafarella studied these papers and created Hadoop, naming it after Cutting's
son's toy elephant. Thus, while HDFS and MapReduce were developed by
Cutting and Cafarella, their designs were originally inspired by Google.

Components of Hadoop
 HDFS: Hadoop Distributed File System is a dedicated file system to
store big data with a cluster of commodity hardware or cheaper
hardware with streaming access pattern. It enables data to be stored
at multiple nodes in the cluster which ensures data security and fault
tolerance.

 Map Reduce: Data once stored in the HDFS also needs to be processed
upon. Now suppose a query is sent to process a data set in the HDFS.
Now, Hadoop identifies where this data is stored, this is called
Mapping. Now the query is broken into multiple parts and the results of
all these multiple parts are combined and the overall result is sent
back to the user. This is called the reduction process. Thus, while HDFS
is used to store the data, Map Reduce is used to process the data.

 YARN: YARN stands for Yet Another Resource Negotiator. It is a

dedicated operating system for Hadoop which manages the cluster's
resources and functions as a job scheduling framework in Hadoop. The
various types of scheduling are First Come First Serve, Fair Share
Scheduler and Capacity Scheduler etc. The First Come First Serve
scheduling is set by default in YARN.
HDFS Basics
DFS stands for the distributed file system, a concept of storing the file in
multiple nodes. DFS provides the Abstraction for a single large system whose
storage is equal to the sum of other nodes in a cluster.

HDFS (Hadoop Distributed File System) is utilized for storage permission in a

Hadoop cluster. It is mainly designed for working on commodity Hardware
devices (devices that are inexpensive), working on a distributed file system
design. HDFS is designed in such a way that it believes more in storing the
data in a large chunk of blocks rather than storing small data blocks. HDFS in
Hadoop provides Fault-tolerance and High availability to the storage layer
and the other devices present in that Hadoop cluster.

HDFS can handle larger size data with high volume velocity and variety,
making Hadoop work more efficient and reliable with easy access to all its
components. HDFS stores the data in the form of the block where the size of
each data block is 128MB in size which is configurable means you can
change it according to your requirement in hdfs-site.xml file in your Hadoop
directory.
Hadoop Distributed File System (HDFS) is designed to store vast amounts of
data across many machines while providing high throughput and fault
tolerance. Key characteristics of HDFS include:

 Scalability: HDFS can handle large volumes of data by distributing

storage across multiple nodes.
 Fault Tolerance: Data is replicated across multiple nodes, ensuring
availability even if a node fails.
 High Throughput: HDFS is optimized for large datasets and provides
high data transfer rates.

HDFS Architecture:

 Namenode: Manages the metadata and directory structure of HDFS.

 Datanode: Stores the actual data blocks.
 Secondary Namenode: Periodically merges the namespace image
with the edit logs to prevent the Namenode from becoming a
bottleneck.
Hadoop Ecosystem Tools Overview
The Hadoop ecosystem includes various tools and components that enhance
its functionality:

 MapReduce: A programming model for processing large datasets in

parallel.
 YARN (Yet Another Resource Negotiator): Manages and schedules
resources in a Hadoop cluster.
 Hive: A data warehousing solution that provides SQL-like queries.
 Pig: A high-level platform for creating MapReduce programs using a
scripting language.
 HBase: A distributed, scalable, NoSQL database built on HDFS.
 Sqoop: A tool for transferring data between Hadoop and relational
databases.
 Flume: A service for efficiently collecting and moving large amounts of
log data.

IMPLEMENTATION
Conclusion
This experiment provided a detailed overview of HDFS and its role
within the Hadoop ecosystem. By installing Hadoop, performing
various file operations in HDFS, and implementing a MapReduce
program, we gained practical experience and a deeper
understanding of Hadoop's capabilities. This foundational
knowledge is crucial for working with big data and leveraging
Hadoop for large-scale data processing tasks.

Big Data Unit 2
No ratings yet
Big Data Unit 2
25 pages
Introduction To Hadoop
No ratings yet
Introduction To Hadoop
5 pages
BigData Unit 2
No ratings yet
BigData Unit 2
56 pages
Hadoop Basics for Engineering Students
No ratings yet
Hadoop Basics for Engineering Students
18 pages
Bda Unit 2
No ratings yet
Bda Unit 2
79 pages
Unit Ii LM
No ratings yet
Unit Ii LM
18 pages
Unit 3
No ratings yet
Unit 3
5 pages
Unit - 2
No ratings yet
Unit - 2
42 pages
Unit-Iv CC&BD CS71
No ratings yet
Unit-Iv CC&BD CS71
148 pages
School of Computer Engineering: Kalinga Institute of Industrial Technology Deemed To Be University Bhubaneswar-751024
No ratings yet
School of Computer Engineering: Kalinga Institute of Industrial Technology Deemed To Be University Bhubaneswar-751024
260 pages
Big Data Aktu Unit 2
No ratings yet
Big Data Aktu Unit 2
127 pages
Hadoop Basics and HDFS Overview
No ratings yet
Hadoop Basics and HDFS Overview
126 pages
Hadoop ISE 2
No ratings yet
Hadoop ISE 2
25 pages
BDA Module 2 - Notes PDF
No ratings yet
BDA Module 2 - Notes PDF
101 pages
Bda-Unit-2 - 2023
No ratings yet
Bda-Unit-2 - 2023
58 pages
Unit II Hadoop and Map Reduce Overview
No ratings yet
Unit II Hadoop and Map Reduce Overview
136 pages
Big Data-UNIT-2
No ratings yet
Big Data-UNIT-2
46 pages
Unit 3 - BD - Hadoop Ecosystem
No ratings yet
Unit 3 - BD - Hadoop Ecosystem
42 pages
Module - 2
No ratings yet
Module - 2
84 pages
Fbda Unit-3
No ratings yet
Fbda Unit-3
27 pages
Session3 - 4-Bigdata Tools and Movie Use Case
No ratings yet
Session3 - 4-Bigdata Tools and Movie Use Case
79 pages
An Introduction To Hadoop
No ratings yet
An Introduction To Hadoop
12 pages
5.apache Hadoop Updated
No ratings yet
5.apache Hadoop Updated
57 pages
Unit 3
No ratings yet
Unit 3
61 pages
Hadoop Distributed File System: Presented by Mohammad Sufiyan Nagaraju Kola Prudhvi Krishna Kamireddy
No ratings yet
Hadoop Distributed File System: Presented by Mohammad Sufiyan Nagaraju Kola Prudhvi Krishna Kamireddy
17 pages
Unit-2 Introduction To Hadoop
No ratings yet
Unit-2 Introduction To Hadoop
19 pages
Unit 3 - BD - Hadoop Ecosystem
No ratings yet
Unit 3 - BD - Hadoop Ecosystem
132 pages
Module II
No ratings yet
Module II
46 pages
Hadoop Architecture
No ratings yet
Hadoop Architecture
48 pages
Module 2 Hadoop
No ratings yet
Module 2 Hadoop
180 pages
Module 2 Big Data Analytics
No ratings yet
Module 2 Big Data Analytics
38 pages
Hadoop Ecosystem
100% (2)
Hadoop Ecosystem
33 pages
Unit-2 Hadoop
No ratings yet
Unit-2 Hadoop
16 pages
3.1 Hadoop Ecosystem
No ratings yet
3.1 Hadoop Ecosystem
48 pages
Introduction to Hadoop Basics
No ratings yet
Introduction to Hadoop Basics
26 pages
Big Data Ia Answers
No ratings yet
Big Data Ia Answers
14 pages
Unit 3 Full
No ratings yet
Unit 3 Full
89 pages
Hadoop for Big Data Analysis
No ratings yet
Hadoop for Big Data Analysis
4 pages
Bda Unit - 3
No ratings yet
Bda Unit - 3
15 pages
Apache Hadoop: Getting Started With
No ratings yet
Apache Hadoop: Getting Started With
7 pages
Chapter2 Bdi
No ratings yet
Chapter2 Bdi
101 pages
UNIT 2 Full
No ratings yet
UNIT 2 Full
121 pages
2nd Unit Bda
No ratings yet
2nd Unit Bda
30 pages
Wa0001.
No ratings yet
Wa0001.
56 pages
Unit-III (Big Data) Final
No ratings yet
Unit-III (Big Data) Final
34 pages
Exp1 Bda
No ratings yet
Exp1 Bda
11 pages
Hadoop Intro and Hdfs
No ratings yet
Hadoop Intro and Hdfs
37 pages
DW - Bigdata9
No ratings yet
DW - Bigdata9
113 pages
Chap4 BigDataStorageAndManagement
No ratings yet
Chap4 BigDataStorageAndManagement
46 pages
BD Unit II
No ratings yet
BD Unit II
57 pages
Bda Mod 2
No ratings yet
Bda Mod 2
132 pages
Hadoop Unit-4
No ratings yet
Hadoop Unit-4
44 pages
Lecture 4 Introduction To Hadoop
No ratings yet
Lecture 4 Introduction To Hadoop
25 pages
BIG Data - Unit - 2
No ratings yet
BIG Data - Unit - 2
24 pages
Introduction To Hadoop and MapReduce Programming
No ratings yet
Introduction To Hadoop and MapReduce Programming
29 pages
Unit 2 Big Data Notes
No ratings yet
Unit 2 Big Data Notes
21 pages
CS19741-Cloud Computing-Unit 3 Notes
No ratings yet
CS19741-Cloud Computing-Unit 3 Notes
37 pages
Hadoop for Fault-Tolerant Storage
No ratings yet
Hadoop for Fault-Tolerant Storage
9 pages
High Performance Fault-Tolerant Hadoop Distributed File System
No ratings yet
High Performance Fault-Tolerant Hadoop Distributed File System
9 pages
Object Oriented Programming Using Java
No ratings yet
Object Oriented Programming Using Java
2 pages
cs421 Spring2023
No ratings yet
cs421 Spring2023
1 page
Chapter 3 Database
No ratings yet
Chapter 3 Database
5 pages
RupeshKumarGangwar SenioriOSDeveloper AlphaMoney
No ratings yet
RupeshKumarGangwar SenioriOSDeveloper AlphaMoney
5 pages
Netapp Backup Snapcreator
No ratings yet
Netapp Backup Snapcreator
68 pages
JDBC Example With MySQL
No ratings yet
JDBC Example With MySQL
118 pages
Unit Iii
No ratings yet
Unit Iii
15 pages
Eze - Chinenye Glory - VA Resume
No ratings yet
Eze - Chinenye Glory - VA Resume
3 pages
TikTok Privacy Concerns 2023
No ratings yet
TikTok Privacy Concerns 2023
5 pages
Cloud Candidate Evaluation Form
No ratings yet
Cloud Candidate Evaluation Form
7 pages
FortiMail Student Guide-Online PDF
No ratings yet
FortiMail Student Guide-Online PDF
472 pages
Cybersecurity Essentials Guide
No ratings yet
Cybersecurity Essentials Guide
187 pages
Ciso Workshop 1 Cybersecurity Briefing PDF
No ratings yet
Ciso Workshop 1 Cybersecurity Briefing PDF
63 pages
Userguide Optimus Spark
No ratings yet
Userguide Optimus Spark
54 pages
QucikStart Connect ArtiMinds RPS With Fanuc Robots V1.10
No ratings yet
QucikStart Connect ArtiMinds RPS With Fanuc Robots V1.10
17 pages
SQL Server Port Guide
No ratings yet
SQL Server Port Guide
1 page
Chris. F: Java Developer/ Analyst - Pepsico
No ratings yet
Chris. F: Java Developer/ Analyst - Pepsico
5 pages
Summary of Post-Survey Results
No ratings yet
Summary of Post-Survey Results
5 pages
Internet's Impact on Modern Life
No ratings yet
Internet's Impact on Modern Life
2 pages
CA Intermediate EIS MCQ Bank
No ratings yet
CA Intermediate EIS MCQ Bank
73 pages
Social Media Se
No ratings yet
Social Media Se
3 pages
Kahoot Bot - Spam Hack Bot & Answers and Flood
No ratings yet
Kahoot Bot - Spam Hack Bot & Answers and Flood
1 page
Devops&Agile
No ratings yet
Devops&Agile
33 pages
IT Security Threats Vulnerabilities and Countermeasures
0% (1)
IT Security Threats Vulnerabilities and Countermeasures
35 pages
DM Unit 5
No ratings yet
DM Unit 5
9 pages
API Testing Handbook
No ratings yet
API Testing Handbook
16 pages
Implementing Cisco SD-WAN Solutions (ENSDWI) v2.0: What You'll Learn in This Course
No ratings yet
Implementing Cisco SD-WAN Solutions (ENSDWI) v2.0: What You'll Learn in This Course
4 pages
Assignment 1
No ratings yet
Assignment 1
2 pages
How I Tamed Cursor AI To Write Perfect Code Every Time. - by Code Pulse - Coding Nexus - Medium
No ratings yet
How I Tamed Cursor AI To Write Perfect Code Every Time. - by Code Pulse - Coding Nexus - Medium
14 pages
Tas s6 Software Engineering Teacher Support Resource Secure Software Architecture
No ratings yet
Tas s6 Software Engineering Teacher Support Resource Secure Software Architecture
179 pages

BDA Exp 1

Uploaded by

BDA Exp 1

Uploaded by

Experiment 1

Aim: Hadoop HDFS practical: HDFS Basics, Hadoop

1. Volume: With increasing dependence on technology, data is being

2. Velocity: Huge amount of data is generated per second. It is estimated

3. Variety: The data being produced by different means is of three types

4. Veracity: The term Veracity is coined for the inconsistent or incomplete

 YARN: YARN stands for Yet Another Resource Negotiator. It is a

HDFS (Hadoop Distributed File System) is utilized for storage permission in a

 Scalability: HDFS can handle large volumes of data by distributing

 Namenode: Manages the metadata and directory structure of HDFS.

 MapReduce: A programming model for processing large datasets in

You might also like