0% found this document useful (0 votes)
29 views7 pages

BDA Exp 1

The document outlines an experiment focused on Hadoop HDFS, covering its basics, installation, and file operations. It discusses the evolution of Hadoop, its components such as HDFS, MapReduce, and YARN, and the characteristics of Big Data defined by the '5Vs'. The conclusion emphasizes the practical experience gained in using Hadoop for large-scale data processing tasks.

Uploaded by

dhruvshetty960
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views7 pages

BDA Exp 1

The document outlines an experiment focused on Hadoop HDFS, covering its basics, installation, and file operations. It discusses the evolution of Hadoop, its components such as HDFS, MapReduce, and YARN, and the characteristics of Big Data defined by the '5Vs'. The conclusion emphasizes the practical experience gained in using Hadoop for large-scale data processing tasks.

Uploaded by

dhruvshetty960
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 7

Experiment 1

Aim: Hadoop HDFS practical: HDFS Basics, Hadoop


Ecosystem Tools Overview. Installing Hadoop. Copying
file to Hadoop. Copy from Hadoop File system and delete
file. Moving and displaying files in HDFS. Programming
exercises on Hadoop.

Theory:
Hadoop is a framework of the open source set of tools distributed under
Apache License. It is used to manage data, store data, and process data for
various big data applications running under clustered systems. In previous
years, Big Data was defined by the “3Vs” but now there are “5Vs” of Big
Data which are also termed as the characteristics of Big Data.

1. Volume: With increasing dependence on technology, data is being


produced at a large volume. Common examples are data being
produced by various social networking sites, sensors, scanners, airlines
and other organizations.

2. Velocity: Huge amount of data is generated per second. It is estimated


that by the end of 2020, every individual will produce 3mb data per
second. This large volume of data is being generated with great
velocity.

3. Variety: The data being produced by different means is of three types


Structured, Un-Structured and Semi-Structured.

4. Veracity: The term Veracity is coined for the inconsistent or incomplete


data which results in the generation of doubtful or uncertain
Information. Often data inconsistency arises because of the volume or
amount of data e.g. data in bulk could create confusion whereas a
lesser amount of data could convey half or incomplete Information.

5. Value: After considering the 4 Vs, one more V stands for Value. Bulk of
Data having no Value is of no good to the company, unless you turn it
into something useful. Data is of no use or importance, but it must be
converted into something valuable to extract Information. Hence, you
can state that Value! is the most important V of all the 5V’s.

Evolution of Hadoop:
Hadoop was designed by Doug Cutting and Michael Cafarella in 2005,
inspired by Google's technology. It stores data using the Hadoop Distributed
File System (HDFS) and processes it with MapReduce. Both HDFS and
MapReduce are based on the Google File System (GFS) and MapReduce.
Google's success in 2000 was largely due to these innovations, which were
detailed in papers released by Google in 2003 and 2004. Cutting and
Cafarella studied these papers and created Hadoop, naming it after Cutting's
son's toy elephant. Thus, while HDFS and MapReduce were developed by
Cutting and Cafarella, their designs were originally inspired by Google.

Components of Hadoop
 HDFS: Hadoop Distributed File System is a dedicated file system to
store big data with a cluster of commodity hardware or cheaper
hardware with streaming access pattern. It enables data to be stored
at multiple nodes in the cluster which ensures data security and fault
tolerance.

 Map Reduce: Data once stored in the HDFS also needs to be processed
upon. Now suppose a query is sent to process a data set in the HDFS.
Now, Hadoop identifies where this data is stored, this is called
Mapping. Now the query is broken into multiple parts and the results of
all these multiple parts are combined and the overall result is sent
back to the user. This is called the reduction process. Thus, while HDFS
is used to store the data, Map Reduce is used to process the data.

 YARN: YARN stands for Yet Another Resource Negotiator. It is a


dedicated operating system for Hadoop which manages the cluster's
resources and functions as a job scheduling framework in Hadoop. The
various types of scheduling are First Come First Serve, Fair Share
Scheduler and Capacity Scheduler etc. The First Come First Serve
scheduling is set by default in YARN.
HDFS Basics
DFS stands for the distributed file system, a concept of storing the file in
multiple nodes. DFS provides the Abstraction for a single large system whose
storage is equal to the sum of other nodes in a cluster.

HDFS (Hadoop Distributed File System) is utilized for storage permission in a


Hadoop cluster. It is mainly designed for working on commodity Hardware
devices (devices that are inexpensive), working on a distributed file system
design. HDFS is designed in such a way that it believes more in storing the
data in a large chunk of blocks rather than storing small data blocks. HDFS in
Hadoop provides Fault-tolerance and High availability to the storage layer
and the other devices present in that Hadoop cluster.

HDFS can handle larger size data with high volume velocity and variety,
making Hadoop work more efficient and reliable with easy access to all its
components. HDFS stores the data in the form of the block where the size of
each data block is 128MB in size which is configurable means you can
change it according to your requirement in hdfs-site.xml file in your Hadoop
directory.
Hadoop Distributed File System (HDFS) is designed to store vast amounts of
data across many machines while providing high throughput and fault
tolerance. Key characteristics of HDFS include:

 Scalability: HDFS can handle large volumes of data by distributing


storage across multiple nodes.
 Fault Tolerance: Data is replicated across multiple nodes, ensuring
availability even if a node fails.
 High Throughput: HDFS is optimized for large datasets and provides
high data transfer rates.

HDFS Architecture:

 Namenode: Manages the metadata and directory structure of HDFS.


 Datanode: Stores the actual data blocks.
 Secondary Namenode: Periodically merges the namespace image
with the edit logs to prevent the Namenode from becoming a
bottleneck.
Hadoop Ecosystem Tools Overview
The Hadoop ecosystem includes various tools and components that enhance
its functionality:

 MapReduce: A programming model for processing large datasets in


parallel.
 YARN (Yet Another Resource Negotiator): Manages and schedules
resources in a Hadoop cluster.
 Hive: A data warehousing solution that provides SQL-like queries.
 Pig: A high-level platform for creating MapReduce programs using a
scripting language.
 HBase: A distributed, scalable, NoSQL database built on HDFS.
 Sqoop: A tool for transferring data between Hadoop and relational
databases.
 Flume: A service for efficiently collecting and moving large amounts of
log data.

IMPLEMENTATION
Conclusion
This experiment provided a detailed overview of HDFS and its role
within the Hadoop ecosystem. By installing Hadoop, performing
various file operations in HDFS, and implementing a MapReduce
program, we gained practical experience and a deeper
understanding of Hadoop's capabilities. This foundational
knowledge is crucial for working with big data and leveraging
Hadoop for large-scale data processing tasks.

You might also like