0% found this document useful (0 votes)
57 views78 pages

Bda U1

Uploaded by

snehapallap18
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
57 views78 pages

Bda U1

Uploaded by

snehapallap18
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 78

Big Data Analytics

UNIT - I
INTRODUCTION TO BIG DATA: Big Data Analytics, Characteristics of Big
Data – The Four Vs, importance of Big Data, Different Use cases, Data-Structured,
Semi-Structured, Un-Structured.
INTRODUCTION TO HADOOP : Hadoop and its use in solving big data
problems, Comparison Hadoop with RDBMS, Brief history of Hadoop, Apache
Hadoop EcoSystem, Components of Hadoop, The Hadoop Distributed File System
(HDFS):, Architecture and design of HDFS in detail, Working with HDFS
(Commands)
INTRODUCTION

Data
 Data is nothing but facts and statistics stored or free flowing over a

network,generally it's raw and unprocessed.


 When data are processed, organized, structured or presented in a given context so

as to make them useful, they are called Information.

What is Big Data? –

Big Data is a term used to describe a collection of data that is huge in size and yet
growing exponentially with time.

In short such data is so large and complex that none of the traditional data
management tools are able to store it or process it efficiently.
The data of size MB(WordDoc ,Excel) or maximum GB(Movies, Codes), is termed as normal data
range, but data in Peta bytes i.e. 10^15 byte size& beyond is termed as Big Data.

Other Definition:Big data is a term that is used to describe data that is high volume, high velocity,
and/or high variety, requires new technologies and techniques to capture, store, and analyze it, and is
used to enhance decision making.

Data analytics converts raw data into actionable insights.It includes a range of tools,
technologies, and processes used to find trends and solve problems by using data.

Data analytics can shape business processes, improve decision- making, and business growth.

Example of Big Data:

Following are some of the Big Data examples-


The New York Stock Exchange is an example of Big Data that generates about one terabyte
of new trade data per day.
Social Media: The statistic shows that 500+terabytes of new data get ingested into the databases
of social media site Facebook, every day. This data is mainly generated in terms of photo and
video uploads, message exchanges, putting comments etc.
A single Jet engine can generate 10+terabytes of data in 30 minutes of flight time. With many
thousand flights per day, generation of data reaches up to many Petabytes.
DATA RANGE

Applications:
• Healthcare: With the help of predictive analytics, medical professionals are now able to provide
personalized healthcare services to individual patients.
• Academia: Big Data is also helping enhance education today. there are numerous online educational
courses to learn from. Academic institutions are investing in digital courses powered by Big Data
technologies to aid the all-round development of budding learners.
• Banking: The banking sector relies on Big Data for fraud detection. Big Data tools can efficiently detect
fraudulent acts in real-time such as misuse of credit/debit cards.
• IT : One of the largest users of Big Data, IT companies around the world are using Big Data to optimize
their functioning, enhance employee productivity, and minimize risks in business operations. By
combining Big Data technologies with ML and AI, the IT sector is continually powering innovation to
find solutions even for the most complex of problems.
Big Data Challenges:
The major challenges associated with big data are as follows −
• Capturing data
• Storage
• Security
• Analysis
• To fulfill the above challenges, organizations normally take the help of enterprise servers.

BIG DATA ANALYTICS


• The Process of Analysis of large volumes of Diverse data sets using Advanced Analytical Techniques is referred as Big Data
Analytics.

There are mainly 4 types of analytics:


Descriptive Analytics: Descriptive Analytics is focused solely on historical data.describing without judgment. It
tells what happen in the past and uses some statistics functions and returns answers(“What happened?”)
Ex: Google Analytics and Netflix
Predictive Analytics : States that some specified event will happen in future i.e. what might happen in future, to
reduce loses for the users. ( “What might happen in the future?”). Predictive Analytics is a statistical method
that utilizes algorithms and machine learning to identify
trends in data and predict future behaviors.
Prescriptive Analytics: Imposition of a rule/method. What we should do like what kind of action should be
performed. Eg: self driving cars which analyzes environment and moves on the road.(“What should we do
next?”)
Diagnostic analytics: It finds out the root cause for happened thing in medical diagnosis.(why Did this
happen.)
Characteristics of Big Data – The Four V’s

VOLUME VELOCITY

(The Amount (The Speed of


of Data) Data)

Four V’S
Of Big Data
VARIETY VERACITY
(The different (The Quality
types of of Data)
Data)
Volume:
Volume refers to the unimaginable amounts of information generated every second from social media, cell phones, cars, credit cards,
M2M sensors, images, video, and whatnot. The amount of data which we deal with is of very large size of Peta bytes.According to sources
each day 2.3 trillion gigabytes of new data is being created.
 Velocity:
The term ‘velocity’ refers to the speed of generation of data in real time. How fast the data is generated and processed to meet the
demands, determines real potential inthedata

 Variety:

Big Data is generated in multiple varieties. Variety of Big Data refers to structured, unstructured, and semi-structured data that
is gathered from multiple sources. While in the past, data could only be collected from spreadsheets anddatabases, today data comes in
an array of forms such as emails, PDFs, photos, videos, audios, and so much more.
 Veracity:

Veracity refers to theTrustworthiness in terms of quality and accuracy .

IMPORTANCE OF BIG DATA:


Big Data importance doesn’t revolve around the amount of data a company has. Its importance lies in the fact that how
the company utilizes the gathered data.
Every company uses its collected data in its own way. More effectively the company uses its data, more rapidly it
grows.
The companies in the present market need to collect it and analyze it because:
1.Cost Savings
Big Data tools like Apache Hadoop, Spark, etc. bring cost-saving benefits to businesses when they have to
store large amounts of data. These tools help organizations in identifying more effective ways of doing
business.
2.Time-Saving
Real-time in-memory analytics helps companies to collect data from various sources. Tools like Hadoop help
them to analyze data immediately thus helping in making quick decisions based on the learnings.
3.Understand the market conditions
Big Data analysis helps businesses to get a better understanding of market situations.
For example, analysis of customer purchasing behavior helps companies to identify the products sold most
and thus produces those products accordingly. This helps companies to get ahead of their competitors.
4.Social Media Listening/Control online reputation:
Companies can perform sentiment analysis using Big Data tools. These enable them to get feedback about
their company, that is, who is saying what about the company.
5.Big Data Analytics in Product Development:
Another huge advantage of big data is the ability to help companies innovate and redevelop their products
SOURCES OF BIG DATA
DIFFERENT USE CASES:
Big Data Use-Cases
Let’s discuss various use cases of Big data. Below are some of the Big data use cases from different domains:
 Netflix Uses Big Data to Improve Customer Experience
 Promotion and campaign analysis by Sears Holding
 Sentiment analysis
 Customer Churn analysis
 Predictive analysis
 Real-time ad matching and serving

BIG DATA TYPES:

STRUCTURED DATA :Structured is one of the types of big data and By structured data, we mean data that can be
processed, stored,
and retrieved in a fixed format. It is represented in a Tabular Format.
Eg: An ‘Employee’ table in a database is an example of Structured Data, spreadsheets data etc.
SEMI-STRUCTURED DATA:
 Semi-structured data can contain both the forms of data.
 The data is not in the relational format and is not neatly organized into rows and columns like that in a spreadsheet.
 Since semi-structured data doesn’t need a structured query language, it
is commonly called NoSQL data.Semi-structured content is often used to store metadata about a business process but it can
also include files containing machine instructions for computer programs.
 This type of information typically comes from external sources such as
social media platforms or other web-based data feeds.
 Eg: Zip files
 e-mails
 HTML
 xml files(extensible markup language, is a text-based markup language designed to store and transport data)etc

UN-STRUCTURED DATA:
 Unstructured data refers to the data that lacks any specific form or structure whatsoever. This makes it very difficult and time-
consuming to process and analyze unstructured data.
 Additionally, Unstructured data is also known as “dark data” because it cannot be analyzed without the proper software tools
 Eg: Email,
 Audio,
 simple text files,
 images,
 videos ,
 sensor data, Websites, logs etc.
Use case
An e-commerce site XYZ (having 100 million users) wants to offer a gift voucher of 100$ to
its top 10 customers who have spent the most in the previous year. Moreover, they want to
find the buying trend of these customers so that company can suggest more items related to
them.
Issues
Huge amount of unstructured data which needs to be stored, processed and analyzed.
Solution
Storage: This huge amount of data, Hadoop uses HDFS (Hadoop Distributed File System)
which uses commodity hardware to form clusters and store data in a distributed fashion. It
works on Write once, read many times principle.
Processing: Map Reduce paradigm is applied to data distributed over network to find the
required output.
Analyze: Pig, Hive can be used to analyze the data.
Cost: Hadoop is open source so the cost is no more an issue.
Introduction to Hadoop
 Hadoop is an open-source software framework that is used
for storing and processing large amounts of data in a
distributed computing environment.
 It is designed to handle big data and is based on the
MapReduce programming model, which allows for the
parallel processing of large datasets.
How Hadoop Solves the Big Data Problem
Hadoop is built to run on a cluster of machines
Lets start with an example. Let's say that we need to store lots of photos. We will start with
a single disk. When we exceed a single disk, we may use a few disks stacked on a machine.
When we max out all the disks on a single machine, we need to get a bunch of machines,
each with a bunch of disks.
This is exactly how Hadoop is built. Hadoop is designed to run on a cluster of machines from the get go.
Hadoop clusters scale horizontally
More storage and compute power can be achieved by adding more nodes to a Hadoop
cluster. This eliminates the need to buy more and more powerful and expensive hardware.

Hadoop can handle unstructured/semi-structured data


Hadoop doesn't enforce a schema on the data it stores. It can handle arbitrary text and binary
data. So Hadoop can digest any unstructured data easily.

Hadoop clusters provides storage and computing


We saw how having separate storage and processing clusters is not the best fit for big data.
Hadoop clusters, however, provide storage and distributed computing all in one.
Comparison HADOOP with RDBMS
RDMS (Relational Database Management System):

 RDBMS is an information management system, which is based on a data


model.
 In RDBMS tables are used for information storage.
 Each row of the table represents a record and column represents an attribute
of data.
 Organization of data and their manipulation processes are different in
RDBMS from other databases.
 RDBMS ensures ACID (atomicity, consistency, integrity, durability)
properties required for designing a database.
 The purpose of RDBMS is to store, manage, and retrieve data as quickly
and reliably as possible.
Hadoop:

 It is an open-source software framework used for storing data and


running applications on a group of commodity hardware.
 It has large storage capacity and high processing power.
 It can manage multiple concurrent processes at the same time.
 It is used in predictive analysis, data mining and machine learning.
 It can handle both structured and unstructured form of data.
 It is more flexible in storing, processing, and managing data than
traditional RDBMS.
 Unlike traditional systems, Hadoop enables multiple analytical
processes on the same data at the same time.
 It supports scalability very flexibly.
Comparison / Differences between RDBMS and Hadoop:

S.No. RDBMS Hadoop

Traditional row-column based An open-source software used


databases, basically used for data for storing data and running
1.
storage, manipulation and applications or processes
retrieval. concurrently.

In this structured data is mostly In this both structured and


2.
processed. unstructured data is processed.

It is best suited for OLTP


3. It is best suited for BIG data.
environment.

4. It is less scalable than Hadoop. It is highly scalable.

Data normalization is required in Data normalization is not


5.
RDBMS. required in Hadoop.
S.No. RDBMS Hadoop

It stores transformed and


6. It stores huge volume of data.
aggregated data.
7. It has no latency in response. It has some latency in response.
The data schema of RDBMS is The data schema of Hadoop is
8.
static type. dynamic type.
Low data integrity available than
9. High data integrity available.
RDBMS.
Cost is applicable for licensed Free of cost, as it is an open
10.
software. source software.
Brief History of HADOOP
Introduction:
 Hadoop Ecosystem is a platform or a suite which provides various
services to solve the big data problems.
 It includes Apache projects and various commercial tools and
solutions.

 There are four major elements of Hadoop i.e. HDFS, MapReduce,


YARN, and Hadoop Common.
 Most of the tools or solutions are used to supplement or support these
major elements.
 All these tools work collectively to provide services such as
absorption, analysis, storage and maintenance of data etc.
Following are the components that collectively form a Hadoop ecosystem:

 HDFS : Hadoop Distributed File System


 YARN : Yet Another Resource Negotiator
 MapReduce : Programming based Data Processing
 Spark : In-Memory data processing
 PIG, HIVE : Query based processing of data services
 Hbase : NoSQL Database
 Mahout, Spark MLLib : Machine Learning algorithm libraries
 Solar, Lucene : Searching and Indexing
 Zookeeper : Managing cluster
 Oozie : Job Scheduling
Note:

Apart from the above-mentioned components, there are


many other components too that are part of the Hadoop
ecosystem.

All these toolkits or components revolve around one term


i.e. Data. That’s the beauty of Hadoop that it revolves
around data and hence making its synthesis easier.
HDFS:

 HDFS is the primary or major component of Hadoop ecosystem and is responsible for
storing large data sets of structured or unstructured data across various nodes and
thereby maintaining the metadata in the form of log files.

 HDFS consists of two core components i.e.


1. Name node
2. Data Node

 Name Node is the prime node which contains metadata (data about data) requiring
comparatively fewer resources than the data nodes that stores the actual data. These
data nodes are commodity hardware in the distributed environment. Undoubtedly,
making Hadoop cost effective.

 HDFS maintains all the coordination between the clusters and hardware, thus working
at the heart of the system.
 HDFS consists of two core components i.e.
1. Name node
 NameNode works as a Master in a Hadoop cluster that Guides the
Datanode(Slaves).
 Namenode is mainly used for storing the Metadata i.e. nothing but the data about
the data.
 Meta Data can be the transaction logs that keep track of the user’s activity in a
Hadoop cluster.
 Meta Data can also be the name of the file, size, and the information about the
location(Block number, Block ids) of Datanode that Namenode stores to find the
closest DataNode for Faster Communication.
 Namenode instructs the DataNodes with the operation like delete, create, Replicate,
etc.
 As our NameNode is working as a Master it should have a high RAM or Processing
power in order to Maintain or Guide all the slaves in a Hadoop cluster.
 Namenode receives heartbeat signals and block reports from all the slaves i.e.
DataNodes.
 HDFS consists of two core components i.e.
1. Data node

 DataNodes works as a Slave DataNodes are mainly utilized for storing the data in a
Hadoop cluster, the number of DataNodes can be from 1 to 500 or even more than
that, the more number of DataNode your Hadoop cluster has More Data can be
stored.

 So it is advised that the DataNode should have High storing capacity to store a large
number of file blocks.

 Datanode performs operations like creation, deletion, etc. according to the


instruction provided by the NameNode.
YARN:

 Yet Another Resource Negotiator, as the name implies, YARN is the one who helps to
manage the resources across the clusters. In short, it performs scheduling and resource
allocation for the Hadoop System.

 Consists of three major components i.e.


1. Resource Manager
2. Nodes Manager
3. Application Manager

 Resource manager has the privilege of allocating resources for the applications in a
system whereas Node managers work on the allocation of resources such as CPU,
memory, bandwidth per machine and later on acknowledges the resource manager.
Application manager works as an interface between the resource manager and node
manager and performs negotiations as per the requirement of the two.
MapReduce:

 By making the use of distributed and parallel algorithms, MapReduce makes it possible
to carry over the processing’s logic and helps to write applications which transform big
data sets into a manageable one.

 MapReduce makes the use of two functions i.e. Map() and Reduce() whose task is:

1. Map() performs sorting and filtering of data and thereby organizing them in the
form of group. Map generates a key-value pair based result which is later on
processed by the Reduce() method.

2. Reduce(), as the name suggests does the summarization by aggregating the mapped
data. In simple, Reduce() takes the output generated by Map() as input and combines
those tuples into smaller set of tuples.
PIG:

Pig was basically developed by Yahoo which works on a pig Latin language, which is
Query based language similar to SQL.

 It is a platform for structuring the data flow, processing and analyzing huge data sets.

 Pig does the work of executing commands and in the background, all the activities of
MapReduce are taken care of. After the processing, pig stores the result in HDFS.

 Pig Latin language is specially designed for this framework which runs on Pig Runtime.
Just the way Java runs on the JVM.

 Pig helps to achieve ease of programming and optimization and hence is a major
segment of the Hadoop Ecosystem.
HIVE:

 With the help of SQL methodology and interface, HIVE performs reading and writing
of large data sets. However, its query language is called as HQL (Hive Query
Language).

 It is highly scalable as it allows real-time processing and batch processing both. Also,
all the SQL datatypes are supported by Hive thus, making the query processing easier.

 Similar to the Query Processing frameworks, HIVE too comes with two
components: JDBC Drivers and HIVE Command Line.

 JDBC, along with ODBC drivers work on establishing the data storage permissions and
connection whereas HIVE Command line helps in the processing of queries.
Mahout:

 Mahout, allows Machine Learnability to a system or


application. Machine Learning, as the name suggests helps the system
to develop itself based on some patterns, user/environmental
interaction or on the basis of algorithms.

 It provides various libraries or functionalities such as collaborative


filtering, clustering, and classification which are nothing but concepts
of Machine learning. It allows invoking algorithms as per our need
with the help of its own libraries.
ApacheSpark:

 It’s a platform that handles all the process consumptive tasks like batch
processing, interactive or iterative real-time processing, graph
conversions, and visualization, etc.

 It consumes in memory resources hence, thus being faster than the prior in
terms of optimization.

 Spark is best suited for real-time data whereas Hadoop is best suited for
structured data or batch processing, hence both are used in most of the
companies interchangeably.
ApacheHBase:

 It’s a NoSQL database which supports all kinds of data and thus
capable of handling anything of Hadoop Database. It provides
capabilities of Google’s BigTable, thus able to work on Big Data sets
effectively.

 At times where we need to search or retrieve the occurrences of


something small in a huge database, the request must be processed
within a short quick span of time. At such times, HBase comes handy
as it gives us a tolerant way of storing limited data
Other Components: Apart from all of these, there are some other components too that carry
out a huge task in order to make Hadoop capable of processing large datasets. They are as
follows:

 Solr, Lucene: These are the two services that perform the task of searching and indexing
with the help of some java libraries, especially Lucene is based on Java which allows spell
check mechanism, as well. However, Lucene is driven by Solr.

 Zookeeper: There was a huge issue of management of coordination and synchronization


among the resources or the components of Hadoop which resulted in inconsistency, often.
Zookeeper overcame all the problems by performing synchronization, inter-component
based communication, grouping, and maintenance.

 Oozie: Oozie simply performs the task of a scheduler, thus scheduling jobs and binding
them together as a single unit. There is two kinds of jobs .i.e Oozie workflow and Oozie
coordinator jobs. Oozie workflow is the jobs that need to be executed in a sequentially
ordered manner whereas Oozie Coordinator jobs are those that are triggered when some
data or external stimulus is given to it.
 Hadoop Distributed File System follows the master-slave
architecture.
 Each cluster comprises a single master node and multiple slave
nodes. Internally the files get divided into one or more blocks,
and each block is stored on different slave machines depending
on the replication factor
 The master node stores and manages the file system namespace,
that is information about blocks of files like block locations,
permissions, etc. The slave nodes store data blocks of files.
 The Master node is the NameNode and DataNodes are the slave
nodes.
 Let’s discuss each of the nodes in the Hadoop HDFS Architecture
in detail.
What is HDFS NameNode?

NameNode is the centerpiece of the Hadoop Distributed File System.


It maintains and manages the file system namespace and provides
the right access permission to the clients.

The NameNode stores information about blocks locations,


permissions, etc. on the local disk in the form of two files:
•Fsimage: Fsimage stands for File System image. It contains the
complete namespace of the Hadoop file system since the NameNode
creation.
•Edit log: It contains all the recent changes performed to the file
system namespace to the most recent Fsimage.
Functions of HDFS NameNode
1. It executes the file system namespace operations like opening,

renaming, and closing files and directories.


2. NameNode manages and maintains the DataNodes.
3. It determines the mapping of blocks of a file to DataNodes.
4. NameNode records each change made to the file system
namespace.
5. It keeps the locations of each block of a file.
6. NameNode takes care of the replication factor of all the blocks.
7. NameNode receives heartbeat and block reports from all
DataNodes that ensure DataNode is alive.
8. If the DataNode fails, the NameNode chooses new DataNodes for
What is HDFS DataNode?

DataNodes are the slave nodes in Hadoop HDFS. DataNodes


are inexpensive commodity hardware. They store blocks of a file.

Functions of DataNode

1. DataNode is responsible for serving the client read/write requests.


2. Based on the instruction from the NameNode, DataNodes performs
block creation, replication, and deletion.
3. DataNodes send a heartbeat to NameNode to report the health of
HDFS.
4. DataNodes also sends block reports to NameNode to report the list of
What is Secondary NameNode?
 Apart from DataNode and NameNode, there is another daemon called
the secondary NameNode.

 Secondary NameNode works as a helper node to primary NameNode but doesn’t


replace primary NameNode.

 When the NameNode starts, the NameNode merges the Fsimage and edit logs file
to restore the current file system namespace.

 Since the NameNode runs continuously for a long time without any restart, the
size of edit logs becomes too large.
 This will result in a long restart time for NameNode.

 Secondary NameNode solves this issue.

 Secondary NameNode downloads the Fsimage file and edit logs file from
NameNode.
It periodically applies edit logs to Fsimage and refreshes the edit
logs. The updated Fsimage is then sent to the NameNode so that
NameNode doesn’t have to re-apply the edit log records during its
restart. This keeps the edit log size small and reduces the
NameNode restart time.

If the NameNode fails, the last save Fsimage on the secondary


NameNode can be used to recover file system metadata. The
secondary NameNode performs regular checkpoints in HDFS.
Working with HDFS Commands
HDFS is the primary or major component of the Hadoop ecosystem which is
responsible for storing large data sets of structured or unstructured data across various
nodes and thereby maintaining the metadata in the form of log files. To use the HDFS
commands, first you need to start the Hadoop services using the following command:
sbin/start-all.sh
To check the Hadoop services are up and running use the following command:
jps
Commands:
1. ls: This command is used to list all the files. Use lsr for recursive
approach. It is useful when we want a hierarchy of a folder.
Syntax:
bin/hdfs dfs -ls <path>
Example:
bin/hdfs dfs -ls /
It will print all the directories present in HDFS. bin directory contains
executables so, bin/hdfs means we want the executables of hdfs
particularly dfs(Distributed File System) commands.
mkdir: To create a directory. In Hadoop dfs there is no home directory by default. So let’s first create it.
Syntax:
bin/hdfs dfs -mkdir <folder name>

creating home directory:

hdfs/bin -mkdir /user


hdfs/bin -mkdir /user/username -> write the username of your computer
Example:
bin/hdfs dfs -mkdir /geeks => '/' means absolute path
bin/hdfs dfs -mkdir geeks2 => Relative path -> the folder will be
created relative to the home directory.
1. touchz: It creates an empty file.
Syntax:
bin/hdfs dfs -touchz <file_path>
Example:
bin/hdfs dfs -touchz /geeks/myfile.txt
1. copyFromLocal (or) put: To copy files/folders from local file system to
hdfs store. This is the most important command. Local filesystem means
the files present on the OS.
Syntax:
bin/hdfs dfs -copyFromLocal <local file path> <dest(present on hdfs)>
Example: Let’s suppose we have a file AI.txt on Desktop which we want
to copy to folder geeks present on hdfs.
bin/hdfs dfs -copyFromLocal ../Desktop/AI.txt /geeks

(OR)

bin/hdfs dfs -put ../Desktop/AI.txt /geeks


1. cat: To print file contents.
Syntax:
bin/hdfs dfs -cat <path>
Example:
// print the content of AI.txt present
// inside geeks folder.
bin/hdfs dfs -cat /geeks/AI.txt ->
1. copyToLocal (or) get: To copy files/folders from hdfs store to local file
system.
Syntax:
bin/hdfs dfs -copyToLocal <<srcfile(on hdfs)> <local file dest>
Example:
bin/hdfs dfs -copyToLocal /geeks ../Desktop/hero

(OR)

bin/hdfs dfs -get /geeks/myfile.txt ../Desktop/hero


myfile.txt from geeks folder will be copied to folder hero present
on Desktop.

Note: Observe that we don’t write bin/hdfs while checking the things
present on local filesystem.
1. moveFromLocal: This command will move file from local to hdfs.
Syntax:
bin/hdfs dfs -moveFromLocal <local src> <dest(on hdfs)>
Example:
bin/hdfs dfs -moveFromLocal ../Desktop/cutAndPaste.txt /geeks
1. cp: This command is used to copy files within hdfs. Lets copy
folder geeks to geeks_copied.
Syntax:
bin/hdfs dfs -cp <src(on hdfs)> <dest(on hdfs)>
Example:
bin/hdfs -cp /geeks /geeks_copied
1. mv: This command is used to move files within hdfs. Lets cut-paste a
file myfile.txt from geeks folder to geeks_copied.
Syntax:
bin/hdfs dfs -mv <src(on hdfs)> <src(on hdfs)>
Example:
bin/hdfs -mv /geeks/myfile.txt /geeks_copied
1. rmr: This command deletes a file from HDFS recursively. It is very
useful command when you want to delete a non-empty directory.
Syntax:
bin/hdfs dfs -rmr <filename/directoryName>
Example:
bin/hdfs dfs -rmr /geeks_copied -> It will delete all the content inside the
directory then the directory itself.
1. du: It will give the size of each file in directory.
Syntax:
bin/hdfs dfs -du <dirName>
Example:
bin/hdfs dfs -du /geeks
1. dus:: This command will give the total size of directory/file.
Syntax:
bin/hdfs dfs -dus <dirName>
Example:
bin/hdfs dfs -dus /geeks
1. stat: It will give the last modified time of directory or path. In short it will
give stats of the directory or file.
Syntax:
bin/hdfs dfs -stat <hdfs file>
Example:
bin/hdfs dfs -stat /geeks
1. setrep: This command is used to change the replication factor of a
file/directory in HDFS. By default it is 3 for anything which is stored in
HDFS (as set in hdfs core-site.xml).
Example 1: To change the replication factor to 6 for geeks.txt stored in
HDFS.
bin/hdfs dfs -setrep -R -w 6 geeks.txt
Example 2: To change the replication factor to 4 for a
directory geeksInput stored in HDFS.
bin/hdfs dfs -setrep -R 4 /geeks
Note: The -w means wait till the replication is completed. And -R means
recursively, we use it for directories as they may also contain many files
and folders inside them.
Note: There are more commands in HDFS but we discussed the commands which
are commonly used when working with Hadoop. You can check out the list
of dfs commands using the following command:
bin/hdfs dfs
THE END

You might also like