Bda U1
Bda U1
UNIT - I
INTRODUCTION TO BIG DATA: Big Data Analytics, Characteristics of Big
Data – The Four Vs, importance of Big Data, Different Use cases, Data-Structured,
Semi-Structured, Un-Structured.
INTRODUCTION TO HADOOP : Hadoop and its use in solving big data
problems, Comparison Hadoop with RDBMS, Brief history of Hadoop, Apache
Hadoop EcoSystem, Components of Hadoop, The Hadoop Distributed File System
(HDFS):, Architecture and design of HDFS in detail, Working with HDFS
(Commands)
INTRODUCTION
Data
Data is nothing but facts and statistics stored or free flowing over a
Big Data is a term used to describe a collection of data that is huge in size and yet
growing exponentially with time.
In short such data is so large and complex that none of the traditional data
management tools are able to store it or process it efficiently.
The data of size MB(WordDoc ,Excel) or maximum GB(Movies, Codes), is termed as normal data
range, but data in Peta bytes i.e. 10^15 byte size& beyond is termed as Big Data.
Other Definition:Big data is a term that is used to describe data that is high volume, high velocity,
and/or high variety, requires new technologies and techniques to capture, store, and analyze it, and is
used to enhance decision making.
Data analytics converts raw data into actionable insights.It includes a range of tools,
technologies, and processes used to find trends and solve problems by using data.
Data analytics can shape business processes, improve decision- making, and business growth.
Applications:
• Healthcare: With the help of predictive analytics, medical professionals are now able to provide
personalized healthcare services to individual patients.
• Academia: Big Data is also helping enhance education today. there are numerous online educational
courses to learn from. Academic institutions are investing in digital courses powered by Big Data
technologies to aid the all-round development of budding learners.
• Banking: The banking sector relies on Big Data for fraud detection. Big Data tools can efficiently detect
fraudulent acts in real-time such as misuse of credit/debit cards.
• IT : One of the largest users of Big Data, IT companies around the world are using Big Data to optimize
their functioning, enhance employee productivity, and minimize risks in business operations. By
combining Big Data technologies with ML and AI, the IT sector is continually powering innovation to
find solutions even for the most complex of problems.
Big Data Challenges:
The major challenges associated with big data are as follows −
• Capturing data
• Storage
• Security
• Analysis
• To fulfill the above challenges, organizations normally take the help of enterprise servers.
VOLUME VELOCITY
Four V’S
Of Big Data
VARIETY VERACITY
(The different (The Quality
types of of Data)
Data)
Volume:
Volume refers to the unimaginable amounts of information generated every second from social media, cell phones, cars, credit cards,
M2M sensors, images, video, and whatnot. The amount of data which we deal with is of very large size of Peta bytes.According to sources
each day 2.3 trillion gigabytes of new data is being created.
Velocity:
The term ‘velocity’ refers to the speed of generation of data in real time. How fast the data is generated and processed to meet the
demands, determines real potential inthedata
Variety:
Big Data is generated in multiple varieties. Variety of Big Data refers to structured, unstructured, and semi-structured data that
is gathered from multiple sources. While in the past, data could only be collected from spreadsheets anddatabases, today data comes in
an array of forms such as emails, PDFs, photos, videos, audios, and so much more.
Veracity:
STRUCTURED DATA :Structured is one of the types of big data and By structured data, we mean data that can be
processed, stored,
and retrieved in a fixed format. It is represented in a Tabular Format.
Eg: An ‘Employee’ table in a database is an example of Structured Data, spreadsheets data etc.
SEMI-STRUCTURED DATA:
Semi-structured data can contain both the forms of data.
The data is not in the relational format and is not neatly organized into rows and columns like that in a spreadsheet.
Since semi-structured data doesn’t need a structured query language, it
is commonly called NoSQL data.Semi-structured content is often used to store metadata about a business process but it can
also include files containing machine instructions for computer programs.
This type of information typically comes from external sources such as
social media platforms or other web-based data feeds.
Eg: Zip files
e-mails
HTML
xml files(extensible markup language, is a text-based markup language designed to store and transport data)etc
UN-STRUCTURED DATA:
Unstructured data refers to the data that lacks any specific form or structure whatsoever. This makes it very difficult and time-
consuming to process and analyze unstructured data.
Additionally, Unstructured data is also known as “dark data” because it cannot be analyzed without the proper software tools
Eg: Email,
Audio,
simple text files,
images,
videos ,
sensor data, Websites, logs etc.
Use case
An e-commerce site XYZ (having 100 million users) wants to offer a gift voucher of 100$ to
its top 10 customers who have spent the most in the previous year. Moreover, they want to
find the buying trend of these customers so that company can suggest more items related to
them.
Issues
Huge amount of unstructured data which needs to be stored, processed and analyzed.
Solution
Storage: This huge amount of data, Hadoop uses HDFS (Hadoop Distributed File System)
which uses commodity hardware to form clusters and store data in a distributed fashion. It
works on Write once, read many times principle.
Processing: Map Reduce paradigm is applied to data distributed over network to find the
required output.
Analyze: Pig, Hive can be used to analyze the data.
Cost: Hadoop is open source so the cost is no more an issue.
Introduction to Hadoop
Hadoop is an open-source software framework that is used
for storing and processing large amounts of data in a
distributed computing environment.
It is designed to handle big data and is based on the
MapReduce programming model, which allows for the
parallel processing of large datasets.
How Hadoop Solves the Big Data Problem
Hadoop is built to run on a cluster of machines
Lets start with an example. Let's say that we need to store lots of photos. We will start with
a single disk. When we exceed a single disk, we may use a few disks stacked on a machine.
When we max out all the disks on a single machine, we need to get a bunch of machines,
each with a bunch of disks.
This is exactly how Hadoop is built. Hadoop is designed to run on a cluster of machines from the get go.
Hadoop clusters scale horizontally
More storage and compute power can be achieved by adding more nodes to a Hadoop
cluster. This eliminates the need to buy more and more powerful and expensive hardware.
HDFS is the primary or major component of Hadoop ecosystem and is responsible for
storing large data sets of structured or unstructured data across various nodes and
thereby maintaining the metadata in the form of log files.
Name Node is the prime node which contains metadata (data about data) requiring
comparatively fewer resources than the data nodes that stores the actual data. These
data nodes are commodity hardware in the distributed environment. Undoubtedly,
making Hadoop cost effective.
HDFS maintains all the coordination between the clusters and hardware, thus working
at the heart of the system.
HDFS consists of two core components i.e.
1. Name node
NameNode works as a Master in a Hadoop cluster that Guides the
Datanode(Slaves).
Namenode is mainly used for storing the Metadata i.e. nothing but the data about
the data.
Meta Data can be the transaction logs that keep track of the user’s activity in a
Hadoop cluster.
Meta Data can also be the name of the file, size, and the information about the
location(Block number, Block ids) of Datanode that Namenode stores to find the
closest DataNode for Faster Communication.
Namenode instructs the DataNodes with the operation like delete, create, Replicate,
etc.
As our NameNode is working as a Master it should have a high RAM or Processing
power in order to Maintain or Guide all the slaves in a Hadoop cluster.
Namenode receives heartbeat signals and block reports from all the slaves i.e.
DataNodes.
HDFS consists of two core components i.e.
1. Data node
DataNodes works as a Slave DataNodes are mainly utilized for storing the data in a
Hadoop cluster, the number of DataNodes can be from 1 to 500 or even more than
that, the more number of DataNode your Hadoop cluster has More Data can be
stored.
So it is advised that the DataNode should have High storing capacity to store a large
number of file blocks.
Yet Another Resource Negotiator, as the name implies, YARN is the one who helps to
manage the resources across the clusters. In short, it performs scheduling and resource
allocation for the Hadoop System.
Resource manager has the privilege of allocating resources for the applications in a
system whereas Node managers work on the allocation of resources such as CPU,
memory, bandwidth per machine and later on acknowledges the resource manager.
Application manager works as an interface between the resource manager and node
manager and performs negotiations as per the requirement of the two.
MapReduce:
By making the use of distributed and parallel algorithms, MapReduce makes it possible
to carry over the processing’s logic and helps to write applications which transform big
data sets into a manageable one.
MapReduce makes the use of two functions i.e. Map() and Reduce() whose task is:
1. Map() performs sorting and filtering of data and thereby organizing them in the
form of group. Map generates a key-value pair based result which is later on
processed by the Reduce() method.
2. Reduce(), as the name suggests does the summarization by aggregating the mapped
data. In simple, Reduce() takes the output generated by Map() as input and combines
those tuples into smaller set of tuples.
PIG:
Pig was basically developed by Yahoo which works on a pig Latin language, which is
Query based language similar to SQL.
It is a platform for structuring the data flow, processing and analyzing huge data sets.
Pig does the work of executing commands and in the background, all the activities of
MapReduce are taken care of. After the processing, pig stores the result in HDFS.
Pig Latin language is specially designed for this framework which runs on Pig Runtime.
Just the way Java runs on the JVM.
Pig helps to achieve ease of programming and optimization and hence is a major
segment of the Hadoop Ecosystem.
HIVE:
With the help of SQL methodology and interface, HIVE performs reading and writing
of large data sets. However, its query language is called as HQL (Hive Query
Language).
It is highly scalable as it allows real-time processing and batch processing both. Also,
all the SQL datatypes are supported by Hive thus, making the query processing easier.
Similar to the Query Processing frameworks, HIVE too comes with two
components: JDBC Drivers and HIVE Command Line.
JDBC, along with ODBC drivers work on establishing the data storage permissions and
connection whereas HIVE Command line helps in the processing of queries.
Mahout:
It’s a platform that handles all the process consumptive tasks like batch
processing, interactive or iterative real-time processing, graph
conversions, and visualization, etc.
It consumes in memory resources hence, thus being faster than the prior in
terms of optimization.
Spark is best suited for real-time data whereas Hadoop is best suited for
structured data or batch processing, hence both are used in most of the
companies interchangeably.
ApacheHBase:
It’s a NoSQL database which supports all kinds of data and thus
capable of handling anything of Hadoop Database. It provides
capabilities of Google’s BigTable, thus able to work on Big Data sets
effectively.
Solr, Lucene: These are the two services that perform the task of searching and indexing
with the help of some java libraries, especially Lucene is based on Java which allows spell
check mechanism, as well. However, Lucene is driven by Solr.
Oozie: Oozie simply performs the task of a scheduler, thus scheduling jobs and binding
them together as a single unit. There is two kinds of jobs .i.e Oozie workflow and Oozie
coordinator jobs. Oozie workflow is the jobs that need to be executed in a sequentially
ordered manner whereas Oozie Coordinator jobs are those that are triggered when some
data or external stimulus is given to it.
Hadoop Distributed File System follows the master-slave
architecture.
Each cluster comprises a single master node and multiple slave
nodes. Internally the files get divided into one or more blocks,
and each block is stored on different slave machines depending
on the replication factor
The master node stores and manages the file system namespace,
that is information about blocks of files like block locations,
permissions, etc. The slave nodes store data blocks of files.
The Master node is the NameNode and DataNodes are the slave
nodes.
Let’s discuss each of the nodes in the Hadoop HDFS Architecture
in detail.
What is HDFS NameNode?
Functions of DataNode
When the NameNode starts, the NameNode merges the Fsimage and edit logs file
to restore the current file system namespace.
Since the NameNode runs continuously for a long time without any restart, the
size of edit logs becomes too large.
This will result in a long restart time for NameNode.
Secondary NameNode downloads the Fsimage file and edit logs file from
NameNode.
It periodically applies edit logs to Fsimage and refreshes the edit
logs. The updated Fsimage is then sent to the NameNode so that
NameNode doesn’t have to re-apply the edit log records during its
restart. This keeps the edit log size small and reduces the
NameNode restart time.
(OR)
(OR)
Note: Observe that we don’t write bin/hdfs while checking the things
present on local filesystem.
1. moveFromLocal: This command will move file from local to hdfs.
Syntax:
bin/hdfs dfs -moveFromLocal <local src> <dest(on hdfs)>
Example:
bin/hdfs dfs -moveFromLocal ../Desktop/cutAndPaste.txt /geeks
1. cp: This command is used to copy files within hdfs. Lets copy
folder geeks to geeks_copied.
Syntax:
bin/hdfs dfs -cp <src(on hdfs)> <dest(on hdfs)>
Example:
bin/hdfs -cp /geeks /geeks_copied
1. mv: This command is used to move files within hdfs. Lets cut-paste a
file myfile.txt from geeks folder to geeks_copied.
Syntax:
bin/hdfs dfs -mv <src(on hdfs)> <src(on hdfs)>
Example:
bin/hdfs -mv /geeks/myfile.txt /geeks_copied
1. rmr: This command deletes a file from HDFS recursively. It is very
useful command when you want to delete a non-empty directory.
Syntax:
bin/hdfs dfs -rmr <filename/directoryName>
Example:
bin/hdfs dfs -rmr /geeks_copied -> It will delete all the content inside the
directory then the directory itself.
1. du: It will give the size of each file in directory.
Syntax:
bin/hdfs dfs -du <dirName>
Example:
bin/hdfs dfs -du /geeks
1. dus:: This command will give the total size of directory/file.
Syntax:
bin/hdfs dfs -dus <dirName>
Example:
bin/hdfs dfs -dus /geeks
1. stat: It will give the last modified time of directory or path. In short it will
give stats of the directory or file.
Syntax:
bin/hdfs dfs -stat <hdfs file>
Example:
bin/hdfs dfs -stat /geeks
1. setrep: This command is used to change the replication factor of a
file/directory in HDFS. By default it is 3 for anything which is stored in
HDFS (as set in hdfs core-site.xml).
Example 1: To change the replication factor to 6 for geeks.txt stored in
HDFS.
bin/hdfs dfs -setrep -R -w 6 geeks.txt
Example 2: To change the replication factor to 4 for a
directory geeksInput stored in HDFS.
bin/hdfs dfs -setrep -R 4 /geeks
Note: The -w means wait till the replication is completed. And -R means
recursively, we use it for directories as they may also contain many files
and folders inside them.
Note: There are more commands in HDFS but we discussed the commands which
are commonly used when working with Hadoop. You can check out the list
of dfs commands using the following command:
bin/hdfs dfs
THE END