0% found this document useful (0 votes)
14 views121 pages

UNIT 2 Full

The document provides a comprehensive overview of Hadoop, detailing its history from its inception inspired by Google's GFS and MapReduce to its modern applications in big data processing. It outlines the architecture of Hadoop, including its key components such as HDFS, MapReduce, and YARN, as well as various data formats used within the ecosystem. Additionally, it discusses data analysis processes, scaling strategies, and the importance of both vertical and horizontal scaling in managing server loads.

Uploaded by

Anurag Gupta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views121 pages

UNIT 2 Full

The document provides a comprehensive overview of Hadoop, detailing its history from its inception inspired by Google's GFS and MapReduce to its modern applications in big data processing. It outlines the architecture of Hadoop, including its key components such as HDFS, MapReduce, and YARN, as well as various data formats used within the ecosystem. Additionally, it discusses data analysis processes, scaling strategies, and the importance of both vertical and horizontal scaling in managing server loads.

Uploaded by

Anurag Gupta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 121

UNIT 2

HADOOP
History of Hadoop
Early Foundations (2003–2004)
• In 2003, Google published a paper on the Google File System
(GFS), which described a scalable distributed file system to
store and manage large datasets across multiple machines.
• In 2004, Google introduced MapReduce, a programming model
designed for processing large-scale datasets efficiently.
• Hadoop was originally created by Doug Cutting and Mike Cafarella in 2005, when
they were working on the Nutch Search Engine project. The Nutch Search Engine
project is highly extendible and scalable web crawler, used the search and index
web pages. In order to search and index the billions of pages that exist on the
internet, vast computing resources are required. Instead of relying on
(centralized) hardware to deliver high-availability, Cutting and Cafarella
developed software (now known as Hadoop) that was able to distribute the
workload across clusters of computers.
Birth of Hadoop (2005–2006)
• Inspired by Google’s work, Doug Cutting and Mike
Cafarella were developing an open-source web search
engine called Nutch.
• They realized the need for a robust storage and
processing system like GFS and MapReduce.
• In 2005, they started implementing a distributed
computing system as part of the Apache Nutch
project.
• In 2006, Hadoop became an independent project under
the Apache Software Foundation (ASF), with Doug
Cutting naming it after his son’s toy elephant.
Growth and Adoption (2007–
2011)

•Yahoo! became the primary contributor to Hadoop and built the world’s largest Hadoop cluster at the time.
•In 2008, Apache Hadoop 0.1 was released as an open-source framework.
•Hadoop's ecosystem expanded with HBase (2008), Hive (2009), Pig (2008), and Zookeeper (2010), adding more
functionalities.
•By 2011, Hadoop became widely adopted across industries for handling big data.
Modern Hadoop and Industry
Adoption (2012–Present)
• Hadoop 2.0 (2013) introduced YARN (Yet Another
Resource Negotiator), which improved cluster
resource management
• Companies like Facebook, Twitter, Amazon, and
Microsoft integrated Hadoop into their data
architectures.
• Hadoop 3.0 (2017) introduced support for erasure
coding, improved scalability, and
containerization.
• Although cloud-based solutions like Apache Spark
and Google BigQuery are gaining popularity, Hadoop
remains a key player in big data processing.
Apache Hadoop
• Apache Hadoop is an open-source software framework developed by
Douglas Cutting, then at Yahoo, that provides the highly reliable distributed
processing of large data sets using simple programming models.
• Hadoop is an open-source software framework that is used for
storing and processing large amounts of data in a distributed
computing environment. It is designed to handle big data and is
based on the MapReduce programming model, which allows for the
parallel processing of large datasets.

• Its framework is based on Java programming with some native code


in C and shell scripts.
• The goal is to evaluate word distribution in one million
German books, which is too large for a single computer
to handle.
• The Map-Reduce algorithm breaks the large task into
smaller, manageable subprocesses.
• Each book is processed individually to calculate word
occurrence and distribution, with results stored in
separate tables for each book.
• A master node manages and coordinates the entire
process but doesn’t perform any calculations.Slave
nodes receive tasks from the master node, read books,
calculate word frequencies, and store the results.
Hadoop Distributed File System

• It has distributed file system known as


HDFS and this HDFS splits files into
blocks and sends them across various
nodes in form of large clusters.
• Also in case of a node failure, the
system operates and data transfer takes
place between the nodes which are
facilitated by HDFS.
HDFS Architecture
1. Nodes: Master-slave nodes typically forms the HDFS cluster.

2. NameNode(MasterNode):

1. Manages all the slave nodes and assign work to them.

2. It executes filesystem namespace operations like opening, closing, renaming files and directories.

3. It should be deployed on reliable hardware which has the high config. not on commodity hardware.

3. DataNode(SlaveNode):

1. Actual worker nodes, who do the actual work like reading, writing, processing etc.

2. They also perform creation, deletion, and replication upon instruction from the master.

3. They can be deployed on commodity hardware.


HDFS Architecture
• HDFS daemons: Daemons are the processes running in background.
• Namenodes:

• Run on the master node.


• Store metadata (data about data) like file path, the number of blocks, block Ids. etc.
• Require high amount of RAM.
• Store meta-data in RAM for fast retrieval i.e to reduce seek time. Though a persistent
copy of it is kept on disk.
• DataNodes:

• Run on slave nodes.


• Require high memory as data is actually stored here.
Basic HDFS Commands
• List files in a directory:
hdfs dfs -ls /path/to/directory
• Create a directory:
hdfs dfs -mkdir /new_directory
• Upload a file to HDFS:
hdfs dfs -put local_file.txt /hdfs_directory/
• Download a file from HDFS:
hdfs dfs -get /hdfs_directory/file.txt /local_directory/
• Remove a file or directory:
hdfs dfs -rm /path/to/file
hdfs dfs -rm -r /path/to/directory
Components of Hadoop
Hadoop can be broken down into four main components. These four components
(which are in fact independent software modules) together compose a Hadoop
cluster. In other words, when people are talking about ‘Hadoop,’ actually mean that
they are using (at least) these 4 components:
1.Hadoop Common – Contains libraries and utilities need by all the other Hadoop
Modules;
2.Hadoop Distributed File System – A distributed file-system that stores data on
commodity machines, providing high aggregate bandwidth across a cluster;
3.Hadoop MapReduce – A programming model for large scale data processing;
4.Hadoop YARN – A resource-management platform responsible for managing
compute resources in clusters and using them for scheduling of users’
applications.
Components of Hadoop
Hadoop Common
• Hadoop is a Java-based solution. Hadoop Common provides the tools (in Java) for a
user’s computer so that it can read the data that is stored in a Hadoop file system. No
matter what type of operating system a user has (Windows, Linux, etc.), Hadoop
Common ensures that data can be correctly interpreted by the machine.

Hadoop Distributed File System (HDFS)


• The Hadoop Distributed File System (HDFS) is a filesystem designed for storing very
large files with streaming data access patterns, running on clusters of commodity
hardware.[1] This means that Hadoop stores files that are typically many terabytes up to
petabytes of data. The streaming nature of HDFS means that HDFS stores data under the
assumption that it will need to be read multiple times and that the speed with which the
data can be read is most important. Lastly, HDFS is designed to run on commodity
hardware, which is inexpensive hardware that than be sourced from different vendors.
Components of Hadoop
• Hadoop MapReduce
MapReduce is a processing technique and program model that enables
distributed processing of large quantities of data, in parallel, on large
clusters of commodity hardware. Similar in the way that HDFS stores blocks
in a distributed manner, MapReduce processes data in a distributed manner.
In other words, MapReduce uses processing power in local nodes within the
cluster, instead of centralized processing.
• Hadoop YARN
Hadoop YARN (Yet Another Resource Negotiator) is responsible for
allocating system resources to the various applications running in a Hadoop
cluster and scheduling tasks to be executed on different cluster nodes. It
was developed because in very large clusters (with more than 4000 nodes),
the MapReduce system begins to hit scalability bottlenecks.
Data Format in Hadoop
• In Hadoop, data can be stored and processed in various formats depending on the
requirements for efficiency, compatibility, and performance. Here are the common
data formats used in Hadoop:
• Hadoop data formats are broadly classified into:
1.Text-Based Formats
2.Binary Row-Based Formats
3.Columnar Formats
4.Key-Value Formats
1. Text-Based Formats
Text-based formats store data in a human-readable form but are less efficient in terms of
storage and performance.
a) CSV (Comma-Separated Values)
• Stores tabular data with values separated by commas.
• Simple and widely used but lacks data types and schema enforcement.
• Consumes more space compared to binary formats.
b) JSON (JavaScript Object Notation)
• Stores structured or semi-structured data in key-value pairs.
• Supports nested data, making it useful for NoSQL and unstructured data.
• More readable but larger in size compared to binary formats.
c) XML (Extensible Markup Language)
• Uses tags to define data, making it highly structured.
• Used in older applications but is verbose and slow to process.
2. Binary Row-Based Formats
Binary formats store data in compressed binary form, making them faster and
more space-efficient than text-based formats.
a) Avro
• A row-based format developed by Apache Avro.
• Stores schema separately in JSON format while data is stored in binary.
• Supports schema evolution, making it useful for streaming applications like
Kafka.
b) SequenceFile
• A Hadoop-native format that stores key-value pairs in binary form.
• Optimized for Hadoop MapReduce and supports compression.
• Efficient for storing intermediate data in Hadoop workflows.
3. Columnar Formats
Columnar formats store data column-wise instead of row-wise,
improving query performance for analytical workloads.
a) Parquet
• A highly compressed columnar storage format.
• Optimized for big data analytics in tools like Apache Spark, Hive,
and Presto.
• Supports schema evolution and efficient queries on large datasets.
b) ORC (Optimized Row Columnar)
• Designed specifically for Apache Hive.
• Provides higher compression and better performance than Parquet in
some cases.
• Used in data warehousing and OLAP (Online Analytical Processing).
4. Key-Value Formats
These formats store data in key-value pairs, making them
ideal for NoSQL databases and real-time applications.
a) HBase Table Format
• Used in HBase, a NoSQL database built on Hadoop.
• Supports random reads and writes, making it suitable for
real-time processing.
b) SequenceFile
• Used for storing key-value data in Hadoop’s MapReduce
framework.
• Supports compression and splitting, making it scalable for
large datasets.
Analyzing Data With Hadoop
Analyzing data with Hadoop involves using various components and tools within the Hadoop ecosystem to
process, transform, and gain insights from large datasets. Here are the steps and considerations for
analyzing data with Hadoop:
1. Data Ingestion:

• Start by ingesting data into the Hadoop cluster. You can use tools like Apache Flume, Apache Kafka, or
Hadoop’s HDFS for batch data ingestion.
• Ensure that your data is stored in a structured format in HDFS or another suitable storage system.
2. Data Preparation:

• Preprocess and clean the data as needed. This may involve tasks such as data deduplication, data
normalization, and handling missing values.
• Transform the data into a format suitable for analysis, which could include data enrichment and feature
engineering.
3. Choose a Processing Framework:

Select the appropriate data processing framework based on your requirements. Common
choices include:
• MapReduce: Ideal for batch processing and simple transformations.
• Apache Spark: Suitable for batch, real-time, and iterative data processing. It offers a
wide range of libraries for machine learning, graph processing, and more.
• Apache Hive: If you prefer SQL-like querying, you can use Hive for data analysis.
• Apache Pig: A high-level data flow language for ETL and data analysis tasks.
• Custom Code: You can write custom Java, Scala, or Python code using Hadoop APIs if
necessary.
4. Data Analysis:

• Write the code or queries needed to perform the desired analysis. Depending on your choice
of framework, this may involve writing MapReduce jobs, Spark applications, HiveQL
queries, or Pig scripts.
• Implement data aggregation, filtering, grouping, and any other required transformations.
5. Scaling:

• Hadoop is designed for horizontal scalability. As your data and processing needs grow, you
can add more nodes to your cluster to handle larger workloads.
6. Optimization:

• Optimize your code and queries for performance. Tune the configuration parameters of your
Hadoop cluster, such as memory settings and resource allocation.
• Consider using data partitioning and bucketing techniques to improve query performance.
7. Data Visualization:

• Once you have obtained results from your analysis, you can use data visualization tools like
Apache Zeppelin, Apache Superset, or external tools like Tableau and Power BI to create
meaningful visualizations and reports.
8. Iteration:

• Data analysis is often an iterative process. You may need to refine your analysis based on
initial findings or additional questions that arise.
9. Data Security and Governance:

• Ensure that data access and processing adhere to security and governance policies. Use tools
like Apache Ranger or Apache Sentry for access control and auditing.
10. Results Interpretation:

• Interpret the results of your analysis and draw meaningful insights from the data.
• Document and share your findings with relevant stakeholders.
11. Automation:

• Consider automating your data analysis pipeline to ensure that new data is continuously
ingested, processed, and analyzed as it arrives.
12. Performance Monitoring:

• Implement monitoring and logging to keep track of the health and performance of your
Hadoop cluster and data analysis jobs.
Scaliing
• It can be defined as a process to expand the existing configuration (servers/computers) to
handle a large number of user requests or to manage the amount of load on the server.
This process is called scalability. This can be done either by increasing the current system
configuration (increasing RAM, number of servers) or adding more power to the
configuration. Scalability plays a vital role in the designing of a system as it helps in
responding to a large number of user requests more effectively and quickly.

• There are two ways to do this :


1.Vertical Scaling

2.Horizontal Scaling
Vertical Scaling
• It is defined as the process of increasing the capacity of a single machine by adding more resources
such as memory, storage, etc. to increase the throughput of the system. No new resource is added,
rather the capability of the existing resources is made more efficient. This is called Vertical
scaling. Vertical Scaling is also called the Scale-up approach.
• Example: MySQL
• Advantages of Vertical Scaling
It is easy to implement
Reduced software costs as no new resources are added
Fewer efforts required to maintain this single system

Disadvantages of Vertical Scaling

Single-point failure
Since when the system (server) fails, the downtime is high because we only have a single server
High risk of hardware failures
Example
When traffic increases, the server
degrades in performance. The first
possible solution that everyone has is to
increase the power of their system. For
instance, if earlier they used 8 GB RAM
and 128 GB hard drive now with
increasing traffic, the power of the
system is affected. So a possible solution
is to increase the existing RAM or hard
drive storage, for e.g. the resources could
be increased to 16 GB of RAM and 500
GB of a hard drive but this is not an
ultimate solution as after a point of time,
these capacities will reach a saturation
point.
Horizontal Scaling

• It is defined as the process of adding more instances of the same type to the existing pool
of resources and not increasing the capacity of existing resources like in vertical scaling.
This kind of scaling also helps in decreasing the load on the server. This is called
Horizontal Scaling Horizontal Scaling is also called the Scale-out approach.
• In this process, the number of servers is increased and not the individual capacity of the
server. This is done with the help of a Load Balancer which basically routes the user
requests to different servers according to the availability of the server. Thereby, increasing
the overall performance of the system. In this way, the entire process is distributed among
all servers rather than just depending on a single server.
Example: NoSQL, Cassandra, and MongoDB
Advantages of Horizontal Scaling

1. Fault Tolerance means that there is no single point of failure in this kind of scale because there are 5
servers here instead of 1 powerful server. So if anyone of the servers fails then there will be other
servers for backup. Whereas, in Vertical Scaling there is single point failure i.e: if a server fails then
the whole service is stopped.

2. Low Latency: Latency refers to how late or delayed our request is being processed.

3. Built-in backup

Disadvantages of Horizontal Scaling

1. Not easy to implement as there are a number of components in this kind of scale

2. Cost is high

3. Networking components like, router, load balancer are required


EXAMPLE
if there exists a system of the capacity of
8 GB of RAM and in future, there is a
requirement of 16 GB of RAM then,
rather than the increasing capacity of 8
GB RAM to 16 GB of RAM, similar
instances of 8 GB RAM could be used to
meet the requirements.
Hadoop Streaming
• Hadoop Streaming uses UNIX standard streams as the interface
between Hadoop and your program so you can write MapReduce
program in any language which can write to standard output and read
standard input. Hadoop offers a lot of methods to help non-Java
development.
• The primary mechanisms are Hadoop Pipes which gives a native C++
interface to Hadoop and Hadoop Streaming which permits any
program that uses standard input and output to be used for map tasks
and reduce tasks.
• With this utility, one can create and run MapReduce jobs with any
executable or script as the mapper and/or the reducer.
Features of Hadoop Streaming

Some of the key features associated with Hadoop Streaming are as follows :
1. Hadoop Streaming is a part of the Hadoop Distribution System.
2. It facilitates ease of writing Map Reduce programs and codes.
3. Hadoop Streaming supports almost all types of programming languages such as
Python, C++, Ruby, Perl etc.
4. The entire Hadoop Streaming framework runs on Java. However, the codes might be
written in different languages as mentioned in the above point.
5. The Hadoop Streaming process uses Unix Streams that act as an interface between
Hadoop and Map Reduce programs.
6. Hadoop Streaming uses various Streaming Command Options and the two
mandatory ones are – -input directoryname or filename and -output directoryname
As it can be clearly seen in the diagram above that there are almost 8 key parts in a Hadoop Streaming
Architecture. They are :
• Input Reader/Format
• Key Value
• Mapper Stream
• Key-Value Pairs
• Reduce Stream
• Output Format
• Map External
• Reduce External
The six steps involved in the working of Hadoop Streaming are:

• Step 1: The input data is divided into chunks or blocks, typically 64MB to 128MB in size
automatically. Each chunk of data is processed by a separate mapper.
• Step 2: The mapper reads the input data from standard input (stdin) and generates an intermediate
key-value pair based on the logic of the mapper function which is written to standard output (stdout).
• Step 3: The intermediate key-value pairs are sorted and partitioned based on their keys ensuring
that all values with the same key are directed to the same reducer.
• Step 4: The key-value pairs are passed to the reducers for further processing where each reducer
receives a set of key-value pairs with the same key.
• Step 5: The reducer function, implemented by the developer, performs the required computations or
aggregations on the data and generates the final output which is written to the standard output
(stdout).
• Step 6: The final output generated by the reducers is stored in the specified output location in the
HDFS.
Hadoop Pipes
• Hadoop Pipes is a C++ interface that allows users to write MapReduce applications using Hadoop.
It uses sockets to enable communication between tasktrackers and processes running the C++ map
or reduce functions.
Working of Hadoop Pipes:
• C++ code is split into a separate process that performs the application-specific code
• Writable serialization is used to convert types into bytes that are sent to the process via a socket
• The application programs link against a thin C++ wrapper library that handles communication with
the rest of the Hadoop system
Hadoop Ecosystem
• Hadoop Ecosystem is a platform or a suite which provides various services to solve the big data problems.
• The Hadoop Ecosystem is a suite of tools and technologies built around Hadoop to process and manage large-scale data.
Key components include:
1. HDFS (Hadoop Distributed File System) – Distributed storage system.
2. MapReduce – Programming model for parallel data processing.
3. YARN (Yet Another Resource Negotiator) – Resource management and job scheduling.
Supporting Tools:
• Hive – SQL-like querying.
• Pig – Data flow scripting.
• HBase – NoSQL database.
• Spark – Fast in-memory data processing.
• Sqoop – Data transfer between Hadoop and RDBMS.
• Flume – Data ingestion from logs.
• Oozie – Workflow scheduling.
• Zookeeper – Coordination service.
Classical Data Processing

CPU

Memory Single Node Machine

1. Data fits into memory


• Load data from disk into memory and
then process from memory
Disk
2. Data does not fit into memory
• Load part of the data from disk into
memory
• Process the data
Motivation: Simple Example
• 10 Billion web pages

• Average size of webpage: 20KB

• Total 200 TB

• Disk read bandwidth = 50MB/sec

• Time to read = 4 million second = 46+ days

• Longer, if you want to do useful analytics with the data


What is the Solution?
• Use multiple interconnected machine as follows

BIG
DATA 1. Split data into small chunk

2. Send different chunks to


different machines and process

3. Collect the results from


different machines

Known as Distributed Data processing in Cluster of Computers


How to Organize Cluster
of Computers?
Cluster Architecture: Rack
Servers
switch Backbone switch
Typically 2-10 gbps

switc 1 gbps
switch between
h
any
pair of
nodes
machi
machi machi machi
ne
ne ne ne

Rack 1 Rack 2

• Each rack contains 16-64 commodity (low cost) computers (also called nodes)

• In 2011, Google has roughly 1 million nodes


Google Data Centre
Challenges in Cluster
Computing
Challenge # 1
• Node failures
• Single server lifetime: 1000 days
• 1000 servers in cluster => 1 failure/day
• 1M servers in clusters => 1000 failures/day

• Consequences of node failure


• Data loss
• Node failure in the middle of long and expensive
computation
• Need to restart the computation from scratch
Challenge # 2
• Network bottleneck
• Computers in a cluster exchanges data through
network

• Example
• Network bandwidth = 1 gbps
• Moving 10TB of data takes 1 day
Challenge # 3
• Distributed Programming is hard!

• Why?
1.Data distributions across machines is non-trivial
• (It is desirable that machines have roughly the same load)
2.Avoiding race conditions
• Given two tasks T1 and T2,
• Correctness of result depends on the sequence of execution of
task
• For example, T1 before T2 is must, but NOT T2 before T1
What is the Solution?
• Map-Reduce

• It is a simple programming model for processing really big data


using cluster of computers
Map Reduce
• MapReduce is a programming model in Hadoop used for processing large datasets in a distributed
and parallel manner.
• Generally MapReduce paradigm is based on sending the computer to where the data resides!
• MapReduce program executes in three stages, namely map stage, shuffle stage, and reduce stage.
• Map stage − The map or mapper’s job is to process the input data. Generally the input data is
in the form of file or directory and is stored in the Hadoop file system (HDFS). The input file
is passed to the mapper function line by line. The mapper processes the data and creates
several small chunks of data.
• Reduce stage − This stage is the combination of the Shuffle stage and the Reduce stage. The
Reducer’s job is to process the data that comes from the mapper. After processing, it produces
a new set of output, which will be stored in the HDFS.
Map Reduce
• During a MapReduce job, Hadoop sends the Map and Reduce tasks to the appropriate servers in
the cluster.
• The framework manages all the details of data-passing such as issuing tasks, verifying task
completion, and copying data around the cluster between the nodes.
• Most of the computing takes place on nodes with data on local disks that reduces the network
traffic.
• After completion of the given tasks, the cluster collects and reduces the data to form an appropriate
result, and sends it back to the Hadoop server.
How Map-Reduce addresses the
challenges?
1. Data loss prevention
• By keeping multiple copies of data in different machines

2. Data movement minimization


• By moving computation to the data
• (send your computer program to machines containing data)

3. Simple programing model


• Mainly using two functions
1.Map
2.Reduce

Programmer’s responsibility:
Write only two functions, Map and Reduce suitable for your problem

You DO NOT need to worry about other things


Redundant Storage
Infrastructure
• Distributed File System
• Global file namespaces, redundancy

• Multiple copies of data and in different nodes

• Example: Google file system (GFS), HDFS (Hadoop, a


open-source map-reduce system)

• Typical usage pattern


• Data is rarely updated in place

• Reads and appends are common


Distributed File System:
Inside Look
• Data is kept in chunks, spread across machines
• Each chunk is replicated on different machines
• Ensures persistence
• Example: a1
b1
a2
• We have two files, A and B b2
a3
• 3 computers A B
• 2 times replication of data

Here are the Chunk Servers

a1 a2 b1 a1 a3

a3 b1 b2 a2 b2

Chunk servers also serve as


compute nodes
Bring computation
to the data
Distributes File System:
Summary
• Chunk servers
• File is split into contiguous chunks (16-64 mb)
• Each chunk is replicated (usually 2 times or 3 times)
• Try to keep replicas in different racks

• Master node
• Stores metadata about where the files are stored
• It might also be replicated
Map-Reduce
Programming Model
Example Problem: Counting
Words
• We have a huge text document and count the number of times each
distinct word appears in the file

• Sample application
• Analyze web server logs to find popular URLs

• How you solve this using a single machine?


Word Count
• Case 1: File too large for memory, but all <word, count> pairs fit in
memory
• You can create a big string array OR you can create a hash
table

• Case 2: All <word, count> pairs do not fit in memory, but fit into
disk
getWords(tex
• A possible approach (write computer
programs/functions for each tFile)
step)

1.Break the text document into sequence of words


sort

2.Sort the words count


• This will bring same words together

3.Count the frequencies in a single pass

• Case 2 captures the essence of Map-Reduce


Map-Reduce: In a Nutshell
• getWords(dataFile) sort count

Map Reduce
extract something you Group by key Aggregate,
care about sort and summarize, etc
(here word and count) shuffle Save the
results

Summary
1. Outline stays the same
2. Map and Reduce to be defined to fit the problem
MapReduce: The Map Step
Input key-value pairs Intermediate key-value
(file name and its content) pairs
(word and count)

k1 v1
map
f1 c1
k2 v2
map
f2 c2
k3 v3

… …
map
f3 c3 k4 v4
MapReduce: The Reduce Step
Intermediate
key-value pairs

Output
Key-value groups key-value pairs
k1 v1
reduce
k1 v1 v3 v6 k1 𝑣′
k2 v2
reduce
Group
k2 𝑣′′
k1 v3 by key k2 v2 v5

k3 v4
… …
k2 v5

k1 v6 k3 v4 reduce
k3 𝑣′′′
Map-reduce: Word Count
Provided by the Provided by the
programmer
MAP: Group by programmer
Reduce:
Read input key: Collect
and Collect all values
produces a all pairs belonging
set of key- with
(crew,same
1) to the key
value pairs
(the, 1)
key 1)
(crew, and output
(crew, 2)
(crew, 1)
(space, 1)
(of, 1)
(the, 1)
(space, 1)
(space, 1)
The crew of the space (shuttle, 1) (the, 1)
shuttle Endeavor recently (endeavor, 1) (the, 3)
returned to Earth as (recently, 1) (the, 1)
(returned, 1)
ambassadors, harbingers of (the, 1)
a new era of space (to, 1) (shuttle,
exploration. Crew members
at ……………………..
(earth, 1)
1)
(as, 1)
(shuttle,
(ambassadors, 1) (recently,
…. 1)
1)
(crew, 1) (recently,
…….. …
1)
Big document (key, value) (key,…value) (key, value)
Word Count Using MapReduce: Pseudocode
map(key, value):
// key: document name; value: text of the document
for each word w in value
emit(w, 1)

reduce(key, values):
// key: a word; value: set of counts values for the word
result = 0
for each count v in values:
result += v
emit(key, result)
Map-reduce System: Under the
Hood

Moving data across


machines

All phases are distributed with many tasks doing the


work in parallel
Map-Reduce Algorithm Design
• Programmer’s responsibility is to design two functions:
1. Map
2. Reduce

• A very important issue


• Often network is the bottleneck
• Your design should minimize data communications across machines
Problems Suitable for Map-reduce
• Map-reduce is suitable for batch processing
• Updates are made after whole batch of data is processed

• The mappers do not need data from one another while they
are running

• Example
1. Word count

• In general, map-reduce is suitable:


if the problem can be broken into independent sub-problems
Problems NOT Suitable for
Map-reduce
• In general, when the machines need to
exchange data too often during
computation

• Examples
1.Applications that require very quick response
time
• In IR, indexing is okay, but query processing is not suitable
for map-reduce

2.Machine learning algorithms that require


frequent parameter update
• Stochastic gradient descent
Exerci
ses
Warm-up Exercise
• Matrix Addition

• Can it be done in map-reduce?


• YES

• What is the map function (key and value)?


• Key = row number; value = elements of row (as an array)

• What is the reduce function?


• For each key, reducer will have two arrays
• Reduce function simply adds numbers, position-wise
Advanced Exercise: Join By Map-Reduce
• Compute the natural join T1(A,B) ⋈
T2(B,C)
(combine rows from T1 and T2 such that rows have common value in column B)

T1

A B T2

a1 b1 B C A B C

a2 b1 b2 c1 a3 b2 c1

a3 b2
⋈ b2 c2
= a3 b2 c2

a4 b3 b3 c3 a4 b3 c3
Map-Reduce Join
• Map process
• Each row (a,b) from T1 into key-value pair
(b,(a,T1))
• Each row (b,c) from T2 into (b,(c,T2))

• Reduce process
• Each reduce process matches all the pairs
(b,(a,T1)) with all (b,(c,T2)) and outputs
(a,b,c)
Advanced Exercise
• You have a dataset with thousands of features. Find the most co-
related features in that data.
features
Take Home Exercises

• Design Map and Reduce functions for the


following

1.Pagerank

2.HITS
Developing Map Reduce
Application
• Map Program for word Count in Python
sentences = [
"Artificial Intelligence is fascinating",
"Deep learning improves medical imaging",
"Generative AI is transforming healthcare"
]

# Using map to count words in each sentence


word_counts = list(map(lambda sentence: len(sentence.split()), sentences))

# Print results
for i, count in enumerate(word_counts):
print(f"Sentence {i+1}: {count} words")

Reduce Program for word Count in Python
from functools import reduce
sentences = [
"Artificial Intelligence is fascinating",
"Deep learning improves medical imaging",
"Generative AI is transforming healthcare"
]
# Using reduce to count total words
total_word_count = reduce(lambda total,
sentence: total + len(sentence.split()), sentences, 0)
# Print result
print(f"Total Word Count: {total_word_count}")
Unit Test MapReduce using
MRUnit
• In order to make sure that your code is correct, you need to Unit test
your code first.
• And like you unit test your Java code using JUnit testing framework,
the same can be done using MRUnit to test MapReduce Jobs.

• MRUnit is built on top of JUnit framework. So we will use the JUnit


classes to implement unit test code for MapReduce.
• If you are familiar with JUnits then you will find unit testing for
MapReduce jobs also follows the same pattern.
Unit Test MapReduce using
MRUnit
• To Unit test MapReduce jobs:
• Create a new test class to the existing project
• Add the mrunit jar file to build path
• Declare the drivers
• Write a method for initializations & environment setup
• Write a method to test mapper
• Write a method to test reducer
• Write a method to test the whole MapReduce job
• Run the test
Unit Testing : Streaming by
Python
• We can test it on a hadoop cluster, or test it on our local machine by
Unix stream.

• cat test_input | python3 mapper.py | sort | python3 reducer.py


Handling failures in hadoop,
mapreduce
• In the real world, user code is buggy, processes crash, and machines fail.
• One of the major benefits of using Hadoop is its ability to handle such failures
and allow your job to complete successfully.

• In mapReduce there are three failures modes to consider :


• Failure of the running task
• Failure of the tasktracker
• Failure of the jobtracker
Handling failures in hadoop,
mapreduce
• Task Failure
• The most common occurrence of this failure is when user code in the
map or reduce task throws a runtime exception.
• If this happens, the child JVM reports the error back to its parent
application master before it exits.
• The error ultimately makes it into the user logs.
• The application master marks the task attempt as failed, and frees up
the container so its resources are available for another task.
Handling failures
in hadoop,
mapreduce
• Tasktracker Failure
• If tasktracker fails by crashing or running
very slowly, it will stop sending
heartbeats to the jobtracker.
• The jobtracker will notice a tasketracker
that has stoped sending heartbeats if it
has not received one for 10 minutes
(Interval can be changed) and remove it
from its pool of tasktracker
• A tasktracker can also be blacklisted
by the jobtracker.
Handling failures in
hadoop, mapreduce

• Jobtracker Failure
• It is most serious failure mode.
• Hadoop has no mechanism for dealing
with jobtracker failure
• This situation is improved with YARN.
How Hadoop runs a MapReduce job using YARN
Job Scheduling in Hadoop
• In Hadoop 1, Hadoop MapReduce framework is responsible for
scheduling tasks, monitoring them, and re-executes the failed task.
• But in Hadoop 2, a YARN called Yet Another Resource Negotiator was
introduced.
• The basic idea behind the YARN introduction is to split the
functionalities of resource management and job scheduling or
monitoring into separate daemons that are ResorceManager,
ApplicationMaster, and NodeManager.
Job Scheduling in Hadoop
• The ResourceManager has two main components that are Schedulers
and ApplicationsManager.
• Schedulers in YARN ResourceManager is a pure scheduler which is
responsible for allocating resources to the various running
applications.
• The FIFO Scheduler, CapacityScheduler, and FairScheduler are
pluggable policies that are responsible for allocating resources to the
applications.
FIFO Scheduler
• First In First Out is the default scheduling policy used in Hadoop.
• FIFO Scheduler gives more preferences to the application coming first than those coming
later.
• It places the applications in a queue and executes them in the order of their submission
(first in, first out).
• Advantage:
• It is simple to understand and doesn’t need any configuration.
• Jobs are executed in the order of their submission.
• Disadvantage:
• It is not suitable for shared clusters. If the large application comes before the shorter
• one, then the large application will use all the resources in the cluster, and the shorter application
has to wait for its turn. This leads to starvation.
Capacity Schedular

Advantages
• It maximizes the utilization of resources and
throughput in the Hadoop cluster.
• Provides elasticity for groups or
organizations in a cost- effective
manner.
• It also gives capacity guarantees
and safeguards to the
organization utilizing cluster.
Disadvantage:
• It is complex amongst the other scheduler.
Capacity Schedular
• It is designed to run Hadoop applications in a shared, multi-tenant
cluster while maximizing the throughput and the utilization of the
cluster.
• It supports hierarchical queues to reflect the structure of
organizations or groups that utilizes the cluster resources
Fair Scheduler

• It like capacity Scheduler


• It assigns resources to applications in such a way
• that all applications get, on average, an equal
amount of resources over time.
• When the single application is running, then tha
app uses the entire cluster resources.
• When other applications are submitted, the free
• up resources are assigned to the new apps so tha
• every app eventually gets roughly the
same amount of resources.
• FairScheduler enables short apps to finish in a
• reasonable time without starving the long-lived
apps.
Shuffle Sort
• Shuffling is the process of transferring intermediate
key-value pairs from the Mapper to the Reducer.
• It ensures that all values belonging to the same key
are grouped together before reaching the Reducer.
• This process is automatic in Hadoop and managed by the
framework.
• Sorting happens after shuffling, where intermediate
key-value pairs are sorted by key in ascending order.
• Sorting ensures that the reducer processes data in a
structured manner.
• Hadoop uses merge sort for this operation to handle
large-scale data efficiently.
Steps in Shuffle and Sort
Phase
1.Mapper Output Collection: The Mapper emits
intermediate key-value pairs.
2.Partitioning: The data is divided into
partitions based on the key and sent to
different reducers.
3.Shuffling: The intermediate key-value pairs
are transferred across nodes from Mappers to
Reducers.
4.Sorting: The keys are sorted before being sent
to the Reducer for final processing.
Shuffle and Sort
Task Execution in Hadoop
MapReduce
• Hadoop MapReduce follows a distributed processing model
where tasks are executed in multiple phases. The key phases
include Map, Shuffle & Sort, and Reduce, with task execution
managed by the YARN (Yet Another Resource Negotiator)
framework.
1. Job Execution Overview
A MapReduce job consists of multiple tasks, which are divided
into:
• Map Tasks (Mappers) – Process input data and generate
intermediate key-value pairs.
• Reduce Tasks (Reducers) – Process the sorted key-value pairs
and generate the final output.
• Each task runs independently and is managed by Hadoop's
JobTracker (Hadoop 1.x) or ResourceManager (Hadoop 2.x with
YARN).
2. Phases of Task Execution

(a) Input Splitting


• The input dataset is divided into chunks called input splits.
• Each split is processed by an individual Mapper.
(b) Mapping Phase
• Each Mapper takes an input split and processes it in parallel.
• The output of this phase is a set of intermediate key-value pairs.
(c) Shuffle and Sort Phase
• Intermediate key-value pairs are shuffled across nodes to group similar
keys together.
• The data is sorted by key before being sent to the Reducer.
(d) Reduce Phase
• The Reducer processes grouped key-value pairs and performs aggregation,
filtering, or transformation.
• The final output is written to HDFS (Hadoop Distributed File System).
3. Task Execution Components
• JobTracker (Hadoop 1.x) or ResourceManager (Hadoop
2.x - YARN) manages job scheduling.
• TaskTracker (Hadoop 1.x) or NodeManager (Hadoop
2.x - YARN) executes individual tasks.
• Speculative Execution helps handle slow tasks by
running duplicates on different nodes.
4. Fault Tolerance in Task Execution
• If a task fails, Hadoop automatically retries the
execution on another node.
• Checkpointing and replication in HDFS ensure no
data loss.
Map Reduce types and format
• Hadoop MapReduce processes large datasets in a
distributed manner. It supports different types
and data formats to handle diverse data
processing needs.
1. MapReduce Data Types

Hadoop uses the Writable interface for data serialization and efficient processing. The key
MapReduce data
types include:
(a) Key-Value Data Types
•Text → Stores string values (Text is preferred over String for efficiency).
•IntWritable → Stores integer values.
•LongWritable → Stores long integer values.
•FloatWritable → Stores floating-point values.
•DoubleWritable → Stores double-precision values.
•BooleanWritable → Stores boolean values (true/false).
•ArrayWritable → Stores an array of writable objects.
•MapWritable → Stores key-value pairs where keys and values are writable.
2. Input Formats in Hadoop
• How the input files are split up and read in Hadoop is defined by the
InputFormat.
• An Hadoop InputFormat is the first component in Map-Reduce, it is
responsible for creating the input splits and dividing them into records.
• Initially, the data for a MapReduce task is stored in input files, and input files
typically reside in HDFS.
• Using InputFormat we define how these input files are split and read.
Input Formats
• The InputFormat class is one of the fundamental classes in the Hadoop
MapReduce framework which provides the following functionality:

• The files or other objects that should be used for input is selected by the InputFormat.
• InputFormat defines the Data splits, which defines both the size of individual Map tasks and its
potential execution server.
• InputFormat defines the RecordReader, which is responsible for reading actual records from
the input files.
Types of InputFormat in
MapReduce
• FileInputFormat in Hadoop
• TextInputFormat
• KeyValueTextInputFormat
• SequenceFileInputFormat
• SequenceFileAsTextInputFormat
• SequenceFileAsBinaryInputFormat
• NLineInputFormat
• dbInputFormat
FileInputFormat in Hadoop
• It is the base class for all file-based InputFormats.
• Hadoop FileInputFormat specifies input directory where data files are
located.
• When we start a Hadoop job, FileInputFormat is provided with a path
containing files to read.
• FileInputFormat will read all files and divides these files into one or
more InputSplits.
TextInputFormat
• It is the default InputFormat of MapReduce.
• TextInputFormat treats each line of each input file as a separate
record and performs no parsing.
• This is useful for unformatted data or line-based records like log files.
• Key – It is the byte offset of the beginning of the line within the file
(not whole file just one split), so it will be unique if combined with the
file name.
• Value – It is the contents of the line, excluding line terminators.
KeyValueTextInputFormat
• It is similar to TextInputFormat as it also treats each line of input as a
separate record.
• While TextInputFormat treats entire line as the value, but the
KeyValueTextInputFormat breaks the line itself into key and value by a
tab character (‘/t’).
• Here Key is everything up to the tab character while the value is the
remaining part of the line after tab character.
SequenceFileInputFormat
• Hadoop SequenceFileInputFormat is an InputFormat which reads
sequence files.
• Sequence files are binary files that stores sequences of binary key-
value pairs.
• Sequence files block-compress and provide direct serialization and
deserialization of several arbitrary data types (not just text).
• Here Key & Value both are user-defined.
SequenceFileAsTextInputFormat
• Hadoop SequenceFileAsTextInputFormat is another form of
SequenceFileInputFormat which converts the sequence file key values
to Text objects.
• By calling ‘tostring()’ conversion is performed on the keys and values.
• This InputFormat makes sequence files suitable input for streaming.
NLineInputFormat
• Hadoop NLineInputFormat is another form of TextInputFormat where
the keys are byte offset of the line and values are contents of the line.
• if we want our mapper to receive a fixed number of lines of input,
then we use NLineInputFormat.
• N is the number of lines of input that each mapper receives. By
default (N=1), each mapper receives exactly one line of input. If N=2,
then each split contains two lines. One mapper will receive the first
two Key-Value pairs and another mapper will receive the second two
key-value pairs.
DBInputFormat
• Hadoop DBInputFormat is an InputFormat that reads data from a
relational database, using JDBC.
• As it doesn’t have portioning capabilities, so we need to careful not to
swamp the database from which we are reading too many mappers.
• So it is best for loading relatively small datasets, perhaps for joining
with large datasets from HDFS using MultipleInputs.
Hadoop Output Format
• The Hadoop Output Format checks the Output-Specification of the job.
• It determines how RecordWriter implementation is used to write output to output files.

• Hadoop RecordWriter
• As we know, Reducer takes as input a set of an intermediate key-value pair produced by
the mapper and runs a reducer function on them to generate output that is again zero or
more key-value pairs.
• RecordWriter writes these output key-value pairs from the Reducer phase to output files.
• As we saw above, Hadoop RecordWriter takes output data from Reducer and
writes this data to output files.
• The way these output key-value pairs are written in output files by RecordWriter is
determined by the Output Format.
Types of Hadoop Output Formats
• TextOutputFormat
• SequenceFileOutputFormat
• MapFileOutputFormat
• MultipleOutputs
• DBOutputFormat
TextOutputFormat
• MapReduce default Hadoop reducer Output Format is
TextOutputFormat, which writes (key, value) pairs on individual lines
of text files and its keys and values can be of any type

• Each key-value pair is separated by a tab character, which can be


changed using MapReduce.output.textoutputformat.separator
property.
SequenceFileOutputFormat
• It is an Output Format which writes sequences files for its output and
it is intermediate format use between MapReduce jobs, which rapidly
serialize arbitrary data types to the file;
• SequenceFileInputFormat will deserialize the file into the same types
and presents the data to the next mapper in the same manner as it
was emitted by the previous reducer
MapFileOutputFormat
• It is another form of FileOutputFormat in Hadoop Output Format,
which is used to write output as map files.
• The key in a MapFile must be added in order, so we need to ensure
that reducer emits keys in sorted order.
MultipleOutputs
• It allows writing data to files whose names are derived from the
output keys and values, or in fact from an arbitrary string.

• DBOutputFormat
• DBOutputFormat in Hadoop is an Output Format for writing to
relational databases and HBase.
• It sends the reduce output to a SQL table.
• It accepts key-value pairs.
Features of MapReduce
• Scalability
• Flexibility
• Security and Authentication
• Cost-effective solution
• Fast
• Simple model of programming
• Parallel Programming
• Availability and resilient nature
Real World Map Reduce
MapReduce is widely used in industry for processing large datasets in a distributed
manner. Below are some real-world applications where MapReduce plays a crucial role:
1. Healthcare and Medical Imaging
• Disease detection (X-rays, MRIs, CT scans)
• Genomic analysis for genetic mutations
• Processing electronic health records (EHRs)
2. E-Commerce and Retail
• Product recommendation systems
• Customer sentiment analysis from reviews
• Fraud detection in online transactions
3. Social Media and Online Platforms
• Trending topic detection (Twitter, Facebook)
• Spam and fake news filtering
• User engagement analysis (likes, shares, comments)
4. Finance and Banking
• Stock market predictions
• Risk management and fraud detection
• Customer segmentation for targeted marketing
5. Search Engines (Google, Bing, Yahoo)
• Web indexing for search results
• Ad targeting based on user behavior
6. Transportation and Logistics
• Traffic analysis for route optimization
• Supply chain and warehouse management
7. Scientific Research and Big Data Analytics
• Climate change analysis using satellite data
• Astronomy research for celestial object detection

You might also like