UNIT 2 Full
UNIT 2 Full
HADOOP
History of Hadoop
Early Foundations (2003–2004)
• In 2003, Google published a paper on the Google File System
(GFS), which described a scalable distributed file system to
store and manage large datasets across multiple machines.
• In 2004, Google introduced MapReduce, a programming model
designed for processing large-scale datasets efficiently.
• Hadoop was originally created by Doug Cutting and Mike Cafarella in 2005, when
they were working on the Nutch Search Engine project. The Nutch Search Engine
project is highly extendible and scalable web crawler, used the search and index
web pages. In order to search and index the billions of pages that exist on the
internet, vast computing resources are required. Instead of relying on
(centralized) hardware to deliver high-availability, Cutting and Cafarella
developed software (now known as Hadoop) that was able to distribute the
workload across clusters of computers.
Birth of Hadoop (2005–2006)
• Inspired by Google’s work, Doug Cutting and Mike
Cafarella were developing an open-source web search
engine called Nutch.
• They realized the need for a robust storage and
processing system like GFS and MapReduce.
• In 2005, they started implementing a distributed
computing system as part of the Apache Nutch
project.
• In 2006, Hadoop became an independent project under
the Apache Software Foundation (ASF), with Doug
Cutting naming it after his son’s toy elephant.
Growth and Adoption (2007–
2011)
•Yahoo! became the primary contributor to Hadoop and built the world’s largest Hadoop cluster at the time.
•In 2008, Apache Hadoop 0.1 was released as an open-source framework.
•Hadoop's ecosystem expanded with HBase (2008), Hive (2009), Pig (2008), and Zookeeper (2010), adding more
functionalities.
•By 2011, Hadoop became widely adopted across industries for handling big data.
Modern Hadoop and Industry
Adoption (2012–Present)
• Hadoop 2.0 (2013) introduced YARN (Yet Another
Resource Negotiator), which improved cluster
resource management
• Companies like Facebook, Twitter, Amazon, and
Microsoft integrated Hadoop into their data
architectures.
• Hadoop 3.0 (2017) introduced support for erasure
coding, improved scalability, and
containerization.
• Although cloud-based solutions like Apache Spark
and Google BigQuery are gaining popularity, Hadoop
remains a key player in big data processing.
Apache Hadoop
• Apache Hadoop is an open-source software framework developed by
Douglas Cutting, then at Yahoo, that provides the highly reliable distributed
processing of large data sets using simple programming models.
• Hadoop is an open-source software framework that is used for
storing and processing large amounts of data in a distributed
computing environment. It is designed to handle big data and is
based on the MapReduce programming model, which allows for the
parallel processing of large datasets.
2. NameNode(MasterNode):
2. It executes filesystem namespace operations like opening, closing, renaming files and directories.
3. It should be deployed on reliable hardware which has the high config. not on commodity hardware.
3. DataNode(SlaveNode):
1. Actual worker nodes, who do the actual work like reading, writing, processing etc.
2. They also perform creation, deletion, and replication upon instruction from the master.
• Start by ingesting data into the Hadoop cluster. You can use tools like Apache Flume, Apache Kafka, or
Hadoop’s HDFS for batch data ingestion.
• Ensure that your data is stored in a structured format in HDFS or another suitable storage system.
2. Data Preparation:
• Preprocess and clean the data as needed. This may involve tasks such as data deduplication, data
normalization, and handling missing values.
• Transform the data into a format suitable for analysis, which could include data enrichment and feature
engineering.
3. Choose a Processing Framework:
Select the appropriate data processing framework based on your requirements. Common
choices include:
• MapReduce: Ideal for batch processing and simple transformations.
• Apache Spark: Suitable for batch, real-time, and iterative data processing. It offers a
wide range of libraries for machine learning, graph processing, and more.
• Apache Hive: If you prefer SQL-like querying, you can use Hive for data analysis.
• Apache Pig: A high-level data flow language for ETL and data analysis tasks.
• Custom Code: You can write custom Java, Scala, or Python code using Hadoop APIs if
necessary.
4. Data Analysis:
• Write the code or queries needed to perform the desired analysis. Depending on your choice
of framework, this may involve writing MapReduce jobs, Spark applications, HiveQL
queries, or Pig scripts.
• Implement data aggregation, filtering, grouping, and any other required transformations.
5. Scaling:
• Hadoop is designed for horizontal scalability. As your data and processing needs grow, you
can add more nodes to your cluster to handle larger workloads.
6. Optimization:
• Optimize your code and queries for performance. Tune the configuration parameters of your
Hadoop cluster, such as memory settings and resource allocation.
• Consider using data partitioning and bucketing techniques to improve query performance.
7. Data Visualization:
• Once you have obtained results from your analysis, you can use data visualization tools like
Apache Zeppelin, Apache Superset, or external tools like Tableau and Power BI to create
meaningful visualizations and reports.
8. Iteration:
• Data analysis is often an iterative process. You may need to refine your analysis based on
initial findings or additional questions that arise.
9. Data Security and Governance:
• Ensure that data access and processing adhere to security and governance policies. Use tools
like Apache Ranger or Apache Sentry for access control and auditing.
10. Results Interpretation:
• Interpret the results of your analysis and draw meaningful insights from the data.
• Document and share your findings with relevant stakeholders.
11. Automation:
• Consider automating your data analysis pipeline to ensure that new data is continuously
ingested, processed, and analyzed as it arrives.
12. Performance Monitoring:
• Implement monitoring and logging to keep track of the health and performance of your
Hadoop cluster and data analysis jobs.
Scaliing
• It can be defined as a process to expand the existing configuration (servers/computers) to
handle a large number of user requests or to manage the amount of load on the server.
This process is called scalability. This can be done either by increasing the current system
configuration (increasing RAM, number of servers) or adding more power to the
configuration. Scalability plays a vital role in the designing of a system as it helps in
responding to a large number of user requests more effectively and quickly.
2.Horizontal Scaling
Vertical Scaling
• It is defined as the process of increasing the capacity of a single machine by adding more resources
such as memory, storage, etc. to increase the throughput of the system. No new resource is added,
rather the capability of the existing resources is made more efficient. This is called Vertical
scaling. Vertical Scaling is also called the Scale-up approach.
• Example: MySQL
• Advantages of Vertical Scaling
It is easy to implement
Reduced software costs as no new resources are added
Fewer efforts required to maintain this single system
Single-point failure
Since when the system (server) fails, the downtime is high because we only have a single server
High risk of hardware failures
Example
When traffic increases, the server
degrades in performance. The first
possible solution that everyone has is to
increase the power of their system. For
instance, if earlier they used 8 GB RAM
and 128 GB hard drive now with
increasing traffic, the power of the
system is affected. So a possible solution
is to increase the existing RAM or hard
drive storage, for e.g. the resources could
be increased to 16 GB of RAM and 500
GB of a hard drive but this is not an
ultimate solution as after a point of time,
these capacities will reach a saturation
point.
Horizontal Scaling
• It is defined as the process of adding more instances of the same type to the existing pool
of resources and not increasing the capacity of existing resources like in vertical scaling.
This kind of scaling also helps in decreasing the load on the server. This is called
Horizontal Scaling Horizontal Scaling is also called the Scale-out approach.
• In this process, the number of servers is increased and not the individual capacity of the
server. This is done with the help of a Load Balancer which basically routes the user
requests to different servers according to the availability of the server. Thereby, increasing
the overall performance of the system. In this way, the entire process is distributed among
all servers rather than just depending on a single server.
Example: NoSQL, Cassandra, and MongoDB
Advantages of Horizontal Scaling
1. Fault Tolerance means that there is no single point of failure in this kind of scale because there are 5
servers here instead of 1 powerful server. So if anyone of the servers fails then there will be other
servers for backup. Whereas, in Vertical Scaling there is single point failure i.e: if a server fails then
the whole service is stopped.
2. Low Latency: Latency refers to how late or delayed our request is being processed.
3. Built-in backup
1. Not easy to implement as there are a number of components in this kind of scale
2. Cost is high
Some of the key features associated with Hadoop Streaming are as follows :
1. Hadoop Streaming is a part of the Hadoop Distribution System.
2. It facilitates ease of writing Map Reduce programs and codes.
3. Hadoop Streaming supports almost all types of programming languages such as
Python, C++, Ruby, Perl etc.
4. The entire Hadoop Streaming framework runs on Java. However, the codes might be
written in different languages as mentioned in the above point.
5. The Hadoop Streaming process uses Unix Streams that act as an interface between
Hadoop and Map Reduce programs.
6. Hadoop Streaming uses various Streaming Command Options and the two
mandatory ones are – -input directoryname or filename and -output directoryname
As it can be clearly seen in the diagram above that there are almost 8 key parts in a Hadoop Streaming
Architecture. They are :
• Input Reader/Format
• Key Value
• Mapper Stream
• Key-Value Pairs
• Reduce Stream
• Output Format
• Map External
• Reduce External
The six steps involved in the working of Hadoop Streaming are:
• Step 1: The input data is divided into chunks or blocks, typically 64MB to 128MB in size
automatically. Each chunk of data is processed by a separate mapper.
• Step 2: The mapper reads the input data from standard input (stdin) and generates an intermediate
key-value pair based on the logic of the mapper function which is written to standard output (stdout).
• Step 3: The intermediate key-value pairs are sorted and partitioned based on their keys ensuring
that all values with the same key are directed to the same reducer.
• Step 4: The key-value pairs are passed to the reducers for further processing where each reducer
receives a set of key-value pairs with the same key.
• Step 5: The reducer function, implemented by the developer, performs the required computations or
aggregations on the data and generates the final output which is written to the standard output
(stdout).
• Step 6: The final output generated by the reducers is stored in the specified output location in the
HDFS.
Hadoop Pipes
• Hadoop Pipes is a C++ interface that allows users to write MapReduce applications using Hadoop.
It uses sockets to enable communication between tasktrackers and processes running the C++ map
or reduce functions.
Working of Hadoop Pipes:
• C++ code is split into a separate process that performs the application-specific code
• Writable serialization is used to convert types into bytes that are sent to the process via a socket
• The application programs link against a thin C++ wrapper library that handles communication with
the rest of the Hadoop system
Hadoop Ecosystem
• Hadoop Ecosystem is a platform or a suite which provides various services to solve the big data problems.
• The Hadoop Ecosystem is a suite of tools and technologies built around Hadoop to process and manage large-scale data.
Key components include:
1. HDFS (Hadoop Distributed File System) – Distributed storage system.
2. MapReduce – Programming model for parallel data processing.
3. YARN (Yet Another Resource Negotiator) – Resource management and job scheduling.
Supporting Tools:
• Hive – SQL-like querying.
• Pig – Data flow scripting.
• HBase – NoSQL database.
• Spark – Fast in-memory data processing.
• Sqoop – Data transfer between Hadoop and RDBMS.
• Flume – Data ingestion from logs.
• Oozie – Workflow scheduling.
• Zookeeper – Coordination service.
Classical Data Processing
CPU
• Total 200 TB
BIG
DATA 1. Split data into small chunk
switc 1 gbps
switch between
h
any
pair of
nodes
machi
machi machi machi
ne
ne ne ne
Rack 1 Rack 2
• Each rack contains 16-64 commodity (low cost) computers (also called nodes)
• Example
• Network bandwidth = 1 gbps
• Moving 10TB of data takes 1 day
Challenge # 3
• Distributed Programming is hard!
• Why?
1.Data distributions across machines is non-trivial
• (It is desirable that machines have roughly the same load)
2.Avoiding race conditions
• Given two tasks T1 and T2,
• Correctness of result depends on the sequence of execution of
task
• For example, T1 before T2 is must, but NOT T2 before T1
What is the Solution?
• Map-Reduce
Programmer’s responsibility:
Write only two functions, Map and Reduce suitable for your problem
a1 a2 b1 a1 a3
a3 b1 b2 a2 b2
• Master node
• Stores metadata about where the files are stored
• It might also be replicated
Map-Reduce
Programming Model
Example Problem: Counting
Words
• We have a huge text document and count the number of times each
distinct word appears in the file
• Sample application
• Analyze web server logs to find popular URLs
• Case 2: All <word, count> pairs do not fit in memory, but fit into
disk
getWords(tex
• A possible approach (write computer
programs/functions for each tFile)
step)
Map Reduce
extract something you Group by key Aggregate,
care about sort and summarize, etc
(here word and count) shuffle Save the
results
Summary
1. Outline stays the same
2. Map and Reduce to be defined to fit the problem
MapReduce: The Map Step
Input key-value pairs Intermediate key-value
(file name and its content) pairs
(word and count)
k1 v1
map
f1 c1
k2 v2
map
f2 c2
k3 v3
… …
map
f3 c3 k4 v4
MapReduce: The Reduce Step
Intermediate
key-value pairs
Output
Key-value groups key-value pairs
k1 v1
reduce
k1 v1 v3 v6 k1 𝑣′
k2 v2
reduce
Group
k2 𝑣′′
k1 v3 by key k2 v2 v5
…
k3 v4
… …
k2 v5
k1 v6 k3 v4 reduce
k3 𝑣′′′
Map-reduce: Word Count
Provided by the Provided by the
programmer
MAP: Group by programmer
Reduce:
Read input key: Collect
and Collect all values
produces a all pairs belonging
set of key- with
(crew,same
1) to the key
value pairs
(the, 1)
key 1)
(crew, and output
(crew, 2)
(crew, 1)
(space, 1)
(of, 1)
(the, 1)
(space, 1)
(space, 1)
The crew of the space (shuttle, 1) (the, 1)
shuttle Endeavor recently (endeavor, 1) (the, 3)
returned to Earth as (recently, 1) (the, 1)
(returned, 1)
ambassadors, harbingers of (the, 1)
a new era of space (to, 1) (shuttle,
exploration. Crew members
at ……………………..
(earth, 1)
1)
(as, 1)
(shuttle,
(ambassadors, 1) (recently,
…. 1)
1)
(crew, 1) (recently,
…….. …
1)
Big document (key, value) (key,…value) (key, value)
Word Count Using MapReduce: Pseudocode
map(key, value):
// key: document name; value: text of the document
for each word w in value
emit(w, 1)
reduce(key, values):
// key: a word; value: set of counts values for the word
result = 0
for each count v in values:
result += v
emit(key, result)
Map-reduce System: Under the
Hood
• The mappers do not need data from one another while they
are running
• Example
1. Word count
• Examples
1.Applications that require very quick response
time
• In IR, indexing is okay, but query processing is not suitable
for map-reduce
T1
A B T2
a1 b1 B C A B C
a2 b1 b2 c1 a3 b2 c1
a3 b2
⋈ b2 c2
= a3 b2 c2
a4 b3 b3 c3 a4 b3 c3
Map-Reduce Join
• Map process
• Each row (a,b) from T1 into key-value pair
(b,(a,T1))
• Each row (b,c) from T2 into (b,(c,T2))
• Reduce process
• Each reduce process matches all the pairs
(b,(a,T1)) with all (b,(c,T2)) and outputs
(a,b,c)
Advanced Exercise
• You have a dataset with thousands of features. Find the most co-
related features in that data.
features
Take Home Exercises
1.Pagerank
2.HITS
Developing Map Reduce
Application
• Map Program for word Count in Python
sentences = [
"Artificial Intelligence is fascinating",
"Deep learning improves medical imaging",
"Generative AI is transforming healthcare"
]
# Print results
for i, count in enumerate(word_counts):
print(f"Sentence {i+1}: {count} words")
•
Reduce Program for word Count in Python
from functools import reduce
sentences = [
"Artificial Intelligence is fascinating",
"Deep learning improves medical imaging",
"Generative AI is transforming healthcare"
]
# Using reduce to count total words
total_word_count = reduce(lambda total,
sentence: total + len(sentence.split()), sentences, 0)
# Print result
print(f"Total Word Count: {total_word_count}")
Unit Test MapReduce using
MRUnit
• In order to make sure that your code is correct, you need to Unit test
your code first.
• And like you unit test your Java code using JUnit testing framework,
the same can be done using MRUnit to test MapReduce Jobs.
• Jobtracker Failure
• It is most serious failure mode.
• Hadoop has no mechanism for dealing
with jobtracker failure
• This situation is improved with YARN.
How Hadoop runs a MapReduce job using YARN
Job Scheduling in Hadoop
• In Hadoop 1, Hadoop MapReduce framework is responsible for
scheduling tasks, monitoring them, and re-executes the failed task.
• But in Hadoop 2, a YARN called Yet Another Resource Negotiator was
introduced.
• The basic idea behind the YARN introduction is to split the
functionalities of resource management and job scheduling or
monitoring into separate daemons that are ResorceManager,
ApplicationMaster, and NodeManager.
Job Scheduling in Hadoop
• The ResourceManager has two main components that are Schedulers
and ApplicationsManager.
• Schedulers in YARN ResourceManager is a pure scheduler which is
responsible for allocating resources to the various running
applications.
• The FIFO Scheduler, CapacityScheduler, and FairScheduler are
pluggable policies that are responsible for allocating resources to the
applications.
FIFO Scheduler
• First In First Out is the default scheduling policy used in Hadoop.
• FIFO Scheduler gives more preferences to the application coming first than those coming
later.
• It places the applications in a queue and executes them in the order of their submission
(first in, first out).
• Advantage:
• It is simple to understand and doesn’t need any configuration.
• Jobs are executed in the order of their submission.
• Disadvantage:
• It is not suitable for shared clusters. If the large application comes before the shorter
• one, then the large application will use all the resources in the cluster, and the shorter application
has to wait for its turn. This leads to starvation.
Capacity Schedular
Advantages
• It maximizes the utilization of resources and
throughput in the Hadoop cluster.
• Provides elasticity for groups or
organizations in a cost- effective
manner.
• It also gives capacity guarantees
and safeguards to the
organization utilizing cluster.
Disadvantage:
• It is complex amongst the other scheduler.
Capacity Schedular
• It is designed to run Hadoop applications in a shared, multi-tenant
cluster while maximizing the throughput and the utilization of the
cluster.
• It supports hierarchical queues to reflect the structure of
organizations or groups that utilizes the cluster resources
Fair Scheduler
Hadoop uses the Writable interface for data serialization and efficient processing. The key
MapReduce data
types include:
(a) Key-Value Data Types
•Text → Stores string values (Text is preferred over String for efficiency).
•IntWritable → Stores integer values.
•LongWritable → Stores long integer values.
•FloatWritable → Stores floating-point values.
•DoubleWritable → Stores double-precision values.
•BooleanWritable → Stores boolean values (true/false).
•ArrayWritable → Stores an array of writable objects.
•MapWritable → Stores key-value pairs where keys and values are writable.
2. Input Formats in Hadoop
• How the input files are split up and read in Hadoop is defined by the
InputFormat.
• An Hadoop InputFormat is the first component in Map-Reduce, it is
responsible for creating the input splits and dividing them into records.
• Initially, the data for a MapReduce task is stored in input files, and input files
typically reside in HDFS.
• Using InputFormat we define how these input files are split and read.
Input Formats
• The InputFormat class is one of the fundamental classes in the Hadoop
MapReduce framework which provides the following functionality:
• The files or other objects that should be used for input is selected by the InputFormat.
• InputFormat defines the Data splits, which defines both the size of individual Map tasks and its
potential execution server.
• InputFormat defines the RecordReader, which is responsible for reading actual records from
the input files.
Types of InputFormat in
MapReduce
• FileInputFormat in Hadoop
• TextInputFormat
• KeyValueTextInputFormat
• SequenceFileInputFormat
• SequenceFileAsTextInputFormat
• SequenceFileAsBinaryInputFormat
• NLineInputFormat
• dbInputFormat
FileInputFormat in Hadoop
• It is the base class for all file-based InputFormats.
• Hadoop FileInputFormat specifies input directory where data files are
located.
• When we start a Hadoop job, FileInputFormat is provided with a path
containing files to read.
• FileInputFormat will read all files and divides these files into one or
more InputSplits.
TextInputFormat
• It is the default InputFormat of MapReduce.
• TextInputFormat treats each line of each input file as a separate
record and performs no parsing.
• This is useful for unformatted data or line-based records like log files.
• Key – It is the byte offset of the beginning of the line within the file
(not whole file just one split), so it will be unique if combined with the
file name.
• Value – It is the contents of the line, excluding line terminators.
KeyValueTextInputFormat
• It is similar to TextInputFormat as it also treats each line of input as a
separate record.
• While TextInputFormat treats entire line as the value, but the
KeyValueTextInputFormat breaks the line itself into key and value by a
tab character (‘/t’).
• Here Key is everything up to the tab character while the value is the
remaining part of the line after tab character.
SequenceFileInputFormat
• Hadoop SequenceFileInputFormat is an InputFormat which reads
sequence files.
• Sequence files are binary files that stores sequences of binary key-
value pairs.
• Sequence files block-compress and provide direct serialization and
deserialization of several arbitrary data types (not just text).
• Here Key & Value both are user-defined.
SequenceFileAsTextInputFormat
• Hadoop SequenceFileAsTextInputFormat is another form of
SequenceFileInputFormat which converts the sequence file key values
to Text objects.
• By calling ‘tostring()’ conversion is performed on the keys and values.
• This InputFormat makes sequence files suitable input for streaming.
NLineInputFormat
• Hadoop NLineInputFormat is another form of TextInputFormat where
the keys are byte offset of the line and values are contents of the line.
• if we want our mapper to receive a fixed number of lines of input,
then we use NLineInputFormat.
• N is the number of lines of input that each mapper receives. By
default (N=1), each mapper receives exactly one line of input. If N=2,
then each split contains two lines. One mapper will receive the first
two Key-Value pairs and another mapper will receive the second two
key-value pairs.
DBInputFormat
• Hadoop DBInputFormat is an InputFormat that reads data from a
relational database, using JDBC.
• As it doesn’t have portioning capabilities, so we need to careful not to
swamp the database from which we are reading too many mappers.
• So it is best for loading relatively small datasets, perhaps for joining
with large datasets from HDFS using MultipleInputs.
Hadoop Output Format
• The Hadoop Output Format checks the Output-Specification of the job.
• It determines how RecordWriter implementation is used to write output to output files.
• Hadoop RecordWriter
• As we know, Reducer takes as input a set of an intermediate key-value pair produced by
the mapper and runs a reducer function on them to generate output that is again zero or
more key-value pairs.
• RecordWriter writes these output key-value pairs from the Reducer phase to output files.
• As we saw above, Hadoop RecordWriter takes output data from Reducer and
writes this data to output files.
• The way these output key-value pairs are written in output files by RecordWriter is
determined by the Output Format.
Types of Hadoop Output Formats
• TextOutputFormat
• SequenceFileOutputFormat
• MapFileOutputFormat
• MultipleOutputs
• DBOutputFormat
TextOutputFormat
• MapReduce default Hadoop reducer Output Format is
TextOutputFormat, which writes (key, value) pairs on individual lines
of text files and its keys and values can be of any type
• DBOutputFormat
• DBOutputFormat in Hadoop is an Output Format for writing to
relational databases and HBase.
• It sends the reduce output to a SQL table.
• It accepts key-value pairs.
Features of MapReduce
• Scalability
• Flexibility
• Security and Authentication
• Cost-effective solution
• Fast
• Simple model of programming
• Parallel Programming
• Availability and resilient nature
Real World Map Reduce
MapReduce is widely used in industry for processing large datasets in a distributed
manner. Below are some real-world applications where MapReduce plays a crucial role:
1. Healthcare and Medical Imaging
• Disease detection (X-rays, MRIs, CT scans)
• Genomic analysis for genetic mutations
• Processing electronic health records (EHRs)
2. E-Commerce and Retail
• Product recommendation systems
• Customer sentiment analysis from reviews
• Fraud detection in online transactions
3. Social Media and Online Platforms
• Trending topic detection (Twitter, Facebook)
• Spam and fake news filtering
• User engagement analysis (likes, shares, comments)
4. Finance and Banking
• Stock market predictions
• Risk management and fraud detection
• Customer segmentation for targeted marketing
5. Search Engines (Google, Bing, Yahoo)
• Web indexing for search results
• Ad targeting based on user behavior
6. Transportation and Logistics
• Traffic analysis for route optimization
• Supply chain and warehouse management
7. Scientific Research and Big Data Analytics
• Climate change analysis using satellite data
• Astronomy research for celestial object detection