0% found this document useful (0 votes)
36 views19 pages

Bda Unit1

unit

Uploaded by

samhitha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views19 pages

Bda Unit1

unit

Uploaded by

samhitha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

UNIT I

What is Big Data?

Big data is exactly what the name suggests, a “big” amount of data. Big Data means a data set that is large in
terms of volume and is more complex. Because of the large volume and higher complexity of Big Data, traditional
data processing software cannot handle it. Big Data simply means datasets containing a large amount of diverse
data, both structured as well as unstructured.

Big Data allows companies to address issues they are facing in their business, and solve these problems
effectively using Big Data Analytics. Companies try to identify patterns and draw insights from this sea of data so
that it can be acted upon to solve the problem(s) at hand.

Although companies have been collecting a huge amount of data for decades, the concept of Big Data only gained
popularity in the early-mid 2000s. Corporations realized the amount of data that was being collected on a daily
basis, and the importance of using this data effectively.

5Vs of Big Data

Volume refers to the amount of data that is being collected. The data could be structured or unstructured.

Velocity refers to the rate at which data is coming in.

Variety refers to the different kinds of data (data types, formats, etc.) that is coming in for analysis. Over the last
few years, 2 additional Vs of data have also emerged – value and veracity.

Value refers to the usefulness of the collected data.

Veracity refers to the quality of data that is coming in from different sources.
How Does Big Data Work?

Time needed: 15 minutes.

Big data involves collecting, processing, and analyzing vast amounts of data from multiple sources to uncover
patterns, relationships, and insights that can inform decision-making. The process involves several steps:

Data Collection

Big data is collected from various sources such as social media, sensors, transactional systems, customer reviews,
and other sources.

Data Storage

The collected data then needs to be stored in a way that it can be easily accessed and analyzed later. This often
requires specialized storage technologies capable of handling large volumes of data.

Data Processing

Once the data is stored, it needs to be processed before it can be analyzed. This involves cleaning and organizing
the data to remove any errors or inconsistencies, and transform it into a format suitable for analysis.

Data Analysis

After the data has been processed, it is time to analyze it using tools like statistical models and machine learning
algorithms to identify patterns, relationships, and trends.

Data Visualization

The insights derived from data analysis are then presented in visual formats such as graphs, charts, and
dashboards, making it easier for decision-makers to understand and act upon them.

Use Cases

Big Data helps corporations in making better and faster decisions, because they have more information available
to solve problems, and have more data to test their hypothesis on.

Customer Experience

Customer experience is a major field that has been revolutionized with the advent of Big Data. Companies are
collecting more data about their customers and their preferences than ever. This data is being leveraged in a
positive way, by giving personalized recommendations and offers to customers, who are more than happy to allow
companies to collect this data in return for the personalized services. The recommendations you get on Netflix, or
Amazon/Flipkart are a gift of Big Data!

Machine Learning

Machine Learning is another field that has benefited greatly from the increasing popularity of Big Data. More
data means we have larger datasets to train our ML models, and a more trained model (generally) results in a
better performance. Also, with the help of Machine Learning, we are now able to automate tasks that were earlier
being done manually, all thanks to Big Data.
Demand Forecasting

Demand forecasting has become more accurate with more and more data being collected about customer
purchases. This helps companies build forecasting models, that help them forecast future demand, and scale
production accordingly. It helps companies, especially those in manufacturing businesses, to reduce the cost of
storing unsold inventory in warehouses.

Big data also has extensive use in applications such as product development and fraud detection.

How to Store and Process Big Data?

The volume and velocity of Big Data can be huge, which makes it almost impossible to store it in traditional data
warehouses. Although some and sensitive information can be stored on company premises, for most of the data,
companies have to opt for cloud storage or Hadoop.

Cloud storage allows businesses to store their data on the internet with the help of a cloud service provider (like
Amazon Web Services, Microsoft Azure, or Google Cloud Platform) who takes the responsibility of managing
and storing the data. The data can be accessed easily and quickly with an API.

Hadoop also does the same thing, by giving you the ability to store and process large amounts of data at once.
Hadoop is an open-source software framework and is free. It allows users to process large datasets across clusters
of computers.

Big Data Tools

Apache Hadoop is an open-source big data tool designed to store and process large amounts of data across
multiple servers. Hadoop comprises a distributed file system (HDFS) and a MapReduce processing engine.

Apache Spark is a fast and general-purpose cluster computing system that supports in-memory processing to
speed up iterative algorithms. Spark can be used for batch processing, real-time stream processing, machine
learning, graph processing, and SQL queries.

Apache Cassandra is a distributed NoSQL database management system designed to handle large amounts of data
across commodity servers with high availability and fault tolerance.

Apache Flink is an open-source streaming data processing framework that supports batch processing, real-time
stream processing, and event-driven applications. Flink provides low-latency, high-throughput data processing
with fault tolerance and scalability.
Apache Kafka is a distributed streaming platform that enables the publishing and subscribing to streams of records
in real-time. Kafka is used for building real-time data pipelines and streaming applications.

Splunk is a software platform used for searching, monitoring, and analyzing machine-generated big data in real-
time. Splunk collects and indexes data from various sources and provides insights into operational and business
intelligence.

Talend is an open-source data integration platform that enables organizations to extract, transform, and load (ETL)
data from various sources into target systems. Talend supports big data technologies such as Hadoop, Spark, Hive,
Pig, and HBase.

Tableau is a data visualization and business intelligence tool that allows users to analyze and share data using
interactive dashboards, reports, and charts. Tableau supports big data platforms and databases such as Hadoop,
Amazon Redshift, and Google BigQuery.

Apache NiFi is a data flow management tool used for automating the movement of data between systems. NiFi
supports big data technologies such as Hadoop, Spark, and Kafka and provides real-time data processing and
analytics.

QlikView is a business intelligence and data visualization tool that enables users to analyze and share data using
interactive dashboards, reports, and charts. QlikView supports big data platforms such as Hadoop, and provides
real-time data processing and analytics.

Big Data Best Practices

To effectively manage and utilize big data, organizations should follow some best practices:

Define clear business objectives: Organizations should define clear business objectives while collecting and
analyzing big data. This can help avoid wasting time and resources on irrelevant data.

Collect and store relevant data only: It is important to collect and store only the relevant data that is required for
analysis. This can help reduce data storage costs and improve data processing efficiency.

Ensure data quality: It is critical to ensure data quality by removing errors, inconsistencies, and duplicates from
the data before storage and processing.

Use appropriate tools and technologies: Organizations must use appropriate tools and technologies for collecting,
storing, processing, and analyzing big data. This includes specialized software, hardware, and cloud-based
technologies.

Establish data security and privacy policies: Big data often contains sensitive information, and therefore
organizations must establish rigorous data security and privacy policies to protect this data from unauthorized
access or misuse.

Leverage machine learning and artificial intelligence: Machine learning and artificial intelligence can be used to
identify patterns and predict future trends in big data. Organizations must leverage these technologies to gain
actionable insights from their data.

Focus on data visualization: Data visualization can simplify complex data into intuitive visual formats such as
graphs or charts, making it easier for decision-makers to understand and act upon the insights derived from big
data.
Challenges

1. Data Growth

Managing datasets having terabytes of information can be a big challenge for companies. As datasets grow in size,
storing them not only becomes a challenge but also becomes an expensive affair for companies.

To overcome this, companies are now starting to pay attention to data compression and de-duplication.
Data compression reduces the number of bits that the data needs, resulting in a reduction in space being
consumed. Data de-duplication is the process of making sure duplicate and unwanted data does not reside in our
database.

2. Data Security

Data security is often prioritized quite low in the Big Data workflow, which can backfire at times. With such a
large amount of data being collected, security challenges are bound to come up sooner or later.

Mining of sensitive information, fake data generation, and lack of cryptographic protection (encryption) are some
of the challenges businesses face when trying to adopt Big Data techniques.

Companies need to understand the importance of data security, and need to prioritize it. To help them, there are
professional Big Data consultants nowadays, that help businesses move from traditional data storage and
analysis methods to Big Data.

3. Data Integration

Data is coming in from a lot of different sources (social media applications, emails, customer verification
documents, survey forms, etc.). It often becomes a very big operational challenge for companies to combine and
reconcile all of this data.

There are several Big Data solution vendors that offer ETL (Extract, Transform, Load) and data integration
solutions to companies that are trying to overcome data integration problems. There are also several APIs that
have already been built to tackle issues related to data integration.

Advantages and Disadvantages of Big Data

Advantages of Big Data

Improved decision-making: Big data can provide insights and patterns that help organizations make more
informed decisions.

Increased efficiency: Big data analytics can help organizations identify inefficiencies in their operations and
improve processes to reduce costs.

Better customer targeting: By analyzing customer data, businesses can develop targeted marketing campaigns that
are relevant to individual customers, resulting in better customer engagement and loyalty.

New revenue streams: Big data can uncover new business opportunities, enabling organizations to create new
products and services that meet market demand.

Competitive advantage: Organizations that can effectively leverage big data have a competitive advantage over
those that cannot, as they can make faster, more informed decisions based on data-driven insights.
Disadvantages of Big Data

Privacy concerns: Collecting and storing large amounts of data can raise privacy concerns, particularly if the data
includes sensitive personal information.

Risk of data breaches: Big data increases the risk of data breaches, leading to loss of confidential data and
negative publicity for the organization.

Technical challenges: Managing and processing large volumes of data requires specialized technologies and
skilled personnel, which can be expensive and time-consuming.

Difficulty in integrating data sources: Integrating data from multiple sources can be challenging, particularly if the
data is unstructured or stored in different formats.

Complexity of analysis: Analyzing large datasets can be complex and time-consuming, requiring specialized skills
and expertise.

Implementation Across Industries

Here are top 10 industries that use big data in their favor –

Industry Use of Big data

Healthcare Analyze patient data to improve healthcare outcomes, identify trends and patterns, and
develop personalized treatment

Retail Track and analyze customer data to personalize marketing campaigns, improve inventory
management and enhance CX

Finance Detect fraud, assess risks and make informed investment decisions

Manufacturing Optimize supply chain processes, reduce costs and improve product quality through
predictive maintenance

Transportation Optimize routes, improve fleet management and enhance safety by predicting accidents
before they happen

Energy Monitor and analyze energy usage patterns, optimize production, and reduce waste
through predictive analytics

Telecommunications Manage network traffic, improve service quality, and reduce downtime through
predictive maintenance and outage prediction

Government and public Address issues such as preventing crime, improving traffic management, and predicting
natural disasters

Advertising and Understand consumer behavior, target specific audiences and measure the effectiveness
marketing of campaigns

Education Personalize learning experiences, monitor student progress and improve teaching
Industry Use of Big data

methods through adaptive learning

The Future of Big Data

The volume of data being produced every day is continuously increasing, with increasing digitization. More and
more businesses are starting to shift from traditional data storage and analysis methods to cloud solutions.
Companies are starting to realize the importance of data. All of these imply one thing, the future of Big Data looks
promising! It will change the way businesses operate, and decisions are made.

What is Hadoop Distributed File System (HDFS)?

It is difficult to maintain huge volumes of data in a single machine. Therefore, it becomes necessary to break
down the data into smaller chunks and store it on multiple machines.

Filesystems that manage the storage across a network of machines are called distributed file systems.

Hadoop Distributed File System (HDFS) is the storage component of Hadoop. All data stored on Hadoop is stored
in a distributed manner across a cluster of machines. But it has a few properties that define its existence.

Huge volumes – Being a distributed file system, it is highly capable of storing petabytes of data without any
glitches.

Data access – It is based on the philosophy that “the most effective data processing pattern is write-once, the
read-many-times pattern”.

Cost-effective – HDFS runs on a cluster of commodity hardware. These are inexpensive machines that can be
bought from any vendor.

What are the components of the Hadoop Distributed File System (HDFS)?

HDFS has two main components, broadly speaking, – data blocks and nodes storing those data blocks. But there
is more to it than meets the eye. So, let’s look at this one by one to get a better understanding.

HDFS Blocks

HDFS breaks down a file into smaller units. Each of these units is stored on different machines in the cluster.
This, however, is transparent to the user working on HDFS. To them, it seems like storing all the data onto a
single machine.

These smaller units are the blocks in HDFS. The size of each of these blocks is 128MB by default, you can easily
change it according to requirement. So, if you had a file of size 512MB, it would be divided into 4 blocks storing
128MB each.
If, however, you had a file of size 524MB, then, it would be divided into 5 blocks. 4 of these would store 128MB
each, amounting to 512MB. And the 5th would store the remaining 12MB. That’s right! This last block won’t take
up the complete 128MB on the disk.

But, you must be wondering, why such a huge amount in a single block? Why not multiple blocks of 10KB each?
Well, the amount of data with which we generally deal with in Hadoop is usually in the order of petra bytes or
higher.

Therefore, if we create blocks of small size, we would end up with a colossal number of blocks. This would mean
we would have to deal with equally large metadata regarding the location of the blocks which would just create a
lot of overhead. And we don’t really want that!

There are several perks to storing data in blocks rather than saving the complete file.

The file itself would be too large to store on any single disk alone. Therefore, it is prudent to spread it across
different machines on the cluster.

It would also enable a proper spread of the workload and prevent the choke of a single machine by taking
advantage of parallelism.

Now, you must be wondering, what about the machines in the cluster? How do they store the blocks and where is
the metadata stored? Let’s find out.
Namenode in HDFS

HDFS operates in a master-worker architecture, this means that there are one master node and several worker
nodes in the cluster. The master node is the Namenode.

Namenode is the master node that runs on a separate node in the cluster.

Manages the filesystem namespace which is the filesystem tree or hierarchy of the files and directories.

Stores information like owners of files, file permissions, etc for all the files.

It is also aware of the locations of all the blocks of a file and their size.

All this information is maintained persistently over the local disk in the form of two files: Fsimage and Edit Log.

Fsimage stores the information about the files and directories in the filesystem. For files, it stores the replication
level, modification and access times, access permissions, blocks the file is made up of, and their sizes. For
directories, it stores the modification time and permissions.

Edit log on the other hand keeps track of all the write operations that the client performs. This is regularly
updated to the in-memory metadata to serve the read requests.

Whenever a client wants to write information to HDFS or read information from HDFS, it connects with
the Namenode. The Namenode returns the location of the blocks to the client and the operation is carried out.

Yes, that’s right, the Namenode does not store the blocks. For that, we have separate nodes.

Datanodes in HDFS

Datanodes are the worker nodes. They are inexpensive commodity hardware that can be easily added to the
cluster.

Datanodes are responsible for storing, retrieving, replicating, deletion, etc. of blocks when asked by the
Namenode.

They periodically send heartbeats to the Namenode so that it is aware of their health. With that, a DataNode also
sends a list of blocks that are stored on it so that the Namenode can maintain the mapping of blocks to Datanodes
in its memory.

But in addition to these two types of nodes in the cluster, there is also another node called the Secondary
Namenode. Let’s look at what that is.

Secondary Namenode in HDFS

Suppose we need to restart the Namenode, which can happen in case of a failure. This would mean that we have
to copy the Fsimage from disk to memory. Also, we would also have to copy the latest copy of Edit Log to
Fsimage to keep track of all the transactions. But if we restart the node after a long time, then the Edit log could
have grown in size. This would mean that it would take a lot of time to apply the transactions from the Edit
log. And during this time, the filesystem would be offline. Therefore, to solve this problem, we bring in
the Secondary Namenode.

Secondary Namenode is another node present in the cluster whose main task is to regularly merge the Edit log
with the Fsimage and produce check‐points of the primary’s in-memory file system metadata. This is also referred
to as Checkpointing.

But the checkpointing procedure is computationally very expensive and requires a lot of memory, which is why
the Secondary namenode runs on a separate node on the cluster.

However, despite its name, the Secondary Namenode does not act as a Namenode. It is merely there for
Checkpointing and keeping a copy of the latest Fsimage.

Replication Management in HDFS

Now, one of the best features of HDFS is the replication of blocks which makes it very reliable. But how does it
replicate the blocks and where does it store them? Let’s answer those questions now.

Replication of blocks

HDFS is a reliable storage component of Hadoop. This is because every block stored in the filesystem is
replicated on different Data Nodes in the cluster. This makes HDFS fault-tolerant.

The default replication factor in HDFS is 3. This means that every block will have two more copies of it, each
stored on separate DataNodes in the cluster. However, this number is configurable.
But you must be wondering doesn’t that mean that we are taking up too much storage. For instance, if we have 5
blocks of 128MB each, that amounts to 5*128*3 = 1920 MB. True. But then these nodes are commodity
hardware. We can easily scale the cluster to add more of these machines. The cost of buying machines is much
lower than the cost of losing the data!

Now, you must be wondering, how does Namenode decide which Datanode to store the replicas on? Well, before
answering that question, we need to have a look at what is a Rack in Hadoop.

What is a Rack in Hadoop?

A Rack is a collection of machines (30-40 in Hadoop) that are stored in the same physical location. There are
multiple racks in a Hadoop cluster, all connected through switches.

Rack awareness

Replica storage is a tradeoff between reliability and read/write bandwidth. To increase reliability, we need to store
block replicas on different racks and Datanodes to increase fault tolerance. While the write bandwidth is lowest
when replicas are stored on the same node. Therefore, Hadoop has a default strategy to deal with this conundrum,
also known as the Rack Awareness algorithm.
For example, if the replication factor for a block is 3, then the first replica is stored on the same Datanode on
which the client writes. The second replica is stored on a different Datanode but on a different rack, chosen
randomly. While the third replica is stored on the same rack as the second but on a different Datanode, again
chosen randomly. If, however, the replication factor was higher, then the subsequent replicas would be stored on
random Data Nodes in the cluster.

MapReduce is a Hadoop framework used to write applications that can process large amounts of data in large
volumes. It can also be called an editing model where we can process large databases in all computer collections.
This application allows data to be stored in distributed form, simplifying a large amount of data and a large
computer. There are two main functions in MapReduce: map and trim. We did the previous work before saving.
In the map function, we split the input data into pieces, and the map function processes these pieces accordingly.

A map that uses output as input for reduction functions. The scanners process medium data from maps to smaller
tuples, which reduces tasks, leading to the final output of the frame. This framework improves planning and
monitoring activities, and failed jobs are restructured by frame. Programmers can easily use this framework with
little experience in distributed processing. MapReduce can use various programming languages such as Java,
Live, Pig, Scala, and Python.

How does MapReduce work in Hadoop?

An overview of the categories of MapReduce Architecture and MapReduce will help us understand how it works
in Hadoop.

MapReduce architecture

The following diagram shows the MapReduce structure.


MapReduce architecture consists of various components. A brief description of these sections can enhance our
understanding of how it works.

Job: This is real work that needs to be done or processed

Task: This is a piece of real work that needs to be done or processed. The MapReduce task covers many small
tasks that need to be done.

Job Tracker: This tracker plays a role in organizing tasks and tracking all tasks assigned to a task tracker.

Task Tracker: This tracker plays the role of tracking activity and reporting activity status to the task tracker.

Input data: This is used for processing in the mapping phase.

Exit data: This is the result of mapping and mitigation.

Client: This is a program or Application Programming Interface (API) that sends tasks to MapReduce. It can
accept services from multiple clients.

Hadoop MapReduce Master: This plays the role of dividing tasks into sections.

Job Parts: These are small tasks that result in the division of the primary function.

In MapReduce architecture, clients submit tasks to MapReduce Master. This manager will then divide the work
into smaller equal parts. The components of the function will be used for two main tasks in Map Reduce: mapping
and subtraction.

The developer will write a concept that satisfies the organization’s or company’s needs. Input data will be
categorized and mapped.
The central data will then be filtered and merged. The slider that will produce the last one stored on HDFS will
process the output.

The following diagram shows a simplified flow diagram of the MapReduce program.

How do Job Trackers work?

Every task consists of two essential parts: mapping and reduction functions. Map work plays the role of splitting
duties into task segments and central mapping data, and the reduction function plays the role of shuffling and
reducing the central data into smaller units.

The activity tracker works like a master. It ensures that we do all the work. The activity tracker lists tasks posted
by clients, and it will provide job trackers for jobs. Each task tracker has a map function and minimizes tasks.
Activity trackers report the status of each task assigned to the task tracker. The following diagram summarizes
how task trackers and task trackers work.

Phases of MapReduce

The MapReduce program comprises three main stages: mapping, navigation, and mitigation. There is also an
optional category known as the merging phase.

Mapping Phase

This is the first phase of the program. There are two steps in this phase: classification and mapping. The database
is divided into equal units called units (input divisions) in the division step. Hadoop contains a RecordReader that
uses TextInputFormat to convert input variables into keyword pairs.

Key-value pairs are then used as input on the map step. This is the only data format a map editor can read or
understand. The map step contains the logic of the code used in these data blocks. In this step, the map analyzes
key pairs and generates an output of the same form (key-value pairs).
Shuffling phase

This is the second phase that occurs after the completion of the Mapping phase. It consists of two main steps:
filtering and merging. In the filter step, keywords are filtered using keys and combining ensures that key-value
pairs are included.

The shoplifting phase facilitates the removal of duplicate values and the collection of values. Different values with
the same keys are combined. The output of this category will be keys and values, as in the Map section.

Reducer phase

In the reduction phase, the output of the push phase is user input. The subtractor continuously processes these
inputs to reduce the median values into smaller ones. Provides a summary of the entire database. Output in this
category is stored in HDFS.

The following diagram illustrates MapReduce with three main categories. Separation is usually included in the
mapping phase.

Combiner phase

This is the optional phase used to improve the MapReduce process. It is used to reduce pap output at the node
level. At this stage, duplicate output from the map output can be merged into a single output. The integration
phase accelerates the integration phase by improving the performance of tasks.

The following diagram shows how all four categories of MapReduce are used.
Benefits of Hadoop MapReduce

There are numerous benefits of MapReduce; some of them are listed below.

Speed: It can process large amounts of random data in a short period.

Fault tolerance: The MapReduce framework can manage failures.

Most expensive: Hadoop has a rating feature that allows users to process or store data cost-effectively.

Scalability: Hadoop provides an excellent framework. MapReduce allows users to run applications on multiple
nodes.

Data availability: Data matches are sent to various locations within the network. This ensures that copies of the
data are available in case of failure.

Parallel Processing: On MapReduce, many parts of the same database functions can be processed similarly. This
reduces the time taken to complete the task.

Applications of Hadoop MapReduce

The following are some of the most valuable features of the MapReduce program.

E-commerce

E-commerce companies like Walmart, eBay, and Amazon use MapReduce to analyze consumer behavior.
MapReduce provides sound information that is used as a basis for developing product recommendations. Other
information includes site records, e-commerce catalog, purchase history, and contact logs.

Social media

The MapReduce editing tool can check certain information on social media platforms such as Facebook, Twitter,
and LinkedIn. It can check necessary information, such as who liked your status and viewed your profile.

Entertainment

Netflix uses MapReduce to analyze clicks and logs for online clients. This information helps the company
promote movies based on customer interests and behavior.

MapReduce Types

Mapping is the core technique of processing a list of data elements that come in pairs of keys and values. The map
function applies to individual elements defined as key-value pairs of a list and produces a new list. The general
idea of map and reduce function of Hadoop can be illustrated as follows:

map: (K1, V1) -> list (K2, V2)

reduce: (K2, list(V2)) -> list (K3, V3)

The input parameters of the key and value pair, represented by K1 and V1 respectively, are different from the
output pair type: K2 and V2. The reduce function accepts the same format output by the map, but the type of
output again of the reduce operation is different: K3 and V3. The Java API for this is as follows:
public interface Mapper<K1, V1, K2, V2> extends JobConfigurable,

Closeable {

void map(K1 key, V1 value, OutputCollector<K2, V2> output,

Reporter reporter) throws IOException;

public interface Reducer<K2, V2, K3, V3> extends JobConfigurable,

Closeable {

void reduce(K2 key, Iterator<V2> values,

OutputCollector<K3, V3> output, Reporter reporter)throws

IOException;

The OutputCollector is the generalized interface of the Map-Reduce framework to facilitate collection of data
output either by the Mapper or the Reducer. These outputs are nothing but intermediate output of the job.
Therefore, they must be parameterized with their types. The Reporter facilitates the Map-Reduce application to
report progress and update counters and status information. If, however, the combine function is used, it has the
same form as the reduce function and the output is fed to the reduce function. This may be illustrated as follows:

map: (K1, V1) -> list (K2, V2)

combine: (K2, list(V2)) -> list (K2, V2)

reduce: (K2, list(V2)) -> list (K3, V3)

Note that the combine and reduce functions use the same type, except in the variable names where K3 is K2 and
V3 is V2.

The partition function operates on the intermediate key-value types. It controls the partitioning of the keys of the
intermediate map outputs. The key derives the partition using a typical hash function. The total number of
partitions is the same as the number of reduce tasks for the job. The partition is determined only by the key
ignoring the value.

public interface Partitioner<K2, V2> extends JobConfigurable {

int getPartition(K2 key, V2 value, int numberOfPartition);

This is the key essence of MapReduce types in short.

Input Formats
Hadoop has to accept and process a variety of formats, from text files to databases. A chunk of input, called input
split, is processed by a single map. Each split is further divided into logical records given to the map to process in
key-value pair. In the context of database, the split means reading a range of tuples from an SQL table, as done by
the DBInputFormat and producing LongWritables containing record numbers as keys and DBWritables as values.
The Java API for input splits is as follows:

public interface InputSplit extends Writable {

long getLength() throws IOException;

String[] getLocations() throws IOException;

The InputSplit represents the data to be processed by a Mapper. It returns the length in bytes and has a reference
to the input data. It presents a byte-oriented view on the input and is the responsibility of the RecordReader of the
job to process this and present a record-oriented view. In most cases, we do not deal with InputSplit directly
because they are created by an InputFormat. It is is the responsibility of the InputFormat to create the input splits
and divide them into records.

public interface InputFormat<K, V> {

InputSplit[] getSplits(JobConf job, int numSplits) throws

IOException;

RecordReader<K, V> getRecordReader(InputSplit split,

JobConf job, throws IOException;

The JobClient invokes the getSplits() method with appropriate number of split arguments. The number given is a
hint as the actual number of splits may be different from the given number. Once the split is calculated it is sent to
the jobtracker. The jobtracker schedules map tasks for the tasktrackers using storage location. The tasktracker then
passes the split by invoking getRecordReader() method on the InputFormat to get RecordReader for the split.

The FileInputFormat is the base class for the file data source. It has the responsibility to identify the files that are
to be included as the job input and the definition for generating the split.

Hadoop also includes processing of unstructured data that often comes in textual format. The TextInputFormat is
the default InputFormat for such data.

The SequenceInputFormat takes up binary inputs and stores sequences of binary key-value pairs.

Similarly, DBInputFormat provides the capability to read data from relational database using JDBC.

Output Formats

The output format classes are similar to their corresponding input format classes and work in the reverse direction.
For example, the TextOutputFormat is the default output format that writes records as plain text files, whereas
key-values any be of any types, and transforms them into a string by invoking the toString() method. The key-
value character is separated by the tab character, although this can be customized by manipulating the separator
property of the text output format.

For binary output, there is SequenceFileOutputFormat to write a sequence of binary output to a file. Binary
outputs are particularly useful if the output becomes input to a further MapReduce job.

The output formats for relational databases and to HBase are handled by DBOutputFormat. It sends the reduced
output to a SQL table. For example, the HBase’s TableOutputFormat enables the MapReduce program to work on
the data stored in the HBase table and uses it for writing outputs to the HBase table.

Developing a mapreduce application: Word Count Problem (Program)

You might also like