Module 2
Chapter 7
Essential Hadoop Tools
In this Chapter
• The Pig scripting tool is introduced as a way to quickly examine data
both locally and on a Hadoop cluster.
28-02-2019
• The Hive SQL- like query tool is explained using two examples.
• The Sqoop RDBMS tool is used to import and export data from
CSE, VCET Puttur
Mr. Raghavendra Katagall, Dept of
MySQL to/from HDFS.
• The Flume streaming data transport utility is configured to capture
weblog data into HDFS.
• The Oozie workflow manager is used to run basic and complex
Hadoop workflows.
• The distributed Hbase database is used to store and access data on
a Hadoop Cluster.
2
Using Apache Pig
• Apache Pig is a high level language that enables programmers
28-02-2019
to write complex MapReduce transformations using a simple
scripting language.
CSE, VCET Puttur
Mr. Raghavendra Katagall, Dept of
• Pig defines a set of transformations on a data set such as
aggregate, join and sort.
• Pig is often used to extract, transform, and load data
pipelines, quick research on raw data, and iterative data
processing.
3
Using Apache Pig
• Apache Pig has several usage modes.
• The first is a local mode in which all processing is done on the
28-02-2019
local machine.
• The non-local modes are MapReduce and Tez.
CSE, VCET Puttur
Mr. Raghavendra Katagall, Dept of
• These modes execute the job on the cluster using either the
MapReduce engine or the optimised Tez engine.
• There are also interactive and batch mode available.
• They enable Pig Applications to be developed locally in
interactive modes, using small amounts of data, and then run
at scale on the cluster in a production mode.
4
Pig example Walk-Through
• To begin the example , copy the passwd file to a working directory
for local Pig operation:
28-02-2019
CSE, VCET Puttur
Mr. Raghavendra Katagall, Dept of
•
5
Pig example Walk-Through
• In the following example of local Pig operation, all processing is
done on the local machine.
28-02-2019
• First the interactive command line is started:
CSE, VCET Puttur
Mr. Raghavendra Katagall, Dept of
$ pig –x local
• If Pig starts correctly, you will see a grunt> prompt.
• Enter following commands to load passwd file and then grab the
username and dump it to the terminal.
• Note that Pig commands must end with a semicolon(;)
6
Pig example Walk-Through
• The processing will start and list of user names will be printed
28-02-2019
to the screen.
• To exit the interactive session, enter the command quit
CSE, VCET Puttur
Mr. Raghavendra Katagall, Dept of
• $grunt> quit
• To use Hadoop MapReduce, start Pig as follows,
• $ pig –x mapreduce
• The Tez engine can be used as follows
• $ pig –x tez
7
Pig example Walk-Through
• Pig can also be run from a script.
28-02-2019
CSE, VCET Puttur
Mr. Raghavendra Katagall, Dept of
8
Using Apache Hive
• Hive is considered the defacto standard for interactive SQL
queries over petabyte of data using Hadoop and offers
28-02-2019
following features.
CSE, VCET Puttur
Mr. Raghavendra Katagall, Dept of
9
Using Apache Hive
28-02-2019
• Hive provides users who are already familiar with SQL the
capability to query the data on Hadoop clusters.
CSE, VCET Puttur
Mr. Raghavendra Katagall, Dept of
• Hive makes it possible for programmers who are familiar with the
MapReduce framework to add their custom mappers and reducers
to Hive queries.
10
Hive Example Walk-Through
Mr. Raghavendra Katagall, Dept of
28-02-2019
11
CSE, VCET Puttur
Hive Example Walk-Through
• Create a table using the following command:
28-02-2019
CSE, VCET Puttur
Mr. Raghavendra Katagall, Dept of
12
Hive Example Walk-Through
28-02-2019
CSE, VCET Puttur
Mr. Raghavendra Katagall, Dept of
• To exit Hive, simply type exit:
• $ hive> exit
13
A More Advanced Hive
Example
• In this example, 100000 records will be transformed from userid,
movieid, rating, unixtime to userid, movieid, rating, and weekday
28-02-2019
using Apache Hive and a Python Program.
• The first step is to download and extract the data,
CSE, VCET Puttur
Mr. Raghavendra Katagall, Dept of
14
A More Advanced Hive
Example
• We will create a short python program called weekday_mapper.py
28-02-2019
CSE, VCET Puttur
Mr. Raghavendra Katagall, Dept of
15
A More Advanced Hive
Example
• Start Hive and create the data table with the command
28-02-2019
CSE, VCET Puttur
Mr. Raghavendra Katagall, Dept of
16
A More Advanced Hive
Example
• Load the movie data into the table with the command
• hive> LOAD DATA LOCAL INPATH “./u.data“ OVERWRITE INTO TABLE
28-02-2019
u_data;
• The number of rows in the table can be reported by entering the
CSE, VCET Puttur
Mr. Raghavendra Katagall, Dept of
following command:
• hive> SELECT COUNT(*) FROM u_data;
17
A More Advanced Hive
Example
• The next command adds the weekday_mapper.py to Hive resources
• hive> add FILE weekday_mapper.py
28-02-2019
•
CSE, VCET Puttur
Mr. Raghavendra Katagall, Dept of
18
A More Advanced Hive
Example
• The final query will sort and group the reviews by weekday:
• hive> SELECT weekday, COUNT(*) FROM u_data_new GROUP BY
28-02-2019
weekday
CSE, VCET Puttur
Mr. Raghavendra Katagall, Dept of
19
Using Apache Sqoop to acquire
Relational Data
• Apache Sqoop Import and Export Methods
28-02-2019
CSE, VCET Puttur
Mr. Raghavendra Katagall, Dept of
20
Mr. Raghavendra Katagall, Dept of
28-02-2019
21
CSE, VCET Puttur
Mr. Raghavendra Katagall, Dept of
28-02-2019
22
CSE, VCET Puttur
Sqoop Example Walk-Through
• The following steps will be performed
28-02-2019
CSE, VCET Puttur
Mr. Raghavendra Katagall, Dept of
23
Using Apache Flume to Acquire
Data Streams
• Apache Flume is an independent agent designed to collect, transport, and
store data into HDFS.
28-02-2019
• Often data transport involves a number of Flume agents that may
CSE, VCET Puttur
Mr. Raghavendra Katagall, Dept of
traverse a series of machines and locations.
• Flume is often used for log files, social media generated data, email
messages, and just about any continuous data source.
24
Using Apache Flume to Acquire
Data Streams
• A Flume agent is composed of three components
28-02-2019
Source: The source component receives the data and sends it to a
channel . It can send the data to more than one channel. The input
CSE, VCET Puttur
Mr. Raghavendra Katagall, Dept of
data can be from a real time source or another flume agent.
Channel: A channel is a data queue that forwards the source data to the
sink destination. It can be thought of as a buffer that manages input and
output flow rates.
Sink : The sink delivers data to destination such as HDFS, a local file or
another Flume agent.
25
Data Streams
Using Apache Flume to Acquire
Mr. Raghavendra Katagall, Dept of
28-02-2019
26
CSE, VCET Puttur
Data Streams
Using Apache Flume to Acquire
Mr. Raghavendra Katagall, Dept of
28-02-2019
27
CSE, VCET Puttur
Data Streams
Using Apache Flume to Acquire
Mr. Raghavendra Katagall, Dept of
28-02-2019
28
CSE, VCET Puttur
Flume Example Walk Through
• Step 1: Download and Install Apache Flume
• Step 2: Simple Test
28-02-2019
• Step 3: Weblog Example
CSE, VCET Puttur
Mr. Raghavendra Katagall, Dept of
29
Manage Hadoop Workflows with
Apache Oozie
• Oozie is a workflow director system designed to run and manage
28-02-2019
multiple related Apache Hadoop jobs.
• For instance, complete data input and analysis may require several
CSE, VCET Puttur
Mr. Raghavendra Katagall, Dept of
discrete Hadoop jobs to be run as a workflow in which output of one
job serves as input for a successive job.
• Oozie is designed to construct and manage these workflows.
• Oozie workflow jobs are represented as directed acyclic garphs(DAGs)
of actions.
• Three types of Oozie jobs are permitted.
30
Manage Hadoop Workflows
with Apache Oozie
• Workflow: a specified sequence of Hadoop jobs with outcome-based
28-02-2019
decision points and control dependency. Progress from one action to
another cannot happen until the first action is complete.
• Co-ordinator: a scheduled workflow job that can run at various time
CSE, VCET Puttur
Mr. Raghavendra Katagall, Dept of
intervals or when data become available.
• Bundle: a Higher-level Oozie abstraction that will batch a set of co-
ordinator jobs.
31
Manage Hadoop Workflows
with Apache Oozie
• Oozie workflow has several types of nodes
28-02-2019
CSE, VCET Puttur
Mr. Raghavendra Katagall, Dept of
32
with Apache Oozie
Manage Hadoop Workflows
Mr. Raghavendra Katagall, Dept of
28-02-2019
33
CSE, VCET Puttur
with Apache Oozie
Manage Hadoop Workflows
Mr. Raghavendra Katagall, Dept of
28-02-2019
34
CSE, VCET Puttur
Oozie Example Walk-Through
• Step 1: Download Oozie Examples
• Step 2: Run the Simple MapReduce Example
28-02-2019
• Step 3: Run the Oozie Application
CSE, VCET Puttur
Mr. Raghavendra Katagall, Dept of
35
Mr. Raghavendra Katagall, Dept of
28-02-2019
36
CSE, VCET Puttur
Oozie Job commands
Mr. Raghavendra Katagall, Dept of
28-02-2019
37
CSE, VCET Puttur
Using Apache HBase
• Apache HBase is an open source, distributed, versioned, nonrelational
database modeled after Google's Bigtable
• Like Bigtable, HBase leverages the distributed data storage provided
28-02-2019
by the underlying distributed file systems spread across commodity
servers.
• Apache HBase provides Bigtable-like capabilities on top of Hadoop
CSE, VCET Puttur
Mr. Raghavendra Katagall, Dept of
and HDFS.
• Some of the more important features include the following capabilities:
1. Linear and modular scalability
2. Strictly consistent reads and writes.
3. Automatic and configurable sharding of tables
4. Automatic failover support between RegionServers
5. Convenient base classes for backing Hadoop MapReduce jobs with 38
Apache HBase tables
6. Easy-to-use Java API for client access
HBase Data Model Overview
• A table in HBase is similar to other databases, having rows and
columns
28-02-2019
• Columns in HBase are grouped into column families, all with the same
prefix.
• For example, consider a table of daily stock prices.
CSE, VCET Puttur
Mr. Raghavendra Katagall, Dept of
• There may be a column family called "price" that has four members—
price:open, price: close, price:low, and price:high.
• A column does not need to be a family.
• For instance, the stock table may have a column named "volume"
indicating how many shares were traded.
• All column family members are stored together in the physical file
system. 39
HBase Data Model Overview
• Specific HBase cell values are identified by a row key, column (column
family and column), and version (timestamp).
• It is possible to have many versions of data within an HBase cell.
28-02-2019
• A version is specified as a timestamp and is created each time data are
written to a cell.
CSE, VCET Puttur
Mr. Raghavendra Katagall, Dept of
• Almost anything can serve as a row key, from strings to binary
representations of longs to serialized data structures.
• Rows are lexicographically sorted with the lowest order appearing first
in a table.
• The empty byte array denotes both the start and the end of a table's
namespace.
• All table accesses are via the table row key, which is considered its 40
primary key.
HBase Example Walk-Through
• To enter the shell, type the following as a user:
28-02-2019
hbase shell
hbase(main):001:0>
• To exit the shell, type exit.
CSE, VCET Puttur
Mr. Raghavendra Katagall, Dept of
• The status command provides the system status:
hbase(main):001:0> status
4 servers, 0 dead, 1.0000 average load
In the example that follows, we will use a small set of daily stock price
data for Apple computer
Date Open High Low Close Volume 41
6-May-15 126.56 126.75 123.36 125.01 71820387
HBase Example Walk-Through
Mr. Raghavendra Katagall, Dept of
28-02-2019
42
CSE, VCET Puttur
Mr. Raghavendra Katagall, Dept of
28-02-2019
43
CSE, VCET Puttur
Mr. Raghavendra Katagall, Dept of
28-02-2019
44
CSE, VCET Puttur
Mr. Raghavendra Katagall, Dept of
28-02-2019
45
CSE, VCET Puttur
Mr. Raghavendra Katagall, Dept of
28-02-2019
46
CSE, VCET Puttur
Mr. Raghavendra Katagall, Dept of
28-02-2019
47
CSE, VCET Puttur
Mr. Raghavendra Katagall, Dept of
28-02-2019
48
CSE, VCET Puttur
Mr. Raghavendra Katagall, Dept of
28-02-2019
49
CSE, VCET Puttur
Apache Hbase Web Interface
Mr. Raghavendra Katagall, Dept of
28-02-2019
50
CSE, VCET Puttur