0% found this document useful (0 votes)

106 views50 pages

m2c1 PDF

This document summarizes several essential Hadoop tools: - The Pig and Hive query tools allow querying data on Hadoop clusters using scripting and SQL-like languages. Sqoop imports and exports data between HDFS and RDBMS. - Example uses of Pig, Hive, and Sqoop are demonstrated to extract, transform, and load data on Hadoop clusters using these tools. Flume and Oozie are also introduced for streaming data transport and workflow management on Hadoop.

Uploaded by

Manoj Patil

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

106 views50 pages

m2c1 PDF

Uploaded by

Manoj Patil

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 50

Module 2

Chapter 7

Essential Hadoop Tools

In this Chapter
• The Pig scripting tool is introduced as a way to quickly examine data
both locally and on a Hadoop cluster.

28-02-2019
• The Hive SQL- like query tool is explained using two examples.
• The Sqoop RDBMS tool is used to import and export data from

CSE, VCET Puttur

Mr. Raghavendra Katagall, Dept of
MySQL to/from HDFS.
• The Flume streaming data transport utility is configured to capture
weblog data into HDFS.
• The Oozie workflow manager is used to run basic and complex
Hadoop workflows.
• The distributed Hbase database is used to store and access data on
a Hadoop Cluster.
2
Using Apache Pig

• Apache Pig is a high level language that enables programmers

28-02-2019
to write complex MapReduce transformations using a simple
scripting language.

CSE, VCET Puttur

Mr. Raghavendra Katagall, Dept of
• Pig defines a set of transformations on a data set such as
aggregate, join and sort.
• Pig is often used to extract, transform, and load data
pipelines, quick research on raw data, and iterative data
processing.

3
Using Apache Pig
• Apache Pig has several usage modes.
• The first is a local mode in which all processing is done on the

28-02-2019
local machine.
• The non-local modes are MapReduce and Tez.

CSE, VCET Puttur

Mr. Raghavendra Katagall, Dept of
• These modes execute the job on the cluster using either the
MapReduce engine or the optimised Tez engine.
• There are also interactive and batch mode available.
• They enable Pig Applications to be developed locally in
interactive modes, using small amounts of data, and then run
at scale on the cluster in a production mode.
4
Pig example Walk-Through
• To begin the example , copy the passwd file to a working directory
for local Pig operation:

28-02-2019
CSE, VCET Puttur
Mr. Raghavendra Katagall, Dept of
•

5
Pig example Walk-Through
• In the following example of local Pig operation, all processing is
done on the local machine.

28-02-2019
• First the interactive command line is started:

CSE, VCET Puttur

Mr. Raghavendra Katagall, Dept of
$ pig –x local
• If Pig starts correctly, you will see a grunt> prompt.
• Enter following commands to load passwd file and then grab the
username and dump it to the terminal.
• Note that Pig commands must end with a semicolon(;)

6
Pig example Walk-Through

• The processing will start and list of user names will be printed

28-02-2019
to the screen.
• To exit the interactive session, enter the command quit

CSE, VCET Puttur

Mr. Raghavendra Katagall, Dept of
• $grunt> quit
• To use Hadoop MapReduce, start Pig as follows,
• $ pig –x mapreduce
• The Tez engine can be used as follows
• $ pig –x tez
7
Pig example Walk-Through
• Pig can also be run from a script.

28-02-2019
CSE, VCET Puttur
Mr. Raghavendra Katagall, Dept of
8
Using Apache Hive
• Hive is considered the defacto standard for interactive SQL
queries over petabyte of data using Hadoop and offers

28-02-2019
following features.

CSE, VCET Puttur

Mr. Raghavendra Katagall, Dept of
9
Using Apache Hive

28-02-2019
• Hive provides users who are already familiar with SQL the
capability to query the data on Hadoop clusters.

CSE, VCET Puttur

Mr. Raghavendra Katagall, Dept of
• Hive makes it possible for programmers who are familiar with the
MapReduce framework to add their custom mappers and reducers
to Hive queries.

10
Hive Example Walk-Through

Mr. Raghavendra Katagall, Dept of

28-02-2019
11

CSE, VCET Puttur

Hive Example Walk-Through
• Create a table using the following command:

28-02-2019
CSE, VCET Puttur
Mr. Raghavendra Katagall, Dept of
12
Hive Example Walk-Through

28-02-2019
CSE, VCET Puttur
Mr. Raghavendra Katagall, Dept of
• To exit Hive, simply type exit:
• $ hive> exit

13
A More Advanced Hive
Example
• In this example, 100000 records will be transformed from userid,
movieid, rating, unixtime to userid, movieid, rating, and weekday

28-02-2019
using Apache Hive and a Python Program.
• The first step is to download and extract the data,

CSE, VCET Puttur

Mr. Raghavendra Katagall, Dept of
14
A More Advanced Hive
Example
• We will create a short python program called weekday_mapper.py

28-02-2019
CSE, VCET Puttur
Mr. Raghavendra Katagall, Dept of
15
A More Advanced Hive
Example
• Start Hive and create the data table with the command

28-02-2019
CSE, VCET Puttur
Mr. Raghavendra Katagall, Dept of
16
A More Advanced Hive
Example
• Load the movie data into the table with the command
• hive> LOAD DATA LOCAL INPATH “./u.data“ OVERWRITE INTO TABLE

28-02-2019
u_data;
• The number of rows in the table can be reported by entering the

CSE, VCET Puttur

Mr. Raghavendra Katagall, Dept of
following command:
• hive> SELECT COUNT(*) FROM u_data;

17
A More Advanced Hive
Example
• The next command adds the weekday_mapper.py to Hive resources
• hive> add FILE weekday_mapper.py

28-02-2019
•

CSE, VCET Puttur

Mr. Raghavendra Katagall, Dept of
18
A More Advanced Hive
Example
• The final query will sort and group the reviews by weekday:
• hive> SELECT weekday, COUNT(*) FROM u_data_new GROUP BY

28-02-2019
weekday

CSE, VCET Puttur

Mr. Raghavendra Katagall, Dept of
19
Using Apache Sqoop to acquire
Relational Data
• Apache Sqoop Import and Export Methods

28-02-2019
CSE, VCET Puttur
Mr. Raghavendra Katagall, Dept of
20
Mr. Raghavendra Katagall, Dept of
28-02-2019
21

CSE, VCET Puttur

Mr. Raghavendra Katagall, Dept of
28-02-2019
22

CSE, VCET Puttur

Sqoop Example Walk-Through
• The following steps will be performed

28-02-2019
CSE, VCET Puttur
Mr. Raghavendra Katagall, Dept of
23
Using Apache Flume to Acquire
Data Streams
• Apache Flume is an independent agent designed to collect, transport, and
store data into HDFS.

28-02-2019
• Often data transport involves a number of Flume agents that may

CSE, VCET Puttur

Mr. Raghavendra Katagall, Dept of
traverse a series of machines and locations.

• Flume is often used for log files, social media generated data, email
messages, and just about any continuous data source.

24
Using Apache Flume to Acquire
Data Streams
• A Flume agent is composed of three components

28-02-2019
Source: The source component receives the data and sends it to a
channel . It can send the data to more than one channel. The input

CSE, VCET Puttur

Mr. Raghavendra Katagall, Dept of
data can be from a real time source or another flume agent.
Channel: A channel is a data queue that forwards the source data to the
sink destination. It can be thought of as a buffer that manages input and
output flow rates.
Sink : The sink delivers data to destination such as HDFS, a local file or
another Flume agent.

25
Data Streams
Using Apache Flume to Acquire

Mr. Raghavendra Katagall, Dept of

28-02-2019
26

CSE, VCET Puttur

Data Streams
Using Apache Flume to Acquire

Mr. Raghavendra Katagall, Dept of

28-02-2019
27

CSE, VCET Puttur

Data Streams
Using Apache Flume to Acquire

Mr. Raghavendra Katagall, Dept of

28-02-2019
28

CSE, VCET Puttur

Flume Example Walk Through
• Step 1: Download and Install Apache Flume
• Step 2: Simple Test

28-02-2019
• Step 3: Weblog Example

CSE, VCET Puttur

Mr. Raghavendra Katagall, Dept of
29
Manage Hadoop Workflows with
Apache Oozie
• Oozie is a workflow director system designed to run and manage

28-02-2019
multiple related Apache Hadoop jobs.
• For instance, complete data input and analysis may require several

CSE, VCET Puttur

Mr. Raghavendra Katagall, Dept of
discrete Hadoop jobs to be run as a workflow in which output of one
job serves as input for a successive job.
• Oozie is designed to construct and manage these workflows.
• Oozie workflow jobs are represented as directed acyclic garphs(DAGs)
of actions.
• Three types of Oozie jobs are permitted.

30
Manage Hadoop Workflows
with Apache Oozie
• Workflow: a specified sequence of Hadoop jobs with outcome-based

28-02-2019
decision points and control dependency. Progress from one action to
another cannot happen until the first action is complete.
• Co-ordinator: a scheduled workflow job that can run at various time

CSE, VCET Puttur

Mr. Raghavendra Katagall, Dept of
intervals or when data become available.
• Bundle: a Higher-level Oozie abstraction that will batch a set of co-
ordinator jobs.

31
Manage Hadoop Workflows
with Apache Oozie
• Oozie workflow has several types of nodes

28-02-2019
CSE, VCET Puttur
Mr. Raghavendra Katagall, Dept of
32
with Apache Oozie
Manage Hadoop Workflows

Mr. Raghavendra Katagall, Dept of

28-02-2019
33

CSE, VCET Puttur

with Apache Oozie
Manage Hadoop Workflows

Mr. Raghavendra Katagall, Dept of

28-02-2019
34

CSE, VCET Puttur

Oozie Example Walk-Through
• Step 1: Download Oozie Examples
• Step 2: Run the Simple MapReduce Example

28-02-2019
• Step 3: Run the Oozie Application

CSE, VCET Puttur

Mr. Raghavendra Katagall, Dept of
35
Mr. Raghavendra Katagall, Dept of
28-02-2019
36

CSE, VCET Puttur

Oozie Job commands

Mr. Raghavendra Katagall, Dept of

28-02-2019
37

CSE, VCET Puttur

Using Apache HBase
• Apache HBase is an open source, distributed, versioned, nonrelational
database modeled after Google's Bigtable
• Like Bigtable, HBase leverages the distributed data storage provided

28-02-2019
by the underlying distributed file systems spread across commodity
servers.
• Apache HBase provides Bigtable-like capabilities on top of Hadoop

CSE, VCET Puttur

Mr. Raghavendra Katagall, Dept of
and HDFS.
• Some of the more important features include the following capabilities:
1. Linear and modular scalability
2. Strictly consistent reads and writes.
3. Automatic and configurable sharding of tables
4. Automatic failover support between RegionServers
5. Convenient base classes for backing Hadoop MapReduce jobs with 38
Apache HBase tables
6. Easy-to-use Java API for client access
HBase Data Model Overview
• A table in HBase is similar to other databases, having rows and
columns

28-02-2019
• Columns in HBase are grouped into column families, all with the same
prefix.
• For example, consider a table of daily stock prices.

CSE, VCET Puttur

Mr. Raghavendra Katagall, Dept of
• There may be a column family called "price" that has four members—
price:open, price: close, price:low, and price:high.
• A column does not need to be a family.
• For instance, the stock table may have a column named "volume"
indicating how many shares were traded.
• All column family members are stored together in the physical file
system. 39
HBase Data Model Overview
• Specific HBase cell values are identified by a row key, column (column
family and column), and version (timestamp).
• It is possible to have many versions of data within an HBase cell.

28-02-2019
• A version is specified as a timestamp and is created each time data are
written to a cell.

CSE, VCET Puttur

Mr. Raghavendra Katagall, Dept of
• Almost anything can serve as a row key, from strings to binary
representations of longs to serialized data structures.
• Rows are lexicographically sorted with the lowest order appearing first
in a table.
• The empty byte array denotes both the start and the end of a table's
namespace.
• All table accesses are via the table row key, which is considered its 40
primary key.
HBase Example Walk-Through
• To enter the shell, type the following as a user:

28-02-2019
hbase shell
hbase(main):001:0>
• To exit the shell, type exit.

CSE, VCET Puttur

Mr. Raghavendra Katagall, Dept of
• The status command provides the system status:
hbase(main):001:0> status
4 servers, 0 dead, 1.0000 average load
In the example that follows, we will use a small set of daily stock price
data for Apple computer

Date Open High Low Close Volume 41

6-May-15 126.56 126.75 123.36 125.01 71820387
HBase Example Walk-Through

Mr. Raghavendra Katagall, Dept of

28-02-2019
42

CSE, VCET Puttur

Mr. Raghavendra Katagall, Dept of
28-02-2019
43

CSE, VCET Puttur

Mr. Raghavendra Katagall, Dept of
28-02-2019
44

CSE, VCET Puttur

Mr. Raghavendra Katagall, Dept of
28-02-2019
45

CSE, VCET Puttur

Mr. Raghavendra Katagall, Dept of
28-02-2019
46

CSE, VCET Puttur

Mr. Raghavendra Katagall, Dept of
28-02-2019
47

CSE, VCET Puttur

Mr. Raghavendra Katagall, Dept of
28-02-2019
48

CSE, VCET Puttur

Mr. Raghavendra Katagall, Dept of
28-02-2019
49

CSE, VCET Puttur

Apache Hbase Web Interface

Mr. Raghavendra Katagall, Dept of

28-02-2019
50

CSE, VCET Puttur

BIG DATA Module 2 FINAL SMI
No ratings yet
BIG DATA Module 2 FINAL SMI
44 pages
BigData Module 2
No ratings yet
BigData Module 2
41 pages
15CS82 Module 2
No ratings yet
15CS82 Module 2
12 pages
Apache Sqoop, Apache Pig & Apache Hive
No ratings yet
Apache Sqoop, Apache Pig & Apache Hive
31 pages
Bda - Module Ii
No ratings yet
Bda - Module Ii
239 pages
Big Data Hadoop - Course Curriculum - V1
No ratings yet
Big Data Hadoop - Course Curriculum - V1
7 pages
BigData Module 2
No ratings yet
BigData Module 2
18 pages
Bda Ia-3 QB-1
No ratings yet
Bda Ia-3 QB-1
17 pages
Lab 5 Correlate Structured W Unstructured Data
No ratings yet
Lab 5 Correlate Structured W Unstructured Data
5 pages
Data Lake 1
No ratings yet
Data Lake 1
48 pages
Banking Data Analysis On Hadoop
No ratings yet
Banking Data Analysis On Hadoop
21 pages
BDA Module-4
No ratings yet
BDA Module-4
4 pages
Apache - SQOOP and Flume
No ratings yet
Apache - SQOOP and Flume
16 pages
Unit 4 Hadoop Eco System PDF
No ratings yet
Unit 4 Hadoop Eco System PDF
78 pages
BD U-5 (Anupam Sir)
No ratings yet
BD U-5 (Anupam Sir)
12 pages
8 MapReduce Different Phases 08-01-2025
No ratings yet
8 MapReduce Different Phases 08-01-2025
28 pages
13 Lecture
No ratings yet
13 Lecture
23 pages
Practise Quiz Ccd-470 Exam (05-2014) - Cloudera Quiz Learning
No ratings yet
Practise Quiz Ccd-470 Exam (05-2014) - Cloudera Quiz Learning
74 pages
Bda 06
No ratings yet
Bda 06
15 pages
Unit V Notes
No ratings yet
Unit V Notes
17 pages
Unit 5 Short
No ratings yet
Unit 5 Short
14 pages
6 H Data With Hive Big Data Analytics B.tech. Final Year
No ratings yet
6 H Data With Hive Big Data Analytics B.tech. Final Year
24 pages
Notes of Aktu Btech 3 Yr Big Data
No ratings yet
Notes of Aktu Btech 3 Yr Big Data
15 pages
BD 5
No ratings yet
BD 5
28 pages
Unit-5 (1) BD
No ratings yet
Unit-5 (1) BD
18 pages
Unit-V CC&BD CS62
No ratings yet
Unit-V CC&BD CS62
73 pages
Big Data-2 Sourcing Data
No ratings yet
Big Data-2 Sourcing Data
38 pages
Lab Syllabus Format
No ratings yet
Lab Syllabus Format
4 pages
Big Data & Hadoop - Course Curriculum
No ratings yet
Big Data & Hadoop - Course Curriculum
6 pages
Big Data Applications: Pig & Hive
No ratings yet
Big Data Applications: Pig & Hive
29 pages
Hive Pig
No ratings yet
Hive Pig
20 pages
BDAA
No ratings yet
BDAA
4 pages
BigData - Oozie
No ratings yet
BigData - Oozie
5 pages
BDA Unit 5-1
No ratings yet
BDA Unit 5-1
29 pages
Bda 4 Og
No ratings yet
Bda 4 Og
18 pages
BD - Unit - IV - Hive and Pig
No ratings yet
BD - Unit - IV - Hive and Pig
41 pages
Unit 5-1
No ratings yet
Unit 5-1
8 pages
Cse 17CS82 M2 S1 PPT
No ratings yet
Cse 17CS82 M2 S1 PPT
35 pages
LAB Manual 3170722 Big Data
No ratings yet
LAB Manual 3170722 Big Data
4 pages
Unit 5 Bda
No ratings yet
Unit 5 Bda
42 pages
Screenshot 2025-01-13 at 12.17.38 PM
No ratings yet
Screenshot 2025-01-13 at 12.17.38 PM
12 pages
BATCH12
No ratings yet
BATCH12
32 pages
Data Analytics Chapter 5
No ratings yet
Data Analytics Chapter 5
14 pages
Cloud Computing Era Practice
No ratings yet
Cloud Computing Era Practice
75 pages
Module 2
No ratings yet
Module 2
27 pages
Hive - PIG - HBase - Zookeeper
100% (1)
Hive - PIG - HBase - Zookeeper
31 pages
Big Data
No ratings yet
Big Data
4 pages
DA Unit-5
No ratings yet
DA Unit-5
78 pages
Hadoop Tools - A Brief Overview
No ratings yet
Hadoop Tools - A Brief Overview
18 pages
Big Data Analytics
No ratings yet
Big Data Analytics
2 pages
Big Data Testing
100% (1)
Big Data Testing
34 pages
Big Data Overview
No ratings yet
Big Data Overview
39 pages
BDT Unit04
No ratings yet
BDT Unit04
136 pages
BDA Practicalfile
No ratings yet
BDA Practicalfile
19 pages
BDS Session 9
No ratings yet
BDS Session 9
56 pages
Hadoop Ecosystem and Their Components
No ratings yet
Hadoop Ecosystem and Their Components
12 pages
Ba Iift L 14-15-16
No ratings yet
Ba Iift L 14-15-16
44 pages
Bad601 Lab Maual
No ratings yet
Bad601 Lab Maual
34 pages
Oracle DBA Job Interview Questions and Answers
No ratings yet
Oracle DBA Job Interview Questions and Answers
10 pages
Online Accessioning Guide for Preclarus
No ratings yet
Online Accessioning Guide for Preclarus
9 pages
Machine Learning Guided Project
No ratings yet
Machine Learning Guided Project
23 pages
ADMIRALTY Chart Service Guide
No ratings yet
ADMIRALTY Chart Service Guide
15 pages
Advanced SQL Query-Based MCQs (Set 2 - 50 Questions)
No ratings yet
Advanced SQL Query-Based MCQs (Set 2 - 50 Questions)
20 pages
Sankalp CV PDF
No ratings yet
Sankalp CV PDF
1 page
References
No ratings yet
References
2 pages
W1D2CST200A
100% (2)
W1D2CST200A
2 pages
Srs Complete of Virtual Mouse Control
No ratings yet
Srs Complete of Virtual Mouse Control
19 pages
DWH Fundamentals
No ratings yet
DWH Fundamentals
63 pages
SIH2024 IDEA Presentation Format
No ratings yet
SIH2024 IDEA Presentation Format
6 pages
Essbase Optimization Techniques by Amit Sharma
No ratings yet
Essbase Optimization Techniques by Amit Sharma
17 pages
Smart Glasses For The Visually Impaired With AI and ML Integrations
No ratings yet
Smart Glasses For The Visually Impaired With AI and ML Integrations
10 pages
Futureinternet 15 00010
No ratings yet
Futureinternet 15 00010
23 pages
SQL Basics: DDL, DML, and DCL Overview
No ratings yet
SQL Basics: DDL, DML, and DCL Overview
29 pages
Zoho CRM Setup Blueprint - 2023
No ratings yet
Zoho CRM Setup Blueprint - 2023
24 pages
Open Access E-Resources for CUSB
No ratings yet
Open Access E-Resources for CUSB
2 pages
Data Mining From Scratch
No ratings yet
Data Mining From Scratch
17 pages
Introduction To Data Warehousing
No ratings yet
Introduction To Data Warehousing
24 pages
Wa0000
No ratings yet
Wa0000
14 pages
SQL Lab: Database Creation & Queries
100% (1)
SQL Lab: Database Creation & Queries
2 pages
Lesson 2: Understanding Sources
No ratings yet
Lesson 2: Understanding Sources
10 pages
Utkarsh Shandilya CV
No ratings yet
Utkarsh Shandilya CV
1 page
DhrubaBorthakur-Hadoop File Systems
No ratings yet
DhrubaBorthakur-Hadoop File Systems
25 pages
Unsupervised Learning for MBAs
No ratings yet
Unsupervised Learning for MBAs
10 pages
SimpleETL: ETL Processing by Simple Specifications
No ratings yet
SimpleETL: ETL Processing by Simple Specifications
6 pages
Database System For Taxi Service: Databse Design (Cs 6360.002) - Final Project
No ratings yet
Database System For Taxi Service: Databse Design (Cs 6360.002) - Final Project
18 pages
Haimlc801 Twsma Syllabus
No ratings yet
Haimlc801 Twsma Syllabus
3 pages
Introduction To The Semantic Web
No ratings yet
Introduction To The Semantic Web
7 pages

m2c1 PDF

Uploaded by

m2c1 PDF

Uploaded by

Module 2

Essential Hadoop Tools

CSE, VCET Puttur

• Apache Pig is a high level language that enables programmers

CSE, VCET Puttur

CSE, VCET Puttur

CSE, VCET Puttur

CSE, VCET Puttur

CSE, VCET Puttur

CSE, VCET Puttur

Mr. Raghavendra Katagall, Dept of

CSE, VCET Puttur

CSE, VCET Puttur

CSE, VCET Puttur

CSE, VCET Puttur

CSE, VCET Puttur

CSE, VCET Puttur

CSE, VCET Puttur

CSE, VCET Puttur

CSE, VCET Puttur

Mr. Raghavendra Katagall, Dept of

CSE, VCET Puttur

Mr. Raghavendra Katagall, Dept of

CSE, VCET Puttur

Mr. Raghavendra Katagall, Dept of

CSE, VCET Puttur

CSE, VCET Puttur

CSE, VCET Puttur

CSE, VCET Puttur

Mr. Raghavendra Katagall, Dept of

CSE, VCET Puttur

Mr. Raghavendra Katagall, Dept of

CSE, VCET Puttur

CSE, VCET Puttur

CSE, VCET Puttur

Mr. Raghavendra Katagall, Dept of

CSE, VCET Puttur

CSE, VCET Puttur

CSE, VCET Puttur

CSE, VCET Puttur

CSE, VCET Puttur

Date Open High Low Close Volume 41

Mr. Raghavendra Katagall, Dept of

CSE, VCET Puttur

CSE, VCET Puttur

CSE, VCET Puttur

CSE, VCET Puttur

CSE, VCET Puttur

CSE, VCET Puttur

CSE, VCET Puttur

CSE, VCET Puttur

Mr. Raghavendra Katagall, Dept of

CSE, VCET Puttur

You might also like