0% found this document useful (0 votes)

101 views16 pages

Pig

Apache Pig is a high-level scripting platform designed for processing and analyzing large datasets, particularly useful for those without Java knowledge. It simplifies the development of data processing tasks by allowing users to write scripts in Pig Latin, which are then converted into MapReduce jobs for execution on Hadoop clusters. Pig supports various operations like LOAD, GROUP, JOIN, and provides two execution modes: local and MapReduce, making it versatile for different data processing needs.

Uploaded by

roshani chede

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

101 views16 pages

Pig

Uploaded by

roshani chede

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 16

PIG Basics

 Pig is a scripting platform for processing and analysing large data sets.
 very usefulfor people who did not have java knowledge
 used for high level data flow and processing the data available on HDFS.
 PIG is named pig because like the animal, it can consume and process any type of data, and has lots of
usage in data cleansing.
 Internally, whatever you write in Pig, it internally converts to Map reduce(MR) jobs.
 Pig is client side installation, it need not sit on hadoop cluster.
 Pig script will execute a set of commands, which will be converted to Map Reduce(MR) jobs
and submitted to hadoop running locally or remotely.
 A hadoop cluster will not care whether the job was submitted from pig or from some other
environment.
 map reduce programs get executed only when the DUMP or STORE command is called(more on this
later).

Pig Vs Traditional Hadoop Map Reduce(MR).

 lot of effort required in writing Map Reduce in Hadoop, but in pig the effort required is very less.
 In Hadoop, we have to write toolrunner, Mapper, Reducer, but in Pig, nothing is mandatory as we just
write a small script(set of commands)
 Hadoop Mapreduce has more functionality than Pig.
 since in pig, we just have to write the script, and not the separate toolrunner, mapper, reducer, etc, the
development effort while using PIG is very less.
 Pig is slightly slower than MR job.

Components of PIG
 pig execution environment
 it is essentially, the hadoop cluster, where the pig script is submitted to run.
 it can be local hadoop or remote hadoop cluster.
 pig latin
 new language, which is compiled to map reduce(MR) jobs
 increases productivity, as less no of lines are required.
 good for non java programmers.
 provides operations like join, group, filter, sort, but we need to write lot of code for join etc in
hadoop.
 data flow language instead of procedural language.

Data flow in Pig

 LOAD the data from HDFS, and into the Pig program.
 data is transformed into appropriate format, may be by GROUP, JOIN etc, or combine two files,
FILTER etc or any other built in function.
 DUMP the data to screen or STORE the data somewhere.

Pig Execution Modes

 local mode
 pig -x local
 to enter into a default shell named grunt
 map reduce mode
 pig
 enter to map reduce mode

Pig Latin Example

 A = LOAD 'myserver.log' AS (ipaddress:chararray, timestamp:int, url: chararray) using PigStorage();
 A = LOAD 'myserver.log' using PigStorage();
 B = GROUP A by ipaddress
 C = FOREACH B GENERATE ipaddress, COUNT(A);
 STORE C INTO 'output.txt'
 DUMP C

Terminology
 atom : any value is called an atom
 tuple : collection of atoms, values (123, abc, xyz)
 bag : collection of tuples {(123,abc,xyz), (sdksjd, 122,skd)}

Transformations in Pig
Data for the below transformations can be found here

 SAMPLE
 to get some data from dataset.
 x = SAMPLE c 0.01 == approximate 1% of c into x
 LIMIT
 to limit the no of records.
 x = LIMIT c 3
 get only 3 records from c and put in x
 can fetch any random 3, and not exact same set of records every time)
 ORDER
 to get the columns in ascending or descending order.
 x = ORDER c by f1 ASC
 sort c by f1 column in asc order.
 JOIN
 to join two or more datasets into a single dataset.
 x = JOIN a BY fieldInA, b BY fieldInB, C BY fieldInC
 GROUP
 used to group the dataset based on a field.
 B = GROUP A BY age;
 UNION
 Combination of one or more data sets.
 a = load 'file1.txt' using PigStorage(',') as (field1:int, field2:int, field3:int)
 b = load 'file2.txt' using PigStorage(',') as (anotherfield1:int, anotherfield2:int, anotherfield3:int)
 c = UNION a,b => union works if both the fields erc have the same format, and datatype in all
columns.
 d = DISTINCT c
 f = FILTER c BY f1 > 3
Pig Usage
 processing of logs generated from the servers.
 data processing for search platform
 ad hoc queries across large cluster

What is PIG?
Pig is a high-level programming language useful for analyzing large data sets.
A pig was a result of development effort at Yahoo!

In a MapReduce framework, programs need to be translated into a series of

Map and Reduce stages. However, this is not a programming model which
data analysts are familiar with. So, in order to bridge this gap, an abstraction
called Pig was built on top of Hadoop.

Apache Pig enables people to focus more on analyzing bulk data sets and
to spend less time writing Map-Reduce programs. Similar to Pigs, who eat
anything, the Pig programming language is designed to work upon any kind of
data. That's why the name, Pig!
Pig Architecture
Pig consists of two components:

1. Pig Latin, which is a language

2. A runtime environment, for running PigLatin programs.

A Pig Latin program consists of a series of operations or transformations

which are applied to the input data to produce output. These operations
describe a data flow which is translated into an executable representation, by
Pig execution environment. Underneath, results of these transformations are
series of MapReduce jobs which a programmer is unaware of. So, in a way,
Pig allows the programmer to focus on data rather than the nature of
execution.

PigLatin is a relatively stiffened language which uses familiar keywords from

data processing e.g., Join, Group and Filter.

Execution modes:
Pig has two execution modes:

1. Local mode: In this mode, Pig runs in a single JVM and makes use of
local file system. This mode is suitable only for analysis of small
datasets using Pig
2. Map Reduce mode: In this mode, queries written in Pig Latin are
translated into MapReduce jobs and are run on a Hadoop cluster
(cluster may be pseudo or fully distributed). MapReduce mode with the
fully distributed cluster is useful of running Pig on large datasets.

How to Download and Install Pig

Before we start with the actual process, ensure you have Hadoop installed.
Change user to 'hduser' (id used while Hadoop configuration, you can switch
to the userid used during your Hadoop config)

Step 1) Download the stable latest release of Pig from any one of the mirrors
sites available at : http://pig.apache.org/releases.html

Select tar.gz (and not src.tar.gz) file to download.

Step 2) Once a download is complete, navigate to the directory containing the

downloaded tar file and move the tar to the location where you want to setup
Pig. In this case, we will move to /usr/local
Move to a directory containing Pig Files

cd /usr/local

Extract contents of tar file as below

sudo tar -xvf pig-0.12.1.tar.gz

Step 3). Modify ~/.bashrc to add Pig related environment variables

Open ~/.bashrc file in any text editor of your choice and do below
modifications-

export PIG_HOME=<Installation directory of Pig>

export PATH=$PIG_HOME/bin:$HADOOP_HOME/bin:$PATH
Step 4) Now, source this environment configuration using below command

. ~/.bashrc

Step 5) We need to recompile PIG to support Hadoop 2.2.0

Here are the steps to do this-

Go to PIG home directory

cd $PIG_HOME

Install Ant
sudo apt-get install ant

Note: Download will start and will consume time as per your internet speed.

Recompile PIG

sudo ant clean jar-all -Dhadoopversion=23

Please note that in this recompilation process multiple components are

downloaded. So, a system should be connected to the internet.

Also, in case this process stuck somewhere and you don't see any movement
on command prompt for more than 20 minutes then press Ctrl + c and rerun
the same command.

In our case, it takes 20 minutes

Step 6) Test the Pig installation using the command

pig -help
Example Pig Script
We will use PIG to find the Number of Products Sold in Each Country.

Input: Our input data set is a CSV file, SalesJan2009.csv

Step 1) Start Hadoop

$HADOOP_HOME/sbin/start-dfs.sh
$HADOOP_HOME/sbin/start-yarn.sh

Step 2) Pig takes a file from HDFS in MapReduce mode and stores the
results back to HDFS.

Copy file SalesJan2009.csv (stored on local file

system, ~/input/SalesJan2009.csv) to HDFS (Hadoop Distributed File
System) Home Directory

Here the file is in Folder input. If the file is stored in some other location give
that name

$HADOOP_HOME/bin/hdfs dfs -copyFromLocal ~/input/SalesJan2009.csv /

Verify whether a file is actually copied or not.

$HADOOP_HOME/bin/hdfs dfs -ls /

Step 3) Pig Configuration

First, navigate to $PIG_HOME/conf

cd $PIG_HOME/conf
sudo cp pig.properties pig.properties.original
Open pig.properties using a text editor of your choice, and specify log file
path using pig.logfile

sudo gedit pig.properties

Loger will make use of this file to log errors.

Step 4) Run command 'pig' which will start Pig command prompt which is an
interactive shell Pig queries.

Pig
Step 5)In Grunt command prompt for Pig, execute below Pig commands in ord
er.

A. Load the file containing data.

salesTable = LOAD '/SalesJan2009.csv' USING PigStorage(',') AS (Transaction_date:char

array,Product:chararray,Price:chararray,Payment_Type:chararray,Name:chararray,City:ch
ararray,State:chararray,Country:chararray,Account_Created:chararray,Last_Login:charar
ray,Latitude:chararray,Longitude:chararray);

Press Enter after this command.

-- B. Group data by field Country

GroupByCountry = GROUP salesTable BY Country;

-- C. For each tuple in 'GroupByCountry', generate the resulting string of the
form-> Name of Country: No. of products sold

CountByCountry = FOREACH GroupByCountry GENERATE CONCAT((chararray)$0,CONCAT(':',(cha

rarray)COUNT($1)));

Press Enter after this command.

-- D. Store the results of Data Flow in the directory 'pig_output_sales' on

HDFS

STORE CountByCountry INTO 'pig_output_sales' USING PigStorage('\t');

This command will take some time to execute. Once done, you should see the
following screen
Step 6) Result can be seen through command interface as,

$HADOOP_HOME/bin/hdfs dfs -cat pig_output_sales/part-r-00000

Results can also be seen via a web interface as-

Results through a web interface-

Open http://localhost:50070/ in a web browser.

Now select 'Browse the filesystem' and navigate
upto /user/hduser/pig_output_sales

Open part-r-00000

Pig Slides
No ratings yet
Pig Slides
46 pages
Hadoop
No ratings yet
Hadoop
6 pages
14-Lesson Cloudera Hive
No ratings yet
14-Lesson Cloudera Hive
9 pages
Svelte Development Essentials
No ratings yet
Svelte Development Essentials
2 pages
Learning The Ropes of The CDF Sandbox
No ratings yet
Learning The Ropes of The CDF Sandbox
16 pages
Big Data and Hadoop: by - Ujjwal Kumar Gupta
No ratings yet
Big Data and Hadoop: by - Ujjwal Kumar Gupta
57 pages
Cloudera Administrator Training For Apache Hadoop
No ratings yet
Cloudera Administrator Training For Apache Hadoop
5 pages
HBase for Big Data Professionals
No ratings yet
HBase for Big Data Professionals
100 pages
BD - Unit - IV - Hive and Pig
No ratings yet
BD - Unit - IV - Hive and Pig
41 pages
YARN: Advanced Cluster Management
No ratings yet
YARN: Advanced Cluster Management
34 pages
HDFS Commands
No ratings yet
HDFS Commands
15 pages
Administration of Hadoop Summer 2014 Lab Guide v3.1
No ratings yet
Administration of Hadoop Summer 2014 Lab Guide v3.1
107 pages
Apache Pig
No ratings yet
Apache Pig
21 pages
Cloudera Hbase
100% (1)
Cloudera Hbase
145 pages
OpenStack Trove Tutorial
No ratings yet
OpenStack Trove Tutorial
23 pages
Hive Main Installation
No ratings yet
Hive Main Installation
2 pages
How To Set Up A Hadoop Cluster in Docker
No ratings yet
How To Set Up A Hadoop Cluster in Docker
13 pages
Cloudera Administration
No ratings yet
Cloudera Administration
424 pages
Mysql Interview Questions PDF
No ratings yet
Mysql Interview Questions PDF
5 pages
Hadoop Setup Guide for Windows Users
No ratings yet
Hadoop Setup Guide for Windows Users
29 pages
Adm2000 Lab Guide
100% (1)
Adm2000 Lab Guide
48 pages
Hortonworks Cluster Config Guide.1.0
No ratings yet
Hortonworks Cluster Config Guide.1.0
15 pages
Class: CS 237 Distributed Systems Middleware Instructor: Nalini Venkatasubramanian
No ratings yet
Class: CS 237 Distributed Systems Middleware Instructor: Nalini Venkatasubramanian
55 pages
Apache Flume Tutorial PDF
No ratings yet
Apache Flume Tutorial PDF
43 pages
Install Sqoop
No ratings yet
Install Sqoop
7 pages
Unit-2 - Introduction To Hadoop and Hadoop Architecture
No ratings yet
Unit-2 - Introduction To Hadoop and Hadoop Architecture
46 pages
Apache Hive
No ratings yet
Apache Hive
77 pages
Hadoop FS Shell Commands Guide
No ratings yet
Hadoop FS Shell Commands Guide
5 pages
BDT Lab Manual
No ratings yet
BDT Lab Manual
34 pages
Hive Installation On Windows 10
No ratings yet
Hive Installation On Windows 10
13 pages
Hive Queries
No ratings yet
Hive Queries
5 pages
Hadoop and BigData LAB MANUAL
50% (4)
Hadoop and BigData LAB MANUAL
59 pages
HBase Interview Questions
No ratings yet
HBase Interview Questions
12 pages
Introduction to NLP in AI
No ratings yet
Introduction to NLP in AI
43 pages
HADOOP
100% (1)
HADOOP
35 pages
Apache Flume for Data Engineers
No ratings yet
Apache Flume for Data Engineers
8 pages
Hadoop Training #5: MapReduce Algorithm
100% (2)
Hadoop Training #5: MapReduce Algorithm
31 pages
Big Data and Apache Spark Overview
No ratings yet
Big Data and Apache Spark Overview
211 pages
Apache Pig Tutorial
100% (1)
Apache Pig Tutorial
207 pages
Getting Started With Hadoop
No ratings yet
Getting Started With Hadoop
47 pages
Tutorial-HDP-Administration V III
100% (1)
Tutorial-HDP-Administration V III
274 pages
Hadoop for Data Engineers
No ratings yet
Hadoop for Data Engineers
44 pages
Cloudera Administration Study Guide
No ratings yet
Cloudera Administration Study Guide
3 pages
Pair RDD Operations: Flat Map
No ratings yet
Pair RDD Operations: Flat Map
4 pages
Hadoop Python MapReduce Tutorial For Beginners
No ratings yet
Hadoop Python MapReduce Tutorial For Beginners
15 pages
Unit 4 Hadoop Ecosystem - HIVE and PIG
No ratings yet
Unit 4 Hadoop Ecosystem - HIVE and PIG
157 pages
Experiment No.2 Aim:: To Develop An Application That Draws Basic Graphical Primitives On The Screen
No ratings yet
Experiment No.2 Aim:: To Develop An Application That Draws Basic Graphical Primitives On The Screen
4 pages
Bda - Unit 2
No ratings yet
Bda - Unit 2
30 pages
Hadoop Admin Course
No ratings yet
Hadoop Admin Course
8 pages
Big Data Analytics - Lab-Manual
No ratings yet
Big Data Analytics - Lab-Manual
19 pages
Hadoop Realtime Issues
100% (1)
Hadoop Realtime Issues
3 pages
MapR Sandbox For Hadoop DocUpdateFor3.1.1
No ratings yet
MapR Sandbox For Hadoop DocUpdateFor3.1.1
7 pages
Bdaut 2
No ratings yet
Bdaut 2
66 pages
Unit-V Pig Programming
No ratings yet
Unit-V Pig Programming
123 pages
Hadoop Pig
No ratings yet
Hadoop Pig
111 pages
PIG A Big Data Processor
No ratings yet
PIG A Big Data Processor
49 pages
Apache Pig for Data Engineers
No ratings yet
Apache Pig for Data Engineers
5 pages
Hadoop - PIG User Material
No ratings yet
Hadoop - PIG User Material
292 pages
Pig: Building High-Level Dataflows Over Map-Reduce
No ratings yet
Pig: Building High-Level Dataflows Over Map-Reduce
59 pages
Tlo Cris-Fmm Lms
No ratings yet
Tlo Cris-Fmm Lms
2 pages
GMD 600 Gii (HD) 700gii (HD) 800gii (HD)
No ratings yet
GMD 600 Gii (HD) 700gii (HD) 800gii (HD)
48 pages
PI DuraBlend Diesel 10W 40 031 06b
No ratings yet
PI DuraBlend Diesel 10W 40 031 06b
2 pages
High Frequency Voltage Controlled Ring Oscillators in Standard CMOS
No ratings yet
High Frequency Voltage Controlled Ring Oscillators in Standard CMOS
18 pages
Pemeliharaan Sistem Monitoring Gempabumi Indonesia Ii - Tahun 2020 Badan Meteorologi Klimatologi Dan Geofisika (BMKG)
No ratings yet
Pemeliharaan Sistem Monitoring Gempabumi Indonesia Ii - Tahun 2020 Badan Meteorologi Klimatologi Dan Geofisika (BMKG)
7 pages
Al Bahar Tower
100% (3)
Al Bahar Tower
25 pages
Traffic Management Measures For Main Carriageway Opening
No ratings yet
Traffic Management Measures For Main Carriageway Opening
73 pages
Resume
No ratings yet
Resume
1 page
Life Sciences R&D Writing Services
No ratings yet
Life Sciences R&D Writing Services
2 pages
Super Market Project
No ratings yet
Super Market Project
32 pages
Es-60 Broch 7010 2220 Revb SM 6
No ratings yet
Es-60 Broch 7010 2220 Revb SM 6
2 pages
Tosibox Lock500 User Manual
No ratings yet
Tosibox Lock500 User Manual
71 pages
Java OOP Concepts and Examples
No ratings yet
Java OOP Concepts and Examples
26 pages
Mainly Instruments For Measurement and Control
No ratings yet
Mainly Instruments For Measurement and Control
17 pages
Lista de Partes Tecnologicas
No ratings yet
Lista de Partes Tecnologicas
155 pages
65 Software Engineer Interview Questions 1680181461
No ratings yet
65 Software Engineer Interview Questions 1680181461
20 pages
Python Intenship
No ratings yet
Python Intenship
34 pages
Smart24x7 - TMS: All Time Safety, Security & Reliability
No ratings yet
Smart24x7 - TMS: All Time Safety, Security & Reliability
30 pages
Introduction To Microsoft SQL Server 2014: Module Overview
No ratings yet
Introduction To Microsoft SQL Server 2014: Module Overview
16 pages
Kumori - Case Study 1
No ratings yet
Kumori - Case Study 1
2 pages
Governor WEHGOV
No ratings yet
Governor WEHGOV
1 page
(SVC-09) IMEI Cloud Guide For Service Center User - Rev1.0
No ratings yet
(SVC-09) IMEI Cloud Guide For Service Center User - Rev1.0
18 pages
Bo's Template Tamers
No ratings yet
Bo's Template Tamers
14 pages
Qdoc - Tips - Dst4600a User Manual
No ratings yet
Qdoc - Tips - Dst4600a User Manual
49 pages
Arcode Control System Offer Request / Ordering Form:::: VAC (Phase To Phase) A B
No ratings yet
Arcode Control System Offer Request / Ordering Form:::: VAC (Phase To Phase) A B
2 pages
Schlage Connect Smart Deadbolt Be469zp Manual
No ratings yet
Schlage Connect Smart Deadbolt Be469zp Manual
8 pages
DX Diag
No ratings yet
DX Diag
34 pages
NFV Security Challenges & Solutions
No ratings yet
NFV Security Challenges & Solutions
5 pages
Leica GS18I DS
No ratings yet
Leica GS18I DS
2 pages

Pig

Uploaded by

Pig

Uploaded by

PIG Basics

Pig Vs Traditional Hadoop Map Reduce(MR).

Data flow in Pig

Pig Execution Modes

Pig Latin Example

In a MapReduce framework, programs need to be translated into a series of

1. Pig Latin, which is a language

A Pig Latin program consists of a series of operations or transformations

PigLatin is a relatively stiffened language which uses familiar keywords from

How to Download and Install Pig

Select tar.gz (and not src.tar.gz) file to download.

Step 2) Once a download is complete, navigate to the directory containing the

Extract contents of tar file as below

sudo tar -xvf pig-0.12.1.tar.gz

Step 3). Modify ~/.bashrc to add Pig related environment variables

export PIG_HOME=<Installation directory of Pig>

Step 5) We need to recompile PIG to support Hadoop 2.2.0

Here are the steps to do this-

Go to PIG home directory

sudo ant clean jar-all -Dhadoopversion=23

Please note that in this recompilation process multiple components are

In our case, it takes 20 minutes

Step 6) Test the Pig installation using the command

Input: Our input data set is a CSV file, SalesJan2009.csv

Step 1) Start Hadoop

Copy file SalesJan2009.csv (stored on local file

$HADOOP_HOME/bin/hdfs dfs -copyFromLocal ~/input/SalesJan2009.csv /

Verify whether a file is actually copied or not.

$HADOOP_HOME/bin/hdfs dfs -ls /

Step 3) Pig Configuration

First, navigate to $PIG_HOME/conf

sudo gedit pig.properties

Loger will make use of this file to log errors.

A. Load the file containing data.

salesTable = LOAD '/SalesJan2009.csv' USING PigStorage(',') AS (Transaction_date:char

Press Enter after this command.

-- B. Group data by field Country

GroupByCountry = GROUP salesTable BY Country;

CountByCountry = FOREACH GroupByCountry GENERATE CONCAT((chararray)$0,CONCAT(':',(cha

Press Enter after this command.

-- D. Store the results of Data Flow in the directory 'pig_output_sales' on

STORE CountByCountry INTO 'pig_output_sales' USING PigStorage('\t');

$HADOOP_HOME/bin/hdfs dfs -cat pig_output_sales/part-r-00000

Results can also be seen via a web interface as-

Results through a web interface-

Open http://localhost:50070/ in a web browser.

You might also like