0% found this document useful (0 votes)
101 views16 pages

Pig

Apache Pig is a high-level scripting platform designed for processing and analyzing large datasets, particularly useful for those without Java knowledge. It simplifies the development of data processing tasks by allowing users to write scripts in Pig Latin, which are then converted into MapReduce jobs for execution on Hadoop clusters. Pig supports various operations like LOAD, GROUP, JOIN, and provides two execution modes: local and MapReduce, making it versatile for different data processing needs.

Uploaded by

roshani chede
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
101 views16 pages

Pig

Apache Pig is a high-level scripting platform designed for processing and analyzing large datasets, particularly useful for those without Java knowledge. It simplifies the development of data processing tasks by allowing users to write scripts in Pig Latin, which are then converted into MapReduce jobs for execution on Hadoop clusters. Pig supports various operations like LOAD, GROUP, JOIN, and provides two execution modes: local and MapReduce, making it versatile for different data processing needs.

Uploaded by

roshani chede
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 16

PIG Basics

 Pig is a scripting platform for processing and analysing large data sets.
 very usefulfor people who did not have java knowledge
 used for high level data flow and processing the data available on HDFS.
 PIG is named pig because like the animal, it can consume and process any type of data, and has lots of
usage in data cleansing.
 Internally, whatever you write in Pig, it internally converts to Map reduce(MR) jobs.
 Pig is client side installation, it need not sit on hadoop cluster.
 Pig script will execute a set of commands, which will be converted to Map Reduce(MR) jobs
and submitted to hadoop running locally or remotely.
 A hadoop cluster will not care whether the job was submitted from pig or from some other
environment.
 map reduce programs get executed only when the DUMP or STORE command is called(more on this
later).

Pig Vs Traditional Hadoop Map Reduce(MR).


 lot of effort required in writing Map Reduce in Hadoop, but in pig the effort required is very less.
 In Hadoop, we have to write toolrunner, Mapper, Reducer, but in Pig, nothing is mandatory as we just
write a small script(set of commands)
 Hadoop Mapreduce has more functionality than Pig.
 since in pig, we just have to write the script, and not the separate toolrunner, mapper, reducer, etc, the
development effort while using PIG is very less.
 Pig is slightly slower than MR job.

Components of PIG
 pig execution environment
 it is essentially, the hadoop cluster, where the pig script is submitted to run.
 it can be local hadoop or remote hadoop cluster.
 pig latin
 new language, which is compiled to map reduce(MR) jobs
 increases productivity, as less no of lines are required.
 good for non java programmers.
 provides operations like join, group, filter, sort, but we need to write lot of code for join etc in
hadoop.
 data flow language instead of procedural language.

Data flow in Pig


 LOAD the data from HDFS, and into the Pig program.
 data is transformed into appropriate format, may be by GROUP, JOIN etc, or combine two files,
FILTER etc or any other built in function.
 DUMP the data to screen or STORE the data somewhere.

Pig Execution Modes


 local mode
 pig -x local
 to enter into a default shell named grunt
 map reduce mode
 pig
 enter to map reduce mode

Pig Latin Example


 A = LOAD 'myserver.log' AS (ipaddress:chararray, timestamp:int, url: chararray) using PigStorage();
 A = LOAD 'myserver.log' using PigStorage();
 B = GROUP A by ipaddress
 C = FOREACH B GENERATE ipaddress, COUNT(A);
 STORE C INTO 'output.txt'
 DUMP C

Terminology
 atom : any value is called an atom
 tuple : collection of atoms, values (123, abc, xyz)
 bag : collection of tuples {(123,abc,xyz), (sdksjd, 122,skd)}

Transformations in Pig
Data for the below transformations can be found here

 SAMPLE
 to get some data from dataset.
 x = SAMPLE c 0.01 == approximate 1% of c into x
 LIMIT
 to limit the no of records.
 x = LIMIT c 3
 get only 3 records from c and put in x
 can fetch any random 3, and not exact same set of records every time)
 ORDER
 to get the columns in ascending or descending order.
 x = ORDER c by f1 ASC
 sort c by f1 column in asc order.
 JOIN
 to join two or more datasets into a single dataset.
 x = JOIN a BY fieldInA, b BY fieldInB, C BY fieldInC
 GROUP
 used to group the dataset based on a field.
 B = GROUP A BY age;
 UNION
 Combination of one or more data sets.
 a = load 'file1.txt' using PigStorage(',') as (field1:int, field2:int, field3:int)
 b = load 'file2.txt' using PigStorage(',') as (anotherfield1:int, anotherfield2:int, anotherfield3:int)
 c = UNION a,b => union works if both the fields erc have the same format, and datatype in all
columns.
 d = DISTINCT c
 f = FILTER c BY f1 > 3
Pig Usage
 processing of logs generated from the servers.
 data processing for search platform
 ad hoc queries across large cluster

What is PIG?
Pig is a high-level programming language useful for analyzing large data sets.
A pig was a result of development effort at Yahoo!

In a MapReduce framework, programs need to be translated into a series of


Map and Reduce stages. However, this is not a programming model which
data analysts are familiar with. So, in order to bridge this gap, an abstraction
called Pig was built on top of Hadoop.

Apache Pig enables people to focus more on analyzing bulk data sets and
to spend less time writing Map-Reduce programs. Similar to Pigs, who eat
anything, the Pig programming language is designed to work upon any kind of
data. That's why the name, Pig!
Pig Architecture
Pig consists of two components:

1. Pig Latin, which is a language


2. A runtime environment, for running PigLatin programs.

A Pig Latin program consists of a series of operations or transformations


which are applied to the input data to produce output. These operations
describe a data flow which is translated into an executable representation, by
Pig execution environment. Underneath, results of these transformations are
series of MapReduce jobs which a programmer is unaware of. So, in a way,
Pig allows the programmer to focus on data rather than the nature of
execution.

PigLatin is a relatively stiffened language which uses familiar keywords from


data processing e.g., Join, Group and Filter.

Execution modes:
Pig has two execution modes:

1. Local mode: In this mode, Pig runs in a single JVM and makes use of
local file system. This mode is suitable only for analysis of small
datasets using Pig
2. Map Reduce mode: In this mode, queries written in Pig Latin are
translated into MapReduce jobs and are run on a Hadoop cluster
(cluster may be pseudo or fully distributed). MapReduce mode with the
fully distributed cluster is useful of running Pig on large datasets.

How to Download and Install Pig


Before we start with the actual process, ensure you have Hadoop installed.
Change user to 'hduser' (id used while Hadoop configuration, you can switch
to the userid used during your Hadoop config)

Step 1) Download the stable latest release of Pig from any one of the mirrors
sites available at : http://pig.apache.org/releases.html

Select tar.gz (and not src.tar.gz) file to download.

Step 2) Once a download is complete, navigate to the directory containing the


downloaded tar file and move the tar to the location where you want to setup
Pig. In this case, we will move to /usr/local
Move to a directory containing Pig Files

cd /usr/local

Extract contents of tar file as below

sudo tar -xvf pig-0.12.1.tar.gz

Step 3). Modify ~/.bashrc to add Pig related environment variables

Open ~/.bashrc file in any text editor of your choice and do below
modifications-

export PIG_HOME=<Installation directory of Pig>


export PATH=$PIG_HOME/bin:$HADOOP_HOME/bin:$PATH
Step 4) Now, source this environment configuration using below command

. ~/.bashrc

Step 5) We need to recompile PIG to support Hadoop 2.2.0

Here are the steps to do this-

Go to PIG home directory

cd $PIG_HOME

Install Ant
sudo apt-get install ant

Note: Download will start and will consume time as per your internet speed.

Recompile PIG

sudo ant clean jar-all -Dhadoopversion=23

Please note that in this recompilation process multiple components are


downloaded. So, a system should be connected to the internet.

Also, in case this process stuck somewhere and you don't see any movement
on command prompt for more than 20 minutes then press Ctrl + c and rerun
the same command.

In our case, it takes 20 minutes

Step 6) Test the Pig installation using the command

pig -help
Example Pig Script
We will use PIG to find the Number of Products Sold in Each Country.

Input: Our input data set is a CSV file, SalesJan2009.csv

Step 1) Start Hadoop


$HADOOP_HOME/sbin/start-dfs.sh
$HADOOP_HOME/sbin/start-yarn.sh

Step 2) Pig takes a file from HDFS in MapReduce mode and stores the
results back to HDFS.

Copy file SalesJan2009.csv (stored on local file


system, ~/input/SalesJan2009.csv) to HDFS (Hadoop Distributed File
System) Home Directory

Here the file is in Folder input. If the file is stored in some other location give
that name

$HADOOP_HOME/bin/hdfs dfs -copyFromLocal ~/input/SalesJan2009.csv /

Verify whether a file is actually copied or not.

$HADOOP_HOME/bin/hdfs dfs -ls /

Step 3) Pig Configuration

First, navigate to $PIG_HOME/conf

cd $PIG_HOME/conf
sudo cp pig.properties pig.properties.original
Open pig.properties using a text editor of your choice, and specify log file
path using pig.logfile

sudo gedit pig.properties

Loger will make use of this file to log errors.

Step 4) Run command 'pig' which will start Pig command prompt which is an
interactive shell Pig queries.

Pig
Step 5)In Grunt command prompt for Pig, execute below Pig commands in ord
er.

A. Load the file containing data.

salesTable = LOAD '/SalesJan2009.csv' USING PigStorage(',') AS (Transaction_date:char


array,Product:chararray,Price:chararray,Payment_Type:chararray,Name:chararray,City:ch
ararray,State:chararray,Country:chararray,Account_Created:chararray,Last_Login:charar
ray,Latitude:chararray,Longitude:chararray);

Press Enter after this command.

-- B. Group data by field Country

GroupByCountry = GROUP salesTable BY Country;


-- C. For each tuple in 'GroupByCountry', generate the resulting string of the
form-> Name of Country: No. of products sold

CountByCountry = FOREACH GroupByCountry GENERATE CONCAT((chararray)$0,CONCAT(':',(cha


rarray)COUNT($1)));

Press Enter after this command.

-- D. Store the results of Data Flow in the directory 'pig_output_sales' on


HDFS

STORE CountByCountry INTO 'pig_output_sales' USING PigStorage('\t');

This command will take some time to execute. Once done, you should see the
following screen
Step 6) Result can be seen through command interface as,

$HADOOP_HOME/bin/hdfs dfs -cat pig_output_sales/part-r-00000

Results can also be seen via a web interface as-

Results through a web interface-

Open http://localhost:50070/ in a web browser.


Now select 'Browse the filesystem' and navigate
upto /user/hduser/pig_output_sales

Open part-r-00000

You might also like