PIG Basics
Pig is a scripting platform for processing and analysing large data sets.
very usefulfor people who did not have java knowledge
used for high level data flow and processing the data available on HDFS.
PIG is named pig because like the animal, it can consume and process any type of data, and has lots of
usage in data cleansing.
Internally, whatever you write in Pig, it internally converts to Map reduce(MR) jobs.
Pig is client side installation, it need not sit on hadoop cluster.
Pig script will execute a set of commands, which will be converted to Map Reduce(MR) jobs
and submitted to hadoop running locally or remotely.
A hadoop cluster will not care whether the job was submitted from pig or from some other
environment.
map reduce programs get executed only when the DUMP or STORE command is called(more on this
later).
Pig Vs Traditional Hadoop Map Reduce(MR).
lot of effort required in writing Map Reduce in Hadoop, but in pig the effort required is very less.
In Hadoop, we have to write toolrunner, Mapper, Reducer, but in Pig, nothing is mandatory as we just
write a small script(set of commands)
Hadoop Mapreduce has more functionality than Pig.
since in pig, we just have to write the script, and not the separate toolrunner, mapper, reducer, etc, the
development effort while using PIG is very less.
Pig is slightly slower than MR job.
Components of PIG
pig execution environment
it is essentially, the hadoop cluster, where the pig script is submitted to run.
it can be local hadoop or remote hadoop cluster.
pig latin
new language, which is compiled to map reduce(MR) jobs
increases productivity, as less no of lines are required.
good for non java programmers.
provides operations like join, group, filter, sort, but we need to write lot of code for join etc in
hadoop.
data flow language instead of procedural language.
Data flow in Pig
LOAD the data from HDFS, and into the Pig program.
data is transformed into appropriate format, may be by GROUP, JOIN etc, or combine two files,
FILTER etc or any other built in function.
DUMP the data to screen or STORE the data somewhere.
Pig Execution Modes
local mode
pig -x local
to enter into a default shell named grunt
map reduce mode
pig
enter to map reduce mode
Pig Latin Example
A = LOAD 'myserver.log' AS (ipaddress:chararray, timestamp:int, url: chararray) using PigStorage();
A = LOAD 'myserver.log' using PigStorage();
B = GROUP A by ipaddress
C = FOREACH B GENERATE ipaddress, COUNT(A);
STORE C INTO 'output.txt'
DUMP C
Terminology
atom : any value is called an atom
tuple : collection of atoms, values (123, abc, xyz)
bag : collection of tuples {(123,abc,xyz), (sdksjd, 122,skd)}
Transformations in Pig
Data for the below transformations can be found here
SAMPLE
to get some data from dataset.
x = SAMPLE c 0.01 == approximate 1% of c into x
LIMIT
to limit the no of records.
x = LIMIT c 3
get only 3 records from c and put in x
can fetch any random 3, and not exact same set of records every time)
ORDER
to get the columns in ascending or descending order.
x = ORDER c by f1 ASC
sort c by f1 column in asc order.
JOIN
to join two or more datasets into a single dataset.
x = JOIN a BY fieldInA, b BY fieldInB, C BY fieldInC
GROUP
used to group the dataset based on a field.
B = GROUP A BY age;
UNION
Combination of one or more data sets.
a = load 'file1.txt' using PigStorage(',') as (field1:int, field2:int, field3:int)
b = load 'file2.txt' using PigStorage(',') as (anotherfield1:int, anotherfield2:int, anotherfield3:int)
c = UNION a,b => union works if both the fields erc have the same format, and datatype in all
columns.
d = DISTINCT c
f = FILTER c BY f1 > 3
Pig Usage
processing of logs generated from the servers.
data processing for search platform
ad hoc queries across large cluster
What is PIG?
Pig is a high-level programming language useful for analyzing large data sets.
A pig was a result of development effort at Yahoo!
In a MapReduce framework, programs need to be translated into a series of
Map and Reduce stages. However, this is not a programming model which
data analysts are familiar with. So, in order to bridge this gap, an abstraction
called Pig was built on top of Hadoop.
Apache Pig enables people to focus more on analyzing bulk data sets and
to spend less time writing Map-Reduce programs. Similar to Pigs, who eat
anything, the Pig programming language is designed to work upon any kind of
data. That's why the name, Pig!
Pig Architecture
Pig consists of two components:
1. Pig Latin, which is a language
2. A runtime environment, for running PigLatin programs.
A Pig Latin program consists of a series of operations or transformations
which are applied to the input data to produce output. These operations
describe a data flow which is translated into an executable representation, by
Pig execution environment. Underneath, results of these transformations are
series of MapReduce jobs which a programmer is unaware of. So, in a way,
Pig allows the programmer to focus on data rather than the nature of
execution.
PigLatin is a relatively stiffened language which uses familiar keywords from
data processing e.g., Join, Group and Filter.
Execution modes:
Pig has two execution modes:
1. Local mode: In this mode, Pig runs in a single JVM and makes use of
local file system. This mode is suitable only for analysis of small
datasets using Pig
2. Map Reduce mode: In this mode, queries written in Pig Latin are
translated into MapReduce jobs and are run on a Hadoop cluster
(cluster may be pseudo or fully distributed). MapReduce mode with the
fully distributed cluster is useful of running Pig on large datasets.
How to Download and Install Pig
Before we start with the actual process, ensure you have Hadoop installed.
Change user to 'hduser' (id used while Hadoop configuration, you can switch
to the userid used during your Hadoop config)
Step 1) Download the stable latest release of Pig from any one of the mirrors
sites available at : http://pig.apache.org/releases.html
Select tar.gz (and not src.tar.gz) file to download.
Step 2) Once a download is complete, navigate to the directory containing the
downloaded tar file and move the tar to the location where you want to setup
Pig. In this case, we will move to /usr/local
Move to a directory containing Pig Files
cd /usr/local
Extract contents of tar file as below
sudo tar -xvf pig-0.12.1.tar.gz
Step 3). Modify ~/.bashrc to add Pig related environment variables
Open ~/.bashrc file in any text editor of your choice and do below
modifications-
export PIG_HOME=<Installation directory of Pig>
export PATH=$PIG_HOME/bin:$HADOOP_HOME/bin:$PATH
Step 4) Now, source this environment configuration using below command
. ~/.bashrc
Step 5) We need to recompile PIG to support Hadoop 2.2.0
Here are the steps to do this-
Go to PIG home directory
cd $PIG_HOME
Install Ant
sudo apt-get install ant
Note: Download will start and will consume time as per your internet speed.
Recompile PIG
sudo ant clean jar-all -Dhadoopversion=23
Please note that in this recompilation process multiple components are
downloaded. So, a system should be connected to the internet.
Also, in case this process stuck somewhere and you don't see any movement
on command prompt for more than 20 minutes then press Ctrl + c and rerun
the same command.
In our case, it takes 20 minutes
Step 6) Test the Pig installation using the command
pig -help
Example Pig Script
We will use PIG to find the Number of Products Sold in Each Country.
Input: Our input data set is a CSV file, SalesJan2009.csv
Step 1) Start Hadoop
$HADOOP_HOME/sbin/start-dfs.sh
$HADOOP_HOME/sbin/start-yarn.sh
Step 2) Pig takes a file from HDFS in MapReduce mode and stores the
results back to HDFS.
Copy file SalesJan2009.csv (stored on local file
system, ~/input/SalesJan2009.csv) to HDFS (Hadoop Distributed File
System) Home Directory
Here the file is in Folder input. If the file is stored in some other location give
that name
$HADOOP_HOME/bin/hdfs dfs -copyFromLocal ~/input/SalesJan2009.csv /
Verify whether a file is actually copied or not.
$HADOOP_HOME/bin/hdfs dfs -ls /
Step 3) Pig Configuration
First, navigate to $PIG_HOME/conf
cd $PIG_HOME/conf
sudo cp pig.properties pig.properties.original
Open pig.properties using a text editor of your choice, and specify log file
path using pig.logfile
sudo gedit pig.properties
Loger will make use of this file to log errors.
Step 4) Run command 'pig' which will start Pig command prompt which is an
interactive shell Pig queries.
Pig
Step 5)In Grunt command prompt for Pig, execute below Pig commands in ord
er.
A. Load the file containing data.
salesTable = LOAD '/SalesJan2009.csv' USING PigStorage(',') AS (Transaction_date:char
array,Product:chararray,Price:chararray,Payment_Type:chararray,Name:chararray,City:ch
ararray,State:chararray,Country:chararray,Account_Created:chararray,Last_Login:charar
ray,Latitude:chararray,Longitude:chararray);
Press Enter after this command.
-- B. Group data by field Country
GroupByCountry = GROUP salesTable BY Country;
-- C. For each tuple in 'GroupByCountry', generate the resulting string of the
form-> Name of Country: No. of products sold
CountByCountry = FOREACH GroupByCountry GENERATE CONCAT((chararray)$0,CONCAT(':',(cha
rarray)COUNT($1)));
Press Enter after this command.
-- D. Store the results of Data Flow in the directory 'pig_output_sales' on
HDFS
STORE CountByCountry INTO 'pig_output_sales' USING PigStorage('\t');
This command will take some time to execute. Once done, you should see the
following screen
Step 6) Result can be seen through command interface as,
$HADOOP_HOME/bin/hdfs dfs -cat pig_output_sales/part-r-00000
Results can also be seen via a web interface as-
Results through a web interface-
Open http://localhost:50070/ in a web browser.
Now select 'Browse the filesystem' and navigate
upto /user/hduser/pig_output_sales
Open part-r-00000