0% found this document useful (0 votes)
44 views77 pages

3 Pig

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views77 pages

3 Pig

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 77

What is Apache Pig?

Apache Pig is an abstraction over


MapReduce. It is a tool/platform which is
used to analyze larger sets of data
representing them as data flows. Pig is
generally used with Hadoop; we can
perform all the data manipulation
operations in Hadoop using Apache Pig.
To write data analysis programs, Pig
provides a high-level language known as
Pig Latin. This language provides various
operators using which programmers can
develop their own functions for reading,
writing, and processing data.
To analyze data using Apache Pig,
programmers need to write scripts using
Pig Latin language. All these scripts are
internally converted to Map and Reduce
tasks. Apache Pig has a component known
as Pig Engine that accepts the Pig Latin
scripts as input and converts those scripts
into MapReduce jobs.

Apache Pig Architecture


The main reason why programmers have
started using Hadoop Pig is that it converts
the scripts into a series of MapReduce
tasks making their job easy. Below is the
architecture of Pig Hadoop:
Pig Hadoop framework has four main
components:
1. Parser: When a Pig Latin script is sent
to Hadoop Pig, it is first handled by the
parser. The parser is responsible for
checking the syntax of the script, along
with other miscellaneous checks. Parser
gives an output in the form of a Directed
Acyclic Graph (DAG) that contains Pig
Latin statements, together with other
logical operators represented as nodes.
2. Optimizer: After the output from the
parser is retrieved, a logical plan for
DAG is passed to a logical optimizer.
The optimizer is responsible for carrying
out the logical optimizations.
3. Compiler: The role of the compiler
comes in when the output from the
optimizer is received. The compiler
compiles the logical plan sent by the
optimizing The logical plan is then
converted into a series of MapReduce
tasks or jobs.
4. Execution Engine: After the logical
plan is converted to MapReduce jobs,
these jobs are sent to Hadoop in a
properly sorted order, and these jobs
are executed on Hadoop for yielding the
desired result.

Why Do We Need Apache Pig?


Programmers who are not so good at Java
normally used to struggle working with
Hadoop, especially while performing any
MapReduce tasks. Apache Pig is a boon for
all such programmers.
● Using Pig Latin, programmers can
perform MapReduce tasks easily
without having to type complex codes in
Java.
● Apache Pig uses a multi-query
approach, thereby reducing the length
of codes. For example, an operation that
would require you to type 200 lines of
code (LoC) in Java can be easily done
by typing as less as just 10 LoC in
Apache Pig. Ultimately Apache Pig
reduces the development time by almost
16 times.
● Pig Latin is SQL-like language and it is
easy to learn Apache Pig when you are
familiar with SQL.
● Apache Pig provides many built-in
operators to support data operations like
joins, filters, ordering, etc. In addition, it
also provides nested data types like
tuples, bags, and maps that are missing
from MapReduce.
Features of Pig
Apache Pig comes with the following
features −
● Rich set of operators − It provides
many operators to perform operations
like join, sort, filer, etc.
● Ease of programming − Pig Latin is
similar to SQL and it is easy to write a
Pig script if you are good at SQL.
● Optimization opportunities − The
tasks in Apache Pig optimize their
execution automatically, so the
programmers need to focus only on
semantics of the language.
● Extensibility − Using the existing
operators, users can develop their own
functions to read, process, and write
data.
● UDF’s − Pig provides the facility to
create User-defined Functions in other
programming languages such as Java
and invoke or embed them in Pig
Scripts.
● Handles all kinds of data − Apache Pig
analyzes all kinds of data, both
structured as well as unstructured. It
stores the results in HDFS.

Apache Pig Vs MapReduce

Apache Pig Vs SQL


Listed below are the major differences
between Apache Pig and SQL.
Pig SQL

Pig Latin is a SQL is a


procedural declarative
language. language.

In Apache Pig, Schema is


schema is optional. mandatory in SQL.
We can store data
without designing a
schema (values are
stored as $01, $02
etc.)
The data model in The data model
Apache Pig is used in SQL is flat
nested relational. relational.
Apache Pig provides There is more
limited opportunity opportunity for query
for Query optimization in SQL.
optimization.

Apache Pig MapReduce

MapReduce is a
Apache Pig is a data
flow language. data processing
paradigm.
It is a high level MapReduce is low
language. level and rigid.
Performing a Join It is quite difficult in
operation in Apache MapReduce to
Pig is pretty simple. perform a Join
operation between
datasets.
Any novice Exposure to Java
programmer with a is must to work
basic knowledge of with MapReduce.
SQL can work
conveniently with
Apache Pig.
Apache Pig uses MapReduce will
multi-query approach, require almost 20
thereby reducing the times more the
length of the codes to number of lines to
a great extent. perform the same
task.
There is no need for MapReduce jobs
compilation. On have a long
execution, every compilation
Apache Pig operator is process.
converted internally
into a MapReduce job.
Installing and Runnig
PIg

Apache Pig Execution Modes


You can run Apache Pig in two modes,
namely, Local Mode and HDFS mode.
Local Mode
In this mode, all the files are installed and
run from your local host and local file
system. There is no need of Hadoop or
HDFS. This mode is generally used for
testing purpose.
MapReduce Mode
MapReduce mode is where we load or
process the data that exists in the Hadoop
File System (HDFS) using Apache Pig. In
this mode, whenever we execute the Pig
Latin statements to process the data, a
MapReduce job is invoked in the back-end
to perform a particular operation on the
data that exists in the HDFS.

Apache Pig Execution Mechanisms


Apache Pig scripts can be executed in
three ways, namely, interactive mode, batch
mode, and embedded mode.
● Interactive Mode (Grunt shell) − You
can run Apache Pig in interactive mode
using the Grunt shell. In this shell, you
can enter the Pig Latin statements and
get the output (using Dump operator).
● Batch Mode (Script) − You can run
Apache Pig in Batch mode by writing the
Pig Latin script in a single file with .pig
extension.
● Embedded Mode (UDF) − Apache Pig
provides the provision of defining our
own functions (User Defined Functions)
in programming languages such as
Java, and using them in our script.

Invoking the Grunt Shell


You can invoke the Grunt shell in a desired
mode (local/MapReduce) using the −x
option as shown below.

Local MapReduc
mode e mode
Comma Command
nd − −
$ ./pig $ ./pig -x
–x local mapreduce
Output Output −

Either of these commands gives you the
Grunt shell prompt as shown below.
grunt>
You can exit the Grunt shell using ‘ctrl + d’.
After invoking the Grunt shell, you can
execute a Pig script by directly entering the
Pig Latin statements in it.
grunt> customers = LOAD 'customers.txt'
USING PigStorage(',');

Executing Apache Pig in Batch Mode


You can write an entire Pig Latin script in a
file and execute it using the –x command.
Let us suppose we have a Pig script in a file
named sample_script.pig as shown below.
Sample_script.pig
student = LOAD
'hdfs://localhost:9000/pig_data/student.txt'
USING
PigStorage(',') as
(id:int,name:chararray,city:chararray);

Dump student;
Now, you can execute the script in the
above file as shown below.

Local mode MapReduce mode


$ pig -x local $ pig -x mapreduce
Sample_script.pi Sample_script.pig
g
Apache Pig Execution Modes
We can start Apache Pig in two modes,
the first mode is Local and the second
mode is Mapreduce or HDFS.
Let us see each mode in detail.
1. Local Mode
In this mode of execution, we need a
single machine and all files are installed
and run using your localhost and file
system. This mode is used for testing
and development purposes. The local
mode does not need HDFS or Hadoop.
To start Local mode type the below
command.
$pig -x local
2. Mapreduce Mode
Mapreduce is the default mode of the
Apache Pig Grunt shell. In this mode, we
need to load data in HDFS and then we
can perform the operation. When we run
the Pig Latin command on that data, a
MapReduce job is started in the
back-end to operate.
To start the Mapreduce mode type the
below command.
$pig -x mapreduce
or
$pig

Apache Pig Execution Methods


A user can execute Apache Pig Latin
scripts in three ways as mentioned
below.

1. Interactive Mode (Grunt shell)


In this mode, a user can interactively run
Apache Pig using the Grunt shell. Users
can submit commands and get a result
there only.
Let us see the below example. We are
running the below statements in
interactive mode and getting output
there only.
Command:
grunt> employee = LOAD
'hdfs://localhost:9000/pigdata/e
mp.txt' USING PigStorage(',')
as
(empid:int,empname:chararray,com
pany:chararray);
grunt> dump employee;
Output:
2. Batch Mode (Script)
In this mode, a user can run Apache Pig
in batch mode by creating a Pig Latin
script file and running it from local or
MapReduce mode.
Let us see the below example. We have
created a script file with the name
“emp_script.pig” and placed it at the
HDFS location, now we are calling that
file using the batch mode command.
Command:
$pig -x mapreduce
hdfs:///pigdata/emp_script.pig

Output:
Comparison with Databases
Having seen Pig in action, it might seem
that Pig Latin is similar to SQL. The
presence of such operators as GROUP BY
and DESCRIBE reinforces this impression.
However, there are several differences
between the two languages, and between
Pig and RDBMSs in general
The most significant difference is that Pig
Latin is a data flow programming language,
whereas SQL is a declarative programming
language. In other words, a Pig Latin
program is a step-by-step set of operations
on an input relation, in which each step is a
single transformation. By contrast, SQL
statements are a set of constraints that,
taken together, define the output. In many
ways, programming in Pig Latin is like
working at the level of an RDBMS query
planner, which figures out how to turn a
declarative statement into a system of
steps.
RDBMSs store data in tables, with tightly
predefined schemas. Pig is more relaxed
about the data that it processes: you can
define a schema at runtime, but it’s
optional. Essentially, it will operate on any
source of tuples (although the source
should support being read in parallel, by
being in multiple files, for example), where
a UDF is used to read the tuples from their
raw representation. The most common
representation is atext file with
tab-separated fields, and Pig provides a
built-in load function for this format.
Unlike with a traditional database, there is
no data import process to load the data into
the RDBMS. The data is loaded from the
filesystem (usually HDFS) as the first step
in the processing.
Pig’s support for complex, nested data
structures differentiates it from SQL, which
operates on flatter data structures. Also,
Pig’s ability to use UDFs and streaming
operators that are tightly integrated with the
language and Pig’s nested data structures
makes Pig Latin more customizable than
most SQL dialects.
There are several features to support
online, low-latency queries that RDBMSs
have that are absent in Pig, such as
transactions and indexes. As mentioned
earlier, Pig does not support random reads
or queries in the order of tens of
milliseconds. Nor does it support random
writes to update small portions of data; all
writes are bulk, streaming writes, just like
MapReduce.
Hive (covered in Hive Chapter ) sits
between Pig and conventional RDBMSs.
Like Pig, Hive is designed to use HDFS for
storage, but otherwise there are some
significant differences. Its query language,
HiveQL, is based on SQL, and anyone who
is familiar with SQL would have little trouble
writing queries in HiveQL. Like RDBMSs,
Hive mandates that all data be stored in
tables, with a schema under its
management; however,it can associate a
schema with preexisting data in HDFS, so
the load step is optional. Hive does not
support low-latency queries, a characteristic
it shares with pig

Apache Pig Installation


The objective of this tutorial is to describe
step by step process to install Pig (Version
pig-0.17.0.tar.gz ) on Hadoop 3.1.2 version
and the OS which we are using is Ubuntu
18.04.4 LTS (Bionic Beaver), once the
installation is completed you can play with
Pig.

Platform
● Operating System (OS). You can use
Ubuntu 18.04.4 LTS version or later
version, also you can use other flavors
of Linux systems like Redhat, CentOS,
etc.
● Hadoop. We have already installed
Hadoop 3.1.2 version on which we will
run Pig (Please refer to the "Hadoop
Installation on Single Node” tutorial and
install Hadoop first before proceeding
for Pig installation.)
● Pig. We have used the Apache
Pig-0.17.0 version for installation.

Download Software
● Pig
https://downloads.apache.org/pig/pig-0.17.0
/pig-0.17.0.tar.gz

Steps to Install Apache Pig


version(0.17.0) on Ubuntu 18.04.4 LTS
Please follow the below steps to install
Apache Pig.
Step 1. Please verify if Hadoop is
installed.

Step 2. Please verify if Java is installed.

Step 3. Please download Pig 0.17.0 from


the below link.
On Linux: $wget
https://downloads.apache.org/pig/pig-0.17.0
/pig-0.17.0.tar.gz
On Windows:
https://downloads.apache.org/pig/pig-0.17.0
/pig-0.17.0.tar.gz
Step 4. Now we will extract the tar file by
using the below command and rename
the folder to pig to make it meaningful.
tar -xzf pig-0.17.0.tar.gz
mv pig-0.17.0 pig

Step 5. Now edit the .bashrc file to


update the environment variable of
Apache Pig so that it can be accessed
from any directory.
$nano .bashrc
Add below lines.
export PIG_HOME=/home/cloudduggu/pig
export
PATH=$PATH:/home/cloudduggu/pig/bin
export
PIG_CLASSPATH=$HADOOP_HOME/etc/
Hadoop
export PIG_HOME=/home/hadoop/pig
export PATH=$Path:/home/hadoop/pig/bin

\Save the changes by pressing CTRL + O


and exit from the nano editor by pressing
CTRL + X.

Step 6. Run source command to update


changes in the same terminal.
$source .bashrc

Step 7. Now run the pig version


command to make sure that Pig is
installed properly.
Step 8. Run pig help command to see all
pig command options.

Step 9. Now start Pig grunt shell (grunt


shell is used to execute Pig Latin
scripts).

Shell Commands
The Grunt shell of Apache Pig is mainly
used to write Pig Latin scripts. Prior to that,
we can invoke any shell commands using
sh and fs.
sh Command
Using sh command, we can invoke any
shell commands from the Grunt shell. Using
sh command from the Grunt shell, we
cannot execute the commands that are a
part of the shell environment (ex − cd).
Syntax
Given below is the syntax of sh command.
grunt> sh shell command parameters

Example
We can invoke the ls command of Linux
shell from the Grunt shell using the sh
option as shown below. In this example, it
lists out the files in the /pig/bin/ directory.
grunt> sh ls

pig
pig_1444799121955.log
pig.cmd
pig.py

fs Command
Using the fs command, we can invoke any
FsShell commands from the Grunt shell.
Syntax
Given below is the syntax of fs command.
grunt> sh File System command
parameters

Example
We can invoke the ls command of HDFS
from the Grunt shell using fs command. In
the following example, it lists the files in the
HDFS root directory.
grunt> fs –ls
Found 3 items
drwxrwxrwx - Hadoop supergroup 0
2015-09-08 14:13 Hbase
drwxr-xr-x - Hadoop supergroup 0
2015-09-09 14:52 seqgen_data
drwxr-xr-x - Hadoop supergroup 0
2015-09-08 11:30 twitter_data

In the same way, we can invoke all the


other file system shell commands from the
Grunt shell using the fs command.

Utility Commands
The Grunt shell provides a set of utility
commands. These include utility commands
such as clear, help, history, quit, and set;
and commands such as exec, kill, and run
to control Pig from the Grunt shell. Given
below is the description of the utility
commands provided by the Grunt shell.
clear Command
The clear command is used to clear the
screen of the Grunt shell.
Syntax
You can clear the screen of the grunt shell
using the clear command as shown below.
grunt> clear

help Command
The help command gives you a list of Pig
commands or Pig properties.
Usage
You can get a list of Pig commands using
the help command as shown below.
grunt> help

Commands: <pig latin statement>; - See


the PigLatin manual for details:
http://hadoop.apache.org/pig

File system commands:fs <fs arguments> -


Equivalent to Hadoop dfs command:
http://hadoop.apache.org/common/docs/cur
rent/hdfs_shell.html

Diagnostic Commands:describe
<alias>[::<alias] - Show the schema for the
alias.
Inner aliases can be described as A::B.
explain [-script <pigscript>] [-out <path>]
[-brief] [-dot|-xml]
[-param
<param_name>=<pCram_value>]
[-param_file <file_name>] [<alias>] -
Show the execution plan to compute
the alias or for entire script.
-script - Explain the entire script.
-out - Store the output into directory
rather than print to stdout.
-brief - Don't expand nested plans
(presenting a smaller graph for overview).
-dot - Generate the output in .dot
format. Default is text format.
-xml - Generate the output in .xml
format. Default is text format.
-param <param_name - See parameter
substitution for details.
-param_file <file_name> - See
parameter substitution for details.
alias - Alias to explain.
dump <alias> - Compute the alias and
writes the results to stdout.

Utility Commands: exec [-param


<param_name>=param_value] [-param_file
<file_name>] <script> -
Execute the script with access to grunt
environment including aliases.
-param <param_name - See parameter
substitution for details.
-param_file <file_name> - See
parameter substitution for details.
script - Script to be executed.
run [-param
<param_name>=param_value] [-param_file
<file_name>] <script> -
Execute the script with access to grunt
environment.
-param <param_name - See
parameter substitution for details.
-param_file <file_name> - See
parameter substitution for details.
script - Script to be executed.
sh <shell command> - Invoke a shell
command.
kill <job_id> - Kill the hadoop job
specified by the hadoop job id.
set <key> <value> - Provide execution
parameters to Pig. Keys and values are
case sensitive.
The following keys are supported:
default_parallel - Script-level reduce
parallelism. Basic input size heuristics used
by default.
debug - Set debug on or off. Default is
off.
job.name - Single-quoted name for
jobs. Default is PigLatin:<script name>
job.priority - Priority for jobs. Values:
very_low, low, normal, high, very_high.
Default is normal stream.skippath -
String that contains the path.
This is used by streaming any hadoop
property.
help - Display this message.
history [-n] - Display the list statements in
cache.
-n Hide line numbers.
quit - Quit the grunt shell.

history Command
This command displays a list of statements
executed / used so far since the Grunt sell
is invoked.
Usage
Assume we have executed three
statements since opening the Grunt shell.
grunt> customers = LOAD
'hdfs://localhost:9000/pig_data/customers.tx
t' USING PigStorage(',');
grunt> orders = LOAD
'hdfs://localhost:9000/pig_data/orders.txt'
USING PigStorage(',');

grunt> student = LOAD


'hdfs://localhost:9000/pig_data/student.txt'
USING PigStorage(',');

Then, using the history command will


produce the following output.
grunt> history

customers = LOAD
'hdfs://localhost:9000/pig_data/customers.tx
t' USING PigStorage(',');
orders = LOAD
'hdfs://localhost:9000/pig_data/orders.txt'
USING PigStorage(',');
student = LOAD
'hdfs://localhost:9000/pig_data/student.txt'
USING PigStorage(',');

set Command
The set command is used to show/assign
values to keys used in Pig.
Usage
Using this command, you can set values to
the following keys.

Key Description and values


default_p You can set the number of
arallel reducers for a map job by
passing any whole number as
a value to this key.
debug You can turn off or turn on the
debugging freature in Pig by
passing on/off to this key.
job.name You can set the Job name to
the required job by passing a
string value to this key.
job.priori You can set the job priority to a
ty job by passing one of the
following values to this key −
● very_low
● low
● normal
● high
● very_high
stream.s For streaming, you can set the
kippath path from where the data is not
to be transferred, by passing
the desired path in the form of
a string to this key.

quit Command
You can quit from the Grunt shell using this
command.
Usage
Quit from the Grunt shell as shown below.
grunt> quit

Let us now take a look at the commands


using which you can control Apache Pig
from the Grunt shell.
exec Command
Using the exec command, we can execute
Pig scripts from the Grunt shell.
Syntax
Given below is the syntax of the utility
command exec.
grunt> exec [–param param_name =
param_value] [–param_file file_name]
[script]

Example
Let us assume there is a file named
student.txt in the /pig_data/ directory of
HDFS with the following content.
Student.txt
001,Rajiv,Hyderabad
002,siddarth,Kolkata
003,Rajesh,Delhi

And, assume we have a script file named


sample_script.pig in the /pig_data/
directory of HDFS with the following
content.
Sample_script.pig
student = LOAD
'hdfs://localhost:9000/pig_data/student.txt'
USING PigStorage(',')
as (id:int,name:chararray,city:chararray);
Dump student;
Now, let us execute the above script from
the Grunt shell using the exec command as
shown below.
grunt> exec /sample_script.pig
Output
The exec command executes the script in
the sample_script.pig. As directed in the
script, it loads the student.txt file into Pig
and gives you the result of the Dump
operator displaying the following content.
(1,Rajiv,Hyderabad)
(2,siddarth,Kolkata)
(3,Rajesh,Delhi)

kill Command
You can kill a job from the Grunt shell using
this command.
Syntax
Given below is the syntax of the kill
command.
grunt> kill JobId

Example
Suppose there is a running Pig job having
id Id_0055, you can kill it from the Grunt
shell using the kill command, as shown
below.
grunt> kill Id_0055

run Command
You can run a Pig script from the Grunt
shell using the run command
Syntax
Given below is the syntax of the run
command.
grunt> run [–param param_name =
param_value] [–param_file file_name] script

Example
Let us assume there is a file named
student.txt in the /pig_data/ directory of
HDFS with the following content.
Student.txt
001,Rajiv,Hyderabad
002,siddarth,Kolkata
003,Rajesh,Delhi

And, assume we have a script file named


sample_script.pig in the local filesystem
with the following content.
Sample_script.pig
student = LOAD
'hdfs://localhost:9000/pig_data/student.txt'
USING
PigStorage(',') as
(id:int,name:chararray,city:chararray);
Now, let us run the above script from the
Grunt shell using the run command as
shown below.
grunt> run /sample_script.pig

You can see the output of the script using


the Dump operator as shown below.
grunt> Dump;

(1,Rajiv,Hyderabad)
(2,siddarth,Kolkata)
(3,Rajesh,Delhi)

Filtering in Pig

How to Filter Records -


Pig allows you to remove unwanted records
based on a condition. The Filter functionality is
similar to the WHERE clause in SQL. The
FILTER operator in pig is used to remove
unwanted records from the data file. The syntax
of FILTER operator is shown below:
<new relation> = FILTER <relation> BY
<condition>
Here relation is the data set on which the filter
is applied, condition is the filter condition and
new relation is the relation created after filtering
the rows.

Pig Filter Examples:

Lets consider the below sales data set as an


example

year,product,quantity
---------------------
2000, iphone, 1000
2001, iphone, 1500
2002, iphone, 2000
2000, nokia, 1200
2001, nokia, 1500
2002, nokia, 900

1. select products whose quantity is greater than


or equal to 1000.

grunt> A = LOAD '/user/hadoop/sales' USING


PigStorage(',') AS
(year:int,product:chararray,quantity:int);
grunt> B = FILTER A BY quantity >= 1000;
grunt> DUMP B;
(2000,iphone,1000)
(2001,iphone,1500)
(2002,iphone,2000)
(2000,nokia,1200)
(2001,nokia,1500)

2. select products whose quantity is greater than


1000 and year is 2001

grunt> C = FILTER A BY quantity > 1000


AND year == 2001;
(2001,iphone,1500)
(2001,nokia,1500)

3. select products with year not in 2000

grunt> D = FILTER A BY year != 2000;


grunt> DUMP D;
(2001,iphone,1500)
(2002,iphone,2000)
(2001,nokia,1500)
(2002,nokia,900)

You can use all the logical operators (NOT,


AND, OR) and relational operators (< , >, ==,
!=, >=, <= ) in the filter conditions.

Pig Joins

Running Pig
Note: The following should be performed in
***root user*** (not hadoop user).
Hadoop user is only for starting hadoop

Download Pig0.15.0 and extract


Move to usr/local/pig
Navigate to the above folder in the terminal
Set the path in the terminal before running
pig
export PIG_HOME=/usr/local/pig
export PATH=$PATH:$PIG_HOME/bin

Then type pig in terminal which opens the


grunt shell
(Make sure hadoop is running before pig is
run)

Pig Join Example


The join operator is used to combine records
from two or more relations. While performing a
join operation, we declare one (or a group of)
tuple(s) from each relation, as keys. When these
keys match, the two particular tuples are
matched, else the records are dropped. Joins can
be of the following types:
1) Inner-join
2) Self-join
3) Outer-join : left join, right join, and full
join
Create a customers.txt file.
1,Ramesh,32,Ahmedabad,2000.00
2,Khilan,25,Delhi,1500.00
3,kaushik,23,Kota,2000.00
4,Chaitali,25,Mumbai,6500.00
5,Hardik,27,Bhopal,8500.00
6,Komal,22,MP,4500.00
7,Muffy,24,Indore,10000.00
Create a orders.txt file

102,2009-10-08 00:00:00,3,3000
100,2009-10-08 00:00:00,3,1500
101,2009-11-20 00:00:00,2,1560
103,2008-05-20 00:00:00,4,2060
Copy these files to HDFS
In the Pig Grunt Shell do the following
c= LOAD ‘/pigdata/customers.txt' using
PigStorage(',') as
(id:int,name:chararray,age:int,address:chararray,
salary:int);
o= LOAD '/pigdata/orders.txt' using
PigStorage(',') as
(oid:int,date:chararray,cust_id:int,amount:int);

Inner join:
c_o= JOIN c BY id,o BY cust_id;

produces the following output

(2,Khilan,25,Delhi,1500,101,2009-11-20
00:00:00,2,1560)
(3,kaushik,23,Kota,2000,100,2009-10-08
00:00:00,3,1500)
(3,kaushik,23,Kota,2000,102,2009-10-08
00:00:00,3,3000)
(4,Chaitali,25,Mumbai,6500,103,2008-05-20
00:00:00,4,2060)
Self Join:

c1 = LOAD '/pigdata/customers.txt' USING


PigStorage(',') as
(id:int,name:chararray,age:int,address:chararray,
salary:int);

c2 = LOAD '/pigdata/customers.txt' USING


PigStorage(',') as
(id:int,name:chararray,age:int,address:chararray,
salary:int);
c3= JOIN c1 BY id, c2 BY id;
Dump c3;
produces the output
(1,Ramesh,32,Ahmedabad,2000,1,Ramesh,32,A
hmedabad,2000)
(2,Khilan,25,Delhi,1500,2,Khilan,25,Delhi,1500
)
(3,kaushik,23,Kota,2000,3,kaushik,23,Kota,200
0)
(4,Chaitali,25,Mumbai,6500,4,Chaitali,25,Mum
bai,6500)
(5,Hardik,27,Bhopal,8500,5,Hardik,27,Bhopal,8
500)
(6,Komal,22,MP,4500,6,Komal,22,MP,4500)
(7,Muffy,24,Indore,10000,7,Muffy,24,Indore,10
000)

3) Outer Join
Unlike inner join, outer join returns all the rows
from at least one of the relations. An outer join
operation is carried out in three ways -
a) Left outer join
b) Right outer join
c) Full outer join
a) Left outer join
The left outer Join operation returns all rows
from the left table, even if there are no matches
in the right relation.
c= LOAD '/pigdata/c.txt' using PigStorage(',') as
(id:int,name:chararray,age:int,address:chararray,
salary:int);
o= LOAD '/pigdata/o.txt' using PigStorage(',') as
(oid:int,date:chararray,cust_id:int,amount:int);
outer_left = JOIN c BY id LEFT OUTER, o BY
cust_id;
Dump outer_left;
(1,Ramesh,32,Ahmedabad,2000,,,,)
(2,Khilan,25,Delhi,1500,101,2009-11-20
00:00:00,2,1560)
(3,kaushik,23,Kota,2000,100,2009-10-08
00:00:00,3,1500)
(3,kaushik,23,Kota,2000,102,2009-10-08
00:00:00,3,3000)
(4,Chaitali,25,Mumbai,6500,103,2008-05-20
00:00:00,4,2060)
(5,Hardik,27,Bhopal,8500,,,,)
(6,Komal,22,MP,4500,,,,)
(7,Muffy,24,Indore,10000,,,,)

Right Outer
outer_right = JOIN c BY id RIGHT, o BY
cust_id;
Dump outer_right;
(2,Khilan,25,Delhi,1500,101,2009-11-20
00:00:00,2,1560)
(3,kaushik,23,Kota,2000,100,2009-10-08
00:00:00,3,1500)
(3,kaushik,23,Kota,2000,102,2009-10-08
00:00:00,3,3000)
(4,Chaitali,25,Mumbai,6500,103,2008-05-20
00:00:00,4,2060)
c) Full outer join
outer_full = JOIN c BY id FULL OUTER, o
BY cust_id;
Dump outer_full;
(1,Ramesh,32,Ahmedabad,2000,,,,)
(2,Khilan,25,Delhi,1500,101,2009-11-20
00:00:00,2,1560)
(3,kaushik,23,Kota,2000,100,2009-10-08
00:00:00,3,1500)
(3,kaushik,23,Kota,2000,102,2009-10-08
00:00:00,3,3000)
(4,Chaitali,25,Mumbai,6500,103,2008-05-20
00:00:00,4,2060)
(5,Hardik,27,Bhopal,8500,,,,)
(6,Komal,22,MP,4500,,,,)
(7,Muffy,24,Indore,10000,,,,)
PIG UDF

How to create an UDF in pig

1. Set the classpath as follows

export
CLASSPATH="$HADOOP_HOME/share/hado
op/mapreduce/hadoop-mapreduce-client-core-2.
9.2.jar:$HADOOP_HOME/share/hadoop/mapre
duce/hadoop-mapreduce-client-common-2.9.2.j
ar:$HADOOP_HOME/share/hadoop/common/h
adoop-common-2.9.2.jar:~/pigudf/*:$HADOOP
_HOME/lib/*:/usr/local/pig/pig-0.16.0-core-h1.j
ar:/usr/local/pig/pig-0.16.0-core-h2.jar"

File Name: Sample_Eval.java


package pig;

import java.io.IOException;

import org.apache.pig.EvalFunc;

import org.apache.pig.data.Tuple;

public class Sample_Eval extends


EvalFunc<String> {

public String exec(Tuple input) throws


IOException {

if (input == null || input.size() == 0)

return null;

String str = (String) input.get(0);

return str.toUpperCase();

}
}

Compile the program and generate the jar file

Copy the jar file to hdfs

Create a file called Employ.txt

1,John,2007-01-24,250

2,Ram,2007-05-27,220

3,Jack,2007-05-06,170

3,Jack,2007-04-06,100

4,Jill,2007-04-06,220
5,Zara,2007-06-06,300

5,Zara,2007-02-06,35

In Pig grunt shell do the following:

Register the jar file from hdfs

Register ‘hdfs://localhost:9000/MyUDF.jar’
employee_data = LOAD
'hdfs://localhost:9000/user/hduser/pig/employee
_new.txt' USING PigStorage(',') as (id:int,
name:chararray,workdate:chararray,daily_typi
ng_pages:int);

Step 9

Let us now convert the names of the employees


in to upper case using the UDF sample_eval.
Upper_case = FOREACH employee_data
GENERATE pig.Sample_Eval(name);

Dump Upper_case;

Retrieving user login credentials from


/etc/passwd using Pig Latin
First copy the passwd file from etc to the
working directory.Assume the working
directory is /usr/local/pig
In terminal perform the following command
sudo cp /etc/passwd /usr/local/pig
Load pig in local mode
( You need not run hadoop in pseudo-distributed
mode)
export PIG_HOME=/usr/local/pig
sandeep@sandeep-PC:/usr/local/pig$ export
PATH=$PATH:$PIG_HOME/bin
sandeep@sandeep-PC:/usr/local/pig$ pig -x
local
In grunt shell
grunt>A = load 'passwd' using PigStorage(':');
B = foreach A generate $0 as id;
store B into 'id.out';
Check id.out file in Pig directory

You might also like