Unit-4 Bigdata Analytics: What Is Apache Pig?
Unit-4 Bigdata Analytics: What Is Apache Pig?
BigData Analytics
Pig Latin is SQL-like language and it is easy to learn Apache Pig when
you are familiar with SQL.
Apache Pig provides many built-in operators to support data operations
like joins, filters, ordering, etc.
In addition, it also provides nested data types like tuples, bags, and maps
that are missing from MapReduce.
Features of Pig
Rich set of operators − It provides many operators to perform operations
like join, sort, filer, etc.
Extensibility − Using the existing operators, users can develop their own
functions to read, process, and write data.
Handles all kinds of data − Apache Pig analyzes all kinds of data, both
structured as well as unstructured. It stores the results in HDFS.
Any novice programmer with a basic knowledge of Exposure to Java is must to work with
SQL can work conveniently with Apache Pig. MapReduce.
Apache Pig uses multi-query approach, thereby MapReduce will require almost 20 times
reducing the length of the codes to a great extent. more the number of lines to perform the
same task.
In Apache Pig, schema is optional. We can store data Schema is mandatory in SQL.
without designing a schema (values are stored as $01,
$02 etc.)
The data model in Apache Pig is nested relational. The data model used in SQL is flat
relational.
Apache Pig provides limited opportunity for Query There is more opportunity for query
optimization. optimization in SQL.
Apache Pig uses a language called Pig Latin. Hive uses a language called HiveQL. It was
It was originally created at Yahoo. originally created at Facebook.
Apache Pig can handle structured, Hive is mostly for structured data.
unstructured, and semi-structured data.
Applications of Apache Pig
Apache Pig is generally used by data scientists for performing tasks involving ad-
hoc processing and quick prototyping. Apache Pig is used −
To process huge data sources such as web logs.
To perform data processing for search platforms.
To process time sensitive data loads.
The language used to analyze data in Hadoop using Pig is known as Pig Latin. It
is a highlevel data processing language which provides a rich set of data types
and operators to perform various operations on the data.
Internally, Apache Pig converts these scripts into a series of MapReduce jobs,
The architecture of Apache Pig is shown below.
Apache Pig Components
As shown in the figure, there are various components in the Apache Pig
framework. Let us take a look at the major components.
Parser
Initially the Pig Scripts are handled by the Parser. It checks the syntax of the
script, does type checking, and other miscellaneous checks. The output of the
parser will be a DAG (directed acyclic graph), which represents the Pig Latin
statements and logical operators.
In the DAG, the logical operators of the script are represented as the nodes and
the data flows are represented as edges.
Optimizer
The logical plan (DAG) is passed to the logical optimizer, which carries out the
logical optimizations such as projection and pushdown.
Compiler
The compiler compiles the optimized logical plan into a series of MapReduce
jobs.
Execution engine
Finally the MapReduce jobs are submitted to Hadoop in a sorted order. Finally,
these MapReduce jobs are executed on Hadoop producing the desired results.
Atom
Any single value in Pig Latin, irrespective of their data, type is known as
an Atom. It is stored as string and can be used as string and number. int, long,
float, double, chararray, and bytearray are the atomic values of Pig. A piece of
data or a simple atomic value is known as a field.
Example − ‘raja’ or ‘30’
Tuple
A record that is formed by an ordered set of fields is known as a tuple, the fields
can be of any type. A tuple is similar to a row in a table of RDBMS.
Example − (Raja, 30)
Bag
A bag is an unordered set of tuples. In other words, a collection of tuples (non-
unique) is known as a bag. Each tuple can have any number of fields (flexible
schema). A bag is represented by ‘{}’. It is similar to a table in RDBMS, but unlike
a table in RDBMS, it is not necessary that every tuple contain the same number
of fields or that the fields in the same position (column) have the same type.
Example − {(Raja, 30), (Mohammad, 45)}
A bag can be a field in a relation; in that context, it is known as inner bag.
Example − {Raja, 30, {9848022338, raja@gmail.com,}}
Map
A map (or data map) is a set of key-value pairs. The key needs to be of type
chararray and should be unique. The value might be of any type. It is
represented by ‘[]’
Example − [name#Raja, age#30]
Relation
A relation is a bag of tuples. The relations in Pig Latin are unordered (there is no
guarantee that tuples are processed in any particular order).
This chapter explains the how to download, install, and set up Apache Pig in
your system.
Downloading Apache Pig
First of all, download the latest version of Apache Pig from the following website
− https://pig.apache.org/
Step 1
Open the homepage of Apache Pig website. Under the section News, click on
the link release page as shown in the following snapshot.
Step 2
On clicking the specified link, you will be redirected to the Apache Pig
Releases page. On this page, under the Download section, you will have two
links, namely, Pig 0.8 and later and Pig 0.7 and before. Click on the link Pig 0.8
and later, then you will be redirected to the page having a set of mirrors.
Step 3
Choose and click any one of these mirrors as shown below.
Step 4
These mirrors will take you to the Pig Releases page. This page contains
various versions of Apache Pig. Click the latest version among them.
Step 5
Within these folders, you will have the source and binary files of Apache Pig in
various distributions. Download the tar files of the source and binary files of
Apache Pig 0.15, pig0.15.0-src.tar.gz and pig-0.15.0.tar.gz.
Command − Command −
$ ./pig –x local $ ./pig -x mapreduce
Output − Output −
Either of these commands gives you the Grunt shell prompt as shown below.
grunt>
You can exit the Grunt shell using ‘ctrl + d’.
After invoking the Grunt shell, you can execute a Pig script by directly entering
the Pig Latin statements in it.
grunt> customers = LOAD 'customers.txt' USING PigStorage(',');
Dump student;
Now, you can execute the script in the above file as shown below.
Note − We will discuss in detail how to run a Pig script in Bach mode and
in embedded mode in subsequent chapters.
Pig Latin is the language used to analyze data in Hadoop using Apache Pig. In
this chapter, we are going to discuss the basics of Pig Latin such as Pig Latin
statements, data types, general and relational operators, and Pig Latin UDF’s.
Pig Latin – Statemets
While processing data using Pig Latin, statements are the basic constructs.
These statements work with relations. They
include expressions and schemas.
Every statement ends with a semicolon (;).
We will perform various operations using operators provided by Pig Latin,
through statements.
Except LOAD and STORE, while performing all other operations, Pig Latin
statements take a relation as input and produce another relation as output.
As soon as you enter a Load statement in the Grunt shell, its semantic
checking will be carried out. To see the contents of the schema, you need
to use the Dump operator. Only after performing the dump operation, the
MapReduce job for loading the data into the file system will be carried out.
Example
Given below is a Pig Latin statement, which loads data to Apache Pig.
grunt> Student_data = LOAD 'student_data.txt' USING PigStorage(',')as
( id:int, firstname:chararray, lastname:chararray, phone:chararray, city:chararray );
1 Int
Represents a signed 32-bit integer.
Example : 8
2 Long
Represents a signed 64-bit integer.
Example : 5L
3 Float
Represents a signed 32-bit floating point.
Example : 5.5F
4 Double
Represents a 64-bit floating point.
Example : 10.5
5 Chararray
Represents a character array (string) in Unicode UTF-8 format.
Example : ‘tutorials point’
6 Bytearray
Represents a Byte array (blob).
7 Boolean
Represents a Boolean value.
Example : true/ false.
8 Datetime
Represents a date-time.
Example : 1970-01-01T00:00:00.000+00:00
9 Biginteger
Represents a Java BigInteger.
Example : 60708090709
10 Bigdecimal
Represents a Java BigDecimal
Example : 185.98376256272893883
Complex Types
11 Tuple
A tuple is an ordered set of fields.
Example : (raja, 30)
12 Bag
A bag is a collection of tuples.
Example : {(raju,30),(Mohhammad,45)}
13 Map
A Map is a set of key-value pairs.
Example : [ ‘name’#’Raju’, ‘age’#30]
Null Values
Values for all the above data types can be NULL. Apache Pig treats null values in
a similar way as SQL does.
A null can be an unknown value or a non-existent value. It is used as a
placeholder for optional values. These nulls can occur naturally or can be the
result of an operation.
+ a + b will give 30
Addition − Adds values on either side of the operator
/ b / a will give 2
Division − Divides left hand operand by right hand
operand
% b % a will give 0
Modulus − Divides left hand operand by right hand
operand and returns remainder
== (a = b) is not
Equal − Checks if the values of two operands are equal or not;
true
if yes, then the condition becomes true.
!= (a != b) is
Not Equal − Checks if the values of two operands are equal or
true.
not. If the values are not equal, then condition becomes true.
<= (a <= b) is
Less than or equal to − Checks if the value of the left
true.
operand is less than or equal to the value of the right operand.
If yes, then the condition becomes true.
Matches f1 matches
Pattern matching − Checks whether the string in the left-hand
'.*tutorial.*'
side matches with the constant in the right-hand side.
() (Raju, 30)
Tuple constructor operator − This operator is used
to construct a tuple.
{} {(Raju, 30),
Bag constructor operator − This operator is used to
(Mohammad, 45)}
construct a bag.
[] [name#Raja, age#30]
Map constructor operator − This operator is used to
construct a tuple.
Operator Description
LOAD To Load the data from the file system (local/HDFS) into a relation.
Sorting
Apache Pig provides various built-in functions namely eval, load, store, math,
string, bag and tuple functions.
Eval Functions
Given below is the list of eval functions provided by Apache Pig.
1 AVG()
To compute the average of the numerical values within a bag.
2 BagToString()
To concatenate the elements of a bag into a string. While concatenating, we
can place a delimiter between these values (optional).
3 CONCAT()
To concatenate two or more expressions of same type.
4 COUNT()
To get the number of elements in a bag, while counting the number of tuples in
a bag.
5 COUNT_STAR()
It is similar to the COUNT() function. It is used to get the number of elements
in a bag.
6 DIFF()
To compare two bags (fields) in a tuple.
7 IsEmpty()
To check if a bag or map is empty.
8 MAX()
To calculate the highest value for a column (numeric values or chararrays) in a
single-column bag.
9 MIN()
To get the minimum (lowest) value (numeric or chararray) for a certain column
in a single-column bag.
10 PluckTuple()
Using the Pig Latin PluckTuple() function, we can define a string Prefix and
filter the columns in a relation that begin with the given prefix.
11 SIZE()
To compute the number of elements based on any Pig data type.
12 SUBTRACT()
To subtract two bags. It takes two bags as inputs and returns a bag which
contains the tuples of the first bag that are not in the second bag.
13 SUM()
To get the total of the numeric values of a column in a single-column bag.
14 TOKENIZE()
To split a string (which contains a group of words) in a single tuple and return
a bag which contains the output of the split operation.
1 PigStorage()
To load and store structured files.
2 TextLoader()
To load unstructured data into Pig.
3 BinStorage()
To load and store data into Pig using machine readable format.
4 Handling Compression
In Pig Latin, we can load and store compressed data.
1 TOBAG()
To convert two or more expressions into a bag.
2 TOP()
To get the top N tuples of a relation.
3 TOTUPLE()
To convert one or more expressions into a tuple.
4 TOMAP()
To convert the key-value pairs into a Map.
String Functions
We have the following String functions in Apache Pig.
1 ENDSWITH(string, testAgainst)
To verify whether a given string ends with a particular substring.
2 STARTSWITH(string, substring)
Accepts two string parameters and verifies whether the first string starts with
the second.
4 EqualsIgnoreCase(string1, string2)
To compare two stings ignoring the case.
6 LAST_INDEX_OF(expression)
Returns the index of the last occurrence of a character in a string, searching
backward from a start index.
7 LCFIRST(expression)
Converts the first character in a string to lower case.
8 UCFIRST(expression)
Returns a string with the first character converted to upper case.
9 UPPER(expression)
UPPER(expression) Returns a string converted to upper case.
10 LOWER(expression)
Converts all characters in a string to lower case.
14 TRIM(expression)
Returns a copy of a string with leading and trailing whitespaces removed.
15 LTRIM(expression)
Returns a copy of a string with leading whitespaces removed.
16 RTRIM(expression)
Returns a copy of a string with trailing whitespaces removed.
Date-time Functions
Apache Pig provides the following Date and Time functions −
1 ToDate(milliseconds)
This function returns a date-time object according to the given parameters. The
other alternative for this function are ToDate(iosstring), ToDate(userstring,
format), ToDate(userstring, format, timezone)
2 CurrentTime()
returns the date-time object of the current time.
3 GetDay(datetime)
Returns the day of a month from the date-time object.
4 GetHour(datetime)
Returns the hour of a day from the date-time object.
5 GetMilliSecond(datetime)
Returns the millisecond of a second from the date-time object.
6 GetMinute(datetime)
Returns the minute of an hour from the date-time object.
7 GetMonth(datetime)
Returns the month of a year from the date-time object.
8 GetSecond(datetime)
Returns the second of a minute from the date-time object.
9 GetWeek(datetime)
Returns the week of a year from the date-time object.
10 GetWeekYear(datetime)
Returns the week year from the date-time object.
11 GetYear(datetime)
Returns the year from the date-time object.
12 AddDuration(datetime, duration)
Returns the result of a date-time object along with the duration object.
13 SubtractDuration(datetime, duration)
Subtracts the Duration object from the Date-Time object and returns the result.
14 DaysBetween(datetime1, datetime2)
Returns the number of days between the two date-time objects.
15 HoursBetween(datetime1, datetime2)
Returns the number of hours between two date-time objects.
16 MilliSecondsBetween(datetime1, datetime2)
Returns the number of milliseconds between two date-time objects.
17 MinutesBetween(datetime1, datetime2)
Returns the number of minutes between two date-time objects.
18 MonthsBetween(datetime1, datetime2)
Returns the number of months between two date-time objects.
19 SecondsBetween(datetime1, datetime2)
Returns the number of seconds between two date-time objects.
20 WeeksBetween(datetime1, datetime2)
Returns the number of weeks between two date-time objects.
21 YearsBetween(datetime1, datetime2)
Returns the number of years between two date-time objects.
Math Functions
We have the following Math functions in Apache Pig −
1 ABS(expression)
To get the absolute value of an expression.
2 ACOS(expression)
To get the arc cosine of an expression.
3 ASIN(expression)
To get the arc sine of an expression.
4 ATAN(expression)
This function is used to get the arc tangent of an expression.
5 CBRT(expression)
This function is used to get the cube root of an expression.
6 CEIL(expression)
This function is used to get the value of an expression rounded up to the nearest
integer.
7 COS(expression)
This function is used to get the trigonometric cosine of an expression.
8 COSH(expression)
This function is used to get the hyperbolic cosine of an expression.
9 EXP(expression)
This function is used to get the Euler’s number e raised to the power of x.
10 FLOOR(expression)
To get the value of an expression rounded down to the nearest integer.
11 LOG(expression)
To get the natural logarithm (base e) of an expression.
12 LOG10(expression)
To get the base 10 logarithm of an expression.
13 RANDOM( )
To get a pseudo random number (type double) greater than or equal to 0.0 and
less than 1.0.
14 ROUND(expression)
To get the value of an expression rounded to an integer (if the result type is float)
or rounded to a long (if the result type is double).
15 SIN(expression)
To get the sine of an expression.
16 SINH(expression)
To get the hyperbolic sine of an expression.
17 SQRT(expression)
To get the positive square root of an expression.
18 TAN(expression)
To get the trigonometric tangent of an angle.
19 TANH(expression)
To get the hyperbolic tangent of an expression.
For writing UDF’s, complete support is provided in Java and limited support is
provided in all the remaining languages. Using Java, you can write UDF’s
involving all parts of the processing like data load/store, column transformation,
and aggregation. Since Apache Pig has been written in Java, the UDF’s written
using Java language work efficiently compared to other languages.
In Apache Pig, we also have a Java repository for UDF’s named Piggybank.
Using Piggybank, we can access Java UDF’s written by other users, and
contribute our own UDF’s.
To write a UDF using Java, we have to integrate the jar file Pig-0.15.0.jar. In this
section, we discuss how to write a sample UDF using Eclipse. Before proceeding
further, make sure you have installed Eclipse and Maven in your system.
<modelVersion>4.0.0</modelVersion>
<groupId>Pig_Udf</groupId>
<artifactId>Pig_Udf</artifactId>
<version>0.0.1-SNAPSHOT</version>
<build>
<sourceDirectory>src</sourceDirectory>
<plugins>
<plugin>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.3</version>
<configuration>
<source>1.7</source>
<target>1.7</target>
</configuration>
</plugin>
</plugins>
</build>
<dependencies>
<dependency>
<groupId>org.apache.pig</groupId>
<artifactId>pig</artifactId>
<version>0.15.0</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-core</artifactId>
<version>0.20.2</version>
</dependency>
</dependencies>
</project>
Save the file and refresh it. In the Maven Dependencies section, you can
find the downloaded jar files.
Create a new class file with name Sample_Eval and copy the following
content in it.
import java.io.IOException;
import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;
import java.io.IOException;
import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;
REGISTER '/$PIG_HOME/sample_udf.jar'
Note − assume the Jar file in the path − /$PIG_HOME/sample_udf.jar
Step 2: Defining Alias
After registering the UDF we can define an alias to it using the Define operator.
Syntax
Given below is the syntax of the Define operator.
DEFINE alias {function | [`command` [input] [output] [ship] [cache] [stderr] ] };
Example
Define the alias for sample_eval as shown below.
DEFINE sample_eval sample_eval();
Step 3: Using the UDF
After defining the alias you can use the UDF same as the built-in functions.
Suppose there is a file named emp_data in the HDFS /Pig_Data/ directory with
the following content.
001,Robin,22,newyork
002,BOB,23,Kolkata
003,Maya,23,Tokyo
004,Sara,25,London
005,David,23,Bhuwaneshwar
006,Maggy,22,Chennai
007,Robert,22,newyork
008,Syam,23,Kolkata
009,Mary,25,Tokyo
010,Saran,25,London
011,Stacy,25,Bhuwaneshwar
012,Kelly,22,Chennai
And assume we have loaded this file into Pig as shown below.
grunt> emp_data = LOAD 'hdfs://localhost:9000/pig_data/emp1.txt' USING PigStorage(',')
as (id:int, name:chararray, age:int, city:chararray);
Let us now convert the names of the employees in to upper case using the
UDF sample_eval.
grunt> Upper_case = FOREACH emp_data GENERATE sample_eval(name);
Verify the contents of the relation Upper_case as shown below.
grunt> Dump Upper_case;
(ROBIN)
(BOB)
(MAYA)
(SARA)
(DAVID)
(MAGGY)
(ROBERT)
(SYAM)
(MARY)
(SARAN)
(STACY)
(KELLY)
You can execute it from the Grunt shell as well using the exec command as
shown below.
grunt> exec /sample_script.pig
Executing a Pig Script from HDFS
We can also execute a Pig script that resides in the HDFS. Suppose there is a
Pig script with the name Sample_script.pig in the HDFS directory
named /pig_data/. We can execute it as shown below.
$ pig -x mapreduce hdfs://localhost:9000/pig_data/Sample_script.pig
Example
Assume we have a file student_details.txt in HDFS with the following content.
student_details.txt
001,Rajiv,Reddy,21,9848022337,Hyderabad
002,siddarth,Battacharya,22,9848022338,Kolkata
003,Rajesh,Khanna,22,9848022339,Delhi
004,Preethi,Agarwal,21,9848022330,Pune
005,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar
006,Archana,Mishra,23,9848022335,Chennai
007,Komal,Nayak,24,9848022334,trivendram
008,Bharathi,Nambiayar,24,9848022333,Chennai
We also have a sample script with the name sample_script.pig, in the same
HDFS directory. This file contains statements performing operations and
transformations on the student relation, as shown below.
student = LOAD 'hdfs://localhost:9000/pig_data/student_details.txt' USING PigStorage(',')
as (id:int, firstname:chararray, lastname:chararray, phone:chararray, city:chararray);
student_order = ORDER student BY age DESC;
Dump student_limit;
The first statement of the script will load the data in the file
named student_details.txt as a relation named student.
The second statement of the script will arrange the tuples of the relation in
descending order, based on age, and store it as student_order.
The third statement of the script will store the first 4 tuples
of student_order as student_limit.
Finally the fourth statement will dump the content of the
relation student_limit.
Let us now execute the sample_script.pig as shown below.
$./pig -x mapreduce hdfs://localhost:9000/pig_data/sample_script.pig
Apache Pig gets executed and gives you the output with the following content.
(7,Komal,Nayak,24,9848022334,trivendram)
(8,Bharathi,Nambiayar,24,9848022333,Chennai)
(5,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar)
(6,Archana,Mishra,23,9848022335,Chennai)
2015-10-19 10:31:27,446 [main] INFO org.apache.pig.Main - Pig script completed in 12
minutes, 32 seconds and 751 milliseconds (752751 ms)
Apache Pig Installation
In this section, we will perform the pig installation.
Pre-requisite
o Java Installation - Check whether the Java is installed or not using the
following command.
1. $java -version
1. $hadoop version
If any of them is not installed in your system, follow the below link to install
it. Click Here to Install
1. export PIG_HOME=/home/hduser/pig-0.16.0
2. export PATH=$PATH:$PIG_HOME/bin
o Update the environment variable
1. $ source ~/.bashrc
1. $ pig -h
1. $ pig
Apache Pig Run Modes
Apache Pig executes in two modes: Local Mode and MapReduce Mode.
Local Mode
o It executes in a single JVM and is used for development experimenting and
prototyping.
o Here, files are installed and run using localhost.
o The local mode works on a local file system. The input and output data
stored in the local file system.
1. $ pig-x local
MapReduce Mode
o The MapReduce mode is also known as Hadoop Mode.
o It is the default mode.
o In this Pig renders Pig Latin into MapReduce jobs and executes them on
the cluster.
o It can be executed against semi-distributed or fully distributed Hadoop
installation.
o Here, the input and output data are present on HDFS.
1. $ pig
Or,
1. $ pig -x mapreduce
o Interactive Mode - In this mode, the Pig is executed in the Grunt shell. To
invoke Grunt shell, run the pig command. Once the Grunt mode executes,
we can provide Pig Latin statements and command interactively at the
command line.
o Batch Mode - In this mode, we can run a script file having a .pig extension.
These files contain Pig Latin commands.
o Embedded Mode - In this mode, we can define our own functions. These
functions can be called as UDF (User Defined Functions). Here, we use
programming languages like Java and Python.