0% found this document useful (0 votes)
15 views47 pages

Unit-4 Bigdata Analytics: What Is Apache Pig?

Apache Pig is a high-level platform for analyzing large data sets using a language called Pig Latin, which simplifies MapReduce tasks for programmers, especially those less familiar with Java. It offers features like a rich set of operators, ease of programming, and the ability to handle various data types, making it suitable for tasks involving ad-hoc processing and quick prototyping. The document also outlines the installation, configuration, execution modes, and differences between Apache Pig and other data processing tools like MapReduce, SQL, and Hive.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views47 pages

Unit-4 Bigdata Analytics: What Is Apache Pig?

Apache Pig is a high-level platform for analyzing large data sets using a language called Pig Latin, which simplifies MapReduce tasks for programmers, especially those less familiar with Java. It offers features like a rich set of operators, ease of programming, and the ability to handle various data types, making it suitable for tasks involving ad-hoc processing and quick prototyping. The document also outlines the installation, configuration, execution modes, and differences between Apache Pig and other data processing tools like MapReduce, SQL, and Hive.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 47

Unit-4

BigData Analytics

What is Apache Pig?


Apache Pig is an abstraction over MapReduce. It is a tool/platform which is used
to analyze larger sets of data representing them as data flows. Pig is generally
used with Hadoop; we can perform all the data manipulation operations in
Hadoop using Apache Pig.
Pig provides a high-level language known as Pig Latin. This language provides
various operators using which programmers can develop their own functions for
reading, writing, and processing data.
To analyze data using Apache Pig, programmers need to write scripts using Pig
Latin language. All these scripts are internally converted to Map and Reduce
tasks.
Apache Pig has a component known as Pig Engine that accepts the Pig Latin
scripts as input and converts those scripts into MapReduce jobs.

Why Do We Need Apache Pig?


Programmers who are not so good at Java normally used to struggle working
with Hadoop, especially while performing any MapReduce tasks. Apache Pig is a
boon for all such programmers.
 Using Pig Latin, programmers can perform MapReduce tasks easily
without having to type complex codes in Java.

 Apache Pig uses multi-query approach, thereby reducing the length of


codes.
For example, an operation that would require you to type 200 lines of
code (LoC) in Java can be easily done by typing as less as just 10 LoC in
Apache Pig. Ultimately Apache Pig reduces the development time by
almost 16 times.

 Pig Latin is SQL-like language and it is easy to learn Apache Pig when
you are familiar with SQL.
 Apache Pig provides many built-in operators to support data operations
like joins, filters, ordering, etc.

 In addition, it also provides nested data types like tuples, bags, and maps
that are missing from MapReduce.

Features of Pig
 Rich set of operators − It provides many operators to perform operations
like join, sort, filer, etc.

 Ease of programming − Pig Latin is similar to SQL and it is easy to write


a Pig script if you are good at SQL.

 Optimization opportunities − The tasks in Apache Pig optimize their


execution automatically, so the programmers need to focus only on
semantics of the language.

 Extensibility − Using the existing operators, users can develop their own
functions to read, process, and write data.

 UDF’s − Pig provides the facility to create User-defined Functions in


other programming languages such as Java and invoke or embed them in
Pig Scripts.

 Handles all kinds of data − Apache Pig analyzes all kinds of data, both
structured as well as unstructured. It stores the results in HDFS.

When You Should Use Apache Pig

• If the business use case requires processing multiple data sources


then Pig could be an ideal choice.
• For example, if a business wants to analyse how a particular ad is
performing then they have to combine data from multiple sources
like -IP geo-location ,click through rates, web server traffic and other
details to get an in-depth understanding of the customers on specific
ads.
• If the application requires handling Time Sensitive Data loads then
Apache Pig could be a perfect choice, as it is built on top of hadoop
and can scale out easily. Pig converts the scripts into MapReduce
jobs and spreads the load across multiple servers for faster
processing.
• If the business requires analysis through sampling then Apache Pig
should be considered to sample large datasets with a random
distribution of data to gain meaningful analytic insights

Apache Pig Vs MapReduce


Listed below are the major differences between Apache Pig and MapReduce.

Apache Pig MapReduce

Apache Pig is a data flow language. MapReduce is a data processing


paradigm.

It is a high level language. MapReduce is low level and rigid.

Performing a Join operation in Apache Pig is pretty It is quite difficult in MapReduce to


simple. perform a Join operation between
datasets.

Any novice programmer with a basic knowledge of Exposure to Java is must to work with
SQL can work conveniently with Apache Pig. MapReduce.

Apache Pig uses multi-query approach, thereby MapReduce will require almost 20 times
reducing the length of the codes to a great extent. more the number of lines to perform the
same task.

There is no need for compilation. On execution, MapReduce jobs have a long


every Apache Pig operator is converted internally compilation process.
into a MapReduce job.

Apache Pig Vs SQL


Listed below are the major differences between Apache Pig and SQL.
Pig SQL

Pig Latin is a procedural language. SQL is a declarative language.

In Apache Pig, schema is optional. We can store data Schema is mandatory in SQL.
without designing a schema (values are stored as $01,
$02 etc.)

The data model in Apache Pig is nested relational. The data model used in SQL is flat
relational.

Apache Pig provides limited opportunity for Query There is more opportunity for query
optimization. optimization in SQL.

In addition to above differences, Apache Pig Latin −

 Allows splits in the pipeline.


 Allows developers to store data anywhere in the pipeline.
 Declares execution plans.
 Provides operators to perform ETL (Extract, Transform, and Load)
functions.

Apache Pig Vs Hive


Both Apache Pig and Hive are used to create MapReduce jobs. And in some
cases, Hive operates on HDFS in a similar way Apache Pig does. In the following
table, we have listed a few significant points that set Apache Pig apart from Hive.

Apache Pig Hive

Apache Pig uses a language called Pig Latin. Hive uses a language called HiveQL. It was
It was originally created at Yahoo. originally created at Facebook.

Pig Latin is a data flow language. HiveQL is a query processing language.

Pig Latin is a procedural language and it fits in HiveQL is a declarative language.


pipeline paradigm.

Apache Pig can handle structured, Hive is mostly for structured data.
unstructured, and semi-structured data.
Applications of Apache Pig
Apache Pig is generally used by data scientists for performing tasks involving ad-
hoc processing and quick prototyping. Apache Pig is used −
 To process huge data sources such as web logs.
 To perform data processing for search platforms.
 To process time sensitive data loads.

Apache Pig – History

In 2006, Apache Pig was developed as a research project at Yahoo, especially to


create and execute MapReduce jobs on every dataset. In 2007, Apache Pig was
open sourced via Apache incubator. In 2008, the first release of Apache Pig
came out. In 2010, Apache Pig graduated as an Apache top-level project.

The language used to analyze data in Hadoop using Pig is known as Pig Latin. It
is a highlevel data processing language which provides a rich set of data types
and operators to perform various operations on the data.

To perform a particular task Programmers using Pig, programmers need to write


a Pig script using the Pig Latin language, and execute them using any of the
execution mechanisms (Grunt Shell, UDFs, Embedded). After execution, these
scripts will go through a series of transformations applied by the Pig Framework,
to produce the desired output.

Internally, Apache Pig converts these scripts into a series of MapReduce jobs,
The architecture of Apache Pig is shown below.
Apache Pig Components
As shown in the figure, there are various components in the Apache Pig
framework. Let us take a look at the major components.
Parser
Initially the Pig Scripts are handled by the Parser. It checks the syntax of the
script, does type checking, and other miscellaneous checks. The output of the
parser will be a DAG (directed acyclic graph), which represents the Pig Latin
statements and logical operators.
In the DAG, the logical operators of the script are represented as the nodes and
the data flows are represented as edges.
Optimizer
The logical plan (DAG) is passed to the logical optimizer, which carries out the
logical optimizations such as projection and pushdown.
Compiler
The compiler compiles the optimized logical plan into a series of MapReduce
jobs.
Execution engine
Finally the MapReduce jobs are submitted to Hadoop in a sorted order. Finally,
these MapReduce jobs are executed on Hadoop producing the desired results.

Pig Latin Data Model


The data model of Pig Latin is fully nested and it allows complex non-atomic
datatypes such as map and tuple. Given below is the diagrammatical
representation of Pig Latin’s data model.

Atom
Any single value in Pig Latin, irrespective of their data, type is known as
an Atom. It is stored as string and can be used as string and number. int, long,
float, double, chararray, and bytearray are the atomic values of Pig. A piece of
data or a simple atomic value is known as a field.
Example − ‘raja’ or ‘30’
Tuple
A record that is formed by an ordered set of fields is known as a tuple, the fields
can be of any type. A tuple is similar to a row in a table of RDBMS.
Example − (Raja, 30)
Bag
A bag is an unordered set of tuples. In other words, a collection of tuples (non-
unique) is known as a bag. Each tuple can have any number of fields (flexible
schema). A bag is represented by ‘{}’. It is similar to a table in RDBMS, but unlike
a table in RDBMS, it is not necessary that every tuple contain the same number
of fields or that the fields in the same position (column) have the same type.
Example − {(Raja, 30), (Mohammad, 45)}
A bag can be a field in a relation; in that context, it is known as inner bag.
Example − {Raja, 30, {9848022338, raja@gmail.com,}}
Map
A map (or data map) is a set of key-value pairs. The key needs to be of type
chararray and should be unique. The value might be of any type. It is
represented by ‘[]’
Example − [name#Raja, age#30]
Relation
A relation is a bag of tuples. The relations in Pig Latin are unordered (there is no
guarantee that tuples are processed in any particular order).

This chapter explains the how to download, install, and set up Apache Pig in
your system.
Downloading Apache Pig
First of all, download the latest version of Apache Pig from the following website
− https://pig.apache.org/
Step 1
Open the homepage of Apache Pig website. Under the section News, click on
the link release page as shown in the following snapshot.
Step 2
On clicking the specified link, you will be redirected to the Apache Pig
Releases page. On this page, under the Download section, you will have two
links, namely, Pig 0.8 and later and Pig 0.7 and before. Click on the link Pig 0.8
and later, then you will be redirected to the page having a set of mirrors.
Step 3
Choose and click any one of these mirrors as shown below.
Step 4
These mirrors will take you to the Pig Releases page. This page contains
various versions of Apache Pig. Click the latest version among them.
Step 5
Within these folders, you will have the source and binary files of Apache Pig in
various distributions. Download the tar files of the source and binary files of
Apache Pig 0.15, pig0.15.0-src.tar.gz and pig-0.15.0.tar.gz.

Installing Apache Pig


After downloading the Apache Pig software, install it in your Linux environment
by following the steps given below.
Step 1
Create a directory with the name Pig in the same directory where the installation
directories of Hadoop, Java, and other software were installed. (In our tutorial,
we have created the Pig directory in the user named Hadoop).
$ mkdir Pig
Step 2
Extract the downloaded tar files as shown below.
$ cd Downloads/
$ tar zxvf pig-0.15.0-src.tar.gz
$ tar zxvf pig-0.15.0.tar.gz
Step 3
Move the content of pig-0.15.0-src.tar.gz file to the Pig directory created earlier
as shown below.
$ mv pig-0.15.0-src.tar.gz/* /home/Hadoop/Pig/

Configure Apache Pig


After installing Apache Pig, we have to configure it. To configure, we need to edit
two files − bashrc and pig.properties.
.bashrc file
In the .bashrc file, set the following variables −
 PIG_HOME folder to the Apache Pig’s installation folder,
 PATH environment variable to the bin folder, and
 PIG_CLASSPATH environment variable to the etc (configuration) folder of
your Hadoop installations (the directory that contains the core-site.xml,
hdfs-site.xml and mapred-site.xml files).
export PIG_HOME = /home/Hadoop/Pig
export PATH = $PATH:/home/Hadoop/pig/bin
export PIG_CLASSPATH = $HADOOP_HOME/conf
pig.properties file
In the conf folder of Pig, we have a file named pig.properties. In the
pig.properties file, you can set various parameters as given below.
pig -h properties

Apache Pig Execution Modes


You can run Apache Pig in two modes, namely, Local Mode and HDFS mode.
Local Mode
In this mode, all the files are installed and run from your local host and local file
system. There is no need of Hadoop or HDFS. This mode is generally used for
testing purpose.
MapReduce Mode
MapReduce mode is where we load or process the data that exists in the
Hadoop File System (HDFS) using Apache Pig. In this mode, whenever we
execute the Pig Latin statements to process the data, a MapReduce job is
invoked in the back-end to perform a particular operation on the data that exists
in the HDFS.

Apache Pig Execution Mechanisms


Apache Pig scripts can be executed in three ways, namely, interactive mode,
batch mode, and embedded mode.
 Interactive Mode (Grunt shell) − You can run Apache Pig in interactive
mode using the Grunt shell. In this shell, you can enter the Pig Latin
statements and get the output (using Dump operator).
 Batch Mode (Script) − You can run Apache Pig in Batch mode by writing
the Pig Latin script in a single file with .pig extension.
 Embedded Mode (UDF) − Apache Pig provides the provision of defining
our own functions (User Defined Functions) in programming languages
such as Java, and using them in our script.

Invoking the Grunt Shell


You can invoke the Grunt shell in a desired mode (local/MapReduce) using
the −x option as shown below.

Local mode MapReduce mode

Command − Command −
$ ./pig –x local $ ./pig -x mapreduce

Output − Output −

Either of these commands gives you the Grunt shell prompt as shown below.
grunt>
You can exit the Grunt shell using ‘ctrl + d’.
After invoking the Grunt shell, you can execute a Pig script by directly entering
the Pig Latin statements in it.
grunt> customers = LOAD 'customers.txt' USING PigStorage(',');

Executing Apache Pig in Batch Mode


You can write an entire Pig Latin script in a file and execute it using the –x
command. Let us suppose we have a Pig script in a file
named sample_script.pig as shown below.
Sample_script.pig
student = LOAD 'hdfs://localhost:9000/pig_data/student.txt' USING
PigStorage(',') as (id:int,name:chararray,city:chararray);

Dump student;
Now, you can execute the script in the above file as shown below.

Local mode MapReduce mode

$ pig -x local Sample_script.pig $ pig -x mapreduce Sample_script.pig

Note − We will discuss in detail how to run a Pig script in Bach mode and
in embedded mode in subsequent chapters.

Pig Latin is the language used to analyze data in Hadoop using Apache Pig. In
this chapter, we are going to discuss the basics of Pig Latin such as Pig Latin
statements, data types, general and relational operators, and Pig Latin UDF’s.
Pig Latin – Statemets
While processing data using Pig Latin, statements are the basic constructs.
 These statements work with relations. They
include expressions and schemas.
 Every statement ends with a semicolon (;).
 We will perform various operations using operators provided by Pig Latin,
through statements.
 Except LOAD and STORE, while performing all other operations, Pig Latin
statements take a relation as input and produce another relation as output.
 As soon as you enter a Load statement in the Grunt shell, its semantic
checking will be carried out. To see the contents of the schema, you need
to use the Dump operator. Only after performing the dump operation, the
MapReduce job for loading the data into the file system will be carried out.
Example
Given below is a Pig Latin statement, which loads data to Apache Pig.
grunt> Student_data = LOAD 'student_data.txt' USING PigStorage(',')as
( id:int, firstname:chararray, lastname:chararray, phone:chararray, city:chararray );

Pig Latin – Data types


Given below table describes the Pig Latin data types.

S.N. Data Type Description & Example

1 Int
Represents a signed 32-bit integer.
Example : 8

2 Long
Represents a signed 64-bit integer.
Example : 5L

3 Float
Represents a signed 32-bit floating point.
Example : 5.5F

4 Double
Represents a 64-bit floating point.
Example : 10.5

5 Chararray
Represents a character array (string) in Unicode UTF-8 format.
Example : ‘tutorials point’

6 Bytearray
Represents a Byte array (blob).

7 Boolean
Represents a Boolean value.
Example : true/ false.

8 Datetime
Represents a date-time.
Example : 1970-01-01T00:00:00.000+00:00

9 Biginteger
Represents a Java BigInteger.
Example : 60708090709

10 Bigdecimal
Represents a Java BigDecimal
Example : 185.98376256272893883

Complex Types

11 Tuple
A tuple is an ordered set of fields.
Example : (raja, 30)

12 Bag
A bag is a collection of tuples.
Example : {(raju,30),(Mohhammad,45)}

13 Map
A Map is a set of key-value pairs.
Example : [ ‘name’#’Raju’, ‘age’#30]
Null Values
Values for all the above data types can be NULL. Apache Pig treats null values in
a similar way as SQL does.
A null can be an unknown value or a non-existent value. It is used as a
placeholder for optional values. These nulls can occur naturally or can be the
result of an operation.

Pig Latin – Arithmetic Operators


The following table describes the arithmetic operators of Pig Latin. Suppose a =
10 and b = 20.

Operator Description Example

+ a + b will give 30
Addition − Adds values on either side of the operator

− a − b will give −10


Subtraction − Subtracts right hand operand from left
hand operand

* a * b will give 200


Multiplication − Multiplies values on either side of the
operator

/ b / a will give 2
Division − Divides left hand operand by right hand
operand

% b % a will give 0
Modulus − Divides left hand operand by right hand
operand and returns remainder

Bincond − Evaluates the Boolean operators. It has b = (a == 1)? 20:


three operands as shown below. 30;
?: variable x = (expression) ? value1 if true : value2 if if a = 1 the value
false. of b is 20.
if a!=1 the value of
b is 30.
CASE Case − The case operator is equivalent to nested CASE f2 % 2
bincond operator.
WHEN WHEN 0 THEN
'even'
THEN
WHEN 1 THEN
ELSE
'odd'
END
END

Pig Latin – Comparison Operators


The following table describes the comparison operators of Pig Latin.

Operator Description Example

== (a = b) is not
Equal − Checks if the values of two operands are equal or not;
true
if yes, then the condition becomes true.

!= (a != b) is
Not Equal − Checks if the values of two operands are equal or
true.
not. If the values are not equal, then condition becomes true.

> (a > b) is not


Greater than − Checks if the value of the left operand is
true.
greater than the value of the right operand. If yes, then the
condition becomes true.

< (a < b) is true.


Less than − Checks if the value of the left operand is less
than the value of the right operand. If yes, then the condition
becomes true.

>= (a >= b) is not


Greater than or equal to − Checks if the value of the left
true.
operand is greater than or equal to the value of the right
operand. If yes, then the condition becomes true.

<= (a <= b) is
Less than or equal to − Checks if the value of the left
true.
operand is less than or equal to the value of the right operand.
If yes, then the condition becomes true.

Matches f1 matches
Pattern matching − Checks whether the string in the left-hand
'.*tutorial.*'
side matches with the constant in the right-hand side.

Pig Latin – Type Construction Operators


The following table describes the Type construction operators of Pig Latin.

Operator Description Example

() (Raju, 30)
Tuple constructor operator − This operator is used
to construct a tuple.

{} {(Raju, 30),
Bag constructor operator − This operator is used to
(Mohammad, 45)}
construct a bag.

[] [name#Raja, age#30]
Map constructor operator − This operator is used to
construct a tuple.

Pig Latin – Relational Operations


The following table describes the relational operators of Pig Latin.

Operator Description

Loading and Storing

LOAD To Load the data from the file system (local/HDFS) into a relation.

STORE To save a relation to the file system (local/HDFS).


Filtering

FILTER To remove unwanted rows from a relation.

DISTINCT To remove duplicate rows from a relation.

FOREACH, GENERATE To generate data transformations based on columns of data.

STREAM To transform a relation using an external program.

Grouping and Joining

JOIN To join two or more relations.

COGROUP To group the data in two or more relations.

GROUP To group the data in a single relation.

CROSS To create the cross product of two or more relations.

Sorting

ORDER To arrange a relation in a sorted order based on one or more fields


(ascending or descending).

LIMIT To get a limited number of tuples from a relation.

Combining and Splitting

UNION To combine two or more relations into a single relation.

SPLIT To split a single relation into two or more relations.


Diagnostic Operators

DUMP To print the contents of a relation on the console.

DESCRIBE To describe the schema of a relation.

EXPLAIN To view the logical, physical, or MapReduce execution plans to


compute a relation.

ILLUSTRATE To view the step-by-step execution of a series of statements.


Apache Pig – Built-in Functions

Apache Pig provides various built-in functions namely eval, load, store, math,
string, bag and tuple functions.

Eval Functions
Given below is the list of eval functions provided by Apache Pig.

S.N. Function & Description

1 AVG()
To compute the average of the numerical values within a bag.

2 BagToString()
To concatenate the elements of a bag into a string. While concatenating, we
can place a delimiter between these values (optional).

3 CONCAT()
To concatenate two or more expressions of same type.

4 COUNT()
To get the number of elements in a bag, while counting the number of tuples in
a bag.

5 COUNT_STAR()
It is similar to the COUNT() function. It is used to get the number of elements
in a bag.

6 DIFF()
To compare two bags (fields) in a tuple.

7 IsEmpty()
To check if a bag or map is empty.

8 MAX()
To calculate the highest value for a column (numeric values or chararrays) in a
single-column bag.

9 MIN()
To get the minimum (lowest) value (numeric or chararray) for a certain column
in a single-column bag.

10 PluckTuple()
Using the Pig Latin PluckTuple() function, we can define a string Prefix and
filter the columns in a relation that begin with the given prefix.

11 SIZE()
To compute the number of elements based on any Pig data type.

12 SUBTRACT()
To subtract two bags. It takes two bags as inputs and returns a bag which
contains the tuples of the first bag that are not in the second bag.

13 SUM()
To get the total of the numeric values of a column in a single-column bag.

14 TOKENIZE()
To split a string (which contains a group of words) in a single tuple and return
a bag which contains the output of the split operation.

Load & Store Functions


The Load and Store functions in Apache Pig are used to determine how the data
goes ad comes out of Pig. These functions are used with the load and store
operators. Given below is the list of load and store functions available in Pig.

S.N. Function & Description

1 PigStorage()
To load and store structured files.
2 TextLoader()
To load unstructured data into Pig.

3 BinStorage()
To load and store data into Pig using machine readable format.

4 Handling Compression
In Pig Latin, we can load and store compressed data.

Apache Pig - Bag & Tuple Functions


Given below is the list of Bag and Tuple functions.

S.N. Function & Description

1 TOBAG()
To convert two or more expressions into a bag.

2 TOP()
To get the top N tuples of a relation.

3 TOTUPLE()
To convert one or more expressions into a tuple.

4 TOMAP()
To convert the key-value pairs into a Map.

String Functions
We have the following String functions in Apache Pig.

S.N. Functions & Description

1 ENDSWITH(string, testAgainst)
To verify whether a given string ends with a particular substring.

2 STARTSWITH(string, substring)
Accepts two string parameters and verifies whether the first string starts with
the second.

3 SUBSTRING(string, startIndex, stopIndex)


Returns a substring from a given string.

4 EqualsIgnoreCase(string1, string2)
To compare two stings ignoring the case.

5 INDEXOF(string, ‘character’, startIndex)


Returns the first occurrence of a character in a string, searching forward from
a start index.

6 LAST_INDEX_OF(expression)
Returns the index of the last occurrence of a character in a string, searching
backward from a start index.

7 LCFIRST(expression)
Converts the first character in a string to lower case.

8 UCFIRST(expression)
Returns a string with the first character converted to upper case.

9 UPPER(expression)
UPPER(expression) Returns a string converted to upper case.

10 LOWER(expression)
Converts all characters in a string to lower case.

11 REPLACE(string, ‘oldChar’, ‘newChar’);


To replace existing characters in a string with new characters.
12 STRSPLIT(string, regex, limit)
To split a string around matches of a given regular expression.

13 STRSPLITTOBAG(string, regex, limit)


Similar to the STRSPLIT() function, it splits the string by given delimiter and
returns the result in a bag.

14 TRIM(expression)
Returns a copy of a string with leading and trailing whitespaces removed.

15 LTRIM(expression)
Returns a copy of a string with leading whitespaces removed.

16 RTRIM(expression)
Returns a copy of a string with trailing whitespaces removed.

Date-time Functions
Apache Pig provides the following Date and Time functions −

S.N. Functions & Description

1 ToDate(milliseconds)
This function returns a date-time object according to the given parameters. The
other alternative for this function are ToDate(iosstring), ToDate(userstring,
format), ToDate(userstring, format, timezone)

2 CurrentTime()
returns the date-time object of the current time.

3 GetDay(datetime)
Returns the day of a month from the date-time object.

4 GetHour(datetime)
Returns the hour of a day from the date-time object.

5 GetMilliSecond(datetime)
Returns the millisecond of a second from the date-time object.

6 GetMinute(datetime)
Returns the minute of an hour from the date-time object.

7 GetMonth(datetime)
Returns the month of a year from the date-time object.

8 GetSecond(datetime)
Returns the second of a minute from the date-time object.

9 GetWeek(datetime)
Returns the week of a year from the date-time object.

10 GetWeekYear(datetime)
Returns the week year from the date-time object.

11 GetYear(datetime)
Returns the year from the date-time object.

12 AddDuration(datetime, duration)
Returns the result of a date-time object along with the duration object.

13 SubtractDuration(datetime, duration)
Subtracts the Duration object from the Date-Time object and returns the result.

14 DaysBetween(datetime1, datetime2)
Returns the number of days between the two date-time objects.

15 HoursBetween(datetime1, datetime2)
Returns the number of hours between two date-time objects.

16 MilliSecondsBetween(datetime1, datetime2)
Returns the number of milliseconds between two date-time objects.

17 MinutesBetween(datetime1, datetime2)
Returns the number of minutes between two date-time objects.

18 MonthsBetween(datetime1, datetime2)
Returns the number of months between two date-time objects.

19 SecondsBetween(datetime1, datetime2)
Returns the number of seconds between two date-time objects.

20 WeeksBetween(datetime1, datetime2)
Returns the number of weeks between two date-time objects.

21 YearsBetween(datetime1, datetime2)
Returns the number of years between two date-time objects.

Math Functions
We have the following Math functions in Apache Pig −

S.N. Functions & Description

1 ABS(expression)
To get the absolute value of an expression.

2 ACOS(expression)
To get the arc cosine of an expression.

3 ASIN(expression)
To get the arc sine of an expression.

4 ATAN(expression)
This function is used to get the arc tangent of an expression.

5 CBRT(expression)
This function is used to get the cube root of an expression.

6 CEIL(expression)
This function is used to get the value of an expression rounded up to the nearest
integer.

7 COS(expression)
This function is used to get the trigonometric cosine of an expression.

8 COSH(expression)
This function is used to get the hyperbolic cosine of an expression.

9 EXP(expression)
This function is used to get the Euler’s number e raised to the power of x.

10 FLOOR(expression)
To get the value of an expression rounded down to the nearest integer.

11 LOG(expression)
To get the natural logarithm (base e) of an expression.

12 LOG10(expression)
To get the base 10 logarithm of an expression.

13 RANDOM( )
To get a pseudo random number (type double) greater than or equal to 0.0 and
less than 1.0.
14 ROUND(expression)
To get the value of an expression rounded to an integer (if the result type is float)
or rounded to a long (if the result type is double).

15 SIN(expression)
To get the sine of an expression.

16 SINH(expression)
To get the hyperbolic sine of an expression.

17 SQRT(expression)
To get the positive square root of an expression.

18 TAN(expression)
To get the trigonometric tangent of an angle.

19 TANH(expression)
To get the hyperbolic tangent of an expression.

Apache Pig - User Defined Functions


In addition to the built-in functions, Apache Pig provides extensive support
for User Defined Functions (UDF’s). Using these UDF’s, we can define our own
functions and use them. The UDF support is provided in six programming
languages, namely, Java, Jython, Python, JavaScript, Ruby and Groovy.

For writing UDF’s, complete support is provided in Java and limited support is
provided in all the remaining languages. Using Java, you can write UDF’s
involving all parts of the processing like data load/store, column transformation,
and aggregation. Since Apache Pig has been written in Java, the UDF’s written
using Java language work efficiently compared to other languages.
In Apache Pig, we also have a Java repository for UDF’s named Piggybank.
Using Piggybank, we can access Java UDF’s written by other users, and
contribute our own UDF’s.

Types of UDF’s in Java


While writing UDF’s using Java, we can create and use the following three types
of functions −

 Filter Functions − The filter functions are used as conditions in filter


statements. These functions accept a Pig value as input and return a
Boolean value.
 Eval Functions − The Eval functions are used in FOREACH-GENERATE
statements. These functions accept a Pig value as input and return a Pig
result.
 Algebraic Functions − The Algebraic functions act on inner bags in a
FOREACHGENERATE statement. These functions are used to perform
full MapReduce operations on an inner bag.

Writing UDF’s using Java

To write a UDF using Java, we have to integrate the jar file Pig-0.15.0.jar. In this
section, we discuss how to write a sample UDF using Eclipse. Before proceeding
further, make sure you have installed Eclipse and Maven in your system.

Follow the steps given below to write a UDF function −

 Open Eclipse and create a new project (say myproject).


 Convert the newly created project into a Maven project.
 Copy the following content in the pom.xml. This file contains the Maven
dependencies for Apache Pig and Hadoop-core jar files.
<project xmlns = "http://maven.apache.org/POM/4.0.0"
xmlns:xsi = "http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation = "http://maven.apache.org/POM/4.0.0http://maven.apache
.org/xsd/maven-4.0.0.xsd">

<modelVersion>4.0.0</modelVersion>
<groupId>Pig_Udf</groupId>
<artifactId>Pig_Udf</artifactId>
<version>0.0.1-SNAPSHOT</version>

<build>
<sourceDirectory>src</sourceDirectory>
<plugins>
<plugin>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.3</version>
<configuration>
<source>1.7</source>
<target>1.7</target>
</configuration>
</plugin>
</plugins>
</build>

<dependencies>

<dependency>
<groupId>org.apache.pig</groupId>
<artifactId>pig</artifactId>
<version>0.15.0</version>
</dependency>

<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-core</artifactId>
<version>0.20.2</version>
</dependency>

</dependencies>

</project>
 Save the file and refresh it. In the Maven Dependencies section, you can
find the downloaded jar files.
 Create a new class file with name Sample_Eval and copy the following
content in it.
import java.io.IOException;
import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;

import java.io.IOException;
import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;

public class Sample_Eval extends EvalFunc<String>{

public String exec(Tuple input) throws IOException {


if (input == null || input.size() == 0)
return null;
String str = (String)input.get(0);
return str.toUpperCase();
}
}
While writing UDF’s, it is mandatory to inherit the EvalFunc class and provide
implementation to exec() function. Within this function, the code required for the
UDF is written. In the above example, we have return the code to convert the
contents of the given column to uppercase.
 After compiling the class without errors, right-click on the
Sample_Eval.java file. It gives you a menu. Select export as shown in the
following screenshot.
 On clicking export, you will get the following window. Click on JAR file.
 Proceed further by clicking Next> button. You will get another window
where you need to enter the path in the local file system, where you need
to store the jar file.
 Finally click the Finish button. In the specified folder, a Jar
file sample_udf.jar is created. This jar file contains the UDF written in
Java.

Using the UDF


After writing the UDF and generating the Jar file, follow the steps given below −
Step 1: Registering the Jar file
After writing UDF (in Java) we have to register the Jar file that contain the UDF
using the Register operator. By registering the Jar file, users can intimate the
location of the UDF to Apache Pig.
Syntax
Given below is the syntax of the Register operator.
REGISTER path;
Example
As an example let us register the sample_udf.jar created earlier in this chapter.
Start Apache Pig in local mode and register the jar file sample_udf.jar as shown
below.
$cd PIG_HOME/bin
$./pig –x local

REGISTER '/$PIG_HOME/sample_udf.jar'
Note − assume the Jar file in the path − /$PIG_HOME/sample_udf.jar
Step 2: Defining Alias
After registering the UDF we can define an alias to it using the Define operator.
Syntax
Given below is the syntax of the Define operator.
DEFINE alias {function | [`command` [input] [output] [ship] [cache] [stderr] ] };
Example
Define the alias for sample_eval as shown below.
DEFINE sample_eval sample_eval();
Step 3: Using the UDF
After defining the alias you can use the UDF same as the built-in functions.
Suppose there is a file named emp_data in the HDFS /Pig_Data/ directory with
the following content.
001,Robin,22,newyork
002,BOB,23,Kolkata
003,Maya,23,Tokyo
004,Sara,25,London
005,David,23,Bhuwaneshwar
006,Maggy,22,Chennai
007,Robert,22,newyork
008,Syam,23,Kolkata
009,Mary,25,Tokyo
010,Saran,25,London
011,Stacy,25,Bhuwaneshwar
012,Kelly,22,Chennai
And assume we have loaded this file into Pig as shown below.
grunt> emp_data = LOAD 'hdfs://localhost:9000/pig_data/emp1.txt' USING PigStorage(',')
as (id:int, name:chararray, age:int, city:chararray);
Let us now convert the names of the employees in to upper case using the
UDF sample_eval.
grunt> Upper_case = FOREACH emp_data GENERATE sample_eval(name);
Verify the contents of the relation Upper_case as shown below.
grunt> Dump Upper_case;

(ROBIN)
(BOB)
(MAYA)
(SARA)
(DAVID)
(MAGGY)
(ROBERT)
(SYAM)
(MARY)
(SARAN)
(STACY)
(KELLY)

Apache Pig - Running Scripts


Here in this chapter, we will see how how to run Apache Pig scripts in batch
mode.

Comments in Pig Script


While writing a script in a file, we can include comments in it as shown below.
Multi-line comments
We will begin the multi-line comments with '/*', end them with '*/'.
/* These are the multi-line comments
In the pig script */
Single –line comments
We will begin the single-line comments with '--'.
--we can write single line comments like this.

Executing Pig Script in Batch mode


While executing Apache Pig statements in batch mode, follow the steps given
below.
Step 1
Write all the required Pig Latin statements in a single file. We can write all the Pig
Latin statements and commands in a single file and save it as .pig file.
Step 2
Execute the Apache Pig script. You can execute the Pig script from the shell
(Linux) as shown below.

Local mode MapReduce mode

$ pig -x local Sample_script.pig $ pig -x mapreduce Sample_script.pig

You can execute it from the Grunt shell as well using the exec command as
shown below.
grunt> exec /sample_script.pig
Executing a Pig Script from HDFS
We can also execute a Pig script that resides in the HDFS. Suppose there is a
Pig script with the name Sample_script.pig in the HDFS directory
named /pig_data/. We can execute it as shown below.
$ pig -x mapreduce hdfs://localhost:9000/pig_data/Sample_script.pig
Example
Assume we have a file student_details.txt in HDFS with the following content.
student_details.txt
001,Rajiv,Reddy,21,9848022337,Hyderabad
002,siddarth,Battacharya,22,9848022338,Kolkata
003,Rajesh,Khanna,22,9848022339,Delhi
004,Preethi,Agarwal,21,9848022330,Pune
005,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar
006,Archana,Mishra,23,9848022335,Chennai
007,Komal,Nayak,24,9848022334,trivendram
008,Bharathi,Nambiayar,24,9848022333,Chennai
We also have a sample script with the name sample_script.pig, in the same
HDFS directory. This file contains statements performing operations and
transformations on the student relation, as shown below.
student = LOAD 'hdfs://localhost:9000/pig_data/student_details.txt' USING PigStorage(',')
as (id:int, firstname:chararray, lastname:chararray, phone:chararray, city:chararray);
student_order = ORDER student BY age DESC;

student_limit = LIMIT student_order 4;

Dump student_limit;
 The first statement of the script will load the data in the file
named student_details.txt as a relation named student.
 The second statement of the script will arrange the tuples of the relation in
descending order, based on age, and store it as student_order.
 The third statement of the script will store the first 4 tuples
of student_order as student_limit.
 Finally the fourth statement will dump the content of the
relation student_limit.
Let us now execute the sample_script.pig as shown below.
$./pig -x mapreduce hdfs://localhost:9000/pig_data/sample_script.pig
Apache Pig gets executed and gives you the output with the following content.
(7,Komal,Nayak,24,9848022334,trivendram)
(8,Bharathi,Nambiayar,24,9848022333,Chennai)
(5,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar)
(6,Archana,Mishra,23,9848022335,Chennai)
2015-10-19 10:31:27,446 [main] INFO org.apache.pig.Main - Pig script completed in 12
minutes, 32 seconds and 751 milliseconds (752751 ms)
Apache Pig Installation
In this section, we will perform the pig installation.

Pre-requisite
o Java Installation - Check whether the Java is installed or not using the
following command.

1. $java -version

o Hadoop Installation - Check whether the Hadoop is installed or not using


the following command.

1. $hadoop version

If any of them is not installed in your system, follow the below link to install
it. Click Here to Install

Steps to install Apache Pig


o Download the Apache Pig tar file.
o Unzip the downloaded tar file.

1. $ tar -xvf pig-0.16.0.tar.gz

o Open the bashrc file.

1. $ sudo nano ~/.bashrc

o Now, provide the following PIG_HOME path.

1. export PIG_HOME=/home/hduser/pig-0.16.0
2. export PATH=$PATH:$PIG_HOME/bin
o Update the environment variable

1. $ source ~/.bashrc

o Let's test the installation on the command prompt type

1. $ pig -h

o Let's start the pig in MapReduce mode.

1. $ pig
Apache Pig Run Modes
Apache Pig executes in two modes: Local Mode and MapReduce Mode.

Local Mode
o It executes in a single JVM and is used for development experimenting and
prototyping.
o Here, files are installed and run using localhost.
o The local mode works on a local file system. The input and output data
stored in the local file system.

The command for local mode grunt shell:

1. $ pig-x local

MapReduce Mode
o The MapReduce mode is also known as Hadoop Mode.
o It is the default mode.
o In this Pig renders Pig Latin into MapReduce jobs and executes them on
the cluster.
o It can be executed against semi-distributed or fully distributed Hadoop
installation.
o Here, the input and output data are present on HDFS.

The command for Map reduce mode:

1. $ pig

Or,

1. $ pig -x mapreduce

Ways to execute Pig Program


These are the following ways of executing a Pig program on local and
MapReduce mode: -

o Interactive Mode - In this mode, the Pig is executed in the Grunt shell. To
invoke Grunt shell, run the pig command. Once the Grunt mode executes,
we can provide Pig Latin statements and command interactively at the
command line.
o Batch Mode - In this mode, we can run a script file having a .pig extension.
These files contain Pig Latin commands.
o Embedded Mode - In this mode, we can define our own functions. These
functions can be called as UDF (User Defined Functions). Here, we use
programming languages like Java and Python.

You might also like