PIG
UNIT-2
EVOLUTION OF PIG
• Apache Pig was initially developedby Yahoo researchers in the year 2006
• The main intuition behind developing Pig is to allow MapReduce jobs
execute on large datasets .
PIG
• Apache Pig is a tool/platform for creating and executing Map
Reduce program used with Hadoop
• Apache Pig is an abstraction over MapReduce
• It is a tool/platform for analyzing large sets of data
• It provides a high-level scripting language, known as Pig Latin which is used
to develop the data analysis codes
PIG
LATIN
• Pig Latin is a high-level scripting language used to develop the data analysis code
• Pig Tools has two components
Pig Latin : Provides an environment to develop the scripts for processing
the data stored in HDFS
Pig Engine : Converts the Pig script into a MapReduce tasks
• The result of Pig always stored in the HDFS
Apache Pig MapReduce
It is a compiled programming
It is a scripting language.
language.
Abstraction is at higher level. Abstraction is at lower level.
It has less line of code as
Lines of code is more.
compared to MapReduce.
Less effort is needed for Apache More development efforts are
Pig. required for MapReduce.
Code efficiency is less as compared As compared to Pig efficiency of
to MapReduce. code is higher.
Apache Pig MapReduce
Pig is a data flow language. MapReduce is a data preparing paradigm.
Playing out a Join activity in Pig is
Joining datasets is a complex task
quite straightforward.
Requires only a fundamental knowledge
Java Expertise is very much required
of SQL.
FEATURES OF APACHE PIG
• User-defined Functions: Pig in big data gives the ability to make UDFs in
other programming languages like Java and embed or invoke them in
Pig Scripts.
• Handlesa wide range of data: Apache Pig examines a wide range of
data, both unstructured as well as structured. It stores the
outcomes in the Hadoop Distributed File System.
• Rich set of operators: It gives numerous operators to perform
tasks like a filter, sort, join, and so on.
FEATURES OF APACHE PIG
• Extensibility: Using the current operators, clients can build up their capacities
to write, process, and read data.
• The simplicity of programming: Pig Latin is like Structured Query Language and it is
not difficult to compose a Pig scripting on the off chance that you are acceptable
at Structured Query Language.
• Optimization opportunities: The assignments in Apache Pig enhance their
execution naturally, so the software engineers need to focus just on the semantics
of the language.
DATA MODELS IN PIG
Atom: It is an atomic data value which is used to store as a string. The main use of
this model is that it can be used as a number and as well as a string.
Tuple: Tuple is an arranged arrangement of fields that may contain distinctive
data types for each field.
Bag: A bag is an assortment of a set of tuples and these tuples are a subset of rows
or whole rows of a table.
Map: A map is key-esteem sets used to address data components.
APPLICATIONS OF
PIG
• For exploring large datasets
• Provides support for across data-set , Ad-hoc queries.
• In the prototyping of large data-sets processing algorithms.
• In processing the time sensitive data loads.
• For collecting large amounts of datasets in form of search logs and web
crawls.
• Used where the analytical insights are needed using the sampling.
PIG INSTALLATION
Pig runs as a client-side application.
Pig launches jobs and interacts with HDFS (or other Hadoop filesystems) from your workstation.
Prerequisite : Java 6
Download link: http://hadoop.apache.org/pig/releases.html
Unpack the tarball in a suitable place on your workstation:
tar xzf pig-x.y.z.tar.gz
Add Pig’s binary directory to your command-line path.
export
PIG_INSTALL=/home/tom/pig-x.y.z
export PATH=$PATH:$PIG_INSTALL/bin
Set the JAVA_HOME environment variable to point to a suitable Java installation.
EXECUTION TYPES
Pig has two execution types or modes:
Local mode
In local mode, Pig runs in a single JVM and accesses the local filesystem.
This mode is suitable only for small datasets and when trying out Pig.
The execution type is set using the -x or -exectype option.
To run in local mode, set the option to local:
% pig -x local
grunt>% Pig's interactive shell
EXECUTION
TYPES
MapReduce mode
In MapReduce mode, Pig translates queries into MapReduce jobs and runs them on a Hadoop cluster.
We run MapReduce mode (with a fully distributed cluster) when we want to run Pig on large
datasets.
Top use MapReduce mode, we must check the compatibility between Pig Hadoop we are using.
If a Pig release supports multiple versions of Hadoop, you can use the environment variable
PIG_HADOOP_VERSION to tell Pig the version of Hadoop it is connecting to.
export PIG_HADOOP_VERSION=18
Next, you need to point Pig at the cluster’s namenode and jobtracker.
If you already have a Hadoop site file (or files) that define fs.default.name and mapred.job.tracker,
simply add Hadoop’s configuration directory to Pig’s classpath:
% export PIG_CLASSPATH=$HADOOP_INSTALL/conf/
EXECUTION TYPES
MapReduce mode
Alternatively, you can set these two properties in the pig. properties file in Pig’s conf directory.
fs.default.name=hdfs://localhost/
mapred.job.tracker=localhost:8021
launch Pig, setting the -x option to mapreduce, or omitting it entirely, as MapReduce mode is the default:
% pig %Pig reports the filesystem and jobtracker that it has connected to%
10/07/16 16:27:37 INFO pig.Main: Logging error messages to: /Users/tom/dev/pig-0 .7.0/pig_1279294057867.log
2010-07-16 16:27:38,243 [main] INFO org.apache.pig.backend.hadoop.executionengi ne.HExecutionEngine - Connecting to
hadoop file system at: hdfs://localhost/ 2010-07-16 16:27:38,741 [main] INFO
org.apache.pig.backend.hadoop.executionengi
RUNNING PIG
PROGRAMS
There are three ways of executing Pig programs
They all work in both local and MapReduce mode:
Script
Pig can run a script file that contains Pig commands.
For example, pig script.pig runs the commands in the local file script.pig.
you can also use the -e option to run a script specified as a string on the command line.
Grunt
Grunt is an interactive shell for running Pig commands.
Grunt will be started when no file is specified for Pig to run, and the -e option is not used.
It is also possible to run Pig scripts from within Grunt using run and exec.
Embedded
You can run Pig programs from Java, much like you can use JDBC to run SQL programs from Java.
PIG LATINSTRUCTURE
A Pig Latin program consists of a collection of statements. A statement can be thought
of as an operation or a command.
Statements are usually terminated with a semicolon
statements or commands for interactive use in Grunt do not need the terminating semicolon.
Statements that have to be terminated with a semicolon can be split across multiple lines
for readability:
Comments :
Single line comments -- Everything from the first hyphen to the end of the line is ignored by the Pig Latin
interpreter
Ex: DUMP A; -- What's in A?
Prepared by Mrs K H Vijaya Kumari, Asst Professor, Dept of IT, CBIT
Multiline Comment /* Everything in between is ignored by the interpreter
*/
PIG LATINSTRUCTURE
Pig Latin has a list of keywords that have a special meaning in the language and cannot
be used as identifiers.
These include
operators (LOAD, ILLUSTRATE, e.t.c)
commands (cat, ls, e.t.c) expressions
(matches, FLATTEN e.t.c) functions
(DIFF, MAX , e.t.c)
Pig Latin has mixed rules on case sensitivity.
Operators and commands are not case sensitive (to make interactive use more forgiving);
aliases and function names are case sensitive.
Parser
• As a Pig Latin program is executed, each statement is parsed in turn.
• If any syntactical errors are encountered, the interpreter halts and
displays the error Otherwise it builds a logical plan for each Pig Latin
Statement (Operator)
• The logical plan for the statement is added to the logical plan for the
program so far, and then the interpreter moves on to the next statement.
• No data processing takes place while the logical plan of the program is
being constructed.
• The output of the parser will be a DAG (directed acyclic graph), which
represents the Pig Latin statements and logical operators.
Optimizer
The logical plan (DAG) is passed to the logical
optimizer, which carries out the logical optimizations such as
projection and pushdown.
Compiler
The compiler compiles the optimized logical plan into a
series of MapReduce jobs.
Execution engine
• Finally the MapReduce jobs are submitted to Hadoop in a
sorted order. Finally, these MapReduce jobs are executed on
Hadoop producing the desired results.
PIG STATEMENTS
A Pig Latin statement is an operator that takes a relation as input and produces another relation as output.
They allow you to transform it by sorting, grouping, joining, projecting, and filtering.
Relational Operators:
Relational operators are the main tools Pig Latin provides to operate on the data.
LOADING AND STORING
OPERATORS
LOAD:
LOAD operator is used to load data from the file system or HDFS storage into a Pig relation.
Syntax :
LOAD <path> PigStorage(<delimiter>) AS (<variable description>)
STORE:
Store is used to save results to the file system.
Syntax: STORE <relation1> <location>
DUMP:
Prints a relation to the console.
Syntax: DUMP <relation>
FILTERING
OPERATORS
FOREACH... GENERATE:
This operator generates data transformations based on columns of data. It is used to add or remove fields from
a relation.
Syntax:
FOREACH <relation1> GENERATE (<fields>);
FILTER:
This operator selects tuples from a relation based on a condition.
Syntax: FILTER <relation1> BY <condition>
DISTINCT:
Distinct removes duplicate tuples in a relation.
Syntax: DISTINCT <relation1>
FILTERING
OPERATORS
MAPREDUCE:
Runs a MapReduce job using a relation as input
Syntax:
MAPREDUCE <relation1> ;
STREAM:
Transforms a relation using an external program
Syntax: STREAM alias [, alias …] THROUGH {'command' | cmd_alias } [AS schema]
;
SAMPLE:
Selects a random sample of a relation
Syntax: SAMPLE <relation1>
ASSERT:
Ensures a condition is true for all rows in a relation; otherwise, fails
Syntax: ASSERT <relation1>
PIG GROUPING
OPERATORS
GROUP
It groups the data in a single relation.
Syntax:
GROUP <relation1> BY (<fields>);
COGROUP:
Used for grouping of the data from two or more relations..
Syntax: COGROUP <relation1> BY <condition>, <relation2> BY <condition>
CROSS:
We can create the cross (Cartesian) product of two or more relations.
Syntax : CROSS <relation1> ,<relation2>
PIG GROUPING
OPERATORS
JOIN
It groups the data in a single relation.
Syntax:
JOIN Relation1_name BY key, Relation2_name BY key
CUBE:
Efficiently performs aggregation based on multiple dimentions
Syntax: CUBE people BY CUBE(gender, sport);
PIG SORTING
OPERATORS
RANK :
Assign a rank to each tuple in a relation, optionally sorting by fields first
Syntax: RANK <relation>
ORDER BY:
Order By is used to sort a relation based on one or more fields. You can do sorting in ascending or descending
order using ASC and DESC keywords.
Syntax: ORDER <relation1> by rating ASC
LIMIT:
LIMIT operator is used to limit the number of output tuples. If the specified number of output tuples is equal to or
exceeds the number of tuples in the relation, the output will include all tuples in the relation.
Syntax:
LIMIT <relation1> (no: of tuples);
PIG COMBINE/SPLIT
OPERATORS
UNION :
Combines two or more relations into one
Syntax: UNION <relation1> ,<relation2>
SPLIT:
SPLIT operator is used to partition the contents of a relation into two or more relations based on some expression.
Syntax: SPLIT <relation1> into <relation2> <condition>, <relation2> <condition>
EXAMPL
E
writing the program to calculate the maximum recorded temperature by year for the
sample weather dataset in Pig Latin
(1950,0,1)
(1950,22,1)
(1950,-11,1)
(1949,111,1)
records = LOAD 'input/ncdc/micro-tab/sample.txt' AS (year : chararray, temperature : int, quality : int);
Relations are given names, or aliases, so they can be referred to. This relation is given
the records alias. We can examine the contents of an alias using the DUMP operator:
grunt> DUMP records;
(1950,0,1)
(1950,22,1)
(1950,-11,1)
(1949,111,1)
(1949,78,1)
EXAMPLE
We can also see the structure of a relation—the relation’s schema—using the
DESCRIBE operator on the relation’s alias:
grunt> DESCRIBE records;
records: {year: chararray, temperature: int, quality: int}
To remove records that have a missing temperature (indicated by a value of 9999) or an unsatisfactory quality
reading
filtered_records = FILTER records BY temperature != 9999 AND (quality == 0 OR quality == 1 OR quality
== 4 OR quality == 5 OR quality == 9);
grunt> DUMP
filtered_records; (1950,0,1)
(1950,22,1)
(1950,-11,1)
(1949,111,1)
(1949,78,1)
EXAMPL
EWe can use the GROUP function to group the records relation by the
year field
grunt> grouped_records = GROUP filtered_records BY year;
grunt> DUMP grouped_records;
(1949,{(1949,111,1),(1949,78,1)})
(1950,{(1950,0,1),(1950,22,1),(1950,-11,1)})
grunt> DESCRIBE grouped_records;
grouped_records: {group: chararray, filtered_records: {year:
chararray, temperature: int,quality: int}}
EXAMPLE
To compute the maximum temperature of each filtered_records bag, we use MAX. MAX is a built-in function for
calculating the maximum value of fields in a bag.
grunt> max_temp = FOREACH grouped_records GENERATE group,
>> MAX(filtered_records.temperature);
grunt> DUMP max_temp;
(1949,111)
(1950,22)
DIAGNOSTIC
OPERATORS
DESCRIBE
Prints a relation's schema
Syntax: DESCRIBE <relation>
EXPLAIN
Prints the logical and physical plans
Syntax: EXPLAIN <relation>
ILLUSTRATE
Shows sample execution of the logical plan using a generated subset of the input
Syntax: ILLUSTRATE <relation>
GENERATING
EXAMPLES
With the ILLUSTRATE
operator, Pig provides a
tool for generating a
reasonably complete
and concise sample
dataset.
Prepared by Mrs K H Vijaya Kumari, Asst Professor, Dept of IT, CBIT
PIG LATIN
COMMANDS
Prepared by Mrs K H Vijaya Kumari, Asst Professor,
Dept of IT, CBIT
EXPRESSIONS IN
PIG
Prepared by Mrs K H Vijaya Kumari, Asst Professor, Dept
of IT, CBIT
EXPRESSIONS IN
PIG
TYPES OF
PIGLATIN
MAPS IN PIG
• Maps are always loaded from files, since there is no relational operator in Pig that produces a
map.
• It’s possible to write a UDF to generate maps, if desired.
• A relation is a top-level construct, whereas a bag has to be contained in a relation.
• it’s not possible to create a relation from a bag literal. A =
{(1,2),(3,4)}; -- Error
• you can’t treat a relation like a bag and project a field into a new
Relation
B = A.$0; ---Error
Instead, you have to use a relational operator to turn the relation A into relation B: B = FOREACH A
GENERATE $0;
CASE OPERATOR
Case operator is equivalent to nested bincond operators.
Syntax CASE WHEN THEN ELSE END
Usage:
CASE expression [ WHEN value THEN value ]+ [ ELSE value ]? END
CASE [ WHEN condition THEN value ]+ [ ELSE value ]? END
• The schemas for all the outputs of the when/else branches should match.
• Use expressions only (relational operators are not allowed).
UDF STATEMENTS
REGISTER:
Register's a jar file with the pig runtime so as to use its User Defined Functions
Syntax: REGISTER <path>
DEFINE:
Assigns an alias to a UDF or streaming command.
Syntax: DEFINE alias {function | [`command` [input] [output] [ship] [cache] [stderr] ] }
Since they do not process relations, commands are not added to the logical plan; instead,
they are executed immediately. They are NON-LOGICAL PLAN STATEMENTS
FUNCTION
S• Functions in Pig come in four types:
Eval function
A function that takes one or more expressions and returns another expression
Some eval functions are aggregate functions, which means they operate on a bag of data to produce a scalar value
Ex: MAX
Furthermore, many aggregate functions are algebraic, which means that the result of the function may be calculated
incrementally.
Ex: MAX
Median is non algebraic
Filter function
A special type of eval function that returns a logical boolean result.
filter functions are used in the FILTER operator to remove unwanted rows.
An example of a built-in filter function is IsEmpty, which tests whether a bag or a map contains any items.
FUNCTION
SLoad function
A function that specifies how to load data into a relation from external storage.
Store function
A function that specifies how to save the contents of a relation to external storage.
Ex: PigStorage
i, Asst Professo
USER-DEFINED FUNCTIONS
Plugging custom code into pig statements is a crucial and trivial data processing job.
The User Defined Functions of Pig are meant to achieve this.
A Filter UDF
Writing a filter function for filtering out weather records that do not have a temperature quality reading of
satisfactory
To change the line
filtered_records = FILTER records BY temperature != 9999 AND (quality == 0 OR quality == 1 OR quality == 4 OR
quality == 5 OR quality == 9);
to:
filtered_records = FILTER records BY temperature != 9999 AND isGood(quality);
UDF – FUNCTION RESOLUTION
Pig resolves function calls by treating the function’s name as a Java class name and attempting to load a
class of that name.
When searching for classes, Pig uses a class loader that includes the JAR files that have been registered.
When running in distributed mode, Pig will ensure that your JAR files get shipped to the cluster.
Pig has a set of built-in package names that it searches, so the function call does not have to be a fully
qualified name.
MAX is actually implemented by a class MAX in the package org.apache.pig.builtin. Which is one of the
builtin packages of Pig so MAX function can be written as MAX rather than org.apache.pig.builtin.MAX
PIG IN PRACTICE
There are some practical techniques that are worth knowing about when you are developing and running
Pig programs.
Parallelism
• When running in MapReduce mode, you need to tell Pig how many reducers you want for each job.
• You do this using a PARALLEL clause for operators that run in the reduce phase, which includes all
the grouping and joining operators (GROUP, COGROUP, JOIN, CROSS), as well as DISTINCT and
ORDER
• grouped_records = GROUP records BY year PARALLEL 30;
PIG IN
PRACTICE
Parameter Substitution
Pig supports parameter substitution, where parameters in the script are substituted with values
supplied at runtime
. Parameters are denoted by identifiers prefixed with a $ character;
Parameters can be specified when launching Pig, using the -param option, one for each parameter:
Pig –param <inputpath> -param <outputpath>