Apache Pig
•An abstraction over MapReduce.
•A platform used to analyze larger sets of data.
•Pig is used with Hadoop.
•The language for Pig is pig Latin.
•The Pig scripts get internally converted to Map Reduce jobs
and get executed on data stored in HDFS.
•Every task which can be achieved using PIG can also be
achieved using java used in Map reduce.
Why Do We Need Apache Pig?
•USING PIG LATIN, PROGRAMMERS CAN PERFORM MAPREDUCE
TASKS EASILY WITHOUT HAVING TO TYPE COMPLEX CODES IN JAVA.
•PIG LATIN - SQL-LIKE LANGUAGE.
•APACHE PIG PROVIDES MANY BUILT-IN OPERATORS TO SUPPORT DATA
OPERATIONS LIKE JOINS, FILTERS, ORDERING, ETC.
• IT ALSO PROVIDES NESTED DATA TYPES LIKE TUPLES, BAGS, AND MAPS THAT ARE
MISSING FROM MAPREDUCE.
Features of Pig
•Rich set of operators − join, sort, filter, etc.
•Ease of programming − Pig Latin is similar to SQL.
•Optimization opportunities − The tasks in Apache Pig optimize
their execution automatically.
•Extensibility − Using the existing operators, users can develop
their own functions to read, process, and write data.
•Handles all kinds of data − both structured as well as
unstructured.
•It stores the results in HDFS.
•UDF’s − Pig provides the facility to create User-defined
Functions in other programming languages as well.
Apache Pig Vs MapReduce
•Apache Pig is a data flow language.
•MapReduce is a data processing paradigm.
•Pig is a high level language.
•MapReduce is low level and rigid.
•Performing a Join operation in Apache Pig is pretty simple.
•It is quite difficult in MapReduce to perform a Join operation
between datasets.
Apache Pig Vs MapReduce
•Apache Pig uses multi-query approach, thereby reducing the
length of the codes to a great extent.
•MapReduce will require almost 20 times more the number
of lines to perform the same task.
•There is no need for compilation. On execution, every
Apache Pig operator is converted internally into a
MapReduce job.
•MapReduce jobs have a long compilation process.
Apache Pig Vs Hive
•Pig Latin is a data flow language.
•HiveQL is a query processing language.
•Pig Latin is a procedural language and it fits in pipeline
paradigm.
•HiveQL is a declarative language.
•Apache Pig can handle structured, unstructured, and
semi-structured data.
•Hive is mostly for structured data.
Advantages of Pig
•Code reusability.
•Faster development
•Less number of lines of code
•Ideal for ETL operations.
• It allows a detailed step by step procedure by which the
data has to be transformed.
• Schema and type checking. It can handle inconsistent
schema data.
Pig Latin, Pig Engine, Pig script
Pig Latin:
•provides various operators using which programmers can
develop their own functions for reading, writing, and
processing data.
Pig Engine:
•Pig Engine component of Pig accepts the Pig Latin scripts as
input and converts those scripts into MapReduce jobs.
Pig scripts:
•To analyze data using Apache Pig, programmers need to
write scripts using Pig Latin language.
Pig has two execution modes
Local Mode:
Pig runs in a single JVM and makes use of local file system.
This mode is suitable only for analysis of small data sets
using Pig
This mode is generally used for testing purpose.
HDFS Mode:
-In this mode, queries written in Pig Latin are translated into
MapReduce jobs and are run on a Hadoop cluster.
-MapReduce mode with fully distributed cluster is useful of
running Pig on large data sets.
Apache Pig Components
•Parser
-checks the syntax of the script, does type checking, and other
miscellaneous checks. The output of the parser will be a DAG
•Optimizer
-carries out the logical optimizations
•Compiler
-compiles the optimized logical plan into a series of
MapReduce jobs.
•Execution engine
- MapReduce jobs are executed on Hadoop producing the
desired results
Apache Pig Execution Modes
• Interactive Mode (Grunt shell)
$ ./pig –x local
$ ./pig -x mapreduce
• Batch Mode (Script)
$ pig -x local Sample_script.pig
$ pig -x mapreduce Sample_script.pig
• Embedded Mode (UDF)
Why UDF?
•Do operations on more than one field
•Do more than grouping and filtering
•Programmer is comfortable
•Want to reuse existing logic
Traditionally UDF can be written only in Java. Now other
languages like Python are also supported.
Apache Pig - Architecture
•Pig uses the Pig Latin language, and execute them using any
of the execution mechanisms.
•After execution, these scripts will go through a series of
transformations applied by the Pig Framework, to produce
the desired output.
•Internally, Apache Pig converts these scripts into a series of
MapReduce jobs, and thus, it makes the programmer’s job
easy.
Pig Architecture
Shell Command in Pig
Syntax
grunt> sh shell command parameters
grunt> sh ls
PigStorage
•A built-in function of Pig
• PigStorage is used to load and store data in pig scripts.
• PigStorage can be used to parse text data with an arbitrary
delimiter or output data in a delimited format.
Viewing Data
DUMP input;
Very useful for debugging, but not so much useful for huge
datasets.
Load and Store example
data = LOAD 'data/data-bag.txt'
USING PigStorage(',');
STORE data INTO 'data/output/load-store'
USING PigStorage('|');
Loading Data into Pig
file = LOAD ‘/data/dropbox-policy.txt' AS
(line);
data = LOAD ‘/data/tweets.csv' USING
PigStorage(',');
data = LOAD ‘/data/tweets.csv'
USING PigStorage(',')
AS ('list', 'of', 'fields');
Storing Data from Pig
STORE data INTO 'output_location';
STORE data INTO 'output_location'
USING PigStorage();
STORE data INTO 'output_location'
USING PigStorage(',’);
•Similar to `LOAD`, lot of options are available
•Can store locally or in HDFS
Data Types used in Pig Latin
•Scalar Types
•Complex Types
Scalar Types
•int, long – (32, 64 bit) integer
•float, double – (32, 64 bit) floating point
•boolean (true/false)
•chararray (String in UTF-8)
•bytearray (blob) (DataByteArray in Java)
Complex Types
•tuple – ordered set of fields
•(data) bag – collection of tuples (NESTED)
•map – set of key value pairs
Schemas in Load statement
We can specify a schema to `LOAD` statements
data = LOAD ‘/data/data-bag.txt'
USING PigStorage(',')
AS (f1:int, f2:int, f3:int);
Pig Latin – Relational Operations
Loading and Storing
•LOAD - To Load the data from the file system (local/HDFS)
into a relation.
•STORE - To save a relation to the file system (local/HDFS).
Filtering
•FILTER - To remove unwanted rows from a relation.
•DISTINCT - To remove duplicate rows from a relation.
•FOREACH, GENERATE - To generate data transformations
based on columns of data.
Grouping and Joining
•JOIN To join two or more relations.
•COGROUP To group the data in two or more relations.
•GROUP To group the data in a single relation.
•CROSS To create the cross product of two or more
relations.
Sorting
ORDER To arrange a relation in a sorted order based on one or
more fields (ascending or descending).
LIMIT To get a limited number of tuples from a relation.
Combining and Splitting
UNION To combine two or more relations into a single
relation.
SPLIT To split a single relation into two or more
relations.
Diagnostic Operators
•DUMP To print the contents of a relation on the console.
•DESCRIBE To describe the schema of a relation.
•EXPLAIN To view the logical, physical, or MapReduce
execution plans to compute a relation.
•ILLUSTRATE To view the step-by-step execution of a series
of statements.
FOREACH
Generates data transformations based on columns of data
x = FOREACH data GENERATE *;
x = FOREACH data GENERATE $0, $1;
x = FOREACH data GENERATE $0 AS first, $1
AS second;
GROUP
• Groups data in one or more relations
• Groups tuples that have the same group key
• Similar to SQL group by operator
outerbag = LOAD ‘/data/data-bag.txt'
USING PigStorage(',')
AS (f1:int, f2:int, f3:int);
DUMP outerbag;
innerbag = GROUP outerbag BY f1;
DUMP innerbag;
FILTER
Selects tuples from a relation based on some condition
data = LOAD 'data/data-bag.txt'
USING PigStorage(',')
AS (f1:int, f2:int, f3:int);
DUMP data;
filtered = FILTER data BY f1 == 1;
DUMP filtered;
COUNT
Counts the number of tuples in a relationship
data = LOAD 'data/data-bag.txt'
USING PigStorage(',')
AS (f1:int, f2:int, f3:int);
grouped = GROUP data BY f2;
counted = FOREACH grouped GENERATE group,
COUNT (data);
DUMP counted;
ORDER By
Sort a relation based on one or more fields. Similar to SQL order by
data = LOAD 'data/nested-sample.txt'
USING PigStorage(',')
AS (f1:int, f2:int, f3:int);
DUMP data;
ordera = ORDER data BY f1 ASC;
DUMP ordera;
orderd = ORDER data BY f1 DESC;
DUMP orderd;
DISTINCT
Removes duplicates from a relation
data = LOAD 'data/data-bag.txt'
USING PigStorage(',')
AS (f1:int, f2:int, f3:int);
DUMP data;
unique = DISTINCT data;
DUMP unique;
LIMIT
Limits the number of tuples in the output.
data = LOAD 'data/data-bag.txt'
USING PigStorage(',')
AS (f1:int, f2:int, f3:int);
DUMP data;
limited = LIMIT data 3;
DUMP limited;
JOIN
Joins relation based on a field. Both outer and inner joins are
supported.
a = LOAD 'data/data-bag.txt'
USING PigStorage(',')
AS (f1:int, f2:int, f3:int);
DUMP a;
b = LOAD 'data/simple-tuples.txt'
USING PigStorage(',') AS (t1:int, t2:int);
DUMP b;
joined = JOIN a by f1, b by t1;
DUMP joined;
Pig Commands
(Using Pig's Grunt Shell Interface.)
• grunt> movies = LOAD 'Movies.txt' USING PigStorage(',') as (id:int, name:chararray, year:int,
rating:float, duration:int);
• grunt> dump movies;
• B = group movies all;
• C = FOREACH B GENERATE group, COUNT(movies);
• DUMP C;
• STORE C INTO '/OUTPUT_PIG' USING PigStorage(','); ( OUTPUT directory should not exist
already in HDFS)
• $ hadoop fs -ls /OUTPUT_PIG
• Found 2 items
• -rw-rw-rw- 1 bedrock supergroup 0 2015-07-31 10:30 /OUTPUT_PIG/_SUCCESS
• -rw-rw-rw- 1 bedrock supergroup 7 2015-07-31 10:30 /OUTPUT_PIG/part-r-00000
• [bedrock@cdh-5-2 ~]$ hadoop fs -cat /OUTPUT_PIG/part-r-00000
• all,10
Note: The text file should already exist on HDFS
Pig used to get the difference between two
text files
• file1_set = LOAD '/home/bedrock/TEST_DATA/file1.txt' USING
PigStorage(',') as
(id:int,source_address:chararray,source_city:chararray,source_name:chara
rray,dest_address:chararray,dest_city:chararray,dest_name:chararray,labe
l:float);
• file2_set = LOAD '/home/bedrock/TEST_DATA/file2.txt' USING
PigStorage(',') as
(id:int,source_address:chararray,source_city:chararray,source_name:chara
rray,dest_address:chararray,dest_city:chararray,dest_name:chararray,labe
l:float);
• cogroup_set = COGROUP file1_set by id, file2_set by id ;
• Dump cogroup_set;
• diff_data = FOREACH cogroup_set GENERATE DIFF(file1_set,file2_set);
• Dump diff_data;
Optimizing Pig Scripts
•Project early and often
•Filter early and often
•Drop nulls before a join
•Prefer DISTINCT over GROUP BY
•Use the right data structure
What are the limitations of the Pig?
•As the Pig platform is designed for ETL-type use cases, it’s
not a better choice for real-time scenarios.
•Apache Pig is not a good choice for pinpointing a single
record in huge data sets.
•Apache Pig is built on top of MapReduce, which is batch
processing oriented.
Is Pig script case sensitive?
•Pig script is both case sensitive and case insensitive.
•User defined functions, the field name, and relations are
case sensitive. M=LOAD ‘data’ is not same as M=LOAD
‘Data’.
•Whereas Pig script keywords are case insensitive. i.e. LOAD is
same as load.
• https://www.edureka.co/blog/interview-questions/hadoop-intervie
w-questions-pig/
• https://letsfindcourse.com/hadoop-questions/pig-hadoop-mcq-ques
tions