0% found this document useful (0 votes)

17 views41 pages

Apache PIG

Apache Pig is a high-level data flow language that simplifies the process of analyzing large datasets in Hadoop using a SQL-like language called Pig Latin. It offers built-in operators for data operations, supports both structured and unstructured data, and converts scripts into MapReduce jobs for execution. Pig is advantageous for ETL operations due to its ease of programming, code reusability, and optimization capabilities, although it is not suited for real-time processing or pinpointing individual records in large datasets.

Uploaded by

talh75350

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views41 pages

Apache PIG

Uploaded by

talh75350

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 41

Apache Pig

•An abstraction over MapReduce.

•A platform used to analyze larger sets of data.
•Pig is used with Hadoop.
•The language for Pig is pig Latin.
•The Pig scripts get internally converted to Map Reduce jobs
and get executed on data stored in HDFS.
•Every task which can be achieved using PIG can also be
achieved using java used in Map reduce.
Why Do We Need Apache Pig?

•USING PIG LATIN, PROGRAMMERS CAN PERFORM MAPREDUCE

TASKS EASILY WITHOUT HAVING TO TYPE COMPLEX CODES IN JAVA.
•PIG LATIN - SQL-LIKE LANGUAGE.
•APACHE PIG PROVIDES MANY BUILT-IN OPERATORS TO SUPPORT DATA
OPERATIONS LIKE JOINS, FILTERS, ORDERING, ETC.
• IT ALSO PROVIDES NESTED DATA TYPES LIKE TUPLES, BAGS, AND MAPS THAT ARE
MISSING FROM MAPREDUCE.
Features of Pig
•Rich set of operators − join, sort, filter, etc.
•Ease of programming − Pig Latin is similar to SQL.
•Optimization opportunities − The tasks in Apache Pig optimize
their execution automatically.
•Extensibility − Using the existing operators, users can develop
their own functions to read, process, and write data.
•Handles all kinds of data − both structured as well as
unstructured.
•It stores the results in HDFS.
•UDF’s − Pig provides the facility to create User-defined
Functions in other programming languages as well.
Apache Pig Vs MapReduce

•Apache Pig is a data flow language.

•MapReduce is a data processing paradigm.

•Pig is a high level language.

•MapReduce is low level and rigid.

•Performing a Join operation in Apache Pig is pretty simple.

•It is quite difficult in MapReduce to perform a Join operation

between datasets.
Apache Pig Vs MapReduce

•Apache Pig uses multi-query approach, thereby reducing the

length of the codes to a great extent.
•MapReduce will require almost 20 times more the number
of lines to perform the same task.

•There is no need for compilation. On execution, every

Apache Pig operator is converted internally into a
MapReduce job.
•MapReduce jobs have a long compilation process.
Apache Pig Vs Hive

•Pig Latin is a data flow language.

•HiveQL is a query processing language.

•Pig Latin is a procedural language and it fits in pipeline

paradigm.
•HiveQL is a declarative language.

•Apache Pig can handle structured, unstructured, and

semi-structured data.
•Hive is mostly for structured data.
Advantages of Pig

•Code reusability.
•Faster development
•Less number of lines of code
•Ideal for ETL operations.
• It allows a detailed step by step procedure by which the
data has to be transformed.
• Schema and type checking. It can handle inconsistent
schema data.
Pig Latin, Pig Engine, Pig script
Pig Latin:
•provides various operators using which programmers can
develop their own functions for reading, writing, and
processing data.

Pig Engine:
•Pig Engine component of Pig accepts the Pig Latin scripts as
input and converts those scripts into MapReduce jobs.

Pig scripts:
•To analyze data using Apache Pig, programmers need to
write scripts using Pig Latin language.
Pig has two execution modes

Local Mode:
Pig runs in a single JVM and makes use of local file system.
This mode is suitable only for analysis of small data sets
using Pig
This mode is generally used for testing purpose.

HDFS Mode:
-In this mode, queries written in Pig Latin are translated into
MapReduce jobs and are run on a Hadoop cluster.
-MapReduce mode with fully distributed cluster is useful of
running Pig on large data sets.
Apache Pig Components
•Parser
-checks the syntax of the script, does type checking, and other
miscellaneous checks. The output of the parser will be a DAG
•Optimizer
-carries out the logical optimizations
•Compiler
-compiles the optimized logical plan into a series of
MapReduce jobs.
•Execution engine
- MapReduce jobs are executed on Hadoop producing the
desired results
Apache Pig Execution Modes

• Interactive Mode (Grunt shell)

$ ./pig –x local
$ ./pig -x mapreduce

• Batch Mode (Script)

$ pig -x local Sample_script.pig
$ pig -x mapreduce Sample_script.pig

• Embedded Mode (UDF)

Why UDF?

•Do operations on more than one field

•Do more than grouping and filtering
•Programmer is comfortable
•Want to reuse existing logic

Traditionally UDF can be written only in Java. Now other

languages like Python are also supported.
Apache Pig - Architecture

•Pig uses the Pig Latin language, and execute them using any
of the execution mechanisms.

•After execution, these scripts will go through a series of

transformations applied by the Pig Framework, to produce
the desired output.

•Internally, Apache Pig converts these scripts into a series of

MapReduce jobs, and thus, it makes the programmer’s job
easy.
Pig Architecture
Shell Command in Pig

Syntax
grunt> sh shell command parameters
grunt> sh ls
PigStorage

•A built-in function of Pig

• PigStorage is used to load and store data in pig scripts.
• PigStorage can be used to parse text data with an arbitrary
delimiter or output data in a delimited format.
Viewing Data

DUMP input;

Very useful for debugging, but not so much useful for huge
datasets.
Load and Store example

data = LOAD 'data/data-bag.txt'

USING PigStorage(',');

STORE data INTO 'data/output/load-store'

USING PigStorage('|');
Loading Data into Pig

file = LOAD ‘/data/dropbox-policy.txt' AS

(line);

data = LOAD ‘/data/tweets.csv' USING

PigStorage(',');

data = LOAD ‘/data/tweets.csv'

USING PigStorage(',')
AS ('list', 'of', 'fields');
Storing Data from Pig

STORE data INTO 'output_location';

STORE data INTO 'output_location'

USING PigStorage();

STORE data INTO 'output_location'

USING PigStorage(',’);

•Similar to `LOAD`, lot of options are available

•Can store locally or in HDFS
Data Types used in Pig Latin

•Scalar Types
•Complex Types
Scalar Types

•int, long – (32, 64 bit) integer

•float, double – (32, 64 bit) floating point
•boolean (true/false)
•chararray (String in UTF-8)
•bytearray (blob) (DataByteArray in Java)
Complex Types

•tuple – ordered set of fields

•(data) bag – collection of tuples (NESTED)
•map – set of key value pairs
Schemas in Load statement

We can specify a schema to `LOAD` statements

data = LOAD ‘/data/data-bag.txt'

USING PigStorage(',')
AS (f1:int, f2:int, f3:int);
Pig Latin – Relational Operations
Loading and Storing
•LOAD - To Load the data from the file system (local/HDFS)
into a relation.
•STORE - To save a relation to the file system (local/HDFS).

Filtering
•FILTER - To remove unwanted rows from a relation.
•DISTINCT - To remove duplicate rows from a relation.
•FOREACH, GENERATE - To generate data transformations
based on columns of data.
Grouping and Joining
•JOIN To join two or more relations.
•COGROUP To group the data in two or more relations.
•GROUP To group the data in a single relation.
•CROSS To create the cross product of two or more
relations.

Sorting
ORDER To arrange a relation in a sorted order based on one or
more fields (ascending or descending).
LIMIT To get a limited number of tuples from a relation.
Combining and Splitting
UNION To combine two or more relations into a single
relation.
SPLIT To split a single relation into two or more
relations.

Diagnostic Operators
•DUMP To print the contents of a relation on the console.
•DESCRIBE To describe the schema of a relation.
•EXPLAIN To view the logical, physical, or MapReduce
execution plans to compute a relation.
•ILLUSTRATE To view the step-by-step execution of a series
of statements.
FOREACH

Generates data transformations based on columns of data

x = FOREACH data GENERATE *;

x = FOREACH data GENERATE $0, $1;
x = FOREACH data GENERATE $0 AS first, $1
AS second;
GROUP
• Groups data in one or more relations
• Groups tuples that have the same group key
• Similar to SQL group by operator

outerbag = LOAD ‘/data/data-bag.txt'

USING PigStorage(',')
AS (f1:int, f2:int, f3:int);

DUMP outerbag;

innerbag = GROUP outerbag BY f1;

DUMP innerbag;
FILTER
Selects tuples from a relation based on some condition

data = LOAD 'data/data-bag.txt'

USING PigStorage(',')
AS (f1:int, f2:int, f3:int);

DUMP data;

filtered = FILTER data BY f1 == 1;

DUMP filtered;
COUNT
Counts the number of tuples in a relationship

data = LOAD 'data/data-bag.txt'

USING PigStorage(',')
AS (f1:int, f2:int, f3:int);

grouped = GROUP data BY f2;

counted = FOREACH grouped GENERATE group,

COUNT (data);
DUMP counted;
ORDER By
Sort a relation based on one or more fields. Similar to SQL order by

data = LOAD 'data/nested-sample.txt'

USING PigStorage(',')
AS (f1:int, f2:int, f3:int);

DUMP data;

ordera = ORDER data BY f1 ASC;

DUMP ordera;

orderd = ORDER data BY f1 DESC;

DUMP orderd;
DISTINCT

Removes duplicates from a relation

data = LOAD 'data/data-bag.txt'

USING PigStorage(',')
AS (f1:int, f2:int, f3:int);

DUMP data;

unique = DISTINCT data;

DUMP unique;
LIMIT

Limits the number of tuples in the output.

data = LOAD 'data/data-bag.txt'

USING PigStorage(',')
AS (f1:int, f2:int, f3:int);

DUMP data;

limited = LIMIT data 3;

DUMP limited;
JOIN

Joins relation based on a field. Both outer and inner joins are
supported.
a = LOAD 'data/data-bag.txt'
USING PigStorage(',')
AS (f1:int, f2:int, f3:int);

DUMP a;

b = LOAD 'data/simple-tuples.txt'
USING PigStorage(',') AS (t1:int, t2:int);
DUMP b;

joined = JOIN a by f1, b by t1;

DUMP joined;
Pig Commands
(Using Pig's Grunt Shell Interface.)
• grunt> movies = LOAD 'Movies.txt' USING PigStorage(',') as (id:int, name:chararray, year:int,
rating:float, duration:int);
• grunt> dump movies;
• B = group movies all;
• C = FOREACH B GENERATE group, COUNT(movies);
• DUMP C;
• STORE C INTO '/OUTPUT_PIG' USING PigStorage(','); ( OUTPUT directory should not exist
already in HDFS)
• $ hadoop fs -ls /OUTPUT_PIG
• Found 2 items
• -rw-rw-rw- 1 bedrock supergroup 0 2015-07-31 10:30 /OUTPUT_PIG/_SUCCESS
• -rw-rw-rw- 1 bedrock supergroup 7 2015-07-31 10:30 /OUTPUT_PIG/part-r-00000
• [bedrock@cdh-5-2 ~]$ hadoop fs -cat /OUTPUT_PIG/part-r-00000
• all,10

Note: The text file should already exist on HDFS

Pig used to get the difference between two
text files
• file1_set = LOAD '/home/bedrock/TEST_DATA/file1.txt' USING
PigStorage(',') as
(id:int,source_address:chararray,source_city:chararray,source_name:chara
rray,dest_address:chararray,dest_city:chararray,dest_name:chararray,labe
l:float);
• file2_set = LOAD '/home/bedrock/TEST_DATA/file2.txt' USING
PigStorage(',') as
(id:int,source_address:chararray,source_city:chararray,source_name:chara
rray,dest_address:chararray,dest_city:chararray,dest_name:chararray,labe
l:float);
• cogroup_set = COGROUP file1_set by id, file2_set by id ;
• Dump cogroup_set;
• diff_data = FOREACH cogroup_set GENERATE DIFF(file1_set,file2_set);
• Dump diff_data;
Optimizing Pig Scripts

•Project early and often

•Filter early and often
•Drop nulls before a join
•Prefer DISTINCT over GROUP BY
•Use the right data structure
What are the limitations of the Pig?

•As the Pig platform is designed for ETL-type use cases, it’s
not a better choice for real-time scenarios.
•Apache Pig is not a good choice for pinpointing a single
record in huge data sets.
•Apache Pig is built on top of MapReduce, which is batch
processing oriented.
Is Pig script case sensitive?

•Pig script is both case sensitive and case insensitive.

•User defined functions, the field name, and relations are
case sensitive. M=LOAD ‘data’ is not same as M=LOAD
‘Data’.
•Whereas Pig script keywords are case insensitive. i.e. LOAD is
same as load.
• https://www.edureka.co/blog/interview-questions/hadoop-intervie
w-questions-pig/
• https://letsfindcourse.com/hadoop-questions/pig-hadoop-mcq-ques
tions

Pig Hive
No ratings yet
Pig Hive
72 pages
Big Data Unit-5
No ratings yet
Big Data Unit-5
9 pages
BDA Unit-4
No ratings yet
BDA Unit-4
98 pages
Bda Unit 4 060115 Big Data Analytics Unit 4
No ratings yet
Bda Unit 4 060115 Big Data Analytics Unit 4
19 pages
Unit V-Apache Pig
No ratings yet
Unit V-Apache Pig
10 pages
Bda Unit 4 060115 Big Data Analytics Unit 4
No ratings yet
Bda Unit 4 060115 Big Data Analytics Unit 4
19 pages
Unit 5
No ratings yet
Unit 5
24 pages
Bda Unit 4 060115 Big Data Analytics Unit 4
No ratings yet
Bda Unit 4 060115 Big Data Analytics Unit 4
19 pages
Big Data Unit-5
No ratings yet
Big Data Unit-5
81 pages
Unit IV - Pig PDF
No ratings yet
Unit IV - Pig PDF
79 pages
Unit 5 Lecture No-2 (PIG)
No ratings yet
Unit 5 Lecture No-2 (PIG)
101 pages
PIG A Big Data Processor
No ratings yet
PIG A Big Data Processor
49 pages
Apache Pig
100% (2)
Apache Pig
80 pages
Unit IV - Big Data Programming
No ratings yet
Unit IV - Big Data Programming
17 pages
Hadoop Pig
No ratings yet
Hadoop Pig
111 pages
Apache Pig
No ratings yet
Apache Pig
23 pages
Big Data Applications: Pig & Hive
No ratings yet
Big Data Applications: Pig & Hive
29 pages
Apache Pig for Data Engineers
No ratings yet
Apache Pig for Data Engineers
50 pages
UNIT 5 Complete Notes
No ratings yet
UNIT 5 Complete Notes
21 pages
Hadoop Big Data: Pig, Hive, HBase
No ratings yet
Hadoop Big Data: Pig, Hive, HBase
17 pages
Unit IV EBDP 22
No ratings yet
Unit IV EBDP 22
97 pages
BDA Unit 5-1
No ratings yet
BDA Unit 5-1
29 pages
Big Data Module V Notes
No ratings yet
Big Data Module V Notes
26 pages
Pig Hive
No ratings yet
Pig Hive
59 pages
Pig 2
No ratings yet
Pig 2
63 pages
BDP U4
No ratings yet
BDP U4
58 pages
Bdaut 2
No ratings yet
Bdaut 2
66 pages
BDA-Unit 5-Notes
No ratings yet
BDA-Unit 5-Notes
36 pages
U5 Big Data Aktu
No ratings yet
U5 Big Data Aktu
32 pages
Hadoop Week 5
No ratings yet
Hadoop Week 5
78 pages
BDA Module 4 - Part 1 (Pig) 2023
100% (1)
BDA Module 4 - Part 1 (Pig) 2023
34 pages
Notes Unit 5 Bigdata
No ratings yet
Notes Unit 5 Bigdata
21 pages
Pig
No ratings yet
Pig
61 pages
Session 3.3
No ratings yet
Session 3.3
30 pages
Apache Pig for Data Analysts
No ratings yet
Apache Pig for Data Analysts
58 pages
Big Data Notes Pig
No ratings yet
Big Data Notes Pig
38 pages
Unit 4 Apachepig 210825041412
No ratings yet
Unit 4 Apachepig 210825041412
16 pages
Notes 5 Unit Big Data
No ratings yet
Notes 5 Unit Big Data
23 pages
5 PIG and HIVE
No ratings yet
5 PIG and HIVE
81 pages
Notes UNIT 5 Bigdata
No ratings yet
Notes UNIT 5 Bigdata
18 pages
Apache Pig & Pig Latin Overview
No ratings yet
Apache Pig & Pig Latin Overview
41 pages
Notes - 5 Unit Big Data
No ratings yet
Notes - 5 Unit Big Data
22 pages
Unit 4 Bba
No ratings yet
Unit 4 Bba
10 pages
3 Pig
No ratings yet
3 Pig
77 pages
Notes
No ratings yet
Notes
19 pages
Notes Unit 5 Bigdata
No ratings yet
Notes Unit 5 Bigdata
19 pages
Unit 5
No ratings yet
Unit 5
76 pages
Pig: Building High-Level Dataflows Over Map-Reduce
No ratings yet
Pig: Building High-Level Dataflows Over Map-Reduce
59 pages
Unit 5 Lecture No-2 (PIG)
No ratings yet
Unit 5 Lecture No-2 (PIG)
94 pages
BDA - Unit-4 Part 1
No ratings yet
BDA - Unit-4 Part 1
47 pages
Apache Pig
No ratings yet
Apache Pig
28 pages
Unit 5 (Pig, Hive, Hbase)
No ratings yet
Unit 5 (Pig, Hive, Hbase)
18 pages
BDA Unit - IV
No ratings yet
BDA Unit - IV
81 pages
Bda Unit Iv Notes
No ratings yet
Bda Unit Iv Notes
32 pages
Unit 5
No ratings yet
Unit 5
39 pages
BD Unit 2
No ratings yet
BD Unit 2
20 pages
Pig
No ratings yet
Pig
6 pages
Unit-4 Bigdata Analytics: What Is Apache Pig?
No ratings yet
Unit-4 Bigdata Analytics: What Is Apache Pig?
47 pages
The Manager's Non-Revenue Handbook
No ratings yet
The Manager's Non-Revenue Handbook
110 pages
52144 download full chapters
100% (1)
52144 download full chapters
124 pages
Unit 8 Grade 12
No ratings yet
Unit 8 Grade 12
11 pages
SternMed Xenox C400 Leaflet
No ratings yet
SternMed Xenox C400 Leaflet
10 pages
Sustainable Bamboo for Urban Growth
No ratings yet
Sustainable Bamboo for Urban Growth
7 pages
Comparison in Animated Films The Little Mermaid
No ratings yet
Comparison in Animated Films The Little Mermaid
12 pages
Biotechnology and Gene Tech Answers F215
No ratings yet
Biotechnology and Gene Tech Answers F215
31 pages
MATHEMATICS YEAR 1 (Addition)
No ratings yet
MATHEMATICS YEAR 1 (Addition)
23 pages
EASe Therapy for Sensory Disorders
No ratings yet
EASe Therapy for Sensory Disorders
9 pages
The Blue Collar Theoretically - John F Lavelle
No ratings yet
The Blue Collar Theoretically - John F Lavelle
288 pages
TAX LG Taxation Preferential Taxation Sample Problems
No ratings yet
TAX LG Taxation Preferential Taxation Sample Problems
3 pages
Model Set 2081
No ratings yet
Model Set 2081
9 pages
What Is Democracy
No ratings yet
What Is Democracy
24 pages
Talk 6 Healing
No ratings yet
Talk 6 Healing
6 pages
Research Title
No ratings yet
Research Title
32 pages
Leader in The Profession William E. Butler
No ratings yet
Leader in The Profession William E. Butler
8 pages
The Dolphin: Analysis and Reading Comprehension
No ratings yet
The Dolphin: Analysis and Reading Comprehension
2 pages
Gaia Branding Identity Guide
No ratings yet
Gaia Branding Identity Guide
22 pages
Lab Manual (March-July 2018)
No ratings yet
Lab Manual (March-July 2018)
37 pages
Igcse C Energy Resources and Enenrgy Transfers
No ratings yet
Igcse C Energy Resources and Enenrgy Transfers
13 pages
Apply For Buyer - Casting & Machining, Metallic INDIA at CNH Industrial
No ratings yet
Apply For Buyer - Casting & Machining, Metallic INDIA at CNH Industrial
1 page
Basic Soil-Plant-Water Relationships
No ratings yet
Basic Soil-Plant-Water Relationships
64 pages
HW Squares & Cubes - 14th Aug
No ratings yet
HW Squares & Cubes - 14th Aug
3 pages
FATA Water Studies
No ratings yet
FATA Water Studies
132 pages
SP2
No ratings yet
SP2
8 pages
ICT 10 REGISTRY EDIT Lecture
No ratings yet
ICT 10 REGISTRY EDIT Lecture
22 pages
UNIFE 2016 Rail Market Study Highlights
No ratings yet
UNIFE 2016 Rail Market Study Highlights
21 pages
Conducting A Goal Analysis
No ratings yet
Conducting A Goal Analysis
14 pages
130 Stars Naksatras
100% (1)
130 Stars Naksatras
8 pages
2016 - An Overview of Microgrid Protection Methods and The Factors Involved
No ratings yet
2016 - An Overview of Microgrid Protection Methods and The Factors Involved
13 pages

Apache PIG

Uploaded by

Apache PIG

Uploaded by

Apache Pig

•An abstraction over MapReduce.

•USING PIG LATIN, PROGRAMMERS CAN PERFORM MAPREDUCE

•Apache Pig is a data flow language.

•Pig is a high level language.

•Performing a Join operation in Apache Pig is pretty simple.

•It is quite difficult in MapReduce to perform a Join operation

•Apache Pig uses multi-query approach, thereby reducing the

•There is no need for compilation. On execution, every

•Pig Latin is a data flow language.

•Pig Latin is a procedural language and it fits in pipeline

•Apache Pig can handle structured, unstructured, and

• Interactive Mode (Grunt shell)

• Batch Mode (Script)

• Embedded Mode (UDF)

•Do operations on more than one field

Traditionally UDF can be written only in Java. Now other

•After execution, these scripts will go through a series of

•Internally, Apache Pig converts these scripts into a series of

•A built-in function of Pig

data = LOAD 'data/data-bag.txt'

STORE data INTO 'data/output/load-store'

file = LOAD ‘/data/dropbox-policy.txt' AS

data = LOAD ‘/data/tweets.csv' USING

data = LOAD ‘/data/tweets.csv'

STORE data INTO 'output_location';

STORE data INTO 'output_location'

STORE data INTO 'output_location'

•Similar to `LOAD`, lot of options are available

•int, long – (32, 64 bit) integer

•tuple – ordered set of fields

We can specify a schema to `LOAD` statements

data = LOAD ‘/data/data-bag.txt'

Generates data transformations based on columns of data

x = FOREACH data GENERATE *;

outerbag = LOAD ‘/data/data-bag.txt'

innerbag = GROUP outerbag BY f1;

data = LOAD 'data/data-bag.txt'

filtered = FILTER data BY f1 == 1;

data = LOAD 'data/data-bag.txt'

grouped = GROUP data BY f2;

counted = FOREACH grouped GENERATE group,

data = LOAD 'data/nested-sample.txt'

ordera = ORDER data BY f1 ASC;

orderd = ORDER data BY f1 DESC;

Removes duplicates from a relation

data = LOAD 'data/data-bag.txt'

unique = DISTINCT data;

Limits the number of tuples in the output.

data = LOAD 'data/data-bag.txt'

limited = LIMIT data 3;

joined = JOIN a by f1, b by t1;

Note: The text file should already exist on HDFS

•Project early and often

•Pig script is both case sensitive and case insensitive.

You might also like