0% found this document useful (0 votes)

69 views9 pages

Pig Latin: Simplifying Hadoop for All

The document discusses Pig, a platform for analyzing large datasets. Pig takes a data flow approach and uses the Pig Latin language. Pig Latin scripts define a data stream and series of transformations. This simplifies Hadoop programming by hiding complex MapReduce details. The document outlines key aspects of Pig including its architecture, how Pig Latin works, and provides an example script to calculate total flight miles by carrier.

Uploaded by

Devi Kondaveti

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

69 views9 pages

Pig Latin: Simplifying Hadoop for All

Uploaded by

Devi Kondaveti

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

Unit 5

Pig: Hadoop Programming Made Easier Admiring the Pig Architecture, going with the Pig Latin Application
Flow, working through the ABCs of Pig Latin, Evaluating Local and Distributed Modes of Running Pig Scripts,
Checking out the Pig Script Interfaces, Scripting with Pig Latin

Java MapReduce programs and the Hadoop Distributed File System (HDFS) provide you with a
powerful distributed computing framework, but they relying on them limits the use of Hadoop to
Java programmers who can think in Map and Reduce terms when writing programs.

More developers, data analysts, data scientists, and other people could gain advantage of Hadoop if
they had a way to understand the power of Map and Reduce while hiding some of the Map and
Reduce complexities.
Hive and Pig hide the messy details of MapReduce so that a programmer can concentrate on the
important work.
Hive, for example, provides a limited SQL-like capability that runs over MapReduce, thus
making said MapReduce more approachable for SQL developers.
Hive also provides a declarative query language (the SQL-like HiveQL), which allows you to focus
on which operation you need to carry out versus how it is carried out
SQL is the common accepted language for querying structured data, some developers still prefer
writing imperative scripts — scripts that define a set of operations that change the state of the data
— and also want to have more data processing flexibility than what SQL or HiveQL provide.

Engineers at Yahoo! Research to come up with a product meant to fulfil that need — and so Pig was
born
Pig’s claim to a programming tool attempting to have:
o a declarative query language inspired by SQL and
o a low-level procedural programming language that can generate MapReduce code. This
simplifies technical knowledge needed to exploit the power of Hadoop.
Pig was initially developed at Yahoo! in 2006 as part of a research project, in 2007 Pig
officially became an Apache project
The Pig programming language is designed to handle any kind of data tossed its way —
structured, semi- structured, unstructured data
According to the Apache Pig philosophy, pigs eat anything, live anywhere, are domesticated and can
fly to boot.
Pig is a parallel data processing programming language and can work with any particular parallel
framework.

1|Page D E P T O F MCA , s C E T
Pig is smart in data processing because there is an optimizer that figures out how to do the hard
work of figuring out how to get the data quickly.

Admiring the Pig Architecture

Pig is made up of two components:
The language itself: the programming language for Pig is known as Pig Latin, a high-level language
that allows you to write data processing and analysis programs.
The Pig Latin compiler: The Pig Latin compiler converts the Pig Latin code into executable code.
The executable code is either in the form of MapReduce jobs or to run the Pig code on a single
node.
The sequence of MapReduce programs enables Pig programs to do data processing and analysis
in parallel, leveraging Hadoop MapReduce and HDFS.
Running the Pig job in the virtual Hadoop instance is a useful strategy for testing your Pig scripts.

Pig programs can run on MapReduce v1 or MapReduce v2 without any code changes,
regardless of what than MapReduce.
Pig scripts can also run using the Tez API instead. Apache Tez provides a more efficient execution
framework than MapReduce.

2|Page D E P T O F MCA , s C E T
Going with the Pig Latin Application Flow
o Pig Latin is a dataflow language, where you define a data stream and a series of transformations
that are applied to the data as it flows through your application
o A control flow language (like C or Java), where you write a series of
instructions. In control flow languages, we use constructs like loops and
conditional logic (like an ifstatement).
o To realize working with pig is significantly easier than MapReduce consider the
following example

Listing 8-1: Sample Pig Code to illustrate the data processing dataflow

A = LOAD 'data_file.txt';
...
B = GROUP ... ;
...
C= FILTER ...;
...
DUMP B;
..
STORE C INTO 'Results';

The basic flow of a Pig program is:

o 1. Load: First load (LOAD) the data we want to manipulate, that data is stored in HDFS or local file
system. For a Pig program to access the data, you first tell Pig what file or files to use. For that task,
you use the LOAD 'data_file' command. Here, 'data_file' can specify either an HDFS file
or a directory. If a directory is specified, all files in that directory are loaded into the program
o If the data is stored in a file format that isn’t natively accessible to Pig, you can optionally add the
USING function to the LOAD statement to specify a user-defined function that can read in
(and interpret) the data
o 2.Transform: We run the data through a set of transformations that are translated into a set of
Map and Reduce tasks.
o The transformation logic is where all the data manipulation happens. Here, you can FILTER
out rows that aren’t of interest, JOIN two sets of data files, GROUP data to build aggregations,

o 3. Dump: Finally, you dump (DUMP) the results to the screen or

3|Page D E P T O F MCA , s C E T
o Store (STORE) the results in a file somewhere.

Working through the ABCs of Pig Latin

Pig Latin is the language for Pig programs.
Pig translates the Pig Latin script into MapReduce jobs that can be executed within Hadoop cluster.
Pig Latin development team followed three key design principles:

Keep it simple.

Pig Latin is an abstraction for MapReduce that simplifies the creation of parallel programs on the
Hadoop cluster for data flows and analysis.
Complex tasks may require a series of interrelated data transformations — such series are
encoded as data flow sequences.
Writing Pig Latin scripts instead of Java MapReduce programs makes these programs easier to
write, understand, and maintain because
a) you don’t have to write the job in Java,
b) you don’t have to think in terms of MapReduce, and
c) you don’t need to come up with custom code to support rich data types. Pig
Latin provides a simpler language to exploit your Hadoop cluster.

Make it smart.

Pig Latin Compiler transform a Pig Latin program into a series of Java MapReduce jobs. The
compiler can optimize the execution of these Java MapReduce jobs automatically,
allowing the user to focus on semantics rather than on how to optimize and access the data.

For example, SQL is set up as a declarative query that you use to access structured data stored in
an RDBMS. The RDBMS engine first translates the query to a data access method and then looks
at the statistics and generates a series of data access approaches. The cost-based optimizer
chooses the most efficient approach for execution.

Don’t limit development. Make Pig extensible so that developers can add functions to address
their particular business problems.


Traditional RDBMS data warehouses make use of the ETL data processing pattern, where you
extract data from outside sources, transform it to fit your operational needs, and then load it
into the end target, whether it’s an operational data store, a data warehouse, or another variant of

database.
4|Page D E P T O F MCA , s C E T

With big data, the language for Pig data flows goes with ELT instead: Extract the data from
your various sources, load it into HDFS, and then transform it as necessary to prepare the
data for further analysis.

Uncovering Pig Latin structures:

For example, consider the following pig scripts that perform the task of calculating the total number of
flights flown by every carrier

Listing 8-2: Pig script calculating the total miles flown

Pig Scripts consider the following principles:

Most Pig scripts start with the LOADstatement to read data from HDFS.
 In this case, we’re loading data from a .csvfile. Pig has a data model it uses, so next we need to map
the file’s data model to the Pig data mode. This is accomplished with the help of the USING
statement. We then specify that it is a comma-delimited file with the PigStorage (',') statement
followed by the AS statement defining the name of each of the columns.

Aggregations are commonly used in Pig to summarize data sets.

 The GROUPstatement is used to aggregate the records into a single record

mileage_recs. The ALL statement is used to aggregate all tuples into a single
group. Note that some statements — including the following SUM statement —
requires a preceding GROUP ALL statement for global sums.

FOREACH . . . GENERATEstatements are used here to transform columns data.



In this case, we want to count the miles travelled in the records_Distance
column. The SUM statement computes the sum of the record_Distance
5|Page D E P T O F MCA , s C E T
column into a single-column collection total_miles.

The DUMPoperator is used to execute the Pig Latin statement and dis- play the
results on the screen.


DUMP is used in interactive mode, which means that the statements are executable
you will either use the DUMP
immediately and the results are not saved. Typically,
or STORE operators at the end of your Pig script.

Looking at Pig data types and syntax:

Pig Latin has these four types in its data model:
Atom: An atom is any single value, such as a string or a number — ‘Diego’, for example. Pig’s atomic
values are scalar types that appear in most programming languages — int, long, float,
double, chararray, and bytearray, for example. See Figure 8-2 to see sample atom types.
Tuple: A tuple is a record that consists of a sequence of fields. Each field can be of any type — ‘Diego’,
‘Gomez’, or 6, for example. Think of a tuple as a row in a table
Bag: A bag is a collection of non-unique tuples. The schema of the bag is flexible — each tuple in the
collection can contain an arbitrary number of fields, and each field can be of any type.

Map: A map is a collection of key value pairs. Any type can be stored in the value, and the key
needs to be unique. The key of a map must be a chararrayand the value can be of any type.

In a Hadoop context, accessing data means allowing developers to load, store, and
s t r e a m data, whereas transforming data means taking advantage of Pig’s ability to
group, join, combine, split, filter, and sort data. Table 8-1 gives an overview of the
operators associated with each operation

6|Page D E P T O F MCA , s C E T
The LOADoperator operates on the principle of lazy evaluation, also referred to as call-by-need. Now
lazy doesn’t sound particularly praiseworthy, but all it means is that you delay the evaluation of an
expression until you really need it. In the context of our Pig example, that means that after the
LOADstatement is executed, no data is moved — nothing gets shunted around — until a statement to
write data is encountered. You can have a Pig script that is a page long filled with complex
transformations, but nothing gets executed until the DUMPor STOREstatement is encountered.

Evaluating Local and Distributed Modes of Running Pig scripts:

Pig has two modes for running scripts, as shown in Figure 8-3:
Local mode: All scripts are run on a single machine without requiring Hadoop MapReduce
and HDFS. This can be useful for developing and testing Pig logic. If you’re using a small set
of data to developed or test your code, then local mode could be faster than going through the
MapReduce infrastructure.
 
Local mode doesn’t require Hadoop. When you run in Local mode, the Pig
7|Page D E P T O F MCA , s C E T
program runs in the context of a local Java Virtual Machine, and data access is via the
local file system of a single machine. Local mode is actually a local simulation of
MapReduce in Hadoop’s LocalJobRunner class.

MapReduce mode (also known as Hadoop mode): Pig is executed on the
Hadoop cluster. In this case, the Pig script gets converted
into a series of
MapReduce jobs that are then run on the Hadoop cluster.

Checking Out the Pig Script Interfaces

Pig programs can be packaged in three different ways:
Script: This method is nothing more than a file containing Pig Latin commands, identified by the
.pigsuffix (FlightData.pig, for example). Ending your Pig program with the .pigextension is a
convention but not required. The commands are interpreted by the Pig Latin compiler and executed in
the order determined by the Pig optimizer.

Grunt: Grunt acts as a command interpreter where you can interactively enter Pig Latin at the Grunt
command line and immediately see the response. This method is helpful for prototyping during initial
development and with what-if scenarios.

Embedded: Pig Latin statements can be executed within Java, Python, or JavaScript
programs.
To specify whether a script or Grunt shell is executed locally or in Hadoop mode just specify it in the –
xflag to the pig command. The following is an example of how you’d specify running your Pig script in
local mode:

pig -x local milesPerCarrier.pig

8|Page D E P T O F MCA , s C E T
Here’s how you’d run the Pig script in Hadoop mode, which is the default if you don’t specify the flag:

pig -x mapreduce milesPerCarrier.pig

By default, when you specify the pigcommand without any parameters, it starts the Grunt shell in
Hadoop mode. If you want to start the Grunt shell in local mode, just add the –x localflag to the
command. Here is an example:
pig -x local

Scripting with Pig Latin


Hadoop is a rich and quickly evolving ecosystem with a growing set of new applications, Pig is
 can be written in a
designed to be extensible via user-defined functions, also known as UDFs. UDFs
 number of programming languages, including Java, Python, and JavaScript.


Some of the Pig UDFs that are part of these repositories are LOAD/STOREfunctions (XML, for
 example), date time functions, text, math, and stats functions.

Pig can also be embedded in host languages such as Java, Python, and JavaScript,
existing applications. It also helps
which allows you to integrate Pig with your
 overcome limitations in the Pig language.

One of the most commonly referenced limitations is that Pig doesn’t support 
 control flow statements: if/else, while loop, for loop, and conditionstatements.

Pig natively supports data flow, but needs to be embedded within another language to
provide control flow. There are trade-offs, however of embedding Pig in a control-flow
language. For example, if a Pig statement is embedded in a loop, every time the loop

iterates and runs the Pig statement, this causes a separate MapReduce job to run.

9|Page D E P T O F MCA , s C E T

BDA - UNIT 4 PIG Notes
No ratings yet
BDA - UNIT 4 PIG Notes
9 pages
Unit-V Pig Programming
No ratings yet
Unit-V Pig Programming
123 pages
Unit - V PIG Hadoop & Big Data: Pig Latin. This Language Provides Various Operators Using Which Programmers
No ratings yet
Unit - V PIG Hadoop & Big Data: Pig Latin. This Language Provides Various Operators Using Which Programmers
9 pages
Bdaut 2
No ratings yet
Bdaut 2
66 pages
Hadoop Pig
No ratings yet
Hadoop Pig
111 pages
Unit No. 8
No ratings yet
Unit No. 8
24 pages
Unit Iv Part - 2
No ratings yet
Unit Iv Part - 2
59 pages
UNIT 5 Complete Notes
No ratings yet
UNIT 5 Complete Notes
21 pages
BDA-Unit 5-Notes
No ratings yet
BDA-Unit 5-Notes
36 pages
Unit 5
No ratings yet
Unit 5
24 pages
Unit V-Apache Pig
No ratings yet
Unit V-Apache Pig
10 pages
Unit 4 Bba
No ratings yet
Unit 4 Bba
10 pages
Big Data Module V Notes
No ratings yet
Big Data Module V Notes
26 pages
5 PIG and HIVE
No ratings yet
5 PIG and HIVE
81 pages
Unit 4 Apachepig 210825041412
No ratings yet
Unit 4 Apachepig 210825041412
16 pages
Apache PIG
No ratings yet
Apache PIG
41 pages
Pig
No ratings yet
Pig
6 pages
Unit-4 Bigdata Analytics: What Is Apache Pig?
No ratings yet
Unit-4 Bigdata Analytics: What Is Apache Pig?
47 pages
Apache Pig & Pig Latin Overview
No ratings yet
Apache Pig & Pig Latin Overview
41 pages
Unit IV - Big Data Programming
No ratings yet
Unit IV - Big Data Programming
17 pages
Notes Unit 5 Bigdata
No ratings yet
Notes Unit 5 Bigdata
21 pages
Pig: Building High-Level Dataflows Over Map-Reduce
No ratings yet
Pig: Building High-Level Dataflows Over Map-Reduce
59 pages
Unit 5
No ratings yet
Unit 5
76 pages
Apache Pig
No ratings yet
Apache Pig
23 pages
Pig and Pig Latin
No ratings yet
Pig and Pig Latin
16 pages
Unit 5
No ratings yet
Unit 5
39 pages
U5 Big Data Aktu
No ratings yet
U5 Big Data Aktu
32 pages
BDA - Unit-4 Part 1
No ratings yet
BDA - Unit-4 Part 1
47 pages
Notes 5 Unit Big Data
No ratings yet
Notes 5 Unit Big Data
23 pages
Pig
No ratings yet
Pig
61 pages
Unit-4 SGS
No ratings yet
Unit-4 SGS
13 pages
Bda Unit Iv Notes
No ratings yet
Bda Unit Iv Notes
32 pages
Introduction to Apache Pig & Pig Latin
No ratings yet
Introduction to Apache Pig & Pig Latin
28 pages
Big Data Notes Pig
No ratings yet
Big Data Notes Pig
38 pages
BDA Unit5
No ratings yet
BDA Unit5
36 pages
Notes - 5 Unit Big Data
No ratings yet
Notes - 5 Unit Big Data
22 pages
BDA - HIVE & PIG-Other Notes in Detail
No ratings yet
BDA - HIVE & PIG-Other Notes in Detail
162 pages
Big Data Analytics: Apache Pig
No ratings yet
Big Data Analytics: Apache Pig
52 pages
Big Data Processing, 2014/15: Lecture 8: Pig Latin!
No ratings yet
Big Data Processing, 2014/15: Lecture 8: Pig Latin!
58 pages
Bda Unit 4 060115 Big Data Analytics Unit 4
No ratings yet
Bda Unit 4 060115 Big Data Analytics Unit 4
19 pages
Hadoop Week 5
No ratings yet
Hadoop Week 5
78 pages
Unit IV
No ratings yet
Unit IV
36 pages
Notes UNIT 5 Bigdata
No ratings yet
Notes UNIT 5 Bigdata
18 pages
Big Data Unit-5
No ratings yet
Big Data Unit-5
9 pages
Pig
No ratings yet
Pig
16 pages
BDA Module 4 - Part 1 (Pig) 2023
100% (1)
BDA Module 4 - Part 1 (Pig) 2023
34 pages
06 Pig 01 Intro 1
No ratings yet
06 Pig 01 Intro 1
23 pages
Pig Full Lecture
No ratings yet
Pig Full Lecture
38 pages
Bda Unit 4 060115 Big Data Analytics Unit 4
No ratings yet
Bda Unit 4 060115 Big Data Analytics Unit 4
19 pages
Hadoop Big Data: Pig, Hive, HBase
No ratings yet
Hadoop Big Data: Pig, Hive, HBase
17 pages
Notes of Aktu Btech 3 Yr Big Data
No ratings yet
Notes of Aktu Btech 3 Yr Big Data
15 pages
BD 5
No ratings yet
BD 5
28 pages
BDA Unit 5-1
No ratings yet
BDA Unit 5-1
29 pages
Unit 4
No ratings yet
Unit 4
20 pages
IMTC634 - Data Science - Chapter 16
No ratings yet
IMTC634 - Data Science - Chapter 16
20 pages
Hadoop Pig Presentation
No ratings yet
Hadoop Pig Presentation
33 pages
Big Data Unit-5
No ratings yet
Big Data Unit-5
81 pages
Bda Unit 4 060115 Big Data Analytics Unit 4
No ratings yet
Bda Unit 4 060115 Big Data Analytics Unit 4
19 pages
Pig Slides
No ratings yet
Pig Slides
46 pages
Optimize Small Files in Hadoop
No ratings yet
Optimize Small Files in Hadoop
62 pages
R20 M.Tech DS
No ratings yet
R20 M.Tech DS
64 pages
CCS334 - Bda Lab Manual
No ratings yet
CCS334 - Bda Lab Manual
40 pages
Data Science Bootcamp: Curriculum
No ratings yet
Data Science Bootcamp: Curriculum
13 pages
Exploring Dataset in MapReduce
No ratings yet
Exploring Dataset in MapReduce
14 pages
BIgData and Hadoop Ecosytem
No ratings yet
BIgData and Hadoop Ecosytem
8 pages
Course Handout MBBA 6004 - 2018-19 - Sem IV
No ratings yet
Course Handout MBBA 6004 - 2018-19 - Sem IV
11 pages
Laxmancibi Sivakumar Databricks Resume
No ratings yet
Laxmancibi Sivakumar Databricks Resume
5 pages
Saurabh Hive Spark Hadoop
No ratings yet
Saurabh Hive Spark Hadoop
1 page
Notes 250909 104913
No ratings yet
Notes 250909 104913
52 pages
BDA Question Bank
No ratings yet
BDA Question Bank
33 pages
Aditya Paruchuri
No ratings yet
Aditya Paruchuri
7 pages
Prashanth Kammari ADE
No ratings yet
Prashanth Kammari ADE
6 pages
pkdp-203 0
No ratings yet
pkdp-203 0
23 pages
HDP Components Detailed
No ratings yet
HDP Components Detailed
4 pages
UdayT Professional Long Resume
No ratings yet
UdayT Professional Long Resume
23 pages
Role of Business Analytics in Decision Making
No ratings yet
Role of Business Analytics in Decision Making
17 pages
Creating Tables in Hive
No ratings yet
Creating Tables in Hive
3 pages
JayshreeGaikwad GCP ETL
No ratings yet
JayshreeGaikwad GCP ETL
3 pages
714 Cseiml
No ratings yet
714 Cseiml
28 pages
B.Sc. Computer Science
No ratings yet
B.Sc. Computer Science
11 pages
Apache Impala for Data Engineers
No ratings yet
Apache Impala for Data Engineers
879 pages
Cloud Data Warehousing For Dummies 3rd Edition
No ratings yet
Cloud Data Warehousing For Dummies 3rd Edition
52 pages
Borkar Shashikant@yahoo - Co.in
100% (1)
Borkar Shashikant@yahoo - Co.in
4 pages
Azure Data Engineering Project Part 1
No ratings yet
Azure Data Engineering Project Part 1
41 pages
Graph Algorithms in MapReduce & Spark
No ratings yet
Graph Algorithms in MapReduce & Spark
22 pages
Big Data & Hadoop Architecture Guide
50% (2)
Big Data & Hadoop Architecture Guide
168 pages
Strategy & Roadmap For Bigtop & Ambari
No ratings yet
Strategy & Roadmap For Bigtop & Ambari
5 pages
1.1.2 & 1.2.2 - 209-Syllabus Revision Computer Science 2021-2022
No ratings yet
1.1.2 & 1.2.2 - 209-Syllabus Revision Computer Science 2021-2022
33 pages
Full Download (Ebook PDF) Modern Database Management, Global Edition 13th Edition PDF
100% (3)
Full Download (Ebook PDF) Modern Database Management, Global Edition 13th Edition PDF
45 pages