0% found this document useful (0 votes)

5 views66 pages

Bdaut 2

Apache Pig, developed by Yahoo in 2006, is a platform for creating and executing MapReduce programs in Hadoop, utilizing a high-level scripting language called Pig Latin. It simplifies data analysis by providing user-defined functions, handling various data types, and offering a rich set of operators for data manipulation. Pig supports both local and MapReduce execution modes, allowing users to work with large datasets efficiently.

Uploaded by

AKSHAYA .G

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views66 pages

Bdaut 2

Uploaded by

AKSHAYA .G

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 66

PIG

UNIT-2
EVOLUTION OF PIG

• Apache Pig was initially developedby Yahoo researchers in the year 2006

• The main intuition behind developing Pig is to allow MapReduce jobs

execute on large datasets .
PIG

• Apache Pig is a tool/platform for creating and executing Map

Reduce program used with Hadoop

• Apache Pig is an abstraction over MapReduce

• It is a tool/platform for analyzing large sets of data

• It provides a high-level scripting language, known as Pig Latin which is used

to develop the data analysis codes
PIG
LATIN
• Pig Latin is a high-level scripting language used to develop the data analysis code

• Pig Tools has two components

Pig Latin : Provides an environment to develop the scripts for processing
the data stored in HDFS
Pig Engine : Converts the Pig script into a MapReduce tasks

• The result of Pig always stored in the HDFS

Apache Pig MapReduce

It is a compiled programming
It is a scripting language.
language.

Abstraction is at higher level. Abstraction is at lower level.

It has less line of code as

Lines of code is more.
compared to MapReduce.

Less effort is needed for Apache More development efforts are

Pig. required for MapReduce.

Code efficiency is less as compared As compared to Pig efficiency of

to MapReduce. code is higher.
Apache Pig MapReduce

Pig is a data flow language. MapReduce is a data preparing paradigm.

Playing out a Join activity in Pig is

Joining datasets is a complex task
quite straightforward.

Requires only a fundamental knowledge

Java Expertise is very much required
of SQL.
FEATURES OF APACHE PIG
• User-defined Functions: Pig in big data gives the ability to make UDFs in
other programming languages like Java and embed or invoke them in
Pig Scripts.

• Handlesa wide range of data: Apache Pig examines a wide range of

data, both unstructured as well as structured. It stores the
outcomes in the Hadoop Distributed File System.

• Rich set of operators: It gives numerous operators to perform

tasks like a filter, sort, join, and so on.
FEATURES OF APACHE PIG

• Extensibility: Using the current operators, clients can build up their capacities
to write, process, and read data.

• The simplicity of programming: Pig Latin is like Structured Query Language and it is
not difficult to compose a Pig scripting on the off chance that you are acceptable
at Structured Query Language.

• Optimization opportunities: The assignments in Apache Pig enhance their

execution naturally, so the software engineers need to focus just on the semantics
of the language.
DATA MODELS IN PIG

Atom: It is an atomic data value which is used to store as a string. The main use of
this model is that it can be used as a number and as well as a string.

Tuple: Tuple is an arranged arrangement of fields that may contain distinctive

data types for each field.

Bag: A bag is an assortment of a set of tuples and these tuples are a subset of rows
or whole rows of a table.

Map: A map is key-esteem sets used to address data components.

APPLICATIONS OF
PIG
• For exploring large datasets

• Provides support for across data-set , Ad-hoc queries.

• In the prototyping of large data-sets processing algorithms.

• In processing the time sensitive data loads.

• For collecting large amounts of datasets in form of search logs and web
crawls.

• Used where the analytical insights are needed using the sampling.
PIG INSTALLATION
Pig runs as a client-side application.
Pig launches jobs and interacts with HDFS (or other Hadoop filesystems) from your workstation.

Prerequisite : Java 6

Download link: http://hadoop.apache.org/pig/releases.html

Unpack the tarball in a suitable place on your workstation:

tar xzf pig-x.y.z.tar.gz
Add Pig’s binary directory to your command-line path.
export
PIG_INSTALL=/home/tom/pig-x.y.z
export PATH=$PATH:$PIG_INSTALL/bin

Set the JAVA_HOME environment variable to point to a suitable Java installation.

EXECUTION TYPES
Pig has two execution types or modes:

Local mode
In local mode, Pig runs in a single JVM and accesses the local filesystem.
This mode is suitable only for small datasets and when trying out Pig.

The execution type is set using the -x or -exectype option.

To run in local mode, set the option to local:
% pig -x local
grunt>% Pig's interactive shell
EXECUTION
TYPES
MapReduce mode
In MapReduce mode, Pig translates queries into MapReduce jobs and runs them on a Hadoop cluster.
We run MapReduce mode (with a fully distributed cluster) when we want to run Pig on large
datasets.
Top use MapReduce mode, we must check the compatibility between Pig Hadoop we are using.

If a Pig release supports multiple versions of Hadoop, you can use the environment variable
PIG_HADOOP_VERSION to tell Pig the version of Hadoop it is connecting to.
export PIG_HADOOP_VERSION=18
Next, you need to point Pig at the cluster’s namenode and jobtracker.

If you already have a Hadoop site file (or files) that define fs.default.name and mapred.job.tracker,
simply add Hadoop’s configuration directory to Pig’s classpath:
% export PIG_CLASSPATH=$HADOOP_INSTALL/conf/
EXECUTION TYPES
MapReduce mode

Alternatively, you can set these two properties in the pig. properties file in Pig’s conf directory.
fs.default.name=hdfs://localhost/
mapred.job.tracker=localhost:8021

launch Pig, setting the -x option to mapreduce, or omitting it entirely, as MapReduce mode is the default:

% pig %Pig reports the filesystem and jobtracker that it has connected to%

10/07/16 16:27:37 INFO pig.Main: Logging error messages to: /Users/tom/dev/pig-0 .7.0/pig_1279294057867.log
2010-07-16 16:27:38,243 [main] INFO org.apache.pig.backend.hadoop.executionengi ne.HExecutionEngine - Connecting to
hadoop file system at: hdfs://localhost/ 2010-07-16 16:27:38,741 [main] INFO
org.apache.pig.backend.hadoop.executionengi
RUNNING PIG
PROGRAMS
There are three ways of executing Pig programs

They all work in both local and MapReduce mode:

Script

Pig can run a script file that contains Pig commands.

For example, pig script.pig runs the commands in the local file script.pig.
you can also use the -e option to run a script specified as a string on the command line.

Grunt
Grunt is an interactive shell for running Pig commands.
Grunt will be started when no file is specified for Pig to run, and the -e option is not used.
It is also possible to run Pig scripts from within Grunt using run and exec.

Embedded
You can run Pig programs from Java, much like you can use JDBC to run SQL programs from Java.
PIG LATINSTRUCTURE
A Pig Latin program consists of a collection of statements. A statement can be thought
of as an operation or a command.

Statements are usually terminated with a semicolon

statements or commands for interactive use in Grunt do not need the terminating semicolon.

Statements that have to be terminated with a semicolon can be split across multiple lines
for readability:

Comments :

Single line comments -- Everything from the first hyphen to the end of the line is ignored by the Pig Latin
interpreter

Ex: DUMP A; -- What's in A?

Prepared by Mrs K H Vijaya Kumari, Asst Professor, Dept of IT, CBIT
Multiline Comment /* Everything in between is ignored by the interpreter
*/
PIG LATINSTRUCTURE
Pig Latin has a list of keywords that have a special meaning in the language and cannot
be used as identifiers.
These include

operators (LOAD, ILLUSTRATE, e.t.c)

commands (cat, ls, e.t.c) expressions
(matches, FLATTEN e.t.c) functions
(DIFF, MAX , e.t.c)

Pig Latin has mixed rules on case sensitivity.

Operators and commands are not case sensitive (to make interactive use more forgiving);

aliases and function names are case sensitive.

Parser
• As a Pig Latin program is executed, each statement is parsed in turn.
• If any syntactical errors are encountered, the interpreter halts and
displays the error Otherwise it builds a logical plan for each Pig Latin
Statement (Operator)

• The logical plan for the statement is added to the logical plan for the
program so far, and then the interpreter moves on to the next statement.

• No data processing takes place while the logical plan of the program is
being constructed.

• The output of the parser will be a DAG (directed acyclic graph), which
represents the Pig Latin statements and logical operators.
Optimizer
The logical plan (DAG) is passed to the logical
optimizer, which carries out the logical optimizations such as
projection and pushdown.
Compiler
The compiler compiles the optimized logical plan into a
series of MapReduce jobs.

Execution engine
• Finally the MapReduce jobs are submitted to Hadoop in a
sorted order. Finally, these MapReduce jobs are executed on
Hadoop producing the desired results.
PIG STATEMENTS

A Pig Latin statement is an operator that takes a relation as input and produces another relation as output.

They allow you to transform it by sorting, grouping, joining, projecting, and filtering.

Relational Operators:
Relational operators are the main tools Pig Latin provides to operate on the data.
LOADING AND STORING
OPERATORS
LOAD:
LOAD operator is used to load data from the file system or HDFS storage into a Pig relation.

Syntax :
LOAD <path> PigStorage(<delimiter>) AS (<variable description>)

STORE:
Store is used to save results to the file system.

Syntax: STORE <relation1> <location>

DUMP:
Prints a relation to the console.

Syntax: DUMP <relation>

FILTERING
OPERATORS
FOREACH... GENERATE:
This operator generates data transformations based on columns of data. It is used to add or remove fields from
a relation.

Syntax:
FOREACH <relation1> GENERATE (<fields>);
FILTER:
This operator selects tuples from a relation based on a condition.

Syntax: FILTER <relation1> BY <condition>

DISTINCT:
Distinct removes duplicate tuples in a relation.

Syntax: DISTINCT <relation1>

FILTERING
OPERATORS
MAPREDUCE:
Runs a MapReduce job using a relation as input

Syntax:
MAPREDUCE <relation1> ;

STREAM:
Transforms a relation using an external program

Syntax: STREAM alias [, alias …] THROUGH {'command' | cmd_alias } [AS schema]

;

SAMPLE:
Selects a random sample of a relation

Syntax: SAMPLE <relation1>

ASSERT:
Ensures a condition is true for all rows in a relation; otherwise, fails

Syntax: ASSERT <relation1>

PIG GROUPING
OPERATORS
GROUP
It groups the data in a single relation.

Syntax:
GROUP <relation1> BY (<fields>);

COGROUP:
Used for grouping of the data from two or more relations..

Syntax: COGROUP <relation1> BY <condition>, <relation2> BY <condition>

CROSS:

We can create the cross (Cartesian) product of two or more relations.

Syntax : CROSS <relation1> ,<relation2>

PIG GROUPING
OPERATORS
JOIN
It groups the data in a single relation.

Syntax:
JOIN Relation1_name BY key, Relation2_name BY key

CUBE:
Efficiently performs aggregation based on multiple dimentions

Syntax: CUBE people BY CUBE(gender, sport);

PIG SORTING
OPERATORS
RANK :
Assign a rank to each tuple in a relation, optionally sorting by fields first

Syntax: RANK <relation>

ORDER BY:
Order By is used to sort a relation based on one or more fields. You can do sorting in ascending or descending
order using ASC and DESC keywords.

Syntax: ORDER <relation1> by rating ASC

LIMIT:
LIMIT operator is used to limit the number of output tuples. If the specified number of output tuples is equal to or
exceeds the number of tuples in the relation, the output will include all tuples in the relation.

Syntax:
LIMIT <relation1> (no: of tuples);
PIG COMBINE/SPLIT
OPERATORS
UNION :
Combines two or more relations into one
Syntax: UNION <relation1> ,<relation2>

SPLIT:
SPLIT operator is used to partition the contents of a relation into two or more relations based on some expression.

Syntax: SPLIT <relation1> into <relation2> <condition>, <relation2> <condition>

EXAMPL
E
writing the program to calculate the maximum recorded temperature by year for the
sample weather dataset in Pig Latin

(1950,0,1)
(1950,22,1)
(1950,-11,1)
(1949,111,1)

records = LOAD 'input/ncdc/micro-tab/sample.txt' AS (year : chararray, temperature : int, quality : int);

Relations are given names, or aliases, so they can be referred to. This relation is given
the records alias. We can examine the contents of an alias using the DUMP operator:
grunt> DUMP records;
(1950,0,1)
(1950,22,1)
(1950,-11,1)
(1949,111,1)
(1949,78,1)
EXAMPLE
We can also see the structure of a relation—the relation’s schema—using the
DESCRIBE operator on the relation’s alias:
grunt> DESCRIBE records;
records: {year: chararray, temperature: int, quality: int}
To remove records that have a missing temperature (indicated by a value of 9999) or an unsatisfactory quality
reading

filtered_records = FILTER records BY temperature != 9999 AND (quality == 0 OR quality == 1 OR quality

== 4 OR quality == 5 OR quality == 9);

grunt> DUMP
filtered_records; (1950,0,1)
(1950,22,1)
(1950,-11,1)
(1949,111,1)
(1949,78,1)
EXAMPL
EWe can use the GROUP function to group the records relation by the
year field
grunt> grouped_records = GROUP filtered_records BY year;

grunt> DUMP grouped_records;

(1949,{(1949,111,1),(1949,78,1)})
(1950,{(1950,0,1),(1950,22,1),(1950,-11,1)})

grunt> DESCRIBE grouped_records;

grouped_records: {group: chararray, filtered_records: {year:

chararray, temperature: int,quality: int}}
EXAMPLE
To compute the maximum temperature of each filtered_records bag, we use MAX. MAX is a built-in function for
calculating the maximum value of fields in a bag.

grunt> max_temp = FOREACH grouped_records GENERATE group,

>> MAX(filtered_records.temperature);

grunt> DUMP max_temp;

(1949,111)
(1950,22)
DIAGNOSTIC
OPERATORS
DESCRIBE
Prints a relation's schema
Syntax: DESCRIBE <relation>

EXPLAIN
Prints the logical and physical plans
Syntax: EXPLAIN <relation>

ILLUSTRATE
Shows sample execution of the logical plan using a generated subset of the input
Syntax: ILLUSTRATE <relation>
GENERATING
EXAMPLES

With the ILLUSTRATE

operator, Pig provides a
tool for generating a
reasonably complete
and concise sample
dataset.

Prepared by Mrs K H Vijaya Kumari, Asst Professor, Dept of IT, CBIT

PIG LATIN
COMMANDS

Prepared by Mrs K H Vijaya Kumari, Asst Professor,

Dept of IT, CBIT
EXPRESSIONS IN
PIG

Prepared by Mrs K H Vijaya Kumari, Asst Professor, Dept

of IT, CBIT
EXPRESSIONS IN
PIG
TYPES OF
PIGLATIN
MAPS IN PIG
• Maps are always loaded from files, since there is no relational operator in Pig that produces a
map.

• It’s possible to write a UDF to generate maps, if desired.

• A relation is a top-level construct, whereas a bag has to be contained in a relation.
• it’s not possible to create a relation from a bag literal. A =
{(1,2),(3,4)}; -- Error

• you can’t treat a relation like a bag and project a field into a new
Relation
B = A.$0; ---Error
Instead, you have to use a relational operator to turn the relation A into relation B: B = FOREACH A

GENERATE $0;
CASE OPERATOR
Case operator is equivalent to nested bincond operators.
Syntax CASE WHEN THEN ELSE END

Usage:
CASE expression [ WHEN value THEN value ]+ [ ELSE value ]? END
CASE [ WHEN condition THEN value ]+ [ ELSE value ]? END

• The schemas for all the outputs of the when/else branches should match.
• Use expressions only (relational operators are not allowed).
UDF STATEMENTS
REGISTER:
Register's a jar file with the pig runtime so as to use its User Defined Functions
Syntax: REGISTER <path>

DEFINE:
Assigns an alias to a UDF or streaming command.
Syntax: DEFINE alias {function | [`command` [input] [output] [ship] [cache] [stderr] ] }

Since they do not process relations, commands are not added to the logical plan; instead,
they are executed immediately. They are NON-LOGICAL PLAN STATEMENTS
FUNCTION
S• Functions in Pig come in four types:
Eval function
A function that takes one or more expressions and returns another expression
Some eval functions are aggregate functions, which means they operate on a bag of data to produce a scalar value

Ex: MAX

Furthermore, many aggregate functions are algebraic, which means that the result of the function may be calculated
incrementally.
Ex: MAX
Median is non algebraic
Filter function
A special type of eval function that returns a logical boolean result.
filter functions are used in the FILTER operator to remove unwanted rows.
An example of a built-in filter function is IsEmpty, which tests whether a bag or a map contains any items.
FUNCTION
SLoad function
A function that specifies how to load data into a relation from external storage.
Store function
A function that specifies how to save the contents of a relation to external storage.
Ex: PigStorage
i, Asst Professo
USER-DEFINED FUNCTIONS

Plugging custom code into pig statements is a crucial and trivial data processing job.
The User Defined Functions of Pig are meant to achieve this.

A Filter UDF
Writing a filter function for filtering out weather records that do not have a temperature quality reading of
satisfactory
To change the line
filtered_records = FILTER records BY temperature != 9999 AND (quality == 0 OR quality == 1 OR quality == 4 OR
quality == 5 OR quality == 9);
to:
filtered_records = FILTER records BY temperature != 9999 AND isGood(quality);
UDF – FUNCTION RESOLUTION
Pig resolves function calls by treating the function’s name as a Java class name and attempting to load a
class of that name.
When searching for classes, Pig uses a class loader that includes the JAR files that have been registered.
When running in distributed mode, Pig will ensure that your JAR files get shipped to the cluster.

Pig has a set of built-in package names that it searches, so the function call does not have to be a fully
qualified name.
MAX is actually implemented by a class MAX in the package org.apache.pig.builtin. Which is one of the
builtin packages of Pig so MAX function can be written as MAX rather than org.apache.pig.builtin.MAX
PIG IN PRACTICE

There are some practical techniques that are worth knowing about when you are developing and running
Pig programs.
Parallelism
• When running in MapReduce mode, you need to tell Pig how many reducers you want for each job.

• You do this using a PARALLEL clause for operators that run in the reduce phase, which includes all
the grouping and joining operators (GROUP, COGROUP, JOIN, CROSS), as well as DISTINCT and
ORDER

• grouped_records = GROUP records BY year PARALLEL 30;

PIG IN
PRACTICE
Parameter Substitution
Pig supports parameter substitution, where parameters in the script are substituted with values
supplied at runtime

. Parameters are denoted by identifiers prefixed with a $ character;

Parameters can be specified when launching Pig, using the -param option, one for each parameter:

Pig –param <inputpath> -param <outputpath>

Unit V-Apache Pig
No ratings yet
Unit V-Apache Pig
10 pages
Unit-V Pig Programming
No ratings yet
Unit-V Pig Programming
123 pages
BDA-Unit 5-Notes
No ratings yet
BDA-Unit 5-Notes
36 pages
Unit-4 SGS
No ratings yet
Unit-4 SGS
13 pages
Notes Unit 5 Bigdata
No ratings yet
Notes Unit 5 Bigdata
21 pages
PIG
No ratings yet
PIG
9 pages
Hadoop Pig
No ratings yet
Hadoop Pig
111 pages
BDA - UNIT 4 PIG Notes
No ratings yet
BDA - UNIT 4 PIG Notes
9 pages
BDA Module 4 - Part 1 (Pig) 2023
100% (1)
BDA Module 4 - Part 1 (Pig) 2023
34 pages
UNIT 5 Complete Notes
No ratings yet
UNIT 5 Complete Notes
21 pages
Pig: Building High-Level Dataflows Over Map-Reduce
No ratings yet
Pig: Building High-Level Dataflows Over Map-Reduce
59 pages
3 Pig
No ratings yet
3 Pig
77 pages
Unit 4 Apachepig 210825041412
No ratings yet
Unit 4 Apachepig 210825041412
16 pages
06 Pig 01 Intro 1
No ratings yet
06 Pig 01 Intro 1
23 pages
Unit 5
No ratings yet
Unit 5
24 pages
Unit IV - Big Data Programming
No ratings yet
Unit IV - Big Data Programming
17 pages
Apache PIG
No ratings yet
Apache PIG
41 pages
Pig
No ratings yet
Pig
16 pages
Unit III
No ratings yet
Unit III
118 pages
Pig Latin: Simplifying Hadoop for All
No ratings yet
Pig Latin: Simplifying Hadoop for All
9 pages
PIG A Big Data Processor
No ratings yet
PIG A Big Data Processor
49 pages
Pig SKB
No ratings yet
Pig SKB
7 pages
Big Data Unit IV
No ratings yet
Big Data Unit IV
19 pages
Apache Pig
No ratings yet
Apache Pig
23 pages
Pig Slides
No ratings yet
Pig Slides
46 pages
Unit Iv Part - 2
No ratings yet
Unit Iv Part - 2
59 pages
Big Data Processing with Pig
No ratings yet
Big Data Processing with Pig
12 pages
BDP U4
No ratings yet
BDP U4
58 pages
Hadoop - PIG User Material
No ratings yet
Hadoop - PIG User Material
292 pages
Apache Pig: For Live Hadoop Training, Please See Courses
No ratings yet
Apache Pig: For Live Hadoop Training, Please See Courses
25 pages
Pig and Pig Latin
No ratings yet
Pig and Pig Latin
16 pages
Big Data Unit-5
No ratings yet
Big Data Unit-5
9 pages
Unit - V PIG Hadoop & Big Data: Pig Latin. This Language Provides Various Operators Using Which Programmers
No ratings yet
Unit - V PIG Hadoop & Big Data: Pig Latin. This Language Provides Various Operators Using Which Programmers
9 pages
Hadoop Big Data: Pig, Hive, HBase
No ratings yet
Hadoop Big Data: Pig, Hive, HBase
17 pages
4.1 Pig Unit4
No ratings yet
4.1 Pig Unit4
55 pages
Pig
No ratings yet
Pig
6 pages
Big Data Applications: Pig & Hive
No ratings yet
Big Data Applications: Pig & Hive
29 pages
Unit 3
No ratings yet
Unit 3
26 pages
BDA Unit 5-1
No ratings yet
BDA Unit 5-1
29 pages
Unit 5
No ratings yet
Unit 5
76 pages
5 PIG and HIVE
No ratings yet
5 PIG and HIVE
81 pages
IMTC634 - Data Science - Chapter 16
No ratings yet
IMTC634 - Data Science - Chapter 16
20 pages
Pig Hive
No ratings yet
Pig Hive
72 pages
Big Data Analytics: Apache Pig
No ratings yet
Big Data Analytics: Apache Pig
52 pages
Unit5 Bigdatanotes
No ratings yet
Unit5 Bigdatanotes
52 pages
Big Data Notes Pig
No ratings yet
Big Data Notes Pig
38 pages
Hadoop Pig Presentation
No ratings yet
Hadoop Pig Presentation
33 pages
BD 5
No ratings yet
BD 5
28 pages
Notes of Aktu Btech 3 Yr Big Data
No ratings yet
Notes of Aktu Btech 3 Yr Big Data
15 pages
Big Data Unit-5
No ratings yet
Big Data Unit-5
81 pages
Big Data Processing, 2014/15: Lecture 8: Pig Latin!
No ratings yet
Big Data Processing, 2014/15: Lecture 8: Pig Latin!
58 pages
Hadoop Week 5
No ratings yet
Hadoop Week 5
78 pages
Bda Unit 4 060115 Big Data Analytics Unit 4
No ratings yet
Bda Unit 4 060115 Big Data Analytics Unit 4
19 pages
Big Data Module V Notes
No ratings yet
Big Data Module V Notes
26 pages
Bda Unit 4 060115 Big Data Analytics Unit 4
No ratings yet
Bda Unit 4 060115 Big Data Analytics Unit 4
19 pages
Apache Pig for Data Engineers
No ratings yet
Apache Pig for Data Engineers
5 pages
Unit 4 Bba
No ratings yet
Unit 4 Bba
10 pages
Pig Full Lecture
No ratings yet
Pig Full Lecture
38 pages
Bda Unit 4 060115 Big Data Analytics Unit 4
No ratings yet
Bda Unit 4 060115 Big Data Analytics Unit 4
19 pages
Evolution of Analytical Scalability
100% (1)
Evolution of Analytical Scalability
11 pages
Google Cloud & ML Specialization Guide
100% (5)
Google Cloud & ML Specialization Guide
25 pages
Bda Lab
No ratings yet
Bda Lab
4 pages
MapReduce Matrix Multiplication
No ratings yet
MapReduce Matrix Multiplication
5 pages
Cloudera Search User Guide
No ratings yet
Cloudera Search User Guide
86 pages
MapReduce Tutorial
No ratings yet
MapReduce Tutorial
32 pages
R-20 4-1 Syllabus
No ratings yet
R-20 4-1 Syllabus
34 pages
Business Intelligence Essentials
No ratings yet
Business Intelligence Essentials
9 pages
Bda Lab
No ratings yet
Bda Lab
36 pages
KCA 034 - Unit 1
No ratings yet
KCA 034 - Unit 1
48 pages
Unit 4 FIoT Notes
No ratings yet
Unit 4 FIoT Notes
23 pages
21CS71
No ratings yet
21CS71
2 pages
PySpark Data Frame Questions PDF
100% (2)
PySpark Data Frame Questions PDF
57 pages
Data Engineering Brochure FXSr63lN9T
No ratings yet
Data Engineering Brochure FXSr63lN9T
14 pages
CAPSTONE PROJECTInstallation PDF
No ratings yet
CAPSTONE PROJECTInstallation PDF
33 pages
Questions Certif BigData
No ratings yet
Questions Certif BigData
12 pages
Map-Reduce and The New Software Stack: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman
No ratings yet
Map-Reduce and The New Software Stack: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman
48 pages
Experiment No - 01
No ratings yet
Experiment No - 01
14 pages
Big Data Lab Manual
No ratings yet
Big Data Lab Manual
32 pages
Bda Unit 5
No ratings yet
Bda Unit 5
17 pages
Introduction to Hive: Features & Use Cases
No ratings yet
Introduction to Hive: Features & Use Cases
20 pages
Computer Engineering Syllabus Sem Vii Mumbai University
No ratings yet
Computer Engineering Syllabus Sem Vii Mumbai University
61 pages
BTech Big Data Exam Guide
No ratings yet
BTech Big Data Exam Guide
2 pages
Media
No ratings yet
Media
3 pages
Big Data Storage Techniques For Spatial Databases: Implications of Big Data Architecture On Spatial Query Processing
No ratings yet
Big Data Storage Techniques For Spatial Databases: Implications of Big Data Architecture On Spatial Query Processing
27 pages
Key Concepts in Data Science and Big Data
No ratings yet
Key Concepts in Data Science and Big Data
13 pages
Bda Module 4 PPT (KM)
No ratings yet
Bda Module 4 PPT (KM)
76 pages
Big Data Analytics Course Guide
No ratings yet
Big Data Analytics Course Guide
4 pages
Gold Price Forecasting for Investors
No ratings yet
Gold Price Forecasting for Investors
6 pages
Thejesh Venkata Arumalla: Education
100% (1)
Thejesh Venkata Arumalla: Education
1 page

Bdaut 2

Uploaded by

Bdaut 2

Uploaded by

PIG

• The main intuition behind developing Pig is to allow MapReduce jobs

• Apache Pig is a tool/platform for creating and executing Map

• Apache Pig is an abstraction over MapReduce

• It is a tool/platform for analyzing large sets of data

• It provides a high-level scripting language, known as Pig Latin which is used

• Pig Tools has two components

• The result of Pig always stored in the HDFS

Abstraction is at higher level. Abstraction is at lower level.

It has less line of code as

Less effort is needed for Apache More development efforts are

Code efficiency is less as compared As compared to Pig efficiency of

Pig is a data flow language. MapReduce is a data preparing paradigm.

Playing out a Join activity in Pig is

Requires only a fundamental knowledge

• Handlesa wide range of data: Apache Pig examines a wide range of

• Rich set of operators: It gives numerous operators to perform

• Optimization opportunities: The assignments in Apache Pig enhance their

Tuple: Tuple is an arranged arrangement of fields that may contain distinctive

Map: A map is key-esteem sets used to address data components.

• Provides support for across data-set , Ad-hoc queries.

• In the prototyping of large data-sets processing algorithms.

• In processing the time sensitive data loads.

Download link: http://hadoop.apache.org/pig/releases.html

Unpack the tarball in a suitable place on your workstation:

Set the JAVA_HOME environment variable to point to a suitable Java installation.

The execution type is set using the -x or -exectype option.

They all work in both local and MapReduce mode:

Pig can run a script file that contains Pig commands.

Statements are usually terminated with a semicolon

Ex: DUMP A; -- What's in A?

operators (LOAD, ILLUSTRATE, e.t.c)

Pig Latin has mixed rules on case sensitivity.

aliases and function names are case sensitive.

Syntax: STORE <relation1> <location>

Syntax: DUMP <relation>

Syntax: FILTER <relation1> BY <condition>

Syntax: DISTINCT <relation1>

Syntax: STREAM alias [, alias …] THROUGH {'command' | cmd_alias } [AS schema]

Syntax: SAMPLE <relation1>

Syntax: ASSERT <relation1>

Syntax: COGROUP <relation1> BY <condition>, <relation2> BY <condition>

We can create the cross (Cartesian) product of two or more relations.

Syntax : CROSS <relation1> ,<relation2>

Syntax: CUBE people BY CUBE(gender, sport);

Syntax: RANK <relation>

Syntax: ORDER <relation1> by rating ASC

Syntax: SPLIT <relation1> into <relation2> <condition>, <relation2> <condition>

records = LOAD 'input/ncdc/micro-tab/sample.txt' AS (year : chararray, temperature : int, quality : int);

filtered_records = FILTER records BY temperature != 9999 AND (quality == 0 OR quality == 1 OR quality

grunt> DUMP grouped_records;

grunt> DESCRIBE grouped_records;

grouped_records: {group: chararray, filtered_records: {year:

grunt> max_temp = FOREACH grouped_records GENERATE group,

grunt> DUMP max_temp;

With the ILLUSTRATE

Prepared by Mrs K H Vijaya Kumari, Asst Professor, Dept of IT, CBIT

Prepared by Mrs K H Vijaya Kumari, Asst Professor,

Prepared by Mrs K H Vijaya Kumari, Asst Professor, Dept

• It’s possible to write a UDF to generate maps, if desired.

• grouped_records = GROUP records BY year PARALLEL 30;

. Parameters are denoted by identifiers prefixed with a $ character;

Pig –param <inputpath> -param <outputpath>

You might also like