0% found this document useful (0 votes)

76 views6 pages

Default - Parallel: You Can Set The Number of Reducers For A Map Job by Passing Any Whole Number As A

Apache Pig is a platform for analyzing large datasets on Hadoop. To load custom data in Pig, a loader function is used which loads data from the file system using a specified InputFormat. The loader function uses an InputFormat to split input data into logical splits. The InputFormat uses a RecordReader to read each split and emit key-value pairs for the map function. To implement a custom load function, a class extends LoadFunc and provides implementations for abstract methods to set the input location, return the InputFormat, prepare to read data, and return the next tuple.

Uploaded by

Vijay Yenchilwar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

76 views6 pages

Default - Parallel: You Can Set The Number of Reducers For A Map Job by Passing Any Whole Number As A

Uploaded by

Vijay Yenchilwar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 6

Custom Load Function:

---------------------
Apache Pig is a platform for analyzing large data sets on top of Hadoop.
To load a custom input dataset, Pig uses a loader function which loads the
data from file-system.

Pig’s loader function uses specified InputFormat which will split input
data into logical split. Input format in turn uses RecordReader which will
read each input split and emits <key,value> for map function as input.

Implementation of Custom Load Function:

---------------------------------------

Custom loader function extends LoadFunc and provides implementation for

following abstract methods :

sets the path of input data ---> abstract void setLocation(String

location,Job job)
returns InputFormat class which will be used to split the input data --->
abstract InputFormat getInputFormat()
prepares to read data. It takes an arguments as record reader and input
split ---> abstract void prepareToRead(RecordReader rr, PigSplit split)
returns the next tuple to be processed ---> abstract Tuple getNext()

Execution Procedure:
--------------------

1)create a jar which contains load function.

2)register the loader function.
3)load the input data using custom loader function.

Q. How to set No. of Reducers in PIG

default_parallel: You can set the number of reducers for a map job by passing any whole number as a
value to this key.
Use the PARALLEL clause to increase the parallelism of a job:
 PARALLEL sets the number of reduce tasks for the MapReduce jobs generated by Pig. The
default value is 1 (one reduce task).
 PARALLEL only affects the number of reduce tasks. Map parallelism is determined by the input
file, one map for each HDFS block.
 If you don’t specify PARALLEL, you still get the same map parallelism but only one reduce task.
 example PARALLEL is used with the GROUP operator.
 A = LOAD 'myfile' AS (t, u, v);
 B = GROUP A BY t PARALLEL 18;
 .....
 In this example all the MapReduce jobs that get launched use 20 reducers.
 SET DEFAULT_PARALLEL 20;
 A = LOAD ‘myfile.txt’ USING PigStorage() AS (t, u, v);
 B = GROUP A BY t;
 C = FOREACH B GENERATE group, COUNT(A.t) as mycount;
 D = ORDER C BY mycount;
 STORE D INTO ‘mysortedcount’ USING PigStorage();
Prefer DISTINCT over GROUP BY - GENERATE
When it comes to extracting the unique values from a column in a relation, one of two approaches can be used:

Example Using GROUP BY - GENERATE

A = load 'myfile' as (t, u, v);

B = foreach A generate u;
C = group B by u;
D = foreach C generate group as uniquekey;
dump D;

Example Using DISTINCT

A = load 'myfile' as (t, u, v);

B = foreach A generate u;
C = distinct B;
dump C;

Q. How to set the job priorities in PIG.

job.priority: Sets the priority of a Pig job.

Acceptable values (case insensitive): very_low, low, normal, high, very_high

Q. How to Kill the Job in PIG

A. Kill : kills the job
kill jobid
grunt> kill job_0001

set Command
The set command is used to show/assign values to keys used in Pig.

Usage

Using this command, you can set values to the following keys.

Key Description and values

default_parallel You can set the number of reducers for a map job by passing
any whole number as a value to this key.

Debug You can turn off or turn on the debugging freature in Pig by
passing on/off to this key.

job.name You can set the Job name to the required job by passing a
string value to this key.

job.priority You can set the job priority to a job by passing one of the
following values to this key −
 very_low

 low

 normal

 high

 very_high

stream.skippath For streaming, you can set the path from where the data is
not to be transferred, by passing the desired path in the form
of a string to this key.

which loader function used for unstructured data in PIG ?

The Pig Latin function TextLoader() is a Load function which is used to load unstructured data in
UTF-8 format.

grunt> TextLoader()

Example
Let us assume there is a file with named stu_data.txt in the HDFS
directory named /data/ as shown below.

001,Rajiv_Reddy,21,Hyderabad
002,siddarth_Battacharya,22,Kolkata
003,Rajesh_Khanna,22,Delhi
004,Preethi_Agarwal,21,Pune
005,Trupthi_Mohanthy,23,Bhuwaneshwar
006,Archana_Mishra,23,Chennai
007,Komal_Nayak,24,trivendram
008,Bharathi_Nambiayar,24,Chennai

grunt> details = LOAD 'hdfs://localhost:9000/pig_data/stu_data.txt' USING TextLoader();

grunt> dump details;

(001,Rajiv_Reddy,21,Hyderabad)
(002,siddarth_Battacharya,22,Kolkata)
(003,Rajesh_Khanna,22,Delhi)
(004,Preethi_Agarwal,21,Pune)
(005,Trupthi_Mohanthy,23,Bhuwaneshwar)
(006,Archana_Mishra,23,Chennai)
(007,Komal_Nayak,24,trivendram)
(008,Bharathi_Nambiayar,24,Chennai)

The Load and Store functions in Apache Pig are used to determine how the

data goes ad comes out of Pig. These functions are used with the load and
store operators. Given below is the list of load and store functions available
in Pig.

S.N. Function & Description

1 PigStorage()

To load and store structured files.

2 TextLoader()
To load unstructured data into Pig.

3 BinStorage()
To load and store data into Pig using machine readable format.

4 Handling Compression
In Pig Latin, we can load and store compressed data.

20) What are the various diagnostic operators available in Apache Pig?
1. Dump Operator- It is used to display the output of pig Latin statements on the screen, so that
developers can debug the code.
Describe Operator- describe debugging utility is helpful to developers when writing Pig scripts as it
shows the schema of a relation in the script. For beginners who are trying to learn Apache Pig can use the
describe utility to understand how each operator makes alterations to data. A pig script can have multiple
describes.

2.
3. Explain Operator- OR
Differentiate between the logical and physical plan of an Apache Pig script
Logical and Physical plans are created during the execution of a pig script. Pig scripts
are based on interpreter checking. Logical plan is produced after semantic checking and
basic parsing and no data processing takes place during the creation of a logical plan.
For each line in the Pig script, syntax check is performed for operators and a logical plan
is created. Whenever an error is encountered within the script, an exception is thrown
and the program execution ends, else for each statement in the script has its own logical
plan.

A logical plan contains collection of operators in the script but does not contain the
edges between the operators.
After the logical plan is generated, the script execution moves to the physical plan where
there is a description about the physical operators, Apache Pig will use, to execute the
Pig script. A physical plan is more or less like a series of MapReduce jobs but then the
plan does not have any reference on how it will be executed in MapReduce. During the
creation of physical plan, cogroup logical operator is converted into 3 physical operators
namely –Local Rearrange, Global Rearrange and Package. Load and store functions
usually get resolved in the physical plan.

4. Illustrate Operator- Explained in apache pig interview question no -11

Executing pig scripts on large data sets, usually takes a long time. To tackle this, developers run
pig scripts on sample data but there is possibility that the sample data selected, might not
execute your pig script properly. For instance, if the script has a join operator there should be at
least a few records in the sample data that have the same key, otherwise the join operation will
not return any results. To tackle these kind of issues, illustrate is used. illustrate takes a sample
from the data and whenever it comes across operators like join or filter that remove data, it
ensures that only some records pass through and some do not, by making modifications to the
records such that they meet the condition. illustrate just shows the output of each stage but
does not run any MapReduce task.

WHAT IS PIGGY BANK ?

Word count logic in PIG ?

lines = LOAD '/user/hadoop/HDFS_File.txt' AS (line:chararray);

words = FOREACH lines GENERATE FLATTEN(TOKENIZE(line)) as word;

grouped = GROUP words BY word;

wordcount = FOREACH grouped GENERATE group, COUNT(words);

DUMP wordcount;

lines = LOAD '/home/cloudera/CV.txt' AS (line:chararray);

words = FOREACH lines GENERATE FLATTEN(TOKENIZE(line)) as word;

grouped = GROUP words BY word;

wordcount = FOREACH grouped GENERATE group, COUNT(words);

DUMP wordcount;

Unit 5 Lecture No-2 (PIG)
No ratings yet
Unit 5 Lecture No-2 (PIG)
101 pages
BDA Unit-4
No ratings yet
BDA Unit-4
98 pages
Apache PIG
No ratings yet
Apache PIG
41 pages
PIG Interview Qusetions
No ratings yet
PIG Interview Qusetions
15 pages
Unit IV - Pig PDF
No ratings yet
Unit IV - Pig PDF
79 pages
Apache Pig
100% (2)
Apache Pig
80 pages
Hadoop Pig Presentation
No ratings yet
Hadoop Pig Presentation
33 pages
Unit-4 PIG
No ratings yet
Unit-4 PIG
9 pages
Pig: Building High-Level Dataflows Over Map-Reduce
No ratings yet
Pig: Building High-Level Dataflows Over Map-Reduce
59 pages
IMTC634 - Data Science - Chapter 16
No ratings yet
IMTC634 - Data Science - Chapter 16
20 pages
Unit V
No ratings yet
Unit V
30 pages
Pig 2
No ratings yet
Pig 2
63 pages
Pig Hive
No ratings yet
Pig Hive
59 pages
Unit 5
No ratings yet
Unit 5
16 pages
PIG A Big Data Processor
No ratings yet
PIG A Big Data Processor
49 pages
6 Part2
No ratings yet
6 Part2
45 pages
Notes Unit 5 Bigdata
No ratings yet
Notes Unit 5 Bigdata
21 pages
Apache Pig for Data Analysts
No ratings yet
Apache Pig for Data Analysts
58 pages
Big Data Unit-5
No ratings yet
Big Data Unit-5
9 pages
Big Data Unit 5 Big Data Notes of Unit 5
No ratings yet
Big Data Unit 5 Big Data Notes of Unit 5
16 pages
Notes of Aktu Btech 3 Yr Big Data
No ratings yet
Notes of Aktu Btech 3 Yr Big Data
15 pages
BD 5
No ratings yet
BD 5
28 pages
Pig Interview Questions
No ratings yet
Pig Interview Questions
3 pages
Pig Hive
No ratings yet
Pig Hive
72 pages
Unit 5 Lecture No-2 (PIG)
No ratings yet
Unit 5 Lecture No-2 (PIG)
94 pages
BDA Module 4 - Part 1 (Pig) 2023
100% (1)
BDA Module 4 - Part 1 (Pig) 2023
34 pages
Pig Viva Ques
No ratings yet
Pig Viva Ques
6 pages
Unit 5
No ratings yet
Unit 5
23 pages
Apache Pig: Big Data Analytics Guide
No ratings yet
Apache Pig: Big Data Analytics Guide
65 pages
BigData Unit 4
No ratings yet
BigData Unit 4
13 pages
Unit 5
No ratings yet
Unit 5
19 pages
Pig vs. SQL & MapReduce: Features & Benefits
No ratings yet
Pig vs. SQL & MapReduce: Features & Benefits
21 pages
Hadoop - PIG User Material
No ratings yet
Hadoop - PIG User Material
292 pages
Unit 5
No ratings yet
Unit 5
24 pages
BDP U4
No ratings yet
BDP U4
58 pages
Bda Unit 4 060115 Big Data Analytics Unit 4
No ratings yet
Bda Unit 4 060115 Big Data Analytics Unit 4
19 pages
Hadoop Week 5
No ratings yet
Hadoop Week 5
78 pages
BD Unit 2
No ratings yet
BD Unit 2
20 pages
Bda Unit 4 060115 Big Data Analytics Unit 4
No ratings yet
Bda Unit 4 060115 Big Data Analytics Unit 4
19 pages
Notes Unit 5 Bigdata
No ratings yet
Notes Unit 5 Bigdata
19 pages
Hadoop Big Data: Pig, Hive, HBase
No ratings yet
Hadoop Big Data: Pig, Hive, HBase
17 pages
Unit V-Apache Pig
No ratings yet
Unit V-Apache Pig
10 pages
Unit-4 SGS
No ratings yet
Unit-4 SGS
13 pages
Pig Latin Users Guide
No ratings yet
Pig Latin Users Guide
13 pages
L Apachepigdataquery PDF
No ratings yet
L Apachepigdataquery PDF
10 pages
Bda Unit 4 060115 Big Data Analytics Unit 4
No ratings yet
Bda Unit 4 060115 Big Data Analytics Unit 4
19 pages
Unit No. 8
No ratings yet
Unit No. 8
24 pages
BDA Unit - IV
No ratings yet
BDA Unit - IV
81 pages
Apache Pig: For Live Hadoop Training, Please See Courses
No ratings yet
Apache Pig: For Live Hadoop Training, Please See Courses
25 pages
Big Data Applications: Pig & Hive
No ratings yet
Big Data Applications: Pig & Hive
29 pages
Unit IV - Big Data Programming
No ratings yet
Unit IV - Big Data Programming
17 pages
Hadoop Pig
No ratings yet
Hadoop Pig
111 pages
06 Pig 01 Intro 1
No ratings yet
06 Pig 01 Intro 1
23 pages
UNIT 5 Complete Notes
No ratings yet
UNIT 5 Complete Notes
21 pages
Emailing Pig PDF
No ratings yet
Emailing Pig PDF
23 pages
Bdaut 2
No ratings yet
Bdaut 2
66 pages
BDA Unit 5-1
No ratings yet
BDA Unit 5-1
29 pages
Apache Pig for Data Engineers
No ratings yet
Apache Pig for Data Engineers
50 pages
Hadoop Namenode Commands: Command Description
No ratings yet
Hadoop Namenode Commands: Command Description
4 pages
Hbase Interview Questions
No ratings yet
Hbase Interview Questions
5 pages
Hbase - Quick Guide Hbase - Overview
No ratings yet
Hbase - Quick Guide Hbase - Overview
53 pages
Hadoop Quick Guide
No ratings yet
Hadoop Quick Guide
32 pages
Hadoop Mock Test I
No ratings yet
Hadoop Mock Test I
8 pages
Banking Support Expert Profile
No ratings yet
Banking Support Expert Profile
2 pages
MI0036-Business Intelligence Tools - F1
No ratings yet
MI0036-Business Intelligence Tools - F1
97 pages
MI0034 Database-Management System - F1
No ratings yet
MI0034 Database-Management System - F1
54 pages
Manager: Company Namecapgemini Consulting India Pvt. LTD
No ratings yet
Manager: Company Namecapgemini Consulting India Pvt. LTD
1 page
MI0035-Computer Network - F1
No ratings yet
MI0035-Computer Network - F1
80 pages
MBA Research Methodology Guide
No ratings yet
MBA Research Methodology Guide
15 pages
Training Guideline TBQC Consultant
No ratings yet
Training Guideline TBQC Consultant
78 pages
An Analysis of Model Driven Architecture (MDA) and Executable UML (xUML)
No ratings yet
An Analysis of Model Driven Architecture (MDA) and Executable UML (xUML)
45 pages
Saurabh Saini Newupadated Resume
No ratings yet
Saurabh Saini Newupadated Resume
1 page
Integration Using Numerical Recipes: Realfunction Realfunction Operator
No ratings yet
Integration Using Numerical Recipes: Realfunction Realfunction Operator
21 pages
Principles of Web Development: Creating A Professional Website
No ratings yet
Principles of Web Development: Creating A Professional Website
32 pages
Web Based EV Charging Station Finder and Slot Booking
No ratings yet
Web Based EV Charging Station Finder and Slot Booking
5 pages
Writing Queries
No ratings yet
Writing Queries
5 pages
React JS Interview Prep Guide
No ratings yet
React JS Interview Prep Guide
12 pages
Week 5 Assignment Nov
No ratings yet
Week 5 Assignment Nov
7 pages
Make2Pack Seger
No ratings yet
Make2Pack Seger
17 pages
Django Interview Ques
No ratings yet
Django Interview Ques
4 pages
Java's Main Method Explained
No ratings yet
Java's Main Method Explained
10 pages
Solution To PQ
No ratings yet
Solution To PQ
6 pages
x86 Assembly Tutorial
No ratings yet
x86 Assembly Tutorial
22 pages
5.9 Building APIs
No ratings yet
5.9 Building APIs
3 pages
Library Management System Java Project
No ratings yet
Library Management System Java Project
11 pages
Online Career Guidance System Abstract
0% (1)
Online Career Guidance System Abstract
51 pages
CCS0043 Fa1
No ratings yet
CCS0043 Fa1
25 pages
FactSet Solutions for Finance Pros
No ratings yet
FactSet Solutions for Finance Pros
14 pages
Quiz
No ratings yet
Quiz
2 pages
TMF630 REST API Design Guidelines Part 5 v4.0.0
No ratings yet
TMF630 REST API Design Guidelines Part 5 v4.0.0
24 pages
Excel Password Remover Script
No ratings yet
Excel Password Remover Script
3 pages
Dropbox - IP TV
No ratings yet
Dropbox - IP TV
50 pages
The World Is Flat
No ratings yet
The World Is Flat
32 pages
Word Shortcuts
100% (1)
Word Shortcuts
3 pages
Oracle APEX Quiz for Developers
No ratings yet
Oracle APEX Quiz for Developers
11 pages
Lab5 SoftwareDevelopmentTools
No ratings yet
Lab5 SoftwareDevelopmentTools
12 pages
Data Science - Python Data Types - 14 - 04 - 2025
No ratings yet
Data Science - Python Data Types - 14 - 04 - 2025
6 pages
Cloud Computing & Virtualization Guide
No ratings yet
Cloud Computing & Virtualization Guide
21 pages
Logcat 1596454595045
No ratings yet
Logcat 1596454595045
11 pages

Default - Parallel: You Can Set The Number of Reducers For A Map Job by Passing Any Whole Number As A

Uploaded by

Default - Parallel: You Can Set The Number of Reducers For A Map Job by Passing Any Whole Number As A

Uploaded by

Custom Load Function:

Implementation of Custom Load Function:

Custom loader function extends LoadFunc and provides implementation for

sets the path of input data ---> abstract void setLocation(String

1)create a jar which contains load function.

Q. How to set No. of Reducers in PIG

Example Using GROUP BY - GENERATE

A = load 'myfile' as (t, u, v);

Example Using DISTINCT

A = load 'myfile' as (t, u, v);

Q. How to set the job priorities in PIG.

Acceptable values (case insensitive): very_low, low, normal, high, very_high

Q. How to Kill the Job in PIG

Key Description and values

which loader function used for unstructured data in PIG ?

grunt> details = LOAD 'hdfs://localhost:9000/pig_data/stu_data.txt' USING TextLoader();

The Load and Store functions in Apache Pig are used to determine how the

S.N. Function & Description

To load and store structured files.

4. Illustrate Operator- Explained in apache pig interview question no -11

WHAT IS PIGGY BANK ?

Word count logic in PIG ?

lines = LOAD '/user/hadoop/HDFS_File.txt' AS (line:chararray);

words = FOREACH lines GENERATE FLATTEN(TOKENIZE(line)) as word;

grouped = GROUP words BY word;

wordcount = FOREACH grouped GENERATE group, COUNT(words);

lines = LOAD '/home/cloudera/CV.txt' AS (line:chararray);

words = FOREACH lines GENERATE FLATTEN(TOKENIZE(line)) as word;

grouped = GROUP words BY word;

wordcount = FOREACH grouped GENERATE group, COUNT(words);

You might also like