0% found this document useful (0 votes)
76 views6 pages

Default - Parallel: You Can Set The Number of Reducers For A Map Job by Passing Any Whole Number As A

Apache Pig is a platform for analyzing large datasets on Hadoop. To load custom data in Pig, a loader function is used which loads data from the file system using a specified InputFormat. The loader function uses an InputFormat to split input data into logical splits. The InputFormat uses a RecordReader to read each split and emit key-value pairs for the map function. To implement a custom load function, a class extends LoadFunc and provides implementations for abstract methods to set the input location, return the InputFormat, prepare to read data, and return the next tuple.

Uploaded by

Vijay Yenchilwar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
76 views6 pages

Default - Parallel: You Can Set The Number of Reducers For A Map Job by Passing Any Whole Number As A

Apache Pig is a platform for analyzing large datasets on Hadoop. To load custom data in Pig, a loader function is used which loads data from the file system using a specified InputFormat. The loader function uses an InputFormat to split input data into logical splits. The InputFormat uses a RecordReader to read each split and emit key-value pairs for the map function. To implement a custom load function, a class extends LoadFunc and provides implementations for abstract methods to set the input location, return the InputFormat, prepare to read data, and return the next tuple.

Uploaded by

Vijay Yenchilwar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 6

Custom Load Function:

---------------------
Apache Pig is a platform for analyzing large data sets on top of Hadoop.
To load a custom input dataset, Pig uses a loader function which loads the
data from file-system.

Pig’s loader function uses specified InputFormat which will split input
data into logical split. Input format in turn uses RecordReader which will
read each input split and emits <key,value> for map function as input.

Implementation of Custom Load Function:


---------------------------------------

Custom loader function extends LoadFunc and provides implementation for


following abstract methods :

sets the path of input data ---> abstract void setLocation(String


location,Job job)
returns InputFormat class which will be used to split the input data --->
abstract InputFormat getInputFormat()
prepares to read data. It takes an arguments as record reader and input
split ---> abstract void prepareToRead(RecordReader rr, PigSplit split)
returns the next tuple to be processed ---> abstract Tuple getNext()

Execution Procedure:
--------------------

1)create a jar which contains load function.


2)register the loader function.
3)load the input data using custom loader function.

Q. How to set No. of Reducers in PIG


default_parallel: You can set the number of reducers for a map job by passing any whole number as a
value to this key.
Use the PARALLEL clause to increase the parallelism of a job:
 PARALLEL sets the number of reduce tasks for the MapReduce jobs generated by Pig. The
default value is 1 (one reduce task).
 PARALLEL only affects the number of reduce tasks. Map parallelism is determined by the input
file, one map for each HDFS block.
 If you don’t specify PARALLEL, you still get the same map parallelism but only one reduce task.
 example PARALLEL is used with the GROUP operator.
 A = LOAD 'myfile' AS (t, u, v);
 B = GROUP A BY t PARALLEL 18;
 .....
 In this example all the MapReduce jobs that get launched use 20 reducers.
 SET DEFAULT_PARALLEL 20;
 A = LOAD ‘myfile.txt’ USING PigStorage() AS (t, u, v);
 B = GROUP A BY t;
 C = FOREACH B GENERATE group, COUNT(A.t) as mycount;
 D = ORDER C BY mycount;
 STORE D INTO ‘mysortedcount’ USING PigStorage();
Prefer DISTINCT over GROUP BY - GENERATE
When it comes to extracting the unique values from a column in a relation, one of two approaches can be used:

Example Using GROUP BY - GENERATE

A = load 'myfile' as (t, u, v);


B = foreach A generate u;
C = group B by u;
D = foreach C generate group as uniquekey;
dump D;

Example Using DISTINCT

A = load 'myfile' as (t, u, v);


B = foreach A generate u;
C = distinct B;
dump C;

Q. How to set the job priorities in PIG.


job.priority: Sets the priority of a Pig job.

Acceptable values (case insensitive): very_low, low, normal, high, very_high

Q. How to Kill the Job in PIG


A. Kill : kills the job
kill jobid
grunt> kill job_0001

set Command
The set command is used to show/assign values to keys used in Pig.

Usage

Using this command, you can set values to the following keys.

Key Description and values

default_parallel You can set the number of reducers for a map job by passing
any whole number as a value to this key.

Debug You can turn off or turn on the debugging freature in Pig by
passing on/off to this key.

job.name You can set the Job name to the required job by passing a
string value to this key.

job.priority You can set the job priority to a job by passing one of the
following values to this key −
 very_low

 low

 normal

 high

 very_high

stream.skippath For streaming, you can set the path from where the data is
not to be transferred, by passing the desired path in the form
of a string to this key.

which loader function used for unstructured data in PIG ?

The Pig Latin function TextLoader() is a Load function which is used to load unstructured data in
UTF-8 format.

grunt> TextLoader()

Example
Let us assume there is a file with named stu_data.txt in the HDFS
directory named /data/ as shown below.

001,Rajiv_Reddy,21,Hyderabad
002,siddarth_Battacharya,22,Kolkata
003,Rajesh_Khanna,22,Delhi
004,Preethi_Agarwal,21,Pune
005,Trupthi_Mohanthy,23,Bhuwaneshwar
006,Archana_Mishra,23,Chennai
007,Komal_Nayak,24,trivendram
008,Bharathi_Nambiayar,24,Chennai

grunt> details = LOAD 'hdfs://localhost:9000/pig_data/stu_data.txt' USING TextLoader();


grunt> dump details;

(001,Rajiv_Reddy,21,Hyderabad)
(002,siddarth_Battacharya,22,Kolkata)
(003,Rajesh_Khanna,22,Delhi)
(004,Preethi_Agarwal,21,Pune)
(005,Trupthi_Mohanthy,23,Bhuwaneshwar)
(006,Archana_Mishra,23,Chennai)
(007,Komal_Nayak,24,trivendram)
(008,Bharathi_Nambiayar,24,Chennai)

The Load and Store functions in Apache Pig are used to determine how the


data goes ad comes out of Pig. These functions are used with the load and
store operators. Given below is the list of load and store functions available
in Pig.

S.N. Function & Description


1 PigStorage()

To load and store structured files.

2 TextLoader()
To load unstructured data into Pig.

3 BinStorage()
To load and store data into Pig using machine readable format.

4 Handling Compression
In Pig Latin, we can load and store compressed data.

20) What are the various diagnostic operators available in Apache Pig?
1. Dump Operator- It is used to display the output of pig Latin statements on the screen, so that
developers can debug the code.
Describe Operator- describe debugging utility is helpful to developers when writing Pig scripts as it
shows the schema of a relation in the script. For beginners who are trying to learn Apache Pig can use the
describe utility to understand how each operator makes alterations to data. A pig script can have multiple
describes.

2.
3. Explain Operator- OR
Differentiate between the logical and physical plan of an Apache Pig script
Logical and Physical plans are created during the execution of a pig script. Pig scripts
are based on interpreter checking. Logical plan is produced after semantic checking and
basic parsing and no data processing takes place during the creation of a logical plan.
For each line in the Pig script, syntax check is performed for operators and a logical plan
is created. Whenever an error is encountered within the script, an exception is thrown
and the program execution ends, else for each statement in the script has its own logical
plan.

A logical plan contains collection of operators in the script but does not contain the
edges between the operators.
After the logical plan is generated, the script execution moves to the physical plan where
there is a description about the physical operators, Apache Pig will use, to execute the
Pig script. A physical plan is more or less like a series of MapReduce jobs but then the
plan does not have any reference on how it will be executed in MapReduce. During the
creation of physical plan, cogroup logical operator is converted into 3 physical operators
namely –Local Rearrange, Global Rearrange and Package. Load and store functions
usually get resolved in the physical plan.

4. Illustrate Operator- Explained in apache pig interview question no -11


Executing pig scripts on large data sets, usually takes a long time. To tackle this, developers run
pig scripts on sample data but there is possibility that the sample data selected, might not
execute your pig script properly. For instance, if the script has a join operator there should be at
least a few records in the sample data that have the same key, otherwise the join operation will
not return any results. To tackle these kind of issues, illustrate is used. illustrate takes a sample
from the data and whenever it comes across operators like join or filter that remove data, it
ensures that only some records pass through and some do not, by making modifications to the
records such that they meet the condition. illustrate just shows the output of each stage but
does not run any MapReduce task.

WHAT IS PIGGY BANK ?

Word count logic in PIG ?

lines = LOAD '/user/hadoop/HDFS_File.txt' AS (line:chararray);

words = FOREACH lines GENERATE FLATTEN(TOKENIZE(line)) as word;

grouped = GROUP words BY word;

wordcount = FOREACH grouped GENERATE group, COUNT(words);

DUMP wordcount;

lines = LOAD '/home/cloudera/CV.txt' AS (line:chararray);

words = FOREACH lines GENERATE FLATTEN(TOKENIZE(line)) as word;

grouped = GROUP words BY word;

wordcount = FOREACH grouped GENERATE group, COUNT(words);

DUMP wordcount;

You might also like