0% found this document useful (0 votes)
77 views2 pages

Pig

Pig allows you to write custom user-defined functions (UDFs) and inject them into specific parts of the data processing pipeline. While Pig does not enforce an explicit data schema, debugging is often focused on schema issues as data types can change unexpectedly during processing. You can write UDFs in Python and leverage Pig for large-scale data processing by applying your UDFs at specific steps.

Uploaded by

AMIT ARORA
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
77 views2 pages

Pig

Pig allows you to write custom user-defined functions (UDFs) and inject them into specific parts of the data processing pipeline. While Pig does not enforce an explicit data schema, debugging is often focused on schema issues as data types can change unexpectedly during processing. You can write UDFs in Python and leverage Pig for large-scale data processing by applying your UDFs at specific steps.

Uploaded by

AMIT ARORA
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
You are on page 1/ 2

Since it is procedural, you could control of the execution of every step.

If you want to write your own UDF(User Defined Function) and inject in one specific
part in the pipeline,
it is straightforward.

Data Schema is not enforced explicitly but implicitly. I think this is big one,
too.
The debugging of pig scripts in my experience is %90 of time schema and since it
does not enforce an explicit schema, sometimes one data structure goes bytearray,
which is a �raw� data type and unless you coerce the fields even the strings, they
turn bytearray without notice.
This may propagate for other steps of the data processing.

You could write your UDFs in Python.


You have UDFs which you want to parallellize and utilize for large amounts of data,
then you are in luck.
Use Pig as a base pipeline where it does the hard work and you just apply your UDF
in the step that you want.

A class for Java programs to connect to Pig. Typically a program will create a
PigServer instance

pig -x local myscript.pig

pig

Basic commands

sh ls

clear

help

Execute pig commands

truck_events1 = LOAD '/user/centos/drivers.csv' USING PigStorage(',');


DESCRIBE truck_events1;

truck_events2 = LOAD '/user/centos/drivers.csv' USING PigStorage(',')


AS (driverId:int, truckId:int, eventTime:chararray,
eventType:chararray, longitude:double, latitude:double,
eventKey:chararray, correlationId:long, driverName:chararray,
routeId:long,routeName:chararray,eventDate:chararray);
DESCRIBE truck_events2;

truck_events_subset = LIMIT truck_events2 10;


DESCRIBE truck_events_subset;

DUMP truck_events_subset;

specific_columns = FOREACH truck_events_subset GENERATE driverId, eventTime,


eventType;
DESCRIBE specific_columns;
STORE specific_columns INTO 'output1/specific_columns' USING PigStorage(',')

orders = load '/user/centos/data1.csv' using PigStorage(',') as


(cstrId:int,itmId:int,orderDate:long,deliveryDate:long);
grpd = group orders by cstrId;
items_by_customer = foreach grpd generate group as cstrId, COUNT(orders) as
itemCnt;
describe items_by_customer;

orders = load '/user/centos/data1.csv' using PigStorage(',') as (cstrId:int,


itmId:int, orderDate:long, deliveryDate: long);
cstr_info = load '/user/centos/information.csv' using PigStorage(',') as
(cstrId:int, name:chararray, city:chararray);
jnd = join orders by cstrId, cstr_info by cstrId;
describe jnd;
jnd_grp = group jnd by (orders::itmId, cstr_info::city);
describe jnd_grp;
result = foreach jnd_grp generate FLATTEN(group) , COUNT(jnd) as cnt;
describe result;

You might also like