0% found this document useful (0 votes)
109 views23 pages

Apache Pig

Apache Pig is a platform for analyzing large datasets that allows you to write code in a language called Pig Latin. Pig Latin scripts can define map and reduce steps to load, filter, group, join, and analyze data stored in HDFS. An example Pig Latin script is provided that loads movie rating data, calculates average ratings by movie, filters for movies with over 4 stars, joins with movie metadata, and orders the results to find the oldest highly rated movies.

Uploaded by

Bora Yüret
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
109 views23 pages

Apache Pig

Apache Pig is a platform for analyzing large datasets that allows you to write code in a language called Pig Latin. Pig Latin scripts can define map and reduce steps to load, filter, group, join, and analyze data stored in HDFS. An example Pig Latin script is provided that loads movie rating data, calculates average ratings by movie, filters for movies with over 4 stars, joins with movie metadata, and orders the results to find the oldest highly rated movies.

Uploaded by

Bora Yüret
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

APACHE PIG

Why Pig?

■ Writing mappers and reducers by hand


takes a long time.
■ Pig introduces Pig Latin, a scripting
language that lets you use SQL-like
syntax to define your map and reduce
steps.
■ Highly extensible with user-defined
functions (UDF’s)
MapReduce

YARN

HDFS
Running Pig

■ Grunt
■ Script
■ Ambari / Hue
An example

■ Find the oldest 5-star movies


ratings = LOAD '/user/maria_dev/ml-100k/u.data' AS
(userID:int, movieID:int, rating:int, ratingTime:int);

This creates a relation named “ratings” with a given schema.

(660,229,2,891406212)
(421,498,4,892241344)
(495,1091,4,888637503)
(806,421,4,882388897)
(676,538,4,892685437)
(721,262,3,877137285)
Use PigStorage if you need a
different delimiter.
metadata = LOAD '/user/maria_dev/ml-100k/u.item' USING
PigStorage('|')AS (movieID:int, movieTitle:chararray,
releaseDate:chararray, videoRelease:chararray,
imdbLink:chararray);
DUMP metadata;

(1,Toy Story (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Toy%20Story%20(1995))


(2,GoldenEye (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?GoldenEye%20(1995))
(3,Four Rooms (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Four%20Rooms%20(1995))
(4,Get Shorty (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Get%20Shorty%20(1995))
(5,Copycat (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Copycat%20(1995))
Creating a relation from another
relation; FOREACH / GENERATE
metadata = LOAD '/user/maria_dev/ml-100k/u.item' USING PigStorage('|')
AS (movieID:int, movieTitle:chararray, releaseDate:chararray,
videoRelease:chararray, imdbLink:chararray);
nameLookup = FOREACH metadata GENERATE movieID, movieTitle,
ToUnixTime(ToDate(releaseDate, 'dd-MMM-yyyy')) AS releaseTime;

(1,Toy Story (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Toy%20Story%20(1995))

(1,Toy Story (1995),788918400)


Group By

ratingsByMovie = GROUP ratings BY movieID;


DUMP ratingsByMovie;

(1,{(807,1,4,892528231),(554,1,3,876231938),(49,1,2,888068651), … }
(2,{(429,2,3,882387599),(551,2,2,892784780),(774,2,1,888557383), … }
ratingsByMovie: {group: int,ratings: {(userID: int,movieID: int,rating: int,ratingTime: int)}}

avgRatings = FOREACH ratingsByMovie GENERATE group AS movieID,


AVG(ratings.rating) AS avgRating;
DUMP avgRatings;

(1,3.8783185840707963)
(2,3.2061068702290076)
(3,3.033333333333333)
(4,3.550239234449761)
(5,3.302325581395349)

DESCRIBE ratings;
DESCRIBE ratingsByMovie;
DESCRIBE avgRatings;

ratings: {userID: int,movieID: int,rating: int,ratingTime: int}

ratingsByMovie: {group: int,ratings: {(userID: int,movieID: int,rating: int,ratingTime: int)}}

avgRatings: {movieID: int,avgRating: double}


FILTER

fiveStarMovies = FILTER avgRatings BY avgRating > 4.0;

(12,4.385767790262173)
(22,4.151515151515151)
(23,4.1208791208791204)
(45,4.05)
JOIN
DESCRIBE fiveStarMovies;
DESCRIBE nameLookup;
fiveStarsWithData = JOIN fiveStarMovies BY movieID, nameLookup BY movieID;
DESCRIBE fiveStarsWithData;
DUMP fiveStarsWithData;

fiveStarMovies: {movieID: int,avgRating: double}


nameLookup: {movieID: int,movieTitle: chararray,releaseTime: long}

fiveStarsWithData: {fiveStarMovies::movieID: int,fiveStarMovies::avgRating: double,


nameLookup::movieID: int,nameLookup::movieTitle: chararray,nameLookup::releaseTime: long}

(12,4.385767790262173,12,Usual Suspects, The (1995),808358400)


(22,4.151515151515151,22,Braveheart (1995),824428800)
(23,4.1208791208791204,23,Taxi Driver (1976),824428800)
ORDER BY
oldestFiveStarMovies = ORDER fiveStarsWithData BY
nameLookup::releaseTime;
DUMP oldestFiveStarMovies;

(493,4.15,493,Thin Man, The (1934),-1136073600)


(604,4.012345679012346,604,It Happened One Night (1934),-1136073600)
(615,4.0508474576271185,615,39 Steps, The (1935),-1104537600)
(1203,4.0476190476190474,1203,Top Hat (1935),-1104537600)
Putting it all together
Let’s run it
Pig Latin: Diving Deeper
Things you can do to a relation
■ LOAD STORE DUMP
– STORE ratings INTO ‘outRatings’ USING PigStorage(‘:’);
■ FILTER DISTINCT FOREACH/GENERATE MAPREDUCE STREAM SAMPLE
■ JOIN COGROUP GROUP CROSS CUBE
■ ORDER RANK LIMIT
■ UNION SPLIT
Diagnostics

■ DESCRIBE
■ EXPLAIN
■ ILLUSTRATE
UDF’s

■ REGISTER
■ DEFINE
■ IMPORT
Some other functions and loaders

■ AVG CONCAT COUNT MAX MIN SIZE SUM

■ PigStorage
■ TextLoader
■ JsonLoader
■ AvroStorage
■ ParquetLoader
■ OrcStorage
■ HBaseStorage
Learning more
PIG CHALLENGE
Find the most popular bad movies
Defining the problem

■ Find all movies with an average rating less than 2.0


■ Sort them by the total number of ratings
Hint

■ We used everything you need in our earlier example of finding old movies with
ratings greater than 4.0
■ Only new thing you need is COUNT(). This lets you count up the number of
items in a bag.
– So just like you can say AVG(ratings.rating) to get the average rating
from a bag of ratings,
– You can say COUNT(ratings.rating) to get the total number of ratings for
a given group’s bag.

You might also like