Lab 7: Getting Started with Pig on Cloudera QuickStart VM
Apache Pig is a high-level platform for processing large datasets in Hadoop. It uses a scripting
language called Pig Latin, which simplifies complex data transformations and analysis. Pig
converts these scripts into MapReduce jobs, making it easier to work with structured and semi-
structured data without writing low-level Java code. It is widely used for ETL (Extract,
Transform, Load) tasks, data preprocessing, and analytics.
Objective:
In this lab, you will learn how to start Pig in local and MapReduce modes, load data, and
perform basic operations such as sorting, grouping, joining, projecting, and filtering.
Task 1: Start Apache Pig
1. Open the Cloudera QuickStart VM.
2. Open the Terminal and start Pig in Grunt Shell using
pig -x local
Task 2: Create and Load Sample Data
1. Create an input file in HDFS (or local if using local mode). In the terminal, execute:
101,John,28,IT,60000
102,Alice,24,HR,55000
103,Bob,30,IT,70000
104,David,27,Finance,65000
105,Eve,29,HR,62000
Task 3: Load Data into Pig
Run the following Pig script in the Grunt shell:
EMPLOYEES = LOAD 'employees.txt' USING PigStorage(',')
AS (ID:INT, NAME:CHARARRAY, AGE:INT, DEPT:CHARARRAY,
SALARY:INT);
Verify the loaded data:
DUMP EMPLOYEES;
Task 4: Projection (Selecting Specific Columns)
To display only Name and Department:
EMP_PROJECTION = FOREACH EMPLOYEES GENERATE NAME, DEPT;
DUMP EMP_PROJECTION;
Task 5: Filtering (Employees with Salary > 60000)
HIGH_SALARY = FILTER EMPLOYEES BY SALARY > 60000;
DUMP HIGH_SALARY;
Task 6: Sorting Data by Age
SORTED_EMPLOYEES = ORDER EMPLOYEES BY AGE ASC;
DUMP SORTED_EMPLOYEES;
Task 7: Grouping Data by Department
GROUPED_BY_DEPT = GROUP EMPLOYEES BY DEPT;
DUMP GROUPED_BY_DEPT;
Practice problems;
Data:
201,John,TV,Electronics,2,50000
202,Alice,Laptop,Electronics,1,70000
203,Bob,Phone,Electronics,3,30000
204,David,Shirt,Clothing,4,2000
205,Eve,Shoes,Clothing,2,4000
206,Frank,WashingMachine,Electronics,1,25000
207,Grace,Table,Furniture,1,15000
208,Harry,Chair,Furniture,2,5000
SALES = LOAD 'sales.txt' USING PigStorage(',')
AS (TID:INT, CNAME:CHARARRAY, PRODUCT:CHARARRAY,
CATEGORY:CHARARRAY, QTY:INT, PRICE:INT);
DUMP SALES;
Task 1: Projection (Selecting Specific Columns)
Display Customer Name, Product, and Price only.
Task 2: Filtering (Transactions where Quantity > 2)
Task 3: Sorting Data by Price (Descending Order)
Task 4: Grouping Transactions by Category