0% found this document useful (0 votes)
20 views2 pages

Lab 7

This document provides a lab guide for using Apache Pig on Cloudera QuickStart VM, focusing on data processing with Pig Latin. It outlines tasks such as starting Pig, loading sample data, and performing operations like projection, filtering, sorting, and grouping. Additionally, it includes practice problems for further application of the concepts learned.

Uploaded by

tanusinghh03
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views2 pages

Lab 7

This document provides a lab guide for using Apache Pig on Cloudera QuickStart VM, focusing on data processing with Pig Latin. It outlines tasks such as starting Pig, loading sample data, and performing operations like projection, filtering, sorting, and grouping. Additionally, it includes practice problems for further application of the concepts learned.

Uploaded by

tanusinghh03
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 2

Lab 7: Getting Started with Pig on Cloudera QuickStart VM

Apache Pig is a high-level platform for processing large datasets in Hadoop. It uses a scripting
language called Pig Latin, which simplifies complex data transformations and analysis. Pig
converts these scripts into MapReduce jobs, making it easier to work with structured and semi-
structured data without writing low-level Java code. It is widely used for ETL (Extract,
Transform, Load) tasks, data preprocessing, and analytics.
Objective:
In this lab, you will learn how to start Pig in local and MapReduce modes, load data, and
perform basic operations such as sorting, grouping, joining, projecting, and filtering.
Task 1: Start Apache Pig
1. Open the Cloudera QuickStart VM.
2. Open the Terminal and start Pig in Grunt Shell using
pig -x local
Task 2: Create and Load Sample Data
1. Create an input file in HDFS (or local if using local mode). In the terminal, execute:
101,John,28,IT,60000
102,Alice,24,HR,55000
103,Bob,30,IT,70000
104,David,27,Finance,65000
105,Eve,29,HR,62000
Task 3: Load Data into Pig
Run the following Pig script in the Grunt shell:
EMPLOYEES = LOAD 'employees.txt' USING PigStorage(',')
AS (ID:INT, NAME:CHARARRAY, AGE:INT, DEPT:CHARARRAY,
SALARY:INT);
Verify the loaded data:
DUMP EMPLOYEES;
Task 4: Projection (Selecting Specific Columns)
To display only Name and Department:
EMP_PROJECTION = FOREACH EMPLOYEES GENERATE NAME, DEPT;
DUMP EMP_PROJECTION;
Task 5: Filtering (Employees with Salary > 60000)
HIGH_SALARY = FILTER EMPLOYEES BY SALARY > 60000;
DUMP HIGH_SALARY;
Task 6: Sorting Data by Age
SORTED_EMPLOYEES = ORDER EMPLOYEES BY AGE ASC;
DUMP SORTED_EMPLOYEES;
Task 7: Grouping Data by Department
GROUPED_BY_DEPT = GROUP EMPLOYEES BY DEPT;
DUMP GROUPED_BY_DEPT;

Practice problems;
Data:
201,John,TV,Electronics,2,50000
202,Alice,Laptop,Electronics,1,70000
203,Bob,Phone,Electronics,3,30000
204,David,Shirt,Clothing,4,2000
205,Eve,Shoes,Clothing,2,4000
206,Frank,WashingMachine,Electronics,1,25000
207,Grace,Table,Furniture,1,15000
208,Harry,Chair,Furniture,2,5000

SALES = LOAD 'sales.txt' USING PigStorage(',')


AS (TID:INT, CNAME:CHARARRAY, PRODUCT:CHARARRAY,
CATEGORY:CHARARRAY, QTY:INT, PRICE:INT);
DUMP SALES;

Task 1: Projection (Selecting Specific Columns)


Display Customer Name, Product, and Price only.
Task 2: Filtering (Transactions where Quantity > 2)
Task 3: Sorting Data by Price (Descending Order)
Task 4: Grouping Transactions by Category

You might also like