0% found this document useful (0 votes)

30 views31 pages

Activity 2

This document provides instructions for completing Activity 2 which involves using Hadoop on Cloudera and AWS. It includes: - An overview of the MapReduce process and the tasks to be completed involving word counting. - Steps for creating an Eclipse project, adding Hadoop libraries, and code blocks for counting word frequencies. - Implementing the tasks using the provided code blocks and visualizing outputs in Hue. - Instructions for uploading files to AWS S3 and creating an EMR cluster to run the MapReduce job on AWS.

Uploaded by

patilbhavesh991209

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

30 views31 pages

Activity 2

Uploaded by

patilbhavesh991209

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 31

CSCE5300 Introduction to Big Data and Data Science

Activity2
Eclipse Project Creation Instructions, executing Activity2
Using Hadoop on cloudera and AWS

Hadoop on cloudera

Purpose of Activity2.

In this we will learn how to use Hadoop MapReduce model, storing data in HDFS,
visualizing data using HUE and as well as in AWS.

QUESTION:

Provide the detailed understanding of the activity at the end and

perform tasks where ever mentioned.

Use case Description:

Map Phase:
Input Splitting: The input data is divided into smaller chunks called input splits.
Map Function Execution: Each input split is processed by multiple worker nodes in
parallel. The Map function processes the input data and produces intermediate
key-value pairs.
Intermediate Key-Value Pairs: The Map function generates these intermediate
pairs, which are then grouped by key.

Shuffle and Sort Phase:

Partitioning: The intermediate key-value pairs are partitioned based on the key's
hash value. Each partition is assigned to a reducer.
Sorting: Within each partition, the intermediate pairs are sorted based on their
keys. This step is crucial for efficient grouping and reducing.

Reduce Phase:
Reduce Function Execution: Each reducer processes one partition of the sorted
intermediate data. The Reduce function takes the sorted key-value pairs and
performs computations on them.
Output Generation: The Reduce function generates the final output key-value pairs,
which are typically aggregated results or summaries.

Activity Overview
WORD COUNT: (VOWELS and CONSONANTS)

 Task 1: Use code block 1 and count the frequency of words that start with letter
‘a’.

WORD COUNT: (EVEN and ODD)

 Task 2:Use code block 2 and count the frequency of words that has odd count.
Eclipse Project creation steps

Step by step instructions:

File > New > Java Project > Next.

"WordCount" as our project name and click "Finish":
Getting references to hadoop libraries
Right click on WordCount project and select "Properties":

Hit "Add External JARs...", then, File System > usr > lib > hadoop :
We may want to select all jars, and click OK:
We need to add more external libs. Go to "Add External JARs..." again, then grab all libs
in "client": Then, hit "OK"
CODE BLOCK 1:

WORD COUNT: (VOWELS and CONSONANTS)

This code block 1 explains about counting the frequency of words from
the given text file that starts with any two of the alphabets from vowels
followed any two alphabets from consonants.
Example: Starts with A, E (vowels) followed by S, R (Consonants)

Vowels: Words that start with letters ‘a’,’e’,’i’,’o’,’u’.

Consonants:Other than vowels are consonants.

Code is available in Ex2.java file which is there in canvas under

Activity2

Task 1

Use code block 1 and count the frequency of words that start with
letter ‘a’.

Please perform Task 1 by using code block 1 as reference

Implementation for Code Block 1

Step1: Intially placed sample.txt in Downloads folder and

Displaying the content in commandPrompt using below
command
Command: cat /home/cloudera/Downloads/sample.txt
Here cat is used to display the contents in command prompt
Step2:Creating a Directory named pravallika and placing
sample.txt in that folder using belowcommands
Commands: hadoop fs -mkdir pravallika
hadoop fs -put /home/cloudera/Downloads/sample.txt pravallika/
Here mkdir command is used to create a directory in HDFS
put command is used to copy the file from local file
system(/Downloads/sample.txt) toHDFS(pravallika/)
Visualize sample.txt in Hue

Create a new class named Ex2 under WordCount project

Once Class is Created then copy code from this file Ex2.java which is
available in canvas under activity 2

once we saved code do right click on project->export->select JAR file in

Java->Rename JARFile->next->finish

By using below command we can run jar file in command prompt to

visualize the output in HUE

Command: hadoop jar /home/cloudera/Ex2.jar Ex2

pravallika/sample.txt Ex2_output

Command Explanation
/home/cloudera/Ex2.jarjar is located in this path
pravallika/sample.txtInput file is present in pravallika directory
Ex2_outputthis is the output file name

Visualize the Output in Hue

In command prompt to display output we are using below command
Command: hadoop fs -cat

Ex2_output/part-r-00000

cat is used to display the content in

command prompt

CODE BLOCK 2: (EVEN AND ODD NUMBERS)

This Code Block 2 explain about frequency of words that has even count

Code is available in even.java file which is there in canvas under

Activity2

Task 2

Use code block 2 and count the frequency of words that has odd count.

Please perform Task 2 by using code block 2 as reference

Implementation for Code Block 2

Create a new class named even under WordCount project

Once Class is Created then copy code from this file even.java which is
available in canvas under activity 2

once we saved code do right click on project->export->select JAR file in

Java->Rename JARFile->next->finish

By using below command we can run jar file in command prompt to

visualize the output in HUE

Command: hadoop jar /home/cloudera/even.jar even

pravallika/sample.txt even_output

Command Explanation
/home/cloudera/even.jarjar is located in this path
pravallika/sample.txtInput file is present in pravallika directory
even_outputthis is the output file name
Visualize the Output in Hue

In command prompt to display output we are using below command

Command: hadoop fs -cat

even_output/part-r-00000

cat is used to display the content in command prompt

List of Commands using hadoop fs and hdfs dfs

List Files and Directories:

Using hadoop fs:

hadoop fs -ls /user/myuser

Using hdfs dfs:

hdfs dfs -ls /user/myuser

Both commands will list the contents of the /user/myuser directory in

HDFS.

Create Directory:

Using hadoop fs:

hadoop fs -mkdir /user/myuser/data

Using hdfs dfs:

hdfs dfs -mkdir /user/myuser/data

Both commands will create a new directory named data within the
/user/myuser directory in HDFS.

Copy File from Local to HDFS:

Using hadoop fs:

hadoop fs -copyFromLocal localfile.txt hdfs://name-

node:8020/user/myuser/data/

Using hdfs dfs:

hdfs dfs -copyFromLocal localfile.txt

hdfs://namenode:8020/user/myuser/data/

Both commands will copy the local file localfile.txt to the data directory
in HDFS.

Move File:

Using hadoop fs:

hadoop fs -mv /user/myuser/data/file.txt /user/myuser/archive/

Using hdfs dfs:

hdfs dfs -mv /user/myuser/data/file.txt /user/myuser/archive/

Both commands will move the file file.txt from the data directory to the
archive directory in HDFS.

Delete File:

Using hadoop fs:

hadoop fs -rm /user/myuser/archive/file.txt

Using hdfs dfs:

hdfs dfs -rm /user/myuser/archive/file.txt

Difference between Hadoop fs and hdfs dfs

 hadoop fs is a more generic command-line utility that can interact

with various file systems, while hdfs dfs is specifically designed for
HDFS operations.
 The syntax of the commands is almost identical between the two
utilities for HDFS operations.
 The main difference lies in their scope, hadoop fs can interact with
other file systems (like HBase, S3, local file system, etc.), while hdfs
dfs is limited to HDFS operations.
Using AWS

Step-1: login into AWS account and create S3 bucket.

Step-2: choose a unique bucket name and at last click on the create bucket option.

Step-3 After creating bucket successfully click on bucket name (to enter bucket)
Step 4: After entering into bucket, we have to do following steps:

a. We have to upload jar file which is created in Cloudera to our Bucket.

b. We have to create an input folder on aws bucket and inside the input folder we have to upload our
sample.txt which contains our input data.
c. Note it down jar file S3 URI which will be useful at executing steps in clusters.
d. Note it down sample file S3 URI which will be useful at executing steps in clusters.

Step 4-a: click on the upload button and choose add file and upload our jar file and click upload at
bottom of the page.

After uploading the file the dashboard of bucket looks like bellow:
Step 4-b: click on create folder icon and name it as inputFile.

Then go inside the inputFolder and upload your input file which is given on canvas named as sample.txt,
after uploading the file the dashboard looks like bellow:
Step 4-c: Navigating to JAR file S3 URI:

Click on jar file

Noted it down S3 URI.

Step 4-d: Navigating to input file S3 URI

Click on inputFile folder and select input file

Noted it down the S3 URI

Now search as EMR and navigate into EMR page

Click on switch to the old console which is more user friendly for creating clusters.
After converting into old console click on create cluster

After clicking in create cluster enter unique cluster name and then choose Go to advanced options

In the next page come down and choose step type as custom Jar and then click add step
The Add step screen looks as follows:

In the above column Name we have to give our java class name where we write our code. In my case
Name is WordCount(make sure give your class name then only our cluster will run successfully) .
JAR location we have to take it from S3 bucket. Already we noted down location in Step 4-C.

Arguments we have to give both input and output file separated by space

Giving input file: Input file location we are already noted down at Step 4-d.In my case my input file S3
URI is : s3://bigdata5300activity2/inputFile/sample.txt

Giving output file: For output file no need to create file manually AWS s3 bucket will take care whenever
we run our cluster based on our arguments it will create output files in specified path. So I am just giving
name of the outputfile

s3://bigdata5300activity2/inputFile/output118
After adding all arguments, it should be looks like bellow:
Then click on Add and next and next In step-3 of General Cluster Settings make sure our name is
reﬂected or not if not means rename it.

After changing the name click next and select create cluster

The cluster runs internally it takes some time to execute steps and gives our output.
At finally our cluster will execute all steps successfully.

For output files we have to go back to S3 bucket.

If we go inside the inputFile folder, we can see output118 Folder which contains output files which
generated by our cluster.

NOTE: For both the tasks 1 and task2 steps are same (1. creating Jar and adding that jar into our cluster,
2. Passing input file to cluster) so we are not explaining task2.

Cloud PDF
No ratings yet
Cloud PDF
47 pages
Extracting Real Value From Your Data With Apache Hadoop: Sarah Sproehnle
No ratings yet
Extracting Real Value From Your Data With Apache Hadoop: Sarah Sproehnle
51 pages
Big Data & Analytics Lab Manual
No ratings yet
Big Data & Analytics Lab Manual
51 pages
@bigdatalabfile 09
No ratings yet
@bigdatalabfile 09
35 pages
Developing A Simple Map-Reduce Program For Hadoop: Big Data Course CS6350 Professor: Dr. Latifur Khan
No ratings yet
Developing A Simple Map-Reduce Program For Hadoop: Big Data Course CS6350 Professor: Dr. Latifur Khan
22 pages
PDC All Labs
100% (1)
PDC All Labs
129 pages
Bda Lab Manual
No ratings yet
Bda Lab Manual
20 pages
Tutorial-Counting Words in File (S) Using Mapreduce: Prerequisites
No ratings yet
Tutorial-Counting Words in File (S) Using Mapreduce: Prerequisites
11 pages
Extreme Computing Lab Exercises Session One: 1 Getting Started
No ratings yet
Extreme Computing Lab Exercises Session One: 1 Getting Started
6 pages
Assignment 2
No ratings yet
Assignment 2
7 pages
BDA Practicalfile
No ratings yet
BDA Practicalfile
19 pages
Hands On
No ratings yet
Hands On
26 pages
Big Datalab
No ratings yet
Big Datalab
4 pages
Hadoop Installation & MapReduce Guide
No ratings yet
Hadoop Installation & MapReduce Guide
13 pages
Lez.d-01-Hadoop (A) Intro
No ratings yet
Lez.d-01-Hadoop (A) Intro
58 pages
Lsde Workshop wk9
No ratings yet
Lsde Workshop wk9
31 pages
Bda Lab Manual 2024
No ratings yet
Bda Lab Manual 2024
45 pages
Hadoop Module1
No ratings yet
Hadoop Module1
37 pages
Bda Lab Manual
No ratings yet
Bda Lab Manual
42 pages
Data Science
No ratings yet
Data Science
82 pages
Palak
No ratings yet
Palak
10 pages
Module-1: Hdfs Basics Running Example Programs and Benchmarks Hadoop Mapreduce Framework Mapreduce Programming
No ratings yet
Module-1: Hdfs Basics Running Example Programs and Benchmarks Hadoop Mapreduce Framework Mapreduce Programming
33 pages
Big Data
No ratings yet
Big Data
130 pages
MapReduce & WordCount Tutorial
No ratings yet
MapReduce & WordCount Tutorial
7 pages
Procedure: 1
No ratings yet
Procedure: 1
29 pages
Part 03 Intro To Hadoop
No ratings yet
Part 03 Intro To Hadoop
22 pages
Big Data Ia Answers
No ratings yet
Big Data Ia Answers
14 pages
C21053 Jay Vijay Karwatkar-Big Data Analytics & Visualization
No ratings yet
C21053 Jay Vijay Karwatkar-Big Data Analytics & Visualization
210 pages
Big Data File
No ratings yet
Big Data File
16 pages
BIG Data File
No ratings yet
BIG Data File
28 pages
Sanoob BDA 1 S Merged
No ratings yet
Sanoob BDA 1 S Merged
8 pages
Hadoop Lab Practical Guide
No ratings yet
Hadoop Lab Practical Guide
69 pages
Hadoop Administrator Training - Lab Hand Book
No ratings yet
Hadoop Administrator Training - Lab Hand Book
12 pages
Install Cloudera Hadoop on VirtualBox
No ratings yet
Install Cloudera Hadoop on VirtualBox
33 pages
Big Data Akshat
No ratings yet
Big Data Akshat
57 pages
Course: Big Data Analytics Lab Scheme: 2017
No ratings yet
Course: Big Data Analytics Lab Scheme: 2017
25 pages
BDA Manual
No ratings yet
BDA Manual
41 pages
CS-702 (D) BigData
No ratings yet
CS-702 (D) BigData
61 pages
Hadoop MapReduce WordCount Guide
No ratings yet
Hadoop MapReduce WordCount Guide
5 pages
BDA Exp (1 To 7)
No ratings yet
BDA Exp (1 To 7)
22 pages
Hands On Exercises 2013
No ratings yet
Hands On Exercises 2013
51 pages
Big Data Analytics Lab
No ratings yet
Big Data Analytics Lab
18 pages
Big Data Lab Manual Printout
No ratings yet
Big Data Lab Manual Printout
51 pages
Parlab Parallel Boot Camp Cloud Computing With Mapreduce and Hadoop
No ratings yet
Parlab Parallel Boot Camp Cloud Computing With Mapreduce and Hadoop
49 pages
Kcs 061 PPT Unit 2
No ratings yet
Kcs 061 PPT Unit 2
56 pages
Setting Up Eclipse:: Codelab 1 Introduction To The Hadoop Environment (Version 0.17.0)
No ratings yet
Setting Up Eclipse:: Codelab 1 Introduction To The Hadoop Environment (Version 0.17.0)
9 pages
Bda Unit 1
No ratings yet
Bda Unit 1
13 pages
BDA Record
No ratings yet
BDA Record
34 pages
KCC Institute of Technology and Management: Big Data and Analytics Lab File BCDS651
No ratings yet
KCC Institute of Technology and Management: Big Data and Analytics Lab File BCDS651
30 pages
Exp 5 - 9
No ratings yet
Exp 5 - 9
25 pages
Basic HDFS Commands
No ratings yet
Basic HDFS Commands
7 pages
Hadoop Practical Commands & Mapreduce Lab Mannula With Java and Python
No ratings yet
Hadoop Practical Commands & Mapreduce Lab Mannula With Java and Python
2 pages
CS702 Big Data Programs
No ratings yet
CS702 Big Data Programs
58 pages
Execute WordCount in Hadoop CDH
No ratings yet
Execute WordCount in Hadoop CDH
10 pages
Big Data Lab Manual
No ratings yet
Big Data Lab Manual
27 pages
Labs Hadoop1
No ratings yet
Labs Hadoop1
9 pages
QA & Testing Expert Resume
89% (19)
QA & Testing Expert Resume
2 pages
1 Solution
No ratings yet
1 Solution
43 pages
(Oct-2023) New PassLeader JN0-351 Exam Dumps
No ratings yet
(Oct-2023) New PassLeader JN0-351 Exam Dumps
8 pages
Extend Toad Code Xpert with SonarSource
100% (1)
Extend Toad Code Xpert with SonarSource
14 pages
Unsteady Flow Model Output Guide
No ratings yet
Unsteady Flow Model Output Guide
36 pages
Crash 2025 04 21 - 08.16.36 FML
No ratings yet
Crash 2025 04 21 - 08.16.36 FML
9 pages
Hyperion LCM Utlity
No ratings yet
Hyperion LCM Utlity
30 pages
Bleak 2
No ratings yet
Bleak 2
9 pages
HyperTransport 3.1 Interconnect Technology PDF
100% (1)
HyperTransport 3.1 Interconnect Technology PDF
30 pages
Technical Specifications: Make: Typhoon Hil GMBH, Switzerland, Model: Hil 604
No ratings yet
Technical Specifications: Make: Typhoon Hil GMBH, Switzerland, Model: Hil 604
2 pages
SM 7506 NFP
No ratings yet
SM 7506 NFP
11 pages
Universal Serial Bus System Architecture
No ratings yet
Universal Serial Bus System Architecture
285 pages
Aditya Modul5 Prak Komdat
No ratings yet
Aditya Modul5 Prak Komdat
17 pages
Ex - No.1 Implementation of Symbol Table AIM Algorithm
No ratings yet
Ex - No.1 Implementation of Symbol Table AIM Algorithm
21 pages
Note 865109 - SAP Compliance Calibrator by Virsa Systems Install 620
100% (2)
Note 865109 - SAP Compliance Calibrator by Virsa Systems Install 620
4 pages
FreeCAD Build and Setup Guide
No ratings yet
FreeCAD Build and Setup Guide
4 pages
Telecommunication Management System
No ratings yet
Telecommunication Management System
16 pages
Fundamentals of Software Testing Handout v1
No ratings yet
Fundamentals of Software Testing Handout v1
36 pages
LSMW FAQs for SAP Users
No ratings yet
LSMW FAQs for SAP Users
18 pages
Automatic Control Systems
No ratings yet
Automatic Control Systems
25 pages
Cvabz G611
No ratings yet
Cvabz G611
13 pages
Operational Amplifier Theory (Shrikrishna Yawale and Sangita Yawale)
No ratings yet
Operational Amplifier Theory (Shrikrishna Yawale and Sangita Yawale)
256 pages
MFJ-401B Econo Keyer II Guide
No ratings yet
MFJ-401B Econo Keyer II Guide
3 pages
Sit 302 Mobile Application Development
No ratings yet
Sit 302 Mobile Application Development
3 pages
USB and Power Board Changes
No ratings yet
USB and Power Board Changes
58 pages
HNS Level 4 Practical Full Document
86% (7)
HNS Level 4 Practical Full Document
5 pages
Ironhack Cybersecurity Bootcamp Guide
No ratings yet
Ironhack Cybersecurity Bootcamp Guide
12 pages
Direct Support & Gen Support Maint Repair Parts RT-505/PRC-25
No ratings yet
Direct Support & Gen Support Maint Repair Parts RT-505/PRC-25
140 pages
Manual PanelServer Modbus DOCA0241EN-00
No ratings yet
Manual PanelServer Modbus DOCA0241EN-00
95 pages
C File Processing Guide
No ratings yet
C File Processing Guide
39 pages

Activity 2

Uploaded by

Activity 2

Uploaded by

CSCE5300 Introduction to Big Data and Data Science

Provide the detailed understanding of the activity at the end and

Use case Description:

Shuffle and Sort Phase:

WORD COUNT: (EVEN and ODD)

Step by step instructions:

File > New > Java Project > Next.

WORD COUNT: (VOWELS and CONSONANTS)

Vowels: Words that start with letters ‘a’,’e’,’i’,’o’,’u’.

Code is available in Ex2.java file which is there in canvas under

Please perform Task 1 by using code block 1 as reference

Implementation for Code Block 1

Step1: Intially placed sample.txt in Downloads folder and

Create a new class named Ex2 under WordCount project

once we saved code do right click on project->export->select JAR file in

By using below command we can run jar file in command prompt to

Command: hadoop jar /home/cloudera/Ex2.jar Ex2

Visualize the Output in Hue

cat is used to display the content in

CODE BLOCK 2: (EVEN AND ODD NUMBERS)

Code is available in even.java file which is there in canvas under

Please perform Task 2 by using code block 2 as reference

Create a new class named even under WordCount project

once we saved code do right click on project->export->select JAR file in

By using below command we can run jar file in command prompt to

Command: hadoop jar /home/cloudera/even.jar even

In command prompt to display output we are using below command

cat is used to display the content in command prompt

List Files and Directories:

Using hadoop fs:

hadoop fs -ls /user/myuser

Using hdfs dfs:

hdfs dfs -ls /user/myuser

Both commands will list the contents of the /user/myuser directory in

Using hadoop fs:

hadoop fs -mkdir /user/myuser/data

Using hdfs dfs:

Copy File from Local to HDFS:

Using hadoop fs:

hadoop fs -copyFromLocal localfile.txt hdfs://name-

Using hdfs dfs:

hdfs dfs -copyFromLocal localfile.txt

Using hadoop fs:

hadoop fs -mv /user/myuser/data/file.txt /user/myuser/archive/

Using hdfs dfs:

hdfs dfs -mv /user/myuser/data/file.txt /user/myuser/archive/

Using hadoop fs:

Using hdfs dfs:

Difference between Hadoop fs and hdfs dfs

 hadoop fs is a more generic command-line utility that can interact

Step-1: login into AWS account and create S3 bucket.

a. We have to upload jar file which is created in Cloudera to our Bucket.

Click on jar file

Noted it down S3 URI.

Click on inputFile folder and select input file

Noted it down the S3 URI

For output files we have to go back to S3 bucket.

You might also like