CSCE5300 Introduction to Big Data and Data Science
Activity2
Eclipse Project Creation Instructions, executing Activity2
Using Hadoop on cloudera and AWS
Hadoop on cloudera
Purpose of Activity2.
In this we will learn how to use Hadoop MapReduce model, storing data in HDFS,
visualizing data using HUE and as well as in AWS.
QUESTION:
Provide the detailed understanding of the activity at the end and
perform tasks where ever mentioned.
Use case Description:
Map Phase:
Input Splitting: The input data is divided into smaller chunks called input splits.
Map Function Execution: Each input split is processed by multiple worker nodes in
parallel. The Map function processes the input data and produces intermediate
key-value pairs.
Intermediate Key-Value Pairs: The Map function generates these intermediate
pairs, which are then grouped by key.
Shuffle and Sort Phase:
Partitioning: The intermediate key-value pairs are partitioned based on the key's
hash value. Each partition is assigned to a reducer.
Sorting: Within each partition, the intermediate pairs are sorted based on their
keys. This step is crucial for efficient grouping and reducing.
Reduce Phase:
Reduce Function Execution: Each reducer processes one partition of the sorted
intermediate data. The Reduce function takes the sorted key-value pairs and
performs computations on them.
Output Generation: The Reduce function generates the final output key-value pairs,
which are typically aggregated results or summaries.
Activity Overview
WORD COUNT: (VOWELS and CONSONANTS)
Task 1: Use code block 1 and count the frequency of words that start with letter
‘a’.
WORD COUNT: (EVEN and ODD)
Task 2:Use code block 2 and count the frequency of words that has odd count.
Eclipse Project creation steps
Step by step instructions:
File > New > Java Project > Next.
"WordCount" as our project name and click "Finish":
Getting references to hadoop libraries
Right click on WordCount project and select "Properties":
Hit "Add External JARs...", then, File System > usr > lib > hadoop :
We may want to select all jars, and click OK:
We need to add more external libs. Go to "Add External JARs..." again, then grab all libs
in "client": Then, hit "OK"
CODE BLOCK 1:
WORD COUNT: (VOWELS and CONSONANTS)
This code block 1 explains about counting the frequency of words from
the given text file that starts with any two of the alphabets from vowels
followed any two alphabets from consonants.
Example: Starts with A, E (vowels) followed by S, R (Consonants)
Vowels: Words that start with letters ‘a’,’e’,’i’,’o’,’u’.
Consonants:Other than vowels are consonants.
Code is available in Ex2.java file which is there in canvas under
Activity2
Task 1
Use code block 1 and count the frequency of words that start with
letter ‘a’.
Please perform Task 1 by using code block 1 as reference
Implementation for Code Block 1
Step1: Intially placed sample.txt in Downloads folder and
Displaying the content in commandPrompt using below
command
Command: cat /home/cloudera/Downloads/sample.txt
Here cat is used to display the contents in command prompt
Step2:Creating a Directory named pravallika and placing
sample.txt in that folder using belowcommands
Commands: hadoop fs -mkdir pravallika
hadoop fs -put /home/cloudera/Downloads/sample.txt pravallika/
Here mkdir command is used to create a directory in HDFS
put command is used to copy the file from local file
system(/Downloads/sample.txt) toHDFS(pravallika/)
Visualize sample.txt in Hue
Create a new class named Ex2 under WordCount project
Once Class is Created then copy code from this file Ex2.java which is
available in canvas under activity 2
once we saved code do right click on project->export->select JAR file in
Java->Rename JARFile->next->finish
By using below command we can run jar file in command prompt to
visualize the output in HUE
Command: hadoop jar /home/cloudera/Ex2.jar Ex2
pravallika/sample.txt Ex2_output
Command Explanation
/home/cloudera/Ex2.jarjar is located in this path
pravallika/sample.txtInput file is present in pravallika directory
Ex2_outputthis is the output file name
Visualize the Output in Hue
In command prompt to display output we are using below command
Command: hadoop fs -cat
Ex2_output/part-r-00000
cat is used to display the content in
command prompt
CODE BLOCK 2: (EVEN AND ODD NUMBERS)
This Code Block 2 explain about frequency of words that has even count
Code is available in even.java file which is there in canvas under
Activity2
Task 2
Use code block 2 and count the frequency of words that has odd count.
Please perform Task 2 by using code block 2 as reference
Implementation for Code Block 2
Create a new class named even under WordCount project
Once Class is Created then copy code from this file even.java which is
available in canvas under activity 2
once we saved code do right click on project->export->select JAR file in
Java->Rename JARFile->next->finish
By using below command we can run jar file in command prompt to
visualize the output in HUE
Command: hadoop jar /home/cloudera/even.jar even
pravallika/sample.txt even_output
Command Explanation
/home/cloudera/even.jarjar is located in this path
pravallika/sample.txtInput file is present in pravallika directory
even_outputthis is the output file name
Visualize the Output in Hue
In command prompt to display output we are using below command
Command: hadoop fs -cat
even_output/part-r-00000
cat is used to display the content in command prompt
List of Commands using hadoop fs and hdfs dfs
List Files and Directories:
Using hadoop fs:
hadoop fs -ls /user/myuser
Using hdfs dfs:
hdfs dfs -ls /user/myuser
Both commands will list the contents of the /user/myuser directory in
HDFS.
Create Directory:
Using hadoop fs:
hadoop fs -mkdir /user/myuser/data
Using hdfs dfs:
hdfs dfs -mkdir /user/myuser/data
Both commands will create a new directory named data within the
/user/myuser directory in HDFS.
Copy File from Local to HDFS:
Using hadoop fs:
hadoop fs -copyFromLocal localfile.txt hdfs://name-
node:8020/user/myuser/data/
Using hdfs dfs:
hdfs dfs -copyFromLocal localfile.txt
hdfs://namenode:8020/user/myuser/data/
Both commands will copy the local file localfile.txt to the data directory
in HDFS.
Move File:
Using hadoop fs:
hadoop fs -mv /user/myuser/data/file.txt /user/myuser/archive/
Using hdfs dfs:
hdfs dfs -mv /user/myuser/data/file.txt /user/myuser/archive/
Both commands will move the file file.txt from the data directory to the
archive directory in HDFS.
Delete File:
Using hadoop fs:
hadoop fs -rm /user/myuser/archive/file.txt
Using hdfs dfs:
hdfs dfs -rm /user/myuser/archive/file.txt
Difference between Hadoop fs and hdfs dfs
hadoop fs is a more generic command-line utility that can interact
with various file systems, while hdfs dfs is specifically designed for
HDFS operations.
The syntax of the commands is almost identical between the two
utilities for HDFS operations.
The main difference lies in their scope, hadoop fs can interact with
other file systems (like HBase, S3, local file system, etc.), while hdfs
dfs is limited to HDFS operations.
Using AWS
Step-1: login into AWS account and create S3 bucket.
Step-2: choose a unique bucket name and at last click on the create bucket option.
Step-3 After creating bucket successfully click on bucket name (to enter bucket)
Step 4: After entering into bucket, we have to do following steps:
a. We have to upload jar file which is created in Cloudera to our Bucket.
b. We have to create an input folder on aws bucket and inside the input folder we have to upload our
sample.txt which contains our input data.
c. Note it down jar file S3 URI which will be useful at executing steps in clusters.
d. Note it down sample file S3 URI which will be useful at executing steps in clusters.
Step 4-a: click on the upload button and choose add file and upload our jar file and click upload at
bottom of the page.
After uploading the file the dashboard of bucket looks like bellow:
Step 4-b: click on create folder icon and name it as inputFile.
Then go inside the inputFolder and upload your input file which is given on canvas named as sample.txt,
after uploading the file the dashboard looks like bellow:
Step 4-c: Navigating to JAR file S3 URI:
Click on jar file
Noted it down S3 URI.
Step 4-d: Navigating to input file S3 URI
Click on inputFile folder and select input file
Noted it down the S3 URI
Now search as EMR and navigate into EMR page
Click on switch to the old console which is more user friendly for creating clusters.
After converting into old console click on create cluster
After clicking in create cluster enter unique cluster name and then choose Go to advanced options
In the next page come down and choose step type as custom Jar and then click add step
The Add step screen looks as follows:
In the above column Name we have to give our java class name where we write our code. In my case
Name is WordCount(make sure give your class name then only our cluster will run successfully) .
JAR location we have to take it from S3 bucket. Already we noted down location in Step 4-C.
Arguments we have to give both input and output file separated by space
Giving input file: Input file location we are already noted down at Step 4-d.In my case my input file S3
URI is : s3://bigdata5300activity2/inputFile/sample.txt
Giving output file: For output file no need to create file manually AWS s3 bucket will take care whenever
we run our cluster based on our arguments it will create output files in specified path. So I am just giving
name of the outputfile
s3://bigdata5300activity2/inputFile/output118
After adding all arguments, it should be looks like bellow:
Then click on Add and next and next In step-3 of General Cluster Settings make sure our name is
reflected or not if not means rename it.
After changing the name click next and select create cluster
The cluster runs internally it takes some time to execute steps and gives our output.
At finally our cluster will execute all steps successfully.
For output files we have to go back to S3 bucket.
If we go inside the inputFile folder, we can see output118 Folder which contains output files which
generated by our cluster.
NOTE: For both the tasks 1 and task2 steps are same (1. creating Jar and adding that jar into our cluster,
2. Passing input file to cluster) so we are not explaining task2.