How to Execute WordCount Program in
MapReduce using Cloudera
Distribution Hadoop(CDH)
For Lab on Wednesday (27/9/23)
The steps which show how to write a MapReduce code for Word
Count.
Input:
Hello I am Geeks for Geeks
Hello I am an Intern
Output:
GeeksforGeeks 1
Hello 2
I 2
Intern 1
am 2
an 1
Steps:
First Open Eclipse -> then select File -> New -> Java
Project ->Name it WordCount -> then Finish.
Create Three Java Classes into the project. Name
them WCDriver(having the main
function), WCMapper, WCReducer.
You have to include two Reference Libraries for that:
Right Click on Project -> then select Build Path-> Click
on Configure Build Path
In the above figure, you can see the Add External JARs
option on the Right Hand Side. Click on it and add the below
mention files. You can find these files in /usr/lib/
1. /usr/lib/hadoop-0.20-mapreduce/hadoop-core-2.6.0-mr1-
cdh5.13.0.jar
2. /usr/lib/hadoop/hadoop-common-2.6.0-cdh5.13.0.jar
Mapper Code: You have to copy paste this program into the
WCMapper Java Class file.
Java
// Importing libraries
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reporter;
public class WCMapper extends MapReduceBase implements Mapper<LongWritable,
Text, Text, IntWritable> {
// Map function
public void map(LongWritable key, Text value, OutputCollector<Text,
IntWritable> output, Reporter rep) throws IOException
{
String line = value.toString();
// Splitting the line on spaces
for (String word : line.split(" "))
{
if (word.length() > 0)
{
output.collect(new Text(word), new IntWritable(1));
}
}
}
}
Reducer Code: You have to copy paste this program into the
WCReducer Java Class file.
Java
// Importing libraries
import java.io.IOException;
import java.util.Iterator;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.Reporter;
public class WCReducer extends MapReduceBase implements Reducer<Text,
IntWritable, Text, IntWritable> {
// Reduce function
public void reduce(Text key, Iterator<IntWritable> value,
OutputCollector<Text, IntWritable> output,
Reporter rep) throws IOException
{
int count = 0;
// Counting the frequency of each words
while (value.hasNext())
{
IntWritable i = value.next();
count += i.get();
}
output.collect(key, new IntWritable(count));
}
}
Driver Code: You have to copy paste this program into the
WCDriver Java Class file.
Java
// Importing libraries
import java.io.IOException;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
public class WCDriver extends Configured implements Tool {
public int run(String args[]) throws IOException
{
if (args.length < 2)
{
System.out.println("Please give valid inputs");
return -1;
}
JobConf conf = new JobConf(WCDriver.class);
FileInputFormat.setInputPaths(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
conf.setMapperClass(WCMapper.class);
conf.setReducerClass(WCReducer.class);
conf.setMapOutputKeyClass(Text.class);
conf.setMapOutputValueClass(IntWritable.class);
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
JobClient.runJob(conf);
return 0;
}
// Main Method
public static void main(String args[]) throws Exception
{
int exitCode = ToolRunner.run(new WCDriver(), args);
System.out.println(exitCode);
}
}
Now you have to make a jar file. Right Click on Project-
> Click on Export-> Select export destination as Jar
File-> Name the jar File(WordCount.jar) -> Click on
next -> at last Click on Finish. Now copy this file into the
Workspace directory of Cloudera
Open the terminal on CDH and change the directory to the
workspace. You can do this by using “cd workspace/”
command. Now, Create a text file(WCFile.txt) and move it
to HDFS. For that open terminal and write this
code(remember you should be in the same directory as jar
file you have created just now).
cat >> WordCCountFinal.txt
Enter your own text here
After finishing the text
Click Ctrl+z
Then Create a directory using below command:
sudo -u hdfs hadoop dfs -mkdir /WordCCount
TO create the text file . Type the below command:
Add the WordCCount.txt in hadoop by using below
command :
sudo -u hdfs hadoop dfs -put
/WordCCount/WordCCountFinal.txt
To view the contents
sudo -u hdfs hadoop dfs -cat
/WordCCount/WordCCountFinal.txt
Run jar file now using below command :
Sudo -u hdfs hadoop jar /home/cloudera/WordCCountFinal.jar
WCDriver /WordCCount/WordCCountFinal.txt OutputWC
After executing the jar file , Run below command to
see the output
hadoop fs -ls /WordCCount
You can view output file that is OutputWC
Type
sudo -u hdfs hadoop dfs -cat
/WordCCount/OutputWC
Thanks and Regards
Kimmi Kumari