0% found this document useful (0 votes)
120 views15 pages

Big Data Analytics Midterm Q&A

This document provides code for a MapReduce program to count the occurrences of specific numbers in an input text file. The code defines a Mapper class that extracts the last token from each line as an integer value and emits it with the year as key-value pairs. A Reducer class collects the values for each year key, finds the maximum value greater than 30, and emits it. The main function runs the MapReduce job. To compile and run, use hadoop jar command with the class name and input/output paths.

Uploaded by

Gaurav Nagar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
120 views15 pages

Big Data Analytics Midterm Q&A

This document provides code for a MapReduce program to count the occurrences of specific numbers in an input text file. The code defines a Mapper class that extracts the last token from each line as an integer value and emits it with the year as key-value pairs. A Reducer class collects the values for each year key, finds the maximum value greater than 30, and emits it. The main function runs the MapReduce job. To compile and run, use hadoop jar command with the class name and input/output paths.

Uploaded by

Gaurav Nagar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

BIG DATA

ANALYTICS
(Question bank-1ST MIDTERM)

Prepared By
Dr. Vibhakar Pathak
Information Technology
Big Data Analytics ARYA COLLEGE OF ENGG & IT

Q1 Write the various methods in Map interface. List the characteristics of Big Data.

Answer:
Mapper Class
The Mapper class defines the Map job. Maps input key-value pairs to a set of intermediate key-value
pairs. Maps are the individual tasks that transform the input records into intermediate records. The
transformed intermediate records need not be of the same type as the input records. A given input pair
may map to zero or many output pairs.
Method : map is the most prominent method of the Mapper class. The syntax is defined below
map(KEYIN key, VALUEIN value, org.apache.hadoop.mapreduce.Mapper.Context context)

Reducer Class
The Reducer class defines the Reduce job in MapReduce. It reduces a set of intermediate values that share
a key to a smaller set of values. Reducer implementations can access the Configuration for a job via the
JobContext.getConfiguration() method. A Reducer has three primary phases − Shuffle, Sort, and Reduce.
Shuffle − The Reducer copies the sorted output from each Mapper using HTTP across the network.
Sort − The framework merge-sorts the Reducer inputs by keys (since different Mappers may have output
the same key). The shuffle and sort phases occur simultaneously, i.e., while outputs are being fetched,
they are merged.
Reduce − In this phase the reduce (Object, Iterable, Context) method is called for each <key, (collection of
values)> in the sorted inputs.
Method : reduce is the most prominent method of the Reducer class
Syntax:
reduce (KEYIN key, Iterable<VALUEIN> values, org.apache.hadoop.mapreduce.Reducer.Context context)

Big Data Characteristics


It is characterized by 3V : Velocity , Variety and Velocity.

Q2. What are the functions of a combiner?

Answer:
A Combiner, also known as a semi-reducer, is an optional class that operates by accepting the inputs from

SEMESTER IT VII / CSE VIII 1


Big Data Analytics ARYA COLLEGE OF ENGG & IT

the Map class and thereafter passing the output key-value pairs to the Reducer class.
The main function of a Combiner is to summarize the map output records with the same key. The output
(key-value collection) of the combiner will be sent over the network to the actual Reducer task as input.

Q3. List the writable wrapper classes for java primitives. Differentiate Apache pig with Map Reduce.:

Answer:
The Writable interface defines two methods—
One for writing its state to a Data Output binary stream, and
One for reading its state from a Data Input binary stream.
Hadoop comes with a large selection of Writable classes, which are available in the org.apache.hadoop.io
package like BooleanWritable, ByteWritable, ShortWritable, IntWritable, VIntWritable, FloatWritable,
LongWritable, VLongWritable, DoubleWritable.When it comes to encoding integers, there is a choice
between the fixed-length formats (IntWritable and LongWritable) and the variable-length formats
(VIntWritable and VLongWritable). The variable-length formats use only a single byte to encode the value
if it is small enough (between –112 and 127, inclusive); otherwise, they use the first byte to indicate
whether the value is positive or negative, and how many bytes follow.Text is a Writable for UTF-8
sequences. It can be thought of as the Writable equivalent of java.lang.String. The Text class uses an int
(with a variable-length encoding) to store the number of bytes in the string encoding, so the maximum
value is 2 GB.
Syntax:
public interface Writable {
void write(DataOutput out) throws IOException
void readFields(DataInput in) throws IOException;
}
Q5. Define Serialization Reference.
Answer:
Serialization in Java is a mechanism of writing the state of an object into a byte-stream. It is mainly used
in Hibernate, RMI, JPA, EJB , JMS, and Hadoop technologies.
The reverse operation of serialization is called deserialization where byte-stream is converted into an
object. The serialization and deserialization process is platform-independent, it means you can serialize an
object in a platform and deserialize in different platform.
For serializing the object, we call the writeObject() method ObjectOutputStream, and for deserialization
we call the readObject() method of ObjectInputStream class.

SEMESTER IT VII / CSE VIII 2


Big Data Analytics ARYA COLLEGE OF ENGG & IT

Q6. a)What is the function of Job tracker and Task Tracker in HDFS?
b)How can you perform combiner and partition in a Map Reduce?
c)Enumerate the writable wrapper classes for Java Primitives?
d)Write about the key design principles of Pig Latin.

Answer. a)
Job Tracker –
JobTracker process runs on a separate node and not usually on a DataNode.
JobTracker is an essential Daemon for MapReduce execution in MRv1.
It is replaced by ResourceManager/ApplicationMaster in MRv2.
JobTracker receives the requests for MapReduce execution from the client.
JobTracker talks to the NameNode to determine the location of the data.
JobTracker finds the best TaskTracker nodes to execute tasks based on the data locality (proximity of the
data) and the available slots to execute a task on a given node.
JobTracker monitors the individual TaskTrackers and the submits back the overall status of the job back to
the client.
JobTracker process is critical to the Hadoop cluster in terms of MapReduce execution.
When the JobTracker is down, HDFS will still be functional but the MapReduce execution can not be
started and the existing MapReduce jobs will be halted.

TaskTracker –
TaskTracker runs on DataNode. Mostly on all DataNodes.
TaskTracker is replaced by Node Manager in MRv2.
Mapper and Reducer tasks are executed on DataNodes administered by TaskTrackers.
TaskTrackers will be assigned Mapper and Reducer tasks to execute by JobTracker.
TaskTracker will be in constant communication with the JobTracker signalling the progress of the task in
execution.
TaskTracker failure is not considered fatal. When a TaskTracker becomes unresponsive, JobTracker will
assign the task executed by the TaskTracker to another node.

b) Given in Q2
c) Given in Q3

SEMESTER IT VII / CSE VIII 3


Big Data Analytics ARYA COLLEGE OF ENGG & IT

Q8. Explain the Linked List data structure with sample example program

Answer :

Refer to DSA 3 sem Book

Q9. What is data serialization? With proper examples discuss and differentiate
Answer: Refer Q No.5.
Q10. Structured, unstructured and semi-structured data. Make a note on how type of data affects data
serialization.
Answer:
Data can be classifed as structured data, semi-structured data, or unstructured data. Structured
data resides in predefined formats and models, Unstructured data is stored in its natural format until it’s
extracted for analysis, and Semi structured data basically is a mix of both structured and unstructured
data.. Due to no relation between data it is difficult to have perfect serialization of unstructured data..
Q11. Explain in detail building blocks of Hadoop with neat sketch.
Answer:
Architecture
Externel Interfaces- CLI, WebUI, JDBC, ODBC programming interfaces
Thrift Server – Cross Language service framework .
Metastore - Meta data about the Hive tables, partitions
Driver - Brain of Hive! Compiler, Optimizer and Execution engine

SEMESTER IT VII / CSE VIII 4


Big Data Analytics ARYA COLLEGE OF ENGG & IT

Q12 Write Map Reduce steps for counting occurrences of specific numbers in the input text file(s). Also
write the commands to compile and run the code.

Answer:
package hadoop;
import java.util.*;
import java.io.IOException;
import java.io.IOException;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
import org.apache.hadoop.util.*;
public class ProcessUnits
{
//Mapper class
public static class E_EMapper extends MapReduceBase implements
Mapper<LongWritable, /*Input key Type */
Text, /*Input value Type*/
Text, /*Output key Type*/
IntWritable> /*Output value Type*/
{
//Map function
public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter
reporter) throws IOException
{
String line = value.toString();
String lasttoken = null;
StringTokenizer s = new StringTokenizer(line,"\t");
String year = s.nextToken();

while(s.hasMoreTokens()){
lasttoken=s.nextToken();
}

SEMESTER IT VII / CSE VIII 5


Big Data Analytics ARYA COLLEGE OF ENGG & IT

int avgprice = Integer.parseInt(lasttoken);


output.collect(new Text(year), new IntWritable(avgprice));
}
}

//Reducer class

public static class E_EReduce extends MapReduceBase implements


Reducer< Text, IntWritable, Text, IntWritable >
{
//Reduce function
public void reduce(Text key, Iterator <IntWritable> values, OutputCollector>Text, IntWritable>
output, Reporter reporter) throws IOException
{
int maxavg=30;
int val=Integer.MIN_VALUE;
while (values.hasNext())
{
if((val=values.next().get())>maxavg)
{
output.collect(key, new IntWritable(val));
}
}
}
}

//Main function

public static void main(String args[])throws Exception


{
JobConf conf = new JobConf(Eleunits.class);

conf.setJobName("max_eletricityunits");

SEMESTER IT VII / CSE VIII 6


Big Data Analytics ARYA COLLEGE OF ENGG & IT

conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);

conf.setMapperClass(E_EMapper.class);
conf.setCombinerClass(E_EReduce.class);
conf.setReducerClass(E_EReduce.class);

conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);

FileInputFormat.setInputPaths(conf, new Path(args[0]));


FileOutputFormat.setOutputPath(conf, new Path(args[1]));

JobClient.runJob(conf);
}
}

Q13. Discuss on the different types and formats of Map-reduce with an example each one.

Answer:
MapReduce Types
Mapping is the core technique of processing a list of data elements that come in pairs of keys and values.
The map function applies to individual elements defined as key-value pairs of a list and produces a new
list.

Input Formats
Hadoop has to accept and process a variety of formats, from text files to databases. A chunk of input,
called input split, is processed by a single map. Each split is further divided into logical records given to
the map to process in key-value pair. In the context of database, the split means reading a range of tuples
from an SQL table, as done by the DBInputFormat and producing LongWritables containing record
numbers as keys and DBWritables as values. The InputSplit represents the data to be processed by
a Mapper. It returns the length in bytes and has a reference to the input data. It presents a byte-oriented
view on the input and is the responsibility of the RecordReader of the job to process this and present a

SEMESTER IT VII / CSE VIII 7


Big Data Analytics ARYA COLLEGE OF ENGG & IT

record-oriented view
The FileInputFormat is the base class for the file data source. It has the responsibility to identify
the files that are to be included as the job input and the definition for generating the split. Hadoop
also includes processing of unstructured data that often comes in textual format. The TextInputFormat is
the default InputFormat for such data. The SequenceInputFormat takes up binary inputs and stores
sequences of binary key-value pairs.
Similarly, DBInputFormat provides the capability to read data from relational database using JDBC.
Output Formats

The TextOutputFormat is the default output format that writes records as plain text files, whereas
key-values any be of any types, and transforms them into a string by invoking
the toString() method. The key-value character is separated by the tab character, although this can
be customized by manipulating the separator property of the text output format.

For binary output, there is SequenceFileOutputFormat to write a sequence of binary output to a


file. Binary outputs are particularly useful if the output becomes input to a further MapReduce job.

The output formats for relational databases and to HBase are handled by DBOutputFormat. It
sends the reduced output to a SQL table. For example, the HBase’s TableOutputFormat enables
the MapReduce program to work on the data stored in the HBase table and uses it for writing
outputs to the HBase table.

Q20. Write the reasons for Why Hadoop won’t be using JAVA serialization. Define structured, semi
structured and un structured data with examples. Define Byte writable and Object writable writable
wrappers.
Answer : See Q No. 3 Q No.10,Q No.14

Q21.Write a PIG script for Word Count. Explain Metastore in Hive.


Answer:
Pig Script for word count
lines = LOAD '/user/hadoop/yourfilename.txt AS (line:chararray);
words = FOREACH lines GENERATE FLATTEN(TOKENIZE(line)) as word;

SEMESTER IT VII / CSE VIII 8


Big Data Analytics ARYA COLLEGE OF ENGG & IT

grouped = GROUP words BY word;


wordcount = FOREACH grouped GENERATE group, COUNT(words);
DUMP wordcount;

Hive Meta store is where Hive chooses respective database servers to store the schema or Metadata of
tables, databases, columns in a table, their data types, and HDFS mapping.

Q22. Discuss about serialization concept in java.

Answer
See Q No. 9

Q23 Explain in brief about various map implementations in Java with suitable examples

Answer:
It is association stored as key value pair .Mapping in support java are
A Map doesn't allow duplicate keys, but you can have duplicate values. HashMap and LinkedHashMap
allow null keys and values, but TreeMap doesn't allow any null key or value.
A Map can't be traversed, so you need to convert it into Set using keySet() or entrySet() method
.
Class Description
HashMap HashMap is the implementation of Map, but it doesn't maintain any order.
LinkedHashMap LinkedHashMap is the implementation of Map. It inherits HashMap class. It
maintains insertion order.
TreeMap TreeMap is the implementation of Map and SortedMap. It maintains ascending ord

Q24 Explain the procedure for Installing Hadoop in Pseudo Distributed Mode.

Answer:
Follow the steps given below to install Hadoop 2.4.1 in pseudo distributed mode.
Step 1 − Setting Up Hadoop
You can set Hadoop environment variables by appending the following commands to ~/.bashrc file.
export HADOOP_HOME=/usr/local/hadoop
export HADOOP_MAPRED_HOME=$HADOOP_HOME

SEMESTER IT VII / CSE VIII 9


Big Data Analytics ARYA COLLEGE OF ENGG & IT

export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
export HADOOP_INSTALL=$HADOOP_HOME
Now apply all the changes into the current running system.
$ source ~/.bashrc
Step 2 − Hadoop Configuration
You can find all the Hadoop configuration files in the location “$HADOOP_HOME/etc/hadoop”. It is
required to make changes in those configuration files according to your Hadoop infrastructure.
$ cd $HADOOP_HOME/etc/hadoop
In order to develop Hadoop programs in java, you have to reset the java environment variables in hadoop-
env.sh file by replacing JAVA_HOME value with the location of java in your system.
export JAVA_HOME=/usr/local/jdk1.7.0_71
The following are the list of files that you have to edit to configure Hadoop.
core-site.xml
The core-site.xml file contains information such as the port number used for Hadoop instance, memory
allocated for the file system, memory limit for storing the data, and size of Read/Write buffers.
Open the core-site.xml and add the following properties in between <configuration>, </configuration>
tags.
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
hdfs-site.xml
The hdfs-site.xml file contains information such as the value of replication data, namenode path, and
datanode paths of your local file systems. It means the place where you want to store the Hadoop
infrastructure.
Let us assume the following data.
dfs.replication (data replication value) = 1
(In the below given path /hadoop/ is the user name.

SEMESTER IT VII / CSE VIII 10


Big Data Analytics ARYA COLLEGE OF ENGG & IT

hadoopinfra/hdfs/namenode is the directory created by hdfs file system.)


namenode path = //home/hadoop/hadoopinfra/hdfs/namenode

(hadoopinfra/hdfs/datanode is the directory created by hdfs file system.)


datanode path = //home/hadoop/hadoopinfra/hdfs/datanode
Open this file and add the following properties in between the <configuration> </configuration> tags in
this file.
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.name.dir</name>
<value>file:///home/hadoop/hadoopinfra/hdfs/namenode </value>
</property>

<property>
<name>dfs.data.dir</name>
<value>file:///home/hadoop/hadoopinfra/hdfs/datanode </value>
</property>
</configuration>
yarn-site.xml
This file is used to configure yarn into Hadoop. Open the yarn-site.xml file and add the following
properties in between the <configuration>, </configuration> tags in this file.
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>
mapred-site.xml
This file is used to specify which MapReduce framework we are using. By default, Hadoop contains a
template of yarn-site.xml. First of all, it is required to copy the file from mapred-

SEMESTER IT VII / CSE VIII 11


Big Data Analytics ARYA COLLEGE OF ENGG & IT

site.xml.template to mapred-site.xml file using the following command.


$ cp mapred-site.xml.template mapred-site.xml
Open mapred-site.xml file and add the following properties in between the <configuration>,
</configuration>tags in this file.
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>

Q25. Write Map Reduce steps for counting sum of numbers in the input text file(s). Also write the
commands to compile and run the code.

Answer :
See Q No. 12

Q26. What are core methods of a reducer? What happens if you try to run a Hadoop job with an output
directory that is already present?

Answer :
The Reducer class defines the Reduce job in MapReduce. It reduces a set of intermediate values that share
a key to a smaller set of values. Reducer implementations can access the Configuration for a job via the
JobContext.getConfiguration() method. A Reducer has three primary phases − Shuffle, Sort, and Reduce.
Shuffle − The Reducer copies the sorted output from each Mapper using HTTP across the network.
Sort − The framework merge-sorts the Reducer inputs by keys (since different Mappers may have output
the same key). The shuffle and sort phases occur simultaneously, i.e., while outputs are being fetched, they
are merged.
Reduce − In this phase the reduce (Object, Iterable, Context) method is called for each <key, (collection of
values)> in the sorted inputs.
Method- reduce is the most prominent method of the Reducer class. The syntax is defined below −
reduce(KEYIN key, Iterable<VALUEIN> values, org.apache.hadoop.mapreduce.Reducer.Context context)
If a Hadoop try to run job with an output directory that is already present The directory will be overriden

SEMESTER IT VII / CSE VIII 12


Big Data Analytics ARYA COLLEGE OF ENGG & IT

Q31. What are the various functions of name node?


Answer :
The HDFS namespace is stored in Namenode .
Namenode uses a transaction log called the EditLog to record every change that occurs to the filesystem
meta data.
– For example, creating a new file.
– Change replication factor of a file
– EditLog is stored in the Namenode’s local filesystem
Entire filesystem namespace including mapping of blocks to files and file system properties is stored in a
file FsImage. Stored in Namenode’s local filesystem.
Keeps image of entire file system namespace and file Blockmap in memory.
Minimum 4GB of local RAM to support the data structures that represent the huge number of files and
directories.
When the Namenode starts up it gets the FsImage and Editlog from its local file system, update FsImage
with EditLog information and then stores a copy of the FsImage on the filesytstem as a checkpoint.
Periodic checkpointing is done. So that the system can recover back to the last checkpointed state in case of
a crash.

Q36 Explain Map-reduce framework in detail.

Answer:

MapReduce is a processing technique and a program model for distributed computing based on java. The
MapReduce algorithm contains two important tasks, namely Map and Reduce. Map takes a set of data and
converts it into another set of data, where individual elements are broken down into tuples (key/value pairs).
Secondly, reduce task, which takes the output from a map as an input and combines those data tuples into a
smaller set of tuples. As the sequence of the name MapReduce implies, the reduce task is always performed
after the map job.

The major advantage of MapReduce is that it is easy to scale data processing over multiple computing
nodes. Under the MapReduce model, the data processing primitives are called mappers and reducers.
Decomposing a data processing application into mappers and reducers is sometimes nontrivial. But, once
we write an application in the MapReduce form, scaling the application to run over hundreds, thousands, or
even tens of thousands of machines in a cluster is merely a configuration change. This simple scalability is
what has attracted many programmers to use the MapReduce model.

SEMESTER IT VII / CSE VIII 13


Big Data Analytics ARYA COLLEGE OF ENGG & IT

The Algorithm

Generally MapReduce paradigm is based on sending the computer to where the data resides!

MapReduce program executes in three stages, namely map stage, shuffle stage, and reduce stage.

Map stage − The map or mapper’s job is to process the input data. Generally the input data is in the form of
file or directory and is stored in the Hadoop file system (HDFS). The input file is passed to the mapper
function line by line. The mapper processes the data and creates several small chunks of data.

Reduce stage − This stage is the combination of the Shuffle stage and the Reduce stage. The Reducer’s job
is to process the data that comes from the mapper. After processing, it produces a new set of output, which
will be stored in the HDFS.

During a MapReduce job, Hadoop sends the Map and Reduce tasks to the appropriate servers in the cluster.

The framework manages all the details of data-passing such as issuing tasks, verifying task completion, and
copying data around the cluster between the nodes.

Most of the computing takes place on nodes with data on local disks that reduces the network traffic.

After completion of the given tasks, the cluster collects and reduces the data to form an appropriate result,
and sends it back to the Hadoop server.

SEMESTER IT VII / CSE VIII 14

You might also like