0% found this document useful (0 votes)

120 views15 pages

Big Data Analytics Midterm Q&A

This document provides code for a MapReduce program to count the occurrences of specific numbers in an input text file. The code defines a Mapper class that extracts the last token from each line as an integer value and emits it with the year as key-value pairs. A Reducer class collects the values for each year key, finds the maximum value greater than 30, and emits it. The main function runs the MapReduce job. To compile and run, use hadoop jar command with the class name and input/output paths.

Uploaded by

Gaurav Nagar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

120 views15 pages

Big Data Analytics Midterm Q&A

Uploaded by

Gaurav Nagar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

BIG DATA

ANALYTICS
(Question bank-1ST MIDTERM)

Prepared By
Dr. Vibhakar Pathak
Information Technology
Big Data Analytics ARYA COLLEGE OF ENGG & IT

Q1 Write the various methods in Map interface. List the characteristics of Big Data.

Answer:
Mapper Class
The Mapper class defines the Map job. Maps input key-value pairs to a set of intermediate key-value
pairs. Maps are the individual tasks that transform the input records into intermediate records. The
transformed intermediate records need not be of the same type as the input records. A given input pair
may map to zero or many output pairs.
Method : map is the most prominent method of the Mapper class. The syntax is defined below
map(KEYIN key, VALUEIN value, org.apache.hadoop.mapreduce.Mapper.Context context)

Reducer Class
The Reducer class defines the Reduce job in MapReduce. It reduces a set of intermediate values that share
a key to a smaller set of values. Reducer implementations can access the Configuration for a job via the
JobContext.getConfiguration() method. A Reducer has three primary phases − Shuffle, Sort, and Reduce.
Shuffle − The Reducer copies the sorted output from each Mapper using HTTP across the network.
Sort − The framework merge-sorts the Reducer inputs by keys (since different Mappers may have output
the same key). The shuffle and sort phases occur simultaneously, i.e., while outputs are being fetched,
they are merged.
Reduce − In this phase the reduce (Object, Iterable, Context) method is called for each <key, (collection of
values)> in the sorted inputs.
Method : reduce is the most prominent method of the Reducer class
Syntax:
reduce (KEYIN key, Iterable<VALUEIN> values, org.apache.hadoop.mapreduce.Reducer.Context context)

Big Data Characteristics

It is characterized by 3V : Velocity , Variety and Velocity.

Q2. What are the functions of a combiner?

Answer:
A Combiner, also known as a semi-reducer, is an optional class that operates by accepting the inputs from

SEMESTER IT VII / CSE VIII 1

Big Data Analytics ARYA COLLEGE OF ENGG & IT

the Map class and thereafter passing the output key-value pairs to the Reducer class.
The main function of a Combiner is to summarize the map output records with the same key. The output
(key-value collection) of the combiner will be sent over the network to the actual Reducer task as input.

Q3. List the writable wrapper classes for java primitives. Differentiate Apache pig with Map Reduce.:

Answer:
The Writable interface defines two methods—
One for writing its state to a Data Output binary stream, and
One for reading its state from a Data Input binary stream.
Hadoop comes with a large selection of Writable classes, which are available in the org.apache.hadoop.io
package like BooleanWritable, ByteWritable, ShortWritable, IntWritable, VIntWritable, FloatWritable,
LongWritable, VLongWritable, DoubleWritable.When it comes to encoding integers, there is a choice
between the fixed-length formats (IntWritable and LongWritable) and the variable-length formats
(VIntWritable and VLongWritable). The variable-length formats use only a single byte to encode the value
if it is small enough (between –112 and 127, inclusive); otherwise, they use the first byte to indicate
whether the value is positive or negative, and how many bytes follow.Text is a Writable for UTF-8
sequences. It can be thought of as the Writable equivalent of java.lang.String. The Text class uses an int
(with a variable-length encoding) to store the number of bytes in the string encoding, so the maximum
value is 2 GB.
Syntax:
public interface Writable {
void write(DataOutput out) throws IOException
void readFields(DataInput in) throws IOException;
}
Q5. Define Serialization Reference.
Answer:
Serialization in Java is a mechanism of writing the state of an object into a byte-stream. It is mainly used
in Hibernate, RMI, JPA, EJB , JMS, and Hadoop technologies.
The reverse operation of serialization is called deserialization where byte-stream is converted into an
object. The serialization and deserialization process is platform-independent, it means you can serialize an
object in a platform and deserialize in different platform.
For serializing the object, we call the writeObject() method ObjectOutputStream, and for deserialization
we call the readObject() method of ObjectInputStream class.

SEMESTER IT VII / CSE VIII 2

Big Data Analytics ARYA COLLEGE OF ENGG & IT

Q6. a)What is the function of Job tracker and Task Tracker in HDFS?
b)How can you perform combiner and partition in a Map Reduce?
c)Enumerate the writable wrapper classes for Java Primitives?
d)Write about the key design principles of Pig Latin.

Answer. a)
Job Tracker –
JobTracker process runs on a separate node and not usually on a DataNode.
JobTracker is an essential Daemon for MapReduce execution in MRv1.
It is replaced by ResourceManager/ApplicationMaster in MRv2.
JobTracker receives the requests for MapReduce execution from the client.
JobTracker talks to the NameNode to determine the location of the data.
JobTracker finds the best TaskTracker nodes to execute tasks based on the data locality (proximity of the
data) and the available slots to execute a task on a given node.
JobTracker monitors the individual TaskTrackers and the submits back the overall status of the job back to
the client.
JobTracker process is critical to the Hadoop cluster in terms of MapReduce execution.
When the JobTracker is down, HDFS will still be functional but the MapReduce execution can not be
started and the existing MapReduce jobs will be halted.

TaskTracker –
TaskTracker runs on DataNode. Mostly on all DataNodes.
TaskTracker is replaced by Node Manager in MRv2.
Mapper and Reducer tasks are executed on DataNodes administered by TaskTrackers.
TaskTrackers will be assigned Mapper and Reducer tasks to execute by JobTracker.
TaskTracker will be in constant communication with the JobTracker signalling the progress of the task in
execution.
TaskTracker failure is not considered fatal. When a TaskTracker becomes unresponsive, JobTracker will
assign the task executed by the TaskTracker to another node.

b) Given in Q2
c) Given in Q3

SEMESTER IT VII / CSE VIII 3

Big Data Analytics ARYA COLLEGE OF ENGG & IT

Q8. Explain the Linked List data structure with sample example program

Answer :

Refer to DSA 3 sem Book

Q9. What is data serialization? With proper examples discuss and differentiate
Answer: Refer Q No.5.
Q10. Structured, unstructured and semi-structured data. Make a note on how type of data affects data
serialization.
Answer:
Data can be classifed as structured data, semi-structured data, or unstructured data. Structured
data resides in predefined formats and models, Unstructured data is stored in its natural format until it’s
extracted for analysis, and Semi structured data basically is a mix of both structured and unstructured
data.. Due to no relation between data it is difficult to have perfect serialization of unstructured data..
Q11. Explain in detail building blocks of Hadoop with neat sketch.
Answer:
Architecture
Externel Interfaces- CLI, WebUI, JDBC, ODBC programming interfaces
Thrift Server – Cross Language service framework .
Metastore - Meta data about the Hive tables, partitions
Driver - Brain of Hive! Compiler, Optimizer and Execution engine

SEMESTER IT VII / CSE VIII 4

Big Data Analytics ARYA COLLEGE OF ENGG & IT

Q12 Write Map Reduce steps for counting occurrences of specific numbers in the input text file(s). Also
write the commands to compile and run the code.

Answer:
package hadoop;
import java.util.*;
import java.io.IOException;
import java.io.IOException;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
import org.apache.hadoop.util.*;
public class ProcessUnits
{
//Mapper class
public static class E_EMapper extends MapReduceBase implements
Mapper<LongWritable, /*Input key Type */
Text, /*Input value Type*/
Text, /*Output key Type*/
IntWritable> /*Output value Type*/
{
//Map function
public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter
reporter) throws IOException
{
String line = value.toString();
String lasttoken = null;
StringTokenizer s = new StringTokenizer(line,"\t");
String year = s.nextToken();

while(s.hasMoreTokens()){
lasttoken=s.nextToken();
}

SEMESTER IT VII / CSE VIII 5

Big Data Analytics ARYA COLLEGE OF ENGG & IT

int avgprice = Integer.parseInt(lasttoken);

output.collect(new Text(year), new IntWritable(avgprice));
}
}

//Reducer class

public static class E_EReduce extends MapReduceBase implements

Reducer< Text, IntWritable, Text, IntWritable >
{
//Reduce function
public void reduce(Text key, Iterator <IntWritable> values, OutputCollector>Text, IntWritable>
output, Reporter reporter) throws IOException
{
int maxavg=30;
int val=Integer.MIN_VALUE;
while (values.hasNext())
{
if((val=values.next().get())>maxavg)
{
output.collect(key, new IntWritable(val));
}
}
}
}

//Main function

public static void main(String args[])throws Exception

{
JobConf conf = new JobConf(Eleunits.class);

conf.setJobName("max_eletricityunits");

SEMESTER IT VII / CSE VIII 6

Big Data Analytics ARYA COLLEGE OF ENGG & IT

conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);

conf.setMapperClass(E_EMapper.class);
conf.setCombinerClass(E_EReduce.class);
conf.setReducerClass(E_EReduce.class);

conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);

FileInputFormat.setInputPaths(conf, new Path(args[0]));

FileOutputFormat.setOutputPath(conf, new Path(args[1]));

JobClient.runJob(conf);
}
}

Q13. Discuss on the different types and formats of Map-reduce with an example each one.

Answer:
MapReduce Types
Mapping is the core technique of processing a list of data elements that come in pairs of keys and values.
The map function applies to individual elements defined as key-value pairs of a list and produces a new
list.

Input Formats
Hadoop has to accept and process a variety of formats, from text files to databases. A chunk of input,
called input split, is processed by a single map. Each split is further divided into logical records given to
the map to process in key-value pair. In the context of database, the split means reading a range of tuples
from an SQL table, as done by the DBInputFormat and producing LongWritables containing record
numbers as keys and DBWritables as values. The InputSplit represents the data to be processed by
a Mapper. It returns the length in bytes and has a reference to the input data. It presents a byte-oriented
view on the input and is the responsibility of the RecordReader of the job to process this and present a

SEMESTER IT VII / CSE VIII 7

Big Data Analytics ARYA COLLEGE OF ENGG & IT

record-oriented view
The FileInputFormat is the base class for the file data source. It has the responsibility to identify
the files that are to be included as the job input and the definition for generating the split. Hadoop
also includes processing of unstructured data that often comes in textual format. The TextInputFormat is
the default InputFormat for such data. The SequenceInputFormat takes up binary inputs and stores
sequences of binary key-value pairs.
Similarly, DBInputFormat provides the capability to read data from relational database using JDBC.
Output Formats

The TextOutputFormat is the default output format that writes records as plain text files, whereas
key-values any be of any types, and transforms them into a string by invoking
the toString() method. The key-value character is separated by the tab character, although this can
be customized by manipulating the separator property of the text output format.

For binary output, there is SequenceFileOutputFormat to write a sequence of binary output to a

file. Binary outputs are particularly useful if the output becomes input to a further MapReduce job.

The output formats for relational databases and to HBase are handled by DBOutputFormat. It
sends the reduced output to a SQL table. For example, the HBase’s TableOutputFormat enables
the MapReduce program to work on the data stored in the HBase table and uses it for writing
outputs to the HBase table.

Q20. Write the reasons for Why Hadoop won’t be using JAVA serialization. Define structured, semi
structured and un structured data with examples. Define Byte writable and Object writable writable
wrappers.
Answer : See Q No. 3 Q No.10,Q No.14

Q21.Write a PIG script for Word Count. Explain Metastore in Hive.

Answer:
Pig Script for word count
lines = LOAD '/user/hadoop/yourfilename.txt AS (line:chararray);
words = FOREACH lines GENERATE FLATTEN(TOKENIZE(line)) as word;

SEMESTER IT VII / CSE VIII 8

Big Data Analytics ARYA COLLEGE OF ENGG & IT

grouped = GROUP words BY word;

wordcount = FOREACH grouped GENERATE group, COUNT(words);
DUMP wordcount;

Hive Meta store is where Hive chooses respective database servers to store the schema or Metadata of
tables, databases, columns in a table, their data types, and HDFS mapping.

Q22. Discuss about serialization concept in java.

Answer
See Q No. 9

Q23 Explain in brief about various map implementations in Java with suitable examples

Answer:
It is association stored as key value pair .Mapping in support java are
A Map doesn't allow duplicate keys, but you can have duplicate values. HashMap and LinkedHashMap
allow null keys and values, but TreeMap doesn't allow any null key or value.
A Map can't be traversed, so you need to convert it into Set using keySet() or entrySet() method
.
Class Description
HashMap HashMap is the implementation of Map, but it doesn't maintain any order.
LinkedHashMap LinkedHashMap is the implementation of Map. It inherits HashMap class. It
maintains insertion order.
TreeMap TreeMap is the implementation of Map and SortedMap. It maintains ascending ord

Q24 Explain the procedure for Installing Hadoop in Pseudo Distributed Mode.

Answer:
Follow the steps given below to install Hadoop 2.4.1 in pseudo distributed mode.
Step 1 − Setting Up Hadoop
You can set Hadoop environment variables by appending the following commands to ~/.bashrc file.
export HADOOP_HOME=/usr/local/hadoop
export HADOOP_MAPRED_HOME=$HADOOP_HOME

SEMESTER IT VII / CSE VIII 9

Big Data Analytics ARYA COLLEGE OF ENGG & IT

export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
export HADOOP_INSTALL=$HADOOP_HOME
Now apply all the changes into the current running system.
$ source ~/.bashrc
Step 2 − Hadoop Configuration
You can find all the Hadoop configuration files in the location “$HADOOP_HOME/etc/hadoop”. It is
required to make changes in those configuration files according to your Hadoop infrastructure.
$ cd $HADOOP_HOME/etc/hadoop
In order to develop Hadoop programs in java, you have to reset the java environment variables in hadoop-
env.sh file by replacing JAVA_HOME value with the location of java in your system.
export JAVA_HOME=/usr/local/jdk1.7.0_71
The following are the list of files that you have to edit to configure Hadoop.
core-site.xml
The core-site.xml file contains information such as the port number used for Hadoop instance, memory
allocated for the file system, memory limit for storing the data, and size of Read/Write buffers.
Open the core-site.xml and add the following properties in between <configuration>, </configuration>
tags.
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
hdfs-site.xml
The hdfs-site.xml file contains information such as the value of replication data, namenode path, and
datanode paths of your local file systems. It means the place where you want to store the Hadoop
infrastructure.
Let us assume the following data.
dfs.replication (data replication value) = 1
(In the below given path /hadoop/ is the user name.

SEMESTER IT VII / CSE VIII 10

Big Data Analytics ARYA COLLEGE OF ENGG & IT

hadoopinfra/hdfs/namenode is the directory created by hdfs file system.)

namenode path = //home/hadoop/hadoopinfra/hdfs/namenode

(hadoopinfra/hdfs/datanode is the directory created by hdfs file system.)

datanode path = //home/hadoop/hadoopinfra/hdfs/datanode
Open this file and add the following properties in between the <configuration> </configuration> tags in
this file.
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.name.dir</name>
<value>file:///home/hadoop/hadoopinfra/hdfs/namenode </value>
</property>

<property>
<name>dfs.data.dir</name>
<value>file:///home/hadoop/hadoopinfra/hdfs/datanode </value>
</property>
</configuration>
yarn-site.xml
This file is used to configure yarn into Hadoop. Open the yarn-site.xml file and add the following
properties in between the <configuration>, </configuration> tags in this file.
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>
mapred-site.xml
This file is used to specify which MapReduce framework we are using. By default, Hadoop contains a
template of yarn-site.xml. First of all, it is required to copy the file from mapred-

SEMESTER IT VII / CSE VIII 11

Big Data Analytics ARYA COLLEGE OF ENGG & IT

site.xml.template to mapred-site.xml file using the following command.

$ cp mapred-site.xml.template mapred-site.xml
Open mapred-site.xml file and add the following properties in between the <configuration>,
</configuration>tags in this file.
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>

Q25. Write Map Reduce steps for counting sum of numbers in the input text file(s). Also write the
commands to compile and run the code.

Answer :
See Q No. 12

Q26. What are core methods of a reducer? What happens if you try to run a Hadoop job with an output
directory that is already present?

Answer :
The Reducer class defines the Reduce job in MapReduce. It reduces a set of intermediate values that share
a key to a smaller set of values. Reducer implementations can access the Configuration for a job via the
JobContext.getConfiguration() method. A Reducer has three primary phases − Shuffle, Sort, and Reduce.
Shuffle − The Reducer copies the sorted output from each Mapper using HTTP across the network.
Sort − The framework merge-sorts the Reducer inputs by keys (since different Mappers may have output
the same key). The shuffle and sort phases occur simultaneously, i.e., while outputs are being fetched, they
are merged.
Reduce − In this phase the reduce (Object, Iterable, Context) method is called for each <key, (collection of
values)> in the sorted inputs.
Method- reduce is the most prominent method of the Reducer class. The syntax is defined below −
reduce(KEYIN key, Iterable<VALUEIN> values, org.apache.hadoop.mapreduce.Reducer.Context context)
If a Hadoop try to run job with an output directory that is already present The directory will be overriden

SEMESTER IT VII / CSE VIII 12

Big Data Analytics ARYA COLLEGE OF ENGG & IT

Q31. What are the various functions of name node?

Answer :
The HDFS namespace is stored in Namenode .
Namenode uses a transaction log called the EditLog to record every change that occurs to the filesystem
meta data.
– For example, creating a new file.
– Change replication factor of a file
– EditLog is stored in the Namenode’s local filesystem
Entire filesystem namespace including mapping of blocks to files and file system properties is stored in a
file FsImage. Stored in Namenode’s local filesystem.
Keeps image of entire file system namespace and file Blockmap in memory.
Minimum 4GB of local RAM to support the data structures that represent the huge number of files and
directories.
When the Namenode starts up it gets the FsImage and Editlog from its local file system, update FsImage
with EditLog information and then stores a copy of the FsImage on the filesytstem as a checkpoint.
Periodic checkpointing is done. So that the system can recover back to the last checkpointed state in case of
a crash.

Q36 Explain Map-reduce framework in detail.

Answer:

MapReduce is a processing technique and a program model for distributed computing based on java. The
MapReduce algorithm contains two important tasks, namely Map and Reduce. Map takes a set of data and
converts it into another set of data, where individual elements are broken down into tuples (key/value pairs).
Secondly, reduce task, which takes the output from a map as an input and combines those data tuples into a
smaller set of tuples. As the sequence of the name MapReduce implies, the reduce task is always performed
after the map job.

The major advantage of MapReduce is that it is easy to scale data processing over multiple computing
nodes. Under the MapReduce model, the data processing primitives are called mappers and reducers.
Decomposing a data processing application into mappers and reducers is sometimes nontrivial. But, once
we write an application in the MapReduce form, scaling the application to run over hundreds, thousands, or
even tens of thousands of machines in a cluster is merely a configuration change. This simple scalability is
what has attracted many programmers to use the MapReduce model.

SEMESTER IT VII / CSE VIII 13

Big Data Analytics ARYA COLLEGE OF ENGG & IT

The Algorithm

Generally MapReduce paradigm is based on sending the computer to where the data resides!

MapReduce program executes in three stages, namely map stage, shuffle stage, and reduce stage.

Map stage − The map or mapper’s job is to process the input data. Generally the input data is in the form of
file or directory and is stored in the Hadoop file system (HDFS). The input file is passed to the mapper
function line by line. The mapper processes the data and creates several small chunks of data.

Reduce stage − This stage is the combination of the Shuffle stage and the Reduce stage. The Reducer’s job
is to process the data that comes from the mapper. After processing, it produces a new set of output, which
will be stored in the HDFS.

During a MapReduce job, Hadoop sends the Map and Reduce tasks to the appropriate servers in the cluster.

The framework manages all the details of data-passing such as issuing tasks, verifying task completion, and
copying data around the cluster between the nodes.

Most of the computing takes place on nodes with data on local disks that reduces the network traffic.

After completion of the given tasks, the cluster collects and reduces the data to form an appropriate result,
and sends it back to the Hadoop server.

SEMESTER IT VII / CSE VIII 14

Unit Iii - Cloud Virtualization
100% (1)
Unit Iii - Cloud Virtualization
10 pages
FDP Brochure PDF
100% (1)
FDP Brochure PDF
2 pages
Digital Steganography
No ratings yet
Digital Steganography
38 pages
BYTE D1-4 BigDataTechnologiesInfrastructures FINAL - Compressed
No ratings yet
BYTE D1-4 BigDataTechnologiesInfrastructures FINAL - Compressed
34 pages
Stqa Viva
No ratings yet
Stqa Viva
10 pages
Virtualization in Cloud Computing and Types
No ratings yet
Virtualization in Cloud Computing and Types
8 pages
Knowledge Representation Issue
No ratings yet
Knowledge Representation Issue
18 pages
Notes - Unit 3 - Map Reduce Applications
No ratings yet
Notes - Unit 3 - Map Reduce Applications
11 pages
Unix File System Case Study
No ratings yet
Unix File System Case Study
23 pages
Bda - 2 Unit
No ratings yet
Bda - 2 Unit
12 pages
Features of MapReduce
No ratings yet
Features of MapReduce
4 pages
Dsbda Unit 2
No ratings yet
Dsbda Unit 2
155 pages
Unit V Big Data Analytics
No ratings yet
Unit V Big Data Analytics
47 pages
Subject Name Parallel and Distributed Computing
100% (1)
Subject Name Parallel and Distributed Computing
3 pages
Assembler Design Options
100% (3)
Assembler Design Options
19 pages
Cs3353 Foundations of Data Science Unit V
No ratings yet
Cs3353 Foundations of Data Science Unit V
13 pages
PPT
No ratings yet
PPT
35 pages
Unit 4 Session 1
No ratings yet
Unit 4 Session 1
17 pages
AD3461 ML Lab Manual
No ratings yet
AD3461 ML Lab Manual
32 pages
Cloud Computing Unit-5
No ratings yet
Cloud Computing Unit-5
11 pages
Benefits of Parallel Computing
No ratings yet
Benefits of Parallel Computing
22 pages
BDA Unit 1
No ratings yet
BDA Unit 1
10 pages
Unit-II BDA
No ratings yet
Unit-II BDA
19 pages
Neural Network Activation Functions
No ratings yet
Neural Network Activation Functions
15 pages
Unit-5 Unit-5: Case Studies of Big Data Analytics Using Map-Reduce Programming
No ratings yet
Unit-5 Unit-5: Case Studies of Big Data Analytics Using Map-Reduce Programming
11 pages
Driver Drowsiness Detection System
No ratings yet
Driver Drowsiness Detection System
10 pages
SPM Lecture Notes 2023 (R20 III-I)
No ratings yet
SPM Lecture Notes 2023 (R20 III-I)
76 pages
Recurrent & Recursive Nets
No ratings yet
Recurrent & Recursive Nets
10 pages
Unit 2 - Week 1: Introduction To Clouds, Virtualization and Virtual Machine
No ratings yet
Unit 2 - Week 1: Introduction To Clouds, Virtualization and Virtual Machine
48 pages
Unit 2
No ratings yet
Unit 2
52 pages
P.prabu (28x61c) CCS334 BDA - Unit 4
No ratings yet
P.prabu (28x61c) CCS334 BDA - Unit 4
28 pages
Software Project Management Questionnaire
No ratings yet
Software Project Management Questionnaire
18 pages
ADBMS Lab Manual
No ratings yet
ADBMS Lab Manual
33 pages
Distributed Algorithms Guide
No ratings yet
Distributed Algorithms Guide
8 pages
Software Architecture & Design Guide
No ratings yet
Software Architecture & Design Guide
10 pages
3.7.YARN - Failures in Classic MapReduce
No ratings yet
3.7.YARN - Failures in Classic MapReduce
5 pages
CSE Students' MEAN Stack Guide
No ratings yet
CSE Students' MEAN Stack Guide
115 pages
Lab Rubrics
No ratings yet
Lab Rubrics
3 pages
SDN Lab 2
No ratings yet
SDN Lab 2
17 pages
4-Data Cleaning, Data Integration, Data Transformation, Data Reduction-03-02-2024
No ratings yet
4-Data Cleaning, Data Integration, Data Transformation, Data Reduction-03-02-2024
22 pages
Beyond Binary Classification
No ratings yet
Beyond Binary Classification
34 pages
Network Security UNIT I
No ratings yet
Network Security UNIT I
35 pages
CS6456-Object Oriented Programming
No ratings yet
CS6456-Object Oriented Programming
15 pages
Celonis PQL: A Query Language For Process Mining
No ratings yet
Celonis PQL: A Query Language For Process Mining
32 pages
Cs2203 Object Oriented Programming Iiird Sem Question Bank Unit - I Part - A (2 Marks)
No ratings yet
Cs2203 Object Oriented Programming Iiird Sem Question Bank Unit - I Part - A (2 Marks)
6 pages
BDACh 02 L01 Hadoop
No ratings yet
BDACh 02 L01 Hadoop
24 pages
Unit I
No ratings yet
Unit I
53 pages
Question Bank - OS
No ratings yet
Question Bank - OS
6 pages
Unit 4 Notes SW
No ratings yet
Unit 4 Notes SW
20 pages
Bda Unit 5
No ratings yet
Bda Unit 5
29 pages
Ieee FDP Brochure
No ratings yet
Ieee FDP Brochure
2 pages
Cybersecurity & Forensics Lab Guide
No ratings yet
Cybersecurity & Forensics Lab Guide
29 pages
Cp4152 Database Practice Lab Manual R 2021
No ratings yet
Cp4152 Database Practice Lab Manual R 2021
48 pages
OOSE Unit 1 Notes
No ratings yet
OOSE Unit 1 Notes
21 pages
Case Study On Dbms & Rdbms
No ratings yet
Case Study On Dbms & Rdbms
36 pages
Unit 3 Topic 9 Hadoop Archives
No ratings yet
Unit 3 Topic 9 Hadoop Archives
32 pages
Distributed File Systems
No ratings yet
Distributed File Systems
18 pages
Secure File Sharing Using Access Control
No ratings yet
Secure File Sharing Using Access Control
73 pages
MapReduce for Data Processing
No ratings yet
MapReduce for Data Processing
7 pages
MapReduce Is A Framework Using Which We Can Write Applications To Process Huge Amounts of Data
No ratings yet
MapReduce Is A Framework Using Which We Can Write Applications To Process Huge Amounts of Data
12 pages
Features Supported by The Editions of SQL Server 2014
No ratings yet
Features Supported by The Editions of SQL Server 2014
12 pages
My SQL
No ratings yet
My SQL
22 pages
Year 9 ICT Lesson Notes
No ratings yet
Year 9 ICT Lesson Notes
35 pages
Power Bi Lab Manual
No ratings yet
Power Bi Lab Manual
98 pages
SQL Basics for Beginners
No ratings yet
SQL Basics for Beginners
17 pages
2) Ms. Excel - Importing Exporting Data (By Tahir Aziz)
No ratings yet
2) Ms. Excel - Importing Exporting Data (By Tahir Aziz)
23 pages
Quantiphi FE Rahul Chalwadi
No ratings yet
Quantiphi FE Rahul Chalwadi
4 pages
Excel Solver Comparison Summary
No ratings yet
Excel Solver Comparison Summary
1 page
Localization 1
No ratings yet
Localization 1
5 pages
Inventory Period Close Document Rev.00
No ratings yet
Inventory Period Close Document Rev.00
11 pages
Dbms Lab File
100% (1)
Dbms Lab File
30 pages
The Full Version and Explore A Variety of Ebooks
No ratings yet
The Full Version and Explore A Variety of Ebooks
43 pages
Abdul Khaleeq Mohammed
No ratings yet
Abdul Khaleeq Mohammed
5 pages
TIB Bwce 2.7.1 Concepts
No ratings yet
TIB Bwce 2.7.1 Concepts
54 pages
Backend Q
No ratings yet
Backend Q
2 pages
Database Management Practical Guide
No ratings yet
Database Management Practical Guide
6 pages
EJb3 Interview Questions
No ratings yet
EJb3 Interview Questions
48 pages
Designing Data Intensive Applications The Big Ideas Behind Reliable Scalable and Maintainable Systems 6 (Early Release) Edition Martin Kleppmann Available All Format
No ratings yet
Designing Data Intensive Applications The Big Ideas Behind Reliable Scalable and Maintainable Systems 6 (Early Release) Edition Martin Kleppmann Available All Format
167 pages
Tuning SAP Oracle
No ratings yet
Tuning SAP Oracle
89 pages
Relational Model Slides
No ratings yet
Relational Model Slides
30 pages
DBMS Normalization & Dependencies
No ratings yet
DBMS Normalization & Dependencies
35 pages
Association Rule Mining
No ratings yet
Association Rule Mining
17 pages
NuoDB Documentation
0% (1)
NuoDB Documentation
142 pages
SQL Server Reporting Services (SSRS) PDF
100% (1)
SQL Server Reporting Services (SSRS) PDF
2,489 pages
Unnamed Datafile in Standby Database
No ratings yet
Unnamed Datafile in Standby Database
11 pages
Freshers Resume
50% (2)
Freshers Resume
2 pages
DBMS Lab 7
No ratings yet
DBMS Lab 7
8 pages
DBA - IT - Audit - Report - Template (AutoRecovered)
No ratings yet
DBA - IT - Audit - Report - Template (AutoRecovered)
10 pages
Monitoring, Managing, and Recovering AD DS: Contents
No ratings yet
Monitoring, Managing, and Recovering AD DS: Contents
31 pages
06-Data Modeling Using The Entity-Relationship
No ratings yet
06-Data Modeling Using The Entity-Relationship
23 pages

Big Data Analytics Midterm Q&A

Uploaded by

Big Data Analytics Midterm Q&A

Uploaded by

BIG DATA

Big Data Characteristics

Q2. What are the functions of a combiner?

SEMESTER IT VII / CSE VIII 1

SEMESTER IT VII / CSE VIII 2

SEMESTER IT VII / CSE VIII 3

Refer to DSA 3 sem Book

SEMESTER IT VII / CSE VIII 4

SEMESTER IT VII / CSE VIII 5

int avgprice = Integer.parseInt(lasttoken);

public static class E_EReduce extends MapReduceBase implements

public static void main(String args[])throws Exception

SEMESTER IT VII / CSE VIII 6

FileInputFormat.setInputPaths(conf, new Path(args[0]));

SEMESTER IT VII / CSE VIII 7

For binary output, there is SequenceFileOutputFormat to write a sequence of binary output to a

Q21.Write a PIG script for Word Count. Explain Metastore in Hive.

SEMESTER IT VII / CSE VIII 8

grouped = GROUP words BY word;

Q22. Discuss about serialization concept in java.

SEMESTER IT VII / CSE VIII 9

SEMESTER IT VII / CSE VIII 10

hadoopinfra/hdfs/namenode is the directory created by hdfs file system.)

(hadoopinfra/hdfs/datanode is the directory created by hdfs file system.)

SEMESTER IT VII / CSE VIII 11

site.xml.template to mapred-site.xml file using the following command.

SEMESTER IT VII / CSE VIII 12

Q31. What are the various functions of name node?

Q36 Explain Map-reduce framework in detail.

SEMESTER IT VII / CSE VIII 13

SEMESTER IT VII / CSE VIII 14

You might also like