Unit3 BD

Noida Institute of Engineering and Technology, Greater Noida
Basic of Hadoop
Unit: 3
RCA E45- Big Data

Hirdesh Sharma,
Department of MCA
MCA 5th Sem
Hirdesh Sharma RCA E45 Big Data Unit: 3

1
08/11/2021
Content
• Course Objective
• Course Outcome
• CO and PO Mapping
• Basics of hadoop
• Data format
• analyzing data with Hadoop
• scaling out
• Hadoop streaming
• Hadoop pipes
• design of Hadoop distributed file system
(HDFS)
• HDFS concepts
08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 3 2

Content
• Java interface
• data flow
• Hadoop I/O
• data integrity
• Oppression
• Serialization
• Avro file-based data structures.
• Summary

Course Objective
Upon completion of this course, students will be able to do

the following:
• What is Big Data and Why Big Data used.
• What are Hadoop and open source technologies.
• Demonstrate a familiarity with NO SQL data management.

• Apply important concepts of Big Data and Hadoop with
unstructured data.
• Synthesize the use of Hbase data models and implementation.

08/11/2021 4
Course Outcome
After Completing this course the students will be able to:

• CO1: To study paradigms and approaches used to analyze unstructured
data into semi structured data and structured data, cloud and big data
mobile business intelligence in practice.
• CO2: Explain Why big data concept is used, Basics of hadoop Data format,
analyzing data with Hadoop , scaling out , Hadoop streaming , Hadoop
pipes , design of Hadoop distributed file system (HDFS).
• CO3: Apply the industry examples of Big data in real life and analyze
to implement the industry examples of big data.
• CO4: Explain the concept of NO SQL, aggregate data models
,aggregates ,key-value and document data models, relationships,
partitioning and combining, composing map-reduce calculations.
• CO5: Gather information about Hadoop related tools, Hbase, data model
and implementations, Hbase clients, Hbase examples – praxis. Cassandra,
cassandra data model HiveQL queries.

Program Outcome
• PO1: Computational Knowledge: Develop knowledge of computing fundamentals,

computing specialization, mathematics and domain knowledge for solving real
world problems.
• PO2: Problem Analysis: Identify formulate review research literature and analyze
complex problems reaching substantial conclusions using first fundamental
principles of mathematics, computing science and relevant domain discipline.
• PO3: Design /Development of Solutions: Ability to design and evaluate system,
components or processes for complex computing problems that meets specified
needs with appropriate consideration for the public health and safety and cultural
societal and environmental consideration.
• PO4: Conduct investigations of complex Computing problems: Use research-
based knowledge and research methods including design of experiments, analysis
and interpretation of data, and synthesis of the information to provide valid
conclusions.
• PO5: Modern Tool Usage: Create, select, adapt and apply appropriate techniques,
resources, and modern computing tools including prediction and modeling to
complex computing activities, with an understanding of the limitations.
• PO6: Professional Ethics: Understand and commit to professional ethics and cyber
regulations, responsibilities, and norms of professional computing practices.
Program Outcome
• PO7: Life-long Learning: Recognize the need, and have the ability, to engage in
independent learning for continual preparation and development as a computing
professional for broadest content of technological change.
• PO8: Project management and finance: Demonstrate knowledge and
understanding of the computing and management principles and apply these to
one’s own work, as a member and leader in a team, to manage projects and in
multidisciplinary environments.
• PO9: Communication Efficacy: Communicate effectively with the computing
community, and with society at large, about complex computing activities by being
able to comprehend and write effective reports, design documentation, make
effective presentations, and give and understand clear instructions.
• PO10: Societal and Environmental Concern: Understand and assess societal,
environmental, health, safety, legal, and cultural issues within local and global
contexts, and the consequential responsibilities relevant to professional computing
practices.
• PO11: Individual and Team Work: Function effectively as an individual and as a
member or leader in diverse teams and in multidisciplinary environments.
• PO12: Innovation and Entrepreneurship: Identify a timely opportunity and using
innovation to pursue that opportunity to create value and wealth for the
betterment of the individual and society at large.
CO-PO Mapping
Mappping of Course Outcomes(COs)and Program Outcomes (POs):

Unit 3 Objective
Upon completion of this course, students will be able to do

the following:
• What are Hadoop and open source technologies.
• Demonstrate a familiarity with NO SQL data management.

08/11/2021 9
Prerequisite and Recap
• No SQL database, also called Not Only SQL, is an approach to data

management and database design that's useful for very large sets of
distributed data.
• SQL databases have predefined schema whereas No SQL databases
have dynamic schema for unstructured data.
• There are four general types of No SQL databases, each with their
own specific attributes:
–Key value storage
–Document Storage
–Graph Storage
–Colum value Storage

Topic Name (CO3)
• Introduction to Hadoop and Hadoop Architecture

Topic Objective (CO3)
After completion of this topic, students will be able to understand:

• Introduction to Hadoop
• Hadoop Architecture
• Analyzing the data with Hadoop

Introduction to Hadoop (CO3)
• Performing computation on large volumes of data has been done

before, usually in a distributed setting.
• In a Hadoop cluster, data is distributed to all the nodes of the cluster
as it is being loaded in. The Hadoop Distributed File System
(HDFS) will split large data files into chunks which are managed by
different nodes in the cluster.

• In addition to this each chunk is replicated across several machines,

so that a single machine failure does not result in any data being
unavailable.
• An active monitoring system then re-replicates the data in response
to system failures which can result in partial storage.


Map Reduce in Hadoop (CO3)
• Hadoop limits the amount of communication which can be

performed by the processes, as each individual record is processed
by a task in isolation from one another.
• Hadoop will not run just any program and distribute it across a
cluster.
• Programs must be written to conform to a particular programming
model, named "MapReduce."
• In MapReduce, records are processed in isolation by tasks called
Mappers.

Map Reduce in Hadoop (CO3)

Hadoop Architecture (CO3)
• HDFS has a master/slave architecture.

• An HDFS cluster consists of a single NameNode, a master server
that manages the file system namespace and regulates access to files
by clients.
• HDFS exposes a file system namespace and allows user data to be
stored in files.

• Internally, a file is split into one or more blocks and these blocks are
stored in a set of DataNodes.
• The NameNode executes file system namespace operations like
opening, closing, and renaming files and directories.
• The DataNodes also perform block creation, deletion, and
replication upon instruction from the NameNode.


• The NameNode and DataNode are pieces of software designed to

run on commodity machines.
• These machines typically run a GNU/Linux operating system (OS).
• HDFS is built using the Java language; any machine that supports
Java can run the NameNode or the DataNode software.

• Each of the other machines in the cluster runs one instance of the
DataNode software.
• The NameNode is the arbitrator and repository for all HDFS
metadata.
• The system is designed in such a way that user data never flows
through the NameNode.

Data Format (CO3)
Input Format:
• How these input files are split up and read is defined by the Input
Format.
• An Input Format is a class that provides the following functionality:
– Selects the files or other objects that should be used for input
– Defines the Input Splits that break a file into tasks

Data Format (CO3)

Data Format (CO3)
Output Format:
• The (key, value) pairs provided to this Output Collector are then
written to output files.

Data Format (CO3)
• Hadoop can process many different types of data formats, from flat
text files to databases.
• If it is flat file, the data is stored using a line-oriented ASCII format,
in which each line is a record.
• For example, ( National Climatic Data Center) NCDC data as given
below, the format supports a rich set of meteorological elements.

Data Format (CO3)

Analyzing the data with Hadoop (CO3)
• To take advantage of the parallel processing that Hadoop provides,

we need to express our query as a Map Reduce job.
Map and Reduce:

• Map Reduce works by breaking the processing into two phases: the
map phase and the reduce phase.
• The programmer also specifies two functions: the map function and
the reduce function.

• The input to our map phase is the raw NCDC data. We choose a text
input format that gives us each line in the dataset as a text value.
• The map function is also a good place to drop bad records: here we
filter out temperatures that are missing, suspect, or erroneous.

• To visualize the way the map works, consider the following sample
lines of input data:
• (some unused columns have been dropped to fit the page, indicated
by ellipses):

• These lines are presented to the map function as the key-value pairs:

• The map function merely extracts the year and the air temperature
(indicated in bold text), and emits them as its output (the
temperature values have been interpreted as integers):
• (1950, 0)
• (1950, 22)
• (1950, −11)
• (1949, 111)
• (1949, 78)

• The output from the map function is processed by the Map Reduce
framework before being sent to the reduce function. So, continuing
the example, our reduce function sees the following input:
(1949, [111, 78])
(1950, [0, 22, −11])
• All the reduce function has to do now is iterate through the list and
pick up the maximum reading:
(1949, 111)
(1950, 22)
• This is the final output: the maximum global temperature recorded
in each year.

• The whole data flow is illustrated in the following figure:

Java Map Reduce (CO3)
• Having run through how the Map Reduce program works, the next
step is to express it in code.
• The map function is represented by an implementation of the
Mapper interface, which declares a map() method.
• The Mapper interface is a generic type, with four formal type
parameters that specify the input key, input value, output key, and
output value types of the map function.

public class MaxTemperatureMapper extends MapReduceBase

implements Mapper<LongWritable, Text, Text, IntWritable>
{
public void map(LongWritable key, Text value,
OutputCollector<Text, IntWritable> output, Reporter reporter)
throws IOException
{
// statement to convert the input data into string
// statement to obtain year and temp using the substring method
// statement to place the year and temp into a set called
OutputCollector
}
}

• The reduce function is similarly defined using a Reducer, as

illustrated in following figure.
public class MaxTemperatureReducer extends MapReduceBase
implements Reducer<Text, IntWritable, Text, IntWritable>
{
public void reduce(Text key, Iterator<IntWritable> values,
OutputCollector<Text, IntWritable> output, Reporter reporter)
throws IOException
{
// statement to find the maximum temperature of a each year
// statement to put the max. temp and its year in a set called
OuputCollector
}
}
• The third piece of code runs the MapReduce job.
public class MaxTemperature
{ public static void main(String[] args) throws IOException
{ JobConf conf = new JobConf(MaxTemperature.class);
conf.setJobName("Max temperature");
FileInputFormat.addInputPath(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
conf.setMapperClass(MaxTemperatureMapper.class);
conf.setReducerClass(MaxTemperatureReducer.class);
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
JobClient.runJob(conf);
}}

• A JobConf object forms the specification of the job. It gives you

control over how the job is run.
• The setOutputKeyClass() and setOutputValueClass() methods
control the output types for the map and the reduce functions.

Daily Quiz
• Hadoop comes with a set of ________ for data I/O.
a) methods
b) commands
c) classes
d) none of the mentioned
• How many formats of Sequence File are present in Hadoop I/O?
a) 2
b) 3
c) 4
d) 5
• Point out the wrong statement.
a) The data file contains all the key, value records but key N + 1
must be greater than or equal to the key N
b) Sequence file is a kind of hadoop file based data structure
c) Map file type is splittable as it contains a sync point after
several records
d) None of the mentioned
Noida Institute of Engineering and Technology, Greater Noida
Basic of Hadoop
Unit: 3
RCA E45- Big Data

Hirdesh Sharma,
Department of MCA
MCA 5th Sem

41
08/11/2021
Recap
• In a Hadoop cluster, data is distributed to all the nodes of the cluster

as it is being loaded in.
• The Hadoop Distributed File System (HDFS) will split large data
files into chunks which are managed by different nodes in the
cluster.

Topic Name (CO3)
• Scaling Out and Avro Based file system

• Hadoop File System

Topic Objective (CO3)
After completion of this topic, students will be able to understand:

• Scaling Out
• Hadoop File System

• Avro

Scaling Out (CO3)
• You’ve seen how Map Reduce works for small inputs; now it’s time
to take a bird’s-eye view of the system and look at the data flow for
large inputs.
• It consists of the input data, the Map Reduce program, and
configuration information.
• Hadoop runs the job by dividing it into tasks, of which there are two
types: map tasks and reduce tasks.

Scaling Out (CO3)
• There are two types of nodes that control the job execution process:
a job tracker and a number of task trackers.
• Hadoop divides the input to a Map Reduce job into fixed-size pieces
called input splits, or just splits.
• Hadoop creates one map task for each split, which runs the user
defined map function for each record in the split.

Scaling Out (CO3)
• Having many splits means the time taken to process each split is
small compared to the time to process the whole input.
• Even if the machines are identical, failed processes or other jobs
running concurrently
• make load balancing desirable, and the quality of the load balancing
increases as the splits become more fine-grained.

Scaling Out (CO3)

Scaling Out (CO3)
• The number of reduce tasks is not governed by the size of the input,
but is specified independently.
• When there are multiple reducers, the map tasks partition their
output, each creating one partition for each reduce task.
• The partitioning can be controlled by a user-defined partitioning
function, but normally the default practitioner—which buckets keys
using a hash function—works very well.

Scaling Out (CO3)

Scaling Out (CO3)
• Finally, it’s also possible to have zero reduce tasks. This can be
appropriate when you don’t need the shuffle since the processing
can be carried out entirely in parallel as shown in figure:

Combiner Function (CO3)
• Hadoop allows the user to specify a combiner function to be run on

the map output—the combiner function’s output forms the input to
the reduce function.
• In other words, calling the combiner function zero, one, or many
times should produce the same output from the reducer.
• The contract for the combiner function constrains the type of
function that may be used.

Imagine the first map produced the output:

(1950, 0)
(1950, 20)
(1950, 10)
And the second produced:
(1950, 25)
(1950, 15)
The reduce function would be called with a list of all the values:
(1950, [0, 20, 10, 25, 15])
with output:
(1950, 25)

• since 25 is the maximum value in the list. We could use a combiner

function. The reduce would then be called with:
(1950, [20, 25])
• we may express the function calls on the temperature values in this
case as follows:
max(0, 20, 10, 25, 15) = max(max(0, 20, 10), max(25, 15)) =
max(20, 25) = 25

public class MaxTemperatureWithCombiner

{ public static void main(String[] args) throws IOException
{ JobConf conf = new
JobConf(MaxTemperatureWithCombiner.class);
conf.setJobName("Max temperature");
FileInputFormat.addInputPath(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
conf.setMapperClass(MaxTemperatureMapper.class);
conf.setCombinerClass(MaxTemperatureReducer.class);
conf.setReducerClass(MaxTemperatureReducer.class);
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
JobClient.runJob(conf);
}}
Hadoop Streaming (CO3)
• Hadoop provides an API to MapReduce that allows you to write

your map and reduce functions in languages other than Java.
• Streaming is naturally suited for text processing and when used in
text mode, it has a line-oriented view of data.
• Map input data is passed over standard input to your map function,
which processes it line by line and writes lines to standard output.
• A map output key-value pair is written as a single tab-delimited line.

Ruby: The map function can be expressed in Ruby as shown below

STDIN.each_line do |line|
val = line
year, temp, q = val[15,4], val[87,5], val[92,1]
puts "#{year}\t#{temp}" if (temp != "+9999" && q =~ /[01459]/)
end
Since the script just operates on standard input and output, it’s trivial to
test the script without using Hadoop, simply using Unix pipes:
% cat input/ncdc/sample.txt | ch02/src/main/ruby/max_temperature_map.rb
1950 +0000
1950 +0022
1950 -0011
1949 +0111
1949 +0078

The reduce function shown below:

last_key, max_val = nil, 0
STDIN.each_line do |line|
key, val = line.split("\t")
if last_key && last_key != key
puts "#{last_key}\t#{max_val}"
last_key, max_val = key, val.to_i
else
last_key, max_val = key, [max_val, val.to_i].max
end
end
puts "#{last_key}\t#{max_val}" if last_key

We can now simulate the whole MapReduce pipeline with a Unix pipeline,
we get
% cat input/ncdc/sample.txt | ch02/src/main/ruby/max_temperature_map.rb

|\
sort | ch02/src/main/ruby/max_temperature_reduce.rb
1949 111
1950 22

Hadoop Pipes (CO3)
• Hadoop Pipes is the name of the C++ interface to Hadoop Map

Reduce.
• Pipes uses sockets as the channel over which the task tracker
communicates with the process running the C++ map or reduce
function.

Hadoop Pipes (CO3)
class MaxTemperatureMapper : public HadoopPipes::Mapper

{
public:
MaxTemperatureMapper(HadoopPipes::TaskContext& context) { }
void map(HadoopPipes::MapContext& context)
{
// statement to convert the input data into string
// statement to obtain year and temp using the substring method
// statement to place the year and temp into a set
}
};

Hadoop Pipes (CO3)
class MapTemperatureReducer : public HadoopPipes::Reducer

{
public:
MapTemperatureReducer(HadoopPipes::TaskContext& context) {}
void reduce(HadoopPipes::ReduceContext& context)
{
// statement to find the maximum temperature of a each year
// statement to put the max. temp and its year in a set
}};
int main(int argc, char *argv[])
{
return HadoopPipes::runTask( HadoopPipes:: TemplateFactory <
MaxTemperatureMapper,
MapTemperatureReducer>());
}
Hadoop Pipes (CO3)
• The map and reduce functions are defined by extending the Mapper
and Reducer classes defined in the HadoopPipes namespace and
providing implementations of the map() and reduce() methods in
each case.
• The runTask() method is passed a Factory so that it can create
instances of the Mapper or Reducer.

Hadoop Pipes (CO3)
• Which one it creates is controlled by the Java parent over the socket
connection. There are overloaded template factory methods for
setting a combiner, practitioner, record reader, or record writer.
• Pipes doesn’t run in standalone (local) mode, since it relies on
Hadoop’s distributed cache mechanism, which works only when
HDFS is running.

Hadoop Pipes (CO3)
% hadoop fs -put max_temperature bin/max_temperature

The sample data also needs to be copied from the local filesystem into
HDFS:
% hadoop fs -put input/ncdc/sample.txt sample.txt
Now we can run the job. For this, we use the Hadoop pipes command,
passing the URI of the executable in HDFS using the -program argument:
% hadoop pipes \
-D hadoop.pipes.java.recordreader=true \
-D hadoop.pipes.java.recordwriter=true \
-input sample.txt \
-output output \
-program bin/max_temperature
The result is the same as the other versions of the same program that we
ran previous example.

Design of HDFS (CO3)
• HDFS is a filesystem designed for storing very large files with

streaming data access patterns, running on clusters of commodity
hardware.
• “Very large” : In this context means files that are hundreds of
megabytes, gigabytes, or terabytes in size. There are Hadoop
clusters running today that store petabytes of data.
• Streaming data access

• Commodity hardware: Hadoop doesn’t require expensive, highly

reliable hardware to run on. HDFS is designed to carry on working
without a noticeable interruption to the user in the face of such
failure.
• Low-latency data access: HDFS is optimized for delivering a high
throughput of data, and this may be at the expense of latency.

• Lots of small files: Since the name node holds file system metadata
in memory, the limit to the number of files in a file system is
governed by the amount of memory on the name node.
• Multiple writers, arbitrary file modifications: Files in HDFS may be
written to by a single writer.

HDFS Concepts (CO3)
• The following diagram illustrates the Hadoop concepts:
Blocks
• A disk has a block size, which is the minimum amount of data that it
can read or write.
• HDFS blocks are large compared to disk blocks, and the reason is
to minimize the cost of seeks
• Thus the time to transfer a large file made of multiple blocks
operates at the disk transfer rate.

HDFS Concepts (CO3)

HDFS Concepts (CO3)
Benefits of blocks:
• It simplifies the storage subsystem. The storage subsystem deals
with blocks, simplifying storage management and eliminating
metadata concerns.
• Blocks fit well with replication for providing fault tolerance and
availability.

HDFS Concepts (CO3)
Name nodes and Data nodes

• An HDFS cluster has two types of node operating in a master-
worker pattern: a name node (the master) and a number of data
nodes (workers).
• This information is stored persistently on the local disk in the form
of two files: the namespace image and the edit log.

HDFS Concepts (CO3)
• A client accesses the file system on behalf of the user by

communicating with the name node and data nodes. Data nodes are
the workhorses of the file system.
• They store and retrieve blocks when they are told to, and they report
back to the name node periodically with lists of blocks that they are
storing. Without the name node, the file system cannot be used.

HDFS Concepts (CO3)
Secondary Name node

• It is also possible to run a secondary name node, which despite its
name does not act as a name node.
• The secondary name node usually runs on a separate physical
machine, since it requires plenty of CPU and as much memory as
the name node to perform the merge.

HDFS Concepts (CO3)
The Command-Line Interface

• There are many other interfaces to HDFS, but the command line is
one of the simplest and, to many developers, the most familiar.
• It provides a command line interface called FS shell that lets a user
interact with the data in HDFS.

Hadoop Filesystems (CO3)

Hadoop File systems (CO3)
• The Java abstract class org.apache.hadoop.fs.FileSystem represents

a file system in Hadoop, and there are several concrete
implementations, which are described in Table.
Thrift:
• The Thrift API in the “thriftfs” module expose Hadoop file systems
as an Apache Thrift service, making it easy for any language that has
Thrift bindings to interact with a Hadoop fil esystem, such as HDFS.

C:
• Hadoop provides a C library called libhdfs that mirrors the Java File
System interface (it was written as a C library for accessing HDFS.
FUSE
• File system in User space (FUSE) allows file systems that are
implemented in user space to be integrated as a Unix file system.

Hadoop File systems (CO3)
Web DAV
• Web DAV is a set of extensions to HTTP to support editing and
updating files. Web DAV shares can be mounted as file systems on
most operating systems.
File patterns
• It is a common requirement to process sets of files in a single
operation.




Avro: Overview (CO3)
• To transfer data over a network or for its persistent storage, you need
to serialize the data.
• Prior to the serialization APIs provided by Java and Hadoop, we
have a special utility, called Avro, a schema-based serialization
technique.
• Avro provides libraries for various programming languages.

What is Avro (CO3)
• Apache Avro is a language-neutral data serialization system. It was

developed by Doug Cutting, the father of Hadoop.
• Avro is a preferred tool to serialize data in Hadoop.
• Avro has a schema-based system. A language-independent schema is

associated with its read and write operations.
• Avro uses JSON format to declare the data structures. Currently it
supports languages such as Java, C, C++, C#, Python, and Ruby.

Avro: Schema (CO3)
• Avro depends heavily on its schema. It allows every data to be

written with no prior knowledge of the schema.
• It serializes fast and the resulting serialized data is lesser in size.
Schema is stored along with the Avro data in a file for any further
processing.
• Avro schemas are defined with JSON that simplifies its
implementation in languages with JSON libraries.

Features of Avro (CO3)
• Listed below are some of the prominent features of Avro:

• Avro is a language-neutral data serialization system.
• It can be processed by many languages (currently C, C++, C#, Java,

Python, and Ruby).
• Avro creates binary structured format that is both compressible and
split table.
• Avro schemas defined in JSON facilitate implementation in the
languages that already have JSON libraries.

Features of Avro (CO3)
• Avro creates a self-describing file named Avro Data File, in which it

stores data along with its schema in the metadata section.
• Avro is also used in Remote Procedure Calls (RPCs). During RPC,
client and server exchange schemas in the connection handshake.
• Avro does not need code generation. The data is always
accompanied by schemas, which permit full processing on the data.

Working with Avro (CO3)
• To use Avro, you need to follow the given workflow:

• Step 1: Create schemas. Here you need to design Avro schema
according to your data.
• Step 2: Read the schemas- By Generating a Class Corresponding to, By
Using Parsers Library
• Step 3: Serialize the data using the serialization API provided for Avro,
which is found in the package org.apache.avro.specific.
• Step 4: Deserialize the data using deserialization API provided for
Avro, which is found in the package org.apache.avro.specific.

Daily Quiz
• Apache Hadoop ___________ provides a persistent data structure

for binary key-value pairs.
a) GetFile
b) SequenceFile
c) Putfile
d) All of the mentioned
• Point out the correct statement.

a) The sequence file also can contain a “secondary” key-value list
that can be used as file Metadata
b) SequenceFile formats share a header that contains some
information which allows the reader to recognize is format
c) There’re Key and Value Class Name’s that allow the reader to
instantiate those classes, via reflection, for reading

Recap
• HDFS is a file system designed for storing very large files with
streaming data access patterns, running on clusters of commodity
hardware.
• Apache Avro is a language-neutral data serialization system. It was
developed by Doug Cutting, the father of Hadoop.

Faculty Video Links, Youtube & NPTEL Video Links and Online
Courses Details
• https://www.javatpoint.com/hadoop-tutorial
• https://www.tutorialspoint.com/hadoop/index.htm
• https://www.sanfoundry.com/hadoop-filesystem-hdfs-questio
ns-answers/

Daily Quiz
• Hadoop I/O Hadoop comes with a set of ________ for data I/O.
a) methods
b) commands
c) classes
d) none of the mentioned
• Point out the wrong statement.

a) The data file contains all the key, value records but key N + 1
must be greater than or equal to the key N
b) Sequence file is a kind of hadoop file based data structure
c) Map file type is splittable as it contains a sync point after
several records

Daily Quiz
• Apache Hadoop ___________ provides a persistent data structure

for binary key-value pairs.
a) GetFile
b) SequenceFile
c) Putfile
• Point out the correct statement.

a) The sequence file also can contain a “secondary” key-value list
that can be used as file Metadata
b) SequenceFile formats share a header that contains some
information which allows the reader to recognize is format
c) There’re Key and Value Class Name’s that allow the reader to
instantiate those classes, via reflection, for reading

Weekly Assignment 1
Q:1 Discuss the basic concepts of Hadoop in detail.

Q:2 How you can analyze data with Hadoop? Expalin this with the
help of suitable example.
Q:3 Discuss the HDFS concepts in detail with suitable examples.
Q:4 Write a short note on:
–Hadoop Data Format
–Sacling Out
–Hadoop Pipes

Weekly Assignment 2
Q:1 Explain the concept of Avro file-based data structures.

Q:2 Explain the following terms:
–Data Integrity
–Oppression
–Serialization
Q:3 Explain the design on Hadoop distributed file system (HDFS) in
detail.

MCQ s
• Which of the following format is more compression-aggressive?

a) Partition Compressed
b) Record Compressed
c) Block-Compressed
d) Uncompressed
• The __________ is a directory that contains two SequenceFile.
a) ReduceFile
b) MapperFile
c) MapFile
• The ______ file is populated with the key and a LongWritable that
contains the starting byte position of the record.
a) Array
b) Index
c) Immutable
MCQ s
• The _________ as just the value field append(value) and the key is a
LongWritable that contains the record number, count + 1.
a) SetFile
b) ArrayFile
c) BloomMapFile
• ____________ data file takes is based on avro serialization

framework which was primarily created for hadoop.
a) Oozie
b) Avro
c) cTakes
d) Lucene

Old Question Papers

Old Question Papers

Old Question Papers

Old Question Papers

Expected Questions for University Exam
Q:1 Discuss the basic concepts of Hadoop in detail.

Q:2 How you can analyze data with Hadoop? Expalin this with the
help of suitable example.
Q:3 Discuss the HDFS concepts in detail with suitable examples.
Q:4 Explain the design on Hadoop distributed file system (HDFS) in
detail.
Q:5 Write a short note on:
–Hadoop Data Format
–Sacling Out
–Hadoop Pipes

Summary
• The Hadoop Distributed File System (HDFS) will split large data
files into chunks which are managed by different nodes in the
cluster.
• HDFS has a master/slave architecture.
• Hadoop can process many different types of data formats, from flat
text files to databases.
• Map Reduce works by breaking the processing into two phases: the
map phase and the reduce phase.
• Apache Avro is a language-neutral data serialization system.

References
1. Michael Minelli, Michelle Chambers, and Ambiga Dhiraj, "Big Data, Big
Analytics: Emerging Business Intelligence and Analytic Trends for Today's
Businesses", Wiley, 2013.
2. P. J. Sadalage and M. Fowler, "NoSQL Distilled: A Brief Guide to the Emerging
World of
3. Polyglot Persistence", Addison-Wesley Professional, 2012.
4. Tom White, "Hadoop: The Definitive Guide", Third Edition, O'Reilley, 2012.
5. Eric Sammer, "Hadoop Operations", O'Reilley, 2012.
6. E. Capriolo, D. Wampler, and J. Rutherglen, "Programming Hive", O'Reilley,
2012.
7. Lars George, "HBase: The Definitive Guide", O'Reilley, 2011.
8. Eben Hewitt, "Cassandra: The Definitive Guide", O'Reilley, 2010.
9. Alan Gates, "Programming Pig", O'Reilley, 2011.
Thank You

Unit3 BD

Uploaded by

Copyright:

Available Formats

Unit3 BD

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Unit3 BD

Uploaded by

Copyright:

Available Formats

Noida Institute of Engineering and Technology, Greater Noida

RCA E45- Big Data

Hirdesh Sharma RCA E45 Big Data Unit: 3

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 3 2

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 3 3

Upon completion of this course, students will be able to do

• Demonstrate a familiarity with NO SQL data management.

Hirdesh Sharma RCA E45 Big Data Unit: 3

After Completing this course the students will be able to:

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 3 5

• PO1: Computational Knowledge: Develop knowledge of computing fundamentals,

Mappping of Course Outcomes(COs)and Program Outcomes (POs):

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 1 8

Upon completion of this course, students will be able to do

Hirdesh Sharma RCA E45 Big Data Unit: 3

• No SQL database, also called Not Only SQL, is an approach to data

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 3 10

• Introduction to Hadoop and Hadoop Architecture

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 3 11

After completion of this topic, students will be able to understand:

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 3 12

• Performing computation on large volumes of data has been done

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 3 13

• In addition to this each chunk is replicated across several machines,

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 3 14

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 3 15

• Hadoop limits the amount of communication which can be

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 3 16

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 3 17

• HDFS has a master/slave architecture.

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 3 18

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 3 19

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 3 20

• The NameNode and DataNode are pieces of software designed to

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 3 21

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 3 22

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 3 23

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 3 24

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 3 25

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 3 26

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 3 27

• To take advantage of the parallel processing that Hadoop provides,

Map and Reduce:

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 3 28

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 3 29

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 3 30

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 3 31

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 3 32

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 3 33

• The whole data flow is illustrated in the following figure:

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 3 34

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 3 35

public class MaxTemperatureMapper extends MapReduceBase

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 3 36

• The reduce function is similarly defined using a Reducer, as

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 3 38

• A JobConf object forms the specification of the job. It gives you

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 3 39

RCA E45- Big Data