Unit3 BD

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 104

Noida Institute of Engineering and Technology, Greater Noida

Basic of Hadoop

Unit: 3

RCA E45- Big Data


Hirdesh Sharma,
Department of MCA
MCA 5th Sem

Hirdesh Sharma RCA E45 Big Data Unit: 3


1
08/11/2021
Content

• Course Objective
• Course Outcome
• CO and PO Mapping
• Basics of hadoop
• Data format
• analyzing data with Hadoop
• scaling out
• Hadoop streaming
• Hadoop pipes
• design of Hadoop distributed file system
(HDFS)
• HDFS concepts

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 3 2


Content

• Java interface
• data flow
• Hadoop I/O
• data integrity
• Oppression
• Serialization
• Avro file-based data structures.
• Summary

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 3 3


Course Objective

Upon completion of this course, students will be able to do


the following:
• What is Big Data and Why Big Data used.
• What are Hadoop and open source technologies.

• Demonstrate a familiarity with NO SQL data management.


• Apply important concepts of Big Data and Hadoop with
unstructured data.
• Synthesize the use of Hbase data models and implementation.

Hirdesh Sharma RCA E45 Big Data Unit: 3


08/11/2021 4
Course Outcome

After Completing this course the students will be able to:


• CO1: To study paradigms and approaches used to analyze unstructured
data into semi structured data and structured data, cloud and big data
mobile business intelligence in practice.
• CO2: Explain Why big data concept is used, Basics of hadoop Data format,
analyzing data with Hadoop , scaling out , Hadoop streaming , Hadoop
pipes , design of Hadoop distributed file system (HDFS).
• CO3: Apply the industry examples of Big data in real life and analyze
to implement the industry examples of big data.
• CO4: Explain the concept of NO SQL, aggregate data models
,aggregates ,key-value and document data models, relationships,
partitioning and combining, composing map-reduce calculations.
• CO5: Gather information about Hadoop related tools, Hbase, data model
and implementations, Hbase clients, Hbase examples – praxis. Cassandra,
cassandra data model HiveQL queries.

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 3 5


Program Outcome

• PO1: Computational Knowledge: Develop knowledge of computing fundamentals,


computing specialization, mathematics and domain knowledge for solving real
world problems.
• PO2: Problem Analysis: Identify formulate review research literature and analyze
complex problems reaching substantial conclusions using first fundamental
principles of mathematics, computing science and relevant domain discipline.
• PO3: Design /Development of Solutions: Ability to design and evaluate system,
components or processes for complex computing problems that meets specified
needs with appropriate consideration for the public health and safety and cultural
societal and environmental consideration.
• PO4: Conduct investigations of complex Computing problems: Use research-
based knowledge and research methods including design of experiments, analysis
and interpretation of data, and synthesis of the information to provide valid
conclusions.
• PO5: Modern Tool Usage: Create, select, adapt and apply appropriate techniques,
resources, and modern computing tools including prediction and modeling to
complex computing activities, with an understanding of the limitations.
• PO6: Professional Ethics: Understand and commit to professional ethics and cyber
regulations, responsibilities, and norms of professional computing practices.
08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 1 6
Program Outcome
• PO7: Life-long Learning: Recognize the need, and have the ability, to engage in
independent learning for continual preparation and development as a computing
professional for broadest content of technological change.
• PO8: Project management and finance: Demonstrate knowledge and
understanding of the computing and management principles and apply these to
one’s own work, as a member and leader in a team, to manage projects and in
multidisciplinary environments.
• PO9: Communication Efficacy: Communicate effectively with the computing
community, and with society at large, about complex computing activities by being
able to comprehend and write effective reports, design documentation, make
effective presentations, and give and understand clear instructions.
• PO10: Societal and Environmental Concern: Understand and assess societal,
environmental, health, safety, legal, and cultural issues within local and global
contexts, and the consequential responsibilities relevant to professional computing
practices.
• PO11: Individual and Team Work: Function effectively as an individual and as a
member or leader in diverse teams and in multidisciplinary environments.
• PO12: Innovation and Entrepreneurship: Identify a timely opportunity and using
innovation to pursue that opportunity to create value and wealth for the
betterment of the individual and society at large.
08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 1 7
CO-PO Mapping

Mappping of Course Outcomes(COs)and Program Outcomes (POs):

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 1 8


Unit 3 Objective

Upon completion of this course, students will be able to do


the following:
• What are Hadoop and open source technologies.
• Demonstrate a familiarity with NO SQL data management.

Hirdesh Sharma RCA E45 Big Data Unit: 3


08/11/2021 9
Prerequisite and Recap

• No SQL database, also called Not Only SQL, is an approach to data


management and database design that's useful for very large sets of
distributed data.
• SQL databases have predefined schema whereas No SQL databases
have dynamic schema for unstructured data.
• There are four general types of No SQL databases, each with their
own specific attributes:
–Key value storage
–Document Storage
–Graph Storage
–Colum value Storage

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 3 10


Topic Name (CO3)

• Introduction to Hadoop and Hadoop Architecture

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 3 11


Topic Objective (CO3)

After completion of this topic, students will be able to understand:


• Introduction to Hadoop
• Hadoop Architecture
• Analyzing the data with Hadoop

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 3 12


Introduction to Hadoop (CO3)

• Performing computation on large volumes of data has been done


before, usually in a distributed setting.
• In a Hadoop cluster, data is distributed to all the nodes of the cluster
as it is being loaded in. The Hadoop Distributed File System
(HDFS) will split large data files into chunks which are managed by
different nodes in the cluster.

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 3 13


Introduction to Hadoop (CO3)

• In addition to this each chunk is replicated across several machines,


so that a single machine failure does not result in any data being
unavailable.
• An active monitoring system then re-replicates the data in response
to system failures which can result in partial storage.

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 3 14


Introduction to Hadoop (CO3)

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 3 15


Map Reduce in Hadoop (CO3)

• Hadoop limits the amount of communication which can be


performed by the processes, as each individual record is processed
by a task in isolation from one another.
• Hadoop will not run just any program and distribute it across a
cluster.
• Programs must be written to conform to a particular programming
model, named "MapReduce."
• In MapReduce, records are processed in isolation by tasks called
Mappers.

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 3 16


Map Reduce in Hadoop (CO3)

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 3 17


Hadoop Architecture (CO3)

• HDFS has a master/slave architecture.


• An HDFS cluster consists of a single NameNode, a master server
that manages the file system namespace and regulates access to files
by clients.
• HDFS exposes a file system namespace and allows user data to be
stored in files.

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 3 18


Hadoop Architecture (CO3)

• Internally, a file is split into one or more blocks and these blocks are
stored in a set of DataNodes.
• The NameNode executes file system namespace operations like
opening, closing, and renaming files and directories.
• The DataNodes also perform block creation, deletion, and
replication upon instruction from the NameNode.

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 3 19


Hadoop Architecture (CO3)

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 3 20


Hadoop Architecture (CO3)

• The NameNode and DataNode are pieces of software designed to


run on commodity machines.
• These machines typically run a GNU/Linux operating system (OS).

• HDFS is built using the Java language; any machine that supports
Java can run the NameNode or the DataNode software.

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 3 21


Hadoop Architecture (CO3)

• Each of the other machines in the cluster runs one instance of the
DataNode software.
• The NameNode is the arbitrator and repository for all HDFS
metadata.
• The system is designed in such a way that user data never flows
through the NameNode.

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 3 22


Data Format (CO3)

Input Format:
• How these input files are split up and read is defined by the Input
Format.
• An Input Format is a class that provides the following functionality:
– Selects the files or other objects that should be used for input
– Defines the Input Splits that break a file into tasks

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 3 23


Data Format (CO3)

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 3 24


Data Format (CO3)

Output Format:
• The (key, value) pairs provided to this Output Collector are then
written to output files.

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 3 25


Data Format (CO3)

• Hadoop can process many different types of data formats, from flat
text files to databases.
• If it is flat file, the data is stored using a line-oriented ASCII format,
in which each line is a record.
• For example, ( National Climatic Data Center) NCDC data as given
below, the format supports a rich set of meteorological elements.

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 3 26


Data Format (CO3)

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 3 27


Analyzing the data with Hadoop (CO3)

• To take advantage of the parallel processing that Hadoop provides,


we need to express our query as a Map Reduce job.

Map and Reduce:


• Map Reduce works by breaking the processing into two phases: the
map phase and the reduce phase.
• The programmer also specifies two functions: the map function and
the reduce function.

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 3 28


Analyzing the data with Hadoop (CO3)

• The input to our map phase is the raw NCDC data. We choose a text
input format that gives us each line in the dataset as a text value.
• The map function is also a good place to drop bad records: here we
filter out temperatures that are missing, suspect, or erroneous.

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 3 29


Analyzing the data with Hadoop (CO3)

• To visualize the way the map works, consider the following sample
lines of input data:
• (some unused columns have been dropped to fit the page, indicated
by ellipses):

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 3 30


Analyzing the data with Hadoop (CO3)

• These lines are presented to the map function as the key-value pairs:

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 3 31


Analyzing the data with Hadoop (CO3)

• The map function merely extracts the year and the air temperature
(indicated in bold text), and emits them as its output (the
temperature values have been interpreted as integers):
• (1950, 0)
• (1950, 22)
• (1950, −11)

• (1949, 111)
• (1949, 78)

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 3 32


Analyzing the data with Hadoop (CO3)

• The output from the map function is processed by the Map Reduce
framework before being sent to the reduce function. So, continuing
the example, our reduce function sees the following input:
(1949, [111, 78])
(1950, [0, 22, −11])
• All the reduce function has to do now is iterate through the list and
pick up the maximum reading:
(1949, 111)
(1950, 22)
• This is the final output: the maximum global temperature recorded
in each year.

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 3 33


Analyzing the data with Hadoop (CO3)

• The whole data flow is illustrated in the following figure:

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 3 34


Java Map Reduce (CO3)

• Having run through how the Map Reduce program works, the next
step is to express it in code.
• The map function is represented by an implementation of the
Mapper interface, which declares a map() method.
• The Mapper interface is a generic type, with four formal type
parameters that specify the input key, input value, output key, and
output value types of the map function.

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 3 35


Java Map Reduce (CO3)

public class MaxTemperatureMapper extends MapReduceBase


implements Mapper<LongWritable, Text, Text, IntWritable>
{
public void map(LongWritable key, Text value,
OutputCollector<Text, IntWritable> output, Reporter reporter)
throws IOException
{
// statement to convert the input data into string
// statement to obtain year and temp using the substring method
// statement to place the year and temp into a set called
OutputCollector
}
}

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 3 36


Java Map Reduce (CO3)

• The reduce function is similarly defined using a Reducer, as


illustrated in following figure.
public class MaxTemperatureReducer extends MapReduceBase
implements Reducer<Text, IntWritable, Text, IntWritable>
{
public void reduce(Text key, Iterator<IntWritable> values,
OutputCollector<Text, IntWritable> output, Reporter reporter)
throws IOException
{
// statement to find the maximum temperature of a each year
// statement to put the max. temp and its year in a set called
OuputCollector
}
}
08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 3 37
Java Map Reduce (CO3)
• The third piece of code runs the MapReduce job.
public class MaxTemperature
{ public static void main(String[] args) throws IOException
{ JobConf conf = new JobConf(MaxTemperature.class);
conf.setJobName("Max temperature");
FileInputFormat.addInputPath(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
conf.setMapperClass(MaxTemperatureMapper.class);
conf.setReducerClass(MaxTemperatureReducer.class);
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
JobClient.runJob(conf);
}}

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 3 38


Java Map Reduce (CO3)

• A JobConf object forms the specification of the job. It gives you


control over how the job is run.
• The setOutputKeyClass() and setOutputValueClass() methods
control the output types for the map and the reduce functions.

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 3 39


Daily Quiz
•  Hadoop comes with a set of ________ for data I/O.
a) methods
b) commands
c) classes
d) none of the mentioned
•  How many formats of Sequence File are present in Hadoop I/O?
a) 2
b) 3
c) 4
d) 5
• Point out the wrong statement.
a) The data file contains all the key, value records but key N + 1
must be greater than or equal to the key N
b) Sequence file is a kind of hadoop file based data structure
c) Map file type is splittable as it contains a sync point after
several records
d) None of the mentioned
08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 3 40
Noida Institute of Engineering and Technology, Greater Noida

Basic of Hadoop

Unit: 3

RCA E45- Big Data


Hirdesh Sharma,
Department of MCA
MCA 5th Sem

Hirdesh Sharma RCA E45 Big Data Unit: 3


41
08/11/2021
Recap

• In a Hadoop cluster, data is distributed to all the nodes of the cluster


as it is being loaded in.
• The Hadoop Distributed File System (HDFS) will split large data
files into chunks which are managed by different nodes in the
cluster.

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 3 42


Topic Name (CO3)

• Scaling Out and Avro Based file system


• Hadoop File System

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 3 43


Topic Objective (CO3)

After completion of this topic, students will be able to understand:


• Scaling Out

• Hadoop File System


• Avro

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 3 44


Scaling Out (CO3)

• You’ve seen how Map Reduce works for small inputs; now it’s time
to take a bird’s-eye view of the system and look at the data flow for
large inputs.
• It consists of the input data, the Map Reduce program, and
configuration information.
• Hadoop runs the job by dividing it into tasks, of which there are two
types: map tasks and reduce tasks.

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 3 45


Scaling Out (CO3)

• There are two types of nodes that control the job execution process:
a job tracker and a number of task trackers.
• Hadoop divides the input to a Map Reduce job into fixed-size pieces
called input splits, or just splits.
• Hadoop creates one map task for each split, which runs the user
defined map function for each record in the split.

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 3 46


Scaling Out (CO3)

• Having many splits means the time taken to process each split is
small compared to the time to process the whole input.
• Even if the machines are identical, failed processes or other jobs
running concurrently
• make load balancing desirable, and the quality of the load balancing
increases as the splits become more fine-grained.

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 3 47


Scaling Out (CO3)

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 3 48


Scaling Out (CO3)

• The number of reduce tasks is not governed by the size of the input,
but is specified independently.
• When there are multiple reducers, the map tasks partition their
output, each creating one partition for each reduce task.
• The partitioning can be controlled by a user-defined partitioning
function, but normally the default practitioner—which buckets keys
using a hash function—works very well.

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 3 49


Scaling Out (CO3)

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 3 50


Scaling Out (CO3)

• Finally, it’s also possible to have zero reduce tasks. This can be
appropriate when you don’t need the shuffle since the processing
can be carried out entirely in parallel as shown in figure:

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 3 51


Combiner Function (CO3)

• Hadoop allows the user to specify a combiner function to be run on


the map output—the combiner function’s output forms the input to
the reduce function.
• In other words, calling the combiner function zero, one, or many
times should produce the same output from the reducer.
• The contract for the combiner function constrains the type of
function that may be used.

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 3 52


Combiner Function (CO3)

Imagine the first map produced the output:


(1950, 0)
(1950, 20)
(1950, 10)
And the second produced:
(1950, 25)
(1950, 15)
The reduce function would be called with a list of all the values:
(1950, [0, 20, 10, 25, 15])
with output:
(1950, 25)

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 3 53


Combiner Function (CO3)

• since 25 is the maximum value in the list. We could use a combiner


function. The reduce would then be called with:
(1950, [20, 25])
• we may express the function calls on the temperature values in this
case as follows:
max(0, 20, 10, 25, 15) = max(max(0, 20, 10), max(25, 15)) =
max(20, 25) = 25

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 3 54


Combiner Function (CO3)

public class MaxTemperatureWithCombiner


{ public static void main(String[] args) throws IOException
{ JobConf conf = new
JobConf(MaxTemperatureWithCombiner.class);
conf.setJobName("Max temperature");
FileInputFormat.addInputPath(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
conf.setMapperClass(MaxTemperatureMapper.class);
conf.setCombinerClass(MaxTemperatureReducer.class);
conf.setReducerClass(MaxTemperatureReducer.class);
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
JobClient.runJob(conf);
}}
08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 3 55
Hadoop Streaming (CO3)

• Hadoop provides an API to MapReduce that allows you to write


your map and reduce functions in languages other than Java.
• Streaming is naturally suited for text processing and when used in
text mode, it has a line-oriented view of data.
• Map input data is passed over standard input to your map function,
which processes it line by line and writes lines to standard output.
• A map output key-value pair is written as a single tab-delimited line.

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 3 56


Hadoop Streaming (CO3)

Ruby: The map function can be expressed in Ruby as shown below


 STDIN.each_line do |line|
val = line
year, temp, q = val[15,4], val[87,5], val[92,1]
puts "#{year}\t#{temp}" if (temp != "+9999" && q =~ /[01459]/)
end
 Since the script just operates on standard input and output, it’s trivial to
test the script without using Hadoop, simply using Unix pipes:
% cat input/ncdc/sample.txt | ch02/src/main/ruby/max_temperature_map.rb
1950 +0000
1950 +0022
1950 -0011
1949 +0111
1949 +0078

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 3 57


Hadoop Streaming (CO3)

The reduce function shown below:


 last_key, max_val = nil, 0
STDIN.each_line do |line|
key, val = line.split("\t")
if last_key && last_key != key
puts "#{last_key}\t#{max_val}"
last_key, max_val = key, val.to_i
else
last_key, max_val = key, [max_val, val.to_i].max
end
end
puts "#{last_key}\t#{max_val}" if last_key

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 3 58


Hadoop Streaming (CO3)

We can now simulate the whole MapReduce pipeline with a Unix pipeline,
we get

% cat input/ncdc/sample.txt | ch02/src/main/ruby/max_temperature_map.rb


|\

sort | ch02/src/main/ruby/max_temperature_reduce.rb
1949 111

1950 22

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 3 59


Hadoop Pipes (CO3)

• Hadoop Pipes is the name of the C++ interface to Hadoop Map


Reduce.
• Pipes uses sockets as the channel over which the task tracker
communicates with the process running the C++ map or reduce
function.

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 3 60


Hadoop Pipes (CO3)

class MaxTemperatureMapper : public HadoopPipes::Mapper


{
public:
MaxTemperatureMapper(HadoopPipes::TaskContext& context) { }
void map(HadoopPipes::MapContext& context)
{
// statement to convert the input data into string
// statement to obtain year and temp using the substring method
// statement to place the year and temp into a set
}
 };

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 3 61


Hadoop Pipes (CO3)

class MapTemperatureReducer : public HadoopPipes::Reducer


{
public:
MapTemperatureReducer(HadoopPipes::TaskContext& context) {}
void reduce(HadoopPipes::ReduceContext& context)
{
// statement to find the maximum temperature of a each year
// statement to put the max. temp and its year in a set
}};
 int main(int argc, char *argv[])
{
return HadoopPipes::runTask( HadoopPipes:: TemplateFactory <
MaxTemperatureMapper,
MapTemperatureReducer>());
}
08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 3 62
Hadoop Pipes (CO3)

• The map and reduce functions are defined by extending the Mapper
and Reducer classes defined in the HadoopPipes namespace and
providing implementations of the map() and reduce() methods in
each case.
• The runTask() method is passed a Factory so that it can create
instances of the Mapper or Reducer.

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 3 63


Hadoop Pipes (CO3)

• Which one it creates is controlled by the Java parent over the socket
connection. There are overloaded template factory methods for
setting a combiner, practitioner, record reader, or record writer.
• Pipes doesn’t run in standalone (local) mode, since it relies on
Hadoop’s distributed cache mechanism, which works only when
HDFS is running.

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 3 64


Hadoop Pipes (CO3)

% hadoop fs -put max_temperature bin/max_temperature


The sample data also needs to be copied from the local filesystem into
HDFS:
% hadoop fs -put input/ncdc/sample.txt sample.txt
Now we can run the job. For this, we use the Hadoop pipes command,
passing the URI of the executable in HDFS using the -program argument:
% hadoop pipes \
-D hadoop.pipes.java.recordreader=true \
-D hadoop.pipes.java.recordwriter=true \
-input sample.txt \
-output output \
-program bin/max_temperature
The result is the same as the other versions of the same program that we
ran previous example.

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 3 65


Design of HDFS (CO3)

• HDFS is a filesystem designed for storing very large files with


streaming data access patterns, running on clusters of commodity
hardware.
• “Very large” : In this context means files that are hundreds of
megabytes, gigabytes, or terabytes in size. There are Hadoop
clusters running today that store petabytes of data.
• Streaming data access

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 3 66


Design of HDFS (CO3)

• Commodity hardware: Hadoop doesn’t require expensive, highly


reliable hardware to run on. HDFS is designed to carry on working
without a noticeable interruption to the user in the face of such
failure.
• Low-latency data access: HDFS is optimized for delivering a high
throughput of data, and this may be at the expense of latency.

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 3 67


Design of HDFS (CO3)

• Lots of small files: Since the name node holds file system metadata
in memory, the limit to the number of files in a file system is
governed by the amount of memory on the name node.
• Multiple writers, arbitrary file modifications: Files in HDFS may be
written to by a single writer.

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 3 68


HDFS Concepts (CO3)

• The following diagram illustrates the Hadoop concepts:

Blocks
• A disk has a block size, which is the minimum amount of data that it
can read or write.
•  HDFS blocks are large compared to disk blocks, and the reason is
to minimize the cost of seeks
• Thus the time to transfer a large file made of multiple blocks
operates at the disk transfer rate.

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 3 69


HDFS Concepts (CO3)

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 3 70


HDFS Concepts (CO3)

Benefits of blocks:
• It simplifies the storage subsystem. The storage subsystem deals
with blocks, simplifying storage management and eliminating
metadata concerns.
• Blocks fit well with replication for providing fault tolerance and
availability.

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 3 71


HDFS Concepts (CO3)

Name nodes and Data nodes


• An HDFS cluster has two types of node operating in a master-
worker pattern: a name node (the master) and a number of data
nodes (workers).
• This information is stored persistently on the local disk in the form
of two files: the namespace image and the edit log.

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 3 72


HDFS Concepts (CO3)

• A client accesses the file system on behalf of the user by


communicating with the name node and data nodes. Data nodes are
the workhorses of the file system.
• They store and retrieve blocks when they are told to, and they report
back to the name node periodically with lists of blocks that they are
storing. Without the name node, the file system cannot be used.

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 3 73


HDFS Concepts (CO3)

Secondary Name node


• It is also possible to run a secondary name node, which despite its
name does not act as a name node.
• The secondary name node usually runs on a separate physical
machine, since it requires plenty of CPU and as much memory as
the name node to perform the merge.

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 3 74


HDFS Concepts (CO3)

The Command-Line Interface


• There are many other interfaces to HDFS, but the command line is
one of the simplest and, to many developers, the most familiar.
• It provides a command line interface called FS shell that lets a user
interact with the data in HDFS.

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 3 75


Hadoop Filesystems (CO3)

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 3 76


Hadoop File systems (CO3)

• The Java abstract class org.apache.hadoop.fs.FileSystem represents


a file system in Hadoop, and there are several concrete
implementations, which are described in Table.

Thrift:
• The Thrift API in the “thriftfs” module expose Hadoop file systems
as an Apache Thrift service, making it easy for any language that has
Thrift bindings to interact with a Hadoop fil esystem, such as HDFS.

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 3 77


Hadoop Filesystems (CO3)

C:
• Hadoop provides a C library called libhdfs that mirrors the Java File
System interface (it was written as a C library for accessing HDFS.

FUSE
• File system in User space (FUSE) allows file systems that are
implemented in user space to be integrated as a Unix file system.

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 3 78


Hadoop File systems (CO3)

Web DAV
• Web DAV is a set of extensions to HTTP to support editing and
updating files. Web DAV shares can be mounted as file systems on
most operating systems.

File patterns
• It is a common requirement to process sets of files in a single
operation.

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 3 79


Hadoop Filesystems (CO3)

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 3 80


Hadoop Filesystems (CO3)

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 3 81


Hadoop Filesystems (CO3)

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 3 82


Avro: Overview (CO3)

• To transfer data over a network or for its persistent storage, you need
to serialize the data.
• Prior to the serialization APIs provided by Java and Hadoop, we
have a special utility, called Avro, a schema-based serialization
technique.
• Avro provides libraries for various programming languages.

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 3 83


What is Avro (CO3)

• Apache Avro is a language-neutral data serialization system. It was


developed by Doug Cutting, the father of Hadoop.
• Avro is a preferred tool to serialize data in Hadoop.

• Avro has a schema-based system. A language-independent schema is


associated with its read and write operations.
• Avro uses JSON format to declare the data structures. Currently it
supports languages such as Java, C, C++, C#, Python, and Ruby.

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 3 84


Avro: Schema (CO3)

• Avro depends heavily on its schema. It allows every data to be


written with no prior knowledge of the schema.
• It serializes fast and the resulting serialized data is lesser in size.
Schema is stored along with the Avro data in a file for any further
processing.
• Avro schemas are defined with JSON that simplifies its
implementation in languages with JSON libraries.

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 3 85


Features of Avro (CO3)

• Listed below are some of the prominent features of Avro:


• Avro is a language-neutral data serialization system.

• It can be processed by many languages (currently C, C++, C#, Java,


Python, and Ruby).
• Avro creates binary structured format that is both compressible and
split table.
• Avro schemas defined in JSON facilitate implementation in the
languages that already have JSON libraries.

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 3 86


Features of Avro (CO3)

• Avro creates a self-describing file named Avro Data File, in which it


stores data along with its schema in the metadata section.
• Avro is also used in Remote Procedure Calls (RPCs). During RPC,
client and server exchange schemas in the connection handshake.
• Avro does not need code generation. The data is always
accompanied by schemas, which permit full processing on the data.

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 3 87


Working with Avro (CO3)

• To use Avro, you need to follow the given workflow:


• Step 1: Create schemas. Here you need to design Avro schema
according to your data.
• Step 2: Read the schemas- By Generating a Class Corresponding to, By
Using Parsers Library
• Step 3: Serialize the data using the serialization API provided for Avro,
which is found in the package org.apache.avro.specific.
• Step 4: Deserialize the data using deserialization API provided for
Avro, which is found in the package org.apache.avro.specific.

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 3 88


Daily Quiz

• Apache Hadoop ___________ provides a persistent data structure


for binary key-value pairs.
a) GetFile
b) SequenceFile
c) Putfile
d) All of the mentioned

• Point out the correct statement.


a) The sequence file also can contain a “secondary” key-value list
that can be used as file Metadata
b) SequenceFile formats share a header that contains some
information which allows the reader to recognize is format
c) There’re Key and Value Class Name’s that allow the reader to
instantiate those classes, via reflection, for reading
d) All of the mentioned

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 3 89


Recap

• HDFS is a file system designed for storing very large files with
streaming data access patterns, running on clusters of commodity
hardware.
• Apache Avro is a language-neutral data serialization system. It was
developed by Doug Cutting, the father of Hadoop.

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 3 90


Faculty Video Links, Youtube & NPTEL Video Links and Online
Courses Details

• https://www.javatpoint.com/hadoop-tutorial
• https://www.tutorialspoint.com/hadoop/index.htm
• https://www.sanfoundry.com/hadoop-filesystem-hdfs-questio
ns-answers/

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 3 91


Daily Quiz

•  Hadoop I/O Hadoop comes with a set of ________ for data I/O.
a) methods
b) commands
c) classes
d) none of the mentioned

• Point out the wrong statement.


a) The data file contains all the key, value records but key N + 1
must be greater than or equal to the key N
b) Sequence file is a kind of hadoop file based data structure
c) Map file type is splittable as it contains a sync point after
several records
d) None of the mentioned

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 3 92


Daily Quiz

• Apache Hadoop ___________ provides a persistent data structure


for binary key-value pairs.
a) GetFile
b) SequenceFile
c) Putfile
d) All of the mentioned

• Point out the correct statement.


a) The sequence file also can contain a “secondary” key-value list
that can be used as file Metadata
b) SequenceFile formats share a header that contains some
information which allows the reader to recognize is format
c) There’re Key and Value Class Name’s that allow the reader to
instantiate those classes, via reflection, for reading
d) All of the mentioned

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 3 93


Weekly Assignment 1

Q:1 Discuss the basic concepts of Hadoop in detail.


Q:2 How you can analyze data with Hadoop? Expalin this with the
help of suitable example.
Q:3 Discuss the HDFS concepts in detail with suitable examples.
Q:4 Write a short note on:
–Hadoop Data Format
–Sacling Out
–Hadoop Pipes

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 3 94


Weekly Assignment 2

Q:1 Explain the concept of Avro file-based data structures.


Q:2 Explain the following terms:
–Data Integrity
–Oppression
–Serialization
Q:3 Explain the design on Hadoop distributed file system (HDFS) in
detail.

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 3 95


MCQ s

• Which of the following format is more compression-aggressive?


a) Partition Compressed
b) Record Compressed
c) Block-Compressed
d) Uncompressed
• The __________ is a directory that contains two SequenceFile.
a) ReduceFile
b) MapperFile
c) MapFile
d) None of the mentioned
• The ______ file is populated with the key and a LongWritable that
contains the starting byte position of the record.
a) Array
b) Index
c) Immutable
d) All of the mentioned
08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 3 96
MCQ s

• The _________ as just the value field append(value) and the key is a
LongWritable that contains the record number, count + 1.
a) SetFile
b) ArrayFile
c) BloomMapFile
d) None of the mentioned

•  ____________ data file takes is based on avro serialization


framework which was primarily created for hadoop.
a) Oozie
b) Avro
c) cTakes
d) Lucene

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 3 97


Old Question Papers

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 3 98


Old Question Papers

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 3 99


Old Question Papers

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 3 100


Old Question Papers

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 3 101


Expected Questions for University Exam

Q:1 Discuss the basic concepts of Hadoop in detail.


Q:2 How you can analyze data with Hadoop? Expalin this with the
help of suitable example.
Q:3 Discuss the HDFS concepts in detail with suitable examples.
Q:4 Explain the design on Hadoop distributed file system (HDFS) in
detail.
Q:5 Write a short note on:
–Hadoop Data Format
–Sacling Out
–Hadoop Pipes

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 3 102


Summary

• The Hadoop Distributed File System (HDFS) will split large data
files into chunks which are managed by different nodes in the
cluster.
• HDFS has a master/slave architecture.

• Hadoop can process many different types of data formats, from flat
text files to databases.
• Map Reduce works by breaking the processing into two phases: the
map phase and the reduce phase.
• Apache Avro is a language-neutral data serialization system.

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 3 103


References

1. Michael Minelli, Michelle Chambers, and Ambiga Dhiraj, "Big Data, Big
Analytics: Emerging Business Intelligence and Analytic Trends for Today's
Businesses", Wiley, 2013.
2. P. J. Sadalage and M. Fowler, "NoSQL Distilled: A Brief Guide to the Emerging
World of
3. Polyglot Persistence", Addison-Wesley Professional, 2012.
4. Tom White, "Hadoop: The Definitive Guide", Third Edition, O'Reilley, 2012.
5. Eric Sammer, "Hadoop Operations", O'Reilley, 2012.
6. E. Capriolo, D. Wampler, and J. Rutherglen, "Programming Hive", O'Reilley,
2012.
7. Lars George, "HBase: The Definitive Guide", O'Reilley, 2011.
8. Eben Hewitt, "Cassandra: The Definitive Guide", O'Reilley, 2010.
9. Alan Gates, "Programming Pig", O'Reilley, 2011.

Thank You
08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 3 104

You might also like