Unit3 BD
Unit3 BD
Unit3 BD
Basic of Hadoop
Unit: 3
• Course Objective
• Course Outcome
• CO and PO Mapping
• Basics of hadoop
• Data format
• analyzing data with Hadoop
• scaling out
• Hadoop streaming
• Hadoop pipes
• design of Hadoop distributed file system
(HDFS)
• HDFS concepts
• Java interface
• data flow
• Hadoop I/O
• data integrity
• Oppression
• Serialization
• Avro file-based data structures.
• Summary
• Internally, a file is split into one or more blocks and these blocks are
stored in a set of DataNodes.
• The NameNode executes file system namespace operations like
opening, closing, and renaming files and directories.
• The DataNodes also perform block creation, deletion, and
replication upon instruction from the NameNode.
• HDFS is built using the Java language; any machine that supports
Java can run the NameNode or the DataNode software.
• Each of the other machines in the cluster runs one instance of the
DataNode software.
• The NameNode is the arbitrator and repository for all HDFS
metadata.
• The system is designed in such a way that user data never flows
through the NameNode.
Input Format:
• How these input files are split up and read is defined by the Input
Format.
• An Input Format is a class that provides the following functionality:
– Selects the files or other objects that should be used for input
– Defines the Input Splits that break a file into tasks
Output Format:
• The (key, value) pairs provided to this Output Collector are then
written to output files.
• Hadoop can process many different types of data formats, from flat
text files to databases.
• If it is flat file, the data is stored using a line-oriented ASCII format,
in which each line is a record.
• For example, ( National Climatic Data Center) NCDC data as given
below, the format supports a rich set of meteorological elements.
• The input to our map phase is the raw NCDC data. We choose a text
input format that gives us each line in the dataset as a text value.
• The map function is also a good place to drop bad records: here we
filter out temperatures that are missing, suspect, or erroneous.
• To visualize the way the map works, consider the following sample
lines of input data:
• (some unused columns have been dropped to fit the page, indicated
by ellipses):
• These lines are presented to the map function as the key-value pairs:
• The map function merely extracts the year and the air temperature
(indicated in bold text), and emits them as its output (the
temperature values have been interpreted as integers):
• (1950, 0)
• (1950, 22)
• (1950, −11)
• (1949, 111)
• (1949, 78)
• The output from the map function is processed by the Map Reduce
framework before being sent to the reduce function. So, continuing
the example, our reduce function sees the following input:
(1949, [111, 78])
(1950, [0, 22, −11])
• All the reduce function has to do now is iterate through the list and
pick up the maximum reading:
(1949, 111)
(1950, 22)
• This is the final output: the maximum global temperature recorded
in each year.
• Having run through how the Map Reduce program works, the next
step is to express it in code.
• The map function is represented by an implementation of the
Mapper interface, which declares a map() method.
• The Mapper interface is a generic type, with four formal type
parameters that specify the input key, input value, output key, and
output value types of the map function.
Basic of Hadoop
Unit: 3
• You’ve seen how Map Reduce works for small inputs; now it’s time
to take a bird’s-eye view of the system and look at the data flow for
large inputs.
• It consists of the input data, the Map Reduce program, and
configuration information.
• Hadoop runs the job by dividing it into tasks, of which there are two
types: map tasks and reduce tasks.
• There are two types of nodes that control the job execution process:
a job tracker and a number of task trackers.
• Hadoop divides the input to a Map Reduce job into fixed-size pieces
called input splits, or just splits.
• Hadoop creates one map task for each split, which runs the user
defined map function for each record in the split.
• Having many splits means the time taken to process each split is
small compared to the time to process the whole input.
• Even if the machines are identical, failed processes or other jobs
running concurrently
• make load balancing desirable, and the quality of the load balancing
increases as the splits become more fine-grained.
• The number of reduce tasks is not governed by the size of the input,
but is specified independently.
• When there are multiple reducers, the map tasks partition their
output, each creating one partition for each reduce task.
• The partitioning can be controlled by a user-defined partitioning
function, but normally the default practitioner—which buckets keys
using a hash function—works very well.
• Finally, it’s also possible to have zero reduce tasks. This can be
appropriate when you don’t need the shuffle since the processing
can be carried out entirely in parallel as shown in figure:
We can now simulate the whole MapReduce pipeline with a Unix pipeline,
we get
sort | ch02/src/main/ruby/max_temperature_reduce.rb
1949 111
1950 22
• The map and reduce functions are defined by extending the Mapper
and Reducer classes defined in the HadoopPipes namespace and
providing implementations of the map() and reduce() methods in
each case.
• The runTask() method is passed a Factory so that it can create
instances of the Mapper or Reducer.
• Which one it creates is controlled by the Java parent over the socket
connection. There are overloaded template factory methods for
setting a combiner, practitioner, record reader, or record writer.
• Pipes doesn’t run in standalone (local) mode, since it relies on
Hadoop’s distributed cache mechanism, which works only when
HDFS is running.
• Lots of small files: Since the name node holds file system metadata
in memory, the limit to the number of files in a file system is
governed by the amount of memory on the name node.
• Multiple writers, arbitrary file modifications: Files in HDFS may be
written to by a single writer.
Blocks
• A disk has a block size, which is the minimum amount of data that it
can read or write.
• HDFS blocks are large compared to disk blocks, and the reason is
to minimize the cost of seeks
• Thus the time to transfer a large file made of multiple blocks
operates at the disk transfer rate.
Benefits of blocks:
• It simplifies the storage subsystem. The storage subsystem deals
with blocks, simplifying storage management and eliminating
metadata concerns.
• Blocks fit well with replication for providing fault tolerance and
availability.
Thrift:
• The Thrift API in the “thriftfs” module expose Hadoop file systems
as an Apache Thrift service, making it easy for any language that has
Thrift bindings to interact with a Hadoop fil esystem, such as HDFS.
C:
• Hadoop provides a C library called libhdfs that mirrors the Java File
System interface (it was written as a C library for accessing HDFS.
FUSE
• File system in User space (FUSE) allows file systems that are
implemented in user space to be integrated as a Unix file system.
Web DAV
• Web DAV is a set of extensions to HTTP to support editing and
updating files. Web DAV shares can be mounted as file systems on
most operating systems.
File patterns
• It is a common requirement to process sets of files in a single
operation.
• To transfer data over a network or for its persistent storage, you need
to serialize the data.
• Prior to the serialization APIs provided by Java and Hadoop, we
have a special utility, called Avro, a schema-based serialization
technique.
• Avro provides libraries for various programming languages.
• HDFS is a file system designed for storing very large files with
streaming data access patterns, running on clusters of commodity
hardware.
• Apache Avro is a language-neutral data serialization system. It was
developed by Doug Cutting, the father of Hadoop.
• https://www.javatpoint.com/hadoop-tutorial
• https://www.tutorialspoint.com/hadoop/index.htm
• https://www.sanfoundry.com/hadoop-filesystem-hdfs-questio
ns-answers/
• Hadoop I/O Hadoop comes with a set of ________ for data I/O.
a) methods
b) commands
c) classes
d) none of the mentioned
• The _________ as just the value field append(value) and the key is a
LongWritable that contains the record number, count + 1.
a) SetFile
b) ArrayFile
c) BloomMapFile
d) None of the mentioned
• The Hadoop Distributed File System (HDFS) will split large data
files into chunks which are managed by different nodes in the
cluster.
• HDFS has a master/slave architecture.
• Hadoop can process many different types of data formats, from flat
text files to databases.
• Map Reduce works by breaking the processing into two phases: the
map phase and the reduce phase.
• Apache Avro is a language-neutral data serialization system.
1. Michael Minelli, Michelle Chambers, and Ambiga Dhiraj, "Big Data, Big
Analytics: Emerging Business Intelligence and Analytic Trends for Today's
Businesses", Wiley, 2013.
2. P. J. Sadalage and M. Fowler, "NoSQL Distilled: A Brief Guide to the Emerging
World of
3. Polyglot Persistence", Addison-Wesley Professional, 2012.
4. Tom White, "Hadoop: The Definitive Guide", Third Edition, O'Reilley, 2012.
5. Eric Sammer, "Hadoop Operations", O'Reilley, 2012.
6. E. Capriolo, D. Wampler, and J. Rutherglen, "Programming Hive", O'Reilley,
2012.
7. Lars George, "HBase: The Definitive Guide", O'Reilley, 2011.
8. Eben Hewitt, "Cassandra: The Definitive Guide", O'Reilley, 2010.
9. Alan Gates, "Programming Pig", O'Reilley, 2011.
Thank You
08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 3 104