HDFS
Hadoop Distributed File System
Thuong-Cang PHAN, PhD.
College of Information & Communication Technology - CTU
Apache Hadoop
• Open Source:
http://hadoop.apache.org/
• Wide acceptance:
- http://wiki.apache.org/hadoop/PoweredBy
- Amazon.com, Apple, AOL, eBay, IBM, Google, LinkedIn,
Last.fm, MicrosoY, SAP, Twiter, …
CIT - Can Tho University
Hadoop Architecture
CIT - Can Tho University
Storage - HDFS
• HDFS - Hadoop Distributed File System
• The primary distributed file system used by Hadoop
applications which runs on large clusters of commodity
machines.
• HDFS clusters consist of:
• A NameNode: manages the file system metadata and
monitors data nodes.
• A Secondary Namenode: periodically copy and merge the
namespace image and edit log. In case if the name node
crashes, the namespace image stored in secondary
namenode can be used to restart the namenode.
• DataNodes: store, read and write actual data blocks.
CIT - Can Tho University
HDFS - Basics
• Given file is cut in blocks (e.g., 128MB)
• Which are then assigned to (different) nodes
CIT - Can Tho University
HDFS Architecture
Resource:
http://hadoop.apache.org/docs/r2.3.0/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html
CIT - Can Tho University
HDFS- Replication
• Can specify default replication factor (defaults to three).
• Replication is pipelined
• If block is full, NameNode is asked for other DataNodes
(that can hold replica)
• DataNode is contacted, receives data
• Forwards to third replica, etc.
CIT - Can Tho University
HDFS- Replication
• Benefits of replication
• Availability: data isn’t lost when a node fails
• Reliability: HDFS compares replicas and fixes data corruption
• Performance: allows for data locality
CIT - Can Tho University
HDFS- Writing Data
CIT - Can Tho University
HDFS- Reading Data
CIT - Can Tho University
Accessing HDFS via Command Line
• Users typically access HDFS via the command:
hadoop fs -subcommand
• subcommand is similar to corresponding UNIX commands
• Examples
• $ hadoop fs -ls /user/tomwheeler
• $ hadoop fs -cat /customers.csv
• $ hadoop fs -rm /webdata/access.log
• $ hadoop fs -mkdir /reports/marketing
CIT - Can Tho University
Copying Local Data To/From HDFS
• HDFS is distinct from your local filesystem
• hadoop fs –put copies local files to HDFS
• hadoop fs –get fetches a local copy of a file from HDFS
CIT - Can Tho University
hadoop fs -mkdir /user/hadoop/hadoopdemo
HDFS – hadoop fs Shell Commands
• hadoop fs -ls
• $hadoop fs -ls /
• $hadoop fs -ls /home/hadoop/hadoop
• $hadoop fs -lsr /home/hadoop/hadoop
• hadoop fs -mkdir
• $hadoop fs -mkdir /input/test
• $hadoop fs -mkdir /input/test1/test2
• hadoop fs -rm
• $hadoop fs -rm /input/test1
• $hadoop fs -rmr /intput/test1
CIT - Can Tho University
HDFS – hadoop fs Shell Commands
• hadoop fs -put
• $hadoop fs -put localFileName /input/test
• $hadoop fs -put localfile1 localfile2 /input/test
• $hadoop fs -put WordCount.java hdfs://localhost:9000/input/test
• hadoop fs -get
• $hadoop fs -get /input/test/hdfsFileName localFileName
• hadoop fs -copyFromLocal <localsrc> <URI>
• copy a file from the local file system to the hadoop hdfs
• hadoop fs -copyToLocal <URI> <localdst>
• copy a file from the hdfs to the local file system
CIT - Can Tho University
HDFS – hadoop fs Shell Commands
• hadoop fs -getmerge <src> <localdst> [addnl]
• The addnl option is for adding new line character at the end of each file.
• hadoop fs -cp
• hadoop fs -cp <SrcFile> <TgtFile>
• hadoop fs -cp <SrcFile1> <SrcFile2>
hdfs://namenodehost/<TgtDirectory>
• hadoop fs -du
• hadoop fs -du hdfs://namenodehost/<TgtDirectory>
• hadoop fs -dus
• hadoop fs -du hdfs://namenodehost/<TgtDirectory>
• hadoop fs -expunge
• empty the trash
CIT - Can Tho University
HDFS - Exercises
1. How to create a directory in HDFS
2. How to copy a local file to HDFS
3. How to display the contents of a file in HDFS
4. How to remove a file from HDFS
CIT - Can Tho University
HDFS – Web Interface
CIT - Can Tho University