Hadoop 2®
Quick-Start Guide
Hadoop 2 Quick-Start Guide: Learn the
Essentials of Big Data Computing in the
Apache Hadoop 2 Ecosystem
Table of Contents
Cover
Half Title
Title
Copyright
Contents
Foreword
Preface
Acknowledgments
About the Author
1 Background and Concepts
Defining Apache Hadoop
A Brief History of Apache Hadoop
Defining Big Data
Hadoop as a Data Lake
Using Hadoop: Administrator, User, or Both
First There Was MapReduce
Apache Hadoop Design Principles
Apache Hadoop MapReduce Example
MapReduce Advantages
Apache Hadoop V1 MapReduce Operation
Table of Contents
Moving Beyond MapReduce with Hadoop V2
Hadoop V2 YARN Operation Design
The Apache Hadoop Project Ecosystem
Summary and Additional Resources
2 Installation Recipes
Core Hadoop Services
Hadoop Configuration Files
Planning Your Resources
Hardware Choices
Software Choices
Installing on a Desktop or Laptop
Installing Hortonworks HDP 2.2 Sandbox
Installing Hadoop from Apache Sources
Installing Hadoop with Ambari
Performing an Ambari Installation
Undoing the Ambari Install
Installing Hadoop in the Cloud Using Apache Whirr
Step 1: Install Whirr
Step 2: Configure Whirr
Step 3: Launch the Cluster
Step 4: Take Down Your Cluster
Summary and Additional Resources
3 Hadoop Distributed File System Basics
Hadoop Distributed File System Design Features
HDFS Components
HDFS Block Replication
HDFS Safe Mode
Rack Awareness
Table of Contents
NameNode High Availability
HDFS Namespace Federation
HDFS Checkpoints and Backups
HDFS Snapshots
HDFS NFS Gateway
HDFS User Commands
Brief HDFS Command Reference
General HDFS Commands
List Files in HDFS
Make a Directory in HDFS
Copy Files to HDFS
Copy Files from HDFS
Copy Files within HDFS
Delete a File within HDFS
Delete a Directory in HDFS
Get an HDFS Status Report
HDFS Web GUI
Using HDFS in Programs
HDFS Java Application Example
HDFS C Application Example
Summary and Additional Resources
4 Running Example Programs and Benchmarks
Running MapReduce Examples
Listing Available Examples
Running the Pi Example
Using the Web GUI to Monitor Examples
Running Basic Hadoop Benchmarks
Running the Terasort Test
Running the TestDFSIO Benchmark
Table of Contents
Managing Hadoop MapReduce Jobs
Summary and Additional Resources
5 Hadoop MapReduce Framework
The MapReduce Model
MapReduce Parallel Data Flow
Fault Tolerance and Speculative Execution
Speculative Execution
Hadoop MapReduce Hardware
Summary and Additional Resources
6 MapReduce Programming
Compiling and Running the Hadoop WordCount Example
Using the Streaming Interface
Using the Pipes Interface
Compiling and Running the Hadoop Grep Chaining Example
Debugging MapReduce
Listing, Killing, and Job Status
Hadoop Log Management
Summary and Additional Resources
7 Essential Hadoop Tools
Using Apache Pig
Pig Example Walk-Through
Using Apache Hive
Hive Example Walk-Through
A More Advanced Hive Example
Using Apache Sqoop to Acquire Relational Data
Apache Sqoop Import and Export Methods
Apache Sqoop Version Changes
Table of Contents
Sqoop Example Walk-Through
Using Apache Flume to Acquire Data Streams
Flume Example Walk-Through
Manage Hadoop Workflows with Apache Oozie
Oozie Example Walk-Through
Using Apache HBase
HBase Data Model Overview
HBase Example Walk-Through
Summary and Additional Resources
8 Hadoop YARN Applications
YARN Distributed-Shell
Using the YARN Distributed-Shell
A Simple Example
Using More Containers
Distributed-Shell Examples with Shell Arguments
Structure of YARN Applications
YARN Application Frameworks
Distributed-Shell
Hadoop MapReduce
Apache Tez
Apache Giraph
Hoya: HBase on YARN
Dryad on YARN
Apache Spark
Apache Storm
Apache REEF: Retainable Evaluator Execution Framework
Hamster: Hadoop and MPI on the Same Cluster
Apache Flink: Scalable Batch and Stream Data Processing
Table of Contents
Apache Slider: Dynamic Application Management
Summary and Additional Resources
9 Managing Hadoop with Apache Ambari
Quick Tour of Apache Ambari
Dashboard View
Services View
Hosts View
Admin View
Views View
Admin Pull-Down Menu
Managing Hadoop Services
Changing Hadoop Properties
Summary and Additional Resources
10 Basic Hadoop Administration Procedures
Basic Hadoop YARN Administration
Decommissioning YARN Nodes
YARN WebProxy
Using the JobHistoryServer
Managing YARN Jobs
Setting Container Memory
Setting Container Cores
Setting MapReduce Properties
Basic HDFS Administration
The NameNode User Interface
Adding Users to HDFS
Perform an FSCK on HDFS
Balancing HDFS
HDFS Safe Mode
Table of Contents
Decommissioning HDFS Nodes
SecondaryNameNode
HDFS Snapshots
Configuring an NFSv3 Gateway to HDFS
Capacity Scheduler Background
Hadoop Version 2 MapReduce Compatibility
Enabling ApplicationMaster Restarts
Calculating the Capacity of a Node
Running Hadoop Version 1 Applications
Summary and Additional Resources
A: Book Webpage and Code Download
B: Getting Started Flowchart and Troubleshooting Guide
Getting Started Flowchart
General Hadoop Troubleshooting Guide
Rule 1: Dont Panic
Rule 2: Install and Use Ambari
Rule 3: Check the Logs
Rule 4: Simplify the Situation
Rule 5: Ask the Internet
Other Helpful Tips
C: Summary of Apache Hadoop Resources by Topic
General Hadoop Information
Hadoop Installation Recipes
HDFS
Examples
MapReduce
MapReduce Programming
Essential Tools
Table of Contents
YARN Application Frameworks
Ambari Administration
Basic Hadoop Administration
D: Installing the Hue Hadoop GUI
Hue Installation
Steps Performed with Ambari
Install and Configure Hue
Starting Hue
Hue User Interface
E: Installing Apache Spark
Spark Installation on a Cluster
Starting Spark across the Cluster
Installing and Starting Spark on the Pseudo-distributed Single-Node
Installation
Run Spark Examples
Index