Hadoop Eco System Frameworks: Applications on Big Data using Pig, Hive and HBase
o Application of Big Data using Pig, Hive and HBase
2. Pig :
Pig is a high-level platform or tool which is used to process large datasets.
It provides a high level of abstraction for processing over MapReduce.
It provides a high-level scripting language, known as Pig Latin which is used
to develop the data analysis codes.
Applications :
For exploring large datasets Pig Scripting is used.
Provides supports across large data sets for Ad-hoc queries.
In the prototyping of large data-sets processing algorithms.
Required to process the time-sensitive data loads.
For collecting large amounts of datasets in form of search logs and
web crawls.
Used where the analytical insights are needed using the sampling.
3. Hive :
Hive is a data warehouse infrastructure tool to process structured data in
Hadoop.
It resides on top of Hadoop to summarize Big Data and makes querying and
analyzing easy.
It is used by different companies. For example, Amazon uses it in Amazon
Elastic MapReduce.
Benefits :
Ease of use
Accelerated initial insertion of data
Superior scalability, flexibility, and cost-efficiency
Streamlined security
Low overhead
Exceptional working capacity
4. HBase :
HBase is a column-oriented non-relational database management system
that runs on top of the Hadoop Distributed File System (HDFS).
HBase provides a fault-tolerant way of storing sparse data sets, which are
common in many big data use cases.
HBase does support writing applications in Apache Avro, REST and Thrift.
Application :
Medical
Sports
Web
Oil and petroleum
e-commerce
Pig
o Introduction to PIG
Ø What is Apache Pig?
Apache Pig is an abstraction over MapReduce. It is a tool/platform which is
used to analyze larger sets of data representing them as data flows. Pig is
generally used with Hadoop; we can perform all the data manipulation
operations in Hadoop using Apache Pig.
To write data analysis programs, Pig provides a high-level language known as
Pig Latin. This language provides various operators using which programmers
can develop their own functions for reading, writing, and processing data.
To analyze data using Apache Pig, programmers need to write scripts using
Pig Latin language. All these scripts are internally converted to Map and
Reduce tasks. Apache Pig has a component known as Pig Engine that accepts
the Pig Latin scripts as input and converts those scripts into MapReduce jobs.
o Execution Modes of Pig
Apache Pig Execution Modes
You can run Apache Pig in two modes, namely, Local Mode and HDFS
mode.
Local Mode
In this mode, all the files are installed and run from your local host
and local file system. There is no need of Hadoop or HDFS. This
mode is generally used for testing purpose.
MapReduce Mode
MapReduce mode is where we load or process the data that exists in
the Hadoop File System (HDFS) using Apache Pig. In this mode,
whenever we execute the Pig Latin statements to process the data, a
MapReduce job is invoked in the back-end to perform a particular
operation on the data that exists in the HDFS.
Script Execution Modes
Interactive Mode (Grunt shell) − You can run Apache Pig in
interactive mode using the Grunt shell. In this shell, you can enter
the Pig Latin statements and get the output (using Dump operator).
Batch Mode (Script) − You can run Apache Pig in Batch mode by
writing the Pig Latin script in a single file with .pig extension.
Embedded Mode (UDF) − Apache Pig provides the provision of
defining our own functions (User Defined Functions) in programming
languages such as Java, and using them in our script.
o Comparison of Pig with Databases
There are several differences between the two languages, and between Pig
and relational database management systems (RDBMSs) in general.
The most significant difference is that Pig Latin is a data flow programming
language, whereas SQL is a declarative programming language.
Pig Latin programs are step-by-step transformations on input relations; SQL
statements declare constraints that the output must satisfy.
RDBMSs store data in tables with predefined schemas; Pig is more relaxed
and allows optional schemas defined at runtime.
Pig operates on any source of tuples (e.g., text files) without an import
process; data is loaded from the filesystem (usually HDFS) as the first step.
o Grunt
Grunt is the interactive shell for Apache Pig.
Provides line-editing facilities like GNU Readline (e.g., Ctrl-E moves to end of
line).
Remembers command history (Ctrl-P/Ctrl-N or ↑/↓) and offers tab-
completion for Pig Latin keywords and functions.
Supports utility commands: clear, help, history, quit, set, exec, kill, run.
Allows OS commands via sh and HDFS commands via fs.
o Pig Latin
Pig Latin is a high-level data flow language with SQL-like semantics.
Abstracts Java MapReduce into textual notation; simple to learn for those
familiar with SQL.
Supports primitive types (int, long, float, double, chararray, bytearray,
boolean, datetime, biginteger, bigdecimal) and complex types (tuple, bag,
map).
Statements end with a semicolon (;), take relations as input, and produce
relations as output.
o User Defined Functions
Apache Pig supports UDFs in Java (full), and limited support in Jython,
Python, JavaScript, Ruby, Groovy.
Java UDF types:
Filter Functions − Used in FILTER statements; accept a Pig value and
return a Boolean.
Eval Functions − Used in FOREACH-GENERATE; accept a Pig value
and return a Pig result.
Algebraic Functions − Act on inner bags in FOREACH-GENERATE,
performing full MapReduce on an inner bag.
Piggybank is the Java UDF repository for sharing and reusing UDFs.
o Data Processing Operators
Loading and Storing: LOAD, STORE
Filtering: FILTER, DISTINCT, FOREACH…GENERATE, STREAM
Grouping and Joining: GROUP, COGROUP, JOIN, CROSS
Sorting: ORDER, LIMIT
Combining and Splitting: UNION, SPLIT
Diagnostics: DUMP, DESCRIBE, EXPLAIN, ILLUSTRATE
Apache Hive Architecture and Installation
Architecture Components
o User Interface (UI): Provides interfaces such as Hive Web UI, Hive CLI and Hive
HDInsight for submitting queries and operations
o Hive Server (Thrift): Receives client requests (JDBC/ODBC, Beeline, Hue) and
forwards them to the Driver
o Driver: Manages client sessions, forwards queries to the Compiler, and handles
result fetching
o Compiler: Parses HiveQL, performs semantic analysis, and generates an execution
plan (DAG of MapReduce/HDFS tasks) using metadata from the Metastore
o Metastore: Central repository storing table/partition schemas, column metadata,
serializers/deserializers and HDFS file locations in an RDBMS
o Execution Engine: Executes the compiled DAG, managing task dependencies and
coordinating with Hadoop/YARN
o HDFS/HBase: Underlying storage for table data, accessed via Hive’s execution engine
Installation
o Prerequisite: Install same Hadoop version locally as used by your target cluster
o Download and unpack Hive:
o tar xzf apache-hive-x.y.z-bin.tar.gz
o Set environment variables:
o export HIVE_HOME=~/sw/apache-hive-x.y.z-bin
o export PATH=$PATH:$HIVE_HOME/bin
o Launch Hive shell:
o hive
Hive Shell
Primary CLI for issuing HiveQL commands; heavily influenced by MySQL syntax
Features:
o Commands terminated by semicolon; supports ! prefix for OS commands and dfs for
HDFS operations
o Command history navigation via ↑/↓ or Ctrl-P/Ctrl-N; line editing (Ctrl-E to end) and
Tab-based auto-completion
o Case-insensitive keywords (except in string comparisons)
Modes:
o Interactive: Launch with hive; enter queries at hive> prompt
o Non-Interactive: Run scripts using hive -f script.q or single statements with hive -e
"SELECT …"
Hive Services
Hive CLI: Shell for executing HiveQL queries and commands
Hive Web UI: Web-based GUI alternative to CLI
Hive MetaStore: Stores metadata (schemas, serializers, HDFS mappings) in a relational DB
Hive Server (Thrift): Endpoint for remote client connections, forwarding requests to Driver
Hive Driver: Manages sessions, forwards queries to Compiler, returns results to clients
Hive Compiler: Parses HiveQL, performs semantic checks, generates execution plan using
Metastore data
Hive Execution Engine: Executes the DAG of MapReduce/HDFS tasks, handling dependencies
◆
Hive Metastore
Central repository for table and partition definitions, column types, serializers/deserializers,
and file paths; stored in an RDBMS (e.g., MySQL, PostgreSQL)
Shared by Hive, Spark, Impala, and other tools to ensure consistent metadata across engines
Comparison with Traditional Databases
Aspect Hive Data Warehouse Traditional RDBMS
Purpose Data warehousing, OLAP OLTP and OLAP
Schema Schema on READ: schema applied at Schema on WRITE:
Enforcement query time enforced on data load
Scalability Highly scalable at low cost (HDFS) Scale-up (costly)
Read/Write
Write once, read many Read/write many
Patterns
Transactions &
Not supported Supported
Indexes
Supports normalized & de-normalized, Normalized data, manual
Data Organization
automatic partitioning sharding
Drop removes metadata only (external) Drop removes data and
Table Deletion
or metadata+data metadata
HiveQL
SQL-like query language (HQL) that compiles into MapReduce jobs; supporting SELECT,
INSERT, UPDATE-like operations
Extensions: Window functions (SQL:2003), multi-table inserts, TRANSFORM/MAP/REDUCE
clauses
Data Types:
o Primitive: numeric types, Boolean, string (CHAR, VARCHAR), TIMESTAMP
o Complex: ARRAY, MAP, STRUCT
File Formats: TEXTFILE, SEQUENCEFILE, ORC, RCFILE
Hive Tables
Managed (Internal) Tables: Hive owns both data and metadata; stored by default under
/user/hive/warehouse; DROP TABLE removes data+metadata
External Tables: Hive manages only metadata; data resides at user-specified HDFS/Storage
locations; DROP TABLE removes metadata only
Partitioning & Bucketing: (not detailed in provided docs)
Querying Data & User-Defined Functions
Hive supports three types of UDFs (written in Java):
o UDF: Operates on single row → single row (e.g., string/math functions)
o UDAF: Aggregates multiple rows → single result (e.g., COUNT, MAX)
o UDTF: Single row → multiple rows (table-generating, e.g., EXPLODE)
APIs:
o Simple UDF API (org.apache.hadoop.hive.ql.exec.UDF): Use for writable types,
implement evaluate()
o Generic UDF (org.apache.hadoop.hive.ql.udf.generic.GenericUDF): Use for complex
types, manually handle ObjectInspectors
Sorting & Aggregating
SORT BY: Sorts output by specified columns before reducer; use ASC/DESC; numeric vs
lexicographical order based on column type
Aggregate Functions: Take a set of values and return a single value; can be used with or
without GROUP BY (commonly with GROUP BY)
MapReduce Scripts
Used for large-scale data processing by breaking data into smaller parts and creating parallel
jobs automatically
Can be run manually or scheduled; auto-yielding on governance limits for rescheduling
without disruption
Ideal when logic is chunk-able; less suitable for complex per-record processing due to I/O
overhead
Joins & Subqueries
Joins: Combine rows from multiple tables based on equality predicates; supports INNER,
LEFT OUTER, RIGHT OUTER, FULL OUTER; only equality joins; non-commutative and left-
associative
Subqueries: Hive supports subqueries in FROM and WHERE clauses (since Hive 0.13); a
subquery returns a result set used by its outer query
HBase Concepts
o What is HBase?
HBase is a distributed column-oriented database built on top of the Hadoop
file system. It is an open-source project and is horizontally scalable.
HBase is a data model that is similar to Google’s BigTable designed to
provide quick random access to huge amounts of structured data. It
leverages the fault tolerance provided by the Hadoop File System (HDFS).
It is a part of the Hadoop ecosystem that provides random real-time
read/write access to data in the Hadoop File System.
One can store the data in HDFS either directly or through HBase. Data
consumer reads/accesses the data in HDFS randomly using HBase. HBase sits
on top of the Hadoop File System and provides read and write access.
Clients (HBase Client API)
o Class HBaseConfiguration
Adds HBase configuration files to a Configuration. This class belongs to the
org.apache.hadoop.hbase package.
static org.apache.hadoop.conf.Configuration create()
This method creates a Configuration with HBase resources.
o Class HTable (org.apache.hadoop.hbase.client)
Represents an HBase table, used to communicate with a single HBase table.
Constructors
HTable()
HTable(TableName tableName, ClusterConnection connection,
ExecutorService pool)
Using this constructor, you can create an object to access an
HBase table.
Key Methods
void close() – Releases all the resources of the HTable.
void delete(Delete delete) – Deletes the specified cells/row.
boolean exists(Get get) – Tests the existence of columns in the table,
as specified by Get.
Result get(Get get) – Retrieves certain cells from a given row.
Configuration getConfiguration() – Returns the Configuration object
used by this instance.
TableName getName() – Returns the table name instance of this
table.
HTableDescriptor getTableDescriptor() – Returns the table descriptor
for this table.
byte[] getTableName() – Returns the name of this table.
void put(Put put) – Inserts data into the table.
Example Schema
o Table is a collection of rows.
o Row is a collection of column families.
o Column family is a collection of columns.
o Column is a collection of key value pairs.
o Example schema of table in HBase:
o Rowid | Column Family | Column Family | Column Family | Column Family
o col1 col2 col3 | col1 col2 col3 | col1 col2 col3 | col1 col2 col3
o 1
o 2
o 3
HBase vs RDBMS
o SQL
RDBMS: It requires SQL (Structured Query Language).
HBase: SQL is not required in HBase.
o Schema
RDBMS: It has a fixed schema.
HBase: It does not have a fixed schema and allows for the addition of
columns on the fly.
o Database Type
RDBMS: It is a row-oriented database.
HBase: It is a column-oriented database.
o Scalability
RDBMS: Allows for scaling up by upgrading existing servers.
HBase: Scale-out is possible by adding new servers to the cluster.
o Nature
RDBMS: Static in nature.
HBase: Dynamic in nature.
o Data retrieval
RDBMS: Slower retrieval of data.
HBase: Faster retrieval of data.
o Rule
RDBMS: Follows ACID (Atomicity, Consistency, Isolation, Durability).
HBase: Follows CAP (Consistency, Availability, Partition-tolerance).
o Type of data
RDBMS: Handles structured data.
HBase: Handles structured, unstructured, and semi-structured data.
o Sparse data
RDBMS: Cannot handle sparse data.
HBase: Can handle sparse data.
o Volume of data
RDBMS: Determined by server configuration.
HBase: Depends on the number of machines deployed.
o Transaction Integrity
RDBMS: Guarantees transaction integrity.
HBase: No such guarantee.
o Referential Integrity
RDBMS: Supported.
HBase: No built-in support.
o Normalize
RDBMS: Data can be normalized.
HBase: Data is not normalized; no logical relationships across tables.
o Table size
RDBMS: Designed for small tables; difficult to scale.
HBase: Designed for large tables; scales horizontally.
Advanced Usage (Applications & History)
o Applications of HBase
It is used whenever there is a need to write heavy applications.
HBase is used whenever we need to provide fast random access to available
data.
Companies such as Facebook, Twitter, Yahoo, and Adobe use HBase
internally.
o HBase History
Nov 2006 – Google released the paper on BigTable.
Feb 2007 – Initial HBase prototype was created as a Hadoop contribution.
Oct 2007 – First usable HBase (with Hadoop 0.15.0) released.
Jan 2008 – HBase became a Hadoop sub-project.
Oct 2008 – HBase 0.18.1 released.
Jan 2009 – HBase 0.19.0 released.
Sept 2009 – HBase 0.20.0 released.
May 2010 – HBase graduated to Apache top-level project.
Schema Design
o HBase table can scale to billions of rows and many columns based on requirements;
supports terabytes of data and high read/write throughput at low latency.
o HBase Table Schema Design General Concepts
Row key: Each table is indexed on a row key; data is sorted lexicographically
by row key. No secondary indices.
Automaticity: Avoid designs requiring atomicity across multiple rows; all
operations are atomic at row level.
Even distribution: Reads/writes should be uniformly distributed across all
nodes; design row keys so related entities are in adjacent rows.
o Size Limits
Row keys: 4 KB per key
Column families: ≤ 10 per table
Column qualifiers: 16 KB per qualifier
Individual cell values: < 10 MB
All values in a single row: ≤ 10 MB
Advance Indexing
o Advanced indexing is triggered when obj is:
an ndarray of type integer or Boolean
a tuple with at least one sequence object
a non-tuple sequence object
o Advanced indexing returns a copy of data rather than a view.
o Two types of advanced indexing:
Integer Indexing: Selects arbitrary items based on N-dimensional indices.
Boolean Array Indexing: Selects items based on Boolean conditions.
ZooKeeper – Monitoring a Cluster
o Key Performance Metrics to Monitor:
Resource utilization details
Thread and JVM usage
Performance statistics
Cluster and configuration details
o Resource utilization details: Automatically discover clusters, monitor heap/non-heap
memory on znodes, collect and alert on GC iterations, heap size, system usage, and
threads.
o Thread and JVM usage: Analyze JVM thread dumps; track daemon, peak, and live
thread counts.
o Performance statistics: Measure server response time to client requests, queued
requests, connections, and network usage.
o Cluster and configuration details: Track Znode count, watchers, follower count,
leader selection stats, session times, and session growth.
Building Applications with ZooKeeper
o A Configuration Service:
ZooKeeper can act as a highly available store for configuration; clients can
retrieve or update configuration files and receive notifications of changes via
watches.
The write() method hides the difference between creating/updating a znode
by checking existence before performing the appropriate operation.
o The Resilient ZooKeeper Application:
Programs assume reliable networks; ZooKeeper operations declare
InterruptedException and KeeperException.
InterruptedException: Thrown if a ZooKeeper operation is interrupted;
adheres to Java’s standard interrupt mechanism.
KeeperException: Thrown on server errors or communication failures;
subclasses like NoNodeException. Handling strategies: catch and test code or
catch specific subclasses.
KeeperException Categories:
State exceptions: Operation cannot be applied to znode tree (e.g.,
BadVersionException).
Recoverable exceptions: Connection loss; ZooKeeper reconnects to
preserve session.
Unrecoverable exceptions: Session expiration or auth failure;
ephemeral znodes lost, application must rebuild state.
o A Lock Service:
Distributed locks provide mutual exclusion; used for leader election.
Clients create sequential ephemeral znodes under a lock znode; the lowest
sequence acquires the lock.
The herd effect: All clients watching lock znode get notified on any child
change, causing traffic spikes; solution is to watch only the predecessor
znode.
Recoverable exceptions: Create failures due to connection loss; use
identifier-embedded znode names to detect successful creations.
Unrecoverable exceptions: Session expiration deletes ephemeral znodes;
application must detect loss of lock, clean up, and retry.
🧰 IBM Big Data Tools Overview
1. IBM InfoSphere
What it is: A suite of data integration and governance tools.
Main Features:
o Data quality management
o Data cleansing and profiling
o ETL (Extract, Transform, Load) operations
Use Case: Used to integrate, cleanse, and prepare data from diverse sources for analytics and
reporting.
2. IBM BigInsights
What it is: IBM’s distribution of Apache Hadoop for big data analytics.
Main Features:
o Supports Hadoop ecosystem components (Hive, Pig, HBase)
o Includes advanced analytics tools (machine learning, text analytics)
o Has a web-based console for administration
Use Case: Large-scale analysis of structured and unstructured data.
3. IBM BigSheets
What it is: A web-based spreadsheet-like tool built on top of BigInsights.
Main Features:
o Lets non-programmers analyze large data sets using a familiar interface
o Uses spreadsheets to run Hadoop jobs in the background
Use Case: Business users can visually explore and analyze big data without needing to write
MapReduce code.
4. IBM Big SQL
What it is: An SQL-on-Hadoop engine by IBM.
Main Features:
o Supports ANSI SQL queries over Hadoop data (including HDFS, Hive, HBase)
o Federates queries across traditional RDBMS (DB2, Oracle) and Hadoop
o Strong security and performance optimization
Use Case: Enables analysts and DBAs to query big data using familiar SQL syntax across
hybrid data sources.
Would you like this in a formatted PDF or included in your study notes?