0% found this document useful (0 votes)
26 views14 pages

Unit 5 Short

The document discusses the Hadoop ecosystem frameworks, focusing on Pig, Hive, and HBase for big data applications. Pig is a high-level platform for processing large datasets with a scripting language called Pig Latin, while Hive serves as a data warehouse infrastructure for structured data processing, and HBase is a column-oriented database for real-time access to large datasets. Each framework has unique applications, benefits, and execution modes, facilitating efficient data analysis and management.

Uploaded by

voila.alberto
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views14 pages

Unit 5 Short

The document discusses the Hadoop ecosystem frameworks, focusing on Pig, Hive, and HBase for big data applications. Pig is a high-level platform for processing large datasets with a scripting language called Pig Latin, while Hive serves as a data warehouse infrastructure for structured data processing, and HBase is a column-oriented database for real-time access to large datasets. Each framework has unique applications, benefits, and execution modes, facilitating efficient data analysis and management.

Uploaded by

voila.alberto
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 14

 Hadoop Eco System Frameworks: Applications on Big Data using Pig, Hive and HBase

o Application of Big Data using Pig, Hive and HBase

2. Pig :

 Pig is a high-level platform or tool which is used to process large datasets.

 It provides a high level of abstraction for processing over MapReduce.

 It provides a high-level scripting language, known as Pig Latin which is used


to develop the data analysis codes.

 Applications :

 For exploring large datasets Pig Scripting is used.

 Provides supports across large data sets for Ad-hoc queries.

 In the prototyping of large data-sets processing algorithms.

 Required to process the time-sensitive data loads.

 For collecting large amounts of datasets in form of search logs and


web crawls.

 Used where the analytical insights are needed using the sampling.

3. Hive :

 Hive is a data warehouse infrastructure tool to process structured data in


Hadoop.

 It resides on top of Hadoop to summarize Big Data and makes querying and
analyzing easy.

 It is used by different companies. For example, Amazon uses it in Amazon


Elastic MapReduce.

 Benefits :

 Ease of use

 Accelerated initial insertion of data

 Superior scalability, flexibility, and cost-efficiency

 Streamlined security

 Low overhead

 Exceptional working capacity

4. HBase :

 HBase is a column-oriented non-relational database management system


that runs on top of the Hadoop Distributed File System (HDFS).
 HBase provides a fault-tolerant way of storing sparse data sets, which are
common in many big data use cases.

 HBase does support writing applications in Apache Avro, REST and Thrift.

 Application :

 Medical

 Sports

 Web

 Oil and petroleum

 e-commerce

 Pig

o Introduction to PIG

 Ø What is Apache Pig?

 Apache Pig is an abstraction over MapReduce. It is a tool/platform which is


used to analyze larger sets of data representing them as data flows. Pig is
generally used with Hadoop; we can perform all the data manipulation
operations in Hadoop using Apache Pig.

 To write data analysis programs, Pig provides a high-level language known as


Pig Latin. This language provides various operators using which programmers
can develop their own functions for reading, writing, and processing data.

 To analyze data using Apache Pig, programmers need to write scripts using
Pig Latin language. All these scripts are internally converted to Map and
Reduce tasks. Apache Pig has a component known as Pig Engine that accepts
the Pig Latin scripts as input and converts those scripts into MapReduce jobs.

o Execution Modes of Pig

 Apache Pig Execution Modes

 You can run Apache Pig in two modes, namely, Local Mode and HDFS
mode.

 Local Mode

 In this mode, all the files are installed and run from your local host
and local file system. There is no need of Hadoop or HDFS. This
mode is generally used for testing purpose.

 MapReduce Mode

 MapReduce mode is where we load or process the data that exists in


the Hadoop File System (HDFS) using Apache Pig. In this mode,
whenever we execute the Pig Latin statements to process the data, a
MapReduce job is invoked in the back-end to perform a particular
operation on the data that exists in the HDFS.

 Script Execution Modes

 Interactive Mode (Grunt shell) − You can run Apache Pig in


interactive mode using the Grunt shell. In this shell, you can enter
the Pig Latin statements and get the output (using Dump operator).

 Batch Mode (Script) − You can run Apache Pig in Batch mode by
writing the Pig Latin script in a single file with .pig extension.

 Embedded Mode (UDF) − Apache Pig provides the provision of


defining our own functions (User Defined Functions) in programming
languages such as Java, and using them in our script.

o Comparison of Pig with Databases

 There are several differences between the two languages, and between Pig
and relational database management systems (RDBMSs) in general.

 The most significant difference is that Pig Latin is a data flow programming
language, whereas SQL is a declarative programming language.

 Pig Latin programs are step-by-step transformations on input relations; SQL


statements declare constraints that the output must satisfy.

 RDBMSs store data in tables with predefined schemas; Pig is more relaxed
and allows optional schemas defined at runtime.

 Pig operates on any source of tuples (e.g., text files) without an import
process; data is loaded from the filesystem (usually HDFS) as the first step.

o Grunt

 Grunt is the interactive shell for Apache Pig.

 Provides line-editing facilities like GNU Readline (e.g., Ctrl-E moves to end of
line).

 Remembers command history (Ctrl-P/Ctrl-N or ↑/↓) and offers tab-


completion for Pig Latin keywords and functions.

 Supports utility commands: clear, help, history, quit, set, exec, kill, run.

 Allows OS commands via sh and HDFS commands via fs.

o Pig Latin

 Pig Latin is a high-level data flow language with SQL-like semantics.

 Abstracts Java MapReduce into textual notation; simple to learn for those
familiar with SQL.

 Supports primitive types (int, long, float, double, chararray, bytearray,


boolean, datetime, biginteger, bigdecimal) and complex types (tuple, bag,
map).
 Statements end with a semicolon (;), take relations as input, and produce
relations as output.

o User Defined Functions

 Apache Pig supports UDFs in Java (full), and limited support in Jython,
Python, JavaScript, Ruby, Groovy.

 Java UDF types:

 Filter Functions − Used in FILTER statements; accept a Pig value and


return a Boolean.

 Eval Functions − Used in FOREACH-GENERATE; accept a Pig value


and return a Pig result.

 Algebraic Functions − Act on inner bags in FOREACH-GENERATE,


performing full MapReduce on an inner bag.

 Piggybank is the Java UDF repository for sharing and reusing UDFs.

o Data Processing Operators

 Loading and Storing: LOAD, STORE

 Filtering: FILTER, DISTINCT, FOREACH…GENERATE, STREAM

 Grouping and Joining: GROUP, COGROUP, JOIN, CROSS

 Sorting: ORDER, LIMIT

 Combining and Splitting: UNION, SPLIT

 Diagnostics: DUMP, DESCRIBE, EXPLAIN, ILLUSTRATE

Apache Hive Architecture and Installation

 Architecture Components

o User Interface (UI): Provides interfaces such as Hive Web UI, Hive CLI and Hive
HDInsight for submitting queries and operations

o Hive Server (Thrift): Receives client requests (JDBC/ODBC, Beeline, Hue) and
forwards them to the Driver

o Driver: Manages client sessions, forwards queries to the Compiler, and handles
result fetching

o Compiler: Parses HiveQL, performs semantic analysis, and generates an execution


plan (DAG of MapReduce/HDFS tasks) using metadata from the Metastore

o Metastore: Central repository storing table/partition schemas, column metadata,


serializers/deserializers and HDFS file locations in an RDBMS

o Execution Engine: Executes the compiled DAG, managing task dependencies and
coordinating with Hadoop/YARN

o HDFS/HBase: Underlying storage for table data, accessed via Hive’s execution engine
 Installation

o Prerequisite: Install same Hadoop version locally as used by your target cluster

o Download and unpack Hive:

o tar xzf apache-hive-x.y.z-bin.tar.gz

o Set environment variables:

o export HIVE_HOME=~/sw/apache-hive-x.y.z-bin

o export PATH=$PATH:$HIVE_HOME/bin

o Launch Hive shell:

o hive

Hive Shell

 Primary CLI for issuing HiveQL commands; heavily influenced by MySQL syntax

 Features:

o Commands terminated by semicolon; supports ! prefix for OS commands and dfs for
HDFS operations

o Command history navigation via ↑/↓ or Ctrl-P/Ctrl-N; line editing (Ctrl-E to end) and
Tab-based auto-completion

o Case-insensitive keywords (except in string comparisons)

 Modes:

o Interactive: Launch with hive; enter queries at hive> prompt

o Non-Interactive: Run scripts using hive -f script.q or single statements with hive -e
"SELECT …"

Hive Services

 Hive CLI: Shell for executing HiveQL queries and commands

 Hive Web UI: Web-based GUI alternative to CLI

 Hive MetaStore: Stores metadata (schemas, serializers, HDFS mappings) in a relational DB

 Hive Server (Thrift): Endpoint for remote client connections, forwarding requests to Driver

 Hive Driver: Manages sessions, forwards queries to Compiler, returns results to clients

 Hive Compiler: Parses HiveQL, performs semantic checks, generates execution plan using
Metastore data

 Hive Execution Engine: Executes the DAG of MapReduce/HDFS tasks, handling dependencies

Hive Metastore

 Central repository for table and partition definitions, column types, serializers/deserializers,
and file paths; stored in an RDBMS (e.g., MySQL, PostgreSQL)

 Shared by Hive, Spark, Impala, and other tools to ensure consistent metadata across engines

Comparison with Traditional Databases

Aspect Hive Data Warehouse Traditional RDBMS

Purpose Data warehousing, OLAP OLTP and OLAP

Schema Schema on READ: schema applied at Schema on WRITE:


Enforcement query time enforced on data load

Scalability Highly scalable at low cost (HDFS) Scale-up (costly)

Read/Write
Write once, read many Read/write many
Patterns

Transactions &
Not supported Supported
Indexes

Supports normalized & de-normalized, Normalized data, manual


Data Organization
automatic partitioning sharding

Drop removes metadata only (external) Drop removes data and


Table Deletion
or metadata+data metadata

HiveQL

 SQL-like query language (HQL) that compiles into MapReduce jobs; supporting SELECT,
INSERT, UPDATE-like operations

 Extensions: Window functions (SQL:2003), multi-table inserts, TRANSFORM/MAP/REDUCE


clauses

 Data Types:

o Primitive: numeric types, Boolean, string (CHAR, VARCHAR), TIMESTAMP

o Complex: ARRAY, MAP, STRUCT

 File Formats: TEXTFILE, SEQUENCEFILE, ORC, RCFILE

Hive Tables

 Managed (Internal) Tables: Hive owns both data and metadata; stored by default under
/user/hive/warehouse; DROP TABLE removes data+metadata
 External Tables: Hive manages only metadata; data resides at user-specified HDFS/Storage
locations; DROP TABLE removes metadata only

 Partitioning & Bucketing: (not detailed in provided docs)

Querying Data & User-Defined Functions

 Hive supports three types of UDFs (written in Java):

o UDF: Operates on single row → single row (e.g., string/math functions)

o UDAF: Aggregates multiple rows → single result (e.g., COUNT, MAX)

o UDTF: Single row → multiple rows (table-generating, e.g., EXPLODE)

 APIs:

o Simple UDF API (org.apache.hadoop.hive.ql.exec.UDF): Use for writable types,


implement evaluate()

o Generic UDF (org.apache.hadoop.hive.ql.udf.generic.GenericUDF): Use for complex


types, manually handle ObjectInspectors

Sorting & Aggregating

 SORT BY: Sorts output by specified columns before reducer; use ASC/DESC; numeric vs
lexicographical order based on column type

 Aggregate Functions: Take a set of values and return a single value; can be used with or
without GROUP BY (commonly with GROUP BY)

MapReduce Scripts

 Used for large-scale data processing by breaking data into smaller parts and creating parallel
jobs automatically

 Can be run manually or scheduled; auto-yielding on governance limits for rescheduling


without disruption

 Ideal when logic is chunk-able; less suitable for complex per-record processing due to I/O
overhead

Joins & Subqueries

 Joins: Combine rows from multiple tables based on equality predicates; supports INNER,
LEFT OUTER, RIGHT OUTER, FULL OUTER; only equality joins; non-commutative and left-
associative

 Subqueries: Hive supports subqueries in FROM and WHERE clauses (since Hive 0.13); a
subquery returns a result set used by its outer query
 HBase Concepts

o What is HBase?

 HBase is a distributed column-oriented database built on top of the Hadoop


file system. It is an open-source project and is horizontally scalable.

 HBase is a data model that is similar to Google’s BigTable designed to


provide quick random access to huge amounts of structured data. It
leverages the fault tolerance provided by the Hadoop File System (HDFS).

 It is a part of the Hadoop ecosystem that provides random real-time


read/write access to data in the Hadoop File System.

 One can store the data in HDFS either directly or through HBase. Data
consumer reads/accesses the data in HDFS randomly using HBase. HBase sits
on top of the Hadoop File System and provides read and write access.

 Clients (HBase Client API)

o Class HBaseConfiguration

 Adds HBase configuration files to a Configuration. This class belongs to the


org.apache.hadoop.hbase package.

 static org.apache.hadoop.conf.Configuration create()

 This method creates a Configuration with HBase resources.

o Class HTable (org.apache.hadoop.hbase.client)

 Represents an HBase table, used to communicate with a single HBase table.

 Constructors

 HTable()

 HTable(TableName tableName, ClusterConnection connection,


ExecutorService pool)

 Using this constructor, you can create an object to access an


HBase table.

 Key Methods

 void close() – Releases all the resources of the HTable.

 void delete(Delete delete) – Deletes the specified cells/row.

 boolean exists(Get get) – Tests the existence of columns in the table,


as specified by Get.

 Result get(Get get) – Retrieves certain cells from a given row.

 Configuration getConfiguration() – Returns the Configuration object


used by this instance.
 TableName getName() – Returns the table name instance of this
table.

 HTableDescriptor getTableDescriptor() – Returns the table descriptor


for this table.

 byte[] getTableName() – Returns the name of this table.

 void put(Put put) – Inserts data into the table.

 Example Schema

o Table is a collection of rows.

o Row is a collection of column families.

o Column family is a collection of columns.

o Column is a collection of key value pairs.

o Example schema of table in HBase:

o Rowid | Column Family | Column Family | Column Family | Column Family

o col1 col2 col3 | col1 col2 col3 | col1 col2 col3 | col1 col2 col3

o 1

o 2

o 3

 HBase vs RDBMS

o SQL

 RDBMS: It requires SQL (Structured Query Language).

 HBase: SQL is not required in HBase.

o Schema

 RDBMS: It has a fixed schema.

 HBase: It does not have a fixed schema and allows for the addition of
columns on the fly.

o Database Type

 RDBMS: It is a row-oriented database.

 HBase: It is a column-oriented database.

o Scalability

 RDBMS: Allows for scaling up by upgrading existing servers.

 HBase: Scale-out is possible by adding new servers to the cluster.

o Nature
 RDBMS: Static in nature.

 HBase: Dynamic in nature.

o Data retrieval

 RDBMS: Slower retrieval of data.

 HBase: Faster retrieval of data.

o Rule

 RDBMS: Follows ACID (Atomicity, Consistency, Isolation, Durability).

 HBase: Follows CAP (Consistency, Availability, Partition-tolerance).

o Type of data

 RDBMS: Handles structured data.

 HBase: Handles structured, unstructured, and semi-structured data.

o Sparse data

 RDBMS: Cannot handle sparse data.

 HBase: Can handle sparse data.

o Volume of data

 RDBMS: Determined by server configuration.

 HBase: Depends on the number of machines deployed.

o Transaction Integrity

 RDBMS: Guarantees transaction integrity.

 HBase: No such guarantee.

o Referential Integrity

 RDBMS: Supported.

 HBase: No built-in support.

o Normalize

 RDBMS: Data can be normalized.

 HBase: Data is not normalized; no logical relationships across tables.

o Table size

 RDBMS: Designed for small tables; difficult to scale.

 HBase: Designed for large tables; scales horizontally.

 Advanced Usage (Applications & History)

o Applications of HBase
 It is used whenever there is a need to write heavy applications.

 HBase is used whenever we need to provide fast random access to available


data.

 Companies such as Facebook, Twitter, Yahoo, and Adobe use HBase


internally.

o HBase History

 Nov 2006 – Google released the paper on BigTable.

 Feb 2007 – Initial HBase prototype was created as a Hadoop contribution.

 Oct 2007 – First usable HBase (with Hadoop 0.15.0) released.

 Jan 2008 – HBase became a Hadoop sub-project.

 Oct 2008 – HBase 0.18.1 released.

 Jan 2009 – HBase 0.19.0 released.

 Sept 2009 – HBase 0.20.0 released.

 May 2010 – HBase graduated to Apache top-level project.

 Schema Design

o HBase table can scale to billions of rows and many columns based on requirements;
supports terabytes of data and high read/write throughput at low latency.

o HBase Table Schema Design General Concepts

 Row key: Each table is indexed on a row key; data is sorted lexicographically
by row key. No secondary indices.

 Automaticity: Avoid designs requiring atomicity across multiple rows; all


operations are atomic at row level.

 Even distribution: Reads/writes should be uniformly distributed across all


nodes; design row keys so related entities are in adjacent rows.

o Size Limits

 Row keys: 4 KB per key

 Column families: ≤ 10 per table

 Column qualifiers: 16 KB per qualifier

 Individual cell values: < 10 MB

 All values in a single row: ≤ 10 MB

 Advance Indexing

o Advanced indexing is triggered when obj is:

 an ndarray of type integer or Boolean


 a tuple with at least one sequence object

 a non-tuple sequence object

o Advanced indexing returns a copy of data rather than a view.

o Two types of advanced indexing:

 Integer Indexing: Selects arbitrary items based on N-dimensional indices.

 Boolean Array Indexing: Selects items based on Boolean conditions.

 ZooKeeper – Monitoring a Cluster

o Key Performance Metrics to Monitor:

 Resource utilization details

 Thread and JVM usage

 Performance statistics

 Cluster and configuration details

o Resource utilization details: Automatically discover clusters, monitor heap/non-heap


memory on znodes, collect and alert on GC iterations, heap size, system usage, and
threads.

o Thread and JVM usage: Analyze JVM thread dumps; track daemon, peak, and live
thread counts.

o Performance statistics: Measure server response time to client requests, queued


requests, connections, and network usage.

o Cluster and configuration details: Track Znode count, watchers, follower count,
leader selection stats, session times, and session growth.

 Building Applications with ZooKeeper

o A Configuration Service:

 ZooKeeper can act as a highly available store for configuration; clients can
retrieve or update configuration files and receive notifications of changes via
watches.

 The write() method hides the difference between creating/updating a znode


by checking existence before performing the appropriate operation.

o The Resilient ZooKeeper Application:

 Programs assume reliable networks; ZooKeeper operations declare


InterruptedException and KeeperException.

 InterruptedException: Thrown if a ZooKeeper operation is interrupted;


adheres to Java’s standard interrupt mechanism.
 KeeperException: Thrown on server errors or communication failures;
subclasses like NoNodeException. Handling strategies: catch and test code or
catch specific subclasses.

 KeeperException Categories:

 State exceptions: Operation cannot be applied to znode tree (e.g.,


BadVersionException).

 Recoverable exceptions: Connection loss; ZooKeeper reconnects to


preserve session.

 Unrecoverable exceptions: Session expiration or auth failure;


ephemeral znodes lost, application must rebuild state.

o A Lock Service:

 Distributed locks provide mutual exclusion; used for leader election.

 Clients create sequential ephemeral znodes under a lock znode; the lowest
sequence acquires the lock.

 The herd effect: All clients watching lock znode get notified on any child
change, causing traffic spikes; solution is to watch only the predecessor
znode.

 Recoverable exceptions: Create failures due to connection loss; use


identifier-embedded znode names to detect successful creations.

 Unrecoverable exceptions: Session expiration deletes ephemeral znodes;


application must detect loss of lock, clean up, and retry.

🧰 IBM Big Data Tools Overview

1. IBM InfoSphere

 What it is: A suite of data integration and governance tools.

 Main Features:

o Data quality management

o Data cleansing and profiling

o ETL (Extract, Transform, Load) operations

 Use Case: Used to integrate, cleanse, and prepare data from diverse sources for analytics and
reporting.

2. IBM BigInsights

 What it is: IBM’s distribution of Apache Hadoop for big data analytics.
 Main Features:

o Supports Hadoop ecosystem components (Hive, Pig, HBase)

o Includes advanced analytics tools (machine learning, text analytics)

o Has a web-based console for administration

 Use Case: Large-scale analysis of structured and unstructured data.

3. IBM BigSheets

 What it is: A web-based spreadsheet-like tool built on top of BigInsights.

 Main Features:

o Lets non-programmers analyze large data sets using a familiar interface

o Uses spreadsheets to run Hadoop jobs in the background

 Use Case: Business users can visually explore and analyze big data without needing to write
MapReduce code.

4. IBM Big SQL

 What it is: An SQL-on-Hadoop engine by IBM.

 Main Features:

o Supports ANSI SQL queries over Hadoop data (including HDFS, Hive, HBase)

o Federates queries across traditional RDBMS (DB2, Oracle) and Hadoop

o Strong security and performance optimization

 Use Case: Enables analysts and DBAs to query big data using familiar SQL syntax across
hybrid data sources.

Would you like this in a formatted PDF or included in your study notes?

You might also like