Architecture and Working of Hive
The major components of Hive and its interaction with the Hadoop is
demonstrated in the figure below and all the components are described
further:
User Interface (UI) –
As the name describes User interface provide an interface between user
and hive. It enables user to submit queries and other operations to the
system. Hive web UI, Hive command line, and Hive HD Insight (In
windows server) are supported by the user interface.
Hive Server – It is referred to as Apache Thrift Server. It accepts the
request from different clients and provides it to Hive Driver.
Driver –
Queries of the user after the interface are received by the driver within
the Hive. Concept of session handles is implemented by driver.
Execution and Fetching of APIs modelled on JDBC/ODBC interfaces is
provided by the user.
Compiler –
Queries are parses, semantic analysis on the different query blocks and
query expression is done by the compiler. Execution plan with the help
of the table in the database and partition metadata observed from the
metastore are generated by the compiler eventually.
Metastore –
All the structured data or information of the different tables and partition
in the warehouse containing attributes and attributes level information
are stored in the metastore. Sequences or de-sequences necessary to read
and write data and the corresponding HDFS files where the data is
stored. Hive selects corresponding database servers to stock the schema
or Metadata of databases, tables, attributes in a table, data types of
databases, and HDFS mapping.
Execution Engine –
Execution of the execution plan made by the compiler is performed in
the execution engine. The plan is a DAG of stages. The dependencies
within the various stages of the plan is managed by execution engine as
well as it executes these stages on the suitable system components.
Diagram – Architecture of Hive that is built on the top of Hadoop
In the above diagram along with architecture, job execution flow in Hive with
Hadoop is demonstrated step by step.
Step-1: Execute Query –
Interface of the Hive such as Command Line or Web user interface
delivers query to the driver to execute. In this, UI calls the execute
interface to the driver such as ODBC or JDBC.
Step-2: Get Plan –
Driver designs a session handle for the query and transfer the query to
the compiler to make execution plan. In other words, driver interacts
with the compiler.
Step-3: Get Metadata –
In this, the compiler transfers the metadata request to any database and
the compiler gets the necessary metadata from the metastore.
Step-4: Send Metadata –
Metastore transfers metadata as an acknowledgment to the compiler.
Step-5: Send Plan –
Compiler communicating with driver with the execution plan made by
the compiler to execute the query.
Step-6: Execute Plan –
Execute plan is sent to the execution engine by the driver.
Execute Job
Job Done
Dfs operation (Metadata Operation)
Step-7: Fetch Results –
Fetching results from the driver to the user interface (UI).
Step-8: Send Results –
Result is transferred to the execution engine from the driver. Sending
results to Execution engine. When the result is retrieved from data nodes
to the execution engine, it returns the result to the driver and to user
interface (UI).
Advantages of Hive Architecture:
Scalability: Hive is a distributed system that can easily scale to handle large
volumes of data by adding more nodes to the cluster.
Data Accessibility: Hive allows users to access data stored in Hadoop without the
need for complex programming skills. SQL-like language is used for queries and
HiveQL is based on SQL syntax.
Data Integration: Hive integrates easily with other tools and systems in the
Hadoop ecosystem such as Pig, HBase, and MapReduce.
Flexibility: Hive can handle both structured and unstructured data, and supports
various data formats including CSV, JSON, and Parquet.
Security: Hive provides security features such as authentication, authorization, and
encryption to ensure data privacy.
Disadvantages of Hive Architecture:
High Latency: Hive’s performance is slower compared to traditional databases
because of the overhead of running queries in a distributed system.
Limited Real-time Processing: Hive is not ideal for real-time data processing as it
is designed for batch processing.
Complexity: Hive is complex to set up and requires a high level of expertise in
Hadoop, SQL, and data warehousing concepts.
Lack of Full SQL Support: HiveQL does not support all SQL operations, such as
transactions and indexes, which may limit the usefulness of the tool for certain
applications.
Debugging Difficulties: Debugging Hive queries can be difficult as the queries are
executed across a distributed system, and errors may occur in different nodes.
HIVE Data Types
Hive data types are categorized in numeric types, string types, misc types, and
complex types. A list of Hive data types is given below.
Integer Types
Type Size Range
TINYINT 1-byte signed integer -128 to 127
SMALLINT 2-byte signed integer 32,768 to 32,767
INT 4-byte signed integer 2,147,483,648 to 2,147,483,647
BIGINT 8-byte signed integer -9,223,372,036,854,775,808 to 9,223,372,036,854,775,807
Decimal Type
Type Size Range
FLOAT 4-byte Single precision floating point number
DOUBLE 8-byte Double precision floating point number
Date/Time Types
TIMESTAMP
o It supports traditional UNIX timestamp with optional nanosecond precision.
o As Integer numeric type, it is interpreted as UNIX timestamp in seconds.
o As Floating point numeric type, it is interpreted as UNIX timestamp in seconds
with decimal precision.
o As string, it follows java.sql.Timestamp format "YYYY-MM-DD
HH:MM:SS.fffffffff" (9 decimal place precision)
DATES
The Date value is used to specify a particular year, month and day, in the form YYYY--
MM--DD. However, it didn't provide the time of the day. The range of Date type lies
between 0000--01--01 to 9999--12--31.
AD
String Types
STRING
The string is a sequence of characters. It values can be enclosed within single quotes
(') or double quotes (").
Varchar
The varchar is a variable length type whose range lies between 1 and 65535, which
specifies that the maximum number of characters allowed in the character string.
CHAR
The char is a fixed-length type whose maximum length is fixed at 255.
Complex Type
Type Size Range
Struct It is similar to C struct or an object where fields are struct('James','Roy')
accessed using the "dot" notation.
Map It contains the key-value tuples where the fields are map('first','James','last','Roy')
accessed using array notation.
Array It is a collection of similar type of values that array('James','Roy')
indexable using zero-based integers.
HiveQL - Select-Where
The Hive Query Language (HiveQL) is a query language for Hive
to process and analyze structured data in a Metastore. This
chapter explains how to use the SELECT statement with WHERE
clause.
SELECT statement is used to retrieve the data from a table.
WHERE clause works similar to a condition. It filters the data
using the condition and gives you a finite result. The built-in
operators and functions generate an expression, which fulfils the
condition.
Syntax
Given below is the syntax of the SELECT query:
SELECT [ALL | DISTINCT] select_expr, select_expr, ...
FROM table_reference
[WHERE where_condition]
[GROUP BY col_list]
[HAVING having_condition]
[CLUSTER BY col_list | [DISTRIBUTE BY col_list] [SORT
BY col_list]]
[LIMIT number];
Example
Let us take an example for SELECT…WHERE clause. Assume we
have the employee table as given below, with fields named Id,
Name, Salary, Designation, and Dept. Generate a query to
retrieve the employee details who earn a salary of more than Rs
30000.
+------+--------------+-------------+-------------------+--------+
| ID | Name | Salary | Designation | Dept |
+------+--------------+-------------+-------------------+--------+
|1201 | Gopal | 45000 | Technical manager | TP |
|1202 | Manisha | 45000 | Proofreader | PR |
|1203 | Masthanvali | 40000 | Technical writer | TP |
|1204 | Krian | 40000 | Hr Admin | HR |
|1205 | Kranthi | 30000 | Op Admin | Admin |
+------+--------------+-------------+-------------------+--------+
The following query retrieves the employee details using the
above scenario:
hive> SELECT * FROM employee WHERE salary>30000;
On successful execution of the query, you get to see the following
response:
+------+--------------+-------------+-------------------+--------+
| ID | Name | Salary | Designation | Dept |
+------+--------------+-------------+-------------------+--------+
|1201 | Gopal | 45000 | Technical manager | TP |
|1202 | Manisha | 45000 | Proofreader | PR |
|1203 | Masthanvali | 40000 | Technical writer | TP |
|1204 | Krian | 40000 | Hr Admin | HR |
+------+--------------+-------------+-------------------+--------+
Apache Pig Tutorial
Pig tutorial provides basic and advanced concepts of Pig. Our Pig tutorial is designed
for beginners and professionals.
Pig is a high-level data flow platform for executing Map Reduce programs of
Hadoop. It was developed by Yahoo. The language for Pig is pig Latin.
Our Pig tutorial includes all topics of Apache Pig with Pig usage, Pig Installation, Pig
Run Modes, Pig Latin concepts, Pig Data Types, Pig example, Pig user defined
functions etc.
What is Apache Pig
Apache Pig is a high-level data flow platform for executing MapReduce programs of
Hadoop. The language used for Pig is Pig Latin.
The Pig scripts get internally converted to Map Reduce jobs and get executed on
data stored in HDFS. Apart from that, Pig can also execute its job in Apache Tez or
Apache Spark.
Pig can handle any type of data, i.e., structured, semi-structured or unstructured and
stores the corresponding results into Hadoop Data File System. Every task which can
be achieved using PIG can also be achieved using java used in MapReduce.
Features of Apache Pig
Let's see the various uses of Pig technology.
1) Ease of programming
Writing complex java programs for map reduce is quite tough for non-programmers.
Pig makes this process easy. In the Pig, the queries are converted to MapReduce
internally.
2) Optimization opportunities
It is how tasks are encoded permits the system to optimize their execution
automatically, allowing the user to focus on semantics rather than efficiency.
3) Extensibility
A user-defined function is written in which the user can write their logic to execute
over the data set.
4) Flexible
It can easily handle structured as well as unstructured data.
5) In-built operators
It contains various type of operators such as sort, filter and joins.
Differences between Apache MapReduce
and PIG
Apache MapReduce Apache PIG
It is a low-level data processing tool. It is a high-level data flow tool.
Here, it is required to develop complex It is not required to develop complex programs.
programs using Java or Python.
It is difficult to perform data operations in It provides built-in operators to perform data
MapReduce. operations like union, sorting and ordering.
It doesn't allow nested data types. It provides nested data types like tuple, bag, and
map.
AD
Advantages of Apache Pig
o Less code - The Pig consumes less line of code to perform any operation.
o Reusability - The Pig code is flexible enough to reuse again.
o Nested data types - The Pig provides a useful concept of nested data types like tuple,
bag, and map.
Apache Pig - Architecture
The language used to analyze data in Hadoop using Pig is known
as Pig Latin. It is a highlevel data processing language which
provides a rich set of data types and operators to perform various
operations on the data.
To perform a particular task Programmers using Pig,
programmers need to write a Pig script using the Pig Latin
language, and execute them using any of the execution
mechanisms (Grunt Shell, UDFs, Embedded). After execution,
these scripts will go through a series of transformations applied
by the Pig Framework, to produce the desired output.
Internally, Apache Pig converts these scripts into a series of
MapReduce jobs, and thus, it makes the programmer’s job easy.
The architecture of Apache Pig is shown below.
Apache Pig Components
As shown in the figure, there are various components in the
Apache Pig framework. Let us take a look at the major
components.
Parser
Initially the Pig Scripts are handled by the Parser. It checks the
syntax of the script, does type checking, and other miscellaneous
checks. The output of the parser will be a DAG (directed acyclic
graph), which represents the Pig Latin statements and logical
operators.
In the DAG, the logical operators of the script are represented as
the nodes and the data flows are represented as edges.
Optimizer
The logical plan (DAG) is passed to the logical optimizer, which
carries out the logical optimizations such as projection and
pushdown.
Compiler
The compiler compiles the optimized logical plan into a series of
MapReduce jobs.
Execution engine
Finally the MapReduce jobs are submitted to Hadoop in a sorted
order. Finally, these MapReduce jobs are executed on Hadoop
producing the desired results.
Pig Latin Data Model
The data model of Pig Latin is fully nested and it allows complex
non-atomic datatypes such as map and tuple. Given below is the
diagrammatical representation of Pig Latin’s data model.
Atom
Any single value in Pig Latin, irrespective of their datatype is
known as an Atom. It is stored as string and can be used as string
and number. int, long, float, double, chararray, and bytearray are
the atomic values of Pig. A piece of data or a simple atomic value
is known as a field.
Example − ‘raja’ or ‘30’
Tuple
A record that is formed by an ordered set of fields is known as a
tuple, the fields can be of any type. A tuple is similar to a row in a
table of RDBMS.
Example − (Raja, 30)
Bag
A bag is an unordered set of tuples. In other words, a collection
of tuples (non-unique) is known as a bag. Each tuple can have
any number of fields (flexible schema). A bag is represented by
‘{}’. It is similar to a table in RDBMS, but unlike a table in
RDBMS, it is not necessary that every tuple contain the same
number of fields or that the fields in the same position (column)
have the same type.
Example − {(Raja, 30), (Mohammad, 45)}
A bag can be a field in a relation; in that context, it is known
as inner bag.
Example − {Raja, 30, {9848022338, raja@gmail.com,}}
Map
A map (or data map) is a set of key-value pairs. The key needs to
be of type chararray and should be unique. The value might be of
any type. It is represented by ‘[]’
Example − [name#Raja, age#30]
Relation
A relation is a bag of tuples. The relations in Pig Latin are
unordered (there is no guarantee that tuples are processed in any
particular order).
Use Case for Pig :
A common use case for Apache Pig is in data processing workflows where you need
to handle and transform large volumes of data efficiently. Pig is especially useful for
ETL (Extract, Transform, Load) tasks and data pipelines within big data environments.
Here's a detailed example of a use case for Pig:
### Use Case: Analyzing Web Server Logs
**Scenario:**
A company has a large-scale website generating massive log files daily. They want to
analyze these logs to understand user behavior, popular pages, and traffic patterns.
**Objective:**
Extract meaningful insights from the web server logs using Apache Pig.
**Steps Using Pig:**
1. **Data Loading:**
- Use Pig to load log data stored in Hadoop HDFS (Hadoop Distributed File System).
```pig
-- Load log data
log_data = LOAD '/path/to/logfiles' USING TextLoader AS (line:chararray);
```
2. **Data Cleaning and Parsing:**
- Clean and parse the log data to extract relevant fields such as IP addresses,
timestamps, URLs, response codes, etc.
```pig
-- Split log lines into fields
parsed_logs = FOREACH log_data GENERATE FLATTEN
(STRSPLIT(line, '\\s+')) AS (ip:chararray, timestamp:chararray, url:chararray,
status:int);
```
3. **Data Transformation:**
- Perform transformations and aggregations to derive insights. For example,
counting the number of requests per URL or per IP address.
```pig
-- Calculate requests per URL
requests_per_url = FOREACH (GROUP parsed_logs BY url)
GENERATE group AS url, COUNT(parsed_logs) AS request_count;
```
4. **Filtering and Analysis:**
- Apply filters and conditions to analyze specific subsets of data. For instance,
filtering out requests with certain response codes.
```pig
-- Filter out successful requests
successful_requests = FILTER parsed_logs BY status >= 200 AND status < 300;
```
5. **Data Storage or Output:**
- Store the processed data or generate reports for further analysis.
```pig
-- Store or output results
STORE requests_per_url INTO '/output/url_requests';
```
**Advantages of Using Pig:**
- **Abstraction:** Pig provides a high-level scripting language (Pig Latin) that
abstracts complex MapReduce tasks, making it easier to write and maintain data
processing jobs.
- **Scalability:** Pig jobs can leverage the scalability and fault-tolerance of Hadoop,
handling large datasets efficiently.
- **Flexibility:** Pig supports a wide range of data transformations and operations,
allowing for diverse data processing tasks.
- **Optimization:** Pig optimizes the execution of tasks, automatically parallelizing
operations where possible.
ETL Process in Data Warehouse
Last Updated : 02 Feb, 2023
INTRODUCTION:
1. ETL stands for Extract, Transform, Load and it is a process used in data
warehousing to extract data from various sources, transform it into a
format suitable for loading into a data warehouse, and then load it into
the warehouse. The process of ETL can be broken down into the
following three stages:
2. Extract: The first stage in the ETL process is to extract data from
various sources such as transactional systems, spreadsheets, and flat
files. This step involves reading data from the source systems and storing
it in a staging area.
3. Transform: In this stage, the extracted data is transformed into a format
that is suitable for loading into the data warehouse. This may involve
cleaning and validating the data, converting data types, combining data
from multiple sources, and creating new data fields.
4. Load: After the data is transformed, it is loaded into the data warehouse.
This step involves creating the physical data structures and loading the
data into the warehouse.
5. The ETL process is an iterative process that is repeated as new data is
added to the warehouse. The process is important because it ensures that
the data in the data warehouse is accurate, complete, and up-to-date. It
also helps to ensure that the data is in the format required for data mining
and reporting.
Additionally, there are many different ETL tools and technologies available, such
as Informatica, Talend, DataStage, and others, that can automate and simplify the
ETL process.
ETL is a process in Data Warehousing and it stands
for Extract, Transform and Load. It is a process in which an ETL tool extracts
the data from various data source systems, transforms it in the staging area, and
then finally, loads it into the Data Warehouse system.
Let us understand each step of the ETL process in-depth:
1. Extraction:
The first step of the ETL process is extraction. In this step, data from
various source systems is extracted which can be in various formats like
relational databases, No SQL, XML, and flat files into the staging area. It
is important to extract the data from various source systems and store it
into the staging area first and not directly into the data warehouse
because the extracted data is in various formats and can be corrupted
also. Hence loading it directly into the data warehouse may damage it
and rollback will be much more difficult. Therefore, this is one of the
most important steps of ETL process.
2. Transformation:
The second step of the ETL process is transformation. In this step, a set
of rules or functions are applied on the extracted data to convert it into a
single standard format. It may involve following processes/tasks:
Filtering – loading only certain attributes into the data
warehouse.
Cleaning – filling up the NULL values with some default
values, mapping U.S.A, United States, and America into USA,
etc.
Joining – joining multiple attributes into one.
Splitting – splitting a single attribute into multiple attributes.
Sorting – sorting tuples on the basis of some attribute
(generally key-attribute).
3. Loading:
The third and final step of the ETL process is loading. In this step, the
transformed data is finally loaded into the data warehouse. Sometimes
the data is updated by loading into the data warehouse very frequently
and sometimes it is done after longer but regular intervals. The rate and
period of loading solely depends on the requirements and varies from
system to system.
ETL process can also use the pipelining concept i.e. as soon as some data is
extracted, it can transformed and during that period some new data can be
extracted. And while the transformed data is being loaded into the data warehouse,
the already extracted data can be transformed. The block diagram of the pipelining
of ETL process is shown below:
ETL Tools: Most commonly used ETL tools are Hevo, Sybase, Oracle Warehouse
builder, CloverETL, and MarkLogic.
Data Warehouses: Most commonly used Data Warehouses are Snowflake,
Redshift, BigQuery, and Firebolt.
ADVANTAGES OR DISADVANTAGES:
Advantages of ETL process in data warehousing:
1. Improved data quality: ETL process ensures that the data in the data
warehouse is accurate, complete, and up-to-date.
2. Better data integration: ETL process helps to integrate data from
multiple sources and systems, making it more accessible and usable.
3. Increased data security: ETL process can help to improve data security
by controlling access to the data warehouse and ensuring that only
authorized users can access the data.
4. Improved scalability: ETL process can help to improve scalability by
providing a way to manage and analyze large amounts of data.
5. Increased automation: ETL tools and technologies can automate and
simplify the ETL process, reducing the time and effort required to load
and update data in the warehouse.
Disadvantages of ETL process in data warehousing:
1. High cost: ETL process can be expensive to implement and maintain,
especially for organizations with limited resources.
2. Complexity: ETL process can be complex and difficult to implement,
especially for organizations that lack the necessary expertise or
resources.
3. Limited flexibility: ETL process can be limited in terms of flexibility, as
it may not be able to handle unstructured data or real-time data streams.
4. Limited scalability: ETL process can be limited in terms of scalability,
as it may not be able to handle very large amounts of data.
5. Data privacy concerns: ETL process can raise concerns about data
privacy, as large amounts of data are collected, stored, and analyzed.
Pig Data Types
Apache Pig supports many data types. A list of Apache Pig Data Types with
description and examples are given below.
Type Description Example
Int Signed 32 bit integer 2
Long Signed 64 bit integer 15L or 15l
Float 32 bit floating point 2.5f or 2.5F
Double 32 bit floating point 1.5 or 1.5e2 or 1.5E2
charArray Character array hello javatpoint
byteArray BLOB(Byte array)
tuple Ordered set of fields (12,43)
bag Collection f tuples {(12,43),(54,28)}
map collection of tuples [open#apache]
Apache Pig - Execution
Apache Pig Execution Modes
You can run Apache Pig in two modes, namely, Local
Mode and HDFS mode.
Local Mode
In this mode, all the files are installed and run from your local
host and local file system. There is no need of Hadoop or HDFS.
This mode is generally used for testing purpose.
MapReduce Mode
MapReduce mode is where we load or process the data that exists
in the Hadoop File System (HDFS) using Apache Pig. In this
mode, whenever we execute the Pig Latin statements to process
the data, a MapReduce job is invoked in the back-end to perform
a particular operation on the data that exists in the HDFS.
Apache Pig Execution Mechanisms
Apache Pig scripts can be executed in three ways, namely,
interactive mode, batch mode, and embedded mode.
Interactive Mode (Grunt shell) − You can run Apache Pig in
interactive mode using the Grunt shell. In this shell, you can
enter the Pig Latin statements and get the output (using
Dump operator).
Batch Mode (Script) − You can run Apache Pig in Batch mode
by writing the Pig Latin script in a single file
with .pig extension.
Embedded Mode (UDF) − Apache Pig provides the provision of
defining our own functions (User Defined Functions) in
programming languages such as Java, and using them in
our script.
Invoking the Grunt Shell
You can invoke the Grunt shell in a desired mode
(local/MapReduce) using the −x option as shown below.
Either of these commands gives you the Grunt shell prompt as
shown below.
You can exit the Grunt shell using ‘ctrl + d’.
After invoking the Grunt shell, you can execute a Pig script by
directly entering the Pig Latin statements in it.
grunt> customers = LOAD 'customers.txt' USING
PigStorage(',');
Executing Apache Pig in Batch Mode
You can write an entire Pig Latin script in a file and execute it
using the –x command. Let us suppose we have a Pig script in a file
named sample_script.pig as shown below.
Sample_script.pig
student = LOAD
'hdfs://localhost:9000/pig_data/student.txt' USING
PigStorage(',') as
(id:int,name:chararray,city:chararray);
Dump student;
Operators, functions, Data types of Pig:
In Apache Pig, which is a platform for analyzing large data sets, you work with various
operators, functions, and data types to process and manipulate data effectively
within the Pig Latin scripting language. Here's a breakdown of these components:
### Operators in Pig Latin:
1. **Arithmetic Operators:**
- `+` (addition)
- `-` (subtraction)
- `*` (multiplication)
- `/` (division)
- `%` (modulus)
2. **Comparison Operators:**
- `==` (equal to)
- `!=` (not equal to)
- `<` (less than)
- `<=` (less than or equal to)
- `>` (greater than)
- `>=` (greater than or equal to)
3. **Logical Operators:**
- `AND`
- `OR`
- `NOT`
4. **Unary Operators:**
- `-` (unary minus)
- `+` (unary plus)
### Functions in Pig Latin:
Pig provides a rich set of built-in functions that can be used for data transformation
and analysis. These functions can be broadly categorized as:
1. **Scalar Functions:**
- Operate on individual values and produce a single result.
- Examples: `LOWER`, `UPPER`, `TRIM`, `SUBSTRING`, `CONCAT`, `REPLACE`,
`TOUPPER`, `TOLOWER`, etc.
2. **Bag Functions:**
- Operate on collections of data (bags) and produce bags as output.
- Examples: `COUNT`, `MAX`, `MIN`, `AVG`, `SUM`, `TOKENIZE`, `FLATTEN`, etc.
3. **Map Functions:**
- Operate on maps (key-value pairs) and produce maps as output.
- Examples: `MAPSIZE`, `TOTUPLE`, `TOMAP`, `TOP`, `LIMIT`, etc.
4. **Other Functions:**
- Functions to handle data types, type conversions, date-time operations, etc.
- Examples: `IsEmpty`, `IsNull`, `Size`, `ToDate`, `ToUnixTime`, `ToString`, `ToBag`,
etc.
### Data Types in Pig Latin:
1. **Atomic Data Types:**
- `int`: Integer
- `long`: Long integer
- `float`: Single-precision floating point
- `double`: Double-precision floating point
- `chararray`: Character array (string)
- `bytearray`: Byte array (used for binary data)
2. **Complex Data Types:**
- `tuple`: An ordered set of fields (similar to a row in a table)
- `bag`: A collection of tuples (unordered)
- `map`: A set of key-value pairs
These data types can be used to define the structure of your data within Pig, allowing
you to process and transform complex data formats efficiently.
Understanding these operators, functions, and data types is crucial for writing
effective Pig Latin scripts to process large-scale datasets in a distributed environment
like Apache Hadoop. Each component plays a vital role in defining data flow and
transformation logic within Pig.