0% found this document useful (0 votes)

16 views23 pages

Unit 3

Uploaded by

mukul.money2003

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views23 pages

Unit 3

Uploaded by

mukul.money2003

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 23

Architecture and Working of Hive

The major components of Hive and its interaction with the Hadoop is
demonstrated in the figure below and all the components are described
further:
 User Interface (UI) –
As the name describes User interface provide an interface between user
and hive. It enables user to submit queries and other operations to the
system. Hive web UI, Hive command line, and Hive HD Insight (In
windows server) are supported by the user interface.

 Hive Server – It is referred to as Apache Thrift Server. It accepts the

request from different clients and provides it to Hive Driver.
 Driver –
Queries of the user after the interface are received by the driver within
the Hive. Concept of session handles is implemented by driver.
Execution and Fetching of APIs modelled on JDBC/ODBC interfaces is
provided by the user.

 Compiler –
Queries are parses, semantic analysis on the different query blocks and
query expression is done by the compiler. Execution plan with the help
of the table in the database and partition metadata observed from the
metastore are generated by the compiler eventually.

 Metastore –
All the structured data or information of the different tables and partition
in the warehouse containing attributes and attributes level information
are stored in the metastore. Sequences or de-sequences necessary to read
and write data and the corresponding HDFS files where the data is
stored. Hive selects corresponding database servers to stock the schema
or Metadata of databases, tables, attributes in a table, data types of
databases, and HDFS mapping.

 Execution Engine –
Execution of the execution plan made by the compiler is performed in
the execution engine. The plan is a DAG of stages. The dependencies
within the various stages of the plan is managed by execution engine as
well as it executes these stages on the suitable system components.
Diagram – Architecture of Hive that is built on the top of Hadoop
In the above diagram along with architecture, job execution flow in Hive with
Hadoop is demonstrated step by step.
 Step-1: Execute Query –
Interface of the Hive such as Command Line or Web user interface
delivers query to the driver to execute. In this, UI calls the execute
interface to the driver such as ODBC or JDBC.

 Step-2: Get Plan –

Driver designs a session handle for the query and transfer the query to
the compiler to make execution plan. In other words, driver interacts
with the compiler.

 Step-3: Get Metadata –

In this, the compiler transfers the metadata request to any database and
the compiler gets the necessary metadata from the metastore.

 Step-4: Send Metadata –

Metastore transfers metadata as an acknowledgment to the compiler.

 Step-5: Send Plan –

Compiler communicating with driver with the execution plan made by
the compiler to execute the query.

 Step-6: Execute Plan –

Execute plan is sent to the execution engine by the driver.
 Execute Job
 Job Done
 Dfs operation (Metadata Operation)
 Step-7: Fetch Results –
Fetching results from the driver to the user interface (UI).

 Step-8: Send Results –

Result is transferred to the execution engine from the driver. Sending
results to Execution engine. When the result is retrieved from data nodes
to the execution engine, it returns the result to the driver and to user
interface (UI).

Advantages of Hive Architecture:

Scalability: Hive is a distributed system that can easily scale to handle large
volumes of data by adding more nodes to the cluster.
Data Accessibility: Hive allows users to access data stored in Hadoop without the
need for complex programming skills. SQL-like language is used for queries and
HiveQL is based on SQL syntax.
Data Integration: Hive integrates easily with other tools and systems in the
Hadoop ecosystem such as Pig, HBase, and MapReduce.
Flexibility: Hive can handle both structured and unstructured data, and supports
various data formats including CSV, JSON, and Parquet.
Security: Hive provides security features such as authentication, authorization, and
encryption to ensure data privacy.

Disadvantages of Hive Architecture:

High Latency: Hive’s performance is slower compared to traditional databases

because of the overhead of running queries in a distributed system.
Limited Real-time Processing: Hive is not ideal for real-time data processing as it
is designed for batch processing.
Complexity: Hive is complex to set up and requires a high level of expertise in
Hadoop, SQL, and data warehousing concepts.
Lack of Full SQL Support: HiveQL does not support all SQL operations, such as
transactions and indexes, which may limit the usefulness of the tool for certain
applications.
Debugging Difficulties: Debugging Hive queries can be difficult as the queries are
executed across a distributed system, and errors may occur in different nodes.

HIVE Data Types

Hive data types are categorized in numeric types, string types, misc types, and
complex types. A list of Hive data types is given below.

Integer Types

Type Size Range

TINYINT 1-byte signed integer -128 to 127

SMALLINT 2-byte signed integer 32,768 to 32,767

INT 4-byte signed integer 2,147,483,648 to 2,147,483,647

BIGINT 8-byte signed integer -9,223,372,036,854,775,808 to 9,223,372,036,854,775,807

Decimal Type

Type Size Range

FLOAT 4-byte Single precision floating point number

DOUBLE 8-byte Double precision floating point number

Date/Time Types
TIMESTAMP

o It supports traditional UNIX timestamp with optional nanosecond precision.

o As Integer numeric type, it is interpreted as UNIX timestamp in seconds.
o As Floating point numeric type, it is interpreted as UNIX timestamp in seconds
with decimal precision.
o As string, it follows java.sql.Timestamp format "YYYY-MM-DD
HH:MM:SS.fffffffff" (9 decimal place precision)

DATES

The Date value is used to specify a particular year, month and day, in the form YYYY--
MM--DD. However, it didn't provide the time of the day. The range of Date type lies
between 0000--01--01 to 9999--12--31.
AD

String Types
STRING

The string is a sequence of characters. It values can be enclosed within single quotes
(') or double quotes (").

Varchar

The varchar is a variable length type whose range lies between 1 and 65535, which
specifies that the maximum number of characters allowed in the character string.

CHAR

The char is a fixed-length type whose maximum length is fixed at 255.

Complex Type

Type Size Range

Struct It is similar to C struct or an object where fields are struct('James','Roy')

accessed using the "dot" notation.

Map It contains the key-value tuples where the fields are map('first','James','last','Roy')
accessed using array notation.

Array It is a collection of similar type of values that array('James','Roy')

indexable using zero-based integers.

HiveQL - Select-Where
The Hive Query Language (HiveQL) is a query language for Hive
to process and analyze structured data in a Metastore. This
chapter explains how to use the SELECT statement with WHERE
clause.

SELECT statement is used to retrieve the data from a table.

WHERE clause works similar to a condition. It filters the data
using the condition and gives you a finite result. The built-in
operators and functions generate an expression, which fulfils the
condition.
Syntax
Given below is the syntax of the SELECT query:

SELECT [ALL | DISTINCT] select_expr, select_expr, ...

FROM table_reference
[WHERE where_condition]
[GROUP BY col_list]
[HAVING having_condition]
[CLUSTER BY col_list | [DISTRIBUTE BY col_list] [SORT
BY col_list]]
[LIMIT number];

Example
Let us take an example for SELECT…WHERE clause. Assume we
have the employee table as given below, with fields named Id,
Name, Salary, Designation, and Dept. Generate a query to
retrieve the employee details who earn a salary of more than Rs
30000.

+------+--------------+-------------+-------------------+--------+
| ID | Name | Salary | Designation | Dept |
+------+--------------+-------------+-------------------+--------+
|1201 | Gopal | 45000 | Technical manager | TP |
|1202 | Manisha | 45000 | Proofreader | PR |
|1203 | Masthanvali | 40000 | Technical writer | TP |
|1204 | Krian | 40000 | Hr Admin | HR |
|1205 | Kranthi | 30000 | Op Admin | Admin |
+------+--------------+-------------+-------------------+--------+

The following query retrieves the employee details using the

above scenario:

hive> SELECT * FROM employee WHERE salary>30000;

On successful execution of the query, you get to see the following

response:
+------+--------------+-------------+-------------------+--------+
| ID | Name | Salary | Designation | Dept |
+------+--------------+-------------+-------------------+--------+
|1201 | Gopal | 45000 | Technical manager | TP |
|1202 | Manisha | 45000 | Proofreader | PR |
|1203 | Masthanvali | 40000 | Technical writer | TP |
|1204 | Krian | 40000 | Hr Admin | HR |
+------+--------------+-------------+-------------------+--------+

Apache Pig Tutorial

Pig tutorial provides basic and advanced concepts of Pig. Our Pig tutorial is designed
for beginners and professionals.

Pig is a high-level data flow platform for executing Map Reduce programs of
Hadoop. It was developed by Yahoo. The language for Pig is pig Latin.

Our Pig tutorial includes all topics of Apache Pig with Pig usage, Pig Installation, Pig
Run Modes, Pig Latin concepts, Pig Data Types, Pig example, Pig user defined
functions etc.

What is Apache Pig

Apache Pig is a high-level data flow platform for executing MapReduce programs of
Hadoop. The language used for Pig is Pig Latin.

The Pig scripts get internally converted to Map Reduce jobs and get executed on
data stored in HDFS. Apart from that, Pig can also execute its job in Apache Tez or
Apache Spark.
Pig can handle any type of data, i.e., structured, semi-structured or unstructured and
stores the corresponding results into Hadoop Data File System. Every task which can
be achieved using PIG can also be achieved using java used in MapReduce.

Features of Apache Pig

Let's see the various uses of Pig technology.

1) Ease of programming
Writing complex java programs for map reduce is quite tough for non-programmers.
Pig makes this process easy. In the Pig, the queries are converted to MapReduce
internally.

2) Optimization opportunities
It is how tasks are encoded permits the system to optimize their execution
automatically, allowing the user to focus on semantics rather than efficiency.

3) Extensibility
A user-defined function is written in which the user can write their logic to execute
over the data set.

4) Flexible
It can easily handle structured as well as unstructured data.

5) In-built operators
It contains various type of operators such as sort, filter and joins.

Differences between Apache MapReduce

and PIG

Apache MapReduce Apache PIG

It is a low-level data processing tool. It is a high-level data flow tool.

Here, it is required to develop complex It is not required to develop complex programs.

programs using Java or Python.

It is difficult to perform data operations in It provides built-in operators to perform data

MapReduce. operations like union, sorting and ordering.

It doesn't allow nested data types. It provides nested data types like tuple, bag, and
map.

Advantages of Apache Pig

o Less code - The Pig consumes less line of code to perform any operation.
o Reusability - The Pig code is flexible enough to reuse again.
o Nested data types - The Pig provides a useful concept of nested data types like tuple,
bag, and map.

Apache Pig - Architecture

The language used to analyze data in Hadoop using Pig is known
as Pig Latin. It is a highlevel data processing language which
provides a rich set of data types and operators to perform various
operations on the data.

To perform a particular task Programmers using Pig,

programmers need to write a Pig script using the Pig Latin
language, and execute them using any of the execution
mechanisms (Grunt Shell, UDFs, Embedded). After execution,
these scripts will go through a series of transformations applied
by the Pig Framework, to produce the desired output.

Internally, Apache Pig converts these scripts into a series of

MapReduce jobs, and thus, it makes the programmer’s job easy.
The architecture of Apache Pig is shown below.
Apache Pig Components
As shown in the figure, there are various components in the
Apache Pig framework. Let us take a look at the major
components.

Parser

Initially the Pig Scripts are handled by the Parser. It checks the
syntax of the script, does type checking, and other miscellaneous
checks. The output of the parser will be a DAG (directed acyclic
graph), which represents the Pig Latin statements and logical
operators.

In the DAG, the logical operators of the script are represented as

the nodes and the data flows are represented as edges.
Optimizer

The logical plan (DAG) is passed to the logical optimizer, which

carries out the logical optimizations such as projection and
pushdown.

Compiler

The compiler compiles the optimized logical plan into a series of

MapReduce jobs.

Execution engine

Finally the MapReduce jobs are submitted to Hadoop in a sorted

order. Finally, these MapReduce jobs are executed on Hadoop
producing the desired results.

Pig Latin Data Model

The data model of Pig Latin is fully nested and it allows complex
non-atomic datatypes such as map and tuple. Given below is the
diagrammatical representation of Pig Latin’s data model.

Atom
Any single value in Pig Latin, irrespective of their datatype is
known as an Atom. It is stored as string and can be used as string
and number. int, long, float, double, chararray, and bytearray are
the atomic values of Pig. A piece of data or a simple atomic value
is known as a field.
Example − ‘raja’ or ‘30’
Tuple

A record that is formed by an ordered set of fields is known as a

tuple, the fields can be of any type. A tuple is similar to a row in a
table of RDBMS.

Example − (Raja, 30)

Bag

A bag is an unordered set of tuples. In other words, a collection

of tuples (non-unique) is known as a bag. Each tuple can have
any number of fields (flexible schema). A bag is represented by
‘{}’. It is similar to a table in RDBMS, but unlike a table in
RDBMS, it is not necessary that every tuple contain the same
number of fields or that the fields in the same position (column)
have the same type.

Example − {(Raja, 30), (Mohammad, 45)}

A bag can be a field in a relation; in that context, it is known
as inner bag.
Example − {Raja, 30, {9848022338, raja@gmail.com,}}
Map
A map (or data map) is a set of key-value pairs. The key needs to
be of type chararray and should be unique. The value might be of
any type. It is represented by ‘[]’
Example − [name#Raja, age#30]
Relation

A relation is a bag of tuples. The relations in Pig Latin are

unordered (there is no guarantee that tuples are processed in any
particular order).

Use Case for Pig :

A common use case for Apache Pig is in data processing workflows where you need
to handle and transform large volumes of data efficiently. Pig is especially useful for
ETL (Extract, Transform, Load) tasks and data pipelines within big data environments.
Here's a detailed example of a use case for Pig:

### Use Case: Analyzing Web Server Logs

**Scenario:**
A company has a large-scale website generating massive log files daily. They want to
analyze these logs to understand user behavior, popular pages, and traffic patterns.

**Objective:**
Extract meaningful insights from the web server logs using Apache Pig.

Steps Using Pig:

1. **Data Loading:**
- Use Pig to load log data stored in Hadoop HDFS (Hadoop Distributed File System).

```pig
-- Load log data
log_data = LOAD '/path/to/logfiles' USING TextLoader AS (line:chararray);
```

2. Data Cleaning and Parsing:

- Clean and parse the log data to extract relevant fields such as IP addresses,
timestamps, URLs, response codes, etc.

```pig
-- Split log lines into fields
parsed_logs = FOREACH log_data GENERATE FLATTEN
(STRSPLIT(line, '\\s+')) AS (ip:chararray, timestamp:chararray, url:chararray,
status:int);
```
3. **Data Transformation:**
- Perform transformations and aggregations to derive insights. For example,
counting the number of requests per URL or per IP address.

```pig
-- Calculate requests per URL
requests_per_url = FOREACH (GROUP parsed_logs BY url)
GENERATE group AS url, COUNT(parsed_logs) AS request_count;
```

4. Filtering and Analysis:

- Apply filters and conditions to analyze specific subsets of data. For instance,
filtering out requests with certain response codes.

```pig
-- Filter out successful requests
successful_requests = FILTER parsed_logs BY status >= 200 AND status < 300;
```

5. Data Storage or Output:

- Store the processed data or generate reports for further analysis.

```pig
-- Store or output results
STORE requests_per_url INTO '/output/url_requests';
```

Advantages of Using Pig:

- **Abstraction:** Pig provides a high-level scripting language (Pig Latin) that
abstracts complex MapReduce tasks, making it easier to write and maintain data
processing jobs.

- **Scalability:** Pig jobs can leverage the scalability and fault-tolerance of Hadoop,
handling large datasets efficiently.

- Flexibility: Pig supports a wide range of data transformations and operations,

allowing for diverse data processing tasks.

- Optimization: Pig optimizes the execution of tasks, automatically parallelizing

operations where possible.

ETL Process in Data Warehouse

Last Updated : 02 Feb, 2023




INTRODUCTION:

1. ETL stands for Extract, Transform, Load and it is a process used in data
warehousing to extract data from various sources, transform it into a
format suitable for loading into a data warehouse, and then load it into
the warehouse. The process of ETL can be broken down into the
following three stages:
2. Extract: The first stage in the ETL process is to extract data from
various sources such as transactional systems, spreadsheets, and flat
files. This step involves reading data from the source systems and storing
it in a staging area.
3. Transform: In this stage, the extracted data is transformed into a format
that is suitable for loading into the data warehouse. This may involve
cleaning and validating the data, converting data types, combining data
from multiple sources, and creating new data fields.
4. Load: After the data is transformed, it is loaded into the data warehouse.
This step involves creating the physical data structures and loading the
data into the warehouse.
5. The ETL process is an iterative process that is repeated as new data is
added to the warehouse. The process is important because it ensures that
the data in the data warehouse is accurate, complete, and up-to-date. It
also helps to ensure that the data is in the format required for data mining
and reporting.
Additionally, there are many different ETL tools and technologies available, such
as Informatica, Talend, DataStage, and others, that can automate and simplify the
ETL process.
ETL is a process in Data Warehousing and it stands
for Extract, Transform and Load. It is a process in which an ETL tool extracts
the data from various data source systems, transforms it in the staging area, and
then finally, loads it into the Data Warehouse system.

Let us understand each step of the ETL process in-depth:

1. Extraction:
The first step of the ETL process is extraction. In this step, data from
various source systems is extracted which can be in various formats like
relational databases, No SQL, XML, and flat files into the staging area. It
is important to extract the data from various source systems and store it
into the staging area first and not directly into the data warehouse
because the extracted data is in various formats and can be corrupted
also. Hence loading it directly into the data warehouse may damage it
and rollback will be much more difficult. Therefore, this is one of the
most important steps of ETL process.
2. Transformation:
The second step of the ETL process is transformation. In this step, a set
of rules or functions are applied on the extracted data to convert it into a
single standard format. It may involve following processes/tasks:
 Filtering – loading only certain attributes into the data
warehouse.
 Cleaning – filling up the NULL values with some default
values, mapping U.S.A, United States, and America into USA,
etc.
 Joining – joining multiple attributes into one.
 Splitting – splitting a single attribute into multiple attributes.
 Sorting – sorting tuples on the basis of some attribute
(generally key-attribute).
3. Loading:
The third and final step of the ETL process is loading. In this step, the
transformed data is finally loaded into the data warehouse. Sometimes
the data is updated by loading into the data warehouse very frequently
and sometimes it is done after longer but regular intervals. The rate and
period of loading solely depends on the requirements and varies from
system to system.
ETL process can also use the pipelining concept i.e. as soon as some data is
extracted, it can transformed and during that period some new data can be
extracted. And while the transformed data is being loaded into the data warehouse,
the already extracted data can be transformed. The block diagram of the pipelining
of ETL process is shown below:

ETL Tools: Most commonly used ETL tools are Hevo, Sybase, Oracle Warehouse
builder, CloverETL, and MarkLogic.
Data Warehouses: Most commonly used Data Warehouses are Snowflake,
Redshift, BigQuery, and Firebolt.
ADVANTAGES OR DISADVANTAGES:

Advantages of ETL process in data warehousing:

1. Improved data quality: ETL process ensures that the data in the data
warehouse is accurate, complete, and up-to-date.
2. Better data integration: ETL process helps to integrate data from
multiple sources and systems, making it more accessible and usable.
3. Increased data security: ETL process can help to improve data security
by controlling access to the data warehouse and ensuring that only
authorized users can access the data.
4. Improved scalability: ETL process can help to improve scalability by
providing a way to manage and analyze large amounts of data.
5. Increased automation: ETL tools and technologies can automate and
simplify the ETL process, reducing the time and effort required to load
and update data in the warehouse.

Disadvantages of ETL process in data warehousing:

1. High cost: ETL process can be expensive to implement and maintain,

especially for organizations with limited resources.
2. Complexity: ETL process can be complex and difficult to implement,
especially for organizations that lack the necessary expertise or
resources.
3. Limited flexibility: ETL process can be limited in terms of flexibility, as
it may not be able to handle unstructured data or real-time data streams.
4. Limited scalability: ETL process can be limited in terms of scalability,
as it may not be able to handle very large amounts of data.
5. Data privacy concerns: ETL process can raise concerns about data
privacy, as large amounts of data are collected, stored, and analyzed.

Pig Data Types

Apache Pig supports many data types. A list of Apache Pig Data Types with
description and examples are given below.

Type Description Example

Int Signed 32 bit integer 2

Long Signed 64 bit integer 15L or 15l

Float 32 bit floating point 2.5f or 2.5F

Double 32 bit floating point 1.5 or 1.5e2 or 1.5E2

charArray Character array hello javatpoint

byteArray BLOB(Byte array)

tuple Ordered set of fields (12,43)

bag Collection f tuples {(12,43),(54,28)}

map collection of tuples [open#apache]

Apache Pig - Execution

Apache Pig Execution Modes
You can run Apache Pig in two modes, namely, Local
Mode and HDFS mode.
Local Mode

In this mode, all the files are installed and run from your local
host and local file system. There is no need of Hadoop or HDFS.
This mode is generally used for testing purpose.

MapReduce Mode

MapReduce mode is where we load or process the data that exists

in the Hadoop File System (HDFS) using Apache Pig. In this
mode, whenever we execute the Pig Latin statements to process
the data, a MapReduce job is invoked in the back-end to perform
a particular operation on the data that exists in the HDFS.

Apache Pig Execution Mechanisms

Apache Pig scripts can be executed in three ways, namely,
interactive mode, batch mode, and embedded mode.

 Interactive Mode (Grunt shell) − You can run Apache Pig in

interactive mode using the Grunt shell. In this shell, you can
enter the Pig Latin statements and get the output (using
Dump operator).
 Batch Mode (Script) − You can run Apache Pig in Batch mode
by writing the Pig Latin script in a single file
with .pig extension.
 Embedded Mode (UDF) − Apache Pig provides the provision of
defining our own functions (User Defined Functions) in
programming languages such as Java, and using them in
our script.

Invoking the Grunt Shell

You can invoke the Grunt shell in a desired mode
(local/MapReduce) using the −x option as shown below.

Either of these commands gives you the Grunt shell prompt as

shown below.

You can exit the Grunt shell using ‘ctrl + d’.

After invoking the Grunt shell, you can execute a Pig script by
directly entering the Pig Latin statements in it.

grunt> customers = LOAD 'customers.txt' USING

PigStorage(',');

Executing Apache Pig in Batch Mode

You can write an entire Pig Latin script in a file and execute it
using the –x command. Let us suppose we have a Pig script in a file
named sample_script.pig as shown below.
Sample_script.pig
student = LOAD
'hdfs://localhost:9000/pig_data/student.txt' USING
PigStorage(',') as
(id:int,name:chararray,city:chararray);
Dump student;

Operators, functions, Data types of Pig:

In Apache Pig, which is a platform for analyzing large data sets, you work with various
operators, functions, and data types to process and manipulate data effectively
within the Pig Latin scripting language. Here's a breakdown of these components:

### Operators in Pig Latin:

1. **Arithmetic Operators:**
- `+` (addition)
- `-` (subtraction)
- `*` (multiplication)
- `/` (division)
- `%` (modulus)

2. **Comparison Operators:**
- `==` (equal to)
- `!=` (not equal to)
- `<` (less than)
- `<=` (less than or equal to)
- `>` (greater than)
- `>=` (greater than or equal to)

3. **Logical Operators:**
- `AND`
- `OR`
- `NOT`

4. **Unary Operators:**
- `-` (unary minus)
- `+` (unary plus)

### Functions in Pig Latin:

Pig provides a rich set of built-in functions that can be used for data transformation
and analysis. These functions can be broadly categorized as:

1. **Scalar Functions:**
- Operate on individual values and produce a single result.
- Examples: `LOWER`, `UPPER`, `TRIM`, `SUBSTRING`, `CONCAT`, `REPLACE`,
`TOUPPER`, `TOLOWER`, etc.

2. **Bag Functions:**
- Operate on collections of data (bags) and produce bags as output.
- Examples: `COUNT`, `MAX`, `MIN`, `AVG`, `SUM`, `TOKENIZE`, `FLATTEN`, etc.

3. **Map Functions:**
- Operate on maps (key-value pairs) and produce maps as output.
- Examples: `MAPSIZE`, `TOTUPLE`, `TOMAP`, `TOP`, `LIMIT`, etc.

4. **Other Functions:**
- Functions to handle data types, type conversions, date-time operations, etc.
- Examples: `IsEmpty`, `IsNull`, `Size`, `ToDate`, `ToUnixTime`, `ToString`, `ToBag`,
etc.

### Data Types in Pig Latin:

1. Atomic Data Types:

- `int`: Integer
- `long`: Long integer
- `float`: Single-precision floating point
- `double`: Double-precision floating point
- `chararray`: Character array (string)
- `bytearray`: Byte array (used for binary data)

2. Complex Data Types:

- `tuple`: An ordered set of fields (similar to a row in a table)
- `bag`: A collection of tuples (unordered)
- `map`: A set of key-value pairs

These data types can be used to define the structure of your data within Pig, allowing
you to process and transform complex data formats efficiently.

Understanding these operators, functions, and data types is crucial for writing
effective Pig Latin scripts to process large-scale datasets in a distributed environment
like Apache Hadoop. Each component plays a vital role in defining data flow and
transformation logic within Pig.

BDA Unit 4 Notes
No ratings yet
BDA Unit 4 Notes
33 pages
Architecture and Working of Hive
No ratings yet
Architecture and Working of Hive
7 pages
IET Udaipur BDA Unit-5
No ratings yet
IET Udaipur BDA Unit-5
9 pages
Bda Unit 4 - Mam
No ratings yet
Bda Unit 4 - Mam
57 pages
Apache Hive for Data Analysts
No ratings yet
Apache Hive for Data Analysts
8 pages
Introduction To Hive
No ratings yet
Introduction To Hive
14 pages
Apache Hive: Structure & Data Analysis
No ratings yet
Apache Hive: Structure & Data Analysis
25 pages
Hive Unit VI
No ratings yet
Hive Unit VI
39 pages
Unit 5 Lecture No-1 (Hive)
No ratings yet
Unit 5 Lecture No-1 (Hive)
30 pages
Unit 5 Lecture No-1 (Hive)
No ratings yet
Unit 5 Lecture No-1 (Hive)
30 pages
Unit 5
No ratings yet
Unit 5
21 pages
Hive for Data Engineers
No ratings yet
Hive for Data Engineers
13 pages
Hive
No ratings yet
Hive
30 pages
Unit IV
No ratings yet
Unit IV
22 pages
HIVE
No ratings yet
HIVE
18 pages
HIVE
No ratings yet
HIVE
80 pages
Chapter 5 Hive
No ratings yet
Chapter 5 Hive
69 pages
Unit-5 - Hive
No ratings yet
Unit-5 - Hive
31 pages
Hiveppt
No ratings yet
Hiveppt
29 pages
Hive Tutorial
No ratings yet
Hive Tutorial
25 pages
Apache Hive: Data Warehousing on Hadoop
No ratings yet
Apache Hive: Data Warehousing on Hadoop
28 pages
Hive Final
No ratings yet
Hive Final
75 pages
Ilovepdf Merged
No ratings yet
Ilovepdf Merged
182 pages
Bda Unit 5 Hive Notes
No ratings yet
Bda Unit 5 Hive Notes
23 pages
Course On: Big Data Analytics
No ratings yet
Course On: Big Data Analytics
59 pages
BDA Unit-5
No ratings yet
BDA Unit-5
39 pages
Unit-Vi Hive Hadoop & Big Data
100% (1)
Unit-Vi Hive Hadoop & Big Data
24 pages
Module 3-1
No ratings yet
Module 3-1
32 pages
Introduction to Hive Architecture
No ratings yet
Introduction to Hive Architecture
23 pages
Using Hive For Data Warehousing: Introduction To Hive
No ratings yet
Using Hive For Data Warehousing: Introduction To Hive
4 pages
Chapter - 4 - Data Access - Hive
No ratings yet
Chapter - 4 - Data Access - Hive
35 pages
Bda Report
No ratings yet
Bda Report
16 pages
7 Hive
No ratings yet
7 Hive
30 pages
Unit V-Hive
No ratings yet
Unit V-Hive
10 pages
HIVE
No ratings yet
HIVE
28 pages
Unit 5 (BDC)
No ratings yet
Unit 5 (BDC)
59 pages
Big Data & Analytics (CSE6005) L6
No ratings yet
Big Data & Analytics (CSE6005) L6
56 pages
Unit 5-Hive
No ratings yet
Unit 5-Hive
18 pages
Apache Hive for Big Data Processing
No ratings yet
Apache Hive for Big Data Processing
19 pages
Hive Data Warehousing Overview
No ratings yet
Hive Data Warehousing Overview
9 pages
Hive
No ratings yet
Hive
26 pages
Unit5 Notes
No ratings yet
Unit5 Notes
29 pages
Hive Architecture
No ratings yet
Hive Architecture
7 pages
Hive Database & Analytics Guide
No ratings yet
Hive Database & Analytics Guide
10 pages
Hive Full Lecture
No ratings yet
Hive Full Lecture
17 pages
Big Data Analytics: Welcome
No ratings yet
Big Data Analytics: Welcome
69 pages
Chapter+9+ HIVE
No ratings yet
Chapter+9+ HIVE
50 pages
Unit 4 Hadoop Ecosystem - HIVE and PIG
No ratings yet
Unit 4 Hadoop Ecosystem - HIVE and PIG
157 pages
Unit 3-1
No ratings yet
Unit 3-1
41 pages
Hive
No ratings yet
Hive
5 pages
Hive and Pig
No ratings yet
Hive and Pig
57 pages
(R17a0528) Big Data Analytics-57-100
No ratings yet
(R17a0528) Big Data Analytics-57-100
44 pages
Hive
No ratings yet
Hive
49 pages
Hive Data Types and Data Models
No ratings yet
Hive Data Types and Data Models
24 pages
Hive
No ratings yet
Hive
65 pages
Unit IV
No ratings yet
Unit IV
64 pages
Unit 2
No ratings yet
Unit 2
26 pages
Blockchain Cryptography Basics
No ratings yet
Blockchain Cryptography Basics
29 pages
Swasthavritta RecordF
No ratings yet
Swasthavritta RecordF
68 pages
AI DS Scheme Plus Syllabus All Years 2022 April
No ratings yet
AI DS Scheme Plus Syllabus All Years 2022 April
76 pages
Ncism - II Bams - Ayug-At Record Book
No ratings yet
Ncism - II Bams - Ayug-At Record Book
25 pages
Forgotten Realms: D&D Fantasy World
No ratings yet
Forgotten Realms: D&D Fantasy World
4 pages
Marriage of The Diamond and The Lotus: About Me
No ratings yet
Marriage of The Diamond and The Lotus: About Me
3 pages
Boltmark II BM 20180227v2 0en
No ratings yet
Boltmark II BM 20180227v2 0en
31 pages
? Madarij Al-Salikin Chapter 3 Notes
No ratings yet
? Madarij Al-Salikin Chapter 3 Notes
3 pages
DAFIN ZUCIYA by Fatima Binta Hussain
No ratings yet
DAFIN ZUCIYA by Fatima Binta Hussain
82 pages
Lesson 1.1 Set Statements and Reasoning
No ratings yet
Lesson 1.1 Set Statements and Reasoning
10 pages
Math Addition Blended Lesson Plan
No ratings yet
Math Addition Blended Lesson Plan
3 pages
English Pronouns: Types and Usage
100% (1)
English Pronouns: Types and Usage
9 pages
Christian Metz: Film Theory Insights
0% (1)
Christian Metz: Film Theory Insights
20 pages
JAVA MCQ-CSA0917-word Format
No ratings yet
JAVA MCQ-CSA0917-word Format
6 pages
C Language Question Bca 1st Sem
No ratings yet
C Language Question Bca 1st Sem
3 pages
Exam Practice 4
No ratings yet
Exam Practice 4
1 page
s850 Paper Three
No ratings yet
s850 Paper Three
7 pages
ĐÁP ÁN de Tu Soan Năng Khieu L P 8 TEST 1
No ratings yet
ĐÁP ÁN de Tu Soan Năng Khieu L P 8 TEST 1
4 pages
Full Blast Plus 5 Ukr Test-2
100% (1)
Full Blast Plus 5 Ukr Test-2
6 pages
07 Laboratory Exercise 1
No ratings yet
07 Laboratory Exercise 1
2 pages
Alternation of Reduced and Vowel
No ratings yet
Alternation of Reduced and Vowel
3 pages
Random Paragraphs for Writers
No ratings yet
Random Paragraphs for Writers
6 pages
Probability Exercises for Students
No ratings yet
Probability Exercises for Students
3 pages
Linq Default Value Handling
No ratings yet
Linq Default Value Handling
6 pages
Tabla Predictiva
No ratings yet
Tabla Predictiva
5 pages
The Sermon at Benares Class 10th
No ratings yet
The Sermon at Benares Class 10th
8 pages
Iso 11064 3 1999
No ratings yet
Iso 11064 3 1999
13 pages
Python Programming Exercises
No ratings yet
Python Programming Exercises
3 pages
Mic Account Data - Sheet1
No ratings yet
Mic Account Data - Sheet1
3 pages
Halima Ghzeil - Prophet Muhammed
No ratings yet
Halima Ghzeil - Prophet Muhammed
13 pages
The Lion The Witch and The Wardrobe - Chapter 16 - Review Questions PDF
No ratings yet
The Lion The Witch and The Wardrobe - Chapter 16 - Review Questions PDF
1 page
Slave To Religion A Pegging Religious Humiliation Erotica
No ratings yet
Slave To Religion A Pegging Religious Humiliation Erotica
8 pages
IPT Chapter 1
No ratings yet
IPT Chapter 1
11 pages
Resume of Satish 16122009
No ratings yet
Resume of Satish 16122009
4 pages

Unit 3

Uploaded by

Unit 3

Uploaded by

Architecture and Working of Hive

 Hive Server – It is referred to as Apache Thrift Server. It accepts the

 Step-2: Get Plan –

 Step-3: Get Metadata –

 Step-4: Send Metadata –

 Step-5: Send Plan –

 Step-6: Execute Plan –

 Step-8: Send Results –

Advantages of Hive Architecture:

Disadvantages of Hive Architecture:

High Latency: Hive’s performance is slower compared to traditional databases

HIVE Data Types

Type Size Range

TINYINT 1-byte signed integer -128 to 127

SMALLINT 2-byte signed integer 32,768 to 32,767

INT 4-byte signed integer 2,147,483,648 to 2,147,483,647

BIGINT 8-byte signed integer -9,223,372,036,854,775,808 to 9,223,372,036,854,775,807

Type Size Range

FLOAT 4-byte Single precision floating point number

DOUBLE 8-byte Double precision floating point number

o It supports traditional UNIX timestamp with optional nanosecond precision.

The char is a fixed-length type whose maximum length is fixed at 255.

Type Size Range

Struct It is similar to C struct or an object where fields are struct('James','Roy')

Array It is a collection of similar type of values that array('James','Roy')

SELECT statement is used to retrieve the data from a table.

SELECT [ALL | DISTINCT] select_expr, select_expr, ...

The following query retrieves the employee details using the

hive> SELECT * FROM employee WHERE salary>30000;

On successful execution of the query, you get to see the following

Apache Pig Tutorial

What is Apache Pig

Features of Apache Pig

Differences between Apache MapReduce

Apache MapReduce Apache PIG

It is a low-level data processing tool. It is a high-level data flow tool.

Here, it is required to develop complex It is not required to develop complex programs.

It is difficult to perform data operations in It provides built-in operators to perform data

Advantages of Apache Pig

Apache Pig - Architecture

To perform a particular task Programmers using Pig,

Internally, Apache Pig converts these scripts into a series of

In the DAG, the logical operators of the script are represented as

The logical plan (DAG) is passed to the logical optimizer, which

The compiler compiles the optimized logical plan into a series of

Finally the MapReduce jobs are submitted to Hadoop in a sorted

Pig Latin Data Model

A record that is formed by an ordered set of fields is known as a

Example − (Raja, 30)

A bag is an unordered set of tuples. In other words, a collection

Example − {(Raja, 30), (Mohammad, 45)}

A relation is a bag of tuples. The relations in Pig Latin are

Use Case for Pig :

### Use Case: Analyzing Web Server Logs

**Steps Using Pig:**

2. **Data Cleaning and Parsing:**

4. **Filtering and Analysis:**

5. **Data Storage or Output:**

**Advantages of Using Pig:**

- **Flexibility:** Pig supports a wide range of data transformations and operations,

- **Optimization:** Pig optimizes the execution of tasks, automatically parallelizing

ETL Process in Data Warehouse

Let us understand each step of the ETL process in-depth:

Advantages of ETL process in data warehousing:

Disadvantages of ETL process in data warehousing:

1. High cost: ETL process can be expensive to implement and maintain,

Pig Data Types

Type Description Example

Int Signed 32 bit integer 2

Long Signed 64 bit integer 15L or 15l

Float 32 bit floating point 2.5f or 2.5F

charArray Character array hello javatpoint

byteArray BLOB(Byte array)

tuple Ordered set of fields (12,43)

bag Collection f tuples {(12,43),(54,28)}

map collection of tuples [open#apache]

Steps Using Pig:

2. Data Cleaning and Parsing:

4. Filtering and Analysis:

5. Data Storage or Output:

Advantages of Using Pig:

- Flexibility: Pig supports a wide range of data transformations and operations,

- Optimization: Pig optimizes the execution of tasks, automatically parallelizing

1. Atomic Data Types:

2. Complex Data Types: