0% found this document useful (0 votes)

26 views5 pages

Hive

Hive Assistant is a tool that helps in analyzing the performance of Hive queries. It provides information like execution time, number of mappers/reducers used, amount of data processed etc. CLI: Command Line Interface is used to execute Hive queries from the command prompt. HiveServer2: HiveServer2 is a daemon service that runs independent of the Hive CLI and allows clients like Beeline to connect to it and submit Hive queries. Beeline: Beeline is a CLI tool to interact with HiveServer2. It provides tab completion and syntax highlighting features. Tez: Tez is a generalized data flow programming framework that

Uploaded by

Mayank Rai

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

26 views5 pages

Hive

Uploaded by

Mayank Rai

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

HIVE

What is Hive?

Apache Hive is a data warehouse system built on top of Hadoop and is used for analyzing
structured and semi-structured data. Hive abstracts the complexity of Hadoop MapReduce.
Basically, it provides a mechanism to project structure onto the data and perform queries
written in HQL (Hive Query Language) that are similar to SQL statements. Internally, these
queries or HQL gets converted to map reduce jobs by the Hive compiler. Therefore, you
don’t need to worry about writing complex MapReduce programs to process your data
using Hadoop. It is targeted towards users who are comfortable with SQL. Apache Hive
supports Data Definition Language (DDL), Data Manipulation Language (DML) and User
Defined Functions (UDF).

Where to use Apache Hive?

Apache Hive takes advantage of both the worlds i.e. SQL Database System and Hadoop –
MapReduce framework. Therefore, it is used by a vast multitude of companies. It is mostly
used for data warehousing where you can perform analytics and data mining that does not
require real time processing. Some of the fields where you can use Apache Hive are as
follows:

 Data Warehousing
 Ad-hoc Analysis

As it is said, you can’t clap with one hand only i.e. You can’t solve every problem with a
single tool. Therefore, you can couple Hive with other tools to use it in many other domains.
For example, Tableau along with Apache Hive can be used for Data Visualization, Apache
Tez integration with Hive will provide you real time processing capabilities, etc.
Hive Architecture and its Components

The following image describes the Hive Architecture and the flow in which a query is
submitted into Hive and finally processed using the MapReduce framework:

Fig: Hive Tutorial – Hive Architecture

As shown in the above image, the Hive Architecture can be categorized into the following
components:

 Hive Clients: Hive supports application written in many languages like Java, C++,
Python etc. using JDBC, Thrift and ODBC drivers. Hence one can always write hive
client application written in a language of their choice.
 Hive Services: Apache Hive provides various services like CLI, Web Interface etc. to
perform queries. We will explore each one of them shortly in this Hive tutorial blog.
 Processing framework and Resource Management: Internally, Hive uses Hadoop
MapReduce framework as de facto engine to execute the queries. Hadoop
MapReduce framework is a separate topic in itself and therefore, is not discussed
here.
 Distributed Storage: As Hive is installed on top of Hadoop, it uses the underlying
HDFS for the distributed storage. You can refer to the HDFS blog to learn more about
it.
Now, let us explore the first two major components in the Hive Architecture:

1. Hive Clients:
Apache Hive supports different types of client applications for performing queries on
the Hive. These clients can be categorized into three types:

 Thrift Clients: As Hive server is based on Apache Thrift, it can serve the request from
all those programming language that supports Thrift.
 JDBC Clients: Hive allows Java applications to connect to it using the JDBC driver
which is defined in the class org.apache.hadoop.hive.jdbc.HiveDriver.
 ODBC Clients: The Hive ODBC Driver allows applications that support the ODBC
protocol to connect to Hive. (Like the JDBC driver, the ODBC driver uses Thrift to
communicate with the Hive server.)

2. Hive Services:
Hive provides many services as shown in the image above. Let us have a look at each of
them:

 Hive CLI (Command Line Interface): This is the default shell provided by the Hive
where you can execute your Hive queries and commands directly.
 Apache Hive Web Interfaces: Apart from the command line interface, Hive also
provides a web based GUI for executing Hive queries and commands.
 Hive Server: Hive server is built on Apache Thrift and therefore, is also referred as
Thrift Server that allows different clients to submit requests to Hive and retrieve the
final result.
 Apache Hive Driver: It is responsible for receiving the queries submitted through the
CLI, the web UI, Thrift, ODBC or JDBC interfaces by a client. Then, the driver passes
the query to the compiler where parsing, type checking and semantic analysis takes
place with the help of schema present in the metastore. In the next step, an
optimized logical plan is generated in the form of a DAG (Directed Acyclic Graph) of
map-reduce tasks and HDFS tasks. Finally, the execution engine executes these tasks
in the order of their dependencies, using Hadoop.
 Metastore: You can think metastore as a central repository for storing all the Hive
metadata information. Hive metadata includes various types of information like
structure of tables and the partitions along with the column, column type, serializer
and deserializer which is required for Read/Write operation on the data present in
HDFS. The metastore comprises of two fundamental units:
o A service that provides metastore access to other Hive services.
o Disk storage for the metadata which is separate from HDFS storage.

Buckets:
Commands:

CREATE TABLE table_name PARTITIONED BY (partition1 data_type, partition2 data_type,….)

CLUSTERED BY (column_name1, column_name2, …) SORTED BY (column_name *ASC|DESC+,
…)+ INTO num_buckets BUCKETS;
Why do we need buckets?
There are two main reasons for performing bucketing to a partition:

 A map side join requires the data belonging to a unique join key to be present in the
same partition. But what about those cases where your partition key differs from
join? Therefore, in these cases you can perform a map side join by bucketing the
table using the join key.
 Bucketing makes the sampling process more efficient and therefore, allows us to
decrease the query time.

What is HQL?

Hive defines a simple SQL-like query language to querying and managing large datasets
called Hive-QL ( HQL ). It’s easy to use if you’re familiar with SQL Language. Hive allows
programmers who are familiar with the language to write the custom MapReduce
framework to perform more sophisticated analysis.

Uses of Hive:

1. The Apache Hive distributed storage.

2. Hive provides tools to enable easy data extract/transform/load (ETL)

3. It provides the structure on a variety of data formats.

4. By using Hive, we can access files stored in Hadoop Distributed File System (HDFS is used
to querying and managing large datasets residing in) or in other data storage systems such
as Apache HBase.

Limitations of Hive:

• Hive is not designed for Online transaction processing (OLTP ), it is only used for the Online
Analytical Processing.

• Hive supports overwriting or apprehending data, but not updates and deletes.

• In Hive, sub queries are not supported.

Why Hive is used inspite of Pig?

The following are the reasons why Hive is used in spite of Pig’s availability:

 Hive-QL is a declarative language line SQL, PigLatin is a data flow language.

 Pig: a data-flow language and environment for exploring very large datasets.
 Hive: a distributed data warehouse.
Components of Hive:

Metastore :

Hive stores the schema of the Hive tables in a Hive Metastore. Metastore is used to hold all
the information about the tables and partitions that are in the warehouse. By default, the
metastore is run in the same process as the Hive service and the default Metastore is DerBy
Database.

SerDe :

Serializer, Deserializer gives instructions to hive on how to process a record.

Apache Hive for Big Data Processing
No ratings yet
Apache Hive for Big Data Processing
19 pages
Introduction To HIVE
No ratings yet
Introduction To HIVE
8 pages
HIVE
No ratings yet
HIVE
18 pages
Hive Unit VI
No ratings yet
Hive Unit VI
39 pages
Hive Full Lecture
No ratings yet
Hive Full Lecture
17 pages
Apache Hive: Structure & Data Analysis
No ratings yet
Apache Hive: Structure & Data Analysis
25 pages
Unit-4 Hive
No ratings yet
Unit-4 Hive
10 pages
Bda Unit 4 - Mam
No ratings yet
Bda Unit 4 - Mam
57 pages
Hive Is A Data Warehouse Infrastructure Tool To Process Structured Data in Hadoop
No ratings yet
Hive Is A Data Warehouse Infrastructure Tool To Process Structured Data in Hadoop
30 pages
Using Hive For Data Warehousing: Introduction To Hive
No ratings yet
Using Hive For Data Warehousing: Introduction To Hive
4 pages
Unit V-Hive
No ratings yet
Unit V-Hive
10 pages
Chapter 5 Hive
No ratings yet
Chapter 5 Hive
69 pages
Hive
No ratings yet
Hive
52 pages
Ilovepdf Merged
No ratings yet
Ilovepdf Merged
182 pages
Chapter 7
No ratings yet
Chapter 7
84 pages
Architecture and Working of Hive
No ratings yet
Architecture and Working of Hive
7 pages
Big Data & Analytics (CSE6005) L6
No ratings yet
Big Data & Analytics (CSE6005) L6
56 pages
Unit 5 Lecture No-1 (Hive)
No ratings yet
Unit 5 Lecture No-1 (Hive)
30 pages
Hive
No ratings yet
Hive
30 pages
Unit 5 Lecture No-1 (Hive)
No ratings yet
Unit 5 Lecture No-1 (Hive)
30 pages
BDA Unit 4 Notes
No ratings yet
BDA Unit 4 Notes
33 pages
IET Udaipur BDA Unit-5
No ratings yet
IET Udaipur BDA Unit-5
9 pages
BDA Unit V
No ratings yet
BDA Unit V
23 pages
Apache Hive for Data Analysts
No ratings yet
Apache Hive for Data Analysts
8 pages
Unit 3 Hive
No ratings yet
Unit 3 Hive
3 pages
DSS U4 HIVE Rev1.1
No ratings yet
DSS U4 HIVE Rev1.1
23 pages
Unit 3-1
No ratings yet
Unit 3-1
41 pages
HIVE
No ratings yet
HIVE
33 pages
NoteGPT - Apache Hive Tutorial For Beginners - Big Data Training - Edureka - Big Data Rewind
No ratings yet
NoteGPT - Apache Hive Tutorial For Beginners - Big Data Training - Edureka - Big Data Rewind
15 pages
Hive Slides-2
No ratings yet
Hive Slides-2
25 pages
04 Bigdata Hive
No ratings yet
04 Bigdata Hive
22 pages
Hive
No ratings yet
Hive
49 pages
Session 3.1
No ratings yet
Session 3.1
29 pages
Bda Unit 5 Notes
No ratings yet
Bda Unit 5 Notes
23 pages
Apache HIVE
100% (1)
Apache HIVE
105 pages
BDA Unit-5
No ratings yet
BDA Unit-5
26 pages
Hadoop - Hive
No ratings yet
Hadoop - Hive
190 pages
Hive Updated
No ratings yet
Hive Updated
18 pages
BD U-5 (Anupam Sir)
No ratings yet
BD U-5 (Anupam Sir)
12 pages
Hive Data Warehousing Overview
No ratings yet
Hive Data Warehousing Overview
9 pages
Unit V
No ratings yet
Unit V
23 pages
Bda Unit 5 Hive Notes
No ratings yet
Bda Unit 5 Hive Notes
23 pages
Hive Database & Analytics Guide
No ratings yet
Hive Database & Analytics Guide
10 pages
Hive
No ratings yet
Hive
63 pages
Assignment 4-Gcc: Hive Is Not
No ratings yet
Assignment 4-Gcc: Hive Is Not
3 pages
Big Data Analytics Module-4
No ratings yet
Big Data Analytics Module-4
39 pages
Unit 5 (BDC)
No ratings yet
Unit 5 (BDC)
59 pages
Unit IV
No ratings yet
Unit IV
22 pages
Hive for Big Data Professionals
No ratings yet
Hive for Big Data Professionals
17 pages
Hive - A Warehousing Solution Over A Map-Reduce Framework
No ratings yet
Hive - A Warehousing Solution Over A Map-Reduce Framework
24 pages
HIVE
No ratings yet
HIVE
16 pages
Apache Hive Overview & Architecture
No ratings yet
Apache Hive Overview & Architecture
27 pages
7 Hive
No ratings yet
7 Hive
30 pages
Apache Hive: Data Warehousing on Hadoop
No ratings yet
Apache Hive: Data Warehousing on Hadoop
28 pages
Bda Report
No ratings yet
Bda Report
16 pages
Hive
No ratings yet
Hive
28 pages
Day 4
No ratings yet
Day 4
10 pages
Big-Data-Unit 5
No ratings yet
Big-Data-Unit 5
54 pages
Tsm Sql Interface: Tivoli 技术专家沙龙活动
No ratings yet
Tsm Sql Interface: Tivoli 技术专家沙龙活动
30 pages
Working With Triggers in A MySQL Database PDF
No ratings yet
Working With Triggers in A MySQL Database PDF
10 pages
SAP Certified Development Professional - SAP Commerce Cloud Developer-P - C4H340 - 24
No ratings yet
SAP Certified Development Professional - SAP Commerce Cloud Developer-P - C4H340 - 24
22 pages
SQL Query Performance Tuning Tips
No ratings yet
SQL Query Performance Tuning Tips
3 pages
Airline Reservation Practical File Final
No ratings yet
Airline Reservation Practical File Final
8 pages
Why Sets, Relations, and Functions Matter in Computer Science
No ratings yet
Why Sets, Relations, and Functions Matter in Computer Science
11 pages
SLT in SAP HANA SAP SOURCE SYSTEM
No ratings yet
SLT in SAP HANA SAP SOURCE SYSTEM
7 pages
Narender Java
No ratings yet
Narender Java
7 pages
Tasks For Final Project
No ratings yet
Tasks For Final Project
6 pages
Python MySQL CRUD Guide
No ratings yet
Python MySQL CRUD Guide
2 pages
SQLcel
No ratings yet
SQLcel
3 pages
MongoDB Indexing & Sharding Guide
No ratings yet
MongoDB Indexing & Sharding Guide
32 pages
Library Management System Project in JAVA
55% (31)
Library Management System Project in JAVA
7 pages
Transaction Processing and Concurrency Control
100% (1)
Transaction Processing and Concurrency Control
6 pages
Exporting SAP Version Databases
No ratings yet
Exporting SAP Version Databases
2 pages
Java MCQ
No ratings yet
Java MCQ
762 pages
02 Intro To SQL
No ratings yet
02 Intro To SQL
66 pages
Enhanced Student Management System JDBC API Documentation
No ratings yet
Enhanced Student Management System JDBC API Documentation
19 pages
(The Only Proper) PDO Tutorial - Treating PHP Delusions PDF
No ratings yet
(The Only Proper) PDO Tutorial - Treating PHP Delusions PDF
121 pages
Unit 5-DBP
No ratings yet
Unit 5-DBP
37 pages
JSP Session Hibernate
No ratings yet
JSP Session Hibernate
22 pages
Unit 1. RDBMS 12
No ratings yet
Unit 1. RDBMS 12
32 pages
Exam 70-777: Implementing Microsoft Azure Cosmos DB Solutions - Skills Measured
No ratings yet
Exam 70-777: Implementing Microsoft Azure Cosmos DB Solutions - Skills Measured
3 pages
Chapter 2 Data Models Final
No ratings yet
Chapter 2 Data Models Final
18 pages
Chapter 7 - Database Design
No ratings yet
Chapter 7 - Database Design
52 pages
Unit 4 Locking Based Protocol
No ratings yet
Unit 4 Locking Based Protocol
15 pages
CURD
No ratings yet
CURD
26 pages
Issues ORA 0054 Resource Busy
No ratings yet
Issues ORA 0054 Resource Busy
5 pages
Final Test B - DU2
No ratings yet
Final Test B - DU2
2 pages
11 What Is Hashing in DBMS
No ratings yet
11 What Is Hashing in DBMS
20 pages

Hive

Uploaded by

Hive

Uploaded by

HIVE

Where to use Apache Hive?

Fig: Hive Tutorial – Hive Architecture

CREATE TABLE table_name PARTITIONED BY (partition1 data_type, partition2 data_type,….)

1. The Apache Hive distributed storage.

2. Hive provides tools to enable easy data extract/transform/load (ETL)

3. It provides the structure on a variety of data formats.

• In Hive, sub queries are not supported.

Why Hive is used inspite of Pig?

 Hive-QL is a declarative language line SQL, PigLatin is a data flow language.

Serializer, Deserializer gives instructions to hive on how to process a record.

You might also like