HIVE
What is Hive?
Apache Hive is a data warehouse system built on top of Hadoop and is used for analyzing
structured and semi-structured data. Hive abstracts the complexity of Hadoop MapReduce.
Basically, it provides a mechanism to project structure onto the data and perform queries
written in HQL (Hive Query Language) that are similar to SQL statements. Internally, these
queries or HQL gets converted to map reduce jobs by the Hive compiler. Therefore, you
don’t need to worry about writing complex MapReduce programs to process your data
using Hadoop. It is targeted towards users who are comfortable with SQL. Apache Hive
supports Data Definition Language (DDL), Data Manipulation Language (DML) and User
Defined Functions (UDF).
Where to use Apache Hive?
Apache Hive takes advantage of both the worlds i.e. SQL Database System and Hadoop –
MapReduce framework. Therefore, it is used by a vast multitude of companies. It is mostly
used for data warehousing where you can perform analytics and data mining that does not
require real time processing. Some of the fields where you can use Apache Hive are as
follows:
Data Warehousing
Ad-hoc Analysis
As it is said, you can’t clap with one hand only i.e. You can’t solve every problem with a
single tool. Therefore, you can couple Hive with other tools to use it in many other domains.
For example, Tableau along with Apache Hive can be used for Data Visualization, Apache
Tez integration with Hive will provide you real time processing capabilities, etc.
Hive Architecture and its Components
The following image describes the Hive Architecture and the flow in which a query is
submitted into Hive and finally processed using the MapReduce framework:
Fig: Hive Tutorial – Hive Architecture
As shown in the above image, the Hive Architecture can be categorized into the following
components:
Hive Clients: Hive supports application written in many languages like Java, C++,
Python etc. using JDBC, Thrift and ODBC drivers. Hence one can always write hive
client application written in a language of their choice.
Hive Services: Apache Hive provides various services like CLI, Web Interface etc. to
perform queries. We will explore each one of them shortly in this Hive tutorial blog.
Processing framework and Resource Management: Internally, Hive uses Hadoop
MapReduce framework as de facto engine to execute the queries. Hadoop
MapReduce framework is a separate topic in itself and therefore, is not discussed
here.
Distributed Storage: As Hive is installed on top of Hadoop, it uses the underlying
HDFS for the distributed storage. You can refer to the HDFS blog to learn more about
it.
Now, let us explore the first two major components in the Hive Architecture:
1. Hive Clients:
Apache Hive supports different types of client applications for performing queries on
the Hive. These clients can be categorized into three types:
Thrift Clients: As Hive server is based on Apache Thrift, it can serve the request from
all those programming language that supports Thrift.
JDBC Clients: Hive allows Java applications to connect to it using the JDBC driver
which is defined in the class org.apache.hadoop.hive.jdbc.HiveDriver.
ODBC Clients: The Hive ODBC Driver allows applications that support the ODBC
protocol to connect to Hive. (Like the JDBC driver, the ODBC driver uses Thrift to
communicate with the Hive server.)
2. Hive Services:
Hive provides many services as shown in the image above. Let us have a look at each of
them:
Hive CLI (Command Line Interface): This is the default shell provided by the Hive
where you can execute your Hive queries and commands directly.
Apache Hive Web Interfaces: Apart from the command line interface, Hive also
provides a web based GUI for executing Hive queries and commands.
Hive Server: Hive server is built on Apache Thrift and therefore, is also referred as
Thrift Server that allows different clients to submit requests to Hive and retrieve the
final result.
Apache Hive Driver: It is responsible for receiving the queries submitted through the
CLI, the web UI, Thrift, ODBC or JDBC interfaces by a client. Then, the driver passes
the query to the compiler where parsing, type checking and semantic analysis takes
place with the help of schema present in the metastore. In the next step, an
optimized logical plan is generated in the form of a DAG (Directed Acyclic Graph) of
map-reduce tasks and HDFS tasks. Finally, the execution engine executes these tasks
in the order of their dependencies, using Hadoop.
Metastore: You can think metastore as a central repository for storing all the Hive
metadata information. Hive metadata includes various types of information like
structure of tables and the partitions along with the column, column type, serializer
and deserializer which is required for Read/Write operation on the data present in
HDFS. The metastore comprises of two fundamental units:
o A service that provides metastore access to other Hive services.
o Disk storage for the metadata which is separate from HDFS storage.
Buckets:
Commands:
CREATE TABLE table_name PARTITIONED BY (partition1 data_type, partition2 data_type,….)
CLUSTERED BY (column_name1, column_name2, …) SORTED BY (column_name *ASC|DESC+,
…)+ INTO num_buckets BUCKETS;
Why do we need buckets?
There are two main reasons for performing bucketing to a partition:
A map side join requires the data belonging to a unique join key to be present in the
same partition. But what about those cases where your partition key differs from
join? Therefore, in these cases you can perform a map side join by bucketing the
table using the join key.
Bucketing makes the sampling process more efficient and therefore, allows us to
decrease the query time.
What is HQL?
Hive defines a simple SQL-like query language to querying and managing large datasets
called Hive-QL ( HQL ). It’s easy to use if you’re familiar with SQL Language. Hive allows
programmers who are familiar with the language to write the custom MapReduce
framework to perform more sophisticated analysis.
Uses of Hive:
1. The Apache Hive distributed storage.
2. Hive provides tools to enable easy data extract/transform/load (ETL)
3. It provides the structure on a variety of data formats.
4. By using Hive, we can access files stored in Hadoop Distributed File System (HDFS is used
to querying and managing large datasets residing in) or in other data storage systems such
as Apache HBase.
Limitations of Hive:
• Hive is not designed for Online transaction processing (OLTP ), it is only used for the Online
Analytical Processing.
• Hive supports overwriting or apprehending data, but not updates and deletes.
• In Hive, sub queries are not supported.
Why Hive is used inspite of Pig?
The following are the reasons why Hive is used in spite of Pig’s availability:
Hive-QL is a declarative language line SQL, PigLatin is a data flow language.
Pig: a data-flow language and environment for exploring very large datasets.
Hive: a distributed data warehouse.
Components of Hive:
Metastore :
Hive stores the schema of the Hive tables in a Hive Metastore. Metastore is used to hold all
the information about the tables and partitions that are in the warehouse. By default, the
metastore is run in the same process as the Hive service and the default Metastore is DerBy
Database.
SerDe :
Serializer, Deserializer gives instructions to hive on how to process a record.