0% found this document useful (0 votes)
50 views60 pages

Unit Iv Part - 1

The document discusses the architecture and components of Hive including the user interface, driver, compiler, metastore, execution engine. It also covers how to get started with Hive by installing and configuring it. The document further explains Hive services, variables, properties, queries, data types, DDL, and DML operations in Hive like creating, altering and manipulating tables.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
50 views60 pages

Unit Iv Part - 1

The document discusses the architecture and components of Hive including the user interface, driver, compiler, metastore, execution engine. It also covers how to get started with Hive by installing and configuring it. The document further explains Hive services, variables, properties, queries, data types, DDL, and DML operations in Hive like creating, altering and manipulating tables.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 60

19ECS442

BIG DATA
UNIT IV
Introducing Hive
• Hive provides a Structured Query Language (SQL) interface,
HiveQL, or the Hive Query Language.
• This interface translates the given query into a MapReduce code.
• Hive can be seen as a mechanism through which one can access the
data stored in the HDFS.
• HiveQL enables users to perform tasks using the MapReduce
concept but without explicitly writing the code in terms of the map
and reduce functions.
• The data stored in HDFS can be accessed through HiveQL, which
contains the features of SQL but runs on the MapReduce
framework.
Introducing Hive
• It should be noted that Hive is not a complete database and is
not meant to be used in Online Transactional Processing
Systems, such as online ticketing, bank transactions, etc.
• It is mostly used in data warehousing kind of applications,
where you need to perform batch processing on a huge
amount of data.
• Typical examples of this kind of data include Web logs, call
data records, weather data, etc.
The architecture of Hive consists of various
components.
• User Interface (UI)—Allows you to submit queries to the Hive
system for execution.

• Driver—Receives the submitted queries. This driver component


creates a session handle for the submitted query and then sends the
query to the compiler to generate an execution plan.

• Compiler—Parses the query, performs semantic analysis on


different query blocks and query expressions, and generates an
execution plan
• Metastore—Stores all the information related to the structure of the
various tables and partitions in the data warehouse.
• It also includes column and column type information and the serializers
and deserializers necessary to read and write data.
• It also contains information about the corresponding HDFS files where
your data is stored.
• Execution Engine—Executes the execution plan created by the
compiler. The plan is in the form of a Directed Acyclic Graph (DAG) to
be executed in various stages.
• This engine manages the dependencies between the different stages of a
plan and is also responsible to execute these stages on the appropriate
system components.
Getting Started with Hive
As we know that before using Hive, we need to first install and set the
home variable to use Hive on Hadoop.
The steps for installing Hive on Hadoop are as follows:
Installing Hive with embedded metastore:
1. Download the latest version of Hive.
2. Untar the package:
$tar –xzvf apache-hive-0.13.1-bin.tar.gz
3. Add following to ~/.bash_
profile:
$sudo nano ~/.bash_profile
export HIVE_HOME=/home/hduser/hive-0.13.1
export PATH=$PATH:$HIVE_HOME/bin
Where hduser is the user name and Hive-0.13.1 is the Hive directory
extracted from tar.
4. Install a sql database like MySql on the same machine where you
want to run Hive.
5. Add following entries to Hive-site.xml
6. Run Hive from terminal:
Hive commands to be used in the shell interactive mode:
In the preceding examples, the set command prints a list of configuration
variables, the select query displays the columns beginning with my. from
the mytable table, and dfs –ls executes the dfs command to list the
directory content
4.2 Hive Services
Hive services are as follows:
• CLI—It is the black window or the panel that we get after the
installation of Hive.
• This is nothing but a command line interface of Hive. This is an
inbuilt default service present in Hive.
• Hive server—Runs Hive as an integrated server to expose a thrift
service, which integrates the access to the number of clients that are
written in different types of languages.
• There are many applications like JDBC, ODBC connectors that
need to execute a Hive server to get in communication loop with
Hive.
• Hive Web Interface (HWI)—The Hive Web Interface is the GUI
of Hive on which we can execute the queries. It is an alternative to
the shell.
HWI uses the following command:

JAR—Hive is somewhat equal to Hadoop JAR, as it is


convenient to run Java applications including the java
applications, Hadoop, and Hive classes on the classpath.
It is not required to make a special jar to do factual analysis.
Hive itself does it just by a simple query.
Hive Services
• Metastore—It is the service that runs with the Hive services
whenever Hive starts. This is a default process.
• Using the metastore service, it is possible to run it on a standalone
(remote) process.
• For this, you just need to set the property of the Hive
“METASTORE_PORT” environment variable so that the specified
port gets listened by the server.

• Hive client—There are many different mechanisms to get in


contact with the applications when you run Hive as a server
that is hiveserver.
Hive Variables
• Hive allows you to set variables that can be referred in the Hive
script.
• For this purpose, you need to use the –d or –define option, as
shown in the following commands

In the preceding commands, a table named sample table is created in


the database sample database. By default, the variable substitution
option is enabled.
Hive Properties
• The hive-site.xml file stores the configuration properties of Hive.
These properties can be overwritten by the developers.
• To overwrite the properties of the hive-site.xml file, the set
command is used.
• For example, the following command sets the path of the current
directory to /tmp/mydir for all subsequent commands :
Hive Queries
• Hive allows you to simultaneously execute one or more queries.
These queries can be stored in and executed from files.
• The extension of a query file in Hive is .hql or .q. Let’s take an
example of a Hive query file, named ourquery.hql, stored in the
…/home/weusers/ queries/folder.
• Now, type the following commands to execute the query stored in
the ourquery.hql file:
4. 3 Data Types in Hive
• Hive supports two kinds of data types:
• primitive type and complex type.
• Primitive data types are built-in data types, which also act as basic
structures for building more sophisticated data types.
• Primitive data types are associated with columns of a table.
The following types of complex data types are available in Hive:
• Structs—The elements within this type can be accessed by using
the Dot (.) operator.

• Maps—The elements in a map are accessed by using the ['element


name'] notation.

• Arrays—The elements in an array have to be of the same data type.


Elements can be accessed by using the [n] notation where n
represents the index of the array.
Built- in Functions
• You must already be aware of the concept of functions and the role
they play in the development of an application.

• A function is a group of commands used to perform a particular


task in a program and return an outcome.

• Like every programming language, Hive also has its set of built-in
functions (also known as pre-defined functions).
lists the built-in functions available in Hive:
4.4 Hive-DDL
• Data Definition Language (DDL) is used to describe data and data
structures of a database.

• Hive has its own DDL, such as SQL DDL, which is used for
managing, creating, altering, and dropping databases, tables, and
other objects in a database.

• Similar to other SQL databases, Hive databases also contain


namespaces for tables
• Some of the main commands used in DDL are as
follows
• In the preceding command, the name of the database is added before
the name of the table.
• Therefore, temp_table gets added in the temp_database.
• In addition, you can also create a table in the database by using the
following commands:
• In the preceding commands, the USE statement is used for
setting the current database to execute all the subsequent
HiveQL statements. In this way, you do not need to add the name
of the database before the table name.
• The table temp_table is created in the database temp_database
• Furthermore, you can specify DBPROPERTIES in the form of
key-value pairs in the following manner:
• Viewing a Database You can view all the databases present in a particular path by using the following command:

• Dropping a Database
• Dropping a database means deleting it from its storage location
• The database can be deleted by using the following command:
Creating Tables
• You can create a table in a database by using the CREATE
command, as discussed earlier.
• Now, let’s learn how to provide the complete definition of a table
in a database by using the following commands:
Altering Tables
Altering a table means modifying or changing an existing table.
By altering a table, you can modify the metadata associated with the table.
The table can be modified by using the ALTER TABLE
The altering of a table allows you to:
 Rename tables
 Modify columns
 Add new columns
 Delete some columns
 Change table properties
 Alter tables for adding partition
4.5 Data Manipulation
After specifying the database schema and creating a database, the data
can be modified by using a set of procedures/mechanisms defined
by a special language known as Data Manipulation Language
(DML).
Data can be manipulated in the following ways:
• Loading files into tables
• Inserting data into Hive table from queries
• Updating existing tables
• Deleting records in tables Let’s learn about each of these
Loading Files into Tables
• While loading data into tables, Hive does not perform any type of
transformations.
• The data load operations in Hive are, at present, pure copy/move
operations, which move data files from one location to another.
• You can upload data into Hive tables from the local file system as
well as from HDFS.
Inserting Data into Tables
• Consider two tables, T1 and T2. We want to copy the sal column
from T2 to T1 by using the INSERT command. It can be done as
follows:
Static Partition Insertion
• Static partition insertion refers to the task of inserting data into a
table by specifying a partition column value.

Consider the following example:


Dynamic Partition Insertion
• In dynamic partition insertion, you need to specify a list of partition column
names in the PARTITION() clause along with the optional column values.
• A dynamic partition column always has a corresponding input column in
the SELECT statement.

• If the SELECT statement has multiple column names, the dynamic partition
columns must be specified at the end of the columns and in the same order
in which they appear in the PARTITION() clause.
Inserting Data into Local Files
• Sometimes, you might require to save the result of the SELECT
query in flat files so that you do not have to execute the queries
again and again.
Consider the following example:
Creating and Inserting Data into a Table Using a Single Query

Delete in Hive
• The delete operation is available in Hive from the Hive 0.14 version. The delete operation can only be performed on those tables that support the ACID property.
• The syntax for performing the delete operation is as follows:
4.6 Data Retrieval Queries
Hive allows you to perform data retrieval queries by using the
SELECT command along with various types of operators and
clauses. In this section, you learn about the following:
• Using the SELECT command
• Using the WHERE clause
• Using the GROUP BY clause
• Using the HAVING clause
• Using the LIMIT clause
• Executing HiveQL queries
Using the SELECT Command
The SELECT statement is the most common operation in SQL.
You can filter the required columns, rows, or both.
The syntax for using the SELECT command is as follows
Using the GROUP BY Clause
• The GROUP BY clause is used to put all the related records
together. It can also be used with aggregate functions. Often, it is
required to group the resultsets in complex queries.
• In such scenarios, the ‘GROUP BY’ clause can be used.
Using the HAVING Clause
• The HAVING clause is used to specify a condition on the use of
the GROUP BY clause. The use of the HAVING clause is added
in the version 0.7.0 of Hive.
• The following query shows an example of using the HAVING
clause:
4.7 Using JOINS in Hive
Hive supports joining of one or more tables to aggregate information. The
various joins supported by Hive are:
• Inner joins
• Outer joins
Inner Joins
• In case of inner joins, only the records satisfying the given condition get
selected. All the other records get discarded.
• Let’s take an example to describe the concept of inner joins.
Consider two tables, order and customer
Table 12.6 lists the data of the order table
Outer Joins
• Sometimes, you need to retrieve all the records from one table and
only some records from the other table.
• In such cases, you have to use the outer join.
• Figure 12.12 illustrates the concept of outer joins:
Outer joins are of three types:
• Right Outer Join
• Left Outer Join
• Full Outer Join
Right Outer Join
• In this type of join, all the records from the table on the right side of the
join are retained.
Left Outer Join
• In this type of join, all the records from the table on the left side of
the join are retained.
• Figure 12.14 illustrates the concept of left outer joins:
Full Outer Join
• In this case, all the fields from both tables are included. For the
entries that do not have any match, a NULL value would be
displayed.
Joining Tables
• You can combine the data of two or more tables in Hive by using
HiveQL queries.
• For this, we need to create tables and load them into Hive from
HDFS.

You might also like