19ECS442
BIG DATA
 UNIT IV
                      Introducing Hive
•     Hive provides a Structured Query Language (SQL) interface,
    HiveQL, or the Hive Query Language.
•    This interface translates the given query into a MapReduce code.
•   Hive can be seen as a mechanism through which one can access the
    data stored in the HDFS.
•   HiveQL enables users to perform tasks using the MapReduce
    concept but without explicitly writing the code in terms of the map
    and reduce functions.
•   The data stored in HDFS can be accessed through HiveQL, which
    contains the features of SQL but runs on the MapReduce
    framework.
               Introducing Hive
• It should be noted that Hive is not a complete database and is
  not meant to be used in Online Transactional Processing
  Systems, such as online ticketing, bank transactions, etc.
• It is mostly used in data warehousing kind of applications,
  where you need to perform batch processing on a huge
  amount of data.
• Typical examples of this kind of data include Web logs, call
  data records, weather data, etc.
    The architecture of Hive consists of various
                  components.
• User Interface (UI)—Allows you to submit queries to the Hive
  system for execution.
• Driver—Receives the submitted queries. This driver component
  creates a session handle for the submitted query and then sends the
  query to the compiler to generate an execution plan.
• Compiler—Parses the query, performs semantic analysis on
  different query blocks and query expressions, and generates an
  execution plan
• Metastore—Stores all the information related to the structure of the
  various tables and partitions in the data warehouse.
• It also includes column and column type information and the serializers
  and deserializers necessary to read and write data.
• It also contains information about the corresponding HDFS files where
  your data is stored.
• Execution Engine—Executes the execution plan created by the
  compiler. The plan is in the form of a Directed Acyclic Graph (DAG) to
  be executed in various stages.
• This engine manages the dependencies between the different stages of a
  plan and is also responsible to execute these stages on the appropriate
  system components.
               Getting Started with Hive
As we know that before using Hive, we need to first install and set the
    home variable to use Hive on Hadoop.
 The steps for installing Hive on Hadoop are as follows:
Installing Hive with embedded metastore:
 1. Download the latest version of Hive.
2. Untar the package:
      $tar –xzvf apache-hive-0.13.1-bin.tar.gz
3. Add following to ~/.bash_
profile:
$sudo nano ~/.bash_profile
 export HIVE_HOME=/home/hduser/hive-0.13.1
 export PATH=$PATH:$HIVE_HOME/bin
 Where hduser is the user name and Hive-0.13.1 is the Hive directory
    extracted from tar.
4. Install a sql database like MySql on the same machine where you
    want to run Hive.
5. Add following entries to Hive-site.xml
6. Run Hive from terminal:
Hive commands to be used in the shell interactive mode:
In the preceding examples, the set command prints a list of configuration
variables, the select query displays the columns beginning with my. from
the mytable table, and dfs –ls executes the dfs command to list the
directory content
                      4.2 Hive Services
Hive services are as follows:
• CLI—It is the black window or the panel that we get after the
  installation of Hive.
• This is nothing but a command line interface of Hive. This is an
  inbuilt default service present in Hive.
• Hive server—Runs Hive as an integrated server to expose a thrift
  service, which integrates the access to the number of clients that are
  written in different types of languages.
• There are many applications like JDBC, ODBC connectors that
  need to execute a Hive server to get in communication loop with
  Hive.
•    Hive Web Interface (HWI)—The Hive Web Interface is the GUI
    of Hive on which we can execute the queries. It is an alternative to
    the shell.
HWI uses the following command:
JAR—Hive is somewhat equal to Hadoop JAR, as it is
  convenient to run Java applications including the java
  applications, Hadoop, and Hive classes on the classpath.
  It is not required to make a special jar to do factual analysis.
  Hive itself does it just by a simple query.
                    Hive Services
•   Metastore—It is the service that runs with the Hive services
  whenever Hive starts. This is a default process.
• Using the metastore service, it is possible to run it on a standalone
  (remote) process.
• For this, you just need to set the property of the Hive
  “METASTORE_PORT” environment variable so that the specified
  port gets listened by the server.
• Hive client—There are many different mechanisms to get in
  contact with the applications when you run Hive as a server
  that is hiveserver.
                         Hive Variables
• Hive allows you to set variables that can be referred in the Hive
  script.
• For this purpose, you need to use the –d or –define option, as
  shown in the following commands
In the preceding commands, a table named sample table is created in
    the database sample database. By default, the variable substitution
   option is enabled.
Hive Properties
•  The hive-site.xml file stores the configuration properties of Hive.
  These properties can be overwritten by the developers.
• To overwrite the properties of the hive-site.xml file, the set
  command is used.
• For example, the following command sets the path of the current
    directory to /tmp/mydir for all subsequent commands :
                        Hive Queries
• Hive allows you to simultaneously execute one or more queries.
  These queries can be stored in and executed from files.
• The extension of a query file in Hive is .hql or .q. Let’s take an
  example of a Hive query file, named ourquery.hql, stored in the
  …/home/weusers/ queries/folder.
• Now, type the following commands to execute the query stored in
  the ourquery.hql file:
                 4. 3 Data Types in Hive
• Hive supports two kinds of data types:
• primitive type and complex type.
• Primitive data types are built-in data types, which also act as basic
  structures for building more sophisticated data types.
• Primitive data types are associated with columns of a table.
The following types of complex data types are available in Hive:
• Structs—The elements within this type can be accessed by using
  the Dot (.) operator.
• Maps—The elements in a map are accessed by using the ['element
  name'] notation.
• Arrays—The elements in an array have to be of the same data type.
  Elements can be accessed by using the [n] notation where n
  represents the index of the array.
Built- in Functions
• You must already be aware of the concept of functions and the role
  they play in the development of an application.
•    A function is a group of commands used to perform a particular
    task in a program and return an outcome.
• Like every programming language, Hive also has its set of built-in
  functions (also known as pre-defined functions).
lists the built-in functions available in Hive:
                     4.4 Hive-DDL
• Data Definition Language (DDL) is used to describe data and data
  structures of a database.
• Hive has its own DDL, such as SQL DDL, which is used for
  managing, creating, altering, and dropping databases, tables, and
  other objects in a database.
• Similar to other SQL databases, Hive databases      also contain
  namespaces for tables
• Some of the main commands used in DDL are as
  follows
• In the preceding command, the name of the database is added before
  the name of the table.
• Therefore, temp_table gets added in the temp_database.
• In addition, you can also create a table in the database by using the
   following commands:
• In the preceding commands, the USE statement is used for
  setting the current database to execute all the subsequent
  HiveQL statements. In this way, you do not need to add the name
  of the database before the table name.
• The table temp_table is created in the database temp_database
• Furthermore, you can specify DBPROPERTIES in the form of
  key-value pairs in the following manner:
• Viewing a Database You can view all the databases present in a particular path by using the following command:
• Dropping a Database
• Dropping a database means deleting it from its storage location
• The database can be deleted by using the following command:
                           Creating Tables
•    You can create a table in a database by using the CREATE
  command, as discussed earlier.
• Now, let’s learn how to provide the complete definition of a table
  in a database by using the following commands:
                                                                Altering Tables
Altering a table means modifying or changing an existing table.
By altering a table, you can modify the metadata associated with the table.
The table can be modified by using the ALTER TABLE
The altering of a table allows you to:
 Rename tables
 Modify columns
 Add new columns
 Delete some columns
 Change table properties
 Alter tables for adding partition
             4.5 Data Manipulation
After specifying the database schema and creating a database, the data
   can be modified by using a set of procedures/mechanisms defined
   by a special language known as Data Manipulation Language
   (DML).
 Data can be manipulated in the following ways:
• Loading files into tables
• Inserting data into Hive table from queries
• Updating existing tables
• Deleting records in tables Let’s learn about each of these
Loading Files into Tables
• While loading data into tables, Hive does not perform any type of
  transformations.
• The data load operations in Hive are, at present, pure copy/move
  operations, which move data files from one location to another.
• You can upload data into Hive tables from the local file system as
  well as from HDFS.
Inserting Data into Tables
• Consider two tables, T1 and T2. We want to copy the sal column
  from T2 to T1 by using the INSERT command. It can be done as
  follows:
Static Partition Insertion
• Static partition insertion refers to the task of inserting data into a
   table by specifying a partition column value.
Consider the following example:
Dynamic Partition Insertion
• In dynamic partition insertion, you need to specify a list of partition column
  names in the PARTITION() clause along with the optional column values.
• A dynamic partition column always has a corresponding input column in
  the SELECT statement.
• If the SELECT statement has multiple column names, the dynamic partition
  columns must be specified at the end of the columns and in the same order
   in which they appear in the PARTITION() clause.
Inserting Data into Local Files
• Sometimes, you might require to save the result of the SELECT
   query in flat files so that you do not have to execute the queries
   again and again.
Consider the following example:
Creating and Inserting Data into a Table Using a Single Query
Delete in Hive
• The delete operation is available in Hive from the Hive 0.14 version. The delete operation can only be performed on those tables that support the ACID property.
• The syntax for performing the delete operation is as follows:
                      4.6 Data Retrieval Queries
    Hive allows you to perform data retrieval queries by using the
    SELECT command along with various types of operators and
    clauses. In this section, you learn about the following:
•   Using the SELECT command
•   Using the WHERE clause
•   Using the GROUP BY clause
•   Using the HAVING clause
•   Using the LIMIT clause
•   Executing HiveQL queries
Using the SELECT Command
The SELECT statement is the most common operation in SQL.
You can filter the required columns, rows, or both.
The syntax for using the SELECT command is as follows
Using the GROUP BY Clause
• The GROUP BY clause is used to put all the related records
  together. It can also be used with aggregate functions. Often, it is
  required to group the resultsets in complex queries.
• In such scenarios, the ‘GROUP BY’ clause can be used.
Using the HAVING Clause
• The HAVING clause is used to specify a condition on the use of
  the GROUP BY clause. The use of the HAVING clause is added
  in the version 0.7.0 of Hive.
• The following query shows an example of using the HAVING
  clause:
                         4.7 Using JOINS in Hive
 Hive supports joining of one or more tables to aggregate information. The
   various joins supported by Hive are:
• Inner joins
• Outer joins
 Inner Joins
• In case of inner joins, only the records satisfying the given condition get
   selected. All the other records get discarded.
• Let’s take an example to describe the concept of inner joins.
   Consider two tables, order and customer
 Table 12.6 lists the data of the order table
                              Outer Joins
• Sometimes, you need to retrieve all the records from one table and
  only some records from the other table.
• In such cases, you have to use the outer join.
• Figure 12.12 illustrates the concept of outer joins:
Outer joins are of three types:
• Right Outer Join
• Left Outer Join
• Full Outer Join
Right Outer Join
• In this type of join, all the records from the table on the right side of the
   join are retained.
Left Outer Join
• In this type of join, all the records from the table on the left side of
  the join are retained.
• Figure 12.14 illustrates the concept of left outer joins:
Full Outer Join
• In this case, all the fields from both tables are included. For the
  entries that do not have any match, a NULL value would be
  displayed.
Joining Tables
• You can combine the data of two or more tables in Hive by using
   HiveQL queries.
•   For this, we need to create tables and load them into Hive from
    HDFS.