0% found this document useful (0 votes)

22 views19 pages

Apache Hive for Big Data Processing

Uploaded by

ucebittrichy2020

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOC, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

22 views19 pages

Apache Hive for Big Data Processing

Uploaded by

ucebittrichy2020

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOC, PDF, TXT or read online on Scribd

You are on page 1/ 19

Apache HIVE

Introduction

The term ‘Big Data’ is used for collections of large datasets that include huge volume, high
velocity, and a variety of data that is increasing day by day. Using traditional data
management systems, it is difficult to process Big Data. Therefore, the Apache Software
Foundation introduced a framework called Hadoop to solve Big Data management and
processing challenges.

Hadoop
Hadoop is an open-source framework to store and process Big Data in a distributed
environment. It contains two modules, one is MapReduce and another is Hadoop Distributed
File System (HDFS).

 MapReduce: It is a parallel programming model for processing large amounts of

structured, semi-structured, and unstructured data on large clusters of commodity
hardware.
1.
 HDFS: Hadoop Distributed File System is a part of Hadoop framework, used to store
and process the datasets. It provides a fault-tolerant file system to run on
commodity hardware.

The Hadoop ecosystem contains different sub-projects (tools) such as Sqoop, Pig, and Hive
that are used to help Hadoop modules.

 Sqoop: It is used to import and export data to and fro between HDFS and RDBMS.

 Pig: It is a procedural language platform used to develop a script for MapReduce

operations.

 Hive: It is a platform used to develop SQL type scripts to do MapReduce operations.

Note: There are various ways to execute MapReduce operations:

 The traditional approach using Java MapReduce program for structured, semi-
structured, and unstructured data.

 The scripting approach for MapReduce to process structured and semi structured
data using Pig.

 The Hive Query Language (HiveQL or HQL) for MapReduce to process structured data
using Hive.

What is Hive?
Hive is a data warehouse infrastructure tool to process structured data in Hadoop. It resides
on top of Hadoop to summarize Big Data, and makes querying and analyzing easy.
Initially Hive was developed by Facebook, later the Apache Software Foundation took it up
and developed it further as an open source under the name Apache Hive. It is used by
different companies. For example, Amazon uses it in Amazon Elastic MapReduce.

Hive is not
 A relational database
 A design for OnLine Transaction Processing (OLTP)
 A language for real-time queries and row-level updates

Features of Hive
Hive is a declarative SQL based language, mainly used for data analysis and creating
reports. Hive operates on the server-side of a cluster. Hive provides schema flexibility and
evolution along with data summarization, querying of data, and analysis in a much easier
manner.

In Hive, we can make two types of tables – partitioned and bucketed which make it
feasible to process data stored in HDFS and improves the performance as well.

Hive tables are defined directly in the Hadoop File System(HDFS).In Hive, we have
JDBC/ODBC drivers.Hive is fast and scalable, and easy to learn.Hive has a rule-based
optimizer for optimizing plans.Using Hive we can also execute Ad-hoc queries to analyze
data.

Here are the features of Hive:

 It stores schema in a database and processed data into HDFS.

 It is designed for OLAP.
 It provides SQL type language for querying called HiveQL or HQL.
 It is familiar, fast, scalable, and extensible.

Architecture of Hive
The following component diagram depicts the architecture of Hive:
This component diagram contains different units. The following table describes each unit:

Unit Name Operation

User Interface Hive is a data warehouse infrastructure software that can create
interaction between user and HDFS. The user interfaces that Hive
supports are Hive Web UI, Hive command line, and Hive HD Insight
(In Windows server).

Meta Store Hive chooses respective database servers to store the schema or
Metadata of tables, databases, columns in a table, their data
types, and HDFS mapping.

HiveQL Process HiveQL is similar to SQL for querying on schema info on the
Engine Metastore. It is one of the replacements of traditional approach for
MapReduce program. Instead of writing MapReduce program in
Java, we can write a query for MapReduce job and process it.

Execution Engine The conjunction part of HiveQL process Engine and MapReduce is
Hive Execution Engine. Execution engine processes the query and
generates results as same as MapReduce results. It uses the flavor
of MapReduce.

HDFS or HBASE Hadoop distributed file system or HBASE are the data storage
techniques to store data into file system.

Working of Hive
The following diagram depicts the workflow between Hive and Hadoop.
Working :

1. First of all, the user submits their query and CLI sends that query to the Driver.
2. Then the driver takes the help of query compiler to check syntax.
3. Then compiler request for Metadata by sending a metadata request to Metastore.
4. In response to that request, metastore sends metadata to the compiler.
5. Then compiler resends the plan to the driver after checking requirements.
6. The Driver sends the plan to the execution engine.
7. Execution engine sends the job to Job tracker and assigns the job to Task Tracker.
Here, the query executes MapReduce job. And in meantime execution engine
executes metadata operations with Metastore.
8. Then the execution engine fetches the results from the Data Node and sends those
results to the driver.
9. At last, the driver sends the results to the hive interface.

HIVE Metastore –

Hive Metastore is the central repository for metadata. It stores metadata for Hive
tables (like their schema and location) and partitions in a relational database. It provides
client access to this information by using the metastore service API.

Modes:
 Embedded: In Hive by default, metastore service and hive services run in the same
JVM. In this mode, Data in the local file system are stored using the embedded derby
database.
 Local: Hive is a SQL based framework, that should have multiple sessions. In Local
mode, multiple Hive sessions are allowed. This can be achieved by using any JDBC
application like MySQL that runs in a separate JVM.
 Remote: In this mode, metastore and hive services run in a separate JVM. Thrift
network APIs are used by different processes to communicate among them.

Limitations –
Apache Hive has some limitations also:
1. Read-only views are allowed but materialized views are not allowed.
2. It does not support triggers.
3. Apache Hive queries have very high latency.
4. No difference between NULL and null values.

How HIVE is different from RDBMS ?

o RDBMS supports schema on Write whereas Hive provides schema on Read.
o In Hive, we can write once but in RDBMS we can write as many times as we want.
 Hive can handle big datasets whereas RDBMS can’t handle beyond 10TB.
 Hive is highly scalable but scalability in RDBMS costs a lost.
o Hive has a feature of Bucketing which is not there in RDBMS.
The following table defines how Hive interacts with Hadoop framework:

Step No. Operation

1 Execute Query

The Hive interface such as Command Line or Web UI sends query to Driver
(any database driver such as JDBC, ODBC, etc.) to execute.

2 Get Plan

The driver takes the help of query compiler that parses the query to check
the syntax and query plan or the requirement of query.

3 Get Metadata

The compiler sends metadata request to Metastore (any database).

4 Send Metadata

Metastore sends metadata as a response to the compiler.

5 Send Plan
The compiler checks the requirement and resends the plan to the driver.
Up to here, the parsing and compiling of a query is complete.

6 Execute Plan

The driver sends the execute plan to the execution engine.

7 Execute Job

Internally, the process of execution job is a MapReduce job. The execution

engine sends the job to JobTracker, which is in Name node and it assigns
this job to TaskTracker, which is in Data node. Here, the query executes
MapReduce job.

7.1 Metadata Ops

Meanwhile in execution, the execution engine can execute metadata

operations with Metastore.

8 Fetch Result

The execution engine receives the results from Data nodes.

9 Send Results

The execution engine sends those resultant values to the driver.

10 Send Results

The driver sends the results to Hive Interfaces.

2. HIVE INATALLATION
All Hadoop sub-projects such as Hive, Pig, and HBase support Linux operating system.
Therefore, you need to install any Linux flavored OS. The following simple steps are
executed for Hive installation:

Step 1: Verifying JAVA Installation

Java must be installed on your system before installing Hive. Let us verify java installation
using the following command:

$ java –version

If Java is already installed on your system, you get to see the following response:

java version "1.7.0_71"

Java(TM) SE Runtime Environment (build 1.7.0_71-b13)
Java HotSpot(TM) Client VM (build 25.0-b02, mixed mode)

If java is not installed in your system, then follow the steps given below for installing java.

Installing Java
Step I:
Download java (JDK <latest version> - X64.tar.gz) by visiting the following link
http://www.oracle.com/technetwork/java/javase/downloads/jdk7-downloads-1880260.html.

Then jdk-7u71-linux-x64.tar.gz will be downloaded onto your system.

Step II:
Generally, you will find the downloaded java file in the Downloads folder. Verify it and
extract the jdk-7u71-linux-x64.gz file using the following commands.

$ cd Downloads/

$ ls

jdk-7u71-linux-x64.gz
$ tar zxf jdk-7u71-linux-x64.gz

$ ls

jdk1.7.0_71 jdk-7u71-linux-x64.gz

Step III:
To make java available to all the users, you have to move it to the location “/usr/local/”.
Open root, and type the following commands.

$ su

password:

# mv jdk1.7.0_71 /usr/local/

# exit

Step IV:
For setting up PATH and JAVA_HOME variables, add the following commands to
~/.bashrc file.

export JAVA_HOME=/usr/local/jdk1.7.0_71

export PATH=PATH:$JAVA_HOME/bin

Now apply all the changes into the current running system.

$ source ~/.bashrc

Step V:
Use the following commands to configure java alternatives:

# alternatives --install /usr/bin/java java usr/local/java/bin/java 2

# alternatives --install /usr/bin/javac javac usr/local/java/bin/javac 2

# alternatives --install /usr/bin/jar jar usr/local/java/bin/jar 2

# alternatives --set java usr/local/java/bin/java

# alternatives --set javac usr/local/java/bin/javac

# alternatives --set jar usr/local/java/bin/jar

HIVE QUERIES:

Hive Databases & Tables

Create database in Hive

We can connect to the hive in 3 different ways:

1) Hive CLI
2) Beeline
3) Hue ( browser application)

IP: 192.168.13.132

Browser Hue Application: http://192.168.13.132:8888/accounts/login/?next=/

Hive databases

Display databases:

hive> show databases;

Creating new databases:

hive> create database userdb;

hive> create schema userdb1;

Pointing to specific database

hive> use userdb;

Display the current database

hive> set hive.cli.print.current.db=true;

hive (userdb)>

dropping databases

hive (userdb)> drop database userdb1;

Hive data types

Numeric Types:
Int, tiny int, big int, float, double, decimal
String Types:
Char, varchar, string
Date/Time Types:
Timestamp, date
Complex Types
Array, Map, Structs

Hive Tables

1) Internal Tables (Managed tables)

2) External Tables

Internal Tables (Managed Tables)

 First we have to create table and load the data.

 We can call this one as "data on schema".
 By dropping this table, both data and schema will be removed.
 The stored location of this table will be at /user/hive/warehouse.
 Data is temporary
 Hive to Manage the table data completely not allowing any external source to use
the table (external sources means sqoop, pig, mapreduce etc..)
 Don’t want data after deletion
External Tables

 Data will be available in HDFS. The table is going to create on HDFS data.
 We can call this one as "schema on data".
 At the time of dropping the table it drops only schema, the data will be still available
in HDFS as before.
External tables provide an option to create multiple schemas for the data stored in HDFS
instead of deleting the data every time whenever schema updates

 If processing data available in HDFS

 Useful when the files are being used outside of Hive
Creating Internal tables & loading data

1) create a table and insert data

2) create a table then load the data from Local File System
3) create a table then load the data from HDFS
4) create a table based on another table
5) create a table then load the data from another table

Method-1

hive (userdb)> create table employees(empid int,ename string,salary int,job string,deptno int);
hive (userdb)> insert into employees values(101,'pavan',60000,'enginerr',10);
hive (userdb)> select * from employees;

Method-2

hive (userdb)> create table IF NOT EXISTS employee(eid int,name string,salary string,designation
string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n'
STORED AS TEXTFILE;
hive (userdb)> LOAD DATA LOCAL INPATH '/home/cloudera/training/hivedata/emp.txt'
OVERWRITE INTO TABLE employee;

Method-3

-> copy the file in to HDFS

[cloudera@quickstart hivedata]$ hdfs dfs -put /home/cloudera/training/hivedata/emp.txt
/user/hive/warehouse/

hive (userdb)> Create table IF NOT EXISTS employee ( eid int, name String,salary String, designation
String)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n'
STORED AS TEXTFILE;

hive (userdb)> LOAD DATA INPATH '/user/hive/warehouse/emp.txt' OVERWRITE INTO TABLE

employee;

Method-4

->Describe table
hive (userdb)> describe employee;

hive (userdb)> create table employee_new as select eid,name from employee;

Method-5

hive (userdb)> create table empnames(names string)ROW FORMAT DELIMITED FIELDS

TERMINATED BY ',' LINES TERMINATED BY '\n' STORED AS TEXTFILE;

hive (userdb)> insert into table empnames select name from employee;

External tables

 External tables creates on top of data.

 If table is deleted data still persists in HDFS.
1) create external table and insert data.
2) create external table and load the data from Local file system.
3) create external table and load the data from HDFS.
Method-1
hive> create EXTERNAL table employee_external(eid int,name string,salary string,designation string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n'
LOCATION '/user/training/employee_external';
hive> insert into employee_external values(111,'pavan','50000','manager');
Method-2
hive> create EXTERNAL table employee_external(eid int,name string,salary string,designation string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n'
LOCATION '/user/training/employee_external';
hive> LOAD DATA LOCAL INPATH '/home/cloudera/training/hivedata/emp.txt' INTO TABLE
employee_external;
hive> LOAD DATA LOCAL INPATH '/home/cloudera/training/hivedata/emp.txt' OVERWRITE INTO
TABLE employee_external;
** If you use OVERWRITE in loading statement data will be override.
** Other wise copy of files will be created.
** If you drop the table , data file will not be removed from HDFS.
Method-3
hive> create EXTERNAL table employee_external(eid int,name string,salary string,designation string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n'
LOCATION '/user/training/employee_external';
hive> LOAD DATA INPATH '/user/training/emp.txt' INTO TABLE employee_external;
describe
hive> describe employee_external;
OK
eid int
name string
salary string
designation string
Time taken: 0.09 seconds, Fetched: 4 row(s)
hive> describe extended employee_external;
OK
eid int
name string
salary string
designation string

Complex Data Types

array : We can store multiple elements with same data
map
struct
Array
hive> create table arrayTable(id int,name string,sal bigint,sub array,city string)
row format delimited fields terminated by ',' collection items terminated by '$';

hive> load data local inpath'/home/cloudera/training/hivedata/arrrayfile.txt' into table arrayTable;

hive> select * from arrayTable;
OK
1 abc 40000 ["a","b","c"] hyd
2 def 3000 ["d","f"] bang
Time taken: 1.029 seconds, Fetched: 2 row(s)
hive> select sub[2] from arrayTable;
OK
c
NULL
Time taken: 0.402 seconds, Fetched: 2 row(s)
hive> select sub[1] from arrayTable;
OK
b
f
Time taken: 0.244 seconds, Fetched: 2 row(s)
hive>
Map
create table maptable(id int,name string,sal bigint,sub array,pfund map,city string) row format delimited
fields terminated by ','
collection items terminated by '$' map keys terminated by '#';
hive> load data local inpath'/home/cloudera/training/hivedata/mapfile.txt' into table maptable;

hive> select * From maptable;

hive> select pfund["pf"],["epf"] from maptable;

Struct
hive> create table structtable(id int,name string,sal bigint,sub array,dud map,
address struct)
row format delimited fields terminated by ',' collection items terminated by '$' map keys terminated by '#';

hive> load data local inpath '/home/cloudera/training/hivedata/structfile.txt' into table structtable;

hive> select * From structtable;

OK
1 abc 40000 ["a","b","c"] {"pf":500,"epf":200} {"city":"hyd","state":"ap","pin":500001}
2 def 3000 ["d","f"] {"pf":500} {"city":"bang","state":"kar","pin":600038}
Time taken: 0.136 seconds, Fetched: 2 row(s)
hive> select address.city from structtable;
OK
hyd
bang
Time taken: 0.14 seconds, Fetched: 2 row(s)
hive> select address.pin from structtable;
OK
500001
600038
Time taken: 0.13 seconds, Fetched: 2 row(s)
hive>
Partitioning
1) create Hive table
hive> create table all_students(sno int,sname string, passout_year string) row format delimited fields
terminated by ',';
2) Load the data
hive> load data local inpath '/home/cloudera/training/hivedata/Students.csv' into table all_students;
3) create partision table
hive> create table students_part(sno int,sname string) PARTITIONED BY(passout_year string);
4) For partition we have to set this property
set hive.exec.dynamic.partition.mode=nonstrict;
5) Load data into partision table
hive> INSERT OVERWRITE TABLE students_part PARTITION(passout_year) SELECT sno,sname,
passout_year from all_students;

Bucketting
1) create Hive table
hive> create table all_students(sno int,sname string, passout_year string) row format delimited fields
terminated by ',';
2) Load the data
hive> load data local inpath '/home/cloudera/training/hivedata/Students.csv' into table all_students;
3) creating bucket table:
create table students_buck(sno int,sname string, passout_year string) clustered by (passout_year) sorted
by (sno) into 3 buckets;
4) Enable bucketing table:
set hive.enforce.bucketing=true;
5) Insert data into bucket table:
insert overwrite table students_buck select * from all_students;
** After bucketing splits will be stored in /user/hive/warehouse/students_buck
6) Test Bucket table:
select * from students_buck tablesample(bucket 2 out of 3 on sname);
7) Test normal table:
select * from all_students limit 10;
-------------------------------------
Alter table - rename table name, rename column
Views & Indexes
Select ..where
Select ..order by
Select..Group By
-----------------------------
Renaming table
hive> alter table all_students rename to students;
Renaming the column
hive> alter table students change passout_year pyear String;
View ---> View is a logical object. can be create for 1 or more columns from a table.
Creating view
hive> create view students_view as select sno,pyear from students where pyear='2011';
hive> select * from students_view;
OK
5 2011
6 2011
7 2011
18 2011
19 2011
20 2011
Dropping view
hive> drop view students_view;
Indexes
Index--> improving the performance when fetching the data from tables.
Creating index
hive> create index sno_idx on table students(sno) as 'compact' with deferred rebuild;
hive> show index on students;
Dropping Index
hive> drop index sno_idx on students;
HQL--Select with where clause
where --> used for filtering the rows based on condition.
Display students who are passed in the year 2012.
hive> select * from students where pyear=2012;
HQL-Select with order by clause
Display students in asceding order based on pyear;
hive> select * from students order by pyear;
HQL-Select with group by clause
Count number of students passed in each year.
hive> select pyear,count(*) from students group by pyear;
Group by with having clause
having--> apply the condition on top of group by clause
hive> select pyear,count(*) from students group by pyear having count(*)>=5;

Hive Built-in functions

dummy --> one row and one colomn.
hive> select * From dummy;
OK
xxx
Time taken: 2.575 seconds, Fetched: 1 row(s)
hive> describe dummy;

hive> select concat('abc','xyz') from dummy;

hive> select concat(sname,pyear) from students;
hive> select substr('welcome',2,4) from dummy;
elco
hive> select substr('welcome',3,2) from dummy;
lc
hive> select substr(sname,1,4) from students;
hive> select upper('welcome') from dummy;
WELCOME
hive> select ucase('welcome') from dummy;
WELCOME
hive> select lower('WELCOME') from dummy;
welcome
hive> select lcase('WELCOME') from dummy;
welcome
hive> select upper(sname) from students;
hive> select lower(sname) from students;
hive> select ' welcome ' from dummy;
welcome
hive> select trim(' welcome ') from dummy;
welcome
hive> select rtrim(' welcome ') from dummy;
welcome
hive> select ltrim(' welcome ') from dummy;
welcome
hive> select CURRENT_DATE from dummy;
2017-12-28
hive> select year(CURRENT_DATE) from dummy;
hive> select month(CURRENT_DATE) from dummy;
hive> select day(CURRENT_DATE) from dummy;

Aggregation Functions
Aggregation functions applicable for group of columns.
Tables: owners , cars
hive> create table owners(id int,name string,age int,car_id int) row format delimited fields terminated by
','
lines terminated by '\n';
hive> load data local inpath '/home/cloudera/training/hivedata/owners.txt' into table owners;
hive> create table cars(id int,make string,model string,year string)row format delimited fields terminated
by ','
lines terminated by '\n';
hive> load data local inpath '/home/cloudera/training/hivedata/cars.txt' into table cars;

count(*) --> returns total number of rows including nulls.

count(exp) --> returns total number of rows for which the supplied column.
sum(col)--> returns sum of elements in the group.
sum(distinct col) --> returns sum of unique elements in the group(eleminate duplictes)
avg(col)---> returns average value of elements from a group.
min(col)--> returns minimum value in the group
max(col)--> retruns maximum value from the group.
Query: table : cars
I want to see the car makers that have more than 1 model in cars table.
hive> select make,count(distinct model) from cars group by make having count(distinct model)>1;

Hive User Defined Functions

select count(*) from customers;
Steps to create and test UDF's
1) Implement the code for UDF in Java
2) Package java class into jar file copy in some location
3) Add jar file in to Hive CLI
4) Create temporary function in hive
5) Use hive UDF BY using Query.
Prerequiste: Table should have some data.
Problem statement-1
Find the maximum marks obtained out of four subject by an student.
Package java class into jar file copy in some location.
-----------
SELECT CLASS IN ECLIPSE-->RIGHT-->EXPORT-->JAVA-->JAR--> BROWSE THE
LOCATION-->PROFILE FILENAME WITH .JAR Extension.
Add jar file in to Hive CLI
---------------------
hive> add jar /home/cloudera/training/HiveUDFS/getMaxMarks.jar;
Create temporary function in hive
-----------------
hive> create temporary function getmaxmarks as 'udfs.GetMaxMarks';
Use hive UDF BY using Query
-------------
hive> select getmaxmarks(10,20,30,40) from dummy; // sanity test

2 types of UDF
1) Regular UDF( UDF) ---> applied on more number of rows in a table
2) User Defined aggregate function (UDAF) --> group of result sets.
Problem statement -2: Find the mean of marks obtained in maths by all the students.
Package java class into jar file copy in some location
----------
Right click onth package-->export-->java-->provide jar file name.
Add jar file in to Hive CLI
-----
hive> add jar /home/cloudera/training/HiveUDFS/getMeanMarks.jar;
Create temporary function in hive
----------------
hive> create temporary function getmeanmarks as 'udaf.GetMeanMarks';
Use functions with queries
-----------------------
hive> select getmeanmarks(social)from t_student_record;

Hive clients
------------------
1) hive CLI
2) beeline
3) Hue ( Web UI) ; http://192.168.13.135:8888
4) JDBC Client ( Java program)
Hive Joins:
Join is used for getting the data from multiple tables.
1) Inner join
2) Left outer join
3) right outer join
4) full outer join
T1: 1 3 5 7 (left table)
T2 1 2 5 8 9 (right table)
inner join: 1,5 ( COMMON RECORDS IN BOTH THE TABLES)
left outer join: 1,5,3,7,null,null ( common records+extra records from left table+ if there is no
corresponding values it substitute as nulls)
right outer join: 1,5 2,8,9,null,null
full outer join: 1,5,3,7,2,8,9,null,null,null
Tables
-----------------
orders
customers
hive> create table customers(id int,name string,age int,address string,salary int) row format delimited
fields terminated by ',' lines terminated by '\n';
hive> load data local inpath '/home/cloudera/training/hivedata/customers.txt' into table customers;
hive> create table orders(oid int,date date,customer_id int,amount int)row format delimited fields
terminated by ',' lines terminated by '\n';
hive> load data local inpath '/home/cloudera/training/hivedata/orders.txt' into table orders;
Join Queries
INNER JOIN:
SELECT c.ID, c.NAME, c.AGE, o.AMOUNT FROM CUSTOMERS c JOIN ORDERS o ON (c.ID =
o.CUSTOMER_ID);
LEFT OUTER JOIN
SELECT c.ID, c.NAME, o.AMOUNT, o.DATE FROM CUSTOMERS c LEFT OUTER JOIN ORDERS
o ON (c.ID = o.CUSTOMER_ID);
RIGHT OUTER JOIN
SELECT c.ID, c.NAME, o.AMOUNT, o.DATE FROM CUSTOMERS c RIGHT OUTER JOIN
ORDERS o ON (c.ID = o.CUSTOMER_ID);
FULL OUTER JOIN
SELECT c.ID, c.NAME, o.AMOUNT, o.DATE FROM CUSTOMERS c FULL OUTER JOIN ORDERS
o ON (c.ID = o.CUSTOMER_ID);

Processing XML
hive> create table xmltable(line string);
hive> load data local inpath '/home/cloudera/training/hivedata/sampxml1.xml' into table xmltable;

hive> select * from xmltable;

OK
xmltable.line
Ravi25hyd
Rani24f
SampathmDel
Time taken: 0.064 seconds, Fetched: 3 row(s)
hive> select xpath_string(line,'rec/name') from xmltable;
OK
_c0
Ravi
Rani
Sampath
Time taken: 0.108 seconds, Fetched: 3 row(s)
hive> select xpath_string(line,'rec/city') from xmltable;
OK
_c0
hyd
Del
Time taken: 0.1 seconds, Fetched: 3 row(s)
hive>
Hive WebUI (Hue application)
Generating data: http://generatedata.com/
1) Querying tables
2) Create new table and import the data from external file.
step 1) DataBrowsers-->MetaStore tables-->
step 2)you need to click on ' + ' at right corner

HIVE
No ratings yet
HIVE
18 pages
Hive
No ratings yet
Hive
5 pages
Hive for Big Data Professionals
No ratings yet
Hive for Big Data Professionals
17 pages
Bda Unit 4 - Mam
No ratings yet
Bda Unit 4 - Mam
57 pages
01 Introduction To Hive (1) 2 15
No ratings yet
01 Introduction To Hive (1) 2 15
14 pages
Apache Hive: Structure & Data Analysis
No ratings yet
Apache Hive: Structure & Data Analysis
25 pages
Unit 5 Lecture No-1 (Hive)
No ratings yet
Unit 5 Lecture No-1 (Hive)
30 pages
Assignment 4-Gcc: Hive Is Not
No ratings yet
Assignment 4-Gcc: Hive Is Not
3 pages
HIVE
No ratings yet
HIVE
7 pages
Hadoop Ecosystem: Hive and MapReduce
No ratings yet
Hadoop Ecosystem: Hive and MapReduce
14 pages
Hive
No ratings yet
Hive
49 pages
Introduction To HIVE
No ratings yet
Introduction To HIVE
8 pages
Chapter 5 Hive
No ratings yet
Chapter 5 Hive
69 pages
Unit V-Hive
No ratings yet
Unit V-Hive
10 pages
Hive
No ratings yet
Hive
52 pages
Hive
No ratings yet
Hive
30 pages
Apache Hive for Data Analysts
No ratings yet
Apache Hive for Data Analysts
8 pages
Ilovepdf Merged
No ratings yet
Ilovepdf Merged
182 pages
Web Based Data Management of Apache Hive
No ratings yet
Web Based Data Management of Apache Hive
22 pages
What Is Hive
No ratings yet
What Is Hive
4 pages
Hive Unit VI
No ratings yet
Hive Unit VI
39 pages
Unit 5 Lecture No-1 (Hive)
No ratings yet
Unit 5 Lecture No-1 (Hive)
30 pages
1 - Introduction
No ratings yet
1 - Introduction
5 pages
Big-Data-Unit 5
No ratings yet
Big-Data-Unit 5
54 pages
Architecture and Working of Hive
No ratings yet
Architecture and Working of Hive
7 pages
Unit 5 (BDC)
No ratings yet
Unit 5 (BDC)
59 pages
Big-Data-Unit 5
No ratings yet
Big-Data-Unit 5
54 pages
DA Unit-5
No ratings yet
DA Unit-5
78 pages
Bda Unit 5 Hive Notes
No ratings yet
Bda Unit 5 Hive Notes
23 pages
Big Data: Week - 11
No ratings yet
Big Data: Week - 11
28 pages
Big Data & Analytics (CSE6005) L6
No ratings yet
Big Data & Analytics (CSE6005) L6
56 pages
Unit 4 Hadoop Ecosystem - HIVE and PIG
No ratings yet
Unit 4 Hadoop Ecosystem - HIVE and PIG
157 pages
IET Udaipur BDA Unit-5
No ratings yet
IET Udaipur BDA Unit-5
9 pages
HIVE
No ratings yet
HIVE
16 pages
Introduction To Hive-5
No ratings yet
Introduction To Hive-5
4 pages
Hadoop - Hive
No ratings yet
Hadoop - Hive
190 pages
Unit-IV - BDA
No ratings yet
Unit-IV - BDA
42 pages
Hive and Hadoop Workflow Explained
No ratings yet
Hive and Hadoop Workflow Explained
3 pages
Using Hive For Data Warehousing: Introduction To Hive
No ratings yet
Using Hive For Data Warehousing: Introduction To Hive
4 pages
Hive
No ratings yet
Hive
12 pages
Unit-Vi Hive Hadoop & Big Data
100% (1)
Unit-Vi Hive Hadoop & Big Data
24 pages
Bigdata Lecture 5
No ratings yet
Bigdata Lecture 5
19 pages
HIVE
No ratings yet
HIVE
33 pages
BDA Unit 4 Notes
No ratings yet
BDA Unit 4 Notes
33 pages
Unit 3-1
No ratings yet
Unit 3-1
41 pages
Week 14 Hive
No ratings yet
Week 14 Hive
6 pages
Hadoop and Hive Architecture 1
No ratings yet
Hadoop and Hive Architecture 1
11 pages
BDA Answers
No ratings yet
BDA Answers
10 pages
Unit-4 Hive
No ratings yet
Unit-4 Hive
10 pages
(R17a0528) Big Data Analytics-57-100
No ratings yet
(R17a0528) Big Data Analytics-57-100
44 pages
Hive Database & Analytics Guide
No ratings yet
Hive Database & Analytics Guide
10 pages
Hive Full Lecture
No ratings yet
Hive Full Lecture
17 pages
BD U-5 (Anupam Sir)
No ratings yet
BD U-5 (Anupam Sir)
12 pages
Big Data Analytics Module-4
No ratings yet
Big Data Analytics Module-4
39 pages
Apache Hive: Data Warehousing on Hadoop
No ratings yet
Apache Hive: Data Warehousing on Hadoop
28 pages
BDA Unit V
No ratings yet
BDA Unit V
23 pages
Hive Data Warehousing Overview
No ratings yet
Hive Data Warehousing Overview
9 pages
Hive
No ratings yet
Hive
28 pages
II CSE CS3352 FDS QB Unit4
100% (1)
II CSE CS3352 FDS QB Unit4
6 pages
II CSE CS3352 FDS QB Unit1
No ratings yet
II CSE CS3352 FDS QB Unit1
4 pages
II CSE CS3352 FDS QB Unit5
No ratings yet
II CSE CS3352 FDS QB Unit5
4 pages
II CSE CS3352 FDS QB Unit2
No ratings yet
II CSE CS3352 FDS QB Unit2
3 pages
Guitar Tab for Gorillaz Fans
No ratings yet
Guitar Tab for Gorillaz Fans
2 pages
SMB-R Programming Lab
No ratings yet
SMB-R Programming Lab
57 pages
Ammar Step by Step2
No ratings yet
Ammar Step by Step2
20 pages
Naive Art: Features of Creative Perception
No ratings yet
Naive Art: Features of Creative Perception
9 pages
Written in Blood Revolutionary Terrorism and Russian Literary Culture 1861 1881 1st Edition Lynn Ellen Patyk Download
100% (12)
Written in Blood Revolutionary Terrorism and Russian Literary Culture 1861 1881 1st Edition Lynn Ellen Patyk Download
79 pages
Resume of Satish 16122009
No ratings yet
Resume of Satish 16122009
4 pages
The Reality and Elements of Poetry
No ratings yet
The Reality and Elements of Poetry
59 pages
Artificial Intelligence Question Bank 2021
100% (1)
Artificial Intelligence Question Bank 2021
28 pages
Evaluation and Application of Credible Sources
No ratings yet
Evaluation and Application of Credible Sources
3 pages
The Hindu
No ratings yet
The Hindu
2 pages
Practice Set 999
No ratings yet
Practice Set 999
3 pages
R Programming MOOC Quiz
No ratings yet
R Programming MOOC Quiz
12 pages
Introduction To Culture and Society-12.2023-CTC+CLC-UPDATED
No ratings yet
Introduction To Culture and Society-12.2023-CTC+CLC-UPDATED
6 pages
Science Talk Moves
No ratings yet
Science Talk Moves
2 pages
New Literacies Across Curriculum Module 1
No ratings yet
New Literacies Across Curriculum Module 1
12 pages
Sunny Morning 2026
No ratings yet
Sunny Morning 2026
7 pages
Inclusiveness PPT (2) - 1
No ratings yet
Inclusiveness PPT (2) - 1
155 pages
Division Long Grid 2digit Divisor 6digit Dividend Remainders All.1496071772
No ratings yet
Division Long Grid 2digit Divisor 6digit Dividend Remainders All.1496071772
20 pages
50 Common English Phrasal Verbs
No ratings yet
50 Common English Phrasal Verbs
61 pages
English Pronouns: Types and Usage
100% (1)
English Pronouns: Types and Usage
9 pages
LESSON 6 - Creative Writing Module
100% (2)
LESSON 6 - Creative Writing Module
6 pages
Pre A1 Starters Reading and Writing Part 1: Description
No ratings yet
Pre A1 Starters Reading and Writing Part 1: Description
8 pages
Aakanksha Mehta: Cgpa 8.04
No ratings yet
Aakanksha Mehta: Cgpa 8.04
1 page
TESOL Mid Term Assignment of Anusha2308
No ratings yet
TESOL Mid Term Assignment of Anusha2308
4 pages
viewNitPdf 4860731
No ratings yet
viewNitPdf 4860731
6 pages
Lab Experiment 08 Complement
No ratings yet
Lab Experiment 08 Complement
4 pages
Handbook: Published On Musescore
No ratings yet
Handbook: Published On Musescore
50 pages
Array Programs For Interviews 1727455838
No ratings yet
Array Programs For Interviews 1727455838
192 pages
BE1-11 Modbus PDF
No ratings yet
BE1-11 Modbus PDF
164 pages
2017 Psle Examination Timetable: Actual Duration May Differ Slightly
No ratings yet
2017 Psle Examination Timetable: Actual Duration May Differ Slightly
2 pages

Apache Hive for Big Data Processing

Uploaded by

Apache Hive for Big Data Processing

Uploaded by

Apache HIVE

 MapReduce: It is a parallel programming model for processing large amounts of

 Pig: It is a procedural language platform used to develop a script for MapReduce

 Hive: It is a platform used to develop SQL type scripts to do MapReduce operations.

Note: There are various ways to execute MapReduce operations:

Here are the features of Hive:

 It stores schema in a database and processed data into HDFS.

Unit Name Operation

How HIVE is different from RDBMS ?

Step No. Operation

The compiler sends metadata request to Metastore (any database).

Metastore sends metadata as a response to the compiler.

The driver sends the execute plan to the execution engine.

Internally, the process of execution job is a MapReduce job. The execution

7.1 Metadata Ops

Meanwhile in execution, the execution engine can execute metadata

The execution engine receives the results from Data nodes.

The execution engine sends those resultant values to the driver.

The driver sends the results to Hive Interfaces.

Step 1: Verifying JAVA Installation

java version "1.7.0_71"

Then jdk-7u71-linux-x64.tar.gz will be downloaded onto your system.

# alternatives --install /usr/bin/java java usr/local/java/bin/java 2

# alternatives --install /usr/bin/javac javac usr/local/java/bin/javac 2

# alternatives --install /usr/bin/jar jar usr/local/java/bin/jar 2

# alternatives --set java usr/local/java/bin/java

# alternatives --set jar usr/local/java/bin/jar

Hive Databases & Tables

Create database in Hive

We can connect to the hive in 3 different ways:

Browser Hue Application: http://192.168.13.132:8888/accounts/login/?next=/

hive> show databases;

Creating new databases:

hive> create database userdb;

Pointing to specific database

hive> use userdb;

Display the current database

hive> set hive.cli.print.current.db=true;

hive (userdb)> drop database userdb1;

Hive data types

1) Internal Tables (Managed tables)

Internal Tables (Managed Tables)

 First we have to create table and load the data.

 If processing data available in HDFS

1) create a table and insert data

-> copy the file in to HDFS

hive (userdb)> LOAD DATA INPATH '/user/hive/warehouse/emp.txt' OVERWRITE INTO TABLE

hive (userdb)> create table employee_new as select eid,name from employee;

hive (userdb)> create table empnames(names string)ROW FORMAT DELIMITED FIELDS

 External tables creates on top of data.

Complex Data Types

hive> load data local inpath'/home/cloudera/training/hivedata/arrrayfile.txt' into table arrayTable;

hive> select * From maptable;

hive> load data local inpath '/home/cloudera/training/hivedata/structfile.txt' into table structtable;

hive> select * From structtable;

Hive Built-in functions

hive> select concat('abc','xyz') from dummy;

count(*) --> returns total number of rows including nulls.

Hive User Defined Functions

hive> select * from xmltable;

You might also like