0% found this document useful (0 votes)
22 views19 pages

Apache Hive for Big Data Processing

Uploaded by

ucebittrichy2020
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views19 pages

Apache Hive for Big Data Processing

Uploaded by

ucebittrichy2020
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
You are on page 1/ 19

Apache HIVE

Introduction

The term ‘Big Data’ is used for collections of large datasets that include huge volume, high
velocity, and a variety of data that is increasing day by day. Using traditional data
management systems, it is difficult to process Big Data. Therefore, the Apache Software
Foundation introduced a framework called Hadoop to solve Big Data management and
processing challenges.

Hadoop
Hadoop is an open-source framework to store and process Big Data in a distributed
environment. It contains two modules, one is MapReduce and another is Hadoop Distributed
File System (HDFS).

 MapReduce: It is a parallel programming model for processing large amounts of


structured, semi-structured, and unstructured data on large clusters of commodity
hardware.
1.
 HDFS: Hadoop Distributed File System is a part of Hadoop framework, used to store
and process the datasets. It provides a fault-tolerant file system to run on
commodity hardware.

The Hadoop ecosystem contains different sub-projects (tools) such as Sqoop, Pig, and Hive
that are used to help Hadoop modules.

 Sqoop: It is used to import and export data to and fro between HDFS and RDBMS.

 Pig: It is a procedural language platform used to develop a script for MapReduce


operations.

 Hive: It is a platform used to develop SQL type scripts to do MapReduce operations.

Note: There are various ways to execute MapReduce operations:

 The traditional approach using Java MapReduce program for structured, semi-
structured, and unstructured data.

 The scripting approach for MapReduce to process structured and semi structured
data using Pig.

 The Hive Query Language (HiveQL or HQL) for MapReduce to process structured data
using Hive.

What is Hive?
Hive is a data warehouse infrastructure tool to process structured data in Hadoop. It resides
on top of Hadoop to summarize Big Data, and makes querying and analyzing easy.
Initially Hive was developed by Facebook, later the Apache Software Foundation took it up
and developed it further as an open source under the name Apache Hive. It is used by
different companies. For example, Amazon uses it in Amazon Elastic MapReduce.

Hive is not
 A relational database
 A design for OnLine Transaction Processing (OLTP)
 A language for real-time queries and row-level updates

Features of Hive
Hive is a declarative SQL based language, mainly used for data analysis and creating
reports. Hive operates on the server-side of a cluster. Hive provides schema flexibility and
evolution along with data summarization, querying of data, and analysis in a much easier
manner.

In Hive, we can make two types of tables – partitioned and bucketed which make it
feasible to process data stored in HDFS and improves the performance as well.

Hive tables are defined directly in the Hadoop File System(HDFS).In Hive, we have
JDBC/ODBC drivers.Hive is fast and scalable, and easy to learn.Hive has a rule-based
optimizer for optimizing plans.Using Hive we can also execute Ad-hoc queries to analyze
data.

Here are the features of Hive:

 It stores schema in a database and processed data into HDFS.


 It is designed for OLAP.
 It provides SQL type language for querying called HiveQL or HQL.
 It is familiar, fast, scalable, and extensible.

Architecture of Hive
The following component diagram depicts the architecture of Hive:
This component diagram contains different units. The following table describes each unit:

Unit Name Operation


User Interface Hive is a data warehouse infrastructure software that can create
interaction between user and HDFS. The user interfaces that Hive
supports are Hive Web UI, Hive command line, and Hive HD Insight
(In Windows server).

Meta Store Hive chooses respective database servers to store the schema or
Metadata of tables, databases, columns in a table, their data
types, and HDFS mapping.

HiveQL Process HiveQL is similar to SQL for querying on schema info on the
Engine Metastore. It is one of the replacements of traditional approach for
MapReduce program. Instead of writing MapReduce program in
Java, we can write a query for MapReduce job and process it.

Execution Engine The conjunction part of HiveQL process Engine and MapReduce is
Hive Execution Engine. Execution engine processes the query and
generates results as same as MapReduce results. It uses the flavor
of MapReduce.

HDFS or HBASE Hadoop distributed file system or HBASE are the data storage
techniques to store data into file system.

Working of Hive
The following diagram depicts the workflow between Hive and Hadoop.
Working :

1. First of all, the user submits their query and CLI sends that query to the Driver.
2. Then the driver takes the help of query compiler to check syntax.
3. Then compiler request for Metadata by sending a metadata request to Metastore.
4. In response to that request, metastore sends metadata to the compiler.
5. Then compiler resends the plan to the driver after checking requirements.
6. The Driver sends the plan to the execution engine.
7. Execution engine sends the job to Job tracker and assigns the job to Task Tracker.
Here, the query executes MapReduce job. And in meantime execution engine
executes metadata operations with Metastore.
8. Then the execution engine fetches the results from the Data Node and sends those
results to the driver.
9. At last, the driver sends the results to the hive interface.

HIVE Metastore –

Hive Metastore is the central repository for metadata. It stores metadata for Hive
tables (like their schema and location) and partitions in a relational database. It provides
client access to this information by using the metastore service API.

Modes:
 Embedded: In Hive by default, metastore service and hive services run in the same
JVM. In this mode, Data in the local file system are stored using the embedded derby
database.
 Local: Hive is a SQL based framework, that should have multiple sessions. In Local
mode, multiple Hive sessions are allowed. This can be achieved by using any JDBC
application like MySQL that runs in a separate JVM.
 Remote: In this mode, metastore and hive services run in a separate JVM. Thrift
network APIs are used by different processes to communicate among them.

Limitations –
Apache Hive has some limitations also:
1. Read-only views are allowed but materialized views are not allowed.
2. It does not support triggers.
3. Apache Hive queries have very high latency.
4. No difference between NULL and null values.

How HIVE is different from RDBMS ?


o RDBMS supports schema on Write whereas Hive provides schema on Read.
o In Hive, we can write once but in RDBMS we can write as many times as we want.
 Hive can handle big datasets whereas RDBMS can’t handle beyond 10TB.
 Hive is highly scalable but scalability in RDBMS costs a lost.
o Hive has a feature of Bucketing which is not there in RDBMS.
The following table defines how Hive interacts with Hadoop framework:

Step No. Operation

1 Execute Query

The Hive interface such as Command Line or Web UI sends query to Driver
(any database driver such as JDBC, ODBC, etc.) to execute.

2 Get Plan

The driver takes the help of query compiler that parses the query to check
the syntax and query plan or the requirement of query.

3 Get Metadata

The compiler sends metadata request to Metastore (any database).

4 Send Metadata

Metastore sends metadata as a response to the compiler.

5 Send Plan
The compiler checks the requirement and resends the plan to the driver.
Up to here, the parsing and compiling of a query is complete.

6 Execute Plan

The driver sends the execute plan to the execution engine.

7 Execute Job

Internally, the process of execution job is a MapReduce job. The execution


engine sends the job to JobTracker, which is in Name node and it assigns
this job to TaskTracker, which is in Data node. Here, the query executes
MapReduce job.

7.1 Metadata Ops

Meanwhile in execution, the execution engine can execute metadata


operations with Metastore.

8 Fetch Result

The execution engine receives the results from Data nodes.

9 Send Results

The execution engine sends those resultant values to the driver.

10 Send Results

The driver sends the results to Hive Interfaces.


2. HIVE INATALLATION
All Hadoop sub-projects such as Hive, Pig, and HBase support Linux operating system.
Therefore, you need to install any Linux flavored OS. The following simple steps are
executed for Hive installation:

Step 1: Verifying JAVA Installation


Java must be installed on your system before installing Hive. Let us verify java installation
using the following command:

$ java –version

If Java is already installed on your system, you get to see the following response:

java version "1.7.0_71"


Java(TM) SE Runtime Environment (build 1.7.0_71-b13)
Java HotSpot(TM) Client VM (build 25.0-b02, mixed mode)

If java is not installed in your system, then follow the steps given below for installing java.

Installing Java
Step I:
Download java (JDK <latest version> - X64.tar.gz) by visiting the following link
http://www.oracle.com/technetwork/java/javase/downloads/jdk7-downloads-1880260.html.

Then jdk-7u71-linux-x64.tar.gz will be downloaded onto your system.

Step II:
Generally, you will find the downloaded java file in the Downloads folder. Verify it and
extract the jdk-7u71-linux-x64.gz file using the following commands.

$ cd Downloads/

$ ls

jdk-7u71-linux-x64.gz
$ tar zxf jdk-7u71-linux-x64.gz

$ ls

jdk1.7.0_71 jdk-7u71-linux-x64.gz

Step III:
To make java available to all the users, you have to move it to the location “/usr/local/”.
Open root, and type the following commands.

$ su

password:

# mv jdk1.7.0_71 /usr/local/

# exit

Step IV:
For setting up PATH and JAVA_HOME variables, add the following commands to
~/.bashrc file.

export JAVA_HOME=/usr/local/jdk1.7.0_71

export PATH=PATH:$JAVA_HOME/bin

Now apply all the changes into the current running system.

$ source ~/.bashrc

Step V:
Use the following commands to configure java alternatives:

# alternatives --install /usr/bin/java java usr/local/java/bin/java 2

# alternatives --install /usr/bin/javac javac usr/local/java/bin/javac 2

# alternatives --install /usr/bin/jar jar usr/local/java/bin/jar 2

# alternatives --set java usr/local/java/bin/java


# alternatives --set javac usr/local/java/bin/javac

# alternatives --set jar usr/local/java/bin/jar

HIVE QUERIES:

Hive Databases & Tables

Create database in Hive

We can connect to the hive in 3 different ways:


1) Hive CLI
2) Beeline
3) Hue ( browser application)

IP: 192.168.13.132

Browser Hue Application: http://192.168.13.132:8888/accounts/login/?next=/


Login : cloudera/cloudera

Hive databases

Display databases:

hive> show databases;

Creating new databases:

hive> create database userdb;


hive> create schema userdb1;

Pointing to specific database

hive> use userdb;

Display the current database

hive> set hive.cli.print.current.db=true;


hive (userdb)>

dropping databases

hive (userdb)> drop database userdb1;

Hive data types

Numeric Types:
Int, tiny int, big int, float, double, decimal
String Types:
Char, varchar, string
Date/Time Types:
Timestamp, date
Complex Types
Array, Map, Structs

Hive Tables

1) Internal Tables (Managed tables)


2) External Tables

Internal Tables (Managed Tables)

 First we have to create table and load the data.


 We can call this one as "data on schema".
 By dropping this table, both data and schema will be removed.
 The stored location of this table will be at /user/hive/warehouse.
 Data is temporary
 Hive to Manage the table data completely not allowing any external source to use
the table (external sources means sqoop, pig, mapreduce etc..)
 Don’t want data after deletion
External Tables

 Data will be available in HDFS. The table is going to create on HDFS data.
 We can call this one as "schema on data".
 At the time of dropping the table it drops only schema, the data will be still available
in HDFS as before.
External tables provide an option to create multiple schemas for the data stored in HDFS
instead of deleting the data every time whenever schema updates

 If processing data available in HDFS


 Useful when the files are being used outside of Hive
Creating Internal tables & loading data

1) create a table and insert data


2) create a table then load the data from Local File System
3) create a table then load the data from HDFS
4) create a table based on another table
5) create a table then load the data from another table

Method-1

hive (userdb)> create table employees(empid int,ename string,salary int,job string,deptno int);
hive (userdb)> insert into employees values(101,'pavan',60000,'enginerr',10);
hive (userdb)> select * from employees;

Method-2

hive (userdb)> create table IF NOT EXISTS employee(eid int,name string,salary string,designation
string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n'
STORED AS TEXTFILE;
hive (userdb)> LOAD DATA LOCAL INPATH '/home/cloudera/training/hivedata/emp.txt'
OVERWRITE INTO TABLE employee;

Method-3

-> copy the file in to HDFS


[cloudera@quickstart hivedata]$ hdfs dfs -put /home/cloudera/training/hivedata/emp.txt
/user/hive/warehouse/

hive (userdb)> Create table IF NOT EXISTS employee ( eid int, name String,salary String, designation
String)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n'
STORED AS TEXTFILE;

hive (userdb)> LOAD DATA INPATH '/user/hive/warehouse/emp.txt' OVERWRITE INTO TABLE


employee;

Method-4

->Describe table
hive (userdb)> describe employee;

hive (userdb)> create table employee_new as select eid,name from employee;

Method-5

hive (userdb)> create table empnames(names string)ROW FORMAT DELIMITED FIELDS


TERMINATED BY ',' LINES TERMINATED BY '\n' STORED AS TEXTFILE;

hive (userdb)> insert into table empnames select name from employee;

External tables

 External tables creates on top of data.


 If table is deleted data still persists in HDFS.
1) create external table and insert data.
2) create external table and load the data from Local file system.
3) create external table and load the data from HDFS.
Method-1
hive> create EXTERNAL table employee_external(eid int,name string,salary string,designation string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n'
LOCATION '/user/training/employee_external';
hive> insert into employee_external values(111,'pavan','50000','manager');
Method-2
hive> create EXTERNAL table employee_external(eid int,name string,salary string,designation string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n'
LOCATION '/user/training/employee_external';
hive> LOAD DATA LOCAL INPATH '/home/cloudera/training/hivedata/emp.txt' INTO TABLE
employee_external;
hive> LOAD DATA LOCAL INPATH '/home/cloudera/training/hivedata/emp.txt' OVERWRITE INTO
TABLE employee_external;
** If you use OVERWRITE in loading statement data will be override.
** Other wise copy of files will be created.
** If you drop the table , data file will not be removed from HDFS.
Method-3
hive> create EXTERNAL table employee_external(eid int,name string,salary string,designation string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n'
LOCATION '/user/training/employee_external';
hive> LOAD DATA INPATH '/user/training/emp.txt' INTO TABLE employee_external;
describe
hive> describe employee_external;
OK
eid int
name string
salary string
designation string
Time taken: 0.09 seconds, Fetched: 4 row(s)
hive> describe extended employee_external;
OK
eid int
name string
salary string
designation string

Complex Data Types


array : We can store multiple elements with same data
map
struct
Array
hive> create table arrayTable(id int,name string,sal bigint,sub array,city string)
row format delimited fields terminated by ',' collection items terminated by '$';

hive> load data local inpath'/home/cloudera/training/hivedata/arrrayfile.txt' into table arrayTable;


hive> select * from arrayTable;
OK
1 abc 40000 ["a","b","c"] hyd
2 def 3000 ["d","f"] bang
Time taken: 1.029 seconds, Fetched: 2 row(s)
hive> select sub[2] from arrayTable;
OK
c
NULL
Time taken: 0.402 seconds, Fetched: 2 row(s)
hive> select sub[1] from arrayTable;
OK
b
f
Time taken: 0.244 seconds, Fetched: 2 row(s)
hive>
Map
create table maptable(id int,name string,sal bigint,sub array,pfund map,city string) row format delimited
fields terminated by ','
collection items terminated by '$' map keys terminated by '#';
hive> load data local inpath'/home/cloudera/training/hivedata/mapfile.txt' into table maptable;

hive> select * From maptable;


hive> select pfund["pf"],["epf"] from maptable;

Struct
hive> create table structtable(id int,name string,sal bigint,sub array,dud map,
address struct)
row format delimited fields terminated by ',' collection items terminated by '$' map keys terminated by '#';

hive> load data local inpath '/home/cloudera/training/hivedata/structfile.txt' into table structtable;

hive> select * From structtable;


OK
1 abc 40000 ["a","b","c"] {"pf":500,"epf":200} {"city":"hyd","state":"ap","pin":500001}
2 def 3000 ["d","f"] {"pf":500} {"city":"bang","state":"kar","pin":600038}
Time taken: 0.136 seconds, Fetched: 2 row(s)
hive> select address.city from structtable;
OK
hyd
bang
Time taken: 0.14 seconds, Fetched: 2 row(s)
hive> select address.pin from structtable;
OK
500001
600038
Time taken: 0.13 seconds, Fetched: 2 row(s)
hive>
Partitioning
1) create Hive table
hive> create table all_students(sno int,sname string, passout_year string) row format delimited fields
terminated by ',';
2) Load the data
hive> load data local inpath '/home/cloudera/training/hivedata/Students.csv' into table all_students;
3) create partision table
hive> create table students_part(sno int,sname string) PARTITIONED BY(passout_year string);
4) For partition we have to set this property
set hive.exec.dynamic.partition.mode=nonstrict;
5) Load data into partision table
hive> INSERT OVERWRITE TABLE students_part PARTITION(passout_year) SELECT sno,sname,
passout_year from all_students;

Bucketting
1) create Hive table
hive> create table all_students(sno int,sname string, passout_year string) row format delimited fields
terminated by ',';
2) Load the data
hive> load data local inpath '/home/cloudera/training/hivedata/Students.csv' into table all_students;
3) creating bucket table:
create table students_buck(sno int,sname string, passout_year string) clustered by (passout_year) sorted
by (sno) into 3 buckets;
4) Enable bucketing table:
set hive.enforce.bucketing=true;
5) Insert data into bucket table:
insert overwrite table students_buck select * from all_students;
** After bucketing splits will be stored in /user/hive/warehouse/students_buck
6) Test Bucket table:
select * from students_buck tablesample(bucket 2 out of 3 on sname);
7) Test normal table:
select * from all_students limit 10;
-------------------------------------
Alter table - rename table name, rename column
Views & Indexes
Select ..where
Select ..order by
Select..Group By
-----------------------------
Renaming table
hive> alter table all_students rename to students;
Renaming the column
hive> alter table students change passout_year pyear String;
View ---> View is a logical object. can be create for 1 or more columns from a table.
Creating view
hive> create view students_view as select sno,pyear from students where pyear='2011';
hive> select * from students_view;
OK
5 2011
6 2011
7 2011
18 2011
19 2011
20 2011
Dropping view
hive> drop view students_view;
Indexes
Index--> improving the performance when fetching the data from tables.
Creating index
hive> create index sno_idx on table students(sno) as 'compact' with deferred rebuild;
hive> show index on students;
Dropping Index
hive> drop index sno_idx on students;
HQL--Select with where clause
where --> used for filtering the rows based on condition.
Display students who are passed in the year 2012.
hive> select * from students where pyear=2012;
HQL-Select with order by clause
Display students in asceding order based on pyear;
hive> select * from students order by pyear;
HQL-Select with group by clause
Count number of students passed in each year.
hive> select pyear,count(*) from students group by pyear;
Group by with having clause
having--> apply the condition on top of group by clause
hive> select pyear,count(*) from students group by pyear having count(*)>=5;

Hive Built-in functions


dummy --> one row and one colomn.
hive> select * From dummy;
OK
xxx
Time taken: 2.575 seconds, Fetched: 1 row(s)
hive> describe dummy;

hive> select concat('abc','xyz') from dummy;


hive> select concat(sname,pyear) from students;
hive> select substr('welcome',2,4) from dummy;
elco
hive> select substr('welcome',3,2) from dummy;
lc
hive> select substr(sname,1,4) from students;
hive> select upper('welcome') from dummy;
WELCOME
hive> select ucase('welcome') from dummy;
WELCOME
hive> select lower('WELCOME') from dummy;
welcome
hive> select lcase('WELCOME') from dummy;
welcome
hive> select upper(sname) from students;
hive> select lower(sname) from students;
hive> select ' welcome ' from dummy;
welcome
hive> select trim(' welcome ') from dummy;
welcome
hive> select rtrim(' welcome ') from dummy;
welcome
hive> select ltrim(' welcome ') from dummy;
welcome
hive> select CURRENT_DATE from dummy;
2017-12-28
hive> select year(CURRENT_DATE) from dummy;
hive> select month(CURRENT_DATE) from dummy;
hive> select day(CURRENT_DATE) from dummy;

Aggregation Functions
Aggregation functions applicable for group of columns.
Tables: owners , cars
hive> create table owners(id int,name string,age int,car_id int) row format delimited fields terminated by
','
lines terminated by '\n';
hive> load data local inpath '/home/cloudera/training/hivedata/owners.txt' into table owners;
hive> create table cars(id int,make string,model string,year string)row format delimited fields terminated
by ','
lines terminated by '\n';
hive> load data local inpath '/home/cloudera/training/hivedata/cars.txt' into table cars;

count(*) --> returns total number of rows including nulls.


count(exp) --> returns total number of rows for which the supplied column.
sum(col)--> returns sum of elements in the group.
sum(distinct col) --> returns sum of unique elements in the group(eleminate duplictes)
avg(col)---> returns average value of elements from a group.
min(col)--> returns minimum value in the group
max(col)--> retruns maximum value from the group.
Query: table : cars
I want to see the car makers that have more than 1 model in cars table.
hive> select make,count(distinct model) from cars group by make having count(distinct model)>1;

Hive User Defined Functions


select count(*) from customers;
Steps to create and test UDF's
1) Implement the code for UDF in Java
2) Package java class into jar file copy in some location
3) Add jar file in to Hive CLI
4) Create temporary function in hive
5) Use hive UDF BY using Query.
Prerequiste: Table should have some data.
Problem statement-1
Find the maximum marks obtained out of four subject by an student.
Package java class into jar file copy in some location.
-----------
SELECT CLASS IN ECLIPSE-->RIGHT-->EXPORT-->JAVA-->JAR--> BROWSE THE
LOCATION-->PROFILE FILENAME WITH .JAR Extension.
Add jar file in to Hive CLI
---------------------
hive> add jar /home/cloudera/training/HiveUDFS/getMaxMarks.jar;
Create temporary function in hive
-----------------
hive> create temporary function getmaxmarks as 'udfs.GetMaxMarks';
Use hive UDF BY using Query
-------------
hive> select getmaxmarks(10,20,30,40) from dummy; // sanity test

2 types of UDF
1) Regular UDF( UDF) ---> applied on more number of rows in a table
2) User Defined aggregate function (UDAF) --> group of result sets.
Problem statement -2: Find the mean of marks obtained in maths by all the students.
Package java class into jar file copy in some location
----------
Right click onth package-->export-->java-->provide jar file name.
Add jar file in to Hive CLI
-----
hive> add jar /home/cloudera/training/HiveUDFS/getMeanMarks.jar;
Create temporary function in hive
----------------
hive> create temporary function getmeanmarks as 'udaf.GetMeanMarks';
Use functions with queries
-----------------------
hive> select getmeanmarks(social)from t_student_record;

Hive clients
------------------
1) hive CLI
2) beeline
3) Hue ( Web UI) ; http://192.168.13.135:8888
4) JDBC Client ( Java program)
Hive Joins:
Join is used for getting the data from multiple tables.
1) Inner join
2) Left outer join
3) right outer join
4) full outer join
T1: 1 3 5 7 (left table)
T2 1 2 5 8 9 (right table)
inner join: 1,5 ( COMMON RECORDS IN BOTH THE TABLES)
left outer join: 1,5,3,7,null,null ( common records+extra records from left table+ if there is no
corresponding values it substitute as nulls)
right outer join: 1,5 2,8,9,null,null
full outer join: 1,5,3,7,2,8,9,null,null,null
Tables
-----------------
orders
customers
hive> create table customers(id int,name string,age int,address string,salary int) row format delimited
fields terminated by ',' lines terminated by '\n';
hive> load data local inpath '/home/cloudera/training/hivedata/customers.txt' into table customers;
hive> create table orders(oid int,date date,customer_id int,amount int)row format delimited fields
terminated by ',' lines terminated by '\n';
hive> load data local inpath '/home/cloudera/training/hivedata/orders.txt' into table orders;
Join Queries
INNER JOIN:
SELECT c.ID, c.NAME, c.AGE, o.AMOUNT FROM CUSTOMERS c JOIN ORDERS o ON (c.ID =
o.CUSTOMER_ID);
LEFT OUTER JOIN
SELECT c.ID, c.NAME, o.AMOUNT, o.DATE FROM CUSTOMERS c LEFT OUTER JOIN ORDERS
o ON (c.ID = o.CUSTOMER_ID);
RIGHT OUTER JOIN
SELECT c.ID, c.NAME, o.AMOUNT, o.DATE FROM CUSTOMERS c RIGHT OUTER JOIN
ORDERS o ON (c.ID = o.CUSTOMER_ID);
FULL OUTER JOIN
SELECT c.ID, c.NAME, o.AMOUNT, o.DATE FROM CUSTOMERS c FULL OUTER JOIN ORDERS
o ON (c.ID = o.CUSTOMER_ID);

Processing XML
hive> create table xmltable(line string);
hive> load data local inpath '/home/cloudera/training/hivedata/sampxml1.xml' into table xmltable;

hive> select * from xmltable;


OK
xmltable.line
Ravi25hyd
Rani24f
SampathmDel
Time taken: 0.064 seconds, Fetched: 3 row(s)
hive> select xpath_string(line,'rec/name') from xmltable;
OK
_c0
Ravi
Rani
Sampath
Time taken: 0.108 seconds, Fetched: 3 row(s)
hive> select xpath_string(line,'rec/city') from xmltable;
OK
_c0
hyd
Del
Time taken: 0.1 seconds, Fetched: 3 row(s)
hive>
Hive WebUI (Hue application)
Generating data: http://generatedata.com/
1) Querying tables
2) Create new table and import the data from external file.
step 1) DataBrowsers-->MetaStore tables-->
step 2)you need to click on ' + ' at right corner

You might also like