0% found this document useful (0 votes)

8 views71 pages

HiveQL Overview

Hive is a data warehouse infrastructure tool that enables SQL-like querying of structured data in Hadoop using HiveQL. It supports various data types, including primitive and complex types, and features like managed and external tables, partitions, and bucketing for efficient data processing. Additionally, Sqoop facilitates data transfer between relational databases and the Hadoop ecosystem, allowing for efficient data import and export.

Uploaded by

Ch Ajay Reddy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views71 pages

HiveQL Overview

Uploaded by

Ch Ajay Reddy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 71

HiveQL Overview

UNIT: IV
HIVE
• It is a platform used to develop SQL type scripts
to do Map Reduce operations.
• The Hive Query Language (HiveQL or HQL) for
MapReduce to process structured data using
Hive.
• Hive is a data warehouse infrastructure tool to
process structured data in Hadoop. It resides on
top of Hadoop to summarize Big Data, and
makes querying and analyzing easy.
• Initially Hive was developed by Facebook, later
the Apache Software.
Features of Hive
• It stores schema in a database and processed
data into HDFS.
• It is designed for OLAP.
• It provides SQL type language for querying
called HiveQL or HQL.
• It is familiar, fast, scalable, and extensible.
HiveQL Data Types

• Primitive: INT, STRING, FLOAT, DOUBLE, BOOLEAN, DATE, TIMESTAMP

• Complex: ARRAY, MAP, STRUCT, UNIONTYPE
Integral Types

• Integer type data can be specified using

integral data types, INT. When the data range
exceeds the range of INT, you need to use
BIGINT and if the data range is smaller than
the INT, you use SMALLINT. TINYINT is smaller
than SMALLINT.
INT Example

• CREATE TABLE students (id INT, marks INT);

• CREATE TABLE products (price FLOAT, discount

DOUBLE);
String
• String type data types can be specified using
single quotes (' ') or double quotes (" "). It
contains two data types: VARCHAR and CHAR.
• VARCHAR 1 to 65355
• CHAR 255
STRING Example

• CREATE TABLE books (title STRING, author

STRING);

• -- CHAR and VARCHAR Example

• CREATE TABLE codes (code CHAR(5),
description VARCHAR(50));
Timestamp

• It supports traditional UNIX timestamp with

optional nanosecond precision. It supports
java.sql.Timestamp format YYYY-MM-DD.
Dates

• DATE values are described in year/month/day

format in the form {{YYYY-MM-DD}}.
Complex Data Types <ARRAY>
• CREATE TABLE employee (id INT,name STRING,
skills ARRAY<STRING> );
• Example Row: (1, "Ankit", ["Java", "Python",
"SQL"])
MAP
CREATE TABLE product_reviews (
product_id INT,
reviews MAP<STRING, STRING>
);
Example Row: (101, {"Alice":"Good",
"Bob":"Excellent"})
WHY HIVE?
Bridges SQL & Hadoop
• Hadoop stores data in HDFS, but querying raw
files with MapReduce is complex and low-
level.
• Hive provides a SQL-like interface (HiveQL) →
making Hadoop accessible to non-
programmers.
Features
• Scalability & Flexibility
• Handles petabyte-scale datasets stored in HDFS,
HBase, or cloud storage (e.g., S3).
• Used in ETL (Extract-Transform-Load) pipelines.
• Common in data analytics, reporting, and BI
(Business Intelligence).
• Widely adopted in companies like Facebook,
Netflix, Amazon, etc.
• Supports Advanced Features
Operators and Functions

• Operators: Arithmetic (+, -, *, /), Relational (=, !=, <, >), Logical (AND, OR,
NOT)
• Functions: String (LENGTH, CONCAT), Date (YEAR, MONTH), Aggregate
(COUNT, SUM, AVG)
Arithmetic
-- Example: Add bonus marks
SELECT id, marks + 10 AS new_marks
FROM students;
Relational Operators
Find students with marks greater than 50
SELECT id, name
FROM students
WHERE marks > 50;
Logical Operators
Find students with marks > 50 and age < 18
SELECT id, name
FROM students
WHERE marks > 50 AND age < 18;
String
Find students whose name starts with 'A'
SELECT name
FROM students
WHERE name LIKE 'A%';
SELECT name
FROM students
WHERE name RLIKE 'sh$';

Output:Ramesh, Rajesh, Satish

RLIKE → full regular expressions, much more
powerful for advanced filtering.
Function
• A function in Hive is a predefined operation
that takes one or more inputs, performs some
processing, and returns a result.
• Functions save time → you don’t have to write
complex code for common tasks.
• They are similar to functions in mathematics
or programming.
Aggregate Func.
-- Count students
SELECT COUNT(*) FROM students;

-- Average marks
SELECT AVG(marks) FROM students;

-- Max marks
SELECT MAX(marks) FROM students;
Date Function
-- Extract year from a date
SELECT YEAR(class_date) FROM attendance;

-- Extract month from a date

SELECT MONTH(class_date) FROM attendance;

-- Current date
SELECT CURRENT_DATE();
String Function
-- Length of a string
SELECT LENGTH(name) FROM students;

-- Convert to uppercase
SELECT UPPER(name) FROM students;

-- Concatenate two strings

SELECT CONCAT(first_name, ' ', last_name) FROM
students;
Managed vs External Tables

• Managed: Hive manages both metadata and data. Dropping deletes

everything.
• External: Hive manages only metadata. Data remains after drop.
• Syntax: CREATE TABLE vs CREATE EXTERNAL TABLE
Managed Tables

• Hive owns the data and the metadata.

• If you drop a managed table → both data and
metadata are deleted.
• Data stored inside Hive’s warehouse directory
(/user/hive/warehouse).
Example
-- Create a managed table
CREATE TABLE employees (
id INT,
name STRING,
salary FLOAT
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',';

-- Load data into managed table

LOAD DATA LOCAL INPATH '/home/hadoop/employees.csv'
INTO TABLE employees;
Example
DROP TABLE employees;
• Both table data and schema are deleted.
External Tables

• Hive does not own the data; only the schema

(metadata).
• If you drop an external table → only metadata
is deleted, data remains in HDFS.
• Useful when you want to share data across
multiple tools.
Examples
-- Create an external table
CREATE EXTERNAL TABLE sales (
id INT,
product STRING,
price FLOAT
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LOCATION '/user/hadoop/sales_data/';
• The data in HDFS (/user/hadoop/sales_data/) remains
safe. Only table definition is removed.
Partitions

• Used to divide a table into sub-directories

based on column values.
• Improves query performance by scanning only
relevant partitions.
Example
-- Create table partitioned by year
CREATE TABLE orders (
order_id INT,
customer STRING,
amount FLOAT
)
PARTITIONED BY (year INT)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',';

-- Load data into a specific partition

LOAD DATA LOCAL INPATH '/home/hadoop/orders_2024.csv'
INTO TABLE orders
PARTITION (year=2024);
SELECT * FROM orders WHERE year=2024;
Faster than scanning the whole table
Bucket
• Bucketing means dividing data into smaller,
fixed-size files (called buckets) using a
column’s value.
• Hive applies a hash function on that column.
• The result of the hash decides which bucket
(file) the row goes into.
Example
Imagine you have 1000 customers, and you create 4 buckets
using the column id.
• How it works:
• Hive takes each customer’s id.
• Applies a hash function → gets a number.
• Then divides by the number of buckets (e.g., 4).
• Puts the row into bucket 0, 1, 2, or 3.
• So:
• Customer with id=101 → hash → goes to Bucket 0
• Customer with id=202 → hash → goes to Bucket 2
• Customer with id=303 → hash → goes to Bucket 3
...and so on.
If your table is bucketed into 4 buckets, Hive will
create 4 files for the data.
SELECT * FROM customers
TABLESAMPLE(BUCKET 1 OUT OF 4);
→ Picks data only from bucket 1 (25% of data).
CREATE TABLE customers (
id INT,
name STRING,
city STRING
)
CLUSTERED BY (id) INTO 4 BUCKETS
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',';

-- Enable bucketing
SET hive.enforce.bucketing = true;
Importing Data into Hive

(a) Load Local File

LOAD DATA LOCAL INPATH

'/home/hadoop/employees.csv'
INTO TABLE employees;
Load From hadoop
LOAD DATA INPATH
'/user/hadoop/employees.csv'
INTO TABLE employees;
INSERT DATA
INSERT INTO TABLE employees VALUES (1,
'Ankit', 50000);
Import from Another Table

CREATE TABLE employees_copy AS

SELECT * FROM employees;
Altering Tables
-- Add a new column
ALTER TABLE employees ADD COLUMNS
(department STRING);

-- Rename a column
ALTER TABLE employees CHANGE salary
salary_amount FLOAT;

-- Rename a table
ALTER TABLE employees RENAME TO staff;
Droping Tables
-- Drop managed table (data + metadata
deleted)
DROP TABLE employees;

-- Drop external table (only metadata deleted)

DROP TABLE sales;
5. Querying Data in Hive
-- Select all data
SELECT * FROM employees;

-- Filter with WHERE

SELECT name, salary FROM employees WHERE salary > 40000;

-- Aggregation
SELECT department, AVG(salary) AS avg_salary
FROM employees
GROUP BY department;

-- Sorting
SELECT * FROM employees ORDER BY salary DESC;

-- Join
SELECT e.name, d.department_name
FROM employees e
JOIN departments d
ON (e.department = d.id);
Sorting and Aggregating
• Sorting → arranges rows in
ascending/descending order.
• Aggregating → performs calculations like
SUM(), AVG(), MAX(), etc.
Examples
-- Sorting employees by salary
SELECT * FROM employees ORDER BY salary DESC;

-- Aggregation: average salary by department

SELECT department, AVG(salary) AS avg_salary
FROM employees
GROUP BY department;

-- Aggregation with filtering

SELECT department, COUNT(*) AS emp_count
FROM employees
GROUP BY department
HAVING emp_count > 5;
Map Reduce Scripts in Hive

• Hive can call custom MapReduce code for

complex logic.
• Typically written in Java, Python, or other
languages.
Transform
• Hive lets you plug in your own code (Python,
Perl, Java, etc.) to process rows.
• Each row of Hive data is sent to your script.
• Your script processes it and sends rows back
to Hive.
• This is how you run custom logic inside Hive
queries.
When to Use Transform
• When Hive’s built-in functions are not enough.
• For complex text parsing, machine learning
preprocessing, custom logic, etc.
Example
CREATE TABLE employees (
id INT,
name STRING,
salary FLOAT
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',';
Data
1, Ankit, 50000
2, Riya, 60000
3, John, 45000
Python script
import sys
for line in sys.stdin: # Read rows from Hive
fields = line.strip().split('\t')
emp_id = fields[0]
name = fields[1]
salary = float(fields[2])

# Custom logic: give 10% salary hike

new_salary = salary * 1.10

# Output row back to Hive

print(f"{emp_id}\t{name}\t{new_salary}")
USE Transform in Hive
ADD FILE /home/hadoop/script.py;

SELECT TRANSFORM (id, name, salary)

USING 'python script.py'
AS (id INT, name STRING, new_salary FLOAT)
FROM employees;
• Hive sends each row (e.g., 1\tAnkit\t50000)
into script.py.
• script.py adds 10% → returns
(1\tAnkit\t55000).
• Hive collects the results into a new table-like
output.
Storage Formats

• TEXTFILE – Default plain text

• ORC – Optimized Row Columnar, compressed, fast
• PARQUET – Columnar, good for analytics
• AVRO – Row-based with schema evolution
Dropping Tables

• DROP TABLE managed_table; → Deletes data + metadata

• DROP TABLE external_table; → Deletes metadata only
Querying in HiveQL

• SELECT name, age FROM students WHERE age > 18;

• SELECT dept, COUNT(*) FROM employees GROUP BY dept;
• JOIN: SELECT e.name, d.dept_name FROM emp e JOIN dept d ON
e.dept_id = d.dept_id;
SQOOP
• Sqoop (SQL-to-Hadoop) is used to transfer
data between relational databases (RDBMS)
and Hadoop ecosystem (HDFS, Hive, HBase)
efficiently.
• Hadoop is great for large-scale batch
processing.
• RDBMS is great for structured transactions.
• Sqoop acts as a bridge between the two.
SQOOP
Sqoop Connectors
• Think of connectors as drivers that let Sqoop
talk to different databases.
• Example: Just like JDBC driver lets Java
connect to MySQL, Sqoop connectors enable
Hadoop ↔ MySQL communication.
CONNECTOR
• Sqoop is a tool which is designed for the data
transfers between the RDBMS and the Hadoop
ecosystem. One of the major advantages of
Apache Sqoop is that it is easy to use, and we can
use it with the varieties of systems (inside and
outside) of the Hadoop ecosystem.
• With Apache Sqoop, we can easily import or
export data from all relational databases that
support JDBC interface. This can be done by using
the same command line arguments exposed by
Apache Sqoop.
Sqoop Connectors types:

• Built-in Connectors → MySQL, Oracle,

PostgreSQL, SQL Server, DB2.
• Third-party Connectors → Netezza, Teradata,
Couchbase, etc.
Example
sqoop list-databases \
--connect jdbc:mysql://localhost/ \
--username root \
--password 1234
Sqoop Import
Import = Copy data from RDBMS → Hadoop
(HDFS, Hive, HBase)
• Sqoop tool import is used to import table data
from the table to the Hadoop file system as a
text file.
Example
sqoop import \ --connect
jdbc:mysql://localhost/userdb \ --username
root \ --table emp --m 1
NOTE: By default, Sqoop uses 4 mappers (--m 4),
meaning it splits the table data into 4 parts
and imports in parallel.
• This parallelism makes imports faster for large
datasets.
Generated Code
• Sqoop can generate Java classes representing
the DB schema.
• Each column → one field in the Java class.
• Useful for writing custom MapReduce jobs
directly on the imported data.
Example
sqoop codegen \
--connect jdbc:mysql://localhost/company \
--username root \
--password 1234 \
--table employees
Employee.java
public class employees implements Writable,
DBWritable {
int id;
String name;
float salary;
// getters and setters
}
Code Generation
• Sqoop can generate Java classes that map
directly to a relational database table’s
schema.
• These classes are POJOs (Plain Old Java
Objects) with fields, getters, and setters for
each column.
• They implement Writable and DBWritable so
they can be used in Hadoop MapReduce jobs.
Code Generation
NOTE:
• For just moving data → No Java code
required.
• For custom Hadoop jobs → Sqoop generates
the Java schema classes so you only focus on
logic, not data structure.
• For Hive/Spark → No Java at all, only SQL-like
queries.
Ways of Working with Imported Data
• Imported into HDFS (Default)
• By default, Sqoop writes table data into HDFS
files.
• Files are usually stored in TextFile,
SequenceFile, Avro, or Parquet format.

Hive Main
No ratings yet
Hive Main
33 pages
Apache HIVE
No ratings yet
Apache HIVE
44 pages
Hive
No ratings yet
Hive
29 pages
Hive
No ratings yet
Hive
65 pages
HIVE
No ratings yet
HIVE
80 pages
Session 3.2
No ratings yet
Session 3.2
27 pages
Hive Overview
No ratings yet
Hive Overview
28 pages
6.1NoSQL ApacheHIVE Witha3
No ratings yet
6.1NoSQL ApacheHIVE Witha3
45 pages
BDA Unit-5
No ratings yet
BDA Unit-5
39 pages
Hive and Pig
No ratings yet
Hive and Pig
57 pages
Big Data Analytics and Developers Training Session 10
No ratings yet
Big Data Analytics and Developers Training Session 10
27 pages
Introduction To Hive
No ratings yet
Introduction To Hive
14 pages
Big Data Analytics: Welcome
No ratings yet
Big Data Analytics: Welcome
69 pages
Hive Data Warehousing Overview
No ratings yet
Hive Data Warehousing Overview
61 pages
Module 3-1
No ratings yet
Module 3-1
32 pages
BDAV Practical 4 Hive
No ratings yet
BDAV Practical 4 Hive
21 pages
DSCI 5350 - Lecture 5 PDF
No ratings yet
DSCI 5350 - Lecture 5 PDF
64 pages
HIVE
No ratings yet
HIVE
28 pages
Chapter+9+ HIVE
No ratings yet
Chapter+9+ HIVE
50 pages
Hiveppt
No ratings yet
Hiveppt
29 pages
Hive
No ratings yet
Hive
9 pages
Wa0006.
No ratings yet
Wa0006.
53 pages
Cheat Sheet: Hive Basics
No ratings yet
Cheat Sheet: Hive Basics
1 page
Unit-5 - Hive
No ratings yet
Unit-5 - Hive
31 pages
BDA Hive
No ratings yet
BDA Hive
22 pages
M4 Q&a
No ratings yet
M4 Q&a
22 pages
Hive Intoduction and Tables
No ratings yet
Hive Intoduction and Tables
31 pages
Hive Basics
No ratings yet
Hive Basics
35 pages
HIVE Lect
No ratings yet
HIVE Lect
91 pages
Hive for Data Engineers
No ratings yet
Hive for Data Engineers
18 pages
HDFSandhivecommands
No ratings yet
HDFSandhivecommands
15 pages
Apache Hive Notes
No ratings yet
Apache Hive Notes
15 pages
Hive Final
No ratings yet
Hive Final
75 pages
Hive Tutorial
No ratings yet
Hive Tutorial
25 pages
Hive PPTs
No ratings yet
Hive PPTs
34 pages
Cse3002 Big Data m2
No ratings yet
Cse3002 Big Data m2
76 pages
Hive Data Modeling & Commands Guide
No ratings yet
Hive Data Modeling & Commands Guide
6 pages
Apache Hive for Data Analysts
No ratings yet
Apache Hive for Data Analysts
51 pages
Hive Query Language
No ratings yet
Hive Query Language
33 pages
Hive L1
No ratings yet
Hive L1
134 pages
Hive
No ratings yet
Hive
42 pages
Unit 2.2 Hive
No ratings yet
Unit 2.2 Hive
80 pages
Unit-4 Pig Hive
No ratings yet
Unit-4 Pig Hive
40 pages
Hive
No ratings yet
Hive
15 pages
Bigdata Analytics
No ratings yet
Bigdata Analytics
13 pages
Unit 5 (BDC)
No ratings yet
Unit 5 (BDC)
59 pages
Hive Table Session
No ratings yet
Hive Table Session
23 pages
BDA011GU04
No ratings yet
BDA011GU04
49 pages
Hive Tutorial for Data Analysts
No ratings yet
Hive Tutorial for Data Analysts
11 pages
Unit V
No ratings yet
Unit V
23 pages
Hive
No ratings yet
Hive
45 pages
HiveQL Guide for Data Analysts
No ratings yet
HiveQL Guide for Data Analysts
34 pages
TD Hive Guide V2.0
No ratings yet
TD Hive Guide V2.0
34 pages
Big Data
No ratings yet
Big Data
120 pages
Hive Notes
No ratings yet
Hive Notes
15 pages
Big Data Record 2
No ratings yet
Big Data Record 2
117 pages
Hive Part 2
No ratings yet
Hive Part 2
53 pages
NT00320-EN-02 - RTU Control & Formulas PDF
No ratings yet
NT00320-EN-02 - RTU Control & Formulas PDF
36 pages
CD00018962 57108
No ratings yet
CD00018962 57108
85 pages
DS CSIT Lecture 01
No ratings yet
DS CSIT Lecture 01
17 pages
Class of Service - Darren's Blog
No ratings yet
Class of Service - Darren's Blog
23 pages
Stefan Schackow: Program Manager Microsoft Corporation
No ratings yet
Stefan Schackow: Program Manager Microsoft Corporation
24 pages
Epson LX300+ Service Manual
100% (1)
Epson LX300+ Service Manual
106 pages
CSE302 Lab02 Final
No ratings yet
CSE302 Lab02 Final
4 pages
Lecture 2
No ratings yet
Lecture 2
37 pages
MX 29 LV 160
No ratings yet
MX 29 LV 160
66 pages
Introduction To Multimedia Networks
No ratings yet
Introduction To Multimedia Networks
36 pages
Types of Computer Networks Advantages and Disadvantages
No ratings yet
Types of Computer Networks Advantages and Disadvantages
6 pages
Data Compression: LZ77 vs LZ78
No ratings yet
Data Compression: LZ77 vs LZ78
5 pages
Loaders
No ratings yet
Loaders
59 pages
The Ieee 802
No ratings yet
The Ieee 802
14 pages
My Solution:: Same Commands
No ratings yet
My Solution:: Same Commands
5 pages
8254 Micro Processor
100% (2)
8254 Micro Processor
16 pages
PM172EH Modbus
100% (1)
PM172EH Modbus
54 pages
Total Station Topcon GTS-230 PDF
100% (1)
Total Station Topcon GTS-230 PDF
71 pages
Adhoc Sensor Networks Lesson Plan
No ratings yet
Adhoc Sensor Networks Lesson Plan
6 pages
HDD RAW Fix Partition With 2 Step !
No ratings yet
HDD RAW Fix Partition With 2 Step !
14 pages
ZXR10 5900 - 5200 - 3900A Series Switch V2.8.23A Version Upgrade Guide
No ratings yet
ZXR10 5900 - 5200 - 3900A Series Switch V2.8.23A Version Upgrade Guide
32 pages
Xtreemstore - : A Scalable Storage Management Software Without Limits
No ratings yet
Xtreemstore - : A Scalable Storage Management Software Without Limits
30 pages
Cache Optimizations
No ratings yet
Cache Optimizations
23 pages
Tutorial 1 BCS 302
No ratings yet
Tutorial 1 BCS 302
2 pages
TP Debug Info
No ratings yet
TP Debug Info
39 pages
FortiGate Policy-Based IPsec VPN Setup
No ratings yet
FortiGate Policy-Based IPsec VPN Setup
7 pages
Cisco ACS 5.2 User Guide
No ratings yet
Cisco ACS 5.2 User Guide
6 pages
Troubleshooting Oracle Clusterware
No ratings yet
Troubleshooting Oracle Clusterware
9 pages
Linux Kernel Scheduler Study
No ratings yet
Linux Kernel Scheduler Study
25 pages
SAP HANA: A Guide for IT Professionals
No ratings yet
SAP HANA: A Guide for IT Professionals
58 pages

HiveQL Overview

Uploaded by

HiveQL Overview

Uploaded by

HiveQL Overview

• Primitive: INT, STRING, FLOAT, DOUBLE, BOOLEAN, DATE, TIMESTAMP

• Integer type data can be specified using

• CREATE TABLE students (id INT, marks INT);

• CREATE TABLE products (price FLOAT, discount

• CREATE TABLE books (title STRING, author

• -- CHAR and VARCHAR Example

• It supports traditional UNIX timestamp with

• DATE values are described in year/month/day

Output:Ramesh, Rajesh, Satish

-- Extract month from a date

-- Concatenate two strings

• Managed: Hive manages both metadata and data. Dropping deletes

• Hive owns the data and the metadata.

-- Load data into managed table

• Hive does not own the data; only the schema

• Used to divide a table into sub-directories

-- Load data into a specific partition

(a) Load Local File

LOAD DATA LOCAL INPATH

CREATE TABLE employees_copy AS

-- Drop external table (only metadata deleted)

-- Filter with WHERE

-- Aggregation: average salary by department

-- Aggregation with filtering

• Hive can call custom MapReduce code for

# Custom logic: give 10% salary hike

# Output row back to Hive

SELECT TRANSFORM (id, name, salary)

• TEXTFILE – Default plain text

• DROP TABLE managed_table; → Deletes data + metadata

• SELECT name, age FROM students WHERE age > 18;

• Built-in Connectors → MySQL, Oracle,

You might also like