0% found this document useful (0 votes)
8 views71 pages

HiveQL Overview

Hive is a data warehouse infrastructure tool that enables SQL-like querying of structured data in Hadoop using HiveQL. It supports various data types, including primitive and complex types, and features like managed and external tables, partitions, and bucketing for efficient data processing. Additionally, Sqoop facilitates data transfer between relational databases and the Hadoop ecosystem, allowing for efficient data import and export.

Uploaded by

Ch Ajay Reddy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views71 pages

HiveQL Overview

Hive is a data warehouse infrastructure tool that enables SQL-like querying of structured data in Hadoop using HiveQL. It supports various data types, including primitive and complex types, and features like managed and external tables, partitions, and bucketing for efficient data processing. Additionally, Sqoop facilitates data transfer between relational databases and the Hadoop ecosystem, allowing for efficient data import and export.

Uploaded by

Ch Ajay Reddy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 71

HiveQL Overview

UNIT: IV
HIVE
• It is a platform used to develop SQL type scripts
to do Map Reduce operations.
• The Hive Query Language (HiveQL or HQL) for
MapReduce to process structured data using
Hive.
• Hive is a data warehouse infrastructure tool to
process structured data in Hadoop. It resides on
top of Hadoop to summarize Big Data, and
makes querying and analyzing easy.
• Initially Hive was developed by Facebook, later
the Apache Software.
Features of Hive
• It stores schema in a database and processed
data into HDFS.
• It is designed for OLAP.
• It provides SQL type language for querying
called HiveQL or HQL.
• It is familiar, fast, scalable, and extensible.
HiveQL Data Types

• Primitive: INT, STRING, FLOAT, DOUBLE, BOOLEAN, DATE, TIMESTAMP


• Complex: ARRAY, MAP, STRUCT, UNIONTYPE
Integral Types

• Integer type data can be specified using


integral data types, INT. When the data range
exceeds the range of INT, you need to use
BIGINT and if the data range is smaller than
the INT, you use SMALLINT. TINYINT is smaller
than SMALLINT.
INT Example

• CREATE TABLE students (id INT, marks INT);

• CREATE TABLE products (price FLOAT, discount


DOUBLE);
String
• String type data types can be specified using
single quotes (' ') or double quotes (" "). It
contains two data types: VARCHAR and CHAR.
• VARCHAR 1 to 65355
• CHAR 255
STRING Example

• CREATE TABLE books (title STRING, author


STRING);

• -- CHAR and VARCHAR Example


• CREATE TABLE codes (code CHAR(5),
description VARCHAR(50));
Timestamp

• It supports traditional UNIX timestamp with


optional nanosecond precision. It supports
java.sql.Timestamp format YYYY-MM-DD.
Dates

• DATE values are described in year/month/day


format in the form {{YYYY-MM-DD}}.
Complex Data Types <ARRAY>
• CREATE TABLE employee (id INT,name STRING,
skills ARRAY<STRING> );
• Example Row: (1, "Ankit", ["Java", "Python",
"SQL"])
MAP
CREATE TABLE product_reviews (
product_id INT,
reviews MAP<STRING, STRING>
);
Example Row: (101, {"Alice":"Good",
"Bob":"Excellent"})
WHY HIVE?
Bridges SQL & Hadoop
• Hadoop stores data in HDFS, but querying raw
files with MapReduce is complex and low-
level.
• Hive provides a SQL-like interface (HiveQL) →
making Hadoop accessible to non-
programmers.
Features
• Scalability & Flexibility
• Handles petabyte-scale datasets stored in HDFS,
HBase, or cloud storage (e.g., S3).
• Used in ETL (Extract-Transform-Load) pipelines.
• Common in data analytics, reporting, and BI
(Business Intelligence).
• Widely adopted in companies like Facebook,
Netflix, Amazon, etc.
• Supports Advanced Features
Operators and Functions

• Operators: Arithmetic (+, -, *, /), Relational (=, !=, <, >), Logical (AND, OR,
NOT)
• Functions: String (LENGTH, CONCAT), Date (YEAR, MONTH), Aggregate
(COUNT, SUM, AVG)
Arithmetic
-- Example: Add bonus marks
SELECT id, marks + 10 AS new_marks
FROM students;
Relational Operators
Find students with marks greater than 50
SELECT id, name
FROM students
WHERE marks > 50;
Logical Operators
Find students with marks > 50 and age < 18
SELECT id, name
FROM students
WHERE marks > 50 AND age < 18;
String
Find students whose name starts with 'A'
SELECT name
FROM students
WHERE name LIKE 'A%';
SELECT name
FROM students
WHERE name RLIKE 'sh$';

Output:Ramesh, Rajesh, Satish


RLIKE → full regular expressions, much more
powerful for advanced filtering.
Function
• A function in Hive is a predefined operation
that takes one or more inputs, performs some
processing, and returns a result.
• Functions save time → you don’t have to write
complex code for common tasks.
• They are similar to functions in mathematics
or programming.
Aggregate Func.
-- Count students
SELECT COUNT(*) FROM students;

-- Average marks
SELECT AVG(marks) FROM students;

-- Max marks
SELECT MAX(marks) FROM students;
Date Function
-- Extract year from a date
SELECT YEAR(class_date) FROM attendance;

-- Extract month from a date


SELECT MONTH(class_date) FROM attendance;

-- Current date
SELECT CURRENT_DATE();
String Function
-- Length of a string
SELECT LENGTH(name) FROM students;

-- Convert to uppercase
SELECT UPPER(name) FROM students;

-- Concatenate two strings


SELECT CONCAT(first_name, ' ', last_name) FROM
students;
Managed vs External Tables

• Managed: Hive manages both metadata and data. Dropping deletes


everything.
• External: Hive manages only metadata. Data remains after drop.
• Syntax: CREATE TABLE vs CREATE EXTERNAL TABLE
Managed Tables

• Hive owns the data and the metadata.


• If you drop a managed table → both data and
metadata are deleted.
• Data stored inside Hive’s warehouse directory
(/user/hive/warehouse).
Example
-- Create a managed table
CREATE TABLE employees (
id INT,
name STRING,
salary FLOAT
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',';

-- Load data into managed table


LOAD DATA LOCAL INPATH '/home/hadoop/employees.csv'
INTO TABLE employees;
Example
DROP TABLE employees;
• Both table data and schema are deleted.
External Tables

• Hive does not own the data; only the schema


(metadata).
• If you drop an external table → only metadata
is deleted, data remains in HDFS.
• Useful when you want to share data across
multiple tools.
Examples
-- Create an external table
CREATE EXTERNAL TABLE sales (
id INT,
product STRING,
price FLOAT
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LOCATION '/user/hadoop/sales_data/';
• The data in HDFS (/user/hadoop/sales_data/) remains
safe. Only table definition is removed.
Partitions

• Used to divide a table into sub-directories


based on column values.
• Improves query performance by scanning only
relevant partitions.
Example
-- Create table partitioned by year
CREATE TABLE orders (
order_id INT,
customer STRING,
amount FLOAT
)
PARTITIONED BY (year INT)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',';

-- Load data into a specific partition


LOAD DATA LOCAL INPATH '/home/hadoop/orders_2024.csv'
INTO TABLE orders
PARTITION (year=2024);
SELECT * FROM orders WHERE year=2024;
Faster than scanning the whole table
Bucket
• Bucketing means dividing data into smaller,
fixed-size files (called buckets) using a
column’s value.
• Hive applies a hash function on that column.
• The result of the hash decides which bucket
(file) the row goes into.
Example
Imagine you have 1000 customers, and you create 4 buckets
using the column id.
• How it works:
• Hive takes each customer’s id.
• Applies a hash function → gets a number.
• Then divides by the number of buckets (e.g., 4).
• Puts the row into bucket 0, 1, 2, or 3.
• So:
• Customer with id=101 → hash → goes to Bucket 0
• Customer with id=202 → hash → goes to Bucket 2
• Customer with id=303 → hash → goes to Bucket 3
...and so on.
If your table is bucketed into 4 buckets, Hive will
create 4 files for the data.
SELECT * FROM customers
TABLESAMPLE(BUCKET 1 OUT OF 4);
→ Picks data only from bucket 1 (25% of data).
CREATE TABLE customers (
id INT,
name STRING,
city STRING
)
CLUSTERED BY (id) INTO 4 BUCKETS
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',';

-- Enable bucketing
SET hive.enforce.bucketing = true;
Importing Data into Hive

(a) Load Local File

LOAD DATA LOCAL INPATH


'/home/hadoop/employees.csv'
INTO TABLE employees;
Load From hadoop
LOAD DATA INPATH
'/user/hadoop/employees.csv'
INTO TABLE employees;
INSERT DATA
INSERT INTO TABLE employees VALUES (1,
'Ankit', 50000);
Import from Another Table

CREATE TABLE employees_copy AS


SELECT * FROM employees;
Altering Tables
-- Add a new column
ALTER TABLE employees ADD COLUMNS
(department STRING);

-- Rename a column
ALTER TABLE employees CHANGE salary
salary_amount FLOAT;

-- Rename a table
ALTER TABLE employees RENAME TO staff;
Droping Tables
-- Drop managed table (data + metadata
deleted)
DROP TABLE employees;

-- Drop external table (only metadata deleted)


DROP TABLE sales;
5. Querying Data in Hive
-- Select all data
SELECT * FROM employees;

-- Filter with WHERE


SELECT name, salary FROM employees WHERE salary > 40000;

-- Aggregation
SELECT department, AVG(salary) AS avg_salary
FROM employees
GROUP BY department;

-- Sorting
SELECT * FROM employees ORDER BY salary DESC;

-- Join
SELECT e.name, d.department_name
FROM employees e
JOIN departments d
ON (e.department = d.id);
Sorting and Aggregating
• Sorting → arranges rows in
ascending/descending order.
• Aggregating → performs calculations like
SUM(), AVG(), MAX(), etc.
Examples
-- Sorting employees by salary
SELECT * FROM employees ORDER BY salary DESC;

-- Aggregation: average salary by department


SELECT department, AVG(salary) AS avg_salary
FROM employees
GROUP BY department;

-- Aggregation with filtering


SELECT department, COUNT(*) AS emp_count
FROM employees
GROUP BY department
HAVING emp_count > 5;
Map Reduce Scripts in Hive

• Hive can call custom MapReduce code for


complex logic.
• Typically written in Java, Python, or other
languages.
Transform
• Hive lets you plug in your own code (Python,
Perl, Java, etc.) to process rows.
• Each row of Hive data is sent to your script.
• Your script processes it and sends rows back
to Hive.
• This is how you run custom logic inside Hive
queries.
When to Use Transform
• When Hive’s built-in functions are not enough.
• For complex text parsing, machine learning
preprocessing, custom logic, etc.
Example
CREATE TABLE employees (
id INT,
name STRING,
salary FLOAT
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',';
Data
1, Ankit, 50000
2, Riya, 60000
3, John, 45000
Python script
import sys
for line in sys.stdin: # Read rows from Hive
fields = line.strip().split('\t')
emp_id = fields[0]
name = fields[1]
salary = float(fields[2])

# Custom logic: give 10% salary hike


new_salary = salary * 1.10

# Output row back to Hive


print(f"{emp_id}\t{name}\t{new_salary}")
USE Transform in Hive
ADD FILE /home/hadoop/script.py;

SELECT TRANSFORM (id, name, salary)


USING 'python script.py'
AS (id INT, name STRING, new_salary FLOAT)
FROM employees;
• Hive sends each row (e.g., 1\tAnkit\t50000)
into script.py.
• script.py adds 10% → returns
(1\tAnkit\t55000).
• Hive collects the results into a new table-like
output.
Storage Formats

• TEXTFILE – Default plain text


• ORC – Optimized Row Columnar, compressed, fast
• PARQUET – Columnar, good for analytics
• AVRO – Row-based with schema evolution
Dropping Tables

• DROP TABLE managed_table; → Deletes data + metadata


• DROP TABLE external_table; → Deletes metadata only
Querying in HiveQL

• SELECT name, age FROM students WHERE age > 18;


• SELECT dept, COUNT(*) FROM employees GROUP BY dept;
• JOIN: SELECT e.name, d.dept_name FROM emp e JOIN dept d ON
e.dept_id = d.dept_id;
SQOOP
• Sqoop (SQL-to-Hadoop) is used to transfer
data between relational databases (RDBMS)
and Hadoop ecosystem (HDFS, Hive, HBase)
efficiently.
• Hadoop is great for large-scale batch
processing.
• RDBMS is great for structured transactions.
• Sqoop acts as a bridge between the two.
SQOOP
Sqoop Connectors
• Think of connectors as drivers that let Sqoop
talk to different databases.
• Example: Just like JDBC driver lets Java
connect to MySQL, Sqoop connectors enable
Hadoop ↔ MySQL communication.
CONNECTOR
• Sqoop is a tool which is designed for the data
transfers between the RDBMS and the Hadoop
ecosystem. One of the major advantages of
Apache Sqoop is that it is easy to use, and we can
use it with the varieties of systems (inside and
outside) of the Hadoop ecosystem.
• With Apache Sqoop, we can easily import or
export data from all relational databases that
support JDBC interface. This can be done by using
the same command line arguments exposed by
Apache Sqoop.
Sqoop Connectors types:

• Built-in Connectors → MySQL, Oracle,


PostgreSQL, SQL Server, DB2.
• Third-party Connectors → Netezza, Teradata,
Couchbase, etc.
Example
sqoop list-databases \
--connect jdbc:mysql://localhost/ \
--username root \
--password 1234
Sqoop Import
Import = Copy data from RDBMS → Hadoop
(HDFS, Hive, HBase)
• Sqoop tool import is used to import table data
from the table to the Hadoop file system as a
text file.
Example
sqoop import \ --connect
jdbc:mysql://localhost/userdb \ --username
root \ --table emp --m 1
NOTE: By default, Sqoop uses 4 mappers (--m 4),
meaning it splits the table data into 4 parts
and imports in parallel.
• This parallelism makes imports faster for large
datasets.
Generated Code
• Sqoop can generate Java classes representing
the DB schema.
• Each column → one field in the Java class.
• Useful for writing custom MapReduce jobs
directly on the imported data.
Example
sqoop codegen \
--connect jdbc:mysql://localhost/company \
--username root \
--password 1234 \
--table employees
Employee.java
public class employees implements Writable,
DBWritable {
int id;
String name;
float salary;
// getters and setters
}
Code Generation
• Sqoop can generate Java classes that map
directly to a relational database table’s
schema.
• These classes are POJOs (Plain Old Java
Objects) with fields, getters, and setters for
each column.
• They implement Writable and DBWritable so
they can be used in Hadoop MapReduce jobs.
Code Generation
NOTE:
• For just moving data → No Java code
required.
• For custom Hadoop jobs → Sqoop generates
the Java schema classes so you only focus on
logic, not data structure.
• For Hive/Spark → No Java at all, only SQL-like
queries.
Ways of Working with Imported Data
• Imported into HDFS (Default)
• By default, Sqoop writes table data into HDFS
files.
• Files are usually stored in TextFile,
SequenceFile, Avro, or Parquet format.

You might also like