UNIT – IV: Apache Hive and NOSQL Database - HBase:
Features of HIVE, HIVE Architecture, HIVE Meta store, Datatypes in HIVE,
HIVEQL, Tables, File Format Types–Text, Sequence, AVRO, Parquet,
Querying Data, Types of NOSQL Database, Characteristics of HBASE,
Architecture, HBase Vs RDBMS, HBASE Shell Commands.
Apache Hive
Apache Hive is a tool for processing structured data stored in a Hadoop Distributed File
System.
It is also a platform used to develop SQL type scripts to do MapReduce operations.
Apache Hive is a data warehouse infrastructure.
It resides on top of Hadoop to summarize Big Data, and makes querying and analyzing
easy.
Hive use language called HiveQL (HQL), which is similar to SQL.
HiveQL automatically translates SQL-like queries into MapReduce jobs.
Apache Hive organizes data into tables.
This provides a means for attaching the structure to data stored in HDFS.
Apache Hive is similar to SQL, hence it becomes very easy for the SQL developers to
Features
Hive provides data summarization, query, and analysis in much easier manner.
It also supports partitioning of data at the level of tables to improve performance.
Hive has a rule-based optimizer for optimizing logical plans.
Using HiveQL doesn’t require any knowledge of programming language, Knowledge of
basic SQL query is enough.
We can easily process structured data in Hadoop using Hive.
Querying in Hive is very simple as it is similar to SQL.
It is designed for OLAP.
HIVE Architecture
User Interface
Hive is a data warehouse infrastructure software that can create interaction between user
and HDFS.
The user interfaces that Hive supports are Hive Web UI, Hive command line, and Hive HD
Insight (In Windows server).
Meta Store
Hive chooses respective database servers to store the schema or Metadata of tables,
databases, columns in a table, their data types, and HDFS mapping.
HiveQL Process Engine
HiveQL is similar to SQL for querying on schema info on the Metastore.
Instead of writing MapReduce program in Java, we can write a query for MapReduce job
and process it.
Execution Engine
The combining part of HiveQL process Engine and MapReduce is Hive Execution Engine.
Execution engine processes the query and generates results as same as MapReduce
results.
HDFS or HBASE
Hadoop distributed file system or HBASE are the data storage techniques to store data
into file system.
The major components of Apache Hive are:
❑ Hive Client
❑ Hive Services
❑ Processing and Resource Management
❑ Distributed Storage
Hive Client
Hive supports applications written in any language like Python, Java, C++, Ruby, etc.
using JDBC, ODBC, and Thrift drivers, for performing queries on the Hive.
Hence, one can easily write a hive client application in any language of its own choice.
Hive clients are categorized into three types:
1. Thrift Clients
The Hive server is based on Apache Thrift so that it can serve the request from a thrift
client.
2. JDBC client
Hive allows for the Java applications to connect to it using the JDBC driver.
JDBC driver uses Thrift to communicate with the Hive Server.
3. ODBC client
Hive ODBC driver allows applications based on the ODBC protocol to connect to Hive.
Similar to the JDBC driver, the ODBC driver uses Thrift to communicate with the Hive
Hive Service
To perform all queries, Hive provides various services like the Hive server2, Beeline,
etc. The various services offered by Hive are:
Beeline
The Beeline is a command shell supported by HiveServer2, where the user can submit its
queries and command to the system.
It is a JDBC client that is based on SQLLINE CLI
Hive Server 2
HiveServer2 is the successor of HiveServer1.
HiveServer2 enables clients to execute queries against the Hive.
It allows multiple clients to submit requests to Hive and retrieve the final results.
It is basically designed to provide the best support for open API clients like JDBC and
ODBC.
Hive Driver
The Hive driver receives the HiveQL statements submitted by the user through the command
shell.
It creates the session handles for the query and sends the query to the compiler.
Hive Compiler
Hive compiler parses the query.
It performs semantic analysis and type-checking on the different query blocks and query
expressions by using the metadata stored in metastore and generates an execution plan.
The execution plan created by the compiler is the DAG(Directed Acyclic Graph), where each
stage is a map/reduce job, operation on HDFS, a metadata operation.
Optimizer
Optimizer performs the transformation operations on the execution plan and splits the task to
improve efficiency and scalability.
Execution Engine
Execution engine, after the compilation and optimization steps, executes the execution
plan created by the compiler in order of their dependencies using Hadoop.
Metastore
Metastore is a central repository that stores the metadata information about the structure
of tables and partitions, including column and column type information.
It also stores information of serializer and deserializer, required for the read/write
operation, and HDFS files where data is stored. This metastore is generally a relational
database.
Processing-and-Resource-Management:
Hive internally uses a MapReduce framework as a default engine for executing the
queries.
MapReduce job works by splitting data into chunks, which are processed by map-reduce
tasks.
Distributed Storage
Hive is built on top of Hadoop, so it uses the underlying Hadoop Distributed File System for
the distributed storage.
Working of Hive
Working of Hive
Execute Query- The Hive interface such as Command Line or Web UI sends query Driver to
execute.
Get Plan- The driver takes the help of query complier that parses the query to check the
syntax and query plan or the requirement of query.
Get Metadata- The compiler sends metadata request to Metastore
Send Metadata- Metastore sends metadata as a response to the compiler.
Send Plan- The compiler checks the requirement and resends the plan to the driver. Up to
here, the parsing and compiling of a query is complete.
Execute Plan- the driver sends the execute plan to the execution engine.
Execute Job-
Internally, the process of execution job is a MapReduce job.
The execution engine sends the job to JobTracker, which is in Name node and it assigns
this job to TaskTracker, which is in Data node.
Here, the query executes MapReduce job.
Metadata Ops-
Meanwhile in execution, the execution engine can execute metadata operations with
Metastore.
Fetch Result-
The execution engine receives the results from Data nodes.
Send Results-
The execution engine sends those resultant values to the driver.The driver sends the
HIVE Data Types
Data Types in Hive specifies the column/field type in the Hive table.
It specifies the type of values that can be inserted into the specified column.
Primitive Type
1. Numeric
2. Date/time
3. String
4. Miscellaneous
Complex Type
1. Array
2. Map
3. Struct
4. Union
Numeric Type
The Numeric data type in Hive is categorized into
Integral data type
Floating data type
Integral data type
a. TINYINT – (1-byte signed integer ranging from -128 to 127)
b. SMALLINT – (2-byte signed integer ranging from -32, 768 to 32, 767)
c. INTEGER – (4-byte signed integer ranging from -2, 147, 483, 648 to 2, 147, 483, 647)
d. BIGINT – (8-byte signed integer ranging from -9, 223, 372, 036, 854, 775, 808 to 9, 223,
372, 036, 854, 775, 807)
Floating data type
a. FLOAT - It is a 4-byte single-precision floating-point number.
b. DOUBLE - It is an 8-byte double-precision floating-point number.
c. DOUBLE PRECISION -It is an alias for DOUBLE. It is only available starting with Hive
2.2.0
d. DECIMAL - It was introduced in Hive 0.11.0. It is based on Java’s BigDecimal. DECIMAL
types support both scientific and non-scientific notations.
Date/Time data type:
TIMESTAMP
Timestamps were introduced in Hive 0.8.0. It supports traditional UNIX timestamp with the
optional nanosecond precision.
The supported Timestamps format is yyyy-mm-dd hh:mm:ss[.f…] in the text files.
DATE
Dates were introduced in Hive 0.12.0. DATE value describes a particular year/month/day
in the form of YYYY-MM-DD.
For example- DATE ‘2020-02-04’
INTERVAL
Hive Interval data types are available only after starting with Hive version 1.2 or above.
Hive accepts the interval syntax with unit specifications. We have to specify the units
along with the interval value.
For example, INTERVAL ‘1’ DAY refers to the day time.
String data type
a. STRING
In Hive, String literals are represented either with the single quotes(‘ ’) or with double-quotes(“ ”).
Hive uses C-style escaping.
b. VARCHAR
In Hive, VARCHAR data types are of different lengths, but we have to specify the maximum
number of characters allowed in the character string.
If the string value assigned to the varchar is less than the maximum length, then the remaining
space will be freed out.
Also, if the string value assigned is more than the maximum length, then the string is silently
truncated.
The length of the varchar is between(1 to 65535).
Trailing whitespace is important in varchar and will affect the comparison results.
c. CHAR
CHAR data types are fixed-length.
The values shorter than the specified length are padded with the spaces.
Unlike VARCHAR, trailing spaces are not significant in CHAR types during comparisons.
The maximum length of CHAR is fixed at 255.
Miscellaneous data type
a. BOOLEAN
Boolean types in Hive store either true or false.
b. BINARY
BINARY type in Hive is an array of bytes.
Arrays
Array in Hive is an ordered sequence of similar type elements that are indexable using the
zero-based integers.
Arrays in Hive are similar to the arrays in JAVA.
array<datatype>
Example: array(‘AIML’,’Dept’). The second element is accessed as array[1].
maps
Map in Hive is a collection of key-value pairs, where the fields are accessed using array
notations of keys (e.g., [‘key’]).
map<primitive_type, data_type>
Example: ‘first’ -> ‘John’, ‘last’ -> ‘Deo’, represented as map(‘first’, ‘John’, ‘last’, ‘Deo’). Now
‘John’ can be accessed with map[‘first’].
structs
STRUCT in Hive is similar to the STRUCT in C language. It is a record type that
encapsulates a set of named fields, which can be any primitive data type.
We can access the elements in STRUCT type using DOT (.) notation.
Example: address STRUCT<street:STRING, city:STRING, pincode:INT>
SELECT address. city, FROM table_name;
union
UNION type in Hive is similar to the UNION in C. UNION types at any point of time can
hold exactly one data type from its specified data types.
The full support for UNIONTYPE data type in Hive is still incomplete.
UNIONTYPE<data_type, data_type, ...
HiveQL
The Hive Query Language (HiveQL) is a query language for Hive to process and analyze
structured data in a Meta store.
HiveQL provides a familiar SQL-like interface for querying and managing large datasets
stored in HDFS or other compatible storage systems.
SQL-Like Syntax:
HiveQL syntax is similar to traditional SQL, making it accessible to users who are already
familiar with relational databases and SQL queries.
It allows users to express queries without dealing directly with MapReduce or other
underlying technologies.
Data Definition Language (DDL):
HiveQL supports DDL statements for defining and managing database objects such as
tables, views, and partitions.
Example:
CREATE TABLE my_table (
id INT,
name STRING,
age INT
);
Data Manipulation Language (DML):
DML statements in HiveQL are used for querying, inserting, updating, and deleting data
in tables.
Example:
SELECT * FROM my_table WHERE age > 25;
Data Loading and Insertion:
HiveQL provides commands for loading data into tables, either from existing files or
external sources.
Example:
LOAD DATA INPATH ‘/path/to/data’ INTO TABLE my_table;
Complex Data Types:
HiveQL supports complex data types such as arrays, maps, and structs, allowing users to
work with semi-structured or nested data.
Example:
CREATE TABLE complex_table (
id INT,
names ARRAY<STRING>,
details MAP<STRING, INT>
);
User-Defined Functions (UDFs):
Users can define custom functions, known as User-Defined Functions (UDFs), in HiveQL to
perform operations not natively supported.
Example:
CREATE FUNCTION my_udf AS ‘com.example.MyUDF’ USING JAR
‘hdfs:///path/to/my_udf.jar’;
Join Operations:
HiveQL supports various types of join operations, including INNER JOIN, LEFT JOIN,
RIGHT JOIN, and FULL OUTER JOIN.
Example:
SELECT * FROM table1 JOIN table2 ON table1.id = table2.id;
Group By and Aggregation:
HiveQL allows users to perform aggregation functions using GROUP BY, such as COUNT,
SUM, AVG, etc.
Example:
SELECT department, AVG(salary) FROM employee GROUP BY department;
Partitioning and Bucketing:
Users can define and work with partitioned tables in HiveQL, allowing for more efficient data
organization and retrieval.
Example
CREATE TABLE partitioned_table (col1 INT, col2 STRING)
PARTITIONED BY (date STRING);
Types of Hive Tables
Hive has mainly two types of tables:
1. Managed Table (Internal Table)
2. External Table
Managed Table (Internal Table)
Hive controls both metadata and the data in a table.
Data is stored in Hive’s warehouse directory: hdfs:/user/hive/warehouse/tablename/
If you drop the table, both schema and data are deleted.
Example:
CREATE TABLE employees (id INT, name STRING, department STRING) ROW FORMAT
DELIMITED FIELDS TERMINATED BY ',’;
It creates a table named employees.
Data stored inside /user/hive/warehouse/employees/ in HDFS.
If you run: DROP TABLE employees;
Table and all its data will be deleted from HDFS.
External Table
Hive manages only the metadata of the table.
The actual data remains outside Hive's warehouse directory.
If you drop the table, metadata is deleted but the data remains.
It is useful when:
❑ You want to share data with other tools like Pig, Spark, etc.
❑ The data is already present in HDFS and you don’t want to move it
Example:
CREATE EXTERNAL TABLE employees (id INT, name STRING, department STRING )
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',’ LOCATION ‘/xyz/employees/’;
If you run: DROP TABLE employees;
Only the schema (metadata) is deleted.
The actual data at the HDFS path remains same
File Format Types
File formats decide how data stored inside the files.
Hive supports several file formats:
❑ Text File
❑ SequenceFile
❑ Avro Files
❑ Parquet
TEXT FILE FORMAT:
This is to store as a plain text file and also the default file format.
Delimited by — separator used in data (comma, period, tab, any complex symbol, or even
some characters).
NULL format — change the null values in data to empty or any value. eg: NULL DEFINED
AS ‘NULL’. Default is ‘\n’
It is Plain text (like CSV or TSV). And easy to read and write.
It can be used when you want easy access for viewing/editing.
Example:
CREATE TABLE students (id INT, name STRING, marks FLOAT) ROW FORMAT
DELIMITED FIELDS TERMINATED BY ',’ STORED AS TEXTFILE;
Load Data:
LOAD DATA LOCAL INPATH '/home/cloudera/students.csv' INTO TABLE students;
students.csv
1,Ravi,80.5
2,Meena,90.0
SequenceFile Format:
Sequence files are flat files that store values in binary key-value pairs.
The sequence files are in binary format and able to split these files.
It can be used when data size is lower than block size which not happens in most cases.
It is faster to read/write than plain text.
it is Use when the data is generated by MapReduce.
this format is written by programs, not manually
Not human-readable. Efficient for internal Hive use, not often used by analysts.
Example:
CREATE TABLE students_seq (id INT, name STRING, marks FLOAT) ROW FORMAT
DELIMITED FIELDS TERMINATED BY ',’ STORED AS SEQUENCEFILE;
Avro Format:
AVRO is a row-based format that has a high degree of splitting.
The schema is stored in JSON format, while the data is stored in binary format, which
minimizes file size and maximizes efficiency.
It supports data serialization and most reliable for schema evolution.
With schema evolution you can change schemas over time.
Example:
CREATE TABLE students_avro
STORED AS AVRO
TBLPROPERTIES (
'avro.schema.literal'='{
"type":"record",
"name":"Student",
"fields":[
{"name":"id", "type":"int"},
{"name":"name", "type":"string"},
{"name":"marks", "type":"float"}
]
}'
);
Parquet Format:
Parquet is a column-oriented binary file format.
Good for scanning large scale queries.
Parquet is optimized for a model called WORM(write once and read more).
It's a good choice for heavy read loads.
Very fast for analytical queries that read only a few columns.
Highly compressed and optimized for large-scale queries.
Example
CREATE TABLE students (id INT, name STRING, marks FLOAT ) STORED AS
PARQUET;
NOSQL
NoSQL is a type of database that stores and retrieves data without needing to define
its structure first - an alternative to the relational databases .
which stands for “not only SQL,” is an approach to database design that provides
flexible schemas for the storage and retrieval of data beyond the traditional table
structures found in relational databases
NoSQL database doesn't use tables for storing data.
It is generally used to store big data and real-time web applications.
relational database also get a problem that it could not handle big data, due to this
problem there was a need of database which can handle every types of problems then
NoSQL database was developed.
Advantages of NoSQL
It supports query language.
It provides fast performance.
It provides horizontal scalability.
Types
four main types of NoSQL databases:
Document databases
Key-value stores
Column-oriented databases
Graph databases
HBase
HBase is a distributed column-oriented database built on top of the Hadoop file
system.
It is a part of the Hadoop ecosystem that provides random real-time read/write access
to data in the Hadoop File System.
Data Producer store the data in HDFS either directly or through HBase.
Data consumer reads/accesses the data in HDFS randomly using HBase.
Storage Mechanism
HBase is a column-oriented database and the tables in it are sorted by row.
The table schema defines only column families.
A table have multiple column families and each column family can have any number
of columns.
Features
Column-Oriented Storage: HBase stores data in column families rather than rows
Schema-less: Each row can have a different number of columns.
Distributed and Scalable: HBase supports horizontal scalability using region servers.
Strong Consistency: HBase follows a strong consistency model, meaning any read
returns the most recent write for a given key.
Automatic Sharding: Data is split into regions and automatically distributed across
servers.
Versioned Data: HBase stores multiple versions of a cell, with each write tagged by a
timestamp.
NoSQL Database: HBase is a non-relational (NoSQL) database.
HBase vs RDBMS
HBase RDBMS
HBase is schema-less, it doesn't have the An RDBMS is governed by its schema,
concept of fixed columns schema; which describes the whole structure of
defines only column families. tables.
It is built for wide tables. HBase is It is thin and built for small tables. Hard
horizontally scalable. to scale.
No transactions are there in HBase. RDBMS is transactional.
It has de-normalized data. It will have normalized data.
It is good for semi-structured as well as
It is good for structured data.
structured data.
Architecture
In HBase, tables are split into regions and are served by the region servers.
Regions are vertically divided by column families into Stores. Stores are saved as files
in HDFS.
HBase has three major components: the client library, a master server, and region
servers.
MasterServer
Assigns regions to the region servers and takes the help of Apache ZooKeeper for this
task.
Handles load balancing of the regions across region servers.
It unloads the busy servers and shifts the regions to less occupied servers.
Maintains the state of the cluster by negotiating the load balancing.
Is responsible for schema changes and other metadata operations such as creation of
tables and column families.
Region server
Communicate with the client and handle data-related operations.
Handle read and write requests for all the regions under it.
Decide the size of the region by following the region size thresholds.