0% found this document useful (0 votes)

374 views46 pages

Hive and Impala

Hive and Impala provide SQL interfaces for querying data stored in Hadoop. Hive was developed earlier and uses MapReduce while Impala uses its own query engine for faster performance. Both tools allow querying data from HDFS and other Hadoop components in an SQL-like manner.

Uploaded by

Joe1

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

374 views46 pages

Hive and Impala

Uploaded by

Joe1

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 46

Big Data Hadoop and Spark Developer

Lesson 4—Basics of Hive and Impala

© Simplilearn. All rights reserved.

Learning Objectives

Identify the features of Hive and Impala

Understand the methods to interact with Hive and Impala

Basics of Hive and Impala
Topic 1: Features of Hive and Impala
Introduction to Hive and Impala

Batch Interactive
Processing SQL

SELECT t1.a1 as c1, t2.b1 as c2

FROM t1 JOIN t2 ON (t1.a2=t2.b2);

• Hive and Impala provide an SQL-like interface
for users to extract data from the Hadoop
system.
Resource Management • They reside on top of Hadoop and can be used
to query data from the underlying storage
Storage components.
HDFS HBase
Hive and Impala: Similarities

• Hive is very similar to Impala in the following ways:

Hive and Impala: Differences

Hive was developed by Facebook around 2007. Impala was developed by Cloudera around
2012.

It is an Open source Apache project. It is an incubating Apache project.

It has a high level abstraction layer on top of It has a high performance dedicated SQL
MapReduce and Apache Spark.
engine.
It uses HiveQL to query the structured data in a
metastore. It uses Impala SQL for ad hoc queries.

It is suitable for structured data. It is designed for high concurrency and ad

z
hoc queries.
Hive and Impala: Comparison

Hive Impala

• Comprises a specialized SQL

• Provides more features than engine that offers five to fifty
Impala times faster performance
than Hive

• Is highly extensible
• Used mainly for interactive
queries and data analysis
• Used mostly for batch
processing • Accommodates many
concurrent users
Relational Databases vs. Hive vs. Impala
Use Case: Hive and Impala

Hive and Impala are commonly used to analyze social media coverage.
Basics of Hive and Impala
Topic 2: Interacting with Hive and Impala
Executing a Query in Hive and Impala

Receive SQL query Receive SQL query

Parse Hive QL 1 Parse Impala SQL

Make optimizations 2 Make optimizations

Plan execution 3 Plan execution

Submit job(s) to cluster 4 Execute query on cluster

Monitor progress 5 Store the data in HDFS

Process data—
6
MapReduce or Apache Spark

Store the data in HDFS 7

Hive Query Editor
Interfaces to Run Hive and Impala Queries

Hive and Impala offer numerous interfaces to run queries:

• Command-line shell:
– Impala: Impala shell
– Hive: Beeline
Impala Query Editor
• Hue Web UI:
– Hive Query Editor
– Impala Query Editor

• Metastore Manager:
– ODBC/JDBC
Impala Lab Access Details

• The steps to start Impala in lab are as follows:

Step 1 Step 2

• Log in to cloud lab • Connect to any

web console with daemon server with
your credentials the help of the
command below:
•impala-shell -i
cloudera-
slavenode3.cloudlab.
com
Demo
Starting Impala Lab

Demonstrate the method to start and connect to the Impala lab from command.
Impala Lab Access Details
Connecting with Hive and Impala Shell

• To execute Impala commands from Impala shell:

• To run Hive using Beeline:

Running Impala Queries from Command Line

To check all options of Impala using the help option: Impala-shell –help

Impala-shell –q ‘select *
To run direct queries from shell using the –q option: from simple’

Impala-shell –d
To issue a use database on startup using the –d option: Simplilearn
Demo
Connecting with Hive and Impala Shell

Demonstrate the method to connect with Hive and Impala shell, along with some basic
operations.
Sample Queries

SELECT version();
To explore a new Impala instance:
SELECT current_database();

CREATE DATABASE IF NOT EXISTS

To create a database: sample;

To verify a database: SHOW databases;

To specify the location where the database is CREATE DATABASE IF NOT EXISTS
to be created: database_name LOCATION hdfs_path;
Sample Queries

To switch the current session to another

USE db_name;
database:

CREATE TABLE stockprice

(stock_id INT,
date STRING,
open_price FLOAT,
high_price FLOAT,
low_price FLOAT,
close_price FLOAT,
To create a table in Parquet format: stock_volume INT,
adjclose_price FLOAT)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
LOCATION
'/home/singh25nov_gmail/input'
stored as parquet ;
Sample Queries

CREATE EXTERNAL TABLE stop_loss

(
stock_id INT,
stock_volume FLOAT,
stock_current_rate DOUBLE,
To load csv data from local files: stock_trigger_price DOUBLE
)
ROW FORMAT DELIMITED FIELDS
TERMINATED BY ','
LOCATION
'/user/cloudera/sample_data/tab1';

To list all tables in the current database in SHOW tables;

Impala:
Sample Queries

INSERT INTO stockprice

(date,open_price,high_price,low_price,c
To insert a single row: lose_price,stock_volume,adjclose_price
) VALUES ('15112017',102
,105,98.6,100154711,100);

impala-shell -i <impala-daemon-uri> -f
To migrate from SQL:
<filename>.sql;
Sample Queries

SELECT stockprice.open_price,
MAX(stockprice.stock_volume),
MIN(stop_loss.stock_current_rate)
To aggregate and join: FROM stop_loss JOIN stockprice USING
(stock_id)
GROUP BY high_price ORDER BY 1
LIMIT 5;

DROP (DATABASE|SCHEMA) [IF EXISTS]

To drop a database: database_name [RESTRICT | CASCADE]

[LOCATION hdfs_path];
Sample Queries

• Interactive mode:
SELECT count(*) FROM stockprice;

To query the Impala table: • Set of commands contained in a file:

impala-shell-i impala-host -f <filename>.sql;
• Single command to the impala-shell:
impala-shell-i impala-host-q 'select count(*) from stockprice‘;
Executing Queries in the Impala Shell

localhost.localdomain:21000] > select * from webpage where page_id > 40

Demonstrate the sample Impala queries.

Running Hive Queries Using Beeline

• The character “!” is used to execute Beeline

commands.

The commands used to run Beeline:

• !exit: Used to exit the shell
• !help: Shows list of all commands
• !verbose: Shows added details of queries
Demo
Running Hive Queries Using Beeline

Demonstrate the method to connect with Beeline and execute basic queries.
Running Beeline from Command Line

beeline –u … -f
To execute file using the –u option: simplilearn.hql

To use HiveQL directly from the command line using the -e beeline –u ... -e 'SELECT *
option: FROM users‘

To continue running script even after an error: beeline –u … -force=TRUE

Running Hive Query

Hive> select * from device

> LIMIT 5;
OK • All SQL commands are terminated
1 2008-10-21 00:00:00 Sorrento F00L phone with a semicolon “;”
2 2010-04-19 00:00:00 Titanic 2100 phone
3 2011-02-18 00:00:00 MeeToo 3.0 phone
4 2011-09-21 00:00:00 MeeToo 3.1 phone
5 2008-10-21 00:00:00 iFruit 1 phone
Time taken: 0.296 seconds, Fetched: 5 row(s)
Connecting Hive and Impala Shell with Hue

• Hue can be used to write Hive and Impala

queries from the User Interface.
Demo
Connecting Hive and Impala Shell with Hue

Demonstrate the method to connect Hive and Impala shell using Hue.
Hive and Impala Editors in Hue

Diagram 1 Diagram 2
Key Takeaways

Hive and Impala are tools to perform SQL queries on data residing on HDFS
or HBase.

Hive and Impala are easy to learn for experienced SQL developers.

Hive and Impala solve the Big Data problem but cannot replace a traditional
RDBMS.

Hive runs MapReduce or Spark jobs on Hadoop based on HiveQL statements.

Impala uses a very fast specialized SQL engine that is faster than MapReduce.
Quiz
QUIZ
Which of the following components can be used to accept command inputs from users?
1

a. Command Line Interface

b. Query compiler

c. Execution engine

d. Thrift server
QUIZ
Which of the following components can be used to accept command inputs from users?
1

a. Command Line Interface

b. Query compiler

c. Execution engine

d. Thrift server

The correct answer is a.

The Command Line Interface is used as an input medium to accept command input from users.
QUIZ
Hive can be accessed from Hue using ________.
2

a. Impala editor

b. Hive Editor

c. File browser

d. YARN UI
QUIZ
Hive can be accessed from Hue using ________.
2

a. Impala editor

b. Hive Editor

c. File browser

d. YARN UI

The correct answer is b.

Hive can be accessed through the Hive editor in Hue.
QUIZ
Impala can be accessed from Hue using ________.
3

a. Impala editor

b. Hive Editor

c. File browser

d. YARN UI
QUIZ
Impala can be accessed from Hue using ________.
3

a. Impala editor

b. Hive Editor

c. File browser

d. YARN UI

The correct answer is a.

Impala can be accessed through the Impala editor in Hue.
QUIZ
Updating an individual record is possible in______.
4

a. Impala

b. Hive

c. RDBMS

d. All of the above

QUIZ
Updating an individual record is possible in______.
4

a. Impala

b. Hive

c. RDBMS

d. All of the above

The correct answer is c.

Hive and Impala cannot update individual records, but an RDBMS can.
QUIZ
Deleting an individual record is possible in_______.
5

a. RDBMS

b. Hive

c. Impala

d. All of the above

QUIZ
Deleting an individual record is possible in_______.
5

a. RDBMS

b. Hive

c. Impala

d. All of the above

The correct answer is a.

Hive and Impala cannot delete individual records, but an RDBMS can.
This concludes “Basics of Hive and Impala.”
The next lesson is “Working with Hive and Impala.”

(COURSE SUPPORT) Getting Started - Apache Iceberg
No ratings yet
(COURSE SUPPORT) Getting Started - Apache Iceberg
34 pages
Apache Hue-Cloudera
No ratings yet
Apache Hue-Cloudera
63 pages
WP Dremio Simplifying Data Mesh
No ratings yet
WP Dremio Simplifying Data Mesh
22 pages
Big Data and Visualization
No ratings yet
Big Data and Visualization
141 pages
Spark Optimizations & Deployment
No ratings yet
Spark Optimizations & Deployment
13 pages
Post Training Project
No ratings yet
Post Training Project
1,353 pages
Spring Cloud Dataflow Reference
No ratings yet
Spring Cloud Dataflow Reference
130 pages
Big Data Architecture Overview
No ratings yet
Big Data Architecture Overview
8 pages
Unstructured Dataload Into Hive Database Through PySpark
No ratings yet
Unstructured Dataload Into Hive Database Through PySpark
9 pages
Data Science in Spark With Sparklyr::: Cheat Sheet
No ratings yet
Data Science in Spark With Sparklyr::: Cheat Sheet
2 pages
Tungban Machine Learning Math Course
No ratings yet
Tungban Machine Learning Math Course
124 pages
Apache Hive Tutorial
No ratings yet
Apache Hive Tutorial
139 pages
ETL - PPT v0.2
No ratings yet
ETL - PPT v0.2
20 pages
Sqoop Demo
No ratings yet
Sqoop Demo
7 pages
Modernize Data Platforms With SingleStore - IBM
No ratings yet
Modernize Data Platforms With SingleStore - IBM
27 pages
CB Queryoptimization 01
No ratings yet
CB Queryoptimization 01
78 pages
Install Sqoop
No ratings yet
Install Sqoop
7 pages
Big Book of Data Warehousing and Bi v11 010925 Final
No ratings yet
Big Book of Data Warehousing and Bi v11 010925 Final
110 pages
High-Performance Web Apps With FastAPI: The Asynchronous Web Framework Based On Modern Python 1st Edition Malhar Lathkar Newest Edition 2025
0% (1)
High-Performance Web Apps With FastAPI: The Asynchronous Web Framework Based On Modern Python 1st Edition Malhar Lathkar Newest Edition 2025
127 pages
Big Data and Spark Developers
No ratings yet
Big Data and Spark Developers
5 pages
Pythons Basics
No ratings yet
Pythons Basics
104 pages
Hadoop Data Transfer with Sqoop
No ratings yet
Hadoop Data Transfer with Sqoop
21 pages
The Hadoop Distributed File System
No ratings yet
The Hadoop Distributed File System
44 pages
Dokumen - Pub - Understanding Etl Data Pipelines For Modern Data Architectures Early Release 9781098159252
No ratings yet
Dokumen - Pub - Understanding Etl Data Pipelines For Modern Data Architectures Early Release 9781098159252
39 pages
ABD22 1st Exam - 6 January - Attempt Review
No ratings yet
ABD22 1st Exam - 6 January - Attempt Review
13 pages
De Mod 2 Transform Data With Spark
No ratings yet
De Mod 2 Transform Data With Spark
32 pages
Hadoop Ecosystem Overview
No ratings yet
Hadoop Ecosystem Overview
229 pages
Hive and Presto For Big Data
100% (1)
Hive and Presto For Big Data
31 pages
Data Science Tools Guide: SQL, R, Python
No ratings yet
Data Science Tools Guide: SQL, R, Python
23 pages
Lecture 4 - Pair RDD and DataFrame
No ratings yet
Lecture 4 - Pair RDD and DataFrame
38 pages
HDPDeveloper EnterpriseSpark1 StudentGuide
100% (1)
HDPDeveloper EnterpriseSpark1 StudentGuide
244 pages
Business Analytics ST521 - Base SAS I
No ratings yet
Business Analytics ST521 - Base SAS I
67 pages
Slide 13 - Kafka
No ratings yet
Slide 13 - Kafka
109 pages
02 Big Data Pipeline
No ratings yet
02 Big Data Pipeline
61 pages
Unit 4 Hadoop Ecosystem - HIVE and PIG
No ratings yet
Unit 4 Hadoop Ecosystem - HIVE and PIG
157 pages
Sqoop Cammand
No ratings yet
Sqoop Cammand
8 pages
Accessing Data: Center of Excellence Data Warehousing
No ratings yet
Accessing Data: Center of Excellence Data Warehousing
108 pages
Cloudera Introduction PDF
No ratings yet
Cloudera Introduction PDF
97 pages
Slide 3 Hadoop MapReduce Tutorial
No ratings yet
Slide 3 Hadoop MapReduce Tutorial
119 pages
The Data Warehouse ETL Toolkit - Chapter 04
100% (1)
The Data Warehouse ETL Toolkit - Chapter 04
51 pages
Business Intelligence DW
No ratings yet
Business Intelligence DW
17 pages
Deep Learning Booklet
No ratings yet
Deep Learning Booklet
55 pages
Ebook Accelerating Apache Spark 3
No ratings yet
Ebook Accelerating Apache Spark 3
108 pages
What Are DBT Sources
No ratings yet
What Are DBT Sources
109 pages
Azure AI Solution Design Exam Prep
No ratings yet
Azure AI Solution Design Exam Prep
112 pages
Data Warehousing
No ratings yet
Data Warehousing
39 pages
Etl VS Elt
No ratings yet
Etl VS Elt
8 pages
Sqoop v1.1
No ratings yet
Sqoop v1.1
18 pages
Netezza Stored Procedures Guide Rev 2014 PDF
No ratings yet
Netezza Stored Procedures Guide Rev 2014 PDF
86 pages
Hadoop & Kognitio Commands Guide
No ratings yet
Hadoop & Kognitio Commands Guide
1 page
Advanced Data Modeling Guide
No ratings yet
Advanced Data Modeling Guide
18 pages
Sqoop Commands
No ratings yet
Sqoop Commands
4 pages
Open Source Data Engineering Landscape 2024 by Alireza Sadeghi Feb, 2024 Medium
No ratings yet
Open Source Data Engineering Landscape 2024 by Alireza Sadeghi Feb, 2024 Medium
25 pages
Hadoop Data Lake: Hadoop Log Files Json
No ratings yet
Hadoop Data Lake: Hadoop Log Files Json
5 pages
COMP9313: Big Data Management
No ratings yet
COMP9313: Big Data Management
79 pages
Stream Processing at Lyft
No ratings yet
Stream Processing at Lyft
20 pages
05 ImpalaHiveIntro
No ratings yet
05 ImpalaHiveIntro
24 pages
DS Lab - Manual - Assignment 11
No ratings yet
DS Lab - Manual - Assignment 11
3 pages
Impala vs Hive: Big Data Analytics
No ratings yet
Impala vs Hive: Big Data Analytics
33 pages
Impala - Overview
No ratings yet
Impala - Overview
1 page
Selenium Java Environment Setup
No ratings yet
Selenium Java Environment Setup
7 pages
Java For Selenium
No ratings yet
Java For Selenium
9 pages
Employees Mod DB PDF
No ratings yet
Employees Mod DB PDF
1 page
Worksheet 2
No ratings yet
Worksheet 2
3 pages
Windows Quickstart Instructions: Step 1: Download Anaconda
No ratings yet
Windows Quickstart Instructions: Step 1: Download Anaconda
7 pages
SELECT From Nobel
No ratings yet
SELECT From Nobel
13 pages
SQL SELECT from WORLD Tutorial
No ratings yet
SQL SELECT from WORLD Tutorial
13 pages
Predicting Party Affiliation
No ratings yet
Predicting Party Affiliation
2 pages
HDFS and YARN
No ratings yet
HDFS and YARN
91 pages
Knn1 MinMaxScalar
No ratings yet
Knn1 MinMaxScalar
13 pages
Random Forest: Random Forest Has Classifier For Classification and Regressor For Regression
No ratings yet
Random Forest: Random Forest Has Classifier For Classification and Regressor For Regression
9 pages
Decision Tree and EDA With Functions: Import Pandas As PD
No ratings yet
Decision Tree and EDA With Functions: Import Pandas As PD
9 pages
Regular Expressions in Python
No ratings yet
Regular Expressions in Python
16 pages
Digits Recognition Dataset
No ratings yet
Digits Recognition Dataset
4 pages
Random Forest/Roc&Auc - Hyperparamer Tuning With For Loop - TITANIC DB
No ratings yet
Random Forest/Roc&Auc - Hyperparamer Tuning With For Loop - TITANIC DB
17 pages
Symmetrical vs. Skewed Distribution
No ratings yet
Symmetrical vs. Skewed Distribution
1 page
# Import Plotting Libraries: in (1) : Import Pandas As PD
No ratings yet
# Import Plotting Libraries: in (1) : Import Pandas As PD
13 pages
Symbolic Reasoning Under Uncertainty
100% (2)
Symbolic Reasoning Under Uncertainty
16 pages
Horoscope Lesson
No ratings yet
Horoscope Lesson
8 pages
PT - G4 Matatag Mathematics 4 - Q2
No ratings yet
PT - G4 Matatag Mathematics 4 - Q2
3 pages
04-Control of Non Conforming Product Word Document
No ratings yet
04-Control of Non Conforming Product Word Document
4 pages
4-Orchard Layout
No ratings yet
4-Orchard Layout
3 pages
Y6 Autumn Block 1 D4 Powers of 10 2022
No ratings yet
Y6 Autumn Block 1 D4 Powers of 10 2022
2 pages
MT Rating Sheet
100% (1)
MT Rating Sheet
4 pages
Consilience Ostreng PDF
No ratings yet
Consilience Ostreng PDF
4 pages
Analysis of CSR
No ratings yet
Analysis of CSR
16 pages
A National Strategy For High-Growth Entrepreneurship
No ratings yet
A National Strategy For High-Growth Entrepreneurship
20 pages
Title of The Paper (E.g. Why Is The Use of A Mask Effective in Preventing The Transmission of COVID-19 and Which Is The Best Type)
No ratings yet
Title of The Paper (E.g. Why Is The Use of A Mask Effective in Preventing The Transmission of COVID-19 and Which Is The Best Type)
4 pages
Buyback Versus Revenue-Sharing Contracts
No ratings yet
Buyback Versus Revenue-Sharing Contracts
4 pages
Verde Island Passage Assessment
No ratings yet
Verde Island Passage Assessment
8 pages
ICVGD Project Update for NGOs
No ratings yet
ICVGD Project Update for NGOs
3 pages
Essay On The Nacirema People
No ratings yet
Essay On The Nacirema People
2 pages
Science 7 w3
No ratings yet
Science 7 w3
4 pages
Development of A Senior High School Career Decision Tool PDF
No ratings yet
Development of A Senior High School Career Decision Tool PDF
11 pages
Setting and Achieving Goals
No ratings yet
Setting and Achieving Goals
2 pages
HCI Course Work
No ratings yet
HCI Course Work
5 pages
Power Systems Development Facility (PSDF) Final Report (1990 - 2009)
No ratings yet
Power Systems Development Facility (PSDF) Final Report (1990 - 2009)
158 pages
API Soil Report
No ratings yet
API Soil Report
37 pages
Scan Insertion and Violation Fixes
100% (3)
Scan Insertion and Violation Fixes
33 pages
Simon Martin - Joel Skidmore - Exploring The 584286 Correlation Between The Maya and European Calendars PDF
No ratings yet
Simon Martin - Joel Skidmore - Exploring The 584286 Correlation Between The Maya and European Calendars PDF
14 pages
987 PDF
No ratings yet
987 PDF
5 pages
Sophos Firewall Load Baancing
No ratings yet
Sophos Firewall Load Baancing
9 pages
NCERT Solutions For Class 9 Maths Chapter 2 Polynomials Exercise 2.5
No ratings yet
NCERT Solutions For Class 9 Maths Chapter 2 Polynomials Exercise 2.5
12 pages
Evolution of The Nation of Islam by Ernest Allen JR
100% (1)
Evolution of The Nation of Islam by Ernest Allen JR
34 pages
226 Matches From 77 Sources, of Which 20 Are Online Sources
No ratings yet
226 Matches From 77 Sources, of Which 20 Are Online Sources
23 pages
Akkurt 2002 PDF
No ratings yet
Akkurt 2002 PDF
12 pages
Seating Arrangement Tips & Problems
No ratings yet
Seating Arrangement Tips & Problems
5 pages

Hive and Impala

Uploaded by

Hive and Impala

Uploaded by

Big Data Hadoop and Spark Developer

Lesson 4—Basics of Hive and Impala

© Simplilearn. All rights reserved.

Identify the features of Hive and Impala

Understand the methods to interact with Hive and Impala

SELECT t1.a1 as c1, t2.b1 as c2

FROM t1 JOIN t2 ON (t1.a2=t2.b2);

• Hive is very similar to Impala in the following ways:

It is an Open source Apache project. It is an incubating Apache project.

It is suitable for structured data. It is designed for high concurrency and ad

• Comprises a specialized SQL

Receive SQL query Receive SQL query

Parse Hive QL 1 Parse Impala SQL

Make optimizations 2 Make optimizations

Plan execution 3 Plan execution

Submit job(s) to cluster 4 Execute query on cluster

Monitor progress 5 Store the data in HDFS

Store the data in HDFS 7

Hive and Impala offer numerous interfaces to run queries:

• The steps to start Impala in lab are as follows:

• Log in to cloud lab • Connect to any

• To execute Impala commands from Impala shell:

• To run Hive using Beeline:

CREATE DATABASE IF NOT EXISTS

To verify a database: SHOW databases;

To switch the current session to another

CREATE TABLE stockprice

CREATE EXTERNAL TABLE stop_loss

To list all tables in the current database in SHOW tables;

INSERT INTO stockprice

DROP (DATABASE|SCHEMA) [IF EXISTS]

To drop a database: database_name [RESTRICT | CASCADE]

To query the Impala table: • Set of commands contained in a file:

localhost.localdomain:21000] > select * from webpage where page_id > 40

Demonstrate the sample Impala queries.

• The character “!” is used to execute Beeline

The commands used to run Beeline:

To continue running script even after an error: beeline –u … -force=TRUE

Hive> select * from device

• Hue can be used to write Hive and Impala

Hive runs MapReduce or Spark jobs on Hadoop based on HiveQL statements.

a. Command Line Interface

a. Command Line Interface

The correct answer is a.

The correct answer is b.

The correct answer is a.

d. All of the above

d. All of the above

The correct answer is c.

d. All of the above

d. All of the above

The correct answer is a.

©Simplilearn. All rights reserved

You might also like