Let's break down the process of how a Hive query is executed in simple terms:
1. Execute Query
● What Happens: You submit a query using a Hive interface, like a command line or web
interface.
Example: Imagine you want to find out the average sales of your store from a database. You
write a SQL-like query in Hive, such as:
sql
Copy code
SELECT AVG(sales) FROM store_data;
2. Get Plan
● What Happens: The query is passed to a "driver," which uses a query compiler to first
check if your query is written correctly (syntax check) and then decide on the best way to
get the answer (query plan).
● Example: The compiler checks if you've written "AVG" and "sales" correctly and figures
out which parts of your data need to be read.
3. Get Metadata
● What Happens: The compiler now asks the Metastore (a database storing metadata) for
information about the tables in your query, like their structure.
● Example: It might ask, "What is the structure of the store_data table? Does it have a
sales column?"
4. Send Metadata
● What Happens: The Metastore sends back details like the table's schema, location, and
column types.
● Example: The Metastore might respond, "Yes, the store_data table has a column
called sales and it's stored in this specific format on these servers."
5. Send Plan
● What Happens: After getting the metadata, the compiler finalizes the plan to run your
query and gives it back to the driver.
● Example: The driver now knows how it will execute the query, what data to read, and in
what order.
6. Execute Plan
● What Happens: The driver sends this plan to the execution engine.
● Example: The execution engine starts preparing for the actual work, much like a chef
getting ingredients ready based on a recipe.
7. Execute Job (MapReduce)
● What Happens: The execution engine processes the query using MapReduce jobs
(small pieces of work distributed across different machines). It sends the job to a
JobTracker, which assigns work to TaskTrackers running on different data nodes
(computers in the cluster).
● Example: The task might be, "Find the total sales per day across many servers," and
each server handles a chunk of the data, reporting results back.
7.1 Metadata Ops (During Execution)
● What Happens: While running the query, the execution engine might also ask for more
metadata from the Metastore, if necessary.
● Example: It might need to check details about a table's partitions or where certain data
is stored.
8. Fetch Results
● What Happens: After the MapReduce job finishes, the execution engine collects all the
results from different nodes (servers).
● Example: Each server that processed part of the data sends its partial result back, such
as sales for a particular day.
9. Send Results to Driver
● What Happens: The execution engine sends the final results to the driver.
● Example: The execution engine now knows the total and average sales and passes that
information to the driver.
10. Send Results to Hive Interface
● What Happens: The driver sends the final result back to the interface where you
submitted the query, like the command line or web UI.
● Example: Finally, you see the average sales figure on your screen, say, "The average
sales is $500."
Summary:
In simple terms, Hive takes your SQL-like query, checks if it's written correctly, figures out how
to run it, and distributes the work across many machines in the background. It uses MapReduce
to break down the job, fetches results from different parts of the system, and then sends the
answer back to you.