On-going work. This version is the result of a 3 months internship at NTU Singapore and is to be continued.
This repository contains a skeleton of QHICS (Quantile Hypothetical Index for Column-oriented Storage).
One can find a working implementation for Singlestore on the singlestore
branch.
This project provides a lightweight hypothetical index benefit estimator designed specifically for column-oriented storage.
If you’re not familiar with column-oriented storage, you can read about C-store for an introduction, and explore Singlestore as a modern example.
The implementation uses new heuristics tailored to the columnar architecture. Detailed explanations, including technical insights and results, can be found in the internship report.
QHICS is composed of interconnected components, each in its own root folder, to be adapted for your environment:
-
Database
Functions for communication with your database (getting cardinalities, NDV, schema info, etc.). This provides the abstraction layer used by all components. Implement caching as calls here may be frequent. -
WorkloadAnalysis
This section needs to be completed for your system. You must develop a tool that takes a SQL query (string) and returnsJoinMap
andAccessMap
(see thetypes
folder for definitions). A hit factor file is provided to estimate access costs and can be directly used. -
Estimator
Contains the Encoder (which converts an index into encoding vectors) and the regression models. -
Main
The main entry point connecting all subfolders.
You will need to build some components for your database (detailled below). However, a working example is built on the Singlestore branch. You can check singlestore_example.ipynb
on it.
Before doing anything, be sure to have install everyting in requirements.txt
You first need to get a DbWrapper
instance connected to your database and a DbUtilities
. Once you have them, you can create a QHICS instance using whatif = Qhics(db_wrapper,dbu)
.
Suppose that you know what is the usual kind of queries issued to your database, you can set the original workload of QHICS using set_workload
:
db_wrapper.run_query("USE tpch")
known_workload = [
"SELECT c_nationkey FROM CUSTOMER WHERE c_acctbal > 150",
"SELECT o_orderstatus, o_totalprice, o_shippriority FROM ORDERS WHERE o_orderdate >= '2004-02-04'",
"SELECT l_shipinstruct FROM LINEITEM WHERE L_ORDERKEY = 12345"
]
whatif.set_workload(known_workload)
To be able to work, QHICS needs both an estimator and a cost model. You can simply create them using
whatif.create_encoder()
whatif.create_cost_model(fit=True)
If fit
is set to True
, the
QHICS also supports online edit of the workload without recomputing every query metadata. In the following example you can see how QHICS can be set aware of sales in France and then set the sales off:
# Some sales are happening in France,
# there is new queries related to it:
new_queries = [
"SELECT n_regionkey FROM NATION WHERE n_name = 'France'",
"SELECT c_nationkey FROM CUSTOMER WHERE c_nationkey = 5",
"SELECT O_ORDERKEY FROM ORDERS JOIN CUSTOMER ON c_nationkey = 5 AND O_CUSTKEY = C_CUSTKEY"
]
whatif.add_to_workload(new_queries)
# Now that the sales are over,
# these queries are not index relevant anymore.
whatif.remove_from_workload(new_queries)
You can also change the configuration online (this is indexes that are already materialized), for example:
# Erasing previous configuration and writing the new one
existing_indexes = [Index("LINEITEM",["l_orderkey"],["int"],"Hash")]
whatif.set_configuration(existing_indexes)
# Adding new indexes
# For the sale in France you've created a special index.
new_indexes = [Index("CUSTOMER",["C_NATIONKEY"],["int"],"Hash")]
whatif.add_to_configuration(new_indexes)
# Removing indexes
# Now that the sales are over, you can
# remove the special index to save resources.
whatif.remove_from_configuration(new_indexes)
Once QHICS is ready, you can estimate a benefit using estimate_benefit
:
candidate1 = Index("LINEITEM",["L_QUANTITY"],["decimal"])
candidate2 = Index("LINEITEM",["L_LINENUMBER"],["integer"])
whatif.estimate_benefit(candidate1),whatif.estimate_benefit(candidate2)
The estimation gets better as you get more historical data. Preliminary tests report that you can use QHICS to compare two indexes even without historical data. However, you should not rely on the value itself if you do not have some historical data.
An easy way to get historical data is to launch a script similar to this one on a configuration that looks like your configuration, or in off-peak scenario:
from QHICS.types.wa_types import Index
known_workload = [
f"SELECT * FROM LINEITEM WHERE L_QUANTITY = {v}"
for v in range(1,20)
]
whatif.set_workload(known_workload)
i = Index("LINEITEM",["L_QUANTITY"],["decimal"])
# You must have set an encoder and a cost model (to have history_data_path set).
whatif.create_historical_data(i)
Obviously, by doing so QHICS will learn the capacity of your database at the given time, hence if you want to simulate a highly-demanding time of the day you will need to simulate this pressure while QHICS is getting data.
This consumes many resources, use it with caution.
This repository contains all code needed to take a tuple[list[AccessMap], list[JoinMap]]
and use QHICS to estimate index benefits. However, to produce these types, you need to develop components for your target database. The singlestore
branch provides an example implementation for Singlestore (note this is just a research example and is not affiliated or funded by Singlestore).
Specifically, you need to:
-
Edit the Database folder to implement functions using your database language’s input/output syntax (including getting column cardinalities with predicates, fetching the information schema, etc.).
-
Edit the Utils folder to ensure all functions related to query optimizer behaviors are correct for your database.
-
Complete the WorkloadAnalysis folder by implementing the
analyze_one_query
function, which takes a query string and returns the discussed types. Check for existing tools (e.g., hit factor approximation) to avoid re-implementing functionality. -
( Optional ) Add index types in the encoding formula or SQL operators in the parsing.
After these are complete and verified, you can run QHICS to get index benefit estimation.
This work was developed as part of Clément Rouvroy’s internship at Nanyang Technological University (NTU), Singapore, supervised by PhD SHI Jiachen and Prof. CONG Gao. It was made possible thanks to funding from ENS-PSL.