Skip to content

zixy17/LimeQO

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🟢🍋 LimeQO: Low Rank Learning for Offline Query Optimization

🌟 Overview

LimeQO is a novel offline query optimizer that leverages low-rank structure in the workload matrix to optimize query performance efficiently.

What is the workload matrix?
  • Each row represents a query.
  • Each column represents a hint (e.g., disable Nested Loop Join).
  • Each cell represents the runtime of the query under the given hint (in seconds).
Offline query optimization?
  • What: Effectively explore alternative query execution plans offline, in a way that minimizing total workload runtime while minimizing offline computation time.
  • Why: Focus on repetitive workloads and prevent regression.
  • How: Explores optimal plans within the available query hint space by steering the existing query optimizer. offline
How LimeQO works in one sentence?
  • LimeQO uses ALS (Alternating Least Squares) algorithm to predict the runtime, then leverages the completed workload matrix to efficiently select query hints for offline exploration.

🚩Features:

  • Simple implementation: Utilizes a straightforward matrix factorization algorithm (ALS). DO NOT need any featurization!
  • Low overhead: Negligible additional computation time.
  • High performance: Achieves 2x speedup within just one hour.

linear

🔗 Dataset

Download dataset from dropbox. This dataset contains the four workload we used in the experiments: ceb, job, stack, and dsb. Each zip includes query runtimes and corresponding EXPLAIN plans for all queries and hints.

wget -O dataset/qo_dataset.zip "https://www.dropbox.com/scl/fo/y4e88tmcx7ywo4ou1unnh/ABN6iqV1t_ecktO51gsPKRc?rlkey=hedjnmkpak3r3gxjzx48s9etu&st=uxnr4s17&dl=1"
unzip dataset/qo_dataset.zip -d dataset/
rm dataset/qo_dataset.zip

📂 Project Structure

.
├── dataset/              # Workload datasets (CEB, JOB, Stack, DSB)
├── src/                  
│   ├── data/             # Data loading and preprocessing
│   ├── models/           # Core LimeQO implementation
│   ├── utils/            # Helper functions and utilities
│   ├── strategies/       # Optimization strategies: Random, Greedy, LimeQO, LimeQO+
│   └── run_experiment.py # Run main experiments
├── draw/                 # Figures and visualizations
├── experiment/           # Experiment results
└── limeqo.ipynb          # Interactive demo notebook

🚀 Try LimeQO!

Check out the interactive demo in limeqo.ipynb to see how LimeQO works! Achieve 2x speedup in just less than 1 hour! ⬇️

LimeQO

🧐 Detailed instructions on reproducing the results in the paper:

Download code and set up environment
git clone https://github.com/zixy17/LimeQO.git
pip install -e .
Data and Experiment Results
  • Experiment results
    The directory experiment/ already contains all results used in the paper.
    You can directly reproduce every figure from the paper using these results, without downloading any dataset.

  • Datasets
    The directory dataset/ contains partially packaged datasets.
    To fully re-run experiments, you need to download the complete datasets (including all EXPLAIN plans) from Dropbox:

    wget -O dataset/qo_dataset.zip "https://www.dropbox.com/scl/fo/y4e88tmcx7ywo4ou1unnh/ABN6iqV1t_ecktO51gsPKRc?rlkey=hedjnmkpak3r3gxjzx48s9etu&st=uxnr4s17&dl=1" 
    unzip dataset/qo_dataset.zip -d dataset/ 
    rm dataset/qo_dataset.zip
    
    • The EXPLAIN plans are required only for LimeQO+ (training requires the plan trees).
    • They are not needed for LimeQO or for reproducing figures.
Reproduce Figures from Experiment Results
  1. All experiments results are stored in experiment/:
    • ceb/, job/, stack/, dsb/ correspond to each dataset.
    • These results were produced by src/run_experiment.py and are used for figure generation.
  2. To reproduce the plots from the paper, run the notebooks in draw/.
    • Example: Figure 5 (the first figure in the experiments section) can be reproduced by running: draw_ceb_fig1.ipynb, draw_job_fig1.ipynb, draw_stack_fig1.ipynb, draw_dsb_fig1.ipynb
Reproduce ALL experiment results
  1. Download datasets as described above.
  2. Run experiments: python src/run_experiment.py --dataset ceb
    • --dataset can be one of ceb, job, stack, dsb.
    • This will run all methods, including baselines (QOAdvisor, Random, Greedy) and ours (LimeQO, LimeQO+).
    • Each baseline is repeated 20 times; LimeQO+ is repeated 5 times (since it involves training a neural network).
    • Running each dataset may take more than 1 hour. For example, the LimeQO+ method on CEB dataset took ~2 hours per run (on CPU), so the total time is ~10 hours.
  3. Figures can then be regenerated by running the notebooks in draw/.
Demo: limeqo.ipynb
  • This demo reproduces LimeQO results on the CEB benchmark (one figure from the paper).
  • It is designed as a quick exploratory example and does not reproduce baselines or LimeQO+.
  • No dataset download is required for this demo.
  • Running the demo may produce a RuntimeWarning: overflow encountered in expm1. This does not affect correctness of the results or the reproduced figure.

❤️ If you find our data, code, or the paper useful, please cite our paper 📑:

@article{yi2025low,
  title={Low Rank Learning for Offline Query Optimization},
  author={Yi, Zixuan and Tian, Yao and Ives, Zachary G and Marcus, Ryan},
  journal={Proceedings of the ACM on Management of Data},
  volume={3},
  number={3},
  pages={1--26},
  year={2025},
  publisher={ACM New York, NY, USA}
}

About

[SIGMOD 25] LimeQO: Low Rank Learning for Offline Query Optimization

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors