OPERA is a learned execution-time predictor that uses optimizer cost signals and operator-level cost-formula features as priors; the method details are described in the paper.The release is organized as a runnable Python project: one main entry point, operator-cost feature generation utilities, and the support code needed to run OPERA on user-provided workloads.
run_opera.py: main command-line entry for loading data, training OPERA, evaluating on the selected test split, and writing metrics.requirements.txt: minimal Python dependency list.data/: place user-provided train/test trace files here. Each row should contain SQL text, a PostgreSQL JSON plan, observed runtime, and database configuration fields.data/schemainfo/: generated schema and configuration metadata cache for local workloads.cost_file/: PostgreSQL operator cost-formula hooks used to derive operator-level prior features.common/: shared parsing, plan traversal, catalog lookup, and runtime cost-model helpers.db_info/: schema metadata loading and optional live PostgreSQL connection helpers.tools/build_node_cost_cache.py: generates the operator-cost feature artifact from selected traces andcost_file/hooks.tools/validate_query_total_runtime_based_cost.py: support validator used by the cache builder.
This release intentionally excludes raw trace files, generated schema caches, operator-cost feature artifacts, plotting code, historical ablation scripts, generated run directories, and paper-table post-processing scripts.
The release was prepared with the following software environment:
Python: 3.11.15
PostgreSQL: 13.3
PostgreSQL 13.3 is recommended when collecting plans, rebuilding schema metadata, or preparing operator-cost features for a custom workload. The included cost hooks and schema-statistics helpers follow PostgreSQL 13 behavior.
Install the Python dependency set before training or rebuilding artifacts:
pip install -r requirements.txtrequirements.txt includes the packages needed for model training, operator-cost feature preparation, and PostgreSQL metadata access:
numpy==2.4.4
torch==2.11.0+cu128
psycopg2-binary==2.9.12
paramiko==4.0.0
pandas==3.0.2
Place your train/test traces under data/, then generate an operator-cost feature artifact for the same train/test workload selection. The builder reads the selected trace files, applies the PostgreSQL operator formula hooks in cost_file/, and writes a compressed NPZ file consumed by run_opera.py.
python tools/build_node_cost_cache.py \
--train-dbs imdb,tpch,tpcds \
--test-dbs imdb,tpch,tpcds \
--forceThe --jobs option is optional; it only controls parallel worker count during feature generation.
python run_opera.py \
--device cuda \
--train-dbs imdb,tpch,tpcds \
--test-dbs imdb,tpch,tpcds \
--outdir outputs/opera_retrainThe command trains on the selected train splits, evaluates on the selected test splits, and writes:
outputs/opera_retrain/metrics.json
The supported dataset names in the release entry point are imdb, tpch, and tpcds.
Schema and configuration metadata can be loaded from data/schemainfo/ when matching cache files are available.
If the schema cache for a workload is missing, db_info/database_info.py falls back to a live PostgreSQL connection. In that path, db_info/infos.py reads connection settings from environment variables:
OPERA_DB_HOST
OPERA_DB_USER
OPERA_DB_PASSWORD
OPERA_DB_NAME
OPERA_DB_PORT
OPERA_PG_COMMAND_CTRL
OPERA_SSH_USER
OPERA_SSH_PASSWORD
OPERA_SSH_PORT