This repo hosts a minimal end-to-end experiment for the BIRD Mini-Dev text-to-SQL benchmark using GPT-5-mini. It downloads a small slice of the dataset, computes a baseline prompt, then runs DSPy-based prompt optimizers to see if we can improve execution accuracy. Results are logged so they are easy to compare.
-
Create a virtual environment and install dependencies:
python -m venv .venv source .venv/bin/activate pip install -r requirements.txt -
Populate
.envwith an API key that has access togpt-5-mini:cp .env .env.local # optional backup # edit .env and set OPENAI_API_KEY=...
The scripts load environment variables via
python-dotenv.
The main entry point is run_experiment.py. By default it
- downloads/unpacks the BIRD Mini-Dev dataset (SQLite flavor) via HuggingFace Hub,
- samples 5 examples for DSPy optimization and 15 examples for evaluation,
- runs a hand-written baseline prompt with GPT-5-mini,
- runs two DSPy teleprompters (BootstrapFewShot and MIPROv2 when available),
- evaluates execution accuracy by executing both gold and predicted SQL on the actual databases,
- records per-example predictions plus aggregate metrics.
python run_experiment.py --train-size 5 --eval-size 15You can adjust sample sizes, the OpenAI model, temperature, and maximum tokens via CLI flags. The script prints a quick summary at the end and produces artifacts inside results/:
scores.csvaccumulates metrics for every run (baseline and each DSPy optimizer).baseline_<model>_<timestamp>_predictions.jsonlanddspy_<model>_<timestamp>_*_predictions.jsonlcontain per-example outputs.summary_<timestamp>.jsoncaptures the aggregated scores for that experiment.
Dataset files are cached under data/bird_mini_dev. If the download fails, remove that directory and run the script again once connectivity is restored.
- The experiment prioritizes a small slice of the dataset to stay within a low API budget; increase the evaluation size if you need more stable statistics.
- Execution accuracy is computed by running SQL on the shipped SQLite databases in read-only mode; errors are captured in the per-example JSONL files for inspection.
- DSPy support requires the
dspypackage; the script automatically skips optimizers that are unavailable in your install. - If you already have the Mini-Dev package locally (e.g.,
minidev_*.zip), unpack it underdata/before running the script. The loader detects adata/minidev/MINIDEV/dev_databases/tree and skips any network download.