BIRD Mini-Dev Text-to-SQL MVP

This repo hosts a minimal end-to-end experiment for the BIRD Mini-Dev text-to-SQL benchmark using GPT-5-mini. It downloads a small slice of the dataset, computes a baseline prompt, then runs DSPy-based prompt optimizers to see if we can improve execution accuracy. Results are logged so they are easy to compare.

Setup

Create a virtual environment and install dependencies:

python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Populate .env with an API key that has access to gpt-5-mini:
```
cp .env .env.local  # optional backup
# edit .env and set OPENAI_API_KEY=...
```
The scripts load environment variables via python-dotenv.

Running the experiment

The main entry point is run_experiment.py. By default it

downloads/unpacks the BIRD Mini-Dev dataset (SQLite flavor) via HuggingFace Hub,
samples 5 examples for DSPy optimization and 15 examples for evaluation,
runs a hand-written baseline prompt with GPT-5-mini,
runs two DSPy teleprompters (BootstrapFewShot and MIPROv2 when available),
evaluates execution accuracy by executing both gold and predicted SQL on the actual databases,
records per-example predictions plus aggregate metrics.

python run_experiment.py --train-size 5 --eval-size 15

You can adjust sample sizes, the OpenAI model, temperature, and maximum tokens via CLI flags. The script prints a quick summary at the end and produces artifacts inside results/:

scores.csv accumulates metrics for every run (baseline and each DSPy optimizer).
baseline_<model>_<timestamp>_predictions.jsonl and dspy_<model>_<timestamp>_*_predictions.jsonl contain per-example outputs.
summary_<timestamp>.json captures the aggregated scores for that experiment.

Dataset files are cached under data/bird_mini_dev. If the download fails, remove that directory and run the script again once connectivity is restored.

Notes

The experiment prioritizes a small slice of the dataset to stay within a low API budget; increase the evaluation size if you need more stable statistics.
Execution accuracy is computed by running SQL on the shipped SQLite databases in read-only mode; errors are captured in the per-example JSONL files for inspection.
DSPy support requires the dspy package; the script automatically skips optimizers that are unavailable in your install.
If you already have the Mini-Dev package locally (e.g., minidev_*.zip), unpack it under data/ before running the script. The loader detects a data/minidev/MINIDEV/dev_databases/ tree and skips any network download.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
bird_experiment		bird_experiment
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
run_experiment.py		run_experiment.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BIRD Mini-Dev Text-to-SQL MVP

Setup

Running the experiment

Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

BIRD Mini-Dev Text-to-SQL MVP

Setup

Running the experiment

Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages