Skip to content

aavetis/bird-test

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 

Repository files navigation

BIRD Mini-Dev Text-to-SQL MVP

This repo hosts a minimal end-to-end experiment for the BIRD Mini-Dev text-to-SQL benchmark using GPT-5-mini. It downloads a small slice of the dataset, computes a baseline prompt, then runs DSPy-based prompt optimizers to see if we can improve execution accuracy. Results are logged so they are easy to compare.

Setup

  1. Create a virtual environment and install dependencies:

    python -m venv .venv
    source .venv/bin/activate
    pip install -r requirements.txt
  2. Populate .env with an API key that has access to gpt-5-mini:

    cp .env .env.local  # optional backup
    # edit .env and set OPENAI_API_KEY=...

    The scripts load environment variables via python-dotenv.

Running the experiment

The main entry point is run_experiment.py. By default it

  • downloads/unpacks the BIRD Mini-Dev dataset (SQLite flavor) via HuggingFace Hub,
  • samples 5 examples for DSPy optimization and 15 examples for evaluation,
  • runs a hand-written baseline prompt with GPT-5-mini,
  • runs two DSPy teleprompters (BootstrapFewShot and MIPROv2 when available),
  • evaluates execution accuracy by executing both gold and predicted SQL on the actual databases,
  • records per-example predictions plus aggregate metrics.
python run_experiment.py --train-size 5 --eval-size 15

You can adjust sample sizes, the OpenAI model, temperature, and maximum tokens via CLI flags. The script prints a quick summary at the end and produces artifacts inside results/:

  • scores.csv accumulates metrics for every run (baseline and each DSPy optimizer).
  • baseline_<model>_<timestamp>_predictions.jsonl and dspy_<model>_<timestamp>_*_predictions.jsonl contain per-example outputs.
  • summary_<timestamp>.json captures the aggregated scores for that experiment.

Dataset files are cached under data/bird_mini_dev. If the download fails, remove that directory and run the script again once connectivity is restored.

Notes

  • The experiment prioritizes a small slice of the dataset to stay within a low API budget; increase the evaluation size if you need more stable statistics.
  • Execution accuracy is computed by running SQL on the shipped SQLite databases in read-only mode; errors are captured in the per-example JSONL files for inspection.
  • DSPy support requires the dspy package; the script automatically skips optimizers that are unavailable in your install.
  • If you already have the Mini-Dev package locally (e.g., minidev_*.zip), unpack it under data/ before running the script. The loader detects a data/minidev/MINIDEV/dev_databases/ tree and skips any network download.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages