Skip to content

Conversation

@vetertann
Copy link

Linked Issue

Closes #207

Description

This PR imports the Python-based generation benchmark suite into the repository.

Placed the files in a dedicated benchmarks/generation directory to keep the Python environment isolated from the main TS project. This benchmark allows for running the token efficiency and accuracy tests documented in the generation/readme.md update.

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update
  • Refactoring (no functional changes)
  • Performance improvement
  • Test coverage improvement

Changes Made

  • Created benchmarks/generation/ directory.
  • Added Python source code (src/), Pydantic models, and prompt templates.
  • Added Gold Standard datasets (data/*.gold.json and *.toon).
  • Added runs results in *.csv
  • Added requirements.txt and a dedicated README.md with setup benchmark description and instructions for the Python environment.

SPEC Compliance

  • This PR implements/fixes spec compliance
  • Spec section(s) affected: N/A (Tooling only)
  • Spec version: N/A

Testing

  • All existing tests pass
  • Added new tests for changes
  • Tests cover edge cases and spec compliance

Pre-submission Checklist

  • My code follows the project's coding standards
  • I have run code formatting/linting tools
  • I have added tests that prove my fix/feature works
  • New and existing tests pass locally
  • I have updated documentation if needed
  • I have reviewed the TOON specification for relevant sections

Breaking Changes

  • No breaking changes
  • Breaking changes (describe migration path below)

Additional Context

This benchmark suite is self-contained and does not affect the core TypeScript build process. It requires an API key (e.g., Nebius, OpenAI) to run.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

TOON benchmark for generation tasks

1 participant