Skip to content

Conversation

@vetertann
Copy link

@vetertann vetertann commented Dec 5, 2025

Linked Issue

Closes #207

Description

This PR adds Generation Benchmarks section to the documentation. It details the performance of TOON compared to JSON and JSON Structured Output (JSO) across 21 different LLMs, focusing on token efficiency, accuracy, and repair capabilities.

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update
  • Refactoring (no functional changes)
  • Performance improvement
  • Test coverage improvement

Changes Made

  • Added ## 2. Generation benchmarks section to docs/guide/benchmarks.md.
  • Documented methodology
  • Added performance tables comparing 1-shot accuracy, final accuracy (after repair loops), and token budgets across 21 models .
  • Included qualitative analysis on the accuracy, repair loops, and token efficiency scaling.

SPEC Compliance

  • This PR implements/fixes spec compliance
  • Spec section(s) affected: N/A (Documentation only)
  • Spec version: N/A

Testing

  • All existing tests pass
  • Added new tests for changes
  • Tests cover edge cases and spec compliance

Pre-submission Checklist

  • My code follows the project's coding standards
  • I have run code formatting/linting tools (Markdown linting)
  • I have added tests that prove my fix/feature works
  • New and existing tests pass locally
  • I have updated documentation if needed
  • I have reviewed the TOON specification for relevant sections

Breaking Changes

  • No breaking changes
  • Breaking changes (describe migration path below)

Additional Context

Benchmarks were run via the Nebius API.

@johannschopplich
Copy link
Collaborator

johannschopplich commented Dec 5, 2025

Hi there!
Thanks for the benchmark results. But the benchmark docs are auto-generated from the internal benchmarks. Also, I don't intend to add benchmarks without the code to reproduce them.

Can you please enhance my benchmarks package with your code and share the tool results? As a hint, the final generation result that gets embedded in benchmarks.md is generated by scripts/accuracy-benchmark.ts.

@vetertann
Copy link
Author

Oh, ok... I did this PR just because in your comment to the issue #207 you wrote:
"Once you're happy with the setup and have stable results, I'd definitely be interested in:

A write‑up or summary table we can link to, and
If you're up for it, a PR that adds a short "generation benchmarks" section under docs/guide/benchmarks (even if the harness itself stays in your Python repo and we just describe the methodology and link out)."

@johannschopplich
Copy link
Collaborator

I see, sorry, missed that. Could you add the generation benchmarks (tho in Python, no problem) to this repo as well? For the sake of reproducibility? Thanks.

@vetertann
Copy link
Author

Sure, I’ll open a PR adding it under benchmarks/generation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

TOON benchmark for generation tasks

2 participants