docs(benchmarks): add generation benchmarks #239

vetertann · 2025-12-05T18:04:31Z

Linked Issue

Closes #207

Description

This PR adds Generation Benchmarks section to the documentation. It details the performance of TOON compared to JSON and JSON Structured Output (JSO) across 21 different LLMs, focusing on token efficiency, accuracy, and repair capabilities.

Type of Change

Bug fix (non-breaking change that fixes an issue)
New feature (non-breaking change that adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Documentation update
Refactoring (no functional changes)
Performance improvement
Test coverage improvement

Changes Made

Added ## 2. Generation benchmarks section to docs/guide/benchmarks.md.
Documented methodology
Added performance tables comparing 1-shot accuracy, final accuracy (after repair loops), and token budgets across 21 models .
Included qualitative analysis on the accuracy, repair loops, and token efficiency scaling.

SPEC Compliance

This PR implements/fixes spec compliance
Spec section(s) affected: N/A (Documentation only)
Spec version: N/A

Testing

All existing tests pass
Added new tests for changes
Tests cover edge cases and spec compliance

Pre-submission Checklist

My code follows the project's coding standards
I have run code formatting/linting tools (Markdown linting)
I have added tests that prove my fix/feature works
New and existing tests pass locally
I have updated documentation if needed
I have reviewed the TOON specification for relevant sections

Breaking Changes

No breaking changes
Breaking changes (describe migration path below)

Additional Context

Benchmarks were run via the Nebius API.

johannschopplich · 2025-12-05T19:36:46Z

Hi there!
Thanks for the benchmark results. But the benchmark docs are auto-generated from the internal benchmarks. Also, I don't intend to add benchmarks without the code to reproduce them.

Can you please enhance my benchmarks package with your code and share the tool results? As a hint, the final generation result that gets embedded in benchmarks.md is generated by scripts/accuracy-benchmark.ts.

vetertann · 2025-12-05T19:47:21Z

Oh, ok... I did this PR just because in your comment to the issue #207 you wrote:
"Once you're happy with the setup and have stable results, I'd definitely be interested in:

A write‑up or summary table we can link to, and
If you're up for it, a PR that adds a short "generation benchmarks" section under docs/guide/benchmarks (even if the harness itself stays in your Python repo and we just describe the methodology and link out)."

johannschopplich · 2025-12-05T19:49:39Z

I see, sorry, missed that. Could you add the generation benchmarks (tho in Python, no problem) to this repo as well? For the sake of reproducibility? Thanks.

vetertann · 2025-12-05T20:00:11Z

Sure, I’ll open a PR adding it under benchmarks/generation

docs(benchmarks): add generation benchmarks summary

ff4678c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

docs(benchmarks): add generation benchmarks #239

docs(benchmarks): add generation benchmarks #239

Uh oh!

vetertann commented Dec 5, 2025 •

edited

Loading

Uh oh!

johannschopplich commented Dec 5, 2025 •

edited

Loading

Uh oh!

vetertann commented Dec 5, 2025

Uh oh!

johannschopplich commented Dec 5, 2025

Uh oh!

vetertann commented Dec 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

docs(benchmarks): add generation benchmarks #239

Are you sure you want to change the base?

docs(benchmarks): add generation benchmarks #239

Uh oh!

Conversation

vetertann commented Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Linked Issue

Description

Type of Change

Changes Made

SPEC Compliance

Testing

Pre-submission Checklist

Breaking Changes

Additional Context

Uh oh!

johannschopplich commented Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vetertann commented Dec 5, 2025

Uh oh!

johannschopplich commented Dec 5, 2025

Uh oh!

vetertann commented Dec 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

vetertann commented Dec 5, 2025 •

edited

Loading

johannschopplich commented Dec 5, 2025 •

edited

Loading