ToolCall-15

ToolCall-15 is an official BenchLocal Bench Pack for evaluating deterministic tool use, tool selection, parameter precision, multi-step tool chains, restraint, and recovery from tool errors.

A Bench Pack is an installable benchmark package that runs inside the BenchLocal desktop app. BenchLocal provides the shared app experience for provider setup, model selection, sampling controls, run histories, and side-by-side comparison across benchmark packs.

This repository contains the benchmark source: scenario definitions, scoring logic, methodology, a BenchLocal adapter, and a CLI runner for local development. main tracks the maintained Bench Pack version for BenchLocal.

If you need the older standalone web app implementation, use the legacy/web-app branch. That branch is preserved for compatibility, but BenchLocal is now the recommended way to run ToolCall-15.

Run With BenchLocal

Download BenchLocal from the latest BenchLocal release.
Open BenchLocal and install ToolCall-15 from the official Bench Pack registry.
Add one or more models, select ToolCall-15, and start a run.

What It Measures

ToolCall-15 is organized into 5 categories, with 3 scenarios per category:

Tool Selection
Parameter Precision
Multi-Step Chains
Restraint and Refusal
Error Recovery

Each scenario is scored as:

2 points for a pass
1 point for a partial pass
0 points for a fail

Each category is worth 6 points. The final score is the average of the 5 category percentages, rounded to a whole number.

Bench Pack Structure

lib/                    Benchmark core, scoring, tool loop, and transport
benchlocal/             Thin BenchLocal SDK adapter
cli/                    Non-UI runner
benchlocal.pack.json    Canonical Bench Pack manifest
METHODOLOGY.md          Published benchmark methodology

BenchLocal Adapter

benchlocal/index.ts is the only place that imports @benchlocal/sdk.
lib/ stays framework-agnostic and is shared by the CLI and BenchLocal.
benchlocal.pack.json is the canonical Bench Pack metadata manifest used for install, inspection, and runtime metadata.
Per-pack default sampling belongs on the manifest. ToolCall-15 defaults to temperature: 0.

Methodology

The benchmark spec is documented in METHODOLOGY.md and implemented in lib/benchmark.ts.

Design Goals

Reproducible: the system prompt, tool schema, mocked tool outputs, and scoring logic are all versioned in the repo.
Balanced: the suite spreads scenarios across distinct tool-use failure modes instead of over-indexing on one skill.
Deterministic: tool results are mocked and the benchmark uses temperature: 0.
Inspectable: every scenario stores a raw trace so failures can be audited.

BenchLocal and CLI

Install: npm install
BenchLocal build: npm run build:benchlocal
CLI runner: npm run cli
Methodology: METHODOLOGY.md

Validation

npm run typecheck
npm run build:benchlocal

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
benchlocal		benchlocal
cli		cli
dist		dist
lib		lib
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
METHODOLOGY.md		METHODOLOGY.md
README.md		README.md
benchlocal.pack.json		benchlocal.pack.json
package-lock.json		package-lock.json
package.json		package.json
tsconfig.benchlocal.json		tsconfig.benchlocal.json
tsconfig.cli.json		tsconfig.cli.json
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ToolCall-15

Run With BenchLocal

What It Measures

Bench Pack Structure

BenchLocal Adapter

Methodology

Design Goals

BenchLocal and CLI

Validation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ToolCall-15

Run With BenchLocal

What It Measures

Bench Pack Structure

BenchLocal Adapter

Methodology

Design Goals

BenchLocal and CLI

Validation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages