LongContextBench is a long-context benchmark CLI for testing how well LLMs recall facts, compose multiple facts, and resolve conflicting information at large context sizes.
The longctx binary can generate synthetic benchmark suites, run them against OpenAI-compatible chat completion APIs, and turn JSONL results into a local HTML report.
Design and configuration notes live in docs/configuration.md, docs/productionization.md, and docs/schema-migration.md.
Minimum supported Rust version: 1.86.
cargo install --git https://github.com/LibSkills/longctx.gitTagged releases publish platform archives for Linux, macOS, and Windows with SHA256 checksum files. After a crates.io release exists, install with:
cargo install longctxGenerate synthetic benchmark cases and context files.
Supported suites:
needle: hides one target fact in a long filler context.multi-needle: hides several facts that must be returned together.conflict: includes older or adjacent facts and asks for the scoped current answer.multi-hop: chains two related facts (e.g. project lead and badge code) and asks for a transitive answer.order-dependent: generates sequenced facts with timestamps and asks for a value at a specific step.position-sweep: places a needle at 5 positions (start/early/middle/late/end) to measure the "lost in the middle" effect.hallucination: pure filler context with questions about non-existent facts; tests whether the model fabricates answers.
longctx generate needle --tokens 100000 --out ./bench
longctx generate multi-needle --tokens 100000 --out ./bench
longctx generate conflict --tokens 100000 --out ./bench
longctx generate multi-hop --tokens 100000 --out ./bench
longctx generate order-dependent --tokens 100000 --out ./bench
longctx generate position-sweep --tokens 100000 --out ./bench
longctx generate hallucination --tokens 100000 --out ./benchAdd --seed <value> to make generation reproducible.
Run generated benchmark cases against an OpenAI-compatible provider. The benchmark directory must contain a config.toml file.
longctx run ./bench
longctx run ./bench --force
longctx run ./bench --dry-run
longctx run ./bench --filter needle --limit 2Results are written to ./bench/results.jsonl.
By default, run refuses to overwrite an existing results.jsonl, run.json, or request log. Use --force when intentionally replacing prior output.
Use --dry-run to validate config, select tests, and resolve context = "auto" without reading the API key, sending provider requests, or writing results. Use --filter <text> to select tests whose ID or suite contains the text, and --limit <n> to cap the selected set.
Run a complete scoring profile and write a terminal summary plus score.json and score.html.
longctx score ./score --config ./bench/config.toml
longctx score ./score --config ./bench/config.toml --profile standard
longctx score ./score --config ./bench/config.toml --profile max-context
longctx score ./score --config ./bench/config.toml --profile quick --jsonProfiles:
quick: low-cost smoke score, probing 8K through 64K and running 8K capability suites.standard: recommended comparable score, probing up to 1M with 8K resolution and running 32K capability suites.deep: higher-cost score, probing up to 1M with 4K resolution and running 128K capability suites.max-context: only find the usable context boundary; skip capability suites.
Score runs are written under score/probe-runs/<id>/ with score.json, score.html, probe-summary.json, and standard per-attempt benchmark artifacts.
Automatically probe a provider's usable context boundary with generated needle tests. The command starts at --min-tokens, doubles until --max-tokens or the first failure, then binary-searches the success/failure boundary until it reaches --resolution-tokens.
longctx probe-context ./probe --config ./bench/config.toml
longctx probe-context ./probe --config ./bench/config.toml --max-tokens 1000000 --resolution-tokens 8000
longctx probe-context ./probe --config ./bench/config.toml --skip-capabilities
longctx probe-context ./probe --config ./bench/config.toml --jsonBy default, the command also runs a 32K-token capability pass for multi-needle, conflict, multi-hop, order-dependent, position-sweep, and hallucination. Each token attempt is written as an ordinary benchmark directory under probe-runs/, with standard contexts/, manifests/, results.jsonl, run.json, and reports/ outputs.
Validate a benchmark directory before running it.
longctx validate ./benchUse --skip-api-key-check when you only want to check files and schema.
Build or refresh the context index used by automatic context routing.
longctx index ./benchGenerated suites write this index automatically with deterministic metadata for reproducible benchmark data. Run index after hand-editing manifests or context files.
Generate a standalone HTML report from a results JSONL file.
longctx report ./bench/results.jsonl
longctx report ./bench/results.jsonl --out report.html
longctx report ./bench/results.jsonl --json
longctx report ./bench/results.jsonl --json --out report.jsonThe report includes per-suite summaries, per-token-count summaries, trend charts, and failure groups.
Without --out, HTML reports are written next to the results file under reports/report.html.
When result rows include automatic routing decisions, the report shows the selected context, routing method, status, and confidence.
--json emits a machine-readable summary.
Compare two result files.
longctx compare baseline.jsonl candidate.jsonl
longctx compare baseline.jsonl candidate.jsonl --json
longctx compare baseline.jsonl candidate.jsonl --html-out compare.html
longctx compare baseline.jsonl candidate.jsonl --fail-on-regressioncompare --json emits machine-readable deltas, including average latency and token changes.
compare --html-out writes a standalone comparison report.
Generate a 100K-token needle retrieval test:
longctx generate needle --tokens 100000 --out ./benchCreate ./bench/config.toml:
[provider]
base_url = "https://api.openai.com/v1"
api_key_env = "OPENAI_API_KEY"
model = "gpt-4.1"
[run]
request_timeout_secs = 120
max_retries = 2
retry_backoff_ms = 500
concurrency = 4Run the benchmark:
export OPENAI_API_KEY="sk-..."
longctx run ./benchGenerate the report:
longctx report ./bench/results.jsonlOr run a single scoring command:
longctx score ./score --config ./bench/config.toml --profile quicklongctx run reads config.toml from the benchmark directory.
[provider]
base_url = "https://api.openai.com/v1"
api_key_env = "OPENAI_API_KEY"
model = "gpt-4.1"Fields:
base_url: Provider base URL. It must be an absolutehttporhttpsURL with no query string or fragment. The runner posts to{base_url}/chat/completionsor{base_url}/responses.api_key_env: Name of the environment variable containing the API key.model: Model name sent in the chat completion request.request_timeout_secs: Per-request timeout in seconds.max_retries: Retry budget for transient failures.retry_backoff_ms: Base retry delay in milliseconds.concurrency: Number of benchmark requests to run in parallel.request_style: Provider request style, eitherchat-completionsorresponses.log_requests: Write opt-in redacted HTTP exchange logs underreports/.request_log_path: Relative path underreports/for the request log file, defaulting toreports/http-log.jsonl.
The config also supports an optional [grader] section for LLM-as-judge grading:
[grader]
judge_model = "gpt-4.1-mini" # optional, defaults to the same provider modelWhen a test case uses Grader::LlmJudge, the runner sends the answer to the judge model for evaluation instead of using exact string matching. Judge requests use the configured provider request style plus the same retry, backoff, and timeout controls as benchmark provider requests. Judge failures are reported separately with error_kind = "Judge" and include judge HTTP status, attempts, latency, token counts, and error text in the result row.
Any provider that exposes an OpenAI-compatible /chat/completions endpoint can be used by changing base_url, api_key_env, and model.
Each generated suite writes:
- Context text files under
contexts/, with token-suffixed names so token sweeps do not overwrite earlier contexts. - Suite manifest JSON files under
manifests/, with token-suffixed names so token sweeps can share one benchmark directory. - A context routing index at
context.index.json.
The runner accepts generated suite manifests and writes newline-delimited JSON results to results.jsonl.
Result rows include the suite name, token count, provider model, HTTP status, provider request ID, rate-limit headers, attempt count, structured error kind when a run fails, optional judge audit details, and optional routing audit details. Readers reject result rows with a newer unsupported schema_version, and compare rejects duplicate result IDs.
Each run also writes a run.json snapshot with config, timing metadata, and SHA256 fingerprints for benchmark manifests, contexts, and context.index.json.
When run.log_requests is enabled, redacted HTTP exchange logs are written to reports/http-log.jsonl.
Reports default to reports/report.html next to the results file unless --out is provided.
Direct context file paths in manifests must resolve under the benchmark directory. Absolute paths and .. paths that escape the benchmark directory are rejected before provider requests are sent.
score writes score.json, score.html, probe-summary.json, and standard per-attempt benchmark artifacts under probe-runs/<id>/.
probe-context writes timestamped probe runs under the selected output directory. Every individual probe attempt remains a standard benchmark directory, so existing report, compare, and result readers can inspect the generated artifacts.
Set a test case's context = "auto" to let the runner select a context file from context.index.json instead of naming a file directly.
The first implementation is a zero-token local router. It scores context-level index entries using safe manifest metadata, test IDs, suite names, token counts, and lexical overlap with the question. It does not call an LLM router and does not inspect expected answers. If routing is ambiguous or top candidates tie, the result fails with error_kind = "ContextRoute" instead of silently choosing a weak candidate.
Routing decisions are written into each result row under routing, including selected context path, candidate scores, method, status, confidence, token usage, and latency fields. The token and latency fields are currently zero because the local router does not spend model tokens.
Existing context.index.json files are validated during run; stale hashes, unsupported schema versions, absolute paths, and paths escaping the benchmark directory are rejected before provider requests are sent.
Contributions are welcome. Before opening a pull request:
- Run
cargo fmt. - Run
cargo test. - Keep changes focused on one behavior or feature.
- Include documentation updates when CLI behavior, config, or output formats change.
Useful local checks:
cargo fmt --check
cargo test
cargo clippy --all-targets -- -D warnings
cargo buildLongContextBench is licensed under the MIT License. See LICENSE.