benchmark

ATLAS Benchmark Suite

V3.1 (Qwen3.5-9B) — In Progress

The V3 pipeline runner (v3_runner.py + the modules in v3/) targets the V3.1.0 model+topology: Qwen3.5-9B-Q6_K with the PlanSearch / DivSampling / Budget Forcing / PR-CoT / Refinement / Derivation stack. Ablation work is tracked under scripts/run_v31_ablation.sh; the 6-condition study (A–F) is mid-run on hardware tier medium. Headline 9B numbers are not yet published — docs/SOURCES.md "Known Limitations" lists this as a V3.1.x roadmap item.

Until the 9B numbers land, the V3 (14B) results in docs/reports/V3_ABLATION_STUDY.md are the canonical published evidence (74.6% LiveCodeBench v5 pass@1, 599 tasks, 4 ablation conditions).

V2 Benchmark (Historical)

The V2 path runs against the older topology (Fox 9B / Qwen3-14B with --method best-of-K selection). Preserved here for reproducibility of the V2 results table below and the legacy runner.py + v2_runner.py modules.

Run: ./run_v2_benchmark.sh Config: See config.py for parameters.

What's Measured

LiveCodeBench v5 (primary): Coding problem-solving, 599 tasks, stdin/stdout evaluation
GPQA Diamond: Knowledge reasoning, 198 multiple-choice questions
IFBench: Instruction following, 300 tasks. Note: IFBench evaluation is incomplete -- evaluate_ifbench_loose() defaults to True for ~11/15 instruction categories. IFBench is excluded from headline results pending proper implementation.
SciCode: Scientific coding, ~80 multi-step problems
Custom: Real-world coding tasks, 100 problems from benchmark/custom/tasks.json

V2 Results Summary

Benchmark	Tasks	pass@1	Conditions
LiveCodeBench v5	599	36-41%	k=3, Geometric Lens selection
GPQA Diamond	198	47.0%	k=5, MCQ
Custom	100	53-55%	k=1
SciCode	341	14.7% sub-problems	k=1

Run ID: v2_run_20260217_125310 Hardware: RTX 5060 Ti 16GB VRAM Throughput: 109 tasks/hr aggregate

All results from a single benchmark run. Not averaged across multiple runs; variance unknown.

Reproducing Results

Ensure cluster is running: kubectl get pods -n atlas
Lock codebase: git stash any changes
Run: ./run_v2_benchmark.sh
Results written to: benchmark/results/

Files

File	Purpose
`runner.py`	Base benchmark runner (function + stdio modes, ChatML formatting, code extraction)
`cli.py`	Command-line interface (`atlas benchmark --humaneval --dry-run`, etc.)
`config.py`	Benchmark configuration loaded from `atlas.conf`
`models.py`	Data models: BenchmarkTask, AttemptResult, TaskResult, BenchmarkRun
`datasets/`	Dataset loaders (HumanEval, MBPP, EvalPlus, LiveCodeBench v5, GPQA, IFBench, SciCode)
`analysis/`	Cost analysis, hardware info, pass@k metric
`custom/`	100 custom benchmark tasks + validator
V3 pipeline (active)
`v3_runner.py`	V3 benchmark runner entry point (PlanSearch + DivSampling + PR-CoT, ablation conditions A–F)
`v3/`	19 V3 pipeline modules (plan_search.py, div_sampling.py, budget_forcing.py, pr_cot.py, refinement_loop.py, etc.)
V2 (historical)
`v2_runner.py`	V2 benchmark orchestrator (phases 0–6, telemetry, Mode A/B)
`v2_report.py`	V2 result analysis and reporting
`best_of_k.py`	Best-of-K selection with Geometric Lens (V2-era, superseded by V3 S* candidate selection)
`geo_learning.py`	Geometric Lens retraining pipeline
`run_v2_benchmark.sh`	V2 benchmark launch script with pre-flight checks

V1 Benchmark (Archived)

V1 is preserved in the git history under the v1.0.0 tag — check it out (git checkout v1.0.0) if you need the original runner or report text. The V1 results file (v1_benchmark_report.md) was removed during the V2 → V2.5 reorg and is no longer reachable from the V3.1.0 working tree.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

README.md

ATLAS Benchmark Suite

V3.1 (Qwen3.5-9B) — In Progress

V2 Benchmark (Historical)

What's Measured

V2 Results Summary

Reproducing Results

Files

V1 Benchmark (Archived)

Name		Name	Last commit message	Last commit date
parent directory ..
analysis		analysis
custom		custom
datasets		datasets
v3		v3
README.md		README.md
__init__.py		__init__.py
best_of_k.py		best_of_k.py
cli.py		cli.py
config.py		config.py
geo_learning.py		geo_learning.py
measure_bok_latency.sh		measure_bok_latency.sh
models.py		models.py
run_v2_benchmark.sh		run_v2_benchmark.sh
runner.py		runner.py
v2_report.py		v2_report.py
v2_runner.py		v2_runner.py
v3_runner.py		v3_runner.py

Uh oh!

FilesExpand file tree

benchmark

Directory actions

More options

Directory actions

More options

Latest commit

History

benchmark

Folders and files

parent directory

README.md

ATLAS Benchmark Suite

V3.1 (Qwen3.5-9B) — In Progress

V2 Benchmark (Historical)

What's Measured

V2 Results Summary

Reproducing Results

Files

V1 Benchmark (Archived)