Documentation: https://kim-em.github.io/lean-bench/
Microbenchmark library for Lean 4. Declare a benchmark and the
complexity model you expect; run it; the harness reports per-call
times across an adaptive parameter ladder and the ratio
C = perCallNanos / complexity(n). If your model fits, C is
roughly constant.
The ladder shape is auto-picked from the declared complexity: doubling for polynomial growth, linear-inside-the-cap-bracket for exponential.
From the repo root:
$ lake build && ./.lake/build/bin/fib_benchmark_example run LeanBench.Examples.Fib.goodFib
Builds the library + examples and runs a linear-Fibonacci benchmark
over params 0, 1, 2, 4, 8, …, 134_217_728. Tail of the output:
32_768 197.481 µs ×2^9 C= 6.027
65_536 374.301 µs ×2^9 C= 5.711
131_072 747.345 µs ×2^8 C= 5.702
262_144 1.496 ms ×2^7 C= 5.708
524_288 2.989 ms ×2^6 C= 5.702
1_048_576 5.979 ms ×2^6 C= 5.702
…
verdict: consistent with declared complexity (cMin=5.702, cMax=6.863, β=-0.005)
Tombstones † on earlier rungs mark the warmup-trim region — those
rows aren't included in the slope fit.
goodFib is declared as O(n); observed per-call time scales
linearly across 22 doublings (n=2 through n=134M); C stabilises
near 5.7 ns per iteration once the param is large enough that the
small fixed per-call cost amortises. β is the log-log slope of C
vs param over the trimmed tail — β ≈ 0 is what a correct complexity
declaration looks like.
import LeanBench
def myFib (n : Nat) : UInt64 := Id.run do
let mut a : UInt64 := 0; let mut b : UInt64 := 1
for _ in [0:n] do let c := a + b; a := b; b := c
return a
setup_benchmark myFib n => n + 1
-- Or with per-benchmark overrides on `BenchmarkConfig`:
-- setup_benchmark myFib n => n + 1 where {
-- maxSecondsPerCall := 0.5
-- paramCeiling := 4096
-- }
def main (args : List String) : IO UInt32 :=
LeanBench.Cli.dispatch argsCLI flags layer on top of declared defaults, so you can tighten a single run without recompiling:
$ lake exe my_benchmarks run myFib --max-seconds-per-call 0.25 --param-ceiling 1024If you don't yet know the right complexity model, declare your best
guess (or 1) and pass --auto-fit to see a ranked list of
suggestions from a fixed catalog (1, n, n log n, n², n³,
2ⁿ) — the heuristic ranks each model by how constant C = perCallNanos / model(n) ends up across the ladder. See
the deployed quickstart.
See the deployed quickstart for the full flag list.
lakefile.toml:
[[require]]
name = "lean-bench"
git = "https://github.com/kim-em/lean-bench.git"
rev = "main"
[[lean_exe]]
name = "my_benchmarks"
root = "MyBenchmarks"Then lake exe my_benchmarks list / run NAME / compare A B / verify / profile NAME --profiler "perf record -g --".
See the deployed quickstart and
the Profiling page.
For workloads with a single canonical input — "this hard problem takes ~1.2 s; the FLINT equivalent takes ~0.8 s" — there's a second registration form with no parameter sweep:
def factorXOverFTwo : Unit → Polynomial Bool := fun () => …
setup_fixed_benchmark factorXOverFTwo(Bare def : α registrations are rejected — closed pure
expressions get folded into compile-time constants. See issue #54
and the quickstart.)
The runner does one warmup call followed by repeats measured calls
(default 5), reports median / min / max wall time, and hash-checks
across repeats. run, compare, list, and verify all dispatch
by registration kind. See the deployed quickstart
for the full guide.
v0.1. Linux and macOS are exercised in CI. Windows builds in CI but
the test suite isn't run there yet — reports of breakage welcome.
See PLAN.md for the
v0.2+ roadmap, the deployed quickstart for the
user guide, the Pitfalls page for Lean-specific
benchmarking pitfalls (bignum Nat, forcing evaluation, sharing,
warm vs. cold), the Profiling page for the
external-profiler integration (perf, samply, heaptrack,
/usr/bin/time -v), and the Result schema page for the
JSONL result-row schema and its evolution rules.
Subprocess-per-batch. The child does the timing and the auto-tuned
inner-repeat loop; the parent spawns directly via IO.Process.spawn,
reads JSONL stdout, and kicks off a sibling Task that kills the
child via IO.Process.Child.kill if a single batch exceeds
maxSecondsPerCall + killGraceMs. No external timeout(1)
dependency. We use subprocesses because Lean has no portable
in-process interrupt for arbitrary computation, so the only reliable
way to enforce a wallclock cap is to give each measurement its own
process.
See the Design page for the full architectural rationale, including limits and known caveats.
Each measurement is a child process, so very fast operations have a
per-spawn noise floor in the milliseconds. The harness measures this
floor itself and prints it with every report; rows whose total wall
time is below signalFloorMultiplier × floor (default 10×) are
flagged [<floor] and excluded from the verdict, with an actionable
advisory line at the bottom of the report. For workloads that don't
fit the built-in auto / doubling / linear ladders, declare a
custom one with where { paramSchedule := .custom #[…] }.
Lean's Nat is bignum, the compiler hoists pure work out of loops
unless its result is observed, and reference-counted sharing can
flip operations between O(1) and O(n) depending on aliasing — these
and other Lean-specific traps are covered in
the Pitfalls page. Read it before declaring a
benchmark you're going to commit to a CI baseline.
The verdict is a thresholded log-log slope (|β| ≤ 0.15 over the
trimmed tail), not a statistical test. β's sign tells you direction;
read the raw ratios for magnitude.