Hi @stefanzetzsche,
Thanks for the great contribution of releasing this benchmark of applying Dafny in classical mathematics verification tasks. The results are very impressive and suggest that Dafny-style verification tools could become a strong alternative to traditional automated theorem provers in certain domains.
While experimenting with the benchmark, I (with the help of AI tools) noticed that a few tasks appear to be unprovable. From my investigation, several of these issues seem to stem from errors introduced during the translation from Lean to Dafny, while others may already exist in the original Lean benchmark itself. Please double check since I might also made a few mistakes while understandting the original problems.
I have not yet gone through the entire benchmark systematically, but I would be very interested in collaborating with you to identify more issues, improve the translation pipeline, and potentially help build a more robust and automated benchmark for the community.
Here are the issues I found (descriptions polished with AI assistance).
A. translation bugs (14)
The Lean miniF2F original is correct but the Dafny translation introduced a false statement.
ensures -1.0 <= x < 1.0 - sqrt(31.0/8.0)
Bug: division pulled inside the radical. Lean: 1 - real.sqrt 31 / 8 (= √31/8 ≈ 0.696)
Fix: ensures -1.0 <= x < 1.0 - sqrt(31.0)/8.0.
requires forall x: real, n: nat | x >= 0.0 && n != 0 :: f(n+1)(x) == f(n)(x) * f(n)(x) + 1.0/(n as real)
Bug: dropped parentheses. Lean: f (n+1) x = f n x * (f n x + 1/n); you wrote f·f + 1/n instead of f·(f + 1/n).
Fix: f(n+1)(x) == f(n)(x) * (f(n)(x) + 1.0/(n as real)).
ensures forall y: nat :: f(4)(y+1) == Int.pow(2, f(4)(y+3)) - 3
Bug: +3 moved inside the argument. Lean: f 4 (y+1) = 2^(f 4 y + 3) - 3
Fix: f(4)(y+1) == Int.pow(2, f(4)(y) + 3) - 3.
ensures Int.pow(19, 19) + Int.pow(99, 99) % 10 == 8
Bug: Lean: (19^19+99^99)%10.
Fix: ensures (Int.pow(19,19) + Int.pow(99,99)) % 10 == 8.
lemma algebra_amgm_sum1toneqn_prod1tonleq1(a: nat -> real, n: nat)
requires Real.sum(range(n), k => a(k)) == n as real
ensures Real.prod(range(n), k => a(k)) <= 1.0
Bug: 1Lean types a : ℕ → nnreal; this one uses real, dropping a(k) ≥ 0
Fix: add requires forall k :: 0 <= k < n ==> a(k) >= 0.0.
lemma algebra_amgm_prod1toneq1_sum1tongeqn(a: nat -> real, n: nat)
requires Real.prod(range(n), k => a(k)) == 1.0
ensures Real.sum(range(n), k => a(k)) >= n as real
Bug: as A5.
Fix: add requires forall k :: 0 <= k < n ==> a(k) >= 0.0.
lemma mathd_numbertheory_227(x: nat, y: nat, n: nat)
requires x/4 + y/6 == (x+y)/n
ensures n == 5
Bug: lossy real→int coercion. (x,y,n)=(1,1,3) — all integer quotients are 0.
Fix: make x, y, n real (or cast) so the equation uses real division.
ensures ((100*100 - 7*7)) / (70*70 - 11*11) * ((70 - 11) * (70 + 11) / ((100 - 7) * (100 + 7))) == 1
Bug: same lossy real→int coercion. 9951 / 4779 = 2 and 4779 / 9951 = 0 in integer division, so the product is 0 ≠ 1.
Fix: evaluate the expression over real.
ensures Real.sum(set k: int | 1 <= k <= p-2, (k: int) => if k % p == 0 then 0.0 else 1.0/((k % p) as real) * 1.0/((k % p) as real + 1.0)) == 2.0
Bug: translation bug of zmod
Fix: the sum must be taken in zmod p, not real.
requires a == Int.sum(divisors(500), k => k)
ensures Int.sum(filter(prime, divisors(a as int)), k => k) == 25
ensures Int.sum(divisors(198), k => k) == 270
ensures Int.sum(divisors(Int.pow(30, 4)), k => 1) - 2 == 123
Bug (A10–A12): the library's divisors(n: int) returns signed divisors (both +d and −d), whereas Lean's nat.divisors / nat.proper_divisors are positive-only.
Fix: restrict the library divisors to positive divisors, or define extra functions for that.
requires Real.sum(range(100), x => f(x+1)*f(x+1)) == 1.0
ensures Real.sum(range(99), x => f(x+1)*f(x+1) + f(x+2) + f(100)*f(100)*f(1)) < 12.0/25.0
Bug: Lean: ∑ x in finset.range 99, ((a (x+1))^2 * a (x+2)) + (a 100)^2 * a 1 < 12/25, with a : ℕ → nnreal.
Fix: ensures Real.sum(range(99), x => f(x+1)*f(x+1)*f(x+2)) + f(100)*f(100)*f(1) < 12.0/25.0, and add requires forall i :: 1 <= i <= 100 ==> f(i) >= 0.0.
ensures Complex.Real.abs(Complex.sub(a, b)) <= 2.0 * sqrt(21.0)
Bug: It does not resolve.
These are false in the Lean miniF2F source too, and Dafny extend the same error.
ensures !((exists i, j :: a == 2*i && b == 2*j) <==> (exists k :: a*a + b*b == 8*k))
Counterexample: (a,b)=(0,0)
ensures (Int.pow(29, 13) - Int.pow(5, 13)) % 7 == 0
bug: the Lean theorem states the identical (false) equation
(29^13 - 5^13) % 7 = 0. 29 ≡ 1 (mod 7) ⇒ 29¹³ ≡ 1; 5¹³ ≡ 5; so
(29¹³−5¹³) % 7 = 3 ≠ 0.
ensures |filter(x => x % 19 == 7, set x | 100 <= x <= 999)| == 52
bug: the Lean theorem also claims the count is 52. The actual count of x ∈ [100,999] with x % 19 == 7 is 48 (x = 102, 121, …, 995).
requires forall x: real :: 10.0*x*x - x - 24.0 == ((a as real)*x - 8.0) * ((b as real)*x + 3.0)
ensures a + b == 12
Counterexample: (a,b)=(5,2).
lemma mathd_numbertheory_618(n: nat, p: nat -> nat)
requires forall x: nat :: p(x) == x*x - x + 41
requires 1 < gcd(p(n), p(n+1))
ensures 41 <= n
Counterexample: n=0.
requires i != 0
requires m != 0
requires o != 0
requires i*m*o == 2001
ensures i+m+o <= 671
Counterexample: (i,m,o)=(1,1,2001).
Hi @stefanzetzsche,
Thanks for the great contribution of releasing this benchmark of applying Dafny in classical mathematics verification tasks. The results are very impressive and suggest that Dafny-style verification tools could become a strong alternative to traditional automated theorem provers in certain domains.
While experimenting with the benchmark, I (with the help of AI tools) noticed that a few tasks appear to be unprovable. From my investigation, several of these issues seem to stem from errors introduced during the translation from Lean to Dafny, while others may already exist in the original Lean benchmark itself. Please double check since I might also made a few mistakes while understandting the original problems.
I have not yet gone through the entire benchmark systematically, but I would be very interested in collaborating with you to identify more issues, improve the translation pipeline, and potentially help build a more robust and automated benchmark for the community.
Here are the issues I found (descriptions polished with AI assistance).
A. translation bugs (14)
The Lean miniF2F original is correct but the Dafny translation introduced a false statement.
A1.
dataset/test/imo_1962_p2.dfyBug: division pulled inside the radical. Lean:
1 - real.sqrt 31 / 8(=√31/8 ≈ 0.696)Fix:
ensures -1.0 <= x < 1.0 - sqrt(31.0)/8.0.A2.
dataset/test/imo_1985_p6.dfyBug: dropped parentheses. Lean:
f (n+1) x = f n x * (f n x + 1/n); you wrotef·f + 1/ninstead off·(f + 1/n).Fix:
f(n+1)(x) == f(n)(x) * (f(n)(x) + 1.0/(n as real)).A3.
dataset/test/imo_1981_p6.dfyBug:
+3moved inside the argument. Lean:f 4 (y+1) = 2^(f 4 y + 3) - 3Fix:
f(4)(y+1) == Int.pow(2, f(4)(y) + 3) - 3.A4.
dataset/valid/mathd_numbertheory_202.dfyBug: Lean:
(19^19+99^99)%10.Fix:
ensures (Int.pow(19,19) + Int.pow(99,99)) % 10 == 8.A5.
dataset/test/algebra_amgm_sum1toneqn_prod1tonleq1.dfyBug: 1Lean types
a : ℕ → nnreal; this one usesreal, droppinga(k) ≥ 0Fix: add
requires forall k :: 0 <= k < n ==> a(k) >= 0.0.A6.
dataset/valid/algebra_amgm_prod1toneq1_sum1tongeqn.dfyBug: as A5.
Fix: add
requires forall k :: 0 <= k < n ==> a(k) >= 0.0.A7.
dataset/test/mathd_numbertheory_227.dfyBug: lossy real→int coercion.
(x,y,n)=(1,1,3)— all integer quotients are0.Fix: make
x, y, nreal (or cast) so the equation uses real division.A8.
dataset/test/amc12b_2020_p2.dfyBug: same lossy real→int coercion.
9951 / 4779 = 2and4779 / 9951 = 0in integer division, so the product is0 ≠ 1.Fix: evaluate the expression over
real.A9.
dataset/test/mathd_numbertheory_764.dfyBug: translation bug of
zmodFix: the sum must be taken in
zmod p, notreal.A10.
dataset/test/mathd_numbertheory_427.dfyA11.
dataset/valid/mathd_numbertheory_403.dfyA12.
dataset/valid/mathd_numbertheory_543.dfyBug (A10–A12): the library's
divisors(n: int)returns signed divisors (both+dand−d), whereas Lean'snat.divisors/nat.proper_divisorsare positive-only.Fix: restrict the library
divisorsto positive divisors, or define extra functions for that.A13.
dataset/test/imo_2007_p6.dfyBug: Lean:
∑ x in finset.range 99, ((a (x+1))^2 * a (x+2)) + (a 100)^2 * a 1 < 12/25, witha : ℕ → nnreal.Fix:
ensures Real.sum(range(99), x => f(x+1)*f(x+1)*f(x+2)) + f(100)*f(100)*f(1) < 12.0/25.0, and addrequires forall i :: 1 <= i <= 100 ==> f(i) >= 0.0.A14.
dataset/test/amc12a_2020_p15.dfyBug: It does not resolve.
These are false in the Lean miniF2F source too, and Dafny extend the same error.
B1.
dataset/test/numbertheory_notequiv2i2jasqbsqdiv8.dfyCounterexample:
(a,b)=(0,0)B2.
dataset/test/mathd_numbertheory_728.dfybug: the Lean theorem states the identical (false) equation
(29^13 - 5^13) % 7 = 0.29 ≡ 1 (mod 7)⇒29¹³ ≡ 1;5¹³ ≡ 5; so(29¹³−5¹³) % 7 = 3 ≠ 0.B3.
dataset/valid/mathd_numbertheory_155.dfybug: the Lean theorem also claims the count is
52. The actual count ofx ∈ [100,999]withx % 19 == 7is 48 (x = 102, 121, …, 995).B4.
dataset/test/mathd_algebra_276.dfyCounterexample:
(a,b)=(5,2).B5.
dataset/test/mathd_numbertheory_618.dfyCounterexample:
n=0.B6.
dataset/test/amc12_2000_p1.dfyCounterexample:
(i,m,o)=(1,1,2001).