Skip to content

Soundness check: a few unprovable tasks in the benchmark #8

@OlivierTOMATO

Description

@OlivierTOMATO

Hi @stefanzetzsche,

Thanks for the great contribution of releasing this benchmark of applying Dafny in classical mathematics verification tasks. The results are very impressive and suggest that Dafny-style verification tools could become a strong alternative to traditional automated theorem provers in certain domains.

While experimenting with the benchmark, I (with the help of AI tools) noticed that a few tasks appear to be unprovable. From my investigation, several of these issues seem to stem from errors introduced during the translation from Lean to Dafny, while others may already exist in the original Lean benchmark itself. Please double check since I might also made a few mistakes while understandting the original problems.

I have not yet gone through the entire benchmark systematically, but I would be very interested in collaborating with you to identify more issues, improve the translation pipeline, and potentially help build a more robust and automated benchmark for the community.

Here are the issues I found (descriptions polished with AI assistance).

A. translation bugs (14)

The Lean miniF2F original is correct but the Dafny translation introduced a false statement.

A1. dataset/test/imo_1962_p2.dfy

ensures -1.0 <= x < 1.0 - sqrt(31.0/8.0)

Bug: division pulled inside the radical. Lean: 1 - real.sqrt 31 / 8 (= √31/8 ≈ 0.696)

Fix: ensures -1.0 <= x < 1.0 - sqrt(31.0)/8.0.

A2. dataset/test/imo_1985_p6.dfy

requires forall x: real, n: nat | x >= 0.0 && n != 0 :: f(n+1)(x) == f(n)(x) * f(n)(x) + 1.0/(n as real)

Bug: dropped parentheses. Lean: f (n+1) x = f n x * (f n x + 1/n); you wrote f·f + 1/n instead of f·(f + 1/n).

Fix: f(n+1)(x) == f(n)(x) * (f(n)(x) + 1.0/(n as real)).

A3. dataset/test/imo_1981_p6.dfy

ensures forall y: nat :: f(4)(y+1) == Int.pow(2, f(4)(y+3)) - 3

Bug: +3 moved inside the argument. Lean: f 4 (y+1) = 2^(f 4 y + 3) - 3

Fix: f(4)(y+1) == Int.pow(2, f(4)(y) + 3) - 3.

A4. dataset/valid/mathd_numbertheory_202.dfy

ensures Int.pow(19, 19) + Int.pow(99, 99) % 10 == 8

Bug: Lean: (19^19+99^99)%10.

Fix: ensures (Int.pow(19,19) + Int.pow(99,99)) % 10 == 8.

A5. dataset/test/algebra_amgm_sum1toneqn_prod1tonleq1.dfy

lemma algebra_amgm_sum1toneqn_prod1tonleq1(a: nat -> real, n: nat)
  requires Real.sum(range(n), k => a(k)) == n as real
  ensures Real.prod(range(n), k => a(k)) <= 1.0

Bug: 1Lean types a : ℕ → nnreal; this one uses real, dropping a(k) ≥ 0

Fix: add requires forall k :: 0 <= k < n ==> a(k) >= 0.0.

A6. dataset/valid/algebra_amgm_prod1toneq1_sum1tongeqn.dfy

lemma algebra_amgm_prod1toneq1_sum1tongeqn(a: nat -> real, n: nat)
  requires Real.prod(range(n), k => a(k)) == 1.0
  ensures Real.sum(range(n), k => a(k)) >= n as real

Bug: as A5.

Fix: add requires forall k :: 0 <= k < n ==> a(k) >= 0.0.

A7. dataset/test/mathd_numbertheory_227.dfy

lemma mathd_numbertheory_227(x: nat, y: nat, n: nat)
  requires x/4 + y/6 == (x+y)/n
  ensures n == 5

Bug: lossy real→int coercion. (x,y,n)=(1,1,3) — all integer quotients are 0.

Fix: make x, y, n real (or cast) so the equation uses real division.

A8. dataset/test/amc12b_2020_p2.dfy

ensures ((100*100 - 7*7)) / (70*70 - 11*11) * ((70 - 11) * (70 + 11) / ((100 - 7) * (100 + 7))) == 1

Bug: same lossy real→int coercion. 9951 / 4779 = 2 and 4779 / 9951 = 0 in integer division, so the product is 0 ≠ 1.

Fix: evaluate the expression over real.

A9. dataset/test/mathd_numbertheory_764.dfy

ensures Real.sum(set k: int | 1 <= k <= p-2, (k: int) => if k % p == 0 then 0.0 else 1.0/((k % p) as real) * 1.0/((k % p) as real + 1.0)) == 2.0

Bug: translation bug of zmod

Fix: the sum must be taken in zmod p, not real.

A10. dataset/test/mathd_numbertheory_427.dfy

requires a == Int.sum(divisors(500), k => k)
ensures Int.sum(filter(prime, divisors(a as int)), k => k) == 25

A11. dataset/valid/mathd_numbertheory_403.dfy

ensures Int.sum(divisors(198), k => k) == 270

A12. dataset/valid/mathd_numbertheory_543.dfy

ensures Int.sum(divisors(Int.pow(30, 4)), k => 1) - 2 == 123

Bug (A10–A12): the library's divisors(n: int) returns signed divisors (both +d and −d), whereas Lean's nat.divisors / nat.proper_divisors are positive-only.

Fix: restrict the library divisors to positive divisors, or define extra functions for that.

A13. dataset/test/imo_2007_p6.dfy

requires Real.sum(range(100), x => f(x+1)*f(x+1)) == 1.0
ensures Real.sum(range(99), x => f(x+1)*f(x+1) + f(x+2) + f(100)*f(100)*f(1)) < 12.0/25.0

Bug: Lean: ∑ x in finset.range 99, ((a (x+1))^2 * a (x+2)) + (a 100)^2 * a 1 < 12/25, with a : ℕ → nnreal.

Fix: ensures Real.sum(range(99), x => f(x+1)*f(x+1)*f(x+2)) + f(100)*f(100)*f(1) < 12.0/25.0, and add requires forall i :: 1 <= i <= 100 ==> f(i) >= 0.0.

A14. dataset/test/amc12a_2020_p15.dfy

ensures Complex.Real.abs(Complex.sub(a, b)) <= 2.0 * sqrt(21.0)

Bug: It does not resolve.

These are false in the Lean miniF2F source too, and Dafny extend the same error.

B1. dataset/test/numbertheory_notequiv2i2jasqbsqdiv8.dfy

ensures !((exists i, j :: a == 2*i && b == 2*j) <==> (exists k :: a*a + b*b == 8*k))

Counterexample: (a,b)=(0,0)

B2. dataset/test/mathd_numbertheory_728.dfy

ensures (Int.pow(29, 13) - Int.pow(5, 13)) % 7 == 0

bug: the Lean theorem states the identical (false) equation
(29^13 - 5^13) % 7 = 0. 29 ≡ 1 (mod 7)29¹³ ≡ 1; 5¹³ ≡ 5; so
(29¹³−5¹³) % 7 = 3 ≠ 0.

B3. dataset/valid/mathd_numbertheory_155.dfy

ensures |filter(x => x % 19 == 7, set x | 100 <= x <= 999)| == 52

bug: the Lean theorem also claims the count is 52. The actual count of x ∈ [100,999] with x % 19 == 7 is 48 (x = 102, 121, …, 995).

B4. dataset/test/mathd_algebra_276.dfy

requires forall x: real :: 10.0*x*x - x - 24.0 == ((a as real)*x - 8.0) * ((b as real)*x + 3.0)
ensures a + b == 12

Counterexample: (a,b)=(5,2).

B5. dataset/test/mathd_numbertheory_618.dfy

lemma mathd_numbertheory_618(n: nat, p: nat -> nat)
  requires forall x: nat :: p(x) == x*x - x + 41
  requires 1 < gcd(p(n), p(n+1))
  ensures 41 <= n

Counterexample: n=0.

B6. dataset/test/amc12_2000_p1.dfy

requires i != 0
requires m != 0
requires o != 0
requires i*m*o == 2001
ensures i+m+o <= 671

Counterexample: (i,m,o)=(1,1,2001).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions