Skip to content

scidonia/vericode

Repository files navigation

VericodeTutorial

A small Lean 4 tutorial project demonstrating a verified transducer for streaming citation recognition.

The motivating scenario is an LLM streaming API that emits a token stream as short String chunks (often 3–4 characters) and may split citation markers across token boundaries.

Problem statement

Inputs

  • A stream of tokens: List String.
  • A fixed citation map of size n: CitationMap n := Vector Source n.

Citation syntax

A citation marker is a literal substring of the form

  • Citation followed by one or more digits

For example: Citation19.

Digits stop at the first non-digit character.

Desired streaming output

The server should output an event stream (think SSE).

  • When a valid citation marker Citation{k} is recognized (1-based, in-range for the citation map):

    1. emit a citation metadata event identifying the cited source
    2. then emit a text event whose payload still contains the literal marker Citation{k}.
  • When an out-of-range marker Citation{k} is recognized:

    • emit a warning event
    • do not emit any citation event for that marker
    • do not emit any text containing the dropped Citation{k} marker (only the marker is dropped; the stream continues).

Implementation overview

Event protocol

Defined in VericodeTutorial/Protocol.lean.

  • Source is the citation metadata payload.
  • OutEvent n is the output event type:
    • OutEvent.text s
    • OutEvent.citation id src
    • OutEvent.warning k msg
  • OutEvent.renderSse renders an event as an SSE-like string.

Verified transducer

Implemented in VericodeTutorial/Transducer.lean.

  • A small state machine (Machine n) that processes the token stream incrementally.
  • Parser modes:
    • Mode.normal
    • Mode.matching matched candRev (matching the Citation prefix)
    • Mode.digits value digitsRev (reading digits)

Important design choice:

  • In Mode.digits, we store only the digits (digitsRev : List Char, stored in reverse order). The Citation prefix is implicit.
  • Therefore the literal marker text emitted on success is definitionally String.ofList (citationPrefix ++ digitsRev.reverse).

Core functions:

  • feedChar / feedChars: process characters
  • feedToken: process one token
  • finish: finalize end-of-stream
  • runTokens: process the full token list

Glossary for theorem statements

This section explains the "words" and notations that appear in the theorem statements below.

  • n : Nat is the number of citations available.
  • CitationId n is Fin n (a natural number proven to be < n).
  • citations : CitationMap n is a Vector Source n.
  • citationPrefix : List Char is the character list for the literal string "Citation".
  • xs ++ ys is list concatenation.
  • ∃ tail, events = prefix ++ tail means events has prefix as a prefix, and tail is the rest of the list.
  • head? : List α → Option α returns the first element, or none.
  • p <+: q is the proposition “p is a prefix of q” for lists.
    • Example: citationPrefix <+: s.toList means the string s begins with "Citation".
  • In Lean, for a pair x : A × B, x.1 is the first component and x.2 is the second.

Proven correctness theorems

All proofs are in Lean, and the project builds with lake build.

Run-level (whole-program) guarantees

These are the properties about runTokens output that correspond to the main intended safety/consistency guarantees.

Definitions (from VericodeTutorial/RunTheorems.lean):

/-- Predicate: an event is a `citation`. -/
def IsCitation : OutEvent n → Prop

/-- No emitted stream ends with a `citation` event. -/
def NoEndCitation : List (OutEvent n) → Prop

/-- Adjacency invariant: every `citation` is immediately followed by a `text`. -/
def AdjOK : List (OutEvent n) → Prop

/-- Combined run invariant used for fold proofs. -/
def StreamOK (events : List (OutEvent n)) : Prop :=
  AdjOK (n := n) events ∧ NoEndCitation (n := n) events

Theorem (base run invariant):

theorem runTokens_streamOK (citations : CitationMap n) (tokens : List String) :
    StreamOK (n := n) (runTokens citations tokens)

Meaning:

  • The entire output list satisfies StreamOK.
  • In particular, every emitted citation event is immediately followed by a text event, and the stream never ends with a dangling citation.

Corollary (adjacency extracted from StreamOK):

theorem runTokens_adjOK (citations : CitationMap n) (tokens : List String) :
    AdjOK (n := n) (runTokens citations tokens)

Stronger adjacency that captures the Citation... marker shape at least up to prefix (requirement (a)):

/-- Stronger adjacency: every `citation` is immediately followed by a `text`
that starts with the literal `Citation` prefix. -/
def AdjOKWithCitationPrefix : List (OutEvent n) → Prop


theorem runTokens_adjOKWithCitationPrefix (citations : CitationMap n) (tokens : List String) :
    AdjOKWithCitationPrefix (n := n) (runTokens citations tokens)

Meaning:

  • Every citation event is immediately followed by a text s event.
  • That string satisfies citationPrefix <+: s.toList, i.e. it begins with "Citation".

Step-level citation completion behavior

These theorems are about a single-character step, in a specific parser state. They are used as building blocks for the run-level proofs.

Valid citation completion emits metadata before marker text

From VericodeTutorial/TransducerTheorems.lean:

theorem feedChar_digits_delim_valid_prefix
    (citations : CitationMap n)
    (textRev digitsRev : List Char)
    (value : Nat)
    (c : Char)
    (hn : digitVal? c = none)
    {id : CitationId n}
    (hid : toCitationId? n value = some id) :
    ∃ tail,
      (feedChar citations { textRev := textRev, mode := .digits value digitsRev } c).2
        =
      [ OutEvent.citation id (citations.get id)
      , OutEvent.text (String.ofList (citationPrefix ++ digitsRev.reverse))
      ] ++ tail

Meaning:

  • When we are in digits mode and see a non-digit delimiter c, and the parsed citation number is in-range (hid), the emitted event list begins with citation and then the literal marker text Citation{k}.
  • The tail accounts for any additional events produced by reprocessing the delimiter character.

Out-of-range citation completion emits warning

theorem feedChar_digits_delim_oob_prefix
    (citations : CitationMap n)
    (textRev digitsRev : List Char)
    (value : Nat)
    (c : Char)
    (hn : digitVal? c = none)
    (hid : toCitationId? n value = none) :
    ∃ tail,
      (feedChar citations { textRev := textRev, mode := .digits value digitsRev } c).2
        =
      [ OutEvent.warning value s!"Citation{value} out of range" ] ++ tail

Meaning:

  • If the parsed citation number is out-of-range (hid), the output has a warning event prefix.

A convenient “warning is the first event” corollary:

theorem feedChar_digits_delim_oob_head
    (citations : CitationMap n)
    (textRev digitsRev : List Char)
    (value : Nat)
    (c : Char)
    (hn : digitVal? c = none)
    (hid : toCitationId? n value = none) :
    (feedChar citations { textRev := textRev, mode := .digits value digitsRev } c).2.head?
      = some (OutEvent.warning value s!"Citation{value} out of range")

Out-of-range completion emits no citations (requirement (b), step-level)

This one formalizes “out-of-range markers do not produce any citation events”.

/-- Predicate: an event is a `citation`. (Local to this file.) -/
def IsCitationEvent : OutEvent n → Prop


theorem feedChar_digits_delim_oob_noCitation
    (citations : CitationMap n)
    (textRev digitsRev : List Char)
    (value : Nat)
    (c : Char)
    (hn : digitVal? c = none)
    (hid : toCitationId? n value = none) :
    ∀ e ∈ (feedChar citations { textRev := textRev, mode := .digits value digitsRev } c).2,
      ¬ IsCitationEvent (n := n) e

Meaning:

  • For every event e in the list emitted by this out-of-range completion step, e is not a citation event.

Properties we would like to prove (not done yet)

These are natural next steps for extending the tutorial. The statements here are intentionally informal sketches.

  1. Exact marker shape (stronger than prefix):

    • For every citation event, the next text event is exactly a marker of the form Citation followed by a nonempty list of digits (and nothing else), or at least contains a Citation marker at a known position.
  2. Run-level out-of-range removal from the text stream:

    • Out-of-range markers Citation{k} never appear in any emitted text event of runTokens.
    • This is subtle because a warning can be followed by unrelated text events (e.g. flushing buffered text from earlier).
  3. Referential correctness (allowing reuse):

    • Each marker occurrence Citation{k} in the text stream is preceded by a matching citation metadata event.
    • The same citation id may legitimately appear multiple times (reuse), so the goal is correctness of each occurrence rather than a one-to-one bijection.
  4. No spurious citations:

    • Every emitted citation event corresponds to an actual Citation{k} marker in the input stream.
  5. End-to-end SSE trace constraints:

    • For example: forbidding certain event interleavings, or enforcing that warnings include the exact numeric k that was parsed.

Demo

Main.lean runs the transducer on a small fixture stream and prints SSE-like output.

  • Build: lake build
  • Run: lake exe vericodetutorial

About

Vericoding tutorial

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages