A small Lean 4 tutorial project demonstrating a verified transducer for streaming citation recognition.
The motivating scenario is an LLM streaming API that emits a token stream as
short String chunks (often 3–4 characters) and may split citation markers
across token boundaries.
- A stream of tokens:
List String. - A fixed citation map of size
n:CitationMap n := Vector Source n.
A citation marker is a literal substring of the form
Citationfollowed by one or more digits
For example: Citation19.
Digits stop at the first non-digit character.
The server should output an event stream (think SSE).
-
When a valid citation marker
Citation{k}is recognized (1-based, in-range for the citation map):- emit a
citationmetadata event identifying the cited source - then emit a
textevent whose payload still contains the literal markerCitation{k}.
- emit a
-
When an out-of-range marker
Citation{k}is recognized:- emit a
warningevent - do not emit any
citationevent for that marker - do not emit any
textcontaining the droppedCitation{k}marker (only the marker is dropped; the stream continues).
- emit a
Defined in VericodeTutorial/Protocol.lean.
Sourceis the citation metadata payload.OutEvent nis the output event type:OutEvent.text sOutEvent.citation id srcOutEvent.warning k msg
OutEvent.renderSserenders an event as an SSE-like string.
Implemented in VericodeTutorial/Transducer.lean.
- A small state machine (
Machine n) that processes the token stream incrementally. - Parser modes:
Mode.normalMode.matching matched candRev(matching theCitationprefix)Mode.digits value digitsRev(reading digits)
Important design choice:
- In
Mode.digits, we store only the digits (digitsRev : List Char, stored in reverse order). TheCitationprefix is implicit. - Therefore the literal marker text emitted on success is definitionally
String.ofList (citationPrefix ++ digitsRev.reverse).
Core functions:
feedChar/feedChars: process charactersfeedToken: process one tokenfinish: finalize end-of-streamrunTokens: process the full token list
This section explains the "words" and notations that appear in the theorem statements below.
n : Natis the number of citations available.CitationId nisFin n(a natural number proven to be< n).citations : CitationMap nis aVector Source n.citationPrefix : List Charis the character list for the literal string"Citation".xs ++ ysis list concatenation.∃ tail, events = prefix ++ tailmeanseventshasprefixas a prefix, andtailis the rest of the list.head? : List α → Option αreturns the first element, ornone.p <+: qis the proposition “pis a prefix ofq” for lists.- Example:
citationPrefix <+: s.toListmeans the stringsbegins with"Citation".
- Example:
- In Lean, for a pair
x : A × B,x.1is the first component andx.2is the second.
All proofs are in Lean, and the project builds with lake build.
These are the properties about runTokens output that correspond to the main
intended safety/consistency guarantees.
Definitions (from VericodeTutorial/RunTheorems.lean):
/-- Predicate: an event is a `citation`. -/
def IsCitation : OutEvent n → Prop
/-- No emitted stream ends with a `citation` event. -/
def NoEndCitation : List (OutEvent n) → Prop
/-- Adjacency invariant: every `citation` is immediately followed by a `text`. -/
def AdjOK : List (OutEvent n) → Prop
/-- Combined run invariant used for fold proofs. -/
def StreamOK (events : List (OutEvent n)) : Prop :=
AdjOK (n := n) events ∧ NoEndCitation (n := n) eventsTheorem (base run invariant):
theorem runTokens_streamOK (citations : CitationMap n) (tokens : List String) :
StreamOK (n := n) (runTokens citations tokens)Meaning:
- The entire output list satisfies
StreamOK. - In particular, every emitted
citationevent is immediately followed by atextevent, and the stream never ends with a danglingcitation.
Corollary (adjacency extracted from StreamOK):
theorem runTokens_adjOK (citations : CitationMap n) (tokens : List String) :
AdjOK (n := n) (runTokens citations tokens)Stronger adjacency that captures the Citation... marker shape at least up to
prefix (requirement (a)):
/-- Stronger adjacency: every `citation` is immediately followed by a `text`
that starts with the literal `Citation` prefix. -/
def AdjOKWithCitationPrefix : List (OutEvent n) → Prop
theorem runTokens_adjOKWithCitationPrefix (citations : CitationMap n) (tokens : List String) :
AdjOKWithCitationPrefix (n := n) (runTokens citations tokens)Meaning:
- Every
citationevent is immediately followed by atext sevent. - That string satisfies
citationPrefix <+: s.toList, i.e. it begins with"Citation".
These theorems are about a single-character step, in a specific parser state. They are used as building blocks for the run-level proofs.
From VericodeTutorial/TransducerTheorems.lean:
theorem feedChar_digits_delim_valid_prefix
(citations : CitationMap n)
(textRev digitsRev : List Char)
(value : Nat)
(c : Char)
(hn : digitVal? c = none)
{id : CitationId n}
(hid : toCitationId? n value = some id) :
∃ tail,
(feedChar citations { textRev := textRev, mode := .digits value digitsRev } c).2
=
[ OutEvent.citation id (citations.get id)
, OutEvent.text (String.ofList (citationPrefix ++ digitsRev.reverse))
] ++ tailMeaning:
- When we are in digits mode and see a non-digit delimiter
c, and the parsed citation number is in-range (hid), the emitted event list begins withcitationand then the literal marker textCitation{k}. - The
tailaccounts for any additional events produced by reprocessing the delimiter character.
theorem feedChar_digits_delim_oob_prefix
(citations : CitationMap n)
(textRev digitsRev : List Char)
(value : Nat)
(c : Char)
(hn : digitVal? c = none)
(hid : toCitationId? n value = none) :
∃ tail,
(feedChar citations { textRev := textRev, mode := .digits value digitsRev } c).2
=
[ OutEvent.warning value s!"Citation{value} out of range" ] ++ tailMeaning:
- If the parsed citation number is out-of-range (
hid), the output has awarningevent prefix.
A convenient “warning is the first event” corollary:
theorem feedChar_digits_delim_oob_head
(citations : CitationMap n)
(textRev digitsRev : List Char)
(value : Nat)
(c : Char)
(hn : digitVal? c = none)
(hid : toCitationId? n value = none) :
(feedChar citations { textRev := textRev, mode := .digits value digitsRev } c).2.head?
= some (OutEvent.warning value s!"Citation{value} out of range")This one formalizes “out-of-range markers do not produce any citation events”.
/-- Predicate: an event is a `citation`. (Local to this file.) -/
def IsCitationEvent : OutEvent n → Prop
theorem feedChar_digits_delim_oob_noCitation
(citations : CitationMap n)
(textRev digitsRev : List Char)
(value : Nat)
(c : Char)
(hn : digitVal? c = none)
(hid : toCitationId? n value = none) :
∀ e ∈ (feedChar citations { textRev := textRev, mode := .digits value digitsRev } c).2,
¬ IsCitationEvent (n := n) eMeaning:
- For every event
ein the list emitted by this out-of-range completion step,eis not acitationevent.
These are natural next steps for extending the tutorial. The statements here are intentionally informal sketches.
-
Exact marker shape (stronger than prefix):
- For every
citationevent, the nexttextevent is exactly a marker of the formCitationfollowed by a nonempty list of digits (and nothing else), or at least contains aCitationmarker at a known position.
- For every
-
Run-level out-of-range removal from the text stream:
- Out-of-range markers
Citation{k}never appear in any emittedtextevent ofrunTokens. - This is subtle because a warning can be followed by unrelated text events (e.g. flushing buffered text from earlier).
- Out-of-range markers
-
Referential correctness (allowing reuse):
- Each marker occurrence
Citation{k}in the text stream is preceded by a matchingcitationmetadata event. - The same citation id may legitimately appear multiple times (reuse), so the goal is correctness of each occurrence rather than a one-to-one bijection.
- Each marker occurrence
-
No spurious citations:
- Every emitted
citationevent corresponds to an actualCitation{k}marker in the input stream.
- Every emitted
-
End-to-end SSE trace constraints:
- For example: forbidding certain event interleavings, or enforcing that
warnings include the exact numeric
kthat was parsed.
- For example: forbidding certain event interleavings, or enforcing that
warnings include the exact numeric
Main.lean runs the transducer on a small fixture stream and prints SSE-like
output.
- Build:
lake build - Run:
lake exe vericodetutorial