An MLT-XML compiler with a render-in-the-loop oracle: turn an edit spec into a melt-renderable project that provably loads and renders frame-exact — and the dataset that pinpoints where LLMs break on timeline files.
· ▶ Live dashboard · code MIT · data CC BY 4.0
Results at a glance (5-task slice) — an
experttimeline scores 100%, anaive-llmthat makes the documented mistake scores 0%, and every verdict is a realmeltrender verified frame-by-frame (offline and live in CI).
The open-source A/V stack (Shotcut, Kdenlive) stores projects as MLT XML — schema-bound, cross-referenced text rendered headlessly by melt. That's exactly what an LLM should be able to write, and routinely can't: MLT XML needs unique producer ids and their cross-references, per-clip in/out frame math, and cumulative playlist offsets. LLM output crashes on load or silently desyncs; even agentic frameworks top out ~78% on the related ffmpeg command task. The MLT docs themselves call localizing an edit "horrifically impractical."
mlt-forge turns that into a machine-graded benchmark. An edit spec (IR) compiles to MLT XML; every candidate is rendered by melt and fingerprinted, and the verifier proves the correct timeline renders frame-exact while wrong ones fail.
Why this is hard to fake: the dangerous failures render without erroring — a reordered timeline still produces a video of the right length. So the grader can't trust "did it run"; it checks exact frame count, duration, and a per-frame hash of the decoded pixels against a golden render.
mltforge (Python)
compiler ── edit IR (+ perturbation) -> MLT XML (unique ids, in/out math, playlist wiring)
render ── melt render (lossless FFV1) -> {loads, nframes, duration_s, frame_hash}
runner ── grade a candidate (offline recorded fingerprint, or live melt render)
verifier ── render_match (loads + exact frames + duration + per-frame hash)
score ── aggregate -> results/results.json (leaderboard, failure taxonomy)
export ── AI-training data (records + chosen/rejected MLT-XML pairs)
dashboard (Next.js, static export)
embedded video player (the actual melt render) · the compiled MLT XML (correct vs wrong) · verdicts
The wrong candidates are generated by perturbing the IR — exactly the documented MLT failure modes:
| Task | Category | The trap it catches |
|---|---|---|
01-sequence-integrity |
integrity | a playlist entry referencing a missing producer id → fails to load |
02-clip-duration |
timing | a clip with the wrong frame count → wrong total duration |
03-clip-order |
ordering | clips in the wrong order → same duration, wrong content (caught by frame hash) |
04-dropped-clip |
timing | a dropped clip → silently shorter timeline |
05-montage-order |
ordering | a subtle mid-montage swap → identical length, wrong sequence |
make build # worker image (melt + ffmpeg + python)
make run # offline: grade recorded fingerprints -> results/results.json
make test # schema + verifiers; correct passes, wrong fails
make run-live # re-render every candidate with melt and grade
make build-data # recompile + re-render -> golden fingerprints, manifests, web mp4s
make export # dataset/{records,preference_pairs}.jsonl
make dashboard # static site: video player + MLT XML + verdicts- Proven, not asserted. Golden fingerprints come from a real
meltrender; the verifier re-renders and re-checks. Renders are lossless (FFV1) so the per-frame hash is exact and codec-independent. - Offline-first. Grader + dashboard run on recorded fingerprints with no renderer; live mode re-renders.
- Contrastive + self-mining. Wrong candidates are auto-generated by perturbing the correct IR — clean chosen/rejected training pairs.
- No external media. Timelines use
color:producers, so everything is deterministic and self-contained.
No credentials are needed or committed. The live runner invokes melt/ffmpeg on generated projects (color producers only). See SECURITY.md.