Skip to content

Kevin1289/mlt-forge

mlt-forge

An MLT-XML compiler with a render-in-the-loop oracle: turn an edit spec into a melt-renderable project that provably loads and renders frame-exact — and the dataset that pinpoints where LLMs break on timeline files.

ci deploy-dashboard  ·  ▶ Live dashboard  ·  code MIT · data CC BY 4.0

mlt-forge dashboard — leaderboard, failure taxonomy, an embedded player of the actual melt render, and the compiled MLT XML

Results at a glance (5-task slice) — an expert timeline scores 100%, a naive-llm that makes the documented mistake scores 0%, and every verdict is a real melt render verified frame-by-frame (offline and live in CI).

The open-source A/V stack (Shotcut, Kdenlive) stores projects as MLT XML — schema-bound, cross-referenced text rendered headlessly by melt. That's exactly what an LLM should be able to write, and routinely can't: MLT XML needs unique producer ids and their cross-references, per-clip in/out frame math, and cumulative playlist offsets. LLM output crashes on load or silently desyncs; even agentic frameworks top out ~78% on the related ffmpeg command task. The MLT docs themselves call localizing an edit "horrifically impractical."

mlt-forge turns that into a machine-graded benchmark. An edit spec (IR) compiles to MLT XML; every candidate is rendered by melt and fingerprinted, and the verifier proves the correct timeline renders frame-exact while wrong ones fail.

Why this is hard to fake: the dangerous failures render without erroring — a reordered timeline still produces a video of the right length. So the grader can't trust "did it run"; it checks exact frame count, duration, and a per-frame hash of the decoded pixels against a golden render.

What's in the box

mltforge (Python)
  compiler ── edit IR (+ perturbation) -> MLT XML (unique ids, in/out math, playlist wiring)
  render   ── melt render (lossless FFV1) -> {loads, nframes, duration_s, frame_hash}
  runner   ── grade a candidate (offline recorded fingerprint, or live melt render)
  verifier ── render_match (loads + exact frames + duration + per-frame hash)
  score    ── aggregate -> results/results.json (leaderboard, failure taxonomy)
  export   ── AI-training data (records + chosen/rejected MLT-XML pairs)

dashboard (Next.js, static export)
  embedded video player (the actual melt render) · the compiled MLT XML (correct vs wrong) · verdicts

The wrong candidates are generated by perturbing the IR — exactly the documented MLT failure modes:

Task Category The trap it catches
01-sequence-integrity integrity a playlist entry referencing a missing producer id → fails to load
02-clip-duration timing a clip with the wrong frame count → wrong total duration
03-clip-order ordering clips in the wrong order → same duration, wrong content (caught by frame hash)
04-dropped-clip timing a dropped clip → silently shorter timeline
05-montage-order ordering a subtle mid-montage swap → identical length, wrong sequence

Quickstart

make build        # worker image (melt + ffmpeg + python)
make run          # offline: grade recorded fingerprints -> results/results.json
make test         # schema + verifiers; correct passes, wrong fails
make run-live     # re-render every candidate with melt and grade
make build-data   # recompile + re-render -> golden fingerprints, manifests, web mp4s
make export       # dataset/{records,preference_pairs}.jsonl
make dashboard    # static site: video player + MLT XML + verdicts

Design principles

  • Proven, not asserted. Golden fingerprints come from a real melt render; the verifier re-renders and re-checks. Renders are lossless (FFV1) so the per-frame hash is exact and codec-independent.
  • Offline-first. Grader + dashboard run on recorded fingerprints with no renderer; live mode re-renders.
  • Contrastive + self-mining. Wrong candidates are auto-generated by perturbing the correct IR — clean chosen/rejected training pairs.
  • No external media. Timelines use color: producers, so everything is deterministic and self-contained.

Security

No credentials are needed or committed. The live runner invokes melt/ffmpeg on generated projects (color producers only). See SECURITY.md.

License

Code: MIT. Benchmark data (tasks/, dataset/): CC BY 4.0.

About

MLT-XML compiler with a render-in-the-loop oracle: turn an edit spec into a melt-renderable project that provably loads and renders frame-exact — plus the dataset of where LLMs break on timeline files.

Topics

Resources

License

MIT, Unknown licenses found

Licenses found

MIT
LICENSE
Unknown
LICENSE-DATA

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors