mlt-forge

An MLT-XML compiler with a render-in-the-loop oracle: turn an edit spec into a melt-renderable project that provably loads and renders frame-exact — and the dataset that pinpoints where LLMs break on timeline files.

· ▶ Live dashboard · code MIT · data CC BY 4.0

Results at a glance (5-task slice) — an expert timeline scores 100%, a naive-llm that makes the documented mistake scores 0%, and every verdict is a real melt render verified frame-by-frame (offline and live in CI).

The open-source A/V stack (Shotcut, Kdenlive) stores projects as MLT XML — schema-bound, cross-referenced text rendered headlessly by melt. That's exactly what an LLM should be able to write, and routinely can't: MLT XML needs unique producer ids and their cross-references, per-clip in/out frame math, and cumulative playlist offsets. LLM output crashes on load or silently desyncs; even agentic frameworks top out ~78% on the related ffmpeg command task. The MLT docs themselves call localizing an edit "horrifically impractical."

mlt-forge turns that into a machine-graded benchmark. An edit spec (IR) compiles to MLT XML; every candidate is rendered by melt and fingerprinted, and the verifier proves the correct timeline renders frame-exact while wrong ones fail.

Why this is hard to fake: the dangerous failures render without erroring — a reordered timeline still produces a video of the right length. So the grader can't trust "did it run"; it checks exact frame count, duration, and a per-frame hash of the decoded pixels against a golden render.

What's in the box

mltforge (Python)
  compiler ── edit IR (+ perturbation) -> MLT XML (unique ids, in/out math, playlist wiring)
  render   ── melt render (lossless FFV1) -> {loads, nframes, duration_s, frame_hash}
  runner   ── grade a candidate (offline recorded fingerprint, or live melt render)
  verifier ── render_match (loads + exact frames + duration + per-frame hash)
  score    ── aggregate -> results/results.json (leaderboard, failure taxonomy)
  export   ── AI-training data (records + chosen/rejected MLT-XML pairs)

dashboard (Next.js, static export)
  embedded video player (the actual melt render) · the compiled MLT XML (correct vs wrong) · verdicts

The wrong candidates are generated by perturbing the IR — exactly the documented MLT failure modes:

Task	Category	The trap it catches
`01-sequence-integrity`	integrity	a playlist entry referencing a missing producer id → fails to load
`02-clip-duration`	timing	a clip with the wrong frame count → wrong total duration
`03-clip-order`	ordering	clips in the wrong order → same duration, wrong content (caught by frame hash)
`04-dropped-clip`	timing	a dropped clip → silently shorter timeline
`05-montage-order`	ordering	a subtle mid-montage swap → identical length, wrong sequence

Quickstart

make build        # worker image (melt + ffmpeg + python)
make run          # offline: grade recorded fingerprints -> results/results.json
make test         # schema + verifiers; correct passes, wrong fails
make run-live     # re-render every candidate with melt and grade
make build-data   # recompile + re-render -> golden fingerprints, manifests, web mp4s
make export       # dataset/{records,preference_pairs}.jsonl
make dashboard    # static site: video player + MLT XML + verdicts

Design principles

Proven, not asserted. Golden fingerprints come from a real melt render; the verifier re-renders and re-checks. Renders are lossless (FFV1) so the per-frame hash is exact and codec-independent.
Offline-first. Grader + dashboard run on recorded fingerprints with no renderer; live mode re-renders.
Contrastive + self-mining. Wrong candidates are auto-generated by perturbing the correct IR — clean chosen/rejected training pairs.
No external media. Timelines use color: producers, so everything is deterministic and self-contained.

Security

No credentials are needed or committed. The live runner invokes melt/ffmpeg on generated projects (color producers only). See SECURITY.md.

License

Code: MIT. Benchmark data (tasks/, dataset/): CC BY 4.0.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.github/workflows		.github/workflows
assets		assets
dashboard		dashboard
dataset		dataset
docs		docs
infra/worker		infra/worker
mltforge		mltforge
results		results
schema		schema
tasks		tasks
tests		tests
.dockerignore		.dockerignore
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
LICENSE-DATA		LICENSE-DATA
Makefile		Makefile
PLAN.md		PLAN.md
README.md		README.md
SECURITY.md		SECURITY.md
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

mlt-forge

What's in the box

Quickstart

Design principles

Security

License

About

Licenses found

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

mlt-forge

What's in the box

Quickstart

Design principles

Security

License

About

Topics

Resources

License

Licenses found

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages