This repository will host the official implementation of Aurora, an agentic video editing framework that pairs a tool-augmented vision-language model (VLM) agent with a unified video diffusion transformer. The VLM agent rewrites a raw user request into a typed edit plan (instruction, task label, image-search query, mask phrase) and dispatches it to the video DiT, resolving textual and visual underspecification before generation.
- ๐ฌ Unified video editing - replacement, removal, style transfer, and reference-driven insertion under one set of weights
- ๐ค Tool-using VLM agent - rewrites a raw user request into a four-field edit plan
- ๐ Resolves underspecification - fills in missing reference images via web image search and missing masks via grounded segmentation
- ๐ AgentEdit-Bench - evaluates agent-enhanced video editing under textual and visual underspecification
- [] The code is being prepared for release. ETA: late May 2026.