Skip to content

bytedance/Bernini

Repository files navigation

Bernini

Latent Semantic Planning for Video Diffusion

Chenchen Liu*, Junyi Chen*, Lei Li*, Lu Chi*,§, Mingzhen Sun*, Zhuoying Li*, Yi Fu, Ruoyu Guo, Yiheng Wu, Ge Bai, Zehuan Yuan

* Equal contribution   Corresponding author  § Project lead

arXiv Project Page HuggingFace

🎉 News

✨ Highlights

Bernini is a unified framework for video generation and editing that combines an MLLM-based semantic planner with a DiT-based renderer.

On video editing, Bernini reaches the first tier among leading closed-source commercial models. The leaderboard below comes from our self-built arena platform, where human annotators blindly vote on paired edits and the votes are aggregated into a Bradley-Terry score and a pairwise win-rate matrix.

Video editing arena: Bradley-Terry leaderboard and pairwise win-rate matrix

Benchmark results across the released models:

Model EditVerse OpenVE OpenS2V VBench Bernini-v2v (OS) Bernini-rv2v (OS)
Bernini-R 1.3B 7.74 3.65 62.18 84.69 3.15 3.21
Bernini-R 14B 7.99 3.78 62.94 84.64 3.25 3.34
Bernini 7B+14B 8.02 4.03 62.30 84.37 3.49 3.48

🧾 Models

The repository provides two model families. Pick one and follow its guide for weight download, inference commands, and ready-to-run scripts:

Bernini Bernini-R
What it is Full pipeline: MLLM-based semantic planner + DiT-based renderer Renderer-only model fine-tuned from the Wan diffusion renderer
Strengths Decomposes complex instructions and plans semantic changes before rendering; stronger instruction following Strong rendering and consistency with fewer moving parts; simpler setup
Checkpoints ByteDance/Bernini-Diffusers ByteDance/Bernini-R-Diffusers (14B) · ByteDance/Bernini-R-1.3B-Diffusers · ByteDance/Bernini-R (separate ckpts)
Guide docs/bernini.md docs/bernini_r.md

Both families share the same task interface: t2i, i2i, t2v, v2v, rv2v, and r2v.

📦 Installation

Requirements

  • Python 3.11.2.
  • CUDA GPU — a Hopper GPU (H100/H800/H200) is recommended so FlashAttention-3 can be used; other CUDA GPUs fall back to FlashAttention-2 or PyTorch SDPA.
  • CUDA toolkit 12.4 (matches the pinned torch==2.5.1+cu124; 12.3+ is the minimum if you build FlashAttention-3).
  • Pinned in requirements.txt: torch==2.5.1+cu124, diffusers==0.35.2, accelerate==0.34.2, transformers==4.57.3.

Reference environment (developed and tested on this setup):

Component Version
GPU NVIDIA H100
CUDA 12.4
Python 3.11.2
PyTorch 2.5.1+cu124

Install

git clone https://github.com/bytedance/Bernini.git bernini && cd bernini
pip install -r requirements.txt
# Open-VeOmni is required. Install it with --no-deps so it does not pull in a
# different torch build and override the pinned torch==2.5.1+cu124:
pip install --no-deps git+https://github.com/ByteDance-Seed/VeOmni.git@v0.1.10

Open-VeOmni (Apache-2.0, Python 3.11) is a required dependency — all inference paths import it, including single-GPU.

Optional extras:

  • Faster attention (FlashAttention-2 by default):
    • FlashAttention-2 — general CUDA GPUs (incl. A100/A800): pip install flash-attn==2.8.3.
    • FlashAttention-3 — Hopper only (H100/H800/H200, CUDA ≥ 12.3, PyTorch ≥ 2.4). flash_attn_interface is not on PyPI; build it from the flash-attention repo's hopper/ directory at tag v2.8.3:
      git clone https://github.com/Dao-AILab/flash-attention.git
      cd flash-attention && git checkout v2.8.3
      cd hopper && MAX_JOBS=$(nproc) python3 setup.py install --user

🚀 Usage

Weight download and per-task inference commands are model-specific — follow docs/bernini.md or docs/bernini_r.md. The pieces below are shared by both pipelines.

Case files

A run is described by a case file — a small JSON under assets/testcases/ that bundles one task's routing and inputs (task_type, guidance_mode, prompt, source media, output). This keeps long prompts out of the command line. Each task has a directory under assets/testcases/ with one or more bundled examples; see the case-file format.

Prompt enhancer (highly recommended)

--use_pe enhances the prompt through an OpenAI-compatible endpoint and is recommended for best generation quality. The openai SDK is installed by requirements.txt; configure the endpoint with environment variables:

export BERNINI_PE_API_KEY=...      # or OPENAI_API_KEY
export BERNINI_PE_BASE_URL=...     # or OPENAI_BASE_URL
export BERNINI_PE_MODEL=...        # vision-capable chat model

Gradio demo

gradio_demo.py exposes the same pipeline through a Gradio UI for both Bernini and Bernini-R: the task-type dropdown auto-fills guidance_mode (still user-editable), uploaded media is routed to the matching slot, and the result is rendered inline. Launch commands are in each model's guide (Bernini · Bernini-R).

Add --use_pe (with the prompt-enhancer environment variables above) to enable prompt enhancement; the in-UI checkbox is a per-request switch on top of this flag.

📑 Citation

If you use Bernini in your research, please cite:

@article{bernini,
  title   = {Bernini: Latent Semantic Planning for Video Diffusion},
  author  = {Chenchen Liu and Junyi Chen and Lei Li and Lu Chi and Mingzhen Sun and Zhuoying Li and Yi Fu and Ruoyu Guo and Yiheng Wu and Ge Bai and Zehuan Yuan},
  journal = {arXiv preprint arXiv:2605.22344},
  year    = {2026}
}

🙏 Acknowledgements

Bernini builds on several outstanding open-source projects:

We thank the authors and communities of these projects for their contributions.

📄 License

Apache License 2.0. See LICENSE.

About

Bernini is a unified framework for video generation and editing that combines an MLLM-based semantic planner with a DiT-based renderer.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors