Skip to content

WillWu111/ViBe

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ViBe: Ultra-High-Resolution Video Synthesis Born from Pure Images 🎥✨

arXiv

Teaser – Ultra-High-Resolution Results

Teaser: ViBe generates ultra-high-resolution videos with rich details and coherent global structure. ✨

Introduction 🎬

ViBe presents an image-only training framework for ultra-high-resolution video generation, aiming to unlock 4K video synthesis from pre-trained video Diffusion Transformers without expensive high-resolution video training. Instead of relying on large-scale high-resolution video data, ViBe explores how a low-resolution video generator can be effectively adapted using only image supervision.

A central challenge in this setting is the modality gap between images and videos. Directly fine-tuning a video model on high-resolution images often leads to artifacts and residual noise during video inference. To address this, ViBe introduces Relay LoRA, a two-stage adaptation strategy that separates modality alignment from spatial extrapolation, enabling high-resolution generation while reducing image-induced degradation.

To further improve generation quality, ViBe proposes Global-Coarse-Local-Fine (GCLF) Attention, which combines local fine-grained modeling with compact global context aggregation. This design improves detail fidelity while maintaining semantic consistency across the entire frame.

ViBe also introduces a High-Frequency-Awareness Training Objective (HFATO) to enhance the recovery of fine textures and sharp structures from degraded latent inputs. This leads to clearer and more detailed high-resolution video synthesis.

Extensive experiments on modern video DiTs, such as Wan2.2, show that ViBe can generate 4K videos with strong semantic coherence and rich visual details, achieving state-of-the-art performance on VBench.

Methodology 🧠

ViBe Framework

Figure: Overview of the ViBe framework. 🧩

Installation 📦

ViBe requires torch==2.7.1 (for flex attention). You can set up a new experiment with:

conda create -n ViBe python=3.12
conda activate ViBe
cd DiffSynth-Studio-main
pip install -e .

Clone this repo to your project directory:

git clone https://github.com/WillWu111/ViBe.git

Models 🤖

We provide our trained LoRA models for ultra-high-resolution video synthesis. You can download them from Hugging Face:

Model Description Download
ViBe LoRA Relay LoRA for high-resolution video generation Download

Inference 🚀

You can specify target_height, target_width, and spatial_down_factor as command-line arguments:

python ViBe.py --target_height 1408 --target_width 2560 --spatial_down_factor 2

Note: Ensure that the latent height and width are divisible by spatial_down_factor.

Training 🏋️

Our training loss functions are defined in diffsynth/diffusion/loss.py. ViBe uses a two-stage training approach with different loss functions:

Stage 1: Modality Alignment (First Stage)

# First Stage: Standard noise prediction
def FlowMatchSFTLoss(pipe, **inputs):
    timestep_id = torch.randint(min_timestep_boundary, max_timestep_boundary, (1,))
    timestep = pipe.scheduler.timesteps[timestep_id]

    noise = torch.randn_like(inputs["input_latents"])
    x_noisy = pipe.scheduler.add_noise(inputs["input_latents"], noise, timestep)
    training_target = pipe.scheduler.training_target(inputs["input_latents"], noise, timestep)

    noise_pred = pipe.model_fn(**models, **inputs, timestep=timestep)
    loss = mse_loss(noise_pred, training_target)
    return loss

Stage 2: Spatial Extrapolation (Second Stage)

# Second Stage: x0 prediction with down-up cycle (HFATO)
def FlowMatchSFTLoss(pipe, **inputs):
    # Down-sample and up-sample x0 to create degraded input
    x0_down = F.interpolate(x0, scale_factor=(1.0, 0.5, 0.5), mode="trilinear")
    x0_up = F.interpolate(x0_down, size=(T, H, W), mode="trilinear")

    # Add noise to degraded input
    x_noisy = pipe.scheduler.add_noise(x0_up, noise, timestep)

    # Predict flow and recover x0
    flow_pred = pipe.model_fn(**models, **inputs, timestep=timestep)
    x0_pred = x_noisy - sigma * flow_pred

    # Loss: predict original x0 (not degraded x0_up)
    loss = mse_loss(x0_pred, x0)
    return loss

Key Difference: Stage 1 trains on original latents, while Stage 2 trains the model to recover high-frequency details from degraded latent inputs, enhancing the model's ability to synthesize fine textures.

Results 🏆

ViBe outperforms previous state-of-the-art methods on VBench, with significant improvements in aesthetic appeal, imaging quality, and overall consistency. 🌈

Citation 📚

If you use ViBe in your research, please cite our paper:

@misc{wu2026vibeultrahighresolutionvideosynthesis,
      title={ViBe: Ultra-High-Resolution Video Synthesis Born from Pure Images}, 
      author={Yunfeng Wu and Hongying Cheng and Zihao He and Songhua Liu},
      year={2026},
      eprint={2603.23326},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2603.23326}, 
}

About

ViBe: Ultra-High-Resolution Video Synthesis Born from Pure Images

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors