Skip to content

shiml20/SVG

Repository files navigation

SVG: Latent Diffusion Model without Variational Autoencoder

Official PyTorch Implementation



arXiv Github HF Model: SVG
arXiv Github HF Model: SVG-T2I PDF License

Minglei Shi1*, Haolin Wang1*, Wenzhao Zheng1†, Ziyang Yuan2, Xiaoshi Wu2, Xintao Wang2, Pengfei Wan2, Jie Zhou1, Jiwen Lu1
(*equal contribution, listed in alphabetical order; †project lead)
1Department of Automation, Tsinghua University  2Kling Team, Kuaishou Technology


🔥 News

  • [2025.12.13] 🚀📢🎉 We are thrilled to announce the official release of SVG-T2I!
    The project is now fully open-sourced, featuring complete training and inference code as well as pre-trained model weights.
    🔧 SVG-T2I Code: GitHub  |  🤗 SVG-T2I Models: Hugging Face

  • [2025.11.20] 🧠⚙️📦 We release pre-trained weights for both the SVG Autoencoder and the SVG-XL diffusion backbone, providing strong foundations for high-quality text-to-image generation.

  • [2025.09.12] 📄✨🔓 The paper is officially released, together with full training and inference pipelines, enabling easy adoption and further research.

🧠 Overview

We introduce SVG, a novel latent diffusion model without variational autoencoders, which unleashes Self-supervised representations for Visual Generation.

Key Components:

  1. SVG Autoencoder - Uses a frozen representation encoder with a residual branch to compensate the information loss and a learned convolutional decoder to transfer the SVG latent space to pixel space.
  2. Latent Diffusion Transformer - Performs diffusion modeling directly on SVG latent space.

Repository Features:

  • ✅ PyTorch implementation of SVG Autoencoder
  • ✅ PyTorch implementation of Latent Diffusion Transformer
  • ✅ End-to-end training and sampling scripts
  • ✅ Multi-GPU distributed training support
  • ✅ Pretrained-weights of SVG Autoencoder and SVG-XL

⚙️ Installation

1. Create Environment

conda create -n svg python=3.10 -y
conda activate svg

2. Install Dependencies

pip install -r requirements.txt

3. Download Pretrained Weights

Pretrained models are available on HuggingFace.


📦 Data Preparation

1. Download DINOv3

git clone https://github.com/facebookresearch/dinov3.git

Follow the official DINOv3 repository instructions to download pre-trained checkpoints.

2. Prepare Dataset

  • Download ImageNet-1k
  • Update dataset paths in the configuration files

🚀 Quick Start

1. Configure Paths

Before running the pipeline, update the placeholder paths in the following configuration files to match your local file/directory structure.

1.1 Autoencoder Config

File path: autoencoder/configs/example_svg_autoencoder_vitsp.yaml

Modify the paths under dinoconfig (for DINOv3 dependencies) and train/validation (for dataset) as shown below:

dinoconfig:
  dinov3_location: /path/to/your/dinov3  # Path to the directory storing the DINOv3 codebase
  model_name: dinov3_vits16plus          # Fixed DINOv3 model variant (no need to change)
  weights: /path/to/your/dinov3_vits16plus_pretrain.pth  # Path to the pre-trained DINOv3 weights file
...
train:
  params:
    data_root: /path/to/your/ImageNet-1k/  # Root directory of the ImageNet-1k dataset (for training)
validation:
  params:
    data_root: /path/to/your/ImageNet-1k/  # Root directory of the ImageNet-1k dataset (for validation)

1.2 Diffusion Config

File path: configs/example_SVG_XL.yaml

Update the data_path (for training data) and encoder_config (path to the Autoencoder config above) as follows:

basic:
  data_path: /path/to/your/ImageNet-1k/train_images  # Path to the "train_images" subfolder in ImageNet-1k
  encoder_config: ../autoencoder/svg/configs/example_svg_autoencoder_vitsp.yaml  # Relative/absolute path to your edited Autoencoder config

Note: Ensure the encoder_config path is valid (use an absolute path if the relative path ../ does not match your project’s folder hierarchy). Additionally, the ckpt parameter must be set to the full path of your trained decoder checkpoint file.

2. Train SVG Autoencoder

cd autoencoder/svg
bash run_train.sh configs/example_svg_autoencoder_vitsp.yaml

3. Train Latent Diffusion Transformer

torchrun --nnodes=1 --nproc_per_node=8 train_svg.py --config ./configs/example_SVG_XL.yaml

4. Eval Latent Diffusion Transformer

torchrun --nnodes=1 --nproc_per_node=1 sample_ddp_feature_svg.py --cfg-scale 1.0 --sample-dir ./samples --ckpt pretrained/checkpoints/V1-SVG-XL-7000K-256x256.pt

Then you will get a npy file in samples dir, and

cd evaluation
python fid.py <path of npy file>

Attention: You should put VIRTUAL_imagenet256_labeled.npz in the evalution dir.

🎨 Image Generation

Generate images using a trained model:

# Update ckpt_path in sample_svg.py with your checkpoint
python sample_svg.py

Generated images will be saved to the current directory.


🛠️ Configuration

Key Configuration Files:

  • autoencoder/configs/ - SVG autoencoder training configurations
  • configs/ - Diffusion transformer training configurations

Multi-GPU Training:

Adjust --nproc_per_node based on your available GPUs. The example uses 8 GPUs.


📄 Citation

If you use this work in your research, please cite our paper:

@misc{shi2025latentdiffusionmodelvariational,
      title={Latent Diffusion Model without Variational Autoencoder}, 
      author={Minglei Shi and Haolin Wang and Wenzhao Zheng and Ziyang Yuan and Xiaoshi Wu and Xintao Wang and Pengfei Wan and Jie Zhou and Jiwen Lu},
      year={2025},
      eprint={2510.15301},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2510.15301}, 
}

🙏 Acknowledgments

This implementation builds upon several excellent open-source projects:

  • dinov3 - Dinov3 official architecture
  • SigLIP2 - SigLIP2 official architecture
  • MAE - MAE baseline architecture
  • SiT - Diffusion framework and training codebase
  • VAVAE - PyTorch convolutional decoder implementation

📧 Contact

For questions and issues, please open an issue on GitHub or contact the authors.


Made with ❤️ by the SVG Team

About

Official PyTorch Implementation of "Latent Diffusion Model Without Variational Autoencoder".

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages