SVG: Latent Diffusion Model without Variational Autoencoder

_{Official PyTorch Implementation}

Minglei Shi^1*, Haolin Wang^1*, Wenzhao Zheng^1†, Ziyang Yuan², Xiaoshi Wu², Xintao Wang², Pengfei Wan², Jie Zhou¹, Jiwen Lu¹
(*equal contribution, listed in alphabetical order; †project lead)
¹Department of Automation, Tsinghua University ²Kling Team, Kuaishou Technology

🔥 News

[2025.12.13] 🚀📢🎉 We are thrilled to announce the official release of SVG-T2I!
The project is now fully open-sourced, featuring complete training and inference code as well as pre-trained model weights.
🔧 SVG-T2I Code: GitHub | 🤗 SVG-T2I Models: Hugging Face
[2025.11.20] 🧠⚙️📦 We release pre-trained weights for both the SVG Autoencoder and the SVG-XL diffusion backbone, providing strong foundations for high-quality text-to-image generation.
[2025.09.12] 📄✨🔓 The paper is officially released, together with full training and inference pipelines, enabling easy adoption and further research.

🧠 Overview

We introduce SVG, a novel latent diffusion model without variational autoencoders, which unleashes Self-supervised representations for Visual Generation.

Key Components:

SVG Autoencoder - Uses a frozen representation encoder with a residual branch to compensate the information loss and a learned convolutional decoder to transfer the SVG latent space to pixel space.
Latent Diffusion Transformer - Performs diffusion modeling directly on SVG latent space.

Repository Features:

✅ PyTorch implementation of SVG Autoencoder
✅ PyTorch implementation of Latent Diffusion Transformer
✅ End-to-end training and sampling scripts
✅ Multi-GPU distributed training support
✅ Pretrained-weights of SVG Autoencoder and SVG-XL

⚙️ Installation

1. Create Environment

conda create -n svg python=3.10 -y
conda activate svg

2. Install Dependencies

pip install -r requirements.txt

3. Download Pretrained Weights

Pretrained models are available on HuggingFace.

📦 Data Preparation

1. Download DINOv3

git clone https://github.com/facebookresearch/dinov3.git

Follow the official DINOv3 repository instructions to download pre-trained checkpoints.

2. Prepare Dataset

Download ImageNet-1k
Update dataset paths in the configuration files

🚀 Quick Start

1. Configure Paths

Before running the pipeline, update the placeholder paths in the following configuration files to match your local file/directory structure.

1.1 Autoencoder Config

File path: autoencoder/configs/example_svg_autoencoder_vitsp.yaml

Modify the paths under dinoconfig (for DINOv3 dependencies) and train/validation (for dataset) as shown below:

dinoconfig:
  dinov3_location: /path/to/your/dinov3  # Path to the directory storing the DINOv3 codebase
  model_name: dinov3_vits16plus          # Fixed DINOv3 model variant (no need to change)
  weights: /path/to/your/dinov3_vits16plus_pretrain.pth  # Path to the pre-trained DINOv3 weights file
...
train:
  params:
    data_root: /path/to/your/ImageNet-1k/  # Root directory of the ImageNet-1k dataset (for training)
validation:
  params:
    data_root: /path/to/your/ImageNet-1k/  # Root directory of the ImageNet-1k dataset (for validation)

1.2 Diffusion Config

File path: configs/example_SVG_XL.yaml

Update the data_path (for training data) and encoder_config (path to the Autoencoder config above) as follows:

basic:
  data_path: /path/to/your/ImageNet-1k/train_images  # Path to the "train_images" subfolder in ImageNet-1k
  encoder_config: ../autoencoder/svg/configs/example_svg_autoencoder_vitsp.yaml  # Relative/absolute path to your edited Autoencoder config

Note: Ensure the encoder_config path is valid (use an absolute path if the relative path ../ does not match your project’s folder hierarchy). Additionally, the ckpt parameter must be set to the full path of your trained decoder checkpoint file.

2. Train SVG Autoencoder

cd autoencoder/svg
bash run_train.sh configs/example_svg_autoencoder_vitsp.yaml

3. Train Latent Diffusion Transformer

torchrun --nnodes=1 --nproc_per_node=8 train_svg.py --config ./configs/example_SVG_XL.yaml

4. Eval Latent Diffusion Transformer

torchrun --nnodes=1 --nproc_per_node=1 sample_ddp_feature_svg.py --cfg-scale 1.0 --sample-dir ./samples --ckpt pretrained/checkpoints/V1-SVG-XL-7000K-256x256.pt

Then you will get a npy file in samples dir, and

cd evaluation
python fid.py <path of npy file>

Attention: You should put VIRTUAL_imagenet256_labeled.npz in the evalution dir.

🎨 Image Generation

Generate images using a trained model:

# Update ckpt_path in sample_svg.py with your checkpoint
python sample_svg.py

Generated images will be saved to the current directory.

🛠️ Configuration

Key Configuration Files:

autoencoder/configs/ - SVG autoencoder training configurations
configs/ - Diffusion transformer training configurations

Multi-GPU Training:

Adjust --nproc_per_node based on your available GPUs. The example uses 8 GPUs.

📄 Citation

If you use this work in your research, please cite our paper:

@misc{shi2025latentdiffusionmodelvariational,
      title={Latent Diffusion Model without Variational Autoencoder}, 
      author={Minglei Shi and Haolin Wang and Wenzhao Zheng and Ziyang Yuan and Xiaoshi Wu and Xintao Wang and Pengfei Wan and Jie Zhou and Jiwen Lu},
      year={2025},
      eprint={2510.15301},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2510.15301}, 
}

🙏 Acknowledgments

This implementation builds upon several excellent open-source projects:

dinov3 - Dinov3 official architecture
SigLIP2 - SigLIP2 official architecture
MAE - MAE baseline architecture
SiT - Diffusion framework and training codebase
VAVAE - PyTorch convolutional decoder implementation

📧 Contact

For questions and issues, please open an issue on GitHub or contact the authors.

_{Made with ❤️ by the SVG Team}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SVG: Latent Diffusion Model without Variational Autoencoder

🔥 News

🧠 Overview

⚙️ Installation

1. Create Environment

2. Install Dependencies

3. Download Pretrained Weights

📦 Data Preparation

1. Download DINOv3

2. Prepare Dataset

🚀 Quick Start

1. Configure Paths

1.1 Autoencoder Config

1.2 Diffusion Config

2. Train SVG Autoencoder

3. Train Latent Diffusion Transformer

4. Eval Latent Diffusion Transformer

Attention: You should put VIRTUAL_imagenet256_labeled.npz in the evalution dir.

🎨 Image Generation

🛠️ Configuration

Key Configuration Files:

Multi-GPU Training:

📄 Citation

🙏 Acknowledgments

📧 Contact

About

Uh oh!

Releases

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
autoencoder		autoencoder
config		config
evaluation		evaluation
figs		figs
models		models
pretrained		pretrained
rectified_flow		rectified_flow
.gitignore		.gitignore
README.md		README.md
dinov3_sp_stats.pt		dinov3_sp_stats.pt
environment.yml		environment.yml
requirements.txt		requirements.txt
sample_ddp_feature_svg.py		sample_ddp_feature_svg.py
sample_svg.py		sample_svg.py
train_svg.py		train_svg.py
utils.py		utils.py

shiml20/SVG

Folders and files

Latest commit

History

Repository files navigation

SVG: Latent Diffusion Model without Variational Autoencoder

🔥 News

🧠 Overview

⚙️ Installation

1. Create Environment

2. Install Dependencies

3. Download Pretrained Weights

📦 Data Preparation

1. Download DINOv3

2. Prepare Dataset

🚀 Quick Start

1. Configure Paths

1.1 Autoencoder Config

1.2 Diffusion Config

2. Train SVG Autoencoder

3. Train Latent Diffusion Transformer

4. Eval Latent Diffusion Transformer

Attention: You should put VIRTUAL_imagenet256_labeled.npz in the evalution dir.

🎨 Image Generation

🛠️ Configuration

Key Configuration Files:

Multi-GPU Training:

📄 Citation

🙏 Acknowledgments

📧 Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages