I wrote a blog about the motivation behind this series of work.
Spherical Leech Quantization for Visual Tokenization and Generation
Yue Zhao1,2, Hanwen Jiang1,3, Zhenlin Xu4, Chutong Yang1, Ehsan Adeli2, Philipp Krähenbühl1
1UT Austin, 2Stanford University, 3Adobe Research, 4Mistral AI
arxiv | bibtex | Image Generation with Infinity+L24SQ
You can pronouce BSQ-ViT like "biskvit" (a kind of Russian sponge cake) or simply "biscuit".
Image and Video Tokenization with Binary Spherical Quantization
Yue Zhao1, Yuanjun Xiong2, Philipp Krähenbühl1
1UT Austin, 2Predera
arxiv | bibtex
-
Install Miniforge3
-
Create the environment
mamba env create -f bsqvit-env.yaml
mamba activate bsqvit| Use approx. (Eq 8) | #bits | PSNR↑ | SSIM↑ | LPIPS↓ | rFID↓ | config & ckpt | md5sum | |
|---|---|---|---|---|---|---|---|---|
| SDXL-VAE | N/A | 64 | 25.38 | .7276 | .0666 | 0.72 | External | N/A |
| BSQ-ViT | 18 | 24.79 | .7319 | .0836 | 1.34 | UTBox | 7abf5a | |
| BSQ-ViT (EMA) | 18 | 24.80 | .7314 | .0820 | 1.23 | UTBox | 7abf5a | |
| BSQ-ViT | ✓ | 18 | 25.36 | .7578 | .0761 | 1.14 | UTBox | 8f5422 |
| BSQ-ViT (EMA) | ✓ | 18 | 25.80 | .7680 | .0729 | 1.30 | UTBox | 8f5422 |
| BSQ-ViT | ✓ | 36 | 27.88 | .8410 | .0432 | 0.41 | UTBox | b5ce5f |
| BSQ-ViT (EMA) | ✓ | 36 | 28.14 | .8448 | .0400 | 0.45 | UTBox | b5ce5f |
| #bits | PSNR↑ | SSIM↑ | LPIPS↓ | rFVD↓ | config & ckpt | |
|---|---|---|---|---|---|---|
| MAGVIT-L | 10 | 22.0 | .7010 | .0990 | 25 | N/A |
| MAGVITv2 | 18 | - | - | .0694 | 16.12 | N/A |
| MAGVITv2 (deeper) | 18 | - | - | .0537 | 8.62 | N/A |
| BSQ-bcViT | 18 | 32.08 | .9421 | .0244 | 8.08 | TBA |
| BSQ-bcViT | 36 | 33.80 | .9606 | .0159 | 4.10 | TBA |
| FID↓ | IS↑ | Prec↑ | Rec↑ | pre-computed samples | config & ckpt | |
|---|---|---|---|---|---|---|
| BigGAN | 6.02 | 145.8 | 0.86 | 0.35 | External | External |
| ADM | 5.91 | 93.3 | 0.70 | 0.65 | External | External |
| Ours BSQ-ViT + Masked-LM | 5.44 | 139.6 | 0.80 | 0.50 | UTBox | TBA |
@article{zhao2024lsqvit,
title={Spherical Leech Quantization for Visual Tokenization and Generation},
author={Zhao, Yue and Jiang, Hanwen and Xu, Zhenlin and Yang, Chutong and Adeli, Ehsan and Kr{\"a}henb{\"u}hl, Philipp},
journal={arXiv preprint arXiv:2512.14697},
year={2025}
}@article{zhao2024bsqvit,
title={Image and Video Tokenization with Binary Spherical Quantization},
author={Zhao, Yue and Xiong, Yuanjun, and Kr{\"a}henb{\"u}hl, Philipp},
journal={arXiv preprint arXiv:2406.07548},
year={2024}
}