This repository contains the PyTorch/GPU implementation of our paper: Improved Baselines with Representation Autoencoders.
Improved Baselines with Representation Autoencoders
Jaskirat Singh1,2, Boyang Zheng3, Zongze Wu1, Richard Zhang1, Eli Shechtman1, Saining Xie3
1Adobe Research, 2ANU, 3New York University
@article{singh2026raev2,
title={Improved Baselines with Representation Autoencoders},
author={Singh, Jaskirat and Zheng, Boyang and Wu, Zongze and Zhang, Richard and Shechtman, Eli and Xie, Saining},
journal={arXiv preprint arXiv:2605.18324},
year={2026}
}RAEv2 simplifies and improves representation autoencoders, achieving over 10x faster convergence, better generation, and better reconstruction. RAEv2 achieves state-of-the-art gFID and FDr6 in just 80 epochs compared to prior baselines (800 epochs) without any post-training. We also validate the improved training recipe on diverse settings T2I generation and world models showing consistent improvements.
git clone https://github.com/nanovisionx/RAEv2.git
cd RAEv2
# install uv project manager (if you don't already have it)
curl -LsSf https://astral.sh/uv/install.sh | sh
# install dependencies
uv syncPre-processed datasets at 256x256. All rights to the original owners.
| Subset | Task | Source | Format | Notes |
|---|---|---|---|---|
imagenet-256 |
ImageNet | ImageNet | Arrow | Or use your own ImageNet |
blip3o-256 |
T2I | BLIP3o | WDS | Captioned image pairs |
rendertext-256 |
T2I | RenderedText | WDS | Rendered-text images |
scale-rae-256 |
T2I | Scale-RAE | WDS | Synthetic FLUX images |
recon-256 |
NWM | RECON | WDS | Robot navigation frames |
# (Recommended) ~5-10x faster downloads
export HF_HUB_ENABLE_HF_TRANSFER=1
# Download all subsets into data/
uv run hf download nanovisionx/RAEv2-data --repo-type dataset --exclude .gitattributes --local-dir data/
# Or download a specific subset (uncomment one):
# uv run hf download nanovisionx/RAEv2-data --repo-type dataset --exclude .gitattributes --include "imagenet-256/**" --local-dir data/ # ImageNet
# uv run hf download nanovisionx/RAEv2-data --repo-type dataset --exclude .gitattributes --include "blip3o-256/**" --local-dir data/ # BLIP3o
# uv run hf download nanovisionx/RAEv2-data --repo-type dataset --exclude .gitattributes --include "rendertext-256/**" --local-dir data/ # RenderedText
# uv run hf download nanovisionx/RAEv2-data --repo-type dataset --exclude .gitattributes --include "scale-rae-256/**" --local-dir data/ # Scale-RAE
# uv run hf download nanovisionx/RAEv2-data --repo-type dataset --exclude .gitattributes --include "recon-256/**" --local-dir data/ # RECON# Download all (encoders + stage 1 + stage 2)
uv run hf download nyu-visionx/RAEv2-models --exclude .gitattributes --local-dir pretrained_models/
# Or download a specific subset (uncomment one):
# uv run hf download nyu-visionx/RAEv2-models --include "encoders/**" --exclude .gitattributes --local-dir pretrained_models/ # Pretrained vision encoders
# uv run hf download nyu-visionx/RAEv2-models --include "stage1/**" --exclude .gitattributes --local-dir pretrained_models/ # RAEv2 stage 1 checkpoints
# uv run hf download nyu-visionx/RAEv2-models --include "stage2/**" --exclude .gitattributes --local-dir pretrained_models/ # RAEv2 stage 2 checkpointsWe support 80+ pre-trained vision encoders across different encoder families and sizes (DINOv2, DINOv3, WebSSL, EUPE, MAE, iJEPA, MoCov3, CLIP, SigLIP2 etc.). See src/encoders/ for the full list and naming spec.
Naming: e.g.
- DINOv3-L:
dinov3-vit-l16 - DINOv3-L-K7 (multi-layer-sum, last 7 layers):
dinov3mls-vit-l16[layers=11.13.15.17.19.21.23]
export WANDB_ENTITY=<your-entity>
export WANDB_PROJECT=<your-project>
export EXPERIMENT_NAME=<your-run-name>
uv run torchrun --nproc_per_node=8 \
src/train_stage1.py \
--config <CONFIG_PATH> \
--results-dir ckpts/stage1 \
--precision bf16 \
--compile \
--wandbImageNet config:
configs/stage1/training/dinov3l-k7-imagenet.yamlconfigs/stage1/training/dinov3l-k23-imagenet.yaml
General Config: Similar to proprietary VAEs, training with more data helps further improve reconstruction performance.
configs/stage1/training/dinov3l-k7-general.yamlconfigs/stage1/training/dinov3l-k23-general.yaml
After training: extract the EMA decoder and compute encoder statistics for latent normalization.
# 1. Extract EMA decoder from the final checkpoint
uv run python scripts/stage1/extract_decoder.py \
--config <CONFIG_PATH> \
--ckpt ckpts/stage1/<RUN_NAME>/checkpoints/ep-XXXXXXX.pt \
--use-ema \
--out pretrained_models/stage1/<imagenet|general>/<encoder>-k<N>/decoder.pt
# 2. Compute encoder stats (multi-GPU, single node)
uv run torchrun --nproc_per_node=8 \
scripts/stage1/compute_encoder_stats.py \
--config <CONFIG_PATH> \
--use-hf-dataset \
--hf-data-dir data/imagenet-256 \
--batch-size 256 \
--output-path pretrained_models/stage1/<imagenet|general>/<encoder>-k<N>/stats.ptSampling: reconstruct an image with a trained RAE. See configs/stage1/sampling/ for the full list. E.g.,
uv run python scripts/stage1/sample.py \
--config configs/stage1/sampling/dinov3l-k23-general.yaml \
--image assets/samples/sample_1.pngEvaluation: offline reconstruction metrics (rFID, PSNR, SSIM, LPIPS) on eval datasets (e.g., ImageNet, RenderedText etc).
export EXPERIMENT_NAME=<your-run-name>
uv run torchrun --nproc_per_node=8 \
src/offline_eval_stage1.py \
--config configs/stage1/sampling/dinov3l-k23-general.yamlWe support training RAEv2 across diverse settings: ImageNet, text-to-image (T2I), and navigation world models.
export WANDB_ENTITY=<your-entity>
export WANDB_PROJECT=<your-project>
export EXPERIMENT_NAME=<your-run-name>
uv run torchrun --nproc_per_node=8 \
src/train.py \
--config <CONFIG_PATH> \
--results-dir ckpts/stage2 \
--precision bf16 \
--compile \
--wandbExample training configs for different tasks (all under configs/stage2/training/):
| Task | k=1 | k=7 | k=23 |
|---|---|---|---|
| ImageNet | imagenet-dinov3l-k1.yaml |
imagenet-dinov3l-k7.yaml |
imagenet-dinov3l-k23.yaml |
| T2I | t2i-dinov3l-k1.yaml |
t2i-dinov3l-k7.yaml |
t2i-dinov3l-k23.yaml |
| NWM | nwm-dinov3l-k1.yaml |
nwm-dinov3l-k7.yaml |
nwm-dinov3l-k23.yaml |
Online Evaluation: Similar to JiT, we support online evaluation during training. See the eval block in any config under configs/stage2/training/.
| Task | Supported metrics |
|---|---|
| Stage 1 - Reconstruction | rFID, PSNR, LPIPS, SSIM |
| Stage 2 - ImageNet | gFID, Inception Score, FDr6 (6 representation spaces), MIND / torch-fidelity (6 representation spaces) |
| Stage 2 - T2I | GenEval, DPGBench, GenAI Bench, gFID, VQAScore |
| Stage 2 - NWM | LPIPS, gFID |
Offline Evaluation: We can also evaluate the model ckpts after training.
export EXPERIMENT_NAME=<your-run-name>
uv run torchrun --nproc_per_node=8 src/offline_eval.py \
--config configs/stage2/sampling/imagenet-dinov3l-k7.yamlThe codebase is built upon some amazing projects:
We thank the authors for making their work publicly available. We also sincerely thank Xingjian Leng for support and help with online geneval and dpgbench evaluation during T2I training.
@article{singh2026raev2,
title={Improved Baselines with Representation Autoencoders},
author={Singh, Jaskirat and Zheng, Boyang and Wu, Zongze and Zhang, Richard and Shechtman, Eli and Xie, Saining},
journal={arXiv preprint arXiv:2605.18324},
year={2026}
}