Official implementation of Pixio from the paper In Pursuit of Pixel Supervision for Visual Pre-training.
Lihe Yang, Shang-Wen Li, Yang Li, Xinjie Lei, Dong Wang, Abdelrahman Mohamed, Hengshuang Zhao, Hu Xu
[arXiv] [HuggingFace] [BibTeX]
Pixio is largely built on MAE, with three minimal yet critical algorithm updates:
- deeper decoder
- larger masking granularity
- more class tokens
Pixio also updates MAE's pre-training data from ImageNet-1K to MetaCLIP-2B with a simple self-curation strategy.
This codebase is developed with PyTorch 2.8.0 + CUDA 12.8.
conda create -n pixio python=3.10.18
conda activate pixio
pip install -r requirements.txtYou can either use source code from this repo or call Transformers APIs.
Pixio ViT models pre-trained on web-scale dataset (MetaCLIP-2B):
| Model | Parameters | Pre-training Dataset | Download |
|---|---|---|---|
| Pixio-B/16 | 86M | MetaCLIP-2B | [link] |
| Pixio-L/16 | 303M | MetaCLIP-2B | [link] |
| Pixio-H/16 | 631M | MetaCLIP-2B | [link] |
| Pixio-1B/16 | 1362M | MetaCLIP-2B | [link] |
| Pixio-5B/16 | 5441M | MetaCLIP-2B | [link] |
cd pixioThen testing as follows:
from PIL import Image
from torchvision import transforms
from pixio import pixio_vith16
model = pixio_vith16(pretrained="your/checkpoint/path")
# you can try larger resolution, but ensure both sides are divisible by 16
transform = transforms.Compose([
transforms.Resize((256, 256), interpolation=3), # 3 is bicubic
transforms.ToTensor(),
transforms.Normalize((0.485, 0.456, 0.406), (0.229, 0.224, 0.225))
])
img = Image.open("your/image/path").convert("RGB")
img = transform(img)
# block-wise features containing class tokens and patch tokens
features = model(img.unsqueeze(0))You can find all HuggingFace paths under this collection.
from transformers import AutoImageProcessor, AutoModel
from PIL import Image
img = Image.open("your/image/path")
processor = AutoImageProcessor.from_pretrained("facebook/pixio-vith16")
model = AutoModel.from_pretrained("facebook/pixio-vith16")
inputs = processor(images=img, return_tensors="pt")
outputs = model(**inputs)
features_norm = outputs.last_hidden_state # class tokens + patch tokens after last LayerNorm
features = outputs.hidden_states[-1] # class tokens + patch tokens before last LayerNormWe provide examples using ImageNet-1K and ImageNet-21K. We use ImageNet datasets organized as tar files from HuggingFace:
cd pretraining
# specify your data path in the script
bash scripts/pretrain_pixio_vith16_imagenet.shWe provide the evaluation code for monocular depth estimation (NYUv2, KITTI), semantic segmentation (ADE20K, Pascal VOC, LoveDA), and k-NN classification (ImageNet-1K).
Click here for details
We follow ZoeDepth and BTS, preparing the data as follows:
- NYUv2: training set | validation set
- KITTI: images | annotations
Please organize the data as follows:
├── [Your NYUv2 Path]
├── sync
│ ├── basement_0001a
│ ├── bathroom_0001
│ └── ...
└── official_splits
└── test
├── bathroom
├── bedroom
└── ...
├── [Your KITTI Path]
├── images
│ ├── 2011_09_26
│ ├── 2011_09_28
│ └── ...
└── annotations # extracted from data_depth_annotated.zip
├── 2011_09_26_drive_0001_sync
├── 2011_09_26_drive_0002_sync
└── ...
We mainly follow UniMatch V2, preparing the data as follows:
- ADE20K: images | annotations
- Pascal: images | annotations
- LoveDA: data (run
evaluation/semseg/util/process_loveda.pyto convert masks)
Please organize the data as follows:
├── [Your ADE20K Path]
├── images
│ ├── training
│ └── validation
└── annotations
├── training
└── validation
├── [Your Pascal Path]
├── JPEGImages
└── SegmentationClass
├── [Your LoveDA Path]
├── Train/Train
└── Val/Val
Following this script to prepare ImageNet-1K.
cd evaluation
model="pixio_vith16"
pretrained="your/checkpoint/path"
# specify the data path in config files or script
sbatch launch_monodepth.sh monodepth/configs/nyuv2_dpt.yaml $model $pretrained
sbatch launch_semseg.sh semseg/configs/ade20k_linear.yaml $model $pretrained
sbatch launch_knn.sh $model $pretrained
# or run all evaluations together
bash run_all.sh $model $pretrainedcd distillation
# specify your data path and teacher checkpoint path in the scripts
bash scripts/distill_pixio_vit5b16_to_vit1b16+vith16_imagenet.shPixio is licensed under Facebook license.
We sincerely thank the authors of MAE, DINO, DINOv2, and DINOv3 for open-sourcing their code and models.
@article{pixio,
title={In Pursuit of Pixel Supervision for Visual Pre-training},
author={Yang, Lihe and Li, Shang-Wen and Li, Yang and Lei, Xinjie and Wang, Dong and Mohamed, Abdelrahman and Zhao, Hengshuang and Xu, Hu},
journal={arXiv:2512.15715},
year={2025}
}