Pixio

Official implementation of Pixio from the paper In Pursuit of Pixel Supervision for Visual Pre-training.

Lihe Yang, Shang-Wen Li, Yang Li, Xinjie Lei, Dong Wang, Abdelrahman Mohamed, Hengshuang Zhao, Hu Xu

[arXiv] [HuggingFace] [BibTeX]

Pixio is largely built on MAE, with three minimal yet critical algorithm updates:

deeper decoder
larger masking granularity
more class tokens

Pixio also updates MAE's pre-training data from ImageNet-1K to MetaCLIP-2B with a simple self-curation strategy.

Installation

This codebase is developed with PyTorch 2.8.0 + CUDA 12.8.

conda create -n pixio python=3.10.18
conda activate pixio
pip install -r requirements.txt

Inference (may need Huggingface login)

You can either use source code from this repo or call Transformers APIs.

Source Code

Pixio ViT models pre-trained on web-scale dataset (MetaCLIP-2B):

Model	Parameters	Pre-training Dataset	Download
Pixio-B/16	86M	MetaCLIP-2B	[link]
Pixio-L/16	303M	MetaCLIP-2B	[link]
Pixio-H/16	631M	MetaCLIP-2B	[link]
Pixio-1B/16	1362M	MetaCLIP-2B	[link]
Pixio-5B/16	5441M	MetaCLIP-2B	[link]

cd pixio

Then testing as follows:

from PIL import Image
from torchvision import transforms

from pixio import pixio_vith16

model = pixio_vith16(pretrained="your/checkpoint/path")

# you can try larger resolution, but ensure both sides are divisible by 16
transform = transforms.Compose([
    transforms.Resize((256, 256), interpolation=3), # 3 is bicubic
    transforms.ToTensor(),
    transforms.Normalize((0.485, 0.456, 0.406), (0.229, 0.224, 0.225))
])

img = Image.open("your/image/path").convert("RGB")
img = transform(img)

# block-wise features containing class tokens and patch tokens
features = model(img.unsqueeze(0))

Transformers (may need Huggingface login)

You can find all HuggingFace paths under this collection.

from transformers import AutoImageProcessor, AutoModel
from PIL import Image

img = Image.open("your/image/path")

processor = AutoImageProcessor.from_pretrained("facebook/pixio-vith16")
model = AutoModel.from_pretrained("facebook/pixio-vith16")

inputs = processor(images=img, return_tensors="pt")
outputs = model(**inputs)
features_norm = outputs.last_hidden_state # class tokens + patch tokens after last LayerNorm
features = outputs.hidden_states[-1] # class tokens + patch tokens before last LayerNorm

Pre-training

Data Preparation

We provide examples using ImageNet-1K and ImageNet-21K. We use ImageNet datasets organized as tar files from HuggingFace:

ImageNet-1K: download
ImageNet-21K: download

Launch Pre-training

cd pretraining

# specify your data path in the script
bash scripts/pretrain_pixio_vith16_imagenet.sh

Evaluation

We provide the evaluation code for monocular depth estimation (NYUv2, KITTI), semantic segmentation (ADE20K, Pascal VOC, LoveDA), and k-NN classification (ImageNet-1K).

Data Preparation

Click here for details

Monocular Depth Estimation

We follow ZoeDepth and BTS, preparing the data as follows:

NYUv2: training set | validation set
KITTI: images | annotations

Please organize the data as follows:

├── [Your NYUv2 Path]
    ├── sync
    │   ├── basement_0001a
    │   ├── bathroom_0001
    │   └── ...    
    └── official_splits
        └── test
            ├── bathroom
            ├── bedroom
            └── ...

├── [Your KITTI Path]
    ├── images
    │   ├── 2011_09_26
    │   ├── 2011_09_28
    │   └── ...    
    └── annotations # extracted from data_depth_annotated.zip
        ├── 2011_09_26_drive_0001_sync
        ├── 2011_09_26_drive_0002_sync
        └── ...

Semantic Segmentation

We mainly follow UniMatch V2, preparing the data as follows:

ADE20K: images | annotations
Pascal: images | annotations
LoveDA: data (run evaluation/semseg/util/process_loveda.py to convert masks)

Please organize the data as follows:

├── [Your ADE20K Path]
    ├── images
    │   ├── training
    │   └── validation
    └── annotations
        ├── training
        └── validation

├── [Your Pascal Path]
    ├── JPEGImages
    └── SegmentationClass

├── [Your LoveDA Path]
    ├── Train/Train
    └── Val/Val

k-NN Classification

Following this script to prepare ImageNet-1K.

Launch Evaluation

cd evaluation

model="pixio_vith16"
pretrained="your/checkpoint/path"

# specify the data path in config files or script
sbatch launch_monodepth.sh monodepth/configs/nyuv2_dpt.yaml $model $pretrained
sbatch launch_semseg.sh semseg/configs/ade20k_linear.yaml $model $pretrained
sbatch launch_knn.sh $model $pretrained

# or run all evaluations together
bash run_all.sh $model $pretrained

Distillation

Launch Distillation

cd distillation

# specify your data path and teacher checkpoint path in the scripts
bash scripts/distill_pixio_vit5b16_to_vit1b16+vith16_imagenet.sh

License

Pixio is licensed under Facebook license.

Acknowledgement

We sincerely thank the authors of MAE, DINO, DINOv2, and DINOv3 for open-sourcing their code and models.

Citation

@article{pixio,
  title={In Pursuit of Pixel Supervision for Visual Pre-training},
  author={Yang, Lihe and Li, Shang-Wen and Li, Yang and Lei, Xinjie and Wang, Dong and Mohamed, Abdelrahman and Zhao, Hengshuang and Xu, Hu},
  journal={arXiv:2512.15715},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
assets		assets
distillation		distillation
evaluation		evaluation
pixio		pixio
pretraining		pretraining
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Pixio

Installation

Inference (may need Huggingface login)

Source Code

Transformers (may need Huggingface login)

Pre-training

Data Preparation

Launch Pre-training

Evaluation

Data Preparation

Monocular Depth Estimation

Semantic Segmentation

k-NN Classification

Launch Evaluation

Distillation

Launch Distillation

License

Acknowledgement

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Languages

License

facebookresearch/pixio

Folders and files

Latest commit

History

Repository files navigation

Pixio

Installation

Inference (may need Huggingface login)

Source Code

Transformers (may need Huggingface login)

Pre-training

Data Preparation

Launch Pre-training

Evaluation

Data Preparation

Monocular Depth Estimation

Semantic Segmentation

k-NN Classification

Launch Evaluation

Distillation

Launch Distillation

License

Acknowledgement

Citation

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Languages

Packages