Skip to content

facebookresearch/pixio

Pixio

Official implementation of Pixio from the paper In Pursuit of Pixel Supervision for Visual Pre-training.

Lihe Yang, Shang-Wen Li, Yang Li, Xinjie Lei, Dong Wang, Abdelrahman Mohamed, Hengshuang Zhao, Hu Xu

[arXiv] [HuggingFace] [BibTeX]

Pixio is largely built on MAE, with three minimal yet critical algorithm updates:

  • deeper decoder
  • larger masking granularity
  • more class tokens

Pixio also updates MAE's pre-training data from ImageNet-1K to MetaCLIP-2B with a simple self-curation strategy.

Installation

This codebase is developed with PyTorch 2.8.0 + CUDA 12.8.

conda create -n pixio python=3.10.18
conda activate pixio
pip install -r requirements.txt

Inference (may need Huggingface login)

You can either use source code from this repo or call Transformers APIs.

Source Code

Pixio ViT models pre-trained on web-scale dataset (MetaCLIP-2B):

Model Parameters Pre-training Dataset Download
Pixio-B/16 86M MetaCLIP-2B [link]
Pixio-L/16 303M MetaCLIP-2B [link]
Pixio-H/16 631M MetaCLIP-2B [link]
Pixio-1B/16 1362M MetaCLIP-2B [link]
Pixio-5B/16 5441M MetaCLIP-2B [link]
cd pixio

Then testing as follows:

from PIL import Image
from torchvision import transforms

from pixio import pixio_vith16

model = pixio_vith16(pretrained="your/checkpoint/path")

# you can try larger resolution, but ensure both sides are divisible by 16
transform = transforms.Compose([
    transforms.Resize((256, 256), interpolation=3), # 3 is bicubic
    transforms.ToTensor(),
    transforms.Normalize((0.485, 0.456, 0.406), (0.229, 0.224, 0.225))
])

img = Image.open("your/image/path").convert("RGB")
img = transform(img)

# block-wise features containing class tokens and patch tokens
features = model(img.unsqueeze(0))

Transformers (may need Huggingface login)

You can find all HuggingFace paths under this collection.

from transformers import AutoImageProcessor, AutoModel
from PIL import Image

img = Image.open("your/image/path")

processor = AutoImageProcessor.from_pretrained("facebook/pixio-vith16")
model = AutoModel.from_pretrained("facebook/pixio-vith16")

inputs = processor(images=img, return_tensors="pt")
outputs = model(**inputs)
features_norm = outputs.last_hidden_state # class tokens + patch tokens after last LayerNorm
features = outputs.hidden_states[-1] # class tokens + patch tokens before last LayerNorm

Pre-training

Data Preparation

We provide examples using ImageNet-1K and ImageNet-21K. We use ImageNet datasets organized as tar files from HuggingFace:

Launch Pre-training

cd pretraining

# specify your data path in the script
bash scripts/pretrain_pixio_vith16_imagenet.sh

Evaluation

We provide the evaluation code for monocular depth estimation (NYUv2, KITTI), semantic segmentation (ADE20K, Pascal VOC, LoveDA), and k-NN classification (ImageNet-1K).

Data Preparation

Click here for details

Monocular Depth Estimation

We follow ZoeDepth and BTS, preparing the data as follows:

Please organize the data as follows:

├── [Your NYUv2 Path]
    ├── sync
    │   ├── basement_0001a
    │   ├── bathroom_0001
    │   └── ...    
    └── official_splits
        └── test
            ├── bathroom
            ├── bedroom
            └── ...

├── [Your KITTI Path]
    ├── images
    │   ├── 2011_09_26
    │   ├── 2011_09_28
    │   └── ...    
    └── annotations # extracted from data_depth_annotated.zip
        ├── 2011_09_26_drive_0001_sync
        ├── 2011_09_26_drive_0002_sync
        └── ...

Semantic Segmentation

We mainly follow UniMatch V2, preparing the data as follows:

Please organize the data as follows:

├── [Your ADE20K Path]
    ├── images
    │   ├── training
    │   └── validation
    └── annotations
        ├── training
        └── validation

├── [Your Pascal Path]
    ├── JPEGImages
    └── SegmentationClass

├── [Your LoveDA Path]
    ├── Train/Train
    └── Val/Val

k-NN Classification

Following this script to prepare ImageNet-1K.

Launch Evaluation

cd evaluation

model="pixio_vith16"
pretrained="your/checkpoint/path"

# specify the data path in config files or script
sbatch launch_monodepth.sh monodepth/configs/nyuv2_dpt.yaml $model $pretrained
sbatch launch_semseg.sh semseg/configs/ade20k_linear.yaml $model $pretrained
sbatch launch_knn.sh $model $pretrained

# or run all evaluations together
bash run_all.sh $model $pretrained

Distillation

Launch Distillation

cd distillation

# specify your data path and teacher checkpoint path in the scripts
bash scripts/distill_pixio_vit5b16_to_vit1b16+vith16_imagenet.sh

License

Pixio is licensed under Facebook license.

Acknowledgement

We sincerely thank the authors of MAE, DINO, DINOv2, and DINOv3 for open-sourcing their code and models.

Citation

@article{pixio,
  title={In Pursuit of Pixel Supervision for Visual Pre-training},
  author={Yang, Lihe and Li, Shang-Wen and Li, Yang and Lei, Xinjie and Wang, Dong and Mohamed, Abdelrahman and Zhao, Hengshuang and Xu, Hu},
  journal={arXiv:2512.15715},
  year={2025}
}

About

In Pursuit of Pixel Supervision for Visual Pre-training

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published