Roman Bachmann*, David Mizrahi*, Andrei Atanov, Amir Zamir
Official PyTorch implementation and pre-trained models for MultiMAE: Multi-modal Multi-task Masked Autoencoders.
We introduce Multi-modal Multi-task Masked Autoencoders (MultiMAE), an efficient and effective pre-training strategy for Vision Transformers. Given a small random sample of visible patches from multiple modalities, the MultiMAE pre-training objective is to reconstruct the masked-out regions. Once pre-trained, a single MultiMAE encoder can then be used for both single-modal and multi-modal downstream transfer, yielding competitive to or significantly better results than the baselines.
- Pre-trained models
- MultiMAE pre-training code
- ImageNet-1K classification fine-tuning code
- Semantic segmentation fine-tuning code (single-modal & multi-modal)
- Depth estimation fine-tuning code
- Taskonomy fine-tuning code
- Colab & Hugging Face demos
- Download links for ImageNet-1K depth and semantic segmentation pseudo labels
We provide the weights of our pre-trained MultiMAE ViT-B model, in MultiViT (multi-modal) format and timm (RGB-only) format.
For comparison, we also provide the weights of a MAE ViT-B model that we pre-trained using the official MAE codebase following the recommended settings.
| Method | Arch. | Pre-training modalities |
Pre-training epochs |
Weights (MultiViT) |
Weights (timm) |
Config |
|---|---|---|---|---|---|---|
| MAE | ViT-B | RGB | 1600 | download | download | See MAE |
| MultiMAE | ViT-B | RGB+D+S | 1600 | download | download | link |
These pre-trained models can then be fine-tuned using this codebase to reach the following performance:
| Method | Classif. (@1) | Semantic Segmentation (mIoU) | Depth (δ1) | ||||||
|---|---|---|---|---|---|---|---|---|---|
| ImageNet-1K (RGB) |
ADE20K (RGB) |
Hypersim (RGB / D / RGB + D) |
NYUv2 (RGB / D / RGB + D) |
NYUv2 (RGB) |
|||||
| Sup. (DeiT) | 81.8 | 45.8 | 33.9 | - | - | 50.1 | - | - | 80.7 |
| MAE | 83.3 | 46.2 | 36.5 | - | - |
50.8 | - | - | 85.1 |
| MultiMAE | 83.3 | 46.2 | 37.0 | 38.5 | 47.6 | 52.0 | 41.4 | 56.0 | 86.4 |
We provide pre-trained weights in two different formats: the single-modal ViT / timm format, which is compatible with other popular ViT repositories (e.g., timm, DINO, MAE), and the multi-modal MultiMAE / MultiViT format, which is used throughout this codebase for multi-modal pre-training and fine-tuning. See multimae/multimae.py for the documentation and implementation of MultiMAE / MultiViT.
You can convert between these formats using the provided vit2multimae_converter.py and multimae2vit_converter.py scripts.
See SETUP.md for set-up instructions.
See PRETRAINING.md for pre-training instructions.
See FINETUNING.md for fine-tuning instructions.
For interactive demos, please see our website. Open our Colab notebook to play around with the visualization code, or simply upload an image to our Hugging Face Spaces demo.
This repository is built using the timm, DeiT, DINO, MoCo v3, BEiT, MAE-priv, and MAE repositories.
This project is under the CC-BY-NC 4.0 license. See LICENSE for details.
If you find this repository helpful, please consider citing our work:
@article{bachmann2022multimae,
author = {Roman Bachmann and David Mizrahi and Andrei Atanov and Amir Zamir},
title = {{MultiMAE}: Multi-modal Multi-task Masked Autoencoders},
booktitle = {European Conference on Computer Vision},
year = {2022},
}