MultiMAE: Multi-modal Multi-task Masked Autoencoders

Roman Bachmann*, David Mizrahi*, Andrei Atanov, Amir Zamir

Website | arXiv | BibTeX

Official PyTorch implementation and pre-trained models for MultiMAE: Multi-modal Multi-task Masked Autoencoders.

We introduce Multi-modal Multi-task Masked Autoencoders (MultiMAE), an efficient and effective pre-training strategy for Vision Transformers. Given a small random sample of visible patches from multiple modalities, the MultiMAE pre-training objective is to reconstruct the masked-out regions. Once pre-trained, a single MultiMAE encoder can then be used for both single-modal and multi-modal downstream transfer, yielding competitive to or significantly better results than the baselines.

Catalog

Pre-trained models
MultiMAE pre-training code
ImageNet-1K classification fine-tuning code
Semantic segmentation fine-tuning code (single-modal & multi-modal)
Depth estimation fine-tuning code
Taskonomy fine-tuning code
Colab & Hugging Face demos
Download links for ImageNet-1K depth and semantic segmentation pseudo labels

Pre-trained models

We provide the weights of our pre-trained MultiMAE ViT-B model, in MultiViT (multi-modal) format and timm (RGB-only) format.

For comparison, we also provide the weights of a MAE ViT-B model that we pre-trained using the official MAE codebase following the recommended settings.

Method	Arch.	Pre-training modalities	Pre-training epochs	Weights (MultiViT)	Weights (timm)	Config
MAE	ViT-B	RGB	1600	download	download	See MAE
MultiMAE	ViT-B	RGB+D+S	1600	download	download	link

These pre-trained models can then be fine-tuned using this codebase to reach the following performance:

Method	Classif. (@1)	Semantic Segmentation (mIoU)							Depth (δ1)
	ImageNet-1K (RGB)	ADE20K (RGB)	Hypersim (RGB / D / RGB + D)			NYUv2 (RGB / D / RGB + D)			NYUv2 (RGB)
Sup. (DeiT)	81.8	45.8	33.9	-	-	50.1	-	-	80.7
MAE	83.3	46.2	36.5	-	-	50.8	-	-	85.1
MultiMAE	83.3	46.2	37.0	38.5	47.6	52.0	41.4	56.0	86.4

Model formats

We provide pre-trained weights in two different formats: the single-modal ViT / timm format, which is compatible with other popular ViT repositories (e.g., timm, DINO, MAE), and the multi-modal MultiMAE / MultiViT format, which is used throughout this codebase for multi-modal pre-training and fine-tuning. See multimae/multimae.py for the documentation and implementation of MultiMAE / MultiViT.

You can convert between these formats using the provided vit2multimae_converter.py and multimae2vit_converter.py scripts.

Usage

Set-up

See SETUP.md for set-up instructions.

Pre-training

See PRETRAINING.md for pre-training instructions.

Fine-tuning

See FINETUNING.md for fine-tuning instructions.

Demo & visualizations

For interactive demos, please see our website. Open our Colab notebook to play around with the visualization code, or simply upload an image to our Hugging Face Spaces demo.

Acknowledgement

This repository is built using the timm, DeiT, DINO, MoCo v3, BEiT, MAE-priv, and MAE repositories.

License

This project is under the CC-BY-NC 4.0 license. See LICENSE for details.

Citation

If you find this repository helpful, please consider citing our work:

@article{bachmann2022multimae,
  author    = {Roman Bachmann and David Mizrahi and Andrei Atanov and Amir Zamir},
  title     = {{MultiMAE}: Multi-modal Multi-task Masked Autoencoders},
  booktitle = {European Conference on Computer Vision},
  year      = {2022},
}

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
assets		assets
cfgs		cfgs
multimae		multimae
tools		tools
utils		utils
.gitattributes		.gitattributes
.gitignore		.gitignore
FINETUNING.md		FINETUNING.md
LICENSE		LICENSE
MultiMAE_Demo.ipynb		MultiMAE_Demo.ipynb
PRETRAINING.md		PRETRAINING.md
README.md		README.md
SETUP.md		SETUP.md
requirements.txt		requirements.txt
run_finetuning_cls.py		run_finetuning_cls.py
run_finetuning_depth.py		run_finetuning_depth.py
run_finetuning_semseg.py		run_finetuning_semseg.py
run_finetuning_taskonomy.py		run_finetuning_taskonomy.py
run_pretraining_multimae.py		run_pretraining_multimae.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MultiMAE: Multi-modal Multi-task Masked Autoencoders

Catalog

Pre-trained models

Model formats

Usage

Set-up

Pre-training

Fine-tuning

Demo & visualizations

Acknowledgement

License

Citation

About

Uh oh!

Releases

Packages

Languages

License

yoninachmany/MultiMAE

Folders and files

Latest commit

History

Repository files navigation

MultiMAE: Multi-modal Multi-task Masked Autoencoders

Catalog

Pre-trained models

Model formats

Usage

Set-up

Pre-training

Fine-tuning

Demo & visualizations

Acknowledgement

License

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages