On-par-with/surpass 10B-level industrial SOTA generalist (FLUX.1-Fill-Dev) on 6 benchmarks across natural and portrait scenes & Only 2% (0.2B) parameters, and inference 15ร faster
Kangsheng Duan1,*, Ziyang Xu1,*,โ , Wenyu Liu1, Xiaohu Ruan2, Xiaoxin Chen2, Xinggang Wang1, ๐ง
(*) Equal Contribution, (โ ) Project Leader, (๐ง) Corresponding Author.
1 Huazhong University of Science and Technology. 2 VIVO AI Lab.
Moebius is our latest AI Image Inpainting endeavor, serving as a direct continuation of our previous work, PixelHacker. Named after the concepts of "infinity" and "master painter," Moebius embodies our vision: maintaining exceptional generation quality under highly constrained computational resources while pushing the efficiency of image inpainting to its limits as much as possible.
Under the iron grip of the Scaling Law, AI research has long devolved into a grueling arms race of burning capital, compute, and data. Consequently, the academic community finds it increasingly difficult to keep pace with the ever-expanding model scales driven by the tech industry.
"But is this brute-force scaling truly the only path forward?"
Using general-purpose image inpainting as our strategic entry point, we challenge the "scale-at-all-costs" path dependency dictated by the Scaling Law narrative. Through the synergistic optimization of architectural design and knowledge distillation, Moebius achieves a remarkably compact footprint of just 0.22B parameters. It liberates high-quality image inpainting from the heavy-compute narrative of 10B+ foundation models: Across six comprehensive benchmarks spanning both natural and portrait scenes, Moebius performs on par with, and in certain scenarios surpasses, the inpainting quality of 10B+ industrial state-of-the-art (SOTA) generalist models like FLUX.1-Fill-Dev, while delivering a massive >15ร inference acceleration.
๐ก The core insight of Moebius can be summarized in a single equation:
$$\begin{aligned} \text{Synergy} \times (\text{Architecture} + \text{Distillation}) = & \text{Shattering the "Impossible Triangle" of} \\ & \text{Low Parameters, Fast Inference, and High Quality} \end{aligned}$$ --- written on June 16, 2026 ---
- ๐ Extreme Parametric Efficiency (< 2%): Moebius operates with a mere 0.22B (226M) parameters, which represents less than 2% of the size of the colossal industrial giant FLUX.1-Fill-Dev (11.9B). It shatters the heavy-compute narrative, making high-quality inpainting accessible on consumer-grade and edge devices.
- โก 15ร Inference Speedup (26ms/step): Achieves a blistering inference latency of only 26.01 ms per step on a single GPU. Combined with optimized sampling steps, Moebius delivers an overall >15ร total runtime acceleration compared to 10B-level models.
- ๐ 10B-Level Inpainting Quality (on-par-with/surpass FLUX.1-Fill-Dev across 6 benchmarks): Size contraction does not mean representation degradation. Through the synergistic optimization of architecture and distillation, Moebius performs on par with, and in certain scenarios (such as complex textures and facial plausibility), surpasses 10B-level state-of-the-art (SOTA) generalist models (FLUX.1-Fill-Dev, SD3.5 Large-Inpainting) across 6 comprehensive benchmarks spanning both natural scenes (Places2) and portrait scenes (CelebA-HQ, FFHQ).
- ๐ก Synergistic Core Innovations:
- Architecture Design (LฮปMI Block): Reformulates both self- and cross-attention by condensing spatial context and global semantic priors into fixed-size linear matrices, bypassing quadratic computational overhead.
- Adaptive Multi-Granularity Distillation Strategy: Transfers the representational capacity from our PixelHacker (teacher) strictly within the latent space (avoiding expensive pixel-space decoding). It bridges the giant capacity gap by aligning multi-granularity supervisionโranging from microscopic intermediate features to macroscopic diffusion trajectoriesโwhile dynamically balancing training via a gradient norm adaptive loss weighting mechanism.
- Optimal Synergistic Balancing: Systematically explores the mutual constraint and upper bound between compact structure and distillation. By mapping this architecture-distillation synergy frontier, we ensure our 0.22B Moebius (student) absorbs the maximum semantic reasoning of PixelHacker (teacher) without triggering representation saturation.
- ๐ Task-Specific Specialist over Bloated Generalists: Rather than blindly scaling up, Moebius answers a fundamental question: Can a model be smarter, lighter, and faster when the task is explicitly defined? It serves as a highly optimized specialist that liberates real-world image inpainting and AI object removal from parameter bloat.
-
June 19, 2026: ๐ Moebius has achieved the No. 1 daily ranking on Hugging Face! -
June 18, 2026: ๐ฅ๐ฅ We have released the training and inference code, and open-sourced the model weights on Hugging Face. -
June 18, 2026: ๐ Moebius is accepted by ECCV'26! We have released the preprint on arXiv, check it here ~ ๐ป -
June 16, 2026: ๐ฅ We have submitted the GitHub repo for the first time, and there will be more updates soon. Stay tuned! ๐ค
The masks of the evaluation set are shared in Google Drive, and the corresponding images can be downloaded from the following open source platforms:
- torch=2.7.1
- diffusers=0.38.0
- transformers=4.56.2
- flash-linear-attention=0.3.2
- See 'requirements.txt' for detailed Python libraries required
conda create -n moebius python=3.14.4
conda activate moebius
# cd /xx/xx/Moebius
pip install -r requirements.txt-
Download the checkpoint of VAE and put it into ./weight/vae.
-
Download the checkpoints of pretrained version, fine-tuned version (places2), fine-tuned version (celeba-hq), fine-tuned version (ffhq), and put them into ./weight/Moebius.
-
Finally, the detailed organizational form is as follows:
โโโ weight
| โโโ Moebius
| โโโ pretrained
| โโโ diffusion_pytorch_model.bin
| โโโ ft_places2
| โโโ diffusion_pytorch_model.bin
| โโโ ft_celebahq
| โโโ diffusion_pytorch_model.bin
| โโโ ft_ffhq
| โโโ diffusion_pytorch_model.bin
| โโโ vae
| โโโ config.json
| โโโ diffusion_pytorch_model.bin
โโโ ...You can run the following code to start training. The training script supports distributed training, and you can configure the GPU count via environment variables.
# For single GPU training:
PY_TRAINER=train_distillation.py bash run/run_ddp_1node.sh config/train_demo.sh
# For multi GPU training:
NUM_GPUS_PER_MACHINE=4 bash run/run_ddp_1node.sh config/train_demo.shYou can run the following code directly to get the inpainting result of the example image-mask pair, and the result will be generated in ./outputs. If you want to infer on custom data, just place the image and mask with the same name in ./dataset.local/imgs and ./dataset.local/masks, respectively, then run the following code as well.
python -m infer.infer_moebius \
--model-config config/model_cfg/moebius.yaml \
--model-weight weight/Moebius/ft_celebahq/diffusion_pytorch_model.bin \
--real-dir data/images \
--mask-dir data/masks \
--save-dir ./outputs \
--cfg 2.0 \
--batch-size 8 \
--num-workers 8@misc{DuanAndXu2026Moebius,
title={Moebius: 0.2B Lightweight Image Inpainting Framework with 10B-Level Performance},
author={Kangsheng Duan and Ziyang Xu and Wenyu Liu and Xiaohu Ruan and Xiaoxin Chen and Xinggang Wang},
year={2026},
eprint={2606.19195},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2606.19195},
}We sincerely thank the authors of the following open-source repositories for their contributions to the community, which have greatly facilitated our research and development of Moebius: Sana, flash-linear-attention, lambda-networks, timm, Muon, diffusers.