Haotian Wang1, Yusong Huang1, Zhaonian Kuang2,1, Hongliang Lu1, Xinhu Zheng1,†, Meng Yang2,†, and Gang Hua3
1The Hong Kong University of Science and Technology (Guangzhou) 2Xi'an Jiaotong University 3Amazon.com, Inc.
- 2026-05-21: Paper, project page, and Hugging Face demo are released.
UniT is a unified feed-forward model that reformulates a wide range of geometry perception capabilities into a single framework, covering diverse view configurations, modality combinations, metric-scale perception, and long-horizon scalability. It supports both online and offline inference over an arbitrary number of views, flexibly incorporates auxiliary modalities such as camera parameters and depth maps, recovers geometry in metric scale measured in meters, and maintains bounded complexity over long horizons in in-the-wild environments.
The paper is currently under review, and the code is not publicly available at this stage. In the meantime, we provide a Hugging Face demo for testing UniT.
If you find UniT useful in your research, please consider citing:
@misc{wang2026unit,
title={UniT: Unified Geometry Learning with Group Autoregressive Transformer},
author={Haotian Wang and Yusong Huang and Zhaonian Kuang and Hongliang Lu and Xinhu Zheng and Meng Yang and Gang Hua},
year={2026},
eprint={2605.21131},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2605.21131},
}