We introduce DexVLG, a vision‑language‑grasp model trained on a large-scale synthetic dataset that can generate instruction‑aligned dexterous grasp poses and achieves SOTA success and part‑grasp accuracy.
- DexGraspNet3.0, a large-scale dataset containing 170M part-aligned dexterous grasp poses on 174k objects, each annotated with semantic captions.
- DexVLG, a vision‑language model to generate language-instructed dexterous grasp poses in an end-to-end way.
- We curate benchmarks and conduct extensive experiments to evaluate DexVLG in simulation and the real world.
TODO List:
- Release the training/inference code of DexVLG
- Release model weights
@article{dexvlg25,
title={DexVLG: Dexterous Vision-Language-Grasp Model at Scale},
author={He, Jiawei and Li, Danshi and Yu, Xinqiang and Qi, Zekun and Zhang, Wenyao and Chen, Jiayi and Zhang, Zhaoxiang and Zhang, Zhizheng and Yi, Li and Wang, He},
journal={arXiv preprint arXiv:2507.02747},
year={2025}
}