Skip to content

GWxuan/GesVLA

Repository files navigation

GesVLA: Gesture-Aware Vision-Language-Action Model with Embedded Representations

Project Page   Paper   Dataset

🛠️ Installation

We manage Python dependencies with uv. If you haven't installed uv, please follow uv installation instructions to set it up.

Run the following to set up the environment:

GIT_LFS_SKIP_SMUDGE=1 uv sync
GIT_LFS_SKIP_SMUDGE=1 uv pip install -e .

NOTE: GIT_LFS_SKIP_SMUDGE=1 is needed to pull LeRobot as a dependency.

For more details, refer to the original openpi repository.

🚀 Training GesVLA

Data Preparation

All data should be placed under the data/ directory at the project root. The directory structure should look like the following:

data/
├── datasets/
│   ├── gestureVLA_2026xxxx/           # Robot arm action dataset
│   └── pointing_dataset_xxxx/         # Synthetic gesture reasoning data
├── reasoning/
│   └── reference_images/              # Visual prompt reference images
├── gesture_data/
│   └── realgesdata/                   # Real gesture reasoning data (aligned with robot actions)
...

All datasets can be obtained from here. We provide synthetic data for the block scene, which can be used to train and validate the Intent Reasoning model of GesVLA. Additional synthetic data can be generated using the data generation pipeline in data_generation/. We also provide robot arm action data for the block, jelly, and fruit/vegetable scenes. The amount of provided data exceeds the training requirement, so you can select as needed.

Training Pipeline

This project adopts a two-stage training paradigm. First, the Intent Reasoning model is pre-trained on synthetic gesture reasoning data. Before training, ensure that data/datasets/pointing_dataset_xxxx exists. Run:

bash train_scripts/train_ges_reasoning.sh

After training, the model checkpoint will be automatically saved to the models/ directory. Then, proceed to train the GesVLA model. Before training, make sure the Intent Reasoning checkpoint path is correctly configured. GesVLA loads and freezes the Intent Reasoning checkpoint for intent reasoning, while fine-tuning the remaining parts of the model. Training requires real robot action data data/datasets/gestureVLA_2026xxxx, real gesture pointing data data/gesture_data/realgesdata/ (aligned with real robot actions at the episode level), and visual prompt images data/reasoning/reference_images/ (obtained by batch inference of real gesture data using the pre-trained Intent Reasoning model, to accelerate training). Run:

bash train_scripts/train_gesvla_2vlm.sh

Configuration for both training stages can be adjusted in the training scripts.

🦾 Deployment

We adopt a policy server + hardware client architecture for deployment. We provide the server-side deployment code. The client-side code should provide observation data to the server and receive action data for execution. You can implement the client based on your own robot hardware configuration.

Configure the model checkpoint path and run:

bash scripts/serve_2vlm.sh

🎬 Demo

Real-world rollouts with a 7-DoF manipulator and three-camera observations (global, side, gripper). Each clip follows combined gesture-and-language instructions. For clearer videos, see the project page.

Pick-and-Place Block

Grasp a specified block from clutter and place it on the designated plate.

Pick-and-Place Block demo

Select Jelly

Pick specified jelly cups in pointing order and place them into a target plate.

Select Jelly demo

Select Fruit and Vegetable

Pick specified bell peppers and bananas in pointing order and place them into a basket.

Select Fruit and Vegetable demo

🙏 Acknowledgements

We express our sincere gratitude to the developers of openpi and OneTwoVLA for open-sourcing their code, which has provided strong support for our project.

📜 Citation

@article{guo2026gesvla,
      title={GesVLA: Gesture-Aware Vision-Language-Action Model Embedded Representations}, 
      author={Wenxuan Guo and Ziyuan Li and Meng Zhang and Yichen Liu and Yimeng Dong and Chuxi Xu and Yunfei Wei and Ze Chen and Erjin Zhou and Jianjiang Feng},
      journal={arXiv preprint arXiv:2605.22812},
      year={2026},
      url={https://arxiv.org/abs/2605.22812}, 
}

About

GesVLA: Gesture-Aware Vision-Language-Action Model with Embedded Representations

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors