GesVLA: Gesture-Aware Vision-Language-Action Model with Embedded Representations

🛠️ Installation

We manage Python dependencies with uv. If you haven't installed uv, please follow uv installation instructions to set it up.

Run the following to set up the environment:

GIT_LFS_SKIP_SMUDGE=1 uv sync
GIT_LFS_SKIP_SMUDGE=1 uv pip install -e .

NOTE: GIT_LFS_SKIP_SMUDGE=1 is needed to pull LeRobot as a dependency.

For more details, refer to the original openpi repository.

🚀 Training GesVLA

Data Preparation

All data should be placed under the data/ directory at the project root. The directory structure should look like the following:

data/
├── datasets/
│   ├── gestureVLA_2026xxxx/           # Robot arm action dataset
│   └── pointing_dataset_xxxx/         # Synthetic gesture reasoning data
├── reasoning/
│   └── reference_images/              # Visual prompt reference images
├── gesture_data/
│   └── realgesdata/                   # Real gesture reasoning data (aligned with robot actions)
...

All datasets can be obtained from here. We provide synthetic data for the block scene, which can be used to train and validate the Intent Reasoning model of GesVLA. Additional synthetic data can be generated using the data generation pipeline in data_generation/. We also provide robot arm action data for the block, jelly, and fruit/vegetable scenes. The amount of provided data exceeds the training requirement, so you can select as needed.

Training Pipeline

This project adopts a two-stage training paradigm. First, the Intent Reasoning model is pre-trained on synthetic gesture reasoning data. Before training, ensure that data/datasets/pointing_dataset_xxxx exists. Run:

bash train_scripts/train_ges_reasoning.sh

After training, the model checkpoint will be automatically saved to the models/ directory. Then, proceed to train the GesVLA model. Before training, make sure the Intent Reasoning checkpoint path is correctly configured. GesVLA loads and freezes the Intent Reasoning checkpoint for intent reasoning, while fine-tuning the remaining parts of the model. Training requires real robot action data data/datasets/gestureVLA_2026xxxx, real gesture pointing data data/gesture_data/realgesdata/ (aligned with real robot actions at the episode level), and visual prompt images data/reasoning/reference_images/ (obtained by batch inference of real gesture data using the pre-trained Intent Reasoning model, to accelerate training). Run:

bash train_scripts/train_gesvla_2vlm.sh

Configuration for both training stages can be adjusted in the training scripts.

🦾 Deployment

We adopt a policy server + hardware client architecture for deployment. We provide the server-side deployment code. The client-side code should provide observation data to the server and receive action data for execution. You can implement the client based on your own robot hardware configuration.

Configure the model checkpoint path and run:

bash scripts/serve_2vlm.sh

🎬 Demo

Real-world rollouts with a 7-DoF manipulator and three-camera observations (global, side, gripper). Each clip follows combined gesture-and-language instructions. For clearer videos, see the project page.

Pick-and-Place Block

Grasp a specified block from clutter and place it on the designated plate.

Select Jelly

Pick specified jelly cups in pointing order and place them into a target plate.

Select Fruit and Vegetable

Pick specified bell peppers and bananas in pointing order and place them into a basket.

🙏 Acknowledgements

We express our sincere gratitude to the developers of openpi and OneTwoVLA for open-sourcing their code, which has provided strong support for our project.

📜 Citation

@article{guo2026gesvla,
      title={GesVLA: Gesture-Aware Vision-Language-Action Model Embedded Representations}, 
      author={Wenxuan Guo and Ziyuan Li and Meng Zhang and Yichen Liu and Yimeng Dong and Chuxi Xu and Yunfei Wei and Ze Chen and Erjin Zhou and Jianjiang Feng},
      journal={arXiv preprint arXiv:2605.22812},
      year={2026},
      url={https://arxiv.org/abs/2605.22812}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
assets		assets
data_generation		data_generation
packages/openpi-client		packages/openpi-client
scripts		scripts
src/openpi		src/openpi
train_scripts		train_scripts
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GesVLA: Gesture-Aware Vision-Language-Action Model with Embedded Representations

🛠️ Installation

🚀 Training GesVLA

Data Preparation

Training Pipeline

🦾 Deployment

🎬 Demo

Pick-and-Place Block

Select Jelly

Select Fruit and Vegetable

🙏 Acknowledgements

📜 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

GesVLA: Gesture-Aware Vision-Language-Action Model with Embedded Representations

🛠️ Installation

🚀 Training GesVLA

Data Preparation

Training Pipeline

🦾 Deployment

🎬 Demo

Pick-and-Place Block

Select Jelly

Select Fruit and Vegetable

🙏 Acknowledgements

📜 Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages