This is the official implementation code base for A Parameter-Efficient Tuning Framework for Language-guided Object Grounding and Robot Grasping accepted for ICRA 2025. Our project produces three separated github repos for ETOG, ETRG-A, ETRG-B models. Stay tuned for code release. Here is our Project Page.
This git repo includes the ETOG model, which is designed for parameter-efficient tuning on the Referring Expression Segmentation (RES) task.
-
The ETRG-A model designed for
Referring Grasp Synthesis (RGS)task can be found here. -
The ETRG-B model designed for
Referring Grasp Affordance (RGA)task can be found here.
- Conda env: We used Pytorch (2.1.0+cu118), other packages are in
requirements.txt - Refcoco related dataset
- The detailed instruction is in prepare_datasets.md
- The folder arrangement after preparation should be like this:
$ETOG
├── config
├── model
├── enging
├── pretrain (manually download from CLIP -> R50, R101, ViT-B-16)
├── tools
│ ├── data_process.py
│ └── ...
├── ...
└── datasets
├── anns
├── lmdb
│ ├── refcoco
│ ├── refcoco+
│ ├── refcocog
│ └── ...
├── masks
│ ├── refcoco
│ ├── refcoco+
│ ├── refcocog
│ └── ...
└── images
Performance (mIoU) on Refcoco dataset:
| Backbone | val | test A | test B | Weights | Train log | Test log |
|---|---|---|---|---|---|---|
| CLIP-R50 | 72.31 | 75.49 | 66.62 | models | log | log |
| CLIP-R101 | 73.37 | 76.16 | 68.54 | models | log | log |
| CLIP-ViT-B | 73.37 | 76.90 | 69.34 | models | log | log |
We release all Refcoco-related pretrained weights reported on our paper.
More training/testing logs and model weights available for Refcoco+ and Refcocog benchmarks are available here on our google drive.
Quick run
bash run_scripts/train.sh
Please modify the config files (e.g. config/refcoco/bridge_r50.ymal) to change the batch_size, directory and test-split etc. values.
Our defualt setup: bs=16 on 1 NVIDIA RTX 2080 TI GPU.
Quick run
bash run_scripts/test.sh
or directly run test.py while changing the --config directory
We also provide prediction visualization saving functionality by setting up
TEST:
visualizate: True
in .yaml files. Currently, we support attention viusalizations for R50 and R101 (but not ViT backbone) in heatmap style.
The code is heavily adapted from ETIRS. We appreciate the authors for their wonderful codebase.
If ETOG-ETRG is useful for your research, please consider citing:
@article{yu2024parameter,
title={A Parameter-Efficient Tuning Framework for Language-guided Object Grounding and Robot Grasping},
author={Yu, Houjian and Li, Mingen and Rezazadeh, Alireza and Yang, Yang and Choi, Changhyun},
journal={arXiv preprint arXiv:2409.19457},
year={2024}
}