Implementation of Streamlined Open-Vocabulary Human-Object Interaction Detection (CVPR 2026)
In this paper, we present SL-HOI, a streamlined one-stage framework for open-vocabulary HOI detection built upon the DINOv3 model. We leverage the complementary strengths of DINOv3's backbone and vision head to effectively address both interactive human-object detection and open-vocabulary interaction classification tasks. Our design includes a novel two-step interaction classification process that bridges representation gaps and enhances feature utilization. Extensive experiments on two popular benchmarks demonstrate that SL-HOI achieves state-of-the-art performance in open-vocabulary HOI detection while maintaining a simple architecture with few trainable parameters.
- Python 3.10
- PyTorch 2.5.1
- CUDA ≥ 12.1
- transformers
- accelerate
- deepspeed
A requirements.txt file will be provided later.
git clone https://github.com/MPI-Lab/SL-HOI.git
cd SL-HOI
pip install -r requirements.txtSWIG-HOI dataset preparation follows THID. Please refer to their documentation for download and setup instructions.
swig_hoi
|─ images_512
|─ annotations
| |─ swig_train_1000.json
| |─ swig_val_1000.json
| |─ swig_trainval_1000.json
| |─ swig_test_1000.json
HICO-DET dataset preparation follows GEN-VLKT. Please refer to their documentation for download and setup instructions.
hico_20160224_det
|─ images
| |─ train2015
| |─ test2015
|─ annotations
| |─ trainval_hico.json
| |─ test_hico.json
| |─ corre_hico.npy
All model weights are available on HuggingFace: Thatmakes11/SL-HOI-weights
params/- Pre-computed HOI classifier weights (swig/andhico/)pretrained/- Trained checkpoints (swig/,hico/,hico_ov/)
DINOv3 pretrained weights are available at facebookresearch/dinov3.
HOI classifier weights can also be generated using the provided scripts:
python swig_offline_classifier.py \
--dinotxt_weights <path_to_dinov3_text_head_and_vision_head_weights> \
--backbone_weights <path_to_dinov3_backbone_weights> \
--bpe_path_or_url <path_or_url_to_bpe_vocab>
python hico_offline_classifier.py \
--dinotxt_weights <path_to_dinov3_text_head_and_vision_head_weights> \
--backbone_weights <path_to_dinov3_backbone_weights> \
--bpe_path_or_url <path_or_url_to_bpe_vocab>By default, the classifier weights will be saved in params
Training scripts are provided in scripts/:
scripts/swig.sh- Training on SWIG-HOIscripts/hico.sh- Training on HICO-DETscripts/hico_ov.sh- Training on HICO-DET with zero-shot setting
Modify the following variables in the scripts to match your environment:
EXP_DIR="exps/swig" # Experiment output directory
DATA_DIR="/path/to/your/datasets" # Path to dataset
DINO_DIR="/path/to/your/weights" # Path to DINOv3 weightsThen run:
bash scripts/swig.shEvaluation scripts are provided in scripts/:
scripts/swig_eval.sh- Evaluate on SWIG-HOIscripts/hico_eval.sh- Evaluate on HICO-DETscripts/hico_ov_eval.sh- Evaluate on HICO-DET with zero-shot setting
Place the provided checkpoints in the pretrained folder. Modify only DATA_DIR in the evaluation scripts to point to your dataset, then run:
bash scripts/swig_eval.sh| Dataset | Setting | Unseen | Rare | Non-rare/ Seen | Full | Checkpoint |
|---|---|---|---|---|---|---|
| SWIG-HOI | - | 19.04 | 24.69 | 30.62 | 24.67 | pretrained/swig/pytorch_model.bin |
| HICO-DET | Default | - | 47.71 | 44.25 | 45.05 | pretrained/hico/pytorch_model.bin |
| HICO-DET | Zero-shot | 40.53 | - | 42.99 | 42.49 | pretrained/hico_ov/pytorch_model.bin |
Checkpoints are available in the HuggingFace repository.
@inproceedings{slhoi2026,
title={Streamlined Open-Vocabulary Human-Object Interaction Detection},
author={Chang Sun and Dongliang Liao and Changxing Ding},
booktitle={CVPR},
year={2026}
}This code builds upon QPIC, GEN-VLKT, THID, and DINOv3. We thank their authors for making their code publicly available.