π Technical Report (arXiv) β’ π€ Model (EchoInk-R1-7B) β’ π€ Dataset (AVQA-R1-6K)
EchoInk-R1 is the first general framework for unified audio-visual reasoning via reinforcement learning, built upon Qwen2.5-Omni-7B and optimized using Group Relative Policy Optimization (GRPO). It supports structured reasoning over synchronized audio-image inputs through multiple-choice question answering.
We introduce AVQA-R1-6K, a dataset derived from OmniInstruct-v1, comprising:
- 4,490 training samples
- 1,911 validation samples
- Each sample includes a synchronized audio-image pair with a multiple-choice question and four options.
Beyond our core study, EchoInk-R1 provides an extensible RL fine-tuning framework for Qwen2.5-Omni, enabling easy adaptation to new multimodal reasoning tasks with minimal modifications.
EchoInk-R1-7B achieves 85.77% accuracy on the AVQA-R1-6K validation set, surpassing the base Qwen2.5-Omni-7B model (80.53%) using only 562 RL steps.
All code, models, and data are released to support transparency and reproducibility.
- [2025/05/08] Released the AVQA-R1-6K dataset, EchoInk-R1-7B model, full training & evaluation pipeline, and technical report.
- Built on Qwen2.5-Omni-7B with GRPO-based RL
- Supports audio, image, video, and text modalities
- Provides a complete pipeline: dataset, training, and evaluation
During training, EchoInk-R1 exhibits reflective reasoning behaviors, where it revisits initial assumptions and refines its responses under ambiguous multimodal cues. These βaha momentsβ reveal its capacity for belief revision and deeper cross-modal understanding.
- Accuracy reward steadily improves throughout training, indicating that GRPO effectively guides the model toward more accurate and reasoned outputs.
- Completion length exhibits a two-phase trend: an initial increase as the model explores elaborated reasoning, followed by a gradual decline toward more concise and efficient answers.
- Format reward converges rapidly, showing that the model quickly internalizes the required response structure.
git clone https://github.com/HarryHsing/EchoInk
cd EchoInk
conda create -n echoink-r1 python=3.11
conda activate echoink-r1
bash setup.shTo download and extract the AVQA-R1-6K dataset:
git lfs install
git clone https://huggingface.co/datasets/harryhsing/AVQA-R1-6K
cd AVQA-R1-6K
tar -xzvf AVQA_R1.tar.gz
π Dataset Structure
AVQA_R1/
βββ train/
β βββ audios/
β βββ images/
β βββ omni_rl_format_train.json
βββ valid/
β βββ audios/
β βββ images/
β βββ omni_rl_format_valid.jsonFirst, download the base model: Qwen2.5-Omni-7B
Modify config.json of Qwen2.5-Omni-7B to include "hidden_size": 3584 at the root level.
bash ./src/scripts/run_grpo_image_audio_avqa.shπ Set
per_device_train_batch_size=1as in previous R1-V setups
π To use custom data, follow the JSON format in./src/make_omniInstruct_r1_dataset.pyfor audioβimage or audioβvideo tasks.
π See Qwen2.5-Omni issue #205 if you run into a dtype mismatch error.
βοΈ Trained on 8ΓA100 (80G) GPUs; also supported on 4ΓA100 (80G).
Evaluate on the AVQA-R1-6K validation set:
python ./src/omniInstruct-v1_eval_valid.py # Run the model on the validation set
python ./src/omniInstruct-v1_cal_metrics_valid.py # Compute accuracyWe thank the open-source community. This work builds on Qwen2.5-Omni, Video-R1, Open-R1-Video, R1-V, and DeepSeek-R1.
If you find EchoInk-R1 useful, please cite:
@article{xing2025echoink,
title={{EchoInk-R1}: Exploring Audio-Visual Reasoning in Multimodal {LLMs} via Reinforcement Learning},
author={Zhenghao Xing and Xiaowei Hu and Chi-Wing Fu and Wenhai Wang and Jifeng Dai and Pheng-Ann Heng},
year={2025},
journal={arXiv preprint arXiv:2505.04623}
}