EchoInk-R1: Exploring Audio-Visual Reasoning in Multimodal LLMs via Reinforcement Learning

📄 Technical Report (arXiv) • 🤗 Model (EchoInk-R1-7B) • 🤗 Dataset (AVQA-R1-6K)

Overview

EchoInk-R1 is the first general framework for unified audio-visual reasoning via reinforcement learning, built upon Qwen2.5-Omni-7B and optimized using Group Relative Policy Optimization (GRPO). It supports structured reasoning over synchronized audio-image inputs through multiple-choice question answering.

We introduce AVQA-R1-6K, a dataset derived from OmniInstruct-v1, comprising:

4,490 training samples
1,911 validation samples
Each sample includes a synchronized audio-image pair with a multiple-choice question and four options.

Beyond our core study, EchoInk-R1 provides an extensible RL fine-tuning framework for Qwen2.5-Omni, enabling easy adaptation to new multimodal reasoning tasks with minimal modifications.

Performance

EchoInk-R1-7B achieves 85.77% accuracy on the AVQA-R1-6K validation set, surpassing the base Qwen2.5-Omni-7B model (80.53%) using only 562 RL steps.

All code, models, and data are released to support transparency and reproducibility.

News

[2025/05/08] Released the AVQA-R1-6K dataset, EchoInk-R1-7B model, full training & evaluation pipeline, and technical report.

Highlights

Built on Qwen2.5-Omni-7B with GRPO-based RL
Supports audio, image, video, and text modalities
Provides a complete pipeline: dataset, training, and evaluation

Reflective Reasoning: Aha Moments

During training, EchoInk-R1 exhibits reflective reasoning behaviors, where it revisits initial assumptions and refines its responses under ambiguous multimodal cues. These “aha moments” reveal its capacity for belief revision and deeper cross-modal understanding.

Learning Dynamics

Accuracy reward steadily improves throughout training, indicating that GRPO effectively guides the model toward more accurate and reasoned outputs.
Completion length exhibits a two-phase trend: an initial increase as the model explores elaborated reasoning, followed by a gradual decline toward more concise and efficient answers.
Format reward converges rapidly, showing that the model quickly internalizes the required response structure.

Setup & Installation

Environment Setup

git clone https://github.com/HarryHsing/EchoInk
cd EchoInk

conda create -n echoink-r1 python=3.11
conda activate echoink-r1
bash setup.sh

Download Dataset

To download and extract the AVQA-R1-6K dataset:

git lfs install
git clone https://huggingface.co/datasets/harryhsing/AVQA-R1-6K
cd AVQA-R1-6K
tar -xzvf AVQA_R1.tar.gz

📁 Dataset Structure
AVQA_R1/
├── train/
│   ├── audios/
│   ├── images/
│   └── omni_rl_format_train.json
├── valid/
│   ├── audios/
│   ├── images/
│   └── omni_rl_format_valid.json

Training

Download Qwen2.5-Omni-7B Model

First, download the base model: Qwen2.5-Omni-7B

Modify config.json of Qwen2.5-Omni-7B to include "hidden_size": 3584 at the root level.

Launch GRPO Training

bash ./src/scripts/run_grpo_image_audio_avqa.sh

📝 Set per_device_train_batch_size=1 as in previous R1-V setups
📝 To use custom data, follow the JSON format in ./src/make_omniInstruct_r1_dataset.py for audio–image or audio–video tasks.
📝 See Qwen2.5-Omni issue #205 if you run into a dtype mismatch error.
⚙️ Trained on 8×A100 (80G) GPUs; also supported on 4×A100 (80G).

Evaluation

Evaluate on the AVQA-R1-6K validation set:

python ./src/omniInstruct-v1_eval_valid.py # Run the model on the validation set
python ./src/omniInstruct-v1_cal_metrics_valid.py # Compute accuracy

Acknowledgements

We thank the open-source community. This work builds on Qwen2.5-Omni, Video-R1, Open-R1-Video, R1-V, and DeepSeek-R1.

Citation

If you find EchoInk-R1 useful, please cite:

@article{xing2025echoink,
      title={{EchoInk-R1}: Exploring Audio-Visual Reasoning in Multimodal {LLMs} via Reinforcement Learning}, 
      author={Zhenghao Xing and Xiaowei Hu and Chi-Wing Fu and Wenhai Wang and Jifeng Dai and Pheng-Ann Heng},
      year={2025},
      journal={arXiv preprint arXiv:2505.04623}
}

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
images		images
src		src
.gitignore		.gitignore
README.md		README.md
setup.sh		setup.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

EchoInk-R1: Exploring Audio-Visual Reasoning in Multimodal LLMs via Reinforcement Learning

Overview

Performance

News

Highlights

Reflective Reasoning: Aha Moments

Learning Dynamics

Setup & Installation

Environment Setup

Download Dataset

Training

Download Qwen2.5-Omni-7B Model

Launch GRPO Training

Evaluation

Acknowledgements

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Languages

HarryHsing/EchoInk

Folders and files

Latest commit

History

Repository files navigation

EchoInk-R1: Exploring Audio-Visual Reasoning in Multimodal LLMs via Reinforcement Learning

Overview

Performance

News

Highlights

Reflective Reasoning: Aha Moments

Learning Dynamics

Setup & Installation

Environment Setup

Download Dataset

Training

Download Qwen2.5-Omni-7B Model

Launch GRPO Training

Evaluation

Acknowledgements

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages