Skip to content

EchoInk-R1: Exploring Audio-Visual Reasoning in Multimodal LLMs via Reinforcement Learning [πŸ”₯The Exploration of R1 for General Audio-Visual Reasoning with Qwen2.5-Omni]

Notifications You must be signed in to change notification settings

HarryHsing/EchoInk

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

20 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

EchoInk-R1: Exploring Audio-Visual Reasoning in Multimodal LLMs via Reinforcement Learning

πŸ“„ Technical Report (arXiv) β€’ πŸ€— Model (EchoInk-R1-7B) β€’ πŸ€— Dataset (AVQA-R1-6K)

Overview

EchoInk-R1 is the first general framework for unified audio-visual reasoning via reinforcement learning, built upon Qwen2.5-Omni-7B and optimized using Group Relative Policy Optimization (GRPO). It supports structured reasoning over synchronized audio-image inputs through multiple-choice question answering.

We introduce AVQA-R1-6K, a dataset derived from OmniInstruct-v1, comprising:

  • 4,490 training samples
  • 1,911 validation samples
  • Each sample includes a synchronized audio-image pair with a multiple-choice question and four options.

Beyond our core study, EchoInk-R1 provides an extensible RL fine-tuning framework for Qwen2.5-Omni, enabling easy adaptation to new multimodal reasoning tasks with minimal modifications.

Performance

EchoInk-R1-7B achieves 85.77% accuracy on the AVQA-R1-6K validation set, surpassing the base Qwen2.5-Omni-7B model (80.53%) using only 562 RL steps.

All code, models, and data are released to support transparency and reproducibility.

News

Highlights

  • Built on Qwen2.5-Omni-7B with GRPO-based RL
  • Supports audio, image, video, and text modalities
  • Provides a complete pipeline: dataset, training, and evaluation

Reflective Reasoning: Aha Moments

During training, EchoInk-R1 exhibits reflective reasoning behaviors, where it revisits initial assumptions and refines its responses under ambiguous multimodal cues. These β€œaha moments” reveal its capacity for belief revision and deeper cross-modal understanding.

Case 1 reasoning


Case 2 reasoning

Learning Dynamics

  • Accuracy reward steadily improves throughout training, indicating that GRPO effectively guides the model toward more accurate and reasoned outputs.
  • Completion length exhibits a two-phase trend: an initial increase as the model explores elaborated reasoning, followed by a gradual decline toward more concise and efficient answers.
  • Format reward converges rapidly, showing that the model quickly internalizes the required response structure.

Training dynamics

Setup & Installation

Environment Setup

git clone https://github.com/HarryHsing/EchoInk
cd EchoInk

conda create -n echoink-r1 python=3.11
conda activate echoink-r1
bash setup.sh

Download Dataset

To download and extract the AVQA-R1-6K dataset:

git lfs install
git clone https://huggingface.co/datasets/harryhsing/AVQA-R1-6K
cd AVQA-R1-6K
tar -xzvf AVQA_R1.tar.gz

πŸ“ Dataset Structure
AVQA_R1/
β”œβ”€β”€ train/
β”‚   β”œβ”€β”€ audios/
β”‚   β”œβ”€β”€ images/
β”‚   └── omni_rl_format_train.json
β”œβ”€β”€ valid/
β”‚   β”œβ”€β”€ audios/
β”‚   β”œβ”€β”€ images/
β”‚   └── omni_rl_format_valid.json

Training

Download Qwen2.5-Omni-7B Model

First, download the base model: Qwen2.5-Omni-7B

Modify config.json of Qwen2.5-Omni-7B to include "hidden_size": 3584 at the root level.

Launch GRPO Training

bash ./src/scripts/run_grpo_image_audio_avqa.sh

πŸ“ Set per_device_train_batch_size=1 as in previous R1-V setups
πŸ“ To use custom data, follow the JSON format in ./src/make_omniInstruct_r1_dataset.py for audio–image or audio–video tasks.
πŸ“ See Qwen2.5-Omni issue #205 if you run into a dtype mismatch error.
βš™οΈ Trained on 8Γ—A100 (80G) GPUs; also supported on 4Γ—A100 (80G).

Evaluation

Evaluate on the AVQA-R1-6K validation set:

python ./src/omniInstruct-v1_eval_valid.py # Run the model on the validation set
python ./src/omniInstruct-v1_cal_metrics_valid.py # Compute accuracy

Acknowledgements

We thank the open-source community. This work builds on Qwen2.5-Omni, Video-R1, Open-R1-Video, R1-V, and DeepSeek-R1.

Citation

If you find EchoInk-R1 useful, please cite:

@article{xing2025echoink,
      title={{EchoInk-R1}: Exploring Audio-Visual Reasoning in Multimodal {LLMs} via Reinforcement Learning}, 
      author={Zhenghao Xing and Xiaowei Hu and Chi-Wing Fu and Wenhai Wang and Jifeng Dai and Pheng-Ann Heng},
      year={2025},
      journal={arXiv preprint arXiv:2505.04623}
}

About

EchoInk-R1: Exploring Audio-Visual Reasoning in Multimodal LLMs via Reinforcement Learning [πŸ”₯The Exploration of R1 for General Audio-Visual Reasoning with Qwen2.5-Omni]

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published