🎬LongVideoAgent: Multi-Agent Reasoning with Long Videos

Runtao Liu*, Ziyi Liu*, Jiaqi Tang, Yue Ma, Renjie Pi, Jipeng Zhang, Qifeng Chen

ACL 2026 Main

Hong Kong University of Science and Technology

* Equal contribution

This is the official repository for our ACL 2026 Main paper, LongVideoAgent: Multi-Agent Reasoning with Long Videos. Training and evaluation code are now available, and model weights are released on Hugging Face. This README provides a compact code overview, while the 📚 Docs contain the full setup, workflow, and argument details.

🚀 Latest News

• [2026/04/08]: 🎉 Our paper has been accepted to ACL 2026 Main Conference.

• [2026/03/22]: 🤗 We released the LongVideoAgent-Qwen2.5-7B model weights on Hugging Face. This checkpoint was trained on the newversion branch.

• [2026/03/15]: 🌿 We added the newversion branch for users who want to try the updated verl-based experimental training stack.

• [2026/03/07]: 🤗 We released the LongVideoAgent-Qwen2.5-3B model weights on Hugging Face.

• [2026/03/06]: 🚀 We released the training and evaluation code for LongVideoAgent.

• [2026/02/14]: 📦 We released the LongTVQA dataset on Hugging Face.

• [2025/12/30]: 📦 We released the LongTVQA+ dataset on Hugging Face.

• [2025/12/24]: 🚀 We released our paper "LongVideoAgent: Multi-Agent Reasoning with Long Videos" on arXiv!

📅 Roadmap

[2026/03/22]: Released the LongVideoAgent-Qwen2.5-7B model weights on Hugging Face.
[2026/03/06]: Released training and evaluation code.
[2026/03/07]: Released the LongVideoAgent-Qwen2.5-3B model weights on Hugging Face.

📦 Dataset

LongTVQA: https://huggingface.co/datasets/longvideoagent/LongTVQA
LongTVQA+: https://huggingface.co/datasets/longvideoagent/LongTVQA_plus

🛠️ Installation

We recommend creating a clean Python 3.11 environment and installing the project from the repository root:

conda create -n lvagent python=3.11
conda activate lvagent
pip install vllm
pip install -e .
pip install flash-attn --no-build-isolation
pip install wandb

If dependency resolution fails, you can install the package first without dependencies and then install from requirements.txt:

pip install -e . --no-deps
pip install -r requirements.txt

For step-by-step installation details, see docs/installation.md.

🏋️ Train

Note main documents the current default training path. An experimental training stack based on the updated verl is currently kept on the newversion branch for users who want to try it separately. Training configs, behavior, and outcomes may differ noticeably across branches, so please switch branches first and follow the corresponding branch documentation.

The recommended training flow is: prepare datasets, build an offline grounding cache, convert data to GRPO parquet files, and then launch the quickstart script.

1. Download and prepare LongTVQA assets

bash scripts/download_and_prepare_longtvqa.sh

2. Build the offline grounding cache

python src/dataset/build_grounding_cache.py \
  --dataset tvqa_plus \
  --questions-path /path/to/train.json \
  --subs-path /path/to/all_episodes_subtitles_by_clips.json \
  --grounding-model "grok-4-fast-reasoning" \
  --grounding-base-url "https://api2.aigcbest.top/v1" \
  --output-dir /path/to/cache_dir \
  --threads 8

3. Convert the dataset to training parquet files

python src/dataset/convert_tvqa_json_to_grpo_parquet.py \
  --questions-path /path/to/LongTVQA_or_LongTVQA_plus_questions.jsonl_or_json \
  --grounding-cache-json /path/to/grounding_cache.json \
  --subtitles-dir /path/to/subtitles_dir \
  --output-dir ./data \
  --seed 42

4. Launch quickstart GRPO training

bash scripts/quickstart_qwen_2_5_3B_grpo.sh

The quickstart script expects ./data/train.parquet and ./data/val.parquet. Grounding and vision API credentials can be passed by CLI or read from environment variables in the training pipeline.

For more detailed training instructions, see Quickstart, Offline Grounding Cache, Convert to Parquet, and GRPO Config Details.

📊 Evaluation

LongVideoAgent provides evaluation scripts for both LongTVQA and LongTVQA+. The difference between the local and API versions lies in how the Master Agent performs reasoning: the local version runs the Master Agent with a local LLM, while the API version calls an API-hosted model for the Master Agent.

Local evaluation

python src/evaluation/lvagent/evaluate_local_unified.py \
  --dataset tvqa_plus \
  --llm-path "/path/to/your/local_llm" \
  --max_turn 5 \
  --gpu_memory_utilization 0.4

API evaluation

python src/evaluation/lvagent/evaluate_api_unified.py \
  --dataset tvqa_plus \
  --checkpoint_step api \
  --max_turn 5 \
  --threads 30

To prepare evaluation-ready dataset files, you can also run:

bash scripts/download_and_prepare_longtvqa.sh

If API keys are not passed explicitly, the evaluation scripts read them from environment variables such as qdd_api and aliyun_api.

For full evaluation setups and argument descriptions, see docs/evaluation.md.

📝 Abstract

Recent advances in multimodal LLMs and systems that use tools for long-video QA point to the promise of reasoning over hour-long episodes. However, many methods still compress content into lossy summaries or rely on limited toolsets, weakening temporal grounding and missing fine-grained cues. We propose LongVideoAgent, a multi-agent framework in which a master LLM coordinates a grounding agent to localize question-relevant segments and a vision agent to extract targeted textual observations. The master agent plans with a step limit, and is trained with reinforcement learning to encourage concise, correct, and efficient multi-agent cooperation. This design helps the master agent focus on relevant clips via grounding, complements subtitles with visual detail, and yields interpretable trajectories. On our proposed LongTVQA and LongTVQA+ which are episode-level datasets aggregated from TVQA/TVQA+, our multi-agent system significantly outperforms strong non-agent baselines. Experiments also show reinforcement learning further strengthens reasoning and planning for the trained agent.

🌟 Overview

Traditional single-pass MLLMs that ingest entire long videos in one context—typically (may through heavy downsampling and compression) often miss crucial evidence and produce wrong answers, whereas LongVideoAgent conducts multi-agent, multi-round, and multimodal reasoning to extract sparse, task-relevant cues and answer correctly.

🤖 Method: Multi-Agent Framework

Architecture of LongVideoAgent. A MasterAgent runs for up to $K$ rounds, collaborating with a GroundingAgent to localize relevant clips from videos and a VisionAgent to read fine-grained cues from the localized frames. Evidence accumulates until the MasterAgent feels confident to answer the user.

🔄 Iterative Reasoning Loop

Unlike single-pass models, LongVideoAgent operates in a bounded loop (max $K$ steps). At each step, the MasterAgent generates a "thinking" trace and emits a structured action token:

<request_grounding>: Calls the GroundingAgent to localize relevant video segments based on subtitles. The agent returns a symbolic tag <clip_X>.
<visual_query>: Calls the VisionAgent to extract specific visual details (objects, actions, text) from the localized clip. The agent returns textual observations.
<answer>: Terminates the loop and provides the final response when sufficient evidence is gathered.

🧠 Reinforcement Learning (GRPO)

We optimize the MasterAgent using Group Relative Policy Optimization (GRPO). The training objective includes: 1. Structural validity. 2. Answer Correctness: Rewarding the agent for reaching the correct final answer.

📈 Experimental Results

We evaluate LongVideoAgent on LongTVQA and LongTVQA+, which are episode-level datasets.

Main Results

Performance on LongTVQA and LongTVQA+. The left block lists model attributes (Agentic, Input, RL fine-tune); the right block reports validation accuracy (%). GPT-4o and Gemini-2.5 Pro are multimodal baselines that process and accept the full long video directly. Methods labeled Agentic indicate the model operates as the MasterAgent; methods labeled AgenticRL additionally denote RL fine-tuning. Parenthesized green numbers denote absolute gains over the immediately preceding (non-agentic or non-RL) setting. We observe that: (i) our multi-agent framework, LongVideoAgent, consistently outperforms the non-agentic counterparts; (ii) agentic RL yields additional gains, especially for smaller open-source models; (iii) using frames provides visual evidence beyond subtitles, and generally outperforms subtitle-only inputs; (iv) closed-source models remain strong, but the gap narrows much when open-source models adopt agentic designs and agentic RL.

🔍 Ablation Analysis

We conduct comprehensive ablation studies to validate our design choices. First, we observe that both grounding and vision agents are essential, with the full multi-agent system achieving the highest accuracy. Second, increasing the reasoning step limit $K$ improves performance until saturation, confirming the value of iterative planning. Finally, stronger vision backbones and larger temporal windows provide richer context, further boosting the agent's reasoning capabilities.

📝 Citation

If you find our work helpful, please cite:

@misc{liu2025longvideoagentmultiagentreasoninglong,
      title={LongVideoAgent: Multi-Agent Reasoning with Long Videos}, 
      author={Runtao Liu and Ziyi Liu and Jiaqi Tang and Yue Ma and Renjie Pi and Jipeng Zhang and Qifeng Chen},
      year={2025},
      eprint={2512.20618},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2512.20618}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
.github/workflows		.github/workflows
docs		docs
readme_src		readme_src
recipe/dapo		recipe/dapo
scripts		scripts
src		src
verl		verl
videoagent		videoagent
.gitignore		.gitignore
README.md		README.md
README_zh.md		README_zh.md
VERL_README.md		VERL_README.md
mkdocs.yml		mkdocs.yml
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🎬LongVideoAgent: Multi-Agent Reasoning with Long Videos

🚀 Latest News

📅 Roadmap

📦 Dataset

🛠️ Installation

🏋️ Train

1. Download and prepare LongTVQA assets

2. Build the offline grounding cache

3. Convert the dataset to training parquet files

4. Launch quickstart GRPO training

📊 Evaluation

Local evaluation

API evaluation

📝 Abstract

🌟 Overview

🤖 Method: Multi-Agent Framework

🔄 Iterative Reasoning Loop

🧠 Reinforcement Learning (GRPO)

📈 Experimental Results

Main Results

🔍 Ablation Analysis

📝 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🎬LongVideoAgent: Multi-Agent Reasoning with Long Videos

🚀 Latest News

📅 Roadmap

📦 Dataset

🛠️ Installation

🏋️ Train

1. Download and prepare LongTVQA assets

2. Build the offline grounding cache

3. Convert the dataset to training parquet files

4. Launch quickstart GRPO training

📊 Evaluation

Local evaluation

API evaluation

📝 Abstract

🌟 Overview

🤖 Method: Multi-Agent Framework

🔄 Iterative Reasoning Loop

🧠 Reinforcement Learning (GRPO)

📈 Experimental Results

Main Results

🔍 Ablation Analysis

📝 Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages