Skip to content

longvideoagent/LongVideoAgent

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

33 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

🎬LongVideoAgent: Multi-Agent Reasoning with Long Videos

πŸ‡¨πŸ‡³ δΈ­ζ–‡ | πŸ“š Docs | πŸ€— Hugging Face | 🧠 Models | 🌐 Project Page | πŸ“„ Arxiv

Runtao Liu*, Ziyi Liu*, Jiaqi Tang, Yue Ma, Renjie Pi, Jipeng Zhang, Qifeng Chen

ACL 2026 Main

Hong Kong University of Science and Technology

* Equal contribution


This is the official repository for our ACL 2026 Main paper, LongVideoAgent: Multi-Agent Reasoning with Long Videos. Training and evaluation code are now available, and model weights are released on Hugging Face. This README provides a compact code overview, while the πŸ“š Docs contain the full setup, workflow, and argument details.

πŸš€ Latest News

β€’ [2026/04/08]: πŸŽ‰ Our paper has been accepted to ACL 2026 Main Conference.

β€’ [2026/03/22]: πŸ€— We released the LongVideoAgent-Qwen2.5-7B model weights on Hugging Face. This checkpoint was trained on the newversion branch.

β€’ [2026/03/15]: 🌿 We added the newversion branch for users who want to try the updated verl-based experimental training stack.

β€’ [2026/03/07]: πŸ€— We released the LongVideoAgent-Qwen2.5-3B model weights on Hugging Face.

β€’ [2026/03/06]: πŸš€ We released the training and evaluation code for LongVideoAgent.

β€’ [2026/02/14]: πŸ“¦ We released the LongTVQA dataset on Hugging Face.

β€’ [2025/12/30]: πŸ“¦ We released the LongTVQA+ dataset on Hugging Face.

β€’ [2025/12/24]: πŸš€ We released our paper "LongVideoAgent: Multi-Agent Reasoning with Long Videos" on arXiv!


πŸ“… Roadmap

  • [2026/03/22]: Released the LongVideoAgent-Qwen2.5-7B model weights on Hugging Face.
  • [2026/03/06]: Released training and evaluation code.
  • [2026/03/07]: Released the LongVideoAgent-Qwen2.5-3B model weights on Hugging Face.

πŸ“¦ Dataset


πŸ› οΈ Installation

We recommend creating a clean Python 3.11 environment and installing the project from the repository root:

conda create -n lvagent python=3.11
conda activate lvagent
pip install vllm
pip install -e .
pip install flash-attn --no-build-isolation
pip install wandb

If dependency resolution fails, you can install the package first without dependencies and then install from requirements.txt:

pip install -e . --no-deps
pip install -r requirements.txt

For step-by-step installation details, see docs/installation.md.


πŸ‹οΈ Train

Note main documents the current default training path. An experimental training stack based on the updated verl is currently kept on the newversion branch for users who want to try it separately. Training configs, behavior, and outcomes may differ noticeably across branches, so please switch branches first and follow the corresponding branch documentation.

The recommended training flow is: prepare datasets, build an offline grounding cache, convert data to GRPO parquet files, and then launch the quickstart script.

1. Download and prepare LongTVQA assets

bash scripts/download_and_prepare_longtvqa.sh

2. Build the offline grounding cache

python src/dataset/build_grounding_cache.py \
  --dataset tvqa_plus \
  --questions-path /path/to/train.json \
  --subs-path /path/to/all_episodes_subtitles_by_clips.json \
  --grounding-model "grok-4-fast-reasoning" \
  --grounding-base-url "https://api2.aigcbest.top/v1" \
  --output-dir /path/to/cache_dir \
  --threads 8

3. Convert the dataset to training parquet files

python src/dataset/convert_tvqa_json_to_grpo_parquet.py \
  --questions-path /path/to/LongTVQA_or_LongTVQA_plus_questions.jsonl_or_json \
  --grounding-cache-json /path/to/grounding_cache.json \
  --subtitles-dir /path/to/subtitles_dir \
  --output-dir ./data \
  --seed 42

4. Launch quickstart GRPO training

bash scripts/quickstart_qwen_2_5_3B_grpo.sh

The quickstart script expects ./data/train.parquet and ./data/val.parquet. Grounding and vision API credentials can be passed by CLI or read from environment variables in the training pipeline.

For more detailed training instructions, see Quickstart, Offline Grounding Cache, Convert to Parquet, and GRPO Config Details.


πŸ“Š Evaluation

LongVideoAgent provides evaluation scripts for both LongTVQA and LongTVQA+. The difference between the local and API versions lies in how the Master Agent performs reasoning: the local version runs the Master Agent with a local LLM, while the API version calls an API-hosted model for the Master Agent.

Local evaluation

python src/evaluation/lvagent/evaluate_local_unified.py \
  --dataset tvqa_plus \
  --llm-path "/path/to/your/local_llm" \
  --max_turn 5 \
  --gpu_memory_utilization 0.4

API evaluation

python src/evaluation/lvagent/evaluate_api_unified.py \
  --dataset tvqa_plus \
  --checkpoint_step api \
  --max_turn 5 \
  --threads 30

To prepare evaluation-ready dataset files, you can also run:

bash scripts/download_and_prepare_longtvqa.sh

If API keys are not passed explicitly, the evaluation scripts read them from environment variables such as qdd_api and aliyun_api.

For full evaluation setups and argument descriptions, see docs/evaluation.md.


πŸ“ Abstract

Recent advances in multimodal LLMs and systems that use tools for long-video QA point to the promise of reasoning over hour-long episodes. However, many methods still compress content into lossy summaries or rely on limited toolsets, weakening temporal grounding and missing fine-grained cues. We propose LongVideoAgent, a multi-agent framework in which a master LLM coordinates a grounding agent to localize question-relevant segments and a vision agent to extract targeted textual observations. The master agent plans with a step limit, and is trained with reinforcement learning to encourage concise, correct, and efficient multi-agent cooperation. This design helps the master agent focus on relevant clips via grounding, complements subtitles with visual detail, and yields interpretable trajectories. On our proposed LongTVQA and LongTVQA+ which are episode-level datasets aggregated from TVQA/TVQA+, our multi-agent system significantly outperforms strong non-agent baselines. Experiments also show reinforcement learning further strengthens reasoning and planning for the trained agent.


🌟 Overview

Traditional single-pass MLLMs that ingest entire long videos in one contextβ€”typically (may through heavy downsampling and compression) often miss crucial evidence and produce wrong answers, whereas LongVideoAgent conducts multi-agent, multi-round, and multimodal reasoning to extract sparse, task-relevant cues and answer correctly.


πŸ€– Method: Multi-Agent Framework

Architecture

Architecture of LongVideoAgent. A MasterAgent runs for up to $K$ rounds, collaborating with a GroundingAgent to localize relevant clips from videos and a VisionAgent to read fine-grained cues from the localized frames. Evidence accumulates until the MasterAgent feels confident to answer the user.

πŸ”„ Iterative Reasoning Loop

Unlike single-pass models, LongVideoAgent operates in a bounded loop (max $K$ steps). At each step, the MasterAgent generates a "thinking" trace and emits a structured action token:

  • <request_grounding>: Calls the GroundingAgent to localize relevant video segments based on subtitles. The agent returns a symbolic tag <clip_X>.
  • <visual_query>: Calls the VisionAgent to extract specific visual details (objects, actions, text) from the localized clip. The agent returns textual observations.
  • <answer>: Terminates the loop and provides the final response when sufficient evidence is gathered.

🧠 Reinforcement Learning (GRPO)

We optimize the MasterAgent using Group Relative Policy Optimization (GRPO). The training objective includes: 1. Structural validity. 2. Answer Correctness: Rewarding the agent for reaching the correct final answer.


πŸ“ˆ Experimental Results

We evaluate LongVideoAgent on LongTVQA and LongTVQA+, which are episode-level datasets.

Main Results

Performance on LongTVQA and LongTVQA+. The left block lists model attributes (Agentic, Input, RL fine-tune); the right block reports validation accuracy (%). GPT-4o and Gemini-2.5 Pro are multimodal baselines that process and accept the full long video directly. Methods labeled Agentic indicate the model operates as the MasterAgent; methods labeled AgenticRL additionally denote RL fine-tuning. Parenthesized green numbers denote absolute gains over the immediately preceding (non-agentic or non-RL) setting. We observe that: (i) our multi-agent framework, LongVideoAgent, consistently outperforms the non-agentic counterparts; (ii) agentic RL yields additional gains, especially for smaller open-source models; (iii) using frames provides visual evidence beyond subtitles, and generally outperforms subtitle-only inputs; (iv) closed-source models remain strong, but the gap narrows much when open-source models adopt agentic designs and agentic RL.

πŸ” Ablation Analysis

We conduct comprehensive ablation studies to validate our design choices. First, we observe that both grounding and vision agents are essential, with the full multi-agent system achieving the highest accuracy. Second, increasing the reasoning step limit $K$ improves performance until saturation, confirming the value of iterative planning. Finally, stronger vision backbones and larger temporal windows provide richer context, further boosting the agent's reasoning capabilities.


πŸ“ Citation

If you find our work helpful, please cite:

@misc{liu2025longvideoagentmultiagentreasoninglong,
      title={LongVideoAgent: Multi-Agent Reasoning with Long Videos}, 
      author={Runtao Liu and Ziyi Liu and Jiaqi Tang and Yue Ma and Renjie Pi and Jipeng Zhang and Qifeng Chen},
      year={2025},
      eprint={2512.20618},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2512.20618}, 
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors