π¨π³ δΈζ | π Docs | π€ Hugging Face | π§ Models | π Project Page | π Arxiv
Runtao Liu*, Ziyi Liu*, Jiaqi Tang, Yue Ma, Renjie Pi, Jipeng Zhang, Qifeng Chen
ACL 2026 Main
Hong Kong University of Science and Technology
* Equal contribution
This is the official repository for our ACL 2026 Main paper, LongVideoAgent: Multi-Agent Reasoning with Long Videos. Training and evaluation code are now available, and model weights are released on Hugging Face. This README provides a compact code overview, while the π Docs contain the full setup, workflow, and argument details.
β’ [2026/04/08]: π Our paper has been accepted to ACL 2026 Main Conference.
β’ [2026/03/22]: π€ We released the LongVideoAgent-Qwen2.5-7B model weights on Hugging Face. This checkpoint was trained on the newversion branch.
β’ [2026/03/15]: πΏ We added the newversion branch for users who want to try the updated verl-based experimental training stack.
β’ [2026/03/07]: π€ We released the LongVideoAgent-Qwen2.5-3B model weights on Hugging Face.
β’ [2026/03/06]: π We released the training and evaluation code for LongVideoAgent.
β’ [2026/02/14]: π¦ We released the LongTVQA dataset on Hugging Face.
β’ [2025/12/30]: π¦ We released the LongTVQA+ dataset on Hugging Face.
β’ [2025/12/24]: π We released our paper "LongVideoAgent: Multi-Agent Reasoning with Long Videos" on arXiv!
- [2026/03/22]: Released the LongVideoAgent-Qwen2.5-7B model weights on Hugging Face.
- [2026/03/06]: Released training and evaluation code.
- [2026/03/07]: Released the LongVideoAgent-Qwen2.5-3B model weights on Hugging Face.
- LongTVQA: https://huggingface.co/datasets/longvideoagent/LongTVQA
- LongTVQA+: https://huggingface.co/datasets/longvideoagent/LongTVQA_plus
We recommend creating a clean Python 3.11 environment and installing the project from the repository root:
conda create -n lvagent python=3.11
conda activate lvagent
pip install vllm
pip install -e .
pip install flash-attn --no-build-isolation
pip install wandbIf dependency resolution fails, you can install the package first without dependencies and then install from requirements.txt:
pip install -e . --no-deps
pip install -r requirements.txtFor step-by-step installation details, see docs/installation.md.
Note
maindocuments the current default training path. An experimental training stack based on the updatedverlis currently kept on thenewversionbranch for users who want to try it separately. Training configs, behavior, and outcomes may differ noticeably across branches, so please switch branches first and follow the corresponding branch documentation.
The recommended training flow is: prepare datasets, build an offline grounding cache, convert data to GRPO parquet files, and then launch the quickstart script.
bash scripts/download_and_prepare_longtvqa.shpython src/dataset/build_grounding_cache.py \
--dataset tvqa_plus \
--questions-path /path/to/train.json \
--subs-path /path/to/all_episodes_subtitles_by_clips.json \
--grounding-model "grok-4-fast-reasoning" \
--grounding-base-url "https://api2.aigcbest.top/v1" \
--output-dir /path/to/cache_dir \
--threads 8python src/dataset/convert_tvqa_json_to_grpo_parquet.py \
--questions-path /path/to/LongTVQA_or_LongTVQA_plus_questions.jsonl_or_json \
--grounding-cache-json /path/to/grounding_cache.json \
--subtitles-dir /path/to/subtitles_dir \
--output-dir ./data \
--seed 42bash scripts/quickstart_qwen_2_5_3B_grpo.shThe quickstart script expects ./data/train.parquet and ./data/val.parquet. Grounding and vision API credentials can be passed by CLI or read from environment variables in the training pipeline.
For more detailed training instructions, see Quickstart, Offline Grounding Cache, Convert to Parquet, and GRPO Config Details.
LongVideoAgent provides evaluation scripts for both LongTVQA and LongTVQA+. The difference between the local and API versions lies in how the Master Agent performs reasoning: the local version runs the Master Agent with a local LLM, while the API version calls an API-hosted model for the Master Agent.
python src/evaluation/lvagent/evaluate_local_unified.py \
--dataset tvqa_plus \
--llm-path "/path/to/your/local_llm" \
--max_turn 5 \
--gpu_memory_utilization 0.4python src/evaluation/lvagent/evaluate_api_unified.py \
--dataset tvqa_plus \
--checkpoint_step api \
--max_turn 5 \
--threads 30To prepare evaluation-ready dataset files, you can also run:
bash scripts/download_and_prepare_longtvqa.shIf API keys are not passed explicitly, the evaluation scripts read them from environment variables such as qdd_api and aliyun_api.
For full evaluation setups and argument descriptions, see docs/evaluation.md.
Recent advances in multimodal LLMs and systems that use tools for long-video QA point to the promise of reasoning over hour-long episodes. However, many methods still compress content into lossy summaries or rely on limited toolsets, weakening temporal grounding and missing fine-grained cues. We propose LongVideoAgent, a multi-agent framework in which a master LLM coordinates a grounding agent to localize question-relevant segments and a vision agent to extract targeted textual observations. The master agent plans with a step limit, and is trained with reinforcement learning to encourage concise, correct, and efficient multi-agent cooperation. This design helps the master agent focus on relevant clips via grounding, complements subtitles with visual detail, and yields interpretable trajectories. On our proposed LongTVQA and LongTVQA+ which are episode-level datasets aggregated from TVQA/TVQA+, our multi-agent system significantly outperforms strong non-agent baselines. Experiments also show reinforcement learning further strengthens reasoning and planning for the trained agent.
Traditional single-pass MLLMs that ingest entire long videos in one contextβtypically (may through heavy downsampling and compression) often miss crucial evidence and produce wrong answers, whereas LongVideoAgent conducts multi-agent, multi-round, and multimodal reasoning to extract sparse, task-relevant cues and answer correctly.
Architecture of LongVideoAgent. A MasterAgent runs for up to
Unlike single-pass models, LongVideoAgent operates in a bounded loop (max
<request_grounding>: Calls the GroundingAgent to localize relevant video segments based on subtitles. The agent returns a symbolic tag<clip_X>.<visual_query>: Calls the VisionAgent to extract specific visual details (objects, actions, text) from the localized clip. The agent returns textual observations.<answer>: Terminates the loop and provides the final response when sufficient evidence is gathered.
We optimize the MasterAgent using Group Relative Policy Optimization (GRPO). The training objective includes: 1. Structural validity. 2. Answer Correctness: Rewarding the agent for reaching the correct final answer.
We evaluate LongVideoAgent on LongTVQA and LongTVQA+, which are episode-level datasets.
Performance on LongTVQA and LongTVQA+. The left block lists model attributes (Agentic, Input, RL fine-tune); the right block reports validation accuracy (%). GPT-4o and Gemini-2.5 Pro are multimodal baselines that process and accept the full long video directly. Methods labeled Agentic indicate the model operates as the MasterAgent; methods labeled AgenticRL additionally denote RL fine-tuning. Parenthesized green numbers denote absolute gains over the immediately preceding (non-agentic or non-RL) setting. We observe that: (i) our multi-agent framework, LongVideoAgent, consistently outperforms the non-agentic counterparts; (ii) agentic RL yields additional gains, especially for smaller open-source models; (iii) using frames provides visual evidence beyond subtitles, and generally outperforms subtitle-only inputs; (iv) closed-source models remain strong, but the gap narrows much when open-source models adopt agentic designs and agentic RL.
We conduct comprehensive ablation studies to validate our design choices. First, we observe that both grounding and vision agents are essential, with the full multi-agent system achieving the highest accuracy. Second, increasing the reasoning step limit
If you find our work helpful, please cite:
@misc{liu2025longvideoagentmultiagentreasoninglong,
title={LongVideoAgent: Multi-Agent Reasoning with Long Videos},
author={Runtao Liu and Ziyi Liu and Jiaqi Tang and Yue Ma and Renjie Pi and Jipeng Zhang and Qifeng Chen},
year={2025},
eprint={2512.20618},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2512.20618},
}