Scaling rl to long videos

Y Chen, W Huang, B Shi, Q Hu, H Ye… - Advances in …, 2026 - proceedings.neurips.cc
scale VLMs for reasoning over long videos. LongVILA-R1 encompasses a meticulously
constructed large-scale … Leveraging our curated dataset of 104K long video question-reasoning-…

Video-rts: Rethinking reinforcement learning and test-time scaling for efficient and enhanced video reasoning

Z Wang, J Yoon, S Yu, MM Islam… - Proceedings of the …, 2025 - aclanthology.org
… large-scale supervised fine-tuning (SFT) data with long CoT … and directly utilize pure RL
training on simple video question-… pure RL training and sparse-to-dense video test-time scaling

Scalelong: A multi-timescale benchmark for long video understanding

D Ma, H Yuan, X Wang, Q Zang, T Liu, X He… - arXiv preprint arXiv …, 2025 - arxiv.org
… distinct scale. ScaleLong includes 269 diverse long videos (averaging 86 minutes), with 4-8
questions per video (at last one per scale), across 5 major categories and 36 subcategories. …

Video-r1: Reinforcing video reasoning in mllms

K Feng, K Gong, B Li, Z Guo, Y Wang… - Advances in …, 2026 - proceedings.neurips.cc
… based reinforcement learning (RL), we introduce Video-R1 as the … To further explore the
impact of scaling up reinforcement … strategies that allow scaling to longer videos, enabling more …

VideoZoomer: Reinforcement-Learned Temporal Focusing for Long Video Reasoning

Y Ding, Y Zhang, X Lai, R Chu, Y Yang - arXiv preprint arXiv:2512.22315, 2025 - arxiv.org
… for single turn, combining RL-driven reasoning and multi-turn tool use strategy for long
The x-axis(log scale) represents the fixed frame budget for the baseline and the average …

Kimi k1. 5: Scaling reinforcement learning with llms

K Team, A Du, B Gao, B Xing, C Jiang, C Chen… - arXiv preprint arXiv …, 2025 - arxiv.org
… By scaling up RL training, we aim to train a model that … our work is to scale long-context RL
training. Partial rollouts … handling long-CoT features by managing the rollouts of both long and …

Thinking with videos: Multimodal tool-augmented reinforcement learning for long video reasoning

H Zhang, X Gu, J Li, C Ma, S Bai, C Zhang… - arXiv preprint arXiv …, 2025 - arxiv.org
… We construct two large-scale, high-quality multi-task video … frame sampling strategy for
efficient long video understanding. … RL framework for efficient and accurate long video reasoning. …

Time-r1: Post-training large vision language model for temporal video grounding

Y Wang, Z Wang, B Xu, Y Du, K Lin… - Advances in …, 2026 - proceedings.neurips.cc
… Existing benchmarks for temporal video grounding either focus on large-scale datasets
tailored for … We also compare RL and SFT strategies across TVG, short video QA, and long

[PDF][PDF] RL-VideoAlign: Reinforcement Learning for Long-Horizon Aligned, Temporally Consistent, and Interaction-Credible Video Generation

B Run, S Li, S Wang - researchgate.net
… -scale datasets like CityFlow [26] and VERI-Wild [27], which emphasize the difficulty of
cross-camera and long-… keeping subjects coherent during complex 3D rotations in RL-VideoAlign. …

EasyVideoR1: Easier RL for Video Understanding

C Qin, C Yang, Q Si, N Gu, D Yao, Z Lin, P Fu… - arXiv preprint arXiv …, 2026 - arxiv.org
… To further prevent long video sequences from … -source video-language models at this scale.
We train on approximately 100K video samples assembled from publicly available video RL