Skip to content
View yxgeee's full-sized avatar
🕛
Focusing
🕛
Focusing

Highlights

  • Pro

Organizations

@TencentARC

Block or report yxgeee

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don’t include any personal information such as legal names or email addresses. Markdown is supported. This note will only be visible to you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Showing results
Python 87 2 Updated Jun 2, 2026
Python 94 3 Updated Jun 2, 2026

Structuring Hour-Long Videos into Navigable Chapters and Hierarchical Summaries

43 2 Updated Nov 19, 2025

Structured Video Comprehension of Real-World Shorts

Python 238 7 Updated Sep 21, 2025
Python 794 24 Updated Jun 10, 2026

[ACL2026 Findings] GRPO-CARE: Consistency-Aware Reinforcement Learning for Multimodal Reasoning

Python 84 2 Updated Jun 23, 2025

[NeurIPS2025] The official implementation of MindOmni: Unleashing Reasoning Generation in Vision Language Models with RGPO

Python 140 3 Updated Oct 15, 2025

Awesome Unified Multimodal Models

1,284 40 Updated Mar 24, 2026

TokLIP: Marry Visual Tokens to CLIP for Multimodal Comprehension and Generation

Python 236 6 Updated Aug 18, 2025

ICML2025

Python 65 5 Updated Aug 28, 2025

Video-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning?

Python 94 2 Updated Jul 13, 2025

[ICCV 2025] AnimeGamer: Infinite Anime Life Simulation with Next Game State Prediction

Python 347 29 Updated Apr 9, 2025
Python 101 2 Updated Jun 23, 2025

[ICCV 2025, Oral] TrajectoryCrafter: Redirecting Camera Trajectory for Monocular Videos via Diffusion Models

Python 856 42 Updated Dec 17, 2025

[ECCV2024] Video Foundation Models & Data for Multimodal Understanding

Python 2,295 154 Updated Jun 11, 2026

Tarsier -- a family of large-scale video-language models, which is designed to generate high-quality video descriptions , together with good capability of general video understanding.

Python 549 31 Updated Aug 14, 2025

[CVPR 2025 Oral]Infinity ∞ : Scaling Bitwise AutoRegressive Modeling for High-Resolution Image Synthesis

Python 1,574 93 Updated Apr 16, 2026

EasyR1: An Efficient, Scalable, Multi-Modality RL Training Framework based on veRL

Python 5,018 373 Updated Apr 6, 2026

[arXiv: 2502.05178] QLIP: Text-Aligned Visual Tokenization Unifies Auto-Regressive Multimodal Understanding and Generation

Jupyter Notebook 97 3 Updated Mar 1, 2025

Minimal reproduction of DeepSeek R1-Zero

Python 13,170 1,585 Updated Feb 27, 2026

Fully open reproduction of DeepSeek-R1

Python 26,327 2,444 Updated Apr 2, 2026

🔥🔥🔥 [IEEE TCSVT] Latest Papers, Codes and Datasets on Vid-LLMs.

3,210 144 Updated Jun 13, 2026

A fork to add multimodal model training to open-r1

Python 1,568 72 Updated Feb 8, 2025

NVIDIA Cosmos is an open platform of world models, datasets, and tools that enables developers to build Physical AI for robots, autonomous vehicles, smart infrastructure, and more.

Jupyter Notebook 10,306 679 Updated Jun 17, 2026
Jupyter Notebook 31 1 Updated Apr 11, 2025

Diffusion Powers Video Tokenizer for Comprehension and Generation (CVPR 2025)

Python 87 3 Updated Feb 27, 2025

[ICCV2025 Oral] Latent Motion Token as the Bridging Language for Learning Robot Manipulation from Videos

Python 179 8 Updated Oct 1, 2025
Python 110 9 Updated Nov 27, 2024

Awesome papers & datasets specifically focused on long-term videos.

380 14 Updated Oct 9, 2025
Next