- Beijing, China
- https://geyixiao.com/
- @ge_yixiao
Highlights
- Pro
Stars
Structuring Hour-Long Videos into Navigable Chapters and Hierarchical Summaries
Structured Video Comprehension of Real-World Shorts
[ACL2026 Findings] GRPO-CARE: Consistency-Aware Reinforcement Learning for Multimodal Reasoning
[NeurIPS2025] The official implementation of MindOmni: Unleashing Reasoning Generation in Vision Language Models with RGPO
Awesome Unified Multimodal Models
TokLIP: Marry Visual Tokens to CLIP for Multimodal Comprehension and Generation
Video-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning?
[ICCV 2025] AnimeGamer: Infinite Anime Life Simulation with Next Game State Prediction
[ICCV 2025, Oral] TrajectoryCrafter: Redirecting Camera Trajectory for Monocular Videos via Diffusion Models
[ECCV2024] Video Foundation Models & Data for Multimodal Understanding
Tarsier -- a family of large-scale video-language models, which is designed to generate high-quality video descriptions , together with good capability of general video understanding.
[CVPR 2025 Oral]Infinity ∞ : Scaling Bitwise AutoRegressive Modeling for High-Resolution Image Synthesis
EasyR1: An Efficient, Scalable, Multi-Modality RL Training Framework based on veRL
[arXiv: 2502.05178] QLIP: Text-Aligned Visual Tokenization Unifies Auto-Regressive Multimodal Understanding and Generation
Minimal reproduction of DeepSeek R1-Zero
Fully open reproduction of DeepSeek-R1
🔥🔥🔥 [IEEE TCSVT] Latest Papers, Codes and Datasets on Vid-LLMs.
A fork to add multimodal model training to open-r1
NVIDIA Cosmos is an open platform of world models, datasets, and tools that enables developers to build Physical AI for robots, autonomous vehicles, smart infrastructure, and more.
Diffusion Powers Video Tokenizer for Comprehension and Generation (CVPR 2025)
[ICCV2025 Oral] Latent Motion Token as the Bridging Language for Learning Robot Manipulation from Videos
Awesome papers & datasets specifically focused on long-term videos.