Stars
OmniVinci is an omni-modal LLM for joint understanding of vision, audio, and language.
🚀🚀 「大模型」2小时完全从0训练26M的小参数GPT!🌏 Train a 26M-parameter GPT from scratch in just 2h!
Native Multimodal Models are World Learners
Official repository of paper "LOVE-R1: Advancing Long Video Understanding with Adaptive Zoom-in Mechanism via Multi-Step Reasoning"
Implementation of a single layer of the MMDiT, proposed in Stable Diffusion 3, in Pytorch
Official Implementation of "UniFlow: A Unified Pixel Flow Tokenizer for Visual Understanding and Generation"
[NeurIPS 2025 D&B🔥] ImgEdit: A Unified Image Editing Dataset and Benchmark
An official implementation of "CapRL: Stimulating Dense Image Caption Capabilities via Reinforcement Learning"
code for "Diffusion Forcing: Next-token Prediction Meets Full-Sequence Diffusion"
A benchmark for evaluating vision-centric, complex video reasoning.
Implementation of Denoising Diffusion Probabilistic Model in Pytorch
Official PyTorch Implementation of "Scalable Diffusion Models with Transformers"
Structured Video Comprehension of Real-World Shorts
[ICCV 2025] LVBench: An Extreme Long Video Understanding Benchmark
Official repo and evaluation implementation of VSI-Bench
The Next Step Forward in Multimodal LLM Alignment
Ultra-high-performance, secure, all-in-one acceleration engine for developer resources
Tracking the latest and greatest research papers on video generation.
Long-RL: Scaling RL to Long Sequences (NeurIPS 2025)
Agent-R1: Training Powerful LLM Agents with End-to-End Reinforcement Learning