Stars
Step3-VL-10B: A compact yet frontier multimodal model achieving SOTA performance at the 10B scale, matching open-source models 10-20x its size.
VeOmni: Scaling Any Modality Model Training with Model-Centric Distributed Recipe Zoo
My learning notes for ML SYS.
A compact implementation of SGLang, designed to demystify the complexities of modern LLM serving systems.
EditReward: A Human-Aligned Reward Model for Instruction-Guided Image Editing [ICLR 2026]
OmniVinci is an omni-modal LLM for joint understanding of vision, audio, and language.
🚀🚀 「大模型」2小时完全从0训练26M的小参数GPT!🌏 Train a 26M-parameter GPT from scratch in just 2h!
Native Multimodal Models are World Learners
Official repository of paper "LOVE-R1: Advancing Long Video Understanding with Adaptive Zoom-in Mechanism via Multi-Step Reasoning"
Implementation of a single layer of the MMDiT, proposed in Stable Diffusion 3, in Pytorch
Official Implementation of "UniFlow: A Unified Pixel Flow Tokenizer for Visual Understanding and Generation"
[NeurIPS 2025 D&B🔥] ImgEdit: A Unified Image Editing Dataset and Benchmark
(ICLR 2026) An official implementation of "CapRL: Stimulating Dense Image Caption Capabilities via Reinforcement Learning"
code for "Diffusion Forcing: Next-token Prediction Meets Full-Sequence Diffusion"
[ICLR 2026] "VideoReasonBench: Can MLLMs Perform Vision-Centric Complex Video Reasoning?", Yuanxin Liu, Kun Ouyang, Haoning Wu, Yi Liu, Lin Sui, Xinhao Li, Yan Zhong, Y. Charles, Xinyu Zhou, Xu Sun
Implementation of Denoising Diffusion Probabilistic Model in Pytorch
Official PyTorch Implementation of "Scalable Diffusion Models with Transformers"
Structured Video Comprehension of Real-World Shorts
[ICCV 2025] LVBench: An Extreme Long Video Understanding Benchmark
Official repo and evaluation implementation of VSI-Bench
The Next Step Forward in Multimodal LLM Alignment