Stars
MiniCPM-V 4.5: A GPT-4o Level MLLM for Single Image, Multi Image and High-FPS Video Understanding on Your Phone
VeOmni: Scaling Any Modality Model Training with Model-Centric Distributed Recipe Zoo
Fully Open Framework for Democratized Multimodal Training
StreamingVLM: Real-Time Understanding for Infinite Video Streams
Official PyTorch Implementation of "Diffusion Transformers with Representation Autoencoders"
Official codebase used to develop Vision Transformer, SigLIP, MLP-Mixer, LiT and more.
20+ high-performance LLMs with recipes to pretrain, finetune and deploy at scale.
Minimalistic large language model 3D-parallelism training
Training Large Language Model to Reason in a Continuous Latent Space
Official codebase for the paper Latent Visual Reasoning
SophiaVL-R1: Reinforcing MLLMs Reasoning with Thinking Reward
[NeurIPS 2024] Visual Perception by Large Language Model’s Weights
Tongyi Deep Research, the Leading Open-source Deep Research Agent
Qwen3-VL is the multimodal large language model series developed by Qwen team, Alibaba Cloud.
A toolkit for developing and comparing reinforcement learning algorithms.
SGLang is a fast serving framework for large language models and vision language models.
An Easy-to-use, Scalable and High-performance RLHF Framework based on Ray (PPO & GRPO & REINFORCE++ & TIS & vLLM & Ray & Dynamic Sampling & Async Agentic RL)
Enforce the output format (JSON Schema, Regex etc) of a language model
This is the della guide for Zhuang's group at Princeton University.
[ICCV 2025] Video-T1: Test-Time Scaling for Video Generation
A simple pip-installable Python tool to generate your own HTML citation world map from your Google Scholar ID.