Highlights
- Pro
Stars
Autoregressive Semantic Visual Reconstruction Helps VLMs Understand Better
verl: Volcano Engine Reinforcement Learning for LLMs
A PyTorch native platform for training generative AI models
Directly Aligning the Full Diffusion Trajectory with Fine-Grained Human Preference
Official implementation of LiFT: Leveraging Human Feedback for Text-to-Video Model Alignment.
Official repo of paper "Reconstruction Alignment Improves Unified Multimodal Models". Unlocking the Massive Zero-shot Potential in Unified Multimodal Models through Self-supervised Learning.
DDPO for finetuning diffusion models, implemented in PyTorch with LoRA support
Unified Multimodal Model for image generation/editing/understanding
gpt-oss-120b and gpt-oss-20b are two open-weight language models by OpenAI
GPT-IMAGE-EDIT-1.5M: A Million-Scale, GPT-Generated Image Dataset
Official PyTorch Implementation of "Latent Denoising Makes Good Visual Tokenizers"
Official code for ICCV 2025 paper, X2I: Seamless Integration of Multimodal Understanding into Diffusion Transformer via Attention Distillation
A framework that allows you to apply Sparse AutoEncoder on any models
SigLIP-based Aesthetic Score Predictor
Open protocol for communication between AI agents, applications, and humans.
Official Implementation of Paper Transfer between Modalities with MetaQueries
DeepDubber-V1: Towards High Quality and Dialogue, Narration, Monologue Adaptive Movie Dubbing Via Multi-Modal Chain-of-Thoughts Reasoning Guidance
Official codebase for "Self Forcing: Bridging Training and Inference in Autoregressive Video Diffusion" (NeurIPS 2025 Spotlight)
Selftok: Discrete Visual Tokens of Autoregression, by Diffusion, and for Reasoning
Repo for SeedVR2 & SeedVR (CVPR2025 Highlight)
[NeurIPS 2025] An official implementation of Flow-GRPO: Training Flow Matching Models via Online RL
Pytorch implementation for the paper titled "SimpleAR: Pushing the Frontier of Autoregressive Visual Generation"
[ICML 2025] This is the official repository of our paper "What If We Recaption Billions of Web Images with LLaMA-3 ?"
[ICCV2025]Code Release of Harmonizing Visual Representations for Unified Multimodal Understanding and Generation
[CVPR 2025] EgoLife: Towards Egocentric Life Assistant