Stars
Bernini is a unified framework for video generation and editing that combines an MLLM-based semantic planner with a DiT-based renderer.
The code for "InstructSAM: Segment Any Instance with Any Instructions"
A 3B-active-parameter native unified multimodal model for image and video understanding, generation, and editing.
🔥 Official code repository for "Unlocking Dense Metric Depth Estimation in VLMs"
Official codebase for "Pyramid Forcing: Head-Aware Pyramid KV Cache Policy for High-Quality Long Video Generation"
Official Implementation of CoInteract: Spatially-Structured Co-Generation for Interactive Human-Object Video Synthesis
Generative Refinement Networks for Visual Synthesis (Support C2I & T2I & T2V)
Official implementation of "OmniForcing: Unleashing Real-time Joint Audio-Visual Generation"[arXiv:2603.11647]. OmniForcing is the first framework to distill bidirectional audio-visual diffusion mo…
Helios: Real Real-Time Long Video Generation Model
Reinforcement Learning Framework for Visual Generation
[Tech Report] Alive: A Unified Audio-Video Generation Model
[ICML 2026] | Scaling Interactive World Models to 1000-Frame Horizons via Pose-Free Hierarchical Memory
Implementation of "Live Avatar: Streaming Real-time Audio-Driven Avatar Generation with Infinite Length"
[NeurIPS 2025 Oral]Infinity⭐️: Unified Spacetime AutoRegressive Modeling for Visual Generation
[AAAI 2026] Playmate2: Training-Free Multi-Character Audio-Driven Animation via Diffusion Transformer with Reward Feedback
Unlimited-length talking video generation that supports image-to-video and video-to-video generation
We present StableAvatar, the first end-to-end video diffusion transformer, which synthesizes infinite-length high-quality audio-driven avatar videos without any post-processing, conditioned on a re…
[AAAI 2026] EchoMimicV3: 1.3B Parameters are All You Need for Unified Multi-Modal and Multi-Task Human Animation
Industry-level video foundation model for unified Text-to-Video (T2V) and Image-to-Video (I2V) generation.
Resources list for multimodal agentic reasoning
[ICLR 2025] Hallo2: Long-Duration and High-Resolution Audio-driven Portrait Image Animation
[CVPR2025 Highlight] Video Generation Foundation Models: https://saiyan-world.github.io/goku/
Easy Data Preparation with latest LLMs-based Operators and Pipelines.
Official implementary of HCoG: Apply Hierarchical-Chain-of-Generation to Complex Attributes Text-to-3D Generation [CVPR 2025]
[ICCV 2025] RoboFactory: Exploring Embodied Agent Collaboration with Compositional Constraints
Official implementation of GUI-R1 : A Generalist R1-Style Vision-Language Action Model For GUI Agents