Stars
🔥An open-source survey of the latest video reasoning tasks, paradigms, and benchmarks.
slime is an LLM post-training framework for RL Scaling.
An interactive AI voice agent that can capture and transcribe speech in real-time, generate intelligent responses using the DeepSeek R1 (7B model) AI, and convert the responses back to natural spee…
GPT-ImgEval: Evaluating GPT-4o’s state-of-the-art image generation capabilities
The most powerful and modular diffusion model GUI, api and backend with a graph/nodes interface.
Open-source and strong foundation image recognition models.
Qwen2.5-Omni is an end-to-end multimodal model by Qwen team at Alibaba Cloud, capable of understanding text, audio, vision, video, and performing real-time speech generation.
Official Repo for "TheoremExplainAgent: Towards Video-based Multimodal Explanations for LLM Theorem Understanding" [ACL 2025 oral]
This repo is meant to serve as a guide for Machine Learning/AI technical interviews.
[ICCV'25]DimensionX: Create Any 3D and 4D Scenes from a Single Image with Controllable Video Diffusion
[SIGGRAPH Asia 2024, Journal Track] ToonCrafter: Generative Cartoon Interpolation
A collection of resources and papers on Diffusion Models
collection of diffusion model papers categorized by their subareas
Extend BoxDiff to SDXL (SDXL-based layout-to-image generation)
[ECCV 2024] OMG: Occlusion-friendly Personalized Multi-concept Generation In Diffusion Models
Apply unlimited masks to unlimited LoRA models
🚀 Cross attention map tools for huggingface/diffusers
StoryMaker: Towards consistent characters in text-to-image generation
[AAAI 2025] Official implementation of "OOTDiffusion: Outfitting Fusion based Latent Diffusion for Controllable Virtual Try-on"