Stars
[ICLR’26] Official PyTorch Implementation of “HDR-NSFF: High Dynamic Range Neural Scene Flow Fields“
VGGSounder, a multi-label audio-visual classification dataset with modality annotations.
open-sourced video dataset with dynamic scenes and camera movements annotation
[ICCV 2025] A simple training-free approach adapting DUSt3R for dynamic scenes.
PyTorch implementation of Audio Flamingo: Series of Advanced Audio Understanding Language Models
Official Repo For Pixel-LLM Codebase: Sa2VA (Arxiv-25), SAMTok (CVPR-26), VRT, SaSaSa2VA (1-st solution for LSVOS)
✨✨Latest Advances on Multimodal Large Language Models
[NeurIPS 2024] Hamba: Single-view 3D Hand Reconstruction with Graph-guided Bi-Scanning Mamba
Simulation platform for general-purpose robotics & embodied AI learning.
[CVPR 2025] MMAudio: Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis
[ICCV 2025] LLaVA-CoT, a visual language model capable of spontaneous, systematic reasoning
Code for the project "MegaSaM: Accurate, Fast and Robust Structure and Motion from Casual Dynamic Videos"
Official repository for "MMM: Generative Masked Motion Model" (CVPR 2024 -- Highlight)
[ICLR2024] The official implementation of paper "VDT: General-purpose Video Diffusion Transformers via Mask Modeling", by Haoyu Lu, Guoxing Yang, Nanyi Fei, Yuqi Huo, Zhiwu Lu, Ping Luo, Mingyu Ding.
[CVPR 2024] AV2AV: Direct Audio-Visual Speech to Audio-Visual Speech Translation with Unified Audio-Visual Speech Representation
mok0102 / Video-LLaVA
Forked from PKU-YuanGroup/Video-LLaVAVideo-LLaVA: Learning United Visual Representation by Alignment Before Projection
Code for the paper Physics-as-Inverse-Graphics: Joint Unsupervised Learning of Objects and Physics from Video
A data generation pipeline for creating semi-realistic synthetic multi-object videos with rich annotations such as instance segmentation masks, depth maps, and optical flow.
Paper list about multimodal and large language models, only used to record papers I read in the daily arxiv for personal needs.
SAiD: Blendshape-based Audio-Driven Speech Animation with Diffusion
📖 A curated list of resources dedicated to talking face.
Implementation of Korean FastSpeech2
The most powerful and modular diffusion model GUI, api and backend with a graph/nodes interface.
Summary of publicly available ressources such as code, datasets, and scientific papers for the FLAME 3D head model
A curated list of audio-visual learning methods and datasets.
The repo for studying and sharing diffusion models.
[ICCV 2023] Understanding 3D Object Interaction from a Single Image
Track-Anything is a flexible and interactive tool for video object tracking and segmentation, based on Segment Anything, XMem, and E2FGVI.