-
Tsinghua University
-
14:20
(UTC -12:00) - https://www.whu.edu.cn/
Stars
PyTorch code and models for VJEPA2 self-supervised learning from video.
The repository provides code for running inference and finetuning with the Meta Segment Anything Model 3 (SAM 3), links for downloading the trained model checkpoints, and example notebooks that sho…
A real-time approach for mapping all human pixels of 2D RGB images to a 3D surface-based model of the body
The repository provides code for running inference with the Meta Segment Anything Model 2 (SAM 2), links for downloading the trained model checkpoints, and example notebooks that show how to use th…
[NeurIPS 2024] Depth Anything V2. A More Capable Foundation Model for Monocular Depth Estimation
MOVA: Towards Scalable and Synchronized Video–Audio Generation
MOSS-TTSD is a spoken dialogue generation model designed for expressive multi-speaker synthesis. It features long-context modeling, flexible speaker control, and multilingual support, while enablin…
[NeurIPS 2025 D&B🔥] OpenS2V-Nexus: A Detailed Benchmark and Million-Scale Dataset for Subject-to-Video Generation
[CVPR 2026]UnityVideo: Unified Multi-Modal Multi-Task Learning for Enhancing World-Aware Video Generation
InteractAvatar is a novel dual-stream DiT framework that enables talking avatars to perform Grounded Human-Object Interaction (GHOI)
[RSS 2025] Learning to Act Anywhere with Task-centric Latent Actions
[CVPR 2026] OmniTransfer: All-in-one Framework for Spatio-temporal Video Transfer
Unlimited-length talking video generation that supports image-to-video and video-to-video generation
Towards Scalable Pre-training of Visual Tokenizers for Generation
The repository provides code for running inference with the Meta Segment Anything Audio Model (SAM-Audio), links for downloading the trained model checkpoints, and example notebooks that show how t…
Official code of Motus: A Unified Latent Action World Model
大模型算法岗面试题(含答案):常见问题和概念解析 "大模型面试题"、"算法岗面试"、"面试常见问题"、"大模型算法面试"、"大模型应用基础"
Qwen3-VL is the multimodal large language model series developed by Qwen team, Alibaba Cloud.
Seed1.5-VL, a vision-language foundation model designed to advance general-purpose multimodal understanding and reasoning, achieving state-of-the-art performance on 38 out of 60 public benchmarks.
[TPAMI 2025] Official Code for "SMPLest-X: Ultimate Scaling for Expressive Human Pose and Shape Estimation"
[AAAI 2026] EchoMimicV3: 1.3B Parameters are All You Need for Unified Multi-Modal and Multi-Task Human Animation
[NeurIPS 2025] Let Them Talk: Audio-Driven Multi-Person Conversational Video Generation
HunyuanVideo-1.5: A leading lightweight video generation model
An AI-Powered Speech Processing Toolkit and Open Source SOTA Pretrained Models, Supporting Speech Enhancement, Separation, and Target Speaker Extraction, etc.