Lists (28)
Sort Name ascending (A-Z)
3D-from-mono
3d_recon
3DGS
AIGC
Anything-Model
autonomous driving
C++
CG
chatgpt
CUDA_tools
deep-learning
diffusion
digital-avatar
image_edit
LLM-AI
Localization
mono-depth
MVS
nerf
occupancy-network
optical flow
pose_est
sample_utils
simulation
SLAM
stylization
tools
world-model
Stars
🌟本项目自动抓取并索引科学空间的文章元数据,按研究主题进行规则分类,方便在 GitHub 上快速浏览并跳转到原文。
[ICML'26] Code and website for Self-Flow: Self-Supervised Flow Matching for Scalable Multi-Modal Synthesis
Code to pretrain, fine-tune, and evaluate DreamZero and run sim & real-world evals
GLM-Image: Auto-regressive for Dense-knowledge and High-fidelity Image Generation.
一个基于nano banana pro🍌的原生AI PPT生成应用,迈向"Vibe PPT"; 支持上传任意模板图片,上传任意素材&智能解析,一句话/大纲/页面描述自动生成PPT,口头修改指定区域、一键导出可编辑ppt - An AI-native slides generator based on nano banana pro🍌
TurboDiffusion: 100–200× Acceleration for Video Diffusion Models
cuTile is a programming model for writing parallel kernels for NVIDIA GPUs
StreamDiffusion, Live Stream APP
Native Multimodal Models are World Learners
A minimal implementation of DeepMind's Genie world model
HunyuanImage-3.0: A Powerful Native Multimodal Model for Image Generation
[ICLR 2026] Pyramidal Patchification Flow for Visual Generation (PPFlow)
Qwen3-omni is a natively end-to-end, omni-modal LLM developed by the Qwen team at Alibaba Cloud, capable of understanding text, audio, images, and video, as well as generating speech in real time.
Latent Bridge Matching for Fast Image-to-Image Translation (ICCV 2025 Highlight)
MapAnything: Universal Feed-Forward Metric 3D Reconstruction
[ICLR 2026] OmniWorld: A Multi-Domain and Multi-Modal Dataset for 4D World Modeling
[CVPR 2026] SpatialVID: A Large-Scale Video Dataset with Spatial Annotations
Multi-lingual large voice generation model, providing inference, training and deployment full-stack ability.
✨✨[NeurIPS 2025] VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction
✨✨Latest Advances on Multimodal Large Language Models
ViPE: Video Pose Engine for Geometric 3D Perception
Reference PyTorch implementation and models for DINOv3
High-resolution models for human tasks.
Industry-level video foundation model for unified Text-to-Video (T2V) and Image-to-Video (I2V) generation.