Stars
Tempo: Small Vision-Language Models are Smart Compressors for Long Video Understanding
[CVPR 2026 Hightlight] OmniVGGT: Omni-Modality Driven Visual Geometry Grounded Transformer
Text-driven human motion generation surveys, datasets and models.
This repository collects papers on Human-Interaction-Motion-Generation applications. We will update new papers irregularly.
A paper list of some recent works about Token Compress for Vit and VLM
A curated list of awesome LLM/VLM/VLA/World Model for Autonomous Driving(LLM4AD) resources (continually updated)
🍀 Pytorch implementation of various Attention Mechanisms, MLP, Re-parameter, Convolution, which is helpful to further understand papers.⭐⭐⭐
Awesome-LLM-3D: a curated list of Multi-modal Large Language Model in 3D world Resources
A collection of papers on diffusion models for 3D generation.
Track-Anything is a flexible and interactive tool for video object tracking and segmentation, based on Segment Anything, XMem, and E2FGVI.
Caption-Anything is a versatile tool combining image segmentation, visual captioning, and ChatGPT, generating tailored captions with diverse controls for user preferences. https://huggingface.co/sp…