Stars
From Images to High-Fidelity 3D Assets with Production-Ready PBR Material
HY-World 2.0: A Multi-Modal World Model for Reconstructing, Generating, and Simulating 3D Worlds
High-Resolution 3D Assets Generation with Large Scale Hunyuan3D Diffusion Models.
A survey on MM-LLMs for long video understanding: From Seconds to Hours: Reviewing MultiModal Large Language Models on Comprehensive Long Video Understanding
A modern static resume template and theme. Powered by Jekyll and GitHub pages.
ATS and Human-friendly Resume Writer in Markdown.
📄 Awesome CV is LaTeX template for your outstanding job application
An elegant \LaTeX\ résumé template. 大陆镜像 https://gods.coding.net/p/resume/git
Masked Depth Modeling for Spatial Perception
The repository provides code for running inference and finetuning with the Meta Segment Anything Model 3 (SAM 3), links for downloading the trained model checkpoints, and example notebooks that sho…
[CVPR'25 Oral] MoGe: Unlocking Accurate Monocular Geometry Estimation for Open-Domain Images with Optimal Training Supervision
[ICCV 2019] Monocular depth estimation from a single image
TAG: A Simple Yet Effective Temporal-Aware Approach for Zero-Shot Video Temporal Grounding
Papers related to wireless large AI models and wireless foundation models.
MaxInfo: A Training-Free Key-Frame Selection Method Using Maximum Volume for Enhanced Video Understanding
Open-source and strong foundation image recognition models.
[CVPR24] Official Implementation of GEM (Grounding Everything Module)
CLIP (Contrastive Language-Image Pretraining), Predict the most relevant text snippet given an image
Official Repository of paper VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding
[ACL 2024 🔥] Video-ChatGPT is a video conversation model capable of generating meaningful conversation about videos. It combines the capabilities of LLMs with a pretrained visual encoder adapted fo…
This repository is intended to host tools and demos for ActivityNet
[ICLR 2026] FastVGGT: Fast Visual Geometry Transformer
This repo contains the code for "VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks" [ICLR 2025]
TALL: Temporal Activity Localization via Language Query
Temporally-aligned Audio CaptiOnS for Language-Audio Pretraining
PyTorch code and models for VJEPA2 self-supervised learning from video.
[ICCV 2023] UniVTG: Towards Unified Video-Language Temporal Grounding