Stars
[AAAI 2026] ✨ TSPO: Temporal Sampling Policy Optimization for Long-form Video Language Understanding
Rex-Thinker: Grounded Object Refering via Chain-of-Thought Reasoning
[CVPR2025] FlashSloth: Lightning Multimodal Large Language Models via Embedded Visual Compression
RepViT: Revisiting Mobile CNN From ViT Perspective [CVPR 2024] and RepViT-SAM: Towards Real-Time Segmenting Anything
New generation of CLIP with strong fine grained discrimination capability, ICML2025
FinePOSE: Fine-Grained Prompt-Driven 3D Human Pose Estimation via Diffusion Models
[ECCV 2024] Official PyTorch implementation of TC-CLIP "Leveraging Temporal Contextualization for Video Action Recognition"
[ICLR 2024] FROSTER: Frozen CLIP is a Strong Teacher for Open-Vocabulary Action Recognition
Solve Visual Understanding with Reinforced VLMs
VARGPT: Unified Understanding and Generation in a Visual Autoregressive Multimodal Large Language Model
official implementation of "Interpreting CLIP's Image Representation via Text-Based Decomposition"
[ACM MM 2024] Hierarchical Multimodal Fine-grained Modulation for Visual Grounding.
tmlr-group / WCA
Forked from JinhaoLee/WCA[ICML 2024] "Visual-Text Cross Alignment: Refining the Similarity Score in Vision-Language Models"
[CVPR 2024] Alpha-CLIP: A CLIP Model Focusing on Wherever You Want
[NeurIPS2024] - SimVG: A Simple Framework for Visual Grounding with Decoupled Multi-modal Fusion
[CVPR 2024 Oral] InternVL Family: A Pioneering Open-Source Alternative to GPT-4o. 接近GPT-4o表现的开源多模态对话模型
[CVPR 2024] LION: Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge
[ICCV 2025] LLaVA-CoT, a visual language model capable of spontaneous, systematic reasoning
The code for "TokenPacker: Efficient Visual Projector for Multimodal LLM", IJCV2025
[ICLR 2025] MLLM for On-Demand Spatial-Temporal Understanding at Arbitrary Resolution
Qwen3-VL is the multimodal large language model series developed by Qwen team, Alibaba Cloud.
[ACM MM 2024] Cross-modal Resonance through Evidential Deep Learning for Enhanced Zero-Shot Learning
Question and Answer based on Anything.
Local models support for Microsoft's graphrag using ollama (llama3, mistral, gemma2 phi3)- LLM & Embedding extraction