Skip to content
View hongbo-sun's full-sized avatar

Block or report hongbo-sun

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don't include any personal information such as legal names or email addresses. Markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Showing results

[AAAI 2026] ✨ TSPO: Temporal Sampling Policy Optimization for Long-form Video Language Understanding

Python 122 11 Updated Nov 12, 2025

Rex-Thinker: Grounded Object Refering via Chain-of-Thought Reasoning

Python 146 7 Updated Jun 30, 2025

[CVPR2025] FlashSloth: Lightning Multimodal Large Language Models via Embedded Visual Compression

Python 62 5 Updated Oct 10, 2025

RepViT: Revisiting Mobile CNN From ViT Perspective [CVPR 2024] and RepViT-SAM: Towards Real-Time Segmenting Anything

Jupyter Notebook 1,075 81 Updated Jun 14, 2024

New generation of CLIP with strong fine grained discrimination capability, ICML2025

Python 563 35 Updated Oct 27, 2025
Python 3 1 Updated Dec 31, 2024

FinePOSE: Fine-Grained Prompt-Driven 3D Human Pose Estimation via Diffusion Models

Python 114 14 Updated May 3, 2025

[ECCV 2024] Official PyTorch implementation of TC-CLIP "Leveraging Temporal Contextualization for Video Action Recognition"

Python 92 10 Updated Feb 25, 2025

[ICLR 2024] FROSTER: Frozen CLIP is a Strong Teacher for Open-Vocabulary Action Recognition

Python 97 9 Updated Jan 14, 2025

Solve Visual Understanding with Reinforced VLMs

Python 5,927 378 Updated Mar 12, 2026
Python 8 5 Updated Oct 30, 2023

VARGPT: Unified Understanding and Generation in a Visual Autoregressive Multimodal Large Language Model

Python 341 18 Updated Apr 17, 2025

official implementation of "Interpreting CLIP's Image Representation via Text-Based Decomposition"

Jupyter Notebook 233 28 Updated Jun 1, 2025

[ACM MM 2024] Hierarchical Multimodal Fine-grained Modulation for Visual Grounding.

Python 60 6 Updated Nov 10, 2025

[ICML 2024] "Visual-Text Cross Alignment: Refining the Similarity Score in Vision-Language Models"

Python 58 4 Updated Sep 3, 2024

[CVPR 2024] Alpha-CLIP: A CLIP Model Focusing on Wherever You Want

Jupyter Notebook 871 58 Updated Jul 20, 2025

[NeurIPS2024] - SimVG: A Simple Framework for Visual Grounding with Decoupled Multi-modal Fusion

Python 101 4 Updated Oct 29, 2025

[CVPR 2024 Oral] InternVL Family: A Pioneering Open-Source Alternative to GPT-4o. 接近GPT-4o表现的开源多模态对话模型

Python 9,947 770 Updated Sep 22, 2025

[CVPR 2024] LION: Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge

Jupyter Notebook 153 7 Updated Sep 3, 2025

code for MANET

Python 2 1 Updated Mar 3, 2023

[ICCV 2025] LLaVA-CoT, a visual language model capable of spontaneous, systematic reasoning

Python 2,131 82 Updated Dec 12, 2025

The code for "TokenPacker: Efficient Visual Projector for Multimodal LLM", IJCV2025

Python 278 9 Updated May 26, 2025

[ICLR 2025] MLLM for On-Demand Spatial-Temporal Understanding at Arbitrary Resolution

Python 330 19 Updated Jul 4, 2025

Qwen3-VL is the multimodal large language model series developed by Qwen team, Alibaba Cloud.

Jupyter Notebook 18,892 1,715 Updated Jan 30, 2026

[ACM MM 2024] Cross-modal Resonance through Evidential Deep Learning for Enhanced Zero-Shot Learning

Python 8 1 Updated Jul 22, 2024
Python 353 27 Updated May 25, 2024

Question and Answer based on Anything.

Python 13,928 1,342 Updated Mar 24, 2025

Local models support for Microsoft's graphrag using ollama (llama3, mistral, gemma2 phi3)- LLM & Embedding extraction

Python 1,096 166 Updated Sep 30, 2024
Next