Highlights
- Pro
Lists (4)
Sort Name ascending (A-Z)
Stars
A curated collection of resources focused on the Mechanistic Interpretability (MI) of Large Multimodal Models (LMMs). This repository aggregates surveys, blog posts, and research papers that explor…
Open source implementation of "Vision Transformers Need Registers"
Grounded SAM: Marrying Grounding DINO with Segment Anything & Stable Diffusion & Recognize Anything - Automatically Detect , Segment and Generate Anything
The repository provides code for running inference with the Meta Segment Anything Model 2 (SAM 2), links for downloading the trained model checkpoints, and example notebooks that show how to use th…
[ICCV 2023] Tracking Anything with Decoupled Video Segmentation
Video-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning?
An open-source project dedicated to tracking and segmenting any objects in videos, either automatically or interactively. The primary algorithms utilized include the Segment Anything Model (SAM) fo…
Code accompanying our paper "Improved Baselines for Data-efficient Perceptual Augmentation of LLMs"
Locating and editing factual associations in GPT (NeurIPS 2022)
Code for paper: "What’s in the Image? A Deep-Dive into the Vision of Vision Language Models" (CVPR 2025)
The official repo for "Vidi: Large Multimodal Models for Video Understanding and Editing"
Intervening Anchor Token: Decoding Strategy in Alleviating Hallucinations for MLLMs
Offical code for the CVPR 2024 Paper: Separating the "Chirp" from the "Chat": Self-supervised Visual Grounding of Sound and Language
[ICLR 2025] When Attention Sink Emerges in Language Models: An Empirical View (Spotlight)
[NeurIPS2024] Official code for (IMA) Implicit Multimodal Alignment: On the Generalization of Frozen LLMs to Multimodal Inputs
Code for the paper "Head Pursuit: Probing Attention Specialization in Multimodal Transformers" [NeurIPS 2025 spotlight]
Official implementation of "MMNeuron: Discovering Neuron-Level Domain-Specific Interpretation in Multimodal Large Language Model". Our codes are borrowed from Tang's language specific neurons imple…
Code for "Finding and Editing Multi-Modal Neurons in Pre-Trained Transformers" (Findings of ACL 2024)
[ICLR'25] Official code for "Can Video LLMs Refuse to Answer? Alignment for Answerability in Video Large Language Models"
Code for the paper: Discover-then-Name: Task-Agnostic Concept Bottlenecks via Automated Concept Discovery. ECCV 2024.
A framework that allows you to apply Sparse AutoEncoder on any models
[ICCV 2025] Auto Interpretation Pipeline and many other functionalities for Multimodal SAE Analysis.
Code for reproducing our paper "Not All Language Model Features Are Linear"
A curated list of resources for activation engineering