VLM-Lens

Environment Setup

We recommend using a virtual environment to manage your dependencies. You can create one using the following command to create a virtual environment under

virtualenv --no-download "venv/vlm-lens-base" --prompt "vlm-lens-base"  # Or "python3.10 -m venv venv/vlm-lens-base"
source venv/vlm-lens-base/bin/activate

Then, install the required dependencies:

pip install --upgrade pip
pip install -r envs/base/requirements.txt

There are some models that require different dependencies, and we recommend creating a separate virtual environment for each of them to avoid conflicts. For such models, we have offered a separate requirements.txt file under envs/<model_name>/requirements.txt, which can be installed in the same way as above. All the model-specific environments are independent of the base environment, and can be installed individually.

Notes:

There may be local constraints (e.g., issues caused by cluster regulations) that cause failure of the above commands. In such cases, you are encouraged to modify it whenever fit. We welcome issues and pull requests to help us keep the dependencies up to date.
Some models, due to the resources available at the development time, may not be fully supported on modern GPUs. While our released environments are tested on L40s GPUs, we recommend following the error messages to adjust the environment setups for your specific hardware.

Example Usage: Extract Qwen2-VL-2B Embeddings with VLM-Lens

General Command-Line Demo

The general command to run the quick command-line demo is:

python -m src.main \
  --config <config-file-path> \
  --debug

with an optional debug flag to see more detailed outputs.

Note that the config file should be in yaml format, and that any arguments you want to send to the huggingface API should be under the model key. See configs/models/qwen/qwen-2b.yaml as an example.

Run Qwen2-VL-2B Embeddings Extraction

The file configs/models/qwen/qwen-2b.yaml contains the configuration for running the Qwen2-VL-2B model.

architecture: qwen  # Architecture of the model, see more options in src/models/configs.py
model_path: Qwen/Qwen2-VL-2B-Instruct  # HuggingFace model path
model:  # Model configuration, i.e., arguments to pass to the model
  - torch_dtype: auto
output_db: output/qwen.db  # Output database file to store embeddings
input_dir: ./data/  # Directory containing images to process
prompt: "Describe the color in this image in one word."  # Textual prompt
pooling_method: None  # Pooling method to use for aggregating token embeddings over tokens (options: None, mean, max)
modules:  # List of modules to extract embeddings from
  - lm_head
  - visual.blocks.31

To run the extraction on available GPU, use the following command:

python -m src.main --config configs/models/qwen/qwen-2b.yaml --debug

If there is no GPU available, you can run it on CPU with:

python -m src.main --config configs/models/qwen/qwen-2b.yaml --device cpu --debug

Layers of Interest in a VLM

Retrieving All Named Modules

Unfortunately there is no way to find which layers to potentially match to without loading the model. This can take quite a bit of system time figuring out.

Instead, we offer some cached results under logs/ for each model, which were generated through including the -l or --log-named-modules flag when running python -m src.main.

When running this flag, it is not necessary to set modules or anything besides the architecture and HuggingFace model path.

Matching Layers

To automatically set up which layers to find/use, one should use the Unix style strings, where you can use * to denote wildcards.

For example, if one wanted to match with all the attention layer's query projection layer for Qwen, simply add the following lines to the .yaml file:

modules:
  - model.layers.*.self_attn.q_proj

Feature Extraction using HuggingFace Datasets

To use VLM-Lens with either hosted or local datasets, there are multiple methods you can use depending on the location of the input images.

First, your dataset must be standardized to a format that includes the attributes of prompt, label and image_path. Here is a snippet of the compling/coco-val2017-obj-qa-categories dataset, adjusted with the former attributes:

id	prompt	label	image_path
397,133	Is this A photo of a dining table on the bottom	yes	/path/to/397133.png
37,777	Is this A photo of a dining table on the top	no	/path/to/37777.png

This can be achieved manually or using the helper script in scripts/map_datasets.py.

Method 1: Using hosted datasets

If you are using datasets hosted on a platform such as HuggingFace, you will either use images that are also hosted, or ones that are downloaded locally with an identifier to map back to the hosted dataset (e.g., filename).

You must use the dataset_path attribute in your configuration file with the appropriate dataset_split (if it exists, otherwise leave it out).

1(a): Hosted Dataset with Hosted Images

dataset:
  - dataset_path: compling/coco-val2017-obj-qa-categories
  - dataset_split: val2017

1(b): Hosted Dataset with Local Images

🚨 NOTE: The image_path attribute in the dataset must contain either filenames or relative paths, such that a cell value of train/00023.png can be joined with image_dataset_path to form the full absolute path: /path/to/local/images/train/00023.png. If the image_path attribute does not require any additional path joining, you can leave out the image_dataset_path attribute.

dataset:
  - dataset_path: compling/coco-val2017-obj-qa-categories
  - dataset_split: val2017
  - image_dataset_path: /path/to/local/images  # downloaded using configs/dataset/download-coco.yaml

Method 2: Using local datasets

2(a): Local Dataset containing Image Files

dataset:
  - local_dataset_path: /path/to/local/CLEVR
  - dataset_split: train # leave out if unspecified

2(b): Local Dataset with Separate Input Image Directory

🚨 NOTE: The image_path attribute in the dataset must contain either filenames or relative paths, such that a cell value of train/00023.png can be joined with image_dataset_path to form the full absolute path: /path/to/local/images/train/00023.png. If the image_path attribute does not require any additional path joining, you can leave out the image_dataset_path attribute.

dataset:
  - local_dataset_path: /path/to/local/CLEVR
  - dataset_split: train # leave out if unspecified
  - image_dataset_path: /path/to/local/CLEVR/images

Output Database

Specified by the -o and --output-db flags, this specifies the specific output database we want. From this, in SQL we have a single table under the name tensors with the following columns:

name, architecture, timestamp, image_path, prompt, label, layer, tensor_dim, tensor

where each column contains:

name represents the model path from HuggingFace.
architecture is the supported flags above.
timestamp is the specific time that the model was ran.
image_path is the absolute path to the image.
prompt stores the prompt used in that instance.
label is an optional cell that stores the "ground-truth" answer, which is helpful in use cases such as classification.
layer is the matched layer from model.named_modules()
pooling_method is the pooling method used for aggregating token embeddings over tokens.
tensor_dim is the dimension of the tensor saved.
tensor is the embedding saved.

Principal Component Analysis over Primitive Concept

Data Collection

Download license-free images for primitive concepts (e.g., colors):

pip install -r data/concepts/requirements.txt
python -m data.concepts.download --config configs/concepts/colors.yaml

Embedding Extraction

Run the LLaVA model to obtain embeddings of the concept images:

python -m src.main --config configs/models/llava-7b/llava-7b-concepts-colors.yaml --device cuda

Also, run the LLaVA model to obtain embeddings of the test images:

python -m src.main --config configs/models/llava-7b/llava-7b.yaml --device cuda

Run PCA

Several PCA-based analysis scripts are provided:

pip install -r src/concepts/requirements.txt
python -m src.concepts.pca
python -m src.concepts.pca_knn
python -m src.concepts.pca_separation

Run Gradio Demo Locally

Install additional dependencies and launch the app.

pip install -r demo/requirements.txt
python -m demo.launch_gradio

Contributing to VLM-Lens

We welcome contributions to VLM-Lens! If you have suggestions, improvements, or bug fixes, please consider submitting a pull request, and we are actively reviewing them.

We generally follow the Google Python Styles to ensure readability, with a few exceptions stated in .flake8. We use pre-commit hooks to ensure code quality and consistency---please make sure to run the following scripts before committing:

pip install pre-commit
pre-commit install

Miscellaneous

Using a Cache

To use a specific cache, one should set the HF_HOME environment variable as so:

HF_HOME=./cache/ python -m src.main --config configs/models/clip/clip.yaml --debug

Using Submodule-Based Models

There are some models that require separate submodules to be cloned, such as Glamm. To use these models, please follow the instructions below to download the submodules.

Glamm

For Glamm (GroundingLMM), one needs to clone the separate submodules, which can be done with the following command:

git submodule update --recursive --init

See our document for details on the installation.

Citation

@inproceedings{vlmlens,
  title={From Behavioral Performance to Internal Competence: Interpreting Vision-Language Models with VLM-Lens},
  author={Hala Sheta and Eric Huang and Shuyu Wu and Ilia Alenabi and Jiajun Hong and Ryker Lin and Ruoxi Ning and Daniel Wei and Jialin Yang and Jiawei Zhou and Ziqiao Ma and Freda Shi},
  booktitle={Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 479 Commits
.github/workflows		.github/workflows
configs		configs
data		data
demo		demo
docs		docs
envs		envs
imgs		imgs
logs		logs
output		output
scripts		scripts
src		src
tutorial-notebooks		tutorial-notebooks
.flake8		.flake8
.gitattributes		.gitattributes
.gitignore		.gitignore
.gitmodules		.gitmodules
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

VLM-Lens

Table of Contents

Environment Setup

Example Usage: Extract Qwen2-VL-2B Embeddings with VLM-Lens

General Command-Line Demo

Run Qwen2-VL-2B Embeddings Extraction

Layers of Interest in a VLM

Retrieving All Named Modules

Matching Layers

Feature Extraction using HuggingFace Datasets

Method 1: Using hosted datasets

1(a): Hosted Dataset with Hosted Images

1(b): Hosted Dataset with Local Images

Method 2: Using local datasets

2(a): Local Dataset containing Image Files

2(b): Local Dataset with Separate Input Image Directory

Output Database

Principal Component Analysis over Primitive Concept

Data Collection

Embedding Extraction

Run PCA

Run Gradio Demo Locally

Contributing to VLM-Lens

Miscellaneous

Using a Cache

Using Submodule-Based Models

Glamm

Citation

About

Uh oh!

Releases 1

Packages

Contributors 12

Uh oh!

Languages

License

compling-wat/vlm-lens

Folders and files

Latest commit

History

Repository files navigation

VLM-Lens

Table of Contents

Environment Setup

Example Usage: Extract Qwen2-VL-2B Embeddings with VLM-Lens

General Command-Line Demo

Run Qwen2-VL-2B Embeddings Extraction

Layers of Interest in a VLM

Retrieving All Named Modules

Matching Layers

Feature Extraction using HuggingFace Datasets

Method 1: Using hosted datasets

1(a): Hosted Dataset with Hosted Images

1(b): Hosted Dataset with Local Images

Method 2: Using local datasets

2(a): Local Dataset containing Image Files

2(b): Local Dataset with Separate Input Image Directory

Output Database

Principal Component Analysis over Primitive Concept

Data Collection

Embedding Extraction

Run PCA

Run Gradio Demo Locally

Contributing to VLM-Lens

Miscellaneous

Using a Cache

Using Submodule-Based Models

Glamm

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 12

Uh oh!

Languages

Packages