- Environment Setup
- Example Usage: Extract Qwen2-VL-2B Embeddings with VLM-Lens
- Layers of Interest in a VLM
- Feature Extraction using HuggingFace Datasets
- Output Database
- Demo: Principal Component Analysis over Primitive Concept
- Contributing to VLM-Lens
- Miscellaneous
We recommend using a virtual environment to manage your dependencies. You can create one using the following command to create a virtual environment under
virtualenv --no-download "venv/vlm-lens-base" --prompt "vlm-lens-base" # Or "python3.10 -m venv venv/vlm-lens-base"
source venv/vlm-lens-base/bin/activateThen, install the required dependencies:
pip install --upgrade pip
pip install -r envs/base/requirements.txtThere are some models that require different dependencies, and we recommend creating a separate virtual environment for each of them to avoid conflicts.
For such models, we have offered a separate requirements.txt file under envs/<model_name>/requirements.txt, which can be installed in the same way as above.
All the model-specific environments are independent of the base environment, and can be installed individually.
Notes:
- There may be local constraints (e.g., issues caused by cluster regulations) that cause failure of the above commands. In such cases, you are encouraged to modify it whenever fit. We welcome issues and pull requests to help us keep the dependencies up to date.
- Some models, due to the resources available at the development time, may not be fully supported on modern GPUs. While our released environments are tested on L40s GPUs, we recommend following the error messages to adjust the environment setups for your specific hardware.
The general command to run the quick command-line demo is:
python -m src.main \
--config <config-file-path> \
--debugwith an optional debug flag to see more detailed outputs.
Note that the config file should be in yaml format, and that any arguments you want to send to the huggingface API should be under the model key.
See configs/models/qwen/qwen-2b.yaml as an example.
The file configs/models/qwen/qwen-2b.yaml contains the configuration for running the Qwen2-VL-2B model.
architecture: qwen # Architecture of the model, see more options in src/models/configs.py
model_path: Qwen/Qwen2-VL-2B-Instruct # HuggingFace model path
model: # Model configuration, i.e., arguments to pass to the model
- torch_dtype: auto
output_db: output/qwen.db # Output database file to store embeddings
input_dir: ./data/ # Directory containing images to process
prompt: "Describe the color in this image in one word." # Textual prompt
pooling_method: None # Pooling method to use for aggregating token embeddings over tokens (options: None, mean, max)
modules: # List of modules to extract embeddings from
- lm_head
- visual.blocks.31To run the extraction on available GPU, use the following command:
python -m src.main --config configs/models/qwen/qwen-2b.yaml --debugIf there is no GPU available, you can run it on CPU with:
python -m src.main --config configs/models/qwen/qwen-2b.yaml --device cpu --debugUnfortunately there is no way to find which layers to potentially match to without loading the model. This can take quite a bit of system time figuring out.
Instead, we offer some cached results under logs/ for each model, which were generated through including the -l or --log-named-modules flag when running python -m src.main.
When running this flag, it is not necessary to set modules or anything besides the architecture and HuggingFace model path.
To automatically set up which layers to find/use, one should use the Unix style strings, where you can use * to denote wildcards.
For example, if one wanted to match with all the attention layer's query projection layer for Qwen, simply add the following lines to the .yaml file:
modules:
- model.layers.*.self_attn.q_proj
To use VLM-Lens with either hosted or local datasets, there are multiple methods you can use depending on the location of the input images.
First, your dataset must be standardized to a format that includes the attributes of prompt, label and image_path. Here is a snippet of the compling/coco-val2017-obj-qa-categories dataset, adjusted with the former attributes:
| id | prompt | label | image_path |
|---|---|---|---|
| 397,133 | Is this A photo of a dining table on the bottom | yes | /path/to/397133.png |
| 37,777 | Is this A photo of a dining table on the top | no | /path/to/37777.png |
This can be achieved manually or using the helper script in scripts/map_datasets.py.
If you are using datasets hosted on a platform such as HuggingFace, you will either use images that are also hosted, or ones that are downloaded locally with an identifier to map back to the hosted dataset (e.g., filename).
You must use the dataset_path attribute in your configuration file with the appropriate dataset_split (if it exists, otherwise leave it out).
dataset:
- dataset_path: compling/coco-val2017-obj-qa-categories
- dataset_split: val2017🚨 NOTE: The
image_pathattribute in the dataset must contain either filenames or relative paths, such that a cell value oftrain/00023.pngcan be joined withimage_dataset_pathto form the full absolute path:/path/to/local/images/train/00023.png. If theimage_pathattribute does not require any additional path joining, you can leave out theimage_dataset_pathattribute.
dataset:
- dataset_path: compling/coco-val2017-obj-qa-categories
- dataset_split: val2017
- image_dataset_path: /path/to/local/images # downloaded using configs/dataset/download-coco.yamldataset:
- local_dataset_path: /path/to/local/CLEVR
- dataset_split: train # leave out if unspecified🚨 NOTE: The
image_pathattribute in the dataset must contain either filenames or relative paths, such that a cell value oftrain/00023.pngcan be joined withimage_dataset_pathto form the full absolute path:/path/to/local/images/train/00023.png. If theimage_pathattribute does not require any additional path joining, you can leave out theimage_dataset_pathattribute.
dataset:
- local_dataset_path: /path/to/local/CLEVR
- dataset_split: train # leave out if unspecified
- image_dataset_path: /path/to/local/CLEVR/imagesSpecified by the -o and --output-db flags, this specifies the specific output database we want. From this, in SQL we have a single table under the name tensors with the following columns:
name, architecture, timestamp, image_path, prompt, label, layer, tensor_dim, tensor
where each column contains:
namerepresents the model path from HuggingFace.architectureis the supported flags above.timestampis the specific time that the model was ran.image_pathis the absolute path to the image.promptstores the prompt used in that instance.labelis an optional cell that stores the "ground-truth" answer, which is helpful in use cases such as classification.layeris the matched layer frommodel.named_modules()pooling_methodis the pooling method used for aggregating token embeddings over tokens.tensor_dimis the dimension of the tensor saved.tensoris the embedding saved.
Download license-free images for primitive concepts (e.g., colors):
pip install -r data/concepts/requirements.txt
python -m data.concepts.download --config configs/concepts/colors.yamlRun the LLaVA model to obtain embeddings of the concept images:
python -m src.main --config configs/models/llava-7b/llava-7b-concepts-colors.yaml --device cudaAlso, run the LLaVA model to obtain embeddings of the test images:
python -m src.main --config configs/models/llava-7b/llava-7b.yaml --device cudaSeveral PCA-based analysis scripts are provided:
pip install -r src/concepts/requirements.txt
python -m src.concepts.pca
python -m src.concepts.pca_knn
python -m src.concepts.pca_separationInstall additional dependencies and launch the app.
pip install -r demo/requirements.txt
python -m demo.launch_gradioWe welcome contributions to VLM-Lens! If you have suggestions, improvements, or bug fixes, please consider submitting a pull request, and we are actively reviewing them.
We generally follow the Google Python Styles to ensure readability, with a few exceptions stated in .flake8.
We use pre-commit hooks to ensure code quality and consistency---please make sure to run the following scripts before committing:
pip install pre-commit
pre-commit installTo use a specific cache, one should set the HF_HOME environment variable as so:
HF_HOME=./cache/ python -m src.main --config configs/models/clip/clip.yaml --debug
There are some models that require separate submodules to be cloned, such as Glamm. To use these models, please follow the instructions below to download the submodules.
For Glamm (GroundingLMM), one needs to clone the separate submodules, which can be done with the following command:
git submodule update --recursive --init
See our document for details on the installation.
@inproceedings{vlmlens,
title={From Behavioral Performance to Internal Competence: Interpreting Vision-Language Models with VLM-Lens},
author={Hala Sheta and Eric Huang and Shuyu Wu and Ilia Alenabi and Jiajun Hong and Ryker Lin and Ruoxi Ning and Daniel Wei and Jialin Yang and Jiawei Zhou and Ziqiao Ma and Freda Shi},
booktitle={Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations},
year={2025}
}