📄 Paper | 🌐 Project Page
Shengao Wang1, Arjun Chandra1, Aoming Liu1, Venkatesh Saligrama1, Boqing Gong1
1Boston University
This is the codebase of BabyVLM evaluation suite, integrated with the lmms-eval framework. Specifically, this repository provides four extra evaluation tasks (Labeled-S, Visual Two-Word Test, Baby Winoground, SAYCam Caption) and implementes the model wrapper for BabyLLaVA series in the paper.
Install this package by cloning the repository and running the following command:
git clone https://github.com/ShawnKing98/babylmms-eval.git
cd babylmms-eval
conda create -n babyvlm python=3.10
conda activate babyvlm
pip install -e .Optionally, install the dependencies for BabyLLaVA by following the instructions in the BabyLLaVA repository.
The BabyVLM evaluation tasks use data from the SAYCam dataset, along with our own synthetic data. SAYCam dataset is hosted on the Databrary platform, and we are still seeking an appropriate platform to host our own synthetic data. All the data labels are already included in this repository, yet due to the term of use, we cannot publicly share the images here, interested researchers can apply for access on Databrary with approval from their institution's IRB.
Below are the steps to prepare the data:
- Acquire SAYCam images: Instead of directly using the raw SAYCam videos, we use the frames extracted by the authors of this paper. Download the
frames.txtfile from Databrary, change the suffix from.txtto.zip, unzip the file into your local directory, and you should have a folder containing 600,285 images (~14G), call itpath/to/saycam_images/. - Acquire Synthetic data: The synthetic data is generated by GPT4o and used in the Baby Winoground task. As we are still seeking a platform to host the synthetic data, please contact us to get access to the synthetic data.
- Post-process: After acquiring the SAYCam images and synthetic data, you can run the following command to put the images at the right place:
cd babylmms-eval
ln -s path/to/saycam_images/ dataset/labeled_s/images
ln -s path/to/saycam_images/ dataset/vtwt/images
ln -s path/to/saycam_images/ dataset/SAYCam_caption/images
ln -s path/to/saycam_images/ dataset/baby_winoground/positive_images
ln -s path/to/synthetic_images/ dataset/baby_winoground/negative_imagesIn order to evaluate the LLaVA and BabyLLaVA series models, please check our BabyLLaVA repository to install the necessary dependencies, before running the evaluation.
Evaluation of LLaVA-v1.5 on BabyVLM tasks
accelerate launch --num_processes=1 -m lmms_eval \
--model llava \
--model_args pretrained=liuhaotian/llava-v1.5-7b,conv_template=plain \
--task vtwt,labeled_s,baby_winoground,saycam_caption \
--batch_size 16 \
--output_path ./logs \
--trust_remote_codeEvaluation of BabyLLaVA on BabyVLM tasks
accelerate launch --num_processes=1 -m lmms_eval \
--model babyllava \
--model_args pretrained=wsashawn/babyllava_resnext_gpt2,conv_template=plain \
--task vtwt,labeled_s,baby_winoground,saycam_caption \
--batch_size 16 \
--output_path ./logs \
--trust_remote_codeMore detail about the usage of this package can be found at the original lmms-eval repository.
Please refer to the model guide documentation for instructions on how to add your own model. Note that both the generate_until and loglikelihood methods need to be implemented, as they are both used in the BabyVLM evaluation tasks.
Please cite us if you use this repository in your work.
@misc{wang2025babyvlmdataefficientpretrainingvlms,
title={BabyVLM: Data-Efficient Pretraining of VLMs Inspired by Infant Learning},
author={Shengao Wang and Arjun Chandra and Aoming Liu and Venkatesh Saligrama and Boqing Gong},
year={2025},
eprint={2504.09426},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2504.09426},
}