Notice (2026.2.21): This work has been accepted to CVPR 2026.
Official implementation of VITAL: Vision-Encoder-centered pretraining for LMMs in visual quality assessment.
VITAL-Series contains two major components:
- VITAL-LMM: training/evaluation code for VITAL main models.
- VITAL-linear-probe: visual encoder extension workflows (e.g., linear-probe and lightweight downstream adaptation).
Use the provided environment file:
conda env create -f environment.ymlIf needed, adjust CUDA/PyTorch versions according to your machine.
- Download VITAL-Assistant-8B, VITAL-Base-8B, and VITAL-Vision-Encoder-300M.
- Place LMM-related models under
VITAL-LMM. - For visual-encoder extension experiments (e.g., linear-probe), place VITAL-Vision-Encoder-300M under
VITAL-linear-probe. - Additional zero/warm-up series models are available on Hugging Face (see Model Zoo).
cd VITAL-LMMcd VITAL-LMM/test
-
Edit JSON configs in
shell/eval/eval_data:- Update
rootandannotationto your image/video paths and annotation files.
- Update
-
Run evaluation scripts:
For quality scoring:
bash shell/eval/evaluate_image.sh
bash shell/eval/evaluate_video.sh
For text generation:
bash shell/eval/evaluate_qbench.sh
bash shell/eval/evaluate_qbench_video_single_dev.sh- Evaluation entry scripts are in
internvl/eval:- Default scoring:
scoring.py - Faster video scoring:
scoring_less_token.py
- Default scoring:
If you want to use scoring_less_token.py, modify line 31 in shell/eval/evaluate_custom_scoring.sh accordingly.
cd VITAL-LMM/train
Use scripts in training_shell (update data/model paths before running):
bash shell/pretrain.sh
bash shell/warm_up.shcd VITAL-linear-probeThis module supports training/testing with non-LLM heads (e.g., linear probes) on top of VITAL-Vision-Encoder.
bash shell/probe_finetune.shbash shell/evaluate_video.shPlease update file paths in scripts for your local setup.
- VITAL-Base-8B: https://huggingface.co/JZHWS/VITAL-Base-8B
- VITAL-Assistant-8B: https://huggingface.co/JZHWS/VITAL-Assistant-8B
- VITAL-Warm-up-1B: https://huggingface.co/JZHWS/VITAL-Warm-up-1B
- VITAL-Warm-up-2B: https://huggingface.co/JZHWS/VITAL-Warm-up-2B
- VITAL-Warm-up-14B: https://huggingface.co/JZHWS/VITAL-Warm-up-14B
- VITAL-Vision-Encoder-300M: https://huggingface.co/JZHWS/VITAL-Vision-Encoder-300M
- VITAL-Linear-Probe: https://huggingface.co/JZHWS/VITAL-Linear-Probe
If you use this project, please cite:
@article{jia2025vital,
title={VITAL: Vision-Encoder-centered Pre-training for LMMs in Visual Quality Assessment},
author={Jia, Ziheng and Cao, Linhan and Han, Jinliang and Zhang, Zicheng and Qian, Jiaying and Wang, Jiarui and Chen, Zijian and Zhai, Guangtao and Min, Xiongkuo},
journal={arXiv preprint arXiv:2511.17962},
year={2025}
}
@inproceedings{jia2025vqa2,
title={Vqa2: visual question answering for video quality assessment},
author={Jia, Ziheng and Zhang, Zicheng and Qian, Jiaying and Wu, Haoning and Sun, Wei and Li, Chunyi and Liu, Xiaohong and Lin, Weisi and Zhai, Guangtao and Min, Xiongkuo},
booktitle={Proceedings of the 33rd ACM International Conference on Multimedia},
pages={6751--6760},
year={2025}
}
@inproceedings{zhang2025q,
title={Q-Bench-Video: Benchmark the Video Quality Understanding of LMMs},
author={Zhang, Zicheng and Jia, Ziheng and Wu, Haoning and Li, Chunyi and Chen, Zijian and Zhou, Yingjie and Sun, Wei and Liu, Xiaohong and Min, Xiongkuo and Lin, Weisi and others},
booktitle={Proceedings of the Computer Vision and Pattern Recognition Conference},
pages={3229--3239},
year={2025}
}For custom environments, adjust file paths and parameters as needed. If you encounter issues, feel free to open an issue in this repository.