Skip to content

w1oves/hqclip

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

18 Commits
Β 
Β 
Β 
Β 

Repository files navigation

[ICCV 2025] HQ-CLIP: Leveraging Large Vision-Language Models to Create High-Quality Image-Text Datasets πŸš€

Paper PDF Project Page Demo Dataset VLM-150M Dataset VLM-1B

Authors
Zhixiang Wei12, Guangting Wang2 et al.
1 University of Science and Technology of China
2 WeChat Vision, Tencent Inc.


πŸ” Key Contributions

  • 🏭 Efficient Data Generation Pipeline
    Multi-grained annotation pipeline using Large Vision-Language Models (LVLMs)
  • πŸ—‚οΈ High-Quality Image-Text Datasets
    Generated by state-of-the-art LVLMs with positive/negative examples and rich text descriptions:
  • 🧠 HQ-CLIP Training Framework
    Novel CLIP training paradigm extending contrastive learning with:
    • Negative description supervision
    • Short tag augmentation

Model Overview


Model Zoo

Model Pretrained ImageNet Top-1 DataComp Score
CLIP-B-16 VLM-150M-Medium 70.6 58.6
CLIP-L-14-CLIPA VLM-1B 78.6 63.8
CLIP-L-14-OPENAI VLM-1B 76.5 63.7

Recaption Model: Qwen2VL

Datasets

Dataset Samples URL
VLM-150M 147M https://huggingface.co/datasets/zhixiangwei/VLM-150M
VLM-1B 1.37B https://huggingface.co/datasets/zhixiangwei/VLM-1B

Dataset Usage Guide

Preparation Steps

  1. (Optional) Download CommonPool Foundation Datasets
    Access CommonPool Large and XLarge versions via:
    CommonPool GitHub Repository

  2. Acquire DFN Base Datasets
    Download DFN Large and XLarge from:
    DFN Hugging Face Datasets

  3. Download HQ-CLIP Datasets
    Obtain our enhanced datasets:

    • VLM-150M
    • VLM-1B

Integration Instructions

Each JSON entry in VLM-150M and VLM-1B corresponds directly to a DFN dataset UID through matching filenames. To utilize our enhanced annotations:

  • Option 1: Direct Caption Replacement
    Substitute the original DFN captions with our generated annotations

  • Option 2: Dynamic Data Loading
    Modify the Open CLIP dataloader to load our annotations during training runtime

πŸ”œ Detailed implementation guidance will be published in future releases.

Model Loading Instructions

Our uploaded weights are compatible with both open_clip and huggingface transformers.

For open_clip users:

import open_clip

Initialize model with transforms

model, preprocess_train, preprocess_val = open_clip.create_model_and_transforms(
    'hf-hub:zhixiangwei/vlm150m-hqclip-large-vitb16'
)
tokenizer = open_clip.get_tokenizer(
    'hf-hub:zhixiangwei/vlm150m-hqclip-large-vitb16'
)

For Hugging Face Transformers users:

from transformers import AutoModel

Load model directly from hub

model = AutoModel.from_pretrained(
    'zhixiangwei/vlm150m-hqclip-large-vitb16'
)

πŸ“ Citation

@misc{hqclip,
      title={HQ-CLIP: Leveraging Large Vision-Language Models to Create High-Quality Image-Text Datasets and CLIP Models}, 
      author={Zhixiang Wei and Guangting Wang and Xiaoxiao Ma and Ke Mei and Huaian Chen and Yi Jin and Fengyun Rao},
      year={2025},
      eprint={2507.22431},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2507.22431}, 
}

πŸ™ Acknowledgments

These works have greatly inspired us, providing us with codebases, data, and support. We thank their authors!

About

[ICCV 2025] HQ-CLIP: Leveraging Large Vision-Language Models to Create High-Quality Image-Text Datasets

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors