✨ We propose a novel metric, Weighted Aggregated Robustness, to address the limitations of existing metrics and better evaluate model performance across resolutions and datasets.
✨ Our key findings show that: (i) model size positively correlates with robustness to resolution degradation, (ii) pre-training dataset quality is more important than its size, and (iii) fine-tuned and higher resolution models are less robust against LR.
✨ Our analysis further reveals that the model makes semantically reasonable predictions at LR, and the lack of fine-grained details in input adversely impacts the model’s initial layers more than the deeper layers.
✨ Our proposed LR-TK0 enhances model robustness to low-resolution without altering pre-trained weights, demonstrating effectiveness across several datasets and its generalization capability across backbones and other approaches.
✨ This work will also appear in ICCV'25 (Non-Proceedings Tracks)
X-axis represent relative robustness for each dataset, for all models. Last column indicates the SAR (using relative robustness) and WAR (using improved relative robsutnes).
Setup has instructions for setting up the conda environment to run and train models.
| Model | Backbones | Validation Code | Results |
|---|---|---|---|
| CLIP | CLIP-ViT-B/32, CLIP-ViT-B/16, CLIP-ViT-L/14, CLIP-ViT-L/14@336px, CLIP-RN50, CLIP-RN101, CLIP-RN50x4, CLIP-RN50x16, CLIP-RN50x64 | Code, Script | results_csv, results_pdf |
| BLIP | BLIP-ViT-B/16 (14M), BLIP-ViT-B/16 (129M), BLIP-ViT-B/16 & CapFilt-L (129M), BLIP-ViT-L/16 (129M), BLIP-ViT-B/16 (129M + COCO), BLIP-ViT-B/16 (129M + Flickr), BLIP-ViT-L/16 (129M + COCO), BLIP-ViT-L/16 (129M + Flickr) | code | results_csv, results_pdf |
| MetaCLIP | MetaCLIP-ViT-B/32 (400M), MetaCLIP-ViT-B/32 (2.5B), MetaCLIP-ViT-B/16 (400M), MetaCLIP-ViT-B/16 (2.5B), MetaCLIP-ViT-L/14 (400M), MetaCLIP-ViT-L/14 (2.5B), MetaCLIP-ViT-H/14 (2.5B), MetaCLIP-ViT-G/14 (2.5B) | Code, Script | results_csv, results_pdf |
| EVA-CLIP | EVA-01-CLIP-g/14, EVA-01-CLIP-g/14+, EVA-02-CLIP-B/16, EVA-02-CLIP-E/14, EVA-02-CLIP-E/14+, EVA-02-CLIP-L/14, EVA-02-CLIP-L/14+ | Code, Script | results_csv, results_pdf |
| EVA-CLIP-18B | EVA-CLIP-8B | Code | Last column above |
| CLIPA-v2 | CLIPA(v2)-ViT-G/14, CLIPA(v2)-ViT-G/14@336px, CLIPA(v2)-ViT-H/14, CLIPA(v2)-ViT-H/14@336px (DataComp-1B), CLIPA(v2)-ViT-H/14@336px (LAION-2B), CLIPA(v2)-ViT-L/14, CLIPA(v2)-ViT-L/14@336px | Code | results_csv, results_pdf |
| $M^2$-Encoder |
|
code | results_csv, results_pdf |
| CoCa | CoCa-ViT-B/32, CoCa-ViT-L/14 (laion2b_s13b_b90k), CoCa-ViT-L/14(laion2b_s13b_b90k + mscoco) | Code | results_csv, results_pdf |
| SigLIP | SigLIP-ViT-B/16, SigLIP-ViT-B/16@256px, SigLIP-ViT-B/16@384px, SigLIP-ViT-B/16@512px, SigLIP-ViT-L/16@256px, SigLIP-ViT-L/16@384px, SigLIP-ViT-SO400M, SigLIP-ViT-SO400M@384px | Code | results_csv, results_pdf |
| OpenCLIP | OpenCLIP-ViT-B/16, OpenCLIP-ViT-B/32@256px, OpenCLIP-ViT-L/14 (laion2b_s32b_b82k), OpenCLIP-ViT-L/14 (datacomp_xl_s13b_b90k), OpenCLIP-ViT-H/14, OpenCLIP-ViT-H/14-quickgelu, OpenCLIP-ViT-H/14-quickgelu@378px, OpenCLIP-ViT-G/14 | Code, Script | results_csv, results_pdf |
| ALIBEF | ALBEF (4M), ALBEF (14M), ALBEF (14M + coco_finetuned), ALBEF (14M + flickr_finetuned) | code | results_csv, results_pdf |
Dataset weights.
| Dataset | Weight |
|---|---|
| Imagenet | 0.15556157429688613 |
| ImageNet-A | 0.970498446080589 |
| ImageNet-V2 | 0.2854574367981364 |
| ImageNet-R | 0.01 |
| ImageNet-Sketch | 0.021456095637452655 |
| Caltech101 (300 x 200) | 0.01 |
| DTD split-1 (300x300 - 640x640) | 0.505922498560715 |
| Food101 (512*512) | 0.01 |
| SUN397 | 0.407563119725743 |
| Stanford Cars (360x240) | 0.13583821249199218 |
| FGVC Aircraft | 0.8229545014750042 |
| Oxford Pets | 0.08995285864599148 |
| Flowers102 | 0.08972060770047119 |
| EuroSAT | 1.0 |
| UCF101 | 0.01 |
Code to compute WAR & Improvemend Robsutness (eq 1 in paper) is shown here. Run python generate_SAR_WAR.py 16 to generate SAR & WAR scores for all models. Results are dummed inside MetaData/WAR_SAR_Ranking/.
Total 7,000 captions were used to generate images. These captions were randomly sampled google caption dataset and are placed in https://github.com/shyammarjit/LR0.FM/tree/main/MetaData/Captions
Feeding the dataset to the Diffusuion model via :
import torch
from diffusers import PixArtAlphaPipeline
pipe = PixArtAlphaPipeline.from_pretrained("PixArt-alpha/PixArt-XL-2-1024-MS", torch_dtype=torch.float16)
pipe = pipe.to('cuda')
line = line.strip() ## caption line
offset = 0
for fold in range(5):
images =pipe(line, num_images_per_prompt=10, ).images
[img.save(f"{ROOT}/{k+1 + offset}/{i}.png") for k,img in enumerate(images)]
offset += 10
Training Code provided for EVA, MetaCLIP, OpenCLIP
Setup has instructions for setting up the conda environment to run and train models.
If you think this project is helpful, please feel free to leave a star⭐️ and cite our paper:
@inproceedings{
pathak2025lrfm,
title={{ LR0.FM: Low-Res Benchmark and Improving robustness for Zero-Shot Classification in Foundation Models} },
author={Priyank Pathak and Shyam Marjit and Shruti Vyas and Yogesh S Rawat},
booktitle={The Thirteenth International Conference on Learning Representations},
year={2025},
url={https://openreview.net/forum?id=AsFxRSLtqR}
}