🎨 Multimodal LLMs Can Reason about Aesthetics in Zero-Shot (ACM MM 2025)

We demonstrate that visual aesthetics can be reasoned in zero-shot, outperforming SOTA image aesthetic assessment models.

🖼️ The FineArtBench Dataset

The FineArtBench is by far the largest and most comprehensive benchmark for evaluating aesthetics-judgment ability on fineArts. It contains 1,000, 1,000 content image and style respectively with high-quality human semantic and judgment annotations.

⬇️ Download Guide (Click to Expand)

Option 1: via HuggingFace Hub

First, ensure you have stable connection to huggingface. The dataset can be found at this link. You can use the following code to download the dataset programmatically:

from datasets import load_dataset
dataset = load_dataset("Ruixiang/FineArtBench")

Option 2: via Manual Download

Download Link: 百度网盘 (Baidu NetDisk) | Google Drive

After downloading, please extract the dataset and put it under the data/ folder of the root directory. The directory structure should be as follows:

MLLM4Art/
├── data/
│   ├── 2AFC/               #the pair-wise comparison tasks
│   ├── base/               #the content, style images and their annotations
│   ├── human_annotation/   #the human aesthetic judgment for 2AFC tasks
│   ├── painting/           # the paintings generated by different models

🛠️ Benchmark Guide (Click to Expand)

Environment Setup

Create a virtural environment and install the required packages:

pip install -r requirements.txt

Evaluate Custom Aesthetic Evaluators on FineArtBench

To test your own local model on the 2AFC tasks of the FineArtBench, please refer to custom_model_evaluate.py. Specifically, you shall implement the Evaluator interface with your own model inference logic and map the model prediction to pairwise judgment. Then you can run the benchmark with the following command:

python custom_model_evaluate.py

To test API-based MLLMs (e.g. ChatGPT), please refer to mllm_API_evaluate.py. You need to set up your own API key and base URL in the script. The exemplar config can be found in ./APIConfig/. Then you can run the benchmark with the following command:

python mllm_API_evaluate.py --config <REPLACE_WITH_YOUR_CONFIG_PATH>

The expected output is a JSON file in the same format as ./data/2AFC/2AFC_global_N_5000.json, but with the winner field filled according to your model's predictions. I suggest always evaluate all 5,000 pairwise tasks, as the correlation script will handle the missing human annotations.

Benchmark Correlation Performance

Once you have the model predictions in the required JSON format, you can evaluate the correlation performance with human judgments using benchmark.py. This is the script that gives you the quantitative scores (correlation and statistical significance). You can run the benchmark with the following command:

python benchmark.py --human_annotation <PATH_TO_HUMAN_ANNOTATION_JSON> \
--model_annotation <PATH_TO_YOUR_MODEL_PREDICTION_JSON> --mode <global/instance>

The --human_annotation annotation can be found in ./data/human_annotation/ folder. The --mode argument specifies whether to compute global (per-artist) correlation or instance-level (per-task) correlation, which corresponds to the two columns in Table 1 of our paper. For the global correlation, please always report the score under 5 random splits.

For human annotation the json files without _paper suffix are recommended, which contains 40% more annotations and with additional quality control. For exactly reproducing the results as in our paper, please use the files with _paper suffix.

🚀 The ArtCoT for Human-Aligned Aesthetic Reasoning

We propose ArtCoT to enhance the inference-time reasoning capability of MLLMs on aesthetic judgment. An example conversation is provided below. Detailed quantitative comparison can be found in paper. The full response from MLLMs in our experiments will also be released to facilitate further research.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
APIConfig		APIConfig
asset		asset
data		data
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
benchmark.py		benchmark.py
custom_model_evaluate.py		custom_model_evaluate.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🎨 Multimodal LLMs Can Reason about Aesthetics in Zero-Shot (ACM MM 2025)

We demonstrate that visual aesthetics can be reasoned in zero-shot, outperforming SOTA image aesthetic assessment models.

🖼️ The FineArtBench Dataset

⬇️ Download Guide (Click to Expand)

Option 1: via HuggingFace Hub

Option 2: via Manual Download

🛠️ Benchmark Guide (Click to Expand)

Environment Setup

Evaluate Custom Aesthetic Evaluators on FineArtBench

Benchmark Correlation Performance

🚀 The ArtCoT for Human-Aligned Aesthetic Reasoning

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

songrise/MLLM4Art

Folders and files

Latest commit

History

Repository files navigation

🎨 Multimodal LLMs Can Reason about Aesthetics in Zero-Shot (ACM MM 2025)

We demonstrate that visual aesthetics can be reasoned in zero-shot, outperforming SOTA image aesthetic assessment models.

🖼️ The FineArtBench Dataset

⬇️ Download Guide (Click to Expand)

Option 1: via HuggingFace Hub

Option 2: via Manual Download

🛠️ Benchmark Guide (Click to Expand)

Environment Setup

Evaluate Custom Aesthetic Evaluators on FineArtBench

Benchmark Correlation Performance

🚀 The ArtCoT for Human-Aligned Aesthetic Reasoning

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages