Reasoning-Induced Image Quality Assessment via Reinforcement Learning to Rank

This is the official code of VisualQuality-R1.

Tianhe Wu, Jian Zou, Jie Liang, Lei Zhang*, Kede Ma*,
* denotes the corresponding author

The first NR-IQA model enhanced by RL2R, capable of both quality description and rating through reasoning.

⭐️ Our series works: [MLLMs for IQA] [A-FINE] [MANIQA]

💡 Highlights

Reinforcement learning to rank
Training with multiple datasets
Capability of visual quality captioning and rating

📜 News

[18/09/25] 🔥 Our paper has been accepted has Spotlight in NeurIPS 2025!
[05/06/25] 🔥 We have released the training code of VisualQuality-R1.
[25/05/25] 🤗 We have released the final version of VisualQuality-R1-7B.
[16/05/25] 🤗 We have released a preview version of VisualQuality-R1-7B-preview, trained on three datasets: KADID-10K, TID2013, and KonIQ-10k.

🔨 Installation

We currently support training on A100 and A800 GPUs. To set up the training environment, use the following commands:

conda create -n visualquality python=3.11.10
conda activate visualquality

bash setup.sh

⚡Quick Start

Non-Thinking Inference

When you execute inference with VisualQuality-R1 as a reward/evaluation model, you can only use non-thinking mode to reduce inference time, generating only a single output token with the following prompt:

PROMPT = (
    "You are doing the image quality assessment task. Here is the question: "
    "What is your overall rating on the quality of this picture? The rating should be a float between 1 and 5, "
    "rounded to two decimal places, with 1 representing very poor quality and 5 representing excellent quality."
)

QUESTION_TEMPLATE = "{Question} Please only output the final answer with only one score in <answer> </answer> tags."

For single image quality rating, the code is:

Example Code (VisualQuality-R1: Image Quality Rating with non-thinking mode)

from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info

import torch
import random
import re
import os


def score_image(image_path, model, processor):
    PROMPT = (
        "You are doing the image quality assessment task. Here is the question: "
        "What is your overall rating on the quality of this picture? The rating should be a float between 1 and 5, "
        "rounded to two decimal places, with 1 representing very poor quality and 5 representing excellent quality."
    )
    
    QUESTION_TEMPLATE = "{Question} Please only output the final answer with only one score in <answer> </answer> tags."
    message = [
        {
            "role": "user",
            "content": [
                {'type': 'image', 'image': image_path},
                {"type": "text", "text": QUESTION_TEMPLATE.format(Question=PROMPT)}
            ],
        }
    ]

    batch_messages = [message]

    # Preparation for inference
    text = [processor.apply_chat_template(msg, tokenize=False, add_generation_prompt=True, add_vision_id=True) for msg in batch_messages]
    image_inputs, video_inputs = process_vision_info(batch_messages)
    inputs = processor(
        text=text,
        images=image_inputs,
        videos=video_inputs,
        padding=True,
        return_tensors="pt",
    )
    inputs = inputs.to(device)

    # Inference: Generation of the output
    generated_ids = model.generate(**inputs, use_cache=True, max_new_tokens=2048, do_sample=True, top_k=50, top_p=1)
    generated_ids_trimmed = [
        out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
    ]
    batch_output_text = processor.batch_decode(
        generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
    )

    reasoning = None

    try:
        model_output_matches = re.findall(r'<answer>(.*?)</answer>', batch_output_text[0], re.DOTALL)
        model_answer = model_output_matches[-1].strip() if model_output_matches else batch_output_text[0].strip()
        score = float(re.search(r'\d+(\.\d+)?', model_answer).group())
    except:
        print(f"================= Meet error with {img_path}, please generate again. =================")
        score = random.randint(1, 5)

    return reasoning, score


random.seed(1)
MODEL_PATH = ""
device = torch.device("cuda:5") if torch.cuda.is_available() else torch.device("cpu")
image_path = ""

model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    MODEL_PATH,
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
    device_map=device,
)
processor = AutoProcessor.from_pretrained(MODEL_PATH)
processor.tokenizer.padding_side = "left"

reasoning, score = score_image(
    image_path, model, processor
)

print(score)

Example Code (VisualQuality-R1: Batch Images Quality Rating with non-thinking mode)

from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info
from tqdm import tqdm

import torch
import random
import re
import os


def get_image_paths(folder_path):
    image_extensions = {'.jpg', '.jpeg', '.png', '.bmp', '.gif', '.tiff', '.webp'}
    image_paths = []

    for root, dirs, files in os.walk(folder_path):
        for file in files:
            _, ext = os.path.splitext(file)
            if ext.lower() in image_extensions:
                image_paths.append(os.path.join(root, file))

    return image_paths

def score_batch_image(image_paths, model, processor):
    PROMPT = (
        "You are doing the image quality assessment task. Here is the question: "
        "What is your overall rating on the quality of this picture? The rating should be a float between 1 and 5, "
        "rounded to two decimal places, with 1 representing very poor quality and 5 representing excellent quality."
    )

    QUESTION_TEMPLATE = "{Question} Please only output the final answer with only one score in <answer> </answer> tags."

    messages = []
    for img_path in image_paths:
        message = [
            {
                "role": "user",
                "content": [
                    {'type': 'image', 'image': img_path},
                    {"type": "text", "text": QUESTION_TEMPLATE.format(Question=PROMPT)}
                ],
            }
        ]
        messages.append(message)

    BSZ = 32
    all_outputs = []  # List to store all answers
    for i in tqdm(range(0, len(messages), BSZ)):
        batch_messages = messages[i:i + BSZ]
    
        # Preparation for inference
        text = [processor.apply_chat_template(msg, tokenize=False, add_generation_prompt=True, add_vision_id=True) for msg in batch_messages]
        
        image_inputs, video_inputs = process_vision_info(batch_messages)
        inputs = processor(
            text=text,
            images=image_inputs,
            videos=video_inputs,
            padding=True,
            return_tensors="pt",
        )
        inputs = inputs.to(device)

        # Inference: Generation of the output
        generated_ids = model.generate(**inputs, use_cache=True, max_new_tokens=512, do_sample=True, top_k=50, top_p=1)
        generated_ids_trimmed = [
            out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
        ]
        batch_output_text = processor.batch_decode(
            generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
        )

        all_outputs.extend(batch_output_text)
    
    path_score_dict = {}
    for img_path, model_output in zip(image_paths, all_outputs):
        try:
            model_output_matches = re.findall(r'<answer>(.*?)</answer>', model_output, re.DOTALL)
            model_answer = model_output_matches[-1].strip() if model_output_matches else model_output.strip()
            score = float(re.search(r'\d+(\.\d+)?', model_answer).group())
        except:
            print(f"Meet error with {img_path}, please generate again.")
            score = random.randint(1, 5)

        path_score_dict[img_path] = score

    return path_score_dict


random.seed(1)
MODEL_PATH = ""
device = torch.device("cuda:3") if torch.cuda.is_available() else torch.device("cpu")

model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    MODEL_PATH,
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
    device_map=device,
)
processor = AutoProcessor.from_pretrained(MODEL_PATH)
processor.tokenizer.padding_side = "left"

image_root = ""
image_paths = get_image_paths(image_root) # It should be a list

path_score_dict = score_batch_image(
    image_paths, model, processor
)

file_name = "output.txt"
with open(file_name, "w") as file:
    for key, value in path_score_dict.items():
        file.write(f"{key} {value}\n") 

print("Done!")

Thinking mode for inference

Example Code (VisualQuality-R1: Single Image Quality Rating with thinking)

from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info

import torch
import random
import re
import os


def score_image(image_path, model, processor):
    PROMPT = (
        "You are doing the image quality assessment task. Here is the question: "
        "What is your overall rating on the quality of this picture? The rating should be a float between 1 and 5, "
        "rounded to two decimal places, with 1 representing very poor quality and 5 representing excellent quality."
    )
        
    QUESTION_TEMPLATE = "{Question} First output the thinking process in <think> </think> tags and then output the final answer with only one score in <answer> </answer> tags."
    # QUESTION_TEMPLATE = "Please describe the quality of this image."
    message = [
        {
            "role": "user",
            "content": [
                {'type': 'image', 'image': image_path},
                {"type": "text", "text": QUESTION_TEMPLATE.format(Question=PROMPT)}
            ],
        }
    ]

    batch_messages = [message]

    # Preparation for inference
    text = [processor.apply_chat_template(msg, tokenize=False, add_generation_prompt=True, add_vision_id=True) for msg in batch_messages]
    image_inputs, video_inputs = process_vision_info(batch_messages)
    inputs = processor(
        text=text,
        images=image_inputs,
        videos=video_inputs,
        padding=True,
        return_tensors="pt",
    )
    inputs = inputs.to(device)

    # Inference: Generation of the output
    generated_ids = model.generate(**inputs, use_cache=True, max_new_tokens=2048, do_sample=True, top_k=50, top_p=1)
    generated_ids_trimmed = [
        out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
    ]
    batch_output_text = processor.batch_decode(
        generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
    )

    reasoning = re.findall(r'<think>(.*?)</think>', batch_output_text[0], re.DOTALL)
    reasoning = reasoning[-1].strip()

    try:
        model_output_matches = re.findall(r'<answer>(.*?)</answer>', batch_output_text[0], re.DOTALL)
        model_answer = model_output_matches[-1].strip() if model_output_matches else batch_output_text[0].strip()
        score = float(re.search(r'\d+(\.\d+)?', model_answer).group())
    except:
        print(f"================= Meet error with {img_path}, please generate again. =================")
        score = random.randint(1, 5)

    return reasoning, score


random.seed(1)
MODEL_PATH = ""
device = torch.device("cuda:5") if torch.cuda.is_available() else torch.device("cpu")
image_path = ""

model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    MODEL_PATH,
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
    device_map=device,
)
processor = AutoProcessor.from_pretrained(MODEL_PATH)
processor.tokenizer.padding_side = "left"

reasoning, score = score_image(
    image_path, model, processor
)

print(reasoning)
print(score)

Example Code (VisualQuality-R1: Batch Images Quality Rating with thinking)

from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info
from tqdm import tqdm

import torch
import random
import re
import os


def get_image_paths(folder_path):
    image_extensions = {'.jpg', '.jpeg', '.png', '.bmp', '.gif', '.tiff', '.webp'}
    image_paths = []

    for root, dirs, files in os.walk(folder_path):
        for file in files:
            _, ext = os.path.splitext(file)
            if ext.lower() in image_extensions:
                image_paths.append(os.path.join(root, file))

    return image_paths

def score_batch_image(image_paths, model, processor):
    PROMPT = (
        "You are doing the image quality assessment task. Here is the question: "
        "What is your overall rating on the quality of this picture? The rating should be a float between 1 and 5, "
        "rounded to two decimal places, with 1 representing very poor quality and 5 representing excellent quality."
    )

    QUESTION_TEMPLATE = "{Question} First output the thinking process in <think> </think> tags and then output the final answer with only one score in <answer> </answer> tags."

    messages = []
    for img_path in image_paths:
        message = [
            {
                "role": "user",
                "content": [
                    {'type': 'image', 'image': img_path},
                    {"type": "text", "text": QUESTION_TEMPLATE.format(Question=PROMPT)}
                ],
            }
        ]
        messages.append(message)

    BSZ = 32
    all_outputs = []  # List to store all answers
    for i in tqdm(range(0, len(messages), BSZ)):
        batch_messages = messages[i:i + BSZ]
    
        # Preparation for inference
        text = [processor.apply_chat_template(msg, tokenize=False, add_generation_prompt=True, add_vision_id=True) for msg in batch_messages]
        
        image_inputs, video_inputs = process_vision_info(batch_messages)
        inputs = processor(
            text=text,
            images=image_inputs,
            videos=video_inputs,
            padding=True,
            return_tensors="pt",
        )
        inputs = inputs.to(device)

        # Inference: Generation of the output
        generated_ids = model.generate(**inputs, use_cache=True, max_new_tokens=512, do_sample=True, top_k=50, top_p=1)
        generated_ids_trimmed = [
            out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
        ]
        batch_output_text = processor.batch_decode(
            generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
        )

        all_outputs.extend(batch_output_text)
    
    path_score_dict = {}
    for img_path, model_output in zip(image_paths, all_outputs):
        reasoning = re.findall(r'<think>(.*?)</think>', model_output, re.DOTALL)
        reasoning = reasoning[-1].strip()

        try:
            model_output_matches = re.findall(r'<answer>(.*?)</answer>', model_output, re.DOTALL)
            model_answer = model_output_matches[-1].strip() if model_output_matches else model_output.strip()
            score = float(re.search(r'\d+(\.\d+)?', model_answer).group())
        except:
            print(f"Meet error with {img_path}, please generate again.")
            score = random.randint(1, 5)

        path_score_dict[img_path] = score

    return path_score_dict


random.seed(1)
MODEL_PATH = ""
device = torch.device("cuda:3") if torch.cuda.is_available() else torch.device("cpu")

model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    MODEL_PATH,
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
    device_map=device,
)
processor = AutoProcessor.from_pretrained(MODEL_PATH)
processor.tokenizer.padding_side = "left"

image_root = ""
image_paths = get_image_paths(image_root) # It should be a list

path_score_dict = score_batch_image(
    image_paths, model, processor
)

file_name = "output.txt"
with open(file_name, "w") as file:
    for key, value in path_score_dict.items():
        file.write(f"{key} {value}\n") 

print("Done!")

🚀 Updated: VisualQuality-R1 high efficiency inference script with vLLM

Example Code (VisualQuality-R1: Batch Images Quality Rating with thinking, using vLLM)

# Please install vLLM first: https://docs.vllm.ai/en/stable/getting_started/installation/gpu.html

from transformers import Qwen2_5_VLProcessor, AutoProcessor
from vllm import LLM, RequestOutput, SamplingParams
from qwen_vl_utils import process_vision_info

import torch
import random
import re
import os

IMAGE_PATH = "./images"
MODEL_PATH = "TianheWu/VisualQuality-R1-7B"

def get_image_paths(folder_path):
    image_extensions = {'.jpg', '.jpeg', '.png', '.bmp', '.gif', '.tiff', '.webp'}
    image_paths = []

    for root, dirs, files in os.walk(folder_path):
        for file in files:
            _, ext = os.path.splitext(file)
            if ext.lower() in image_extensions:
                image_paths.append(os.path.join(root, file))

    return image_paths

def score_batch_image(image_paths, model: LLM, processor: Qwen2_5_VLProcessor):
    PROMPT = (
        "You are doing the image quality assessment task. Here is the question: "
        "What is your overall rating on the quality of this picture? The rating should be a float between 1 and 5, "
        "rounded to two decimal places, with 1 representing very poor quality and 5 representing excellent quality."
    )

    QUESTION_TEMPLATE = "{Question} First output the thinking process in <think> </think> tags and then output the final answer with only one score in <answer> </answer> tags."

    messages = []
    for img_path in image_paths:
        message = [
            {
                "role": "user",
                "content": [
                    {'type': 'image', 'image': img_path},
                    {"type": "text", "text": QUESTION_TEMPLATE.format(Question=PROMPT)}
                ],
            }
        ]
        messages.append(message)

    all_outputs = []  # List to store all answers

    # Preparation for inference
    print("preprocessing ...")
    texts = [processor.apply_chat_template(msg, tokenize=False, add_generation_prompt=True, add_vision_id=True) for msg in messages]
    image_inputs, video_inputs = process_vision_info(messages)

    inputs = [{
        "prompt": texts[i],
        "multi_modal_data": {
            "image": image_inputs[i]
        },
    } for i in range(len(messages))]
    
    output: list[RequestOutput] = model.generate(
        inputs,
        sampling_params=SamplingParams(
            max_tokens=512,
            temperature=0.1,
            top_k=50,
            top_p=1.0,
            stop_token_ids=[processor.tokenizer.eos_token_id],
        ),
    )

    batch_output_text = [o.outputs[0].text for o in output]

    all_outputs.extend(batch_output_text)
    
    path_score_dict = {}
    for img_path, model_output in zip(image_paths, all_outputs):
        print(f"{model_output = }")
        try:
            model_output_matches = re.findall(r'<answer>(.*?)</answer>', model_output, re.DOTALL)
            model_answer = model_output_matches[-1].strip() if model_output_matches else model_output.strip()
            score = float(re.search(r'\d+(\.\d+)?', model_answer).group())
        except:
            print(f"Meet error with {img_path}, please generate again.")
            score = random.randint(1, 5)

        path_score_dict[img_path] = score

    return path_score_dict


random.seed(1)
model = LLM(
    model=MODEL_PATH,
    tensor_parallel_size=1,
    trust_remote_code=True,
    seed=1,
)

processor = AutoProcessor.from_pretrained(MODEL_PATH)
processor.tokenizer.padding_side = "left"

image_paths = get_image_paths(IMAGE_PATH) # It should be a list

path_score_dict = score_batch_image(
    image_paths, model, processor
)

file_name = "output.txt"
with open(file_name, "w") as file:
    for key, value in path_score_dict.items():
        file.write(f"{key} {value}\n") 

print("Done!")

Training

Preparation

To smoothly execute the training procedure, first download the IQA images and place them all in a single folder.
Given an original MOS file (e.g., KADID-10K_mos.txt), first execute cd datasets, then run python make_data.py (with moderate modifications) to generate a JSON file for model training.
Download the Qwen/Qwen2.5-VL-7B-Instruct into a folder.

Muti-dataset Training

After generating single dataset jsonl files, you can integrate them with reording id. An example can be seen in datasets/combined/RL-622-KADID-TID2013-KONIQ-LIVEC_train_scoring.jsonl

Training within a Single Node

Please modify three elements in src/open-r1-multimodal/run_scripts/KADID-10K/one_node_run_kadid.sh:

--model_name_or_path [Your Qwen2.5-VL-7B-Instruct path] \
--image_folders [Your dataset images path] \
--data_file_paths [Your JSON file path] \

Then, run:

bash src/open-r1-multimodal/run_scripts/KADID-10K/one_node_run_kadid.sh

Training within Multiple Nodes

After making the necessary modifications, run the following command:

bash src/open-r1-multimodal/run_scripts/KADID-10K/multi_run_kadid.sh

Example

The image shown was captured using the OPPO Find X6 Pro.

Acknowledgement

VLM-R1: We start from codebase from the VLM-R1.

I would like to sincerely thank Zhuoyan Luo for the generous support of my project and for the invaluable guidance in the field of AR generation.

📧 Contact

If you have any question, please email sigstianhewu@gmail.com or tianhewu-c@my.cityu.edu.hk.

BibTeX

@article{wu2025visualquality,
  title={{VisualQuality-R1}: Reasoning-Induced Image Quality Assessment via Reinforcement Learning to Rank},
  author={Wu, Tianhe and Zou, Jian and Liang, Jie and Zhang, Lei and Ma, Kede},
  journal={arXiv preprint arXiv:2505.14460},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 48 Commits
datasets		datasets
images		images
src/open-r1-multimodal		src/open-r1-multimodal
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
setup.sh		setup.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Reasoning-Induced Image Quality Assessment via Reinforcement Learning to Rank

💡 Highlights

📜 News

🔨 Installation

⚡Quick Start

Non-Thinking Inference

Thinking mode for inference

🚀 Updated: VisualQuality-R1 high efficiency inference script with vLLM

Training

Preparation

Muti-dataset Training

Training within a Single Node

Training within Multiple Nodes

Example

Acknowledgement

📧 Contact

BibTeX

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

License

TianheWu/VisualQuality-R1

Folders and files

Latest commit

History

Repository files navigation

Reasoning-Induced Image Quality Assessment via Reinforcement Learning to Rank

💡 Highlights

📜 News

🔨 Installation

⚡Quick Start

Non-Thinking Inference

Thinking mode for inference

🚀 Updated: VisualQuality-R1 high efficiency inference script with vLLM

Training

Preparation

Muti-dataset Training

Training within a Single Node

Training within Multiple Nodes

Example

Acknowledgement

📧 Contact

BibTeX

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Packages