The first NR-IQA model enhanced by RL2R, capable of both quality description and rating through reasoning.
βοΈ Our series works: [MLLMs for IQA] [A-FINE] [MANIQA]
- Reinforcement learning to rank
- Training with multiple datasets
- Capability of visual quality captioning and rating
- [18/09/25] π₯ Our paper has been accepted has Spotlight in NeurIPS 2025!
- [05/06/25] π₯ We have released the training code of VisualQuality-R1.
- [25/05/25] π€ We have released the final version of VisualQuality-R1-7B.
- [16/05/25] π€ We have released a preview version of VisualQuality-R1-7B-preview, trained on three datasets: KADID-10K, TID2013, and KonIQ-10k.
We currently support training on A100 and A800 GPUs. To set up the training environment, use the following commands:
conda create -n visualquality python=3.11.10
conda activate visualquality
bash setup.sh
When you execute inference with VisualQuality-R1 as a reward/evaluation model, you can only use non-thinking mode to reduce inference time, generating only a single output token with the following prompt:
PROMPT = (
"You are doing the image quality assessment task. Here is the question: "
"What is your overall rating on the quality of this picture? The rating should be a float between 1 and 5, "
"rounded to two decimal places, with 1 representing very poor quality and 5 representing excellent quality."
)
QUESTION_TEMPLATE = "{Question} Please only output the final answer with only one score in <answer> </answer> tags."
For single image quality rating, the code is:
Example Code (VisualQuality-R1: Image Quality Rating with non-thinking mode)
from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info
import torch
import random
import re
import os
def score_image(image_path, model, processor):
PROMPT = (
"You are doing the image quality assessment task. Here is the question: "
"What is your overall rating on the quality of this picture? The rating should be a float between 1 and 5, "
"rounded to two decimal places, with 1 representing very poor quality and 5 representing excellent quality."
)
QUESTION_TEMPLATE = "{Question} Please only output the final answer with only one score in <answer> </answer> tags."
message = [
{
"role": "user",
"content": [
{'type': 'image', 'image': image_path},
{"type": "text", "text": QUESTION_TEMPLATE.format(Question=PROMPT)}
],
}
]
batch_messages = [message]
# Preparation for inference
text = [processor.apply_chat_template(msg, tokenize=False, add_generation_prompt=True, add_vision_id=True) for msg in batch_messages]
image_inputs, video_inputs = process_vision_info(batch_messages)
inputs = processor(
text=text,
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
)
inputs = inputs.to(device)
# Inference: Generation of the output
generated_ids = model.generate(**inputs, use_cache=True, max_new_tokens=2048, do_sample=True, top_k=50, top_p=1)
generated_ids_trimmed = [
out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
batch_output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
reasoning = None
try:
model_output_matches = re.findall(r'<answer>(.*?)</answer>', batch_output_text[0], re.DOTALL)
model_answer = model_output_matches[-1].strip() if model_output_matches else batch_output_text[0].strip()
score = float(re.search(r'\d+(\.\d+)?', model_answer).group())
except:
print(f"================= Meet error with {img_path}, please generate again. =================")
score = random.randint(1, 5)
return reasoning, score
random.seed(1)
MODEL_PATH = ""
device = torch.device("cuda:5") if torch.cuda.is_available() else torch.device("cpu")
image_path = ""
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
MODEL_PATH,
torch_dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
device_map=device,
)
processor = AutoProcessor.from_pretrained(MODEL_PATH)
processor.tokenizer.padding_side = "left"
reasoning, score = score_image(
image_path, model, processor
)
print(score)Example Code (VisualQuality-R1: Batch Images Quality Rating with non-thinking mode)
from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info
from tqdm import tqdm
import torch
import random
import re
import os
def get_image_paths(folder_path):
image_extensions = {'.jpg', '.jpeg', '.png', '.bmp', '.gif', '.tiff', '.webp'}
image_paths = []
for root, dirs, files in os.walk(folder_path):
for file in files:
_, ext = os.path.splitext(file)
if ext.lower() in image_extensions:
image_paths.append(os.path.join(root, file))
return image_paths
def score_batch_image(image_paths, model, processor):
PROMPT = (
"You are doing the image quality assessment task. Here is the question: "
"What is your overall rating on the quality of this picture? The rating should be a float between 1 and 5, "
"rounded to two decimal places, with 1 representing very poor quality and 5 representing excellent quality."
)
QUESTION_TEMPLATE = "{Question} Please only output the final answer with only one score in <answer> </answer> tags."
messages = []
for img_path in image_paths:
message = [
{
"role": "user",
"content": [
{'type': 'image', 'image': img_path},
{"type": "text", "text": QUESTION_TEMPLATE.format(Question=PROMPT)}
],
}
]
messages.append(message)
BSZ = 32
all_outputs = [] # List to store all answers
for i in tqdm(range(0, len(messages), BSZ)):
batch_messages = messages[i:i + BSZ]
# Preparation for inference
text = [processor.apply_chat_template(msg, tokenize=False, add_generation_prompt=True, add_vision_id=True) for msg in batch_messages]
image_inputs, video_inputs = process_vision_info(batch_messages)
inputs = processor(
text=text,
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
)
inputs = inputs.to(device)
# Inference: Generation of the output
generated_ids = model.generate(**inputs, use_cache=True, max_new_tokens=512, do_sample=True, top_k=50, top_p=1)
generated_ids_trimmed = [
out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
batch_output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
all_outputs.extend(batch_output_text)
path_score_dict = {}
for img_path, model_output in zip(image_paths, all_outputs):
try:
model_output_matches = re.findall(r'<answer>(.*?)</answer>', model_output, re.DOTALL)
model_answer = model_output_matches[-1].strip() if model_output_matches else model_output.strip()
score = float(re.search(r'\d+(\.\d+)?', model_answer).group())
except:
print(f"Meet error with {img_path}, please generate again.")
score = random.randint(1, 5)
path_score_dict[img_path] = score
return path_score_dict
random.seed(1)
MODEL_PATH = ""
device = torch.device("cuda:3") if torch.cuda.is_available() else torch.device("cpu")
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
MODEL_PATH,
torch_dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
device_map=device,
)
processor = AutoProcessor.from_pretrained(MODEL_PATH)
processor.tokenizer.padding_side = "left"
image_root = ""
image_paths = get_image_paths(image_root) # It should be a list
path_score_dict = score_batch_image(
image_paths, model, processor
)
file_name = "output.txt"
with open(file_name, "w") as file:
for key, value in path_score_dict.items():
file.write(f"{key} {value}\n")
print("Done!")Example Code (VisualQuality-R1: Single Image Quality Rating with thinking)
from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info
import torch
import random
import re
import os
def score_image(image_path, model, processor):
PROMPT = (
"You are doing the image quality assessment task. Here is the question: "
"What is your overall rating on the quality of this picture? The rating should be a float between 1 and 5, "
"rounded to two decimal places, with 1 representing very poor quality and 5 representing excellent quality."
)
QUESTION_TEMPLATE = "{Question} First output the thinking process in <think> </think> tags and then output the final answer with only one score in <answer> </answer> tags."
# QUESTION_TEMPLATE = "Please describe the quality of this image."
message = [
{
"role": "user",
"content": [
{'type': 'image', 'image': image_path},
{"type": "text", "text": QUESTION_TEMPLATE.format(Question=PROMPT)}
],
}
]
batch_messages = [message]
# Preparation for inference
text = [processor.apply_chat_template(msg, tokenize=False, add_generation_prompt=True, add_vision_id=True) for msg in batch_messages]
image_inputs, video_inputs = process_vision_info(batch_messages)
inputs = processor(
text=text,
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
)
inputs = inputs.to(device)
# Inference: Generation of the output
generated_ids = model.generate(**inputs, use_cache=True, max_new_tokens=2048, do_sample=True, top_k=50, top_p=1)
generated_ids_trimmed = [
out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
batch_output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
reasoning = re.findall(r'<think>(.*?)</think>', batch_output_text[0], re.DOTALL)
reasoning = reasoning[-1].strip()
try:
model_output_matches = re.findall(r'<answer>(.*?)</answer>', batch_output_text[0], re.DOTALL)
model_answer = model_output_matches[-1].strip() if model_output_matches else batch_output_text[0].strip()
score = float(re.search(r'\d+(\.\d+)?', model_answer).group())
except:
print(f"================= Meet error with {img_path}, please generate again. =================")
score = random.randint(1, 5)
return reasoning, score
random.seed(1)
MODEL_PATH = ""
device = torch.device("cuda:5") if torch.cuda.is_available() else torch.device("cpu")
image_path = ""
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
MODEL_PATH,
torch_dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
device_map=device,
)
processor = AutoProcessor.from_pretrained(MODEL_PATH)
processor.tokenizer.padding_side = "left"
reasoning, score = score_image(
image_path, model, processor
)
print(reasoning)
print(score)Example Code (VisualQuality-R1: Batch Images Quality Rating with thinking)
from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info
from tqdm import tqdm
import torch
import random
import re
import os
def get_image_paths(folder_path):
image_extensions = {'.jpg', '.jpeg', '.png', '.bmp', '.gif', '.tiff', '.webp'}
image_paths = []
for root, dirs, files in os.walk(folder_path):
for file in files:
_, ext = os.path.splitext(file)
if ext.lower() in image_extensions:
image_paths.append(os.path.join(root, file))
return image_paths
def score_batch_image(image_paths, model, processor):
PROMPT = (
"You are doing the image quality assessment task. Here is the question: "
"What is your overall rating on the quality of this picture? The rating should be a float between 1 and 5, "
"rounded to two decimal places, with 1 representing very poor quality and 5 representing excellent quality."
)
QUESTION_TEMPLATE = "{Question} First output the thinking process in <think> </think> tags and then output the final answer with only one score in <answer> </answer> tags."
messages = []
for img_path in image_paths:
message = [
{
"role": "user",
"content": [
{'type': 'image', 'image': img_path},
{"type": "text", "text": QUESTION_TEMPLATE.format(Question=PROMPT)}
],
}
]
messages.append(message)
BSZ = 32
all_outputs = [] # List to store all answers
for i in tqdm(range(0, len(messages), BSZ)):
batch_messages = messages[i:i + BSZ]
# Preparation for inference
text = [processor.apply_chat_template(msg, tokenize=False, add_generation_prompt=True, add_vision_id=True) for msg in batch_messages]
image_inputs, video_inputs = process_vision_info(batch_messages)
inputs = processor(
text=text,
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
)
inputs = inputs.to(device)
# Inference: Generation of the output
generated_ids = model.generate(**inputs, use_cache=True, max_new_tokens=512, do_sample=True, top_k=50, top_p=1)
generated_ids_trimmed = [
out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
batch_output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
all_outputs.extend(batch_output_text)
path_score_dict = {}
for img_path, model_output in zip(image_paths, all_outputs):
reasoning = re.findall(r'<think>(.*?)</think>', model_output, re.DOTALL)
reasoning = reasoning[-1].strip()
try:
model_output_matches = re.findall(r'<answer>(.*?)</answer>', model_output, re.DOTALL)
model_answer = model_output_matches[-1].strip() if model_output_matches else model_output.strip()
score = float(re.search(r'\d+(\.\d+)?', model_answer).group())
except:
print(f"Meet error with {img_path}, please generate again.")
score = random.randint(1, 5)
path_score_dict[img_path] = score
return path_score_dict
random.seed(1)
MODEL_PATH = ""
device = torch.device("cuda:3") if torch.cuda.is_available() else torch.device("cpu")
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
MODEL_PATH,
torch_dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
device_map=device,
)
processor = AutoProcessor.from_pretrained(MODEL_PATH)
processor.tokenizer.padding_side = "left"
image_root = ""
image_paths = get_image_paths(image_root) # It should be a list
path_score_dict = score_batch_image(
image_paths, model, processor
)
file_name = "output.txt"
with open(file_name, "w") as file:
for key, value in path_score_dict.items():
file.write(f"{key} {value}\n")
print("Done!")Example Code (VisualQuality-R1: Batch Images Quality Rating with thinking, using vLLM)
# Please install vLLM first: https://docs.vllm.ai/en/stable/getting_started/installation/gpu.html
from transformers import Qwen2_5_VLProcessor, AutoProcessor
from vllm import LLM, RequestOutput, SamplingParams
from qwen_vl_utils import process_vision_info
import torch
import random
import re
import os
IMAGE_PATH = "./images"
MODEL_PATH = "TianheWu/VisualQuality-R1-7B"
def get_image_paths(folder_path):
image_extensions = {'.jpg', '.jpeg', '.png', '.bmp', '.gif', '.tiff', '.webp'}
image_paths = []
for root, dirs, files in os.walk(folder_path):
for file in files:
_, ext = os.path.splitext(file)
if ext.lower() in image_extensions:
image_paths.append(os.path.join(root, file))
return image_paths
def score_batch_image(image_paths, model: LLM, processor: Qwen2_5_VLProcessor):
PROMPT = (
"You are doing the image quality assessment task. Here is the question: "
"What is your overall rating on the quality of this picture? The rating should be a float between 1 and 5, "
"rounded to two decimal places, with 1 representing very poor quality and 5 representing excellent quality."
)
QUESTION_TEMPLATE = "{Question} First output the thinking process in <think> </think> tags and then output the final answer with only one score in <answer> </answer> tags."
messages = []
for img_path in image_paths:
message = [
{
"role": "user",
"content": [
{'type': 'image', 'image': img_path},
{"type": "text", "text": QUESTION_TEMPLATE.format(Question=PROMPT)}
],
}
]
messages.append(message)
all_outputs = [] # List to store all answers
# Preparation for inference
print("preprocessing ...")
texts = [processor.apply_chat_template(msg, tokenize=False, add_generation_prompt=True, add_vision_id=True) for msg in messages]
image_inputs, video_inputs = process_vision_info(messages)
inputs = [{
"prompt": texts[i],
"multi_modal_data": {
"image": image_inputs[i]
},
} for i in range(len(messages))]
output: list[RequestOutput] = model.generate(
inputs,
sampling_params=SamplingParams(
max_tokens=512,
temperature=0.1,
top_k=50,
top_p=1.0,
stop_token_ids=[processor.tokenizer.eos_token_id],
),
)
batch_output_text = [o.outputs[0].text for o in output]
all_outputs.extend(batch_output_text)
path_score_dict = {}
for img_path, model_output in zip(image_paths, all_outputs):
print(f"{model_output = }")
try:
model_output_matches = re.findall(r'<answer>(.*?)</answer>', model_output, re.DOTALL)
model_answer = model_output_matches[-1].strip() if model_output_matches else model_output.strip()
score = float(re.search(r'\d+(\.\d+)?', model_answer).group())
except:
print(f"Meet error with {img_path}, please generate again.")
score = random.randint(1, 5)
path_score_dict[img_path] = score
return path_score_dict
random.seed(1)
model = LLM(
model=MODEL_PATH,
tensor_parallel_size=1,
trust_remote_code=True,
seed=1,
)
processor = AutoProcessor.from_pretrained(MODEL_PATH)
processor.tokenizer.padding_side = "left"
image_paths = get_image_paths(IMAGE_PATH) # It should be a list
path_score_dict = score_batch_image(
image_paths, model, processor
)
file_name = "output.txt"
with open(file_name, "w") as file:
for key, value in path_score_dict.items():
file.write(f"{key} {value}\n")
print("Done!")- To smoothly execute the training procedure, first download the IQA images and place them all in a single folder.
- Given an original MOS file (e.g., KADID-10K_mos.txt), first execute
cd datasets, then runpython make_data.py(with moderate modifications) to generate a JSON file for model training. - Download the Qwen/Qwen2.5-VL-7B-Instruct into a folder.
After generating single dataset jsonl files, you can integrate them with reording id. An example can be seen in datasets/combined/RL-622-KADID-TID2013-KONIQ-LIVEC_train_scoring.jsonl
Please modify three elements in src/open-r1-multimodal/run_scripts/KADID-10K/one_node_run_kadid.sh:
--model_name_or_path [Your Qwen2.5-VL-7B-Instruct path] \
--image_folders [Your dataset images path] \
--data_file_paths [Your JSON file path] \
Then, run:
bash src/open-r1-multimodal/run_scripts/KADID-10K/one_node_run_kadid.sh
After making the necessary modifications, run the following command:
bash src/open-r1-multimodal/run_scripts/KADID-10K/multi_run_kadid.sh
The image shown was captured using the OPPO Find X6 Pro.
- VLM-R1: We start from codebase from the VLM-R1.
I would like to sincerely thank Zhuoyan Luo for the generous support of my project and for the invaluable guidance in the field of AR generation.
If you have any question, please email sigstianhewu@gmail.com or tianhewu-c@my.cityu.edu.hk.
@article{wu2025visualquality,
title={{VisualQuality-R1}: Reasoning-Induced Image Quality Assessment via Reinforcement Learning to Rank},
author={Wu, Tianhe and Zou, Jian and Liang, Jie and Zhang, Lei and Ma, Kede},
journal={arXiv preprint arXiv:2505.14460},
year={2025}
}