TherA: Thermal-Aware Visual-Language Prompting for
Controllable RGB-to-Thermal Infrared Translation

Dong-Guw Lee^1* Tai Hyoung Rhee^1* Hyunsoo Jang¹
Young-Sik Shin² Ukcheol Shin³ Ayoung Kim^1†

¹Seoul National University ²Kyungpook National University ³KENTECH
^* Equal Contribution ^† Corresponding Author

CVPR 2026

News

2026-04-03: TherA github repo opening
2026-05-22: TherA inference code and R2T2 dataset release.

Overview

TherA is a controllable RGB-to-thermal infrared translation framework. Given an RGB image, TherA synthesizes a long-wave thermal infrared image using a latent-diffusion translator conditioned on thermal-aware visual-language features.

TherA is designed for:

RGB → TIR translation for thermal perception research.
Thermal-aware VLM conditioning using LLaVA hidden-state features.
Scene- and object-level controllability across weather, time of day, and object state.
Reference-cache inference, allowing deployment without loading LLaVA at runtime.

Key Idea

TherA does not condition directly on raw text during diffusion inference. Instead, it uses a 4096-dimensional LLaVA hidden state, either:

loaded from a precomputed .pt reference cache, or
extracted on the fly using LLaVA.

For resource limited environments, we recommend reference-cache mode. This mode uses precomputed LLaVA features such as SUNNY.pt, CLOUDY.pt, RAINY.pt, or NIGHT.pt, and therefore does not require loading LLaVA weights at runtime. An alternative would be to compute pre-computed LLaVA feature first followed by inferencing with reference-cache mode (upcoming feature).

Repository Layout

TherA/
├── infer_custom.py             # Batch RGB → TIR inference on a folder
├── infer_example_guided.py     # Single-image / example-guided inference
├── infer_palette.sh        # Run multiple weather/style palettes
├── lavi_ip2p/                  # UNet 8-channel + adapter wrapper
├── LaVi-Bridge/modules/        # TextAdapter architecture
├── llava/                      # LLaVA code, only needed for on-the-fly mode
├── thera_paths.py              # Default local weight paths
├── thera_llava.py              # Lazy LLaVA loader
└── weights/                    # Download weights here; not tracked by git
    ├── model.pt                # TherA Model
    ├── merged_models/          # Initialization model
    │   ├── unet/
    │   └── adapter/
    ├── stable-diffusion/       
    │   ├── vae/
    │   └── scheduler/
    ├── reference_caches/
    │   ├── SUNNY.pt
    │   ├── CLOUDY.pt
    │   ├── RAINY.pt
    │   └── NIGHT.pt
    ├── reference_caches/
    │   │   ├── SUNNY.pt
    │   │   ├── CLOUDY.pt
    │   │   ├── RAINY.pt
    │   │   └── NIGHT.pt
    └── TherA-VLM/                  # Optional; only for on-the-fly mode
        ├── adaptor_config.json/
        └── adapter_model.safetensors
        └── config.json
        └── non_lora_trainables.bin
        └── trainer_state.json

Installation

Option 1: Local Python Environment

git clone https://github.com/donkeymouse/TherA.git
cd TherA

python -m venv .venv
source .venv/bin/activate

pip install --upgrade pip
pip install -r requirements.txt

Recommended environment

Python 3.10+
CUDA-capable GPU
16 GB+ VRAM recommended for comfortable inference

Option 2: Docker

A prebuilt Docker image is available at:

docker pull donkeymouse/thera:latest

Example interactive run:

docker run --gpus all --rm -it \
  -v "$(pwd)":/workspace/TherA \
  -w /workspace/TherA \
  donkeymouse/thera:latest \
  bash

Then run inference commands from inside the container.

Download Weights

TherA weights are hosted on Hugging Face:

pip install -U huggingface_hub

huggingface-cli download donkeymouse/TherA \
  --local-dir weights

After downloading, your weights/ directory should contain:

Path	Description	Required?
`weights/model.pt`	TherA trained UNet and adapter checkpoint	Yes
`weights/merged_models/unet/`	UNet architecture/config files	Yes
`weights/merged_models/adapter/`	TextAdapter architecture/config files	Yes
`weights/stable-diffusion/vae/`	Stable Diffusion VAE	Yes
`weights/stable-diffusion/scheduler/`	DDIM scheduler config	Yes
`weights/reference_caches/*.pt`	Precomputed LLaVA hidden states for inference palettes	Recommended
`weights/TherA-VLM/`	LLaVA weights for on-the-fly feature extraction	Optional

Optional: Download LLaVA Weights

LLaVA is only required for on-the-fly feature extraction or two-image guided mode. It is not required for reference-cache inference.

huggingface-cli download llava-hf/llava-1.5-7b-hf \
  --local-dir weights/llava-1.5-7b-hf

Quick Start

Full RGB-TIR translation using TherA-VLM

Use this mode if you want to extract hidden states from TherA directly at runtime from an RGB image and prompt.

python infer_custom.py \
  --rgb-dir examples/rgb \
  --output-dir preds \
  --llava-base-path weights/llava-1.5-7b-hf \
  --llava-lora-path weights/TherA-VLM \
  --llava-prompt "How would this RGB scene appear in long-wave thermal infrared spectrum."

This mode is more expensive because it loads LLaVA during inference.

Reference-Guided Image Translation Mode

This mode extracts LLaVA features from a reference RGB image and applies them to a target RGB image.

python infer_example_guided.py \
  --mode two-image \
  --reference-image examples/ref/rgb.jpg \
  --input-image examples/rgb/scene.jpg \
  --output preds/scene_tir.png \
  --llava-base-path weights/llava-1.5-7b-hf \
  --llava-lora-path weights/TherA-VLM

Recursive Folder Inference

python infer_custom.py \
  --rgb-dir /path/to/dataset/RGB \
  --output-dir preds \
  --reference-cache weights/reference_caches/SUNNY.pt \
  --recursive

When --recursive is used, the output folder preserves the input directory structure.

Reference-cache Mode

Reference-cache mode is the recommended if you are lacking GPU memory. It does not load LLaVA at runtime.

python infer_custom.py \
  --rgb-dir examples/rgb \
  --output-dir preds/sunny \
  --reference-cache weights/reference_caches/SUNNY.pt

The script reads all images in examples/rgb and writes translated TIR images to preds/sunny.

A lighter version of the text-guided image translation module.

Example palette caches:

weights/reference_caches/SUNNY.pt
weights/reference_caches/CLOUDY.pt
weights/reference_caches/RAINY.pt
weights/reference_caches/NIGHT.pt

You can use different pallete cache to achieve different translation effects.

Inference Modes

Mode	Main flag / script	LLaVA weights needed?	Recommended use
Reference cache	`--reference-cache path.pt`	No	Default deployment and fast inference
Per-image cache directory	`--cache-dir dir/`	No	Precomputed feature per image
Full RGB-TIR translation	`--llava-base-path ...`	Yes	Runtime prompt/image conditioning
Reference image-guided translation	`infer_example_guided.py --mode two-image`	Yes	Apply reference-image conditioning

Reference Cache Format

Reference caches are precomputed LLaVA hidden states saved as .pt files.

Supported tensor shapes:

[1, L, 4096]
[L, 4096]

A single reference cache can be applied to all input images as a global thermal/weather/style condition.

Important Arguments

Argument	Default	Description
`--checkpoint`	`weights/checkpoint`	Directory containing `model.pt`
`--merged-model-path`	`weights/merged_models`	Directory containing UNet and adapter configs
`--pretrained-sd`	`weights/stable-diffusion`	Directory containing VAE and scheduler
`--rgb-dir`	Required	Folder of RGB images for batch inference
`--output-dir`	`custom_predictions`	Output folder for predictions
`--reference-cache`	`None`	Single `.pt` cache used for all images
`--cache-dir`	`None`	Folder of per-image `.pt` caches matched by filename stem
`--llava-base-path`	`None`	Base LLaVA model path for on-the-fly mode
`--llava-lora-path`	`None`	Optional LLaVA LoRA path
`--llava-prompt`	thermal prompt	Prompt used for default inference/text-guided translation
`--num-steps`	`100`	DDIM sampling steps
`--cfg-text`	`3.5`	Text/VLM guidance strength
`--cfg-image`	`1.5`	Image guidance strength
`--target-size`	Auto	Resize image to this square size; otherwise dimensions are rounded to multiples of 32
`--recursive`	Off	Recursively process subdirectories
`--device`	`cuda`	Device for inference

Architecture

RGB image
   │
   ▼
VAE encoder ──► RGB latents ───────────────────┐
                                                │
                                                ├──► 8-channel diffusion UNet ──► VAE decoder ──► TIR image
                                                │
LLaVA hidden state, 4096-d ──► TextAdapter ─────┘
                              768-d cross-attention tokens

TherA uses dual classifier-free guidance at inference by combining:

full conditioning,
image-only conditioning,
text/VLM-only conditioning.

R2T2 Dataset

TherA is trained with R2T2, a large-scale RGB–TIR–Text dataset.

R2T2 includes:

112,970 aligned triplets: RGB image, TIR image, and canonical thermal schema.
Scene diversity across driving, CCTV, aerial, and ego-view settings.
Temporal diversity across day/night and diurnal transitions.
Environmental diversity across weather, season, and illumination.
Material- and object-level annotations with structured canonicalization.
Data compiled from multiple aligned RGB–TIR datasets with additional pseudo-aligned pairs.

Dataset page:

https://huggingface.co/datasets/donkeymouse/TherA-R2T2

Example structure:

R2T2/
├── ${DATASET_NAME}/
│   └── ${SEQUENCE_NAME}/
│       ├── RGB/
│       │   ├── 1.jpg
│       │   └── ...
│       └── TIR/
│           ├── 1.jpg
│           └── ...
├── ViVID/
│   ├── img_campus_day1/
│   │   ├── RGB/
│   │   │   ├── 000001.png
│   │   │   └── ...
│   │   └── TIR/
│   │       ├── 000001.png
│   │       └── ...
│   └── ...
└── ...

Troubleshooting

`Checkpoint not found: weights/model.pt`

Download the TherA weights and make sure model.pt is located at:

weights/model.pt

`OSError: ... stable-diffusion/vae`

Make sure the Stable Diffusion VAE and scheduler folders are present:

weights/stable-diffusion/vae/
weights/stable-diffusion/scheduler/

Outputs look identical across palettes

Try increasing text/VLM guidance:

--cfg-text 7.5

Also verify that your reference cache files are distinct:

SUNNY.pt
CLOUDY.pt
RAINY.pt
NIGHT.pt

CUDA out of memory

Try reducing the image size:

--target-size 512

You can also run one image at a time with:

python infer_example_guided.py

LLaVA import or loading errors

Use reference-cache mode if you do not need runtime LLaVA extraction:

--reference-cache weights/reference_caches/SUNNY.pt

For on-the-fly mode, make sure the LLaVA base model and TherA weights are correctly loaded`.

TODOs

inference code and R2T2 dataset
[] Upload cache extraction code
[] Improve text-guidance

Citation

If you find TherA useful for your research, please cite:

@inproceedings{lee2026thera,
  title     = {TherA: Thermal-Aware Visual-Language Prompting for Controllable RGB-to-Thermal Infrared Translation},
  author    = {Lee, Dong-Guw and Rhee, Tai Hyoung and Jang, Hyunsoo and Shin, Young-Sik and Shin, Ukcheol and Kim, Ayoung},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year      = {2026}
}

You may also cite the arXiv version:

@article{lee2026thera_arxiv,
  title   = {TherA: Thermal-Aware Visual-Language Prompting for Controllable RGB-to-Thermal Infrared Translation},
  author  = {Lee, Dong-Guw and Rhee, Tai Hyoung and Jang, Hyunsoo and Shin, Young-Sik and Shin, Ukcheol and Kim, Ayoung},
  journal = {arXiv preprint arXiv:2602.19430},
  year    = {2026}
}

Acknowledgements

TherA builds on open-source components from the vision-language and diffusion communities, including LLaVA, Stable Diffusion, Diffusers, and LaVi-Bridge-style adapter architectures.

License

See LICENSE for details.

Third-party models, datasets, and libraries retain their own licenses. Please review the licenses for LLaVA, Stable Diffusion, Hugging Face model files, and any external datasets before use.

Contact

If you have any questions, contact here please

donkeymouse@snu.ac.kr

Name		Name	Last commit message	Last commit date
Latest commit History 47 Commits
LaVi-Bridge/modules		LaVi-Bridge/modules
__pycache__		__pycache__
assets		assets
lavi_ip2p		lavi_ip2p
llava		llava
LICENSE		LICENSE
README.md		README.md
infer_custom.py		infer_custom.py
infer_example_guided.py		infer_example_guided.py
infer_palette.sh		infer_palette.sh
requirements.txt		requirements.txt
thera_llava.py		thera_llava.py
thera_paths.py		thera_paths.py

Folders and files

Latest commit

History

Repository files navigation

TherA: Thermal-Aware Visual-Language Prompting forControllable RGB-to-Thermal Infrared Translation

News

Overview

Key Idea

Repository Layout

Installation

Option 1: Local Python Environment

Option 2: Docker

Download Weights

Optional: Download LLaVA Weights

Quick Start

Full RGB-TIR translation using TherA-VLM

Reference-Guided Image Translation Mode

Recursive Folder Inference

Reference-cache Mode

Inference Modes

Reference Cache Format

Important Arguments

Architecture

R2T2 Dataset

Troubleshooting

Checkpoint not found: weights/model.pt

OSError: ... stable-diffusion/vae

Outputs look identical across palettes

CUDA out of memory

LLaVA import or loading errors

TODOs

Citation

Acknowledgements

License

Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

TherA: Thermal-Aware Visual-Language Prompting for
Controllable RGB-to-Thermal Infrared Translation

`Checkpoint not found: weights/model.pt`

`OSError: ... stable-diffusion/vae`

Packages