Dong-Guw Lee1*
Tai Hyoung Rhee1*
Hyunsoo Jang1
Young-Sik Shin2
Ukcheol Shin3
Ayoung Kim1†
1Seoul National University
2Kyungpook National University
3KENTECH
* Equal Contribution
† Corresponding Author
CVPR 2026
- 2026-04-03: TherA github repo opening
- 2026-05-22: TherA inference code and R2T2 dataset release.
TherA is a controllable RGB-to-thermal infrared translation framework. Given an RGB image, TherA synthesizes a long-wave thermal infrared image using a latent-diffusion translator conditioned on thermal-aware visual-language features.
TherA is designed for:
- RGB → TIR translation for thermal perception research.
- Thermal-aware VLM conditioning using LLaVA hidden-state features.
- Scene- and object-level controllability across weather, time of day, and object state.
- Reference-cache inference, allowing deployment without loading LLaVA at runtime.
TherA does not condition directly on raw text during diffusion inference. Instead, it uses a 4096-dimensional LLaVA hidden state, either:
- loaded from a precomputed
.ptreference cache, or - extracted on the fly using LLaVA.
For resource limited environments, we recommend reference-cache mode. This mode uses precomputed LLaVA features such as SUNNY.pt, CLOUDY.pt, RAINY.pt, or NIGHT.pt, and therefore does not require loading LLaVA weights at runtime. An alternative would be to compute pre-computed LLaVA feature first followed by inferencing with reference-cache mode (upcoming feature).
TherA/
├── infer_custom.py # Batch RGB → TIR inference on a folder
├── infer_example_guided.py # Single-image / example-guided inference
├── infer_palette.sh # Run multiple weather/style palettes
├── lavi_ip2p/ # UNet 8-channel + adapter wrapper
├── LaVi-Bridge/modules/ # TextAdapter architecture
├── llava/ # LLaVA code, only needed for on-the-fly mode
├── thera_paths.py # Default local weight paths
├── thera_llava.py # Lazy LLaVA loader
└── weights/ # Download weights here; not tracked by git
├── model.pt # TherA Model
├── merged_models/ # Initialization model
│ ├── unet/
│ └── adapter/
├── stable-diffusion/
│ ├── vae/
│ └── scheduler/
├── reference_caches/
│ ├── SUNNY.pt
│ ├── CLOUDY.pt
│ ├── RAINY.pt
│ └── NIGHT.pt
├── reference_caches/
│ │ ├── SUNNY.pt
│ │ ├── CLOUDY.pt
│ │ ├── RAINY.pt
│ │ └── NIGHT.pt
└── TherA-VLM/ # Optional; only for on-the-fly mode
├── adaptor_config.json/
└── adapter_model.safetensors
└── config.json
└── non_lora_trainables.bin
└── trainer_state.json
git clone https://github.com/donkeymouse/TherA.git
cd TherA
python -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txtRecommended environment
- Python 3.10+
- CUDA-capable GPU
- 16 GB+ VRAM recommended for comfortable inference
A prebuilt Docker image is available at:
docker pull donkeymouse/thera:latestExample interactive run:
docker run --gpus all --rm -it \
-v "$(pwd)":/workspace/TherA \
-w /workspace/TherA \
donkeymouse/thera:latest \
bashThen run inference commands from inside the container.
TherA weights are hosted on Hugging Face:
pip install -U huggingface_hub
huggingface-cli download donkeymouse/TherA \
--local-dir weightsAfter downloading, your weights/ directory should contain:
| Path | Description | Required? |
|---|---|---|
weights/model.pt |
TherA trained UNet and adapter checkpoint | Yes |
weights/merged_models/unet/ |
UNet architecture/config files | Yes |
weights/merged_models/adapter/ |
TextAdapter architecture/config files | Yes |
weights/stable-diffusion/vae/ |
Stable Diffusion VAE | Yes |
weights/stable-diffusion/scheduler/ |
DDIM scheduler config | Yes |
weights/reference_caches/*.pt |
Precomputed LLaVA hidden states for inference palettes | Recommended |
weights/TherA-VLM/ |
LLaVA weights for on-the-fly feature extraction | Optional |
LLaVA is only required for on-the-fly feature extraction or two-image guided mode. It is not required for reference-cache inference.
huggingface-cli download llava-hf/llava-1.5-7b-hf \
--local-dir weights/llava-1.5-7b-hfUse this mode if you want to extract hidden states from TherA directly at runtime from an RGB image and prompt.
python infer_custom.py \
--rgb-dir examples/rgb \
--output-dir preds \
--llava-base-path weights/llava-1.5-7b-hf \
--llava-lora-path weights/TherA-VLM \
--llava-prompt "How would this RGB scene appear in long-wave thermal infrared spectrum."This mode is more expensive because it loads LLaVA during inference.
This mode extracts LLaVA features from a reference RGB image and applies them to a target RGB image.
python infer_example_guided.py \
--mode two-image \
--reference-image examples/ref/rgb.jpg \
--input-image examples/rgb/scene.jpg \
--output preds/scene_tir.png \
--llava-base-path weights/llava-1.5-7b-hf \
--llava-lora-path weights/TherA-VLMpython infer_custom.py \
--rgb-dir /path/to/dataset/RGB \
--output-dir preds \
--reference-cache weights/reference_caches/SUNNY.pt \
--recursiveWhen --recursive is used, the output folder preserves the input directory structure.
Reference-cache mode is the recommended if you are lacking GPU memory. It does not load LLaVA at runtime.
python infer_custom.py \
--rgb-dir examples/rgb \
--output-dir preds/sunny \
--reference-cache weights/reference_caches/SUNNY.ptThe script reads all images in examples/rgb and writes translated TIR images to preds/sunny.
A lighter version of the text-guided image translation module.
Example palette caches:
weights/reference_caches/SUNNY.pt
weights/reference_caches/CLOUDY.pt
weights/reference_caches/RAINY.pt
weights/reference_caches/NIGHT.pt
You can use different pallete cache to achieve different translation effects.
| Mode | Main flag / script | LLaVA weights needed? | Recommended use |
|---|---|---|---|
| Reference cache | --reference-cache path.pt |
No | Default deployment and fast inference |
| Per-image cache directory | --cache-dir dir/ |
No | Precomputed feature per image |
| Full RGB-TIR translation | --llava-base-path ... |
Yes | Runtime prompt/image conditioning |
| Reference image-guided translation | infer_example_guided.py --mode two-image |
Yes | Apply reference-image conditioning |
Reference caches are precomputed LLaVA hidden states saved as .pt files.
Supported tensor shapes:
[1, L, 4096]
[L, 4096]
A single reference cache can be applied to all input images as a global thermal/weather/style condition.
| Argument | Default | Description |
|---|---|---|
--checkpoint |
weights/checkpoint |
Directory containing model.pt |
--merged-model-path |
weights/merged_models |
Directory containing UNet and adapter configs |
--pretrained-sd |
weights/stable-diffusion |
Directory containing VAE and scheduler |
--rgb-dir |
Required | Folder of RGB images for batch inference |
--output-dir |
custom_predictions |
Output folder for predictions |
--reference-cache |
None |
Single .pt cache used for all images |
--cache-dir |
None |
Folder of per-image .pt caches matched by filename stem |
--llava-base-path |
None |
Base LLaVA model path for on-the-fly mode |
--llava-lora-path |
None |
Optional LLaVA LoRA path |
--llava-prompt |
thermal prompt | Prompt used for default inference/text-guided translation |
--num-steps |
100 |
DDIM sampling steps |
--cfg-text |
3.5 |
Text/VLM guidance strength |
--cfg-image |
1.5 |
Image guidance strength |
--target-size |
Auto | Resize image to this square size; otherwise dimensions are rounded to multiples of 32 |
--recursive |
Off | Recursively process subdirectories |
--device |
cuda |
Device for inference |
RGB image
│
▼
VAE encoder ──► RGB latents ───────────────────┐
│
├──► 8-channel diffusion UNet ──► VAE decoder ──► TIR image
│
LLaVA hidden state, 4096-d ──► TextAdapter ─────┘
768-d cross-attention tokens
TherA uses dual classifier-free guidance at inference by combining:
- full conditioning,
- image-only conditioning,
- text/VLM-only conditioning.
TherA is trained with R2T2, a large-scale RGB–TIR–Text dataset.
R2T2 includes:
- 112,970 aligned triplets: RGB image, TIR image, and canonical thermal schema.
- Scene diversity across driving, CCTV, aerial, and ego-view settings.
- Temporal diversity across day/night and diurnal transitions.
- Environmental diversity across weather, season, and illumination.
- Material- and object-level annotations with structured canonicalization.
- Data compiled from multiple aligned RGB–TIR datasets with additional pseudo-aligned pairs.
Dataset page:
https://huggingface.co/datasets/donkeymouse/TherA-R2T2
Example structure:
R2T2/
├── ${DATASET_NAME}/
│ └── ${SEQUENCE_NAME}/
│ ├── RGB/
│ │ ├── 1.jpg
│ │ └── ...
│ └── TIR/
│ ├── 1.jpg
│ └── ...
├── ViVID/
│ ├── img_campus_day1/
│ │ ├── RGB/
│ │ │ ├── 000001.png
│ │ │ └── ...
│ │ └── TIR/
│ │ ├── 000001.png
│ │ └── ...
│ └── ...
└── ...
Download the TherA weights and make sure model.pt is located at:
weights/model.pt
Make sure the Stable Diffusion VAE and scheduler folders are present:
weights/stable-diffusion/vae/
weights/stable-diffusion/scheduler/
Try increasing text/VLM guidance:
--cfg-text 7.5Also verify that your reference cache files are distinct:
SUNNY.pt
CLOUDY.pt
RAINY.pt
NIGHT.pt
Try reducing the image size:
--target-size 512You can also run one image at a time with:
python infer_example_guided.pyUse reference-cache mode if you do not need runtime LLaVA extraction:
--reference-cache weights/reference_caches/SUNNY.ptFor on-the-fly mode, make sure the LLaVA base model and TherA weights are correctly loaded`.
- inference code and R2T2 dataset
- [] Upload cache extraction code
- [] Improve text-guidance
If you find TherA useful for your research, please cite:
@inproceedings{lee2026thera,
title = {TherA: Thermal-Aware Visual-Language Prompting for Controllable RGB-to-Thermal Infrared Translation},
author = {Lee, Dong-Guw and Rhee, Tai Hyoung and Jang, Hyunsoo and Shin, Young-Sik and Shin, Ukcheol and Kim, Ayoung},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
year = {2026}
}You may also cite the arXiv version:
@article{lee2026thera_arxiv,
title = {TherA: Thermal-Aware Visual-Language Prompting for Controllable RGB-to-Thermal Infrared Translation},
author = {Lee, Dong-Guw and Rhee, Tai Hyoung and Jang, Hyunsoo and Shin, Young-Sik and Shin, Ukcheol and Kim, Ayoung},
journal = {arXiv preprint arXiv:2602.19430},
year = {2026}
}TherA builds on open-source components from the vision-language and diffusion communities, including LLaVA, Stable Diffusion, Diffusers, and LaVi-Bridge-style adapter architectures.
See LICENSE for details.
Third-party models, datasets, and libraries retain their own licenses. Please review the licenses for LLaVA, Stable Diffusion, Hugging Face model files, and any external datasets before use.
If you have any questions, contact here please
donkeymouse@snu.ac.kr