Zixin Yin1,2, Xili Dai3, Ling-Hao Chen2,4, Deyu Zhou3,6, Jianan Wang5 Duomin Wang6, Gang Yu6, Lionel Ni1,3, Lei Zhang2, Heung-Yeung Shum1
1HKUST, 2IDEA Research, 3HKUST(GZ), 4Tsinghua University, 5Astribot, 6 StepFun
β¨ICLR 2026β¨
| Source Video | Edited Video |
|---|---|
pip install -r requirements.txtDownload the required diffusion models:
- FLUX.1-dev:
/path/to/FLUX.1-dev - Stable Diffusion 3:
/path/to/stable-diffusion-3-medium-diffusers - CogVideoX-2b:
/path/to/CogVideoX-2b
Update the model paths in the scripts accordingly.
We provide two FLUX-based demonstration scripts in the script/ directory:
bash script/flux_consist_edit.shbash script/flux_inconsist_edit.shpython run_synthesis_flux.py \
--src_prompt "a portrait of a woman in a red dress in a forest, best quality" \
--tgt_prompt "a portrait of a woman in a yellow dress in a forest, best quality" \
--edit_object "dress" \
--out_dir "output" \
--alpha 1.0 \
--model_path "/path/to/FLUX.1-dev"python run_synthesis_flux.py \
--src_prompt "a woman is standing in a town facing front, realistic style" \
--tgt_prompt "a woman is standing in a town facing front, cartoon style" \
--out_dir "output" \
--alpha 0.1 \
--no_mask \
--model_path "/path/to/FLUX.1-dev"python run_synthesis_sd3.py \
--src_prompt "a portrait of a woman in a red dress in a forest, best quality" \
--tgt_prompt "a portrait of a woman in a yellow dress in a forest, best quality" \
--edit_object "dress" \
--out_dir "output" \
--alpha 1.0 \
--model_path "/path/to/stable-diffusion-3-medium-diffusers"python run_synthesis_sd3.py \
--src_prompt "a portrait of a woman in a red dress, realistic style, best quality" \
--tgt_prompt "a portrait of a woman in a yellow dress, cartoon style, best quality" \
--edit_object "dress" \
--out_dir "output" \
--alpha 0.3 \
--model_path "/path/to/stable-diffusion-3-medium-diffusers"python run_synthesis_cog.py \
--src_prompt "a portrait of a woman in a red dress in a forest, best quality" \
--tgt_prompt "a portrait of a woman in a yellow dress in a forest, best quality" \
--edit_object "dress" \
--out_dir "output" \
--alpha 1.0 \
--model_path "/path/to/CogVideoX-2b"Real-image editing follows the same FLUX-based ColorCtrl attention-map swapping pipeline, but starts from an input image and uses masked value preservation throughout inversion and denoising.
python run_real_flux.py \
--src_prompt "Yoshua Bengio is wearing a red shirt" \
--tgt_prompt "Yoshua Bengio is wearing a black shirt" \
--edit_object "shirt" \
--source_image_path "assets/bengio.png" \
--out_dir "output" \
--alpha 0.35 \
--model_path "/path/to/FLUX.1-dev"What it does: Disables mask-guided value preservation.
Result: Colors outside the edited region can drift more easily.
python run_synthesis_flux.py --no_mask --alpha 0.3 ...What it does: Uses the ColorCtrl attention-map swap with mask-guided value preservation.
Technical Details:
- Mask Calculation: Computes masks from averaged attention maps
- Image Generation: Swaps only the vision-vision attention block while preserving masked values
Result: β Better background preservation with a cleaner public implementation
| Parameter | Type | Default | Description |
|---|---|---|---|
--src_prompt |
str | Required | Source image prompt: Text description used to generate the source image. This defines the initial state before editing. |
--tgt_prompt |
str | Required | Target image prompt: Text description for the edited result. |
--edit_object |
str | Required | Edit object word: Single word or phrase that appears in src_prompt and specifies what object to edit. Used for mask generation. |
--out_dir |
str | "output" |
Output directory: Directory where generated images and masks will be saved. |
--alpha |
float | 1.0 |
Consistency strength: Controls the strength of cross-attention injection (consistency_strength in paper). Range: 0.0-1.0. |
--model_path |
str | Required | Model path: Local path to the diffusion model directory. |
--no_mask |
flag | False |
Disable masking: When set, no mask is generated and no content fusion is applied. Use this to observe uncontrolled changes. |
| Parameter | Type | Default | Description |
|---|---|---|---|
--src_prompt |
str | "Yoshua Bengio is wearing a red shirt" |
Source prompt paired with the input image during inversion. |
--tgt_prompt |
str | "Yoshua Bengio is wearing a black shirt" |
Edited prompt used for reconstruction. |
--edit_object |
str | "shirt" |
Word used to derive the edit mask. |
--source_image_path |
str | "assets/bengio.png" |
Input real image path. |
--out_dir |
str | "output" |
Directory for reconstructed source image, edited result, mask, and latent. |
--alpha |
float | 0.35 |
Real-image consistency strength used by the preserve variant. |
--model_path |
str | "/path/to/FLUX.1-dev" |
Local FLUX model path. |
To generate results for PIE-Bench or ColorCtrl-Bench:
- ColorCtrl-Bench annotation mapping is bundled in this repo at
evaluation/colorctrl_bench_mapping.json. - When you use
--benchmark colorctrl-bench, bothrun_metric.pyandevaluate.pyload that local JSON automatically.
python run_metric.py \
--model_path "/path/to/FLUX.1-dev" \
--data_path "/path/to/benchmark-root" \
--benchmark piebenchSwitch to ColorCtrl-Bench with:
python run_metric.py \
--model_path "/path/to/FLUX.1-dev" \
--data_path "/path/to/colorctrl-bench-root" \
--benchmark colorctrl-benchTo compute evaluation metrics for either benchmark:
python evaluate.py --benchmark piebenchColorCtrl/
βββ run_synthesis_flux.py # FLUX synthesis editing
βββ run_synthesis_sd3.py # SD3 synthesis editing
βββ run_synthesis_cog.py # CogVideo editing
βββ run_real_flux.py # Real image editing
βββ run_metric.py # Benchmark generation script
βββ evaluate.py # Metric calculation script
βββ script/
β βββ flux_consist_edit.sh # Consistent editing demo
β βββ flux_inconsist_edit.sh # Inconsistent editing demo
βββ colorctrl/
β βββ attention_control.py # Cross-attention mechanisms
β βββ solver.py # Diffusion solvers
β βββ utils.py # Utility functions
β βββ global_var.py # Global variables
βββ evaluation/
β βββ colorctrl_bench_mapping.json # ColorCtrl-Bench annotations
β βββ matric_calculator.py # Evaluation metrics
βββ assets/ # Sample images
This codebase is built upon and inspired by several excellent open-source projects:
- MasaCtrl: Tuning-Free Mutual Self-Attention Control for Consistent Image Synthesis and Editing
- PnPInversion: Plug-and-Play diffusion features for text-driven image-to-image translation
- UniEdit-Flow: UniEdit-Flow: Unleashing Inversion and Editing in the Era of Flow Models
- DiTCtrl: DiTCtrl: Exploring Attention Control in Multi-Modal Diffusion Transformer for Tuning-Free Multi-Prompt Longer Video Generation
- ConsistEdit: ConsistEdit: Highly Consistent and Precise Training-free Visual Editing
We thank the authors of these works for their valuable contributions to the diffusion model editing community.
If you find this work useful, please cite our paper:
@inproceedings{yin2026training,
title={Training-Free Text-Guided Color Editing with Multi-Modal Diffusion Transformer},
author={Yin, Zixin and Dai, Xili and Chen, Ling-Hao and Zhou, Deyu and Wang, Jianan and Wang, Duomin and Yu, Gang and Ni, Lionel M and Zhang, Lei and Shum, Heung-Yeung},
booktitle={The Fourteenth International Conference on Learning Representations}
year={2026}
}