Skip to content

zxYin/ColorCtrl_Code

Repository files navigation

ColorCtrl: Training-Free Text-Guided Color Editing with Multi-Modal Diffusion Transformer

Zixin Yin1,2, Xili Dai3, Ling-Hao Chen2,4, Deyu Zhou3,6, Jianan Wang5 Duomin Wang6, Gang Yu6, Lionel Ni1,3, Lei Zhang2, Heung-Yeung Shum1

1HKUST, 2IDEA Research, 3HKUST(GZ), 4Tsinghua University, 5Astribot, 6 StepFun

✨ICLR 2026✨

🎯 Demo

ColorCtrl Multi-Step Editing Demo

Source Video Edited Video
Source Target

πŸ”§ Setup

Requirements

pip install -r requirements.txt

Model Preparation

Download the required diffusion models:

  • FLUX.1-dev: /path/to/FLUX.1-dev
  • Stable Diffusion 3: /path/to/stable-diffusion-3-medium-diffusers
  • CogVideoX-2b: /path/to/CogVideoX-2b

Update the model paths in the scripts accordingly.

πŸš€ Quick Start

Using Scripts

We provide two FLUX-based demonstration scripts in the script/ directory:

1. Consistent Editing (Change Color/Material)

bash script/flux_consist_edit.sh

2. Inconsistent Editing (Change Style/Object)

bash script/flux_inconsist_edit.sh

Manual Usage

FLUX

python run_synthesis_flux.py \
    --src_prompt "a portrait of a woman in a red dress in a forest, best quality" \
    --tgt_prompt "a portrait of a woman in a yellow dress in a forest, best quality" \
    --edit_object "dress" \
    --out_dir "output" \
    --alpha 1.0 \
    --model_path "/path/to/FLUX.1-dev"
python run_synthesis_flux.py \
    --src_prompt "a woman is standing in a town facing front, realistic style" \
    --tgt_prompt "a woman is standing in a town facing front, cartoon style" \
    --out_dir "output" \
    --alpha 0.1 \
    --no_mask \
    --model_path "/path/to/FLUX.1-dev"

Stable Diffusion 3

python run_synthesis_sd3.py \
    --src_prompt "a portrait of a woman in a red dress in a forest, best quality" \
    --tgt_prompt "a portrait of a woman in a yellow dress in a forest, best quality" \
    --edit_object "dress" \
    --out_dir "output" \
    --alpha 1.0 \
    --model_path "/path/to/stable-diffusion-3-medium-diffusers"
python run_synthesis_sd3.py \
    --src_prompt "a portrait of a woman in a red dress, realistic style, best quality" \
    --tgt_prompt "a portrait of a woman in a yellow dress, cartoon style, best quality" \
    --edit_object "dress" \
    --out_dir "output" \
    --alpha 0.3 \
    --model_path "/path/to/stable-diffusion-3-medium-diffusers"

CogVideo

python run_synthesis_cog.py \
    --src_prompt "a portrait of a woman in a red dress in a forest, best quality" \
    --tgt_prompt "a portrait of a woman in a yellow dress in a forest, best quality" \
    --edit_object "dress" \
    --out_dir "output" \
    --alpha 1.0 \
    --model_path "/path/to/CogVideoX-2b"

Real Image Editing

Real-image editing follows the same FLUX-based ColorCtrl attention-map swapping pipeline, but starts from an input image and uses masked value preservation throughout inversion and denoising.

python run_real_flux.py \
    --src_prompt "Yoshua Bengio is wearing a red shirt" \
    --tgt_prompt "Yoshua Bengio is wearing a black shirt" \
    --edit_object "shirt" \
    --source_image_path "assets/bengio.png" \
    --out_dir "output" \
    --alpha 0.35 \
    --model_path "/path/to/FLUX.1-dev"

🎭 Masking Strategies

Available Modes

1. No Mask (--no_mask)

What it does: Disables mask-guided value preservation.

Result: Colors outside the edited region can drift more easily.

python run_synthesis_flux.py --no_mask --alpha 0.3 ...

2. ColorCtrl Mask (Default)

What it does: Uses the ColorCtrl attention-map swap with mask-guided value preservation.

Technical Details:

  • Mask Calculation: Computes masks from averaged attention maps
  • Image Generation: Swaps only the vision-vision attention block while preserving masked values

Result: βœ… Better background preservation with a cleaner public implementation

βš™οΈ Parameters

Common Parameters

Parameter Type Default Description
--src_prompt str Required Source image prompt: Text description used to generate the source image. This defines the initial state before editing.
--tgt_prompt str Required Target image prompt: Text description for the edited result.
--edit_object str Required Edit object word: Single word or phrase that appears in src_prompt and specifies what object to edit. Used for mask generation.
--out_dir str "output" Output directory: Directory where generated images and masks will be saved.
--alpha float 1.0 Consistency strength: Controls the strength of cross-attention injection (consistency_strength in paper). Range: 0.0-1.0.
--model_path str Required Model path: Local path to the diffusion model directory.
--no_mask flag False Disable masking: When set, no mask is generated and no content fusion is applied. Use this to observe uncontrolled changes.

Model-Specific Parameters

Real Image Editing (run_real_flux.py)

Parameter Type Default Description
--src_prompt str "Yoshua Bengio is wearing a red shirt" Source prompt paired with the input image during inversion.
--tgt_prompt str "Yoshua Bengio is wearing a black shirt" Edited prompt used for reconstruction.
--edit_object str "shirt" Word used to derive the edit mask.
--source_image_path str "assets/bengio.png" Input real image path.
--out_dir str "output" Directory for reconstructed source image, edited result, mask, and latent.
--alpha float 0.35 Real-image consistency strength used by the preserve variant.
--model_path str "/path/to/FLUX.1-dev" Local FLUX model path.

πŸ“Š Evaluation

Benchmark Generation

To generate results for PIE-Bench or ColorCtrl-Bench:

  • ColorCtrl-Bench annotation mapping is bundled in this repo at evaluation/colorctrl_bench_mapping.json.
  • When you use --benchmark colorctrl-bench, both run_metric.py and evaluate.py load that local JSON automatically.
python run_metric.py \
    --model_path "/path/to/FLUX.1-dev" \
    --data_path "/path/to/benchmark-root" \
    --benchmark piebench

Switch to ColorCtrl-Bench with:

python run_metric.py \
    --model_path "/path/to/FLUX.1-dev" \
    --data_path "/path/to/colorctrl-bench-root" \
    --benchmark colorctrl-bench

Metric Calculation

To compute evaluation metrics for either benchmark:

python evaluate.py --benchmark piebench

πŸ“ Project Structure

ColorCtrl/
β”œβ”€β”€ run_synthesis_flux.py     # FLUX synthesis editing
β”œβ”€β”€ run_synthesis_sd3.py      # SD3 synthesis editing
β”œβ”€β”€ run_synthesis_cog.py      # CogVideo editing
β”œβ”€β”€ run_real_flux.py          # Real image editing
β”œβ”€β”€ run_metric.py             # Benchmark generation script
β”œβ”€β”€ evaluate.py               # Metric calculation script
β”œβ”€β”€ script/
β”‚   β”œβ”€β”€ flux_consist_edit.sh   # Consistent editing demo
β”‚   └── flux_inconsist_edit.sh # Inconsistent editing demo
β”œβ”€β”€ colorctrl/
β”‚   β”œβ”€β”€ attention_control.py   # Cross-attention mechanisms
β”‚   β”œβ”€β”€ solver.py              # Diffusion solvers
β”‚   β”œβ”€β”€ utils.py               # Utility functions
β”‚   └── global_var.py          # Global variables
β”œβ”€β”€ evaluation/
β”‚   β”œβ”€β”€ colorctrl_bench_mapping.json # ColorCtrl-Bench annotations
β”‚   └── matric_calculator.py         # Evaluation metrics
└── assets/                   # Sample images

πŸ™ Acknowledgments

This codebase is built upon and inspired by several excellent open-source projects:

  • MasaCtrl: Tuning-Free Mutual Self-Attention Control for Consistent Image Synthesis and Editing
  • PnPInversion: Plug-and-Play diffusion features for text-driven image-to-image translation
  • UniEdit-Flow: UniEdit-Flow: Unleashing Inversion and Editing in the Era of Flow Models
  • DiTCtrl: DiTCtrl: Exploring Attention Control in Multi-Modal Diffusion Transformer for Tuning-Free Multi-Prompt Longer Video Generation
  • ConsistEdit: ConsistEdit: Highly Consistent and Precise Training-free Visual Editing

We thank the authors of these works for their valuable contributions to the diffusion model editing community.

πŸ“– Citation

If you find this work useful, please cite our paper:

@inproceedings{yin2026training,
  title={Training-Free Text-Guided Color Editing with Multi-Modal Diffusion Transformer},
  author={Yin, Zixin and Dai, Xili and Chen, Ling-Hao and Zhou, Deyu and Wang, Jianan and Wang, Duomin and Yu, Gang and Ni, Lionel M and Zhang, Lei and Shum, Heung-Yeung},
  booktitle={The Fourteenth International Conference on Learning Representations}
  year={2026}
}

About

[ICLR 2026] Official Implementation of "ColorCtrl: Training-Free Text-Guided Color Editing with Multi-Modal Diffusion Transformer"

Resources

License

Stars

Watchers

Forks

Contributors