An XML-RPC server that exposes a GPU-accelerated colorization pipeline for black-and-white images and video frames.
Built on top of the Nunchaku SVDQuant FP4/INT4 transformer and the Qwen-Image-Edit-2511 diffusion model.
Optimized for NVIDIA RTX 50-Series (Blackwell) & CUDA 12.8.
- π¨ Batch colorization β process entire directories of B&W images via filesystem paths
- πΌοΈ Paired inference β colorize two images in a single forward pass (faster, temporally consistent)
- π‘ In-memory RPC β pass raw PNG frames over XML-RPC without touching the filesystem (ideal for video pipelines)
- β‘ 4-step lightning model β SVDQuant FP4 quantized transformer for maximum throughput
- π Thread-safe β pipeline loading and stop control are protected by locks; every RPC call runs in its own thread
- βοΈ Startup preload β optional
--load-pipelineflag loads the model at boot from a JSON config file - π Shared memory transport β zero-copy image transfer for same-host deployments (~23% faster than standard RPC)
| Requirement | Details |
|---|---|
| OS | Windows 10/11 or Linux |
| Python | 3.12 |
| GPU | NVIDIA RTX 3090 / 4090 / 5070 Ti / 5090 (16 GB+ VRAM recommended) |
| CUDA | 12.8 or newer |
| CUDA Toolkit | Must match the PyTorch build (see below) |
RTX 40-Series and older: use
"model_precision": "int4"in the pipeline config file. FP4 quantization requires Blackwell hardware; INT4 is the correct precision for Ampere (RTX 30) and Ada Lovelace (RTX 40) GPUs.
Before setting up the project environment, make sure both Git and Python 3.12 are installed on your system.
Windows: download and install Git for Windows.
Accept the default options β in particular keep core.autocrlf=true (the default),
which ensures correct line endings for .cmd files.
Linux:
sudo apt install git # Debian / Ubuntu
sudo dnf install git # Fedora / RHELVerify: git --version
Windows: download the installer from python.org/downloads.
During installation, check "Add Python to PATH" β without this, python will not be
recognized in the terminal.
Linux:
sudo apt install python3.12 python3.12-venv # Debian / Ubuntu
sudo dnf install python3.12 # Fedora / RHELVerify: python --version (Windows) or python3.12 --version (Linux)
Clone the repository with git β this ensures correct line endings for all files
(.gitattributes is applied automatically at checkout):
git clone https://github.com/dan64/DiTServerRPC.git
cd DiTServerRPCThen create and activate the virtual environment inside the project directory:
python -m venv .venv
# Windows
.venv\Scripts\activate
# Linux / macOS
source .venv/bin/activateWindows quick-start: once the venv is active you can run
install.cmdto execute steps 2β6 automatically instead of running them one by one.
Use the stable build for all GPU generations (RTX 30 / 40 / 50):
pip install torch==2.9.1+cu128 torchvision==0.24.1+cu128 torchaudio==2.9.1+cu128 \
--index-url https://download.pytorch.org/whl/cu128Verify the installation:
python -c "import torch; print(torch.__version__); print(torch.cuda.is_available())"
β οΈ Do NOT usepip install nunchakuβ that installs an unrelated package from PyPI with the same name that will fail withModuleNotFoundError: No module named 'nunchaku.models'.
Install the correct MIT Han Lab build directly from the GitHub release:
# Windows / Python 3.12 / CUDA 12.8 / PyTorch 2.9
pip install https://github.com/nunchaku-ai/nunchaku/releases/download/v1.2.1/nunchaku-1.2.1+cu12.8torch2.9-cp312-cp312-win_amd64.whlFor other platforms or Python versions, browse the full list of available wheels on the Nunchaku releases page and replace the filename accordingly.
Verify the correct package is installed β the version string must contain the build tags:
pip show nunchaku
# Expected: Version: 1.2.1+cu12.8torch2.9Nunchaku 1.2.1 contains a bug in its transformer forward pass: txt_seq_lens is always
None at the point where it is passed to pos_embed, causing a ValueError with
diffusers >= 0.37.0.dev0. The included patch_nunchaku.py fixes this by deriving
max_txt_seq_len directly from encoder_hidden_states:
python patch_nunchaku.pyOn Windows you can also double-click patch_nunchaku.cmd or run it from a terminal:
patch_nunchaku.cmd # apply the patch
patch_nunchaku.cmd --check # check status without modifying files
patch_nunchaku.cmd --revert # revert to original (.bak backup)
You can verify the patch status at any time:
python patch_nunchaku.py --checkAnd revert to the original if needed (a .bak backup is created automatically):
python patch_nunchaku.py --revert
β οΈ Do NOT install diffusers from GitHub (pip install git+https://...). Nunchaku 1.2.1 requires exactly0.37.0.dev0. Later dev builds (β₯ 0.39.0) changed theQwenEmbedRopeAPI in a way that is incompatible even after the nunchaku patch.
A tested compatible wheel is included in the packages/ folder.
Install it directly:
pip install packages\diffusers-0.37.0.dev0-py3-none-any.whlVerify:
python -c "import diffusers; print(diffusers.__version__)"
# Expected: 0.37.0.dev0Pin the versions to match the tested working environment:
pip install \
transformers==4.57.6 \
accelerate==1.12.0 \
huggingface_hub>=0.26.0 \
Pillow>=10.0.0
safetensorsis intentionally not pinned here β diffusers pulls the correct version automatically as a dependency (>=0.8.0-rc.0).
dit-colorize-rpc/
βββ dit_rpc_server.py # XML-RPC server (entry point)
βββ dit_colorize_main.py # Colorization pipeline and image utilities
βββ dit_client_example.py # Example RPC client β single frame
βββ dit_client_pair_example.py # Example RPC client β paired inference
βββ patch_nunchaku.py # Compatibility patch for nunchaku 1.2.1
βββ qwen_config_fp4.json # Config for RTX 50-Series (FP4)
βββ qwen_config_int4.json # Config for RTX 30 / 40-Series (INT4)
βββ install.cmd # Windows automated installer
βββ start_server.cmd # Windows launcher β server
βββ run_client_example.cmd # Windows launcher β single frame example
βββ run_client_pair_example.cmd # Windows launcher β paired inference example
βββ patch_nunchaku.cmd # Windows launcher β nunchaku patch
βββ assets/
β βββ santa_bw.png # Sample B&W image (single frame test)
β βββ sample1_bw.jpg # Sample B&W image 1 (paired inference test)
β βββ sample2_bw.jpg # Sample B&W image 2 (paired inference test)
βββ packages/
β βββ diffusers-0.37.0.dev0-py3-none-any.whl # Tested compatible diffusers build
βββ README.md
Two ready-to-use config files are provided. Pick the one that matches your GPU and pass it to --pipeline-config.
{
"model_name": "nunchaku-qwen",
"model_precision": "fp4",
"model_rank": "32",
"model_inference_steps": "4",
"cache_dir": "",
"full_model_path": ""
}{
"model_name": "nunchaku-qwen",
"model_precision": "int4",
"model_rank": "32",
"model_inference_steps": "4",
"cache_dir": "",
"full_model_path": ""
}
β οΈ model_precision: use"fp4"only on RTX 50-Series (Blackwell). On RTX 30 / 40-Series use"int4"β FP4 kernels require sm_120 and will fail on older architectures.
| Key | Required | Description |
|---|---|---|
model_name |
β | Must be "nunchaku-qwen" |
model_precision |
β | "fp4" (RTX 50) or "int4" (RTX 30/40) |
model_rank |
β | SVD rank β "32" is a good default |
model_inference_steps |
β | Diffusion steps used to select the model file to download β must be "4" (no 2-step model file exists). To run inference faster, pass steps=2 in the RPC call β this is independent of the downloaded model and reduces latency by ~40% |
cache_dir |
β | HuggingFace cache directory. Omit or set to "" to use the default (~/.cache/huggingface) |
full_model_path |
β | Absolute path to a local .safetensors file. Omit or set to "" to download from HuggingFace |
python dit_rpc_server.py# RTX 50-Series
python dit_rpc_server.py --load-pipeline --pipeline-config qwen_config_fp4.json
# RTX 30 / 40-Series
python dit_rpc_server.py --load-pipeline --pipeline-config qwen_config_int4.jsonOn Windows you can also use the provided start_server.cmd (see Windows launch script).
usage: dit_rpc_server.py [-h] [--host HOST] [--port PORT]
[--logfile LOGFILE] [--module-dir MODULE_DIR]
[--load-pipeline] [--pipeline-config CONFIG.json]
options:
--host HOST Address to listen on (default: 127.0.0.1)
--port PORT TCP port (default: 8765)
--logfile LOGFILE Optional path for a log file
--module-dir MODULE_DIR Directory containing dit_colorize_main.py
(default: same directory as this script)
--load-pipeline Load the colorization pipeline at startup
--pipeline-config CONFIG.json
Path to the JSON pipeline config file
(required when --load-pipeline is set)
Connect from any Python client using xmlrpc.client:
import xmlrpc.client
proxy = xmlrpc.client.ServerProxy("http://127.0.0.1:8765/", use_builtin_types=True)All methods return a dict with at least {"ok": bool, "msg": str}.
| Method | Returns | Description |
|---|---|---|
ping() |
"pong" |
Connectivity check |
| Method | Returns | Description |
|---|---|---|
load_pipeline(model_name, model_precision, model_rank, model_inference_steps, cache_dir="", full_model_path="") |
{"ok", "msg"} |
Load the model into VRAM |
is_pipeline_loaded() |
bool |
True if the pipeline is ready |
unload_pipeline() |
{"ok", "msg"} |
Release VRAM |
| Method | Returns | Description |
|---|---|---|
request_stop() |
bool |
Ask the server to refuse new colorization calls |
clear_stop() |
bool |
Reset the stop flag before a new batch |
is_stop_requested() |
bool |
Check the current stop flag |
| Method | Returns | Description |
|---|---|---|
colorize_image(in_path, out_path, prompt, img_size=0, steps=2) |
{"ok", "elapsed", "skipped", "msg"} |
Single image, paths on the server filesystem |
colorize_image_pair(img1_path, img2_path, out_dir, prompt, gap_px=8) |
{"ok", "elapsed", "msg"} |
Two images, single inference pass |
colorize_single_image(img_path, out_dir, prompt) |
{"ok", "elapsed", "msg"} |
Single image fallback (odd batch end) |
| Method | Returns | Description |
|---|---|---|
colorize_frame(img_data, prompt, img_size=0, steps=2) |
{"ok", "data", "elapsed", "skipped", "msg"} |
Single frame as raw PNG bytes |
colorize_frame_pair(img1_data, img2_data, prompt, gap_px=8) |
{"ok", "data1", "data2", "elapsed", "skipped1", "skipped2", "msg"} |
Two frames, single inference pass |
skipped=Truemeans the frame was too dark to colorize (average brightness < 9/255). The returneddatafield contains the unchanged input in that case.
| Method | Returns | Description |
|---|---|---|
colorize_frame_shm(shm_in, shm_out, h, w, prompt, img_size=0, steps=2) |
{"ok", "elapsed", "skipped", "msg"} |
Single frame via shared memory |
colorize_frame_pair_shm(shm_in1, shm_out1, h1, w1, shm_in2, shm_out2, h2, w2, prompt, gap_px=8) |
{"ok", "elapsed", "skipped1", "skipped2", "msg"} |
Two frames via shared memory, single inference pass |
See Shared Memory Transport for usage details.
Both clients support two transport modes selectable via --use-shm:
| Mode | Flag | When to use | Measured speed (1480Γ1080 px pair) |
|---|---|---|---|
| Standard RPC | (default) | Any deployment, including remote server | ~5.25s/image |
| Shared memory | --use-shm |
Server and client on the same host only | ~4.06s/image (~23% faster) |
The pipeline must be loaded on the server before running the clients. Start the server with
--load-pipeline --pipeline-config CONFIG.json.
Colorizes assets/santa_bw.png and saves the result as assets/santa_colorized.png.
# standard RPC β works with local and remote server
python dit_client_example.py
# shared memory β same-host only, lower latency
python dit_client_example.py --use-shmWindows: run_client_example.cmd
To enable shared memory edit run_client_example.cmd and set USE_SHM=1.
Colorizes assets/sample1_bw.jpg and assets/sample2_bw.jpg in a single forward
pass, saving assets/sample1_colorized.jpg and assets/sample2_colorized.jpg.
Paired inference places the two images side-by-side and runs one inference instead of two, roughly halving the per-image cost (~5.25s/image vs ~11s standalone). Combined with shared memory transport this reaches ~4.06s/image.
# standard RPC
python dit_client_pair_example.py
# shared memory β same-host only
python dit_client_pair_example.py --use-shmWindows: run_client_pair_example.cmd
To enable shared memory edit run_client_pair_example.cmd and set USE_SHM=1.
--host HOST Server host (default: 127.0.0.1)
--port PORT Server port (default: 8765)
--prompt PROMPT Text prompt for the model
--use-shm Use shared memory transport (same-host only)
Additional argument for the paired client:
--gap-px N Separator width in pixels between the two
images in the merged input (default: 8)
The standard RPC transport serializes each image as a PNG byte stream, encodes it in Base64, sends it over a TCP socket, and decodes it on the other side. For a 1480Γ1080 frame this is roughly 4β5 MB per round trip.
The shared memory transport bypasses the network entirely. The client writes the raw pixel array directly into a shared memory segment; the server attaches to the same segment and reads the pixels without any copy. Only the metadata (segment name, dimensions, prompt) travels over the XML-RPC socket.
Requirement: server and client must run on the same machine.
If the server is on a dedicated GPU machine and the client is on a separate workstation,
shared memory is not available β use the standard RPC transport instead (default).
The clients detect this automatically: passing --use-shm when the host is not
127.0.0.1 / localhost prints a warning and falls back to standard RPC.
Measured on a 1480Γ1080 pixel pair (RTX 5070 Ti, FP4, paired inference):
| Transport | Per-image time | Round-trip overhead |
|---|---|---|
| Standard RPC (PNG) | ~5.25s | ~1.1s |
| Shared memory | ~4.06s | ~0.16s |
| Gain | ~23% faster | ~7Γ less overhead |
The round-trip overhead with shared memory is essentially zero β the 0.16s gap between inference time and wall-clock time is just Python function call and numpy overhead.
On a 100k-frame video processed as pairs (50k inference calls) the cumulative saving is:
(5.25 - 4.06) Γ 50,000 β 16.5 hours
The client owns and manages all shared memory segments. The server is fully stateless with respect to shared memory β it only attaches, reads/writes, and detaches.
Client Server
β β
β create shm_in (h Γ w Γ 3 bytes) β
β create shm_out (h Γ w Γ 3 bytes) β
β write raw RGB pixels β shm_in β
β β
β RPC(shm_in_name, shm_out_name, h, w, β¦) ββΊβ
β β attach shm_in β PIL Image
β β inference
β β result β shm_out
βββ return {elapsed, skipped, β¦} ββββββββββββ
β β detach both segments
β read shm_out β PIL Image β
β unlink shm_in + shm_out β
From the command line:
python dit_client_pair_example.py --use-shm
python dit_client_example.py --use-shmFrom the Windows .cmd launchers, edit the user configuration block and set:
set USE_SHM=1The banner will confirm the active transport:
Transport : 1 (0=RPC 1=shared memory)
And the Python client will print:
[INFO] Transport: shared memory
import uuid
import numpy as np
from multiprocessing.shared_memory import SharedMemory
from PIL import Image
def colorize_pair_shm(proxy, img1: Image.Image, img2: Image.Image, prompt: str):
arr1, arr2 = np.array(img1), np.array(img2)
h1, w1 = arr1.shape[:2]
h2, w2 = arr2.shape[:2]
uid = uuid.uuid4().hex[:12]
# Create all four segments (client owns them)
segs = {
tag: SharedMemory(name=f"dit_{tag}_{uid}", create=True, size=h*w*3)
for tag, h, w in [("in1",h1,w1),("out1",h1,w1),("in2",h2,w2),("out2",h2,w2)]
}
try:
np.ndarray((h1,w1,3), dtype=np.uint8, buffer=segs["in1"].buf)[:] = arr1
np.ndarray((h2,w2,3), dtype=np.uint8, buffer=segs["in2"].buf)[:] = arr2
result = proxy.colorize_frame_pair_shm(
segs["in1"].name, segs["out1"].name, h1, w1,
segs["in2"].name, segs["out2"].name, h2, w2,
prompt, 8, # gap_px
)
out1 = Image.fromarray(
np.ndarray((h1,w1,3), dtype=np.uint8, buffer=segs["out1"].buf).copy())
out2 = Image.fromarray(
np.ndarray((h2,w2,3), dtype=np.uint8, buffer=segs["out2"].buf).copy())
return result, out1, out2
finally:
for shm in segs.values():
shm.close(); shm.unlink()start_server.cmd is a ready-to-use launcher for Windows.
Edit the variables at the top of the file to match your setup, then double-click it or run it from a terminal.
start_server.cmd [fp4|int4]
If no argument is passed it defaults to fp4. Pass int4 for RTX 30 / 40-Series:
start_server.cmd int4
CUDA out of memory
Close other GPU applications. On 16 GB cards the server automatically enables sequential CPU offload for layers that do not fit in VRAM.
dit_colorize_main.py NOT FOUND
Use --module-dir to point the server to the directory that contains dit_colorize_main.py:
python dit_rpc_server.py --module-dir /path/to/dit_colorize_mainModel 'xxx' is not supported
The only supported value for model_name is "nunchaku-qwen".
Pipeline takes a long time to load
On the first run the model weights (~15β30 GB) are downloaded from HuggingFace.
Subsequent runs load from the local cache. Set cache_dir in the config to control where the cache is stored.
- Model: Qwen/Qwen-Image-Edit-2509
- Quantization: Nunchaku / SVDQuant
- Pipeline: Hugging Face Diffusers