Skip to content

dan64/DiTServerRPC

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

22 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

DiT Colorize RPC Server

An XML-RPC server that exposes a GPU-accelerated colorization pipeline for black-and-white images and video frames. Built on top of the Nunchaku SVDQuant FP4/INT4 transformer and the Qwen-Image-Edit-2511 diffusion model.

Optimized for NVIDIA RTX 50-Series (Blackwell) & CUDA 12.8.


✨ Features

  • 🎨 Batch colorization β€” process entire directories of B&W images via filesystem paths
  • πŸ–ΌοΈ Paired inference β€” colorize two images in a single forward pass (faster, temporally consistent)
  • πŸ“‘ In-memory RPC β€” pass raw PNG frames over XML-RPC without touching the filesystem (ideal for video pipelines)
  • ⚑ 4-step lightning model β€” SVDQuant FP4 quantized transformer for maximum throughput
  • πŸ”’ Thread-safe β€” pipeline loading and stop control are protected by locks; every RPC call runs in its own thread
  • βš™οΈ Startup preload β€” optional --load-pipeline flag loads the model at boot from a JSON config file
  • πŸš€ Shared memory transport β€” zero-copy image transfer for same-host deployments (~23% faster than standard RPC)

πŸ“‹ Prerequisites

Requirement Details
OS Windows 10/11 or Linux
Python 3.12
GPU NVIDIA RTX 3090 / 4090 / 5070 Ti / 5090 (16 GB+ VRAM recommended)
CUDA 12.8 or newer
CUDA Toolkit Must match the PyTorch build (see below)

RTX 40-Series and older: use "model_precision": "int4" in the pipeline config file. FP4 quantization requires Blackwell hardware; INT4 is the correct precision for Ampere (RTX 30) and Ada Lovelace (RTX 40) GPUs.


πŸ› οΈ Installing Git and Python

Before setting up the project environment, make sure both Git and Python 3.12 are installed on your system.

Git

Windows: download and install Git for Windows. Accept the default options β€” in particular keep core.autocrlf=true (the default), which ensures correct line endings for .cmd files.

Linux:

sudo apt install git        # Debian / Ubuntu
sudo dnf install git        # Fedora / RHEL

Verify: git --version


Python 3.12

Windows: download the installer from python.org/downloads. During installation, check "Add Python to PATH" β€” without this, python will not be recognized in the terminal.

Linux:

sudo apt install python3.12 python3.12-venv   # Debian / Ubuntu
sudo dnf install python3.12                   # Fedora / RHEL

Verify: python --version (Windows) or python3.12 --version (Linux)


βš™οΈ Environment Setup

1 β€” Clone the repository and create a virtual environment

Clone the repository with git β€” this ensures correct line endings for all files (.gitattributes is applied automatically at checkout):

git clone https://github.com/dan64/DiTServerRPC.git
cd DiTServerRPC

Then create and activate the virtual environment inside the project directory:

python -m venv .venv
# Windows
.venv\Scripts\activate
# Linux / macOS
source .venv/bin/activate

Windows quick-start: once the venv is active you can run install.cmd to execute steps 2–6 automatically instead of running them one by one.


2 β€” Install PyTorch 2.9.1 + CUDA 12.8

Use the stable build for all GPU generations (RTX 30 / 40 / 50):

pip install torch==2.9.1+cu128 torchvision==0.24.1+cu128 torchaudio==2.9.1+cu128 \
    --index-url https://download.pytorch.org/whl/cu128

Verify the installation:

python -c "import torch; print(torch.__version__); print(torch.cuda.is_available())"

3 β€” Install Nunchaku

⚠️ Do NOT use pip install nunchaku β€” that installs an unrelated package from PyPI with the same name that will fail with ModuleNotFoundError: No module named 'nunchaku.models'.

Install the correct MIT Han Lab build directly from the GitHub release:

# Windows / Python 3.12 / CUDA 12.8 / PyTorch 2.9
pip install https://github.com/nunchaku-ai/nunchaku/releases/download/v1.2.1/nunchaku-1.2.1+cu12.8torch2.9-cp312-cp312-win_amd64.whl

For other platforms or Python versions, browse the full list of available wheels on the Nunchaku releases page and replace the filename accordingly.

Verify the correct package is installed β€” the version string must contain the build tags:

pip show nunchaku
# Expected: Version: 1.2.1+cu12.8torch2.9

4 β€” Patch Nunchaku

Nunchaku 1.2.1 contains a bug in its transformer forward pass: txt_seq_lens is always None at the point where it is passed to pos_embed, causing a ValueError with diffusers >= 0.37.0.dev0. The included patch_nunchaku.py fixes this by deriving max_txt_seq_len directly from encoder_hidden_states:

python patch_nunchaku.py

On Windows you can also double-click patch_nunchaku.cmd or run it from a terminal:

patch_nunchaku.cmd            # apply the patch
patch_nunchaku.cmd --check    # check status without modifying files
patch_nunchaku.cmd --revert   # revert to original (.bak backup)

You can verify the patch status at any time:

python patch_nunchaku.py --check

And revert to the original if needed (a .bak backup is created automatically):

python patch_nunchaku.py --revert

5 β€” Install Diffusers

⚠️ Do NOT install diffusers from GitHub (pip install git+https://...). Nunchaku 1.2.1 requires exactly 0.37.0.dev0. Later dev builds (β‰₯ 0.39.0) changed the QwenEmbedRope API in a way that is incompatible even after the nunchaku patch.

A tested compatible wheel is included in the packages/ folder. Install it directly:

pip install packages\diffusers-0.37.0.dev0-py3-none-any.whl

Verify:

python -c "import diffusers; print(diffusers.__version__)"
# Expected: 0.37.0.dev0

6 β€” Install remaining dependencies

Pin the versions to match the tested working environment:

pip install \
    transformers==4.57.6 \
    accelerate==1.12.0 \
    huggingface_hub>=0.26.0 \
    Pillow>=10.0.0

safetensors is intentionally not pinned here β€” diffusers pulls the correct version automatically as a dependency (>=0.8.0-rc.0).


πŸ“‚ Project Structure

dit-colorize-rpc/
β”œβ”€β”€ dit_rpc_server.py            # XML-RPC server (entry point)
β”œβ”€β”€ dit_colorize_main.py         # Colorization pipeline and image utilities
β”œβ”€β”€ dit_client_example.py        # Example RPC client β€” single frame
β”œβ”€β”€ dit_client_pair_example.py   # Example RPC client β€” paired inference
β”œβ”€β”€ patch_nunchaku.py            # Compatibility patch for nunchaku 1.2.1
β”œβ”€β”€ qwen_config_fp4.json         # Config for RTX 50-Series (FP4)
β”œβ”€β”€ qwen_config_int4.json        # Config for RTX 30 / 40-Series (INT4)
β”œβ”€β”€ install.cmd                  # Windows automated installer
β”œβ”€β”€ start_server.cmd             # Windows launcher β€” server
β”œβ”€β”€ run_client_example.cmd       # Windows launcher β€” single frame example
β”œβ”€β”€ run_client_pair_example.cmd  # Windows launcher β€” paired inference example
β”œβ”€β”€ patch_nunchaku.cmd           # Windows launcher β€” nunchaku patch
β”œβ”€β”€ assets/
β”‚   β”œβ”€β”€ santa_bw.png             # Sample B&W image (single frame test)
β”‚   β”œβ”€β”€ sample1_bw.jpg           # Sample B&W image 1 (paired inference test)
β”‚   └── sample2_bw.jpg           # Sample B&W image 2 (paired inference test)
β”œβ”€β”€ packages/
β”‚   └── diffusers-0.37.0.dev0-py3-none-any.whl  # Tested compatible diffusers build
└── README.md

πŸ”§ Pipeline Configuration

Two ready-to-use config files are provided. Pick the one that matches your GPU and pass it to --pipeline-config.

qwen_config_fp4.json β€” RTX 50-Series (Blackwell)

{
    "model_name":            "nunchaku-qwen",
    "model_precision":       "fp4",
    "model_rank":            "32",
    "model_inference_steps": "4",
    "cache_dir":             "",
    "full_model_path":       ""
}

qwen_config_int4.json β€” RTX 30 / 40-Series (Ampere / Ada Lovelace)

{
    "model_name":            "nunchaku-qwen",
    "model_precision":       "int4",
    "model_rank":            "32",
    "model_inference_steps": "4",
    "cache_dir":             "",
    "full_model_path":       ""
}

⚠️ model_precision: use "fp4" only on RTX 50-Series (Blackwell). On RTX 30 / 40-Series use "int4" β€” FP4 kernels require sm_120 and will fail on older architectures.

Key reference

Key Required Description
model_name βœ… Must be "nunchaku-qwen"
model_precision βœ… "fp4" (RTX 50) or "int4" (RTX 30/40)
model_rank βœ… SVD rank β€” "32" is a good default
model_inference_steps βœ… Diffusion steps used to select the model file to download β€” must be "4" (no 2-step model file exists). To run inference faster, pass steps=2 in the RPC call β€” this is independent of the downloaded model and reduces latency by ~40%
cache_dir βž– HuggingFace cache directory. Omit or set to "" to use the default (~/.cache/huggingface)
full_model_path βž– Absolute path to a local .safetensors file. Omit or set to "" to download from HuggingFace

πŸš€ Usage

Start the server (no preload β€” pipeline loaded later via RPC)

python dit_rpc_server.py

Start the server with pipeline preloaded at boot

# RTX 50-Series
python dit_rpc_server.py --load-pipeline --pipeline-config qwen_config_fp4.json

# RTX 30 / 40-Series
python dit_rpc_server.py --load-pipeline --pipeline-config qwen_config_int4.json

On Windows you can also use the provided start_server.cmd (see Windows launch script).

Full list of CLI arguments

usage: dit_rpc_server.py [-h] [--host HOST] [--port PORT]
                         [--logfile LOGFILE] [--module-dir MODULE_DIR]
                         [--load-pipeline] [--pipeline-config CONFIG.json]

options:
  --host HOST                  Address to listen on (default: 127.0.0.1)
  --port PORT                  TCP port (default: 8765)
  --logfile LOGFILE            Optional path for a log file
  --module-dir MODULE_DIR      Directory containing dit_colorize_main.py
                               (default: same directory as this script)
  --load-pipeline              Load the colorization pipeline at startup
  --pipeline-config CONFIG.json
                               Path to the JSON pipeline config file
                               (required when --load-pipeline is set)

πŸ“‘ RPC API Reference

Connect from any Python client using xmlrpc.client:

import xmlrpc.client
proxy = xmlrpc.client.ServerProxy("http://127.0.0.1:8765/", use_builtin_types=True)

All methods return a dict with at least {"ok": bool, "msg": str}.

Health

Method Returns Description
ping() "pong" Connectivity check

Pipeline management

Method Returns Description
load_pipeline(model_name, model_precision, model_rank, model_inference_steps, cache_dir="", full_model_path="") {"ok", "msg"} Load the model into VRAM
is_pipeline_loaded() bool True if the pipeline is ready
unload_pipeline() {"ok", "msg"} Release VRAM

Stop control

Method Returns Description
request_stop() bool Ask the server to refuse new colorization calls
clear_stop() bool Reset the stop flag before a new batch
is_stop_requested() bool Check the current stop flag

Colorization β€” filesystem-based

Method Returns Description
colorize_image(in_path, out_path, prompt, img_size=0, steps=2) {"ok", "elapsed", "skipped", "msg"} Single image, paths on the server filesystem
colorize_image_pair(img1_path, img2_path, out_dir, prompt, gap_px=8) {"ok", "elapsed", "msg"} Two images, single inference pass
colorize_single_image(img_path, out_dir, prompt) {"ok", "elapsed", "msg"} Single image fallback (odd batch end)

Colorization β€” in-memory (PNG bytes over RPC)

Method Returns Description
colorize_frame(img_data, prompt, img_size=0, steps=2) {"ok", "data", "elapsed", "skipped", "msg"} Single frame as raw PNG bytes
colorize_frame_pair(img1_data, img2_data, prompt, gap_px=8) {"ok", "data1", "data2", "elapsed", "skipped1", "skipped2", "msg"} Two frames, single inference pass

skipped=True means the frame was too dark to colorize (average brightness < 9/255). The returned data field contains the unchanged input in that case.

Colorization β€” shared memory (same-host only, zero-copy)

Method Returns Description
colorize_frame_shm(shm_in, shm_out, h, w, prompt, img_size=0, steps=2) {"ok", "elapsed", "skipped", "msg"} Single frame via shared memory
colorize_frame_pair_shm(shm_in1, shm_out1, h1, w1, shm_in2, shm_out2, h2, w2, prompt, gap_px=8) {"ok", "elapsed", "skipped1", "skipped2", "msg"} Two frames via shared memory, single inference pass

See Shared Memory Transport for usage details.


πŸ§ͺ Example Clients

Both clients support two transport modes selectable via --use-shm:

Mode Flag When to use Measured speed (1480Γ—1080 px pair)
Standard RPC (default) Any deployment, including remote server ~5.25s/image
Shared memory --use-shm Server and client on the same host only ~4.06s/image (~23% faster)

The pipeline must be loaded on the server before running the clients. Start the server with --load-pipeline --pipeline-config CONFIG.json.

Single frame β€” dit_client_example.py

Colorizes assets/santa_bw.png and saves the result as assets/santa_colorized.png.

# standard RPC β€” works with local and remote server
python dit_client_example.py

# shared memory β€” same-host only, lower latency
python dit_client_example.py --use-shm

Windows: run_client_example.cmd To enable shared memory edit run_client_example.cmd and set USE_SHM=1.


Paired inference β€” dit_client_pair_example.py

Colorizes assets/sample1_bw.jpg and assets/sample2_bw.jpg in a single forward pass, saving assets/sample1_colorized.jpg and assets/sample2_colorized.jpg.

Paired inference places the two images side-by-side and runs one inference instead of two, roughly halving the per-image cost (~5.25s/image vs ~11s standalone). Combined with shared memory transport this reaches ~4.06s/image.

# standard RPC
python dit_client_pair_example.py

# shared memory β€” same-host only
python dit_client_pair_example.py --use-shm

Windows: run_client_pair_example.cmd To enable shared memory edit run_client_pair_example.cmd and set USE_SHM=1.

Full list of arguments (both clients)

  --host HOST                  Server host (default: 127.0.0.1)
  --port PORT                  Server port (default: 8765)
  --prompt PROMPT              Text prompt for the model
  --use-shm                    Use shared memory transport (same-host only)

Additional argument for the paired client:

  --gap-px N                   Separator width in pixels between the two
                               images in the merged input (default: 8)

πŸš€ Shared Memory Transport (same-host only)

What it is

The standard RPC transport serializes each image as a PNG byte stream, encodes it in Base64, sends it over a TCP socket, and decodes it on the other side. For a 1480Γ—1080 frame this is roughly 4–5 MB per round trip.

The shared memory transport bypasses the network entirely. The client writes the raw pixel array directly into a shared memory segment; the server attaches to the same segment and reads the pixels without any copy. Only the metadata (segment name, dimensions, prompt) travels over the XML-RPC socket.

When you can use it

Requirement: server and client must run on the same machine.

If the server is on a dedicated GPU machine and the client is on a separate workstation, shared memory is not available β€” use the standard RPC transport instead (default). The clients detect this automatically: passing --use-shm when the host is not 127.0.0.1 / localhost prints a warning and falls back to standard RPC.

Performance

Measured on a 1480Γ—1080 pixel pair (RTX 5070 Ti, FP4, paired inference):

Transport Per-image time Round-trip overhead
Standard RPC (PNG) ~5.25s ~1.1s
Shared memory ~4.06s ~0.16s
Gain ~23% faster ~7Γ— less overhead

The round-trip overhead with shared memory is essentially zero β€” the 0.16s gap between inference time and wall-clock time is just Python function call and numpy overhead.

On a 100k-frame video processed as pairs (50k inference calls) the cumulative saving is:

(5.25 - 4.06) Γ— 50,000 β‰ˆ 16.5 hours

How the protocol works

The client owns and manages all shared memory segments. The server is fully stateless with respect to shared memory β€” it only attaches, reads/writes, and detaches.

Client                                     Server
  β”‚                                           β”‚
  β”‚  create shm_in  (h Γ— w Γ— 3 bytes)         β”‚
  β”‚  create shm_out (h Γ— w Γ— 3 bytes)         β”‚
  β”‚  write raw RGB pixels β†’ shm_in            β”‚
  β”‚                                           β”‚
  β”‚  RPC(shm_in_name, shm_out_name, h, w, …) ─►│
  β”‚                                           β”‚  attach shm_in  β†’ PIL Image
  β”‚                                           β”‚  inference
  β”‚                                           β”‚  result β†’ shm_out
  │◄─ return {elapsed, skipped, …} ───────────│
  β”‚                                           β”‚  detach both segments
  β”‚  read shm_out β†’ PIL Image                 β”‚
  β”‚  unlink shm_in + shm_out                  β”‚

Enabling shared memory

From the command line:

python dit_client_pair_example.py --use-shm
python dit_client_example.py      --use-shm

From the Windows .cmd launchers, edit the user configuration block and set:

set USE_SHM=1

The banner will confirm the active transport:

Transport   : 1 (0=RPC 1=shared memory)

And the Python client will print:

[INFO] Transport: shared memory

Implementing shared memory in your own client

import uuid
import numpy as np
from multiprocessing.shared_memory import SharedMemory
from PIL import Image

def colorize_pair_shm(proxy, img1: Image.Image, img2: Image.Image, prompt: str):
    arr1, arr2 = np.array(img1), np.array(img2)
    h1, w1 = arr1.shape[:2]
    h2, w2 = arr2.shape[:2]
    uid = uuid.uuid4().hex[:12]

    # Create all four segments (client owns them)
    segs = {
        tag: SharedMemory(name=f"dit_{tag}_{uid}", create=True, size=h*w*3)
        for tag, h, w in [("in1",h1,w1),("out1",h1,w1),("in2",h2,w2),("out2",h2,w2)]
    }
    try:
        np.ndarray((h1,w1,3), dtype=np.uint8, buffer=segs["in1"].buf)[:] = arr1
        np.ndarray((h2,w2,3), dtype=np.uint8, buffer=segs["in2"].buf)[:] = arr2

        result = proxy.colorize_frame_pair_shm(
            segs["in1"].name, segs["out1"].name, h1, w1,
            segs["in2"].name, segs["out2"].name, h2, w2,
            prompt, 8,  # gap_px
        )

        out1 = Image.fromarray(
            np.ndarray((h1,w1,3), dtype=np.uint8, buffer=segs["out1"].buf).copy())
        out2 = Image.fromarray(
            np.ndarray((h2,w2,3), dtype=np.uint8, buffer=segs["out2"].buf).copy())
        return result, out1, out2
    finally:
        for shm in segs.values():
            shm.close(); shm.unlink()

πŸͺŸ Windows Launch Script

start_server.cmd is a ready-to-use launcher for Windows. Edit the variables at the top of the file to match your setup, then double-click it or run it from a terminal.

start_server.cmd [fp4|int4]

If no argument is passed it defaults to fp4. Pass int4 for RTX 30 / 40-Series:

start_server.cmd int4

πŸ”§ Troubleshooting

CUDA out of memory Close other GPU applications. On 16 GB cards the server automatically enables sequential CPU offload for layers that do not fit in VRAM.

dit_colorize_main.py NOT FOUND Use --module-dir to point the server to the directory that contains dit_colorize_main.py:

python dit_rpc_server.py --module-dir /path/to/dit_colorize_main

Model 'xxx' is not supported The only supported value for model_name is "nunchaku-qwen".

Pipeline takes a long time to load On the first run the model weights (~15–30 GB) are downloaded from HuggingFace. Subsequent runs load from the local cache. Set cache_dir in the config to control where the cache is stored.


πŸ”— Credits

About

An XML-RPC server that exposes a GPU-accelerated colorization pipeline for black-and-white images and video frames. Built on top of the [Nunchaku](https://github.com/mit-han-lab/nunchaku) SVDQuant FP4/INT4 transformer and the `Qwen-Image-Edit-2511` diffusion model.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors