[3/N] add prefetch support for CUDA backend : running ds4 for any GPU with cache (2.75 x faster!) by yiakwy-xpu-ml-framework-team · Pull Request #402 · antirez/ds4

yiakwy-xpu-ml-framework-team · 2026-06-12T19:13:15Z

Introduction

We have verified our sft/rl (much stronger dsv4) with 2 bits can run in 15 tokens/sec.

Then it came into my mind, if I run it in other GPU with UVM technologies (Mapping CPU memory to GPU memory) with prefetch cache ?

For example , we have 80 GB GPU, and we preload tensors with 64 GB from 154 GB model.

Now we did it.

Server side snapshot

** client side snapshot **

Acceleration

Config	Speed (tokens/sec)	Model (GB)
dsv4 iq2_xxs	15	81
q4	2	154
q4 + 64GB cache	5.5	154

Discussion

This is follow up of #368 and #377, but we can merge it independently since it works on runtime engine not quantization toolkits.

acceleration

yiakwy-xpu-ml-framework-team · 2026-06-12T19:14:43Z

@antirez Sorry for disturbing you again! But this is a real important feature with prefetch cache support in CUDA backend (any cuda) !

yiakwy-xpu-ml-framework-team · 2026-06-12T19:18:25Z

How to use it:

DS4_SFT_E2_FP4_MODEL=./gguf/DeepSeek-V4-Flash_e2_v1_Q4KExperts-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2-imatrix.gguf

# ds4: --ssd-streaming is currently supported only with --metal
#   --ssd-streaming \

# export DS4_CUDA_MODEL_PRELOAD_SIZE_GB=64
export DS4_CUDA_WEIGHT_CACHE_LIMIT_GB=64

# export DS4_CUDA_WEIGHT_PRELOAD=1
export DS4_CUDA_WEIGHT_CACHE=1

# for debugging
export DS4_CUDA_MODEL_COPY_VERBOSE=1

export DS4_CUDA_WEIGHT_CACHE_VERBOSE=1

# important !
export DS4_CUDA_COPY_MODEL_CHUNKED=1

CUDA_VISIBLE_DEVICES=2 \
DS4_MODEL_NAME="deepseek-v4-flash-rl-e5" \
DS4_LOCK_FILE=/tmp/ds4-server-2.lock ./ds4-server \
  --cuda \
  -m $DS4_SFT_E2_FP4_MODEL \
  --ssd-streaming-cache-experts 64GB \
  --ctx 256000 \
  --kv-disk-dir /raid/yiakwy/tmp/ds4-kv-gpu4 \
  --kv-disk-space-mb 102400 \
  --host 127.0.0.1 \
  --port 8001

add prefetch for CUDA backend , running ds4 for any GPU with cache

cc18da2

acceleration

yiakwy-xpu-ml-framework-team changed the title ~~[3/N] add prefetch support for CUDA backend : running ds4 for any GPU with cache (3x faster!)~~ [3/N] add prefetch support for CUDA backend : running ds4 for any GPU with cache (2.75 x faster!) Jun 12, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[3/N] add prefetch support for CUDA backend : running ds4 for any GPU with cache (2.75 x faster!)#402

[3/N] add prefetch support for CUDA backend : running ds4 for any GPU with cache (2.75 x faster!)#402
yiakwy-xpu-ml-framework-team wants to merge 1 commit into
antirez:mainfrom
yiakwy-xpu-ml-framework-team:add_prefetch_cache_support_for_cuda

yiakwy-xpu-ml-framework-team commented Jun 12, 2026 •

edited

Loading

Uh oh!

yiakwy-xpu-ml-framework-team commented Jun 12, 2026

Uh oh!

yiakwy-xpu-ml-framework-team commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

yiakwy-xpu-ml-framework-team commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Introduction

Acceleration

Discussion

Uh oh!

yiakwy-xpu-ml-framework-team commented Jun 12, 2026

Uh oh!

yiakwy-xpu-ml-framework-team commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

yiakwy-xpu-ml-framework-team commented Jun 12, 2026 •

edited

Loading