Skip to content

kaichen/ds4

 
 

Repository files navigation

ds4.c

Fork Status

This repository is a fork of antirez/ds4. The upstream project is a narrow DeepSeek V4 Flash Metal inference engine. This fork keeps that scope and adds an experimental opt-in disk-backed weight offload path for machines that cannot eagerly register the full GGUF weight range with Metal.

ds4.c is a small native inference engine for DeepSeek V4 Flash. It is intentionally narrow: not a generic GGUF runner, not a wrapper around another runtime, and not a framework. The main path is a DeepSeek V4 Flash-specific Metal graph executor with DS4-specific loading, prompt rendering, KV state, and server API glue.

This project would not exist without llama.cpp and GGML, make sure to read the acknowledgements section, a big thank you to Georgi Gerganov and all the other contributors.

Now, back at this project. Why we believe DeepSeek v4 Flash to be a pretty special model deserving a stand alone engine? Because after comparing it with powerful smaller dense models, we can report that:

  1. DeepSeek v4 Flash is faster because of less active parameters.
  2. In thinking mode, if you avoid max thinking, it produces a thinking section that is a lot shorter than other models, even 1/5 of other models in many cases, and crucially, the thinking section length is proportional to the problem complexity. This makes DeepSeek v4 Flash usable with thinking enabled when other models are practically impossible to use in the same conditions.
  3. The model features a context window of 1 million tokens.
  4. Being so large, it knows more things if you go sampling at the edge of knowledge. For instance asking about Italian show or political questions soon uncovers that 284B parameters are a lot more than 27B or 35B parameters.
  5. It writes much better English and Italian. It feels a quasi-frontier model.
  6. The KV cache is incredibly compress, allowing long context inference on local computers and on disk KV cache persistence.
  7. It works well with 2-bit quantization, if quantized in a special way (read later). This allows to run it in MacBooks with 128GB of RAM.
  8. We expect DeepSeek to release updated versions of v4 Flash in the future, even better than the current one.

That said, a few important things about this project:

  • The local inference landscape contains many excellent projects, but new models are released continuously, and the attention immediately gets captured by the next model to implement. This project takes a deliberately narrow bet: one model at a time, official-vector validation (logits obtained with the official implementation), long-context tests, and enough agent integration to know if it really works. The exact model may change as the landscape evolves, but the constraint remains: local inference credible on high end personal machines or Mac Studios, starting from 128GB of memory.
  • This software is developed with strong assistance from GPT 5.5 and with humans leading the ideas, testing, and debugging. We say this openly because it shaped how the project was built. If you are not happy with AI-developed code, this software is not for you. The acknowledgement below is equally important: this would not exist without llama.cpp and GGML, largely written by hand.
  • This implementation is based on the idea that compressed KV caches like the one of DeepSeek v4 and the fast SSD disks of modern MacBooks should change our idea that KV cache belongs to RAM. The KV cache It is actually a first class disk citizen.
  • Our vision is that local inference should be a set of three things working well together, out of the box: A) inference engine with HTTP API + B) GGUF specially crafted to run well under a given engine and given assumptions + C) testing and validation with coding agents implementations. This inference engine only runs with the GGUF files provided. It gets tested against officially obtained logits at different context sizes. This project exists because we wanted to make one local model feel finished end to end, not just runnable. However this is just alpha quality code, so probably we are not still there.
  • This is Metal-only, may implement CUDA support in the future? Perhaps, but nothing more. The CPU path is only for correctness check, but warning: current macOS versions have a bug in the virtual memory implementation that will crash the kernel if you try to run the CPU code. Remember? Software sucks. I was not possible to fix the CPU inference to avoid crashing, since each time there is to restart the computer, which is not funny. Help us, if you have the guts.

Acknowledgements to llama.cpp and GGML

ds4.c does not link against GGML, but it exists thanks to the path opened by the llama.cpp project and the kernels, quantization formats, GGUF ecosystem, and hard-won engineering knowledge developed there. We are thankful and indebted to llama.cpp and its contributors. Their implementation, kernels, tests, and design choices were an essential reference while building this DeepSeek V4 Flash-specific inference path. Some source-level pieces are retained or adapted here under the MIT license: GGUF quant layouts and tables, CPU quant/dot logic, and certain Metal kernels. For this reason, and because we are genuinely grateful, we keep the GGML authors copyright notice in our LICENSE file.

Model Weights

This implementation only works with the DeepSeek V4 Flash GGUFs published for this project. It is not a general GGUF loader, and arbitrary DeepSeek/GGUF files will not have the tensor layout, quantization mix, metadata, or optional MTP state expected by the engine. The 2 bit quantizations provided here are not a joke: they behave well, work under coding agents, call tools in a reliable way. The 2 bit quants use a very asymmetrical quantization: only the routed MoE experts are quantized, up/gate at IQ2_XXS, down at Q2_K. They are the majority of all the model space: the other components (shared experts, projections, routing) are left untouched to guarantee quality.

Download one main model:

./download_model.sh q2   # 128 GB RAM machines
./download_model.sh q4   # >= 256 GB RAM machines

The script downloads from https://hf-mirror.com/antirez/deepseek-v4-gguf by default, stores files under ./gguf/, resumes partial downloads with curl -C -, checks that completed files start with the GGUF magic, and updates ./ds4flash.gguf to point at the selected q2/q4 model. Set HF_ENDPOINT to use another endpoint, for example:

HF_ENDPOINT=https://huggingface.co ./download_model.sh q2

Authentication is optional for public downloads, but --token TOKEN, HF_TOKEN, or the local Hugging Face token cache are used when present.

./download_model.sh mtp fetches the optional speculative decoding support GGUF. It can be used with both q2 and q4, but must be enabled explicitly with --mtp. The current MTP/speculative decoding path is still experimental: it is correctness-gated and currently provides at most a slight speedup, not a meaningful generation-speed win.

Then build:

make

./ds4flash.gguf is the default model path used by both binaries. Pass -m to select another supported GGUF from ./gguf/. Run ./ds4 --help and ./ds4-server --help for the full flag list.

Disk Weight Offload

This fork adds --disk-offload-weights for the Metal backend. The mode keeps the GGUF file as the only weight store: startup registers the legal tensor-data range, then each kernel lazily wraps the needed page-aligned GGUF range as a no-copy Metal buffer. It does not create a second weight format and does not copy weights into a separate offload directory.

Use it when eager Metal model-view registration or residency pressure prevents startup:

DS4_METAL_MEMORY_REPORT=1 ./ds4 --disk-offload-weights \
  -p "Say hi." --nothink -n 1 --ctx 2048

The option is intentionally incompatible with --warm-weights, because warming the full tensor range defeats the purpose of demand paging:

./ds4 --disk-offload-weights --warm-weights
# ds4: --disk-offload-weights cannot be used with --warm-weights

ds4-server accepts the same flag:

./ds4-server --disk-offload-weights --ctx 100000 --kv-disk-dir /tmp/ds4-kv

Offload Observability

Set DS4_METAL_MEMORY_REPORT=1 to print memory reports during graph allocation, prefill, decode, and teardown. This is the default performance-safe report mode: it does not print after every offload layer. Add DS4_METAL_MEMORY_REPORT_LAYERS=1 only when you need the old layer-by-layer resident-memory curve while weights are being demand-paged.

The report includes:

  • process resident memory, physical footprint, compressed memory, and virtual size
  • runtime Metal tensor live and peak bytes
  • eager mmap model wrappers, if the normal path is used
  • lazy wrapper count, cumulative wrapped bytes, max single wrapped range
  • lazy wrapper-cache entry limit, live size, hot-reserve size, hit/miss counts, and evictions
  • lazy wrapper window size
  • scratch-buffer categories for attention, compressor, router, indexer, MoE, and rounding

For a 64 GB machine running the q2 GGUF in disk-offload mode, the current balanced preset is a 14 GiB lazy cache plus a 4 GiB hot reserve. Disk-offload prefill automatically keeps attention in the full prompt chunk while running routed MoE/FFN in 32-token microbatches for prompts above 32 tokens:

DS4_METAL_NO_RESIDENCY=1 \
DS4_METAL_OFFLOAD_MEMORY_CAP_MB=32768 \
DS4_METAL_OFFLOAD_MEMORY_GUARD_MB=512 \
DS4_METAL_LAZY_CACHE_MB=14336 \
DS4_METAL_HOT_WEIGHT_MB=4096 \
DS4_METAL_HOT_WEIGHT_MAX_ENTRY_MB=8 \
DS4_METAL_MOE_STATIC_PIN=shared \
DS4_METAL_MOE_MV_NSG=4 \
DS4_METAL_MOE_GHOST_ADMISSION=1 \
DS4_METAL_LAZY_CACHE_MAX_ENTRIES=16384 \
DS4_METAL_LAZY_WINDOW_MB=128 \
./ds4 --disk-offload-weights -p "Say hi." --nothink --temp 0 -n 100 --ctx 2048

Add observability only while measuring. These reports are useful, but they change timing and should stay off for normal runs:

DS4_METAL_MEMORY_REPORT=1 \
DS4_METAL_OFFLOAD_PROFILE_SUMMARY=1 \
DS4_METAL_NO_RESIDENCY=1 \
DS4_METAL_OFFLOAD_MEMORY_CAP_MB=32768 \
DS4_METAL_OFFLOAD_MEMORY_GUARD_MB=512 \
DS4_METAL_LAZY_CACHE_MB=14336 \
DS4_METAL_HOT_WEIGHT_MB=4096 \
DS4_METAL_HOT_WEIGHT_MAX_ENTRY_MB=8 \
DS4_METAL_MOE_STATIC_PIN=shared \
DS4_METAL_MOE_MV_NSG=4 \
DS4_METAL_MOE_GHOST_ADMISSION=1 \
DS4_METAL_LAZY_CACHE_MAX_ENTRIES=16384 \
DS4_METAL_LAZY_WINDOW_MB=128 \
./ds4 --disk-offload-weights -p "Say hi." --nothink --temp 0 -n 100 --ctx 2048

The main disk-offload optimization knobs are:

DS4_METAL_LAZY_CACHE_MB=2048   # default
DS4_METAL_LAZY_CACHE_MB=4096   # larger cache, more memory pressure
DS4_METAL_LAZY_CACHE_MB=0      # disable wrapper caching
DS4_METAL_LAZY_CACHE_MAX_ENTRIES=4096  # default
DS4_METAL_LAZY_CACHE_MAX_ENTRIES=8192  # larger working set, capped at 16384
DS4_METAL_HOT_WEIGHT_MB=2048   # default extra reserve for small hot wrappers
DS4_METAL_HOT_WEIGHT_MB=4096   # current 64 GB disk-offload decode preset
DS4_METAL_HOT_WEIGHT_MAX_ENTRY_MB=8    # default max wrapper size to pin
DS4_METAL_HOT_WEIGHT_MIN_HITS=2        # optional hits before a small wrapper is pinned
DS4_METAL_OFFLOAD_MEMORY_CAP_MB=30720  # optional total tracked offload cap
DS4_METAL_OFFLOAD_MEMORY_GUARD_MB=512  # default guard inside the cap
DS4_METAL_MOE_STATIC_PIN=all           # optional; default pins router only
DS4_METAL_MOE_GHOST_ADMISSION=1        # optional two-hit admission for compact MoE slices
DS4_METAL_NO_RESIDENCY=1               # skip Metal model residency requests
DS4_METAL_PREFILL_MOE_MICROBATCH=16    # override default 32-token disk-offload MoE/FFN microbatch
DS4_METAL_PREFILL_MOE_TELEMETRY=1      # per-layer prefill MoE unique expert and byte diagnostics
DS4_METAL_PREFILL_MOE_CACHE_COMPACT=1  # experiment: allow prefill compact MoE wrappers into cache
DS4_METAL_PREFILL_COMPACT_MOE_MAX_TOKENS=2048  # legacy experiment: compact MoE for larger prefill batches
DS4_METAL_PREFILL_SPLIT_LAYER_PARTS=0           # experiment: one command buffer per prefill layer

Use tracked offload memory, not process footprint, as the disk-offload budget number. The default startup pinning keeps only the small MoE router weights resident. DS4_METAL_MOE_GHOST_ADMISSION=1 is useful after the profile summary shows that one-shot compact-MoE wrappers are crowding out the cache. DS4_METAL_HOT_WEIGHT_MIN_HITS=2 is a narrower diagnostic knob; keep the default unless the hot reserve is visibly filled by one-shot wrappers. Prefill compact-MoE wrappers are transient by default, so they do not fill the decode cache. Set DS4_METAL_PREFILL_MOE_CACHE_COMPACT=1 only to compare cache pollution against the transient behavior. For a strict 30 GiB tracked Metal-offload budget, keep the same cache preset and set DS4_METAL_OFFLOAD_MEMORY_CAP_MB=30720.

On a 64 GB Mac with the q2 GGUF, larger lazy caches are not always faster. The following single-prompt --ctx 2048 -n 100 measurements showed the local sweet spot around 14-18 GiB of total wrapper cache:

Lazy cache Hot cache Total wrapper cache Prefill Generation Use when
12288 MiB 2048 MiB 14 GiB 5.11 t/s 6.65 t/s Short prompts and faster prefill
14336 MiB 2048 MiB 16 GiB 3.03 t/s 8.35 t/s Chat and longer decode
14336 MiB 4096 MiB 18 GiB 13.04 t/s 10.08 t/s Fastest clean short-prompt decode
16384 MiB 2048 MiB 18 GiB 4.86 t/s 7.07 t/s Middle ground

If decode speed matters, start with DS4_METAL_LAZY_CACHE_MB=14336 and raise DS4_METAL_HOT_WEIGHT_MB to 4096. If first-token or short prompt latency matters more, try DS4_METAL_LAZY_CACHE_MB=12288. Treat higher cache sizes as machine-dependent; they may reduce evictions but still slow down due to Metal/VM and file-cache pressure. A later local hot-reserve sweep found 4048, 4096, and 4608 MiB close enough that 4096 is the recommended rounded setting.

Short-prompt --ctx 2048 -n 100 measurements are sensitive to system state, so pick the best repeatable local result rather than a single high-water mark:

Lazy cache Hot cache Ghost admission Generation
13824 MiB 3584 MiB on 10.28, 10.43, 10.40 t/s in one run; 4-6 t/s in later repeats
14336 MiB 4048 MiB on 10.08 t/s with 13.04 t/s prefill in a clean short-prompt run
14336 MiB 4096 MiB on 9.07, 9.68 t/s
14336 MiB 4096 MiB off 9.71 t/s

DS4_METAL_MOE_GHOST_ADMISSION=1 is still useful when the summary reports many one-shot compact-MoE wrappers, but it is not always the dominant speed factor. For long prompts, the key prefill rule is: keep attention in the full prompt chunk, but run the routed MoE/FFN section in a smaller microbatch. Disk-offload mode defaults to a 32-token MoE microbatch for prompts above 32 tokens. On the local 64 GB q2 run, a 555-byte prompt improved from 4.97 t/s prefill with full 256-expert batch wrappers to 25-29 t/s prefill with the default 32-token MoE microbatch. Tracked offload memory stayed around 9.65 GiB because prefill compact wrappers were kept transient.

Knob What it optimizes What to watch
DS4_METAL_LAZY_CACHE_MB Total no-copy wrapper reuse lazy wrapper cache ... live / ... limit and evictions
DS4_METAL_LAZY_CACHE_MAX_ENTRIES Many small wrapper entries cache entry count / entry limit; raise if the entry limit fills before bytes
DS4_METAL_LAZY_WINDOW_MB Adjacent small tensor wrappers lazy model wrapper spans ... total; 128 is a good first test
DS4_METAL_HOT_WEIGHT_MB Small high-frequency wrappers Evictions after the main cache fills; set 0 to disable
DS4_METAL_HOT_WEIGHT_MAX_ENTRY_MB Which wrappers may be pinned Keep small, for example 8, so large expert windows do not consume the hot reserve
DS4_METAL_HOT_WEIGHT_MIN_HITS Avoid pinning one-shot wrappers Default 0 for speed; try 2 only if the summary shows hot reserve churn from one-shot wrappers
DS4_METAL_OFFLOAD_MEMORY_CAP_MB Hard tracked offload budget tracked offload memory peak; cache shrinks before runtime tensors and scratch are counted over the cap
DS4_METAL_OFFLOAD_PROFILE_SUMMARY Aggregated miss source diagnosis Size/source summary for hits, misses, wraps, evictions, and MoE ghost skips
DS4_METAL_MOE_STATIC_PIN Eager static MoE pinning Default pins router weights only; set all or shared to include shared experts, 0 or DS4_METAL_DISABLE_MOE_STATIC_PIN=1 to disable
DS4_METAL_MOE_MV_NSG Decode routed-MoE matvec threadgroup shape Default 2; try 4 on M2 Max class machines. 1 is intentionally not accepted because it produced unstable greedy output in local testing
DS4_METAL_MOE_GHOST_ADMISSION Reduce cache growth from one-shot compact MoE slices Enable only after the summary shows compact MoE dominates misses
DS4_METAL_NO_RESIDENCY Avoid Metal residency request overhead for mapped weights Useful on macOS when disk-offload wrappers are already mmap-backed and residency requests do not improve speed
DS4_METAL_PREFILL_MOE_MICROBATCH Independent MoE/FFN chunk size after full-chunk prefill attention Default 32 in disk-offload mode; lower values reduce expert bytes, higher values reduce command overhead
DS4_METAL_PREFILL_MOE_TELEMETRY Per-layer prefill MoE diagnosis Prints chunk tokens, unique experts, compact bytes, full bytes, and wrapper sources
DS4_METAL_PREFILL_MOE_CACHE_COMPACT Cache prefill compact-MoE wrappers Off by default; enabling it can fill cache quickly and is mainly for A/B testing
DS4_METAL_PREFILL_COMPACT_MOE_MAX_TOKENS Compact-MoE prefill admission threshold Default 32; keep this low and prefer DS4_METAL_PREFILL_MOE_MICROBATCH for long prompts
DS4_METAL_PREFILL_SPLIT_LAYER_PARTS Attention/FFN command-buffer scheduling during prefill Default splits attention and FFN under disk offload; set 0 to test one command buffer per layer

The most important relationship is the cache budget:

wrapper cache limit = min(
    DS4_METAL_LAZY_CACHE_MB + DS4_METAL_HOT_WEIGHT_MB,
    DS4_METAL_OFFLOAD_MEMORY_CAP_MB - runtime tensors - scratch - guard
)

DS4_METAL_LAZY_CACHE_MB is the main wrapper cache for no-copy Metal views over the mapped GGUF file. DS4_METAL_HOT_WEIGHT_MB is an extra reserve for small frequently reused wrappers, but it is still part of the tracked offload memory when a hard cap is set. DS4_METAL_OFFLOAD_MEMORY_CAP_MB is therefore not the same thing as cache size: it is the upper bound for lazy cache plus live runtime tensors plus scratch buffers. Use the reported tracked offload memory line to judge the cap, not process footprint.

The hot reserve is added on top of the lazy cache byte limit. By default, small eligible wrappers can enter it immediately; when the hot reserve fills, the least-recently used pinned entry is demoted. With DS4_METAL_OFFLOAD_MEMORY_CAP_MB set, the cache limit is also reduced by live runtime tensors, scratch buffers, and the guard. Disk-offload startup pins the small MoE router weights by default because every layer uses them. Shared-expert pinning is opt-in (DS4_METAL_MOE_STATIC_PIN=all or shared) since it costs about a GiB of cache space and can reduce room for routed expert slices under a 30 GiB cap. On the 64 GB M2 Max test machine, DS4_METAL_MOE_STATIC_PIN=shared was the most useful decode-side knob in this round: it reduced repeated shared-expert wrapper churn and helped short-prompt decode more than MTP or compact-MoE exact-set caching. If it helps, retune DS4_METAL_LAZY_CACHE_MB afterwards because pinned shared experts consume part of the wrapper-cache budget.

For decode profiling without flooding the terminal, set:

DS4_METAL_OFFLOAD_PROFILE=1

Each decoded token prints one compact line with elapsed time, lazy cache hit/miss/eviction deltas, newly wrapped GiB, cache entries, free cache budget, hot-reserve use, and tracked offload memory. Add DS4_METAL_OFFLOAD_PROFILE_SUMMARY=1 to print an end-of-run source and size bucket summary.

Small adjacent weights can be grouped into fixed no-copy wrapper windows:

DS4_METAL_LAZY_WINDOW_MB=128
DS4_METAL_LAZY_WINDOW_MB=256

The default is 0, which keeps the exact page-aligned range behavior. Tensors larger than the selected window continue to use exact wrappers.

Disk-offload mode compacts routed MoE expert weights by default on decode and short prefill MoE chunks. After router selection, it copies only the unique experts needed by the current token or prefill MoE microbatch into a small contiguous Metal buffer and runs the normal kernels against local expert IDs. This avoids wrapping all 256 experts every token and avoids turning long-prefill expert slices into long-lived decode cache entries. Disable compact MoE only for comparison:

DS4_METAL_DISABLE_COMPACT_MOE=1

Set DS4_METAL_COMPACT_MOE_PROFILE=1 or DS4_METAL_DECODE_STAGE_PROFILE=1 to time compact MoE router readback, expert wrap/blit, kernel encoding, and execution. Set DS4_METAL_PREFILL_MOE_TELEMETRY=1 when tuning prefill: each MoE chunk prints its layer, position, token count, unique expert count, compact bytes, full 256-expert bytes, compact/full mode, and wrapper source names.

Tune in this order:

  1. Use a fixed prompt, --temp 0, the same -n, and the same --ctx for all runs.
  2. Enable DS4_METAL_MEMORY_REPORT=1 and DS4_METAL_OFFLOAD_PROFILE_SUMMARY=1.
  3. Keep DS4_METAL_HOT_WEIGHT_MB=4096, DS4_METAL_HOT_WEIGHT_MAX_ENTRY_MB=8, DS4_METAL_LAZY_CACHE_MAX_ENTRIES=16384, and DS4_METAL_LAZY_WINDOW_MB=128 fixed for the first pass.
  4. Sweep only DS4_METAL_LAZY_CACHE_MB, for example 12288, 14336, 16384, 18432, and 20480.
  5. Pick by measured prefill/decode speed, not just eviction count. More cache can reduce evictions and still run slower because it increases Metal/VM and file-cache pressure.
  6. If compact-MoE gate/up/down dominate misses or evictions, try DS4_METAL_MOE_GHOST_ADMISSION=1 before raising cache size.
  7. If decode is still the priority, try DS4_METAL_MOE_STATIC_PIN=shared, then retune DS4_METAL_LAZY_CACHE_MB because shared pinning costs about 1.17 GiB of cache space.
  8. Try DS4_METAL_MOE_MV_NSG=4 only after output stability is confirmed with a greedy smoke. Keep the default 2 if it does not repeatably improve generation.
  9. For prompts above 32 tokens, keep the default 32-token MoE microbatch first; test DS4_METAL_PREFILL_MOE_MICROBATCH=16 only if memory pressure is more important than prefill speed.
  10. Use DS4_METAL_PREFILL_MOE_TELEMETRY=1 to confirm the chosen microbatch has fewer than 256 unique experts and compact bytes are well below full bytes.
  11. Raise DS4_METAL_LAZY_CACHE_MB further only if generation improves on the target machine.
  12. Add DS4_METAL_OFFLOAD_MEMORY_CAP_MB=30720 only when the tracked offload footprint must stay below 30 GiB.
  13. Try DS4_METAL_LAZY_WINDOW_MB=128 before 256; larger windows can reduce wrapper count but may waste cache space.
  14. Test DS4_METAL_PREFILL_SPLIT_LAYER_PARTS=0 separately. If it hurts decode or prefill, keep the default split scheduling.
  15. Disable compact MoE only to compare behavior, not as a normal setting.

A compact local sweep looks like this:

for lazy in 12288 14336 16384 18432 20480; do
  echo "=== DS4_METAL_LAZY_CACHE_MB=$lazy ==="
  DS4_METAL_MEMORY_REPORT=1 \
  DS4_METAL_OFFLOAD_PROFILE_SUMMARY=1 \
  DS4_METAL_OFFLOAD_MEMORY_CAP_MB=32768 \
  DS4_METAL_OFFLOAD_MEMORY_GUARD_MB=512 \
  DS4_METAL_LAZY_CACHE_MB=$lazy \
  DS4_METAL_HOT_WEIGHT_MB=4096 \
  DS4_METAL_HOT_WEIGHT_MAX_ENTRY_MB=8 \
  DS4_METAL_LAZY_CACHE_MAX_ENTRIES=16384 \
  DS4_METAL_LAZY_WINDOW_MB=128 \
  ./ds4 --disk-offload-weights -p "你是什么模型" --nothink --temp 0 -n 100 --ctx 2048
done

For each run, record prefill, generation, tracked offload memory peak, lazy wrapper cache ... live / ... limit, and the final summary's MoE evictions. If two settings are close, prefer the one with lower tracked peak; it leaves more room for longer context, OS file cache, and other apps.

Performance Boundary

Disk weight offload is a memory-pressure escape hatch, not the normal fast path. On a 64 GB machine with the q2 GGUF, it may make a tiny smoke run possible, but generation can be orders of magnitude slower than the eager path because weights are repeatedly faulted from disk or file cache. Treat it as experimental and watch DS4_METAL_MEMORY_REPORT=1 before increasing context size, generated tokens, or DS4_METAL_LAZY_CACHE_MB.

Speed

These are single-run Metal CLI numbers with --ctx 32768, --nothink, greedy decoding, and -n 256. The short prompt is a normal small Italian story prompt. The long prompts exercise chunked prefill plus long-context decode. Q4 requires the larger-memory machine class, so M3 Max Q4 numbers are N/A.

Machine Quant Prompt Prefill Generation
MacBook Pro M3 Max, 128 GB q2 short 58.52 t/s 26.68 t/s
MacBook Pro M3 Max, 128 GB q2 11709 tokens 250.11 t/s 21.47 t/s
MacBook Pro M3 Max, 128 GB q4 short N/A N/A
MacBook Pro M3 Max, 128 GB q4 long N/A N/A
Mac Studio M3 Ultra, 512 GB q2 short 84.43 t/s 36.86 t/s
Mac Studio M3 Ultra, 512 GB q2 11709 tokens 468.03 t/s 27.39 t/s
Mac Studio M3 Ultra, 512 GB q4 short 78.95 t/s 35.50 t/s
Mac Studio M3 Ultra, 512 GB q4 12018 tokens 448.82 t/s 26.62 t/s

CLI

One-shot prompt:

./ds4 -p "Explain Redis streams in one paragraph."

No -p starts the interactive prompt:

./ds4
ds4>

The interactive CLI is a real multi-turn DS4 chat. It keeps the rendered chat transcript and the live Metal KV checkpoint, so each turn extends the previous conversation. Useful commands are /help, /think, /think-max, /nothink, /ctx N, /read FILE, and /quit. Ctrl+C interrupts the current generation and returns to ds4>.

The CLI defaults to thinking mode. Use /nothink or --nothink for direct answers. --mtp MTP.gguf --mtp-draft 2 enables the optional MTP speculative path; it is useful only for greedy decoding, currently uses a confidence gate (--mtp-margin) to avoid slow partial accepts, and should be treated as an experimental slight-speedup path.

Server

Start a local OpenAI/Anthropic-compatible server:

./ds4-server --ctx 100000 --kv-disk-dir /tmp/ds4-kv --kv-disk-space-mb 8192

The server is Metal-only. It keeps one mutable graph/KV checkpoint in memory, so stateless clients that resend a longer version of the same prompt can reuse the shared prefix instead of pre-filling from token zero.

Request parsing and sockets run in client threads, but inference itself is serialized through one Metal worker. The current server does not batch multiple independent requests together; concurrent requests wait their turn on the single live graph/session.

Supported endpoints:

  • GET /v1/models
  • GET /v1/models/deepseek-v4-flash
  • POST /v1/chat/completions
  • POST /v1/completions
  • POST /v1/messages

/v1/chat/completions accepts the usual OpenAI-style messages, max_tokens/max_completion_tokens, temperature, top_p, top_k, min_p, seed, stream, stream_options.include_usage, tools, and tool_choice. Tool schemas are rendered into DeepSeek's DSML tool format, and generated DSML tool calls are mapped back to OpenAI tool calls.

/v1/messages is the Anthropic-compatible endpoint used by Claude Code style clients. It accepts system, messages, tools, tool_choice, max_tokens, temperature, top_p, top_k, stream, stop_sequences, and thinking controls. Tool uses are returned as Anthropic tool_use blocks.

Both APIs support SSE streaming. In thinking mode, reasoning is streamed in the native API shape instead of being mixed into final text.

Minimal OpenAI example:

curl http://127.0.0.1:8000/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model":"deepseek-v4-flash",
    "messages":[{"role":"user","content":"List three Redis design principles."}],
    "stream":true
  }'

Agent Client Usage

ds4-server can be used by local coding agents that speak OpenAI-compatible chat completions. Start the server first, and set the client context limit no higher than the --ctx value you started the server with:

./ds4-server --ctx 100000 --kv-disk-dir /tmp/ds4-kv --kv-disk-space-mb 8192

You can use larger context and larger cache if you wish. Full context of 1M tokens is going to use more or less 26GB of memory (compressed indexer alone will be like 22GB), so configure a context which makes sense in your system. With 128GB of RAM you would run the 2-bit quants, which are already 81GB, 26GB are going to be likely too much, so a context window of 100~300k tokens is wiser.

The 384000 output limit below avoids token caps since the model is able to generate very long replies otherwise (up to 384k tokens). The server still stops when the configured context window is full.

For opencode, add a provider and agent entry to ~/.config/opencode/opencode.json:

{
  "$schema": "https://opencode.ai/config.json",
  "provider": {
    "ds4": {
      "name": "ds4.c (local)",
      "npm": "@ai-sdk/openai-compatible",
      "options": {
        "baseURL": "http://127.0.0.1:8000/v1",
        "apiKey": "dsv4-local"
      },
      "models": {
        "deepseek-v4-flash": {
          "name": "DeepSeek V4 Flash (ds4.c local)",
          "limit": {
            "context": 100000,
            "output": 384000
          }
        }
      }
    }
  },
  "agent": {
    "ds4": {
      "description": "DeepSeek V4 Flash served by local ds4-server",
      "model": "ds4/deepseek-v4-flash",
      "temperature": 0
    }
  }
}

For Pi, add a provider to ~/.pi/agent/models.json:

{
  "providers": {
    "ds4": {
      "name": "ds4.c local",
      "baseUrl": "http://127.0.0.1:8000/v1",
      "api": "openai-completions",
      "apiKey": "dsv4-local",
      "compat": {
        "supportsStore": false,
        "supportsDeveloperRole": false,
        "supportsReasoningEffort": true,
        "supportsUsageInStreaming": true,
        "maxTokensField": "max_tokens",
        "supportsStrictMode": false,
        "thinkingFormat": "deepseek",
        "requiresReasoningContentOnAssistantMessages": true
      },
      "models": [
        {
          "id": "deepseek-v4-flash",
          "name": "DeepSeek V4 Flash (ds4.c local)",
          "reasoning": true,
          "thinkingLevelMap": {
            "off": null,
            "minimal": "low",
            "low": "low",
            "medium": "medium",
            "high": "high",
            "xhigh": "xhigh"
          },
          "input": ["text"],
          "contextWindow": 100000,
          "maxTokens": 384000,
          "cost": {
            "input": 0,
            "output": 0,
            "cacheRead": 0,
            "cacheWrite": 0
          }
        }
      ]
    }
  }
}

Optionally make it the default Pi model in ~/.pi/agent/settings.json:

{
  "defaultProvider": "ds4",
  "defaultModel": "deepseek-v4-flash"
}

For Claude Code, use the Anthropic-compatible endpoint. A wrapper like this matches the local ~/bin/claude-ds4 setup:

#!/bin/sh
unset ANTHROPIC_API_KEY

export ANTHROPIC_BASE_URL="${DS4_ANTHROPIC_BASE_URL:-http://127.0.0.1:8000}"
export ANTHROPIC_AUTH_TOKEN="${DS4_API_KEY:-dsv4-local}"
export ANTHROPIC_MODEL="deepseek-v4-flash"

export ANTHROPIC_CUSTOM_MODEL_OPTION="deepseek-v4-flash"
export ANTHROPIC_CUSTOM_MODEL_OPTION_NAME="DeepSeek V4 Flash local ds4"
export ANTHROPIC_CUSTOM_MODEL_OPTION_DESCRIPTION="ds4.c local GGUF"

export ANTHROPIC_DEFAULT_SONNET_MODEL="deepseek-v4-flash"
export ANTHROPIC_DEFAULT_HAIKU_MODEL="deepseek-v4-flash"
export ANTHROPIC_DEFAULT_OPUS_MODEL="deepseek-v4-flash"
export CLAUDE_CODE_SUBAGENT_MODEL="deepseek-v4-flash"

export CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC=1
export CLAUDE_CODE_DISABLE_NONSTREAMING_FALLBACK=1
export CLAUDE_STREAM_IDLE_TIMEOUT_MS=600000

exec "$HOME/.local/bin/claude" "$@"

Claude Code may send a large initial prompt, often around 25k tokens, before it starts doing useful work. Keep --kv-disk-dir enabled: after the first expensive prefill, the disk KV cache lets later continuations or restarted sessions reuse the saved prefix instead of processing the whole prompt again.

Thinking Modes

DeepSeek V4 Flash has distinct non-thinking, thinking, and Think Max modes. The server defaults to thinking mode. reasoning_effort=max requests Think Max, but it is only applied when the context size is large enough for the model card recommendation; smaller contexts fall back to normal thinking. OpenAI reasoning_effort=xhigh still maps to normal thinking, not Think Max.

For direct replies, use thinking: {"type":"disabled"}, think:false, or a non-thinking model alias such as deepseek-chat.

Disk KV Cache

Chat/completion APIs are stateless: agent clients usually resend the whole conversation every request. ds4-server handles this by comparing the rendered token stream with cached token prefixes. The live in-memory checkpoint covers the current session; the disk KV cache makes useful prefixes survive session switches and server restarts.

For RAM reasons there is currently only one live KV cache in memory. When a new unrelated session replaces it, the old checkpoint can only be resumed without re-processing if it was written to the disk KV cache. In other words, memory cache handles the active session; disk cache is the resume mechanism for different sessions.

Enable it with:

./ds4-server --kv-disk-dir /tmp/ds4-kv --kv-disk-space-mb 8192

The cache key is the SHA1 of exact token IDs, not raw text. Each token ID is hashed as a little-endian 32-bit integer, and files are named <sha1>.kv. The file is intentionally written with ordinary read/write I/O, not mmap, so restoring cache entries does not add more VM mappings to a process that already maps the model.

On disk, a cache file is:

KVC fixed header, 48 bytes
u32 rendered_text_bytes
rendered_text_bytes of UTF-8-ish token text
DS4 session payload, payload_bytes from the KVC header

The fixed header is little-endian:

0   u8[3]  magic = "KVC"
3   u8     version = 1
4   u8     routed expert quant bits, currently 2 or 4
5   u8     save reason: 0 unknown, 1 cold, 2 continued, 3 evict, 4 shutdown
6   u8[2]  reserved
8   u32    cached token count
12  u32    hit count
16  u32    context size the snapshot was written for
20  u8[4]  reserved
24  u64    creation Unix time
32  u64    last-used Unix time
40  u64    DS4 session payload byte count

The rendered text is the tokenizer-decoded text for the cached token prefix. It is stored only for observability, so humans can inspect a cache directory without decoding token IDs. It is not used as the key and it is not trusted when loading; after load, the stored checkpoint tokens must still match the incoming request prefix.

The DS4 session payload starts with thirteen little-endian u32 fields:

0   magic = "DSV4"
1   payload version = 1
2   saved context size
3   prefill chunk size
4   raw KV ring capacity
5   raw sliding-window length
6   compressed KV capacity
7   checkpoint token count
8   layer count
9   raw/head KV dimension
10  indexer head dimension
11  vocabulary size
12  live raw rows serialized below

Then it stores:

  • u32[token_count] checkpoint token IDs.
  • float32[vocab_size] logits for the next token after that checkpoint.
  • u32[layer_count] compressed attention row counts.
  • u32[layer_count] ratio-4 indexer row counts.
  • For every layer: the live raw sliding-window KV rows, written in logical position order rather than physical ring order.
  • For compressed layers: live compressed KV rows and compressor frontier tensors.
  • For ratio-4 compressed layers: live indexer compressed rows and indexer frontier tensors.

The logits are raw IEEE-754 float32 values from the host ds4_session buffer. They are saved immediately after the checkpoint tokens so a loaded snapshot can sample or continue from the exact next-token distribution without running one extra decode step. MTP draft logits/state are not persisted; after loading a disk checkpoint the draft state is invalidated and rebuilt by normal generation.

The tensor payload is DS4-specific KV/session state, not a generic inference graph dump. It is expected to be portable only across compatible ds4.c builds for this model layout.

The cache stores checkpoints at four moments:

  • cold: after a long first prompt reaches a stable prefix, before generation.
  • continued: when prefill or generation advances the live conversation by the configured interval.
  • evict: before an unrelated request replaces the live in-memory session.
  • shutdown: when the server exits cleanly.

Cold saves intentionally trim a small token suffix and align down to a prefill chunk boundary. This avoids common BPE boundary retokenization misses when a future request appends text to the same prompt. The defaults are conservative: store prefixes of at least 512 tokens, cold-save prompts up to 30000 tokens, trim 32 tail tokens, and align to 2048-token chunks. The important knobs are:

  • --kv-cache-min-tokens
  • --kv-cache-cold-max-tokens
  • --kv-cache-continued-interval-tokens
  • --kv-cache-boundary-trim-tokens
  • --kv-cache-boundary-align-tokens

By default, checkpoints may be reused across the 2-bit and 4-bit routed-expert variants if the token prefix matches. Use --kv-cache-reject-different-quant when you want strict same-quant reuse only.

The cache directory is disposable. If behavior looks suspicious, stop the server and remove it. You can investigate what is cached with hexdump as the kv cache files include the verbatim prompt cached.

Backends

The default backend is Metal:

./ds4 -p "Hello" --metal

There is also a CPU reference/debug path:

./ds4 -p "Hello" --cpu

Do not treat the CPU path as the production target. The server is Metal-only, and the optimized implementation lives in the Metal graph path. This may change in the future.

Test Vectors

tests/test-vectors contains short and long-context continuation vectors captured from the official DeepSeek V4 Flash API. The requests use deepseek-v4-flash, greedy decoding, thinking disabled, and the maximum top_logprobs slice exposed by the API. Local vectors are generated with ./ds4 --dump-logprobs and compared by token bytes, so tokenizer/template or attention regressions show up before they become long generation failures.

All project tests are driven by the C runner:

make test                  # ./ds4_test --all
./ds4_test --logprob-vectors
./ds4_test --server

About

DeepSeek 4 Flash local inference engine for Metal

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • C 54.8%
  • Objective-C 31.5%
  • Metal 13.0%
  • Other 0.7%