Skip to content

HerculeWu/ds4

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

271 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🌋 ds4 — run a frontier-scale LLM on the computer you already own

🚀 An 81 GB DeepSeek-V4-Flash model, generating coherent text on a 6 GB gaming GPU. Not in the cloud, not on a $10k workstation — on an ordinary desktop. This fork breaks the assumption that big models need big iron.

A fork of antirez/ds4 (the DwarfStar engine). Fork: https://github.com/HerculeWu/ds4 · Full engine docs: README.upstream.md


💥 The breakthrough

Local LLMs have a memory wall. The conventional rule is that the model has to fit: enough RAM, or enough VRAM, to hold the weights. That rule quietly locks frontier-scale models away from everyone without a high-memory Mac or a datacenter card.

This project breaks that rule. We run DeepSeek-V4-Flash — an 81 GB mixture-of-experts model that punches far above small local models — on a consumer machine whose GPU holds barely a fifteenth of it in VRAM. The model never fits, and it runs anyway.

The key is that a MoE model doesn't use all of itself at once. Each token activates only 6 of 256 experts per layer. So instead of demanding the model fit in memory, we stream the few weights each token actually needs through a three-tier cache — disk to RAM to VRAM — and let the GPU act as a pure accelerator over whatever fits. Most tokens never touch the disk.

The point isn't one clever machine. The point is generality: if a too-big model can run on a too-small GPU at a usable speed, then the floor of "what hardware can run a real LLM" drops out from under the old assumption. 🏠 The computer on your desk is very likely already enough.

⚙️ How it works: SSD → RAM → VRAM

   SSD (the full ~81 GB model, mmap-backed)
      │   cold experts stream up on demand
      ▼
   Host RAM (warm-expert LRU cache)
      │   hot experts served over PCIe, no disk hit
      ▼
   VRAM (the small slot-bank the GPU computes on)

The CPU is the memory manager; the GPU is the accelerator:

  • A VRAM slot-bank holds only as many experts as your VRAM allows, against a ~258-expert-per-token working set. An LRU keeps the hot set resident and reuses cross-token carryover.
  • A host-RAM LRU tier sits between disk and VRAM. On a VRAM miss it serves from RAM over PCIe with no disk read. Warm, this tier hits ~65–80 % of the time — close to the locality ceiling we measured before writing a line of GPU code.
  • The dense backbone (attention, shared expert, output head) lives in a host-RAM residency cache instead of being re-read from disk every token — which, it turned out, was the single biggest early bottleneck.

📈 Whatever memory you bring, more of it simply means a higher hit rate and a faster decode. The design scales up with your machine, not just down to ours.

🔧 What we changed relative to upstream

Upstream DwarfStar is a superb, narrow engine for DeepSeek V4 Flash — but its entry point is big iron (96–128 GB Macs, DGX Spark). Our work, ~3,600 lines confined to the CUDA/loader path (the Metal path and public contract are untouched), brings the floor down to a commodity GPU. It went in as phases:

  • Phase 1 — measurement. CPU diagnostics (--tensor-budget, DS4_LOG_ROUTER) to measure byte budget and expert locality before writing cache code. Verdict: ~77 % LRU hit rate achievable → proceed.
  • Phase 2 — CUDA slot-bank. A pure-C, host-unit-tested LRU/hash core (ds4_slotbank_core.h) wired into the routed-MoE launch — the VRAM tier, and the fix for the Turing host-pointer crash.
  • Phase 3 — backbone streaming ring. A lazy, VRAM-aware streaming ring for the dense backbone, per-row token_embd gather (no 1 GiB transient per embed), VRAM-aware prefill caps, and fail-loud guards so a cache miss can never silently corrupt logits.
  • Phase 4 — backbone host-RAM cache. Keep the backbone in RAM instead of re-reading ~8.8 GB from disk every token. ⚡ ~4.3× on decode by itself.
  • Phase 5 — routed-expert host-RAM LRU tier. The RAM tier of the hierarchy: a pinned-host slab reusing the same LRU core, feeding the VRAM slot-bank. ⚡ ~1.9× more on decode.
  • Generalization / scale-up. Fail-loud capability gates (unsupported quant or a 384-expert PRO router now abort()s with a named message instead of emitting silent garbage), shape-derived sizing, and a RAM tier that auto-grows to the full expert pool on a big-memory box — so disk drops out entirely after warmup. Still V4-architecture-specific by design; not a generic GGUF runner.

We kept upstream's rules: mmap-backed loading (no eager full copy), no C++, CPU path as reference only, diagnostics behind flags.

📊 The progress, in token speed

Here is the honest journey, measured on the kind of machine that was never supposed to load this model — a single GTX 1660 Ti with 6 GB of VRAM, 31 GB of system RAM, and a plain SSD. Same model, same prompt, greedy decode:

  • 🐌 0.06 t/s — the naive floor, every token bottlenecked on disk reads.
  • 🚶 ~0.30 t/s — once the backbone stays resident in RAM (Phase 4).
  • 🏃 ~0.42 t/s — once hot experts are cached in RAM too (Phase 5).
  • 🚀 ~0.44 t/s — with a larger warm cache.

That's roughly a climb end to end, on a six-year-old gaming GPU. If that card can sustain a coherent stream of tokens from an 81 GB model, a more modern desktop — more VRAM, more RAM, a faster SSD — has more of every resource the cache feeds on, and the same code simply runs warmer and quicker on it.

🔒 Correctness was non-negotiable: decode with the cache on vs off produces byte-identical tokens (the cache changes only where a weight comes from, never its value), and the loud capability gates mean an unsupported configuration stops immediately rather than producing plausible nonsense.

We also learned where the ceiling is. After Phase 5, warm decode is dominated by fixed per-layer overhead and dequant/matmul compute, not weight movement — the tiering approach is near its practical floor here. The remaining multiplicative lever is MTP / speculative decode, currently blocked because this GGUF ships with the MTP head stripped.

🛠️ Install & use

You need an NVIDIA GPU with CUDA, nvcc, a C compiler, and the DeepSeek-V4-Flash GGUF (~81 GB) on an SSD.

1. Get the model

./download_model.sh          # see the script for the GGUF it fetches
ln -s /path/to/DeepSeek-V4-Flash-IQ2XXS-...gguf ds4flash.gguf

2. Build (CUDA)

make cuda CUDA_ARCH=native     # detect the local GPU (easiest)
make cuda CUDA_ARCH=sm_75      # or name your card's compute capability
make cpu                       # CPU-only reference build (debug, slow)

This produces ./ds4 (CLI), ./ds4-server (OpenAI/Anthropic-compatible HTTP API), plus ds4-bench, ds4-eval, and ds4-agent.

3. Run the demo

./flash-demo.sh "Explain why the sky is blue."
VERBOSE=1 ./flash-demo.sh           # watch the SSD/RAM/VRAM hit rates climb
echo "long prompt..." | ./flash-demo.sh -

Or drive the CLI directly:

./ds4 -m ds4flash.gguf -c 4096 -n 200 --temp 0 \
      -sys "You are a helpful assistant." \
      -p "Write a haiku about tiered memory."

The first token is slow — the cache is warming from disk. After that it settles into steady-state decode.

Tuning knobs

Environment variable Effect
DS4_CUDA_EXPERT_RAM_CACHE_GB Host-RAM expert-cache size in GiB. 0 disables; unset = sensible default; larger = higher hit rate, more RAM used.
DS4_CUDA_WEIGHT_CACHE_VERBOSE=1 Print per-layer host/VRAM hit/miss and tier init.
DS4_CUDA_SLOTBANK_RESERVE_MB VRAM reserved outside the slot-bank.
DS4_BIGMEM=1 Scale-up mode (see below). Turn the big-hardware profile on at runtime on any build; DS4_BIGMEM=0 turns it off on a cuda-bigmem build.
DS4_CUDA_BACKBONE_VRAM=0 Force the dense backbone to stream through the ring even in scale-up mode (per-tier override of the VRAM-resident backbone).

A bigger cache helps most in long, single-session runs (a coding agent, say), where cross-token expert reuse keeps climbing past a short benchmark's average.

Scale-up mode: don't smash the ceiling with the floor

The tiering above is built so a frontier model fits on a tiny GPU. But the same engine should also use a big one. Scale-up mode raises the defaults of the tiers so that, when VRAM and host RAM are plentiful, the SSD→RAM→VRAM 3-tier collapses toward 2-tier (RAM↔VRAM, disk inert after warmup) and, for the dense weights, toward 1-tier (all-VRAM):

make clean && make cuda-bigmem      # native arch; or: make cuda-bigmem CUDA_ARCH=sm_86
# …or build normally and flip it at runtime, no rebuild (easiest for A/B):
DS4_BIGMEM=1 ./ds4 -m ds4flash.gguf -c 8192 -n 400 -p ""

(make clean is needed when switching between a normal and a cuda-bigmem build because make tracks file timestamps, not compiler flags — same as switching CUDA_ARCH. The runtime DS4_BIGMEM=1 avoids that entirely.)

On e.g. an A40 (48 GB) + 128 GB RAM box this changes the defaults to:

  • Dense backbone → VRAM-resident. Uploaded once to a persistent device slab instead of re-streaming ~8.8 GiB host→device every token (and the per-token output-head transient disappears too). This is the biggest decode win when VRAM is large.
  • Full routed-expert pool → RAM-resident. The host expert tier grows to hold every expert (the disk tier goes inert after warmup). This already happens with the env left unset; scale-up also makes it fire when you've set an explicit (smaller) DS4_CUDA_EXPERT_RAM_CACHE_GB — the value becomes a floor, not a cap.
  • VRAM expert slot-bank → clamped to the model. It stops greedily grabbing all free VRAM and claims only what the full expert set can use, leaving room for the resident backbone and KV.
  • KV cache → device VRAM for normal contexts, spilling to host (managed memory) only for genuinely huge contexts that don't fit VRAM.
  • Prefill → larger VRAM-safe chunks (fewer batches; identical logits).

Every individual DS4_CUDA_* knob still overrides the scale-up default, and the inference path is unchanged — scale-up only moves numeric defaults, never forks the math. Run with DS4_CUDA_WEIGHT_CACHE_VERBOSE=1 to confirm the residency plan engaged (you'll see the backbone-VRAM-resident line, the expert RAM auto-scale-to-full-pool line, and the slot-bank sizing).

⚠️ Honest caveat: testing is thin

We want to be upfront: this fork is under-tested.

  • Almost all measurement happened on one machine with one IQ2-XXS GGUF. We have not validated the scale-up path on a bigger box; the portable probe (tests/scaleup_family_probe.sh) exists for that but hasn't been run there yet.
  • The PRO (384-expert) variant and non-Flash quantizations are not exercised. They're gated to fail loud — a safety net, not proof they work.
  • The golden long-context vector (long_story_4096) is currently red on the 6 GB card — identically on pristine upstream and on our branch, so it's a pre-existing wide-prefill / low-VRAM issue, not a regression we introduced. Still: we have not demonstrated a correct full 4096-token prefill on this hardware. We've proven decode is byte-identical cache-on vs cache-off, and that short prompts work.
  • Performance numbers are wall-clock observations from individual runs, not a rigorous statistical benchmark, and they're sensitive to VRAM pressure.

Treat this as beta-quality research code that clearly works on the hardware it was built for, with correctness invariants checked carefully where we checked them — but broad cross-hardware validation is exactly the gap we know about. Reports from other machines are very welcome.

🙏 Acknowledgements

This fork stands entirely on antirez/ds4 / DwarfStar, which in turn exists thanks to llama.cpp and GGML. See README.upstream.md for the full engine documentation and acknowledgements — all of which we keep and second.

About

DeepSeek 4 Flash local inference engine for Metal and CUDA

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • C 59.8%
  • Objective-C 15.4%
  • Cuda 12.4%
  • Metal 7.3%
  • Python 3.8%
  • Shell 1.0%
  • Other 0.3%