Tags · val1813/kaiwu

v0.3.2

fix: VRAM detection inflated by Resizable BAR + MoE partial OOM guard…

… + RTX PRO bandwidth

- Windows Resizable BAR causes nvidia-smi XML to report shared GPU memory
  as dedicated VRAM (e.g. 4070 showing 31GB). Now cross-checks XML vs CSV
  and caps to knownMaxVRAM() lookup table.
- MoE partial mode now refuses when model_size > 1.2x VRAM, forcing
  moe_offload to prevent guaranteed OOM on small-VRAM cards.
- Added RTX PRO 6000/5000/4500/4000/2000 to estimateBandwidth() and
  knownMaxVRAM() tables.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

May 6, 2026
e4ed413
zip
tar.gz
Notes
Downloads

v0.3.1

docs: add v0.3.1 changelog

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Apr 29, 2026
f8e3dde
zip
tar.gz
Notes
Downloads

v0.3.0

fix: --fit on conflicts with --cpu-moe/--n-cpu-moe (root cause of MoE…

… OOM)

ik_llama.cpp docs explicitly state --fit cannot be combined with
--cpu-moe, --n-cpu-moe, or -ot. We were passing both, causing
--fit to override MoE layer placement and try to fit everything
in VRAM → OOM on 16GB cards with 20GB models.

Now:
- full_gpu: uses --fit on (automatic layer allocation)
- moe_offload: uses --cpu-moe only (no --fit)
- moe_partial: uses --n-cpu-moe N only (no --fit)

All MoE modes rely on -ngl 999 (already in args) to put all
non-expert layers on GPU.

Also increased calcMoEMode overhead from 1GB to 2.5GB to account
for KV cache + compute buffer space.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Apr 29, 2026
0940690
zip
tar.gz
Notes
Downloads

v0.2.9

fix: MoE models skip -sm graph on multi-GPU (expert buffer explosion)

-sm graph (tensor parallel) splits each compute graph node across GPUs.
For MoE models with 128 expert layers, this causes massive buffer
allocation that exceeds VRAM even on dual 3090 (48GB) with a 25GB model.

MoE models now always use layer split (--tensor-split or default),
which distributes layers across GPUs without duplicating expert buffers.

Also added .so missing detection to isLikelyOOM exclusion list.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Apr 28, 2026
2d85db8
zip
tar.gz
Notes
Downloads

v0.2.8

fix: -sm graph runtime detection + isLikelyOOM parameter error exclusion

1. SupportsGraphSplit(): checks --help output for "graph" support
   before passing -sm graph. Uses sync.Once for thread safety.
   Prevents process exit on binaries that don't support graph split.

2. buildArgs/BuildArgs: added binaryPath parameter so graph split
   detection uses the correct binary (not modelPath).

3. isLikelyOOM: excludes parameter errors ("invalid value",
   "unknown argument", "unrecognized option") and timeouts from
   OOM detection. Prevents false ctx-halving retry loops.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Apr 28, 2026
1c92429
zip
tar.gz
Notes
Downloads

v0.2.7

feat: MTP speculative decoding + n-gram lookup + KV defrag

Three inference optimizations:

1. MTP (--num-speculative-tokens 3): Qwen3.6 models have native MTP
   heads, 40-80% speed boost. NativeMTP field was already in profile
   but never passed to llama-server.

2. N-gram lookup (--lookup 8): zero-cost speculative decoding for
   models without MTP. 20-50% speedup on code/structured output.

3. KV defrag (--defrag-thold 0.1): auto-compact KV cache when
   fragmentation > 10%. Prevents effective ctx from shrinking
   during long conversations.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Apr 28, 2026
3336bfc
zip
tar.gz
Notes
Downloads

v0.2.6

fix: skip --kv-unified on multi-GPU (KV cache lands on GPU 0 only)

--kv-unified allocates the entire KV cache as one contiguous block
on a single device (GPU 0). On dual 3090 (24GB each), model splits
across both cards (~12.5GB each), but KV cache all goes to GPU 0 —
only 11.5GB left on GPU 0, even 8K ctx OOMs.

Now skip --kv-unified when GPUCount > 1. llama.cpp falls back to
paged allocation which spreads KV cache across all devices.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Apr 28, 2026
893506d
zip
tar.gz
Notes
Downloads

v0.2.5

fix: skip --kv-unified on multi-GPU (KV cache lands on GPU 0 only)

--kv-unified allocates the entire KV cache as one contiguous block
on a single device (GPU 0). On dual 3090 (24GB each), model splits
across both cards (~12.5GB each), but KV cache all goes to GPU 0 —
only 11.5GB left on GPU 0, even 8K ctx OOMs.

Now skip --kv-unified when GPUCount > 1. llama.cpp falls back to
paged allocation which spreads KV cache across all devices.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Apr 28, 2026
893506d
zip
tar.gz
Notes
Downloads

v0.2.4

fix: Fingerprint panic on empty ComputeCap (P6000 and older GPUs)

Old code used hardcoded slice indices to remove dot from ComputeCap
string, which panics when the string is empty or too short (e.g.
nvidia-smi returns empty compute_cap on some older drivers).

Replaced with strings.ReplaceAll which handles any input safely.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Apr 27, 2026
bbf1971
zip
tar.gz
Notes
Downloads

v0.2.3

docs: add v0.2.3 changelog

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Apr 27, 2026
e30acab
zip
tar.gz
Notes
Downloads

PreviousNext

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.3.2

v0.3.1

v0.3.0

v0.2.9

v0.2.8

v0.2.7

v0.2.6

v0.2.5

v0.2.4

v0.2.3

Tags: val1813/kaiwu