Skip to content

Tags: val1813/kaiwu

Tags

v0.3.2

Toggle v0.3.2's commit message
fix: VRAM detection inflated by Resizable BAR + MoE partial OOM guard…

… + RTX PRO bandwidth

- Windows Resizable BAR causes nvidia-smi XML to report shared GPU memory
  as dedicated VRAM (e.g. 4070 showing 31GB). Now cross-checks XML vs CSV
  and caps to knownMaxVRAM() lookup table.
- MoE partial mode now refuses when model_size > 1.2x VRAM, forcing
  moe_offload to prevent guaranteed OOM on small-VRAM cards.
- Added RTX PRO 6000/5000/4500/4000/2000 to estimateBandwidth() and
  knownMaxVRAM() tables.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

v0.3.1

Toggle v0.3.1's commit message
docs: add v0.3.1 changelog

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

v0.3.0

Toggle v0.3.0's commit message
fix: --fit on conflicts with --cpu-moe/--n-cpu-moe (root cause of MoE…

… OOM)

ik_llama.cpp docs explicitly state --fit cannot be combined with
--cpu-moe, --n-cpu-moe, or -ot. We were passing both, causing
--fit to override MoE layer placement and try to fit everything
in VRAM → OOM on 16GB cards with 20GB models.

Now:
- full_gpu: uses --fit on (automatic layer allocation)
- moe_offload: uses --cpu-moe only (no --fit)
- moe_partial: uses --n-cpu-moe N only (no --fit)

All MoE modes rely on -ngl 999 (already in args) to put all
non-expert layers on GPU.

Also increased calcMoEMode overhead from 1GB to 2.5GB to account
for KV cache + compute buffer space.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

v0.2.9

Toggle v0.2.9's commit message
fix: MoE models skip -sm graph on multi-GPU (expert buffer explosion)

-sm graph (tensor parallel) splits each compute graph node across GPUs.
For MoE models with 128 expert layers, this causes massive buffer
allocation that exceeds VRAM even on dual 3090 (48GB) with a 25GB model.

MoE models now always use layer split (--tensor-split or default),
which distributes layers across GPUs without duplicating expert buffers.

Also added .so missing detection to isLikelyOOM exclusion list.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

v0.2.8

Toggle v0.2.8's commit message
fix: -sm graph runtime detection + isLikelyOOM parameter error exclusion

1. SupportsGraphSplit(): checks --help output for "graph" support
   before passing -sm graph. Uses sync.Once for thread safety.
   Prevents process exit on binaries that don't support graph split.

2. buildArgs/BuildArgs: added binaryPath parameter so graph split
   detection uses the correct binary (not modelPath).

3. isLikelyOOM: excludes parameter errors ("invalid value",
   "unknown argument", "unrecognized option") and timeouts from
   OOM detection. Prevents false ctx-halving retry loops.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

v0.2.7

Toggle v0.2.7's commit message
feat: MTP speculative decoding + n-gram lookup + KV defrag

Three inference optimizations:

1. MTP (--num-speculative-tokens 3): Qwen3.6 models have native MTP
   heads, 40-80% speed boost. NativeMTP field was already in profile
   but never passed to llama-server.

2. N-gram lookup (--lookup 8): zero-cost speculative decoding for
   models without MTP. 20-50% speedup on code/structured output.

3. KV defrag (--defrag-thold 0.1): auto-compact KV cache when
   fragmentation > 10%. Prevents effective ctx from shrinking
   during long conversations.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

v0.2.6

Toggle v0.2.6's commit message
fix: skip --kv-unified on multi-GPU (KV cache lands on GPU 0 only)

--kv-unified allocates the entire KV cache as one contiguous block
on a single device (GPU 0). On dual 3090 (24GB each), model splits
across both cards (~12.5GB each), but KV cache all goes to GPU 0 —
only 11.5GB left on GPU 0, even 8K ctx OOMs.

Now skip --kv-unified when GPUCount > 1. llama.cpp falls back to
paged allocation which spreads KV cache across all devices.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

v0.2.5

Toggle v0.2.5's commit message
fix: skip --kv-unified on multi-GPU (KV cache lands on GPU 0 only)

--kv-unified allocates the entire KV cache as one contiguous block
on a single device (GPU 0). On dual 3090 (24GB each), model splits
across both cards (~12.5GB each), but KV cache all goes to GPU 0 —
only 11.5GB left on GPU 0, even 8K ctx OOMs.

Now skip --kv-unified when GPUCount > 1. llama.cpp falls back to
paged allocation which spreads KV cache across all devices.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

v0.2.4

Toggle v0.2.4's commit message
fix: Fingerprint panic on empty ComputeCap (P6000 and older GPUs)

Old code used hardcoded slice indices to remove dot from ComputeCap
string, which panics when the string is empty or too short (e.g.
nvidia-smi returns empty compute_cap on some older drivers).

Replaced with strings.ReplaceAll which handles any input safely.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

v0.2.3

Toggle v0.2.3's commit message
docs: add v0.2.3 changelog

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>