Tags: val1813/kaiwu
Tags
fix: VRAM detection inflated by Resizable BAR + MoE partial OOM guard… … + RTX PRO bandwidth - Windows Resizable BAR causes nvidia-smi XML to report shared GPU memory as dedicated VRAM (e.g. 4070 showing 31GB). Now cross-checks XML vs CSV and caps to knownMaxVRAM() lookup table. - MoE partial mode now refuses when model_size > 1.2x VRAM, forcing moe_offload to prevent guaranteed OOM on small-VRAM cards. - Added RTX PRO 6000/5000/4500/4000/2000 to estimateBandwidth() and knownMaxVRAM() tables. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: --fit on conflicts with --cpu-moe/--n-cpu-moe (root cause of MoE… … OOM) ik_llama.cpp docs explicitly state --fit cannot be combined with --cpu-moe, --n-cpu-moe, or -ot. We were passing both, causing --fit to override MoE layer placement and try to fit everything in VRAM → OOM on 16GB cards with 20GB models. Now: - full_gpu: uses --fit on (automatic layer allocation) - moe_offload: uses --cpu-moe only (no --fit) - moe_partial: uses --n-cpu-moe N only (no --fit) All MoE modes rely on -ngl 999 (already in args) to put all non-expert layers on GPU. Also increased calcMoEMode overhead from 1GB to 2.5GB to account for KV cache + compute buffer space. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: MoE models skip -sm graph on multi-GPU (expert buffer explosion) -sm graph (tensor parallel) splits each compute graph node across GPUs. For MoE models with 128 expert layers, this causes massive buffer allocation that exceeds VRAM even on dual 3090 (48GB) with a 25GB model. MoE models now always use layer split (--tensor-split or default), which distributes layers across GPUs without duplicating expert buffers. Also added .so missing detection to isLikelyOOM exclusion list. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: -sm graph runtime detection + isLikelyOOM parameter error exclusion
1. SupportsGraphSplit(): checks --help output for "graph" support
before passing -sm graph. Uses sync.Once for thread safety.
Prevents process exit on binaries that don't support graph split.
2. buildArgs/BuildArgs: added binaryPath parameter so graph split
detection uses the correct binary (not modelPath).
3. isLikelyOOM: excludes parameter errors ("invalid value",
"unknown argument", "unrecognized option") and timeouts from
OOM detection. Prevents false ctx-halving retry loops.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: MTP speculative decoding + n-gram lookup + KV defrag Three inference optimizations: 1. MTP (--num-speculative-tokens 3): Qwen3.6 models have native MTP heads, 40-80% speed boost. NativeMTP field was already in profile but never passed to llama-server. 2. N-gram lookup (--lookup 8): zero-cost speculative decoding for models without MTP. 20-50% speedup on code/structured output. 3. KV defrag (--defrag-thold 0.1): auto-compact KV cache when fragmentation > 10%. Prevents effective ctx from shrinking during long conversations. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: skip --kv-unified on multi-GPU (KV cache lands on GPU 0 only) --kv-unified allocates the entire KV cache as one contiguous block on a single device (GPU 0). On dual 3090 (24GB each), model splits across both cards (~12.5GB each), but KV cache all goes to GPU 0 — only 11.5GB left on GPU 0, even 8K ctx OOMs. Now skip --kv-unified when GPUCount > 1. llama.cpp falls back to paged allocation which spreads KV cache across all devices. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: skip --kv-unified on multi-GPU (KV cache lands on GPU 0 only) --kv-unified allocates the entire KV cache as one contiguous block on a single device (GPU 0). On dual 3090 (24GB each), model splits across both cards (~12.5GB each), but KV cache all goes to GPU 0 — only 11.5GB left on GPU 0, even 8K ctx OOMs. Now skip --kv-unified when GPUCount > 1. llama.cpp falls back to paged allocation which spreads KV cache across all devices. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: Fingerprint panic on empty ComputeCap (P6000 and older GPUs) Old code used hardcoded slice indices to remove dot from ComputeCap string, which panics when the string is empty or too short (e.g. nvidia-smi returns empty compute_cap on some older drivers). Replaced with strings.ReplaceAll which handles any input safely. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
PreviousNext