Skip to content

c0deJedi/llm-fit

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

llm-fit

Run more of an oversized LLM on your GPU. llm-fit is an LD_PRELOAD shim that lifts the number of layers tools like Ollama place on the GPU right up to the real edge of VRAM, with no Modelfile tweaks and no model variants.

demo

It redirects large CUDA allocations to managed (unified) memory and adjusts the free VRAM the application sees, so layer placement is decided against the true capacity of the card plus a small, capped overflow into system RAM.

There are two distinct wins. For runtimes with a CPU fallback like Ollama, llm-fit is a speed upgrade: on an RTX 3070 Laptop (8GB), llama3.1:8b q8 goes from 27/33 layers at ~13 tok/s stock to 31/33 layers at ~22 tok/s. For GPU-only runtimes like vLLM, it is a capability upgrade: a 9.4GB model that stock vLLM refuses with CUDA out of memory loads and runs. See Findings for the measurements.


Background

Capable coding models are often just over the VRAM of a consumer GPU. An 8GB card, for example, falls a few hundred MB short of a quantized 14B model, forcing the runtime to keep some layers on the CPU.

CUDA Unified Memory lets a single allocation span both VRAM and system RAM. The NVIDIA driver migrates pages on demand, keeping active pages in VRAM and moving idle ones to system RAM, transparently to the application. Routing a model's allocations through unified memory places the whole model on the GPU and lets it overflow into RAM as needed.

llm-fit does this automatically. It also reports additional free VRAM to the application so more layers are assigned to the GPU than a stock install would dare, while a built-in cap keeps the placement short of the overflow that kills throughput.


Findings

Measured on RTX 3070 Laptop (8GB VRAM, PCIe 4.0 x8) with llama3.1:8b-instruct-q8_0 (9.2GB), same prompt, 20-token budget, 8 threads, three runs per row:

Config Layers on GPU Generation
Stock Ollama 27/33 ~12.7 tok/s
llm-fit, default cap 31/33 ~21.6 tok/s
llm-fit forcing 100% GPU 33/33 ~0.23 tok/s

The middle row is the point of the tool. A couple of layers on CPU cost little, while unified-memory overflow costs every forward pass a page-migration round trip over PCIe at ~15 GB/s against DDR4's ~50 GB/s, which is why the bottom row collapses. The default overflow cap (LLM_FIT_MAX_OVERFLOW_MB=1024) exists to keep placement on the fast side of that cliff while still claiming several more layers than stock.

Forcing every layer onto the GPU (bottom row) is still possible with a num_gpu 99 model variant, and it does produce a genuine 100% GPU placement on a card the model does not fit in, but the measured throughput makes it a curiosity rather than a configuration to run.

Runtimes without a CPU fallback get a capability win rather than a speed win. vLLM refuses outright to load a model larger than VRAM, where Ollama would quietly split layers to the CPU. On the same 8GB card, stock vLLM fails to load Qwen2.5-Coder-14B-Instruct-AWQ (9.4GB) with CUDA out of memory, while under llm-fit with LLM_FIT_EXTRA_MB=5000 the same model loads, answers correctly, and generates at ~0.33 tok/s with the overflow paged over PCIe. Slow, but the alternative is not running at all, which makes it useful for occasional large-model work on hardware vLLM would otherwise reject.

Models more than ~1-2GB over VRAM capacity: cudaMallocManaged returns successfully, but tensor loading stalls indefinitely. Tested across multiple configurations including reduced context windows, different load timeout values, and llama.cpp's --mmap path. None of them change the outcome. The cause is memory pressure rather than a CUDA error: with every layer committed to GPU, the overflow is faulted into system RAM, RAM fills, and the load blocks behind kernel memory reclaim. To prevent this, llm-fit caps the overflow it will advertise and keeps a system-RAM reserve free (see Configuration). With the cap in place the same model that used to hang (qwen3-coder:30b, 18GB) loads cleanly at 21/49 layers against stock's 18/49 and generates at ~4 tok/s.

If you also run nbd-vram, which uses the same GPU's VRAM as swap space, the two tools compete for VRAM and the stall above is much more likely. llm-fit detects an active nbd-vram swap device and skips VRAM inflation while it is in use; set LLM_FIT_FORCE=1 to override.

MoE models looked like a candidate for beating that overflow penalty, since only a few billion of their parameters are active per token and unified memory could keep the hot experts in VRAM. Measured on a Q2 quant of Qwen3-Coder-30B-A3B (11.8GB), the idea holds only partially: 100% GPU placement through unified memory reaches ~3.1 tok/s, far better than a dense model under the same overflow (~0.23 tok/s), but expert routing scatters across tokens faster than page migration can follow, and a plain 30/49 layer split still wins by 10x at ~30 tok/s.

For large MoE models, ktransformers is more appropriate. It routes only active experts through VRAM per token rather than managing the full weight set, so the bandwidth constraint works in its favor.


How it works

cudaMalloc and cuMemAlloc_v2 are intercepted via LD_PRELOAD and redirected to their managed equivalents for allocations above 64MB. Smaller allocations pass through unchanged to avoid interfering with internal CUDA bookkeeping.

Managed memory is bounded by physical VRAM plus RAM, because unified-memory pages are wired and never swap. An allocation request beyond that bound would appear to succeed (managed allocation is lazy) and then trigger the kernel's OOM killer while tensors load, taking unrelated applications with it. llm-fit checks every redirect against free VRAM plus spare RAM and passes oversized requests through to plain cudaMalloc instead, so the application gets an ordinary allocation failure rather than the system losing processes.

Ollama bundles its own CUDA runtime loaded with RTLD_LOCAL, which hides symbols from the standard RTLD_NEXT lookup. llm-fit falls back to explicit path resolution against Ollama's bundled libraries when the standard lookup fails.

Memory reporting hooks (cudaMemGetInfo, cuMemGetInfo_v2, and NVML equivalents) inflate reported free VRAM so the application assigns more layers to GPU. The amount is recomputed on every call from current system memory, holds a system-RAM reserve back so the machine does not fall into swap, and is capped at a configurable maximum overflow. The cap matters because past ~1GB of overflow UVM is slower than Ollama's native CPU split, so advertising more memory only buys slower inference and risks the load-time stall.

Unified memory works on all consumer GeForce GPUs, unlike the NVIDIA P2P API which is restricted to datacenter SKUs.


Configuration

All settings are environment variables. The defaults are tuned for an 8GB GPU and need no changes for typical use.

Env var Default Effect
LLM_FIT_MAX_OVERFLOW_MB 1024 Maximum overflow advertised, i.e. how far past real VRAM a model is allowed to spill.
LLM_FIT_RAM_RESERVE_MB 2048 System RAM held in reserve and never offered as overflow.
LLM_FIT_EXTRA_MB unset Advertise an exact amount of extra VRAM instead of computing it. Bypasses the cap and reserve.
LLM_FIT_FORCE unset Advertise extra VRAM even when nbd-vram swap is active.
LLM_FIT_VERBOSE unset Log allocations and memory reports to stderr.

Requirements

  • NVIDIA GPU with CUDA support
  • NVIDIA driver (includes nvidia-uvm.ko)
  • An inference runtime: Ollama, vLLM, llama.cpp, or any CUDA application
  • gcc, make
  • Enough system RAM to hold the overflow

Install

git clone https://github.com/c0dejedi/llm-fit
cd llm-fit
sudo ./install.sh

Usage

No model variants, no Modelfile changes. Preload the shim into the process that loads the model and run as normal.

Ollama loads models in its server process, so the preload goes on the service rather than the ollama run client:

sudo mkdir -p /etc/systemd/system/ollama.service.d
sudo tee /etc/systemd/system/ollama.service.d/llm-fit.conf << 'EOF'
[Service]
Environment="LD_PRELOAD=/usr/local/lib/llm-fit.so"
EOF
sudo systemctl daemon-reload && sudo systemctl restart ollama

ollama run qwen2.5-coder:14b

Add Environment="LLM_FIT_VERBOSE=1" to the drop-in to see allocations and memory reports in journalctl -u ollama.

Works with any CUDA application that loads the model in its own process:

LD_PRELOAD=/usr/local/lib/llm-fit.so python inference.py

vLLM and other GPU-only runtimes

vLLM sizes its KV cache from reported free VRAM and has no CPU fallback, so give it an explicit extra budget with LLM_FIT_EXTRA_MB, chosen so the visible VRAM covers the model weights plus a working KV cache:

LD_PRELOAD=/usr/local/lib/llm-fit.so LLM_FIT_EXTRA_MB=5000 \
PYTORCH_CUDA_ALLOC_CONF=backend:native VLLM_USE_FLASHINFER_SAMPLER=0 \
python my_vllm_script.py

PYTORCH_CUDA_ALLOC_CONF=backend:native keeps PyTorch on the cudaMalloc allocator that llm-fit intercepts. VLLM_USE_FLASHINFER_SAMPLER=0 skips a JIT compile that wants a newer system nvcc than many distros ship; the issue is unrelated to llm-fit but commonly hit on the same machines.

Forcing 100% GPU (not recommended)

The llm-fit-create.sh script creates a model variant with num_gpu 99 and a reduced context window, which together with a raised LLM_FIT_MAX_OVERFLOW_MB puts every layer on the GPU through unified-memory overflow. It works, and it benchmarks at ~0.23 tok/s against the default configuration's ~21.6 tok/s (see Findings), so treat it as an experiment rather than a way to run models.

bash llm-fit-create.sh qwen2.5-coder:14b
# creates: qwen2.5-coder:14b-fit

Uninstall

sudo rm -f /etc/systemd/system/ollama.service.d/llm-fit.conf
sudo systemctl daemon-reload && sudo systemctl restart ollama
sudo rm /usr/local/lib/llm-fit.so
sudo ldconfig

License

MIT - Sean Lobjoit (c0dejedi)

About

Research into CUDA Unified Memory as a VRAM extension for LLM inference

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors