oLLM is a lightweight Python library for large-context LLM inference, built on top of Huggingface Transformers and PyTorch. It enables running models like gpt-oss-20B, qwen3-next-80B or Llama-3.1-8B-Instruct on 100k context using ~$200 consumer GPU with 8GB VRAM. No quantization is used—only fp16/bf16 precision.
Latest updates (0.4.2) 🔥
- .safetensor files are now read without `mmap` so they no longer consume RAM through page cache
- qwen3-next-80B DiskCache support added
- qwen3-next-80B (160GB model) added with ⚡️1tok/2s throughput (our fastest model so far)
- Llama3 custom chunked attention replaced with flash-attention2 for stability
- gpt-oss-20B flash-attention-like implementation added to reduce VRAM usage
- gpt-oss-20B chunked MLP added to reduce VRAM usage
Model | Weights | Context length | KV cache | Baseline VRAM (no offload) | oLLM GPU VRAM | oLLM Disk (SSD) |
---|---|---|---|---|---|---|
qwen3-next-80B | 160 GB (bf16) | 50k | 20 GB | ~190 GB | ~7.5 GB | 180 GB |
gpt-oss-20B | 13 GB (packed bf16) | 10k | 1.4 GB | ~40 GB | ~7.3GB | 15 GB |
llama3-1B-chat | 2 GB (fp16) | 100k | 12.6 GB | ~16 GB | ~5 GB | 15 GB |
llama3-3B-chat | 7 GB (fp16) | 100k | 34.1 GB | ~42 GB | ~5.3 GB | 42 GB |
llama3-8B-chat | 16 GB (fp16) | 100k | 52.4 GB | ~71 GB | ~6.6 GB | 69 GB |
By "Baseline" we mean typical inference without any offloading
How do we achieve this:
- Loading layer weights from SSD directly to GPU one by one
- Offloading KV cache to SSD and loading back directly to GPU, no quantization or PagedAttention
- Offloading layer weights to CPU if needed
- FlashAttention-2 with online softmax. Full attention matrix is never materialized.
- Chunked MLP. Intermediate upper projection layers may get large, so we chunk MLP as well
Typical use cases include:
- Analyze contracts, regulations, and compliance reports in one pass
- Summarize or extract insights from massive patient histories or medical literature
- Process very large log files or threat reports locally
- Analyze historical chats to extract the most common issues/questions users have
Supported Nvidia GPUs: Ampere (RTX 30xx, A30, A4000, A10), Ada Lovelace (RTX 40xx, L4), Hopper (H100), and newer
It is recommended to create venv or conda environment first
python3 -m venv ollm_env
source ollm_env/bin/activate
Install oLLM with pip install ollm
or from source:
git clone https://github.com/Mega4alik/ollm.git
cd ollm
pip install -e .
pip install kvikio-cu{cuda_version} Ex, kvikio-cu12
💡 Note
qwen3-next requires 4.57.0.dev version of transformers to be installed aspip install git+https://github.com/huggingface/transformers.git
Code snippet sample
from ollm import Inference, file_get_contents, TextStreamer
o = Inference("llama3-1B-chat", device="cuda:0", logging=True) #llama3-1B/3B/8B-chat, gpt-oss-20B, qwen3-next-80B
o.ini_model(models_dir="./models/", force_download=False)
o.offload_layers_to_cpu(layers_num=2) #(optional) offload some layers to CPU for speed boost
past_key_values = o.DiskCache(cache_dir="./kv_cache/") #set None if context is small
text_streamer = TextStreamer(o.tokenizer, skip_prompt=True, skip_special_tokens=False)
messages = [{"role":"system", "content":"You are helpful AI assistant"}, {"role":"user", "content":"List planets"}]
input_ids = o.tokenizer.apply_chat_template(messages, reasoning_effort="minimal", tokenize=True, add_generation_prompt=True, return_tensors="pt").to(o.device)
outputs = o.model.generate(input_ids=input_ids, past_key_values=past_key_values, max_new_tokens=500, streamer=text_streamer).cpu()
answer = o.tokenizer.decode(outputs[0][input_ids.shape[-1]:], skip_special_tokens=False)
print(answer)
or run sample python script as PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True python example.py
For visibility of what's coming next (subject to change)
- gemma3-27B (or alternative) coming on Sep 30, Tue
- Voxtral-small-24B ASR model coming on Oct 3, Fri
- Qwen3-VL or alternative vision model by Oct 10, Fri
- Qwen3-Next MultiTokenPrediction in R&D
- Efficient weight loading in R&D
If there’s a model you’d like to see supported, feel free to reach out at anuarsh@ailabs.us—I’ll do my best to make it happen.