GitHub

vLLM

LLM Inference for Large-Context Offline Workloads

oLLM is a lightweight Python library for large-context LLM inference, built on top of Huggingface Transformers and PyTorch. It enables running models like gpt-oss-20B, qwen3-next-80B or Llama-3.1-8B-Instruct on 100k context using ~$200 consumer GPU with 8GB VRAM. No quantization is used—only fp16/bf16 precision.

Latest updates (0.4.2) 🔥

.safetensor files are now read without `mmap` so they no longer consume RAM through page cache
qwen3-next-80B DiskCache support added
qwen3-next-80B (160GB model) added with ⚡️1tok/2s throughput (our fastest model so far)
Llama3 custom chunked attention replaced with flash-attention2 for stability
gpt-oss-20B flash-attention-like implementation added to reduce VRAM usage
gpt-oss-20B chunked MLP added to reduce VRAM usage

8GB Nvidia 3060 Ti Inference memory usage:

Model	Weights	Context length	KV cache	Baseline VRAM (no offload)	oLLM GPU VRAM	oLLM Disk (SSD)
qwen3-next-80B	160 GB (bf16)	50k	20 GB	~190 GB	~7.5 GB	180 GB
gpt-oss-20B	13 GB (packed bf16)	10k	1.4 GB	~40 GB	~7.3GB	15 GB
llama3-1B-chat	2 GB (fp16)	100k	12.6 GB	~16 GB	~5 GB	15 GB
llama3-3B-chat	7 GB (fp16)	100k	34.1 GB	~42 GB	~5.3 GB	42 GB
llama3-8B-chat	16 GB (fp16)	100k	52.4 GB	~71 GB	~6.6 GB	69 GB

By "Baseline" we mean typical inference without any offloading

How do we achieve this:

Loading layer weights from SSD directly to GPU one by one
Offloading KV cache to SSD and loading back directly to GPU, no quantization or PagedAttention
Offloading layer weights to CPU if needed
FlashAttention-2 with online softmax. Full attention matrix is never materialized.
Chunked MLP. Intermediate upper projection layers may get large, so we chunk MLP as well

Typical use cases include:

Analyze contracts, regulations, and compliance reports in one pass
Summarize or extract insights from massive patient histories or medical literature
Process very large log files or threat reports locally
Analyze historical chats to extract the most common issues/questions users have

Supported Nvidia GPUs: Ampere (RTX 30xx, A30, A4000, A10), Ada Lovelace (RTX 40xx, L4), Hopper (H100), and newer

Getting Started

It is recommended to create venv or conda environment first

python3 -m venv ollm_env
source ollm_env/bin/activate

Install oLLM with pip install ollm or from source:

git clone https://github.com/Mega4alik/ollm.git
cd ollm
pip install -e .
pip install kvikio-cu{cuda_version} Ex, kvikio-cu12

💡 Note
qwen3-next requires 4.57.0.dev version of transformers to be installed as pip install git+https://github.com/huggingface/transformers.git

Example

Code snippet sample

from ollm import Inference, file_get_contents, TextStreamer
o = Inference("llama3-1B-chat", device="cuda:0", logging=True) #llama3-1B/3B/8B-chat, gpt-oss-20B, qwen3-next-80B
o.ini_model(models_dir="./models/", force_download=False)
o.offload_layers_to_cpu(layers_num=2) #(optional) offload some layers to CPU for speed boost
past_key_values = o.DiskCache(cache_dir="./kv_cache/") #set None if context is small
text_streamer = TextStreamer(o.tokenizer, skip_prompt=True, skip_special_tokens=False)

messages = [{"role":"system", "content":"You are helpful AI assistant"}, {"role":"user", "content":"List planets"}]
input_ids = o.tokenizer.apply_chat_template(messages, reasoning_effort="minimal", tokenize=True, add_generation_prompt=True, return_tensors="pt").to(o.device)
outputs = o.model.generate(input_ids=input_ids,  past_key_values=past_key_values, max_new_tokens=500, streamer=text_streamer).cpu()
answer = o.tokenizer.decode(outputs[0][input_ids.shape[-1]:], skip_special_tokens=False)
print(answer)

or run sample python script as PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True python example.py

Roadmap

For visibility of what's coming next (subject to change)

gemma3-27B (or alternative) coming on Sep 30, Tue
Voxtral-small-24B ASR model coming on Oct 3, Fri
Qwen3-VL or alternative vision model by Oct 10, Fri
Qwen3-Next MultiTokenPrediction in R&D
Efficient weight loading in R&D

Contact us

If there’s a model you’d like to see supported, feel free to reach out at anuarsh@ailabs.us—I’ll do my best to make it happen.

Name		Name	Last commit message	Last commit date
Latest commit History 59 Commits
samples		samples
scripts		scripts
src/ollm		src/ollm
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
example.py		example.py
pyproject.toml		pyproject.toml
run_example.bash		run_example.bash

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

LLM Inference for Large-Context Offline Workloads

8GB Nvidia 3060 Ti Inference memory usage:

Getting Started

Example

Roadmap

Contact us

About

Uh oh!

Releases

Packages

Languages

License

lbt05/ollm

Folders and files

Latest commit

History

Repository files navigation

LLM Inference for Large-Context Offline Workloads

8GB Nvidia 3060 Ti Inference memory usage:

Getting Started

Example

Roadmap

Contact us

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages