gllm — JIT-Compiled LLM Inference Engine in Rust

A pure Rust inference library for local text generation, embeddings, and reranking. Built on gllm-kernels — a JIT-compiled fusion kernel engine that generates hardware-optimal machine code at runtime.

Core Philosophy

JIT-First: All operators compiled via 4-stage pipeline (Scalar → SymExec → IR → ISA Lowering). No precompiled libraries, no hand-written assembly.
Accuracy > Throughput: Deterministic scheduling, strict causal ordering, phase-isolated prefill/decode.
Zero Fallback: No scalar fallback, no silent degradation. If JIT fails, it errors — never silently produces wrong results.

Features

20+ Model Architectures — Qwen3, Llama 4, GLM-5, Mistral 3, DeepSeek V3/R1, GPT-OSS, Phi-4, Gemma 4, and more
Multi-Format Loader — SafeTensors (zero-copy), GGUF (21 quantization types), ONNX (graph pattern matching), PyTorch
Multi-Source Download — HuggingFace with automatic ModelScope fallback
Fused Kernels — FlashAttention, SwiGLU, FusedQkvNormRope, MoE routing, RMSNorm-into-GEMM
PagedAttention + Continuous Batching — KV cache as virtual memory pages with prefix sharing
Quantization — 22 quantization types (INT4/INT8/FP4/FP8/AWQ/GPTQ/MXFP4/NVFP4) with JIT dequantization fusion
Distributed Inference — Multi-GPU tensor/pipeline parallelism via gllm-nccl
Advanced Features — Semantic Gatekeeper, Head Routing, Guardrail, Intent Recall, CoT Reasoner, MoE, MLA, MTP, AltUp

Quick Start

Text Generation

use gllm::Client;

let client = Client::new_chat("Qwen/Qwen3-7B-Instruct")?;
let response = client
    .generate("Explain the theory of relativity:")
    .max_tokens(512)
    .temperature(0.7)
    .generate()?;
println!("{}", response.text);

Text Embeddings

use gllm::Client;

let client = Client::new_embedding("intfloat/e5-small-v2")?;
let response = client.embed(["Hello, world!", "Test sentence"])?;
for emb in &response.embeddings {
    println!("dim: {}", emb.embedding.len());
}

Document Reranking

use gllm::Client;

let client = Client::new_reranker("BAAI/bge-reranker-v3")?;
let response = client
    .rerank("Efficient storage", [
        "Columnar databases compress well.",
        "Rust has zero-cost abstractions.",
    ])
    .top_n(1)?;
println!("Best match score: {:.4}", response.results[0].score);

Supported Models

Category	Models	Architecture
Generator	Qwen3 (7B/MoE 235B/Thinking 32B), Llama 4 (8B MoE/Scout 17B), Mistral 3 (14B), GLM-4.7/5, GPT-OSS (1.5B/12B/20B), Phi-4 (14B), Gemma 4 (E2B/E4B/31B/26B-A4B), SmolLM, InternLM3	Dense / MoE / Thinking
Embedding	Qwen3-Embed (2048D), BGE-M3 (1024D), BGE-M4 (1536D), E5 (384/768/1024D), M3E, Jina v2/v4	Bi-encoder
Reranker	Qwen3-Rerank, BGE-Reranker-v2-m3, BGE-Rerank-v3	Cross-encoder

Backend Support

Auto-detected at runtime, zero configuration:

Backend	Hardware	JIT Method
CUDA	NVIDIA GPU	PTX codegen, SM version specialized (sm_70/80/90/100+)
ROCm	AMD GPU	HIP codegen via HSA runtime
Metal	Apple GPU	MSL codegen via Metal framework
CPU	x86_64	AVX2 / AVX-512 / AMX / VNNI / BF16
CPU	AArch64	NEON / SVE / SME2

Architecture

┌─────────────────────────────────────────────────────┐
│  Client API — generate / embed / rerank / classify  │
├─────────────────────────────────────────────────────┤
│  Loader — safetensors / GGUF / ONNX / PyTorch       │
├─────────────────────────────────────────────────────┤
│  Graph Optimizer — pattern fusion + HW constraints   │
├─────────────────────────────────────────────────────┤
│  Scheduler — PagedAttention + Continuous Batching    │
├─────────────────────────────────────────────────────┤
│  Executor — Mega-Kernel block routing                │
├─────────────────────────────────────────────────────┤
│  gllm-kernels — JIT: Scalar→SymExec→IR→ISA           │
├─────────────────────────────────────────────────────┤
│  gllm-nccl — Distributed: NCCL/RCCL/oneCCL          │
└─────────────────────────────────────────────────────┘

Triple-Repo Architecture

Repository	Role	Dependency
gllm	Inference client, scheduling, model loading	→ gllm-kernels
gllm-kernels	JIT compiler, codegen, operator registry	→ gllm-nccl
gllm-nccl	Distributed GPU communication	Standalone

License

MIT License

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.github/workflows		.github/workflows
examples		examples
src		src
tests		tests
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md
build.rs		build.rs
rust-toolchain.toml		rust-toolchain.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

gllm — JIT-Compiled LLM Inference Engine in Rust

Core Philosophy

Features

Quick Start

Text Generation

Text Embeddings

Document Reranking

Supported Models

Backend Support

Architecture

Triple-Repo Architecture

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

gllm — JIT-Compiled LLM Inference Engine in Rust

Core Philosophy

Features

Quick Start

Text Generation

Text Embeddings

Document Reranking

Supported Models

Backend Support

Architecture

Triple-Repo Architecture

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages