eLLM can infer LLM on CPUs faster than on GPUs
-
Updated
Jun 10, 2026 - Rust
eLLM can infer LLM on CPUs faster than on GPUs
Local LLM inference engine written from scratch in Rust — hand-written AVX-512 assembly kernels, Metal & Vulkan compute shaders. Supports Qwen3, Mistral3, ... Q4/INT8/BF16 quantization.
Rust-native MoE inference runtime with custom CUDA kernels for Blackwell GPUs. Includes DFlash speculative decoding, multi-tier Engram memory, and entropy-adaptive routing. Targets Qwen3.5-35B-A3B on a single RTX 5060 Ti 16GB.
Run Qwen3.5-122B-A10B on a 16 GB MacBook Air via SSD-streamed MoE expert weights.
SSD-streaming MoE inference engine for consumer hardware. Run 80B parameter models on a 24GB Mac.
LLM inference in Rust - Metal & CUDA
PERSPECTIVE v2 — A 1.05 trillion parameter sparse Mixture-of-Experts language model that runs on consumer hardware (4 GB VRAM + 32 GB RAM). Features O(1) perspective decay recurrence, 3D torus manifold routing, native ternary {-1,0,+1} weights, holographic distributed memory, and hard geometric safety constraints. Built in Rust.
Runs a Rust inference server for hybrid State-Space and MoE language models with fast GPU throughput on consumer hardware
Enabling inference of large mixture-of-experts (MoE) models on Apple Silicon using dynamic offloading.
Heterogeneous Compute Cascade (HCC) — distributed 400B-parameter MoE inference across dual AMD Ryzen AI MAX+ 395 'Strix Halo' workstations via USB4.
Frontier AI on the Macs you already own. Treats your SSD as memory and splits models across paired devices, so 18 GB models run on 8 GB Macs. Local, private, OpenAI-compatible — works with Claude Code, Cursor, any agent. Built on iroh + SwiftLM.
Add a description, image, and links to the moe topic page so that developers can more easily learn about it.
To associate your repository with the moe topic, visit your repo's landing page and select "manage topics."