Run LLMs 2×–10× faster on half the hardware

Works with Open Source or custom finetuned models.
Up to 2×–10× faster inference
Up to ~36% less VRAM (Qwen3‑32B: 63 GiB → 40 GiB)
1–5% accuracy delta vs baseline
On‑prem or private cloud
Why Businesses Choose Minima?
Performance-first
Up to 2×–10× faster LLM inference via mnma’s compression, kernel optimizations and speculative decoding — without rewriting your application or switching models.
Local-first & compliant
On-prem or private cloud (VPC) deployment ensures full data control — no third-party AI services, no external data transfers, and easier compliance with GDPR, HIPAA and SOC 2.
Cost-efficient GPUs
Shrink VRAM usage by around 2× and run the same workloads on half the hardware. Keep your existing GPUs but unlock much higher throughput and a lower cost per token.
How mnma makes LLMs smaller and faster
  1. 1. Analyze model & hardware
    We profile your LLM and GPU stack. A custom CNN predicts layer and patch sensitivity so we know exactly where compression is safe and where we must preserve detail.
    Custom CNN predicting layer and patch sensitivity
  2. 2. Compress what’s safe
    Low-sensitivity patches are compressed using tensor networks (Tucker, TT, TR), while high-sensitivity regions are protected. This typically yields around 2× compression without catastrophic drift.
  3. 3. Heal accuracy
    A fast “healing” fine-tune brings performance back to near-baseline, keeping accuracy loss within ~1–5% depending on the benchmark.
  4. 4. Optimize kernels
    We ship custom CUDA kernels in Triton so the compressed model runs as fast or faster than the baseline implementation on your GPUs.
  5. 5. Turbo decode with speculation
    With VRAM freed up, we enable speculative decoding: a small draft model proposes tokens and the main LLM verifies them in batches, further boosting throughput without sacrificing quality.
End-to-end mnma optimization pipeline
Real performance on real models
Live demo: mnma‑compressed Qwen3-32B vs baseline
mnma-optimized Qwen3-32B (bfloat16)
  • Baseline: ~63 GiB VRAM, ~40 tokens/s
  • mnma: ~40 GiB VRAM, ~70 tokens/s
  • ≈1.75× speedup, ≈36% less VRAM, ~1–5% accuracy delta depending on the benchmark.
Numbers from our internal demo. Actual gains vary by model and hardware; across tests we see 2×–10× speedups on different workloads.
Under the hood: how mnma works (explainer)
This short explainer walks through the pipeline — sensitivity analysis, tensor decompositions (Tucker, TT, TR), the healing fine-tune, Triton kernels and speculative decoding — and how they combine into a practical optimization stack.
Products built on the mnma runtime
mnma turbo inference engine
Drop‑in optimization runtime that compresses and accelerates LLMs for your specific GPUs. Achieve 2×–10× higher throughput with less VRAM and only 1–5% accuracy delta, without changing your application code.
  • Supports open-weights and internal LLMs
  • Automatic hardware-aware optimization
  • On-prem or private cloud
Minima AI Agent (local RAG)
Local-first RAG and agents platform built on mnma. “Search in your data, chat with your data, agents on your data” across files, databases, and internal tools – fully on‑prem or in your VPC.
  • On-prem or VPC deployment
  • Connectors to your documents and systems
  • Chat UI, APIs, and workflows
Who benefits from faster local LLMs?
ML platform & infra teams
Serve 7B–70B LLMs with lower latency and fewer GPUs. Increase throughput per node and make autoscaling cheaper while staying fully on-prem for sensitive workloads.
AI & product teams
Launch AI features without burning GPU budget. Use mnma to optimize Qwen, Llama, Gemma, or internal models, then layer RAG and agents on top with Minima AI Agent.
Regulated industries
Run GPT-class LLMs without sending data to third-party APIs. mnma keeps models local; Minima AI Agent brings compliant RAG and agent workflows to your data.
Edge and hybrid deployments
Fit powerful LLMs into constrained VRAM budgets in on-prem clusters or edge appliances. mnma squeezes more performance out of every GPU.
Qwen3‑32B benchmark comparison
Baseline Qwen3-32B (bfloat16)
Throughput
~40 tokens/s
VRAM usage
~63 GiB
Accuracy
100% (baseline reference)
Hardware
H100 80 GiB
mnma-optimised Qwen3-32B (bfloat16)
Throughput
~70 tokens/s (1.75× speedup)
VRAM usage
~40 GiB (~36% less VRAM)
Accuracy
~95–99% of baseline (1–5% delta)
Hardware
A100 48 GiB
Team
David – Co-founder of Minima
David – Co-founder
David is an AI architect based in Silicon Valley, with experience designing and scaling LLM systems and ML platforms. He’s focused on making large models efficient and practical on real hardware.
Sergii – Co-founder of Minima
Sergii – Co-founder
Sergii is an ex-Principal AI Engineer at Atlassian with a background in shipping search, recommendations, and generative AI features at scale. He brings deep experience in math, systems, tooling, and developer experience.
Frequently Asked Questions
What does “2×–10× faster” actually mean?
We measure speed in tokens per second for a specific model and hardware configuration. mnma combines safe compression, optimized kernels, and speculative decoding to increase tokens/second. In our Qwen3‑32B demo, throughput improves from ~40 to ~70 tokens/s with ~36% less VRAM. On some model–hardware combos we see up to 10× gains.
Which models can mnma optimize?
mnma is designed to work with modern transformer‑based LLMs, including open‑weights and internal models. If it runs as a standard transformer today, mnma can usually analyze and optimize it. We work with customers to support their specific stacks.
Is mnma only for RAG?
No. mnma is model‑ and use‑case‑agnostic. You can use it to accelerate chatbots, code assistants, RAG systems, document automation, or any other LLM‑powered service. Minima AI Agent is our turnkey local RAG and agents product built on top of mnma.
How is Minima deployed and priced?
We deploy on-premises or inside your private cloud (VPC). mnma, the turbo inference engine, is offered as a usage-based subscription, and Minima AI Agent is sold as an annual platform subscription (ARR). Contact us for details about the full platform.
Unlock faster, smaller on-prem LLMs
  • Turbo inference on your hardware Compression, Triton kernels and speculative decoding tuned to your GPUs.
  • Fits big models into less VRAM Run GPT-class performance with roughly half the memory footprint.
  • Fully local, fully private Deployed on-prem or in your VPC — no external LLM calls.