Run LLMs 2×–10× faster on half the hardware

Works with Open Source or custom finetuned models.

See how mnma works (2-min explainer)

Up to 2×–10× faster inference

Up to ~36% less VRAM (Qwen3‑32B: 63 GiB → 40 GiB)

1–5% accuracy delta vs baseline

On‑prem or private cloud

Why Businesses Choose Minima?

Performance-first

Up to 2×–10× faster LLM inference via mnma’s compression, kernel optimizations and speculative decoding — without rewriting your application or switching models.

Local-first & compliant

On-prem or private cloud (VPC) deployment ensures full data control — no third-party AI services, no external data transfers, and easier compliance with GDPR, HIPAA and SOC 2.

Cost-efficient GPUs

Shrink VRAM usage by around 2× and run the same workloads on half the hardware. Keep your existing GPUs but unlock much higher throughput and a lower cost per token.

How mnma makes LLMs smaller and faster

1. Analyze model & hardware

We profile your LLM and GPU stack. A custom CNN predicts layer and patch sensitivity so we know exactly where compression is safe and where we must preserve detail.
2. Compress what’s safe

Low-sensitivity patches are compressed using tensor networks (Tucker, TT, TR), while high-sensitivity regions are protected. This typically yields around 2× compression without catastrophic drift.
3. Heal accuracy

A fast “healing” fine-tune brings performance back to near-baseline, keeping accuracy loss within ~1–5% depending on the benchmark.
4. Optimize kernels

We ship custom CUDA kernels in Triton so the compressed model runs as fast or faster than the baseline implementation on your GPUs.
5. Turbo decode with speculation

With VRAM freed up, we enable speculative decoding: a small draft model proposes tokens and the main LLM verifies them in batches, further boosting throughput without sacrificing quality.

Real performance on real models

Live demo: mnma‑compressed Qwen3-32B vs baseline

mnma-optimized Qwen3-32B (bfloat16)

Baseline: ~63 GiB VRAM, ~40 tokens/s
mnma: ~40 GiB VRAM, ~70 tokens/s
≈1.75× speedup, ≈36% less VRAM, ~1–5% accuracy delta depending on the benchmark.

Numbers from our internal demo. Actual gains vary by model and hardware; across tests we see 2×–10× speedups on different workloads.

Under the hood: how mnma works (explainer)

This short explainer walks through the pipeline — sensitivity analysis, tensor decompositions (Tucker, TT, TR), the healing fine-tune, Triton kernels and speculative decoding — and how they combine into a practical optimization stack.

Products built on the mnma runtime

mnma turbo inference engine

Drop‑in optimization runtime that compresses and accelerates LLMs for your specific GPUs. Achieve 2×–10× higher throughput with less VRAM and only 1–5% accuracy delta, without changing your application code.

Supports open-weights and internal LLMs
Automatic hardware-aware optimization
On-prem or private cloud

Minima AI Agent (local RAG)

Local-first RAG and agents platform built on mnma. “Search in your data, chat with your data, agents on your data” across files, databases, and internal tools – fully on‑prem or in your VPC.

On-prem or VPC deployment
Connectors to your documents and systems
Chat UI, APIs, and workflows

Learn more View on GitHub

Who benefits from faster local LLMs?

ML platform & infra teams

Serve 7B–70B LLMs with lower latency and fewer GPUs. Increase throughput per node and make autoscaling cheaper while staying fully on-prem for sensitive workloads.

AI & product teams

Launch AI features without burning GPU budget. Use mnma to optimize Qwen, Llama, Gemma, or internal models, then layer RAG and agents on top with Minima AI Agent.

Regulated industries

Run GPT-class LLMs without sending data to third-party APIs. mnma keeps models local; Minima AI Agent brings compliant RAG and agent workflows to your data.

Edge and hybrid deployments

Fit powerful LLMs into constrained VRAM budgets in on-prem clusters or edge appliances. mnma squeezes more performance out of every GPU.

Qwen3‑32B benchmark comparison

Baseline Qwen3-32B (bfloat16)

Throughput

~40 tokens/s

VRAM usage

~63 GiB

Accuracy

100% (baseline reference)

Hardware

H100 80 GiB

mnma-optimised Qwen3-32B (bfloat16)

Throughput

~70 tokens/s (1.75× speedup)

VRAM usage

~40 GiB (~36% less VRAM)

Accuracy

~95–99% of baseline (1–5% delta)

Hardware

A100 48 GiB

Team

David – Co-founder

David is an AI architect based in Silicon Valley, with experience designing and scaling LLM systems and ML platforms. He’s focused on making large models efficient and practical on real hardware.

Sergii – Co-founder

Sergii is an ex-Principal AI Engineer at Atlassian with a background in shipping search, recommendations, and generative AI features at scale. He brings deep experience in math, systems, tooling, and developer experience.

Frequently Asked Questions

What does “2×–10× faster” actually mean?

We measure speed in tokens per second for a specific model and hardware configuration. mnma combines safe compression, optimized kernels, and speculative decoding to increase tokens/second. In our Qwen3‑32B demo, throughput improves from ~40 to ~70 tokens/s with ~36% less VRAM. On some model–hardware combos we see up to 10× gains.

Which models can mnma optimize?

mnma is designed to work with modern transformer‑based LLMs, including open‑weights and internal models. If it runs as a standard transformer today, mnma can usually analyze and optimize it. We work with customers to support their specific stacks.

Is mnma only for RAG?

No. mnma is model‑ and use‑case‑agnostic. You can use it to accelerate chatbots, code assistants, RAG systems, document automation, or any other LLM‑powered service. Minima AI Agent is our turnkey local RAG and agents product built on top of mnma.

How is Minima deployed and priced?

We deploy on-premises or inside your private cloud (VPC). mnma, the turbo inference engine, is offered as a usage-based subscription, and Minima AI Agent is sold as an annual platform subscription (ARR). Contact us for details about the full platform.

Unlock faster, smaller on-prem LLMs

Turbo inference on your hardware Compression, Triton kernels and speculative decoding tuned to your GPUs.
Fits big models into less VRAM Run GPT-class performance with roughly half the memory footprint.
Fully local, fully private Deployed on-prem or in your VPC — no external LLM calls.