LLM inference engine. Rust+CUDA on GPU, JAX+XLA on TPU.
Three Gemma 4 models on TPU v6e-4: E4B (16,794 tok/s peak, 78.3 tok/s B=1, PPL 5.87), 26B-A4B MoE (14,899 tok/s peak), 31B (9,600 tok/s peak, 128K context). GPU: 31B on H100 at 63 tok/s single-user decode (84 with speculative decoding on real text), 8,786 tok/s peak batch (FP8, CUDA graph). Zero custom kernels on TPU -- ~500 lines of JAX. Native Rust binary on GPU -- zero Python in the serving path.
Full benchmarks | June 2026 H100 session record
| E4B (4B) | 26B-A4B (MoE) | 31B TPU | 31B GPU | vLLM H100 | |
|---|---|---|---|---|---|
| B=1 tok/s | 78.3 | 52.9 | 44.2 | 63.0 (83.9 spec) | 66.9 |
| Peak tok/s | 16,794 | 14,899 | 9,600 | 8,786 | 3,848 |
| PPL | 5.87 | 90.21 | 24.76 | 14.75* | - |
| Cached TTFT | 25.9 ms | 35.3 ms | 73.3 ms | 63 ms | - |
| Peak tok/s/$ | 3,230 | 2,865 | 1,846 | 4,576 | 2,004 |
TPU: v6e-4, $5.20/hr, int8, max-ctx 2048 (measured 2026-04/05). GPU: H100 SXM, $1.92/hr, FP8; B=1 measured 2026-06-09 (commit 0d5f276a5), peak-batch and vLLM comparison measured 2026-04. *The GPU PPL row is flagged for re-verification -- see the perplexity section.
Pure JAX + XLA. No custom kernels. XLA compiles the entire forward pass to TPU machine code from a ~500 line JAX script. Three models, one codebase.
| Property | E4B (4B) | 26B-A4B (MoE) | 31B |
|---|---|---|---|
| Total / active params | ~4B / 4B | 26B / ~4B | 31B / 31B |
| Layers | 42 | 30 | 60 |
| Hidden size | 2,560 | 2,816 | 5,376 |
| Q / KV heads (sliding) | 8 / 2 | 16 / 8 | 32 / 16 |
| Q / KV heads (global) | 8 / 2 | 16 / 2 (V=K) | 32 / 4 (V=K) |
| Head dim (sliding / global) | 256 / 512 | 256 / 512 | 256 / 512 |
| Sliding window | 512 | 1,024 | 1,024 |
| MoE | none | 128 experts, top-8 | none |
| KV-shared layers | 18 (of 42) | 0 | 0 |
| Per-layer input injection | 256-d gated (5.6 GB embed) | none | none |
| Batch | E4B tok/s | 26B-A4B tok/s | 31B tok/s | vLLM H100 |
|---|---|---|---|---|
| 1 | 78 | 53 | 44 | 66.9 |
| 8 | 542 | 390 | 318 | 515 |
| 64 | 3,661 | 2,662 | 2,112 | 2,794 |
| 128 | 6,298 | 4,915 | 3,853 | 3,848 |
| 256 | 10,214 | 8,192 | 6,246 | 3,709 |
| 512 | 13,773 | 12,390 | 8,550 | 3,788 |
| 768 | 15,514 | 14,899 | 9,600 | 3,671 |
| 1024 | 16,794 | - | - | - |
| Context | ms/step | tok/s | Architecture | KV type |
|---|---|---|---|---|
| 512 | 12.79 | 78.2 | Single-scan, 60-layer scan + cond | bf16 |
| 2,048 | 22.6 | 44.2 | Single-scan | bf16 |
| 32K | ~66 | ~15 | Single-scan | bf16 |
| 64K | ~91 | ~11 | Split-cache, 10 groups x 6 | int8 |
| 128K | 40.56 | 24.7 | Split-cache + blockwise global | int8 |
Dual-path architecture auto-switches at the 32K boundary.
# Create TPU v6e-4 ($5.20/hr)
gcloud compute tpus tpu-vm create rvllm-gemma4 \
--zone=us-east5-b --accelerator-type=v6e-4 --version=v2-alpha-tpuv6e \
--boot-disk-size=200
# Install (30 seconds)
pip3 install 'jax[tpu]' huggingface_hub tokenizers \
-f https://storage.googleapis.com/jax-releases/libtpu_releases.html
# Download model
huggingface-cli download google/gemma-4-E4B-it --local-dir ~/models/gemma-4-E4B-it
# Run E4B (78.3 tok/s B=1)
python3 tpu/harness/gemma4_tpu_infer.py \
--model-dir ~/models/gemma-4-E4B-it --max-tokens 200 --max-ctx 2048
# Run 31B batched (9,600 tok/s B=768)
LIBTPU_INIT_ARGS="--xla_tpu_enable_async_collective_fusion=true \
--xla_tpu_enable_async_collective_fusion_fuse_all_gather=true \
--xla_tpu_enable_async_collective_fusion_multiple_steps=true \
--xla_tpu_overlap_compute_collective_tc=true \
--xla_tpu_scoped_vmem_limit_kib=131072" \
python3 tpu/harness/gemma4_tpu_infer.py \
--model-dir ~/models/gemma-4-31B-it --fused --max-tokens 200 --max-ctx 2048 --batch 768
# 128K context (24.7 tok/s)
python3 tpu/harness/gemma4_tpu_infer.py \
--model-dir ~/models/gemma-4-31B-it --fused --max-tokens 200 --max-ctx 131072
# API server (OpenAI-compatible)
python3 tpu/harness/api_server.py --model-dir ~/models/gemma-4-31B-it --port 8080
# Perplexity
python3 tpu/harness/gemma4_tpu_infer.py \
--model-dir ~/models/gemma-4-31B-it --perplexity --max-ctx 2048No Docker. No conda. No torch. No vLLM. One pip install, one Python file, one command.
450M-param draft head proposes K=5 tokens per cycle; the full 31B verifies K+1=6 in one forward pass. Lossless for greedy decode.
| Metric | Value |
|---|---|
| Baseline (B=1, 512 ctx) | 78.2 tok/s, 12.79 ms/step |
| EAGLE-3 fused cycle | 31.0 ms/cycle |
| Projected @ tau=3.5 | ~145 tok/s (1.8x) |
| Hardware ceiling | ~300 tok/s (3.8x) |
Requires 50K+ training examples for production tau. Current: 2K examples, loss 7.1, pipeline validated end-to-end. See tpu/harness/EAGLE3_SPEC.md.
Rust + CUDA on H100 SXM 80GB. FP8 weights with per-channel scales, FP8 or F16 paged KV, FA3 SM90 attention for sliding layers + a split-KV FP8 decode kernel for the fallback/global path. All 60 layers captured in a single CUDA graph. 63.0 tok/s single-user decode (2026-06-09, commit 0d5f276a5), 83.9 tok/s with speculative decoding on real text at real context, 8,786 tok/s peak (B=512, 2026-04).
A single user generates one token at a time (batch=1, M=1). This is the hardest case for a 31B model and is purely weight-bandwidth bound: every token reads all ~30 GB of FP8 e4m3 weights once, so on an H100 (HBM3 peak 3.35 TB/s) the floor is ~9 ms/token ≈ 104-109 tok/s for plain decode. Speculative decoding is the only way past that roofline (it amortizes the weight read over multiple verified tokens).
What actually makes B=1 fast (each item measured, June 2026):
-
Route small-M GEMMs through cuBLASLt (
RVLLM_FP8_GEMM_LT_M1=1). The CUTLASS channelscale GemmUniversal that serves large-batch GEMMs runs M=1 at only ~51% of HBM in situ (69.7 µs avg per GEMM, node-level nsys); cuBLASLt runs the same shapes at ~80% (v3/M1_OUTCOME.mdhad this right). Rerouting M≤16 through cuBLASLt + ascale_colspass + f32→f16 cast is +27% end-to-end (44.4 → 56.5 tok/s at B=1, 388.8 → 495.9 at B=8) even though it adds ~480 graph nodes per step. Two non-obvious rules survive: cuBLASLt OUTER_VEC channelscale fails the sm_90 heuristic (measuredLaunchFailed) so the scale stays a separate pass, and do not hand-roll an FP8 GEMV -- measured to lose to cuBLASLt on every shape. -
CUDA-graph the decode loop (default-on;
RVLLM_DECODE_GRAPH=0for eager). Eager re-issues ~122 sync HtoD copies per step and host-serializes the GPU: 17 → 44.9 tok/s (2.6×), token-hash identical. One correction to an earlier analysis that claimed ~12 ms/step of "inter-node dispatch gap": node-level tracing (nsys --cuda-graph-trace=node) shows the real gap is ~1.1 ms/step. The missing time was GEMM efficiency (point 1) and the attention kernel (point 3), not dispatch. Fusing dispatch (megakernel) was built anyway and measured 1.7× slower -- seev3/MEGAKERNEL_OUTCOME.md. -
A paged-attention kernel that doesn't scale with context like a brick. The original FP8 decode kernel walked the KV window one token at a time (two
__syncthreads()per token, one block per query head): 551 µs/call at a full 1024-token sliding window, which put long-context FP8 decode at 14 tok/s. The split-KV GQA-grouped rewrite (v3/kernels/fp8_decode_v2.cu) loads each KV chunk once for all query heads of its KV head, zero barriers in the token loop: 15.7 µs at the full window (35×), 40 µs at ctx 8192 (171×), parity ≤ 4.9e-4 vs the old kernel across 16 shape/scale variants. FP8-KV decode now beats the F16-KV path at every context length.
Measured B=1 generate (H100 SXM5, FP8 e4m3 weights, graphed, 2026-06-09/10):
| path | tok/s | notes |
|---|---|---|
| FP8 KV, short ctx | 63.0 | production kernel set, RVLLM_FP8_GEMM_LT_M1=1 |
| FP8 KV, ctx 1200-1500 | 61.4 | the production serving regime (fast prefill forces FP8 KV) |
| F16 KV, short ctx | 56.5 | FA3 sliding decode |
| + speculative decoding (K=4, real text, ctx 1200+) | 83.9 | n-gram drafter, see below |
| + speculative decoding (K=4, repetitive text) | 153.0 | acceptance-dependent upper range |
| HBM weight roofline (plain decode) | ~104-109 | ~30 GB FP8 ÷ 3.35 TB/s |
| historical: graphed baseline before June 2026 | 44.9 | F16 KV; FP8-KV long-context was 14 |
Draft up to K tokens by n-gram prompt lookup (zero extra model), verify [last, drafts...] in ONE forward at M=K+1 -- the 30 GB weight read amortizes over every accepted token. Greedy acceptance; every emitted token is a model argmax. The verify forward is graph-captured per chunk size (RVLLM_SPEC_GRAPH=0 to force eager). Gates: K=0 is bit-identical to plain decode (400-token hash equality, including past the sliding-window ring wrap); graphed and eager spec runs are bit-identical to each other; K>0 vs plain decode is quality-identical but not bit-identical -- batched verify GEMMs differ from M=1 by ulps and flip genuine near-tie argmaxes (measured example: "PPL (Peak)" vs "PPL (Cached)", both coherent). Enable: RVLLM_SPEC_DECODE=1 RVLLM_SPEC_K=4. Measured accept rate on real prose: 0.42-0.56/draft, 2.4-3.1 tokens per verify cycle. K=4 is the sweet spot; K=6-8 lose acceptance. Next: Gemma4-E4B as a model drafter (same tokenizer family) to lift acceptance on novel text.
Decode-step bench (run_bench, FP8 weights, 40 iters/8 warmup), default
configuration -- the engine now auto-routes small-M GEMMs through cuBLASLt
up to the measured crossover (M≤64) and CUTLASS above it:
| Batch | tok/s (default) | ms/step | vs CUTLASS-only | route |
|---|---|---|---|---|
| 1 | 64.4 | 15.5 | +32% | cuBLASLt |
| 2 | 125.3 | 16.0 | +29% | cuBLASLt |
| 4 | 249.1 | 16.1 | +27% | cuBLASLt |
| 8 | 495.5 | 16.1 | +27% | cuBLASLt |
| 16 | 949.2 | 16.9 | +21% | cuBLASLt |
| 32 | 1,741 | 18.4 | +16% | cuBLASLt |
| 64 | 2,997 | 21.4 | +4% | cuBLASLt |
| 128 | 5,211 | 24.6 | (CUTLASS wins +12%) | CUTLASS |
| 256 | 7,607 | 33.7 | (CUTLASS wins +20%) | CUTLASS |
The crossover (RVLLM_FP8_GEMM_LT_MAX_M, default 64) was calibrated by
running every batch size both ways. Historical April-2026 table (different
harness settings, 100 iters): B=512 reached 8,786 tok/s; B≥64 April rows
ran ~5-10% above the fresh 40-iter numbers -- treat cross-date deltas at
B≥64 as methodology noise, not regression, until re-run at matched iters.
Single-user / greedy generate (full pipeline incl. lm_head + sampling, same commit, same day):
| config | decode tok/s | e2e tok/s |
|---|---|---|
| FP8 KV, short prompt | 63.0 | 57.9 |
| FP8 KV, 1200-token prompt (production shape) | 61.4 | 11.8* |
| F16 KV, short prompt | 56.6 | 44.7 |
| spec-decode K=4, real text, 1200-token prompt | 83.9 | 12.6* |
| spec-decode K=4, repetitive text, short | 153.1 | 126.0 |
*Long-prompt e2e is dominated by per-token prefill (20.5 s TTFT for 1200
tokens on the bench; the open work item). Opt-in RVLLM_FP8_GEMM_LT_F16OUT=1
(cuBLASLt writes f16 directly, one in-place channel-scale kernel) adds
another +1.4-2% deterministically but changes rounding (f16-before-scale);
it stays opt-in until a unified ppl gate clears it.
vLLM column measured 2026-04 (vLLM 0.19, FP8, CUDA graphs); rvLLM B=1 updated 2026-06 -- an apples-to-apples re-run against current vLLM is owed before claiming the B=1 row.
| Batch | rvLLM tok/s | vLLM tok/s (2026-04) | Delta |
|---|---|---|---|
| 1 | 63.0 (83.9 spec) | 69 | -9% (+22% with spec) |
| 32 | 1,743 | 1,748 | ~0% |
| 64 | 3,265 | 3,130 | +4% |
| 128 | 5,802 | 4,689 | +24% |
| 256 | 7,808 | 7,077 | +10% |
| 512 | 8,786 | 8,243 | +7% |
| Weight path | KV cache | PPL | tok/s (B=1, at the time) |
|---|---|---|---|
| FP8-Dynamic + CUTLASS channelscale epilogue | F16 | 14.75 | 53 |
| BF16 split QKV per-tensor FP8 | F16 | 17.96 | 37.9 |
| F16 weights (no FP8) | F16 | 19.79 | 37.9 |
| HuggingFace BF16 reference | -- | 19.62 | -- |
Honesty flag: FP8 weights measuring 25% better perplexity than the BF16 reference (14.75 vs 19.62) is not plausible as a quantization effect; it almost certainly reflects an eval-config difference between the paths (the logit-softcap is applied on the ppl path only, and the reference was run through a different harness). Treat the relative ordering of the first three rows as informative and the absolute comparison to the HF reference as unverified until the eval is unified. Re-verification is queued.
For each layer in 0..60:
1. fused_rmsnorm_fp8_quant input layernorm + FP8 quantize
2. cutlass_fp8_gemm_channelscale fused Q||K||V + channelscale epilogue
3. fused_qkv_rmsnorm Q/K norm (learned) + V norm (parameter-free)
4. fused_rope_partial_f16kv partial RoPE + F16 KV cache write
5. paged_decode (FA3 SM90) attention (head_dim=256 sliding, 512 global)
6. quantize_fp8_per_token attn output to FP8
7. fp8_gemm O projection
8. fused_norm_add_residual channelscale + rmsnorm + residual add
9. fused_rmsnorm_fp8_quant pre-FFN layernorm + FP8 quantize
10. cutlass_fp8_gemm_channelscale fused gate||up + channelscale epilogue
11. fused_gelu_mul_fp8_quant GELU(tanh)(gate) * up to FP8
12. fp8_gemm down projection
13. fused_norm_add_residual channelscale + rmsnorm + residual + layer_scalar
Sampling tail:
fused_rmsnorm final layernorm
f16_gemm_f32 lm_head
logit_softcap 30 * tanh(logits / 30)
argmax_kernel token selection
Four rounds of fusion + custom CUTLASS epilogue reduced graph nodes from 1776 to ~935 (47% reduction):
| Fusion | Kernels eliminated | Nodes saved |
|---|---|---|
| f32_to_bf16 + rmsnorm + vector_add -> fused_norm_add_residual | 3 -> 1 (x2/layer) | 240 |
| scale_cols_f32 fused into norm+add kernel (O-proj, down) | 1 -> 0 (x2/layer) | 120 |
| residual_scale_f16 fused into post-ff norm+add | 1 -> 0 (x1/layer) | 60 |
| vnorm_f16 fused into qk_rmsnorm -> fused_qkv_rmsnorm | 2 -> 1 (x1/layer) | 60 |
| CUTLASS channelscale epilogue (QKV, gate_up) | 3 -> 1 (x2/layer) | 240+ |
The CUTLASS channelscale kernel uses a custom SM90 EVT epilogue that applies per-token activation scale (ColBroadcast) and per-channel weight scale (RowBroadcast) directly in the GEMM epilogue while the accumulator is still F32, then casts to F16. At M≤16 this kernel is not used in the fast path anymore: RVLLM_FP8_GEMM_LT_M1=1 reroutes small-M GEMMs through cuBLASLt (+27% measured) -- the smaller-tile CUTLASS variant that used to be "help wanted" here is obsolete.
Help wanted (current, real -- updated 2026-06-11):
- Close out the two-source prefill wiring (#58): the kernel is parity-proven bit-identical and measures TTFT 20.5 s -> 0.67 s at 1200-token prompts behind
RVLLM_PREFILL_TWO_SOURCE=1, but an engine wiring divergence keeps it opt-in. The issue has the full evidence chain and a precise repro. - Serve-session spec-decode optimization: spec behind the API measures 39.7 tok/s where the bench-side machinery does 83.9 -- the graphed verify isn't fully exploited in the session loop yet.
- Model drafter for spec-decode (Gemma4-E4B), replacing n-gram lookup for novel text.
Shipped from the previous help-wanted list (2026-06-11): cross-request graph persistence (served long-prompt decode 27.4 -> 59.6 tok/s) and the full API sampling contract (temperature/top_p/top_k/seed/stop -- the endpoint was greedy-only before).
# One-time on H100 box (~15 min)
bash kernels/build.sh # fused PTX
bash kernels/build_cutlass_so.sh # libcutlass_kernels.so
bash kernels/build_fa3.sh # libfa3_kernels.so (real FA3 -- needs flash-attention checkout;
# includes the hdim256 combine instantiation upstream lacks)
bash kernels/build_fa_sm89_so.sh # libfa_sm89_kernels.so (split-KV FP8 decode + global hd512)
# Build
cargo build --release --features cuda --manifest-path v3/Cargo.toml -p rvllm-bench
# Run
RVLLM_MODEL_DIR=/workspace/models/gemma-4-31B-it \
RVLLM_KERNELS_DIR=/workspace/rvllm/kernels/sm_90 \
RVLLM_CUTLASS_SO=/workspace/rvllm/kernels/sm_90/libcutlass_kernels.so \
RVLLM_FA3_SO=/workspace/rvllm/kernels/sm_90/libfa3_kernels.so \
RVLLM_POLICY=/workspace/rvllm/kernels/sm_90/policy.json \
RVLLM_BATCH=128 RVLLM_ITERS=30 RVLLM_WARMUP=5 \
./v3/target/release/rvllm-benchThe server is a Rust-only Gemma 4 path with an OpenAI-compatible HTTP surface.
It keeps CUDA execution on a single engine owner thread and accepts requests
through /v1/chat/completions.
For the solidSF agents production shape, including 256K context, four-seat
admission, the paid-plan busy response, CAD harness prompting, systemd service
shape, and verification scripts, see
docs/solidsf-agent-serving.md.
export CUDA_ARCH=sm_90
export RVLLM_MODEL_DIR=/workspace/models/gemma-4-31B-it
export RVLLM_KERNELS_DIR=/workspace/rvllm/kernels/sm_90
export RVLLM_CUTLASS_SO=/workspace/rvllm/kernels/sm_90/libcutlass_kernels.so
export RVLLM_FA3_SO=/workspace/rvllm/kernels/sm_90/libfa3_kernels.so
export RVLLM_POLICY=/workspace/rvllm/kernels/sm_90/policy.json
export RVLLM_SERVED_MODEL_NAME=gemma4-31b
export RUST_LOG=info
bash kernels/build.sh sm_90
bash kernels/build_cutlass_so.sh sm_90
bash kernels/build_fa3.sh
cargo build --release --features cuda,cublaslt --manifest-path v3/Cargo.toml -p rvllm-serve
./v3/target/release/rvllm-server \
--host 127.0.0.1 \
--port 8080 \
--max-model-len 8192 \
--max-num-seqs 1 \
--max-num-batched-tokens 2048 \
--max-prefill-chunk 128The server exposes GET /health, GET /v1/models, and
POST /v1/chat/completions with non-stream and SSE streaming responses.
Only greedy Gemma 4 decoding is currently enabled; set temperature: 0.
Smoke:
curl -fsS http://127.0.0.1:8080/health
curl -fsS http://127.0.0.1:8080/v1/models
curl -fsS http://127.0.0.1:8080/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{"model":"gemma4-31b","messages":[{"role":"user","content":"Reply exactly: RVLLM_RUST_OK"}],"max_tokens":16,"temperature":0}'
curl -fsS --no-buffer http://127.0.0.1:8080/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{"model":"gemma4-31b","messages":[{"role":"user","content":"hi"}],"max_tokens":16,"temperature":0,"stream":true}'For bind-only local checks without CUDA:
RVLLM_DRY_RUN=1 cargo run --manifest-path v3/Cargo.toml -p rvllm-serve -- \
--host 127.0.0.1 \
--port 8080Every kernel has a known purpose, a pinned variant, and a workspace contract. No dispatch fallback chains.
| Kernel | Purpose |
|---|---|
cutlass_fp8_gemm_channelscale |
SM90 FP8 GEMM with EVT channelscale epilogue (QKV, gate_up) |
fused_rmsnorm_fp8_quant |
layernorm + FP8 quantize in one launch |
fused_qkv_rmsnorm |
per-head RMSNorm on Q, K (learned) and V (parameter-free) |
fused_rope_partial_f16kv |
partial RoPE + F16 KV cache write |
fused_gelu_mul_fp8_quant |
GELU(tanh)(gate) * up to FP8 |
fused_norm_add_residual |
channelscale + RMSNorm + residual add (+ optional layer_scalar) |
logit_softcap |
30 * tanh(logits / 30) |
quantize_fp8_per_token |
activation to FP8 with per-token scale |
argmax |
f32 logits to i32 token |
fp8_decode_v2 (fa_sm89_* .so) |
split-KV GQA-grouped FP8 paged decode + LSE combine (35-171x over the serial kernel) |
No fallbacks. Missing kernel .so = engine refuses to start. One earned scar: the loader probes the FA3 .so for fa3_sm90_* symbols and quietly selects the Ada-generation kernel set if they're absent -- which is how a production H100 served sm_89 attention for four days when a fallback .so got copied over the FA3 filename. Verify with nm -D $RVLLM_FA3_SO | grep fa3_sm90 after any kernel deploy; the engine now also refuses this combination on sm_90 unless explicitly overridden.
v3/crates/
rvllm-core typed errors, IDs, dtype, shape, config, env
rvllm-mem HbmArena, Region, Stream, Event, PinnedBuf, CudaContextHandle
rvllm-kernels manifest (sha-pinned), PTX loader, kernel catalog
rvllm-fused 8 fused-kernel launchers + pure-Rust f32 references
rvllm-attention FA3 SM90 paged decode/prefill dlopen
rvllm-cutlass FP8 variant catalog + schedule pairing trait + cuBLASLt wrapper
rvllm-metadata frozen-layout metadata per bucket (one upload path)
rvllm-loader safetensors mmap -> HBM + CPU-path FP8 quant + clamp gate
rvllm-sampling argmax tail, pinned DtoH
rvllm-graph captured-graph pool keyed on MetaLayoutHash
rvllm-runtime Engine, scheduler, layer_exec, bring_up
rvllm-bench RVLLM_* env-driven bench binary
rvllm-invariants DAG-dep test, no-megakernel gate
- No fallbacks. Missing autotune entry = engine panic. Missing .so = refuse start. No silent degradation.
- Graph-capture invariant. Metadata buffer layout frozen per (bucket, max_blocks_per_seq). Captured graphs bind exact offsets.
- CUTLASS schedule/epilogue pairing. Mainloop and epilogue schedules must match. Enforced via
static_assert. - No
unwrap()in libraries.Result<T, RvllmError>end-to-end with structured context. - Real block-change detection. Scheduler emits block table updates; missing signals = stale KV reads caught at the type level.
Apache-2.0.
docs/bench.html- interactive benchmark results with chartsv3/H100_MAXPERF_PLAN.md- the June 2026 measured session record: GEMM routing, FA3 incident, split-KV FP8 attention, spec-decode, production deployv3/GEMMA4_SPEC.md- 31B Gemma 4 architecture details and weight shapesv3/M1_OUTCOME.md- batch=1 decode: cuBLASLt FP8 vs hand-GEMV measurements (vindicated June 2026 -- now the production route)v3/MEGAKERNEL_OUTCOME.md- persistent megakernel: built, measured, refuted (26 vs 45 tok/s)v3/SPECDECODE_SPEC.md- GPU speculative decoding design + lossless gatesv3/SPEC_FP8_DECODE_ATTN_REWRITE.md- the split-KV FP8 attention kernel specv3/SPEC.md,v3/IMPL_PLAN.md- v3 rewrite plan, 16 agent specstpu/harness/EAGLE3_SPEC.md- EAGLE-3 speculative decoding spec (TPU, experimental)docs/arch.md- crate architecture (April 2026 snapshot; numbers therein superseded)
2026-06-09/10 -- H100 single-user maximization + production deploy (commits f70b4bf..0d5f276, PR #54)
Full record: v3/H100_MAXPERF_PLAN.md. Summary:
- Production-regime decode 14 -> 61.4 tok/s (4.4x); short-ctx FP8 31.2 -> 63.0; spec-decode 83.9 on real text at ctx 1200+, 153 on repetitive text. Deployed to production 2026-06-10 (API-measured 3.38x short / 1.78x long).
- Corrected the published B=1 narrative: the "~12 ms inter-node dispatch gap" analysis was a graph-level-trace artifact -- node-level nsys shows ~1.1 ms gap; the real costs were the CUTLASS channelscale GEMM at ~51% HBM at M=1 (fixed: cuBLASLt routing, +27%) and the serial-token FP8 attention kernel (fixed: split-KV rewrite, 35-171x).
- Found and fixed a production incident: the deployed
libfa3_kernels.sowas the Ada fallback under the FA3 filename (silent symbol-probe fallback) -- H100 served sm_89 attention 2026-06-06..10. The engine now refuses that combination on sm_90;kernels/build_fa3.shgained the missing hdim256 combine instantiation so real FA3 is rebuildable from upstream. - Speculative decoding shipped for GPU (n-gram drafter, graphed verify, lossless gates):
RVLLM_SPEC_DECODE=1 RVLLM_SPEC_K=4. - Known weak spots, in the open: per-token prefill TTFT at >1024-token prompts (sound chunked prefill is the fix, design proven in the verify chunk), serve-side graph lifecycle at long context (27.4 API vs 61.4 bench), and the GPU PPL table needs a unified re-eval.
- Next: TPU revisit (v6e access incoming), E4B model drafter, spec-decode in
rvllm-serve.