Skip to content

m0at/rvllm

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

842 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

rvLLM

LLM inference engine. Rust+CUDA on GPU, JAX+XLA on TPU.

Three Gemma 4 models on TPU v6e-4: E4B (16,794 tok/s peak, 78.3 tok/s B=1, PPL 5.87), 26B-A4B MoE (14,899 tok/s peak), 31B (9,600 tok/s peak, 128K context). GPU: 31B on H100 at 63 tok/s single-user decode (84 with speculative decoding on real text), 8,786 tok/s peak batch (FP8, CUDA graph). Zero custom kernels on TPU -- ~500 lines of JAX. Native Rust binary on GPU -- zero Python in the serving path.

Full benchmarks | June 2026 H100 session record

At a glance

E4B (4B) 26B-A4B (MoE) 31B TPU 31B GPU vLLM H100
B=1 tok/s 78.3 52.9 44.2 63.0 (83.9 spec) 66.9
Peak tok/s 16,794 14,899 9,600 8,786 3,848
PPL 5.87 90.21 24.76 14.75* -
Cached TTFT 25.9 ms 35.3 ms 73.3 ms 63 ms -
Peak tok/s/$ 3,230 2,865 1,846 4,576 2,004

TPU: v6e-4, $5.20/hr, int8, max-ctx 2048 (measured 2026-04/05). GPU: H100 SXM, $1.92/hr, FP8; B=1 measured 2026-06-09 (commit 0d5f276a5), peak-batch and vLLM comparison measured 2026-04. *The GPU PPL row is flagged for re-verification -- see the perplexity section.

TPU: Gemma 4 on v6e-4

Pure JAX + XLA. No custom kernels. XLA compiles the entire forward pass to TPU machine code from a ~500 line JAX script. Three models, one codebase.

Models supported

Property E4B (4B) 26B-A4B (MoE) 31B
Total / active params ~4B / 4B 26B / ~4B 31B / 31B
Layers 42 30 60
Hidden size 2,560 2,816 5,376
Q / KV heads (sliding) 8 / 2 16 / 8 32 / 16
Q / KV heads (global) 8 / 2 16 / 2 (V=K) 32 / 4 (V=K)
Head dim (sliding / global) 256 / 512 256 / 512 256 / 512
Sliding window 512 1,024 1,024
MoE none 128 experts, top-8 none
KV-shared layers 18 (of 42) 0 0
Per-layer input injection 256-d gated (5.6 GB embed) none none

Batch scaling (max-ctx 2048)

Batch E4B tok/s 26B-A4B tok/s 31B tok/s vLLM H100
1 78 53 44 66.9
8 542 390 318 515
64 3,661 2,662 2,112 2,794
128 6,298 4,915 3,853 3,848
256 10,214 8,192 6,246 3,709
512 13,773 12,390 8,550 3,788
768 15,514 14,899 9,600 3,671
1024 16,794 - - -

31B context scaling (B=1)

Context ms/step tok/s Architecture KV type
512 12.79 78.2 Single-scan, 60-layer scan + cond bf16
2,048 22.6 44.2 Single-scan bf16
32K ~66 ~15 Single-scan bf16
64K ~91 ~11 Split-cache, 10 groups x 6 int8
128K 40.56 24.7 Split-cache + blockwise global int8

Dual-path architecture auto-switches at the 32K boundary.

TPU deployment

# Create TPU v6e-4 ($5.20/hr)
gcloud compute tpus tpu-vm create rvllm-gemma4 \
  --zone=us-east5-b --accelerator-type=v6e-4 --version=v2-alpha-tpuv6e \
  --boot-disk-size=200

# Install (30 seconds)
pip3 install 'jax[tpu]' huggingface_hub tokenizers \
  -f https://storage.googleapis.com/jax-releases/libtpu_releases.html

# Download model
huggingface-cli download google/gemma-4-E4B-it --local-dir ~/models/gemma-4-E4B-it

# Run E4B (78.3 tok/s B=1)
python3 tpu/harness/gemma4_tpu_infer.py \
  --model-dir ~/models/gemma-4-E4B-it --max-tokens 200 --max-ctx 2048

# Run 31B batched (9,600 tok/s B=768)
LIBTPU_INIT_ARGS="--xla_tpu_enable_async_collective_fusion=true \
  --xla_tpu_enable_async_collective_fusion_fuse_all_gather=true \
  --xla_tpu_enable_async_collective_fusion_multiple_steps=true \
  --xla_tpu_overlap_compute_collective_tc=true \
  --xla_tpu_scoped_vmem_limit_kib=131072" \
python3 tpu/harness/gemma4_tpu_infer.py \
  --model-dir ~/models/gemma-4-31B-it --fused --max-tokens 200 --max-ctx 2048 --batch 768

# 128K context (24.7 tok/s)
python3 tpu/harness/gemma4_tpu_infer.py \
  --model-dir ~/models/gemma-4-31B-it --fused --max-tokens 200 --max-ctx 131072

# API server (OpenAI-compatible)
python3 tpu/harness/api_server.py --model-dir ~/models/gemma-4-31B-it --port 8080

# Perplexity
python3 tpu/harness/gemma4_tpu_infer.py \
  --model-dir ~/models/gemma-4-31B-it --perplexity --max-ctx 2048

No Docker. No conda. No torch. No vLLM. One pip install, one Python file, one command.

EAGLE-3 Speculative Decoding (TPU, experimental)

450M-param draft head proposes K=5 tokens per cycle; the full 31B verifies K+1=6 in one forward pass. Lossless for greedy decode.

Metric Value
Baseline (B=1, 512 ctx) 78.2 tok/s, 12.79 ms/step
EAGLE-3 fused cycle 31.0 ms/cycle
Projected @ tau=3.5 ~145 tok/s (1.8x)
Hardware ceiling ~300 tok/s (3.8x)

Requires 50K+ training examples for production tau. Current: 2K examples, loss 7.1, pipeline validated end-to-end. See tpu/harness/EAGLE3_SPEC.md.

GPU: 31B Gemma 4 on H100

Rust + CUDA on H100 SXM 80GB. FP8 weights with per-channel scales, FP8 or F16 paged KV, FA3 SM90 attention for sliding layers + a split-KV FP8 decode kernel for the fallback/global path. All 60 layers captured in a single CUDA graph. 63.0 tok/s single-user decode (2026-06-09, commit 0d5f276a5), 83.9 tok/s with speculative decoding on real text at real context, 8,786 tok/s peak (B=512, 2026-04).

Single-user decode (batch=1): the memory-bound regime

A single user generates one token at a time (batch=1, M=1). This is the hardest case for a 31B model and is purely weight-bandwidth bound: every token reads all ~30 GB of FP8 e4m3 weights once, so on an H100 (HBM3 peak 3.35 TB/s) the floor is ~9 ms/token ≈ 104-109 tok/s for plain decode. Speculative decoding is the only way past that roofline (it amortizes the weight read over multiple verified tokens).

What actually makes B=1 fast (each item measured, June 2026):

  1. Route small-M GEMMs through cuBLASLt (RVLLM_FP8_GEMM_LT_M1=1). The CUTLASS channelscale GemmUniversal that serves large-batch GEMMs runs M=1 at only ~51% of HBM in situ (69.7 µs avg per GEMM, node-level nsys); cuBLASLt runs the same shapes at ~80% (v3/M1_OUTCOME.md had this right). Rerouting M≤16 through cuBLASLt + a scale_cols pass + f32→f16 cast is +27% end-to-end (44.4 → 56.5 tok/s at B=1, 388.8 → 495.9 at B=8) even though it adds ~480 graph nodes per step. Two non-obvious rules survive: cuBLASLt OUTER_VEC channelscale fails the sm_90 heuristic (measured LaunchFailed) so the scale stays a separate pass, and do not hand-roll an FP8 GEMV -- measured to lose to cuBLASLt on every shape.

  2. CUDA-graph the decode loop (default-on; RVLLM_DECODE_GRAPH=0 for eager). Eager re-issues ~122 sync HtoD copies per step and host-serializes the GPU: 17 → 44.9 tok/s (2.6×), token-hash identical. One correction to an earlier analysis that claimed ~12 ms/step of "inter-node dispatch gap": node-level tracing (nsys --cuda-graph-trace=node) shows the real gap is ~1.1 ms/step. The missing time was GEMM efficiency (point 1) and the attention kernel (point 3), not dispatch. Fusing dispatch (megakernel) was built anyway and measured 1.7× slower -- see v3/MEGAKERNEL_OUTCOME.md.

  3. A paged-attention kernel that doesn't scale with context like a brick. The original FP8 decode kernel walked the KV window one token at a time (two __syncthreads() per token, one block per query head): 551 µs/call at a full 1024-token sliding window, which put long-context FP8 decode at 14 tok/s. The split-KV GQA-grouped rewrite (v3/kernels/fp8_decode_v2.cu) loads each KV chunk once for all query heads of its KV head, zero barriers in the token loop: 15.7 µs at the full window (35×), 40 µs at ctx 8192 (171×), parity ≤ 4.9e-4 vs the old kernel across 16 shape/scale variants. FP8-KV decode now beats the F16-KV path at every context length.

Measured B=1 generate (H100 SXM5, FP8 e4m3 weights, graphed, 2026-06-09/10):

path tok/s notes
FP8 KV, short ctx 63.0 production kernel set, RVLLM_FP8_GEMM_LT_M1=1
FP8 KV, ctx 1200-1500 61.4 the production serving regime (fast prefill forces FP8 KV)
F16 KV, short ctx 56.5 FA3 sliding decode
+ speculative decoding (K=4, real text, ctx 1200+) 83.9 n-gram drafter, see below
+ speculative decoding (K=4, repetitive text) 153.0 acceptance-dependent upper range
HBM weight roofline (plain decode) ~104-109 ~30 GB FP8 ÷ 3.35 TB/s
historical: graphed baseline before June 2026 44.9 F16 KV; FP8-KV long-context was 14

Speculative decoding (GPU, shipped)

Draft up to K tokens by n-gram prompt lookup (zero extra model), verify [last, drafts...] in ONE forward at M=K+1 -- the 30 GB weight read amortizes over every accepted token. Greedy acceptance; every emitted token is a model argmax. The verify forward is graph-captured per chunk size (RVLLM_SPEC_GRAPH=0 to force eager). Gates: K=0 is bit-identical to plain decode (400-token hash equality, including past the sliding-window ring wrap); graphed and eager spec runs are bit-identical to each other; K>0 vs plain decode is quality-identical but not bit-identical -- batched verify GEMMs differ from M=1 by ulps and flip genuine near-tie argmaxes (measured example: "PPL (Peak)" vs "PPL (Cached)", both coherent). Enable: RVLLM_SPEC_DECODE=1 RVLLM_SPEC_K=4. Measured accept rate on real prose: 0.42-0.56/draft, 2.4-3.1 tokens per verify cycle. K=4 is the sweet spot; K=6-8 lose acceptance. Next: Gemma4-E4B as a model drafter (same tokenizer family) to lift acceptance on novel text.

GPU batch scaling (fresh spread, measured 2026-06-10, commit 544b1309e)

Decode-step bench (run_bench, FP8 weights, 40 iters/8 warmup), default configuration -- the engine now auto-routes small-M GEMMs through cuBLASLt up to the measured crossover (M≤64) and CUTLASS above it:

Batch tok/s (default) ms/step vs CUTLASS-only route
1 64.4 15.5 +32% cuBLASLt
2 125.3 16.0 +29% cuBLASLt
4 249.1 16.1 +27% cuBLASLt
8 495.5 16.1 +27% cuBLASLt
16 949.2 16.9 +21% cuBLASLt
32 1,741 18.4 +16% cuBLASLt
64 2,997 21.4 +4% cuBLASLt
128 5,211 24.6 (CUTLASS wins +12%) CUTLASS
256 7,607 33.7 (CUTLASS wins +20%) CUTLASS

The crossover (RVLLM_FP8_GEMM_LT_MAX_M, default 64) was calibrated by running every batch size both ways. Historical April-2026 table (different harness settings, 100 iters): B=512 reached 8,786 tok/s; B≥64 April rows ran ~5-10% above the fresh 40-iter numbers -- treat cross-date deltas at B≥64 as methodology noise, not regression, until re-run at matched iters.

Single-user / greedy generate (full pipeline incl. lm_head + sampling, same commit, same day):

config decode tok/s e2e tok/s
FP8 KV, short prompt 63.0 57.9
FP8 KV, 1200-token prompt (production shape) 61.4 11.8*
F16 KV, short prompt 56.6 44.7
spec-decode K=4, real text, 1200-token prompt 83.9 12.6*
spec-decode K=4, repetitive text, short 153.1 126.0

*Long-prompt e2e is dominated by per-token prefill (20.5 s TTFT for 1200 tokens on the bench; the open work item). Opt-in RVLLM_FP8_GEMM_LT_F16OUT=1 (cuBLASLt writes f16 directly, one in-place channel-scale kernel) adds another +1.4-2% deterministically but changes rounding (f16-before-scale); it stays opt-in until a unified ppl gate clears it.

rvLLM vs vLLM on H100

vLLM column measured 2026-04 (vLLM 0.19, FP8, CUDA graphs); rvLLM B=1 updated 2026-06 -- an apples-to-apples re-run against current vLLM is owed before claiming the B=1 row.

Batch rvLLM tok/s vLLM tok/s (2026-04) Delta
1 63.0 (83.9 spec) 69 -9% (+22% with spec)
32 1,743 1,748 ~0%
64 3,265 3,130 +4%
128 5,802 4,689 +24%
256 7,808 7,077 +10%
512 8,786 8,243 +7%

GPU perplexity (measured 2026-04; FLAGGED for re-verification)

Weight path KV cache PPL tok/s (B=1, at the time)
FP8-Dynamic + CUTLASS channelscale epilogue F16 14.75 53
BF16 split QKV per-tensor FP8 F16 17.96 37.9
F16 weights (no FP8) F16 19.79 37.9
HuggingFace BF16 reference -- 19.62 --

Honesty flag: FP8 weights measuring 25% better perplexity than the BF16 reference (14.75 vs 19.62) is not plausible as a quantization effect; it almost certainly reflects an eval-config difference between the paths (the logit-softcap is applied on the ppl path only, and the reference was run through a different harness). Treat the relative ordering of the first three rows as informative and the absolute comparison to the HF reference as unverified until the eval is unified. Re-verification is queued.

Gemma 4 forward pass (14 launches per layer)

For each layer in 0..60:
  1.  fused_rmsnorm_fp8_quant           input layernorm + FP8 quantize
  2.  cutlass_fp8_gemm_channelscale     fused Q||K||V + channelscale epilogue
  3.  fused_qkv_rmsnorm                 Q/K norm (learned) + V norm (parameter-free)
  4.  fused_rope_partial_f16kv          partial RoPE + F16 KV cache write
  5.  paged_decode (FA3 SM90)           attention (head_dim=256 sliding, 512 global)
  6.  quantize_fp8_per_token            attn output to FP8
  7.  fp8_gemm                          O projection
  8.  fused_norm_add_residual           channelscale + rmsnorm + residual add
  9.  fused_rmsnorm_fp8_quant           pre-FFN layernorm + FP8 quantize
  10. cutlass_fp8_gemm_channelscale     fused gate||up + channelscale epilogue
  11. fused_gelu_mul_fp8_quant          GELU(tanh)(gate) * up to FP8
  12. fp8_gemm                          down projection
  13. fused_norm_add_residual           channelscale + rmsnorm + residual + layer_scalar

Sampling tail:
  fused_rmsnorm                       final layernorm
  f16_gemm_f32                        lm_head
  logit_softcap                       30 * tanh(logits / 30)
  argmax_kernel                       token selection

Kernel fusion summary

Four rounds of fusion + custom CUTLASS epilogue reduced graph nodes from 1776 to ~935 (47% reduction):

Fusion Kernels eliminated Nodes saved
f32_to_bf16 + rmsnorm + vector_add -> fused_norm_add_residual 3 -> 1 (x2/layer) 240
scale_cols_f32 fused into norm+add kernel (O-proj, down) 1 -> 0 (x2/layer) 120
residual_scale_f16 fused into post-ff norm+add 1 -> 0 (x1/layer) 60
vnorm_f16 fused into qk_rmsnorm -> fused_qkv_rmsnorm 2 -> 1 (x1/layer) 60
CUTLASS channelscale epilogue (QKV, gate_up) 3 -> 1 (x2/layer) 240+

The CUTLASS channelscale kernel uses a custom SM90 EVT epilogue that applies per-token activation scale (ColBroadcast) and per-channel weight scale (RowBroadcast) directly in the GEMM epilogue while the accumulator is still F32, then casts to F16. At M≤16 this kernel is not used in the fast path anymore: RVLLM_FP8_GEMM_LT_M1=1 reroutes small-M GEMMs through cuBLASLt (+27% measured) -- the smaller-tile CUTLASS variant that used to be "help wanted" here is obsolete.

Help wanted (current, real -- updated 2026-06-11):

  • Close out the two-source prefill wiring (#58): the kernel is parity-proven bit-identical and measures TTFT 20.5 s -> 0.67 s at 1200-token prompts behind RVLLM_PREFILL_TWO_SOURCE=1, but an engine wiring divergence keeps it opt-in. The issue has the full evidence chain and a precise repro.
  • Serve-session spec-decode optimization: spec behind the API measures 39.7 tok/s where the bench-side machinery does 83.9 -- the graphed verify isn't fully exploited in the session loop yet.
  • Model drafter for spec-decode (Gemma4-E4B), replacing n-gram lookup for novel text.

Shipped from the previous help-wanted list (2026-06-11): cross-request graph persistence (served long-prompt decode 27.4 -> 59.6 tok/s) and the full API sampling contract (temperature/top_p/top_k/seed/stop -- the endpoint was greedy-only before).

GPU build and run

# One-time on H100 box (~15 min)
bash kernels/build.sh               # fused PTX
bash kernels/build_cutlass_so.sh    # libcutlass_kernels.so
bash kernels/build_fa3.sh           # libfa3_kernels.so (real FA3 -- needs flash-attention checkout;
                                    # includes the hdim256 combine instantiation upstream lacks)
bash kernels/build_fa_sm89_so.sh    # libfa_sm89_kernels.so (split-KV FP8 decode + global hd512)

# Build
cargo build --release --features cuda --manifest-path v3/Cargo.toml -p rvllm-bench

# Run
RVLLM_MODEL_DIR=/workspace/models/gemma-4-31B-it \
RVLLM_KERNELS_DIR=/workspace/rvllm/kernels/sm_90 \
RVLLM_CUTLASS_SO=/workspace/rvllm/kernels/sm_90/libcutlass_kernels.so \
RVLLM_FA3_SO=/workspace/rvllm/kernels/sm_90/libfa3_kernels.so \
RVLLM_POLICY=/workspace/rvllm/kernels/sm_90/policy.json \
RVLLM_BATCH=128 RVLLM_ITERS=30 RVLLM_WARMUP=5 \
  ./v3/target/release/rvllm-bench

OpenAI-compatible Gemma 4 server

The server is a Rust-only Gemma 4 path with an OpenAI-compatible HTTP surface. It keeps CUDA execution on a single engine owner thread and accepts requests through /v1/chat/completions.

For the solidSF agents production shape, including 256K context, four-seat admission, the paid-plan busy response, CAD harness prompting, systemd service shape, and verification scripts, see docs/solidsf-agent-serving.md.

export CUDA_ARCH=sm_90
export RVLLM_MODEL_DIR=/workspace/models/gemma-4-31B-it
export RVLLM_KERNELS_DIR=/workspace/rvllm/kernels/sm_90
export RVLLM_CUTLASS_SO=/workspace/rvllm/kernels/sm_90/libcutlass_kernels.so
export RVLLM_FA3_SO=/workspace/rvllm/kernels/sm_90/libfa3_kernels.so
export RVLLM_POLICY=/workspace/rvllm/kernels/sm_90/policy.json
export RVLLM_SERVED_MODEL_NAME=gemma4-31b
export RUST_LOG=info

bash kernels/build.sh sm_90
bash kernels/build_cutlass_so.sh sm_90
bash kernels/build_fa3.sh
cargo build --release --features cuda,cublaslt --manifest-path v3/Cargo.toml -p rvllm-serve

./v3/target/release/rvllm-server \
  --host 127.0.0.1 \
  --port 8080 \
  --max-model-len 8192 \
  --max-num-seqs 1 \
  --max-num-batched-tokens 2048 \
  --max-prefill-chunk 128

The server exposes GET /health, GET /v1/models, and POST /v1/chat/completions with non-stream and SSE streaming responses. Only greedy Gemma 4 decoding is currently enabled; set temperature: 0.

Smoke:

curl -fsS http://127.0.0.1:8080/health
curl -fsS http://127.0.0.1:8080/v1/models
curl -fsS http://127.0.0.1:8080/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{"model":"gemma4-31b","messages":[{"role":"user","content":"Reply exactly: RVLLM_RUST_OK"}],"max_tokens":16,"temperature":0}'
curl -fsS --no-buffer http://127.0.0.1:8080/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{"model":"gemma4-31b","messages":[{"role":"user","content":"hi"}],"max_tokens":16,"temperature":0,"stream":true}'

For bind-only local checks without CUDA:

RVLLM_DRY_RUN=1 cargo run --manifest-path v3/Cargo.toml -p rvllm-serve -- \
  --host 127.0.0.1 \
  --port 8080

Kernels

Every kernel has a known purpose, a pinned variant, and a workspace contract. No dispatch fallback chains.

Kernel Purpose
cutlass_fp8_gemm_channelscale SM90 FP8 GEMM with EVT channelscale epilogue (QKV, gate_up)
fused_rmsnorm_fp8_quant layernorm + FP8 quantize in one launch
fused_qkv_rmsnorm per-head RMSNorm on Q, K (learned) and V (parameter-free)
fused_rope_partial_f16kv partial RoPE + F16 KV cache write
fused_gelu_mul_fp8_quant GELU(tanh)(gate) * up to FP8
fused_norm_add_residual channelscale + RMSNorm + residual add (+ optional layer_scalar)
logit_softcap 30 * tanh(logits / 30)
quantize_fp8_per_token activation to FP8 with per-token scale
argmax f32 logits to i32 token
fp8_decode_v2 (fa_sm89_* .so) split-KV GQA-grouped FP8 paged decode + LSE combine (35-171x over the serial kernel)

No fallbacks. Missing kernel .so = engine refuses to start. One earned scar: the loader probes the FA3 .so for fa3_sm90_* symbols and quietly selects the Ada-generation kernel set if they're absent -- which is how a production H100 served sm_89 attention for four days when a fallback .so got copied over the FA3 filename. Verify with nm -D $RVLLM_FA3_SO | grep fa3_sm90 after any kernel deploy; the engine now also refuses this combination on sm_90 unless explicitly overridden.

v3 crate map

v3/crates/
  rvllm-core         typed errors, IDs, dtype, shape, config, env
  rvllm-mem          HbmArena, Region, Stream, Event, PinnedBuf, CudaContextHandle
  rvllm-kernels      manifest (sha-pinned), PTX loader, kernel catalog
  rvllm-fused        8 fused-kernel launchers + pure-Rust f32 references
  rvllm-attention    FA3 SM90 paged decode/prefill dlopen
  rvllm-cutlass      FP8 variant catalog + schedule pairing trait + cuBLASLt wrapper
  rvllm-metadata     frozen-layout metadata per bucket (one upload path)
  rvllm-loader       safetensors mmap -> HBM + CPU-path FP8 quant + clamp gate
  rvllm-sampling     argmax tail, pinned DtoH
  rvllm-graph        captured-graph pool keyed on MetaLayoutHash
  rvllm-runtime      Engine, scheduler, layer_exec, bring_up
  rvllm-bench        RVLLM_* env-driven bench binary
  rvllm-invariants   DAG-dep test, no-megakernel gate

Correctness discipline

  1. No fallbacks. Missing autotune entry = engine panic. Missing .so = refuse start. No silent degradation.
  2. Graph-capture invariant. Metadata buffer layout frozen per (bucket, max_blocks_per_seq). Captured graphs bind exact offsets.
  3. CUTLASS schedule/epilogue pairing. Mainloop and epilogue schedules must match. Enforced via static_assert.
  4. No unwrap() in libraries. Result<T, RvllmError> end-to-end with structured context.
  5. Real block-change detection. Scheduler emits block table updates; missing signals = stale KV reads caught at the type level.

License

Apache-2.0.

Further reading

Updates

2026-06-09/10 -- H100 single-user maximization + production deploy (commits f70b4bf..0d5f276, PR #54)

Full record: v3/H100_MAXPERF_PLAN.md. Summary:

  • Production-regime decode 14 -> 61.4 tok/s (4.4x); short-ctx FP8 31.2 -> 63.0; spec-decode 83.9 on real text at ctx 1200+, 153 on repetitive text. Deployed to production 2026-06-10 (API-measured 3.38x short / 1.78x long).
  • Corrected the published B=1 narrative: the "~12 ms inter-node dispatch gap" analysis was a graph-level-trace artifact -- node-level nsys shows ~1.1 ms gap; the real costs were the CUTLASS channelscale GEMM at ~51% HBM at M=1 (fixed: cuBLASLt routing, +27%) and the serial-token FP8 attention kernel (fixed: split-KV rewrite, 35-171x).
  • Found and fixed a production incident: the deployed libfa3_kernels.so was the Ada fallback under the FA3 filename (silent symbol-probe fallback) -- H100 served sm_89 attention 2026-06-06..10. The engine now refuses that combination on sm_90; kernels/build_fa3.sh gained the missing hdim256 combine instantiation so real FA3 is rebuildable from upstream.
  • Speculative decoding shipped for GPU (n-gram drafter, graphed verify, lossless gates): RVLLM_SPEC_DECODE=1 RVLLM_SPEC_K=4.
  • Known weak spots, in the open: per-token prefill TTFT at >1024-token prompts (sound chunked prefill is the fix, design proven in the verify chunk), serve-side graph lifecycle at long context (27.4 API vs 61.4 bench), and the GPU PPL table needs a unified re-eval.
  • Next: TPU revisit (v6e access incoming), E4B model drafter, spec-decode in rvllm-serve.

About

rvLLM: High-performance LLM inference in Rust. Drop-in vLLM replacement.

Resources

License

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors