1Cat-vLLM 1.0.0 Release Notes

1Cat-vLLM 1.1.0 Beta / Experimental 是面向 V100/SM70 的实验版本更新，重点提供 FP8 路径的早期验证包。

更新重点：

Beta/Experimental：Qwen3.6 FP8 推理路径，包含 TurboMind/LMDeploy FP8 GEMM 适配工作。
优化 FP8 KV cache 与反量化开销。
修复和优化 MTP 稳定性、MoE/MTP 路由、verifier 图捕获配置。
修复 prefix caching + Mamba align + async scheduling 兼容问题。
改进工具调用、Qwen3.x 模板兼容、FA2 长上下文稳定性和输出质量稳定性。
DFlash 仍为实验性功能。

注意：

FP8、DFlash 以及部分 MTP 优化仍属于实验性质，不建议直接作为生产默认配置。
发布包目标环境：CUDA 12.8、Python 3.12、V100/SM70。

1Cat-vLLM 1.0.0 Release Notes

This release focuses on the V100/SM70 serving path for Qwen3.5/Qwen3.6-class models, long-context stability, and reproducible local wheel deployment.

Highlights

Added FP8 KV cache support for the V100 FlashAttention path, including regression coverage for operator correctness and model-level generation behavior.
Fixed model output quality regressions observed in long-context and MTP serving, including repeated output, abnormal punctuation-only output, and unstable NaN-like decode behavior.
Fixed OpenAI-compatible tool calling and chat serving behavior, improving compatibility with Cherry Studio, OpenClaw, OpenCode, and OpenAI-style clients.
Improved runtime stability for Qwen3.5/Qwen3.6 model families, including Qwen3.6-27B-AWQ serving with 256K context.
Reduced unnecessary startup memory pressure and tightened V100 memory defaults for more predictable deployment.
Added MTP speculative decoding support and serving flags, with regression tooling for acceptance length, quality, and throughput.
Introduced DFlash as an experimental speculative decoding path. It is included for validation and continued tuning, not as the default production path.
Improved FlashAttention V100 dense prefill and paged decode paths, with stricter operator-level quality and speed regression scripts.
Added benchmark and audit utilities for FA2/Triton comparison, prefix-cache plus MTP serving, Qwen3.6 output quality checks, and OpenAI API compatibility testing.

Recommended Runtime Baseline

GPU: NVIDIA V100 / SM70
CUDA: 12.8
Python: 3.12
PyTorch: 2.9.1 + cu128
Default context target: 256K
Recommended deployment path: install the local flash_attn_v100 wheel together with the local vllm wheel.

Packaging

The local release wheel directory for this build is expected to be:

../dist-cu128-sm70-1.0.0

Install both wheels together:

python -m pip install \
  ../dist-cu128-sm70-1.0.0/flash_attn_v100-*.whl \
  ../dist-cu128-sm70-1.0.0/vllm-*.whl

Notes

DFlash remains experimental and should be benchmarked against the normal MTP path before production use.
The V100 FP8 work targets FP8 KV/cache and V100-specific fast paths. It is not Hopper-style native FP8 Tensor Core W8A8 inference.
Generated benchmark outputs, server logs, model weights, and local cache artifacts are not part of the source release.

0.0.3 Highlights

1Cat-vLLM-0.0.3 is a larger V100 / SM70 release than 0.0.2. The previous public release mainly provided a CUDA 12.8 Python 3.12 vLLM wheel for Tesla V100. This release turns the V100 path into a more complete serving profile for modern AWQ models, with a dedicated attention backend, clearer 4-GPU launch defaults, separate V100 FlashAttention runtime wheels, and public regression artifacts.

What Improved Over 0.0.2

Dedicated V100 attention path. FLASH_ATTN_V100 is now the recommended runtime backend for SM70. It is wired into vLLM's attention backend registry and is selected with VLLM_ATTENTION_BACKEND=FLASH_ATTN_V100, replacing the older public recommendation that leaned on the Triton attention fallback.
V100 FlashAttention runtime extensions. The new flash_attn_v100 wheel provides the SM70 CUDA extensions used by this backend, including flash_attn_v100_cuda and paged_kv_utils. These extensions support the V100 decode/prefill path used by the 0.0.3 launch profiles.
Better long-context serving defaults. The public commands now use explicit max_model_len, max_num_seqs=1, and max_num_batched_tokens values for 4 x 32 GB V100 systems. The 122B profile is documented as a long-context configuration with a reduced prefill chunk budget to leave room for MoE temporary workspace.
Expanded Qwen3.5 / Qwen3.6 coverage. 0.0.3 documents and validates public profiles for Qwen3.5-27B-AWQ, Qwen3.6-35B-A3B-AWQ, and Qwen3.5-122B-A10B-AWQ, including dense and MoE serving paths on V100.
SM70 AWQ and MoE runtime work. The release keeps the TurboMind SM70 WMMA AWQ path and updates the MoE execution path for V100, including compressed-tensors MoE integration and runtime defaults intended to work with CUDA graphs.
Qwen reasoning parser compatibility. The Qwen3 reasoning parser handles Qwen3.5-style thinking prompts where the chat template may open the <think> block before generation begins.
Cleaner source and wheel distribution. The public install path is now explicit: install the V100 FlashAttention wheel together with the vLLM wheel. The source tree was cleaned for open release, while retaining the build dependencies needed for SM70 source builds.
Public regression artifacts. 0.0.3 includes benchmark/regression chart assets under docs/test-table for the three recommended public models on a 4-card V100 32 GB reference host.

Recommended 4-card V100 Profiles

These are the release defaults documented in the README:

Model	TP	`max_model_len`	`max_num_seqs`	`max_num_batched_tokens`	Notes
`Qwen3.5-27B-AWQ`	4	`36000`	1	`16384`	stable public dense-model default
`Qwen3.6-35B-A3B-AWQ`	4	`33000`	1	`16384`	stable public MoE default
`Qwen3.5-122B-A10B-AWQ`	4	`256000`	1	`8096`	long-context large-model default

Recommended environment switches:

export VLLM_ATTENTION_BACKEND=FLASH_ATTN_V100
export VLLM_SM70_ENABLE_LM_HEAD_FASTPATH=1

Recommended compilation setting used by the public commands:

--compilation-config '{"cudagraph_mode":"full_and_piecewise","cudagraph_capture_sizes":[1,2]}'

Wheel Install

Install both wheels in one pip install command. The flash_attn_v100 wheel is required by the FLASH_ATTN_V100 backend, and the vLLM wheel contains the 0.0.3 runtime.

After installing from wheels, run python -m vllm... from outside the source checkout, such as cd ~ or cd /tmp. Running inside the cloned repository makes Python import the local source tree instead of the wheel package that contains vllm/_C.abi3.so.

python -m pip install \
  "https://github.com/1CatAI/1Cat-vLLM/releases/download/v0.0.3/flash_attn_v100-26.2-cp312-cp312-linux_x86_64.whl" \
  "https://github.com/1CatAI/1Cat-vLLM/releases/download/v0.0.3/vllm-0.0.3.dev0+g72bb24e2d.d20260430.cu128-cp312-cp312-linux_x86_64.whl"

Published assets:

flash_attn_v100-26.2-cp312-cp312-linux_x86_64.whl
vllm-0.0.3.dev0+g72bb24e2d.d20260430.cu128-cp312-cp312-linux_x86_64.whl

Validated Stack

Python 3.12
CUDA 12.8
PyTorch 2.9.1+cu128
Target GPU: Tesla V100 / SM70, with public commands written for 4 x V100 32 GB

Notes

First-request warmup on V100 can be slow because kernels and CUDA graphs are being prepared. Steady-state throughput should be measured after warmup.
FLASH_ATTN_V100 is the recommended public attention backend for 0.0.3.
Direct paged prefill remains experimental and is not enabled in the public default commands.

1Cat-vLLM-0.0.2 public release.

Included asset:

vllm-0.15.2rc1.dev2+g72bb24e2d.d20260320-cp312-cp312-linux_x86_64.whl

Target environment:

Python 3.12
CUDA 12.8
SM70 / Tesla V100

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

Uh oh!

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

1Cat-vLLM 1.0.0 Release Notes

Highlights

Recommended Runtime Baseline

Packaging

Notes

Uh oh!

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

0.0.3 Highlights

What Improved Over 0.0.2

Recommended 4-card V100 Profiles

Wheel Install

Validated Stack

Notes

Uh oh!

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

Uh oh!

Releases: 1CatAI/1Cat-vLLM

1Cat-vLLM 1.1.0 Beta / Experimental

Uh oh!

1Cat-vLLM 1.0.0

1Cat-vLLM 1.0.0 Release Notes

Highlights

Recommended Runtime Baseline

Packaging

Notes

Uh oh!

1Cat-vLLM-0.0.3

0.0.3 Highlights

What Improved Over 0.0.2

Recommended 4-card V100 Profiles

Wheel Install

Validated Stack

Notes

Uh oh!

1Cat-vLLM-0.0.2

Uh oh!