Releases: 1CatAI/1Cat-vLLM
1Cat-vLLM 1.1.0 Beta / Experimental
1Cat-vLLM 1.1.0 Beta / Experimental 是面向 V100/SM70 的实验版本更新,重点提供 FP8 路径的早期验证包。
更新重点:
- Beta/Experimental:Qwen3.6 FP8 推理路径,包含 TurboMind/LMDeploy FP8 GEMM 适配工作。
- 优化 FP8 KV cache 与反量化开销。
- 修复和优化 MTP 稳定性、MoE/MTP 路由、verifier 图捕获配置。
- 修复 prefix caching + Mamba align + async scheduling 兼容问题。
- 改进工具调用、Qwen3.x 模板兼容、FA2 长上下文稳定性和输出质量稳定性。
- DFlash 仍为实验性功能。
注意:
- FP8、DFlash 以及部分 MTP 优化仍属于实验性质,不建议直接作为生产默认配置。
- 发布包目标环境:CUDA 12.8、Python 3.12、V100/SM70。
1Cat-vLLM 1.0.0
1Cat-vLLM 1.0.0 Release Notes
This release focuses on the V100/SM70 serving path for Qwen3.5/Qwen3.6-class models, long-context stability, and reproducible local wheel deployment.
Highlights
- Added FP8 KV cache support for the V100 FlashAttention path, including regression coverage for operator correctness and model-level generation behavior.
- Fixed model output quality regressions observed in long-context and MTP serving, including repeated output, abnormal punctuation-only output, and unstable NaN-like decode behavior.
- Fixed OpenAI-compatible tool calling and chat serving behavior, improving compatibility with Cherry Studio, OpenClaw, OpenCode, and OpenAI-style clients.
- Improved runtime stability for Qwen3.5/Qwen3.6 model families, including Qwen3.6-27B-AWQ serving with 256K context.
- Reduced unnecessary startup memory pressure and tightened V100 memory defaults for more predictable deployment.
- Added MTP speculative decoding support and serving flags, with regression tooling for acceptance length, quality, and throughput.
- Introduced DFlash as an experimental speculative decoding path. It is included for validation and continued tuning, not as the default production path.
- Improved FlashAttention V100 dense prefill and paged decode paths, with stricter operator-level quality and speed regression scripts.
- Added benchmark and audit utilities for FA2/Triton comparison, prefix-cache plus MTP serving, Qwen3.6 output quality checks, and OpenAI API compatibility testing.
Recommended Runtime Baseline
- GPU: NVIDIA V100 / SM70
- CUDA: 12.8
- Python: 3.12
- PyTorch: 2.9.1 + cu128
- Default context target: 256K
- Recommended deployment path: install the local
flash_attn_v100wheel together with the localvllmwheel.
Packaging
The local release wheel directory for this build is expected to be:
../dist-cu128-sm70-1.0.0Install both wheels together:
python -m pip install \
../dist-cu128-sm70-1.0.0/flash_attn_v100-*.whl \
../dist-cu128-sm70-1.0.0/vllm-*.whlNotes
- DFlash remains experimental and should be benchmarked against the normal MTP path before production use.
- The V100 FP8 work targets FP8 KV/cache and V100-specific fast paths. It is not Hopper-style native FP8 Tensor Core W8A8 inference.
- Generated benchmark outputs, server logs, model weights, and local cache artifacts are not part of the source release.
1Cat-vLLM-0.0.3
0.0.3 Highlights
1Cat-vLLM-0.0.3 is a larger V100 / SM70 release than 0.0.2. The previous public release mainly provided a CUDA 12.8 Python 3.12 vLLM wheel for Tesla V100. This release turns the V100 path into a more complete serving profile for modern AWQ models, with a dedicated attention backend, clearer 4-GPU launch defaults, separate V100 FlashAttention runtime wheels, and public regression artifacts.
What Improved Over 0.0.2
- Dedicated V100 attention path.
FLASH_ATTN_V100is now the recommended runtime backend for SM70. It is wired into vLLM's attention backend registry and is selected withVLLM_ATTENTION_BACKEND=FLASH_ATTN_V100, replacing the older public recommendation that leaned on the Triton attention fallback. - V100 FlashAttention runtime extensions. The new
flash_attn_v100wheel provides the SM70 CUDA extensions used by this backend, includingflash_attn_v100_cudaandpaged_kv_utils. These extensions support the V100 decode/prefill path used by the 0.0.3 launch profiles. - Better long-context serving defaults. The public commands now use explicit
max_model_len,max_num_seqs=1, andmax_num_batched_tokensvalues for 4 x 32 GB V100 systems. The 122B profile is documented as a long-context configuration with a reduced prefill chunk budget to leave room for MoE temporary workspace. - Expanded Qwen3.5 / Qwen3.6 coverage. 0.0.3 documents and validates public profiles for
Qwen3.5-27B-AWQ,Qwen3.6-35B-A3B-AWQ, andQwen3.5-122B-A10B-AWQ, including dense and MoE serving paths on V100. - SM70 AWQ and MoE runtime work. The release keeps the TurboMind SM70 WMMA AWQ path and updates the MoE execution path for V100, including compressed-tensors MoE integration and runtime defaults intended to work with CUDA graphs.
- Qwen reasoning parser compatibility. The Qwen3 reasoning parser handles Qwen3.5-style thinking prompts where the chat template may open the
<think>block before generation begins. - Cleaner source and wheel distribution. The public install path is now explicit: install the V100 FlashAttention wheel together with the vLLM wheel. The source tree was cleaned for open release, while retaining the build dependencies needed for SM70 source builds.
- Public regression artifacts. 0.0.3 includes benchmark/regression chart assets under
docs/test-tablefor the three recommended public models on a 4-card V100 32 GB reference host.
Recommended 4-card V100 Profiles
These are the release defaults documented in the README:
| Model | TP | max_model_len |
max_num_seqs |
max_num_batched_tokens |
Notes |
|---|---|---|---|---|---|
Qwen3.5-27B-AWQ |
4 | 36000 |
1 | 16384 |
stable public dense-model default |
Qwen3.6-35B-A3B-AWQ |
4 | 33000 |
1 | 16384 |
stable public MoE default |
Qwen3.5-122B-A10B-AWQ |
4 | 256000 |
1 | 8096 |
long-context large-model default |
Recommended environment switches:
export VLLM_ATTENTION_BACKEND=FLASH_ATTN_V100
export VLLM_SM70_ENABLE_LM_HEAD_FASTPATH=1Recommended compilation setting used by the public commands:
--compilation-config '{"cudagraph_mode":"full_and_piecewise","cudagraph_capture_sizes":[1,2]}'Wheel Install
Install both wheels in one pip install command. The flash_attn_v100 wheel is required by the FLASH_ATTN_V100 backend, and the vLLM wheel contains the 0.0.3 runtime.
After installing from wheels, run python -m vllm... from outside the source checkout, such as cd ~ or cd /tmp. Running inside the cloned repository makes Python import the local source tree instead of the wheel package that contains vllm/_C.abi3.so.
python -m pip install \
"https://github.com/1CatAI/1Cat-vLLM/releases/download/v0.0.3/flash_attn_v100-26.2-cp312-cp312-linux_x86_64.whl" \
"https://github.com/1CatAI/1Cat-vLLM/releases/download/v0.0.3/vllm-0.0.3.dev0+g72bb24e2d.d20260430.cu128-cp312-cp312-linux_x86_64.whl"Published assets:
flash_attn_v100-26.2-cp312-cp312-linux_x86_64.whlvllm-0.0.3.dev0+g72bb24e2d.d20260430.cu128-cp312-cp312-linux_x86_64.whl
Validated Stack
- Python
3.12 - CUDA
12.8 - PyTorch
2.9.1+cu128 - Target GPU: Tesla V100 / SM70, with public commands written for 4 x V100 32 GB
Notes
- First-request warmup on V100 can be slow because kernels and CUDA graphs are being prepared. Steady-state throughput should be measured after warmup.
FLASH_ATTN_V100is the recommended public attention backend for 0.0.3.- Direct paged prefill remains experimental and is not enabled in the public default commands.
1Cat-vLLM-0.0.2
1Cat-vLLM-0.0.2 public release.
Included asset:
- vllm-0.15.2rc1.dev2+g72bb24e2d.d20260320-cp312-cp312-linux_x86_64.whl
Target environment:
- Python 3.12
- CUDA 12.8
- SM70 / Tesla V100