Skip to content

Releases: 1CatAI/1Cat-vLLM

1Cat-vLLM 1.1.0 Beta / Experimental

27 May 13:40

Choose a tag to compare

1Cat-vLLM 1.1.0 Beta / Experimental 是面向 V100/SM70 的实验版本更新,重点提供 FP8 路径的早期验证包。

更新重点:

  • Beta/Experimental:Qwen3.6 FP8 推理路径,包含 TurboMind/LMDeploy FP8 GEMM 适配工作。
  • 优化 FP8 KV cache 与反量化开销。
  • 修复和优化 MTP 稳定性、MoE/MTP 路由、verifier 图捕获配置。
  • 修复 prefix caching + Mamba align + async scheduling 兼容问题。
  • 改进工具调用、Qwen3.x 模板兼容、FA2 长上下文稳定性和输出质量稳定性。
  • DFlash 仍为实验性功能。

注意:

  • FP8、DFlash 以及部分 MTP 优化仍属于实验性质,不建议直接作为生产默认配置。
  • 发布包目标环境:CUDA 12.8、Python 3.12、V100/SM70。

1Cat-vLLM 1.0.0

13 May 06:38

Choose a tag to compare

1Cat-vLLM 1.0.0 Release Notes

This release focuses on the V100/SM70 serving path for Qwen3.5/Qwen3.6-class models, long-context stability, and reproducible local wheel deployment.

Highlights

  • Added FP8 KV cache support for the V100 FlashAttention path, including regression coverage for operator correctness and model-level generation behavior.
  • Fixed model output quality regressions observed in long-context and MTP serving, including repeated output, abnormal punctuation-only output, and unstable NaN-like decode behavior.
  • Fixed OpenAI-compatible tool calling and chat serving behavior, improving compatibility with Cherry Studio, OpenClaw, OpenCode, and OpenAI-style clients.
  • Improved runtime stability for Qwen3.5/Qwen3.6 model families, including Qwen3.6-27B-AWQ serving with 256K context.
  • Reduced unnecessary startup memory pressure and tightened V100 memory defaults for more predictable deployment.
  • Added MTP speculative decoding support and serving flags, with regression tooling for acceptance length, quality, and throughput.
  • Introduced DFlash as an experimental speculative decoding path. It is included for validation and continued tuning, not as the default production path.
  • Improved FlashAttention V100 dense prefill and paged decode paths, with stricter operator-level quality and speed regression scripts.
  • Added benchmark and audit utilities for FA2/Triton comparison, prefix-cache plus MTP serving, Qwen3.6 output quality checks, and OpenAI API compatibility testing.

Recommended Runtime Baseline

  • GPU: NVIDIA V100 / SM70
  • CUDA: 12.8
  • Python: 3.12
  • PyTorch: 2.9.1 + cu128
  • Default context target: 256K
  • Recommended deployment path: install the local flash_attn_v100 wheel together with the local vllm wheel.

Packaging

The local release wheel directory for this build is expected to be:

../dist-cu128-sm70-1.0.0

Install both wheels together:

python -m pip install \
  ../dist-cu128-sm70-1.0.0/flash_attn_v100-*.whl \
  ../dist-cu128-sm70-1.0.0/vllm-*.whl

Notes

  • DFlash remains experimental and should be benchmarked against the normal MTP path before production use.
  • The V100 FP8 work targets FP8 KV/cache and V100-specific fast paths. It is not Hopper-style native FP8 Tensor Core W8A8 inference.
  • Generated benchmark outputs, server logs, model weights, and local cache artifacts are not part of the source release.

1Cat-vLLM-0.0.3

01 May 04:02

Choose a tag to compare

0.0.3 Highlights

1Cat-vLLM-0.0.3 is a larger V100 / SM70 release than 0.0.2. The previous public release mainly provided a CUDA 12.8 Python 3.12 vLLM wheel for Tesla V100. This release turns the V100 path into a more complete serving profile for modern AWQ models, with a dedicated attention backend, clearer 4-GPU launch defaults, separate V100 FlashAttention runtime wheels, and public regression artifacts.

What Improved Over 0.0.2

  • Dedicated V100 attention path. FLASH_ATTN_V100 is now the recommended runtime backend for SM70. It is wired into vLLM's attention backend registry and is selected with VLLM_ATTENTION_BACKEND=FLASH_ATTN_V100, replacing the older public recommendation that leaned on the Triton attention fallback.
  • V100 FlashAttention runtime extensions. The new flash_attn_v100 wheel provides the SM70 CUDA extensions used by this backend, including flash_attn_v100_cuda and paged_kv_utils. These extensions support the V100 decode/prefill path used by the 0.0.3 launch profiles.
  • Better long-context serving defaults. The public commands now use explicit max_model_len, max_num_seqs=1, and max_num_batched_tokens values for 4 x 32 GB V100 systems. The 122B profile is documented as a long-context configuration with a reduced prefill chunk budget to leave room for MoE temporary workspace.
  • Expanded Qwen3.5 / Qwen3.6 coverage. 0.0.3 documents and validates public profiles for Qwen3.5-27B-AWQ, Qwen3.6-35B-A3B-AWQ, and Qwen3.5-122B-A10B-AWQ, including dense and MoE serving paths on V100.
  • SM70 AWQ and MoE runtime work. The release keeps the TurboMind SM70 WMMA AWQ path and updates the MoE execution path for V100, including compressed-tensors MoE integration and runtime defaults intended to work with CUDA graphs.
  • Qwen reasoning parser compatibility. The Qwen3 reasoning parser handles Qwen3.5-style thinking prompts where the chat template may open the <think> block before generation begins.
  • Cleaner source and wheel distribution. The public install path is now explicit: install the V100 FlashAttention wheel together with the vLLM wheel. The source tree was cleaned for open release, while retaining the build dependencies needed for SM70 source builds.
  • Public regression artifacts. 0.0.3 includes benchmark/regression chart assets under docs/test-table for the three recommended public models on a 4-card V100 32 GB reference host.

Recommended 4-card V100 Profiles

These are the release defaults documented in the README:

Model TP max_model_len max_num_seqs max_num_batched_tokens Notes
Qwen3.5-27B-AWQ 4 36000 1 16384 stable public dense-model default
Qwen3.6-35B-A3B-AWQ 4 33000 1 16384 stable public MoE default
Qwen3.5-122B-A10B-AWQ 4 256000 1 8096 long-context large-model default

Recommended environment switches:

export VLLM_ATTENTION_BACKEND=FLASH_ATTN_V100
export VLLM_SM70_ENABLE_LM_HEAD_FASTPATH=1

Recommended compilation setting used by the public commands:

--compilation-config '{"cudagraph_mode":"full_and_piecewise","cudagraph_capture_sizes":[1,2]}'

Wheel Install

Install both wheels in one pip install command. The flash_attn_v100 wheel is required by the FLASH_ATTN_V100 backend, and the vLLM wheel contains the 0.0.3 runtime.

After installing from wheels, run python -m vllm... from outside the source checkout, such as cd ~ or cd /tmp. Running inside the cloned repository makes Python import the local source tree instead of the wheel package that contains vllm/_C.abi3.so.

python -m pip install \
  "https://github.com/1CatAI/1Cat-vLLM/releases/download/v0.0.3/flash_attn_v100-26.2-cp312-cp312-linux_x86_64.whl" \
  "https://github.com/1CatAI/1Cat-vLLM/releases/download/v0.0.3/vllm-0.0.3.dev0+g72bb24e2d.d20260430.cu128-cp312-cp312-linux_x86_64.whl"

Published assets:

  • flash_attn_v100-26.2-cp312-cp312-linux_x86_64.whl
  • vllm-0.0.3.dev0+g72bb24e2d.d20260430.cu128-cp312-cp312-linux_x86_64.whl

Validated Stack

  • Python 3.12
  • CUDA 12.8
  • PyTorch 2.9.1+cu128
  • Target GPU: Tesla V100 / SM70, with public commands written for 4 x V100 32 GB

Notes

  • First-request warmup on V100 can be slow because kernels and CUDA graphs are being prepared. Steady-state throughput should be measured after warmup.
  • FLASH_ATTN_V100 is the recommended public attention backend for 0.0.3.
  • Direct paged prefill remains experimental and is not enabled in the public default commands.

1Cat-vLLM-0.0.2

21 Mar 04:01

Choose a tag to compare

1Cat-vLLM-0.0.2 public release.

Included asset:

  • vllm-0.15.2rc1.dev2+g72bb24e2d.d20260320-cp312-cp312-linux_x86_64.whl

Target environment:

  • Python 3.12
  • CUDA 12.8
  • SM70 / Tesla V100