Releases · thu-pacman/chitu

支持 LLaDA2.1-mini、LLaDA2.1-flash 模型，是赤兔首次支持扩散语言模型。
支持 Kimi-K2.6。
向后兼容 Qwen3-Coder-Next-FP8、GLM4.7-FP8。
进一步优化 DeepSeek-V3.2 及类似模型中的稀疏 attention。
优化 Qwen3-Next 和 Qwen3.5 模型开启 MTP 时的性能。
面向前缀缓存优化 DP 路由策略。
优化模型加载速度。
工具调用兼容 PD 分离。
支持单独控制各模块的日志级别（文档）。
将 mooncake 作为 pip 安装时的可选依赖，不再需要单独安装。
更新 flashinfer 可选依赖。
修复在海光平台上的一些兼容问题。
修复 infer.prefix_chunk_size 较大时的溢出问题。
修复多 stream 导致的显存不能及时释放的问题。
修复只有 PCIe 互联的环境上的 allreduce 性能。
删除了 infer.use_cuda_graph=auto 时在昇腾平台上默认关闭 graph 的一项过时判断。
重构算子分发逻辑。
重构异步调度。
改进代码仓库中的若干测试。

Added support for LLaDA2.1-mini and LLaDA2.1-flash models, first supported diffusion LLM models.
Added support for Kimi-K2.6.
Added backward support for Qwen3-Coder-Next-FP8 and GLM4.7-FP8.
Further optimized sparse attention in DeepSeek-V3.2 and similar models.
Optimized Qwen3-Next and Qwen3.5 models when enabling MTP.
Optimized DP routing strategy with respect to prefix caching.
Opitmized model loading speed.
Made tool calling compatible with PD-disaggregation.
Supported logging level control on specific modules (doc).
Added mooncake as an optional dependency during pip install. It no longer needed to be installed manually.
Upgraded flashinfer optional dependency.
Fixed some compatible issues on Hygon platform.
Fixed overflow issues when infer.prefix_chunk_size is high.
Fixed late memory freeing caused by multi-streaming.
Fixed allreduce performance on platforms where PCIe is the only interconnect.
Removed an out-dated default behaviour that turns off the graph when infer.use_cuda_graph=auto.
Refactored operator dispatching.
Refactored asynchornous scheduling.
Improved multiple tests in the repository.

Official Docker images / 官方 docker 镜像:

英伟达 / NVIDIA (arch 8.0, 8.9): qingcheng-ai-cn-beijing.cr.volces.com/public/chitu-nvidia_arch_80_89:v0.5.5
英伟达 / NVIDIA (arch 9.0): qingcheng-ai-cn-beijing.cr.volces.com/public/chitu-nvidia_arch_90:v0.5.5
沐曦 / MetaX: qingcheng-ai-cn-beijing.cr.volces.com/public/chitu-muxi:v0.5.5
昇腾 / Ascend (A2): qingcheng-ai-cn-beijing.cr.volces.com/public/chitu-ascend_a2:v0.5.5

新模型：GLM-5.1 系列和 Kimi-K2.5。
兼容 OpenAI /response API（文档）。
支持在 Qwen3-Next 和 Qwen3.5 系列模型中使用 MTP。
支持在 DP 并行时例外地 TP 并行 embed_token 和 lm_head 层（设置 infer.embed_tokens_lm_head_tp_size 使用）。
新增优雅关闭服务的功能。
新增内置的 profiling 功能。
优化 DeepSeek-V3.2 及类似模型中的稀疏 attention。
弃用 infer.max_reqs 选项，改为意义更明确的 infer.max_batch_size 和 infer.max_concurrent_requests 选项。
增加 infer.mla_absorb=auto 选项，并将 auto 作为默认值。
重构有状态的采样器。
重构 KV cache，使其更能适配不同种类的模型。
重构代码仓库中包含的若干测试。
修复前缀缓存中的若干问题。
修复 MoE 算子分发以及 DeepGEMM 集成的若干问题。
修复 log 不能被完整存储到文件的问题。
修复带有稠密层的 MoE 模型无法使用 DP+EP+PP 并行的问题。
修复依据 infer.memory_utilization 选项自动分配 KV cache 时的计算错误。
修复监控数据中显存占用量的显示错误。
进一步修复 tokenizer 解码特殊字符时的问题。

New models: GLM-5.1 series and Kimi-K2.5.
Compatibility with OpenAI /response API (doc).
MTP support in Qwen3-Next and Qwen3.5 series models.
Support of exceptionally using TP for embed_token and lm_head layer when using DP for other layers (set infer.embed_tokens_lm_head_tp_size to use).
New feature to gracefully shut down the service.
New built-in profiling support.
Optimize sparse attention in DeepSeek-V3.2 and similar models.
Deprecating infer.max_reqs argument and replacing with more explicit infer.max_batch_size and infer.max_concurrent_requests arguments.
New infer.mla_absorb=auto argument, where auto is the new defult value.
Refactoration of samplers with states.
Refactoration of KV cache to better support different types of models.
Refactoration of some tests in the repo.
Fixes on prefix caching.
Fixes on MoE operator dispatching and DeepGEMM integration.
Fix on the issue that logs cannot be fully saved to files.
Fix on compatibility between MoE modes with dense layers, and DP+EP+PP parallelism.
Fix on the calculation when automatically allocating KV cache with respect to infer.memory_utilization.
Fix on device memory usage in statistics.
Further fix on tokenizer decoding of special characters.

Official Docker images / 官方 docker 镜像:

英伟达 / NVIDIA (arch 8.0, 8.9): qingcheng-ai-cn-beijing.cr.volces.com/public/chitu-nvidia_arch_80_89:v0.5.4
英伟达 / NVIDIA (arch 9.0): qingcheng-ai-cn-beijing.cr.volces.com/public/chitu-nvidia_arch_90:v0.5.4
沐曦 / MetaX: qingcheng-ai-cn-beijing.cr.volces.com/public/chitu-muxi:v0.5.4
昇腾 / Ascend (A2): qingcheng-ai-cn-beijing.cr.volces.com/public/chitu-ascend_a2:v0.5.4
昇腾 / Ascend (A3): qingcheng-ai-cn-beijing.cr.volces.com/public/chitu-ascend_a3:v0.5.4

支持 Qwen3.5 系列模型。
支持前缀缓存（设置 infer.enable_prefix_caching 选项）。
监控数据支持 Grafana 可视化。
/tokenize 接口新增包含 chat template 的变体。
改进 Anthropic 接口的兼容性。
DeepSeek-V3.2 及类似模型中的 indexer 模块支持 soft FP8（设置 infer.raise_lower_bit_float_to 选项）。
升级 NVIDIA 镜像的基础软件版本。
重构 VL 多模态模型支持。
重构对不同 moe 算子实现的分发逻辑。
确保 ue8m0 scale 的处理与 DeepSeek 模型的原始定义一致。
优化 DeepGEMM 接入过程中的若干开销。
修复量化算子除以 0 的问题。
修复 EP 通信算子与共享专家算子的重叠优化。
修复 moe 临时激活值显存均衡功能错误地将阈值设置得过低的问题。
修复个别算子中的 int32 溢出问题。
修复 tokenizer 解码特殊字符时的若干问题。
修复流水线并行时对共享 embedding - lm_head 层的处理。
避免在构建镜像时在工作目录中留下 root 权限的临时文件。

New models: Qwen3.5 series.
Prefix caching (set infer.enable_prefix_caching argument).
Visualizing monitoring metrics with Grafana.
New variant of /tokenize endpoint that includes chat templates.
Improved compatibility with Anthripic API.
Soft FP8 support for indexer module in DeepSeek-V3.2 and similar modules (set infer.raise_lower_bit_float_to argument).
Upgraded base dependencies in NVIDIA image build.
Refactoration of VL multimodal models.
Refactoration of dispatching to different implementation of MoE operators.
Treating ue8m0 scales in the same way as model definition in DeepSeek models.
Reduced overhead on DeepGEMM integration.
Fix on a dividing-by-0 issue in some quantzation operators.
Fix on overlaping EP communication and shared expert computation.
Fix on a too-low threashold for MoE temporary activation memory balancing.
Fix on a int32 overflow issue in some operators.
Fix on tokenizer decoding for sprcial characters.
Fix shared embedding - lm_head layer for Pipeline Parallelism.
No longer generating temporary files in root permision in working directory when building images.

Official Docker images / 官方 docker 镜像:

英伟达 / NVIDIA (arch 8.0, 8.9): qingcheng-ai-cn-beijing.cr.volces.com/public/chitu-nvidia_arch_80_89:v0.5.3
英伟达 / NVIDIA (arch 9.0): qingcheng-ai-cn-beijing.cr.volces.com/public/chitu-nvidia_arch_90:v0.5.3
沐曦 / MetaX: qingcheng-ai-cn-beijing.cr.volces.com/public/chitu-muxi:v0.5.3
昇腾 / Ascend (A2): qingcheng-ai-cn-beijing.cr.volces.com/public/chitu-ascend_a2:v0.5.3
昇腾 / Ascend (A3): qingcheng-ai-cn-beijing.cr.volces.com/public/chitu-ascend_a3:v0.5.3

工具调用支持。
新模型：GLM-4.6V、GLM-4.7-Flash、GLM-5、Qwen3-Coder-Next、DeepSeek-V3.2（非 Exp）。
支持在 DeepSeek-V3.2 及同类模型使用 FP8 KV cache。
针对 OpenAI 和 Anthropic 接口兼容性的多项改进。
改进 Triton MoE kernel 的 auto-tuning。
针对显存分配器在显存压力较大时隐式性能下降的问题增加警告。
增加记录请求 trace 的调试选项。
增加昇腾 MLA 优化算子。
沐曦 native layout kernel 兼容性更新。
针对 MTP 多项修复。
针对 KV cache 管理的多项修复。
针对 DeepSeek-V3.2、Qwen3-VL 等模型的修复。
修复 Interrupt（Ctrl+C）有时无法停止赤兔的问题（#127）。
修复针对低速 I/O 场景的分别进行模型权重预处理和模型推理的功能。
重构 PD 分离。
重构流水线并行（PP）。
重构 Qwen3-Next 及同类模型中的权重合并优化。
重构镜像构建脚本。

Tool calling support.
New models: GLM-4.6V, GLM-4.7-Flash, GLM-5, Qwen3-Coder-Next, DeepSeek-V3.2 (w/o -Exp).
Support using FP8 KV cache in DeepSeek-V3.2 and similar models.
Multiple improvements on OpenAI and Anthropic API compatibility.
Improvement on auto-tuning on Triton MoE kernels.
New warnings on silent performance degration on high device memory pressure.
New debugging option for saving request trace.
New optimized kernel for MLA on Ascend platforms.
Compatibility update on native layout kernels on MetaX platforms.
Multiple fixes on MTP.
Multiple fixes on KV cache management.
Fix on DeepSeek-V3.2 and Qwen3-VL.
Fix on the issue that interruption (Ctrl+C) sometimes unable to stop Chitu (#127).
Fix on the separated model preprocessing and model inference feature for slow I/O platforms.
Refactoration on PD-disaggregation.
Refactoration on Pipeline Parallelism (PP).
Refactoration on weight merging for Qwen3-Next and similar models.
Refactoration on image-building scripts.

Official Docker images / 官方 docker 镜像:

英伟达 / NVIDIA (arch 8.0, 8.9): qingcheng-ai-cn-beijing.cr.volces.com/public/chitu-nvidia_arch_80_89:v0.5.2
英伟达 / NVIDIA (arch 9.0): qingcheng-ai-cn-beijing.cr.volces.com/public/chitu-nvidia_arch_90:v0.5.2
沐曦 / MetaX: qingcheng-ai-cn-beijing.cr.volces.com/public/chitu-muxi:v0.5.2
昇腾 / Ascend (A2): qingcheng-ai-cn-beijing.cr.volces.com/public/chitu-ascend_a2:v0.5.2
昇腾 / Ascend (A3): qingcheng-ai-cn-beijing.cr.volces.com/public/chitu-ascend_a3:v0.5.2

初步支持摩尔线程 GPU。
MTP（multi-token prediction）支持。
优化开启 PP 或 DP 时的任务预处理性能。
针对 TP、EP 并行以及 MoE 类模型（特别是 Qwen3-Next 和 DeepSeek-V3.2 模型）进行算子融合和优化。
支持 Qwen3-VL 系列模型。
支持 FP8 KV cache 量化。
降低 MoE 模型在 prefill 时的峰值显存占用。
降低模型加载时的主存占用。
改进多机任务启动脚本。
性能测试脚本支持更多场景。
改进服务启动时的 auto tuning。
初步兼容 Anthropic API。
修复关于任务调度、KV cache 管理、性能监控等的多处问题。
对 MoE、元信息通信、异构 attention 的 KV cache 管理、算子性能测试等进行重构。

Initial support for MooreThreads GPUs.
MTP（multi-token prediction）support.
Optimize task preprocessing when PP or DP is enabled.
Operator fusion and optimization tageting TP, EP parallelism and MoE models (especially Qwen3-Next and DeepSeek-V3.2).
Support Qwen3-VL model series.
Support FP8 KV cache quantization.
Reduce peak device memory usage when prefilling MoE models.
Reduce main memory usage when loading models.
Improve multi-node launching scripts.
Cover more cases in benchmarking scripts.
Improving auto-tuning when starting service.
Initial support for Anthropic API.
Multiple bug fixes for task scehduling, KV cache management, performance monitoring, etc.
Multiple refactoration for MoE, metadata communication, operator benchmarking, etc.

Official Docker images / 官方 docker 镜像:

英伟达 / NVIDIA (arch 8.0, 8.9): qingcheng-ai-cn-beijing.cr.volces.com/public/chitu-nvidia_arch_80_89:v0.5.1
英伟达 / NVIDIA (arch 9.0): qingcheng-ai-cn-beijing.cr.volces.com/public/chitu-nvidia_arch_90:v0.5.1
沐曦 / MetaX: qingcheng-ai-cn-beijing.cr.volces.com/public/chitu-muxi:v0.5.1
昇腾 / Ascend (A2): qingcheng-ai-cn-beijing.cr.volces.com/public/chitu-ascend_a2:v0.5.1
昇腾 / Ascend (A3): qingcheng-ai-cn-beijing.cr.volces.com/public/chitu-ascend_a3:v0.5.1

针对集群部署性能的多项改进：

更好的 DP+TP+EP 混合并行支持。
MoE 负载均衡策略。
针对预处理和后处理的性能优化。
多处问题修复。

Multiple improvements on cluster deployments:

Better support on hybird DP+TP+EP parallelism.
Load balancing strategy for MoE.
Optimizations on pre-processing and post-processing.
Multiple bug fixes.

Official Docker images / 官方 docker 镜像:

NVIDIA: qingcheng-ai-cn-beijing.cr.volces.com/public/chitu-nvidia:v0.5.0
Muxi: qingcheng-ai-cn-beijing.cr.volces.com/public/chitu-muxi:v0.5.0
Ascend A2: qingcheng-ai-cn-beijing.cr.volces.com/public/chitu-ascend:v0.5.0
Ascend A3: qingcheng-ai-cn-beijing.cr.volces.com/public/chitu-ascend-a3:v0.5.0

Fixed some performance issues.

修复了一些性能问题。

Official Docker images / 官方 docker 镜像:

NVIDIA: qingcheng-ai-cn-beijing.cr.volces.com/public/chitu-nvidia:v0.4.3
Muxi: qingcheng-ai-cn-beijing.cr.volces.com/public/chitu-muxi:v0.4.3
Ascend: qingcheng-ai-cn-beijing.cr.volces.com/public/chitu-ascend:v0.4.3

Added supports to some new models.
- Seed-OSS-36B-Instruct
- DeepSeek-V3.1
- See SUPPORTED_MODELS for details
Performance Optimization
- Support Chunked Prefill
- Support using DeepEP to optimize EP communication
  - requires extra installation of nvshmem (see installation guide)
  - CUDA Graph can be enabled when using DeepEP
Fixed some bugs

新增模型支持
- Seed-OSS-36B-Instruct
- DeepSeek-V3.1
- 详见 SUPPORTED_MODELS
性能优化
- 支持 Chunked Prefill
- 支持利用 DeepEP 优化 EP 通信
  - 需要额外安装 nvshmem（参考官方安装说明）
  - 利用 DeepEP 时可开启 CUDA graph
修复若干缺陷

Official Docker images / 官方 docker 镜像:

NVIDIA: qingcheng-ai-cn-beijing.cr.volces.com/public/chitu-nvidia:v0.4.2
Muxi: qingcheng-ai-cn-beijing.cr.volces.com/public/chitu-muxi:v0.4.2
Ascend: qingcheng-ai-cn-beijing.cr.volces.com/public/chitu-ascend:v0.4.2

Supported Expert Parallelism (EP). Enable it by setting infer.ep_size (which currently should be equal to infer.tp_size, parallelizing the attention part with TP in the same degree of parallelism).
Supported PD-disaggregated inference (requiring additional dependencies, currently, please build it manually based on the Dockerfile following the mooncake configuration guideline).
Supported hardware fp4 computation on NVIDIA Blackwell GPUs (requiring additional dependencies, available when building from blackwell.Dockerfile).
Added supports to some new models. See chitu/docs/en/SUPPORTED_MODELS.md at public-main · thu-pacman/chitu for details.
Fixed multiple bugs.

支持专家并行（EP），设置 infer.ep_size 使用（目前需要与 infer.tp_size 相等，表示 attention 部分以相同的并行度进行 TP 并行）。
支持 PD 分离（需要额外依赖，当前请基于赤兔基础镜像，参考 mooncake 配置指南手动构建）。
支持在 NVIDIA Blackwell GPU 上进行硬件 fp4 计算（需要额外依赖，建议通过 blackwell.Dockerfile 构建镜像）。
新增部分模型支持，详见 chitu/docs/zh/SUPPORTED_MODELS.md at public-main · thu-pacman/chitu。
修复若干缺陷。

Official Docker images / 官方 docker 镜像:

NVIDIA: qingcheng-ai-cn-beijing.cr.volces.com/public/chitu-nvidia:v0.4.1
Muxi: qingcheng-ai-cn-beijing.cr.volces.com/public/chitu-muxi:v0.4.1
Ascend: qingcheng-ai-cn-beijing.cr.volces.com/public/chitu-ascend:v0.4.1

v0.4.0 marks a significant improvement over v0.3.x on performance and availability. We recommand all medium-sized (about 1-4 servers) deployments upgrading to this version.

Highlighted changes:

Optimizations for platforms including NVIDIA GPUs, Ascend NPUs, MetaX GPUs, and Hygon DCUs.
Optimizations for models including DeepSeek-R1, Qwen3-32B, Kimi K2, GLM-4.5.

Official Docker images:

NVIDIA: qingcheng-ai-cn-beijing.cr.volces.com/public/chitu-nvidia:v0.4.0
Muxi: qingcheng-ai-cn-beijing.cr.volces.com/public/chitu-muxi:v0.4.0
Ascend: qingcheng-ai-cn-beijing.cr.volces.com/public/chitu-ascend:v0.4.0

Releases: thu-pacman/chitu

v0.5.5

Uh oh!

v0.5.4

Uh oh!

v0.5.3

Uh oh!

v0.5.2

Uh oh!

v0.5.1

Uh oh!

v0.5.0

Uh oh!

v0.4.3

Uh oh!

v0.4.2

Uh oh!

v0.4.1

Uh oh!

v0.4.0

Uh oh!