Skip to content

Releases: thu-pacman/chitu

v0.5.5

23 Apr 07:54

Choose a tag to compare

  • 支持 LLaDA2.1-mini、LLaDA2.1-flash 模型,是赤兔首次支持扩散语言模型。
  • 支持 Kimi-K2.6。
  • 向后兼容 Qwen3-Coder-Next-FP8、GLM4.7-FP8。
  • 进一步优化 DeepSeek-V3.2 及类似模型中的稀疏 attention。
  • 优化 Qwen3-Next 和 Qwen3.5 模型开启 MTP 时的性能。
  • 面向前缀缓存优化 DP 路由策略。
  • 优化模型加载速度。
  • 工具调用兼容 PD 分离。
  • 支持单独控制各模块的日志级别(文档)。
  • 将 mooncake 作为 pip 安装时的可选依赖,不再需要单独安装。
  • 更新 flashinfer 可选依赖。
  • 修复在海光平台上的一些兼容问题。
  • 修复 infer.prefix_chunk_size 较大时的溢出问题。
  • 修复多 stream 导致的显存不能及时释放的问题。
  • 修复只有 PCIe 互联的环境上的 allreduce 性能。
  • 删除了 infer.use_cuda_graph=auto 时在昇腾平台上默认关闭 graph 的一项过时判断。
  • 重构算子分发逻辑。
  • 重构异步调度。
  • 改进代码仓库中的若干测试。

  • Added support for LLaDA2.1-mini and LLaDA2.1-flash models, first supported diffusion LLM models.
  • Added support for Kimi-K2.6.
  • Added backward support for Qwen3-Coder-Next-FP8 and GLM4.7-FP8.
  • Further optimized sparse attention in DeepSeek-V3.2 and similar models.
  • Optimized Qwen3-Next and Qwen3.5 models when enabling MTP.
  • Optimized DP routing strategy with respect to prefix caching.
  • Opitmized model loading speed.
  • Made tool calling compatible with PD-disaggregation.
  • Supported logging level control on specific modules (doc).
  • Added mooncake as an optional dependency during pip install. It no longer needed to be installed manually.
  • Upgraded flashinfer optional dependency.
  • Fixed some compatible issues on Hygon platform.
  • Fixed overflow issues when infer.prefix_chunk_size is high.
  • Fixed late memory freeing caused by multi-streaming.
  • Fixed allreduce performance on platforms where PCIe is the only interconnect.
  • Removed an out-dated default behaviour that turns off the graph when infer.use_cuda_graph=auto.
  • Refactored operator dispatching.
  • Refactored asynchornous scheduling.
  • Improved multiple tests in the repository.

Official Docker images / 官方 docker 镜像:

  • 英伟达 / NVIDIA (arch 8.0, 8.9): qingcheng-ai-cn-beijing.cr.volces.com/public/chitu-nvidia_arch_80_89:v0.5.5
  • 英伟达 / NVIDIA (arch 9.0): qingcheng-ai-cn-beijing.cr.volces.com/public/chitu-nvidia_arch_90:v0.5.5
  • 沐曦 / MetaX: qingcheng-ai-cn-beijing.cr.volces.com/public/chitu-muxi:v0.5.5
  • 昇腾 / Ascend (A2): qingcheng-ai-cn-beijing.cr.volces.com/public/chitu-ascend_a2:v0.5.5

v0.5.4

09 Apr 16:29

Choose a tag to compare

  • 新模型:GLM-5.1 系列和 Kimi-K2.5。
  • 兼容 OpenAI /response API(文档)。
  • 支持在 Qwen3-Next 和 Qwen3.5 系列模型中使用 MTP。
  • 支持在 DP 并行时例外地 TP 并行 embed_tokenlm_head 层(设置 infer.embed_tokens_lm_head_tp_size 使用)。
  • 新增优雅关闭服务的功能。
  • 新增内置的 profiling 功能。
  • 优化 DeepSeek-V3.2 及类似模型中的稀疏 attention。
  • 弃用 infer.max_reqs 选项,改为意义更明确的 infer.max_batch_sizeinfer.max_concurrent_requests 选项。
  • 增加 infer.mla_absorb=auto 选项,并将 auto 作为默认值。
  • 重构有状态的采样器。
  • 重构 KV cache,使其更能适配不同种类的模型。
  • 重构代码仓库中包含的若干测试。
  • 修复前缀缓存中的若干问题。
  • 修复 MoE 算子分发以及 DeepGEMM 集成的若干问题。
  • 修复 log 不能被完整存储到文件的问题。
  • 修复带有稠密层的 MoE 模型无法使用 DP+EP+PP 并行的问题。
  • 修复依据 infer.memory_utilization 选项自动分配 KV cache 时的计算错误。
  • 修复监控数据中显存占用量的显示错误。
  • 进一步修复 tokenizer 解码特殊字符时的问题。

  • New models: GLM-5.1 series and Kimi-K2.5.
  • Compatibility with OpenAI /response API (doc).
  • MTP support in Qwen3-Next and Qwen3.5 series models.
  • Support of exceptionally using TP for embed_token and lm_head layer when using DP for other layers (set infer.embed_tokens_lm_head_tp_size to use).
  • New feature to gracefully shut down the service.
  • New built-in profiling support.
  • Optimize sparse attention in DeepSeek-V3.2 and similar models.
  • Deprecating infer.max_reqs argument and replacing with more explicit infer.max_batch_size and infer.max_concurrent_requests arguments.
  • New infer.mla_absorb=auto argument, where auto is the new defult value.
  • Refactoration of samplers with states.
  • Refactoration of KV cache to better support different types of models.
  • Refactoration of some tests in the repo.
  • Fixes on prefix caching.
  • Fixes on MoE operator dispatching and DeepGEMM integration.
  • Fix on the issue that logs cannot be fully saved to files.
  • Fix on compatibility between MoE modes with dense layers, and DP+EP+PP parallelism.
  • Fix on the calculation when automatically allocating KV cache with respect to infer.memory_utilization.
  • Fix on device memory usage in statistics.
  • Further fix on tokenizer decoding of special characters.

Official Docker images / 官方 docker 镜像:

  • 英伟达 / NVIDIA (arch 8.0, 8.9): qingcheng-ai-cn-beijing.cr.volces.com/public/chitu-nvidia_arch_80_89:v0.5.4
  • 英伟达 / NVIDIA (arch 9.0): qingcheng-ai-cn-beijing.cr.volces.com/public/chitu-nvidia_arch_90:v0.5.4
  • 沐曦 / MetaX: qingcheng-ai-cn-beijing.cr.volces.com/public/chitu-muxi:v0.5.4
  • 昇腾 / Ascend (A2): qingcheng-ai-cn-beijing.cr.volces.com/public/chitu-ascend_a2:v0.5.4
  • 昇腾 / Ascend (A3): qingcheng-ai-cn-beijing.cr.volces.com/public/chitu-ascend_a3:v0.5.4

v0.5.3

27 Mar 06:05

Choose a tag to compare

  • 支持 Qwen3.5 系列模型。
  • 支持前缀缓存(设置 infer.enable_prefix_caching 选项)。
  • 监控数据支持 Grafana 可视化。
  • /tokenize 接口新增包含 chat template 的变体。
  • 改进 Anthropic 接口的兼容性。
  • DeepSeek-V3.2 及类似模型中的 indexer 模块支持 soft FP8(设置 infer.raise_lower_bit_float_to 选项)。
  • 升级 NVIDIA 镜像的基础软件版本。
  • 重构 VL 多模态模型支持。
  • 重构对不同 moe 算子实现的分发逻辑。
  • 确保 ue8m0 scale 的处理与 DeepSeek 模型的原始定义一致。
  • 优化 DeepGEMM 接入过程中的若干开销。
  • 修复量化算子除以 0 的问题。
  • 修复 EP 通信算子与共享专家算子的重叠优化。
  • 修复 moe 临时激活值显存均衡功能错误地将阈值设置得过低的问题。
  • 修复个别算子中的 int32 溢出问题。
  • 修复 tokenizer 解码特殊字符时的若干问题。
  • 修复流水线并行时对共享 embedding - lm_head 层的处理。
  • 避免在构建镜像时在工作目录中留下 root 权限的临时文件。

  • New models: Qwen3.5 series.
  • Prefix caching (set infer.enable_prefix_caching argument).
  • Visualizing monitoring metrics with Grafana.
  • New variant of /tokenize endpoint that includes chat templates.
  • Improved compatibility with Anthripic API.
  • Soft FP8 support for indexer module in DeepSeek-V3.2 and similar modules (set infer.raise_lower_bit_float_to argument).
  • Upgraded base dependencies in NVIDIA image build.
  • Refactoration of VL multimodal models.
  • Refactoration of dispatching to different implementation of MoE operators.
  • Treating ue8m0 scales in the same way as model definition in DeepSeek models.
  • Reduced overhead on DeepGEMM integration.
  • Fix on a dividing-by-0 issue in some quantzation operators.
  • Fix on overlaping EP communication and shared expert computation.
  • Fix on a too-low threashold for MoE temporary activation memory balancing.
  • Fix on a int32 overflow issue in some operators.
  • Fix on tokenizer decoding for sprcial characters.
  • Fix shared embedding - lm_head layer for Pipeline Parallelism.
  • No longer generating temporary files in root permision in working directory when building images.

Official Docker images / 官方 docker 镜像:

  • 英伟达 / NVIDIA (arch 8.0, 8.9): qingcheng-ai-cn-beijing.cr.volces.com/public/chitu-nvidia_arch_80_89:v0.5.3
  • 英伟达 / NVIDIA (arch 9.0): qingcheng-ai-cn-beijing.cr.volces.com/public/chitu-nvidia_arch_90:v0.5.3
  • 沐曦 / MetaX: qingcheng-ai-cn-beijing.cr.volces.com/public/chitu-muxi:v0.5.3
  • 昇腾 / Ascend (A2): qingcheng-ai-cn-beijing.cr.volces.com/public/chitu-ascend_a2:v0.5.3
  • 昇腾 / Ascend (A3): qingcheng-ai-cn-beijing.cr.volces.com/public/chitu-ascend_a3:v0.5.3

v0.5.2

12 Mar 05:53

Choose a tag to compare

  • 工具调用支持。
  • 新模型:GLM-4.6V、GLM-4.7-Flash、GLM-5、Qwen3-Coder-Next、DeepSeek-V3.2(非 Exp)。
  • 支持在 DeepSeek-V3.2 及同类模型使用 FP8 KV cache。
  • 针对 OpenAI 和 Anthropic 接口兼容性的多项改进。
  • 改进 Triton MoE kernel 的 auto-tuning。
  • 针对显存分配器在显存压力较大时隐式性能下降的问题增加警告。
  • 增加记录请求 trace 的调试选项。
  • 增加昇腾 MLA 优化算子。
  • 沐曦 native layout kernel 兼容性更新。
  • 针对 MTP 多项修复。
  • 针对 KV cache 管理的多项修复。
  • 针对 DeepSeek-V3.2、Qwen3-VL 等模型的修复。
  • 修复 Interrupt(Ctrl+C)有时无法停止赤兔的问题(#127)。
  • 修复针对低速 I/O 场景的分别进行模型权重预处理和模型推理的功能。
  • 重构 PD 分离。
  • 重构流水线并行(PP)。
  • 重构 Qwen3-Next 及同类模型中的权重合并优化。
  • 重构镜像构建脚本。

  • Tool calling support.
  • New models: GLM-4.6V, GLM-4.7-Flash, GLM-5, Qwen3-Coder-Next, DeepSeek-V3.2 (w/o -Exp).
  • Support using FP8 KV cache in DeepSeek-V3.2 and similar models.
  • Multiple improvements on OpenAI and Anthropic API compatibility.
  • Improvement on auto-tuning on Triton MoE kernels.
  • New warnings on silent performance degration on high device memory pressure.
  • New debugging option for saving request trace.
  • New optimized kernel for MLA on Ascend platforms.
  • Compatibility update on native layout kernels on MetaX platforms.
  • Multiple fixes on MTP.
  • Multiple fixes on KV cache management.
  • Fix on DeepSeek-V3.2 and Qwen3-VL.
  • Fix on the issue that interruption (Ctrl+C) sometimes unable to stop Chitu (#127).
  • Fix on the separated model preprocessing and model inference feature for slow I/O platforms.
  • Refactoration on PD-disaggregation.
  • Refactoration on Pipeline Parallelism (PP).
  • Refactoration on weight merging for Qwen3-Next and similar models.
  • Refactoration on image-building scripts.

Official Docker images / 官方 docker 镜像:

  • 英伟达 / NVIDIA (arch 8.0, 8.9): qingcheng-ai-cn-beijing.cr.volces.com/public/chitu-nvidia_arch_80_89:v0.5.2
  • 英伟达 / NVIDIA (arch 9.0): qingcheng-ai-cn-beijing.cr.volces.com/public/chitu-nvidia_arch_90:v0.5.2
  • 沐曦 / MetaX: qingcheng-ai-cn-beijing.cr.volces.com/public/chitu-muxi:v0.5.2
  • 昇腾 / Ascend (A2): qingcheng-ai-cn-beijing.cr.volces.com/public/chitu-ascend_a2:v0.5.2
  • 昇腾 / Ascend (A3): qingcheng-ai-cn-beijing.cr.volces.com/public/chitu-ascend_a3:v0.5.2

v0.5.1

06 Feb 06:21

Choose a tag to compare

  • 初步支持摩尔线程 GPU。
  • MTP(multi-token prediction)支持。
  • 优化开启 PP 或 DP 时的任务预处理性能。
  • 针对 TP、EP 并行以及 MoE 类模型(特别是 Qwen3-Next 和 DeepSeek-V3.2 模型)进行算子融合和优化。
  • 支持 Qwen3-VL 系列模型。
  • 支持 FP8 KV cache 量化。
  • 降低 MoE 模型在 prefill 时的峰值显存占用。
  • 降低模型加载时的主存占用。
  • 改进多机任务启动脚本。
  • 性能测试脚本支持更多场景。
  • 改进服务启动时的 auto tuning。
  • 初步兼容 Anthropic API。
  • 修复关于任务调度、KV cache 管理、性能监控等的多处问题。
  • 对 MoE、元信息通信、异构 attention 的 KV cache 管理、算子性能测试等进行重构。

  • Initial support for MooreThreads GPUs.
  • MTP(multi-token prediction)support.
  • Optimize task preprocessing when PP or DP is enabled.
  • Operator fusion and optimization tageting TP, EP parallelism and MoE models (especially Qwen3-Next and DeepSeek-V3.2).
  • Support Qwen3-VL model series.
  • Support FP8 KV cache quantization.
  • Reduce peak device memory usage when prefilling MoE models.
  • Reduce main memory usage when loading models.
  • Improve multi-node launching scripts.
  • Cover more cases in benchmarking scripts.
  • Improving auto-tuning when starting service.
  • Initial support for Anthropic API.
  • Multiple bug fixes for task scehduling, KV cache management, performance monitoring, etc.
  • Multiple refactoration for MoE, metadata communication, operator benchmarking, etc.

Official Docker images / 官方 docker 镜像:

  • 英伟达 / NVIDIA (arch 8.0, 8.9): qingcheng-ai-cn-beijing.cr.volces.com/public/chitu-nvidia_arch_80_89:v0.5.1
  • 英伟达 / NVIDIA (arch 9.0): qingcheng-ai-cn-beijing.cr.volces.com/public/chitu-nvidia_arch_90:v0.5.1
  • 沐曦 / MetaX: qingcheng-ai-cn-beijing.cr.volces.com/public/chitu-muxi:v0.5.1
  • 昇腾 / Ascend (A2): qingcheng-ai-cn-beijing.cr.volces.com/public/chitu-ascend_a2:v0.5.1
  • 昇腾 / Ascend (A3): qingcheng-ai-cn-beijing.cr.volces.com/public/chitu-ascend_a3:v0.5.1

v0.5.0

12 Dec 11:58

Choose a tag to compare

针对集群部署性能的多项改进:

  • 更好的 DP+TP+EP 混合并行支持。
  • MoE 负载均衡策略。
  • 针对预处理和后处理的性能优化。
  • 多处问题修复。

Multiple improvements on cluster deployments:

  • Better support on hybird DP+TP+EP parallelism.
  • Load balancing strategy for MoE.
  • Optimizations on pre-processing and post-processing.
  • Multiple bug fixes.

Official Docker images / 官方 docker 镜像:

  • NVIDIA: qingcheng-ai-cn-beijing.cr.volces.com/public/chitu-nvidia:v0.5.0
  • Muxi: qingcheng-ai-cn-beijing.cr.volces.com/public/chitu-muxi:v0.5.0
  • Ascend A2: qingcheng-ai-cn-beijing.cr.volces.com/public/chitu-ascend:v0.5.0
  • Ascend A3: qingcheng-ai-cn-beijing.cr.volces.com/public/chitu-ascend-a3:v0.5.0

v0.4.3

18 Sep 17:56

Choose a tag to compare

Fixed some performance issues.

修复了一些性能问题。


Official Docker images / 官方 docker 镜像:

NVIDIA: qingcheng-ai-cn-beijing.cr.volces.com/public/chitu-nvidia:v0.4.3
Muxi: qingcheng-ai-cn-beijing.cr.volces.com/public/chitu-muxi:v0.4.3
Ascend: qingcheng-ai-cn-beijing.cr.volces.com/public/chitu-ascend:v0.4.3

v0.4.2

28 Aug 16:39

Choose a tag to compare

  • Added supports to some new models.

  • Performance Optimization

    • Support Chunked Prefill
    • Support using DeepEP to optimize EP communication
      • requires extra installation of nvshmem (see installation guide)
      • CUDA Graph can be enabled when using DeepEP
  • Fixed some bugs


  • 新增模型支持

  • 性能优化

    • 支持 Chunked Prefill
    • 支持利用 DeepEP 优化 EP 通信
      • 需要额外安装 nvshmem(参考官方安装说明
      • 利用 DeepEP 时可开启 CUDA graph
  • 修复若干缺陷


Official Docker images / 官方 docker 镜像:

  • NVIDIA: qingcheng-ai-cn-beijing.cr.volces.com/public/chitu-nvidia:v0.4.2
  • Muxi: qingcheng-ai-cn-beijing.cr.volces.com/public/chitu-muxi:v0.4.2
  • Ascend: qingcheng-ai-cn-beijing.cr.volces.com/public/chitu-ascend:v0.4.2

v0.4.1

14 Aug 12:47

Choose a tag to compare

  • Supported Expert Parallelism (EP). Enable it by setting infer.ep_size (which currently should be equal to infer.tp_size, parallelizing the attention part with TP in the same degree of parallelism).
  • Supported PD-disaggregated inference (requiring additional dependencies, currently, please build it manually based on the Dockerfile following the mooncake configuration guideline).
  • Supported hardware fp4 computation on NVIDIA Blackwell GPUs (requiring additional dependencies, available when building from blackwell.Dockerfile).
  • Added supports to some new models. See chitu/docs/en/SUPPORTED_MODELS.md at public-main · thu-pacman/chitu for details.
  • Fixed multiple bugs.

  • 支持专家并行(EP),设置 infer.ep_size 使用(目前需要与 infer.tp_size 相等,表示 attention 部分以相同的并行度进行 TP 并行)。
  • 支持 PD 分离(需要额外依赖,当前请基于赤兔基础镜像,参考 mooncake 配置指南手动构建)。
  • 支持在 NVIDIA Blackwell GPU 上进行硬件 fp4 计算(需要额外依赖,建议通过 blackwell.Dockerfile 构建镜像)。
  • 新增部分模型支持,详见 chitu/docs/zh/SUPPORTED_MODELS.md at public-main · thu-pacman/chitu
  • 修复若干缺陷。

Official Docker images / 官方 docker 镜像:

  • NVIDIA: qingcheng-ai-cn-beijing.cr.volces.com/public/chitu-nvidia:v0.4.1
  • Muxi: qingcheng-ai-cn-beijing.cr.volces.com/public/chitu-muxi:v0.4.1
  • Ascend: qingcheng-ai-cn-beijing.cr.volces.com/public/chitu-ascend:v0.4.1

v0.4.0

01 Aug 03:38

Choose a tag to compare

v0.4.0 marks a significant improvement over v0.3.x on performance and availability. We recommand all medium-sized (about 1-4 servers) deployments upgrading to this version.

Highlighted changes:

  • Optimizations for platforms including NVIDIA GPUs, Ascend NPUs, MetaX GPUs, and Hygon DCUs.
  • Optimizations for models including DeepSeek-R1, Qwen3-32B, Kimi K2, GLM-4.5.

Official Docker images:

  • NVIDIA: qingcheng-ai-cn-beijing.cr.volces.com/public/chitu-nvidia:v0.4.0
  • Muxi: qingcheng-ai-cn-beijing.cr.volces.com/public/chitu-muxi:v0.4.0
  • Ascend: qingcheng-ai-cn-beijing.cr.volces.com/public/chitu-ascend:v0.4.0