Releases: thu-pacman/chitu
v0.5.5
- 支持 LLaDA2.1-mini、LLaDA2.1-flash 模型,是赤兔首次支持扩散语言模型。
- 支持 Kimi-K2.6。
- 向后兼容 Qwen3-Coder-Next-FP8、GLM4.7-FP8。
- 进一步优化 DeepSeek-V3.2 及类似模型中的稀疏 attention。
- 优化 Qwen3-Next 和 Qwen3.5 模型开启 MTP 时的性能。
- 面向前缀缓存优化 DP 路由策略。
- 优化模型加载速度。
- 工具调用兼容 PD 分离。
- 支持单独控制各模块的日志级别(文档)。
- 将 mooncake 作为 pip 安装时的可选依赖,不再需要单独安装。
- 更新 flashinfer 可选依赖。
- 修复在海光平台上的一些兼容问题。
- 修复
infer.prefix_chunk_size较大时的溢出问题。 - 修复多 stream 导致的显存不能及时释放的问题。
- 修复只有 PCIe 互联的环境上的 allreduce 性能。
- 删除了
infer.use_cuda_graph=auto时在昇腾平台上默认关闭 graph 的一项过时判断。 - 重构算子分发逻辑。
- 重构异步调度。
- 改进代码仓库中的若干测试。
- Added support for LLaDA2.1-mini and LLaDA2.1-flash models, first supported diffusion LLM models.
- Added support for Kimi-K2.6.
- Added backward support for Qwen3-Coder-Next-FP8 and GLM4.7-FP8.
- Further optimized sparse attention in DeepSeek-V3.2 and similar models.
- Optimized Qwen3-Next and Qwen3.5 models when enabling MTP.
- Optimized DP routing strategy with respect to prefix caching.
- Opitmized model loading speed.
- Made tool calling compatible with PD-disaggregation.
- Supported logging level control on specific modules (doc).
- Added mooncake as an optional dependency during pip install. It no longer needed to be installed manually.
- Upgraded flashinfer optional dependency.
- Fixed some compatible issues on Hygon platform.
- Fixed overflow issues when
infer.prefix_chunk_sizeis high. - Fixed late memory freeing caused by multi-streaming.
- Fixed allreduce performance on platforms where PCIe is the only interconnect.
- Removed an out-dated default behaviour that turns off the graph when
infer.use_cuda_graph=auto. - Refactored operator dispatching.
- Refactored asynchornous scheduling.
- Improved multiple tests in the repository.
Official Docker images / 官方 docker 镜像:
- 英伟达 / NVIDIA (arch 8.0, 8.9): qingcheng-ai-cn-beijing.cr.volces.com/public/chitu-nvidia_arch_80_89:v0.5.5
- 英伟达 / NVIDIA (arch 9.0): qingcheng-ai-cn-beijing.cr.volces.com/public/chitu-nvidia_arch_90:v0.5.5
- 沐曦 / MetaX: qingcheng-ai-cn-beijing.cr.volces.com/public/chitu-muxi:v0.5.5
- 昇腾 / Ascend (A2): qingcheng-ai-cn-beijing.cr.volces.com/public/chitu-ascend_a2:v0.5.5
v0.5.4
- 新模型:GLM-5.1 系列和 Kimi-K2.5。
- 兼容 OpenAI
/responseAPI(文档)。 - 支持在 Qwen3-Next 和 Qwen3.5 系列模型中使用 MTP。
- 支持在 DP 并行时例外地 TP 并行
embed_token和lm_head层(设置infer.embed_tokens_lm_head_tp_size使用)。 - 新增优雅关闭服务的功能。
- 新增内置的 profiling 功能。
- 优化 DeepSeek-V3.2 及类似模型中的稀疏 attention。
- 弃用
infer.max_reqs选项,改为意义更明确的infer.max_batch_size和infer.max_concurrent_requests选项。 - 增加
infer.mla_absorb=auto选项,并将auto作为默认值。 - 重构有状态的采样器。
- 重构 KV cache,使其更能适配不同种类的模型。
- 重构代码仓库中包含的若干测试。
- 修复前缀缓存中的若干问题。
- 修复 MoE 算子分发以及 DeepGEMM 集成的若干问题。
- 修复 log 不能被完整存储到文件的问题。
- 修复带有稠密层的 MoE 模型无法使用 DP+EP+PP 并行的问题。
- 修复依据
infer.memory_utilization选项自动分配 KV cache 时的计算错误。 - 修复监控数据中显存占用量的显示错误。
- 进一步修复 tokenizer 解码特殊字符时的问题。
- New models: GLM-5.1 series and Kimi-K2.5.
- Compatibility with OpenAI
/responseAPI (doc). - MTP support in Qwen3-Next and Qwen3.5 series models.
- Support of exceptionally using TP for
embed_tokenandlm_headlayer when using DP for other layers (setinfer.embed_tokens_lm_head_tp_sizeto use). - New feature to gracefully shut down the service.
- New built-in profiling support.
- Optimize sparse attention in DeepSeek-V3.2 and similar models.
- Deprecating
infer.max_reqsargument and replacing with more explicitinfer.max_batch_sizeandinfer.max_concurrent_requestsarguments. - New
infer.mla_absorb=autoargument, whereautois the new defult value. - Refactoration of samplers with states.
- Refactoration of KV cache to better support different types of models.
- Refactoration of some tests in the repo.
- Fixes on prefix caching.
- Fixes on MoE operator dispatching and DeepGEMM integration.
- Fix on the issue that logs cannot be fully saved to files.
- Fix on compatibility between MoE modes with dense layers, and DP+EP+PP parallelism.
- Fix on the calculation when automatically allocating KV cache with respect to
infer.memory_utilization. - Fix on device memory usage in statistics.
- Further fix on tokenizer decoding of special characters.
Official Docker images / 官方 docker 镜像:
- 英伟达 / NVIDIA (arch 8.0, 8.9): qingcheng-ai-cn-beijing.cr.volces.com/public/chitu-nvidia_arch_80_89:v0.5.4
- 英伟达 / NVIDIA (arch 9.0): qingcheng-ai-cn-beijing.cr.volces.com/public/chitu-nvidia_arch_90:v0.5.4
- 沐曦 / MetaX: qingcheng-ai-cn-beijing.cr.volces.com/public/chitu-muxi:v0.5.4
- 昇腾 / Ascend (A2): qingcheng-ai-cn-beijing.cr.volces.com/public/chitu-ascend_a2:v0.5.4
- 昇腾 / Ascend (A3): qingcheng-ai-cn-beijing.cr.volces.com/public/chitu-ascend_a3:v0.5.4
v0.5.3
- 支持 Qwen3.5 系列模型。
- 支持前缀缓存(设置
infer.enable_prefix_caching选项)。 - 监控数据支持 Grafana 可视化。
/tokenize接口新增包含 chat template 的变体。- 改进 Anthropic 接口的兼容性。
- DeepSeek-V3.2 及类似模型中的 indexer 模块支持 soft FP8(设置
infer.raise_lower_bit_float_to选项)。 - 升级 NVIDIA 镜像的基础软件版本。
- 重构 VL 多模态模型支持。
- 重构对不同 moe 算子实现的分发逻辑。
- 确保 ue8m0 scale 的处理与 DeepSeek 模型的原始定义一致。
- 优化 DeepGEMM 接入过程中的若干开销。
- 修复量化算子除以 0 的问题。
- 修复 EP 通信算子与共享专家算子的重叠优化。
- 修复 moe 临时激活值显存均衡功能错误地将阈值设置得过低的问题。
- 修复个别算子中的 int32 溢出问题。
- 修复 tokenizer 解码特殊字符时的若干问题。
- 修复流水线并行时对共享 embedding - lm_head 层的处理。
- 避免在构建镜像时在工作目录中留下 root 权限的临时文件。
- New models: Qwen3.5 series.
- Prefix caching (set
infer.enable_prefix_cachingargument). - Visualizing monitoring metrics with Grafana.
- New variant of
/tokenizeendpoint that includes chat templates. - Improved compatibility with Anthripic API.
- Soft FP8 support for indexer module in DeepSeek-V3.2 and similar modules (set
infer.raise_lower_bit_float_toargument). - Upgraded base dependencies in NVIDIA image build.
- Refactoration of VL multimodal models.
- Refactoration of dispatching to different implementation of MoE operators.
- Treating ue8m0 scales in the same way as model definition in DeepSeek models.
- Reduced overhead on DeepGEMM integration.
- Fix on a dividing-by-0 issue in some quantzation operators.
- Fix on overlaping EP communication and shared expert computation.
- Fix on a too-low threashold for MoE temporary activation memory balancing.
- Fix on a int32 overflow issue in some operators.
- Fix on tokenizer decoding for sprcial characters.
- Fix shared embedding - lm_head layer for Pipeline Parallelism.
- No longer generating temporary files in root permision in working directory when building images.
Official Docker images / 官方 docker 镜像:
- 英伟达 / NVIDIA (arch 8.0, 8.9): qingcheng-ai-cn-beijing.cr.volces.com/public/chitu-nvidia_arch_80_89:v0.5.3
- 英伟达 / NVIDIA (arch 9.0): qingcheng-ai-cn-beijing.cr.volces.com/public/chitu-nvidia_arch_90:v0.5.3
- 沐曦 / MetaX: qingcheng-ai-cn-beijing.cr.volces.com/public/chitu-muxi:v0.5.3
- 昇腾 / Ascend (A2): qingcheng-ai-cn-beijing.cr.volces.com/public/chitu-ascend_a2:v0.5.3
- 昇腾 / Ascend (A3): qingcheng-ai-cn-beijing.cr.volces.com/public/chitu-ascend_a3:v0.5.3
v0.5.2
- 工具调用支持。
- 新模型:GLM-4.6V、GLM-4.7-Flash、GLM-5、Qwen3-Coder-Next、DeepSeek-V3.2(非 Exp)。
- 支持在 DeepSeek-V3.2 及同类模型使用 FP8 KV cache。
- 针对 OpenAI 和 Anthropic 接口兼容性的多项改进。
- 改进 Triton MoE kernel 的 auto-tuning。
- 针对显存分配器在显存压力较大时隐式性能下降的问题增加警告。
- 增加记录请求 trace 的调试选项。
- 增加昇腾 MLA 优化算子。
- 沐曦 native layout kernel 兼容性更新。
- 针对 MTP 多项修复。
- 针对 KV cache 管理的多项修复。
- 针对 DeepSeek-V3.2、Qwen3-VL 等模型的修复。
- 修复 Interrupt(Ctrl+C)有时无法停止赤兔的问题(#127)。
- 修复针对低速 I/O 场景的分别进行模型权重预处理和模型推理的功能。
- 重构 PD 分离。
- 重构流水线并行(PP)。
- 重构 Qwen3-Next 及同类模型中的权重合并优化。
- 重构镜像构建脚本。
- Tool calling support.
- New models: GLM-4.6V, GLM-4.7-Flash, GLM-5, Qwen3-Coder-Next, DeepSeek-V3.2 (w/o -Exp).
- Support using FP8 KV cache in DeepSeek-V3.2 and similar models.
- Multiple improvements on OpenAI and Anthropic API compatibility.
- Improvement on auto-tuning on Triton MoE kernels.
- New warnings on silent performance degration on high device memory pressure.
- New debugging option for saving request trace.
- New optimized kernel for MLA on Ascend platforms.
- Compatibility update on native layout kernels on MetaX platforms.
- Multiple fixes on MTP.
- Multiple fixes on KV cache management.
- Fix on DeepSeek-V3.2 and Qwen3-VL.
- Fix on the issue that interruption (Ctrl+C) sometimes unable to stop Chitu (#127).
- Fix on the separated model preprocessing and model inference feature for slow I/O platforms.
- Refactoration on PD-disaggregation.
- Refactoration on Pipeline Parallelism (PP).
- Refactoration on weight merging for Qwen3-Next and similar models.
- Refactoration on image-building scripts.
Official Docker images / 官方 docker 镜像:
- 英伟达 / NVIDIA (arch 8.0, 8.9): qingcheng-ai-cn-beijing.cr.volces.com/public/chitu-nvidia_arch_80_89:v0.5.2
- 英伟达 / NVIDIA (arch 9.0): qingcheng-ai-cn-beijing.cr.volces.com/public/chitu-nvidia_arch_90:v0.5.2
- 沐曦 / MetaX: qingcheng-ai-cn-beijing.cr.volces.com/public/chitu-muxi:v0.5.2
- 昇腾 / Ascend (A2): qingcheng-ai-cn-beijing.cr.volces.com/public/chitu-ascend_a2:v0.5.2
- 昇腾 / Ascend (A3): qingcheng-ai-cn-beijing.cr.volces.com/public/chitu-ascend_a3:v0.5.2
v0.5.1
- 初步支持摩尔线程 GPU。
- MTP(multi-token prediction)支持。
- 优化开启 PP 或 DP 时的任务预处理性能。
- 针对 TP、EP 并行以及 MoE 类模型(特别是 Qwen3-Next 和 DeepSeek-V3.2 模型)进行算子融合和优化。
- 支持 Qwen3-VL 系列模型。
- 支持 FP8 KV cache 量化。
- 降低 MoE 模型在 prefill 时的峰值显存占用。
- 降低模型加载时的主存占用。
- 改进多机任务启动脚本。
- 性能测试脚本支持更多场景。
- 改进服务启动时的 auto tuning。
- 初步兼容 Anthropic API。
- 修复关于任务调度、KV cache 管理、性能监控等的多处问题。
- 对 MoE、元信息通信、异构 attention 的 KV cache 管理、算子性能测试等进行重构。
- Initial support for MooreThreads GPUs.
- MTP(multi-token prediction)support.
- Optimize task preprocessing when PP or DP is enabled.
- Operator fusion and optimization tageting TP, EP parallelism and MoE models (especially Qwen3-Next and DeepSeek-V3.2).
- Support Qwen3-VL model series.
- Support FP8 KV cache quantization.
- Reduce peak device memory usage when prefilling MoE models.
- Reduce main memory usage when loading models.
- Improve multi-node launching scripts.
- Cover more cases in benchmarking scripts.
- Improving auto-tuning when starting service.
- Initial support for Anthropic API.
- Multiple bug fixes for task scehduling, KV cache management, performance monitoring, etc.
- Multiple refactoration for MoE, metadata communication, operator benchmarking, etc.
Official Docker images / 官方 docker 镜像:
- 英伟达 / NVIDIA (arch 8.0, 8.9): qingcheng-ai-cn-beijing.cr.volces.com/public/chitu-nvidia_arch_80_89:v0.5.1
- 英伟达 / NVIDIA (arch 9.0): qingcheng-ai-cn-beijing.cr.volces.com/public/chitu-nvidia_arch_90:v0.5.1
- 沐曦 / MetaX: qingcheng-ai-cn-beijing.cr.volces.com/public/chitu-muxi:v0.5.1
- 昇腾 / Ascend (A2): qingcheng-ai-cn-beijing.cr.volces.com/public/chitu-ascend_a2:v0.5.1
- 昇腾 / Ascend (A3): qingcheng-ai-cn-beijing.cr.volces.com/public/chitu-ascend_a3:v0.5.1
v0.5.0
针对集群部署性能的多项改进:
- 更好的 DP+TP+EP 混合并行支持。
- MoE 负载均衡策略。
- 针对预处理和后处理的性能优化。
- 多处问题修复。
Multiple improvements on cluster deployments:
- Better support on hybird DP+TP+EP parallelism.
- Load balancing strategy for MoE.
- Optimizations on pre-processing and post-processing.
- Multiple bug fixes.
Official Docker images / 官方 docker 镜像:
- NVIDIA: qingcheng-ai-cn-beijing.cr.volces.com/public/chitu-nvidia:v0.5.0
- Muxi: qingcheng-ai-cn-beijing.cr.volces.com/public/chitu-muxi:v0.5.0
- Ascend A2: qingcheng-ai-cn-beijing.cr.volces.com/public/chitu-ascend:v0.5.0
- Ascend A3: qingcheng-ai-cn-beijing.cr.volces.com/public/chitu-ascend-a3:v0.5.0
v0.4.3
Fixed some performance issues.
修复了一些性能问题。
Official Docker images / 官方 docker 镜像:
NVIDIA: qingcheng-ai-cn-beijing.cr.volces.com/public/chitu-nvidia:v0.4.3
Muxi: qingcheng-ai-cn-beijing.cr.volces.com/public/chitu-muxi:v0.4.3
Ascend: qingcheng-ai-cn-beijing.cr.volces.com/public/chitu-ascend:v0.4.3
v0.4.2
-
Added supports to some new models.
- Seed-OSS-36B-Instruct
- DeepSeek-V3.1
- See SUPPORTED_MODELS for details
-
Performance Optimization
- Support Chunked Prefill
- Support using DeepEP to optimize EP communication
- requires extra installation of nvshmem (see installation guide)
- CUDA Graph can be enabled when using DeepEP
-
Fixed some bugs
-
新增模型支持
- Seed-OSS-36B-Instruct
- DeepSeek-V3.1
- 详见 SUPPORTED_MODELS
-
性能优化
- 支持 Chunked Prefill
- 支持利用 DeepEP 优化 EP 通信
- 需要额外安装 nvshmem(参考官方安装说明)
- 利用 DeepEP 时可开启 CUDA graph
-
修复若干缺陷
Official Docker images / 官方 docker 镜像:
- NVIDIA: qingcheng-ai-cn-beijing.cr.volces.com/public/chitu-nvidia:v0.4.2
- Muxi: qingcheng-ai-cn-beijing.cr.volces.com/public/chitu-muxi:v0.4.2
- Ascend: qingcheng-ai-cn-beijing.cr.volces.com/public/chitu-ascend:v0.4.2
v0.4.1
- Supported Expert Parallelism (EP). Enable it by setting
infer.ep_size(which currently should be equal toinfer.tp_size, parallelizing the attention part with TP in the same degree of parallelism). - Supported PD-disaggregated inference (requiring additional dependencies, currently, please build it manually based on the
Dockerfilefollowing the mooncake configuration guideline). - Supported hardware fp4 computation on NVIDIA Blackwell GPUs (requiring additional dependencies, available when building from
blackwell.Dockerfile). - Added supports to some new models. See chitu/docs/en/SUPPORTED_MODELS.md at public-main · thu-pacman/chitu for details.
- Fixed multiple bugs.
- 支持专家并行(EP),设置
infer.ep_size使用(目前需要与infer.tp_size相等,表示 attention 部分以相同的并行度进行 TP 并行)。 - 支持 PD 分离(需要额外依赖,当前请基于赤兔基础镜像,参考 mooncake 配置指南手动构建)。
- 支持在 NVIDIA Blackwell GPU 上进行硬件 fp4 计算(需要额外依赖,建议通过
blackwell.Dockerfile构建镜像)。 - 新增部分模型支持,详见 chitu/docs/zh/SUPPORTED_MODELS.md at public-main · thu-pacman/chitu。
- 修复若干缺陷。
Official Docker images / 官方 docker 镜像:
- NVIDIA: qingcheng-ai-cn-beijing.cr.volces.com/public/chitu-nvidia:v0.4.1
- Muxi: qingcheng-ai-cn-beijing.cr.volces.com/public/chitu-muxi:v0.4.1
- Ascend: qingcheng-ai-cn-beijing.cr.volces.com/public/chitu-ascend:v0.4.1
v0.4.0
v0.4.0 marks a significant improvement over v0.3.x on performance and availability. We recommand all medium-sized (about 1-4 servers) deployments upgrading to this version.
Highlighted changes:
- Optimizations for platforms including NVIDIA GPUs, Ascend NPUs, MetaX GPUs, and Hygon DCUs.
- Optimizations for models including DeepSeek-R1, Qwen3-32B, Kimi K2, GLM-4.5.
Official Docker images:
- NVIDIA: qingcheng-ai-cn-beijing.cr.volces.com/public/chitu-nvidia:v0.4.0
- Muxi: qingcheng-ai-cn-beijing.cr.volces.com/public/chitu-muxi:v0.4.0
- Ascend: qingcheng-ai-cn-beijing.cr.volces.com/public/chitu-ascend:v0.4.0