Releases · thu-pacman/chitu

What's new:

Fused MoE kernel for Qwen3 MoE models.
Optimizations to metadata communication in DP and PP.
Support of explicit configuration of micro batch size for PP by users.

Initial support for Ascend NPU.

What's new:

[NEW] Qwen3 Models:
The following Qwen3 models are now available for use:

Qwen3-0.6B
Qwen3-1.7B
Qwen3-4B
Qwen3-8B
Qwen3-14B
Qwen3-30B-A3B
Qwen3-32B
Qwen3-235B-A22B

Usage: To use these models, append the models=Qwen3-<desired_model_size> argument when starting Chitu. For example, if you wish to use Qwen3-32B, you would use the command with models=Qwen3-32B

Better support for MetaX (沐曦) GPUs:

Support of both Llama-like models and DeepSeek models. Tested with DeepSeek-R1-Distill-Llama-70B and DeepSeek-R1-671B using bf16, fp16, and soft fp8 precision.
New infer.op_impl=muxi_custom_kernel mode optimized for small batches.

Added support for online conversion from FP4 to FP8 and BF16, supporting the FP4 quantized version of DeepSeek-R1 671B on non-Blackwell GPUs.

Multiple bugs fixed.

Performance improvements on hybrid CPU+GPU inference.

What's new:

[HIGHLIGHT] Hybrid CPU+GPU inference (compatible with multi-GPU and multi-request).
Support of new models (see below for full list).
Multiple optimizations to operator kernels.

Officially supported models:

[NEW] QwQ-32B-FP8 (https://huggingface.co/qingcheng-ai/QWQ-32B-FP8)
Usage: Append models=QwQ-32B-FP8 command line argument when starting Chitu
[NEW] QwQ-32B-AWQ (https://huggingface.co/Qwen/QwQ-32B-AWQ)
Usage: Append models=QwQ-32B-AWQ command line argument when starting Chitu
[NEW] Llama-3.3-70B-Instruct (https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct)
Usage: Append models=Llama-3.3-70B-Instruct command line argument when starting Chitu
[NEW] DeepSeek-R1-Distill-Llama-70B (https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Llama-70B)
Usage: Append models=DeepSeek-R1-Distill-Llama-70B command line argument when starting Chitu
Qwen2.5-32B (https://huggingface.co/Qwen/Qwen2.5-32B)
Usage: Append models=Qwen2.5-32B command line argument when starting Chitu
QwQ-32B (https://huggingface.co/Qwen/QwQ-32B)
Usage: Append models=QwQ-32B command line argument when starting Chitu
Mixtral-8x7B-Instruct-v0.1 (https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1)
Usage: Append models=Mixtral-8x7B-Instruct-v0.1 command line argument when starting Chitu
Qwen2-72B-Instruct (https://huggingface.co/Qwen/Qwen2-72B-Instruct)
Usage: Append models=Qwen2-72B-Instruct command line argument when starting Chitu
Meta-Llama-3-8B-Instruct-original (https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct (Please use its "original" checkpoint))
Usage: Append models=Meta-Llama-3-8B-Instruct-original command line argument when starting Chitu
glm-4-9b-chat (https://huggingface.co/THUDM/glm-4-9b-chat)
Usage: Append models=glm-4-9b-chat command line argument when starting Chitu
DeepSeek-R1 (https://huggingface.co/deepseek-ai/DeepSeek-R1)
Usage: Append models=DeepSeek-R1 command line argument when starting Chitu
DeepSeek-R1-Distill-Qwen-14B (https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-14B)
Usage: Append models=DeepSeek-R1-Distill-Qwen-14B command line argument when starting Chitu
Qwen2-7B-Instruct (https://huggingface.co/Qwen/Qwen2-7B-Instruct)
Usage: Append models=Qwen2-7B-Instruct command line argument when starting Chitu
DeepSeek-R1-bf16 (https://huggingface.co/opensourcerelease/DeepSeek-R1-bf16)
Usage: Append models=DeepSeek-R1-bf16 command line argument when starting Chitu
DeepSeek-V3 (https://huggingface.co/deepseek-ai/DeepSeek-V3)
Usage: Append models=DeepSeek-V3 command line argument when starting Chitu

(This release has been yanked)

HOT FIX: Fix major performance regression when CUDA graph is enabled (via infer.use_cuda_graph=True).

Releases: thu-pacman/chitu

v0.3.4

Uh oh!

v0.3.3

Uh oh!

v0.3.2

Uh oh!

v0.3.1

Uh oh!

v0.3.0

Uh oh!

v0.2.3

Uh oh!

v0.2.2

Uh oh!

v0.2.1

Uh oh!

v0.2.0

Uh oh!

v0.1.2

Uh oh!