Add compile_codec option for 3-4x faster batch decoding by Finrandojin · Pull Request #191 · QwenLM/Qwen3-TTS

Finrandojin · 2026-02-07T08:45:20Z

Summary

Add compile_codec parameter to Qwen3TTSModel.from_pretrained() that applies torch.compile(mode="max-autotune", dynamic=True) to the speech tokenizer codec decoder
Add _compile_codec() instance method for post-construction compilation
Document the optimization in a new "Performance Tips" README section

Motivation

The codec decoder contains 136+ attention modules whose Python dispatch overhead dominates waveform decoding time. Profiling shows:

Component	CPU time	CUDA time
Codec decoder (single gen)	9.8s	3.5s
Share of total batch time	—	~85%

Applying torch.compile eliminates this overhead:

Metric	Before	After
Single generation	~14s	~9s
Batch=12 throughput	1.3x real-time	4.3x real-time

Tested on AMD RX 7900 XTX (ROCm 6.3) and the optimization is hardware-agnostic — torch.compile with dynamic=True works on NVIDIA, AMD, and CPU backends.

API

# Opt-in via from_pretrained (default: no compilation, fully backward-compatible)
model = Qwen3TTSModel.from_pretrained(
    "Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice",
    device_map="cuda:0",
    dtype=torch.bfloat16,
    compile_codec=True,  # or "max-autotune", "reduce-overhead", "default"
)

# Or compile after construction
model._compile_codec()

Trade-offs

First call incurs ~30-60s warmup (one-time compilation cost)
Recommended for batch/repeated generation (audiobooks, datasets), not single one-off calls
No behavior change when compile_codec=False (default)

Test plan

Verify compile_codec=False produces identical results to current behavior
Verify compile_codec=True compiles codec and generation works correctly
Verify compile_codec="max-autotune" string mode works
End-to-end batch generation benchmarks showing 3-4x improvement
Tested with CustomVoice model (generate_custom_voice)

The speech tokenizer codec decoder contains 100+ attention modules whose Python dispatch overhead dominates decoding time (~47% single, ~85% batch). Applying torch.compile with mode="max-autotune" and dynamic=True fuses these into optimized kernels, improving batch throughput by 3-4x. This adds: - compile_codec parameter to from_pretrained() (False/True/mode string) - _compile_codec() method for post-construction compilation - Performance Tips section in README with usage examples and benchmarks Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Iamgoofball · 2026-03-08T03:32:14Z

Set this up locally on top of that Faster Qwen3-TTS fork that's using CUDAGraphs and I'm getting massive speedups with no change in generation quality. PR works 👍

risan-raja

Looks like it works

risan-raja reviewed Mar 14, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add compile_codec option for 3-4x faster batch decoding#191

Add compile_codec option for 3-4x faster batch decoding#191
Finrandojin wants to merge 1 commit into
QwenLM:mainfrom
Finrandojin:feature/compile-codec

Finrandojin commented Feb 7, 2026

Uh oh!

Iamgoofball commented Mar 8, 2026 •

edited

Loading

Uh oh!

risan-raja left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Finrandojin commented Feb 7, 2026

Summary

Motivation

API

Trade-offs

Test plan

Uh oh!

Iamgoofball commented Mar 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

risan-raja left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Iamgoofball commented Mar 8, 2026 •

edited

Loading