Skip to content

Add compile_codec option for 3-4x faster batch decoding#191

Open
Finrandojin wants to merge 1 commit into
QwenLM:mainfrom
Finrandojin:feature/compile-codec
Open

Add compile_codec option for 3-4x faster batch decoding#191
Finrandojin wants to merge 1 commit into
QwenLM:mainfrom
Finrandojin:feature/compile-codec

Conversation

@Finrandojin
Copy link
Copy Markdown

Summary

  • Add compile_codec parameter to Qwen3TTSModel.from_pretrained() that applies torch.compile(mode="max-autotune", dynamic=True) to the speech tokenizer codec decoder
  • Add _compile_codec() instance method for post-construction compilation
  • Document the optimization in a new "Performance Tips" README section

Motivation

The codec decoder contains 136+ attention modules whose Python dispatch overhead dominates waveform decoding time. Profiling shows:

Component CPU time CUDA time
Codec decoder (single gen) 9.8s 3.5s
Share of total batch time ~85%

Applying torch.compile eliminates this overhead:

Metric Before After
Single generation ~14s ~9s
Batch=12 throughput 1.3x real-time 4.3x real-time

Tested on AMD RX 7900 XTX (ROCm 6.3) and the optimization is hardware-agnostic — torch.compile with dynamic=True works on NVIDIA, AMD, and CPU backends.

API

# Opt-in via from_pretrained (default: no compilation, fully backward-compatible)
model = Qwen3TTSModel.from_pretrained(
    "Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice",
    device_map="cuda:0",
    dtype=torch.bfloat16,
    compile_codec=True,  # or "max-autotune", "reduce-overhead", "default"
)

# Or compile after construction
model._compile_codec()

Trade-offs

  • First call incurs ~30-60s warmup (one-time compilation cost)
  • Recommended for batch/repeated generation (audiobooks, datasets), not single one-off calls
  • No behavior change when compile_codec=False (default)

Test plan

  • Verify compile_codec=False produces identical results to current behavior
  • Verify compile_codec=True compiles codec and generation works correctly
  • Verify compile_codec="max-autotune" string mode works
  • End-to-end batch generation benchmarks showing 3-4x improvement
  • Tested with CustomVoice model (generate_custom_voice)

The speech tokenizer codec decoder contains 100+ attention modules whose
Python dispatch overhead dominates decoding time (~47% single, ~85% batch).
Applying torch.compile with mode="max-autotune" and dynamic=True fuses
these into optimized kernels, improving batch throughput by 3-4x.

This adds:
- compile_codec parameter to from_pretrained() (False/True/mode string)
- _compile_codec() method for post-construction compilation
- Performance Tips section in README with usage examples and benchmarks

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@Iamgoofball
Copy link
Copy Markdown

Iamgoofball commented Mar 8, 2026

Set this up locally on top of that Faster Qwen3-TTS fork that's using CUDAGraphs and I'm getting massive speedups with no change in generation quality. PR works 👍

Copy link
Copy Markdown

@risan-raja risan-raja left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like it works

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants