crane-oai

An OpenAI & SGLang compatible inference API server built on the Crane framework, with continuous batching support.

Features

OpenAI-compatible API — Chat Completions, Text Completions, Text-to-Speech, Models, Tokenize/Detokenize
SGLang native API — /generate, /model_info, /server_info and related endpoints
Continuous batching — Dedicated inference thread with prefill-priority scheduling, dynamic KV memory budget, and automatic sequence eviction/recovery
Multi-model support — Auto-detects and loads Hunyuan Dense, Qwen 2.5, Qwen 3, Qwen3-TTS architectures
Qwen3-TTS — Full two-level TTS inference (Talker + Code Predictor) with native Candle speech-tokenizer decoder (ONNX optional fallback); exposes OpenAI-compatible /v1/audio/speech
Streaming — SSE (Server-Sent Events) token streaming
Cross-platform acceleration — CPU / CUDA / Apple Metal, selected automatically

Quick Start

Build

# CPU only
cargo build -p crane-oai --release

# CUDA (NVIDIA GPU)
cargo build -p crane-oai --release --features cuda

Run

# Auto-detect model type and device
crane-oai --model-path /path/to/model

# Specify model type and port
crane-oai --model-path /path/to/Qwen2.5-7B-Instruct \
    --model-type qwen25 \
    --port 8000

# GGUF weights
crane-oai --model-path /path/to/model.gguf \
    --format gguf

# Force CPU
crane-oai --model-path /path/to/model --cpu

Qwen3-TTS Quick Start

Native speech-tokenizer decoding is enabled by default. ONNX export is optional as a compatibility fallback.

Step 1 — Download the model

mkdir -p checkpoints/
# Base model (multi-speaker, 12 Hz)
huggingface-cli download Qwen/Qwen3-TTS-12Hz-0.6B-Base \
    --local-dir checkpoints/Qwen3-TTS-12Hz-0.6B-Base

# Custom-voice model (add your own speaker embedding)
huggingface-cli download Qwen/Qwen3-TTS-12Hz-0.6B-CustomVoice \
    --local-dir checkpoints/Qwen3-TTS-12Hz-0.6B-CustomVoice

Step 2 — (Optional) Export the ONNX speech-tokenizer decoder fallback

# Install the Qwen3-TTS Python package first (needed for export only)
pip install -e vendor/Qwen3-TTS

python scripts/export_qwen_tts_tokenizer_onnx.py \
    checkpoints/Qwen3-TTS-12Hz-0.6B-Base/speech_tokenizer \
    checkpoints/Qwen3-TTS-12Hz-0.6B-Base/speech_tokenizer/speech_tokenizer_decoder.onnx

Step 3 — Build and start the server

# CPU (always available)
cargo build -p crane-oai --release

# CUDA
cargo build -p crane-oai --release --features "cuda"

# macOS Metal (auto-enabled by target)
cargo build -p crane-oai --release

# Start the TTS server (auto-detected from config.json)
./target/release/crane-oai \
    --model-path checkpoints/Qwen3-TTS-12Hz-0.6B-Base \
    --model-type qwen3_tts \
    --port 8080

Step 4 — Synthesize speech

# Generate WAV audio from text (CustomVoice model — predefined speakers)
curl http://localhost:8080/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen3-TTS",
    "input": "今天天气真好，我们去公园吧！",
    "voice": "Serena",
    "language": "chinese"
  }' \
  --output speech.wav

# English
curl http://localhost:8080/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen3-TTS",
    "input": "Hello, this is a test of the Qwen3-TTS model.",
    "voice": "Ryan",
    "language": "english"
  }' \
  --output speech.wav

# Voice cloning (Base model — reference audio + transcript)
curl http://localhost:8080/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen3-TTS",
    "input": "そんな何もない今日が 少しだけでもいい日になったと思えたら",
    "language": "japanese",
    "reference_audio": "/path/to/reference.wav",
    "reference_text": "Reference audio transcript goes here"
  }' \
  --output voice_clone.wav

Qwen3-TTS

Crane implements a full Qwen3-TTS inference pipeline in pure Rust + Candle, making it the first non-Python framework to support this model.

Architecture overview

Text input
  │
  ▼
Tokenizer (Qwen chat tokenizer: tokenizer.json OR vocab.json + merges.txt)
  │
  ▼
┌──────────────────────────────────────────────────┐
│  Talker  (28-layer transformer, MRoPE)           │
│  - Embeds text via text_projection MLP           │
│  - Autoregressively generates codec group 0      │
└──────────────────────────────────────────────────┘
  │  hidden state + group-0 token
  ▼
┌──────────────────────────────────────────────────┐
│  Code Predictor  (5-layer transformer)           │
│  - Predicts codec groups 1–15 from hidden state  │
└──────────────────────────────────────────────────┘
  │  [T, 16] codec tokens
  ▼
┌──────────────────────────────────────────────────┐
│  Speech Tokenizer Decoder  (native Candle)       │
│  - 12 Hz, 24 kHz, 16 RVQ quantizers             │
│  - Converts [1, 16, T] tokens → [1, 1, S] audio │
└──────────────────────────────────────────────────┘
  │
  ▼
WAV audio (24 kHz, 16-bit PCM)

Setup

1. (Optional) Export the speech tokenizer ONNX fallback (one-time step):

# Install required Python package (for export only — not needed at runtime)
pip install -e vendor/Qwen3-TTS

# Base model
python scripts/export_qwen_tts_tokenizer_onnx.py \
    checkpoints/Qwen3-TTS-12Hz-0.6B-Base/speech_tokenizer \
    checkpoints/Qwen3-TTS-12Hz-0.6B-Base/speech_tokenizer/speech_tokenizer_decoder.onnx

# CustomVoice model
python scripts/export_qwen_tts_tokenizer_onnx.py \
    checkpoints/Qwen3-TTS-12Hz-0.6B-CustomVoice/speech_tokenizer \
    checkpoints/Qwen3-TTS-12Hz-0.6B-CustomVoice/speech_tokenizer/speech_tokenizer_decoder.onnx

2. Build:

# CPU
cargo build -p crane-oai --release

# CUDA
cargo build -p crane-oai --release --features "cuda"

# macOS Metal (auto-enabled by target)
cargo build -p crane-oai --release

3. Start the server:

./target/release/crane-oai \
    --model-path checkpoints/Qwen3-TTS-12Hz-0.6B-Base \
    --model-type qwen3_tts \
    --port 8080

The model type is auto-detected from config.json — --model-type qwen3_tts is optional if the path contains Qwen3-TTS.

Model variants

Variant	HuggingFace ID	Description	Voice selection
CustomVoice	`Qwen/Qwen3-TTS-12Hz-0.6B-CustomVoice`	Predefined speakers	`voice` field (e.g. "Serena")
Base	`Qwen/Qwen3-TTS-12Hz-0.6B-Base`	Voice cloning via reference audio	`reference_audio` + `reference_text`

Voice cloning (Base model)

The Base model supports in-context learning (ICL) voice cloning: given a reference audio clip and its transcript, the model synthesizes new text in the same voice.

# Voice-clone a Japanese sentence using a reference audio clip
curl http://localhost:8080/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen3-TTS",
    "input": "こうして君に直接ありがとうを言える時間をくれたこと それが多分一番私は嬉しい",
    "language": "japanese",
    "reference_audio": "data/audio/kinsenka_3.wav",
    "reference_text": "こうして君に直接ありがとうを言える時間をくれたこと それが多分一番私は嬉しい"
  }' \
  --output voice_clone.wav

Voice-clone fields:

Field	Type	Description
`reference_audio`	string	Local file path to reference WAV audio
`reference_text`	string	Transcript of the reference audio (required)

Note: When reference_audio is set, the model runs in voice-clone mode regardless of the voice field. The reference audio must be a WAV file. The language field should match the target text language.

Speaker list (Base model)

The available speakers are defined in config.json under talker_config.spk_id. Open the file to see all names:

python3 -c "
import json, sys
c = json.load(open('checkpoints/Qwen3-TTS-12Hz-0.6B-Base/config.json'))
for name in sorted(c['talker_config']['spk_id']):
    print(name)
"

Built-in speakers for the CustomVoice model include: Serena, Vivian, Uncle_fu, Ryan, Aiden, Ono_anna, Sohee, Eric, Dylan. Larger model variants may include additional speakers. Check your model's config.json for the full list.

Generation parameters

Parameter	Default	Recommended	Notes
`temperature`	`0.9`	`0.7`–`0.9`	Lower = more stable prosody; higher = more expressive
`max_tokens`	`8192`	`2048`–`8192`	12 tokens ≈ 1 second of audio at 12 Hz
`repetition_penalty`	`1.05`	`1.0`–`1.1`	Helps avoid repeating codec tokens
`top_p`	`null`	`null` or `1.0`	Nucleus sampling; `1.0` matches reference defaults
`language`	`auto`	match input	Wrong language hint degrades quality

Troubleshooting

native speech tokenizer load failed — Check <model_dir>/speech_tokenizer/config.json and safetensors files are complete. For older checkpoints that miss quantizer.*._codebook.cluster_usage, Crane now auto-falls back to a ones vector during load; if loading still fails, optionally export ONNX as a fallback decoder.

TTS model not loaded — The server was started with an LLM model (not TTS). Restart with --model-type qwen3_tts.

Garbled audio / silence — Try lowering temperature to 0.5–0.7 and ensure language matches the input text.

Speech tokenizer ONNX not found (fallback path only) — This appears only when native decoder load fails and ONNX fallback is requested; run export script and place ONNX at <model_dir>/speech_tokenizer/speech_tokenizer_decoder.onnx.

CUDA Usage

Note: CUDA support requires the cuda feature flag at build time (see above). The server automatically uses the first available CUDA device.

Basic CUDA inference

crane-oai --model-path /path/to/Qwen3-8B-Instruct

On CUDA, model_info will report the device as Cuda(0) (or Cuda(1), etc.).

GPU memory control

GPU memory grows as KV caches accumulate. Use --gpu-memory-limit to keep usage bounded:

# Hard cap at 8 GB — recommended starting point for a 12 GB GPU
crane-oai --model-path /path/to/model \
    --gpu-memory-limit 8G \
    --max-seq-len 4096

# Cap at 5 GB for 8 GB VRAM cards
crane-oai --model-path /path/to/model \
    --gpu-memory-limit 5G \
    --max-seq-len 2048 \
    --max-concurrent 4

# Use 75% of total VRAM
crane-oai --model-path /path/to/model \
    --gpu-memory-limit 0.75

When the KV memory budget is exceeded, the engine evicts the longest-output sequence (preserving its state), tightens the concurrency cap, and resumes that sequence automatically once load subsides. This avoids OOM without crashing the server.

Recommended values by GPU size:

GPU VRAM	`--gpu-memory-limit`	`--max-seq-len`
8 GB	`6G` or `0.7`	`2048`
12 GB	`8G` or `0.7`	`4096`
24 GB	`20G` or `0.8`	`8192`
48 GB+	(omit)	(omit)

GGUF quantized models on CUDA

GGUF quantization roughly halves VRAM usage compared to FP16:

crane-oai --model-path /path/to/Qwen3-8B-Q4_K_M.gguf \
    --format gguf \
    --gpu-memory-limit 8G

Multi-GPU note

Currently crane-oai runs on a single CUDA device (device 0). Multi-GPU tensor parallelism is not yet supported.

CLI Parameters

Flag	Default	Description
`--model-path`	(required)	Path to model directory or GGUF file
`--model-type`	`auto`	Architecture: `auto`, `hunyuan`, `qwen25`, `qwen3`, `qwen3_tts`
`--model-name`	directory name	Model name shown in API responses
`--host`	`0.0.0.0`	Bind address
`--port`	`8080`	Bind port
`--cpu`	`false`	Force CPU even when a GPU is available
`--max-concurrent`	`16`	Hard cap on concurrently decoding sequences. Actual concurrency may be lower when `--gpu-memory-limit` is active.
`--decode-tokens-per-seq`	`16`	Max decode rounds per scheduling step. Higher = less scheduling overhead, higher TTFT for queued requests.
`--format`	`auto`	Weight format: `auto`, `safetensors`, `gguf`
`--max-seq-len`	`0`	Max sequence length (prompt + generation); `0` = unlimited
`--gpu-memory-limit`	(none)	VRAM cap: absolute (`5G`, `8G`, `5120M`) or fractional (`0.7` = 70% of total)

Parameter tuning guide

Goal	Recommendation
Constrained VRAM (≤12 GB)	Set `--gpu-memory-limit`; use `--max-concurrent 4–8` as a safety ceiling
Maximum throughput	Increase `--decode-tokens-per-seq` to `32` to reduce scheduling round-trips
Lowest time-to-first-token	Decrease `--decode-tokens-per-seq` to `4–8` so prefill slots in sooner
Long context generation	Set `--max-seq-len` to avoid unbounded KV growth

API Reference

OpenAI-compatible

`POST /v1/audio/speech` — Text-to-Speech (Qwen3-TTS)

Synthesizes speech from text. Returns a WAV audio file (or raw PCM, depending on response_format).

# Basic Chinese TTS (CustomVoice)
curl http://localhost:8080/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen3-TTS",
    "input": "今天天气真好，我们去公园吧。",
    "voice": "Chelsie",
    "language": "chinese"
  }' \
  --output speech.wav

# English TTS with higher temperature
curl http://localhost:8080/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen3-TTS",
    "input": "Hello! This is Crane, an ultra-fast inference framework written in Rust.",
    "voice": "Chelsie",
    "language": "english",
    "temperature": 0.8,
    "max_tokens": 2048
  }' \
  --output hello.wav

# Voice cloning (Base model)
curl http://localhost:8080/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen3-TTS",
    "input": "そんな何もない今日が 少しだけでもいい日になったと思えたら",
    "language": "japanese",
    "reference_audio": "data/audio/kinsenka_3.wav",
    "reference_text": "こうして君に直接ありがとうを言える時間をくれたこと それが多分一番私は嬉しい"
  }' \
  --output voice_clone.wav

# Auto-detect language
curl http://localhost:8080/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen3-TTS",
    "input": "This is a bilingual test. 这是一个中英文混合测试。",
    "language": "auto"
  }' \
  --output bilingual.wav

Request fields:

Field	Type	Default	Description
`model`	string	—	Model name (e.g. `"Qwen3-TTS"`)
`input`	string	—	The text to synthesize. UTF-8, up to a few thousand characters
`voice`	string	`null`	Speaker name (CustomVoice model). See speaker list below. `null` uses default voice
`language`	string	`"auto"`	Language hint: `"chinese"`, `"english"`, `"japanese"`, `"korean"`, `"auto"`
`instructions`	string	`null`	Optional system-level prompt to guide speaking style
`response_format`	string	`"wav"`	Output audio format: `"wav"` or `"pcm"` (raw 16-bit LE at 24 kHz). `"mp3"`, `"opus"`, `"aac"`, `"flac"` currently return `400`
`speed`	float	`1.0`	Speaking speed multiplier (reserved, not yet applied to generation)
`temperature`	float	`0.9`	Sampling temperature. Lower = more deterministic
`top_p`	float	`null`	Nucleus sampling threshold (default `null` = no nucleus filtering, equivalent to `1.0`)
`repetition_penalty`	float	`1.05`	Repetition penalty for codec token generation
`max_tokens`	int	`8192`	Max codec tokens to generate. Controls maximum audio duration (~83 ms per token at 12 Hz)
`reference_audio`	string	`null`	Local path to reference WAV audio for voice cloning (Base model only)
`reference_text`	string	`null`	Transcript of the reference audio (required when `reference_audio` is set)

Response: Binary audio bytes with Content-Type: audio/wav (response_format="wav") or audio/pcm (response_format="pcm"). Save WAV as .wav, or raw PCM as .pcm.

Approximate duration cap: max_tokens / 12 seconds (e.g. 8192 tokens ≈ 683 seconds, 2048 ≈ 171 seconds).

Available speakers (Qwen3-TTS-12Hz-0.6B-Base):

Name	Gender	Language
Chelsie	Female	English / Chinese
Ethan	Male	English / Chinese
Cherry	Female	Chinese
Serena	Female	Chinese
...	—	See `config.json → talker_config.spk_id` for full list

Tip: The available speaker names are stored in config.json under talker_config.spk_id. Open the file to see all supported names for your model variant.

`POST /v1/chat/completions`

# Non-streaming
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen2.5-7B-Instruct",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Hello!"}
    ],
    "max_tokens": 256,
    "temperature": 0.7
  }'

# Streaming (SSE)
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen2.5-7B-Instruct",
    "messages": [{"role": "user", "content": "Tell me a joke"}],
    "stream": true,
    "stream_options": {"include_usage": true}
  }'

Multimodal / Vision (PaddleOCR-VL-1.5)

For VLM requests, use an array in content to provide the image URL and the prompt text:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "paddleocr_vl-1.5",
    "messages": [
      {
        "role": "user",
        "content": [
          {"type": "image_url", "image_url": {"url": "https://i0.hdslb.com/bfs/new_dyn/1824ac967aca31d7ac9da4fdda678c4639471072.png"}},
          {"type": "text", "text": "OCR:"}
        ]
      }
    ],
    "max_tokens": 1024
  }'

Request fields:

Field	Type	Default	Description
`model`	string	—	Model name
`messages`	array	—	`[{role, content}]`. `content` can be a string, or an array of `{"type": "text", "text": "..."}` and `{"type": "image_url", "image_url": {"url": "..."}}` items for VLM models.
`max_tokens`	int	`512`	Max tokens to generate
`temperature`	float	`0.8`	Sampling temperature; `0` = greedy
`top_p`	float	`0.95`	Nucleus sampling threshold
`top_k`	int	`40`	Top-k sampling
`repetition_penalty`	float	`1.05`	Repetition penalty
`stream`	bool	`false`	Enable SSE streaming
`stream_options`	object	—	`{"include_usage": true}` to include token counts in the final chunk
`seed`	int	—	Random seed for reproducibility

`POST /v1/completions`

Raw text completion (no chat template applied).

curl http://localhost:8080/v1/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "Qwen2.5-7B-Instruct", "prompt": "The capital of France is", "max_tokens": 64}'

prompt accepts a single string or an array of strings (concatenated).

`GET /v1/models` · `GET /v1/models/:model_id`

List available models or fetch metadata for a specific one.

`POST /v1/tokenize` · `POST /v1/detokenize`

# Text → token IDs
curl http://localhost:8080/v1/tokenize \
  -H "Content-Type: application/json" \
  -d '{"text": "Hello world"}'

# Token IDs → text
curl http://localhost:8080/v1/detokenize \
  -H "Content-Type: application/json" \
  -d '{"tokens": [9707, 1917]}'

SGLang-compatible

`POST /generate`

curl http://localhost:8080/generate \
  -H "Content-Type: application/json" \
  -d '{
    "text": "The meaning of life is",
    "sampling_params": {"max_new_tokens": 128, "temperature": 0.8, "top_p": 0.95}
  }'

Multimodal / Vision (PaddleOCR-VL-1.5)

To run a multimodal inference request with a VLM (PaddleOCR-VL), include the image_url parameter:

curl http://localhost:8000/generate \
  -H "Content-Type: application/json" \
  -d '{
    "text": "OCR:",
    "image_url": "https://i0.hdslb.com/bfs/new_dyn/1824ac967aca31d7ac9da4fdda678c4639471072.png",
    "sampling_params": {"max_new_tokens": 1024}
  }'

sampling_params fields:

Field	Type	Default	Description
`max_new_tokens`	int	`128`	Max tokens to generate
`temperature`	float	`0.8`	Sampling temperature
`top_p`	float	`0.95`	Nucleus sampling
`top_k`	int	`20`	Top-k sampling
`repetition_penalty`	float	`1.0`	Repetition penalty
`stop`	string/array	—	Stop string(s)
`stop_token_ids`	array	—	Stop token IDs
`seed`	int	—	Random seed
`n`	int	`1`	Number of parallel completions

`GET /model_info`

Returns model metadata including device (Cuda(0), Metal(0), Cpu).

`GET /server_info`

Returns server config and live engine stats.

{
  "version": "0.1.0",
  "model_path": "/models/Qwen2.5-7B-Instruct",
  "max_concurrent": 16,
  "decode_tokens_per_seq": 16,
  "stats": {
    "total_requests": 42,
    "completed_requests": 40,
    "avg_decode_tokens_per_sec": 35.2,
    "active_sequences": 2,
    "waiting_sequences": 0
  }
}

`GET /health_generate`

Deep health check — runs a 1-token inference probe with a 30-second timeout.

`POST /abort_request`

Cancel an in-flight request by ID.

curl http://localhost:8080/abort_request \
  -H "Content-Type: application/json" \
  -d '{"rid": "gen-xxxx-xxxx"}'

Management

Endpoint	Method	Description
`/health`	GET	Liveness check, returns `{"status": "ok"}`
`/v1/stats`	GET	Engine stats snapshot (requests, throughput, active sequences)
`/flush_cache`	POST	KV cache flush (reserved for compatibility; caches are managed automatically)

Using with OpenAI SDK

Chat (LLM)

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8080/v1",
    api_key="not-needed",  # crane-oai does not require an API key
)

response = client.chat.completions.create(
    model="Qwen2.5-7B-Instruct",
    messages=[{"role": "user", "content": "Hello!"}],
    max_tokens=256,
)
print(response.choices[0].message.content)

# Streaming
stream = client.chat.completions.create(
    model="Qwen2.5-7B-Instruct",
    messages=[{"role": "user", "content": "Tell me a story"}],
    stream=True,
)
for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)

Text-to-Speech (Qwen3-TTS)

Start crane-oai with --model-type qwen3_tts and a Qwen3-TTS checkpoint.

CustomVoice model (predefined speakers):

from openai import OpenAI
import pathlib

client = OpenAI(
    base_url="http://localhost:8080/v1",
    api_key="not-needed",
)

# --- Basic synthesis (returns WAV bytes) ---
response = client.audio.speech.create(
    model="Qwen3-TTS",
    voice="Chelsie",
    input="今天天气真好，我们去公园吧！",
    extra_body={"language": "chinese"},
)
response.stream_to_file("output.wav")

# --- English ---
response = client.audio.speech.create(
    model="Qwen3-TTS",
    voice="Ethan",
    input="Hello, this is a test of the Crane TTS engine.",
    extra_body={"language": "english", "temperature": 0.7},
)
response.stream_to_file("english.wav")

Base model (voice cloning):

from openai import OpenAI
import pathlib

client = OpenAI(
    base_url="http://localhost:8080/v1",
    api_key="not-needed",
)

# --- Voice clone: synthesize new text in the reference speaker's voice ---
response = client.audio.speech.create(
    model="Qwen3-TTS",
    voice="clone",  # voice field is ignored in voice-clone mode
    input="そんな何もない今日が 少しだけでもいい日になったと思えたら",
    extra_body={
        "language": "japanese",
        "reference_audio": "data/audio/kinsenka_3.wav",
        "reference_text": "こうして君に直接ありがとうを言える時間をくれたこと それが多分一番私は嬉しい",
    },
)
response.stream_to_file("voice_clone.wav")

Low-level requests (requests library):

import requests, pathlib

# CustomVoice
r = requests.post(
    "http://localhost:8080/v1/audio/speech",
    json={
        "model": "Qwen3-TTS",
        "input": "今天天气真好，我们去公园吧！",
        "voice": "Chelsie",
        "language": "chinese",
        "temperature": 0.7,
        "max_tokens": 2048,
    },
)
r.raise_for_status()
pathlib.Path("speech.wav").write_bytes(r.content)
print(f"Saved {len(r.content):,} bytes")

# Voice clone
r = requests.post(
    "http://localhost:8080/v1/audio/speech",
    json={
        "model": "Qwen3-TTS",
        "input": "こんにちは、今日はいい天気ですね。",
        "language": "japanese",
        "reference_audio": "data/audio/kinsenka_3.wav",
        "reference_text": "Reference transcript here",
        "max_tokens": 2048,
    },
)
r.raise_for_status()
pathlib.Path("voice_clone.wav").write_bytes(r.content)

Source Structure

crane-oai/src/
├── main.rs              # CLI entry point, AppState, route registration
├── openai_api.rs        # OpenAI request/response types (incl. SpeechRequest)
├── sglang_api.rs        # SGLang native API types
├── chat_template.rs     # Chat template rendering (Jinja / Hunyuan hard-coded)
├── handlers/
│   ├── common.rs        # /health, /v1/stats
│   ├── openai.rs        # OpenAI endpoint handlers
│   ├── sglang.rs        # SGLang endpoint handlers
│   ├── tts.rs           # /v1/audio/speech handler (Qwen3-TTS)
│   ├── vlm.rs           # VLM handler (PaddleOCR-VL)
│   └── sse.rs           # SSE stream builder
└── engine/
    ├── mod.rs           # InferenceEngine core loop (continuous batching)
    ├── types.rs         # EngineRequest / EngineResponse / EngineHandle
    ├── stats.rs         # Lock-free atomic counters
    ├── sampling.rs      # Token sampling (top-k/p, Gumbel-max, repetition penalty)
    ├── scheduler.rs     # Prefill-priority scheduler (dynamic effective_max_running cap)
    ├── sequence.rs      # Sequence lifecycle management
    ├── backend.rs       # ModelBackend trait and per-model implementations
    └── model_factory.rs # Model auto-detection and factory (incl. Qwen3TTS)

Model Backend Support

Model	Batch decode	KV Swap	Formats	Notes
Hunyuan Dense	✅	✅	Safetensors / GGUF	KV pre-alloc, GQA 4D matmul, RoPE cache growth
Qwen 3	✅	✅	Safetensors / GGUF	+ QK Norm 4D, GGUF quantization
Qwen 2.5	sequential	❌	Safetensors	—
Qwen3-TTS	N/A	N/A	Safetensors (ONNX fallback optional)	Dedicated thread; no continuous batching

Model type is auto-detected from config.json (model_type / architectures) or can be set explicitly with --model-type.

Environment Variables

Variable	Default	Description
`CRANE_FORCE_GPU_TOPK`	`0`	Force GPU top-k even for large vocabularies
`CRANE_TOPP_FALLBACK_TOPK`	`64`	k value for GPU top-k fallback
`CRANE_TOPK_SAMPLE_ON_CPU`	`0`	Sample on CPU after GPU top-k
`CRANE_SAMPLE_TRACE`	`0`	Verbose sampling timing logs

Notes

No API key required — crane-oai does not authenticate requests.
Single CUDA device — The server uses CUDA device 0. Multi-GPU tensor parallelism is not yet supported.
KV eviction is lossless — Evicted sequences preserve their full state and resume automatically; in-flight requests are not dropped or errored.
--max-seq-len 0 means no limit. On constrained hardware, always set an explicit value to avoid runaway memory growth.
GGUF quantization is supported for Hunyuan Dense and Qwen 3. Qwen 2.5 requires Safetensors format.
--decode-tokens-per-seq controls decode rounds per engine step, not per request. Requests always complete fully regardless of this value.
Log diagnostics — The startup log prints kv_bytes and kv_budget. Monitor these to validate your --gpu-memory-limit headroom.
Qwen3-TTS runs on a dedicated thread — No continuous batching; each /v1/audio/speech request is processed sequentially. Concurrent requests are queued in an unbounded channel.
Qwen3-TTS decoder backend — The speech-tokenizer decoder (codes → waveform) uses native Candle by default. ONNX export is optional as a compatibility fallback.

Testing

# All unit tests (125 crane-oai + 11 crane-core)
cargo test -p crane-oai
cargo test -p crane-core

# Specific modules
cargo test -p crane-oai engine::scheduler
cargo test -p crane-oai openai_api::tests
cargo test -p crane-oai sglang_api::tests
cargo test -p crane-core autotokenizer

License

MIT

Name		Name	Last commit message	Last commit date
parent directory ..
src		src
Cargo.toml		Cargo.toml
README.md		README.md

FilesExpand file tree

crane-oai

Directory actions

More options

Directory actions

More options

Latest commit

History

crane-oai

Folders and files

parent directory

README.md

crane-oai

Features

Quick Start

Build

Run

Qwen3-TTS Quick Start

Qwen3-TTS

Architecture overview

Setup

Model variants

Voice cloning (Base model)

Speaker list (Base model)

Generation parameters

Troubleshooting

CUDA Usage

Basic CUDA inference

GPU memory control

GGUF quantized models on CUDA

Multi-GPU note

CLI Parameters

Parameter tuning guide

API Reference

OpenAI-compatible

POST /v1/audio/speech — Text-to-Speech (Qwen3-TTS)

POST /v1/chat/completions

Multimodal / Vision (PaddleOCR-VL-1.5)

POST /v1/completions

GET /v1/models · GET /v1/models/:model_id

POST /v1/tokenize · POST /v1/detokenize

SGLang-compatible

POST /generate

Multimodal / Vision (PaddleOCR-VL-1.5)

GET /model_info

GET /server_info

GET /health_generate

POST /abort_request

Management

Using with OpenAI SDK

Chat (LLM)

Text-to-Speech (Qwen3-TTS)

Source Structure

Model Backend Support

Environment Variables

Notes

Testing

License

`POST /v1/audio/speech` — Text-to-Speech (Qwen3-TTS)

`POST /v1/chat/completions`

`POST /v1/completions`

`GET /v1/models` · `GET /v1/models/:model_id`

`POST /v1/tokenize` · `POST /v1/detokenize`

`POST /generate`

`GET /model_info`

`GET /server_info`

`GET /health_generate`

`POST /abort_request`