The Pipecat STT benchmark is an open-source evaluation of leading speech-to-text models for real-time voice agents. It measures what matters most in production voice AI: transcription accuracy and latency. The benchmark is public, reproducible, and the results table is the single source of truth for every number published on this page.
- Accuracy: Soniox
stt-rt-v4 reaches 1.25% semantic WER and 84.1% perfect transcripts, placing it among the most accurate models in the benchmark. - Latency: 249ms median time to final segment, with 281ms P95 and 310ms P99, making Soniox one of the fastest and lowest-latency speech AI models available.
- Dataset: 1,000 real-world samples from the
pipecat-ai/smart-turn-data-v3.1-train dataset, with ground truth generated by Gemini and human-reviewed. - Mode: Real-time streaming transcription, the setting that defines voice agent performance.
Results
Sorted by semantic WER (lower is better). Latency is reported as time to final segment in milliseconds. Price is the public pay-as-you-go real-time rate per hour of audio, shown where the benchmarked model maps to a listed price.
| Provider | Model | Price / hr | WER mean | Pooled WER | Perfect | TTFS median | TTFS P95 | TTFS P99 |
|---|
| Azure | — | $1.00 | 1.21% | 1.18% | 82.9% | 1016ms | 1345ms | 1791ms |
| Soniox | stt-rt-v4
| $0.12 | 1.25% | 1.29% | 84.1% | 249ms | 281ms | 310ms |
| Speechmatics | — | $0.56 | 1.40% | 1.07% | 83.2% | 495ms | 676ms | 736ms |
| Cartesia | ink-2
| $0.43 | 1.47% | 1.25% | 84.2% | 299ms | 328ms | 1584ms |
| AWS | — | — | 1.68% | 1.75% | 77.4% | 1136ms | 1527ms | 1897ms |
| Deepgram | nova-3-general
| $0.55 | 1.71% | 1.62% | 76.5% | 247ms | 298ms | 326ms |
| AssemblyAI | u3-rt-pro
| $0.57 | 1.74% | 1.34% | 83.9% | 335ms | 534ms | 613ms |
| NVIDIA | Nemotron 3.0 ASR (en)
| — | 1.90% | 1.95% | 76.1% | 221ms | 238ms | 252ms |
| Smallest AI | pulse
| — | 2.30% | 2.37% | 72.4% | 398ms | 533ms | 1593ms |
| Google | latest-long
| $0.96 | 2.84% | 2.85% | 69.0% | 878ms | 1155ms | 1570ms |
| ElevenLabs | scribe_v2_realtime
| $0.39 | 3.16% | 3.12% | 81.3% | 281ms | 348ms | 407ms |
| OpenAI | gpt-4o-transcribe
| — | 3.24% | 3.06% | 75.9% | 637ms | 965ms | 1655ms |
| AssemblyAI | universal-streaming-english
| — | 3.49% | 3.02% | 66.8% | 256ms | 362ms | 417ms |
| Gradium | default
| — | 3.72% | 3.96% | 65.3% | 570ms | 595ms | 614ms |
| Cartesia | ink-whisper
| — | 3.92% | 4.36% | 60.5% | 266ms | 364ms | 898ms |
| Mistral | voxtral-mini-transcribe-realtime-2602
| — | 4.44% | 4.97% | 68.8% | 525ms | 973ms | 1913ms |
| NVIDIA | Nemotron 3.5 ASR (multilingual)
| — | 4.54% | 4.58% | 62.0% | 236ms | 253ms | 266ms |
Pricing reflects public pay-as-you-go rates and may not match every benchmarked configuration. See Soniox pricing for details.