New: Soniox v5 Async is here
Benchmarks

Speech-to-text benchmarks

June 2026

The Pipecat STT benchmark is an open-source evaluation of leading speech-to-text models for real-time voice agents. It measures what matters most in production voice AI: transcription accuracy and latency. The benchmark is public, reproducible, and the results table is the single source of truth for every number published on this page.

  • Accuracy: Soniox stt-rt-v4 reaches 1.25% semantic WER and 84.1% perfect transcripts, placing it among the most accurate models in the benchmark.
  • Latency: 249ms median time to final segment, with 281ms P95 and 310ms P99, making Soniox one of the fastest and lowest-latency speech AI models available.
  • Dataset: 1,000 real-world samples from the pipecat-ai/smart-turn-data-v3.1-train dataset, with ground truth generated by Gemini and human-reviewed.
  • Mode: Real-time streaming transcription, the setting that defines voice agent performance.

Results

Sorted by semantic WER (lower is better). Latency is reported as time to final segment in milliseconds. Price is the public pay-as-you-go real-time rate per hour of audio, shown where the benchmarked model maps to a listed price.

ProviderModelPrice / hrWER meanPooled WERPerfectTTFS medianTTFS P95TTFS P99
Azure$1.001.21%1.18%82.9%1016ms1345ms1791ms
Soniox
stt-rt-v4
$0.121.25%1.29%84.1%249ms281ms310ms
Speechmatics$0.561.40%1.07%83.2%495ms676ms736ms
Cartesia
ink-2
$0.431.47%1.25%84.2%299ms328ms1584ms
AWS1.68%1.75%77.4%1136ms1527ms1897ms
Deepgram
nova-3-general
$0.551.71%1.62%76.5%247ms298ms326ms
AssemblyAI
u3-rt-pro
$0.571.74%1.34%83.9%335ms534ms613ms
NVIDIA
Nemotron 3.0 ASR (en)
1.90%1.95%76.1%221ms238ms252ms
Smallest AI
pulse
2.30%2.37%72.4%398ms533ms1593ms
Google
latest-long
$0.962.84%2.85%69.0%878ms1155ms1570ms
ElevenLabs
scribe_v2_realtime
$0.393.16%3.12%81.3%281ms348ms407ms
OpenAI
gpt-4o-transcribe
3.24%3.06%75.9%637ms965ms1655ms
AssemblyAI
universal-streaming-english
3.49%3.02%66.8%256ms362ms417ms
Gradium
default
3.72%3.96%65.3%570ms595ms614ms
Cartesia
ink-whisper
3.92%4.36%60.5%266ms364ms898ms
Mistral
voxtral-mini-transcribe-realtime-2602
4.44%4.97%68.8%525ms973ms1913ms
NVIDIA
Nemotron 3.5 ASR (multilingual)
4.54%4.58%62.0%236ms253ms266ms

Pricing reflects public pay-as-you-go rates and may not match every benchmarked configuration. See Soniox pricing for details.

Real-time transcription accuracy (semantic WER)

Source: Pipecat STT benchmark, 1,000 samples

How the benchmark works

The Pipecat benchmark scores every provider on the same audio with two purpose-built metrics for streaming voice applications.

  • Semantic WER measures only transcription errors that change meaning for a downstream LLM agent. Punctuation, capitalization, contractions, filler words, and number formats are ignored, so the score reflects real-world impact rather than surface differences.
  • TTFS (time to final segment) measures latency from the moment the user stops speaking to when the final transcription segment arrives. For streaming voice agents, lower TTFS means faster responses, and P95 latency matters more than the median because occasional spikes break conversational flow.

The benchmark dataset is published on Hugging Face as pipecat-ai/stt-benchmark-data, and anyone can rerun the evaluation to reproduce these results.

Compare speech-to-text pricing

Top accuracy does not have to cost more. Pick a provider and your monthly volume to compare pay-as-you-go speech-to-text pricing.

Pricing calculator

Stop overpaying for speech AI

Sonioxvs

1,000 hours of audio / month

1025501002505001k2.5k5k10k100k

Pricing assumptions

Based on public pay-as-you-go pricing. Enterprise discounts and committed-use contracts may differ. Some providers charge separately for certain features. The calculator uses the public price for the provider configuration that most closely matches Soniox.

Start building with Soniox

Create an account instantly, or contact us to design a custom package for your business.

Build with API

Documentation

Get up and running in minutes and spend your time building the product, not wrestling with the API.

Explore docs

See what you’ll pay

Pay only for what you use with our flexible pricing. Built to scale with you.

Pricing details