This repository provides a complete streaming Automatic Speech Recognition (ASR) pipeline for Russian, specialized for the telephony domain. The pipeline includes a pretrained streaming CTC-based acoustic model, a streaming log-probabilities splitter and a decoder, making it a ready-to-use solution for real-time transcription. It processes audio in 300 ms chunks and detects phrase boundaries using a custom log-probability splitter. The final transcription is generated using either greedy decoding or a KenLM-based CTC beam search decoder. Originally developed by T-Software DC, this project is a practical low-latency, high-throughput ASR solution with modular components.
For a quick demo, simply run service with a web interface using our pre-built Docker image. You can transcribe audio files or use your microphone for real-time streaming recognition - all right in your browser. A computer with at least 4 CPU cores and 8 GB RAM is recommended for smooth performance.
Run the container:
docker run -it --rm -p 8080:8080 tinkoffcreditsystems/t-one:0.1.0Open the website http://localhost:8080.
Alternatively, you can build the image manually using Dockerfile and run the container:
docker build -t t-one .
docker run -it --rm -p 8080:8080 t-oneEnsure you have Python (3.9 or higher) and Poetry (2.1 or newer is recommended) installed on your Linux or macOS system.
It is recommended to use a containerized environment like Docker or Windows Subsystem for Linux (WSL) on Windows as the project dependency KenLM does not have official support for Windows. While it might be possible to install it on Windows (you can remove kenlm from Poetry dependencies and try to build it manually via Vcpkg), it is generally more complex and prone to dependency issues.
-
Clone the repository:
git clone https://github.com/voicekit-team/T-one.git
-
Navigate to the repository directory:
cd T-one -
Create and activate a Python virtual environment:
python -m venv .venv source .venv/bin/activate # On Windows, use `.venv\Scripts\activate`
-
You can use Makefile for local package and its dependencies installation and running the demo web-service:
make install make up_dev
-
Open the website
http://localhost:8080
Alternatively, you can install the package directly with Poetry after environment activation:
-
Install the package and its demo dependencies:
poetry install -E demo
-
Run the demo web-service:
uvicorn --host 0.0.0.0 --port 8080 tone.demo.website:app --reload
-
Open the website
http://localhost:8080
Prepare an audio file with Russian speech in a common format (.wav, .mp3, .flac, .ogg) or use our example audio file.
-
You can also run offline recognition directly in Python:
from tone import StreamingCTCPipeline, read_audio, read_example_audio audio = read_example_audio() # or read_audio("your_audio.flac") pipeline = StreamingCTCPipeline.from_hugging_face() print(pipeline.forward_offline(audio)) # run offline recognition
-
See the "Advanced usage example" section for an example of streaming.
See the manual for a detailed guide on how to export T-one acoustic model to TensorRT engine and run efficiently with Triton Inference Server.
- Install the package and its fine-tuning dependencies:
poetry install -E finetune
- See the fine-tuning example.
Incoming audio is sliced into 300 ms segments on-the-fly (see a picture below). Each segment is fed to a Conformer-based acoustic model. The model receives an audio segment and a hidden state generated for the previous segment, preserves acoustic context across boundaries and returns a new hidden state with frame-level log-probabilities (aka logprobs) for each symbol in its output alphabet. Alphabet consists of Russian letters, space and blank token (special token that serves as a hard boundary between groups of characters).
The splitter takes the log-probabilities of the current audio segment and its own internal state as input. It returns an updated state and a list of any newly detected phrases. For each input frame, the splitter determines whether it contains speech. When the splitter detects a speech frame, the phrase is considered started (the first speech frame marks the beginning of the phrase). A phrase ends when the splitter finds N consecutive non-speech frames (the last speech frame before the silence marks the end of the phrase). Each phrase contains the corresponding logprobs along with start and end timestamps.
The decoder takes logprobs of phrases produced by the logprob splitter as inputs and converts them into text. There are two methods that can be used: Greedy decoding or Beam Search decoding with a KenLM language model. The resulting text along with phrase timings is returned to the client.
(Blank and space symbols are marked as '-' and '|' here)
For a detailed exploration of our architecture, design choices, and implementation, check out our accompanying article.
from tone import StreamingCTCPipeline, read_stream_example_audio
pipeline = StreamingCTCPipeline.from_hugging_face()
state = None # Current state of the ASR pipeline (None - initial)
for audio_chunk in read_stream_example_audio(): # Use any source of audio chunks
new_phrases, state = pipeline.forward(audio_chunk, state)
print(new_phrases)
# Finalize the pipeline and get the remaining phrases
new_phrases, _ = pipeline.finalize(state)
print(new_phrases)Word Error Rate (WER) is used to evaluate the quality of automatic speech recognition systems, which can be interpreted as incorrectly recognized words percentage compared to a reference transcript.
| Category | T-one (71M) | GigaAM-RNNT v2 (243M) | GigaAM-CTC v2 (242M) | Vosk-model-ru 0.54 (65M) | Vosk-model-small-streaming-ru 0.54 (20M) | Whisper large-v3 (1540M) |
|---|---|---|---|---|---|---|
| Call-center | 8.63 | 10.22 | 10.57 | 11.28 | 15.53 | 19.39 |
| Other telephony | 6.20 | 7.88 | 8.15 | 8.69 | 13.49 | 17.29 |
| Named entities | 5.83 | 9.55 | 9.81 | 12.12 | 17.65 | 17.87 |
| CommonVoice 19 (test split) | 5.32 | 2.68 | 3.14 | 6.22 | 11.3 | 5.78 |
| OpenSTT asr_calls_2_val original | 20.27 | 20.07 | 21.24 | 22.64 | 29.45 | 29.02 |
| OpenSTT asr_calls_2_val re-labeled | 7.94 | 11.14 | 12.43 | 13.22 | 21.03 | 20.82 |
Here is trtexec model's throughput evaluated during export to TensorRT engine across different GPUs:
| Device | Configuration | Max & Optimal Batch Size | RPS (Requests Per Second) | SPS (Seconds Per Second) |
|---|---|---|---|---|
| T4 | TensorRT | 32 | 5952 | 1786 |
| A30 | TensorRT | 128 | 17408 | 5222 |
| A100 | TensorRT | 256 | 26112 | 7833 |
| H100 | TensorRT | 1024 | 57344 | 17203 |
Find more details on the performance metrics calculation method here.
The code and models in this repository are released under the Apache 2.0 License.