T-one

Streaming CTC-based ASR pipeline for Russian

📝 About The Project

This repository provides a complete streaming Automatic Speech Recognition (ASR) pipeline for Russian, specialized for the telephony domain. The pipeline includes a pretrained streaming CTC-based acoustic model, a streaming log-probabilities splitter and a decoder, making it a ready-to-use solution for real-time transcription. It processes audio in 300 ms chunks and detects phrase boundaries using a custom log-probability splitter. The final transcription is generated using either greedy decoding or a KenLM-based CTC beam search decoder. Originally developed by T-Software DC, this project is a practical low-latency, high-throughput ASR solution with modular components.

⚡️ Quick start / Demo

For a quick demo, simply run service with a web interface using our pre-built Docker image. You can transcribe audio files or use your microphone for real-time streaming recognition - all right in your browser. A computer with at least 4 CPU cores and 8 GB RAM is recommended for smooth performance.

Run the container:

docker run -it --rm -p 8080:8080 tinkoffcreditsystems/t-one:0.1.0

Open the website http://localhost:8080.

Alternatively, you can build the image manually using Dockerfile and run the container:

docker build -t t-one .
docker run -it --rm -p 8080:8080 t-one

🛠️ Installation & Usage

Ensure you have Python (3.9 or higher) and Poetry (2.1 or newer is recommended) installed on your Linux or macOS system.

It is recommended to use a containerized environment like Docker or Windows Subsystem for Linux (WSL) on Windows as the project dependency KenLM does not have official support for Windows. While it might be possible to install it on Windows (you can remove kenlm from Poetry dependencies and try to build it manually via Vcpkg), it is generally more complex and prone to dependency issues.

Installing and Running the Web Service

Clone the repository:

git clone https://github.com/voicekit-team/T-one.git

Navigate to the repository directory:
```
cd T-one
```

Create and activate a Python virtual environment:

python -m venv .venv
source .venv/bin/activate  # On Windows, use `.venv\Scripts\activate`

You can use Makefile for local package and its dependencies installation and running the demo web-service:
```
make install
make up_dev
```
Open the website http://localhost:8080

Alternatively, you can install the package directly with Poetry after environment activation:

Install the package and its demo dependencies:
```
poetry install -E demo
```

Run the demo web-service:

uvicorn --host 0.0.0.0 --port 8080 tone.demo.website:app --reload

Open the website http://localhost:8080

Prepare an audio file with Russian speech in a common format (.wav, .mp3, .flac, .ogg) or use our example audio file.

Python Inference

You can also run offline recognition directly in Python:

from tone import StreamingCTCPipeline, read_audio, read_example_audio


audio = read_example_audio() # or read_audio("your_audio.flac")

pipeline = StreamingCTCPipeline.from_hugging_face()
print(pipeline.forward_offline(audio))  # run offline recognition

See the "Advanced usage example" section for an example of streaming.

Triton Inference Server

See the manual for a detailed guide on how to export T-one acoustic model to TensorRT engine and run efficiently with Triton Inference Server.

Model fine-tuning

Install the package and its fine-tuning dependencies:
```
poetry install -E finetune
```
See the fine-tuning example.

🎙️ ASR pipeline overview

Incoming audio is sliced into 300 ms segments on-the-fly (see a picture below). Each segment is fed to a Conformer-based acoustic model. The model receives an audio segment and a hidden state generated for the previous segment, preserves acoustic context across boundaries and returns a new hidden state with frame-level log-probabilities (aka logprobs) for each symbol in its output alphabet. Alphabet consists of Russian letters, space and blank token (special token that serves as a hard boundary between groups of characters).

The splitter takes the log-probabilities of the current audio segment and its own internal state as input. It returns an updated state and a list of any newly detected phrases. For each input frame, the splitter determines whether it contains speech. When the splitter detects a speech frame, the phrase is considered started (the first speech frame marks the beginning of the phrase). A phrase ends when the splitter finds N consecutive non-speech frames (the last speech frame before the silence marks the end of the phrase). Each phrase contains the corresponding logprobs along with start and end timestamps.

The decoder takes logprobs of phrases produced by the logprob splitter as inputs and converts them into text. There are two methods that can be used: Greedy decoding or Beam Search decoding with a KenLM language model. The resulting text along with phrase timings is returned to the client. (Blank and space symbols are marked as '-' and '|' here)

For a detailed exploration of our architecture, design choices, and implementation, check out our accompanying article.

💪 Advanced usage example

from tone import StreamingCTCPipeline, read_stream_example_audio


pipeline = StreamingCTCPipeline.from_hugging_face()

state = None  # Current state of the ASR pipeline (None - initial)
for audio_chunk in read_stream_example_audio():  # Use any source of audio chunks
    new_phrases, state = pipeline.forward(audio_chunk, state)
    print(new_phrases)

# Finalize the pipeline and get the remaining phrases
new_phrases, _ = pipeline.finalize(state)
print(new_phrases)

📊 Quality benchmarks

Word Error Rate (WER) is used to evaluate the quality of automatic speech recognition systems, which can be interpreted as incorrectly recognized words percentage compared to a reference transcript.

WER comparison with other models

Category	T-one (71M)	GigaAM-RNNT v2 (243M)	GigaAM-CTC v2 (242M)	Vosk-model-ru 0.54 (65M)	Vosk-model-small-streaming-ru 0.54 (20M)	Whisper large-v3 (1540M)
Call-center	8.63	10.22	10.57	11.28	15.53	19.39
Other telephony	6.20	7.88	8.15	8.69	13.49	17.29
Named entities	5.83	9.55	9.81	12.12	17.65	17.87
CommonVoice 19 (test split)	5.32	2.68	3.14	6.22	11.3	5.78
OpenSTT asr_calls_2_val original	20.27	20.07	21.24	22.64	29.45	29.02
OpenSTT asr_calls_2_val re-labeled	7.94	11.14	12.43	13.22	21.03	20.82

📈 Performance metrics

Here is trtexec model's throughput evaluated during export to TensorRT engine across different GPUs:

Device	Configuration	Max & Optimal Batch Size	RPS (Requests Per Second)	SPS (Seconds Per Second)
T4	TensorRT	32	5952	1786
A30	TensorRT	128	17408	5222
A100	TensorRT	256	26112	7833
H100	TensorRT	1024	57344	17203

Find more details on the performance metrics calculation method here.

📜 License

The code and models in this repository are released under the Apache 2.0 License.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.github		.github
configs		configs
docs		docs
examples		examples
tone		tone
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
README.ru.md		README.ru.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

T-one

📝 About The Project

⚡️ Quick start / Demo

🛠️ Installation & Usage

Installing and Running the Web Service

Python Inference

Triton Inference Server

Model fine-tuning

🎙️ ASR pipeline overview

💪 Advanced usage example

📊 Quality benchmarks

WER comparison with other models

📈 Performance metrics

📜 License

📚 Additional Resources

About

Uh oh!

Releases

Packages

Languages

License

voicekit-team/T-one

Folders and files

Latest commit

History

Repository files navigation

T-one

📝 About The Project

⚡️ Quick start / Demo

🛠️ Installation & Usage

Installing and Running the Web Service

Python Inference

Triton Inference Server

Model fine-tuning

🎙️ ASR pipeline overview

💪 Advanced usage example

📊 Quality benchmarks

WER comparison with other models

📈 Performance metrics

📜 License

📚 Additional Resources

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages