Sherpa
Sherpa
Release 1.3
1 Introduction 1
2 Download pdf 3
3 Social groups 5
3.1 WeChat . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3.2 QQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3.3 Bilibili (B ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.4 YouTube . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
5 Pre-trained models 11
5.1 Pre-trained models for different projects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
5.2 How to download . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
6 sherpa 15
6.1 Installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
6.2 Pre-trained models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
7 sherpa-ncnn 49
7.1 Tutorials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
7.2 Installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
7.3 Python API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
7.4 WebAssembly . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
7.5 C API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
7.6 Endpointing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
7.7 Android . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
7.8 iOS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
7.9 Pre-trained models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
7.10 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
7.11 FAQs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
8 sherpa-onnx 167
8.1 Tutorials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
8.2 Installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
8.3 Frequently Asked Question (FAQs) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
8.4 Python . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
i
8.5 C API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214
8.6 Java API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220
8.7 Javascript API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228
8.8 Kotlin API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229
8.9 Swift API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229
8.10 Go API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230
8.11 C# API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237
8.12 Pascal API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246
8.13 Lazarus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250
8.14 WebAssembly . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262
8.15 Android . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268
8.16 iOS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282
8.17 Flutter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293
8.18 WebSocket . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294
8.19 Hotwords (Contextual biasing) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309
8.20 Keyword spotting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325
8.21 Punctuation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 328
8.22 Audio tagging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 331
8.23 Spoken language identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 339
8.24 VAD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 341
8.25 Pre-trained models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 342
8.26 Moonshine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 534
8.27 SenseVoice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 544
8.28 Speaker Diarization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 575
8.29 Speaker Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 588
8.30 Text-to-speech (TTS) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 588
9 Triton 629
9.1 Installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 629
9.2 Triton-server . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 630
9.3 Triton-client . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 631
9.4 Perf Analyzer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 632
9.5 TensorRT acceleration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 633
ii
CHAPTER
ONE
INTRODUCTION
1
sherpa, Release 1.3
OS Support
Linux, Windows, Linux, Windows, Linux, Windows,
macOS macOS, iOS, macOS, iOS,
Android Android
Supported functions
streaming speech streaming speech streaming speech
recognition, recognition, recognition,
non-streaming speech non-streaming speech VAD,
recognition recognition,
text-to-speech,
speaker diarization,
speaker identification,
speaker verification,
spoken language
identification,
audio tagging,
VAD,
keyword spotting,
2 Chapter 1. Introduction
CHAPTER
TWO
DOWNLOAD PDF
Hint: All Chinese related content is not included in the pdf file.
Note: For Chinese users, you can use the following mirror:
https://hf-mirror.com/csukuangfj/sherpa-doc/blob/main/sherpa.pdf
3
sherpa, Release 1.3
THREE
SOCIAL GROUPS
3.1 WeChat
If you have a WeChat account, you can scan the following QR code to join the WeChat group of next-gen Kaldi to get
help.
3.2 QQ
5
sherpa, Release 1.3
3.3 Bilibili (B )
3.4 YouTube
To get the latest news of next-gen Kaldi, please subscribe the following YouTube channel by Nadira Povey:
https://www.youtube.com/channel/UC_VaumpkmINz1pNkFXAN9mw
FOUR
Hint: You don’t need to download or install anything. All you need is a browser.
The server is running on CPU within a docker container provided by Huggingface and you use a browser to interact
with it. The browser can be run on Windows, macOS, Linux, or even on your phone or iPad.
You can upload a file for recognition, record your speech via a microphone from within the browser and submit it for
recognition, or even provider an URL to an audio file for speech recognition.
Now let’s get started.
7
sherpa, Release 1.3
Hint: If you don’t have access to Huggingface, please visit the following mirror:
https://hf-mirror.com/spaces/k2-fsa/automatic-speech-recognition
You can:
1. Select a language for recognition. Currently, we provide pre-trained models from icefall for the following lan-
guages: Chinese, English, and Chinese+English.
2. After selecting the target language, you can select a pre-trained model corresponding to the language.
3. Select the decoding method. Currently, it provides greedy search and modified_beam_search.
4. If you selected modified_beam_search, you can choose the number of active paths during the search.
5. Either upload a file or record your speech for recognition.
6. Click the button Submit for recognition.
7. Wait for a moment and you will get the recognition results.
The following screenshot shows an example when selecting Chinese+English:
In the bottom part of the page, you can find a table of examples. You can click one of them and then click Submit for
recognition.
Note: To get the latest news of next-gen Kaldi, please subscribe the following YouTube channel by Nadira Povey:
https://www.youtube.com/channel/UC_VaumpkmINz1pNkFXAN9mw
https://youtu.be/ElN3r9dkKE4
FIVE
PRE-TRAINED MODELS
We are hosting our pre-trained models on Huggingface as git repositories managed by Git LFS.
There are at least two methods for downloading:
• Using git lfs
• Using wget
In the following, we use the pre-trained model pkufool/icefall-asr-zipformer-streaming-wenetspeech-20230615 (Chi-
nese) as an example.
# apt/deb
sudo apt-get install git-lfs
# yum/rpm
sudo yum install git-lfs
11
sherpa, Release 1.3
cd icefall-asr-zipformer-streaming-wenetspeech-20230615
git lfs pull --include "exp/*chunk-16-left-128.*onnx"
$env:GIT_LFS_SKIP_SMUDGE="1"
git clone https://huggingface.co/pkufool/icefall-asr-zipformer-streaming-wenetspeech-
˓→20230615
cd icefall-asr-zipformer-streaming-wenetspeech-20230615
git lfs pull --include "exp/*chunk-16-left-128.*onnx"
set GIT_LFS_SKIP_SMUDGE="1"
git clone https://huggingface.co/pkufool/icefall-asr-zipformer-streaming-wenetspeech-
˓→20230615
cd icefall-asr-zipformer-streaming-wenetspeech-20230615
git lfs pull --include "exp/*chunk-16-left-128.*onnx"
Note: It is very important to set the environment variable GIT_LFS_SKIP_SMUDGE to 1. We don’t recommend using
git lfs install as it will download many large files that we don’t need.
First, let us visit the huggingface git repository of the pre-trained model:
Click Files and versions and navigate to the directory containing files for downloading:
Right click the arrow that indicates downloading and copy the link address. After that, you can use, for instance, wget
to download the file with the following command:
wget https://huggingface.co/pkufool/icefall-asr-zipformer-streaming-wenetspeech-20230615/
˓→resolve/main/exp/decoder-epoch-12-avg-4-chunk-16-left-128.int8.onnx
Repeat the process until you have downloaded all the required files.
SIX
SHERPA
Hint: During speech recognition, it does not need to access the Internet. Everyting is processed locally on your device.
6.1 Installation
Note: This method supports only Linux and macOS for now. If you want to use Windows, please refer to From source.
Linux (CPU)
https://huggingface.co/csukuangfj/kaldifeat/resolve/main/ubuntu-cpu/k2_sherpa-1.3.
˓→dev20230725+cpu.torch2.0.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
15
sherpa, Release 1.3
macOS (CPU)
Linux (CUDA)
˓→whl
16 Chapter 6. sherpa
sherpa, Release 1.3
Install dependencies
Before installing k2-fsa/sherpa from source, we have to install the following dependencies.
• PyTorch
• k2
• kaldifeat
CPU
CUDA
Suppose that we select torch==2.0.1. We can use the following commands to install the dependencies:
Linux
macOS
Windows
To be done.
Suppose that we select torch==2.0.1+cu117. We can use the following commands to install the dependencies:
6.1. Installation 17
sherpa, Release 1.3
# Please make sure you have installed PyTorch, k2, and kaldifeat
# before you continue
#
git clone http://github.com/k2-fsa/sherpa
cd sherpa
python3 -m pip install --verbose .
cmake ..
make -j
export PATH=$PWD/bin:$PATH
export PYTHONPATH=$PWD/lib:$PWD/../sherpa/python:$PYTHONPATH
sherpa-online --help
sherpa-offline --help
sherpa-online-microphone --help
(continues on next page)
18 Chapter 6. sherpa
sherpa, Release 1.3
sherpa-online-websocket-server --help
sherpa-online-websocket-client --help
sherpa-online-websocket-client-microphone --help
sherpa-offline-websocket-server --help
sherpa-offline-websocket-client --help
Congratulations! You have installed k2-fsa/sherpa successfully. Please refer to Pre-trained models to download pre-
trained models.
Have fun with k2-fsa/sherpa!
We suggest that you install k2-fsa/sherpa by following From pre-compiled wheels.
If you have any issues about the installation, please create an issue at the following address:
https://github.com/k2-fsa/sherpa/issues
Hint: For transducer-based models, we only support stateless transducers. To the best of our knowledge, only icefall
supports that. In other words, only transducer models from icefall are currently supported.
For CTC-based models, we support any type of models trained using CTC loss as long as you can export the model via
torchscript. Models from the following frameworks are currently supported: icefall, WeNet, and torchaudio (Wav2Vec
2.0). If you have a CTC model and want it to be supported in k2-fsa/sherpa, please create an issue at https://github.
com/k2-fsa/sherpa/issues.
Hint: You can try the pre-trained models in your browser without installing anything. See https://huggingface.co/
spaces/k2-fsa/automatic-speech-recognition.
This page lists all available pre-trained models that you can download.
• Tibetan
Hint: We provide a colab notebook for you to try offline recognition step by step.
It shows how to install sherpa and use it as offline recognizer, which supports the models from icefall, the WeNet
framework and torchaudio.
This sections list pre-trained CTC models from the following frameworks:
icefall
Hint: We use the binary sherpa-offline below for demonstration. You can replace sherpa-offline with
sherpa-offline-websocket-server.
icefall-asr-gigaspeech-conformer-ctc (English)
cd icefall-asr-gigaspeech-conformer-ctc
git lfs pull --include "exp/cpu_jit.pt"
git lfs pull --include "data/lang_bpe_500/HLG.pt"
git lfs pull --include "data/lang_bpe_500/tokens.txt"
mkdir test_wavs
cd test_wavs
wget https://huggingface.co/csukuangfj/wav2vec2.0-torchaudio/resolve/main/test_wavs/1089-
˓→134686-0001.wav
wget https://huggingface.co/csukuangfj/wav2vec2.0-torchaudio/resolve/main/test_wavs/1221-
˓→135766-0001.wav
wget https://huggingface.co/csukuangfj/wav2vec2.0-torchaudio/resolve/main/test_wavs/1221-
˓→135766-0002.wav
cd ..
# Decode with H
sherpa-offline \
--nn-model=./exp/cpu_jit.pt \
--tokens=./data/lang_bpe_500/tokens.txt \
./test_wavs/1089-134686-0001.wav \
./test_wavs/1221-135766-0001.wav \
./test_wavs/1221-135766-0002.wav
20 Chapter 6. sherpa
sherpa, Release 1.3
icefall-asr-librispeech-conformer-ctc-jit-bpe-500-2021-11-09 (English)
cd icefall-asr-librispeech-conformer-ctc-jit-bpe-500-2021-11-09
# Decode with H
sherpa-offline \
--nn-model=./exp/cpu_jit.pt \
--tokens=./data/lang_bpe_500/tokens.txt \
--use-gpu=false \
./test_wavs/1089-134686-0001.wav \
./test_wavs/1221-135766-0001.wav \
./test_wavs/1221-135766-0002.wav
icefall-asr-tedlium3-conformer-ctc2 (English)
cd icefall-asr-tedlium3-conformer-ctc2
git lfs pull --include "exp/cpu_jit.pt"
# Decode with H
sherpa-offline \
--nn-model=./exp/cpu_jit.pt \
--tokens=./data/lang_bpe/tokens.txt \
./test_wavs/DanBarber_2010-219.wav \
./test_wavs/DanielKahneman_2010-157.wav \
./test_wavs/RobertGupta_2010U-15.wav
icefall_asr_librispeech_conformer_ctc (English)
cd icefall_asr_librispeech_conformer_ctc
# Decode with H
(continues on next page)
22 Chapter 6. sherpa
sherpa, Release 1.3
icefall_asr_aishell_conformer_ctc (Chinese)
cd icefall_asr_aishell_conformer_ctc
git lfs pull --include "exp/cpu_jit.pt"
git lfs pull --include "data/lang_char/HLG.pt"
icefall-asr-mgb2-conformer_ctc-2022-27-06 (Arabic)
cd icefall-asr-mgb2-conformer_ctc-2022-27-06
git lfs pull --include "exp/cpu_jit.pt"
git lfs pull --include "data/lang_bpe_5000/HLG.pt"
git lfs pull --include "data/lang_bpe_5000/tokens.txt"
WeNet
wenet-english-model (English)
sherpa-offline \
--normalize-samples=false \
--modified=true \
--nn-model=./final.zip \
--tokens=./units.txt \
--use-gpu=false \
./test_wavs/1089-134686-0001.wav \
./test_wavs/1221-135766-0001.wav \
./test_wavs/1221-135766-0002.wav
24 Chapter 6. sherpa
sherpa, Release 1.3
wenet-chinese-model (Chinese)
sherpa-offline \
--normalize-samples=false \
--modified=true \
--nn-model=./final.zip \
--tokens=./units.txt \
./test_wavs/BAC009S0764W0121.wav \
./test_wavs/BAC009S0764W0122.wav \
./test_wavs/BAC009S0764W0123.wav \
./test_wavs/DEV_T0000000000.wav \
./test_wavs/DEV_T0000000001.wav \
./test_wavs/DEV_T0000000002.wav
torchaudio
wav2vec2_asr_base (English)
sherpa-offline \
--nn-model=wav2vec2_asr_base_10m.pt \
--tokens=tokens.txt \
--use-gpu=false \
./test_wavs/1089-134686-0001.wav \
./test_wavs/1221-135766-0001.wav \
./test_wavs/1221-135766-0002.wav
voxpopuli_asr_base (German)
sherpa-offline \
--nn-model=voxpopuli_asr_base_10k_de.pt \
(continues on next page)
NeMo
sherpa-nemo-ctc-en-citrinet-512 (English)
cd sherpa-nemo-ctc-en-citrinet-512
git lfs pull --include "model.pt"
sherpa-offline \
--nn-model=./model.pt \
--tokens=./tokens.txt \
--use-gpu=false \
--modified=false \
--nemo-normalize=per_feature \
./test_wavs/0.wav \
./test_wavs/1.wav \
./test_wavs/2.wav
ls -lh model.pt
-rw-r--r-- 1 kuangfangjun root 142M Mar 9 21:23 model.pt
sherpa-nemo-ctc-zh-citrinet-512 (Chinese)
cd sherpa-nemo-ctc-zh-citrinet-512
git lfs pull --include "model.pt"
sherpa-offline \
--nn-model=./model.pt \
(continues on next page)
26 Chapter 6. sherpa
sherpa, Release 1.3
ls -lh model.pt
-rw-r--r-- 1 kuangfangjun root 153M Mar 10 15:07 model.pt
Hint: Since the vocabulary size of this model is very large, i.e, 5207, we use --modified=true to use a modified
CTC topology
sherpa-nemo-ctc-zh-citrinet-1024-gamma-0-25 (Chinese)
cd sherpa-nemo-ctc-zh-citrinet-1024-gamma-0-25
git lfs pull --include "model.pt"
sherpa-offline \
--nn-model=./model.pt \
--tokens=./tokens.txt \
--use-gpu=false \
--modified=true \
--nemo-normalize=per_feature \
./test_wavs/0.wav \
./test_wavs/1.wav \
./test_wavs/2.wav
ls -lh model.pt
-rw-r--r-- 1 kuangfangjun root 557M Mar 10 16:29 model.pt
Hint: Since the vocabulary size of this model is very large, i.e, 5207, we use --modified=true to use a modified
CTC topology
sherpa-nemo-ctc-de-citrinet-1024 (German)
cd sherpa-nemo-ctc-de-citrinet-1024
git lfs pull --include "model.pt"
sherpa-offline \
--nn-model=./model.pt \
--tokens=./tokens.txt \
--use-gpu=false \
--modified=false \
--nemo-normalize=per_feature \
./test_wavs/0.wav \
./test_wavs/1.wav \
./test_wavs/2.wav
ls -lh model.pt
-rw-r--r-- 1 kuangfangjun root 541M Mar 10 16:55 model.pt
sherpa-nemo-ctc-en-conformer-small (English)
cd sherpa-nemo-ctc-en-conformer-small
git lfs pull --include "model.pt"
sherpa-offline \
--nn-model=./model.pt \
--tokens=./tokens.txt \
--use-gpu=false \
--modified=false \
--nemo-normalize=per_feature \
./test_wavs/0.wav \
./test_wavs/1.wav \
./test_wavs/2.wav
ls -lh model.pt
-rw-r--r-- 1 fangjun staff 82M Mar 10 19:55 model.pt
28 Chapter 6. sherpa
sherpa, Release 1.3
sherpa-nemo-ctc-en-conformer-medium (English)
cd sherpa-nemo-ctc-en-conformer-medium
git lfs pull --include "model.pt"
sherpa-offline \
--nn-model=./model.pt \
--tokens=./tokens.txt \
--use-gpu=false \
--modified=false \
--nemo-normalize=per_feature \
./test_wavs/0.wav \
./test_wavs/1.wav \
./test_wavs/2.wav
ls -lh model.pt
-rw-r--r-- 1 fangjun staff 152M Mar 10 20:26 model.pt
sherpa-nemo-ctc-en-conformer-large (English)
cd sherpa-nemo-ctc-en-conformer-large
git lfs pull --include "model.pt"
sherpa-offline \
--nn-model=./model.pt \
--tokens=./tokens.txt \
--use-gpu=false \
--modified=false \
--nemo-normalize=per_feature \
(continues on next page)
ls -lh model.pt
-rw-r--r-- 1 fangjun staff 508M Mar 10 20:44 model.pt
sherpa-nemo-ctc-de-conformer-large (German)
cd sherpa-nemo-ctc-de-conformer-large
git lfs pull --include "model.pt"
sherpa-offline \
--nn-model=./model.pt \
--tokens=./tokens.txt \
--use-gpu=false \
--modified=false \
--nemo-normalize=per_feature \
./test_wavs/0.wav \
./test_wavs/1.wav \
./test_wavs/2.wav
ls -lh model.pt
-rw-r--r-- 1 fangjun staff 508M Mar 10 21:34 model.pt
30 Chapter 6. sherpa
sherpa, Release 1.3
This section describes how to export NeMo pre-trained CTC models to sherpa.
You can find a list of pre-trained models from NeMo by visiting:
https://catalog.ngc.nvidia.com/orgs/nvidia/collections/nemo_asr.
Let us take stt_en_conformer_ctc_small as an example.
You can use the following code to obtain model.pt and tokens.txt:
One thing to note is that the blank token has the largest token ID in NeMo. However, it is always 0 in sherpa. Dur-
ing network computation, we shift the last column of the log_prob tensor to the first column so that it matches the
convention about using 0 for the blank in sherpa.
You can find the exported model.pt and tokens.txt by visiting
https://huggingface.co/csukuangfj/sherpa-nemo-ctc-en-conformer-small
Hint: We use the binary sherpa-offline below for demonstration. You can replace sherpa-offline with
sherpa-offline-websocket-server.
icefall
English
icefall-asr-librispeech-zipformer-2023-05-15
cd icefall-asr-librispeech-zipformer-2023-05-15
sherpa-offline \
--decoding-method=fast_beam_search \
--nn-model=./exp/jit_script.pt \
--lg=./data/lang_bpe_500/LG.pt \
--tokens=./data/lang_bpe_500/tokens.txt \
./test_wavs/1089-134686-0001.wav \
./test_wavs/1221-135766-0001.wav \
./test_wavs/1221-135766-0002.wav
icefall-asr-librispeech-zipformer-small-2023-05-16
cd icefall-asr-librispeech-zipformer-small-2023-05-16
32 Chapter 6. sherpa
sherpa, Release 1.3
sherpa-offline \
--decoding-method=fast_beam_search \
--nn-model=./exp/jit_script.pt \
--lg=./data/lang_bpe_500/LG.pt \
--tokens=./data/lang_bpe_500/tokens.txt \
./test_wavs/1089-134686-0001.wav \
./test_wavs/1221-135766-0001.wav \
./test_wavs/1221-135766-0002.wav
icefall-asr-librispeech-zipformer-large-2023-05-16
cd icefall-asr-librispeech-zipformer-large-2023-05-16
sherpa-offline \
--decoding-method=fast_beam_search \
--nn-model=./exp/jit_script.pt \
--lg=./data/lang_bpe_500/LG.pt \
--tokens=./data/lang_bpe_500/tokens.txt \
./test_wavs/1089-134686-0001.wav \
./test_wavs/1221-135766-0001.wav \
./test_wavs/1221-135766-0002.wav
icefall-asr-multidataset-pruned_transducer_stateless7-2023-05-04
# This model is trained using GigaSpeech + LibriSpeech + Common Voice 13.0 with zipformer
#
# See https://github.com/k2-fsa/icefall/pull/1010
#
GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/yfyeung/icefall-asr-multidataset-
˓→pruned_transducer_stateless7-2023-05-04
cd icefall-asr-multidataset-pruned_transducer_stateless7-2023-05-04
git lfs pull --include "exp/cpu_jit-epoch-30-avg-4.pt"
cd exp
ln -s cpu_jit-epoch-30-avg-4.pt cpu_jit.pt
cd ..
icefall-asr-librispeech-pruned-transducer-stateless8-2022-12-02
cd icefall-asr-librispeech-pruned-transducer-stateless8-2022-12-02
git lfs pull --include "exp/cpu_jit-torch-1.10.pt"
git lfs pull --include "data/lang_bpe_500/LG.pt"
cd exp
rm cpu_jit.pt
ln -sv cpu_jit-torch-1.10.pt cpu_jit.pt
cd ..
34 Chapter 6. sherpa
sherpa, Release 1.3
icefall-asr-librispeech-pruned-transducer-stateless8-2022-11-14
cd icefall-asr-librispeech-pruned-transducer-stateless8-2022-11-14
git lfs pull --include "exp/cpu_jit.pt"
git lfs pull --include "data/lang_bpe_500/LG.pt"
sherpa-offline \
--decoding-method=fast_beam_search \
--nn-model=./exp/cpu_jit.pt \
--lg=./data/lang_bpe_500/LG.pt \
--tokens=./data/lang_bpe_500/tokens.txt \
./test_wavs/1089-134686-0001.wav \
./test_wavs/1221-135766-0001.wav \
./test_wavs/1221-135766-0002.wav
icefall-asr-librispeech-pruned-transducer-stateless7-2022-11-11
sherpa-offline \
--decoding-method=fast_beam_search \
--nn-model=./exp/cpu_jit.pt \
--lg=./data/lang_bpe_500/LG.pt \
--tokens=./data/lang_bpe_500/tokens.txt \
./test_wavs/1089-134686-0001.wav \
./test_wavs/1221-135766-0001.wav \
./test_wavs/1221-135766-0002.wav
icefall-asr-librispeech-pruned-transducer-stateless3-2022-05-13
cd icefall-asr-librispeech-pruned-transducer-stateless3-2022-05-13
git lfs pull --include "exp/cpu_jit.pt"
git lfs pull --include "data/lang_bpe_500/LG.pt"
sherpa-offline \
--decoding-method=fast_beam_search \
(continues on next page)
36 Chapter 6. sherpa
sherpa, Release 1.3
icefall-asr-gigaspeech-pruned-transducer-stateless2
cd icefall-asr-gigaspeech-pruned-transducer-stateless2
git lfs pull --include "exp/cpu_jit-iter-3488000-avg-15.pt"
git lfs pull --include "data/lang_bpe_500/bpe.model"
cd ../exp
ln -s cpu_jit-iter-3488000-avg-15.pt cpu_jit.pt
cd ..
# Since this repo does not provide tokens.txt, we generate it from bpe.model
# by ourselves
/path/to/sherpa/scripts/bpe_model_to_tokens.py ./data/lang_bpe_500/bpe.model > ./data/
˓→lang_bpe_500/tokens.txt
mkdir test_wavs
cd test_wavs
wget https://huggingface.co/csukuangfj/wav2vec2.0-torchaudio/resolve/main/test_wavs/1089-
˓→134686-0001.wav
wget https://huggingface.co/csukuangfj/wav2vec2.0-torchaudio/resolve/main/test_wavs/1221-
˓→135766-0001.wav
wget https://huggingface.co/csukuangfj/wav2vec2.0-torchaudio/resolve/main/test_wavs/1221-
˓→135766-0002.wav
Chinese
icefall_asr_wenetspeech_pruned_transducer_stateless2
cd icefall_asr_wenetspeech_pruned_transducer_stateless2
git lfs pull --include "exp/cpu_jit_epoch_10_avg_2_torch_1.7.1.pt"
git lfs pull --include "data/lang_char/LG.pt"
cd exp
ln -s cpu_jit_epoch_10_avg_2_torch_1.7.1.pt cpu_jit.pt
cd ..
sherpa-offline \
--decoding-method=$m \
--nn-model=./exp/cpu_jit.pt \
--lg=./data/lang_char/LG.pt \
--tokens=./data/lang_char/tokens.txt \
./test_wavs/DEV_T0000000000.wav \
./test_wavs/DEV_T0000000001.wav \
./test_wavs/DEV_T0000000002.wav
icefall_asr_aidatatang-200zh_pruned_transducer_stateless2
cd icefall_asr_aidatatang-200zh_pruned_transducer_stateless2
git lfs pull --include "exp/cpu_jit_torch.1.7.1.pt"
cd exp
ln -sv cpu_jit_torch.1.7.1.pt cpu_jit.pt
(continues on next page)
38 Chapter 6. sherpa
sherpa, Release 1.3
icefall-asr-alimeeting-pruned-transducer-stateless7
cd icefall-asr-alimeeting-pruned-transducer-stateless7
Chinese + English
icefall_asr_tal-csasr_pruned_transducer_stateless5
cd icefall_asr_tal-csasr_pruned_transducer_stateless5
git lfs pull --include "exp/cpu_jit.pt"
(continues on next page)
Tibetan
icefall-asr-xbmu-amdo31-pruned-transducer-stateless7-2022-12-02
cd icefall-asr-xbmu-amdo31-pruned-transducer-stateless7-2022-12-02
git lfs pull --include "exp/cpu_jit.pt"
git lfs pull --include "data/lang_bpe_500/LG.pt"
sherpa-offline \
--decoding-method=fast_beam_search \
--nn-model=./exp/cpu_jit.pt \
--lg=./data/lang_bpe_500/LG.pt \
--tokens=./data/lang_bpe_500/tokens.txt \
./test_wavs/a_0_cacm-A70_31116.wav \
./test_wavs/a_0_cacm-A70_31117.wav \
./test_wavs/a_0_cacm-A70_31118.wav
40 Chapter 6. sherpa
sherpa, Release 1.3
icefall-asr-xbmu-amdo31-pruned-transducer-stateless5-2022-11-29
cd icefall-asr-xbmu-amdo31-pruned-transducer-stateless5-2022-11-29
git lfs pull --include "data/lang_bpe_500/LG.pt"
git lfs pull --include "data/lang_bpe_500/tokens.txt"
git lfs pull --include "exp/cpu_jit-epoch-28-avg-23-torch-1.10.0.pt"
git lfs pull --include "test_wavs/a_0_cacm-A70_31116.wav"
git lfs pull --include "test_wavs/a_0_cacm-A70_31117.wav"
git lfs pull --include "test_wavs/a_0_cacm-A70_31118.wav"
cd exp
rm cpu_jit.pt
ln -sv cpu_jit-epoch-28-avg-23-torch-1.10.0.pt cpu_jit.pt
cd ..
sherpa-offline \
--decoding-method=fast_beam_search \
--nn-model=./exp/cpu_jit.pt \
--lg=./data/lang_bpe_500/LG.pt \
--tokens=./data/lang_bpe_500/tokens.txt \
./test_wavs/a_0_cacm-A70_31116.wav \
./test_wavs/a_0_cacm-A70_31117.wav \
./test_wavs/a_0_cacm-A70_31118.wav
Hint: We use the binary sherpa-online below for demonstration. You can replace sherpa-online with
sherpa-online-websocket-server and sherpa-online-microphone.
Hint: At present, only streaming transducer models from icefall are supported.
icefall
English
icefall-asr-librispeech-streaming-zipformer-2023-05-17
cd icefall-asr-librispeech-streaming-zipformer-2023-05-17
icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29
cd icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29
42 Chapter 6. sherpa
sherpa, Release 1.3
icefall-asr-librispeech-conv-emformer-transducer-stateless2-2022-07-05
cd icefall-asr-librispeech-conv-emformer-transducer-stateless2-2022-07-05
./build/bin/sherpa-online \
--decoding-method=fast_beam_search \
--nn-model=./exp/cpu_jit.pt \
--lg=./data/lang_bpe_500/LG.pt \
--tokens=./data/lang_bpe_500/tokens.txt \
./test_wavs/1089-134686-0001.wav \
./test_wavs/1221-135766-0001.wav \
./test_wavs/1221-135766-0002.wav
icefall-asr-librispeech-lstm-transducer-stateless2-2022-09-03
cd icefall-asr-librispeech-lstm-transducer-stateless2-2022-09-03
cd exp
ln -sv encoder_jit_trace-iter-468000-avg-16.pt encoder_jit_trace.pt
ln -sv decoder_jit_trace-iter-468000-avg-16.pt decoder_jit_trace.pt
ln -sv joiner_jit_trace-iter-468000-avg-16.pt joiner_jit_trace.pt
cd ..
44 Chapter 6. sherpa
sherpa, Release 1.3
icefall-asr-librispeech-pruned-stateless-emformer-rnnt2-2022-06-01
cd icefall-asr-librispeech-pruned-stateless-emformer-rnnt2-2022-06-01
sherpa-online \
--decoding-method=fast_beam_search \
--nn-model=./exp/cpu_jit.pt \
--lg=./data/lang_bpe_500/LG.pt \
--tokens=./data/lang_bpe_500/tokens.txt \
./test_wavs/1089-134686-0001.wav \
./test_wavs/1221-135766-0001.wav \
./test_wavs/1221-135766-0002.wav
icefall_librispeech_streaming_pruned_transducer_stateless4_20220625
cd icefall_librispeech_streaming_pruned_transducer_stateless4_20220625
sherpa-online \
--decoding-method=fast_beam_search \
--nn-model=./exp/cpu_jit.pt \
--lg=./data/lang_bpe_500/LG.pt \
--tokens=./data/lang_bpe_500/tokens.txt \
./test_waves/1089-134686-0001.wav \
./test_waves/1221-135766-0001.wav \
./test_waves/1221-135766-0002.wav
Chinese
icefall_asr_wenetspeech_pruned_transducer_stateless5_streaming
cd icefall_asr_wenetspeech_pruned_transducer_stateless5_streaming
46 Chapter 6. sherpa
sherpa, Release 1.3
sherpa-online \
--decoding-method=fast_beam_search \
--nn-model=./exp/cpu_jit.pt \
--lg=./data/lang_char/LG.pt \
--tokens=./data/lang_char/tokens.txt \
./test_wavs/DEV_T0000000000.wav \
./test_wavs/DEV_T0000000001.wav \
./test_wavs/DEV_T0000000002.wav
pfluo/k2fsa-zipformer-chinese-english-mixed
cd k2fsa-zipformer-chinese-english-mixed
git lfs pull --include "exp/cpu_jit.pt"
icefall-asr-conv-emformer-transducer-stateless2-zh
It is a ConvEmformer model
cd icefall-asr-conv-emformer-transducer-stateless2-zh
git lfs pull --include "exp/cpu_jit-epoch-11-avg-1.pt"
cd exp
ln -sv cpu_jit-epoch-11-avg-1.pt cpu_jit.pt
cd ..
48 Chapter 6. sherpa
CHAPTER
SEVEN
SHERPA-NCNN
Hint: During speech recognition, it does not need to access the Internet. Everyting is processed locally on your device.
We support using ncnn to replace PyTorch for neural network computation. The code is put in a separate repository
sherpa-ncnn
sherpa-ncnn is self-contained and everything can be compiled from source.
Please refer to https://k2-fsa.github.io/icefall/model-export/export-ncnn.html for how to export models to ncnn format.
In the following, we describe how to build sherpa-ncnn for Linux, macOS, Windows, embedded systems, Android, and
iOS.
Also, we show how to use it for speech recognition with pre-trained models.
7.1 Tutorials
7.2 Installation
Hint: Please refer to Python API for its usage with Python.
In this section, we describe how to install sherpa-ncnn for the following platforms:
49
sherpa, Release 1.3
This section presents some videos about how to install and use sherpa-ncnn.
Window (64-bit)
The following video shows how to install and use sherpa-ncnn on 64-bit Windows.
Thanks to https://space.bilibili.com/7990701 for his contribution.
Caution: It is in Chinese.
7.2.2 Linux
Hint: You can follow this section if you want to build sherpa-ncnn directly on your board.
50 Chapter 7. sherpa-ncnn
sherpa, Release 1.3
After building, you will find two executables inside the bin directory:
$ ls -lh bin/
total 13M
-rwxr-xr-x 1 kuangfangjun root 6.5M Dec 18 11:31 sherpa-ncnn
-rwxr-xr-x 1 kuangfangjun root 6.5M Dec 18 11:31 sherpa-ncnn-microphone
That’s it!
Please read Pre-trained models for usages about the generated binaries.
Read below if you want to learn more.
You can strip the binaries by
$ strip bin/sherpa-ncnn
$ strip bin/sherpa-ncnn-microphone
$ ls -lh bin/
total 12M
-rwxr-xr-x 1 kuangfangjun root 5.8M Dec 18 11:35 sherpa-ncnn
-rwxr-xr-x 1 kuangfangjun root 5.8M Dec 18 11:36 sherpa-ncnn-microphone
Hint: By default, all external dependencies are statically linked. That means, the generated binaries are self-contained.
You can use the following commands to check that and you will find they depend only on system libraries.
$ readelf -d bin/sherpa-ncnn
$ readelf -d bin/sherpa-ncnn-microphone
7.2. Installation 51
sherpa, Release 1.3
7.2.3 macOS
After building, you will find two executables inside the bin directory:
$ ls -lh bin/
total 24232
-rwxr-xr-x 1 fangjun staff 5.9M Dec 18 12:39 sherpa-ncnn
-rwxr-xr-x 1 fangjun staff 6.0M Dec 18 12:39 sherpa-ncnn-microphone
That’s it!
Please read Pre-trained models for usages about the generated binaries.
Read below if you want to learn more.
You can strip the binaries by
$ strip bin/sherpa-ncnn
$ strip bin/sherpa-ncnn-microphone
$ ls -lh bin/
total 23000
-rwxr-xr-x 1 fangjun staff 5.6M Dec 18 12:40 sherpa-ncnn
-rwxr-xr-x 1 fangjun staff 5.6M Dec 18 12:40 sherpa-ncnn-microphone
52 Chapter 7. sherpa-ncnn
sherpa, Release 1.3
Hint: By default, all external dependencies are statically linked. That means, the generated binaries are self-contained.
You can use the following commands to check that and you will find they depend only on system libraries.
$ otool -L bin/sherpa-ncnn
bin/sherpa-ncnn:
/usr/local/opt/libomp/lib/libomp.dylib (compatibility version 5.0.0,␣
˓→current version 5.0.0)
$ otool -L bin/sherpa-ncnn-microphone
bin/sherpa-ncnn-microphone:
/System/Library/Frameworks/CoreAudio.framework/Versions/A/CoreAudio␣
˓→(compatibility version 1.0.0, current version 1.0.0)
/System/Library/Frameworks/AudioToolbox.framework/Versions/A/
˓→AudioToolbox (compatibility version 1.0.0, current version 1000.0.0)
/System/Library/Frameworks/AudioUnit.framework/Versions/A/AudioUnit␣
˓→(compatibility version 1.0.0, current version 1.0.0)
/System/Library/Frameworks/CoreFoundation.framework/Versions/A/
˓→CoreFoundation (compatibility version 150.0.0, current version 1677.104.0)
/System/Library/Frameworks/CoreServices.framework/Versions/A/
˓→CoreServices (compatibility version 1.0.0, current version 1069.24.0)
7.2.4 Windows
Hint: MinGW is known not to work. Please install Visual Studio before you continue.
7.2. Installation 53
sherpa, Release 1.3
# Please select one toolset among VS 2015, 2017, 2019, and 2022 below
# We use VS 2022 as an example.
54 Chapter 7. sherpa-ncnn
sherpa, Release 1.3
This page describes how to build sherpa-ncnn for embedded Linux (arm, 32-bit) with cross-compiling on an x86
machine with Ubuntu OS.
Caution: If you want to build sherpa-ncnn directly on your board, please don’t use this document. Refer to Linux
instead.
Caution: If you want to build sherpa-ncnn directly on your board, please don’t use this document. Refer to Linux
instead.
Caution: If you want to build sherpa-ncnn directly on your board, please don’t use this document. Refer to Linux
instead.
Install toolchain
Warning: You can use any toolchain that is suitable for your platform. The toolchain we use below is just an
example.
Visit https://developer.arm.com/tools-and-software/open-source-software/developer-tools/gnu-toolchain/gnu-a/
downloads/8-3-2019-03 to download the toolchain:
We are going to download gcc-arm-8.3-2019.03-x86_64-arm-linux-gnueabihf.tar.xz, which has been up-
loaded to https://huggingface.co/csukuangfj/sherpa-ncnn-toolchains.
Assume you want to install it in the folder $HOME/software:
mkdir -p $HOME/software
cd $HOME/software
wget https://huggingface.co/csukuangfj/sherpa-ncnn-toolchains/resolve/main/gcc-arm-8.3-
˓→2019.03-x86_64-arm-linux-gnueabihf.tar.xz
export PATH=$HOME/software/gcc-arm-8.3-2019.03-x86_64-arm-linux-gnueabihf/bin:$PATH
To check that we have installed the cross-compiling toolchain successfully, please run:
7.2. Installation 55
sherpa, Release 1.3
arm-linux-gnueabihf-gcc --version
Build sherpa-ncnn
total 6.6M
-rwxr-xr-x 1 kuangfangjun root 2.2M Jan 14 21:46 sherpa-ncnn
-rwxr-xr-x 1 kuangfangjun root 2.2M Jan 14 21:46 sherpa-ncnn-alsa
That’s it!
Hint:
• sherpa-ncnn is for decoding a single file
• sherpa-ncnn-alsa is for real-time speech recongition by reading the microphone with ALSA
Caution: We recommend that you use sherpa-ncnn-alsa on embedded systems such as Raspberry pi.
You need to provide a device_name when invoking sherpa-ncnn-alsa. We describe below how to find the
device name for your microphone.
Run the following command:
arecord -l
to list all avaliable microphones for recording. If it complains that arecord: command not found, please use
sudo apt-get install alsa-utils to install it.
If the above command gives the following output:
**** List of CAPTURE Hardware Devices ****
card 0: Audio [Axera Audio], device 0: 49ac000.i2s_mst-es8328-hifi-analog␣
˓→es8328-hifi-analog-0 []
Subdevices: 1/1
Subdevice #0: subdevice #0
56 Chapter 7. sherpa-ncnn
sherpa, Release 1.3
In this case, I only have 1 microphone. It is card 0 and that card has only device 0. To select card 0
and device 0 on that card, we need to pass plughw:0,0 to sherpa-ncnn-alsa. (Note: It has the format
plughw:card_number,device_index.)
For instance, you have to use
# Note: We use int8 models for encoder and joiner below.
./bin/sherpa-ncnn-alsa \
./sherpa-ncnn-conv-emformer-transducer-small-2023-01-09/tokens.txt \
./sherpa-ncnn-conv-emformer-transducer-small-2023-01-09/encoder_jit_trace-
˓→pnnx.ncnn.int8.param \
./sherpa-ncnn-conv-emformer-transducer-small-2023-01-09/encoder_jit_trace-
˓→pnnx.ncnn.int8.bin \
./sherpa-ncnn-conv-emformer-transducer-small-2023-01-09/decoder_jit_trace-
˓→pnnx.ncnn.param \
./sherpa-ncnn-conv-emformer-transducer-small-2023-01-09/decoder_jit_trace-
˓→pnnx.ncnn.bin \
./sherpa-ncnn-conv-emformer-transducer-small-2023-01-09/joiner_jit_trace-
˓→pnnx.ncnn.int8.param \
./sherpa-ncnn-conv-emformer-transducer-small-2023-01-09/joiner_jit_trace-
˓→pnnx.ncnn.int8.bin \
"plughw:0,0"
Please change the card number and also the device index on the selected card accordingly in your own situation.
Otherwise, you won’t be able to record with your microphone.
Please read Pre-trained models for usages about the generated binaries.
Read below if you want to learn more.
Hint: By default, all external dependencies are statically linked. That means, the generated binaries are self-contained.
You can use the following commands to check that and you will find they depend only on system libraries.
$ readelf -d build-arm-linux-gnueabihf/install/bin/sherpa-ncnn
$ readelf -d build-arm-linux-gnueabihf/install/bin/sherpa-ncnn-alsa
7.2. Installation 57
sherpa, Release 1.3
This page describes how to build sherpa-ncnn for embedded Linux (aarch64, 64-bit) with cross-compiling on an x86
machine with Ubuntu OS.
Caution: If you want to build sherpa-ncnn directly on your board, please don’t use this document. Refer to Linux
instead.
Caution: If you want to build sherpa-ncnn directly on your board, please don’t use this document. Refer to Linux
instead.
Caution: If you want to build sherpa-ncnn directly on your board, please don’t use this document. Refer to Linux
instead.
Install toolchain
Warning: You can use any toolchain that is suitable for your platform. The toolchain we use below is just an
example.
mkdir -p $HOME/software
cd $HOME/software
wget https://huggingface.co/csukuangfj/sherpa-ncnn-toolchains/resolve/main/gcc-linaro-7.
˓→5.0-2019.12-x86_64_aarch64-linux-gnu.tar.xz
58 Chapter 7. sherpa-ncnn
sherpa, Release 1.3
export PATH=$HOME/software/gcc-linaro-7.5.0-2019.12-x86_64_aarch64-linux-gnu/bin:$PATH
To check that we have installed the cross-compiling toolchain successfully, please run:
aarch64-linux-gnu-gcc --version
Build sherpa-ncnn
$ ls -lh build-aarch64-linux-gnu/install/bin/
total 10M
-rwxr-xr-x 1 kuangfangjun root 3.4M Jan 13 21:16 sherpa-ncnn
-rwxr-xr-x 1 kuangfangjun root 3.4M Jan 13 21:16 sherpa-ncnn-alsa
That’s it!
Hint:
• sherpa-ncnn is for decoding a single file
• sherpa-ncnn-alsa is for real-time speech recongition by reading the microphone with ALSA
7.2. Installation 59
sherpa, Release 1.3
sherpa-ncnn-alsa
Caution: We recommend that you use sherpa-ncnn-alsa on embedded systems such as Raspberry pi.
You need to provide a device_name when invoking sherpa-ncnn-alsa. We describe below how to find the
device name for your microphone.
Run the following command:
arecord -l
to list all avaliable microphones for recording. If it complains that arecord: command not found, please use
sudo apt-get install alsa-utils to install it.
If the above command gives the following output:
**** List of CAPTURE Hardware Devices ****
card 3: UACDemoV10 [UACDemoV1.0], device 0: USB Audio [USB Audio]
Subdevices: 1/1
Subdevice #0: subdevice #0
In this case, I only have 1 microphone. It is card 3 and that card has only device 0. To select card 3
and device 0 on that card, we need to pass plughw:3,0 to sherpa-ncnn-alsa. (Note: It has the format
plughw:card_number,device_index.)
For instance, you have to use
./bin/sherpa-ncnn-alsa \
./sherpa-ncnn-conv-emformer-transducer-small-2023-01-09/tokens.txt \
./sherpa-ncnn-conv-emformer-transducer-small-2023-01-09/encoder_jit_trace-
˓→pnnx.ncnn.param \
./sherpa-ncnn-conv-emformer-transducer-small-2023-01-09/encoder_jit_trace-
˓→pnnx.ncnn.bin \
./sherpa-ncnn-conv-emformer-transducer-small-2023-01-09/decoder_jit_trace-
˓→pnnx.ncnn.param \
./sherpa-ncnn-conv-emformer-transducer-small-2023-01-09/decoder_jit_trace-
˓→pnnx.ncnn.bin \
./sherpa-ncnn-conv-emformer-transducer-small-2023-01-09/joiner_jit_trace-
˓→pnnx.ncnn.param \
./sherpa-ncnn-conv-emformer-transducer-small-2023-01-09/joiner_jit_trace-
˓→pnnx.ncnn.bin \
"plughw:3,0"
Please change the card number and also the device index on the selected card accordingly in your own situation.
Otherwise, you won’t be able to record with your microphone.
Please read Pre-trained models for usages about the generated binaries.
Hint: If you want to select a pre-trained model for Raspberry that can be run on real-time, we recommend you to use
marcoyang/sherpa-ncnn-conv-emformer-transducer-small-2023-01-09 (English).
Hint: By default, all external dependencies are statically linked. That means, the generated binaries are self-contained.
60 Chapter 7. sherpa-ncnn
sherpa, Release 1.3
You can use the following commands to check that and you will find they depend only on system libraries.
$ readelf -d build-aarch64-linux-gnu/install/bin/sherpa-ncnn
$ readelf -d build-aarch64-linux-gnu/install/bin/sherpa-ncnn-alsa
This page describes how to build sherpa-ncnn for embedded Linux (RISC-V, 64-bit) with cross-compiling on an x64
machine with Ubuntu OS.
Hint: We provide a colab notebook for you to try this section step by step.
If you are using Windows/macOS or you don’t want to setup your local environment for cross-compiling, please use
the above colab notebook.
Install toolchain
To check that you have installed the toolchain successfully, please run
$ riscv64-linux-gnu-gcc --version
riscv64-linux-gnu-gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
(continues on next page)
7.2. Installation 61
sherpa, Release 1.3
$ riscv64-linux-gnu-g++ --version
riscv64-linux-gnu-g++ (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Copyright (C) 2017 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Build sherpa-ncnn
$ ls -lh build-riscv64-linux-gnu/install/bin/
total 3.8M
-rwxr-xr-x 1 kuangfangjun root 1.9M May 23 22:12 sherpa-ncnn
-rwxr-xr-x 1 kuangfangjun root 1.9M May 23 22:12 sherpa-ncnn-alsa
That’s it!
Hint:
• sherpa-ncnn is for decoding a single file
• sherpa-ncnn-alsa is for real-time speech recongition by reading the microphone with ALSA
Please read Pre-trained models for usages about the generated binaries.
Hint: If you want to select a pre-trained model for VisionFive 2 that can be run on real-time, we recommend you to
use csukuangfj/sherpa-ncnn-streaming-zipformer-small-bilingual-zh-en-2023-02-16 (Bilingual, Chinese + English).
You can use the following command with the above model:
./sherpa-ncnn \
./sherpa-ncnn-streaming-zipformer-small-bilingual-zh-en-2023-02-16/tokens.
˓→txt \
./sherpa-ncnn-streaming-zipformer-small-bilingual-zh-en-2023-02-16/64/
˓→encoder_jit_trace-pnnx.ncnn.param \
./sherpa-ncnn-streaming-zipformer-small-bilingual-zh-en-2023-02-16/64/
˓→encoder_jit_trace-pnnx.ncnn.bin \
./sherpa-ncnn-streaming-zipformer-small-bilingual-zh-en-2023-02-16/64/
˓→decoder_jit_trace-pnnx.ncnn.param \
./sherpa-ncnn-streaming-zipformer-small-bilingual-zh-en-2023-02-16/64/
˓→decoder_jit_trace-pnnx.ncnn.bin \
62 Chapter 7. sherpa-ncnn
sherpa, Release 1.3
./sherpa-ncnn-streaming-zipformer-small-bilingual-zh-en-2023-02-16/64/joiner_
˓→jit_trace-pnnx.ncnn.bin \
./sherpa-ncnn-streaming-zipformer-small-bilingual-zh-en-2023-02-16/test_wavs/
˓→5.wav \
4 \
greedy_search
Hint: By default, all external dependencies are statically linked. That means, the generated binaries are self-contained.
You can use the following commands to check that and you will find they depend only on system libraries.
$ readelf -d build-riscv64-linux-gnu/install/bin/sherpa-ncnn
$ readelf -d build-riscv64-linux-gnu/install/bin/sherpa-ncnn-alsa
7.2. Installation 63
sherpa, Release 1.3
Hint: It is known to work for Python >= 3.6 on Linux, macOS, and Windows.
7.3.1 Installation
You can use 1 of the 4 methods below to install the Python package sherpa-ncnn:
Method 1
Hint: This method supports x86_64, arm64 (e.g., Mac M1, 64-bit Raspberry Pi), and arm32 (e.g., 32-bit Raspberry
Pi).
If you use Method 1, it will install pre-compiled libraries. The disadvantage is that it may not be optimized for
your platform, while the advantage is that you don’t need to install cmake or a C++ compiler.
For the following methods, you have to first install:
• cmake, which can be installed using pip install cmake
• A C++ compiler, e.g., GCC on Linux and macOS, Visual Studio on Windows
Method 2
Method 3
64 Chapter 7. sherpa-ncnn
sherpa, Release 1.3
x86/x86_64
32-bit ARM
64-bit ARM
git clone https://github.com/k2-fsa/sherpa-ncnn
cd sherpa-ncnn
mkdir build
cd build
cmake \
-D SHERPA_NCNN_ENABLE_PYTHON=ON \
-D SHERPA_NCNN_ENABLE_PORTAUDIO=OFF \
-D BUILD_SHARED_LIBS=ON \
..
make -j6
export PYTHONPATH=$PWD/lib:$PWD/../sherpa-ncnn/python:$PYTHONPATH
cmake \
-D SHERPA_NCNN_ENABLE_PYTHON=ON \
-D SHERPA_NCNN_ENABLE_PORTAUDIO=OFF \
-D BUILD_SHARED_LIBS=ON \
-DCMAKE_C_FLAGS="-march=armv7-a -mfloat-abi=hard -mfpu=neon" \
-DCMAKE_CXX_FLAGS="-march=armv7-a -mfloat-abi=hard -mfpu=neon" \
..
make -j6
export PYTHONPATH=$PWD/lib:$PWD/../sherpa-ncnn/python:$PYTHONPATH
cmake \
-D SHERPA_NCNN_ENABLE_PYTHON=ON \
-D SHERPA_NCNN_ENABLE_PORTAUDIO=OFF \
-D BUILD_SHARED_LIBS=ON \
-DCMAKE_C_FLAGS="-march=armv8-a" \
-DCMAKE_CXX_FLAGS="-march=armv8-a" \
..
make -j6
(continues on next page)
export PYTHONPATH=$PWD/lib:$PWD/../sherpa-ncnn/python:$PYTHONPATH
Hint: If you use Method 1, Method 2, and Method 3, you can also use
Next, we describe how to use sherpa-ncnn Python API for speech recognition:
• (1) Real-time speech recognition with a microphone
• (2) Recognize a file
The following Python code shows how to use sherpa-ncnn Python API for real-time speech recognition with a micro-
phone.
Hint: We use sounddevice for recording. Please run pip install sounddevice before you run the code below.
try:
import sounddevice as sd
except ImportError as e:
print("Please install sounddevice first. You can use")
print()
print(" pip install sounddevice")
print()
print("to install it")
sys.exit(-1)
66 Chapter 7. sherpa-ncnn
sherpa, Release 1.3
def create_recognizer():
# Please replace the model files if needed.
# See https://k2-fsa.github.io/sherpa/ncnn/pretrained_models/index.html
# for download links.
recognizer = sherpa_ncnn.Recognizer(
tokens="./sherpa-ncnn-conv-emformer-transducer-2022-12-06/tokens.txt",
encoder_param="./sherpa-ncnn-conv-emformer-transducer-2022-12-06/encoder_jit_
˓→trace-pnnx.ncnn.param",
encoder_bin="./sherpa-ncnn-conv-emformer-transducer-2022-12-06/encoder_jit_trace-
˓→pnnx.ncnn.bin",
decoder_param="./sherpa-ncnn-conv-emformer-transducer-2022-12-06/decoder_jit_
˓→trace-pnnx.ncnn.param",
decoder_bin="./sherpa-ncnn-conv-emformer-transducer-2022-12-06/decoder_jit_trace-
˓→pnnx.ncnn.bin",
joiner_param="./sherpa-ncnn-conv-emformer-transducer-2022-12-06/joiner_jit_trace-
˓→pnnx.ncnn.param",
joiner_bin="./sherpa-ncnn-conv-emformer-transducer-2022-12-06/joiner_jit_trace-
˓→pnnx.ncnn.bin",
num_threads=4,
)
return recognizer
def main():
print("Started! Please speak")
recognizer = create_recognizer()
sample_rate = recognizer.sample_rate
samples_per_read = int(0.1 * sample_rate) # 0.1 second = 100 ms
last_result = ""
with sd.InputStream(
channels=1, dtype="float32", samplerate=sample_rate
) as s:
while True:
samples, _ = s.read(samples_per_read) # a blocking read
samples = samples.reshape(-1)
recognizer.accept_waveform(sample_rate, samples)
result = recognizer.text
if last_result != result:
last_result = result
print(result)
if __name__ == "__main__":
devices = sd.query_devices()
print(devices)
default_input_device_idx = sd.default.device[0]
print(f'Use default device: {devices[default_input_device_idx]["name"]}')
try:
(continues on next page)
Code explanation:
try:
import sounddevice as sd
except ImportError as e:
print("Please install sounddevice first. You can use")
print()
print(" pip install sounddevice")
print()
print("to install it")
sys.exit(-1)
import sherpa_ncnn
def create_recognizer():
# Please replace the model files if needed.
# See https://k2-fsa.github.io/sherpa/ncnn/pretrained_models/index.html
# for download links.
recognizer = sherpa_ncnn.Recognizer(
tokens="./sherpa-ncnn-conv-emformer-transducer-2022-12-06/tokens.txt",
encoder_param="./sherpa-ncnn-conv-emformer-transducer-2022-12-06/encoder_jit_
˓→trace-pnnx.ncnn.param",
encoder_bin="./sherpa-ncnn-conv-emformer-transducer-2022-12-06/encoder_jit_trace-
˓→pnnx.ncnn.bin",
decoder_param="./sherpa-ncnn-conv-emformer-transducer-2022-12-06/decoder_jit_
˓→trace-pnnx.ncnn.param",
decoder_bin="./sherpa-ncnn-conv-emformer-transducer-2022-12-06/decoder_jit_trace-
˓→pnnx.ncnn.bin",
joiner_param="./sherpa-ncnn-conv-emformer-transducer-2022-12-06/joiner_jit_trace-
˓→pnnx.ncnn.param",
joiner_bin="./sherpa-ncnn-conv-emformer-transducer-2022-12-06/joiner_jit_trace-
˓→pnnx.ncnn.bin",
num_threads=4,
)
return recognizer
def main():
print("Started! Please speak")
recognizer = create_recognizer()
68 Chapter 7. sherpa-ncnn
sherpa, Release 1.3
Hint: The above example uses a float16 encoder and joiner. You can also use the following code to switch to 8-bit
(i.e., int8) quantized encoder and joiner.
recognizer = sherpa_ncnn.Recognizer(
tokens="./sherpa-ncnn-conv-emformer-transducer-2022-12-06/tokens.txt",
encoder_param="./sherpa-ncnn-conv-emformer-transducer-2022-12-06/encoder_
˓→jit_trace-pnnx.ncnn.int8.param",
encoder_bin="./sherpa-ncnn-conv-emformer-transducer-2022-12-06/encoder_jit_
˓→trace-pnnx.ncnn.int8.bin",
decoder_param="./sherpa-ncnn-conv-emformer-transducer-2022-12-06/decoder_
˓→jit_trace-pnnx.ncnn.param",
decoder_bin="./sherpa-ncnn-conv-emformer-transducer-2022-12-06/decoder_jit_
˓→trace-pnnx.ncnn.bin",
joiner_param="./sherpa-ncnn-conv-emformer-transducer-2022-12-06/joiner_jit_
˓→trace-pnnx.ncnn.int8.param",
joiner_bin="./sherpa-ncnn-conv-emformer-transducer-2022-12-06/joiner_jit_
˓→trace-pnnx.ncnn.int8.bin",
num_threads=4,
)
3. Start recording
sample_rate = recognizer.sample_rate
with sd.InputStream(
Note that:
• We set channel to 1 since the model supports only a single channel
• We use dtype float32 so that the resulting audio samples are normalized to the range [-1, 1].
• The sampling rate has to be recognizer.sample_rate, which is 16 kHz for all models at present.
Note that:
• It reads 100 ms of audio samples at a time. You can choose a larger value, e.g., 200 ms.
• No queue or callback is used. Instead, we use a blocking read here.
• The samples array is reshaped to a 1-D array
Note that:
• samples has to be a 1-D tensor and should be normalized to the range [-1, 1].
• Upon accepting the audio samples, the recognizer starts the decoding automatically. There is no separate call for
decoding.
samples = samples.reshape(-1)
recognizer.accept_waveform(sample_rate, samples)
result = recognizer.text
if last_result != result:
We use recognizer.text to get the recognition result. To avoid unnecessary output, we compare whether there is
new result in recognizer.text and don’t print to the console if there is nothing new recognized.
That’s it!
Summary
Hint: If you don’t have access to YouTube, please see the following video from bilibili:
Note: https://github.com/k2-fsa/sherpa-ncnn/blob/master/python-api-examples/
speech-recognition-from-microphone-with-endpoint-detection.py supports endpoint detection.
Please see the following video for its usage:
70 Chapter 7. sherpa-ncnn
sherpa, Release 1.3
The following Python code shows how to use sherpa-ncnn Python API to recognize a wave file.
Caution: The sampling rate of the wave file has to be 16 kHz. Also, it should contain only a single channel and
samples should be 16-bit (i.e., int16) encoded.
import numpy as np
import sherpa_ncnn
def main():
recognizer = sherpa_ncnn.Recognizer(
tokens="./sherpa-ncnn-conv-emformer-transducer-2022-12-06/tokens.txt",
encoder_param="./sherpa-ncnn-conv-emformer-transducer-2022-12-06/encoder_jit_
˓→trace-pnnx.ncnn.param",
encoder_bin="./sherpa-ncnn-conv-emformer-transducer-2022-12-06/encoder_jit_trace-
˓→pnnx.ncnn.bin",
decoder_param="./sherpa-ncnn-conv-emformer-transducer-2022-12-06/decoder_jit_
˓→trace-pnnx.ncnn.param",
decoder_bin="./sherpa-ncnn-conv-emformer-transducer-2022-12-06/decoder_jit_trace-
˓→pnnx.ncnn.bin",
joiner_param="./sherpa-ncnn-conv-emformer-transducer-2022-12-06/joiner_jit_trace-
˓→pnnx.ncnn.param",
joiner_bin="./sherpa-ncnn-conv-emformer-transducer-2022-12-06/joiner_jit_trace-
˓→pnnx.ncnn.bin",
num_threads=4,
)
filename = (
"./sherpa-ncnn-conv-emformer-transducer-2022-12-06/test_wavs/1.wav"
)
with wave.open(filename) as f:
assert f.getframerate() == recognizer.sample_rate, (
f.getframerate(),
recognizer.sample_rate,
)
assert f.getnchannels() == 1, f.getnchannels()
assert f.getsampwidth() == 2, f.getsampwidth() # it is in bytes
num_samples = f.getnframes()
samples = f.readframes(num_samples)
samples_int16 = np.frombuffer(samples, dtype=np.int16)
samples_float32 = samples_int16.astype(np.float32)
(continues on next page)
recognizer.accept_waveform(recognizer.sample_rate, samples_float32)
tail_paddings = np.zeros(
int(recognizer.sample_rate * 0.5), dtype=np.float32
)
recognizer.accept_waveform(recognizer.sample_rate, tail_paddings)
recognizer.input_finished()
print(recognizer.text)
if __name__ == "__main__":
main()
Hint: The above example uses a float16 encoder and joiner. You can also use the following code to switch to 8-bit
(i.e., int8) quantized encoder and joiner.
recognizer = sherpa_ncnn.Recognizer(
tokens="./sherpa-ncnn-conv-emformer-transducer-2022-12-06/tokens.txt",
encoder_param="./sherpa-ncnn-conv-emformer-transducer-2022-12-06/encoder_
˓→jit_trace-pnnx.ncnn.int8.param",
encoder_bin="./sherpa-ncnn-conv-emformer-transducer-2022-12-06/encoder_jit_
˓→trace-pnnx.ncnn.int8.bin",
decoder_param="./sherpa-ncnn-conv-emformer-transducer-2022-12-06/decoder_
˓→jit_trace-pnnx.ncnn.param",
decoder_bin="./sherpa-ncnn-conv-emformer-transducer-2022-12-06/decoder_jit_
˓→trace-pnnx.ncnn.bin",
joiner_param="./sherpa-ncnn-conv-emformer-transducer-2022-12-06/joiner_jit_
˓→trace-pnnx.ncnn.int8.param",
joiner_bin="./sherpa-ncnn-conv-emformer-transducer-2022-12-06/joiner_jit_
˓→trace-pnnx.ncnn.int8.bin",
num_threads=4,
)
72 Chapter 7. sherpa-ncnn
sherpa, Release 1.3
7.4 WebAssembly
In this section, we describe how to build sherpa-ncnn for WebAssembly so that you can run real-time speech recognition
with WebAssembly.
Please follow the steps below to build and run sherpa-ncnn for WebAssembly.
We need to compile the C/C++ files in sherpa-ncnn with the help of emscripten.
Please refer to https://emscripten.org/docs/getting_started/downloads for detailed installation instructions.
The following is an example to show you how to install it on Linux/macOS.
emcc -v
Target: wasm32-unknown-emscripten
Thread model: posix
InstalledDir: /Users/fangjun/open-source/emsdk/upstream/bin
7.4.2 Build
cd wasm/assets
wget -q https://github.com/k2-fsa/sherpa-ncnn/releases/download/models/sherpa-ncnn-
˓→streaming-zipformer-bilingual-zh-en-2023-02-13.tar.bz2
7.4. WebAssembly 73
sherpa, Release 1.3
./build-wasm-simd.sh
-- Installing: /Users/fangjun/open-source/sherpa-ncnn/build-wasm-simd/install/lib/
˓→libncnn.a
-- Installing: /Users/fangjun/open-source/sherpa-ncnn/build-wasm-simd/install/./sherpa-
˓→ncnn.pc
-- Installing: /Users/fangjun/open-source/sherpa-ncnn/build-wasm-simd/install/lib/
˓→libsherpa-ncnn-core.a
-- Installing: /Users/fangjun/open-source/sherpa-ncnn/build-wasm-simd/install/lib/
˓→libsherpa-ncnn-c-api.a
-- Installing: /Users/fangjun/open-source/sherpa-ncnn/build-wasm-simd/install/include/
˓→sherpa-ncnn/c-api/c-api.h
-- Installing: /Users/fangjun/open-source/sherpa-ncnn/build-wasm-simd/install/bin/wasm/
˓→sherpa-ncnn-wasm-main.js
-- Installing: /Users/fangjun/open-source/sherpa-ncnn/build-wasm-simd/install/bin/wasm/
˓→sherpa-ncnn.js
-- Installing: /Users/fangjun/open-source/sherpa-ncnn/build-wasm-simd/install/bin/wasm/
˓→app.js
-- Installing: /Users/fangjun/open-source/sherpa-ncnn/build-wasm-simd/install/bin/wasm/
˓→index.html
-- Up-to-date: /Users/fangjun/open-source/sherpa-ncnn/build-wasm-simd/install/bin/wasm/
˓→sherpa-ncnn-wasm-main.js
-- Installing: /Users/fangjun/open-source/sherpa-ncnn/build-wasm-simd/install/bin/wasm/
˓→sherpa-ncnn-wasm-main.wasm
-- Installing: /Users/fangjun/open-source/sherpa-ncnn/build-wasm-simd/install/bin/wasm/
˓→sherpa-ncnn-wasm-main.data
+ ls -lh install/bin/wasm
total 280152
-rw-r--r-- 1 fangjun staff 9.0K Feb 6 15:42 app.js
-rw-r--r-- 1 fangjun staff 936B Feb 6 15:42 index.html
-rw-r--r-- 1 fangjun staff 135M Feb 6 17:06 sherpa-ncnn-wasm-main.data
-rw-r--r-- 1 fangjun staff 79K Feb 6 17:06 sherpa-ncnn-wasm-main.js
-rw-r--r-- 1 fangjun staff 1.7M Feb 6 17:06 sherpa-ncnn-wasm-main.wasm
-rw-r--r-- 1 fangjun staff 6.9K Feb 6 15:42 sherpa-ncnn.js
74 Chapter 7. sherpa-ncnn
sherpa, Release 1.3
cd build-wasm-simd/install/bin/wasm/
python3 -m http.server 6006
Start your browser and visit http://localhost:6006/; you should see the following page:
Now click start and speak! You should see the recognition results in the text box.
Warning: We are using a bilingual model (Chinese + English) in the above example, which means you can only
speak Chinese or English in this case.
Congratulations! You have successfully run real-time speech recognition with WebAssembly in your browser.
7.4. WebAssembly 75
sherpa, Release 1.3
In this section, we describe how to use the pre-built WebAssembly library of sherpa-ncnn for real-time speech recog-
nition.
Note: Note that the pre-built library used in this section uses a bilingual model (Chinese + English), which is from
csukuangfj/sherpa-ncnn-streaming-zipformer-bilingual-zh-en-2023-02-13 (Bilingual, Chinese + English).
Download
Please use the following command to download the pre-built library for version v2.1.7, which is the latest release as
of 2024.02.06.
Hint: Please always use the latest release. You can visit https://github.com/k2-fsa/sherpa-ncnn/releases to find the
latest release.
wget -q https://github.com/k2-fsa/sherpa-ncnn/releases/download/v2.1.7/sherpa-ncnn-wasm-
˓→simd-v2.1.7.tar.bz2
Start your browser and visit http://localhost:6006/; you should see the following page:
Now click start and speak! You should see the recognition results in the text box.
Warning: We are using a bilingual model (Chinese + English) in the above example, which means you can only
speak Chinese or English in this case.
76 Chapter 7. sherpa-ncnn
sherpa, Release 1.3
We provide two Huggingface spaces so that you can try real-time speech recognition with WebAssembly in your
browser.
English only
https://huggingface.co/spaces/k2-fsa/web-assembly-asr-sherpa-ncnn-en
Hint: If you don’t have access to Huggingface, please visit the following mirror:
https://modelscope.cn/studios/k2-fsa/web-assembly-asr-sherpa-ncnn-en/summary
Note: The script for building this space can be found at https://github.com/k2-fsa/sherpa-ncnn/blob/master/.github/
workflows/wasm-simd-hf-space-en.yaml
7.4. WebAssembly 77
sherpa, Release 1.3
Chinese + English
https://huggingface.co/spaces/k2-fsa/web-assembly-asr-sherpa-ncnn-zh-en
Hint: If you don’t have access to Huggingface, please visit the following mirror:
https://modelscope.cn/studios/k2-fsa/web-assembly-asr-sherpa-ncnn-zh-en/summary
Note: The script for building this space can be found at https://github.com/k2-fsa/sherpa-ncnn/blob/master/.github/
workflows/wasm-simd-hf-space-zh-en.yaml
7.5 C API
Before using the C API of sherpa-ncnn, we need to first build required libraries. You can choose either to build static
libraries or shared libraries.
78 Chapter 7. sherpa-ncnn
sherpa, Release 1.3
Assume that we want to put library files and header files in the directory /tmp/sherpa-ncnn/shared:
cmake \
-DSHERPA_NCNN_ENABLE_C_API=ON \
-DCMAKE_BUILD_TYPE=Release \
-DBUILD_SHARED_LIBS=ON \
-DCMAKE_INSTALL_PREFIX=/tmp/sherpa-ncnn/shared \
..
make -j6
make install
$ tree /tmp/sherpa-ncnn/shared/
/tmp/sherpa-ncnn/shared/
bin
sherpa-ncnn
sherpa-ncnn-microphone
include
sherpa-ncnn
c-api
c-api.h
lib
libkaldi-native-fbank-core.dylib
libncnn.dylib
libsherpa-ncnn-c-api.dylib
libsherpa-ncnn-core.dylib
sherpa-ncnn.pc
5 directories, 8 files
$ tree /tmp/sherpa-ncnn/shared/
/tmp/sherpa-ncnn/shared/
bin
sherpa-ncnn
sherpa-ncnn-microphone
include
sherpa-ncnn
c-api
c-api.h
lib
libkaldi-native-fbank-core.so
(continues on next page)
7.5. C API 79
sherpa, Release 1.3
5 directories, 8 files
Assume that we want to put library files and header files in the directory /tmp/sherpa-ncnn/static:
cmake \
-DSHERPA_NCNN_ENABLE_C_API=ON \
-DCMAKE_BUILD_TYPE=Release \
-DBUILD_SHARED_LIBS=OFF \
-DCMAKE_INSTALL_PREFIX=/tmp/sherpa-ncnn/static \
..
make -j6
make install
$ tree /tmp/sherpa-ncnn/static/
/tmp/sherpa-ncnn/static/
bin
sherpa-ncnn
sherpa-ncnn-microphone
include
sherpa-ncnn
c-api
c-api.h
lib
libkaldi-native-fbank-core.a
libncnn.a
libsherpa-ncnn-c-api.a
libsherpa-ncnn-core.a
sherpa-ncnn.pc
5 directories, 8 files
80 Chapter 7. sherpa-ncnn
sherpa, Release 1.3
export PKG_CONFIG_PATH=/tmp/sherpa-ncnn/static:$PKG_CONFIG_PATH
cd ./c-api-examples
gcc -o decode-file-c-api $(pkg-config --cflags sherpa-ncnn) ./decode-file-c-api.c $(pkg-
˓→config --libs sherpa-ncnn)
export PKG_CONFIG_PATH=/tmp/sherpa-ncnn/shared:$PKG_CONFIG_PATH
cd ./c-api-examples
gcc -o decode-file-c-api $(pkg-config --cflags sherpa-ncnn) ./decode-file-c-api.c $(pkg-
˓→config --libs sherpa-ncnn)
7.6 Endpointing
We have three rules for endpoint detection. If any of them is activated, we assume an endpoint is detected.
7.6.1 Rule 1
In Rule 1, we count the duration of trailing silence. If it is larger than a user specified value, Rule 1 is activated. The
following is an example, which uses 2.4 seconds as the threshold.
Hint: In the Python API, you can specify rule1_min_trailing_silence while constructing an instance of
sherpa_ncnn.Recognizer.
In the C++ API, you can specify rule1.min_trailing_silence when creating EndpointConfig.
7.6. Endpointing 81
sherpa, Release 1.3
7.6.2 Rule 2
In Rule 2, we require that it has to first decode something before we count the trailing silence. In the following
example, after decoding something, Rule 2 is activated when the duration of trailing silence is larger than the user
specified value 1.2 seconds.
Hint: In the Python API, you can specify rule2_min_trailing_silence while constructing an instance of
sherpa_ncnn.Recognizer.
In the C++ API, you can specify rule2.min_trailing_silence when creating EndpointConfig.
82 Chapter 7. sherpa-ncnn
sherpa, Release 1.3
7.6.3 Rule 3
Rule 3 is activated when the utterance length in seconds is larger than a given value. In the following example, Rule
3 is activated after the first segment reaches a given value, which is 20 seconds in this case.
Hint: In the Python API, you can specify rule3_min_utterance_length while constructing an instance of
sherpa_ncnn.Recognizer.
In the C++ API, you can specify rule3.min_utterance_length when creating EndpointConfig.
Note: If you want to deactive this rule, please provide a very large value for rule3_min_utterance_length or
rule3.min_utterance_length.
7.6.4 Demo
The following video demonstrates using the Python API of sherpa-ncnn for real-time speech recogntinion with end-
pointing.
7.6.5 FAQs
For each frame to be decoded, we can output either a blank or a non-blank token. We record the number of contiguous
blanks that has been decoded so far. In the current default setting, each frame is 10 ms. Thus, we can get the duration
of trailing silence by counting the number of contiguous trailing blanks.
Note: If a model uses a subsampling factor of 4, the time resolution becomes 10 * 4 = 40 ms.
7.6. Endpointing 83
sherpa, Release 1.3
7.7 Android
In this section, we describe how to build an Android app for real-time speech recognition with sherpa-ncnn. We
also provide real-time speech recognition video demos.
Hint: During speech recognition, it does not need to access the Internet. Everyting is processed locally on your phone.
In this page, we list some videos about using sherpa-ncnn for real-time speech recognition on Android.
Hint: You can find pre-built APK packages used by the following videos at:
https://huggingface.co/csukuangfj/sherpa-ncnn-apk/tree/main
• CPU versions require Android >= 5.0
• GPU versions with Vulkan require Android >= 7.0
Note: You can also find the latest APK for each release at
https://github.com/k2-fsa/sherpa-ncnn/releases
Video 1: Chinese
Hint: Any recent version of Android Studio should work fine. Also, you can use the default settings of Android Studio
during installation.
For reference, we post the version we are using below:
84 Chapter 7. sherpa-ncnn
sherpa, Release 1.3
Download sherpa-ncnn
Install NDK
In the following, we assume Android SDK location was set to /Users/fangjun/software/my-android. You
can change it accordingly below.
After installing NDK, you can find it in
/Users/fangjun/software/my-android/ndk/22.1.7171670
Warning: If you selected a different version of NDK, please replace 22.1.7171670 accordingly.
Next, let us set the environment variable ANDROID_NDK for later use.
7.7. Android 85
sherpa, Release 1.3
86 Chapter 7. sherpa-ncnn
sherpa, Release 1.3
7.7. Android 87
sherpa, Release 1.3
88 Chapter 7. sherpa-ncnn
sherpa, Release 1.3
export ANDROID_NDK=/Users/fangjun/software/my-android/ndk/22.1.7171670
list(APPEND ANDROID_COMPILER_FLAGS
-g
-DANDROID
Caution: If you don’t delete the line containin -g above, the generated library libncnn.so can be as large as 21
MB or even larger!
Caution: You only need to select one and only one ABI. arm64-v8a is probably the most common one.
If you want to test the app on an emulator, you probably need x86_64.
Hint: Building scripts for this section are for macOS and Linux. If you are using Windows or if you don’t want to
build the shared libraries by yourself, you can download pre-compiled shared libraries for this section by visiting
https://github.com/k2-fsa/sherpa-ncnn/releases
Hint: We provide a colab notebook for you to try this section step by step.
If you are using Windows or you don’t want to setup your local environment to build the C++ libraries, please use the
above colab notebook.
7.7. Android 89
sherpa, Release 1.3
$ ls -lh build-android-arm64-v8a/install/lib/lib*.so
-rwxr-xr-x 1 fangjun staff 848K Dec 18 16:49 build-android-arm64-v8a/install/lib/
˓→libkaldi-native-fbank-core.so
$ cp build-android-arm64-v8a/install/lib/lib*.so android/SherpaNcnn/app/src/main/
˓→jniLibs/arm64-v8a/
You should see the following screen shot after running the above copy cp command.
Note: If you have Android >= 7.0 and want to run sherpa-ncnn on GPU, please replace ./
build-android-arm64-v8a.sh with ./build-android-arm64-v8a-with-vulkan.sh and replace
build-android-arm64-v8a/install/lib/lib*.so with ./build-android-arm64-v8a-with-vulkan/
install/lib/lib*.so. That is all you need to do and you don’t need to change any code.
Also, you need to install Vulkan sdk. Please see https://github.com/k2-fsa/sherpa-ncnn/blob/master/
install-vulkan-macos.md for details.
$ ls -lh build-android-armv7-eabi/install/lib/lib*.so
-rwxr-xr-x 1 fangjun staff 513K Dec 18 17:04 build-android-armv7-eabi/install/lib/
˓→libkaldi-native-fbank-core.so
90 Chapter 7. sherpa-ncnn
sherpa, Release 1.3
7.7. Android 91
sherpa, Release 1.3
cp build-android-armv7-eabi/install/lib/lib*.so android/SherpaNcnn/app/src/main/jniLibs/
˓→armeabi-v7a/
You should see the following screen shot after running the above copy cp command.
$ ls -lh build-android-x86-64/install/lib/lib*.so
-rwxr-xr-x 1 fangjun staff 901K Dec 18 17:14 build-android-x86-64/install/lib/
˓→libkaldi-native-fbank-core.so
92 Chapter 7. sherpa-ncnn
sherpa, Release 1.3
cp build-android-x86-64/install/lib/lib*.so android/SherpaNcnn/app/src/main/jniLibs/x86_
˓→64/
You should see the following screen shot after running the above copy cp command.
7.7. Android 93
sherpa, Release 1.3
Hint: The model is trained using icefall and the original torchscript model is from https://huggingface.co/ptrnull/
icefall-asr-conv-emformer-transducer-stateless2-zh.
Use the following command to download the pre-trained model and place it into android/SherpaNcnn/app/src/
main/assets/:
cd android/SherpaNcnn/app/src/main/assets/
cd sherpa-ncnn-conv-emformer-transducer-2022-12-06
git lfs pull --include "*.bin"
# Now, remove extra files to reduce the file size of the generated apk
rm -rf .git test_wavs scripts/
rm export-for-ncnn.sh *.png README.md
$ ls -lh
total 525224
-rw-r--r-- 1 fangjun staff 5.9M Dec 18 17:40 decoder_jit_trace-pnnx.ncnn.bin
-rw-r--r-- 1 fangjun staff 439B Dec 18 17:39 decoder_jit_trace-pnnx.ncnn.param
-rw-r--r-- 1 fangjun staff 141M Dec 18 17:40 encoder_jit_trace-pnnx.ncnn.bin
-rw-r--r-- 1 fangjun staff 99M Dec 18 17:40 encoder_jit_trace-pnnx.ncnn.int8.bin
-rw-r--r-- 1 fangjun staff 78K Dec 18 17:40 encoder_jit_trace-pnnx.ncnn.int8.param
-rw-r--r-- 1 fangjun staff 79K Dec 18 17:39 encoder_jit_trace-pnnx.ncnn.param
-rw-r--r-- 1 fangjun staff 6.9M Dec 18 17:40 joiner_jit_trace-pnnx.ncnn.bin
-rw-r--r-- 1 fangjun staff 3.5M Dec 18 17:40 joiner_jit_trace-pnnx.ncnn.int8.bin
-rw-r--r-- 1 fangjun staff 498B Dec 18 17:40 joiner_jit_trace-pnnx.ncnn.int8.param
-rw-r--r-- 1 fangjun staff 490B Dec 18 17:39 joiner_jit_trace-pnnx.ncnn.param
-rw-r--r-- 1 fangjun staff 53K Dec 18 17:39 tokens.txt
$ du -h -d1 .
256M .
You should see the following screen shot after downloading the pre-trained model:
94 Chapter 7. sherpa-ncnn
sherpa, Release 1.3
7.7. Android 95
sherpa, Release 1.3
Hint: If you select a different pre-trained model, make sure that you also change the corresponding code listed in the
following screen shot:
Generate APK
$ ls -lh android/SherpaNcnn/app/build/outputs/apk/debug/app-debug.apk
-rw-r--r-- 1 fangjun staff 152M Dec 18 17:53 android/SherpaNcnn/app/build/outputs/
˓→apk/debug/app-debug.apk
Select Build -> Analyze APK ... in the above screen shot, in the popped-up dialog select the generated APK
app-debug.apk, and you will see the following screen shot:
You can see from the above screen shot that most part of the APK is occupied by the pre-trained model, while the
runtime, including the shared libraries, is only 1.7 MB.
96 Chapter 7. sherpa-ncnn
sherpa, Release 1.3
7.7. Android 97
sherpa, Release 1.3
98 Chapter 7. sherpa-ncnn
sherpa, Release 1.3
7.8 iOS
In this section, we describe how to build an iOS app for real-time speech recognition with sherpa-ncnn and run it
within a simulator on your Mac, run it on you iPhone or iPad.
We also provide video demos for real-time speech recognition.
Hint: During speech recognition, it does not need to access the Internet. Everyting is processed locally on your iPhone
or iPad.
In this page, we list some videos about using sherpa-ncnn for real-time speech recognition on iPhone and iPad.
This section describes how to build sherpa-ncnn for iPhone and iPad.
7.8. iOS 99
sherpa, Release 1.3
Requirement
Warning: The minimum deployment requires the iOS version >= 13.0.
Before we continue, please make sure the following requirements are satisfied:
• macOS. It won’t work on Windows or Linux.
• Xcode. The version 14.2 (14C18) is known to work. Other versions may also work.
• CMake. CMake 3.25.1 is known to work. Other versions may also work.
• (Optional) iPhone or iPad. This is for testing the app on your device. If you don’t have a device, you can still run
the app within a simulator on your Mac.
Caution:
If you get the following error:
CMake Error at toolchains/ios.toolchain.cmake:544 (get_filename_component):
get_filename_component called with incorrect number of arguments
Call Stack (most recent call first):
/usr/local/Cellar/cmake/3.29.0/share/cmake/Modules/CMakeDetermineSystem.
˓→cmake:146 (include)
CMakeLists.txt:2 (project)
please run:
sudo xcode-select --install
sudo xcodebuild -license
Download sherpa-ncnn
mkdir -p $HOME/open-source
cd $HOME/open-source
git clone https://github.com/k2-fsa/sherpa-ncnn
cd $HOME/open-source/sherpa-ncnn/
./build-ios.sh
Hint: You don’t have to look at the generated files in $HOME/open-source/sherpa-ncnn/build-ios to build an
app. We have pre-configured it for you.
If you are eager to learn more about the generated files or want to use sherpa-ncnn in your own iOS project, please have
a look at For the more curious.
cd $HOME/open-source/sherpa-ncnn/ios-swift/SherpaNcnn
open SherpaNcnn.xcodeproj
It will start Xcode and you will see the following screenshot:
Please select Product -> Build to build the project. See the screenshot below:
After finishing the build, you should see the following screenshot:
Congratulations! You have successfully built the project. Let us run the project by selecting Product -> Run, which
is shown in the following screenshot:
Please wait for a few seconds before Xcode starts the simulator.
Unfortunately, it will throw the following error:
The reason for the above error is that we have not provided the pre-trained model yet.
The file ViewController.swift pre-selects the pre-trained model to be csukuangfj/sherpa-ncnn-conv-emformer-
transducer-2022-12-06 (Chinese + English), shown in the screenshot below:
In the popup dialog, switch to the folder where you just downloaded the pre-trained model.
In the screenshot below, it is the folder /Users/fangjun/open-source/icefall-models/
sherpa-ncnn-conv-emformer-transducer-2022-12-06:
Fig. 7.12: Screenshot for navigating to the folder containing the downloaded pre-trained
After adding pre-trained model files to Xcode, you should see the following screenshot:
At this point, you should be able to select the menu Product -> Run to run the project and you should finally see the
following screenshot:
Congratulations! You have finally succeeded in running sherpa-ncnn with iOS, though it is in a simulator.
Please read below if you want to run sherpa-ncnn on your iPhone or iPad.
First, please make sure the iOS version of your iPhone/iPad is >= 13.0.
Click the menu Xcode -> Settings..., as is shown in the following screenshot:
In the popup dialog, please select Account and click + to add your Apple ID, as is shown in the following screenshots.
After adding your Apple ID, please connect your iPhone or iPad to your Mac and select your device in Xcode. The
following screenshot is an example to select my iPhone.
Now your Xcode should look like below after selecting a device:
Please select Product -> Run again to run sherpa-ncnn on your selected device, as is shown in the following screen-
shot:
After a successful build, check your iPhone/iPad and you should see the following screenshot:
To fix that, please select Settings -> General -> Device Management on your device
Please click Apple Development: csukuangfj... and click Trust "Apple Development:
csukuangfj@g..." in the subsequent dialog, as is shown below:
At this point, you should be able to run the app on your device. The following is a screenshot about running it on my
iPhone:
This section is for those who want to learn more about how to use sherpa-ncnn in an iOS project.
After running:
./build-ios.sh
Hint: Please have a look at ./build-ios.sh so that you know what it does for you.
What is interesting here is the two framework folders openmp.xcframework and sherpa-ncnn.xcframework. All
other folders can be safely removed. We only need the two framework folders.
In the following, we describe the content in these two framework folders.
openmp.xcframework
$ tree build-ios/openmp.xcframework/
build-ios/openmp.xcframework/
Headers
omp.h
Info.plist
ios-arm64
libomp.a
ios-arm64_x86_64-simulator
libomp.a
3 directories, 4 files
Explanation:
• omp.h: The header file, which is used by ncnn
• Info.plist: A file that is dedicated for framework on macOS/iOS
• ios-arm64/libopm.a: A static library for iOS device, e.g., for iPhone
Fig. 7.20: Screenshot for adding your Apple ID and click Next
• ios-arm64_x86_64-simulator/libomp.a: A static library for iOS simulators, including simulators for Intel
chips and Apple Silicon (e.g., M1)
Fig. 7.21: Screenshot for entering your password and click Next
sherpa-ncnn.xcframework
5 directories, 4 files
Explanation:
• c-api.h: The header file, which is copied from https://github.com/k2-fsa/sherpa-ncnn/blob/master/
sherpa-ncnn/c-api/c-api.h
• Info.plist: A file that is dedicated for framework on macOS/iOS
• ios-arm64/sherpa-ncnn.a: A static library for iOS, e.g., iPhone
• ios-arm64_x86_64-simulator/sherpa-ncnn.a: A static library for simulators, including simulators for
Intel chips and Apple Silicon (e.g., M1)
Fig. 7.27: Screenshot for selecting Settings -> General -> Device Management on your device
build-ios/sherpa-ncnn.xcframework/ios-arm64_x86_64-simulator/sherpa-ncnn.a: Mach-O␣
˓→universal binary with 2 architectures: [x86_64:current ar archive] [arm64]
build-ios/sherpa-ncnn.xcframework/ios-arm64_x86_64-simulator/sherpa-ncnn.a (for␣
˓→architecture x86_64): current ar archive
build-ios/sherpa-ncnn.xcframework/ios-arm64_x86_64-simulator/sherpa-ncnn.a (for␣
˓→architecture arm64): current ar archive
#ifndef SWIFT_API_EXAMPLES_SHERPANCNN_BRIDGING_HEADER_H_
#define SWIFT_API_EXAMPLES_SHERPANCNN_BRIDGING_HEADER_H_
#import "sherpa-ncnn/c-api/c-api.h"
#endif // SWIFT_API_EXAMPLES_SHERPANCNN_BRIDGING_HEADER_H_
After adding the bridging header to your iOS project, Xcode will complain it cannot find sherpa-ncnn/c-api/
c-api.h. The fix is to add the path build-ios/sherpa-ncnn.xcframework/Headers to Header Search
Paths by changing Build Settings -> Search Paths -> Header Search Paths, as is shown in the following
screenshot:
Please also add SherpaNcnn.swift to your iOS project, which is a utility to make it easier to access functions from the
bridging header.
The next thing is to add openmp.xcframework and sherpa-ncnn.xcframework as dependencies to you iOS project.
Select Build Phases -> Link Binary with Libraries and then click + to add sherpa-ncnn.xcframework
and openmp.xcframework. See the screenshot below for reference:
Hint: After clicking +, please select Add Other... -> Add Files.., and then add the path to sherpa-ncnn.
xcframework.
Repeat the step for openmp.xcframework.
See the screenshot below for reference:
One more thing you need to do after adding the framework is to setup the framework search path. Click Build
Settings -> Search Paths -> Framework Search Paths and add the path to build-ios/. See the screen-
shot below:
If you encounter link errors about the C++ standard library, please add -lc++ to link against libc++ by clicking Build
Settings -> Linking -> Other Linker Flags and adding -lc++. See the screenshot below for reference:
In this section, we describe how to download and use all available pre-trained models.
In this section, we list models with fewer parameters that are suitable for resource constrained embedded systems.
Hint: If you are using Raspberry Pi 4, this section is not so helpful for you since all models in sherpa-ncnn are able
to run in real-time on it.
This page is especially useful for systems with less resource than Raspberry Pi 4.
• marcoyang/sherpa-ncnn-streaming-zipformer-zh-14M-2023-02-23 (Chinese)
• marcoyang/sherpa-ncnn-streaming-zipformer-20M-2023-02-17 (English)
• marcoyang/sherpa-ncnn-conv-emformer-transducer-small-2023-01-09 (English)
• csukuangfj/sherpa-ncnn-streaming-zipformer-small-bilingual-zh-en-2023-02-16 (Bilingual, Chinese + English)
• marcoyang/sherpa-ncnn-lstm-transducer-small-2023-02-13 (Bilingual, Chinese + English)
Hint: Please refer to Installation to install sherpa-ncnn before you read this section.
marcoyang/sherpa-ncnn-streaming-zipformer-zh-14M-2023-02-23 (Chinese)
This model is a streaming Zipformer model which has around 14 millon parameters. It is trained on the WenetSpeech
corpus so it supports only Chinese.
You can find the training code at https://github.com/k2-fsa/icefall/tree/master/egs/librispeech/ASR/pruned_
transducer_stateless7_streaming
In the following, we describe how to download it and use it with sherpa-ncnn.
cd /path/to/sherpa-ncnn
wget https://github.com/k2-fsa/sherpa-ncnn/releases/download/models/sherpa-ncnn-
˓→streaming-zipformer-zh-14M-2023-02-23.tar.bz2
Hint: It supports decoding only wave files with a single channel and the sampling rate should be 16 kHz.
cd /path/to/sherpa-ncnn
./sherpa-ncnn-streaming-zipformer-zh-14M-2023-02-23/encoder_jit_trace-pnnx.ncnn.bin \
./sherpa-ncnn-streaming-zipformer-zh-14M-2023-02-23/decoder_jit_trace-pnnx.ncnn.
˓→param \
./sherpa-ncnn-streaming-zipformer-zh-14M-2023-02-23/decoder_jit_trace-pnnx.ncnn.bin \
./sherpa-ncnn-streaming-zipformer-zh-14M-2023-02-23/joiner_jit_trace-pnnx.ncnn.param␣
˓→\
./sherpa-ncnn-streaming-zipformer-zh-14M-2023-02-23/joiner_jit_trace-pnnx.ncnn.bin \
./sherpa-ncnn-streaming-zipformer-zh-14M-2023-02-23/test_wavs/0.wav \
2 \
$method
done
˓→small-L/encoder_jit_trace-pnnx.ncnn.bin", decoder_param="pruned_transducer_stateless7_
˓→streaming/exp-small-L/decoder_jit_trace-pnnx.ncnn.param", decoder_bin="pruned_
˓→transducer_stateless7_streaming/exp-small-L/decoder_jit_trace-pnnx.ncnn.bin", joiner_
˓→param="pruned_transducer_stateless7_streaming/exp-small-L/joiner_jit_trace-pnnx.ncnn.
˓→param", joiner_bin="pruned_transducer_stateless7_streaming/exp-small-L/joiner_jit_
˓→utterance_length=20)))
˓→small-L/encoder_jit_trace-pnnx.ncnn.bin", decoder_param="pruned_transducer_stateless7_
˓→streaming/exp-small-L/decoder_jit_trace-pnnx.ncnn.param", decoder_bin="pruned_
˓→transducer_stateless7_streaming/exp-small-L/decoder_jit_trace-pnnx.ncnn.bin", joiner_
˓→param="pruned_transducer_stateless7_streaming/exp-small-L/joiner_jit_trace-pnnx.ncnn.
˓→param", joiner_bin="pruned_transducer_stateless7_streaming/exp-small-L/joiner_jit_
˓→utterance_length=20)))
˓→small-L/encoder_jit_trace-pnnx.ncnn.bin", decoder_param="pruned_transducer_stateless7_
˓→streaming/exp-small-L/decoder_jit_trace-pnnx.ncnn.param", decoder_bin="pruned_
˓→transducer_stateless7_streaming/exp-small-L/decoder_jit_trace-pnnx.ncnn.bin", joiner_
˓→param="pruned_transducer_stateless7_streaming/exp-small-L/joiner_jit_trace-pnnx.ncnn.
˓→param", joiner_bin="pruned_transducer_stateless7_streaming/exp-small-L/joiner_jit_
˓→utterance_length=20)))
Caution: If you use Windows and get encoding issues, please run:
CHCP 65001
in your commandline.
cd /path/to/sherpa-ncnn
./build/bin/sherpa-ncnn-microphone \
./sherpa-ncnn-streaming-zipformer-zh-14M-2023-02-23/tokens.txt \
./sherpa-ncnn-streaming-zipformer-zh-14M-2023-02-23/encoder_jit_trace-pnnx.ncnn.param \
./sherpa-ncnn-streaming-zipformer-zh-14M-2023-02-23/encoder_jit_trace-pnnx.ncnn.bin \
./sherpa-ncnn-streaming-zipformer-zh-14M-2023-02-23/decoder_jit_trace-pnnx.ncnn.param \
./sherpa-ncnn-streaming-zipformer-zh-14M-2023-02-23/decoder_jit_trace-pnnx.ncnn.bin \
./sherpa-ncnn-streaming-zipformer-zh-14M-2023-02-23/joiner_jit_trace-pnnx.ncnn.param \
./sherpa-ncnn-streaming-zipformer-zh-14M-2023-02-23/joiner_jit_trace-pnnx.ncnn.bin \
2 \
greedy_search
Hint: If your system is Linux (including embedded Linux), you can also use sherpa-ncnn-alsa to do real-time speech
recognition with your microphone if sherpa-ncnn-microphone does not work for you.
marcoyang/sherpa-ncnn-streaming-zipformer-20M-2023-02-17 (English)
cd /path/to/sherpa-ncnn
wget https://github.com/k2-fsa/sherpa-ncnn/releases/download/models/sherpa-ncnn-
˓→streaming-zipformer-20M-2023-02-17.tar.bz2
Hint: It supports decoding only wave files with a single channel and the sampling rate should be 16 kHz.
cd /path/to/sherpa-ncnn
˓→encoder_jit_trace-pnnx.ncnn.bin", decoder_param="./sherpa-ncnn-streaming-zipformer-20M-
˓→2023-02-17/decoder_jit_trace-pnnx.ncnn.param", decoder_bin="./sherpa-ncnn-streaming-
˓→zipformer-20M-2023-02-17/decoder_jit_trace-pnnx.ncnn.bin", joiner_param="./sherpa-ncnn-
˓→streaming-zipformer-20M-2023-02-17/joiner_jit_trace-pnnx.ncnn.param", joiner_bin="./
˓→sherpa-ncnn-streaming-zipformer-20M-2023-02-17/joiner_jit_trace-pnnx.ncnn.bin", tokens=
˓→utterance_length=20)))
cd /path/to/sherpa-ncnn
./build/bin/sherpa-ncnn-microphone \
./sherpa-ncnn-streaming-zipformer-20M-2023-02-17/tokens.txt \
./sherpa-ncnn-streaming-zipformer-20M-2023-02-17/encoder_jit_trace-pnnx.ncnn.param \
./sherpa-ncnn-streaming-zipformer-20M-2023-02-17/encoder_jit_trace-pnnx.ncnn.bin \
./sherpa-ncnn-streaming-zipformer-20M-2023-02-17/decoder_jit_trace-pnnx.ncnn.param \
./sherpa-ncnn-streaming-zipformer-20M-2023-02-17/decoder_jit_trace-pnnx.ncnn.bin \
./sherpa-ncnn-streaming-zipformer-20M-2023-02-17/joiner_jit_trace-pnnx.ncnn.param \
./sherpa-ncnn-streaming-zipformer-20M-2023-02-17/joiner_jit_trace-pnnx.ncnn.bin \
2 \
greedy_search
Hint: If your system is Linux (including embedded Linux), you can also use sherpa-ncnn-alsa to do real-time speech
recognition with your microphone if sherpa-ncnn-microphone does not work for you.
csukuangfj/sherpa-ncnn-streaming-zipformer-en-2023-02-13 (English)
cd /path/to/sherpa-ncnn
wget https://github.com/k2-fsa/sherpa-ncnn/releases/download/models/sherpa-ncnn-
˓→streaming-zipformer-en-2023-02-13.tar.bz2
Hint: It supports decoding only wave files with a single channel and the sampling rate should be 16 kHz.
cd /path/to/sherpa-ncnn
ModelConfig(encoder_param="./sherpa-ncnn-streaming-zipformer-en-2023-02-13/encoder_jit_
˓→trace-pnnx.ncnn.param", encoder_bin="./sherpa-ncnn-streaming-zipformer-en-2023-02-13/
˓→encoder_jit_trace-pnnx.ncnn.bin", decoder_param="./sherpa-ncnn-streaming-zipformer-en-
˓→2023-02-13/decoder_jit_trace-pnnx.ncnn.param", decoder_bin="./sherpa-ncnn-streaming-
˓→zipformer-en-2023-02-13/decoder_jit_trace-pnnx.ncnn.bin", joiner_param="./sherpa-ncnn-
˓→streaming-zipformer-en-2023-02-13/joiner_jit_trace-pnnx.ncnn.param", joiner_bin="./
˓→sherpa-ncnn-streaming-zipformer-en-2023-02-13/joiner_jit_trace-pnnx.ncnn.bin", tokens=
˓→utterance_length=20)))
YET THESE THOUGHTS AFFECTED HESTER PRYNNE LESS WITH HOPE THAN APPREHENSION
Elapsed seconds: 0.569 s
Real time factor (RTF): 0.569 / 4.825 = 0.118
ModelConfig(encoder_param="./sherpa-ncnn-streaming-zipformer-en-2023-02-13/encoder_jit_
˓→trace-pnnx.ncnn.param", encoder_bin="./sherpa-ncnn-streaming-zipformer-en-2023-02-13/
˓→encoder_jit_trace-pnnx.ncnn.bin", decoder_param="./sherpa-ncnn-streaming-zipformer-en-
˓→2023-02-13/decoder_jit_trace-pnnx.ncnn.param", decoder_bin="./sherpa-ncnn-streaming-
˓→zipformer-en-2023-02-13/decoder_jit_trace-pnnx.ncnn.bin", joiner_param="./sherpa-ncnn-
˓→streaming-zipformer-en-2023-02-13/joiner_jit_trace-pnnx.ncnn.param", joiner_bin="./
˓→sherpa-ncnn-streaming-zipformer-en-2023-02-13/joiner_jit_trace-pnnx.ncnn.bin", tokens=
˓→utterance_length=20)))
YET THESE THOUGHTS AFFECTED HESTER PRYNNE LESS WITH HOPE THAN APPREHENSION
Elapsed seconds: 0.554 s
Real time factor (RTF): 0.554 / 4.825 = 0.115
cd /path/to/sherpa-ncnn
./build/bin/sherpa-ncnn-microphone \
./sherpa-ncnn-streaming-zipformer-en-2023-02-13/tokens.txt \
./sherpa-ncnn-streaming-zipformer-en-2023-02-13/encoder_jit_trace-pnnx.ncnn.param \
./sherpa-ncnn-streaming-zipformer-en-2023-02-13/encoder_jit_trace-pnnx.ncnn.bin \
./sherpa-ncnn-streaming-zipformer-en-2023-02-13/decoder_jit_trace-pnnx.ncnn.param \
./sherpa-ncnn-streaming-zipformer-en-2023-02-13/decoder_jit_trace-pnnx.ncnn.bin \
./sherpa-ncnn-streaming-zipformer-en-2023-02-13/joiner_jit_trace-pnnx.ncnn.param \
./sherpa-ncnn-streaming-zipformer-en-2023-02-13/joiner_jit_trace-pnnx.ncnn.bin \
2 \
greedy_search
Hint: If your system is Linux (including embedded Linux), you can also use sherpa-ncnn-alsa to do real-time speech
recognition with your microphone if sherpa-ncnn-microphone does not work for you.
cd /path/to/sherpa-ncnn
wget https://github.com/k2-fsa/sherpa-ncnn/releases/download/models/sherpa-ncnn-
˓→streaming-zipformer-bilingual-zh-en-2023-02-13.tar.bz2
Hint: It supports decoding only wave files with a single channel and the sampling rate should be 16 kHz.
cd /path/to/sherpa-ncnn
./sherpa-ncnn-streaming-zipformer-bilingual-zh-en-2023-02-13/encoder_jit_trace-pnnx.
˓→ncnn.bin \
./sherpa-ncnn-streaming-zipformer-bilingual-zh-en-2023-02-13/decoder_jit_trace-pnnx.
˓→ncnn.param \
./sherpa-ncnn-streaming-zipformer-bilingual-zh-en-2023-02-13/decoder_jit_trace-pnnx.
˓→ncnn.bin \
./sherpa-ncnn-streaming-zipformer-bilingual-zh-en-2023-02-13/joiner_jit_trace-pnnx.
˓→ncnn.param \
./sherpa-ncnn-streaming-zipformer-bilingual-zh-en-2023-02-13/joiner_jit_trace-pnnx.
˓→ncnn.bin \
./sherpa-ncnn-streaming-zipformer-bilingual-zh-en-2023-02-13/test_wavs/1.wav \
2 \
(continues on next page)
ModelConfig(encoder_param="./sherpa-ncnn-streaming-zipformer-bilingual-zh-en-2023-02-13/
˓→encoder_jit_trace-pnnx.ncnn.param", encoder_bin="./sherpa-ncnn-streaming-zipformer-
˓→bilingual-zh-en-2023-02-13/encoder_jit_trace-pnnx.ncnn.bin", decoder_param="./sherpa-
˓→ncnn-streaming-zipformer-bilingual-zh-en-2023-02-13/decoder_jit_trace-pnnx.ncnn.param",
˓→ decoder_bin="./sherpa-ncnn-streaming-zipformer-bilingual-zh-en-2023-02-13/decoder_jit_
˓→trace-pnnx.ncnn.bin", joiner_param="./sherpa-ncnn-streaming-zipformer-bilingual-zh-en-
˓→2023-02-13/joiner_jit_trace-pnnx.ncnn.param", joiner_bin="./sherpa-ncnn-streaming-
˓→zipformer-bilingual-zh-en-2023-02-13/joiner_jit_trace-pnnx.ncnn.bin", tokens="./sherpa-
˓→utterance_length=20)))
ALWAYS ALWAYS
Elapsed seconds: 0.598 s
Real time factor (RTF): 0.598 / 5.100 = 0.117
ModelConfig(encoder_param="./sherpa-ncnn-streaming-zipformer-bilingual-zh-en-2023-02-13/
˓→encoder_jit_trace-pnnx.ncnn.param", encoder_bin="./sherpa-ncnn-streaming-zipformer-
˓→bilingual-zh-en-2023-02-13/encoder_jit_trace-pnnx.ncnn.bin", decoder_param="./sherpa-
˓→ncnn-streaming-zipformer-bilingual-zh-en-2023-02-13/decoder_jit_trace-pnnx.ncnn.param",
˓→ decoder_bin="./sherpa-ncnn-streaming-zipformer-bilingual-zh-en-2023-02-13/decoder_jit_
˓→trace-pnnx.ncnn.bin", joiner_param="./sherpa-ncnn-streaming-zipformer-bilingual-zh-en-
˓→2023-02-13/joiner_jit_trace-pnnx.ncnn.param", joiner_bin="./sherpa-ncnn-streaming-
˓→zipformer-bilingual-zh-en-2023-02-13/joiner_jit_trace-pnnx.ncnn.bin", tokens="./sherpa-
˓→utterance_length=20)))
ALWAYS ALWAYS
Elapsed seconds: 0.943 s
Real time factor (RTF): 0.943 / 5.100 = 0.185
Caution: If you use Windows and get encoding issues, please run:
CHCP 65001
in your commandline.
cd /path/to/sherpa-ncnn
./build/bin/sherpa-ncnn-microphone \
./sherpa-ncnn-streaming-zipformer-bilingual-zh-en-2023-02-13/tokens.txt \
./sherpa-ncnn-streaming-zipformer-bilingual-zh-en-2023-02-13/encoder_jit_trace-pnnx.
˓→ncnn.param \
./sherpa-ncnn-streaming-zipformer-bilingual-zh-en-2023-02-13/encoder_jit_trace-pnnx.
˓→ncnn.bin \
./sherpa-ncnn-streaming-zipformer-bilingual-zh-en-2023-02-13/decoder_jit_trace-pnnx.
˓→ncnn.param \
./sherpa-ncnn-streaming-zipformer-bilingual-zh-en-2023-02-13/decoder_jit_trace-pnnx.
˓→ncnn.bin \
./sherpa-ncnn-streaming-zipformer-bilingual-zh-en-2023-02-13/joiner_jit_trace-pnnx.
˓→ncnn.param \
./sherpa-ncnn-streaming-zipformer-bilingual-zh-en-2023-02-13/joiner_jit_trace-pnnx.
˓→ncnn.bin \
2 \
greedy_search
Hint: If your system is Linux (including embedded Linux), you can also use sherpa-ncnn-alsa to do real-time speech
recognition with your microphone if sherpa-ncnn-microphone does not work for you.
cd /path/to/sherpa-ncnn
wget https://github.com/k2-fsa/sherpa-ncnn/releases/download/models/sherpa-ncnn-
˓→streaming-zipformer-small-bilingual-zh-en-2023-02-16.tar.bz2
Hint: It supports decoding only wave files with a single channel and the sampling rate should be 16 kHz.
cd /path/to/sherpa-ncnn
./sherpa-ncnn-streaming-zipformer-small-bilingual-zh-en-2023-02-16/encoder_jit_trace-
˓→pnnx.ncnn.bin \
./sherpa-ncnn-streaming-zipformer-small-bilingual-zh-en-2023-02-16/decoder_jit_trace-
˓→pnnx.ncnn.param \
./sherpa-ncnn-streaming-zipformer-small-bilingual-zh-en-2023-02-16/decoder_jit_trace-
˓→pnnx.ncnn.bin \
./sherpa-ncnn-streaming-zipformer-small-bilingual-zh-en-2023-02-16/joiner_jit_trace-
˓→pnnx.ncnn.param \
./sherpa-ncnn-streaming-zipformer-small-bilingual-zh-en-2023-02-16/joiner_jit_trace-
˓→pnnx.ncnn.bin \
./sherpa-ncnn-streaming-zipformer-small-bilingual-zh-en-2023-02-16/test_wavs/1.wav \
2 \
(continues on next page)
ModelConfig(encoder_param="./sherpa-ncnn-streaming-zipformer-small-bilingual-zh-en-2023-
˓→02-16/encoder_jit_trace-pnnx.ncnn.param", encoder_bin="./sherpa-ncnn-streaming-
˓→zipformer-small-bilingual-zh-en-2023-02-16/encoder_jit_trace-pnnx.ncnn.bin", decoder_
˓→param="./sherpa-ncnn-streaming-zipformer-small-bilingual-zh-en-2023-02-16/decoder_jit_
˓→trace-pnnx.ncnn.param", decoder_bin="./sherpa-ncnn-streaming-zipformer-small-bilingual-
˓→zh-en-2023-02-16/decoder_jit_trace-pnnx.ncnn.bin", joiner_param="./sherpa-ncnn-
˓→streaming-zipformer-small-bilingual-zh-en-2023-02-16/joiner_jit_trace-pnnx.ncnn.param",
˓→ joiner_bin="./sherpa-ncnn-streaming-zipformer-small-bilingual-zh-en-2023-02-16/joiner_
˓→jit_trace-pnnx.ncnn.bin", tokens="./sherpa-ncnn-streaming-zipformer-small-bilingual-zh-
˓→threads=2)
˓→utterance_length=20)))
ALWAYS
ModelConfig(encoder_param="./sherpa-ncnn-streaming-zipformer-small-bilingual-zh-en-2023-
˓→02-16/encoder_jit_trace-pnnx.ncnn.param", encoder_bin="./sherpa-ncnn-streaming-
˓→zipformer-small-bilingual-zh-en-2023-02-16/encoder_jit_trace-pnnx.ncnn.bin", decoder_
˓→param="./sherpa-ncnn-streaming-zipformer-small-bilingual-zh-en-2023-02-16/decoder_jit_
˓→trace-pnnx.ncnn.param", decoder_bin="./sherpa-ncnn-streaming-zipformer-small-bilingual-
˓→zh-en-2023-02-16/decoder_jit_trace-pnnx.ncnn.bin", joiner_param="./sherpa-ncnn-
˓→streaming-zipformer-small-bilingual-zh-en-2023-02-16/joiner_jit_trace-pnnx.ncnn.param",
˓→ joiner_bin="./sherpa-ncnn-streaming-zipformer-small-bilingual-zh-en-2023-02-16/joiner_
˓→jit_trace-pnnx.ncnn.bin", tokens="./sherpa-ncnn-streaming-zipformer-small-bilingual-zh-
˓→threads=2)
˓→utterance_length=20)))
ALWAYS
cd /path/to/sherpa-ncnn
./build/bin/sherpa-ncnn-microphone \
./sherpa-ncnn-streaming-zipformer-small-bilingual-zh-en-2023-02-16/tokens.txt \
./sherpa-ncnn-streaming-zipformer-small-bilingual-zh-en-2023-02-16/encoder_jit_trace-
˓→pnnx.ncnn.param \
./sherpa-ncnn-streaming-zipformer-small-bilingual-zh-en-2023-02-16/encoder_jit_trace-
˓→pnnx.ncnn.bin \
./sherpa-ncnn-streaming-zipformer-small-bilingual-zh-en-2023-02-16/decoder_jit_trace-
˓→pnnx.ncnn.param \
./sherpa-ncnn-streaming-zipformer-small-bilingual-zh-en-2023-02-16/decoder_jit_trace-
˓→pnnx.ncnn.bin \
./sherpa-ncnn-streaming-zipformer-small-bilingual-zh-en-2023-02-16/joiner_jit_trace-
˓→pnnx.ncnn.param \
./sherpa-ncnn-streaming-zipformer-small-bilingual-zh-en-2023-02-16/joiner_jit_trace-
˓→pnnx.ncnn.bin \
2 \
greedy_search
Hint: If your system is Linux (including embedded Linux), you can also use sherpa-ncnn-alsa to do real-time speech
recognition with your microphone if sherpa-ncnn-microphone does not work for you.
We provide a second version of the model that is exported with --decode-chunk-len=96 instead of 32.
Note: You can also find a third version with folder 64.
The advantage of using this model is that it runs much faster, while the downside is that you will see some delay before
you see the recognition result after you speak.
./sherpa-ncnn-streaming-zipformer-small-bilingual-zh-en-2023-02-16/96/encoder_jit_
˓→trace-pnnx.ncnn.bin \
./sherpa-ncnn-streaming-zipformer-small-bilingual-zh-en-2023-02-16/96/decoder_jit_
˓→trace-pnnx.ncnn.param \
./sherpa-ncnn-streaming-zipformer-small-bilingual-zh-en-2023-02-16/96/decoder_jit_
˓→trace-pnnx.ncnn.bin \
./sherpa-ncnn-streaming-zipformer-small-bilingual-zh-en-2023-02-16/96/joiner_jit_
˓→trace-pnnx.ncnn.param \
./sherpa-ncnn-streaming-zipformer-small-bilingual-zh-en-2023-02-16/96/joiner_jit_
˓→trace-pnnx.ncnn.bin \
./sherpa-ncnn-streaming-zipformer-small-bilingual-zh-en-2023-02-16/test_wavs/1.wav \
2 \
$method
done
shaojieli/sherpa-ncnn-streaming-zipformer-fr-2023-04-14
wget https://github.com/k2-fsa/sherpa-ncnn/releases/download/models/sherpa-ncnn-
˓→streaming-zipformer-fr-2023-04-14.tar.bz2
2 \
$method
done
˓→14/encoder_jit_trace-pnnx.ncnn.param", encoder_bin="./sherpa-ncnn-streaming-zipformer-
˓→fr-2023-04-14/encoder_jit_trace-pnnx.ncnn.bin", decoder_param="./sherpa-ncnn-streaming-
˓→zipformer-fr-2023-04-14/decoder_jit_trace-pnnx.ncnn.param", decoder_bin="./sherpa-ncnn-
˓→streaming-zipformer-fr-2023-04-14/decoder_jit_trace-pnnx.ncnn.bin", joiner_param="./
˓→sherpa-ncnn-streaming-zipformer-fr-2023-04-14/joiner_jit_trace-pnnx.ncnn.param",␣
˓→joiner_bin="./sherpa-ncnn-streaming-zipformer-fr-2023-04-14/joiner_jit_trace-pnnx.ncnn.
˓→config=EndpointConfig(rule1=EndpointRule(must_contain_nonsilence=False, min_trailing_
˓→endpoint=False)
text: CE SITE CONTIENT QUATRE TOMBEAUX DE LA DYNASTIE ASHÉMÉNIDE ET SEPT DES SASSANDIDES
timestamps: 0.96 1.44 1.52 1.76 1.96 2.08 2.28 2.56 2.64 2.76 2.8 2.96 3.04 3.2 3.28 3.4␣
˓→3.48 3.72 3.8 4 4.16 4.24 4.32 4.44 4.6 4.68 4.92 5.2 5.52 5.84 6.04 6.12 6.24 6.56 6.
˓→68
˓→14/encoder_jit_trace-pnnx.ncnn.param", encoder_bin="./sherpa-ncnn-streaming-zipformer-
˓→fr-2023-04-14/encoder_jit_trace-pnnx.ncnn.bin", decoder_param="./sherpa-ncnn-streaming-
˓→zipformer-fr-2023-04-14/decoder_jit_trace-pnnx.ncnn.param", decoder_bin="./sherpa-ncnn-
˓→streaming-zipformer-fr-2023-04-14/decoder_jit_trace-pnnx.ncnn.bin", joiner_param="./
˓→sherpa-ncnn-streaming-zipformer-fr-2023-04-14/joiner_jit_trace-pnnx.ncnn.param",␣
˓→joiner_bin="./sherpa-ncnn-streaming-zipformer-fr-2023-04-14/joiner_jit_trace-pnnx.ncnn.
˓→config=EndpointConfig(rule1=EndpointRule(must_contain_nonsilence=False, min_trailing_
˓→endpoint=False)
text: CE SITE CONTIENT QUATRE TOMBEAUX DE LA DYNASTIE ASHÉMÉNIDE ET SEPT DES SASSANDIDES
timestamps: 0.96 1.44 1.52 1.76 1.96 2.08 2.28 2.56 2.64 2.76 2.8 2.96 3.04 3.2 3.28 3.4␣
˓→3.48 3.72 3.8 4 4.16 4.24 4.32 4.44 4.6 4.68 4.92 5.2 5.52 5.84 6.04 6.12 6.24 6.56 6.
˓→68
cd /path/to/sherpa-ncnn
./build/bin/sherpa-ncnn-microphone \
./sherpa-ncnn-streaming-zipformer-fr-2023-04-14/tokens.txt \
./sherpa-ncnn-streaming-zipformer-fr-2023-04-14/encoder_jit_trace-pnnx.ncnn.param \
./sherpa-ncnn-streaming-zipformer-fr-2023-04-14/encoder_jit_trace-pnnx.ncnn.bin \
./sherpa-ncnn-streaming-zipformer-fr-2023-04-14/decoder_jit_trace-pnnx.ncnn.param \
./sherpa-ncnn-streaming-zipformer-fr-2023-04-14/decoder_jit_trace-pnnx.ncnn.bin \
./sherpa-ncnn-streaming-zipformer-fr-2023-04-14/joiner_jit_trace-pnnx.ncnn.param \
./sherpa-ncnn-streaming-zipformer-fr-2023-04-14/joiner_jit_trace-pnnx.ncnn.bin \
./sherpa-ncnn-streaming-zipformer-fr-2023-04-14/test_wavs/common_voice_fr_19364697.wav␣
˓→\
2 \
greedy_search
Hint: If your system is Linux (including embedded Linux), you can also use sherpa-ncnn-alsa to do real-time speech
recognition with your microphone if sherpa-ncnn-microphone does not work for you.
Hint: Please refer to Installation to install sherpa-ncnn before you read this section.
cd /path/to/sherpa-ncnn
wget https://github.com/k2-fsa/sherpa-ncnn/releases/download/models/sherpa-ncnn-lstm-
˓→transducer-small-2023-02-13.tar.bz2
Note: Please refer to Embedded Linux (arm) for how to compile sherpa-ncnn for a 32-bit ARM platform.
Hint: It supports decoding only wave files with a single channel and the sampling rate should be 16 kHz.
cd /path/to/sherpa-ncnn
./build/bin/sherpa-ncnn \
./sherpa-ncnn-lstm-transducer-small-2023-02-13/tokens.txt \
./sherpa-ncnn-lstm-transducer-small-2023-02-13/encoder_jit_trace-pnnx.ncnn.param \
./sherpa-ncnn-lstm-transducer-small-2023-02-13/encoder_jit_trace-pnnx.ncnn.bin \
./sherpa-ncnn-lstm-transducer-small-2023-02-13/decoder_jit_trace-pnnx.ncnn.param \
./sherpa-ncnn-lstm-transducer-small-2023-02-13/decoder_jit_trace-pnnx.ncnn.bin \
./sherpa-ncnn-lstm-transducer-small-2023-02-13/joiner_jit_trace-pnnx.ncnn.param \
./sherpa-ncnn-lstm-transducer-small-2023-02-13/joiner_jit_trace-pnnx.ncnn.bin \
./sherpa-ncnn-lstm-transducer-small-2023-02-13/test_wavs/0.wav
Note: The default option uses 4 threads and greedy_search for decoding.
Caution: If you use Windows and get encoding issues, please run:
CHCP 65001
in your commandline.
csukuangfj/sherpa-ncnn-2022-09-05 (English)
This is a model trained using the GigaSpeech and the LibriSpeech dataset.
Please see https://github.com/k2-fsa/icefall/pull/558 for how the model is trained.
You can find the training code at
https://github.com/k2-fsa/icefall/tree/master/egs/librispeech/ASR/lstm_transducer_stateless2
In the following, we describe how to download it and use it with sherpa-ncnn.
Please use the following commands to download it.
cd /path/to/sherpa-ncnn
wget https://github.com/k2-fsa/sherpa-ncnn/releases/download/models/sherpa-ncnn-2022-09-
˓→05.tar.bz2
Hint: It supports decoding only wave files with a single channel and the sampling rate should be 16 kHz.
cd /path/to/sherpa-ncnn
˓→"./sherpa-ncnn-2022-09-05/decoder_jit_trace-pnnx.ncnn.param", decoder_bin="./sherpa-
˓→ncnn-2022-09-05/decoder_jit_trace-pnnx.ncnn.bin", joiner_param="./sherpa-ncnn-2022-09-
˓→05/joiner_jit_trace-pnnx.ncnn.param", joiner_bin="./sherpa-ncnn-2022-09-05/joiner_jit_
˓→utterance_length=20)))
ModelConfig(encoder_param="./sherpa-ncnn-2022-09-05/encoder_jit_trace-pnnx.ncnn.param",␣
˓→encoder_bin="./sherpa-ncnn-2022-09-05/encoder_jit_trace-pnnx.ncnn.bin", decoder_param=
˓→"./sherpa-ncnn-2022-09-05/decoder_jit_trace-pnnx.ncnn.param", decoder_bin="./sherpa-
˓→ncnn-2022-09-05/decoder_jit_trace-pnnx.ncnn.bin", joiner_param="./sherpa-ncnn-2022-09-
˓→05/joiner_jit_trace-pnnx.ncnn.param", joiner_bin="./sherpa-ncnn-2022-09-05/joiner_jit_
˓→utterance_length=20)))
cd /path/to/sherpa-ncnn
./build/bin/sherpa-ncnn-microphone \
./sherpa-ncnn-2022-09-05/tokens.txt \
./sherpa-ncnn-2022-09-05/encoder_jit_trace-pnnx.ncnn.param \
./sherpa-ncnn-2022-09-05/encoder_jit_trace-pnnx.ncnn.bin \
./sherpa-ncnn-2022-09-05/decoder_jit_trace-pnnx.ncnn.param \
./sherpa-ncnn-2022-09-05/decoder_jit_trace-pnnx.ncnn.bin \
./sherpa-ncnn-2022-09-05/joiner_jit_trace-pnnx.ncnn.param \
./sherpa-ncnn-2022-09-05/joiner_jit_trace-pnnx.ncnn.bin \
2 \
greedy_search
Number of threads: 4
num devices: 4
(continues on next page)
csukuangfj/sherpa-ncnn-2022-09-30 (Chinese)
wget https://github.com/k2-fsa/sherpa-ncnn/releases/download/models/sherpa-ncnn-2022-09-
˓→30.tar.bz2
Hint: It supports decoding only wave files with a single channel and the sampling rate should be 16 kHz.
cd /path/to/sherpa-ncnn
˓→"./sherpa-ncnn-2022-09-30/decoder_jit_trace-pnnx.ncnn.param", decoder_bin="./sherpa-
˓→ncnn-2022-09-30/decoder_jit_trace-pnnx.ncnn.bin", joiner_param="./sherpa-ncnn-2022-09-
˓→30/joiner_jit_trace-pnnx.ncnn.param", joiner_bin="./sherpa-ncnn-2022-09-30/joiner_jit_
˓→utterance_length=20)))
ModelConfig(encoder_param="./sherpa-ncnn-2022-09-30/encoder_jit_trace-pnnx.ncnn.param",␣
˓→encoder_bin="./sherpa-ncnn-2022-09-30/encoder_jit_trace-pnnx.ncnn.bin", decoder_param=
˓→"./sherpa-ncnn-2022-09-30/decoder_jit_trace-pnnx.ncnn.param", decoder_bin="./sherpa-
˓→ncnn-2022-09-30/decoder_jit_trace-pnnx.ncnn.bin", joiner_param="./sherpa-ncnn-2022-09-
˓→30/joiner_jit_trace-pnnx.ncnn.param", joiner_bin="./sherpa-ncnn-2022-09-30/joiner_jit_
˓→utterance_length=20)))
Caution: If you use Windows and get encoding issues, please run:
CHCP 65001
in your commandline.
cd /path/to/sherpa-ncnn
./build/bin/sherpa-ncnn-microphone \
./sherpa-ncnn-2022-09-30/tokens.txt \
./sherpa-ncnn-2022-09-30/encoder_jit_trace-pnnx.ncnn.param \
./sherpa-ncnn-2022-09-30/encoder_jit_trace-pnnx.ncnn.bin \
./sherpa-ncnn-2022-09-30/decoder_jit_trace-pnnx.ncnn.param \
./sherpa-ncnn-2022-09-30/decoder_jit_trace-pnnx.ncnn.bin \
./sherpa-ncnn-2022-09-30/joiner_jit_trace-pnnx.ncnn.param \
./sherpa-ncnn-2022-09-30/joiner_jit_trace-pnnx.ncnn.bin \
2 \
greedy_search
Caution: If you use Windows and get encoding issues, please run:
CHCP 65001
in your commandline.
Hint: Please refer to Installation to install sherpa-ncnn before you read this section.
marcoyang/sherpa-ncnn-conv-emformer-transducer-small-2023-01-09 (English)
cd /path/to/sherpa-ncnn
wget https://github.com/k2-fsa/sherpa-ncnn/releases/download/models/sherpa-ncnn-conv-
˓→emformer-transducer-small-2023-01-09.tar.bz2
Note: Please refer to Embedded Linux (arm) for how to compile sherpa-ncnn for a 32-bit ARM platform. In the
following, we test the pre-trained model on an embedded device, whose CPU is RV1126 (Quad core ARM Cortex-A7).
Hint: It supports decoding only wave files with a single channel and the sampling rate should be 16 kHz.
cd /path/to/sherpa-ncnn
./build/bin/sherpa-ncnn \
./sherpa-ncnn-conv-emformer-transducer-small-2023-01-09/tokens.txt \
./sherpa-ncnn-conv-emformer-transducer-small-2023-01-09/encoder_jit_trace-pnnx.ncnn.
˓→param \
./sherpa-ncnn-conv-emformer-transducer-small-2023-01-09/encoder_jit_trace-pnnx.ncnn.
˓→bin \
./sherpa-ncnn-conv-emformer-transducer-small-2023-01-09/decoder_jit_trace-pnnx.ncnn.
˓→param \
./sherpa-ncnn-conv-emformer-transducer-small-2023-01-09/decoder_jit_trace-pnnx.ncnn.
˓→bin \
./sherpa-ncnn-conv-emformer-transducer-small-2023-01-09/joiner_jit_trace-pnnx.ncnn.
˓→param \
./sherpa-ncnn-conv-emformer-transducer-small-2023-01-09/joiner_jit_trace-pnnx.ncnn.bin␣
˓→\
./sherpa-ncnn-conv-emformer-transducer-small-2023-01-09/test_wavs/1089-134686-0001.wav␣
˓→\
The outputs are shown below. The CPU used for decoding is RV1126 (Quad core ARM Cortex-A7).
Note: The default option uses 4 threads and greedy_search for decoding.
Caution: If you use Windows and get encoding issues, please run:
CHCP 65001
in your commandline.
Note: We also support int8 quantization to compresss the model and speed up inference. Currently, only encoder and
joiner are quantized.
cd /path/to/sherpa-ncnn
./build/bin/sherpa-ncnn \
./sherpa-ncnn-conv-emformer-transducer-small-2023-01-09/tokens.txt \
./sherpa-ncnn-conv-emformer-transducer-small-2023-01-09/encoder_jit_trace-pnnx.ncnn.
˓→int8.param \
./sherpa-ncnn-conv-emformer-transducer-small-2023-01-09/encoder_jit_trace-pnnx.ncnn.
˓→int8.bin \
./sherpa-ncnn-conv-emformer-transducer-small-2023-01-09/decoder_jit_trace-pnnx.ncnn.
˓→param \
./sherpa-ncnn-conv-emformer-transducer-small-2023-01-09/decoder_jit_trace-pnnx.ncnn.
˓→bin \
./sherpa-ncnn-conv-emformer-transducer-small-2023-01-09/joiner_jit_trace-pnnx.ncnn.
˓→int8.param \
./sherpa-ncnn-conv-emformer-transducer-small-2023-01-09/joiner_jit_trace-pnnx.ncnn.
˓→int8.bin \
./sherpa-ncnn-conv-emformer-transducer-small-2023-01-09/test_wavs/1089-134686-0001.wav␣
˓→\
The outputs are shown below. The CPU used for decoding is RV1126 (Quad core ARM Cortex-A7).
Compared to the original model in fp16 format, the decoding speed is significantly improved. The decoding time is
changed from 3.26 s to 2.44 s.
Note: When the model’s weights are quantized to float16, it is converted to float32 during computation.
When the model’s weights are quantized to int8, it is using int8 during computation.
Hint: Even if we use only 1 thread for the int8 model, the resulting real time factor (RTF) is still less than 1.
Hint: If you want to train your own model that is able to support both Chinese and English, please refer to our training
code:
https://github.com/k2-fsa/icefall/tree/master/egs/tal_csasr/ASR
You can also try the pre-trained models in your browser without installing anything by visiting:
https://huggingface.co/spaces/k2-fsa/automatic-speech-recognition
cd /path/to/sherpa-ncnn
wget https://github.com/k2-fsa/sherpa-ncnn/releases/download/models/sherpa-ncnn-conv-
˓→emformer-transducer-2022-12-06.tar.bz2
Hint: It supports decoding only wave files with a single channel and the sampling rate should be 16 kHz.
cd /path/to/sherpa-ncnn
./build/bin/sherpa-ncnn \
./sherpa-ncnn-conv-emformer-transducer-2022-12-06/tokens.txt \
./sherpa-ncnn-conv-emformer-transducer-2022-12-06/encoder_jit_trace-pnnx.ncnn.param \
./sherpa-ncnn-conv-emformer-transducer-2022-12-06/encoder_jit_trace-pnnx.ncnn.bin \
./sherpa-ncnn-conv-emformer-transducer-2022-12-06/decoder_jit_trace-pnnx.ncnn.param \
./sherpa-ncnn-conv-emformer-transducer-2022-12-06/decoder_jit_trace-pnnx.ncnn.bin \
./sherpa-ncnn-conv-emformer-transducer-2022-12-06/joiner_jit_trace-pnnx.ncnn.param \
./sherpa-ncnn-conv-emformer-transducer-2022-12-06/joiner_jit_trace-pnnx.ncnn.bin \
./sherpa-ncnn-conv-emformer-transducer-2022-12-06/test_wavs/0.wav \
Caution: If you use Windows and get encoding issues, please run:
CHCP 65001
in your commandline.
cd /path/to/sherpa-ncnn
./build/bin/sherpa-ncnn-microphone \
./sherpa-ncnn-conv-emformer-transducer-2022-12-06/tokens.txt \
./sherpa-ncnn-conv-emformer-transducer-2022-12-06/encoder_jit_trace-pnnx.ncnn.param \
./sherpa-ncnn-conv-emformer-transducer-2022-12-06/encoder_jit_trace-pnnx.ncnn.bin \
./sherpa-ncnn-conv-emformer-transducer-2022-12-06/decoder_jit_trace-pnnx.ncnn.param \
./sherpa-ncnn-conv-emformer-transducer-2022-12-06/decoder_jit_trace-pnnx.ncnn.bin \
./sherpa-ncnn-conv-emformer-transducer-2022-12-06/joiner_jit_trace-pnnx.ncnn.param \
./sherpa-ncnn-conv-emformer-transducer-2022-12-06/joiner_jit_trace-pnnx.ncnn.bin
Number of threads: 4
num devices: 4
Use default device: 2
Name: MacBook Pro Microphone
Max input channels: 1
Started
Caution: If you use Windows and get encoding issues, please run:
CHCP 65001
in your commandline.
csukuangfj/sherpa-ncnn-conv-emformer-transducer-2022-12-08 (Chinese)
Hint: This is a very small model that can be run in real-time on embedded sytems.
This model is trained using WenetSpeech dataset and it supports only Chinese.
In the following, we describe how to download and use it with sherpa-ncnn.
Please use the following commands to download it.
cd /path/to/sherpa-ncnn
wget https://github.com/k2-fsa/sherpa-ncnn/releases/download/models/sherpa-ncnn-conv-
˓→emformer-transducer-2022-12-08.tar.bz2
Hint: It supports decoding only wave files with a single channel and the sampling rate should be 16 kHz.
cd /path/to/sherpa-ncnn
./build/bin/sherpa-ncnn \
./sherpa-ncnn-conv-emformer-transducer-2022-12-08/tokens.txt \
./sherpa-ncnn-conv-emformer-transducer-2022-12-08/encoder_jit_trace-pnnx.ncnn.param \
./sherpa-ncnn-conv-emformer-transducer-2022-12-08/encoder_jit_trace-pnnx.ncnn.bin \
./sherpa-ncnn-conv-emformer-transducer-2022-12-08/decoder_jit_trace-pnnx.ncnn.param \
./sherpa-ncnn-conv-emformer-transducer-2022-12-08/decoder_jit_trace-pnnx.ncnn.bin \
./sherpa-ncnn-conv-emformer-transducer-2022-12-08/joiner_jit_trace-pnnx.ncnn.param \
./sherpa-ncnn-conv-emformer-transducer-2022-12-08/joiner_jit_trace-pnnx.ncnn.bin \
./sherpa-ncnn-conv-emformer-transducer-2022-12-08/test_wavs/0.wav
Caution: If you use Windows and get encoding issues, please run:
CHCP 65001
in your commandline.
cd /path/to/sherpa-ncnn
./build/bin/sherpa-ncnn-microphone \
./sherpa-ncnn-conv-emformer-transducer-2022-12-08/tokens.txt \
./sherpa-ncnn-conv-emformer-transducer-2022-12-08/encoder_jit_trace-pnnx.ncnn.param \
./sherpa-ncnn-conv-emformer-transducer-2022-12-08/encoder_jit_trace-pnnx.ncnn.bin \
./sherpa-ncnn-conv-emformer-transducer-2022-12-08/decoder_jit_trace-pnnx.ncnn.param \
./sherpa-ncnn-conv-emformer-transducer-2022-12-08/decoder_jit_trace-pnnx.ncnn.bin \
./sherpa-ncnn-conv-emformer-transducer-2022-12-08/joiner_jit_trace-pnnx.ncnn.param \
./sherpa-ncnn-conv-emformer-transducer-2022-12-08/joiner_jit_trace-pnnx.ncnn.bin
Number of threads: 4
num devices: 4
Use default device: 2
Name: MacBook Pro Microphone
Max input channels: 1
Started
Caution: If you use Windows and get encoding issues, please run:
CHCP 65001
in your commandline.
csukuangfj/sherpa-ncnn-conv-emformer-transducer-2022-12-04 (English)
This model is trained using GigaSpeech and LibriSpeech. It supports only English.
In the following, we describe how to download and use it with sherpa-ncnn.
Please use the following commands to download it.
cd /path/to/sherpa-ncnn
wget https://github.com/k2-fsa/sherpa-ncnn/releases/download/models/sherpa-ncnn-conv-
˓→emformer-transducer-2022-12-04.tar.bz2
Hint: It supports decoding only wave files with a single channel and the sampling rate should be 16 kHz.
cd /path/to/sherpa-ncnn
./build/bin/sherpa-ncnn \
./sherpa-ncnn-conv-emformer-transducer-2022-12-04/tokens.txt \
./sherpa-ncnn-conv-emformer-transducer-2022-12-04/encoder_jit_trace-pnnx.ncnn.param \
./sherpa-ncnn-conv-emformer-transducer-2022-12-04/encoder_jit_trace-pnnx.ncnn.bin \
./sherpa-ncnn-conv-emformer-transducer-2022-12-04/decoder_jit_trace-pnnx.ncnn.param \
./sherpa-ncnn-conv-emformer-transducer-2022-12-04/decoder_jit_trace-pnnx.ncnn.bin \
./sherpa-ncnn-conv-emformer-transducer-2022-12-04/joiner_jit_trace-pnnx.ncnn.param \
./sherpa-ncnn-conv-emformer-transducer-2022-12-04/joiner_jit_trace-pnnx.ncnn.bin \
./sherpa-ncnn-conv-emformer-transducer-2022-12-04/test_wavs/1089-134686-0001.wav
Caution: If you use Windows and get encoding issues, please run:
CHCP 65001
in your commandline.
cd /path/to/sherpa-ncnn
./build/bin/sherpa-ncnn-microphone \
./sherpa-ncnn-conv-emformer-transducer-2022-12-04/tokens.txt \
./sherpa-ncnn-conv-emformer-transducer-2022-12-04/encoder_jit_trace-pnnx.ncnn.param \
./sherpa-ncnn-conv-emformer-transducer-2022-12-04/encoder_jit_trace-pnnx.ncnn.bin \
./sherpa-ncnn-conv-emformer-transducer-2022-12-04/decoder_jit_trace-pnnx.ncnn.param \
./sherpa-ncnn-conv-emformer-transducer-2022-12-04/decoder_jit_trace-pnnx.ncnn.bin \
./sherpa-ncnn-conv-emformer-transducer-2022-12-04/joiner_jit_trace-pnnx.ncnn.bin \
./sherpa-ncnn-conv-emformer-transducer-2022-12-04/joiner_jit_trace-pnnx.ncnn.param
Number of threads: 4
num devices: 4
Use default device: 2
Name: MacBook Pro Microphone
Max input channels: 1
Started
Caution: If you use Windows and get encoding issues, please run:
CHCP 65001
in your commandline.
7.10 Examples
Board info
OS release
lscpu
cpuinfo
RTF (1 thread)
RTF (2 threads)
Board info
RTF (4 threads)
7.10.3 Jetson NX
Board info
RTF (2 threads)
RTF (4 threads)
RTF (6 threads)
7.10.4 VisionFive 2
This page describes how to run sherpa-ncnn on VisionFive2, which is a 64-bit RISC-V board with 4 CPUs.
Board info
RTF (4 threads)
You can see that the RTF is less than 1, which means it is able to perform streaming (i.e., real-time) speech recognition.
The following posts the commands used for testing so that you can copy and paste them if you want to test it by yourself.
./sherpa-ncnn \
./sherpa-ncnn-streaming-zipformer-small-bilingual-zh-en-2023-02-16/tokens.txt \
./sherpa-ncnn-streaming-zipformer-small-bilingual-zh-en-2023-02-16/64/encoder_jit_
˓→trace-pnnx.ncnn.param \
./sherpa-ncnn-streaming-zipformer-small-bilingual-zh-en-2023-02-16/64/encoder_jit_
˓→trace-pnnx.ncnn.bin \
./sherpa-ncnn-streaming-zipformer-small-bilingual-zh-en-2023-02-16/64/decoder_jit_
˓→trace-pnnx.ncnn.param \
./sherpa-ncnn-streaming-zipformer-small-bilingual-zh-en-2023-02-16/64/decoder_jit_
˓→trace-pnnx.ncnn.bin \
./sherpa-ncnn-streaming-zipformer-small-bilingual-zh-en-2023-02-16/64/joiner_jit_trace-
˓→pnnx.ncnn.param \
./sherpa-ncnn-streaming-zipformer-small-bilingual-zh-en-2023-02-16/64/joiner_jit_trace-
˓→pnnx.ncnn.bin \
./sherpa-ncnn-streaming-zipformer-small-bilingual-zh-en-2023-02-16/test_wavs/0.wav \
4 \
greedy_search
Since the board does not have microphones, we use a USB microphone for testing.
After connecting a USB microphone to the board, use the following command to check it:
The output shows Card 2 and device 0, so the device name is hw:2,0.
The command to start the program for real-time speech recognition is
./sherpa-ncnn-alsa \
./sherpa-ncnn-streaming-zipformer-small-bilingual-zh-en-2023-02-16/tokens.txt \
./sherpa-ncnn-streaming-zipformer-small-bilingual-zh-en-2023-02-16/64/encoder_jit_
˓→trace-pnnx.ncnn.param \
./sherpa-ncnn-streaming-zipformer-small-bilingual-zh-en-2023-02-16/64/encoder_jit_
˓→trace-pnnx.ncnn.bin \
./sherpa-ncnn-streaming-zipformer-small-bilingual-zh-en-2023-02-16/64/decoder_jit_
˓→trace-pnnx.ncnn.param \
./sherpa-ncnn-streaming-zipformer-small-bilingual-zh-en-2023-02-16/64/decoder_jit_
˓→trace-pnnx.ncnn.bin \
./sherpa-ncnn-streaming-zipformer-small-bilingual-zh-en-2023-02-16/64/joiner_jit_trace-
˓→pnnx.ncnn.param \
./sherpa-ncnn-streaming-zipformer-small-bilingual-zh-en-2023-02-16/64/joiner_jit_trace-
˓→pnnx.ncnn.bin \
hw:2,0 \
4 \
greedy_search
7.11 FAQs
If you are using Linux and if sherpa-ncnn-microphone throws the following error:
Num device: 0
No default input device found.
cd /path/to/sherpa-ncnn
sudo apt-get install alsa-utils libasound2-dev
cd build
rm CMakeCache.txt # Important, remove the cmake cache file
make -j
After the above commands, you should see a binary file ./build/bin/sherpa-ncnn-alsa.
Please follow sherpa-ncnn-alsa to use sherpa-ncnn-alsa.
EIGHT
SHERPA-ONNX
Hint: During speech recognition, it does not need to access the Internet. Everyting is processed locally on your device.
We support using onnx with onnxruntime to replace PyTorch for neural network computation. The code is put in a
separate repository sherpa-onnx.
sherpa-onnx is self-contained and everything can be compiled from source.
Please refer to https://k2-fsa.github.io/icefall/model-export/export-onnx.html for how to export models to onnx format.
In the following, we describe how to build sherpa-onnx for Linux, macOS, Windows, embedded systems, Android, and
iOS.
Also, we show how to use it for speech recognition with pre-trained models.
8.1 Tutorials
8.2 Installation
Hint: You can use pip install cmake to install the latest cmake.
167
sherpa, Release 1.3
8.2.1 Linux
# If you have GCC<=10, e.g., use Ubuntu <= 18.04 or use CentOS<=7, please
# use the following command to build shared libs; otherwise, you would
# get link errors from libonnxruntime.a
#
# cmake -DCMAKE_BUILD_TYPE=Release -DBUILD_SHARED_LIBS=ON ..
#
#
make -j6
Hint: You need to install CUDA toolkit. Otherwise, you would get errors at runtime.
You can refer to https://k2-fsa.github.io/k2/installation/cuda-cudnn.html to install CUDA toolkit.
After building, you will find an executable sherpa-onnx inside the bin directory.
That’s it!
Please refer to Pre-trained models for a list of pre-trained models.
8.2.2 macOS
After building, you will find an executable sherpa-onnx inside the bin directory.
That’s it!
Please refer to Pre-trained models for a list of pre-trained models.
8.2.3 Windows
Hint: MinGW is known not to work. Please install Visual Studio before you continue.
Note: You can download pre-compiled binaries for both 32-bit and 64-bit Windows from the following URL https:
//huggingface.co/csukuangfj/sherpa-onnx-libs/tree/main.
Please always download the latest version.
URLs to download the version 1.9.12 is given below.
If you cannot access huggingface.co, then please replace huggingface.co with hf-mirror.com.
Hint: You need to install CUDA toolkit. Otherwise, you would get errors at runtime.
After building, you will find an executable sherpa-onnx.exe inside the bin/Release directory.
That’s it!
Please refer to Pre-trained models for a list of pre-trained models.
# Please select one toolset among VS 2015, 2017, 2019, and 2022 below
# We use VS 2022 as an example.
After building, you will find an executable sherpa-onnx.exe inside the bin/Release directory.
That’s it!
Please refer to Pre-trained models for a list of pre-trained models.
Hint: By default, it builds static libraries of sherpa-onnx. To get dynamic/shared libraries, please pass
-DBUILD_SHARED_LIBS=ON to cmake. That is, use
This page describes how to build sherpa-onnx for embedded Linux (aarch64, 64-bit) with cross-compiling on an x64
machine with Ubuntu OS.
Warning: By cross-compiling we mean that you do the compilation on a x86_64 machine. And you copy the
generated binaries from a x86_64 machine and run them on an aarch64 machine.
If you want to compile sherpa-onnx on an aarch64 machine directly, please see Linux.
Note: You can download pre-compiled binaries for aarch64 from the following URL https://huggingface.co/
csukuangfj/sherpa-onnx-libs/tree/main/aarch64
Please always download the latest version.
Example command to download the version 1.9.12:
Hint: We provide two colab notebooks for you to try this section step by step.
If you are using Windows/macOS or you don’t want to setup your local environment for cross-compiling, please use
the above colab notebooks.
Install toolchain
Warning: You can use any toolchain that is suitable for your platform. The toolchain we use below is just an
example.
Visit https://releases.linaro.org/components/toolchain/binaries/latest-7/aarch64-linux-gnu/
We are going to download gcc-linaro-7.5.0-2019.12-x86_64_aarch64-linux-gnu.tar.xz, which has been
uploaded to https://huggingface.co/csukuangfj/sherpa-ncnn-toolchains.
Assume you want to install it in the folder $HOME/software:
mkdir -p $HOME/software
cd $HOME/software
# Note: the following toolchain gcc 7.5 is for building shared libraries.
# Please see below to use gcc 10 to build static libaries.
#
# You would get link errors if you use gcc 7.5 to build static libraries.
#
wget https://huggingface.co/csukuangfj/sherpa-ncnn-toolchains/resolve/main/gcc-linaro-7.
˓→5.0-2019.12-x86_64_aarch64-linux-gnu.tar.xz
export PATH=$HOME/software/gcc-linaro-7.5.0-2019.12-x86_64_aarch64-linux-gnu/bin:$PATH
To check that we have installed the cross-compiling toolchain successfully, please run:
aarch64-linux-gnu-gcc --version
Build sherpa-onnx
Note: Please also copy the onnxruntime lib to your embedded systems and put it into the same directory as
sherpa-onnx and sherpa-onnx-alsa.
sherpa-onnx$ ls -lh build-aarch64-linux-gnu/install/lib/*onnxruntime*
lrw-r--r-- 1 kuangfangjun root 24 Feb 21 21:38 build-aarch64-linux-gnu/install/lib/
˓→libonnxruntime.so -> libonnxruntime.so.1.14.0
That’s it!
Hint:
• sherpa-onnx is for decoding a single file
• sherpa-onnx-alsa is for real-time speech recongition by reading the microphone with ALSA
sherpa-onnx-alsa
Caution: We recommend that you use sherpa-onnx-alsa on embedded systems such as Raspberry pi.
You need to provide a device_name when invoking sherpa-onnx-alsa. We describe below how to find the
device name for your microphone.
Run the following command:
arecord -l
to list all avaliable microphones for recording. If it complains that arecord: command not found, please use
sudo apt-get install alsa-utils to install it.
If the above command gives the following output:
**** List of CAPTURE Hardware Devices ****
card 3: UACDemoV10 [UACDemoV1.0], device 0: USB Audio [USB Audio]
Subdevices: 1/1
Subdevice #0: subdevice #0
In this case, I only have 1 microphone. It is card 3 and that card has only device 0. To select card 3
and device 0 on that card, we need to pass plughw:3,0 to sherpa-onnx-alsa. (Note: It has the format
plughw:card_number,device_index.)
For instance, you have to use
./sherpa-onnx-alsa \
--encoder=./sherpa-onnx-streaming-zipformer-bilingual-zh-en-2023-02-20/
˓→encoder-epoch-99-avg-1.onnx \
--decoder=./sherpa-onnx-streaming-zipformer-bilingual-zh-en-2023-02-20/
˓→decoder-epoch-99-avg-1.onnx \
--joiner=./sherpa-onnx-streaming-zipformer-bilingual-zh-en-2023-02-20/
˓→joiner-epoch-99-avg-1.onnx \
--tokens=./sherpa-onnx-streaming-zipformer-bilingual-zh-en-2023-02-20/
˓→tokens.txt \
plughw:3,0
Please change the card number and also the device index on the selected card accordingly in your own situation.
Otherwise, you won’t be able to record with your microphone.
Please read Pre-trained models for usages about the generated binaries.
Hint: If you want to select a pre-trained model for Raspberry that can be run on real-time, we recommend you to use
Zipformer-transducer-based Models.
If you want to build static libraries and static linked binaries, please first download a cross compile toolchain with GCC
>= 9.0. The following is an example:
mkdir -p $HOME/software
cd $HOME/software
wget -q https://huggingface.co/csukuangfj/sherpa-ncnn-toolchains/resolve/main/gcc-arm-10.
˓→3-2021.07-x86_64-aarch64-none-linux-gnu.tar.xz
tar xf gcc-arm-10.3-2021.07-x86_64-aarch64-none-linux-gnu.tar.xz
export PATH=$HOME/software/gcc-arm-10.3-2021.07-x86_64-aarch64-none-linux-gnu/bin:$PATH
To check that we have installed the cross-compiling toolchain successfully, please run:
aarch64-none-linux-gnu-gcc --version
Now you can build static libraries and static linked binaries with the following commands:
You can use the following commands to check that the generated binaries are indeed static linked:
$ cd build-aarch64-linux-gnu/bin
$ ldd sherpa-onnx-alsa
not a dynamic executable
$ readelf -d sherpa-onnx-alsa
c/lib:]
0x000000000000000c (INIT) 0x404218
This page describes how to build sherpa-onnx for embedded Linux (arm, 32-bit) with cross-compiling on an x86
machine with Ubuntu OS.
Caution: If you want to build sherpa-onnx directly on your board, please don’t use this document. Refer to Linux
instead.
Caution: If you want to build sherpa-onnx directly on your board, please don’t use this document. Refer to Linux
instead.
Caution: If you want to build sherpa-onnx directly on your board, please don’t use this document. Refer to Linux
instead.
Note: You can download pre-compiled binaries for 32-bit ARM from the following URL https://huggingface.co/
csukuangfj/sherpa-onnx-libs/tree/main/arm32
Please always download the latest version.
Example command to download the version 1.9.12:
Hint: We provide two colab notebooks for you to try this section step by step.
If you are using Windows/macOS or you don’t want to setup your local environment for cross-compiling, please use
the above colab notebooks.
Install toolchain
Warning: You can use any toolchain that is suitable for your platform. The toolchain we use below is just an
example.
mkdir -p $HOME/software
cd $HOME/software
wget -q https://huggingface.co/csukuangfj/sherpa-ncnn-toolchains/resolve/main/gcc-arm-10.
˓→3-2021.07-x86_64-arm-none-linux-gnueabihf.tar.xz
tar xf gcc-arm-10.3-2021.07-x86_64-arm-none-linux-gnueabihf.tar.xz
export PATH=$HOME/software/gcc-arm-10.3-2021.07-x86_64-arm-none-linux-gnueabihf/bin:$PATH
To check that we have installed the cross-compiling toolchain successfully, please run:
arm-none-linux-gnueabihf-gcc --version
Build sherpa-onnx
$ ls -lh build-arm-linux-gnueabihf/install/bin/
total 1.2M
-rwxr-xr-x 1 kuangfangjun root 395K Jul 7 16:28 sherpa-onnx
-rwxr-xr-x 1 kuangfangjun root 391K Jul 7 16:28 sherpa-onnx-alsa
-rwxr-xr-x 1 kuangfangjun root 351K Jul 7 16:28 sherpa-onnx-offline
That’s it!
Hint:
• sherpa-onnx is for decoding a single file using a streaming model
Caution: We recommend that you use sherpa-onnx-alsa on embedded systems such as Raspberry pi.
You need to provide a device_name when invoking sherpa-onnx-alsa. We describe below how to find the
device name for your microphone.
Run the following command:
arecord -l
to list all avaliable microphones for recording. If it complains that arecord: command not found, please use
sudo apt-get install alsa-utils to install it.
If the above command gives the following output:
**** List of CAPTURE Hardware Devices ****
card 0: Audio [Axera Audio], device 0: 49ac000.i2s_mst-es8328-hifi-analog␣
˓→es8328-hifi-analog-0 []
Subdevices: 1/1
Subdevice #0: subdevice #0
In this case, I only have 1 microphone. It is card 0 and that card has only device 0. To select card 0
and device 0 on that card, we need to pass plughw:0,0 to sherpa-onnx-alsa. (Note: It has the format
plughw:card_number,device_index.)
For instance, you have to use
# Note: We use int8 models below.
./bin/sherpa-onnx-alsa \
./sherpa-onnx-streaming-zipformer-en-2023-06-26/tokens.txt \
./sherpa-onnx-streaming-zipformer-en-2023-06-26/encoder-epoch-99-avg-1-
˓→chunk-16-left-64.int8.onnx \
./sherpa-onnx-streaming-zipformer-en-2023-06-26/decoder-epoch-99-avg-1-
˓→chunk-16-left-64.int8.onnx \
./sherpa-onnx-streaming-zipformer-en-2023-06-26/joiner-epoch-99-avg-1-
˓→chunk-16-left-64.int8.onnx \
"plughw:0,0"
Please change the card number and also the device index on the selected card accordingly in your own situation.
Otherwise, you won’t be able to record with your microphone.
Please read Pre-trained models for usages about the generated binaries.
Read below if you want to learn more.
Hint: By default, all external dependencies are statically linked. That means, the generated binaries are self-contained
(except that it requires the onnxruntime shared library at runtime).
You can use the following commands to check that and you will find they depend only on system libraries.
$ readelf -d build-arm-linux-gnueabihf/install/bin/sherpa-onnx
$ readelf -d build-arm-linux-gnueabihf/install/bin/sherpa-onnx-alsa
If you want to build static libraries and static linked binaries, please first download a cross compile toolchain with GCC
>= 9.0. The following is an example:
mkdir -p $HOME/software
cd $HOME/software
wget -q https://huggingface.co/csukuangfj/sherpa-ncnn-toolchains/resolve/main/gcc-arm-10.
˓→3-2021.07-x86_64-arm-none-linux-gnueabihf.tar.xz
tar xf gcc-arm-10.3-2021.07-x86_64-arm-none-linux-gnueabihf.tar.xz
export PATH=$HOME/software/gcc-arm-10.3-2021.07-x86_64-arm-none-linux-gnueabihf/bin:$PATH
To check that we have installed the cross-compiling toolchain successfully, please run:
arm-none-linux-gnueabihf-gcc --version
Now you can build static libraries and static linked binaries with the following commands:
You can use the following commands to check that the generated binaries are indeed static linked:
$ cd build-arm-linux-gnueabihf/bin
$ ldd sherpa-onnx-alsa
not a dynamic executable
$ readelf -d sherpa-onnx-alsa
˓→open-source/sherpa-onnx/build-arm-linux-gnueabihf/_deps/onnxruntime-src/lib:]
This page describes how to build sherpa-onnx for embedded Linux (RISC-V, 64-bit) with cross-compiling on an x64
machine with Ubuntu OS. It also demonstrates how to use qemu to run the compiled binaries.
Hint: We provide a colab notebook for you to try this section step by step.
If you are using Windows/macOS or you don’t want to setup your local environment for cross-compiling, please use
the above colab notebook.
Note: You can download pre-compiled binaries for riscv64 from the following URL https://huggingface.co/
csukuangfj/sherpa-onnx-libs/tree/main/riscv64
Please always download the latest version.
Example command to download the version 1.9.12:
Install toolchain
mkdir -p $HOME/toolchain
wget -q https://occ-oss-prod.oss-cn-hangzhou.aliyuncs.com/resource//1663142514282/
˓→Xuantie-900-gcc-linux-5.10.4-glibc-x86_64-V2.6.1-20220906.tar.gz
export PATH=$HOME/toolchain/bin:$PATH
To check that you have installed the toolchain successfully, please run
$ riscv64-unknown-linux-gnu-gcc --version
$ riscv64-unknown-linux-gnu-g++ --version
Build sherpa-onnx
Hint: Currently, only shared libraries are supported. We will support static linking in the future.
$ ls -lh build-riscv64-linux-gnu/install/bin
$ echo "---"
$ ls -lh build-riscv64-linux-gnu/install/lib
total 292K
-rwxr-xr-x 1 root root 23K Mar 20 09:41 sherpa-onnx
-rwxr-xr-x 1 root root 27K Mar 20 09:41 sherpa-onnx-alsa
-rwxr-xr-x 1 root root 31K Mar 20 09:41 sherpa-onnx-alsa-offline
-rwxr-xr-x 1 root root 40K Mar 20 09:41 sherpa-onnx-alsa-offline-speaker-identification
-rwxr-xr-x 1 root root 23K Mar 20 09:41 sherpa-onnx-keyword-spotter
-rwxr-xr-x 1 root root 27K Mar 20 09:41 sherpa-onnx-keyword-spotter-alsa
-rwxr-xr-x 1 root root 23K Mar 20 09:41 sherpa-onnx-offline
-rwxr-xr-x 1 root root 39K Mar 20 09:41 sherpa-onnx-offline-parallel
-rwxr-xr-x 1 root root 19K Mar 20 09:41 sherpa-onnx-offline-tts
-rwxr-xr-x 1 root root 31K Mar 20 09:41 sherpa-onnx-offline-tts-play-alsa
---
total 30M
-rw-r--r-- 1 root root 256K Mar 20 09:41 libespeak-ng.so
-rw-r--r-- 1 root root 71K Mar 20 09:41 libkaldi-decoder-core.so
-rw-r--r-- 1 root root 67K Mar 20 09:41 libkaldi-native-fbank-core.so
-rw-r--r-- 1 root root 13M Mar 20 09:35 libonnxruntime.so
-rw-r--r-- 1 root root 13M Mar 20 09:35 libonnxruntime.so.1.14.1
lrwxrwxrwx 1 root root 23 Mar 20 09:41 libpiper_phonemize.so -> libpiper_phonemize.
˓→so.1
$ file build-riscv64-linux-gnu/install/bin/sherpa-onnx
$ readelf -d build-riscv64-linux-gnu/install/bin/sherpa-onnx
/root/toolchain/sysroot/lib/ld-linux-riscv64-lp64d.so.1
That’s it!
Please create an issue at https://github.com/k2-fsa/sherpa-onnx/issues if you have any problems.
qemu
Caution: Please don’t use any other methods to install qemu-riscv64. Only the method listed in this subsection
is known to work.
mkdir -p /tmp
cd /tmp
wget -q https://files.pythonhosted.org/packages/21/f4/
˓→733f29c435987e8bb264a6504c7a4ea4c04d0d431b38a818ab63eef082b9/xuantie_qemu-20230825-py3-
˓→none-manylinux1_x86_64.whl
unzip xuantie_qemu-20230825-py3-none-manylinux1_x86_64.whl
cp -v ./qemu/qemu-riscv64 $HOME/qemu
export PATH=$HOME/qemu:$PATH
-CPF CSKY_PROFILING
-csky-trace CSKY_TRACE [port=<port>][,tb_trace=<on|off>][,mem_trace=
˓→<on|off>][,auto_trace=<on|off>][,start=addr][,exit=addr]
Defaults:
QEMU_LD_PREFIX = /usr/gnemul/qemu-riscv64
QEMU_STACK_SIZE = 8388608 byte
cd /path/to/sherpa-onnx
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-onnx-
˓→streaming-zipformer-en-20M-2023-02-17.tar.bz2
Now you can use the following command to run it with qemu-riscv64:
cd /path/to/sherpa-onnx
export PATH=$HOME/qemu:$PATH
qemu-riscv64 build-riscv64-linux-gnu/install/bin/sherpa-onnx \
--tokens=./sherpa-onnx-streaming-zipformer-en-20M-2023-02-17/tokens.txt \
--encoder=./sherpa-onnx-streaming-zipformer-en-20M-2023-02-17/encoder-epoch-99-avg-1.
˓→onnx \
--decoder=./sherpa-onnx-streaming-zipformer-en-20M-2023-02-17/decoder-epoch-99-avg-1.
˓→onnx \
--joiner=./sherpa-onnx-streaming-zipformer-en-20M-2023-02-17/joiner-epoch-99-avg-1.
˓→onnx \
./sherpa-onnx-streaming-zipformer-en-20M-2023-02-17/test_wavs/0.wav
cd /path/to/sherpa-onnx
export PATH=$HOME/qemu:$PATH
export QEMU_LD_PREFIX=$HOME/toolchain/sysroot
qemu-riscv64 build-riscv64-linux-gnu/install/bin/sherpa-onnx \
--tokens=./sherpa-onnx-streaming-zipformer-en-20M-2023-02-17/tokens.txt \
--encoder=./sherpa-onnx-streaming-zipformer-en-20M-2023-02-17/encoder-epoch-99-avg-1.
˓→onnx \
--decoder=./sherpa-onnx-streaming-zipformer-en-20M-2023-02-17/decoder-epoch-99-avg-1.
˓→onnx \
--joiner=./sherpa-onnx-streaming-zipformer-en-20M-2023-02-17/joiner-epoch-99-avg-1.
˓→onnx \
./sherpa-onnx-streaming-zipformer-en-20M-2023-02-17/test_wavs/0.wav
˓→directory
cd /path/to/sherpa-onnx
export PATH=$HOME/qemu:$PATH
export QEMU_LD_PREFIX=$HOME/toolchain/sysroot
(continues on next page)
qemu-riscv64 build-riscv64-linux-gnu/install/bin/sherpa-onnx \
--tokens=./sherpa-onnx-streaming-zipformer-en-20M-2023-02-17/tokens.txt \
--encoder=./sherpa-onnx-streaming-zipformer-en-20M-2023-02-17/encoder-epoch-99-avg-1.
˓→onnx \
--decoder=./sherpa-onnx-streaming-zipformer-en-20M-2023-02-17/decoder-epoch-99-avg-1.
˓→onnx \
--joiner=./sherpa-onnx-streaming-zipformer-en-20M-2023-02-17/joiner-epoch-99-avg-1.
˓→onnx \
./sherpa-onnx-streaming-zipformer-en-20M-2023-02-17/test_wavs/0.wav
˓→tokens.txt --encoder=./sherpa-onnx-streaming-zipformer-en-20M-2023-02-17/encoder-epoch-
˓→99-avg-1.onnx --decoder=./sherpa-onnx-streaming-zipformer-en-20M-2023-02-17/decoder-
˓→epoch-99-avg-1.onnx --joiner=./sherpa-onnx-streaming-zipformer-en-20M-2023-02-17/
˓→joiner-epoch-99-avg-1.onnx ./sherpa-onnx-streaming-zipformer-en-20M-2023-02-17/test_
˓→wavs/0.wav
OnlineRecognizerConfig(feat_config=FeatureExtractorConfig(sampling_rate=16000, feature_
˓→dim=80), model_config=OnlineModelConfig(transducer=OnlineTransducerModelConfig(encoder=
˓→"./sherpa-onnx-streaming-zipformer-en-20M-2023-02-17/encoder-epoch-99-avg-1.onnx",␣
˓→decoder="./sherpa-onnx-streaming-zipformer-en-20M-2023-02-17/decoder-epoch-99-avg-1.
˓→onnx", joiner="./sherpa-onnx-streaming-zipformer-en-20M-2023-02-17/joiner-epoch-99-avg-
˓→ctc=OnlineZipformer2CtcModelConfig(model=""), tokens="./sherpa-onnx-streaming-
˓→config=EndpointConfig(rule1=EndpointRule(must_contain_nonsilence=False, min_trailing_
˓→method="greedy_search", blank_penalty=0)
./sherpa-onnx-streaming-zipformer-en-20M-2023-02-17/test_wavs/0.wav
Elapsed seconds: 70, Real time factor (RTF): 11
THE YELLOW LAMPS WOULD LIGHT UP HERE AND THERE THE SQUALID QUARTER OF THE BRAFFLELS
{ "text": " THE YELLOW LAMPS WOULD LIGHT UP HERE AND THERE THE SQUALID QUARTER OF THE␣
˓→BRAFFLELS", "tokens": [ " THE", " YE", "LL", "OW", " LA", "M", "P", "S", " WOULD", "␣
˓→LIGHT", " UP", " HE", "RE", " AND", " THERE", " THE", " S", "QUA", "LI", "D", " ", "QUA
˓→", "R", "TER", " OF", " THE", " B", "RA", "FF", "L", "EL", "S" ], "timestamps": [ 2.04,
˓→ 2.16, 2.28, 2.36, 2.52, 2.64, 2.68, 2.76, 2.92, 3.08, 3.40, 3.60, 3.72, 3.88, 4.12, 4.
˓→48, 4.64, 4.68, 4.84, 4.96, 5.16, 5.20, 5.32, 5.36, 5.60, 5.72, 5.92, 5.96, 6.08, 6.24,
Hint: As you can see, the RTF is 11, indicating that it is very slow to run the model with the qemu simulator. Running
on a real RISC-V board should be much faster.
cd /path/to/sherpa-onnx
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/tts-models/vits-piper-en_US-
˓→amy-low.tar.bz2
tar xf vits-piper-en_US-amy-low.tar.bz2
rm vits-piper-en_US-amy-low.tar.bz2
After downloading the model, we can use the following command to run it:
cd /path/to/sherpa-onnx
export PATH=$HOME/qemu:$PATH
export QEMU_LD_PREFIX=$HOME/toolchain/sysroot
export LD_LIBRARY_PATH=$HOME/toolchain/sysroot/lib:$LD_LIBRARY_PATH
qemu-riscv64 build-riscv64-linux-gnu/install/bin/sherpa-onnx-offline-tts \
--vits-model=./vits-piper-en_US-amy-low/en_US-amy-low.onnx \
--vits-tokens=./vits-piper-en_US-amy-low/tokens.txt \
--vits-data-dir=./vits-piper-en_US-amy-low/espeak-ng-data \
--output-filename=./a-test.wav \
"Friends fell out often because life was changing so fast. The easiest thing in the␣
˓→world was to lose touch with someone."
/content/sherpa-onnx/sherpa-onnx/csrc/parse-options.cc:Read:361 build-riscv64-linux-gnu/
˓→install/bin/sherpa-onnx-offline-tts --vits-model=./vits-piper-en_US-amy-low/en_US-amy-
˓→often because life was changing so fast. The easiest thing in the world was to lose␣
If you want to build an Android app, please refer to Android. If you want to build an iOS app, please refer to iOS.
8.3.1
,
,
If you use Linux and get the above error when trying to use the microphone, please do the following:
1. Locate where is the file libasound_module_conf_pulse.so on your system
/usr/lib/x86_64-linux-gnu/alsa-lib/libasound_module_conf_pulse.so
/usr/lib/i386-linux-gnu/alsa-lib/libasound_module_conf_pulse.so
3. Please run:
8.3.3 TTS
Please see How to enable UTF-8 on Windows. You need to use UTF-8 encoding for your system.
Please run:
import sounddevice as sd
File "/mnt/sdb/shared/py311/lib/python3.11/site-packages/sounddevice.py", line 71, in
˓→<module>
go env -w CGO_ENABLED=1
If you are using electron >= 21 and get the following error:
8.3.8 The given version [17] is not supported, only version 1 to 10 is supported in
this build
If you have such an error, please find the file onnxruntime.dll in your C drive and try to remove it.
The reason is that you have two onnxruntime.dll on your computer and the one in your C drive is outdated.
8.4 Python
You can select one of the following methods to install the Python package.
which sherpa-onnx
sherpa-onnx --help
or:
Note: This method installs a version of sherpa-onnx supporting both CUDA and CPU. You need to pass the argument
provider=cuda to use NVIDIA GPU, which always uses GPU 0. Otherwise, it uses CPU by default.
Please use the environment variable CUDA_VISIBLE_DEVICES to control which GPU is mapped to GPU 0.
By default, provider is set to cpu.
Remeber to follow https://onnxruntime.ai/docs/execution-providers/CUDA-ExecutionProvider.html#requirements to
install CUDA 11.8.
If you have issues about installing CUDA 11.8, please have a look at https://k2-fsa.github.io/k2/installation/
cuda-cudnn.html#cuda-11-8.
Note that you don’t need to have sudo permission to install CUDA 11.8
1.10.16+cuda
CPU
Nvidia GPU (CUDA)
CPU
Nvidia GPU (CUDA)
cmake \
-DSHERPA_ONNX_ENABLE_PYTHON=ON \
-DBUILD_SHARED_LIBS=ON \
(continues on next page)
make -j
export PYTHONPATH=$PWD/../sherpa-onnx/python/:$PWD/lib:$PYTHONPATH
cmake \
-DSHERPA_ONNX_ENABLE_PYTHON=ON \
-DBUILD_SHARED_LIBS=ON \
-DSHERPA_ONNX_ENABLE_CHECK=OFF \
-DSHERPA_ONNX_ENABLE_PORTAUDIO=OFF \
-DSHERPA_ONNX_ENABLE_C_API=OFF \
-DSHERPA_ONNX_ENABLE_WEBSOCKET=OFF \
-DSHERPA_ONNX_ENABLE_GPU=ON \
..
make -j
export PYTHONPATH=$PWD/../sherpa-onnx/python/:$PWD/lib:$PYTHONPATH
Hint: You need to install CUDA toolkit. Otherwise, you would get errors at runtime.
You can refer to https://k2-fsa.github.io/k2/installation/cuda-cudnn.html to install CUDA toolkit.
/Users/fangjun/py38/lib/python3.8/site-packages/sherpa_onnx/__init__.py
In this section, we demonstrate how to use the Python API of sherpa-onnx to decode files.
Hint: We only support WAVE files of single channel and each sample should have 16-bit, while the sample rate of
the file can be arbitrary and it does not need to be 16 kHz
Streaming zipformer
cd /path/to/sherpa-onnx
python3 ./python-api-examples/online-decode-files.py \
--tokens=./sherpa-onnx-streaming-zipformer-bilingual-zh-en-2023-02-20/tokens.txt \
--encoder=./sherpa-onnx-streaming-zipformer-bilingual-zh-en-2023-02-20/encoder-epoch-
˓→99-avg-1.onnx \
--decoder=./sherpa-onnx-streaming-zipformer-bilingual-zh-en-2023-02-20/decoder-epoch-
˓→99-avg-1.onnx \
--joiner=./sherpa-onnx-streaming-zipformer-bilingual-zh-en-2023-02-20/joiner-epoch-99-
˓→avg-1.onnx \
./sherpa-onnx-streaming-zipformer-bilingual-zh-en-2023-02-20/test_wavs/0.wav \
./sherpa-onnx-streaming-zipformer-bilingual-zh-en-2023-02-20/test_wavs/1.wav \
./sherpa-onnx-streaming-zipformer-bilingual-zh-en-2023-02-20/test_wavs/2.wav \
./sherpa-onnx-streaming-zipformer-bilingual-zh-en-2023-02-20/test_wavs/3.wav \
./sherpa-onnx-streaming-zipformer-bilingual-zh-en-2023-02-20/test_wavs/8k.wav
Creating a resampler:
in_sample_rate: 8000
output_sample_rate: 16000
Started!
Done!
./sherpa-onnx-streaming-zipformer-bilingual-zh-en-2023-02-20/test_wavs/0.wav
MONDAY TODAY IS LIBR THE DAY AFTER TOMORROW
----------
./sherpa-onnx-streaming-zipformer-bilingual-zh-en-2023-02-20/test_wavs/1.wav
ALWAYS ALWAYS
----------
(continues on next page)
Non-streaming zipformer
cd /path/to/sherpa-onnx
python3 ./python-api-examples/offline-decode-files.py \
--tokens=./sherpa-onnx-zipformer-en-2023-04-01/tokens.txt \
--encoder=./sherpa-onnx-zipformer-en-2023-04-01/encoder-epoch-99-avg-1.onnx \
--decoder=./sherpa-onnx-zipformer-en-2023-04-01/decoder-epoch-99-avg-1.onnx \
--joiner=./sherpa-onnx-zipformer-en-2023-04-01/joiner-epoch-99-avg-1.onnx \
./sherpa-onnx-zipformer-en-2023-04-01/test_wavs/0.wav \
./sherpa-onnx-zipformer-en-2023-04-01/test_wavs/1.wav \
./sherpa-onnx-zipformer-en-2023-04-01/test_wavs/8k.wav
Creating a resampler:
in_sample_rate: 8000
output_sample_rate: 16000
Started!
Done!
./sherpa-onnx-zipformer-en-2023-04-01/test_wavs/0.wav
AFTER EARLY NIGHTFALL THE YELLOW LAMPS WOULD LIGHT UP HERE AND THERE THE SQUALID␣
˓→QUARTER OF THE BROTHELS
----------
(continues on next page)
˓→WITH THE RACE AND DESCENT OF MORTALS AND TO BE FINALLY A BLESSED SOUL IN HEAVEN
----------
./sherpa-onnx-zipformer-en-2023-04-01/test_wavs/8k.wav
YET THESE THOUGHTS AFFECTED HESTER PRYNNE LESS WITH HOPE THAN APPREHENSION
----------
num_threads: 1
decoding_method: greedy_search
Wave duration: 4.825 s
Elapsed time: 2.567 s
Real time factor (RTF): 2.567/4.825 = 0.532
Non-streaming paraformer
python3 ./python-api-examples/offline-decode-files.py \
--tokens=./sherpa-onnx-paraformer-zh-2023-03-28/tokens.txt \
--paraformer=./sherpa-onnx-paraformer-zh-2023-03-28/model.onnx \
./sherpa-onnx-paraformer-zh-2023-03-28/test_wavs/0.wav \
./sherpa-onnx-paraformer-zh-2023-03-28/test_wavs/1.wav \
./sherpa-onnx-paraformer-zh-2023-03-28/test_wavs/2.wav \
./sherpa-onnx-paraformer-zh-2023-03-28/test_wavs/8k.wav
Note: You can replace model.onnx with model.int8.onnx to use int8 models for decoding.
Started!
Done!
./sherpa-onnx-paraformer-zh-2023-03-28/test_wavs/0.wav
----------
./sherpa-onnx-paraformer-zh-2023-03-28/test_wavs/1.wav
----------
./sherpa-onnx-paraformer-zh-2023-03-28/test_wavs/2.wav
----------
./sherpa-onnx-paraformer-zh-2023-03-28/test_wavs/8k.wav
----------
(continues on next page)
In this section, we demonstrate how to use the Python API of sherpa-onnx for real-time speech recognition with a
microphone.
cd /path/to/sherpa-onnx
python3 ./python-api-examples/speech-recognition-from-microphone-with-endpoint-detection.
˓→py \
--tokens=./sherpa-onnx-streaming-zipformer-bilingual-zh-en-2023-02-20/tokens.txt \
--encoder=./sherpa-onnx-streaming-zipformer-bilingual-zh-en-2023-02-20/encoder-epoch-
˓→99-avg-1.onnx \
--decoder=./sherpa-onnx-streaming-zipformer-bilingual-zh-en-2023-02-20/decoder-epoch-
˓→99-avg-1.onnx \
--joiner=./sherpa-onnx-streaming-zipformer-bilingual-zh-en-2023-02-20/joiner-epoch-99-
˓→avg-1.onnx
Hint: speech-recognition-from-microphone-with-endpoint-detection.
py is from https://github.com/k2-fsa/sherpa-onnx/blob/master/python-api-examples/
speech-recognition-from-microphone-with-endpoint-detection.py
In the above demo, the model files are from csukuangfj/sherpa-onnx-streaming-zipformer-bilingual-zh-en-2023-02-20
(Bilingual, Chinese + English).
cd /path/to/sherpa-onnx
python3 ./python-api-examples/speech-recognition-from-microphone.py \
--tokens=./sherpa-onnx-streaming-zipformer-bilingual-zh-en-2023-02-20/tokens.txt \
--encoder=./sherpa-onnx-streaming-zipformer-bilingual-zh-en-2023-02-20/encoder-epoch-
˓→99-avg-1.onnx \
--decoder=./sherpa-onnx-streaming-zipformer-bilingual-zh-en-2023-02-20/decoder-epoch-
˓→99-avg-1.onnx \
--joiner=./sherpa-onnx-streaming-zipformer-bilingual-zh-en-2023-02-20/joiner-epoch-99-
˓→avg-1.onnx
Hint: Only streaming models are currently supported. Please modify the code for non-streaming models on need.
Type Example
RTMP rtmp://localhost/live/livestream
OPUS file https://huggingface.co/spaces/k2-fsa/automatic-speech-recognition/resolve/main/test_wavs/
wenetspeech/DEV_T0000000000.opus
WAVE file https://huggingface.co/spaces/k2-fsa/automatic-speech-recognition/resolve/main/test_wavs/
aishell2/ID0012W0030.wav
Local file:///Users/fangjun/open-source/sherpa-onnx/a.wav
WAVE file
Decode a URL
Hint: The file does not need to be a WAVE file. It can be a file of any format supported by ffmpeg.
python3 ./python-api-examples/speech-recognition-from-url.py \
--encoder ./sherpa-onnx-streaming-zipformer-en-2023-06-26/encoder-epoch-99-avg-1-chunk-
˓→16-left-128.onnx \
--decoder ./sherpa-onnx-streaming-zipformer-en-2023-06-26/decoder-epoch-99-avg-1-chunk-
˓→16-left-128.onnx \
--joiner ./sherpa-onnx-streaming-zipformer-en-2023-06-26/joiner-epoch-99-avg-1-chunk-
˓→16-left-128.onnx \
--tokens ./sherpa-onnx-streaming-zipformer-en-2023-06-26/tokens.txt \
--url https://huggingface.co/spaces/k2-fsa/automatic-speech-recognition/resolve/main/
˓→test_wavs/librispeech/1089-134686-0001.wav
RTMP
In this example, we use ffmpeg to capture a microphone and push the audio stream to a server using RTMP, and then
we start sherpa-onnx to pull the audio stream from the server for recognition.
We will use srs as the server. Let us first install srs from source:
Asan: Please setup the env MallocNanoZone=0 to disable the warning, see https://
˓→stackoverflow.com/a/70209891/17679565
./etc/init.d/srs status
Hint: If you fail to start the srs server, please check the log file ./objs/srs.log for a fix.
First, let us list available recording devices on the current computer with the following command:
We will use the device [1] MacBook Pro Microphone. Note that its index is 1, so we will use -i ":1" in the
following command to start recording and push the recorded audio stream to the server under the address rtmp://
localhost/live/livestream.
ffmpeg -hide_banner -f avfoundation -i ":1" -acodec aac -ab 64k -ar 16000 -ac 1 -f flv␣
˓→rtmp://localhost/live/livestream
Now we can start sherpa-onnx to pull audio stream from rtmp://localhost/live/livestream for speech recog-
nition.
python3 ./python-api-examples/speech-recognition-from-url.py \
--encoder ./sherpa-onnx-streaming-zipformer-en-2023-06-26/encoder-epoch-99-avg-1-chunk-
˓→16-left-128.onnx \
--decoder ./sherpa-onnx-streaming-zipformer-en-2023-06-26/decoder-epoch-99-avg-1-chunk-
˓→16-left-128.onnx \
--joiner ./sherpa-onnx-streaming-zipformer-en-2023-06-26/joiner-epoch-99-avg-1-chunk-
˓→16-left-128.onnx \
--tokens ./sherpa-onnx-streaming-zipformer-en-2023-06-26/tokens.txt \
--url rtmp://localhost/live/livestream
You should see the recognition result printed to the console as you speak.
Hint: You can replace localhost with your server IP and start sherpa-onnx on many computers at the same time to
pull audio stream from the address rtmp://your_server_ip/live/livestream.
This section describes how to use the Python streaming WebSocket server of sherpa-onnx for speech recognition.
Hint: The server supports multiple clients connecting at the same time.
Hint:
If you don’t use a X.509 certificate, due to security reasons imposed by the browser, you are only allowed
to use the domain localhost to access the server if you want to access the microphone in the browser.
That is, you can only use
http://localhost:port
to access the server. You cannot use http://0.0.0.0:port, or http://127.0.0.1:port, or http://public_ip:port.
You can use the following command to generate a self-signed certificate:
cd python-api-examples/web
./generate-certificate.py
The above commands will generate 3 files. You only need to use the file cert.pem. When starting the server, you pass
the following argument:
--certificate=./python-api-examples/web/cert.pem
Please refer to Pre-trained models to download a streaming model before you continue.
We will use csukuangfj/sherpa-onnx-streaming-zipformer-en-2023-06-26 (English) as an example.
First, let us download it:
cd /path/to/sherpa-onnx/
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-onnx-
˓→streaming-zipformer-en-2023-06-26.tar.bz2
cd /path/to/sherpa-onnx/
python3 ./python-api-examples/streaming_server.py \
--encoder ./sherpa-onnx-streaming-zipformer-en-2023-06-26/encoder-epoch-99-avg-1-chunk-
˓→16-left-128.onnx \
--decoder ./sherpa-onnx-streaming-zipformer-en-2023-06-26/decoder-epoch-99-avg-1-chunk-
˓→16-left-128.onnx \
--joiner ./sherpa-onnx-streaming-zipformer-en-2023-06-26/joiner-epoch-99-avg-1-chunk-
˓→16-left-128.onnx \
--tokens ./sherpa-onnx-streaming-zipformer-en-2023-06-26/tokens.txt \
--port 6006
˓→'decoder': './sherpa-onnx-streaming-zipformer-en-2023-06-26/decoder-epoch-99-avg-1-
˓→examples/web'}
http://localhost:6006
Since you are not providing a certificate, you cannot use your microphone from within␣
˓→the browser using public IP addresses. Only localhost can be used.You also cannot use␣
˓→0.0.0.0 or 127.0.0.1
We can use the following two methods to interact with the server:
Description URL
Send a file for decoding https://github.com/k2-fsa/sherpa-onnx/blob/master/python-api-examples/
online-websocket-client-decode-file.py
Send audio samples from a micro- https://github.com/k2-fsa/sherpa-onnx/blob/master/python-api-examples/
phone for decoding speech-recognition-from-microphone.py
Hint: The example file supports only *.wav files with a single channel and the each sample should be of type int16_t.
The sample rate does not need to be 16000 Hz, e.g., it can be 48000 Hz, 8000 Hz or some other value.
cd /path/to/sherpa-onnx
python3 ./python-api-examples/online-websocket-client-decode-file.py \
--server-addr localhost \
--server-port 6006 \
./sherpa-onnx-streaming-zipformer-en-2023-06-26/test_wavs/0.wav
˓→"segment": 0}
˓→"segment": 0}
˓→", "segment": 0}
cd /path/to/sherpa-onnx
python3 ./python-api-examples/online-websocket-client-microphone.py \
--server-addr localhost \
--server-port 6006
If you speak, you will see the recognition result returned by the server.
Use a browser
Click the button Click me to connect to connect to the server and then you can click the Streaming-Record
button to start recording. You should see the decoded results as you speak.
colab
We provide a colab notebook for you to try the Python streaming websocket server example of sherpa-onnx.
This section describes how to use the Python non-streaming WebSocket server of sherpa-onnx for speech recognition.
Hint: The server supports multiple clients connecting at the same time.
Description URL
Non-streaming transducer csukuangfj/sherpa-onnx-zipformer-en-2023-06-26 (English)
Non-streaming paraformer csukuangfj/sherpa-onnx-paraformer-zh-2023-03-28 (Chinese + En-
glish)
Non-streaming CTC model from NeMo stt_en_conformer_ctc_medium
Non-streaming Whisper tiny.en tiny.en
Non-streaming transducer
cd /path/to/sherpa-onnx
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-onnx-
˓→zipformer-en-2023-06-26.tar.bz2
python3 ./python-api-examples/non_streaming_server.py \
--encoder ./sherpa-onnx-zipformer-en-2023-06-26/encoder-epoch-99-avg-1.onnx \
--decoder ./sherpa-onnx-zipformer-en-2023-06-26/decoder-epoch-99-avg-1.onnx \
--joiner ./sherpa-onnx-zipformer-en-2023-06-26/joiner-epoch-99-avg-1.onnx \
--tokens ./sherpa-onnx-zipformer-en-2023-06-26/tokens.txt \
--port 6006
python3 ./python-api-examples/offline-websocket-client-decode-files-paralell.py \
--server-addr localhost \
--server-port 6006 \
./sherpa-onnx-zipformer-en-2023-06-26/test_wavs/0.wav \
./sherpa-onnx-zipformer-en-2023-06-26/test_wavs/1.wav \
./sherpa-onnx-zipformer-en-2023-06-26/test_wavs/8k.wav
˓→zipformer-en-2023-06-26/test_wavs/0.wav', './sherpa-onnx-zipformer-en-2023-06-26/test_
˓→wavs/1.wav', './sherpa-onnx-zipformer-en-2023-06-26/test_wavs/8k.wav']}
YET THESE THOUGHTS AFFECTED HESTER PRYNNE LESS WITH HOPE THAN APPREHENSION
2023-08-11 18:19:26,609 INFO [offline-websocket-client-decode-files-paralell.py:131] ./
˓→sherpa-onnx-zipformer-en-2023-06-26/test_wavs/1.wav
GOD AS A DIRECT CONSEQUENCE OF THE SIN WHICH MAN THUS PUNISHED HAD GIVEN HER A LOVELY␣
˓→CHILD WHOSE PLACE WAS ON THAT SAME DISHONORED BOSOM TO CONNECT HER PARENT FOREVER WITH␣
˓→THE RACE AND DESCENT OF MORTALS AND TO BE FINALLY A BLESSED SOUL IN HEAVEN
AFTER EARLY NIGHTFALL THE YELLOW LAMPS WOULD LIGHT UP HERE AND THERE THE SQUALID␣
˓→QUARTER OF THE BROTHELS
(continues on next page)
python3 ./python-api-examples/offline-websocket-client-decode-files-sequential.py \
--server-addr localhost \
--server-port 6006 \
./sherpa-onnx-zipformer-en-2023-06-26/test_wavs/0.wav \
./sherpa-onnx-zipformer-en-2023-06-26/test_wavs/1.wav \
./sherpa-onnx-zipformer-en-2023-06-26/test_wavs/8k.wav
AFTER EARLY NIGHTFALL THE YELLOW LAMPS WOULD LIGHT UP HERE AND THERE THE SQUALID␣
˓→QUARTER OF THE BROTHELS
GOD AS A DIRECT CONSEQUENCE OF THE SIN WHICH MAN THUS PUNISHED HAD GIVEN HER A LOVELY␣
˓→CHILD WHOSE PLACE WAS ON THAT SAME DISHONORED BOSOM TO CONNECT HER PARENT FOREVER WITH␣
˓→THE RACE AND DESCENT OF MORTALS AND TO BE FINALLY A BLESSED SOUL IN HEAVEN
YET THESE THOUGHTS AFFECTED HESTER PRYNNE LESS WITH HOPE THAN APPREHENSION
Non-streaming paraformer
cd /path/to/sherpa-onnx
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-onnx-
˓→paraformer-zh-2023-03-28.tar.bz2
python3 ./python-api-examples/non_streaming_server.py \
--paraformer ./sherpa-onnx-paraformer-zh-2023-03-28/model.int8.onnx \
--tokens ./sherpa-onnx-paraformer-zh-2023-03-28/tokens.txt \
--port 6006
˓→paraformer-zh-2023-03-28/test_wavs/0.wav', './sherpa-onnx-paraformer-zh-2023-03-28/
˓→onnx-paraformer-zh-2023-03-28/test_wavs/8k.wav']}
cd /path/to/sherpa-onnx
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-onnx-nemo-
˓→ctc-en-conformer-medium.tar.bz2
python3 ./python-api-examples/non_streaming_server.py \
--nemo-ctc ./sherpa-onnx-nemo-ctc-en-conformer-medium/model.onnx \
--tokens ./sherpa-onnx-nemo-ctc-en-conformer-medium/tokens.txt \
--port 6006
˓→ctc-en-conformer-medium/test_wavs/0.wav', './sherpa-onnx-nemo-ctc-en-conformer-medium/
˓→test_wavs/1.wav', './sherpa-onnx-nemo-ctc-en-conformer-medium/test_wavs/8k.wav']}
after early nightfall the yellow lamps would light up here and there the squalid␣
˓→quarter of the brothels
yet these thoughts affected hester pryne less with hope than apprehension
2023-08-11 18:31:33,117 INFO [offline-websocket-client-decode-files-paralell.py:131] ./
˓→sherpa-onnx-nemo-ctc-en-conformer-medium/test_wavs/1.wav
god as a direct consequence of the sin which man thus punished had given her a lovely␣
˓→child whose place was on that same dishonored bosom to connect her parent for ever␣
˓→with the race and descent of mortals and to be finally a blessed soul in heaven
˓→ctc-en-conformer-medium/test_wavs/0.wav', './sherpa-onnx-nemo-ctc-en-conformer-medium/
˓→test_wavs/1.wav', './sherpa-onnx-nemo-ctc-en-conformer-medium/test_wavs/8k.wav']}
after early nightfall the yellow lamps would light up here and there the squalid␣
˓→quarter of the brothels
god as a direct consequence of the sin which man thus punished had given her a lovely␣
˓→child whose place was on that same dishonored bosom to connect her parent for ever␣
˓→with the race and descent of mortals and to be finally a blessed soul in heaven
yet these thoughts affected hester pryne less with hope than apprehension
cd /path/to/sherpa-onnx
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-onnx-
˓→whisper-tiny.en.tar.bz2
python3 ./python-api-examples/non_streaming_server.py \
--whisper-encoder=./sherpa-onnx-whisper-tiny.en/tiny.en-encoder.onnx \
--whisper-decoder=./sherpa-onnx-whisper-tiny.en/tiny.en-decoder.onnx \
--tokens=./sherpa-onnx-whisper-tiny.en/tiny.en-tokens.txt \
--port 6006
python3 ./python-api-examples/offline-websocket-client-decode-files-paralell.py \
--server-addr localhost \
--server-port 6006 \
./sherpa-onnx-whisper-tiny.en/test_wavs/0.wav \
./sherpa-onnx-whisper-tiny.en/test_wavs/1.wav \
./sherpa-onnx-whisper-tiny.en/test_wavs/8k.wav
˓→sherpa-onnx-whisper-tiny.en/test_wavs/8k.wav']}
After early nightfall, the yellow lamps would light up here and there, the squalid␣
˓→quarter of the brothels.
Yet these thoughts affected Hester Prin less with hope than apprehension.
2023-08-11 18:35:31,592 INFO [offline-websocket-client-decode-files-paralell.py:131] ./
˓→sherpa-onnx-whisper-tiny.en/test_wavs/1.wav
God, as a direct consequence of the sin which man thus punished, had given her a lovely␣
˓→child, whose place was on that same dishonored bosom to connect her parent forever␣
˓→with the race and descent of mortals, and to be finally a blessed soul in heaven.
python3 ./python-api-examples/offline-websocket-client-decode-files-sequential.py \
--server-addr localhost \
--server-port 6006 \
./sherpa-onnx-whisper-tiny.en/test_wavs/0.wav \
./sherpa-onnx-whisper-tiny.en/test_wavs/1.wav \
./sherpa-onnx-whisper-tiny.en/test_wavs/8k.wav
˓→sherpa-onnx-whisper-tiny.en/test_wavs/8k.wav']}
After early nightfall, the yellow lamps would light up here and there, the squalid␣
˓→quarter of the brothels.
God, as a direct consequence of the sin which man thus punished, had given her a lovely␣
˓→child, whose place was on that same dishonored bosom to connect her parent forever␣
˓→with the race and descent of mortals, and to be finally a blessed soul in heaven.
Yet these thoughts affected Hester Prin less with hope than apprehension.
colab
We provide a colab notebook for you to try the Python non-streaming websocket server example of sherpa-onnx.
8.5 C API
Before using the C API of sherpa-onnx, we need to first build required libraries. You can choose either to build static
libraries or shared libraries.
Assume that we want to put library files and header files in the directory /tmp/sherpa-onnx/shared:
git clone https://github.com/k2-fsa/sherpa-onnx
cd sherpa-onnx
mkdir build-shared
cd build-shared
cmake \
-DSHERPA_ONNX_ENABLE_C_API=ON \
-DCMAKE_BUILD_TYPE=Release \
-DBUILD_SHARED_LIBS=ON \
-DCMAKE_INSTALL_PREFIX=/tmp/sherpa-onnx/shared \
..
make -j6
make install
/tmp/sherpa-onnx/shared
bin
sherpa-onnx
sherpa-onnx-keyword-spotter
sherpa-onnx-keyword-spotter-microphone
sherpa-onnx-microphone
sherpa-onnx-microphone-offline
sherpa-onnx-microphone-offline-audio-tagging
sherpa-onnx-microphone-offline-speaker-identification
sherpa-onnx-offline
sherpa-onnx-offline-audio-tagging
sherpa-onnx-offline-language-identification
sherpa-onnx-offline-parallel
sherpa-onnx-offline-punctuation
sherpa-onnx-offline-tts
sherpa-onnx-offline-tts-play
sherpa-onnx-offline-websocket-server
sherpa-onnx-online-punctuation
sherpa-onnx-online-websocket-client
sherpa-onnx-online-websocket-server
sherpa-onnx-vad-microphone
sherpa-onnx-vad-microphone-offline-asr
include
sherpa-onnx
c-api
c-api.h
lib
libonnxruntime.1.17.1.dylib
(continues on next page)
5 directories, 25 files
$ tree /tmp/sherpa-onnx/shared/
/tmp/sherpa-onnx/shared
bin
sherpa-onnx
sherpa-onnx-alsa
sherpa-onnx-alsa-offline
sherpa-onnx-alsa-offline-audio-tagging
sherpa-onnx-alsa-offline-speaker-identification
sherpa-onnx-keyword-spotter
sherpa-onnx-keyword-spotter-alsa
sherpa-onnx-offline
sherpa-onnx-offline-audio-tagging
sherpa-onnx-offline-language-identification
sherpa-onnx-offline-parallel
sherpa-onnx-offline-punctuation
sherpa-onnx-offline-tts
sherpa-onnx-offline-tts-play-alsa
sherpa-onnx-offline-websocket-server
sherpa-onnx-online-punctuation
sherpa-onnx-online-websocket-client
sherpa-onnx-online-websocket-server
sherpa-onnx-vad-alsa
include
sherpa-onnx
c-api
c-api.h
lib
libonnxruntime.so
libsherpa-onnx-c-api.so
sherpa-onnx.pc
6 directories, 23 files
Assume that we want to put library files and header files in the directory /tmp/sherpa-onnx/static:
git clone https://github.com/k2-fsa/sherpa-onnx
cd sherpa-onnx
mkdir build-static
cd build-static
cmake \
-DSHERPA_ONNX_ENABLE_C_API=ON \
(continues on next page)
make -j6
make install
$ tree /tmp/sherpa-onnx/static/
/tmp/sherpa-onnx/static
bin
sherpa-onnx
sherpa-onnx-keyword-spotter
sherpa-onnx-keyword-spotter-microphone
sherpa-onnx-microphone
sherpa-onnx-microphone-offline
sherpa-onnx-microphone-offline-audio-tagging
sherpa-onnx-microphone-offline-speaker-identification
sherpa-onnx-offline
sherpa-onnx-offline-audio-tagging
sherpa-onnx-offline-language-identification
sherpa-onnx-offline-parallel
sherpa-onnx-offline-punctuation
sherpa-onnx-offline-tts
sherpa-onnx-offline-tts-play
sherpa-onnx-offline-websocket-server
sherpa-onnx-online-punctuation
sherpa-onnx-online-websocket-client
sherpa-onnx-online-websocket-server
sherpa-onnx-vad-microphone
sherpa-onnx-vad-microphone-offline-asr
include
sherpa-onnx
c-api
c-api.h
lib
libespeak-ng.a
libkaldi-decoder-core.a
libkaldi-native-fbank-core.a
libonnxruntime.a
libpiper_phonemize.a
libsherpa-onnx-c-api.a
libsherpa-onnx-core.a
libsherpa-onnx-fst.a
libsherpa-onnx-fstfar.a
libsherpa-onnx-kaldifst-core.a
(continues on next page)
5 directories, 35 files
$ tree /tmp/sherpa-onnx/static/
/tmp/sherpa-onnx/static
bin
sherpa-onnx
sherpa-onnx-alsa
sherpa-onnx-alsa-offline
sherpa-onnx-alsa-offline-audio-tagging
sherpa-onnx-alsa-offline-speaker-identification
sherpa-onnx-keyword-spotter
sherpa-onnx-keyword-spotter-alsa
sherpa-onnx-keyword-spotter-microphone
sherpa-onnx-microphone
sherpa-onnx-microphone-offline
sherpa-onnx-microphone-offline-audio-tagging
sherpa-onnx-microphone-offline-speaker-identification
sherpa-onnx-offline
sherpa-onnx-offline-audio-tagging
sherpa-onnx-offline-language-identification
sherpa-onnx-offline-parallel
sherpa-onnx-offline-punctuation
sherpa-onnx-offline-tts
sherpa-onnx-offline-tts-play
sherpa-onnx-offline-tts-play-alsa
sherpa-onnx-offline-websocket-server
sherpa-onnx-online-punctuation
sherpa-onnx-online-websocket-client
sherpa-onnx-online-websocket-server
sherpa-onnx-vad-alsa
sherpa-onnx-vad-microphone
sherpa-onnx-vad-microphone-offline-asr
include
sherpa-onnx
c-api
c-api.h
lib
libespeak-ng.a
libkaldi-decoder-core.a
libkaldi-native-fbank-core.a
libonnxruntime.a
libpiper_phonemize.a
libsherpa-onnx-c-api.a
libsherpa-onnx-core.a
libsherpa-onnx-fst.a
libsherpa-onnx-fstfar.a
(continues on next page)
6 directories, 42 files
export PKG_CONFIG_PATH=/tmp/sherpa-onnx/static:$PKG_CONFIG_PATH
cd ./c-api-examples
gcc -o decode-file-c-api $(pkg-config --cflags sherpa-onnx) ./decode-file-c-api.c $(pkg-
˓→config --libs sherpa-onnx)
./decode-file-c-api --help
export PKG_CONFIG_PATH=/tmp/sherpa-onnx/shared:$PKG_CONFIG_PATH
cd ./c-api-examples
gcc -o decode-file-c-api $(pkg-config --cflags sherpa-onnx) ./decode-file-c-api.c $(pkg-
˓→config --libs sherpa-onnx)
./decode-file-c-api --help
8.5.3 colab
In the following, we describe how to build the JNI interface for macOS. It is applicable for both macOS x64 and arm64.
For Linux users, please refer to Build JNI interface (Linux)
Hint: For Windows users, you have to modify the commands by yourself.
The above three commands print the following output on my computer. You don’t need to use the exact versions as I
am using.
javac 19.0.1
Build sherpa-onnx
cd sherpa-onnx
mkdir build
cd build
cmake \
-DSHERPA_ONNX_ENABLE_PYTHON=OFF \
-DSHERPA_ONNX_ENABLE_TESTS=OFF \
-DSHERPA_ONNX_ENABLE_CHECK=OFF \
-DBUILD_SHARED_LIBS=ON \
-DSHERPA_ONNX_ENABLE_PORTAUDIO=OFF \
-DSHERPA_ONNX_ENABLE_JNI=ON \
..
make -j4
ls -lh lib
total 8024
-rwxr-xr-x 1 fangjun staff 3.9M Aug 18 19:34 libsherpa-onnx-jni.dylib
If you don’t want to build JNI libs by yourself, please download pre-built JNI libs from
https://huggingface.co/csukuangfj/sherpa-onnx-libs/tree/main/jni
For Chinese users, please use
https://hf-mirror.com/csukuangfj/sherpa-onnx-libs/tree/main/jni
Please always use the latest version. In the following, we describe how to download the version 1.10.23.
Intel CPU (x86_64)
Apple Silicon (arm64)
wget https://huggingface.co/csukuangfj/sherpa-onnx-libs/resolve/main/jni/sherpa-onnx-v1.
˓→10.23-osx-x86_64-jni.tar.bz2
tar xf sherpa-onnx-v1.10.23-osx-x86_64-jni.tar.bz2
rm sherpa-onnx-v1.10.23-osx-x86_64-jni.tar.bz2
wget https://huggingface.co/csukuangfj/sherpa-onnx-libs/resolve/main/jni/sherpa-onnx-v1.
˓→10.23-osx-arm64-jni.tar.bz2
tar xf sherpa-onnx-v1.10.23-osx-arm64-jni.tar.bz2
rm sherpa-onnx-v1.10.23-osx-arm64-jni.tar.bz2
# For x86_64
ls -lh sherpa-onnx-v1.10.23-osx-x86_64-jni/lib
total 30M
-rw-r--r-- 1 fangjun fangjun 26M Aug 25 00:31 libonnxruntime.1.17.1.dylib
lrwxrwxrwx 1 fangjun fangjun 27 Aug 25 00:35 libonnxruntime.dylib -> libonnxruntime.1.
˓→17.1.dylib
# For arm64
ls -lh sherpa-onnx-v1.10.23-osx-arm64-jni/lib/
total 27M
-rw-r--r-- 1 fangjun fangjun 23M Aug 24 23:56 libonnxruntime.1.17.1.dylib
lrwxrwxrwx 1 fangjun fangjun 27 Aug 24 23:59 libonnxruntime.dylib -> libonnxruntime.1.
˓→17.1.dylib
In the following, we describe how to build the JNI interface for Linux. It is applicable for both Linux x64 and arm64.
For macOS users, please refer to Build JNI interface (macOS)
Hint: For Windows users, you have to modify the commands by yourself.
gcc --version
java -version
javac -version
The above three commands print the following output on my computer. You don’t need to use the exact versions as I
am using.
javac 17.0.11
Build sherpa-onnx
cd sherpa-onnx
mkdir build
cd build
cmake \
-DSHERPA_ONNX_ENABLE_GPU=$SHERPA_ONNX_ENABLE_GPU \
-DSHERPA_ONNX_ENABLE_PYTHON=OFF \
-DSHERPA_ONNX_ENABLE_TESTS=OFF \
-DSHERPA_ONNX_ENABLE_CHECK=OFF \
-DBUILD_SHARED_LIBS=ON \
-DSHERPA_ONNX_ENABLE_PORTAUDIO=OFF \
-DSHERPA_ONNX_ENABLE_JNI=ON \
..
make -j4
ls -lh lib
fangjun@ubuntu23-04:~/sherpa-onnx/build$ ls _deps/onnxruntime-src/lib/
libonnxruntime.so
If you don’t want to build JNI libs by yourself, please download pre-built JNI libs from
https://huggingface.co/csukuangfj/sherpa-onnx-libs/tree/main/jni
For Chinese users, please use
https://hf-mirror.com/csukuangfj/sherpa-onnx-libs/tree/main/jni
Please always use the latest version. In the following, we describe how to download the version 1.10.23.
wget https://huggingface.co/csukuangfj/sherpa-onnx-libs/resolve/main/jni/sherpa-onnx-v1.
˓→10.23-linux-x64-jni.tar.bz2
tar xf sherpa-onnx-v1.10.23-linux-x64-jni.tar.bz2
rm sherpa-onnx-v1.10.23-linux-x64-jni.tar.bz2
ls -lh sherpa-onnx-v1.10.23-linux-x64-jni/lib/
total 19M
-rw-r--r-- 1 fangjun fangjun 15M Aug 24 22:18 libonnxruntime.so
-rwxr-xr-x 1 fangjun fangjun 4.2M Aug 24 22:25 libsherpa-onnx-jni.so
If you don’t want to build JNI libs by yourself, please download pre-built JNI libs from
https://huggingface.co/csukuangfj/sherpa-onnx-libs/tree/main/jni
For Chinese users, please use
https://hf-mirror.com/csukuangfj/sherpa-onnx-libs/tree/main/jni
Please always use the latest version. In the following, we describe how to download the version 1.10.23.
wget https://huggingface.co/csukuangfj/sherpa-onnx-libs/resolve/main/jni/sherpa-onnx-v1.
˓→10.23-win-x64-jni.tar.bz2
tar xf sherpa-onnx-v1.10.23-win-x64-jni.tar.bz2
rm sherpa-onnx-v1.10.23-win-x64-jni.tar.bz2
ls -lh sherpa-onnx-v1.10.23-win-x64-jni/lib/
total 14M
-rwxr-xr-x 1 fangjun fangjun 11M Aug 24 15:41 onnxruntime.dll
-rwxr-xr-x 1 fangjun fangjun 23K Aug 24 15:41 onnxruntime_providers_shared.dll
-rwxr-xr-x 1 fangjun fangjun 3.1M Aug 24 15:48 sherpa-onnx-jni.dll
-rw-r--r-- 1 fangjun fangjun 51K Aug 24 15:47 sherpa-onnx-jni.lib
Note: Please see the end of this page for how to download pre-built jar.
cd sherpa-onnx/sherpa-onnx/java-api
ls -lh
total 8.0K
-rw-rw-r-- 1 fangjun fangjun 2.5K May 8 06:17 Makefile
drwxrwxr-x 3 fangjun fangjun 4.0K Mar 1 04:29 src
make
total 60K
drwxrwxr-x 3 fangjun fangjun 4.0K May 15 03:58 com
-rw-rw-r-- 1 fangjun fangjun 53K May 15 03:59 sherpa-onnx.jar
If you don’t want to build jar by yourself, you can download pre-built jar from from
https://huggingface.co/csukuangfj/sherpa-onnx-libs/tree/main/jni
For Chinese users, please use
https://hf-mirror.com/csukuangfj/sherpa-onnx-libs/tree/main/jni
Please always use the latest version. In the following, we describe how to download the version 1.10.2.
wget https://huggingface.co/csukuangfj/sherpa-onnx-libs/resolve/main/jni/sherpa-onnx-v1.
˓→10.2.jar
8.6.5 Examples
For using Javascript in the browser, please see our WebAssembly doc.
This section describes how to use sherpa-onnx in Node with Javascript API.
8.7.1 Install
8.7.2 Examples
Hint: For Windows users, you have to modify the commands by yourself.
8.8.2 Examples
8.9.1 Build
Please use the following script to build sherpa-onnx for Swift API:
https://github.com/k2-fsa/sherpa-onnx/blob/master/build-swift-macos.sh
The following is an example command:
8.9.2 Examples
8.10 Go API
Description URL
Decode a file with non-streaming https://github.com/k2-fsa/sherpa-onnx/tree/master/go-api-examples/
models non-streaming-decode-files
Decode a file with streaming https://github.com/k2-fsa/sherpa-onnx/tree/master/go-api-examples/
models streaming-decode-files
Real-time speech recognition https://github.com/k2-fsa/sherpa-onnx/tree/master/go-api-examples/
from a microphone real-time-speech-recognition-from-microphone
One thing to note is that we have provided pre-built libraries for Go so that you don’t need to build sherpa-onnx by
yourself when using the Go API.
To make supporting multiple platforms easier, we split the Go API of sherpa-onnx into multiple packages, as listed in
the following table:
To simplify the usage, we have provided a single Go package for sherpa-onnx that supports multiple operating systems.
It can be found at
https://github.com/k2-fsa/sherpa-onnx-go
You can use the following import to import sherpa-onnx-go into your Go project:
import (
sherpa "github.com/k2-fsa/sherpa-onnx-go/sherpa_onnx"
)
Note: Before you continue, please make sure you have installed Go. If not, please follow https://go.dev/doc/install to
install Go.
Usage of ./non-streaming-decode-files:
--debug int Whether to show debug message
--decoder string Path to the decoder model
--decoding-method string Decoding method. Possible values: greedy_search,␣
˓→modified_beam_search (default "greedy_search")
--model-type string Optional. Used for loading the model in a faster way
--nemo-ctc string Path to the NeMo CTC model
--num-threads int Number of threads for computing (default 1)
--paraformer string Path to the paraformer model
--provider string Provider to use (default "cpu")
--tokens string Path to the tokens file
pflag: help requested
Congratulations! You have successfully built your first Go API example for speech recognition.
Note: If you are using Windows and don’t see any output after running ./non-streaming-decode-files
--help, please copy *.dll from https://github.com/k2-fsa/sherpa-onnx-go-windows/tree/master/
lib/x86_64-pc-windows-gnu (for Win64) or https://github.com/k2-fsa/sherpa-onnx-go-windows/tree/
master/lib/i686-pc-windows-gnu (for Win32) to the directory sherpa-onnx/go-api-examples/
non-streaming-decode-files.
Non-streaming transducer
cd sherpa-onnx/go-api-examples/non-streaming-decode-files
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-onnx-
˓→zipformer-en-2023-06-26.tar.bz2
./non-streaming-decode-files \
--encoder ./sherpa-onnx-zipformer-en-2023-06-26/encoder-epoch-99-avg-1.onnx \
--decoder ./sherpa-onnx-zipformer-en-2023-06-26/decoder-epoch-99-avg-1.onnx \
--joiner ./sherpa-onnx-zipformer-en-2023-06-26/joiner-epoch-99-avg-1.onnx \
--tokens ./sherpa-onnx-zipformer-en-2023-06-26/tokens.txt \
--model-type transducer \
./sherpa-onnx-zipformer-en-2023-06-26/test_wavs/0.wav
Non-streaming paraformer
cd sherpa-onnx/go-api-examples/non-streaming-decode-files
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-onnx-
˓→paraformer-zh-2023-03-28.tar.bz2
./non-streaming-decode-files \
--paraformer ./sherpa-onnx-paraformer-zh-2023-03-28/model.int8.onnx \
--tokens ./sherpa-onnx-paraformer-zh-2023-03-28/tokens.txt \
--model-type paraformer \
./sherpa-onnx-paraformer-zh-2023-03-28/test_wavs/0.wav
cd sherpa-onnx/go-api-examples/non-streaming-decode-files
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-onnx-nemo-
˓→ctc-en-conformer-medium.tar.bz2
./non-streaming-decode-files \
--nemo-ctc ./sherpa-onnx-nemo-ctc-en-conformer-medium/model.onnx \
--tokens ./sherpa-onnx-nemo-ctc-en-conformer-medium/tokens.txt \
--model-type nemo_ctc \
./sherpa-onnx-nemo-ctc-en-conformer-medium/test_wavs/0.wav
Usage of ./streaming-decode-files:
--debug int Whether to show debug message
--decoder string Path to the decoder model
--decoding-method string Decoding method. Possible values: greedy_search,␣
˓→modified_beam_search (default "greedy_search")
Note: If you are using Windows and don’t see any output after running ./streaming-decode-files
--help, please copy *.dll from https://github.com/k2-fsa/sherpa-onnx-go-windows/tree/master/lib/x86_
64-pc-windows-gnu (for Win64) or https://github.com/k2-fsa/sherpa-onnx-go-windows/tree/master/lib/
i686-pc-windows-gnu (for Win32) to the directory sherpa-onnx/go-api-examples/streaming-decode-files.
Streaming transducer
cd sherpa-onnx/go-api-examples/streaming-decode-files
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-onnx-
˓→streaming-zipformer-en-2023-06-26.tar.bz2
./streaming-decode-files \
--encoder ./sherpa-onnx-streaming-zipformer-en-2023-06-26/encoder-epoch-99-avg-1-chunk-
˓→16-left-128.onnx \
--decoder ./sherpa-onnx-streaming-zipformer-en-2023-06-26/decoder-epoch-99-avg-1-chunk-
˓→16-left-128.onnx \
--joiner ./sherpa-onnx-streaming-zipformer-en-2023-06-26/joiner-epoch-99-avg-1-chunk-
˓→16-left-128.onnx \
--tokens ./sherpa-onnx-streaming-zipformer-en-2023-06-26/tokens.txt \
--model-type zipformer2 \
./sherpa-onnx-streaming-zipformer-en-2023-06-26/test_wavs/0.wav
# for macOS
brew install portaudio
export PKG_CONFIG_PATH=/usr/local/Cellar/portaudio/19.7.0
# for Ubuntu
sudo apt-get install libasound-dev portaudio19-dev libportaudio2 libportaudiocpp0
# for macOS
-I/usr/local/Cellar/portaudio/19.7.0/include -L/usr/local/Cellar/portaudio/19.
˓→7.0/lib -lportaudio -framework CoreAudio -framework AudioToolbox -framework␣
# for Ubuntu
-pthread -lportaudio -lasound -lm -lpthread
Streaming transducer
cd sherpa-onnx/go-api-examples/real-time-speech-recognition-from-microphone
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-onnx-
˓→streaming-zipformer-en-2023-06-26.tar.bz2
./real-time-speech-recognition-from-microphone \
--encoder ./sherpa-onnx-streaming-zipformer-en-2023-06-26/encoder-epoch-99-avg-1-chunk-
˓→16-left-128.onnx \
--decoder ./sherpa-onnx-streaming-zipformer-en-2023-06-26/decoder-epoch-99-avg-1-chunk-
˓→16-left-128.onnx \
--joiner ./sherpa-onnx-streaming-zipformer-en-2023-06-26/joiner-epoch-99-avg-1-chunk-
˓→16-left-128.onnx \
--tokens ./sherpa-onnx-streaming-zipformer-en-2023-06-26/tokens.txt \
--model-type zipformer2
8.10.4 colab
We provide a colab notebook for you to try the Go API examples of sherpa-onnx.
8.11 C# API
Description URL
Decode a file with non-streaming https://github.com/k2-fsa/sherpa-onnx/tree/master/dotnet-examples/
models offline-decode-files
Decode a file with streaming mod- https://github.com/k2-fsa/sherpa-onnx/tree/master/dotnet-examples/
els online-decode-files
Real-time speech recognition from https://github.com/k2-fsa/sherpa-onnx/tree/master/dotnet-examples/
a microphone speech-recognition-from-microphone
One thing to note is that we have provided pre-built libraries for C# so that you don’t need to build sherpa-onnx by
yourself when using the C# API.
In the following, we describe how to run our provided C# API examples.
Note: Before you continue, please make sure you have installed .Net. If not, please follow https://dotnet.microsoft.
com/en-us/download to install .Net.
# Zipformer
dotnet run \
--tokens=./sherpa-onnx-zipformer-en-2023-04-01/tokens.txt \
--encoder=./sherpa-onnx-zipformer-en-2023-04-01/encoder-epoch-99-avg-1.onnx \
--decoder=./sherpa-onnx-zipformer-en-2023-04-01/decoder-epoch-99-avg-1.onnx \
--joiner=./sherpa-onnx-zipformer-en-2023-04-01/joiner-epoch-99-avg-1.onnx \
--files ./sherpa-onnx-zipformer-en-2023-04-01/test_wavs/0.wav \
./sherpa-onnx-zipformer-en-2023-04-01/test_wavs/1.wav \
./sherpa-onnx-zipformer-en-2023-04-01/test_wavs/8k.wav
Please refer to
https://k2-fsa.github.io/sherpa/onnx/pretrained_models/offline-transducer/index.html
to download pre-trained non-streaming zipformer models.
# Paraformer
dotnet run \
--tokens=./sherpa-onnx-paraformer-zh-2023-03-28/tokens.txt \
--paraformer=./sherpa-onnx-paraformer-zh-2023-03-28/model.onnx \
--files ./sherpa-onnx-zipformer-en-2023-04-01/test_wavs/0.wav \
./sherpa-onnx-paraformer-zh-2023-03-28/test_wavs/0.wav \
./sherpa-onnx-paraformer-zh-2023-03-28/test_wavs/1.wav \
./sherpa-onnx-paraformer-zh-2023-03-28/test_wavs/2.wav \
./sherpa-onnx-paraformer-zh-2023-03-28/test_wavs/8k.wav
Please refer to
https://k2-fsa.github.io/sherpa/onnx/pretrained_models/offline-paraformer/index.html
to download pre-trained paraformer models
# NeMo CTC
dotnet run \
--tokens=./sherpa-onnx-nemo-ctc-en-conformer-medium/tokens.txt \
--nemo-ctc=./sherpa-onnx-nemo-ctc-en-conformer-medium/model.onnx \
--num-threads=1 \
--files ./sherpa-onnx-nemo-ctc-en-conformer-medium/test_wavs/0.wav \
./sherpa-onnx-nemo-ctc-en-conformer-medium/test_wavs/1.wav \
./sherpa-onnx-nemo-ctc-en-conformer-medium/test_wavs/8k.wav
Please refer to
https://k2-fsa.github.io/sherpa/onnx/pretrained_models/offline-ctc/index.html
to download pre-trained paraformer models
Non-streaming transducer
cd sherpa-onnx/dotnet-examples/offline-decode-files
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-onnx-
˓→zipformer-en-2023-06-26.tar.bz2
/Users/runner/work/sherpa-onnx/sherpa-onnx/sherpa-onnx/csrc/offline-stream.
˓→cc:AcceptWaveformImpl:117 Creating a resampler:
in_sample_rate: 8000
output_sample_rate: 16000
--------------------
./sherpa-onnx-zipformer-en-2023-06-26/test_wavs/0.wav
AFTER EARLY NIGHTFALL THE YELLOW LAMPS WOULD LIGHT UP HERE AND THERE THE SQUALID␣
˓→QUARTER OF THE BROTHELS
--------------------
./sherpa-onnx-zipformer-en-2023-06-26/test_wavs/1.wav
GOD AS A DIRECT CONSEQUENCE OF THE SIN WHICH MAN THUS PUNISHED HAD GIVEN HER A LOVELY␣
˓→CHILD WHOSE PLACE WAS ON THAT SAME DISHONORED BOSOM TO CONNECT HER PARENT FOREVER WITH␣
˓→THE RACE AND DESCENT OF MORTALS AND TO BE FINALLY A BLESSED SOUL IN HEAVEN
--------------------
(continues on next page)
Non-streaming paraformer
cd sherpa-onnx/dotnet-examples/offline-decode-files
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-onnx-
˓→paraformer-zh-2023-03-28.tar.bz2
/Users/runner/work/sherpa-onnx/sherpa-onnx/sherpa-onnx/csrc/offline-stream.
˓→cc:AcceptWaveformImpl:117 Creating a resampler:
in_sample_rate: 8000
output_sample_rate: 16000
--------------------
./sherpa-onnx-paraformer-zh-2023-03-28/test_wavs/0.wav
--------------------
./sherpa-onnx-paraformer-zh-2023-03-28/test_wavs/1.wav
--------------------
./sherpa-onnx-paraformer-zh-2023-03-28/test_wavs/8k.wav
--------------------
cd sherpa-onnx/dotnet-examples/offline-decode-files
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-onnx-nemo-
˓→ctc-en-conformer-medium.tar.bz2
/Users/runner/work/sherpa-onnx/sherpa-onnx/sherpa-onnx/csrc/offline-stream.
˓→cc:AcceptWaveformImpl:117 Creating a resampler:
in_sample_rate: 8000
output_sample_rate: 16000
--------------------
./sherpa-onnx-nemo-ctc-en-conformer-medium/test_wavs/0.wav
after early nightfall the yellow lamps would light up here and there the squalid␣
˓→quarter of the brothels
--------------------
./sherpa-onnx-nemo-ctc-en-conformer-medium/test_wavs/1.wav
god as a direct consequence of the sin which man thus punished had given her a lovely␣
˓→child whose place was on that same dishonored bosom to connect her parent for ever␣
˓→with the race and descent of mortals and to be finally a blessed soul in heaven
--------------------
./sherpa-onnx-nemo-ctc-en-conformer-medium/test_wavs/8k.wav
yet these thoughts affected hester pryne less with hope than apprehension
--------------------
dotnet run \
--tokens=./sherpa-onnx-streaming-zipformer-bilingual-zh-en-2023-02-20/tokens.txt \
--encoder=./sherpa-onnx-streaming-zipformer-bilingual-zh-en-2023-02-20/encoder-epoch-
˓→99-avg-1.onnx \
--decoder=./sherpa-onnx-streaming-zipformer-bilingual-zh-en-2023-02-20/decoder-epoch-
˓→99-avg-1.onnx \
--joiner=./sherpa-onnx-streaming-zipformer-bilingual-zh-en-2023-02-20/joiner-epoch-99-
˓→avg-1.onnx \
--num-threads=2 \
--decoding-method=modified_beam_search \
--debug=false \
--files ./sherpa-onnx-streaming-zipformer-bilingual-zh-en-2023-02-20/test_wavs/0.wav \
./sherpa-onnx-streaming-zipformer-bilingual-zh-en-2023-02-20/test_wavs/1.wav
Please refer to
https://k2-fsa.github.io/sherpa/onnx/pretrained_models/online-transducer/index.html
to download pre-trained streaming models.
Streaming transducer
cd sherpa-onnx/dotnet-examples/online-decode-files/
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-onnx-
˓→streaming-zipformer-en-2023-06-26.tar.bz2
--decoder ./sherpa-onnx-streaming-zipformer-en-2023-06-26/decoder-epoch-99-avg-1-chunk-
˓→16-left-128.onnx \
--joiner ./sherpa-onnx-streaming-zipformer-en-2023-06-26/joiner-epoch-99-avg-1-chunk-
˓→16-left-128.onnx \
--tokens ./sherpa-onnx-streaming-zipformer-en-2023-06-26/tokens.txt \
--files ./sherpa-onnx-streaming-zipformer-en-2023-06-26/test_wavs/0.wav \
./sherpa-onnx-streaming-zipformer-en-2023-06-26/test_wavs/1.wav \
./sherpa-onnx-streaming-zipformer-en-2023-06-26/test_wavs/8k.wav
/Users/runner/work/sherpa-onnx/sherpa-onnx/sherpa-onnx/csrc/features.
˓→cc:AcceptWaveform:76 Creating a resampler:
in_sample_rate: 8000
output_sample_rate: 16000
--------------------
./sherpa-onnx-streaming-zipformer-en-2023-06-26/test_wavs/0.wav
AFTER EARLY NIGHTFALL THE YELLOW LAMPS WOULD LIGHT UP HERE AND THERE THE SQUALID␣
˓→QUARTER OF THE BROTHELS
--------------------
./sherpa-onnx-streaming-zipformer-en-2023-06-26/test_wavs/1.wav
GOD AS A DIRECT CONSEQUENCE OF THE SIN WHICH MAN THUS PUNISHED HAD GIVEN HER A LOVELY␣
˓→CHILD WHOSE PLACE WAS ON THAT SAME DISHONOURED BOSOM TO CONNECT HER PARENT FOR EVER␣
˓→WITH THE RACE AND DESCENT OF MORTALS AND TO BE FINALLY A BLESSED SOUL IN HEAVEN
--------------------
./sherpa-onnx-streaming-zipformer-en-2023-06-26/test_wavs/8k.wav
(continues on next page)
--encoder ./icefall-asr-zipformer-streaming-wenetspeech-20230615/exp/encoder-epoch-12-
˓→avg-4-chunk-16-left-128.onnx \
--decoder ./icefall-asr-zipformer-streaming-wenetspeech-20230615/exp/decoder-epoch-12-
˓→avg-4-chunk-16-left-128.onnx \
--joiner ./icefall-asr-zipformer-streaming-wenetspeech-20230615/exp/joiner-epoch-12-
˓→avg-4-chunk-16-left-128.onnx \
Please refer to
https://k2-fsa.github.io/sherpa/onnx/pretrained_models/online-transducer/index.html
to download pre-trained streaming models.
Streaming transducer
cd sherpa-onnx/dotnet-examples/speech-recognition-from-microphone
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-onnx-
˓→streaming-zipformer-en-2023-06-26.tar.bz2
--decoder ./sherpa-onnx-streaming-zipformer-en-2023-06-26/decoder-epoch-99-avg-1-chunk-
˓→16-left-128.onnx \
--joiner ./sherpa-onnx-streaming-zipformer-en-2023-06-26/joiner-epoch-99-avg-1-chunk-
˓→16-left-128.onnx \
--tokens ./sherpa-onnx-streaming-zipformer-en-2023-06-26/tokens.txt
0: THIS IS A TEST
1: THIS IS A SECOND TEST
8.11.4 colab
We provide a colab notebook for you to try the C# API examples of sherpa-onnx.
Hint: For macOS, both Apple Silicon (i.e., macOS arm64, M1/M2/M3) and Intel chips are supported.
Note: We will support text-to-speech, audio tagging, keyword spotting, speaker recognition, speech identification,
and spoken language identification with object pascal later.
In the following, we describe how to use the object pascal API to decode files.
We use macOS below as an example. You can adapt it for Linux and Windows.
Hint: We support both static link and dynamic link; the example below uses dynamic link. You can pass
-DBUILD_SHARED_LIBS=OFF to cmake if you want to use static link.
On the Windows platform, it supports only dynamic link though.
fpc -h
mkdir -p $HOME/open-source
cd $HOME/open-source
git clone https://github.com/k2-fsa/sherpa-onnx
cd sherpa-onnx
mkdir build
cd build
cmake \
-DBUILD_SHARED_LIBS=ON \
-DCMAKE_BUILD_TYPE=Release \
-DCMAKE_INSTALL_PREFIX=./install \
..
ls -lh install/lib
cd $HOME/open-source/sherpa-onnx
cd pascal-api-examples/non-streaming-asr/
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-onnx-
˓→whisper-tiny.en.tar.bz2
fpc \
-dSHERPA_ONNX_USE_SHARED_LIBS \
-Fu$HOME/open-source/sherpa-onnx/sherpa-onnx/pascal-api \
-Fl$HOME/open-source/sherpa-onnx/build/install/lib \
./whisper.pas
The output logs of the above fpc command are given below:
After running the above fpc command, we will find an executable file whisper in the current directory, i.e., $HOME/
open-source/sherpa-onnx/pascal-api-examples/non-streaming-asr/whisper:
If we run it:
Abort trap: 6
˓→ early, night, fall, ,, the, yellow, lamps, would, light, up, here, and, ␣
˓→there, the, squ, alid, quarter, of, the, bro, the, ls, .], Timestamps := [])
NumThreads 1
Elapsed 0.803 s
Wave duration 6.625 s
RTF = 0.803/6.625 = 0.121
Hint: If you are using Linux, please replace DYLD_LIBRARY_PATH with LD_LIBRARY_PATH.
Congratulations! You have successfully managed to use the object pascal API with Whisper for speech recognition!
You can find more examples at:
https://github.com/k2-fsa/sherpa-onnx/tree/master/pascal-api-examples
We provide a colab notebook for you to try this section step by step.
8.13 Lazarus
We also provide examples for developing with https://www.lazarus-ide.org/ using Object Pascal.
We provide support for the following platforms and architectures:
• Linux-x64
• Windows-x64
• macOS-x64
• macOS-arm64
This page lists some pre-built APPs using Lazarus with Object Pascal.
URL
Generate subtitles () https://k2-fsa.github.io/sherpa/onnx/lazarus/download-generated-subtitles.html
This page describes how to run the code in the following directory:
https://github.com/k2-fsa/sherpa-onnx/tree/master/lazarus-examples/generate_subtitles
The same code can be compiled without any modifications for different operating systems and architectures.
That is WOCA,
Write once, compile anywhere.
The following screenshots give an example about that.
Linux x64 screenshot
Windows x64 screenshot
macOS x64 screenshot
sherpa-onnx is implemented in C++. To use it with Object Pascal, we have to get either the static library or the dynamic
library for sherpa-onnx.
To achieve that, you can either build sherpa-onnx from source or download pre-built libraries from
https://github.com/k2-fsa/sherpa-onnx/releases
mkdir -p $HOME/open-source/
cd $HOME/open-source/
git clone https://github.com/k2-fsa/sherpa-onnx
cd sherpa-onnx
mkdir build
cd build
cmake \
-DBUILD_SHARED_LIBS=ON \
-DCMAKE_BUILD_TYPE=Release \
-DCMAKE_INSTALL_PREFIX=./install \
..
cmake --build . --target install --config Release
mkdir -p $HOME/open-source/
cd $HOME/open-source/
git clone https://github.com/k2-fsa/sherpa-onnx
cd sherpa-onnx
mkdir build-static
cd build-static
cmake \
-DBUILD_SHARED_LIBS=OFF \
-DCMAKE_BUILD_TYPE=Release \
-DCMAKE_INSTALL_PREFIX=./install \
..
cmake --build . --target install --config Release
Caution:
• For building shared libraries, the build directory must be build.
• For building static libraries, the build directory must be build-static.
If you want to learn why there are such constraints, please search for build-static in the file generate_subtitles.lpi
If you don’t want to build sherpa-onnx from source, please download pre-built libraries from https://github.com/k2-fsa/
sherpa-onnx/releases.
We suggest that you always use the latest release.
macOS
• libsherpa-onnx-c-api. • libsherpa-onnx-c-api.a
dylib • libsherpa-onnx-core.a
• libonnxruntime.1.17.1. • libkaldi-decoder-core.
dylib a
•
libsherpa-onnx-kaldifst-core.
a
• libsherpa-onnx-fstfar.
a
• libsherpa-onnx-fst.a
•
libkaldi-native-fbank-core.
a
• libpiper_phonemize
• liblibespeak-ng.a
• libucd.a
• liblibonnxruntime.a
•
libssentencepiece_core.
a
If you download shared libraries, please create a build directory inside the sherpa-onnx project directory and put
the library files into build/install/lib. An example on my macOS is given below:
(py38) fangjuns-MacBook-Pro:sherpa-onnx fangjun$ pwd
/Users/fangjun/open-source/sherpa-onnx
(py38) fangjuns-MacBook-Pro:sherpa-onnx fangjun$ ls -lh build/install/lib
total 59696
-rw-r--r-- 1 fangjun staff 25M Aug 14 14:09 libonnxruntime.1.17.1.dylib
lrwxr-xr-x 1 fangjun staff 27B Aug 14 14:18 libonnxruntime.dylib -> libonnxruntime.
˓→1.17.1.dylib
If you download static libraries, please create a build-static directory inside the sherpa-onnx project directory
and put the library files into build-static/install/lib. An example on my macOS is given below:
(py38) fangjuns-MacBook-Pro:sherpa-onnx fangjun$ pwd
/Users/fangjun/open-source/sherpa-onnx
(py38) fangjuns-MacBook-Pro:sherpa-onnx fangjun$ ls -lh build-static/install/lib
total 138176
-rw-r--r-- 1 fangjun staff 438K Aug 15 15:03 libespeak-ng.a
-rw-r--r-- 1 fangjun staff 726K Aug 15 15:03 libkaldi-decoder-core.a
-rw-r--r-- 1 fangjun staff 198K Aug 15 15:03 libkaldi-native-fbank-core.a
-rw-r--r-- 1 fangjun staff 56M Aug 14 14:25 libonnxruntime.a
-rw-r--r-- 1 fangjun staff 421K Aug 15 15:03 libpiper_phonemize.a
-rw-r--r-- 1 fangjun staff 87K Aug 15 15:03 libsherpa-onnx-c-api.a
-rw-r--r-- 1 fangjun staff 5.7M Aug 15 15:03 libsherpa-onnx-core.a
-rw-r--r-- 1 fangjun staff 2.3M Aug 15 15:03 libsherpa-onnx-fst.a
-rw-r--r-- 1 fangjun staff 30K Aug 15 15:03 libsherpa-onnx-fstfar.a
-rw-r--r-- 1 fangjun staff 1.6M Aug 15 15:03 libsherpa-onnx-kaldifst-core.a
-rw-r--r-- 1 fangjun staff 131K Aug 15 15:03 libsherpa-onnx-portaudio_static.a
-rw-r--r-- 1 fangjun staff 147K Aug 15 15:03 libssentencepiece_core.a
-rw-r--r-- 1 fangjun staff 197K Aug 15 15:03 libucd.a
After building, you should find the following files inside the directory generate_subtitles:
macOS
Windows
Linux
(py38) fangjuns-MacBook-Pro:generate_subtitles fangjun$ pwd
/Users/fangjun/open-source/sherpa-onnx/lazarus-examples/generate_subtitles
(py38) fangjuns-MacBook-Pro:generate_subtitles fangjun$ ls -lh generate_subtitles␣
˓→generate_subtitles.app/
generate_subtitles.app/:
total 0
drwxr-xr-x 6 fangjun staff 192B Aug 14 23:01 Contents
fangjun@M-0LQSDCC2RV398 C:\Users\fangjun\open-source\sherpa-onnx\lazarus-examples\
˓→generate_subtitles>dir generate_subtitles.exe
Volume in drive C is
Volume Serial Number is 8E17-A21F
Directory of C:\Users\fangjun\open-source\sherpa-onnx\lazarus-examples\generate_
˓→subtitles
cd lazarus-examples/generate_subtitles
ls -lh generate_subtitles
Now you can start the generated executable generate_subtitles and you should get the screenshot like the one
listed at the start of this section.
If you get any issues about shared libraries not found, please copy the shared library files from build/
install/lib to the directory lazarus-examples/generate_subtitles or you can set the environment variable
DYLD_LIBRARY_PATH (for macOS) and LD_LIBRARY_PATH (for Linux).
Download models
The generated executable expects that there are model files located in the same directory.
cd lazarus-examples/generate_subtitles
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/silero_vad.onnx
The executable expects a non-streaming speech recognition model. Currently, we have supported the following types
of models
• Whisper
• Moonshine
• Zipformer transducer
• NeMo transducer
• SenseVoice
• Paraformer
• TeleSpeech CTC
You can download them from https://github.com/k2-fsa/sherpa-onnx/releases/tag/asr-models
Note that you have to rename the model files after downloading.
Expected filenames
Whisper
• tokens.txt
• whisper-encoder.onnx
• whisper-decoder.onnx
Moonshine
• tokens.txt
• moonshine-preprocessor.onnx
• moonshine-encoder.onnx
• moonshine-uncached-decoder.onnx
• moonshine-cached-decoder.onnx
Zipformer transducer
• tokens.txt
• transducer-encoder.onnx
• transducer-decoder.onnx
• transducer-joiner.onnx
NeMo transducer
• tokens.txt
• nemo-transducer-encoder.onnx
• nemo-transducer-decoder.onnx
• nemo-transducer-joiner.onnx
SenseVoice
• tokens.txt
• sense-voice.onnx
Paraformer
• tokens.txt
• paraformer.onnx
TeleSpeech
• tokens.txt
• telespeech.onnx
1. Wisper
cd lazarus-examples/generate_subtitles
curl -SL -O https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-
˓→onnx-whisper-tiny.en.tar.bz2
cd sherpa-onnx-whisper-tiny.en
mv -v tiny.en-encoder.int8.onnx ../whisper-encoder.onnx
mv -v tiny.en-decoder.int8.onnx ../whisper-decoder.onnx
mv -v tiny.en-tokens.txt ../tokens.txt
cd ..
rm -rf sherpa-onnx-whisper-tiny.en
You can replace tiny.en with other types of Whisper models, e.g., tiny, base, etc.
2. Zipformer transducer
cd lazarus-examples/generate_subtitles
curl -SL -O https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/icefall-
˓→asr-zipformer-wenetspeech-20230615.tar.bz2
cd icefall-asr-zipformer-wenetspeech-20230615
mv -v data/lang_char/tokens.txt ../
mv -v exp/encoder-epoch-12-avg-4.int8.onnx ../transducer-encoder.onnx
mv -v exp/decoder-epoch-12-avg-4.onnx ../transducer-decoder.onnx
mv -v exp/joiner-epoch-12-avg-4.int8.onnx ../transducer-joiner.onnx
cd ..
rm icefall-asr-zipformer-wenetspeech-20230615
Example 2
cd lazarus-examples/generate_subtitles
curl -SL -O https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-
˓→onnx-zipformer-ja-reazonspeech-2024-08-01.tar.bz2
cd sherpa-onnx-zipformer-ja-reazonspeech-2024-08-01
mv ./tokens.txt ../
mv encoder-epoch-99-avg-1.int8.onnx ../transducer-encoder.onnx
mv decoder-epoch-99-avg-1.onnx ../transducer-decoder.onnx
mv joiner-epoch-99-avg-1.int8.onnx ../transducer-joiner.onnx
cd ../
rm -rf sherpa-onnx-zipformer-ja-reazonspeech-2024-08-01
3. NeMo transducer
cd lazarus-examples/generate_subtitles
curl -SL -O https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-
˓→onnx-nemo-fast-conformer-transducer-be-de-en-es-fr-hr-it-pl-ru-uk-20k.tar.bz2
rm sherpa-onnx-nemo-fast-conformer-transducer-be-de-en-es-fr-hr-it-pl-ru-uk-20k.tar.bz2
cd sherpa-onnx-nemo-fast-conformer-transducer-be-de-en-es-fr-hr-it-pl-ru-uk-20k
mv tokens.txt ../
mv encoder.onnx ../nemo-transducer-encoder.onnx
mv decoder.onnx ../nemo-transducer-decoder.onnx
mv joiner.onnx ../nemo-transducer-joiner.onnx
cd ../
rm -rf sherpa-onnx-nemo-fast-conformer-transducer-be-de-en-es-fr-hr-it-pl-ru-uk-20k
4. SenseVoice
cd lazarus-examples/generate_subtitles
curl -SL -O https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-
˓→onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17.tar.bz2
cd sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17
mv tokens.txt ../
mv model.int8.onnx ../sense-voice.onnx
cd ../
rm -rf sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17
5. Paraformer
cd lazarus-examples/generate_subtitles
curl -SL -O https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-
˓→onnx-paraformer-zh-2023-09-14.tar.bz2
cd sherpa-onnx-paraformer-zh-2023-09-14
mv tokens.txt ../
mv model.int8.onnx ../paraformer.onnx
cd ../
rm -rf sherpa-onnx-paraformer-zh-2023-09-14
6. TeleSpeech
cd lazarus-examples/generate_subtitles
curl -SL -O https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-
˓→onnx-telespeech-ctc-int8-zh-2024-06-04.tar.bz2
cd sherpa-onnx-telespeech-ctc-int8-zh-2024-06-04
mv tokens.txt ../
mv model.int8.onnx ../telespeech.onnx
cd ../
rm -rf sherpa-onnx-telespeech-ctc-int8-zh-2024-06-04
7. Moonshine
cd lazarus-examples/generate_subtitles
curl -SL -O https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-
˓→onnx-moonshine-tiny-en-int8.tar.bz2
cd sherpa-onnx-moonshine-tiny-en-int8
mv preprocess.onnx ../moonshine-preprocessor.onnx
mv encode.int8.onnx ../moonshine-encoder.onnx
mv uncached_decode.int8.onnx ../moonshine-uncached-decoder.onnx
mv cached_decode.int8.onnx ../moonshine-cached-decoder.onnx
(continues on next page)
mv tokens.txt ../
cd ../
rm -rf sherpa-onnx-moonshine-tiny-en-int8
8.14 WebAssembly
In this section, we describe how to build sherpa-onnx for WebAssembly so that you can run real-time speech recognition
with WebAssembly.
Please follow the steps below to build and run sherpa-onnx for WebAssembly.
Hint: We provide a colab notebook for you to try this section step by step.
If you are using Windows or you don’t want to setup your local environment to build WebAssembly support, please
use the above colab notebook.
We need to compile the C/C++ files in sherpa-onnx with the help of emscripten.
Please refer to https://emscripten.org/docs/getting_started/downloads for detailed installation instructions.
The following is an example to show you how to install it on Linux/macOS.
source ./emsdk_env.sh
emcc -v
Target: wasm32-unknown-emscripten
Thread model: posix
InstalledDir: /Users/fangjun/open-source/emsdk/upstream/bin
8.14.2 Build
cd wasm/asr/assets
wget -q https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-onnx-
˓→streaming-zipformer-bilingual-zh-en-2023-02-20.tar.bz2
mv sherpa-onnx-streaming-zipformer-bilingual-zh-en-2023-02-20/encoder-epoch-99-avg-1.
˓→int8.onnx encoder.onnx
mv sherpa-onnx-streaming-zipformer-bilingual-zh-en-2023-02-20/decoder-epoch-99-avg-1.
˓→onnx decoder.onnx
mv sherpa-onnx-streaming-zipformer-bilingual-zh-en-2023-02-20/joiner-epoch-99-avg-1.int8.
˓→onnx joiner.onnx
mv sherpa-onnx-streaming-zipformer-bilingual-zh-en-2023-02-20/tokens.txt ./
rm -rf sherpa-onnx-streaming-zipformer-bilingual-zh-en-2023-02-20/
cd ../../..
./build-wasm-simd-asr.sh
asr/assets/README.md#paraformer
-- Up-to-date: /Users/fangjun/open-source/sherpa-onnx/build-wasm-simd-asr/install/bin/
˓→wasm/asr/sherpa-onnx-wasm-asr-main.js
-- Installing: /Users/fangjun/open-source/sherpa-onnx/build-wasm-simd-asr/install/bin/
˓→wasm/asr/index.html
-- Installing: /Users/fangjun/open-source/sherpa-onnx/build-wasm-simd-asr/install/bin/
˓→wasm/asr/sherpa-onnx.js
-- Installing: /Users/fangjun/open-source/sherpa-onnx/build-wasm-simd-asr/install/bin/
˓→wasm/asr/app.js
-- Installing: /Users/fangjun/open-source/sherpa-onnx/build-wasm-simd-asr/install/bin/
˓→wasm/asr/sherpa-onnx-wasm-asr-main.wasm
-- Installing: /Users/fangjun/open-source/sherpa-onnx/build-wasm-simd-asr/install/bin/
˓→wasm/asr/sherpa-onnx-wasm-asr-main.data
+ ls -lh install/bin/wasm/asr
total 440080
-rw-r--r-- 1 fangjun staff 9.0K Feb 23 17:39 app.js
-rw-r--r-- 1 fangjun staff 978B Feb 23 17:39 index.html
-rw-r--r-- 1 fangjun staff 199M Feb 23 18:34 sherpa-onnx-wasm-asr-main.data
-rw-r--r-- 1 fangjun staff 90K Feb 23 18:38 sherpa-onnx-wasm-asr-main.js
-rw-r--r-- 1 fangjun staff 10M Feb 23 18:38 sherpa-onnx-wasm-asr-main.wasm
-rw-r--r-- 1 fangjun staff 9.1K Feb 23 17:39 sherpa-onnx.js
cd build-wasm-simd-asr/install/bin/wasm/asr
python3 -m http.server 6006
Start your browser and visit http://localhost:6006/; you should see the following page:
Now click start and speak! You should see the recognition results in the text box.
Warning: We are using a bilingual model (Chinese + English) in the above example, which means you can only
speak Chinese or English in this case.
Congratulations! You have successfully run real-time speech recognition with WebAssembly in your browser.
We provide four Huggingface spaces so that you can try real-time speech recognition with WebAssembly in your
browser.
https://huggingface.co/spaces/k2-fsa/web-assembly-asr-sherpa-onnx-en
Hint: If you don’t have access to Huggingface, please visit the following mirror:
https://modelscope.cn/studios/k2-fsa/web-assembly-asr-sherpa-onnx-en/summary
Note: The script for building this space can be found at https://github.com/k2-fsa/sherpa-onnx/blob/master/.github/
workflows/wasm-simd-hf-space-en-asr-zipformer.yaml
https://huggingface.co/spaces/k2-fsa/web-assembly-asr-sherpa-onnx-zh-en
Hint: If you don’t have access to Huggingface, please visit the following mirror:
https://modelscope.cn/studios/k2-fsa/web-assembly-asr-sherpa-onnx-zh-en/summary
Note: The script for building this space can be found at https://github.com/k2-fsa/sherpa-onnx/blob/master/.github/
workflows/wasm-simd-hf-space-zh-en-asr-zipformer.yaml
https://huggingface.co/spaces/k2-fsa/web-assembly-asr-sherpa-onnx-zh-en-paraformer
Hint: If you don’t have access to Huggingface, please visit the following mirror:
https://modelscope.cn/studios/k2-fsa/web-assembly-asr-sherpa-onnx-zh-en-paraformer/summary
Note: The script for building this space can be found at https://github.com/k2-fsa/sherpa-onnx/blob/master/.github/
workflows/wasm-simd-hf-space-zh-en-asr-paraformer.yaml
https://huggingface.co/spaces/k2-fsa/web-assembly-asr-sherpa-onnx-zh-cantonese-en-paraformer
Hint: If you don’t have access to Huggingface, please visit the following mirror:
https://modelscope.cn/studios/k2-fsa/web-assembly-asr-sherpa-onnx-zh-cantonese-en-paraformer/
summary
Note: The script for building this space can be found at https://github.com/k2-fsa/sherpa-onnx/blob/master/.github/
workflows/wasm-simd-hf-space-zh-cantonese-en-asr-paraformer.yaml
8.15 Android
In this section, we describe how to build an Android app for with sherpa-onnx.
Hint: For real-time speech recognition, it does not need to access the Internet. Everyting is processed locally on your
phone.
URL
Streaming speech recognition https://k2-fsa.github.io/sherpa/onnx/android/apk.html
Text-to-speech engine https://k2-fsa.github.io/sherpa/onnx/tts/apk-engine.html
Voice activity detection (VAD) https://k2-fsa.github.io/sherpa/onnx/vad/apk.html
VAD + non-streaming speech recogni- https://k2-fsa.github.io/sherpa/onnx/vad/apk-asr.html
tion
Two-pass speech recognition https://k2-fsa.github.io/sherpa/onnx/android/apk-2pass.html
Audio tagging https://k2-fsa.github.io/sherpa/onnx/audio-tagging/apk.html
Audio tagging (WearOS) https://k2-fsa.github.io/sherpa/onnx/audio-tagging/apk-wearos.html
Speaker identification https://k2-fsa.github.io/sherpa/onnx/speaker-identification/apk.html
Spoken language identification https://k2-fsa.github.io/sherpa/onnx/spoken-language-identification/
apk.html
Keyword spotting https://k2-fsa.github.io/sherpa/onnx/kws/apk.html
You can use this section for both speech-to-text (STT, ASR) and text-to-speech (TTS).
Hint: The build scripts mentioned in this section run on both Linux and macOS.
If you are using Windows or if you don’t want to build the shared libraries, you can download pre-built shared libraries
by visiting the release page https://github.com/k2-fsa/sherpa-onnx/releases/
For instance, for the relase v1.10.19, you can visit https://github.com/k2-fsa/sherpa-onnx/releases/tag/v1.10.19 and
download the file sherpa-onnx-v1.10.19-android.tar.bz2 using the following command:
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/v1.10.19/sherpa-
˓→onnx-v1.10.19-android.tar.bz2
Hint: This section is originally written for speech-to-text. However, it is also applicable to other folders in https:
//github.com/k2-fsa/sherpa-onnx/tree/master/android.
Hint: Any recent version of Android Studio should work fine. Also, you can use the default settings of Android Studio
during installation.
For reference, we post the version we are using below:
Download sherpa-onnx
Install NDK
In the following, we assume Android SDK location was set to /Users/fangjun/software/my-android. You
can change it accordingly below.
After installing NDK, you can find it in
/Users/fangjun/software/my-android/ndk/22.1.7171670
Warning: If you selected a different version of NDK, please replace 22.1.7171670 accordingly.
Next, let us set the environment variable ANDROID_NDK for later use.
export ANDROID_NDK=/Users/fangjun/software/my-android/ndk/22.1.7171670
list(APPEND ANDROID_COMPILER_FLAGS
-g
-DANDROID
Caution: You only need to select one and only one ABI. arm64-v8a is probably the most common one.
If you want to test the app on an emulator, you probably need x86_64.
Hint: Building scripts for this section are for macOS and Linux. If you are using Windows or if you don’t want to
build the shared libraries by yourself, you can download pre-compiled shared libraries for this section by visiting
https://github.com/k2-fsa/sherpa-onnx/releases
Hint: We provide a colab notebook for you to try this section step by step.
If you are using Windows or you don’t want to setup your local environment to build the C++ libraries, please use the
above colab notebook.
ls -lh build-android-arm64-v8a/install/lib/
cp build-android-arm64-v8a/install/lib/lib*.so android/SherpaOnnx/app/src/main/jniLibs/
˓→arm64-v8a/
You should see the following screen shot after running the above copy cp command.
ls -lh build-android-armv7-eabi/install/lib
cp build-android-armv7-eabi/install/lib/lib*.so android/SherpaOnnx/app/src/main/jniLibs/
˓→armeabi-v7a/
You should see the following screen shot after running the above copy cp command.
ls -lh build-android-x86-64/install/lib/
cp build-android-x86-64/install/lib/lib*.so android/SherpaOnnx/app/src/main/jniLibs/x86_
˓→64/
You should see the following screen shot after running the above copy cp command.
ls -lh build-android-x86/install/lib/
cp build-android-x86/install/lib/lib*.so android/SherpaOnnx/app/src/main/jniLibs/x86/
You should see the following screen shot after running the above copy cp command.
Hint: The model is trained using icefall and the original torchscript model is from https://huggingface.co/pfluo/
k2fsa-zipformer-chinese-english-mixed.
Use the following command to download the pre-trained model and place it into android/SherpaOnnx/app/src/
main/assets/:
cd android/SherpaOnnx/app/src/main/assets/
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-onnx-
˓→streaming-zipformer-bilingual-zh-en-2023-02-20.tar.bz2
cd sherpa-onnx-streaming-zipformer-bilingual-zh-en-2023-02-20
# Now, remove extra files to reduce the file size of the generated apk
rm -rf test_wavs
rm -f *.sh README.md
rm -f bpe.model
rm -f encoder-epoch-99-avg-1.int8.onnx
rm -f joiner-epoch-99-avg-1.int8.onnx
rm -f decoder-epoch-99-avg-1.int8.onnx
rm -f bpe.vocab
ls -lh
You should see the following screen shot after downloading the pre-trained model:
Hint: If you select a different pre-trained model, make sure that you also change the corresponding code listed in the
following screen shot:
Generate APK
ls -lh android/SherpaOnnx/app/build/outputs/apk/debug/app-debug.apk
Select Build -> Analyze APK ... in the above screen shot, in the popped-up dialog select the generated APK
app-debug.apk, and you will see the following screen shot:
You can see from the above screen shot that most part of the APK is occupied by the pre-trained model, while the
runtime, including the shared libraries, is only 7.2 MB.
Caution: You can see that libonnxruntime.so alone occupies 5.8MB out of 7.2MB.
We use a so-called Full build instead of Mobile build, so the file size of the library is somewhat a bit larger.
libonnxruntime.so is donwloaded from
https://mvnrepository.com/artifact/com.microsoft.onnxruntime/onnxruntime-android/1.17.1
Please refer to https://onnxruntime.ai/docs/build/custom.html for a custom build to reduce the file size of
libonnxruntime.so.
Note that we are constantly updating the version of onnxruntime. By the time you are reading this section, we
may be using the latest version of onnxruntime.
Hint: We recommend you to use sherpa-ncnn. Please see Analyze the APK for sherpa-ncnn. The total runtime of
sherpa-ncnn is only 1.6 MB, which is much smaller than sherpa-onnx.
8.16 iOS
In this section, we describe how to build an iOS app for real-time speech recognition with sherpa-onnx and run it
within a simulator on your Mac, run it on you iPhone or iPad.
Hint: During speech recognition, it does not need to access the Internet. Everyting is processed locally on your device.
This section describes how to build sherpa-onnx for iPhone and iPad.
Requirement
Warning: The minimum deployment requires the iOS version >= 13.0.
Before we continue, please make sure the following requirements are satisfied:
• macOS. It won’t work on Windows or Linux.
• Xcode. The version 14.2 (14C18) is known to work. Other versions may also work.
• CMake. CMake 3.25.1 is known to work. Other versions may also work.
• (Optional) iPhone or iPad. This is for testing the app on your device. If you don’t have a device, you can still run
the app within a simulator on your Mac.
Caution:
If you get the following error:
CMake Error at toolchains/ios.toolchain.cmake:544 (get_filename_component):
get_filename_component called with incorrect number of arguments
Call Stack (most recent call first):
/usr/local/Cellar/cmake/3.29.0/share/cmake/Modules/CMakeDetermineSystem.
˓→cmake:146 (include)
CMakeLists.txt:2 (project)
please run:
sudo xcode-select --install
sudo xcodebuild -license
Download sherpa-onnx
mkdir -p $HOME/open-source
cd $HOME/open-source
git clone https://github.com/k2-fsa/sherpa-onnx
cd $HOME/open-source/sherpa-onnx/
./build-ios.sh
cd $HOME/open-source/sherpa-onnx/ios-swift/SherpaOnnx
open SherpaOnnx.xcodeproj
It will start Xcode and you will see the following screenshot:
Please select Product -> Build to build the project. See the screenshot below:
After finishing the build, you should see the following screenshot:
Congratulations! You have successfully built the project. Let us run the project by selecting Product -> Run, which
is shown in the following screenshot:
Please wait for a few seconds before Xcode starts the simulator.
Unfortunately, it will throw the following error:
The reason for the above error is that we have not provided the pre-trained model yet.
The file ViewController.swift pre-selects the pre-trained model to be csukuangfj/sherpa-onnx-streaming-zipformer-
bilingual-zh-en-2023-02-20 (Bilingual, Chinese + English), shown in the screenshot below:
In the popup dialog, switch to the folder where you just downloaded the pre-trained model.
In the screenshot below, it is the folder /Users/fangjun/open-source/icefall-models/
sherpa-onnx-streaming-zipformer-bilingual-zh-en-2023-02-20:
Fig. 8.15: Screenshot for navigating to the folder containing the downloaded pre-trained
After adding pre-trained model files to Xcode, you should see the following screenshot:
At this point, you should be able to select the menu Product -> Run to run the project and you should finally see the
following screenshot:
Congratulations! You have finally succeeded in running sherpa-onnx with iOS, though it is in a simulator.
Please read below if you want to run sherpa-onnx on your iPhone or iPad.
First, please make sure the iOS version of your iPhone/iPad is >= 13.0.
Click the menu Xcode -> Settings..., as is shown in the following screenshot:
In the popup dialog, please select Account and click + to add your Apple ID, as is shown in the following screenshots.
After adding your Apple ID, please connect your iPhone or iPad to your Mac and select your device in Xcode. The
following screenshot is an example to select my iPhone.
Now your Xcode should look like below after selecting a device:
Please select Product -> Run again to run sherpa-onnx on your selected device, as is shown in the following screen-
shot:
After a successful build, check your iPhone/iPad and you should see the following screenshot:
At this point, you should be able to run the app on your device. The following is a screenshot about running it on my
iPhone:
8.17 Flutter
URL
Android (arm64-v8a, armeabi-v7a, https://k2-fsa.github.io/sherpa/onnx/flutter/tts-android.html
x86_64)
Linux (x64) https://k2-fsa.github.io/sherpa/onnx/flutter/tts-linux.html
macOS (x64) https://k2-fsa.github.io/sherpa/onnx/flutter/tts-macos-x64.html
macOS (arm64) https://k2-fsa.github.io/sherpa/onnx/flutter/tts-macos-arm64.
html
Windows (x64) https://k2-fsa.github.io/sherpa/onnx/flutter/tts-win.html
URL
Streaming speech recognition https://k2-fsa.github.io/sherpa/onnx/flutter/asr/app.html
8.18 WebSocket
In this section, we describe how to use the WebSocket server and client for real-time speech recognition with sherpa-
onnx.
The WebSocket server is implemented in C++ with the help of websocketpp and asio.
Fig. 8.23: Screenshot for adding your Apple ID and click Next
Hint: Please refer to Installation to install sherpa-onnx before you read this section.
Please refer to Non-streaming WebSocket server and client for the usage of
sherpa-onnx-offline-websocket-server.
Fig. 8.24: Screenshot for entering your password and click Next
Before starting the server, let us view the help message of sherpa-onnx-online-websocket-server:
build/bin/sherpa-onnx-online-websocket-server
Usage:
./bin/sherpa-onnx-online-websocket-server --help
./bin/sherpa-onnx-online-websocket-server \
--port=6006 \
--num-work-threads=5 \
--tokens=/path/to/tokens.txt \
--encoder=/path/to/encoder.onnx \
--decoder=/path/to/decoder.onnx \
--joiner=/path/to/joiner.onnx \
--log-file=./log.txt \
--max-batch-size=5 \
(continues on next page)
Please refer to
https://k2-fsa.github.io/sherpa/onnx/pretrained_models/index.html
for a list of pre-trained models to download.
Options:
--max-batch-size : Max batch size for recognition. (int, default = 5)
--loop-interval-ms : It determines how often the decoder loop runs. (int,␣
˓→default = 10)
--port : The port on which the server will listen. (int, default␣
˓→= 6006)
Standard options:
--config : Configuration file to read (this option may be repeated)␣
˓→(string, default = "")
./build/bin/sherpa-onnx-online-websocket-server \
--port=6006 \
--num-work-threads=3 \
--num-io-threads=2 \
--tokens=./sherpa-onnx-streaming-zipformer-bilingual-zh-en-2023-02-20/tokens.txt \
--encoder=./sherpa-onnx-streaming-zipformer-bilingual-zh-en-2023-02-20/encoder-epoch-
˓→99-avg-1.onnx \
--decoder=./sherpa-onnx-streaming-zipformer-bilingual-zh-en-2023-02-20/decoder-epoch-
˓→99-avg-1.onnx \
--joiner=./sherpa-onnx-streaming-zipformer-bilingual-zh-en-2023-02-20/joiner-epoch-99-
˓→avg-1.onnx \
--log-file=./log.txt \
--max-batch-size=5 \
--loop-interval-ms=20
Hint: In the above demo, the model files are from csukuangfj/sherpa-onnx-streaming-zipformer-bilingual-zh-en-
2023-02-20 (Bilingual, Chinese + English).
Note: Note that the server supports processing multiple clients in a batch in parallel. You can use --max-batch-size
./build/bin/sherpa-onnx-online-websocket-client
[I] /Users/fangjun/open-source/sherpa-onnx/sherpa-onnx/csrc/parse-options.cc:484:int␣
˓→sherpa_onnx::ParseOptions::Read(int, const char *const *) ./build/bin/sherpa-onnx-
˓→online-websocket-client
[I] /Users/fangjun/open-source/sherpa-onnx/sherpa-onnx/csrc/parse-options.cc:525:void␣
˓→sherpa_onnx::ParseOptions::PrintUsage(bool) const
Usage:
./bin/sherpa-onnx-online-websocket-client --help
./bin/sherpa-onnx-online-websocket-client \
--server-ip=127.0.0.1 \
--server-port=6006 \
--samples-per-message=8000 \
--seconds-per-message=0.2 \
/path/to/foo.wav
Options:
--seconds-per-message : We will simulate that each message takes this number of␣
˓→seconds to send. If you select a very large value, it will take a long time to send␣
Standard options:
--help : Print out usage message (bool, default = false)
--print-args : Print the command line arguments (to stderr) (bool,␣
˓→default = true)
build/bin/sherpa-onnx-online-websocket-client \
--seconds-per-message=0.1 \
--server-port=6006 \
./sherpa-onnx-streaming-zipformer-bilingual-zh-en-2023-02-20/test_wavs/0.wav
Since the server is able to process multiple clients at the same time, you can use the following command to start multiple
clients:
wait
echo "done"
It will print:
positional arguments:
sound_file The input sound file. Must be wave with a single
(continues on next page)
optional arguments:
-h, --help show this help message and exit
--server-addr SERVER_ADDR
Address of the server (default: localhost)
--server-port SERVER_PORT
Port of the server (default: 6006)
--samples-per-message SAMPLES_PER_MESSAGE
Number of samples per message (default: 8000)
--seconds-per-message SECONDS_PER_MESSAGE
We will simulate that the duration of two messages is
of this value (default: 0.1)
Hint: For the Python client, you can use either a domain name or an IP address for --server-addr. For instance,
you can use either --server-addr localhost or --server-addr 127.0.0.1.
For the input argument, you can either use --key=value or --key value.
python3 ./python-api-examples/online-websocket-client-decode-file.py \
--server-addr localhost \
--server-port 6006 \
--seconds-per-message 0.1 \
./sherpa-onnx-streaming-zipformer-bilingual-zh-en-2023-02-20/test_wavs/4.wav
python3 ./python-api-examples/online-websocket-client-microphone.py \
--server-addr localhost \
--server-port 6006
``online-websocket-client-microphone.py `` is from
`<https://github.com/k2-fsa/sherpa-onnx/blob/master/python-api-examples/online-
˓→websocket-client-microphone.py>`_
Hint: Please refer to Installation to install sherpa-onnx before you read this section.
Please refer to Streaming WebSocket server and client for the usage of sherpa-onnx-online-websocket-server
and sherpa-onnx-online-websocket-client.
Before starting the server, let us view the help message of sherpa-onnx-offline-websocket-server:
build/bin/sherpa-onnx-offline-websocket-server
Usage:
./bin/sherpa-onnx-offline-websocket-server --help
./bin/sherpa-onnx-offline-websocket-server \
--port=6006 \
--num-work-threads=5 \
--tokens=/path/to/tokens.txt \
--encoder=/path/to/encoder.onnx \
--decoder=/path/to/decoder.onnx \
--joiner=/path/to/joiner.onnx \
--log-file=./log.txt \
--max-batch-size=5
./bin/sherpa-onnx-offline-websocket-server \
--port=6006 \
--num-work-threads=5 \
--tokens=/path/to/tokens.txt \
--paraformer=/path/to/model.onnx \
--log-file=./log.txt \
--max-batch-size=5
Please refer to
https://k2-fsa.github.io/sherpa/onnx/pretrained_models/index.html
(continues on next page)
Options:
--log-file : Path to the log file. Logs are appended to this file␣
˓→(string, default = "./log.txt")
˓→memory, you can select a large value for it. (float, default = 300)
--port : The port on which the server will listen. (int, default␣
˓→= 6006)
Standard options:
--help : Print out usage message (bool, default = false)
--print-args : Print the command line arguments (to stderr) (bool,␣
˓→default = true)
./build/bin/sherpa-onnx-offline-websocket-server \
--port=6006 \
--num-work-threads=5 \
--tokens=./sherpa-onnx-zipformer-en-2023-03-30/tokens.txt \
--encoder=./sherpa-onnx-zipformer-en-2023-03-30/encoder-epoch-99-avg-1.onnx \
--decoder=./sherpa-onnx-zipformer-en-2023-03-30/decoder-epoch-99-avg-1.onnx \
--joiner=./sherpa-onnx-zipformer-en-2023-03-30/joiner-epoch-99-avg-1.onnx \
--log-file=./log.txt \
--max-batch-size=5
Hint: In the above demo, the model files are from csukuangfj/sherpa-onnx-zipformer-en-2023-03-30 (English).
Note: Note that the server supports processing multiple clients in a batch in parallel. You can use --max-batch-size
to limit the batch size.
./build/bin/sherpa-onnx-offline-websocket-server \
--port=6006 \
--num-work-threads=5 \
--tokens=./sherpa-onnx-paraformer-zh-2023-03-28/tokens.txt \
--paraformer=./sherpa-onnx-paraformer-zh-2023-03-28/model.onnx \
--log-file=./log.txt \
--max-batch-size=5
Hint: In the above demo, the model files are from csukuangfj/sherpa-onnx-paraformer-zh-2023-03-28 (Chinese +
English).
offline-websocket-client-decode-files-paralell.py
python3 ./python-api-examples/offline-websocket-client-decode-files-paralell.py \
--server-addr localhost \
--server-port 6006 \
./sherpa-onnx-paraformer-zh-2023-03-28/test_wavs/0.wav \
./sherpa-onnx-paraformer-zh-2023-03-28/test_wavs/1.wav \
./sherpa-onnx-paraformer-zh-2023-03-28/test_wavs/2.wav \
./sherpa-onnx-paraformer-zh-2023-03-28/test_wavs/8k.wav
offline-websocket-client-decode-files-sequential.py
python3 ./python-api-examples/offline-websocket-client-decode-files-sequential.py \
--server-addr localhost \
--server-port 6006 \
./sherpa-onnx-paraformer-zh-2023-03-28/test_wavs/0.wav \
./sherpa-onnx-paraformer-zh-2023-03-28/test_wavs/1.wav \
./sherpa-onnx-paraformer-zh-2023-03-28/test_wavs/2.wav \
./sherpa-onnx-paraformer-zh-2023-03-28/test_wavs/8k.wav
In this section, we describe how we implement the hotwords (aka contextual biasing) feature with an Aho-corasick
automaton and how to use it in sherpa-onnx.
Caution: Only transducer models support hotwords in sherpa-onnx. That is, only models from Offline transducer
models and Online transducer models support hotwords.
All other models don’t support hotwords.
Also, you have to change the decoding method to modified_beam_search to use hotwords. The default decoding
method greedy_search does not support hotwords.
Current ASR systems work very well for general cases, but they sometimes fail to recognize special words/phrases
(aka hotwords) like rare words, personalized information etc. Usually, those words/phrases will be recognized as the
words/phrases that pronounce similar to them (for example, recognize LOUIS FOURTEEN as LEWIS FOURTEEN). So we
have to provide some kind of contexts information (for example, the LOUIS FOURTEEN) to the ASR systems to boost
those words/phrases. Normally, we call this kind of boosting task contextual biasing (aka hotwords recognition).
We first construct an Aho-corasick automaton on those given hotwords (after tokenizing into tokens). Please refer to
https://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_algorithm for the construction details of Aho-corasick.
The figure below is the aho-corasick on “HE/SHE/SHELL/HIS/THIS” with hotwords-score==1.
The black arrows in the graph are the goto arcs, the red arrows are the failure arcs, the green arrows are the output
arcs. On each goto arc, there are token and boosting score (Note: we will boost the path when any partial sequence
is matched, if the path finally fails to full match any hotwords, the boosted score will be canceled). Currentlly,
the boosting score distributes on the arcs evenly along the path. On each state, there are two scores, the first one is the
node score (mainly used to cancel score) the second one is output score, the output score is the total scores of the full
matched hotwords of this state.
The following are several matching examples of the graph above.
Note: For simplicity, we assume that the system emits a token each frame.
Hint: We have an extra finalize step to force the graph state to go back to the root state.
Frame Boost score Total boost score Graph state Matched hotwords
init 0 0 0
1 1 1 3
2 1 2 4
3 1+5 8 5 HE, SHE
4 1 9 6
5 -4 5 0
finalize 0 5 0
At frame 3 we reach state 5 and match HE, SHE, so we get a boosting score 1 + 5, the score 1 here because the
SHEL still might be the prefix of other hotwords. At frame 5 F can not match any tokens and fail back to root, so we
cancel the score for SHEL which is 4 (the node score of state 6).
The path is “HI”
Frame Boost score Total boost score Graph state Matched hotwords
init 0 0 0
1 1 1 1
2 1 2 8
finalize -2 0 0
H and I all match the tokens in the graph, unfortunately, we have to go back to root state when finishing matching a
path, so we cancel the boosting score of HI which is 2 (the node score of state 8).
The path is “THE”
Frame Boost score Total boost score Graph state Matched hotwords
init 0 0 0
1 1 1 10
2 1 2 11
3 0+2 4 2 HE
finalize -2 3 0
At frame 3 we jump from state 11 to state 2 and get a boosting score of 0 + 2, 0 because the node score of
state 2 is the same as state 11 so we don’t get score by partial match (the prefix of state 11 is TH has the same
length of the prefix of state 2 which is HE), but we do get the output score (at state 2 it outputs HE).
Note: We implement the hotwords feature during inference time, you don’t have to re-train the models to use this
feature.
Caution: Currentlly, the hotwords feature is only supported in the modified_beam_search decoding method of
the transducer models (both streaming and non-streaming).
The use of the hotwords is no different for streaming and non-streaming models, and in fact it is even no different for
all the API supported by sherpa onnx. We add FOUR extra arguments for hotwords:
• hotwords-file
The file path of the hotwords, one hotwords per line. They could be Chinese words, English words or both
according to the modeling units used to train the model. Here are some examples:
For Chinese models trained on cjkchar it looks like:
SPEECH RECOGNITION
DEEP LEARNING
SPEECH
SPEECH RECOGNITION
You can also specify the boosting score for each hotword, the score should follow the predefined character :, for
example:
:3.5
:2.0
It means, hotword will have a boosting score of 3.5, hotword will have a boosting score of 2.0. For those
hotwords that don’t have specific scores, they will use the global score provided by hotword-score below.
Caution: The specific score MUST BE the last item of each hotword (i.e You shouldn’t break
the hotword into two parts by the score). SPEECH :2.0 # This is invalid
• hotwords-score
The boosting score for each matched token.
Note: We match the hotwords at token level, so the hotwords-score is applied at token level.
• modeling-unit
The modeling unit of the used model, currently support cjkchar (for Chinese), bpe (for English like lan-
guages) and cjkchar+bpe (for multilingual models). We need this modeling-unit to select tokenizer to encode
words/phrases into tokens, so do provide correct modeling-unit according to your model.
• bpe-vocab
The bpe vocabulary generated by sentencepiece toolkit, it also can be exported from bpe.model (see
script/export_bpe_vocab.py for details). This vocabulary is used to tokenize words/phrases into bpe units. It
is only used when modeling-unit is bpe or cjkchar+bpe.
Hint: We need bpe.vocab rather than bpe.model, because we don’t introduce sentencepiece c++ codebase into
sherpa-onnx (which has a depandancy issue of protobuf) , we implement a simple sentencepiece encoder and
decoder which takes bpe.vocab as input.
The main difference of using hotwords feature is about the modeling units. The following shows how to use it for
different modeling units.
Note: In the following example, we use a non-streaming model, if you are using a streaming model, you should
use sherpa-onnx. sherpa-onnx-alsa, sherpa-onnx-microphone, sherpa-onnx-microphone-offline,
sherpa-onnx-online-websocket-server and sherpa-onnx-offline-websocket-server all support hot-
words.
cd /path/to/sherpa-onnx
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-onnx-
˓→zipformer-en-2023-04-01.tar.bz2
QUARTERS
FOREVER
C++ api
./build/bin/sherpa-onnx-offline \
--encoder=exp/encoder-epoch-99-avg-1.onnx \
--decoder=exp/decoder-epoch-99-avg-1.onnx \
--joiner=exp/joiner-epoch-99-avg-1.onnx \
--decoding-method=modified_beam_search \
--tokens=exp/tokens.txt \
exp/test_wavs/0.wav exp/test_wavs/1.wav
/star-kw/kangwei/code/sherpa-onnx/sherpa-onnx/csrc/parse-options.cc:Read:361 ./build/bin/
˓→sherpa-onnx-offline --encoder=exp/encoder-epoch-99-avg-1.onnx --decoder=exp/decoder-
OfflineRecognizerConfig(feat_config=OfflineFeatureExtractorConfig(sampling_rate=16000,␣
˓→feature_dim=80), model_config=OfflineModelConfig(transducer=OfflineTran
˓→$ducerModelConfig(encoder_filename="exp/encoder-epoch-99-avg-1.onnx", decoder_filename=
˓→"exp/decoder-epoch-99-avg-1.onnx", joiner_filename="exp/joiner-epoch-99-$vg-1.onnx"),␣
˓→paraformer=OfflineParaformerModelConfig(model=""), nemo_
˓→ctc=OfflineNemoEncDecCtcModelConfig(model=""),␣
˓→ hotwords_score=1.5)
exp/test_wavs/0.wav
{"text":"ALL THE YELLOW LAMPS WOULD LIGHT UP HERE AND THERE THE SQUALID QUARTER OF THE␣
˓→BROTHELS","timestamps":"[1.44, 1.48, 1.56, 1.72, 1.88, 1.96, 2.16, 2.28$ 2.36, 2.48, 2.
˓→60, 2.80, 3.08, 3.28, 3.40, 3.60, 3.80, 4.08, 4.24, 4.32, 4.48, 4.64, 4.84, 4.88, 5.00,
˓→ 5.08, 5.32, 5.48, 5.60, 5.68, 5.84, 6.04, 6.24]","token$":["A","LL"," THE"," YE","LL",
˓→"OW"," LA","M","P","S"," WOULD"," LIGHT"," UP"," HE","RE"," AND"," THERE"," THE"," S",
----
exp/test_wavs/1.wav
{"text":"IN WHICH MAN THUS PUNISHED HAD GIVEN HER A LOVELY CHILD WHOSE PLACE WAS ON THAT␣
˓→SAME DISHONOURED BOSOM TO CONNECT HER PARENT FOR EVER WITH THE RACE AN
.92, 5.16, 5.44, 5.68, 6.04, 6.24, 6.48, 6.84, 7.08, 7.32, 7.56, 7.84, 8.12, 8.24, 8.32,␣
˓→8.44, 8.60, 8.76, 8.88, 9.08, 9.28, 9.44, 9.56, 9.64, 9.76, 9.96, 10.0
4, 10.20, 10.40, 10.64, 10.76, 11.04, 11.20, 11.36, 11.60, 11.80, 12.00, 12.12, 12.28,␣
˓→12.32, 12.52, 12.72, 12.84, 12.96, 13.04, 13.24, 13.40, 13.60, 13.76, 13
.96, 14.12, 14.24, 14.36, 14.52, 14.68, 14.76, 15.04, 15.28, 15.52, 15.76, 16.00, 16.16,␣
˓→16.24, 16.32]","tokens":["IN"," WHICH"," MAN"," TH","US"," P","UN","IS
H","ED"," HAD"," GIVE","N"," HER"," A"," LOVE","LY"," CHILD"," WHO","SE"," PLACE"," WAS",
˓→" ON"," THAT"," SAME"," DIS","HO","N","OUR","ED"," BO","S","OM"," TO",
./build/bin/sherpa-onnx-offline \
--encoder=exp/encoder-epoch-99-avg-1.onnx \
--decoder=exp/decoder-epoch-99-avg-1.onnx \
--joiner=exp/joiner-epoch-99-avg-1.onnx \
--decoding-method=modified_beam_search \
--tokens=exp/tokens.txt \
--modeling-unit=bpe \
--bpe-vocab=exp/bpe.vocab \
--hotwords-file=hotwords_en.txt \
--hotwords-score=2.0 \
exp/test_wavs/0.wav exp/test_wavs/1.wav
OfflineRecognizerConfig(feat_config=OfflineFeatureExtractorConfig(sampling_rate=16000,␣
˓→feature_dim=80), model_
˓→config=OfflineModelConfig(transducer=OfflineTransducerModelConfig(encoder_filename=
˓→"exp/encoder-epoch-99-avg-1.onnx", decoder_filename="exp/decoder-epoch-99-avg-1.onnx",␣
˓→joiner_filename="exp/joiner-epoch-99-avg-1.onnx"),␣
˓→paraformer=OfflineParaformerModelConfig(model=""), nemo_
˓→ctc=OfflineNemoEncDecCtcModelConfig(model=""),␣
˓→whisper=OfflineWhisperModelConfig(encoder="
exp/test_wavs/0.wav
{"text":"ALL THE YELLOW LAMPS WOULD LIGHT UP HERE AND THERE THE SQUALID QUARTERS OF THE␣
˓→BROTHELS","timestamps":"[1.44, 1.48, 1.56, 1.72, 1.88, 1.96, 2.16, 2.28
, 2.36, 2.48, 2.60, 2.80, 3.08, 3.28, 3.40, 3.60, 3.80, 4.08, 4.24, 4.32, 4.48, 4.64, 4.
˓→84, 4.88, 5.00, 5.08, 5.12, 5.36, 5.48, 5.60, 5.68, 5.84, 6.04, 6.24]",
92, 5.16, 5.44, 5.68, 6.04, 6.24, 6.48, 6.84, 7.08, 7.32, 7.56, 7.84, 8.12, 8.24, 8.32,␣
˓→8.44, 8.60, 8.76, 8.88, 9.08, 9.28, 9.44, 9.56, 9.64, 9.76, 9.96, 10.0$
, 10.20, 10.40, 10.68, 10.76, 11.04, 11.20, 11.36, 11.60, 11.80, 12.00, 12.12, 12.28, 12.
˓→32, 12.52, 12.72, 12.84, 12.96, 13.04, 13.24, 13.40, 13.60, 13.76, 13$ (continues on next page)
","ED"," HAD"," GIVE","N"," HER"," A"," LOVE","LY"," CHILD"," WHO","SE"," PLACE"," WAS",
˓→" ON"," THAT"," SAME"," DIS","HO","N","OUR","ED"," BO","S","OM"," TO",$
Python api
python python-api-examples/offline-decode-files.py \
--encoder exp/encoder-epoch-99-avg-1.onnx \
--decoder exp/decoder-epoch-99-avg-1.onnx \
--joiner exp/joiner-epoch-99-avg-1.onnx \
--decoding modified_beam_search \
--tokens exp/tokens.txt \
exp/test_wavs/0.wav exp/test_wavs/1.wav
Started!
Done!
exp/test_wavs/0.wav
ALL THE YELLOW LAMPS WOULD LIGHT UP HERE AND THERE THE SQUALID QUARTER OF THE BROTHELS
----------
exp/test_wavs/1.wav
IN WHICH MAN THUS PUNISHED HAD GIVEN HER A LOVELY CHILD WHOSE PLACE WAS ON THAT SAME␣
˓→DISHONOURED BOSOM TO CONNECT HER PARENT FOR EVER WITH THE RACE AND DESCENT OF MORTALS␣
----------
num_threads: 1
decoding_method: modified_beam_search
Wave duration: 23.340 s
Elapsed time: 2.546 s
Real time factor (RTF): 2.546/23.340 = 0.109
python python-api-examples/offline-decode-files.py \
--encoder exp/encoder-epoch-99-avg-1.onnx \
--decoder exp/decoder-epoch-99-avg-1.onnx \
--joiner exp/joiner-epoch-99-avg-1.onnx \
--decoding modified_beam_search \
--tokens exp/tokens.txt \
--modeling-unit bpe \
--bpe-vocab exp/bpe.vocab \
--hotwords-file hotwords_en.txt \
--hotwords-score 2.0 \
exp/test_wavs/0.wav exp/test_wavs/1.wav
Started!
Done!
exp/test_wavs/0.wav
ALL THE YELLOW LAMPS WOULD LIGHT UP HERE AND THERE THE SQUALID QUARTERS OF THE BROTHELS
----------
exp/test_wavs/1.wav
IN WHICH MAN THUS PUNISHED HAD GIVEN HER A LOVELY CHILD WHOSE PLACE WAS ON THAT SAME␣
˓→DISHONOURED BOSOM TO CONNECT HER PARENT FOREVER WITH THE RACE AND DESCENTOF MORTALS␣
----------
num_threads: 1
decoding_method: modified_beam_search
Wave duration: 23.340 s
Elapsed time: 2.463 s
Real time factor (RTF): 2.463/23.340 = 0.106
cd /path/to/sherpa-onnx
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-onnx-
˓→conformer-zh-stateless2-2023-05-23.tar.bz2
ln -s sherpa-onnx-conformer-zh-stateless2-2023-05-23 exp-zh
C++ api
./build/bin/sherpa-onnx-offline \
--encoder=exp-zh/encoder-epoch-99-avg-1.onnx \
--decoder=exp-zh/decoder-epoch-99-avg-1.onnx \
--joiner=exp-zh/joiner-epoch-99-avg-1.onnx \
--tokens=exp-zh/tokens.txt \
--decoding-method=modified_beam_search \
exp-zh/test_wavs/3.wav exp-zh/test_wavs/4.wav exp-zh/test_wavs/5.wav exp-zh/test_
˓→wavs/6.wav
/star-kw/kangwei/code/sherpa-onnx/sherpa-onnx/csrc/parse-options.cc:Read:361 ./build/bin/
˓→sherpa-onnx-offline --encoder=exp-zh/encoder-epoch-99-avg-1.onnx --decoder=exp-zh/
OfflineRecognizerConfig(feat_config=OfflineFeatureExtractorConfig(sampling_rate=16000,␣
˓→feature_dim=80), model_
˓→config=OfflineModelConfig(transducer=OfflineTransducerModelConfig(encoder_filename=
˓→"exp-zh/encoder-epoch-99-avg-1.onnx", decoder_filename="exp-zh/decoder-epoch-99-avg-1.
˓→onnx", joiner_filename="exp-zh/joiner-$poch-99-avg-1.onnx"),␣
˓→paraformer=OfflineParaformerModelConfig(model=""), nemo_
˓→ctc=OfflineNemoEncDecCtcModelConfig(model=""), whisper=OfflineWhisperModelConfig
˓→ hotwords_score=1.5)
exp-zh/test_wavs/3.wav
{"text":"","timestamps":"[0.00, 0.16, 0.68, 1.32, 1.72, 2.08, 2.60, 2.88, 3.20, 3.52, 3.
˓→92, 4.40, 4.68, 5.12, 5.44, 6.36, $.96, 7.32]","tokens":["","","","","","","","","","",
˓→"","","","","","","",""]}
----
exp-zh/test_wavs/4.wav
{"text":"","timestamps":"[0.00, 0.20, 0.88, 1.36, 1.76, 2.08, 2.28, 2.68, 2.92, 3.16, 3.
˓→44, 3.80]","tokens":["","","","","",$
","","","","","",""]}
----
exp-zh/test_wavs/5.wav
(continues on next page)
----
exp-zh/test_wavs/6.wav
{"text":"","timestamps":"[0.00, 0.16, 0.80, 1.12, 1.44, 1.68, 1.92, 2.16, 2.36, 2.60, 2.
˓→84, 3.12]","tokens":["","","","","",$
","","","","","",""]}
----
num threads: 2
decoding method: modified_beam_search
max active paths: 4
Elapsed seconds: 1.883 s
Real time factor (RTF): 1.883 / 20.328 = 0.093
./build/bin/sherpa-onnx-offline \
--encoder=exp-zh/encoder-epoch-99-avg-1.onnx \
--decoder=exp-zh/decoder-epoch-99-avg-1.onnx \
--joiner=exp-zh/joiner-epoch-99-avg-1.onnx \
--tokens=exp-zh/tokens.txt \
--decoding-method=modified_beam_search \
--modeling-unit=cjkchar \
--hotwords-file=hotwords_cn.txt \
--hotwords-score=2.0 \
exp-zh/test_wavs/3.wav exp-zh/test_wavs/4.wav exp-zh/test_wavs/5.wav exp-zh/test_wavs/
˓→6.wav
OfflineRecognizerConfig(feat_config=OfflineFeatureExtractorConfig(sampling_rate=16000,␣
˓→feature_dim=80), model_
˓→config=OfflineModelConfig(transducer=OfflineTransducerModelConfig(encoder_filename=
˓→"exp-zh/encoder-epoch-99-avg-1.onnx", decoder_filename="exp-zh/decoder-epoch-99-avg-1.
˓→onnx", joiner_filename="exp-zh/joiner-$poch-99-avg-1.onnx"),␣
˓→paraformer=OfflineParaformerModelConfig(model=""), nemo_
˓→ctc=OfflineNemoEncDecCtcModelConfig(model=""), whisper=OfflineWhisperModelConfig
˓→file=hotwords_cn.txt, hotwords_score=2)
exp-zh/test_wavs/3.wav
{"text":"","timestamps":"[0.00, 0.16, 0.64, 1.28, 1.64, 2.04, 2.60, 2.88, 3.20, 3.52, 3.
˓→92, 4.40, 4.68, 5.12, 5.44, 6.36, $.96, 7.32]","tokens":["","","","","","","","","","",
˓→"","","","","","","",""]}
----
exp-zh/test_wavs/4.wav
{"text":"","timestamps":"[0.00, 0.12, 0.80, 1.36, 1.76, 2.08, 2.28, 2.68, 2.92, 3.16, 3.
˓→44, 3.80]","tokens":["","","","","",$
----
exp-zh/test_wavs/6.wav
{"text":"","timestamps":"[0.00, 0.12, 0.80, 1.12, 1.44, 1.68, 1.92, 2.16, 2.36, 2.60, 2.
˓→84, 3.12]","tokens":["","","","","",$
","","","","","",""]}
----
num threads: 2
decoding method: modified_beam_search
max active paths: 4
Elapsed seconds: 1.810 s
Real time factor (RTF): 1.810 / 20.328 = 0.089
Hint: ->
->
->
->
Python api
python python-api-examples/offline-decode-files.py \
--encoder exp-zh/encoder-epoch-99-avg-1.onnx \
--decoder exp-zh/decoder-epoch-99-avg-1.onnx \
--joiner exp-zh/joiner-epoch-99-avg-1.onnx \
--tokens exp-zh/tokens.txt \
--decoding-method modified_beam_search \
exp-zh/test_wavs/3.wav exp-zh/test_wavs/4.wav exp-zh/test_wavs/5.wav exp-zh/test_wavs/6.
˓→wav
Started!
Done!
exp-zh/test_wavs/3.wav
----------
exp-zh/test_wavs/4.wav
----------
exp-zh/test_wavs/5.wav
----------
(continues on next page)
----------
num_threads: 1
decoding_method: modified_beam_search
Wave duration: 20.328 s
Elapsed time: 2.653 s
Real time factor (RTF): 2.653/20.328 = 0.131
python python-api-examples/offline-decode-files.py \
--encoder exp-zh/encoder-epoch-99-avg-1.onnx \
--decoder exp-zh/decoder-epoch-99-avg-1.onnx \
--joiner exp-zh/joiner-epoch-99-avg-1.onnx \
--tokens exp-zh/tokens.txt \
--decoding-method modified_beam_search \
--modeling-unit=cjkchar \
--hotwords-file hotwords_cn.txt \
--hotwords-score 2.0 \
exp-zh/test_wavs/3.wav exp-zh/test_wavs/4.wav exp-zh/test_wavs/5.wav exp-zh/test_
˓→wavs/6.wav
Started!
Done!
exp-zh/test_wavs/3.wav
----------
exp-zh/test_wavs/4.wav
----------
exp-zh/test_wavs/5.wav
----------
exp-zh/test_wavs/6.wav
----------
num_threads: 1
decoding_method: modified_beam_search
Wave duration: 20.328 s
Elapsed time: 2.636 s
Real time factor (RTF): 2.636/20.328 = 0.130
Hint: ->
->
->
->
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-onnx-
˓→streaming-zipformer-bilingual-zh-en-2023-02-20.tar.bz2
ln -s sherpa-onnx-streaming-zipformer-bilingual-zh-en-2023-02-20 exp-mixed
C++ api
˓→exp-mixed/test_wavs/2.wav
OnlineRecognizerConfig(feat_config=FeatureExtractorConfig(sampling_rate=16000, feature_
˓→dim=80), model_config=OnlineModelConfig(transducer=OnlineTransducerModelConfig(encoder=
˓→"exp-mixed/encoder-epoch-99-avg-1.onnx", decoder="exp-mixed/decoder-epoch-99-avg-1.onnx
˓→", joiner="exp-mixed/joiner-epoch-99-avg-1.onnx"),␣
˓→config=EndpointConfig(rule1=EndpointRule(must_contain_nonsilence=False, min_trailing_
exp-mixed/test_wavs/0.wav
Elapsed seconds: 3, Real time factor (RTF): 0.3
MONDAY TODAY IS LIBR THE DAY AFTER TOMORROW
{"is_final":false,"segment":0,"start_time":0.0,"text":" MONDAY TODAY IS LIBR THE DAY␣
˓→AFTER TOMORROW","timestamps":"[0.64, 1.04, 1.60, 2.08, 2.20, 2.40, 4.16, 4.40, 4.88, 5.
˓→56, 5.80, 6.16, 6.84, 7.12, 7.44, 8.04, 8.16, 8.24, 8.28, 9.04, 9.40, 9.64, 9.88]",
˓→AFTER"," TO","M","OR","ROW","","","",""]}
exp-mixed/test_wavs/2.wav
Elapsed seconds: 1.7, Real time factor (RTF): 0.37
FREQUENTLY
{"is_final":false,"segment":0,"start_time":0.0,"text":" FREQUENTLY","timestamps":"[0.00,␣
˓→0.40, 0.52, 0.96, 1.08, 1.28, 1.48, 1.68, 1.84, 2.00, 2.24, 2.36, 2.52, 2.68, 2.92, 3.
˓→","RE","QU","ENT","LY","","",""]}
/star-kw/kangwei/code/sherpa-onnx/sherpa-onnx/csrc/parse-options.cc:Read:361 ./build/bin/
˓→sherpa-onnx --encoder=exp-mixed/encoder-epoch-99-avg-1.onnx --decoder=exp-mixed/
˓→exp-mixed/test_wavs/0.wav exp-mixed/test_wavs/2.wav
OnlineRecognizerConfig(feat_config=FeatureExtractorConfig(sampling_rate=16000, feature_
˓→dim=80), model_config=OnlineModelConfig(transducer=OnlineTransducerModelConfig(encoder=
˓→"exp-mixed/encoder-epoch-99-avg-1.onnx", decoder="exp-mixed/decoder-epoch-99-avg-1.onnx
˓→", joiner="exp-mixed/joiner-epoch-99-avg-1.onnx"),␣
˓→config=EndpointConfig(rule1=EndpointRule(must_contain_nonsilence=False, min_trailing_
exp-mixed/test_wavs/0.wav
Elapsed seconds: 3.2, Real time factor (RTF): 0.32
MONDAY TODAY IS THE DAY AFTER TOMORROW
{"is_final":false,"segment":0,"start_time":0.0,"text":" MONDAY TODAY IS THE DAY AFTER␣
˓→TOMORROW","timestamps":"[0.64, 1.04, 1.60, 2.08, 2.20, 2.40, 4.16, 4.40, 4.88, 5.56, 5.
˓→68, 6.00, 6.84, 7.12, 7.44, 8.04, 8.16, 8.24, 8.28, 9.04, 9.40, 9.64, 9.88]","tokens":[
˓→"OR","ROW","","","",""]}
exp-mixed/test_wavs/2.wav
Elapsed seconds: 1.9, Real time factor (RTF): 0.4
FREQUENTLY
{"is_final":false,"segment":0,"start_time":0.0,"text":" FREQUENTLY","timestamps":"[0.00,␣
˓→0.40, 0.52, 0.96, 1.08, 1.28, 1.48, 1.68, 1.84, 2.00, 2.24, 2.36, 2.52, 2.68, 2.92, 3.
˓→","RE","QU","ENT","LY","","",""]}
Python api
python python-api-examples/online-decode-files.py \
--encoderexp-mixed/encoder-epoch-99-avg-1.onnx \
--decoder exp-mixed/decoder-epoch-99-avg-1.onnx \
--joiner exp-mixed/joiner-epoch-99-avg-1.onnx \
--decoding-method modified_beam_search \
--tokens exp-mixed/tokens.txt
exp-mixed/test_wavs/0.wav exp-mixed/test_wavs/2.wav
Started!
Done!
exp-mixed/test_wavs/0.wav
MONDAY TODAY IS LIBR THE DAY AFTER TOMORROW
----------
exp-mixed/test_wavs/2.wav
FREQUENTLY
----------
num_threads: 1
decoding_method: modified_beam_search
Wave duration: 14.743 s
Elapsed time: 3.052 s
Real time factor (RTF): 3.052/14.743 = 0.207
python python-api-examples/online-decode-files.py \
--encoder exp-mixed/encoder-epoch-99-avg-1.onnx \
--decoder exp-mixed/decoder-epoch-99-avg-1.onnx \
--joiner exp-mixed/joiner-epoch-99-avg-1.onnx \
--decoding-method modified_beam_search \
--tokens exp-mixed/tokens.txt \
--modeling-unit cjkchar+bpe \
--bpe-vocab exp-mixed/bpe.vocab \
--hotwords-file hotwords_mix.txt \
--hotwords-score 2.0 \
exp-mixed/test_wavs/0.wav exp-mixed/test_wavs/2.wav
Started!
Done!
exp-mixed/test_wavs/0.wav
MONDAY TODAY IS THE DAY AFTER TOMORROW
----------
exp-mixed/test_wavs/2.wav
FREQUENTLY
----------
num_threads: 1
decoding_method: modified_beam_search
Wave duration: 14.743 s
Elapsed time: 3.060 s
Real time factor (RTF): 3.060/14.743 = 0.208
In this section, we describe how we implement the open vocabulary keyword spotting (aka customized keyword spot-
ting) feature and how to use it in sherpa-onnx.
Basically, an open vocabulary keyword spotting system is just like a tiny ASR system, but it can only decode
words/phrases in the given keywords. For example, if the given keyword is HELLO WORLD, then the decoded result
should be either HELLO WORLD or empty. As for open vocabulary (or customized), it means you can specify any key-
words without re-training the model. For building a conventional keyword spotting systems, people need to prepare a lot
of audio-text pairs for the selected keywords and the trained model can only be used to detect those selected keywords.
While an open vocabulary keyword spotting system allows people using one system to detect different keywords, even
the keywords might not be in the training data.
For now, we only implement a beam search decoder to make the system only trigger the given keywords (i.e. the
model itself is actually a tiny ASR). To make it is able to balance between the trigged rate and false alarm,
we introduce two parameters for each keyword, boosting score and trigger threshold. The boosting score
works like the hotwords recognition, it help the paths containing keywords to survive beam search, the larger this score
is the easier the corresponding keyword will be triggered, read Hotwords (Contextual biasing) for more details. The
trigger threshold defines the minimum acoustic probability of decoded sequences (token sequences) that can be
triggered, it is a float value between 0 to 1, the lower this threshold is the easier the corresponding keyword will be
triggered.
The input keywords looks like (the keywords are HELLO WORLD, HI GOOGLE and HEY SIRI):
Each line contains a keyword, the first several tokens (separated by spaces) are encoded tokens of the keyword, the
item starts with : is the boosting score and the item starts with # is the trigger threshold. Note: No spaces
between : (or #) and the float value.
To get the tokens you need to use the command line tool in sherpa-onnx to convert the original keywords, you can see
the usage as follows:
Options:
--text TEXT Path to the input texts. Each line in the texts contains the␣
˓→original phrase, it might also contain some extra items,
for example, the boosting score (startting with :), the triggering␣
˓→threshold
Note: If the tokens-type is fpinyin or ppinyin, you MUST provide the original keyword (starting with @).
Note: If you install sherpa-onnx from sources (i.e. not by pip), you can use the alternative script in scripts, the usage
is almost the same as the command line tool, read the help information by:
Currently, we provide command-line tool and android app for keyword spotting.
command-line tool
After installing sherpa-onnx, type sherpa-onnx-keyword-spotter --help for the help message.
8.21 Punctuation
This section introduces the models that sherpa-onnx supports for adding punctuations to text.
Hint: After getting text from speech using speech-to-text, you can use models from this section to add punctuations
to text.
sherpa-onnx-punct-ct-transformer-zh-en-vocab272727-2024-04-12
Hint: If you want to know how the model is converted to sherpa-onnx, please download it and you can find related
scripts in the downloaded model directory.
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/punctuation-models/sherpa-
˓→onnx-punct-ct-transformer-zh-en-vocab272727-2024-04-12.tar.bz2
Only model.onnx is needed in sherpa-onnx. All other files are for your information about how the model is converted
to sherpa-onnx.
After installing sherpa-onnx, you can use the following command to add punctuations to text:
./bin/sherpa-onnx-offline-punctuation \
--ct-transformer=./sherpa-onnx-punct-ct-transformer-zh-en-vocab272727-2024-04-12/model.
˓→onnx \
""
˓→en-vocab272727-2024-04-12/model.onnx ''
OfflinePunctuationConfig(model=OfflinePunctuationModelConfig(ct_transformer="./sherpa-
˓→onnx-punct-ct-transformer-zh-en-vocab272727-2024-04-12/model.onnx", num_threads=1,␣
˓→debug=False, provider="cpu"))
The second example is for text containing both Chinese and English:
./bin/sherpa-onnx-offline-punctuation \
--ct-transformer=./sherpa-onnx-punct-ct-transformer-zh-en-vocab272727-2024-04-12/model.
˓→onnx \
/Users/fangjun/open-source/sherpa-onnx/sherpa-onnx/csrc/parse-options.cc:Read:361 ./bin/
˓→sherpa-onnx-offline-punctuation --ct-transformer=./sherpa-onnx-punct-ct-transformer-zh-
OfflinePunctuationConfig(model=OfflinePunctuationModelConfig(ct_transformer="./sherpa-
˓→onnx-punct-ct-transformer-zh-en-vocab272727-2024-04-12/model.onnx", num_threads=1,␣
˓→debug=False, provider="cpu"))
./bin/sherpa-onnx-offline-punctuation \
--ct-transformer=./sherpa-onnx-punct-ct-transformer-zh-en-vocab272727-2024-04-12/model.
˓→onnx \
(continues on next page)
/Users/fangjun/open-source/sherpa-onnx/sherpa-onnx/csrc/parse-options.cc:Read:361 ./bin/
˓→sherpa-onnx-offline-punctuation --ct-transformer=./sherpa-onnx-punct-ct-transformer-zh-
˓→bringing more voices online in the form of commentaries opinions analyses rants and␣
˓→poetry'
OfflinePunctuationConfig(model=OfflinePunctuationModelConfig(ct_transformer="./sherpa-
˓→onnx-punct-ct-transformer-zh-en-vocab272727-2024-04-12/model.onnx", num_threads=1,␣
˓→debug=False, provider="cpu"))
Output text: The African blogosphere is rapidly expandingbringing more voices online in␣
˓→the form of commentariesopinionsanalysesrants and poetry
Please see
https://github.com/k2-fsa/sherpa-onnx/blob/master/python-api-examples/add-punctuation.py
Please see
• https://huggingface.co/spaces/k2-fsa/generate-subtitles-for-videos
• https://huggingface.co/spaces/k2-fsa/automatic-speech-recognition
Video demos
This section introduces the models that sherpa-onnx supports for audio tagging, which aims to recognize sound events
within an audio clip without its temporal localization.
sherpa-onnx-zipformer-small-audio-tagging-2024-04-15
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/audio-tagging-models/sherpa-
˓→onnx-zipformer-small-audio-tagging-2024-04-15.tar.bz2
Hint: You can find the binary executable file sherpa-onnx-offline-audio-tagging after installing sherpa-onnx
either from source or using ``pip install sherpa-onnx``_.
Cat
./bin/sherpa-onnx-offline-audio-tagging \
--zipformer-model=./sherpa-onnx-zipformer-small-audio-tagging-2024-04-15/model.int8.
˓→onnx \
--labels=./sherpa-onnx-zipformer-small-audio-tagging-2024-04-15/class_labels_indices.
˓→csv \
./sherpa-onnx-zipformer-small-audio-tagging-2024-04-15/test_wavs/1.wav
Hint: By default, it outputs the top 5 events. The first event has the largest probability.
Whistle
./bin/sherpa-onnx-offline-audio-tagging \
--zipformer-model=./sherpa-onnx-zipformer-small-audio-tagging-2024-04-15/model.int8.
˓→onnx \
--labels=./sherpa-onnx-zipformer-small-audio-tagging-2024-04-15/class_labels_indices.
˓→csv \
./sherpa-onnx-zipformer-small-audio-tagging-2024-04-15/test_wavs/2.wav
Music
./bin/sherpa-onnx-offline-audio-tagging \
--zipformer-model=./sherpa-onnx-zipformer-small-audio-tagging-2024-04-15/model.int8.
˓→onnx \
--labels=./sherpa-onnx-zipformer-small-audio-tagging-2024-04-15/class_labels_indices.
˓→csv \
./sherpa-onnx-zipformer-small-audio-tagging-2024-04-15/test_wavs/3.wav
Laughter
./bin/sherpa-onnx-offline-audio-tagging \
--zipformer-model=./sherpa-onnx-zipformer-small-audio-tagging-2024-04-15/model.int8.
˓→onnx \
--labels=./sherpa-onnx-zipformer-small-audio-tagging-2024-04-15/class_labels_indices.
˓→csv \
./sherpa-onnx-zipformer-small-audio-tagging-2024-04-15/test_wavs/4.wav
Finger snapping
./bin/sherpa-onnx-offline-audio-tagging \
--zipformer-model=./sherpa-onnx-zipformer-small-audio-tagging-2024-04-15/model.int8.
˓→onnx \
--labels=./sherpa-onnx-zipformer-small-audio-tagging-2024-04-15/class_labels_indices.
˓→csv \
./sherpa-onnx-zipformer-small-audio-tagging-2024-04-15/test_wavs/5.wav
Baby cry
./bin/sherpa-onnx-offline-audio-tagging \
--zipformer-model=./sherpa-onnx-zipformer-small-audio-tagging-2024-04-15/model.int8.
˓→onnx \
--labels=./sherpa-onnx-zipformer-small-audio-tagging-2024-04-15/class_labels_indices.
˓→csv \
./sherpa-onnx-zipformer-small-audio-tagging-2024-04-15/test_wavs/6.wav
Smoke alarm
./bin/sherpa-onnx-offline-audio-tagging \
--zipformer-model=./sherpa-onnx-zipformer-small-audio-tagging-2024-04-15/model.int8.
˓→onnx \
--labels=./sherpa-onnx-zipformer-small-audio-tagging-2024-04-15/class_labels_indices.
˓→csv \
./sherpa-onnx-zipformer-small-audio-tagging-2024-04-15/test_wavs/7.wav
Siren
./bin/sherpa-onnx-offline-audio-tagging \
--zipformer-model=./sherpa-onnx-zipformer-small-audio-tagging-2024-04-15/model.int8.
˓→onnx \
--labels=./sherpa-onnx-zipformer-small-audio-tagging-2024-04-15/class_labels_indices.
˓→csv \
./sherpa-onnx-zipformer-small-audio-tagging-2024-04-15/test_wavs/8.wav
Stream water
./bin/sherpa-onnx-offline-audio-tagging \
--zipformer-model=./sherpa-onnx-zipformer-small-audio-tagging-2024-04-15/model.int8.
˓→onnx \
--labels=./sherpa-onnx-zipformer-small-audio-tagging-2024-04-15/class_labels_indices.
˓→csv \
./sherpa-onnx-zipformer-small-audio-tagging-2024-04-15/test_wavs/10.wav
Meow
./bin/sherpa-onnx-offline-audio-tagging \
--zipformer-model=./sherpa-onnx-zipformer-small-audio-tagging-2024-04-15/model.int8.
˓→onnx \
--labels=./sherpa-onnx-zipformer-small-audio-tagging-2024-04-15/class_labels_indices.
˓→csv \
./sherpa-onnx-zipformer-small-audio-tagging-2024-04-15/test_wavs/11.wav
Dog bark
./bin/sherpa-onnx-offline-audio-tagging \
--zipformer-model=./sherpa-onnx-zipformer-small-audio-tagging-2024-04-15/model.int8.
˓→onnx \
--labels=./sherpa-onnx-zipformer-small-audio-tagging-2024-04-15/class_labels_indices.
˓→csv \
./sherpa-onnx-zipformer-small-audio-tagging-2024-04-15/test_wavs/12.wav
Oink (pig)
./bin/sherpa-onnx-offline-audio-tagging \
--zipformer-model=./sherpa-onnx-zipformer-small-audio-tagging-2024-04-15/model.int8.
˓→onnx \
--labels=./sherpa-onnx-zipformer-small-audio-tagging-2024-04-15/class_labels_indices.
˓→csv \
./sherpa-onnx-zipformer-small-audio-tagging-2024-04-15/test_wavs/13.wav
Please see
https://github.com/k2-fsa/sherpa-onnx/blob/master/python-api-examples/audio-tagging-from-a-file.py
Huggingface space
You can try audio tagging with sherpa-onnx from within you browser by visiting the following URL:
https://huggingface.co/spaces/k2-fsa/audio-tagging
8.22.2 Android
You can find Android APKs for each model at the following page
https://k2-fsa.github.io/sherpa/onnx/audio-tagging/apk.html
Please follow Android to build Android APKs from source.
If you want to run audio tagging on your WearOS watches, please see WearOS.
8.22.3 WearOS
You can find APKs for WearOS of each model at the following page
https://k2-fsa.github.io/sherpa/onnx/audio-tagging/apk-wearos.html
Please follow Android to build APKs for WearOS from source.
If you want to run audio tagging on your Android phones, please see Android.
This section describes how to use sherpa-onnx for spoken language identification.
whisper
In the following, we use the tiny model as an example. You can replace tiny with base, small, or medium and
everything still holds.
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-onnx-
˓→whisper-tiny.tar.bz2
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/spoken-language-
˓→identification-test-wavs.tar.bz2
After installing sherpa-onnx either from source or from using pip install sherpa-onnx, you can run:
python3 ./python-api-examples/spoken-language-identification.py \
--whisper-encoder ./sherpa-onnx-whisper-tiny/tiny-encoder.int8.onnx \
--whisper-decoder ./sherpa-onnx-whisper-tiny/tiny-decoder.onnx \
./spoken-language-identification-test-wavs/de-german.wav
Android APKs
You can find pre-built Android APKs for spoken language identification at the following address:
https://k2-fsa.github.io/sherpa/onnx/spoken-language-identification/apk.html
Huggingface space
Note: For Chinese users, you can use the following mirror:
http://hf-mirror.com/spaces/k2-fsa/spoken-language-identification
8.24 VAD
Description URL
Speech recognition (speech to text, https://github.com/k2-fsa/sherpa-onnx/releases/tag/asr-models
ASR)
Text to speech (TTS) https://github.com/k2-fsa/sherpa-onnx/releases/tag/tts-models
VAD https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/
silero_vad.onnx
Keyword spotting https://github.com/k2-fsa/sherpa-onnx/releases/tag/kws-models
Speech identification (Speaker ID) https://github.com/k2-fsa/sherpa-onnx/releases/tag/
speaker-recongition-models
Spoken language identification (Lan- https://github.com/k2-fsa/sherpa-onnx/releases/tag/asr-models (multi-
guage ID) lingual whisper)
Audio tagging https://github.com/k2-fsa/sherpa-onnx/releases/tag/
audio-tagging-models
Punctuation https://github.com/k2-fsa/sherpa-onnx/releases/tag/punctuation-models
In this section, we describe how to download and use all available pre-trained models for speech recognition.
Zipformer-transducer-based Models
Hint: Please refer to Installation to install sherpa-onnx before you read this section.
sherpa-onnx-streaming-zipformer-korean-2024-06-16 (Korean)
Training code for this model can be found at https://github.com/k2-fsa/icefall/pull/1651. It supports only Korean.
PyTorch checkpoint with optimizer state can be found at https://huggingface.co/johnBamma/
icefall-asr-ksponspeech-pruned-transducer-stateless7-streaming-2024-06-12
In the following, we describe how to download it and use it with sherpa-onnx.
cd /path/to/sherpa-onnx
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-onnx-
˓→streaming-zipformer-korean-2024-06-16.tar.bz2
$ ls -lh sherpa-onnx-streaming-zipformer-korean-2024-06-16
total 907104
-rw-r--r-- 1 fangjun staff 307K Jun 16 17:36 bpe.model
-rw-r--r-- 1 fangjun staff 2.7M Jun 16 17:36 decoder-epoch-99-avg-1.int8.onnx
-rw-r--r-- 1 fangjun staff 11M Jun 16 17:36 decoder-epoch-99-avg-1.onnx
-rw-r--r-- 1 fangjun staff 121M Jun 16 17:36 encoder-epoch-99-avg-1.int8.onnx
-rw-r--r-- 1 fangjun staff 279M Jun 16 17:36 encoder-epoch-99-avg-1.onnx
-rw-r--r-- 1 fangjun staff 2.5M Jun 16 17:36 joiner-epoch-99-avg-1.int8.onnx
-rw-r--r-- 1 fangjun staff 9.8M Jun 16 17:36 joiner-epoch-99-avg-1.onnx
drwxr-xr-x 7 fangjun staff 224B Jun 16 17:36 test_wavs
-rw-r--r-- 1 fangjun staff 59K Jun 16 17:36 tokens.txt
Hint: It supports decoding only wave files of a single channel with 16-bit encoded samples, while the sampling rate
does not need to be 16 kHz.
fp32
The following code shows how to use fp32 models to decode a wave file:
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx \
--tokens=./sherpa-onnx-streaming-zipformer-korean-2024-06-16/tokens.txt \
--encoder=./sherpa-onnx-streaming-zipformer-korean-2024-06-16/encoder-epoch-99-avg-1.
˓→onnx \
--decoder=./sherpa-onnx-streaming-zipformer-korean-2024-06-16/decoder-epoch-99-avg-1.
˓→onnx \
--joiner=./sherpa-onnx-streaming-zipformer-korean-2024-06-16/joiner-epoch-99-avg-1.
˓→onnx \
./sherpa-onnx-streaming-zipformer-korean-2024-06-16/test_wavs/0.wav
Caution: If you use Windows and get encoding issues, please run:
CHCP 65001
in your commandline.
/Users/fangjun/open-source/sherpa-onnx/sherpa-onnx/csrc/parse-options.cc:Read:361 ./
˓→build/bin/sherpa-onnx --tokens=./sherpa-onnx-streaming-zipformer-korean-2024-06-16/
˓→tokens.txt --encoder=./sherpa-onnx-streaming-zipformer-korean-2024-06-16/encoder-epoch-
˓→99-avg-1.onnx --decoder=./sherpa-onnx-streaming-zipformer-korean-2024-06-16/decoder-
˓→epoch-99-avg-1.onnx --joiner=./sherpa-onnx-streaming-zipformer-korean-2024-06-16/
˓→joiner-epoch-99-avg-1.onnx ./sherpa-onnx-streaming-zipformer-korean-2024-06-16/test_
˓→wavs/0.wav
OnlineRecognizerConfig(feat_config=FeatureExtractorConfig(sampling_rate=16000, feature_
˓→dim=80, low_freq=20, high_freq=-400, dither=0), model_
˓→config=OnlineModelConfig(transducer=OnlineTransducerModelConfig(encoder="./sherpa-onnx-
˓→streaming-zipformer-korean-2024-06-16/encoder-epoch-99-avg-1.onnx", decoder="./sherpa-
˓→onnx-streaming-zipformer-korean-2024-06-16/decoder-epoch-99-avg-1.onnx", joiner="./
˓→sherpa-onnx-streaming-zipformer-korean-2024-06-16/joiner-epoch-99-avg-1.onnx"),␣
˓→ctc=OnlineZipformer2CtcModelConfig(model=""), nemo_ctc=OnlineNeMoCtcModelConfig(model="
˓→config=EndpointConfig(rule1=EndpointRule(must_contain_nonsilence=False, min_trailing_
˓→blank_penalty=0, temperature_scale=2)
./sherpa-onnx-streaming-zipformer-korean-2024-06-16/test_wavs/0.wav
Elapsed seconds: 0.56, Real time factor (RTF): 0.16
{ "text": " ", "tokens": [" ", " ", " ", "", "", "", " ", " ", " ", " ", "", ""],
˓→"timestamps": [0.52, 0.96, 1.28, 1.44, 1.52, 1.84, 2.28, 2.48, 2.88, 3.04, 3.20, 3.44],
int8
The following code shows how to use int8 models to decode a wave file:
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx \
--tokens=./sherpa-onnx-streaming-zipformer-korean-2024-06-16/tokens.txt \
--encoder=./sherpa-onnx-streaming-zipformer-korean-2024-06-16/encoder-epoch-99-avg-1.
˓→int8.onnx \
--decoder=./sherpa-onnx-streaming-zipformer-korean-2024-06-16/decoder-epoch-99-avg-1.
˓→onnx \
--joiner=./sherpa-onnx-streaming-zipformer-korean-2024-06-16/joiner-epoch-99-avg-1.
˓→int8.onnx \
./sherpa-onnx-streaming-zipformer-korean-2024-06-16/test_wavs/0.wav
Caution: If you use Windows and get encoding issues, please run:
CHCP 65001
in your commandline.
˓→tokens.txt --encoder=./sherpa-onnx-streaming-zipformer-korean-2024-06-16/encoder-epoch-
˓→99-avg-1.int8.onnx --decoder=./sherpa-onnx-streaming-zipformer-korean-2024-06-16/
˓→decoder-epoch-99-avg-1.onnx --joiner=./sherpa-onnx-streaming-zipformer-korean-2024-06-
˓→16/joiner-epoch-99-avg-1.int8.onnx ./sherpa-onnx-streaming-zipformer-korean-2024-06-16/
˓→test_wavs/0.wav
OnlineRecognizerConfig(feat_config=FeatureExtractorConfig(sampling_rate=16000, feature_
˓→dim=80, low_freq=20, high_freq=-400, dither=0), model_
˓→config=OnlineModelConfig(transducer=OnlineTransducerModelConfig(encoder="./sherpa-onnx-
˓→streaming-zipformer-korean-2024-06-16/encoder-epoch-99-avg-1.int8.onnx", decoder="./
˓→sherpa-onnx-streaming-zipformer-korean-2024-06-16/decoder-epoch-99-avg-1.onnx", joiner=
˓→"./sherpa-onnx-streaming-zipformer-korean-2024-06-16/joiner-epoch-99-avg-1.int8.onnx"),
˓→ctc=OnlineZipformer2CtcModelConfig(model=""), nemo_ctc=OnlineNeMoCtcModelConfig(model="
˓→config=EndpointConfig(rule1=EndpointRule(must_contain_nonsilence=False, min_trailing_
˓→blank_penalty=0, temperature_scale=2)
{ "text": " ", "tokens": [" ", " ", " ", "", "", "", " ", " ", " ", " ", "", ""],
˓→"timestamps": [0.52, 0.96, 1.28, 1.44, 1.52, 1.84, 2.28, 2.48, 2.88, 3.04, 3.20, 3.44],
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx-microphone \
--tokens=./sherpa-onnx-streaming-zipformer-korean-2024-06-16/tokens.txt \
--encoder=./sherpa-onnx-streaming-zipformer-korean-2024-06-16/encoder-epoch-99-avg-1.
˓→int8.onnx \
--decoder=./sherpa-onnx-streaming-zipformer-korean-2024-06-16/decoder-epoch-99-avg-1.
˓→onnx \
--joiner=./sherpa-onnx-streaming-zipformer-korean-2024-06-16/joiner-epoch-99-avg-1.
˓→int8.onnx
Hint: If your system is Linux (including embedded Linux), you can also use sherpa-onnx-alsa to do real-time speech
recognition with your microphone if sherpa-onnx-microphone does not work for you.
sherpa-onnx-streaming-zipformer-multi-zh-hans-2023-12-12 (Chinese)
Training code for this model can be found at https://github.com/k2-fsa/icefall/pull/1369. It supports only Chinese.
Please refer to https://github.com/k2-fsa/icefall/tree/master/egs/multi_zh-hans/ASR#included-training-sets for the de-
tailed information about the training data. In total, there are 14k hours of training data.
In the following, we describe how to download it and use it with sherpa-onnx.
cd /path/to/sherpa-onnx
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-onnx-
˓→streaming-zipformer-multi-zh-hans-2023-12-12.tar.bz2
tar xf sherpa-onnx-streaming-zipformer-multi-zh-hans-2023-12-12.tar.bz2
rm sherpa-onnx-streaming-zipformer-multi-zh-hans-2023-12-12.tar.bz2
ls -lh sherpa-onnx-streaming-zipformer-multi-zh-hans-2023-12-12
Hint: It supports decoding only wave files of a single channel with 16-bit encoded samples, while the sampling rate
does not need to be 16 kHz.
fp32
The following code shows how to use fp32 models to decode a wave file:
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx \
--tokens=./sherpa-onnx-streaming-zipformer-multi-zh-hans-2023-12-12/tokens.txt \
--encoder=./sherpa-onnx-streaming-zipformer-multi-zh-hans-2023-12-12/encoder-epoch-20-
˓→avg-1-chunk-16-left-128.onnx \
--decoder=./sherpa-onnx-streaming-zipformer-multi-zh-hans-2023-12-12/decoder-epoch-20-
˓→avg-1-chunk-16-left-128.onnx \
--joiner=./sherpa-onnx-streaming-zipformer-multi-zh-hans-2023-12-12/joiner-epoch-20-
˓→avg-1-chunk-16-left-128.onnx \
./sherpa-onnx-streaming-zipformer-multi-zh-hans-2023-12-12/test_wavs/DEV_T0000000000.
˓→wav
Caution: If you use Windows and get encoding issues, please run:
CHCP 65001
in your commandline.
˓→12/tokens.txt --encoder=./sherpa-onnx-streaming-zipformer-multi-zh-hans-2023-12-12/
˓→encoder-epoch-20-avg-1-chunk-16-left-128.onnx --decoder=./sherpa-onnx-streaming-
˓→zipformer-multi-zh-hans-2023-12-12/decoder-epoch-20-avg-1-chunk-16-left-128.onnx --
˓→joiner=./sherpa-onnx-streaming-zipformer-multi-zh-hans-2023-12-12/joiner-epoch-20-avg-
˓→1-chunk-16-left-128.onnx ./sherpa-onnx-streaming-zipformer-multi-zh-hans-2023-12-12/
˓→test_wavs/DEV_T0000000000.wav
OnlineRecognizerConfig(feat_config=FeatureExtractorConfig(sampling_rate=16000, feature_
˓→dim=80), model_config=OnlineModelConfig(transducer=OnlineTransducerModelConfig(encoder=
˓→"./sherpa-onnx-streaming-zipformer-multi-zh-hans-2023-12-12/encoder-epoch-20-avg-1-
˓→chunk-16-left-128.onnx", decoder="./sherpa-onnx-streaming-zipformer-multi-zh-hans-2023-
˓→12-12/decoder-epoch-20-avg-1-chunk-16-left-128.onnx", joiner="./sherpa-onnx-streaming-
˓→zipformer-multi-zh-hans-2023-12-12/joiner-epoch-20-avg-1-chunk-16-left-128.onnx"),␣
˓→sherpa-onnx-streaming-zipformer-multi-zh-hans-2023-12-12/tokens.txt", num_threads=1,␣
˓→scale=0.5), endpoint_config=EndpointConfig(rule1=EndpointRule(must_contain_
./sherpa-onnx-streaming-zipformer-multi-zh-hans-2023-12-12/test_wavs/DEV_T0000000000.wav
Elapsed seconds: 0.65, Real time factor (RTF): 0.12
˓→ 3.72, 3.84, 3.96, 4.04, 4.16, 4.28, 4.36, 4.60, 4.72], "tokens":[" ", "", "", "", "",
˓→"", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", ""]}
int8
The following code shows how to use int8 models to decode a wave file:
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx \
--tokens=./sherpa-onnx-streaming-zipformer-multi-zh-hans-2023-12-12/tokens.txt \
--encoder=./sherpa-onnx-streaming-zipformer-multi-zh-hans-2023-12-12/encoder-epoch-20-
˓→avg-1-chunk-16-left-128.int8.onnx \
--decoder=./sherpa-onnx-streaming-zipformer-multi-zh-hans-2023-12-12/decoder-epoch-20-
˓→avg-1-chunk-16-left-128.onnx \
--joiner=./sherpa-onnx-streaming-zipformer-multi-zh-hans-2023-12-12/joiner-epoch-20-
˓→avg-1-chunk-16-left-128.int8.onnx \ (continues on next page)
Caution: If you use Windows and get encoding issues, please run:
CHCP 65001
in your commandline.
/Users/fangjun/open-source/sherpa-onnx/sherpa-onnx/csrc/parse-options.cc:Read:361 ./
˓→build/bin/sherpa-onnx --tokens=./sherpa-onnx-streaming-zipformer-multi-zh-hans-2023-12-
˓→12/tokens.txt --encoder=./sherpa-onnx-streaming-zipformer-multi-zh-hans-2023-12-12/
˓→encoder-epoch-20-avg-1-chunk-16-left-128.int8.onnx --decoder=./sherpa-onnx-streaming-
˓→zipformer-multi-zh-hans-2023-12-12/decoder-epoch-20-avg-1-chunk-16-left-128.onnx --
˓→joiner=./sherpa-onnx-streaming-zipformer-multi-zh-hans-2023-12-12/joiner-epoch-20-avg-
˓→1-chunk-16-left-128.int8.onnx ./sherpa-onnx-streaming-zipformer-multi-zh-hans-2023-12-
˓→12/test_wavs/DEV_T0000000000.wav
OnlineRecognizerConfig(feat_config=FeatureExtractorConfig(sampling_rate=16000, feature_
˓→dim=80), model_config=OnlineModelConfig(transducer=OnlineTransducerModelConfig(encoder=
˓→"./sherpa-onnx-streaming-zipformer-multi-zh-hans-2023-12-12/encoder-epoch-20-avg-1-
˓→chunk-16-left-128.int8.onnx", decoder="./sherpa-onnx-streaming-zipformer-multi-zh-hans-
˓→2023-12-12/decoder-epoch-20-avg-1-chunk-16-left-128.onnx", joiner="./sherpa-onnx-
˓→streaming-zipformer-multi-zh-hans-2023-12-12/joiner-epoch-20-avg-1-chunk-16-left-128.
˓→sherpa-onnx-streaming-zipformer-multi-zh-hans-2023-12-12/tokens.txt", num_threads=1,␣
˓→scale=0.5), endpoint_config=EndpointConfig(rule1=EndpointRule(must_contain_
./sherpa-onnx-streaming-zipformer-multi-zh-hans-2023-12-12/test_wavs/DEV_T0000000000.wav
Elapsed seconds: 0.5, Real time factor (RTF): 0.088
˓→ 3.72, 3.84, 3.96, 4.04, 4.16, 4.28, 4.36, 4.60, 4.72], "tokens":[" ", "", "", "", "",
˓→"", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", ""]}
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx-microphone \
--tokens=./sherpa-onnx-streaming-zipformer-multi-zh-hans-2023-12-12/tokens.txt \
--encoder=./sherpa-onnx-streaming-zipformer-multi-zh-hans-2023-12-12/encoder-epoch-20-
˓→avg-1-chunk-16-left-128.int8.onnx \
--decoder=./sherpa-onnx-streaming-zipformer-multi-zh-hans-2023-12-12/decoder-epoch-20-
˓→avg-1-chunk-16-left-128.onnx \
--joiner=./sherpa-onnx-streaming-zipformer-multi-zh-hans-2023-12-12/joiner-epoch-20-
˓→avg-1-chunk-16-left-128.int8.onnx
Hint: If your system is Linux (including embedded Linux), you can also use sherpa-onnx-alsa to do real-time speech
recognition with your microphone if sherpa-onnx-microphone does not work for you.
k2-fsa/icefall-asr-zipformer-wenetspeech-streaming-small (Chinese)
k2-fsa/icefall-asr-zipformer-wenetspeech-streaming-large (Chinese)
pkufool/icefall-asr-zipformer-streaming-wenetspeech-20230615 (Chinese)
cd /path/to/sherpa-onnx
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/icefall-asr-
˓→zipformer-streaming-wenetspeech-20230615.tar.bz2
Please check that the file sizes of the pre-trained models are correct. See the file sizes of *.onnx files below.
Hint: It supports decoding only wave files of a single channel with 16-bit encoded samples, while the sampling rate
does not need to be 16 kHz.
fp32
The following code shows how to use fp32 models to decode a wave file:
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx \
--tokens=./icefall-asr-zipformer-streaming-wenetspeech-20230615/data/lang_char/tokens.
˓→txt \
--encoder=./icefall-asr-zipformer-streaming-wenetspeech-20230615/exp/encoder-epoch-12-
˓→avg-4-chunk-16-left-128.onnx \
--decoder=./icefall-asr-zipformer-streaming-wenetspeech-20230615/exp/decoder-epoch-12-
˓→avg-4-chunk-16-left-128.onnx \
--joiner=./icefall-asr-zipformer-streaming-wenetspeech-20230615/exp/joiner-epoch-12-
˓→avg-4-chunk-16-left-128.onnx \
./icefall-asr-zipformer-streaming-wenetspeech-20230615/test_wavs/DEV_T0000000000.wav
Caution: If you use Windows and get encoding issues, please run:
CHCP 65001
in your commandline.
˓→data/lang_char/tokens.txt --encoder=./icefall-asr-zipformer-streaming-wenetspeech-
˓→20230615/exp/encoder-epoch-12-avg-4-chunk-16-left-128.onnx --decoder=./icefall-asr-
˓→zipformer-streaming-wenetspeech-20230615/exp/decoder-epoch-12-avg-4-chunk-16-left-128.
˓→onnx --joiner=./icefall-asr-zipformer-streaming-wenetspeech-20230615/exp/joiner-epoch-
˓→12-avg-4-chunk-16-left-128.onnx ./icefall-asr-zipformer-streaming-wenetspeech-20230615/
˓→test_wavs/DEV_T0000000000.wav
OnlineRecognizerConfig(feat_config=FeatureExtractorConfig(sampling_rate=16000, feature_
˓→dim=80), model_config=OnlineTransducerModelConfig(encoder_filename="./icefall-asr-
˓→zipformer-streaming-wenetspeech-20230615/exp/encoder-epoch-12-avg-4-chunk-16-left-128.
˓→onnx", decoder_filename="./icefall-asr-zipformer-streaming-wenetspeech-20230615/exp/
˓→decoder-epoch-12-avg-4-chunk-16-left-128.onnx", joiner_filename="./icefall-asr-
˓→zipformer-streaming-wenetspeech-20230615/exp/joiner-epoch-12-avg-4-chunk-16-left-128.
˓→onnx", tokens="./icefall-asr-zipformer-streaming-wenetspeech-20230615/data/lang_char/
˓→config=EndpointConfig(rule1=EndpointRule(must_contain_nonsilence=False, min_trailing_
352 Chapter 8. sherpa-onnx
˓→silence=2.4, min_utterance_length=0), rule2=EndpointRule(must_contain_nonsilence=True,␣
˓→"","","","","","","","","","","","","","",""]}
int8
The following code shows how to use int8 models to decode a wave file:
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx \
--tokens=./icefall-asr-zipformer-streaming-wenetspeech-20230615/data/lang_char/tokens.
˓→txt \
--encoder=./icefall-asr-zipformer-streaming-wenetspeech-20230615/exp/encoder-epoch-12-
˓→avg-4-chunk-16-left-128.int8.onnx \
--decoder=./icefall-asr-zipformer-streaming-wenetspeech-20230615/exp/decoder-epoch-12-
˓→avg-4-chunk-16-left-128.onnx \
--joiner=./icefall-asr-zipformer-streaming-wenetspeech-20230615/exp/joiner-epoch-12-
˓→avg-4-chunk-16-left-128.int8.onnx \
./icefall-asr-zipformer-streaming-wenetspeech-20230615/test_wavs/DEV_T0000000000.wav
Caution: If you use Windows and get encoding issues, please run:
CHCP 65001
in your commandline.
˓→data/lang_char/tokens.txt --encoder=./icefall-asr-zipformer-streaming-wenetspeech-
˓→20230615/exp/encoder-epoch-12-avg-4-chunk-16-left-128.int8.onnx --decoder=./icefall-
˓→asr-zipformer-streaming-wenetspeech-20230615/exp/decoder-epoch-12-avg-4-chunk-16-left-
˓→128.onnx --joiner=./icefall-asr-zipformer-streaming-wenetspeech-20230615/exp/joiner-
˓→epoch-12-avg-4-chunk-16-left-128.int8.onnx ./icefall-asr-zipformer-streaming-
˓→wenetspeech-20230615/test_wavs/DEV_T0000000000.wav
OnlineRecognizerConfig(feat_config=FeatureExtractorConfig(sampling_rate=16000, feature_
˓→dim=80), model_config=OnlineTransducerModelConfig(encoder_filename="./icefall-asr-
˓→zipformer-streaming-wenetspeech-20230615/exp/encoder-epoch-12-avg-4-chunk-16-left-128.
˓→int8.onnx", decoder_filename="./icefall-asr-zipformer-streaming-wenetspeech-20230615/
(continues on next page)
˓→exp/decoder-epoch-12-avg-4-chunk-16-left-128.onnx", joiner_filename="./icefall-asr-
˓→zipformer-streaming-wenetspeech-20230615/exp/joiner-epoch-12-avg-4-chunk-16-left-128.
˓→int8.onnx",
8.25. tokens="./icefall-asr-zipformer-streaming-wenetspeech-20230615/data/lang_
Pre-trained models 353
˓→char/tokens.txt", num_threads=2, provider="cpu", debug=False), lm_
˓→config=EndpointConfig(rule1=EndpointRule(must_contain_nonsilence=False, min_trailing_
sherpa, Release 1.3
˓→60, 3.72, 3.84, 3.92, 4.00, 4.16, 4.28, 4.36, 4.64, 4.68, 5.00]","tokens":["","","","",
˓→"","","","","","","","","","","","","","","","","","","","","","","",""]}
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx-microphone \
--tokens=./icefall-asr-zipformer-streaming-wenetspeech-20230615/data/lang_char/tokens.
˓→txt \
--encoder=./icefall-asr-zipformer-streaming-wenetspeech-20230615/exp/encoder-epoch-12-
˓→avg-4-chunk-16-left-128.int8.onnx \
--decoder=./icefall-asr-zipformer-streaming-wenetspeech-20230615/exp/decoder-epoch-12-
˓→avg-4-chunk-16-left-128.onnx \
--joiner=./icefall-asr-zipformer-streaming-wenetspeech-20230615/exp/joiner-epoch-12-
˓→avg-4-chunk-16-left-128.int8.onnx
Hint: If your system is Linux (including embedded Linux), you can also use sherpa-onnx-alsa to do real-time speech
recognition with your microphone if sherpa-onnx-microphone does not work for you.
csukuangfj/sherpa-onnx-streaming-zipformer-en-2023-06-26 (English)
cd /path/to/sherpa-onnx
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-onnx-
˓→streaming-zipformer-en-2023-06-26.tar.bz2
Please check that the file sizes of the pre-trained models are correct. See the file sizes below.
-rw-r--r-- 1 1001 127 240K Apr 23 06:45 bpe.model
-rw-r--r-- 1 1001 127 1.3M Apr 23 06:45 decoder-epoch-99-avg-1-chunk-16-left-128.int8.
˓→onnx
Hint: It supports decoding only wave files of a single channel with 16-bit encoded samples, while the sampling rate
does not need to be 16 kHz.
fp32
The following code shows how to use fp32 models to decode a wave file:
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx \
--tokens=./sherpa-onnx-streaming-zipformer-en-2023-06-26/tokens.txt \
--encoder=./sherpa-onnx-streaming-zipformer-en-2023-06-26/encoder-epoch-99-avg-1-chunk-
˓→16-left-128.onnx \
--decoder=./sherpa-onnx-streaming-zipformer-en-2023-06-26/decoder-epoch-99-avg-1-chunk-
˓→16-left-128.onnx \
--joiner=./sherpa-onnx-streaming-zipformer-en-2023-06-26/joiner-epoch-99-avg-1-chunk-
˓→16-left-128.onnx \
./sherpa-onnx-streaming-zipformer-en-2023-06-26/test_wavs/0.wav
˓→txt --encoder=./sherpa-onnx-streaming-zipformer-en-2023-06-26/encoder-epoch-99-avg-1-
˓→chunk-16-left-128.onnx --decoder=./sherpa-onnx-streaming-zipformer-en-2023-06-26/
˓→decoder-epoch-99-avg-1-chunk-16-left-128.onnx --joiner=./sherpa-onnx-streaming-
˓→zipformer-en-2023-06-26/joiner-epoch-99-avg-1-chunk-16-left-128.onnx ./sherpa-onnx-
OnlineRecognizerConfig(feat_config=FeatureExtractorConfig(sampling_rate=16000, feature_
˓→dim=80), model_config=OnlineTransducerModelConfig(encoder_filename="./sherpa-onnx-
˓→streaming-zipformer-en-2023-06-26/encoder-epoch-99-avg-1-chunk-16-left-128.onnx",␣
˓→decoder_filename="./sherpa-onnx-streaming-zipformer-en-2023-06-26/decoder-epoch-99-avg-
˓→1-chunk-16-left-128.onnx", joiner_filename="./sherpa-onnx-streaming-zipformer-en-2023-
˓→06-26/joiner-epoch-99-avg-1-chunk-16-left-128.onnx", tokens="./sherpa-onnx-streaming-
˓→config=EndpointConfig(rule1=EndpointRule(must_contain_nonsilence=False, min_trailing_
./sherpa-onnx-streaming-zipformer-en-2023-06-26/test_wavs/0.wav
Elapsed seconds: 0.51, Real time factor (RTF): 0.077
AFTER EARLY NIGHTFALL THE YELLOW LAMPS WOULD LIGHT UP HERE AND THERE THE SQUALID␣
˓→QUARTER OF THE BROTHELS
˓→"[0.68, 1.04, 1.16, 1.24, 1.60, 1.76, 1.80, 1.92, 2.04, 2.24, 2.32, 2.36, 2.52, 2.68,␣
˓→2.72, 2.80, 2.92, 3.12, 3.40, 3.64, 3.76, 3.92, 4.12, 4.48, 4.68, 4.72, 4.84, 5.00, 5.
˓→20, 5.24, 5.36, 5.40, 5.64, 5.76, 5.92, 5.96, 6.08, 6.24, 6.52]","tokens":[" AFTER"," E
int8
The following code shows how to use int8 models to decode a wave file:
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx \
--tokens=./sherpa-onnx-streaming-zipformer-en-2023-06-26/tokens.txt \
--encoder=./sherpa-onnx-streaming-zipformer-en-2023-06-26/encoder-epoch-99-avg-1-chunk-
˓→16-left-128.int8.onnx \
--decoder=./sherpa-onnx-streaming-zipformer-en-2023-06-26/decoder-epoch-99-avg-1-chunk-
˓→16-left-128.onnx \
--joiner=./sherpa-onnx-streaming-zipformer-en-2023-06-26/joiner-epoch-99-avg-1-chunk-
˓→16-left-128.int8.onnx \
./sherpa-onnx-streaming-zipformer-en-2023-06-26/test_wavs/0.wav
/Users/fangjun/open-source/sherpa-onnx/sherpa-onnx/csrc/parse-options.cc:Read:361 ./
˓→build/bin/sherpa-onnx --tokens=./sherpa-onnx-streaming-zipformer-en-2023-06-26/tokens.
˓→txt --encoder=./sherpa-onnx-streaming-zipformer-en-2023-06-26/encoder-epoch-99-avg-1-
˓→chunk-16-left-128.int8.onnx --decoder=./sherpa-onnx-streaming-zipformer-en-2023-06-26/
˓→decoder-epoch-99-avg-1-chunk-16-left-128.onnx --joiner=./sherpa-onnx-streaming-
˓→zipformer-en-2023-06-26/joiner-epoch-99-avg-1-chunk-16-left-128.int8.onnx ./sherpa-
˓→onnx-streaming-zipformer-en-2023-06-26/test_wavs/0.wav
OnlineRecognizerConfig(feat_config=FeatureExtractorConfig(sampling_rate=16000, feature_
˓→dim=80), model_config=OnlineTransducerModelConfig(encoder_filename="./sherpa-onnx-
˓→streaming-zipformer-en-2023-06-26/encoder-epoch-99-avg-1-chunk-16-left-128.int8.onnx",␣
˓→decoder_filename="./sherpa-onnx-streaming-zipformer-en-2023-06-26/decoder-epoch-99-avg-
˓→1-chunk-16-left-128.onnx", joiner_filename="./sherpa-onnx-streaming-zipformer-en-2023-
˓→06-26/joiner-epoch-99-avg-1-chunk-16-left-128.int8.onnx", tokens="./sherpa-onnx-
˓→config=EndpointConfig(rule1=EndpointRule(must_contain_nonsilence=False, min_trailing_
./sherpa-onnx-streaming-zipformer-en-2023-06-26/test_wavs/0.wav
Elapsed seconds: 0.41, Real time factor (RTF): 0.062
AFTER EARLY NIGHTFALL THE YELLOW LAMPS WOULD LIGHT UP HERE AND THERE THE SQUALID␣
˓→QUARTER OF THE BROTHELS
˓→"[0.68, 1.04, 1.16, 1.24, 1.60, 1.76, 1.80, 1.92, 2.04, 2.24, 2.32, 2.36, 2.52, 2.68,␣
˓→2.72, 2.80, 2.92, 3.12, 3.40, 3.64, 3.76, 3.92, 4.12, 4.48, 4.68, 4.72, 4.84, 5.00, 5.
˓→20, 5.24, 5.36, 5.44, 5.64, 5.76, 5.92, 5.96, 6.08, 6.24, 6.52]","tokens":[" AFTER"," E
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx-microphone \
--tokens=./sherpa-onnx-streaming-zipformer-en-2023-06-26/tokens.txt \
--encoder=./sherpa-onnx-streaming-zipformer-en-2023-06-26/encoder-epoch-99-avg-1-chunk-
˓→16-left-128.onnx \
--decoder=./sherpa-onnx-streaming-zipformer-en-2023-06-26/decoder-epoch-99-avg-1-chunk-
˓→16-left-128.onnx \
--joiner=./sherpa-onnx-streaming-zipformer-en-2023-06-26/joiner-epoch-99-avg-1-chunk-
˓→16-left-128.onnx
Hint: If your system is Linux (including embedded Linux), you can also use sherpa-onnx-alsa to do real-time speech
recognition with your microphone if sherpa-onnx-microphone does not work for you.
csukuangfj/sherpa-onnx-streaming-zipformer-en-2023-06-21 (English)
cd /path/to/sherpa-onnx
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-onnx-
˓→streaming-zipformer-en-2023-06-21.tar.bz2
Please check that the file sizes of the pre-trained models are correct. See the file sizes of *.onnx files below.
Hint: It supports decoding only wave files of a single channel with 16-bit encoded samples, while the sampling rate
does not need to be 16 kHz.
fp32
The following code shows how to use fp32 models to decode a wave file:
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx \
--tokens=./sherpa-onnx-streaming-zipformer-en-2023-06-21/tokens.txt \
--encoder=./sherpa-onnx-streaming-zipformer-en-2023-06-21/encoder-epoch-99-avg-1.onnx \
--decoder=./sherpa-onnx-streaming-zipformer-en-2023-06-21/decoder-epoch-99-avg-1.onnx \
(continues on next page)
/Users/fangjun/open-source/sherpa-onnx/sherpa-onnx/csrc/parse-options.cc:Read:361 ./
˓→build/bin/sherpa-onnx --tokens=./sherpa-onnx-streaming-zipformer-en-2023-06-21/tokens.
˓→txt --encoder=./sherpa-onnx-streaming-zipformer-en-2023-06-21/encoder-epoch-99-avg-1.
˓→onnx --decoder=./sherpa-onnx-streaming-zipformer-en-2023-06-21/decoder-epoch-99-avg-1.
˓→onnx --joiner=./sherpa-onnx-streaming-zipformer-en-2023-06-21/joiner-epoch-99-avg-1.
˓→onnx ./sherpa-onnx-streaming-zipformer-en-2023-06-21/test_wavs/0.wav
OnlineRecognizerConfig(feat_config=FeatureExtractorConfig(sampling_rate=16000, feature_
˓→dim=80), model_config=OnlineTransducerModelConfig(encoder_filename="./sherpa-onnx-
˓→streaming-zipformer-en-2023-06-21/encoder-epoch-99-avg-1.onnx", decoder_filename="./
˓→sherpa-onnx-streaming-zipformer-en-2023-06-21/decoder-epoch-99-avg-1.onnx", joiner_
˓→filename="./sherpa-onnx-streaming-zipformer-en-2023-06-21/joiner-epoch-99-avg-1.onnx",␣
˓→tokens="./sherpa-onnx-streaming-zipformer-en-2023-06-21/tokens.txt", num_threads=2,␣
˓→config=EndpointConfig(rule1=EndpointRule(must_contain_nonsilence=False, min_trailing_
./sherpa-onnx-streaming-zipformer-en-2023-06-21/test_wavs/0.wav
Elapsed seconds: 0.5, Real time factor (RTF): 0.076
AFTER EARLY NIGHTFALL THE YELLOW LAMPS WOULD LIGHT UP HERE AND THERE THE SQUALID␣
˓→QUARTER OF THE BROTHELS
˓→"[0.64, 1.00, 1.12, 1.20, 1.60, 1.76, 1.84, 1.96, 2.08, 2.24, 2.36, 2.40, 2.60, 2.72,␣
˓→2.80, 2.88, 3.00, 3.20, 3.44, 3.68, 3.76, 3.96, 4.24, 4.52, 4.72, 4.76, 4.88, 5.04, 5.
˓→24, 5.28, 5.36, 5.48, 5.64, 5.76, 5.92, 5.96, 6.04, 6.24, 6.36]","tokens":[" AFTER"," E
int8
The following code shows how to use int8 models to decode a wave file:
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx \
--tokens=./sherpa-onnx-streaming-zipformer-en-2023-06-21/tokens.txt \
--encoder=./sherpa-onnx-streaming-zipformer-en-2023-06-21/encoder-epoch-99-avg-1.int8.
˓→onnx \
(continues on next page)
./sherpa-onnx-streaming-zipformer-en-2023-06-21/test_wavs/0.wav
/Users/fangjun/open-source/sherpa-onnx/sherpa-onnx/csrc/parse-options.cc:Read:361 ./
˓→build/bin/sherpa-onnx --tokens=./sherpa-onnx-streaming-zipformer-en-2023-06-21/tokens.
˓→txt --encoder=./sherpa-onnx-streaming-zipformer-en-2023-06-21/encoder-epoch-99-avg-1.
˓→int8.onnx --decoder=./sherpa-onnx-streaming-zipformer-en-2023-06-21/decoder-epoch-99-
˓→avg-1.onnx --joiner=./sherpa-onnx-streaming-zipformer-en-2023-06-21/joiner-epoch-99-
˓→avg-1.int8.onnx ./sherpa-onnx-streaming-zipformer-en-2023-06-21/test_wavs/0.wav
OnlineRecognizerConfig(feat_config=FeatureExtractorConfig(sampling_rate=16000, feature_
˓→dim=80), model_config=OnlineTransducerModelConfig(encoder_filename="./sherpa-onnx-
˓→streaming-zipformer-en-2023-06-21/encoder-epoch-99-avg-1.int8.onnx", decoder_filename=
˓→"./sherpa-onnx-streaming-zipformer-en-2023-06-21/decoder-epoch-99-avg-1.onnx", joiner_
˓→filename="./sherpa-onnx-streaming-zipformer-en-2023-06-21/joiner-epoch-99-avg-1.int8.
˓→ endpoint_config=EndpointConfig(rule1=EndpointRule(must_contain_nonsilence=False, min_
˓→"greedy_search")
./sherpa-onnx-streaming-zipformer-en-2023-06-21/test_wavs/0.wav
Elapsed seconds: 0.41, Real time factor (RTF): 0.062
AFTER EARLY NIGHTFALL THE YELLOW LAMPS WOULD LIGHT UP HERE AND THERE THE SQUALID␣
˓→QUARTER OF THE BROTHELS
˓→"[0.64, 1.00, 1.12, 1.20, 1.60, 1.76, 1.80, 1.96, 2.08, 2.24, 2.36, 2.40, 2.60, 2.72,␣
˓→2.80, 2.88, 3.00, 3.20, 3.44, 3.68, 3.76, 3.96, 4.24, 4.52, 4.72, 4.76, 4.88, 5.04, 5.
˓→24, 5.28, 5.36, 5.48, 5.64, 5.76, 5.92, 5.96, 6.04, 6.24, 6.36]","tokens":[" AFTER"," E
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx-microphone \
--tokens=./sherpa-onnx-streaming-zipformer-en-2023-06-21/tokens.txt \
--encoder=./sherpa-onnx-streaming-zipformer-en-2023-06-21/encoder-epoch-99-avg-1.onnx \
--decoder=./sherpa-onnx-streaming-zipformer-en-2023-06-21/decoder-epoch-99-avg-1.onnx \
--joiner=./sherpa-onnx-streaming-zipformer-en-2023-06-21/joiner-epoch-99-avg-1.onnx
Hint: If your system is Linux (including embedded Linux), you can also use sherpa-onnx-alsa to do real-time speech
recognition with your microphone if sherpa-onnx-microphone does not work for you.
csukuangfj/sherpa-onnx-streaming-zipformer-en-2023-02-21 (English)
cd /path/to/sherpa-onnx
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-onnx-
˓→streaming-zipformer-en-2023-02-21.tar.bz2
cd /path/to/sherpa-onnx
cd sherpa-onnx-streaming-zipformer-en-2023-02-21
git lfs pull --include "*.onnx"
Please check that the file sizes of the pre-trained models are correct. See the file sizes of *.onnx files below.
Hint: It supports decoding only wave files of a single channel with 16-bit encoded samples, while the sampling rate
does not need to be 16 kHz.
fp32
The following code shows how to use fp32 models to decode a wave file:
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx \
--tokens=./sherpa-onnx-streaming-zipformer-en-2023-02-21/tokens.txt \
--encoder=./sherpa-onnx-streaming-zipformer-en-2023-02-21/encoder-epoch-99-avg-1.onnx \
--decoder=./sherpa-onnx-streaming-zipformer-en-2023-02-21/decoder-epoch-99-avg-1.onnx \
--joiner=./sherpa-onnx-streaming-zipformer-en-2023-02-21/joiner-epoch-99-avg-1.onnx \
./sherpa-onnx-streaming-zipformer-en-2023-02-21/test_wavs/0.wav
OnlineRecognizerConfig(feat_config=FeatureExtractorConfig(sampling_rate=16000, feature_
˓→dim=80), model_config=OnlineTransducerModelConfig(encoder_filename="./sherpa-onnx-
˓→streaming-zipformer-en-2023-02-21/encoder-epoch-99-avg-1.onnx", decoder_filename="./
˓→sherpa-onnx-streaming-zipformer-en-2023-02-21/decoder-epoch-99-avg-1.onnx", joiner_
˓→filename="./sherpa-onnx-streaming-zipformer-en-2023-02-21/joiner-epoch-99-avg-1.onnx",␣
˓→tokens="./sherpa-onnx-streaming-zipformer-en-2023-02-21/tokens.txt", num_threads=2,␣
˓→debug=False), endpoint_config=EndpointConfig(rule1=EndpointRule(must_contain_
˓→decoding_method="greedy_search")
˓→error msg: Invalid argument. Specify the number of threads explicitly so the affinity␣
num threads: 2
decoding method: greedy_search
Elapsed seconds: 0.825 s
Real time factor (RTF): 0.825 / 6.625 = 0.125
int8
The following code shows how to use int8 models to decode a wave file:
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx \
--tokens=./sherpa-onnx-streaming-zipformer-en-2023-02-21/tokens.txt \
--encoder=./sherpa-onnx-streaming-zipformer-en-2023-02-21/encoder-epoch-99-avg-1.int8.
˓→onnx \
--decoder=./sherpa-onnx-streaming-zipformer-en-2023-02-21/decoder-epoch-99-avg-1.onnx \
--joiner=./sherpa-onnx-streaming-zipformer-en-2023-02-21/joiner-epoch-99-avg-1.int8.
˓→onnx \
./sherpa-onnx-streaming-zipformer-en-2023-02-21/test_wavs/0.wav
˓→streaming-zipformer-en-2023-02-21/encoder-epoch-99-avg-1.int8.onnx", decoder_filename=
˓→"./sherpa-onnx-streaming-zipformer-en-2023-02-21/decoder-epoch-99-avg-1.onnx", joiner_
˓→filename="./sherpa-onnx-streaming-zipformer-en-2023-02-21/joiner-epoch-99-avg-1.int8.
˓→decoding_method="greedy_search")
˓→error msg: Invalid argument. Specify the number of threads explicitly so the affinity␣
num threads: 2
decoding method: greedy_search
Elapsed seconds: 0.633 s
Real time factor (RTF): 0.633 / 6.625 = 0.096
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx-microphone \
--tokens=./sherpa-onnx-streaming-zipformer-en-2023-02-21/tokens.txt \
--encoder=./sherpa-onnx-streaming-zipformer-en-2023-02-21/encoder-epoch-99-avg-1.onnx \
--decoder=./sherpa-onnx-streaming-zipformer-en-2023-02-21/decoder-epoch-99-avg-1.onnx \
--joiner=./sherpa-onnx-streaming-zipformer-en-2023-02-21/joiner-epoch-99-avg-1.onnx
Hint: If your system is Linux (including embedded Linux), you can also use sherpa-onnx-alsa to do real-time speech
recognition with your microphone if sherpa-onnx-microphone does not work for you.
cd /path/to/sherpa-onnx
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-onnx-
˓→streaming-zipformer-bilingual-zh-en-2023-02-20.tar.bz2
Please check that the file sizes of the pre-trained models are correct. See the file sizes of *.onnx files below.
Hint: It supports decoding only wave files of a single channel with 16-bit encoded samples, while the sampling rate
does not need to be 16 kHz.
fp32
The following code shows how to use fp32 models to decode a wave file:
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx \
--tokens=./sherpa-onnx-streaming-zipformer-bilingual-zh-en-2023-02-20/tokens.txt \
--encoder=./sherpa-onnx-streaming-zipformer-bilingual-zh-en-2023-02-20/encoder-epoch-
˓→99-avg-1.onnx \
--decoder=./sherpa-onnx-streaming-zipformer-bilingual-zh-en-2023-02-20/decoder-epoch-
˓→99-avg-1.onnx \
--joiner=./sherpa-onnx-streaming-zipformer-bilingual-zh-en-2023-02-20/joiner-epoch-99-
˓→avg-1.onnx \
./sherpa-onnx-streaming-zipformer-bilingual-zh-en-2023-02-20/test_wavs/1.wav
Caution: If you use Windows and get encoding issues, please run:
CHCP 65001
in your commandline.
OnlineRecognizerConfig(feat_config=FeatureExtractorConfig(sampling_rate=16000, feature_
˓→dim=80), model_config=OnlineTransducerModelConfig(encoder_filename="./sherpa-onnx-
˓→streaming-zipformer-en-2023-02-21/encoder-epoch-99-avg-1.onnx", decoder_filename="./
˓→sherpa-onnx-streaming-zipformer-en-2023-02-21/decoder-epoch-99-avg-1.onnx", joiner_
˓→filename="./sherpa-onnx-streaming-zipformer-en-2023-02-21/joiner-epoch-99-avg-1.onnx",␣
˓→tokens="./sherpa-onnx-streaming-zipformer-en-2023-02-21/tokens.txt", num_threads=2,␣
˓→debug=False), endpoint_config=EndpointConfig(rule1=EndpointRule(must_contain_
˓→decoding_method="greedy_search")
˓→error msg: Invalid argument. Specify the number of threads explicitly so the affinity␣
˓→error msg: Invalid argument. Specify the number of threads explicitly so the affinity␣
num threads: 2
decoding method: greedy_search
Elapsed seconds: 0.815 s
Real time factor (RTF): 0.815 / 6.625 = 0.123
int8
The following code shows how to use fp32 models to decode a wave file:
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx \
--tokens=./sherpa-onnx-streaming-zipformer-bilingual-zh-en-2023-02-20/tokens.txt \
--encoder=./sherpa-onnx-streaming-zipformer-bilingual-zh-en-2023-02-20/encoder-epoch-
˓→99-avg-1.int8.onnx \
--decoder=./sherpa-onnx-streaming-zipformer-bilingual-zh-en-2023-02-20/decoder-epoch-
˓→99-avg-1.onnx \
--joiner=./sherpa-onnx-streaming-zipformer-bilingual-zh-en-2023-02-20/joiner-epoch-99-
˓→avg-1.int8.onnx \
./sherpa-onnx-streaming-zipformer-bilingual-zh-en-2023-02-20/test_wavs/1.wav
Caution: If you use Windows and get encoding issues, please run:
CHCP 65001
in your commandline.
OnlineRecognizerConfig(feat_config=FeatureExtractorConfig(sampling_rate=16000, feature_
˓→dim=80), model_config=OnlineTransducerModelConfig(encoder_filename="./sherpa-onnx-
˓→streaming-zipformer-bilingual-zh-en-2023-02-20/encoder-epoch-99-avg-1.int8.onnx",␣
˓→decoder_filename="./sherpa-onnx-streaming-zipformer-bilingual-zh-en-2023-02-20/decoder-
˓→epoch-99-avg-1.onnx", joiner_filename="./sherpa-onnx-streaming-zipformer-bilingual-zh-
˓→en-2023-02-20/joiner-epoch-99-avg-1.int8.onnx", tokens="./sherpa-onnx-streaming-
˓→endpoint_config=EndpointConfig(rule1=EndpointRule(must_contain_nonsilence=False, min_
˓→"greedy_search")
˓→error msg: Invalid argument. Specify the number of threads explicitly so the affinity␣
˓→error msg: Invalid argument. Specify the number of threads explicitly so the affinity␣
ALWAYS ALWAYS
num threads: 2
decoding method: greedy_search
Elapsed seconds: 0.551 s
Real time factor (RTF): 0.551 / 5.100 = 0.108
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx-microphone \
--tokens=./sherpa-onnx-streaming-zipformer-bilingual-zh-en-2023-02-20/tokens.txt \
--encoder=./sherpa-onnx-streaming-zipformer-bilingual-zh-en-2023-02-20/encoder-epoch-
˓→99-avg-1.onnx \
--decoder=./sherpa-onnx-streaming-zipformer-bilingual-zh-en-2023-02-20/decoder-epoch-
˓→99-avg-1.onnx \
--joiner=./sherpa-onnx-streaming-zipformer-bilingual-zh-en-2023-02-20/joiner-epoch-99-
˓→avg-1.onnx
Hint: If your system is Linux (including embedded Linux), you can also use sherpa-onnx-alsa to do real-time speech
recognition with your microphone if sherpa-onnx-microphone does not work for you.
shaojieli/sherpa-onnx-streaming-zipformer-fr-2023-04-14 (French)
cd /path/to/sherpa-onnx
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-onnx-
˓→streaming-zipformer-fr-2023-04-14.tar.bz2
Please check that the file sizes of the pre-trained models are correct. See the file sizes of *.onnx files below.
Hint: It supports decoding only wave files of a single channel with 16-bit encoded samples, while the sampling rate
does not need to be 16 kHz.
fp32
The following code shows how to use fp32 models to decode a wave file:
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx \
--tokens=./sherpa-onnx-streaming-zipformer-fr-2023-04-14/tokens.txt \
--encoder=./sherpa-onnx-streaming-zipformer-fr-2023-04-14/encoder-epoch-29-avg-9-with-
˓→averaged-model.onnx \
--decoder=./sherpa-onnx-streaming-zipformer-fr-2023-04-14/decoder-epoch-29-avg-9-with-
˓→averaged-model.onnx \
--joiner=./sherpa-onnx-streaming-zipformer-fr-2023-04-14/joiner-epoch-29-avg-9-with-
˓→averaged-model.onnx \
./sherpa-onnx-streaming-zipformer-fr-2023-04-14/test_wavs/common_voice_fr_19364697.wav
Caution: If you use Windows and get encoding issues, please run:
CHCP 65001
in your commandline.
˓→streaming-zipformer-fr-2023-04-14/encoder-epoch-29-avg-9-with-averaged-model.onnx",␣
˓→decoder_filename="./sherpa-onnx-streaming-zipformer-fr-2023-04-14/decoder-epoch-29-avg-
˓→9-with-averaged-model.onnx", joiner_filename="./sherpa-onnx-streaming-zipformer-fr-
˓→2023-04-14/joiner-epoch-29-avg-9-with-averaged-model.onnx", tokens="./sherpa-onnx-
˓→config=EndpointConfig(rule1=EndpointRule(must_contain_nonsilence=False, min_trailing_
int8
The following code shows how to use fp32 models to decode a wave file:
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx \
--tokens=./sherpa-onnx-streaming-zipformer-fr-2023-04-14/tokens.txt \
--encoder=./sherpa-onnx-streaming-zipformer-fr-2023-04-14/encoder-epoch-29-avg-9-with-
˓→averaged-model.int8.onnx \
--decoder=./sherpa-onnx-streaming-zipformer-fr-2023-04-14/decoder-epoch-29-avg-9-with-
˓→averaged-model.onnx \
--joiner=./sherpa-onnx-streaming-zipformer-fr-2023-04-14/joiner-epoch-29-avg-9-with-
˓→averaged-model.int8.onnx \
./sherpa-onnx-streaming-zipformer-fr-2023-04-14/test_wavs/common_voice_fr_19364697.wav
Caution: If you use Windows and get encoding issues, please run:
CHCP 65001
in your commandline.
˓→streaming-zipformer-fr-2023-04-14/encoder-epoch-29-avg-9-with-averaged-model.int8.onnx
˓→", decoder_filename="./sherpa-onnx-streaming-zipformer-fr-2023-04-14/decoder-epoch-29-
˓→avg-9-with-averaged-model.onnx", joiner_filename="./sherpa-onnx-streaming-zipformer-fr-
˓→2023-04-14/joiner-epoch-29-avg-9-with-averaged-model.int8.onnx", tokens="./sherpa-onnx-
˓→config=EndpointConfig(rule1=EndpointRule(must_contain_nonsilence=False, min_trailing_
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx-microphone \
--tokens=./sherpa-onnx-streaming-zipformer-fr-2023-04-14/tokens.txt \
--encoder=./sherpa-onnx-streaming-zipformer-fr-2023-04-14/encoder-epoch-29-avg-9-with-
˓→averaged-model.onnx \
--decoder=./sherpa-onnx-streaming-zipformer-fr-2023-04-14/decoder-epoch-29-avg-9-with-
˓→averaged-model.onnx \
--joiner=./sherpa-onnx-streaming-zipformer-fr-2023-04-14/joiner-epoch-29-avg-9-with-
˓→averaged-model.onnx
Hint: If your system is Linux (including embedded Linux), you can also use sherpa-onnx-alsa to do real-time speech
recognition with your microphone if sherpa-onnx-microphone does not work for you.
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-onnx-
˓→streaming-zipformer-small-bilingual-zh-en-2023-02-16.tar.bz2
tar xf sherpa-onnx-streaming-zipformer-small-bilingual-zh-en-2023-02-16.tar.bz2
rm sherpa-onnx-streaming-zipformer-small-bilingual-zh-en-2023-02-16.tar.bz2
Please check that the file sizes of the pre-trained models are correct. See the file sizes of *.onnx files below.
Hint: There are two sub-folders in the model directory: 64 and 96. The number represents chunk size. The larger the
number, the lower the RTF. The default chunk size is 32.
Hint: It supports decoding only wave files of a single channel with 16-bit encoded samples, while the sampling rate
does not need to be 16 kHz.
fp32
The following code shows how to use fp32 models to decode a wave file:
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx \
--tokens=./sherpa-onnx-streaming-zipformer-small-bilingual-zh-en-2023-02-16/tokens.txt␣
˓→\
--encoder=./sherpa-onnx-streaming-zipformer-small-bilingual-zh-en-2023-02-16/encoder-
˓→epoch-99-avg-1.onnx \
--decoder=./sherpa-onnx-streaming-zipformer-small-bilingual-zh-en-2023-02-16/decoder-
˓→epoch-99-avg-1.onnx \ (continues on next page)
./sherpa-onnx-streaming-zipformer-small-bilingual-zh-en-2023-02-16/test_wavs/0.wav
Caution: If you use Windows and get encoding issues, please run:
CHCP 65001
in your commandline.
˓→onnx-streaming-zipformer-small-bilingual-zh-en-2023-02-16/encoder-epoch-99-avg-1.onnx -
˓→-decoder=./sherpa-onnx-streaming-zipformer-small-bilingual-zh-en-2023-02-16/decoder-
˓→epoch-99-avg-1.onnx --joiner=./sherpa-onnx-streaming-zipformer-small-bilingual-zh-en-
˓→2023-02-16/joiner-epoch-99-avg-1.onnx ./sherpa-onnx-streaming-zipformer-small-
˓→bilingual-zh-en-2023-02-16/test_wavs/0.wav
OnlineRecognizerConfig(feat_config=FeatureExtractorConfig(sampling_rate=16000, feature_
˓→dim=80), model_config=OnlineModelConfig(transducer=OnlineTransducerModelConfig(encoder=
˓→"./sherpa-onnx-streaming-zipformer-small-bilingual-zh-en-2023-02-16/encoder-epoch-99-
˓→avg-1.onnx", decoder="./sherpa-onnx-streaming-zipformer-small-bilingual-zh-en-2023-02-
˓→16/decoder-epoch-99-avg-1.onnx", joiner="./sherpa-onnx-streaming-zipformer-small-
˓→bilingual-zh-en-2023-02-16/joiner-epoch-99-avg-1.onnx"),␣
˓→ctc=OnlineZipformer2CtcModelConfig(model=""), tokens="./sherpa-onnx-streaming-
˓→endpoint_config=EndpointConfig(rule1=EndpointRule(must_contain_nonsilence=False, min_
./sherpa-onnx-streaming-zipformer-small-bilingual-zh-en-2023-02-16/test_wavs/0.wav
Elapsed seconds: 1, Real time factor (RTF): 0.1
MONDAY TODAY IS THEY AFTER TOMORROW
{ "text": " MONDAY TODAY IS THEY AFTER TOMORROW", "tokens": [ "", "", "", " MO", "N",
˓→"DAY", " TO", "DAY", " IS", " THEY", " AFTER", " TO", "M", "OR", "ROW", "", "", "", ""␣
˓→], "timestamps": [ 0.64, 1.08, 1.64, 2.08, 2.20, 2.36, 4.16, 4.36, 5.12, 7.16, 7.44, 8.
˓→00, 8.12, 8.20, 8.44, 9.08, 9.44, 9.64, 9.88 ], "ys_probs": [ -0.000507, -0.056152, -0.
int8
The following code shows how to use int8 models to decode a wave file:
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx \
--tokens=./sherpa-onnx-streaming-zipformer-small-bilingual-zh-en-2023-02-16/tokens.txt␣
˓→\
--encoder=./sherpa-onnx-streaming-zipformer-small-bilingual-zh-en-2023-02-16/encoder-
˓→epoch-99-avg-1.int8.onnx \
--decoder=./sherpa-onnx-streaming-zipformer-small-bilingual-zh-en-2023-02-16/decoder-
˓→epoch-99-avg-1.onnx \
--joiner=./sherpa-onnx-streaming-zipformer-small-bilingual-zh-en-2023-02-16/joiner-
˓→epoch-99-avg-1.int8.onnx \
./sherpa-onnx-streaming-zipformer-small-bilingual-zh-en-2023-02-16/test_wavs/0.wav
Caution: If you use Windows and get encoding issues, please run:
CHCP 65001
in your commandline.
˓→onnx-streaming-zipformer-small-bilingual-zh-en-2023-02-16/encoder-epoch-99-avg-1.int8.
˓→onnx --decoder=./sherpa-onnx-streaming-zipformer-small-bilingual-zh-en-2023-02-16/
˓→decoder-epoch-99-avg-1.onnx --joiner=./sherpa-onnx-streaming-zipformer-small-bilingual-
˓→zh-en-2023-02-16/joiner-epoch-99-avg-1.int8.onnx ./sherpa-onnx-streaming-zipformer-
˓→small-bilingual-zh-en-2023-02-16/test_wavs/0.wav
OnlineRecognizerConfig(feat_config=FeatureExtractorConfig(sampling_rate=16000, feature_
˓→dim=80), model_config=OnlineModelConfig(transducer=OnlineTransducerModelConfig(encoder=
˓→"./sherpa-onnx-streaming-zipformer-small-bilingual-zh-en-2023-02-16/encoder-epoch-99-
˓→avg-1.int8.onnx", decoder="./sherpa-onnx-streaming-zipformer-small-bilingual-zh-en-
˓→2023-02-16/decoder-epoch-99-avg-1.onnx", joiner="./sherpa-onnx-streaming-zipformer-
˓→small-bilingual-zh-en-2023-02-16/joiner-epoch-99-avg-1.int8.onnx"),␣
˓→ctc=OnlineZipformer2CtcModelConfig(model=""), tokens="./sherpa-onnx-streaming-
˓→endpoint_config=EndpointConfig(rule1=EndpointRule(must_contain_nonsilence=False, min_
˓→], "timestamps": [ 0.64, 1.08, 1.64, 2.08, 2.20, 2.36, 4.20, 4.36, 5.12, 7.16, 7.44, 8.
˓→00, 8.12, 8.20, 8.40, 9.04, 9.44, 9.64, 9.88 ], "ys_probs": [ -0.000305, -0.152557, -0.
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx-microphone \
--tokens=./sherpa-onnx-streaming-zipformer-small-bilingual-zh-en-2023-02-16/tokens.txt␣
˓→\
--encoder=./sherpa-onnx-streaming-zipformer-small-bilingual-zh-en-2023-02-16/encoder-
˓→epoch-99-avg-1.int8.onnx \
--decoder=./sherpa-onnx-streaming-zipformer-small-bilingual-zh-en-2023-02-16/decoder-
˓→epoch-99-avg-1.onnx \
--joiner=./sherpa-onnx-streaming-zipformer-small-bilingual-zh-en-2023-02-16/joiner-
˓→epoch-99-avg-1.int8.onnx
Hint: If your system is Linux (including embedded Linux), you can also use sherpa-onnx-alsa to do real-time speech
recognition with your microphone if sherpa-onnx-microphone does not work for you.
csukuangfj/sherpa-onnx-streaming-zipformer-zh-14M-2023-02-23 (Chinese)
cd /path/to/sherpa-onnx
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-onnx-
˓→streaming-zipformer-zh-14M-2023-02-23.tar.bz2
Please check that the file sizes of the pre-trained models are correct. See the file sizes of *.onnx files below.
Hint: It supports decoding only wave files of a single channel with 16-bit encoded samples, while the sampling rate
does not need to be 16 kHz.
fp32
The following code shows how to use fp32 models to decode a wave file:
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx \
--tokens=./sherpa-onnx-streaming-zipformer-zh-14M-2023-02-23/tokens.txt \
--encoder=./sherpa-onnx-streaming-zipformer-zh-14M-2023-02-23/encoder-epoch-99-avg-1.
˓→onnx \
--decoder=./sherpa-onnx-streaming-zipformer-zh-14M-2023-02-23/decoder-epoch-99-avg-1.
˓→onnx \
--joiner=./sherpa-onnx-streaming-zipformer-zh-14M-2023-02-23/joiner-epoch-99-avg-1.
˓→onnx \
./sherpa-onnx-streaming-zipformer-zh-14M-2023-02-23/test_wavs/0.wav
Caution: If you use Windows and get encoding issues, please run:
CHCP 65001
in your commandline.
/Users/fangjun/open-source/sherpa-onnx/sherpa-onnx/csrc/parse-options.cc:Read:361 ./
˓→build/bin/sherpa-onnx --tokens=./sherpa-onnx-streaming-zipformer-zh-14M-2023-02-23/
˓→tokens.txt --encoder=./sherpa-onnx-streaming-zipformer-zh-14M-2023-02-23/encoder-epoch-
˓→99-avg-1.onnx --decoder=./sherpa-onnx-streaming-zipformer-zh-14M-2023-02-23/decoder-
˓→epoch-99-avg-1.onnx --joiner=./sherpa-onnx-streaming-zipformer-zh-14M-2023-02-23/
˓→joiner-epoch-99-avg-1.onnx ./sherpa-onnx-streaming-zipformer-zh-14M-2023-02-23/test_
˓→wavs/0.wav
OnlineRecognizerConfig(feat_config=FeatureExtractorConfig(sampling_rate=16000, feature_
˓→dim=80), model_config=OnlineModelConfig(transducer=OnlineTransducerModelConfig(encoder=
˓→"./sherpa-onnx-streaming-zipformer-zh-14M-2023-02-23/encoder-epoch-99-avg-1.onnx",␣
˓→decoder="./sherpa-onnx-streaming-zipformer-zh-14M-2023-02-23/decoder-epoch-99-avg-1.
˓→onnx", joiner="./sherpa-onnx-streaming-zipformer-zh-14M-2023-02-23/joiner-epoch-99-avg-
˓→sherpa-onnx-streaming-zipformer-zh-14M-2023-02-23/tokens.txt", num_threads=1,␣
˓→scale=0.5), endpoint_config=EndpointConfig(rule1=EndpointRule(must_contain_
˓→context_score=1.5, decoding_method="greedy_search")
./sherpa-onnx-streaming-zipformer-zh-14M-2023-02-23/test_wavs/0.wav
Elapsed seconds: 0.21, Real time factor (RTF): 0.038
˓→"","","","","","","","","","","","","","",""]}
int8
The following code shows how to use int8 models to decode a wave file:
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx \
--tokens=./sherpa-onnx-streaming-zipformer-zh-14M-2023-02-23/tokens.txt \
--encoder=./sherpa-onnx-streaming-zipformer-zh-14M-2023-02-23/encoder-epoch-99-avg-1.
˓→int8.onnx \
--decoder=./sherpa-onnx-streaming-zipformer-zh-14M-2023-02-23/decoder-epoch-99-avg-1.
˓→onnx \
--joiner=./sherpa-onnx-streaming-zipformer-zh-14M-2023-02-23/joiner-epoch-99-avg-1.
˓→int8.onnx \
Caution: If you use Windows and get encoding issues, please run:
CHCP 65001
in your commandline.
/Users/fangjun/open-source/sherpa-onnx/sherpa-onnx/csrc/parse-options.cc:Read:361 ./
˓→build/bin/sherpa-onnx --tokens=./sherpa-onnx-streaming-zipformer-zh-14M-2023-02-23/
˓→tokens.txt --encoder=./sherpa-onnx-streaming-zipformer-zh-14M-2023-02-23/encoder-epoch-
˓→99-avg-1.int8.onnx --decoder=./sherpa-onnx-streaming-zipformer-zh-14M-2023-02-23/
˓→decoder-epoch-99-avg-1.onnx --joiner=./sherpa-onnx-streaming-zipformer-zh-14M-2023-02-
˓→23/joiner-epoch-99-avg-1.int8.onnx ./sherpa-onnx-streaming-zipformer-zh-14M-2023-02-23/
˓→test_wavs/0.wav
OnlineRecognizerConfig(feat_config=FeatureExtractorConfig(sampling_rate=16000, feature_
˓→dim=80), model_config=OnlineModelConfig(transducer=OnlineTransducerModelConfig(encoder=
˓→"./sherpa-onnx-streaming-zipformer-zh-14M-2023-02-23/encoder-epoch-99-avg-1.int8.onnx",
˓→ decoder="./sherpa-onnx-streaming-zipformer-zh-14M-2023-02-23/decoder-epoch-99-avg-1.
˓→onnx", joiner="./sherpa-onnx-streaming-zipformer-zh-14M-2023-02-23/joiner-epoch-99-avg-
˓→"./sherpa-onnx-streaming-zipformer-zh-14M-2023-02-23/tokens.txt", num_threads=1,␣
˓→scale=0.5), endpoint_config=EndpointConfig(rule1=EndpointRule(must_contain_
˓→context_score=1.5, decoding_method="greedy_search")
./sherpa-onnx-streaming-zipformer-zh-14M-2023-02-23/test_wavs/0.wav
Elapsed seconds: 0.16, Real time factor (RTF): 0.028
˓→"","","","","","","","","","","","","","",""]}
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx-microphone \
--tokens=./sherpa-onnx-streaming-zipformer-zh-14M-2023-02-23/tokens.txt \
--encoder=./sherpa-onnx-streaming-zipformer-zh-14M-2023-02-23/encoder-epoch-99-avg-1.
˓→onnx \
--decoder=./sherpa-onnx-streaming-zipformer-zh-14M-2023-02-23/decoder-epoch-99-avg-1.
˓→onnx \
--joiner=./sherpa-onnx-streaming-zipformer-zh-14M-2023-02-23/joiner-epoch-99-avg-1.onnx
Hint: If your system is Linux (including embedded Linux), you can also use sherpa-onnx-alsa to do real-time speech
recognition with your microphone if sherpa-onnx-microphone does not work for you.
csukuangfj/sherpa-onnx-streaming-zipformer-en-20M-2023-02-17 (English)
cd /path/to/sherpa-onnx
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-onnx-
˓→streaming-zipformer-en-20M-2023-02-17.tar.bz2
Please check that the file sizes of the pre-trained models are correct. See the file sizes of *.onnx files below.
Hint: It supports decoding only wave files of a single channel with 16-bit encoded samples, while the sampling rate
does not need to be 16 kHz.
fp32
The following code shows how to use fp32 models to decode a wave file:
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx \
--tokens=./sherpa-onnx-streaming-zipformer-en-20M-2023-02-17/tokens.txt \
--encoder=./sherpa-onnx-streaming-zipformer-en-20M-2023-02-17/encoder-epoch-99-avg-1.
˓→onnx \
--decoder=./sherpa-onnx-streaming-zipformer-en-20M-2023-02-17/decoder-epoch-99-avg-1.
˓→onnx \
--joiner=./sherpa-onnx-streaming-zipformer-en-20M-2023-02-17/joiner-epoch-99-avg-1.
˓→onnx \
./sherpa-onnx-streaming-zipformer-en-20M-2023-02-17/test_wavs/0.wav
˓→tokens.txt --encoder=./sherpa-onnx-streaming-zipformer-en-20M-2023-02-17/encoder-epoch-
˓→99-avg-1.onnx --decoder=./sherpa-onnx-streaming-zipformer-en-20M-2023-02-17/decoder-
˓→epoch-99-avg-1.onnx --joiner=./sherpa-onnx-streaming-zipformer-en-20M-2023-02-17/
˓→joiner-epoch-99-avg-1.onnx ./sherpa-onnx-streaming-zipformer-en-20M-2023-02-17/test_
˓→wavs/0.wav
OnlineRecognizerConfig(feat_config=FeatureExtractorConfig(sampling_rate=16000, feature_
˓→dim=80), model_config=OnlineModelConfig(transducer=OnlineTransducerModelConfig(encoder=
˓→"./sherpa-onnx-streaming-zipformer-en-20M-2023-02-17/encoder-epoch-99-avg-1.onnx",␣
˓→decoder="./sherpa-onnx-streaming-zipformer-en-20M-2023-02-17/decoder-epoch-99-avg-1.
˓→onnx", joiner="./sherpa-onnx-streaming-zipformer-en-20M-2023-02-17/joiner-epoch-99-avg-
˓→sherpa-onnx-streaming-zipformer-en-20M-2023-02-17/tokens.txt", num_threads=1,␣
˓→scale=0.5), endpoint_config=EndpointConfig(rule1=EndpointRule(must_contain_
˓→context_score=1.5, decoding_method="greedy_search")
./sherpa-onnx-streaming-zipformer-en-20M-2023-02-17/test_wavs/0.wav
Elapsed seconds: 0.32, Real time factor (RTF): 0.049
(continues on next page)
˓→2.36, 2.52, 2.64, 2.68, 2.76, 2.92, 3.08, 3.40, 3.60, 3.72, 3.88, 4.12, 4.48, 4.64, 4.
˓→68, 4.84, 4.96, 5.16, 5.20, 5.32, 5.36, 5.60, 5.72, 5.92, 5.96, 6.08, 6.24, 6.36, 6.52]
˓→","FF","L","EL","S"]}
int8
The following code shows how to use int8 models to decode a wave file:
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx \
--tokens=./sherpa-onnx-streaming-zipformer-en-20M-2023-02-17/tokens.txt \
--encoder=./sherpa-onnx-streaming-zipformer-en-20M-2023-02-17/encoder-epoch-99-avg-1.
˓→int8.onnx \
--decoder=./sherpa-onnx-streaming-zipformer-en-20M-2023-02-17/decoder-epoch-99-avg-1.
˓→onnx \
--joiner=./sherpa-onnx-streaming-zipformer-en-20M-2023-02-17/joiner-epoch-99-avg-1.
˓→int8.onnx \
./sherpa-onnx-streaming-zipformer-en-20M-2023-02-17/test_wavs/0.wav
/Users/fangjun/open-source/sherpa-onnx/sherpa-onnx/csrc/parse-options.cc:Read:361 ./
˓→build/bin/sherpa-onnx --tokens=./sherpa-onnx-streaming-zipformer-en-20M-2023-02-17/
˓→tokens.txt --encoder=./sherpa-onnx-streaming-zipformer-en-20M-2023-02-17/encoder-epoch-
˓→99-avg-1.int8.onnx --decoder=./sherpa-onnx-streaming-zipformer-en-20M-2023-02-17/
˓→decoder-epoch-99-avg-1.onnx --joiner=./sherpa-onnx-streaming-zipformer-en-20M-2023-02-
˓→17/joiner-epoch-99-avg-1.int8.onnx ./sherpa-onnx-streaming-zipformer-en-20M-2023-02-17/
˓→test_wavs/0.wav
OnlineRecognizerConfig(feat_config=FeatureExtractorConfig(sampling_rate=16000, feature_
˓→dim=80), model_config=OnlineModelConfig(transducer=OnlineTransducerModelConfig(encoder=
˓→"./sherpa-onnx-streaming-zipformer-en-20M-2023-02-17/encoder-epoch-99-avg-1.int8.onnx",
˓→ decoder="./sherpa-onnx-streaming-zipformer-en-20M-2023-02-17/decoder-epoch-99-avg-1.
˓→onnx", joiner="./sherpa-onnx-streaming-zipformer-en-20M-2023-02-17/joiner-epoch-99-avg-
˓→"./sherpa-onnx-streaming-zipformer-en-20M-2023-02-17/tokens.txt", num_threads=1,␣
˓→scale=0.5), endpoint_config=EndpointConfig(rule1=EndpointRule(must_contain_
˓→context_score=1.5, decoding_method="greedy_search")
˓→36, 2.52, 2.64, 2.68, 2.76, 2.92, 3.08, 3.40, 3.60, 3.72, 3.88, 4.12, 4.48, 4.64, 4.68,
˓→ 4.84, 4.96, 5.16, 5.20, 5.32, 5.36, 5.60, 5.72, 5.92, 5.96, 6.08, 6.24, 6.36]","tokens
˓→","S"]}
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx-microphone \
--tokens=./sherpa-onnx-streaming-zipformer-en-20M-2023-02-17/tokens.txt \
--encoder=./sherpa-onnx-streaming-zipformer-en-20M-2023-02-17/encoder-epoch-99-avg-1.
˓→onnx \
--decoder=./sherpa-onnx-streaming-zipformer-en-20M-2023-02-17/decoder-epoch-99-avg-1.
˓→onnx \
--joiner=./sherpa-onnx-streaming-zipformer-en-20M-2023-02-17/joiner-epoch-99-avg-1.onnx
Hint: If your system is Linux (including embedded Linux), you can also use sherpa-onnx-alsa to do real-time speech
recognition with your microphone if sherpa-onnx-microphone does not work for you.
Conformer-transducer-based Models
Hint: Please refer to Installation to install sherpa-onnx before you read this section.
csukuangfj/sherpa-onnx-streaming-conformer-zh-2023-05-23 (Chinese)
cd /path/to/sherpa-onnx
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-onnx-
˓→streaming-conformer-zh-2023-05-23.tar.bz2
Please check that the file sizes of the pre-trained models are correct. See the file sizes of *.onnx files below.
Hint: It supports decoding only wave files of a single channel with 16-bit encoded samples, while the sampling rate
does not need to be 16 kHz.
fp32
The following code shows how to use fp32 models to decode a wave file:
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx \
--tokens=./sherpa-onnx-streaming-conformer-zh-2023-05-23/tokens.txt \
--encoder=./sherpa-onnx-streaming-conformer-zh-2023-05-23/encoder-epoch-99-avg-1.onnx \
--decoder=./sherpa-onnx-streaming-conformer-zh-2023-05-23/decoder-epoch-99-avg-1.onnx \
--joiner=./sherpa-onnx-streaming-conformer-zh-2023-05-23/joiner-epoch-99-avg-1.onnx \
./sherpa-onnx-streaming-conformer-zh-2023-05-23/test_wavs/0.wav
OnlineRecognizerConfig(feat_config=FeatureExtractorConfig(sampling_rate=16000, feature_
˓→dim=80), model_config=OnlineTransducerModelConfig(encoder_filename="./sherpa-onnx-
˓→streaming-conformer-zh-2023-05-23/encoder-epoch-99-avg-1.onnx", decoder_filename="./
˓→sherpa-onnx-streaming-conformer-zh-2023-05-23/decoder-epoch-99-avg-1.onnx", joiner_
˓→filename="./sherpa-onnx-streaming-conformer-zh-2023-05-23/joiner-epoch-99-avg-1.onnx",␣
˓→tokens="./sherpa-onnx-streaming-conformer-zh-2023-05-23/tokens.txt", num_threads=2,␣
˓→config=EndpointConfig(rule1=EndpointRule(must_contain_nonsilence=False, min_trailing_
˓→"","","","","","","","","","","","","","","","","",""]}
num threads: 2
decoding method: greedy_search
Elapsed seconds: 0.559 s
Real time factor (RTF): 0.559 / 5.611 = 0.100
Caution: If you use Windows and get encoding issues, please run:
CHCP 65001
in your commandline.
int8
The following code shows how to use int8 models to decode a wave file:
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx \
--tokens=./sherpa-onnx-streaming-conformer-zh-2023-05-23/tokens.txt \
--encoder=./sherpa-onnx-streaming-conformer-zh-2023-05-23/encoder-epoch-99-avg-1.int8.
˓→onnx \
--decoder=./sherpa-onnx-streaming-conformer-zh-2023-05-23/decoder-epoch-99-avg-1.onnx \
--joiner=./sherpa-onnx-streaming-conformer-zh-2023-05-23/joiner-epoch-99-avg-1.int8.
˓→onnx \
./sherpa-onnx-streaming-conformer-zh-2023-05-23/test_wavs/0.wav
OnlineRecognizerConfig(feat_config=FeatureExtractorConfig(sampling_rate=16000, feature_
˓→dim=80), model_config=OnlineTransducerModelConfig(encoder_filename="./sherpa-onnx-
˓→streaming-conformer-zh-2023-05-23/encoder-epoch-99-avg-1.int8.onnx", decoder_filename=
˓→"./sherpa-onnx-streaming-conformer-zh-2023-05-23/decoder-epoch-99-avg-1.onnx", joiner_
˓→filename="./sherpa-onnx-streaming-conformer-zh-2023-05-23/joiner-epoch-99-avg-1.int8.
˓→config=EndpointConfig(rule1=EndpointRule(must_contain_nonsilence=False, min_trailing_
˓→"","","","","","","","","","","","","","","","","",""]}
num threads: 2
decoding method: greedy_search
Elapsed seconds: 0.493 s
Real time factor (RTF): 0.493 / 5.611 = 0.088
Caution: If you use Windows and get encoding issues, please run:
CHCP 65001
in your commandline.
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx-microphone \
--tokens=./sherpa-onnx-streaming-conformer-zh-2023-05-23/tokens.txt \
--encoder=./sherpa-onnx-streaming-conformer-zh-2023-05-23/encoder-epoch-99-avg-1.onnx \
--decoder=./sherpa-onnx-streaming-conformer-zh-2023-05-23/decoder-epoch-99-avg-1.onnx \
--joiner=./sherpa-onnx-streaming-conformer-zh-2023-05-23/joiner-epoch-99-avg-1.onnx
Hint: If your system is Linux (including embedded Linux), you can also use sherpa-onnx-alsa to do real-time speech
recognition with your microphone if sherpa-onnx-microphone does not work for you.
LSTM-transducer-based Models
Hint: Please refer to Installation to install sherpa-onnx before you read this section.
csukuangfj/sherpa-onnx-lstm-en-2023-02-17 (English)
This model trained using the GigaSpeech and the LibriSpeech dataset.
Please see https://github.com/k2-fsa/icefall/pull/558 for how the model is trained.
You can find the training code at
https://github.com/k2-fsa/icefall/tree/master/egs/librispeech/ASR/lstm_transducer_stateless2
In the following, we describe how to download it and use it with sherpa-onnx.
cd /path/to/sherpa-onnx
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-onnx-lstm-
˓→en-2023-02-17.tar.bz2
Please check that the file sizes of the pre-trained models are correct. See the file sizes of *.onnx files below.
Hint: It supports decoding only wave files of a single channel with 16-bit encoded samples, while the sampling rate
does not need to be 16 kHz.
fp32
The following code shows how to use fp32 models to decode a wave file:
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx \
--tokens=./sherpa-onnx-lstm-en-2023-02-17/tokens.txt \
--encoder=./sherpa-onnx-lstm-en-2023-02-17/encoder-epoch-99-avg-1.onnx \
--decoder=./sherpa-onnx-lstm-en-2023-02-17/decoder-epoch-99-avg-1.onnx \
--joiner=./sherpa-onnx-lstm-en-2023-02-17/joiner-epoch-99-avg-1.onnx \
./sherpa-onnx-lstm-en-2023-02-17/test_wavs/0.wav
OnlineRecognizerConfig(feat_config=FeatureExtractorConfig(sampling_rate=16000, feature_
˓→dim=80), model_config=OnlineTransducerModelConfig(encoder_filename="./sherpa-onnx-lstm-
˓→en-2023-02-17/encoder-epoch-99-avg-1.onnx", decoder_filename="./sherpa-onnx-lstm-en-
˓→2023-02-17/decoder-epoch-99-avg-1.onnx", joiner_filename="./sherpa-onnx-lstm-en-2023-
˓→02-17/joiner-epoch-99-avg-1.onnx", tokens="./sherpa-onnx-lstm-en-2023-02-17/tokens.txt
max_active_paths=4, decoding_method="greedy_search")
2023-03-31 22:53:22.120185169 [E:onnxruntime:, env.cc:251 ThreadMain] pthread_
˓→setaffinity_np failed for thread: 576406, index: 16, mask: {17, 53, }, error code: 22␣
˓→error msg: Invalid argument. Specify the number of threads explicitly so the affinity␣
˓→error msg: Invalid argument. Specify the number of threads explicitly so the affinity␣
num threads: 2
decoding method: greedy_search
Elapsed seconds: 2.927 s
Real time factor (RTF): 2.927 / 6.625 = 0.442
int8
The following code shows how to use int8 models to decode a wave file:
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx \
--tokens=./sherpa-onnx-lstm-en-2023-02-17/tokens.txt \
--encoder=./sherpa-onnx-lstm-en-2023-02-17/encoder-epoch-99-avg-1.int8.onnx \
--decoder=./sherpa-onnx-lstm-en-2023-02-17/decoder-epoch-99-avg-1.onnx \
--joiner=./sherpa-onnx-lstm-en-2023-02-17/joiner-epoch-99-avg-1.int8.onnx \
./sherpa-onnx-lstm-en-2023-02-17/test_wavs/0.wav
OnlineRecognizerConfig(feat_config=FeatureExtractorConfig(sampling_rate=16000, feature_
˓→dim=80), model_config=OnlineTransducerModelConfig(encoder_filename="./sherpa-onnx-lstm-
˓→en-2023-02-17/encoder-epoch-99-avg-1.int8.onnx", decoder_filename="./sherpa-onnx-lstm-
˓→en-2023-02-17/decoder-epoch-99-avg-1.onnx", joiner_filename="./sherpa-onnx-lstm-en-
˓→2023-02-17/joiner-epoch-99-avg-1.int8.onnx", tokens="./sherpa-onnx-lstm-en-2023-02-17/
˓→config=EndpointConfig(rule1=EndpointRule(must_contain_nonsilence=False, min_trailing_
˓→error msg: Invalid argument. Specify the number of threads explicitly so the affinity␣
˓→error msg: Invalid argument. Specify the number of threads explicitly so the affinity␣
num threads: 2
decoding method: greedy_search
Elapsed seconds: 1.009 s
Real time factor (RTF): 1.009 / 6.625 = 0.152
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx-microphone \
--tokens=./sherpa-onnx-lstm-en-2023-02-17/tokens.txt \
--encoder=./sherpa-onnx-lstm-en-2023-02-17/encoder-epoch-99-avg-1.onnx \
--decoder=./sherpa-onnx-lstm-en-2023-02-17/decoder-epoch-99-avg-1.onnx \
--joiner=./sherpa-onnx-lstm-en-2023-02-17/joiner-epoch-99-avg-1.onnx
Hint: If your system is Linux (including embedded Linux), you can also use sherpa-onnx-alsa to do real-time speech
recognition with your microphone if sherpa-onnx-microphone does not work for you.
csukuangfj/sherpa-onnx-lstm-zh-2023-02-20 (Chinese)
cd /path/to/sherpa-onnx
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-onnx-lstm-
˓→zh-2023-02-20.tar.bz2
Please check that the file sizes of the pre-trained models are correct. See the file sizes of *.onnx files below.
Hint: It supports decoding only wave files of a single channel with 16-bit encoded samples, while the sampling rate
does not need to be 16 kHz.
fp32
The following code shows how to use fp32 models to decode a wave file:
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx \
--tokens=./sherpa-onnx-lstm-zh-2023-02-20/tokens.txt \
--encoder=./sherpa-onnx-lstm-zh-2023-02-20/encoder-epoch-11-avg-1.onnx \
--decoder=./sherpa-onnx-lstm-zh-2023-02-20/decoder-epoch-11-avg-1.onnx \
--joiner=./sherpa-onnx-lstm-zh-2023-02-20/joiner-epoch-11-avg-1.onnx \
./sherpa-onnx-lstm-zh-2023-02-20/test_wavs/0.wav
Caution: If you use Windows and get encoding issues, please run:
CHCP 65001
in your commandline.
˓→zh-2023-02-20/encoder-epoch-11-avg-1.onnx", decoder_filename="./sherpa-onnx-lstm-zh-
˓→2023-02-20/decoder-epoch-11-avg-1.onnx", joiner_filename="./sherpa-onnx-lstm-zh-2023-
˓→02-20/joiner-epoch-11-avg-1.onnx", tokens="./sherpa-onnx-lstm-zh-2023-02-20/tokens.txt
˓→decoding_method="greedy_search")
˓→error msg: Invalid argument. Specify the number of threads explicitly so the affinity␣
˓→error msg: Invalid argument. Specify the number of threads explicitly so the affinity␣
num threads: 2
decoding method: greedy_search
Elapsed seconds: 3.030 s
Real time factor (RTF): 3.030 / 5.611 = 0.540
int8
The following code shows how to use int8 models to decode a wave file:
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx \
--tokens=./sherpa-onnx-lstm-zh-2023-02-20/tokens.txt \
--encoder=./sherpa-onnx-lstm-zh-2023-02-20/encoder-epoch-11-avg-1.int8.onnx \
--decoder=./sherpa-onnx-lstm-zh-2023-02-20/decoder-epoch-11-avg-1.onnx \
--joiner=./sherpa-onnx-lstm-zh-2023-02-20/joiner-epoch-11-avg-1.int8.onnx \
./sherpa-onnx-lstm-zh-2023-02-20/test_wavs/0.wav
Caution: If you use Windows and get encoding issues, please run:
CHCP 65001
in your commandline.
˓→zh-2023-02-20/encoder-epoch-11-avg-1.int8.onnx", decoder_filename="./sherpa-onnx-lstm-
˓→zh-2023-02-20/decoder-epoch-11-avg-1.onnx", joiner_filename="./sherpa-onnx-lstm-zh-
˓→2023-02-20/joiner-epoch-11-avg-1.int8.onnx", tokens="./sherpa-onnx-lstm-zh-2023-02-20/
˓→config=EndpointConfig(rule1=EndpointRule(must_contain_nonsilence=False, min_trailing_
˓→error msg: Invalid argument. Specify the number of threads explicitly so the affinity␣
˓→error msg: Invalid argument. Specify the number of threads explicitly so the affinity␣
num threads: 2
decoding method: greedy_search
Elapsed seconds: 1.091 s
Real time factor (RTF): 1.091 / 5.611 = 0.194
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx-microphone \
--tokens=./sherpa-onnx-lstm-zh-2023-02-20/tokens.txt \
--encoder=./sherpa-onnx-lstm-zh-2023-02-20/encoder-epoch-11-avg-1.onnx \
--decoder=./sherpa-onnx-lstm-zh-2023-02-20/decoder-epoch-11-avg-1.onnx \
--joiner=./sherpa-onnx-lstm-zh-2023-02-20/joiner-epoch-11-avg-1.onnx
Hint: If your system is Linux (including embedded Linux), you can also use sherpa-onnx-alsa to do real-time speech
recognition with your microphone if sherpa-onnx-microphone does not work for you.
Paraformer models
Hint: Please refer to Installation to install sherpa-onnx before you read this section.
Note: This model does not support timestamps. It is a bilingual model, supporting both Chinese and English. ()
cd /path/to/sherpa-onnx
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-onnx-
˓→streaming-paraformer-bilingual-zh-en.tar.bz2
Please check that the file sizes of the pre-trained models are correct. See the file sizes of *.onnx files below.
Hint: It supports decoding only wave files of a single channel with 16-bit encoded samples, while the sampling rate
does not need to be 16 kHz.
fp32
The following code shows how to use fp32 models to decode a wave file:
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx \
--tokens=./sherpa-onnx-streaming-paraformer-bilingual-zh-en/tokens.txt \
--paraformer-encoder=./sherpa-onnx-streaming-paraformer-bilingual-zh-en/encoder.onnx \
(continues on next page)
Caution: If you use Windows and get encoding issues, please run:
CHCP 65001
in your commandline.
˓→tokens.txt --paraformer-encoder=./sherpa-onnx-streaming-paraformer-bilingual-zh-en/
˓→encoder.onnx --paraformer-decoder=./sherpa-onnx-streaming-paraformer-bilingual-zh-en/
˓→decoder.onnx ./sherpa-onnx-streaming-paraformer-bilingual-zh-en/test_wavs/0.wav
OnlineRecognizerConfig(feat_config=FeatureExtractorConfig(sampling_rate=16000, feature_
˓→dim=80), model_config=OnlineModelConfig(transducer=OnlineTransducerModelConfig(encoder=
˓→onnx-streaming-paraformer-bilingual-zh-en/encoder.onnx", decoder="./sherpa-onnx-
˓→streaming-paraformer-bilingual-zh-en/decoder.onnx"), tokens="./sherpa-onnx-streaming-
˓→config=EndpointConfig(rule1=EndpointRule(must_contain_nonsilence=False, min_trailing_
./sherpa-onnx-streaming-paraformer-bilingual-zh-en/test_wavs/0.wav
Elapsed seconds: 2.2, Real time factor (RTF): 0.21
monday today day is the day after tomorrow
{"is_final":false,"segment":0,"start_time":0.0,"text":" monday today day is the day␣
˓→after tomorrow ","timestamps":"[]","tokens":["","","","mon@@","day","today","day","is",
˓→"","","","the","day","after","tom@@","or@@","row","","","",""]}
int8
The following code shows how to use int8 models to decode a wave file:
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx \
--tokens=./sherpa-onnx-streaming-paraformer-bilingual-zh-en/tokens.txt \
--paraformer-encoder=./sherpa-onnx-streaming-paraformer-bilingual-zh-en/encoder.int8.
˓→onnx \
(continues on next page)
./sherpa-onnx-streaming-paraformer-bilingual-zh-en/test_wavs/0.wav
Caution: If you use Windows and get encoding issues, please run:
CHCP 65001
in your commandline.
/Users/fangjun/open-source/sherpa-onnx/sherpa-onnx/csrc/parse-options.cc:Read:361 ./
˓→build/bin/sherpa-onnx --tokens=./sherpa-onnx-streaming-paraformer-bilingual-zh-en/
˓→tokens.txt --paraformer-encoder=./sherpa-onnx-streaming-paraformer-bilingual-zh-en/
˓→encoder.int8.onnx --paraformer-decoder=./sherpa-onnx-streaming-paraformer-bilingual-zh-
˓→en/decoder.int8.onnx ./sherpa-onnx-streaming-paraformer-bilingual-zh-en/test_wavs/0.wav
OnlineRecognizerConfig(feat_config=FeatureExtractorConfig(sampling_rate=16000, feature_
˓→dim=80), model_config=OnlineModelConfig(transducer=OnlineTransducerModelConfig(encoder=
˓→onnx-streaming-paraformer-bilingual-zh-en/encoder.int8.onnx", decoder="./sherpa-onnx-
˓→streaming-paraformer-bilingual-zh-en/decoder.int8.onnx"), tokens="./sherpa-onnx-
˓→config=EndpointConfig(rule1=EndpointRule(must_contain_nonsilence=False, min_trailing_
./sherpa-onnx-streaming-paraformer-bilingual-zh-en/test_wavs/0.wav
Elapsed seconds: 1.6, Real time factor (RTF): 0.15
monday today day is the day after tomorrow
{"is_final":false,"segment":0,"start_time":0.0,"text":" monday today day is the day␣
˓→after tomorrow ","timestamps":"[]","tokens":["","","","mon@@","day","today","day","is",
˓→"","","","the","day","after","tom@@","or@@","row","","","",""]}
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx-microphone \
--tokens=./sherpa-onnx-streaming-paraformer-bilingual-zh-en/tokens.txt \
--paraformer-encoder=./sherpa-onnx-streaming-paraformer-bilingual-zh-en/encoder.int8.
˓→onnx \
--paraformer-decoder=./sherpa-onnx-streaming-paraformer-bilingual-zh-en/decoder.int8.
˓→onnx
Hint: If your system is Linux (including embedded Linux), you can also use sherpa-onnx-alsa to do real-time speech
recognition with your microphone if sherpa-onnx-microphone does not work for you.
Note: This model does not support timestamps. It is a trilingual model, supporting both Chinese and English. ()
cd /path/to/sherpa-onnx
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-onnx-
˓→streaming-paraformer-trilingual-zh-cantonese-en.tar.bz2
Please check that the file sizes of the pre-trained models are correct. See the file sizes of *.onnx files below.
Hint: It supports decoding only wave files of a single channel with 16-bit encoded samples, while the sampling rate
does not need to be 16 kHz.
fp32
The following code shows how to use fp32 models to decode a wave file:
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx \
--tokens=./sherpa-onnx-streaming-paraformer-trilingual-zh-cantonese-en/tokens.txt \
--paraformer-encoder=./sherpa-onnx-streaming-paraformer-trilingual-zh-cantonese-en/
˓→encoder.onnx \
--paraformer-decoder=./sherpa-onnx-streaming-paraformer-trilingual-zh-cantonese-en/
˓→decoder.onnx \
./sherpa-onnx-streaming-paraformer-trilingual-zh-cantonese-en/test_wavs/1.wav
Caution: If you use Windows and get encoding issues, please run:
CHCP 65001
in your commandline.
˓→cantonese-en/tokens.txt --paraformer-encoder=./sherpa-onnx-streaming-paraformer-
˓→trilingual-zh-cantonese-en/encoder.int8.onnx --paraformer-decoder=./sherpa-onnx-
˓→streaming-paraformer-trilingual-zh-cantonese-en/decoder.int8.onnx ./sherpa-onnx-
˓→streaming-paraformer-trilingual-zh-cantonese-en/test_wavs/1.wav
OnlineRecognizerConfig(feat_config=FeatureExtractorConfig(sampling_rate=16000, feature_
˓→dim=80), model_config=OnlineModelConfig(transducer=OnlineTransducerModelConfig(encoder=
˓→onnx-streaming-paraformer-trilingual-zh-cantonese-en/encoder.int8.onnx", decoder="./
˓→sherpa-onnx-streaming-paraformer-trilingual-zh-cantonese-en/decoder.int8.onnx"), wenet_
˓→ctc=OnlineZipformer2CtcModelConfig(model=""), tokens="./sherpa-onnx-streaming-
˓→endpoint_config=EndpointConfig(rule1=EndpointRule(must_contain_nonsilence=False, min_
˓→rule3=EndpointRule(must_contain_nonsilence=False, min_trailing_silence=0,(continues
min_ on next page)
˓→utterance_length=20)), enable_endpoint=True, max_active_paths=4, hotwords_score=1.5,␣
{ "text": "", "tokens": [ "", "", "", "", "", "", "", "", "", "", "", "" ], "timestamps
˓→": [ ], "ys_probs": [ ], "lm_probs": [ ], "context_scores": [ ], "segment": 0,
˓→"start_time": 0.00, "is_final": false}
int8
The following code shows how to use int8 models to decode a wave file:
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx \
--tokens=./sherpa-onnx-streaming-paraformer-trilingual-zh-cantonese-en/tokens.txt \
--paraformer-encoder=./sherpa-onnx-streaming-paraformer-trilingual-zh-cantonese-en/
˓→encoder.int8.onnx \
--paraformer-decoder=./sherpa-onnx-streaming-paraformer-trilingual-zh-cantonese-en/
˓→decoder.int8.onnx \
./sherpa-onnx-streaming-paraformer-trilingual-zh-cantonese-en/test_wavs/1.wav
Caution: If you use Windows and get encoding issues, please run:
CHCP 65001
in your commandline.
˓→cantonese-en/tokens.txt --paraformer-encoder=./sherpa-onnx-streaming-paraformer-
˓→trilingual-zh-cantonese-en/encoder.int8.onnx --paraformer-decoder=./sherpa-onnx-
˓→streaming-paraformer-trilingual-zh-cantonese-en/decoder.int8.onnx ./sherpa-onnx-
˓→streaming-paraformer-trilingual-zh-cantonese-en/test_wavs/1.wav
OnlineRecognizerConfig(feat_config=FeatureExtractorConfig(sampling_rate=16000, feature_
˓→dim=80), model_config=OnlineModelConfig(transducer=OnlineTransducerModelConfig(encoder=
˓→onnx-streaming-paraformer-trilingual-zh-cantonese-en/encoder.int8.onnx", decoder="./
˓→sherpa-onnx-streaming-paraformer-trilingual-zh-cantonese-en/decoder.int8.onnx"), wenet_
˓→ctc=OnlineZipformer2CtcModelConfig(model=""), tokens="./sherpa-onnx-streaming-
˓→endpoint_config=EndpointConfig(rule1=EndpointRule(must_contain_nonsilence=False, min_
(continues on next page)
˓→trailing_silence=2.4, min_utterance_length=0), rule2=EndpointRule(must_contain_
→rule3=EndpointRule(must_contain_nonsilence=False, min_trailing_silence=0,
398
˓ Chapter 8.min_
sherpa-onnx
˓→utterance_length=20)), enable_endpoint=True, max_active_paths=4, hotwords_score=1.5,␣
{ "text": "", "tokens": [ "", "", "", "", "", "", "", "", "", "", "", "" ], "timestamps
˓→": [ ], "ys_probs": [ ], "lm_probs": [ ], "context_scores": [ ], "segment": 0,
˓→"start_time": 0.00, "is_final": false}
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx-microphone \
--tokens=./sherpa-onnx-streaming-paraformer-bilingual-zh-en/tokens.txt \
--paraformer-encoder=./sherpa-onnx-streaming-paraformer-trilingual-zh-cantonese-en/
˓→encoder.int8.onnx \
--paraformer-decoder=./sherpa-onnx-streaming-paraformer-trilingual-zh-cantonese-en/
˓→decoder.int8.onnx
Hint: If your system is Linux (including embedded Linux), you can also use sherpa-onnx-alsa to do real-time speech
recognition with your microphone if sherpa-onnx-microphone does not work for you.
Zipformer-CTC-based Models
Hint: Please refer to Installation to install sherpa-onnx before you read this section.
sherpa-onnx-streaming-zipformer-ctc-multi-zh-hans-2023-12-13 (Chinese)
Training code for this model can be found at https://github.com/k2-fsa/icefall/pull/1369. It supports only Chinese.
Please refer to https://github.com/k2-fsa/icefall/tree/master/egs/multi_zh-hans/ASR#included-training-sets for the de-
tailed information about the training data. In total, there are 14k hours of training data.
In the following, we describe how to download it and use it with sherpa-onnx.
cd /path/to/sherpa-onnx
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-onnx-
˓→streaming-zipformer-ctc-multi-zh-hans-2023-12-13.tar.bz2
$ ls -lh sherpa-onnx-streaming-zipformer-ctc-multi-zh-hans-2023-12-13
total 654136
-rw-r--r--@ 1 fangjun staff 28B Dec 13 16:19 README.md
-rw-r--r--@ 1 fangjun staff 258K Dec 13 16:19 bpe.model
-rw-r--r--@ 1 fangjun staff 68M Dec 13 16:19 ctc-epoch-20-avg-1-chunk-16-left-128.
˓→int8.onnx
Hint: It supports decoding only wave files of a single channel with 16-bit encoded samples, while the sampling rate
does not need to be 16 kHz.
fp32
The following code shows how to use fp32 models to decode a wave file:
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx \
--zipformer2-ctc-model=./sherpa-onnx-streaming-zipformer-ctc-multi-zh-hans-2023-12-13/
˓→ctc-epoch-20-avg-1-chunk-16-left-128.onnx \
--tokens=./sherpa-onnx-streaming-zipformer-ctc-multi-zh-hans-2023-12-13/tokens.txt \
./sherpa-onnx-streaming-zipformer-ctc-multi-zh-hans-2023-12-13/test_wavs/DEV_
˓→T0000000000.wav
Caution: If you use Windows and get encoding issues, please run:
CHCP 65001
in your commandline.
/Users/fangjun/open-source/sherpa-onnx/sherpa-onnx/csrc/parse-options.cc:Read:361 ./
˓→build/bin/sherpa-onnx --zipformer2-ctc-model=./sherpa-onnx-streaming-zipformer-ctc-
˓→multi-zh-hans-2023-12-13/ctc-epoch-20-avg-1-chunk-16-left-128.onnx --tokens=./sherpa-
˓→onnx-streaming-zipformer-ctc-multi-zh-hans-2023-12-13/tokens.txt ./sherpa-onnx-
˓→streaming-zipformer-ctc-multi-zh-hans-2023-12-13/test_wavs/DEV_T0000000000.wav
OnlineRecognizerConfig(feat_config=FeatureExtractorConfig(sampling_rate=16000, feature_
˓→dim=80), model_config=OnlineModelConfig(transducer=OnlineTransducerModelConfig(encoder=
˓→zipformer2_ctc=OnlineZipformer2CtcModelConfig(model="./sherpa-onnx-streaming-zipformer-
˓→ctc-multi-zh-hans-2023-12-13/ctc-epoch-20-avg-1-chunk-16-left-128.onnx"), tokens="./
˓→sherpa-onnx-streaming-zipformer-ctc-multi-zh-hans-2023-12-13/tokens.txt", num_
./sherpa-onnx-streaming-zipformer-ctc-multi-zh-hans-2023-12-13/test_wavs/DEV_T0000000000.
˓→wav
˓→ 3.72, 3.84, 3.96, 4.04, 4.16, 4.28, 4.36, 4.60, 4.80], "tokens":[" ", "", "", "", "",
˓→"", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", ""]}
int8
The following code shows how to use int8 models to decode a wave file:
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx \
--zipformer2-ctc-model=./sherpa-onnx-streaming-zipformer-ctc-multi-zh-hans-2023-12-13/
˓→ctc-epoch-20-avg-1-chunk-16-left-128.int8.onnx \
--tokens=./sherpa-onnx-streaming-zipformer-ctc-multi-zh-hans-2023-12-13/tokens.txt \
./sherpa-onnx-streaming-zipformer-ctc-multi-zh-hans-2023-12-13/test_wavs/DEV_
˓→T0000000000.wav
Caution: If you use Windows and get encoding issues, please run:
CHCP 65001
in your commandline.
/Users/fangjun/open-source/sherpa-onnx/sherpa-onnx/csrc/parse-options.cc:Read:361 ./
˓→build/bin/sherpa-onnx --zipformer2-ctc-model=./sherpa-onnx-streaming-zipformer-ctc-
˓→multi-zh-hans-2023-12-13/ctc-epoch-20-avg-1-chunk-16-left-128.int8.onnx --tokens=./
˓→sherpa-onnx-streaming-zipformer-ctc-multi-zh-hans-2023-12-13/tokens.txt ./sherpa-onnx-
˓→streaming-zipformer-ctc-multi-zh-hans-2023-12-13/test_wavs/DEV_T0000000000.wav
OnlineRecognizerConfig(feat_config=FeatureExtractorConfig(sampling_rate=16000, feature_
˓→dim=80), model_config=OnlineModelConfig(transducer=OnlineTransducerModelConfig(encoder=
˓→zipformer2_ctc=OnlineZipformer2CtcModelConfig(model="./sherpa-onnx-streaming-zipformer-
˓→ctc-multi-zh-hans-2023-12-13/ctc-epoch-20-avg-1-chunk-16-left-128.int8.onnx"), tokens=
˓→"./sherpa-onnx-streaming-zipformer-ctc-multi-zh-hans-2023-12-13/tokens.txt", num_
./sherpa-onnx-streaming-zipformer-ctc-multi-zh-hans-2023-12-13/test_wavs/DEV_T0000000000.
˓→wav
˓→ 3.72, 3.84, 3.96, 4.04, 4.16, 4.28, 4.36, 4.60, 4.84], "tokens":[" ", "", "", "", "",
˓→"", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", ""]}
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx-microphone \
--zipformer2-ctc-model=./sherpa-onnx-streaming-zipformer-ctc-multi-zh-hans-2023-12-13/
˓→ctc-epoch-20-avg-1-chunk-16-left-128.onnx \
--tokens=./sherpa-onnx-streaming-zipformer-ctc-multi-zh-hans-2023-12-13/tokens.txt
Hint: If your system is Linux (including embedded Linux), you can also use sherpa-onnx-alsa to do real-time speech
recognition with your microphone if sherpa-onnx-microphone does not work for you.
Zipformer-transducer-based Models
Hint: Please refer to Installation to install sherpa-onnx before you read this section.
sherpa-onnx-zipformer-ru-2024-09-18 (Russian, )
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-onnx-
˓→zipformer-ru-2024-09-18.tar.bz2
ls -lh sherpa-onnx-zipformer-ru-2024-09-18
total 700352
-rw-r--r-- 1 fangjun staff 240K Sep 18 12:01 bpe.model
-rw-r--r-- 1 fangjun staff 1.2M Sep 18 12:01 decoder.int8.onnx
-rw-r--r-- 1 fangjun staff 2.0M Sep 18 12:01 decoder.onnx
-rw-r--r-- 1 fangjun staff 65M Sep 18 12:01 encoder.int8.onnx
-rw-r--r-- 1 fangjun staff 247M Sep 18 12:01 encoder.onnx
-rw-r--r-- 1 fangjun staff 253K Sep 18 12:01 joiner.int8.onnx
-rw-r--r-- 1 fangjun staff 1.0M Sep 18 12:01 joiner.onnx
drwxr-xr-x 4 fangjun staff 128B Sep 18 12:01 test_wavs
-rw-r--r-- 1 fangjun staff 6.2K Sep 18 12:01 tokens.txt
Hint: It supports decoding only wave files of a single channel with 16-bit encoded samples, while the sampling rate
does not need to be 16 kHz.
fp32
The following code shows how to use fp32 models to decode wave files:
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx-offline \
--tokens=./sherpa-onnx-zipformer-ru-2024-09-18/tokens.txt \
--encoder=./sherpa-onnx-zipformer-ru-2024-09-18/encoder.onnx \
--decoder=./sherpa-onnx-zipformer-ru-2024-09-18/decoder.onnx \
--joiner=./sherpa-onnx-zipformer-ru-2024-09-18/joiner.onnx \
--num-threads=1 \
./sherpa-onnx-zipformer-ru-2024-09-18/test_wavs/1.wav
Caution: If you use Windows and get encoding issues, please run:
CHCP 65001
in your commandline.
int8
The following code shows how to use int8 models to decode wave files:
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx-offline \
--tokens=./sherpa-onnx-zipformer-ru-2024-09-18/tokens.txt \
--encoder=./sherpa-onnx-zipformer-ru-2024-09-18/encoder.int8.onnx \
--decoder=./sherpa-onnx-zipformer-ru-2024-09-18/decoder.onnx \
--joiner=./sherpa-onnx-zipformer-ru-2024-09-18/joiner.int8.onnx \
--num-threads=1 \
./sherpa-onnx-zipformer-ru-2024-09-18/test_wavs/1.wav
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx-microphone-offline \
--tokens=./sherpa-onnx-zipformer-ru-2024-09-18/tokens.txt \
--encoder=./sherpa-onnx-zipformer-ru-2024-09-18/encoder.int8.onnx \
--decoder=./sherpa-onnx-zipformer-ru-2024-09-18/decoder.onnx \
--joiner=./sherpa-onnx-zipformer-ru-2024-09-18/joiner.int8.onnx
cd /path/to/sherpa-onnx
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/silero_vad.onnx
./build/bin/sherpa-onnx-vad-microphone-offline-asr \
--silero-vad-model=./silero_vad.onnx \
--tokens=./sherpa-onnx-zipformer-ru-2024-09-18/tokens.txt \
--encoder=./sherpa-onnx-zipformer-ru-2024-09-18/encoder.int8.onnx \
--decoder=./sherpa-onnx-zipformer-ru-2024-09-18/decoder.onnx \
--joiner=./sherpa-onnx-zipformer-ru-2024-09-18/joiner.int8.onnx
sherpa-onnx-small-zipformer-ru-2024-09-18 (Russian, )
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-onnx-
˓→small-zipformer-ru-2024-09-18.tar.bz2
ls -lh sherpa-onnx-small-zipformer-ru-2024-09-18/
total 257992
-rw-r--r-- 1 fangjun staff 240K Sep 18 12:02 bpe.model
-rw-r--r-- 1 fangjun staff 1.2M Sep 18 12:02 decoder.int8.onnx
-rw-r--r-- 1 fangjun staff 2.0M Sep 18 12:02 decoder.onnx
-rw-r--r-- 1 fangjun staff 24M Sep 18 12:02 encoder.int8.onnx
-rw-r--r-- 1 fangjun staff 86M Sep 18 12:02 encoder.onnx
(continues on next page)
Hint: It supports decoding only wave files of a single channel with 16-bit encoded samples, while the sampling rate
does not need to be 16 kHz.
fp32
The following code shows how to use fp32 models to decode wave files:
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx-offline \
--tokens=./sherpa-onnx-small-zipformer-ru-2024-09-18/tokens.txt \
--encoder=./sherpa-onnx-small-zipformer-ru-2024-09-18/encoder.onnx \
--decoder=./sherpa-onnx-small-zipformer-ru-2024-09-18/decoder.onnx \
--joiner=./sherpa-onnx-small-zipformer-ru-2024-09-18/joiner.onnx \
--num-threads=1 \
./sherpa-onnx-small-zipformer-ru-2024-09-18/test_wavs/1.wav
Caution: If you use Windows and get encoding issues, please run:
CHCP 65001
in your commandline.
int8
The following code shows how to use int8 models to decode wave files:
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx-offline \
--tokens=./sherpa-onnx-small-zipformer-ru-2024-09-18/tokens.txt \
--encoder=./sherpa-onnx-small-zipformer-ru-2024-09-18/encoder.int8.onnx \
--decoder=./sherpa-onnx-small-zipformer-ru-2024-09-18/decoder.onnx \
(continues on next page)
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx-microphone-offline \
--tokens=./sherpa-onnx-small-zipformer-ru-2024-09-18/tokens.txt \
--encoder=./sherpa-onnx-small-zipformer-ru-2024-09-18/encoder.int8.onnx \
--decoder=./sherpa-onnx-small-zipformer-ru-2024-09-18/decoder.onnx \
--joiner=./sherpa-onnx-small-zipformer-ru-2024-09-18/joiner.int8.onnx
cd /path/to/sherpa-onnx
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/silero_vad.onnx
./build/bin/sherpa-onnx-vad-microphone-offline-asr \
--silero-vad-model=./silero_vad.onnx \
--tokens=./sherpa-onnx-small-zipformer-ru-2024-09-18/tokens.txt \
--encoder=./sherpa-onnx-small-zipformer-ru-2024-09-18/encoder.int8.onnx \
--decoder=./sherpa-onnx-small-zipformer-ru-2024-09-18/decoder.onnx \
--joiner=./sherpa-onnx-small-zipformer-ru-2024-09-18/joiner.int8.onnx
sherpa-onnx-zipformer-ja-reazonspeech-2024-08-01 (Japanese, )
This model is from ReazonSpeech and supports only Japanese. It is trained by 35k hours of data.
The code for training the model can be found at https://github.com/k2-fsa/icefall/tree/master/egs/reazonspeech/ASR
Paper about the dataset is https://research.reazon.jp/_static/reazonspeech_nlp2023.pdf
In the following, we describe how to download it and use it with sherpa-onnx.
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-onnx-
˓→zipformer-ja-reazonspeech-2024-08-01.tar.bz2
ls -lh sherpa-onnx-zipformer-ja-reazonspeech-2024-08-01
Hint: It supports decoding only wave files of a single channel with 16-bit encoded samples, while the sampling rate
does not need to be 16 kHz.
fp32
The following code shows how to use fp32 models to decode wave files:
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx-offline \
--tokens=./sherpa-onnx-zipformer-ja-reazonspeech-2024-08-01/tokens.txt \
--encoder=./sherpa-onnx-zipformer-ja-reazonspeech-2024-08-01/encoder-epoch-99-avg-1.
˓→onnx \
--decoder=./sherpa-onnx-zipformer-ja-reazonspeech-2024-08-01/decoder-epoch-99-avg-1.
˓→onnx \
--joiner=./sherpa-onnx-zipformer-ja-reazonspeech-2024-08-01/joiner-epoch-99-avg-1.onnx␣
˓→\
--num-threads=1 \
./sherpa-onnx-zipformer-ja-reazonspeech-2024-08-01/test_wavs/1.wav
Caution: If you use Windows and get encoding issues, please run:
CHCP 65001
in your commandline.
int8
The following code shows how to use int8 models to decode wave files:
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx-offline \
--tokens=./sherpa-onnx-zipformer-ja-reazonspeech-2024-08-01/tokens.txt \
--encoder=./sherpa-onnx-zipformer-ja-reazonspeech-2024-08-01/encoder-epoch-99-avg-1.
˓→int8.onnx \
--decoder=./sherpa-onnx-zipformer-ja-reazonspeech-2024-08-01/decoder-epoch-99-avg-1.
˓→onnx \
--joiner=./sherpa-onnx-zipformer-ja-reazonspeech-2024-08-01/joiner-epoch-99-avg-1.int8.
˓→onnx \
--num-threads=1 \
./sherpa-onnx-zipformer-ja-reazonspeech-2024-08-01/test_wavs/1.wav
Caution: If you use Windows and get encoding issues, please run:
CHCP 65001
in your commandline.
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx-microphone-offline \
--tokens=./sherpa-onnx-zipformer-ja-reazonspeech-2024-08-01/tokens.txt \
--encoder=./sherpa-onnx-zipformer-ja-reazonspeech-2024-08-01/encoder-epoch-99-avg-1.
˓→int8.onnx \
--decoder=./sherpa-onnx-zipformer-ja-reazonspeech-2024-08-01/decoder-epoch-99-avg-1.
˓→onnx \
--joiner=./sherpa-onnx-zipformer-ja-reazonspeech-2024-08-01/joiner-epoch-99-avg-1.int8.
˓→onnx
cd /path/to/sherpa-onnx
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/silero_vad.onnx
./build/bin/sherpa-onnx-vad-microphone-offline-asr \
--silero-vad-model=./silero_vad.onnx \
--tokens=./sherpa-onnx-zipformer-ja-reazonspeech-2024-08-01/tokens.txt \
--encoder=./sherpa-onnx-zipformer-ja-reazonspeech-2024-08-01/encoder-epoch-99-avg-1.
˓→int8.onnx \
--decoder=./sherpa-onnx-zipformer-ja-reazonspeech-2024-08-01/decoder-epoch-99-avg-1.
˓→onnx \
--joiner=./sherpa-onnx-zipformer-ja-reazonspeech-2024-08-01/joiner-epoch-99-avg-1.int8.
˓→onnx
sherpa-onnx-zipformer-korean-2024-06-24 (Korean, )
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-onnx-
˓→zipformer-korean-2024-06-24.tar.bz2
tar xf sherpa-onnx-zipformer-korean-2024-06-24.tar.bz2
rm sherpa-onnx-zipformer-korean-2024-06-24.tar.bz2
ls -lh sherpa-onnx-zipformer-korean-2024-06-24
Hint: It supports decoding only wave files of a single channel with 16-bit encoded samples, while the sampling rate
does not need to be 16 kHz.
fp32
The following code shows how to use fp32 models to decode wave files:
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx-offline \
--tokens=./sherpa-onnx-zipformer-korean-2024-06-24/tokens.txt \
--encoder=./sherpa-onnx-zipformer-korean-2024-06-24/encoder-epoch-99-avg-1.onnx \
--decoder=./sherpa-onnx-zipformer-korean-2024-06-24/decoder-epoch-99-avg-1.onnx \
--joiner=./sherpa-onnx-zipformer-korean-2024-06-24/joiner-epoch-99-avg-1.onnx \
./sherpa-onnx-zipformer-korean-2024-06-24/test_wavs/0.wav
Caution: If you use Windows and get encoding issues, please run:
CHCP 65001
in your commandline.
˓→tokens.txt --encoder=./sherpa-onnx-zipformer-korean-2024-06-24/encoder-epoch-99-avg-1.
˓→onnx --decoder=./sherpa-onnx-zipformer-korean-2024-06-24/decoder-epoch-99-avg-1.onnx --
˓→joiner=./sherpa-onnx-zipformer-korean-2024-06-24/joiner-epoch-99-avg-1.onnx ./sherpa-
˓→onnx-zipformer-korean-2024-06-24/test_wavs/0.wav
OfflineRecognizerConfig(feat_config=FeatureExtractorConfig(sampling_rate=16000, feature_
˓→dim=80, low_freq=20, high_freq=-400, dither=0), model_
˓→config=OfflineModelConfig(transducer=OfflineTransducerModelConfig(encoder_filename="./
˓→sherpa-onnx-zipformer-korean-2024-06-24/encoder-epoch-99-avg-1.onnx", decoder_filename=
˓→"./sherpa-onnx-zipformer-korean-2024-06-24/decoder-epoch-99-avg-1.onnx", joiner_
˓→filename="./sherpa-onnx-zipformer-korean-2024-06-24/joiner-epoch-99-avg-1.onnx"),␣
˓→paraformer=OfflineParaformerModelConfig(model=""), nemo_
˓→ctc=OfflineNemoEncDecCtcModelConfig(model=""),␣
˓→ctc=OfflineZipformerCtcModelConfig(model=""), wenet_
./sherpa-onnx-zipformer-korean-2024-06-24/test_wavs/0.wav
{"text": " .", "timestamps": [0.12, 0.24, 0.56, 1.00, 1.20, 1.32, 2.00, 2.16, 2.32,␣
˓→2.52, 2.68, 2.80, 3.08, 3.28], "tokens":[" ", "", " ", " ", "", "", " ", "", "", " ",
----
num threads: 2
decoding method: greedy_search
Elapsed seconds: 0.119 s
Real time factor (RTF): 0.119 / 3.526 = 0.034
int8
The following code shows how to use int8 models to decode wave files:
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx-offline \
--tokens=./sherpa-onnx-zipformer-korean-2024-06-24/tokens.txt \
--encoder=./sherpa-onnx-zipformer-korean-2024-06-24/encoder-epoch-99-avg-1.int8.onnx \
--decoder=./sherpa-onnx-zipformer-korean-2024-06-24/decoder-epoch-99-avg-1.onnx \
--joiner=./sherpa-onnx-zipformer-korean-2024-06-24/joiner-epoch-99-avg-1.int8.onnx \
./sherpa-onnx-zipformer-korean-2024-06-24/test_wavs/0.wav
Caution: If you use Windows and get encoding issues, please run:
CHCP 65001
in your commandline.
˓→tokens.txt --encoder=./sherpa-onnx-zipformer-korean-2024-06-24/encoder-epoch-99-avg-1.
˓→int8.onnx --decoder=./sherpa-onnx-zipformer-korean-2024-06-24/decoder-epoch-99-avg-1.
˓→onnx --joiner=./sherpa-onnx-zipformer-korean-2024-06-24/joiner-epoch-99-avg-1.int8.
˓→onnx ./sherpa-onnx-zipformer-korean-2024-06-24/test_wavs/0.wav
OfflineRecognizerConfig(feat_config=FeatureExtractorConfig(sampling_rate=16000, feature_
˓→dim=80, low_freq=20, high_freq=-400, dither=0), model_
˓→config=OfflineModelConfig(transducer=OfflineTransducerModelConfig(encoder_filename="./
˓→sherpa-onnx-zipformer-korean-2024-06-24/encoder-epoch-99-avg-1.int8.onnx", decoder_
˓→filename="./sherpa-onnx-zipformer-korean-2024-06-24/decoder-epoch-99-avg-1.onnx",␣
(continues on next page)
˓→joiner_filename="./sherpa-onnx-zipformer-korean-2024-06-24/joiner-epoch-99-avg-1.int8.
˓→ctc=OfflineNemoEncDecCtcModelConfig(model=""),␣
412 Chapter 8. sherpa-onnx
˓→whisper=OfflineWhisperModelConfig(encoder="", decoder="", language="", task="transcribe
˓→ctc=OfflineZipformerCtcModelConfig(model=""), wenet_
sherpa, Release 1.3
./sherpa-onnx-zipformer-korean-2024-06-24/test_wavs/0.wav
{"text": " .", "timestamps": [0.12, 0.24, 0.56, 1.00, 1.20, 1.32, 2.00, 2.16, 2.32,␣
˓→2.52, 2.68, 2.84, 3.08, 3.28], "tokens":[" ", "", " ", " ", "", "", " ", "", "", " ",
----
num threads: 2
decoding method: greedy_search
Elapsed seconds: 0.092 s
Real time factor (RTF): 0.092 / 3.526 = 0.026
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx-microphone-offline \
--tokens=./sherpa-onnx-zipformer-korean-2024-06-24/tokens.txt \
--encoder=./sherpa-onnx-zipformer-korean-2024-06-24/encoder-epoch-99-avg-1.int8.onnx \
--decoder=./sherpa-onnx-zipformer-korean-2024-06-24/decoder-epoch-99-avg-1.onnx \
--joiner=./sherpa-onnx-zipformer-korean-2024-06-24/joiner-epoch-99-avg-1.int8.onnx
sherpa-onnx-zipformer-thai-2024-06-20 (Thai, )
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-onnx-
˓→zipformer-thai-2024-06-20.tar.bz2
tar xf sherpa-onnx-zipformer-thai-2024-06-20.tar.bz2
rm sherpa-onnx-zipformer-thai-2024-06-20.tar.bz2
ls -lh sherpa-onnx-zipformer-thai-2024-06-20
Hint: It supports decoding only wave files of a single channel with 16-bit encoded samples, while the sampling rate
does not need to be 16 kHz.
fp32
The following code shows how to use fp32 models to decode wave files:
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx-offline \
--tokens=./sherpa-onnx-zipformer-thai-2024-06-20/tokens.txt \
--encoder=./sherpa-onnx-zipformer-thai-2024-06-20/encoder-epoch-12-avg-5.onnx \
--decoder=./sherpa-onnx-zipformer-thai-2024-06-20/decoder-epoch-12-avg-5.onnx \
--joiner=./sherpa-onnx-zipformer-thai-2024-06-20/joiner-epoch-12-avg-5.onnx \
./sherpa-onnx-zipformer-thai-2024-06-20/test_wavs/0.wav
Caution: If you use Windows and get encoding issues, please run:
CHCP 65001
in your commandline.
int8
The following code shows how to use int8 models to decode wave files:
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx-offline \
--tokens=./sherpa-onnx-zipformer-thai-2024-06-20/tokens.txt \
--encoder=./sherpa-onnx-zipformer-thai-2024-06-20/encoder-epoch-12-avg-5.int8.onnx \
--decoder=./sherpa-onnx-zipformer-thai-2024-06-20/decoder-epoch-12-avg-5.onnx \
--joiner=./sherpa-onnx-zipformer-thai-2024-06-20/joiner-epoch-12-avg-5.int8.onnx \
./sherpa-onnx-zipformer-thai-2024-06-20/test_wavs/0.wav
Caution: If you use Windows and get encoding issues, please run:
CHCP 65001
in your commandline.
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx-microphone-offline \
--tokens=./sherpa-onnx-zipformer-thai-2024-06-20/tokens.txt \
--encoder=./sherpa-onnx-zipformer-thai-2024-06-20/encoder-epoch-12-avg-5.int8.onnx \
--decoder=./sherpa-onnx-zipformer-thai-2024-06-20/decoder-epoch-12-avg-5.onnx \
--joiner=./sherpa-onnx-zipformer-thai-2024-06-20/joiner-epoch-12-avg-5.int8.onnx
sherpa-onnx-zipformer-cantonese-2024-03-13 (Cantonese, )
Training code for this model can be found at https://github.com/k2-fsa/icefall/pull/1537. It supports only Cantonese
since it is trained on a Canatonese dataset. The paper for the dataset can be found at https://arxiv.org/pdf/2201.02419.
pdf.
In the following, we describe how to download it and use it with sherpa-onnx.
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-onnx-
˓→zipformer-cantonese-2024-03-13.tar.bz2
tar xf sherpa-onnx-zipformer-cantonese-2024-03-13.tar.bz2
rm sherpa-onnx-zipformer-cantonese-2024-03-13.tar.bz2
ls -lh sherpa-onnx-zipformer-cantonese-2024-03-13
total 340M
-rw-r--r-- 1 1001 127 2.7M Mar 13 09:06 decoder-epoch-45-avg-35.int8.onnx
-rw-r--r-- 1 1001 127 11M Mar 13 09:06 decoder-epoch-45-avg-35.onnx
-rw-r--r-- 1 1001 127 67M Mar 13 09:06 encoder-epoch-45-avg-35.int8.onnx
-rw-r--r-- 1 1001 127 248M Mar 13 09:06 encoder-epoch-45-avg-35.onnx
-rw-r--r-- 1 1001 127 2.4M Mar 13 09:06 joiner-epoch-45-avg-35.int8.onnx
-rw-r--r-- 1 1001 127 9.5M Mar 13 09:06 joiner-epoch-45-avg-35.onnx
drwxr-xr-x 2 1001 127 4.0K Mar 13 09:06 test_wavs
-rw-r--r-- 1 1001 127 42K Mar 13 09:06 tokens.txt
Hint: It supports decoding only wave files of a single channel with 16-bit encoded samples, while the sampling rate
does not need to be 16 kHz.
fp32
The following code shows how to use fp32 models to decode wave files:
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx-offline \
--blank-penalty=1.2 \
--tokens=./sherpa-onnx-zipformer-cantonese-2024-03-13/tokens.txt \
--encoder=./sherpa-onnx-zipformer-cantonese-2024-03-13/encoder-epoch-45-avg-35.onnx \
--decoder=./sherpa-onnx-zipformer-cantonese-2024-03-13/decoder-epoch-45-avg-35.onnx \
--joiner=./sherpa-onnx-zipformer-cantonese-2024-03-13/joiner-epoch-45-avg-35.onnx \
./sherpa-onnx-zipformer-cantonese-2024-03-13/test_wavs/test_wavs_1.wav \
./sherpa-onnx-zipformer-cantonese-2024-03-13/test_wavs/test_wavs_2.wav
Caution: If you use Windows and get encoding issues, please run:
CHCP 65001
in your commandline.
˓→encoder=./sherpa-onnx-zipformer-cantonese-2024-03-13/encoder-epoch-45-avg-35.onnx --
˓→decoder=./sherpa-onnx-zipformer-cantonese-2024-03-13/decoder-epoch-45-avg-35.onnx --
˓→joiner=./sherpa-onnx-zipformer-cantonese-2024-03-13/joiner-epoch-45-avg-35.onnx ./
˓→sherpa-onnx-zipformer-cantonese-2024-03-13/test_wavs/test_wavs_1.wav ./sherpa-onnx-
˓→zipformer-cantonese-2024-03-13/test_wavs/test_wavs_2.wav
OfflineRecognizerConfig(feat_config=OfflineFeatureExtractorConfig(sampling_rate=16000,␣
˓→feature_dim=80), model_
˓→config=OfflineModelConfig(transducer=OfflineTransducerModelConfig(encoder_filename="./
˓→sherpa-onnx-zipformer-cantonese-2024-03-13/encoder-epoch-45-avg-35.onnx", decoder_
˓→filename="./sherpa-onnx-zipformer-cantonese-2024-03-13/decoder-epoch-45-avg-35.onnx",␣
˓→joiner_filename="./sherpa-onnx-zipformer-cantonese-2024-03-13/joiner-epoch-45-avg-35.
˓→ctc=OfflineNemoEncDecCtcModelConfig(model=""),␣
˓→ctc=OfflineZipformerCtcModelConfig(model=""), wenet_
˓→ctc=OfflineWenetCtcModelConfig(model=""), tokens="./sherpa-onnx-zipformer-cantonese-
./sherpa-onnx-zipformer-cantonese-2024-03-13/test_wavs/test_wavs_1.wav
{"text": "", "timestamps": [0.00, 0.88, 1.28, 1.52, 1.84, 2.08, 2.32, 2.56, 2.80, 3.04,␣
˓→3.20, 3.44, 3.68, 3.92], "tokens":["", "", "", "", "", "", "", "", "", "", "", "", "",
˓→""]}
----
./sherpa-onnx-zipformer-cantonese-2024-03-13/test_wavs/test_wavs_2.wav
{"text": "", "timestamps": [0.00, 0.64, 0.88, 1.12, 1.28, 1.60, 1.80, 2.16, 2.36, 2.56,␣
˓→2.88, 3.08, 3.32, 3.44, 3.60], "tokens":["", "", "", "", "", "", "", "", "", "", "", "
----
num threads: 2
decoding method: greedy_search
Elapsed seconds: 1.349 s
Real time factor (RTF): 1.349 / 10.320 = 0.131
int8
The following code shows how to use int8 models to decode wave files:
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx-offline \
--blank-penalty=1.2 \
--tokens=./sherpa-onnx-zipformer-cantonese-2024-03-13/tokens.txt \
--encoder=./sherpa-onnx-zipformer-cantonese-2024-03-13/encoder-epoch-45-avg-35.int8.
˓→onnx \
--decoder=./sherpa-onnx-zipformer-cantonese-2024-03-13/decoder-epoch-45-avg-35.onnx \
--joiner=./sherpa-onnx-zipformer-cantonese-2024-03-13/joiner-epoch-45-avg-35.int8.onnx␣
˓→\
./sherpa-onnx-zipformer-cantonese-2024-03-13/test_wavs/test_wavs_1.wav \
./sherpa-onnx-zipformer-cantonese-2024-03-13/test_wavs/test_wavs_2.wav
Caution: If you use Windows and get encoding issues, please run:
CHCP 65001
in your commandline.
˓→encoder=./sherpa-onnx-zipformer-cantonese-2024-03-13/encoder-epoch-45-avg-35.int8.onnx␣
˓→--decoder=./sherpa-onnx-zipformer-cantonese-2024-03-13/decoder-epoch-45-avg-35.onnx --
˓→joiner=./sherpa-onnx-zipformer-cantonese-2024-03-13/joiner-epoch-45-avg-35.int8.onnx ./
˓→sherpa-onnx-zipformer-cantonese-2024-03-13/test_wavs/test_wavs_1.wav ./sherpa-onnx-
˓→zipformer-cantonese-2024-03-13/test_wavs/test_wavs_2.wav
OfflineRecognizerConfig(feat_config=OfflineFeatureExtractorConfig(sampling_rate=16000,␣
˓→feature_dim=80), model_
˓→config=OfflineModelConfig(transducer=OfflineTransducerModelConfig(encoder_filename="./
˓→sherpa-onnx-zipformer-cantonese-2024-03-13/encoder-epoch-45-avg-35.int8.onnx", decoder_
˓→filename="./sherpa-onnx-zipformer-cantonese-2024-03-13/decoder-epoch-45-avg-35.onnx",␣
˓→joiner_filename="./sherpa-onnx-zipformer-cantonese-2024-03-13/joiner-epoch-45-avg-35.
˓→ctc=OfflineNemoEncDecCtcModelConfig(model=""),␣
˓→ctc=OfflineZipformerCtcModelConfig(model=""), wenet_
˓→ctc=OfflineWenetCtcModelConfig(model=""), tokens="./sherpa-onnx-zipformer-cantonese-
./sherpa-onnx-zipformer-cantonese-2024-03-13/test_wavs/test_wavs_1.wav
{"text": "", "timestamps": [0.00, 0.88, 1.28, 1.52, 1.84, 2.08, 2.32, 2.56, 2.80, 3.04,␣
˓→3.20, 3.44, 3.68, 3.92], "tokens":["", "", "", "", "", "", "", "", "", "", "", "", "",
˓→""]}
----
./sherpa-onnx-zipformer-cantonese-2024-03-13/test_wavs/test_wavs_2.wav
{"text": "", "timestamps": [0.00, 0.64, 0.88, 1.12, 1.28, 1.60, 1.80, 2.16, 2.36, 2.56,␣
˓→2.88, 3.08, 3.32, 3.44, 3.60], "tokens":["", "", "", "", "", "", "", "", "", "", "", "
----
num threads: 2
decoding method: greedy_search
Elapsed seconds: 0.907 s
Real time factor (RTF): 0.907 / 10.320 = 0.088
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx-microphone-offline \
--tokens=./sherpa-onnx-zipformer-cantonese-2024-03-13/tokens.txt \
--encoder=./sherpa-onnx-zipformer-cantonese-2024-03-13/encoder-epoch-45-avg-35.int8.
˓→onnx \
--decoder=./sherpa-onnx-zipformer-cantonese-2024-03-13/decoder-epoch-45-avg-35.onnx \
--joiner=./sherpa-onnx-zipformer-cantonese-2024-03-13/joiner-epoch-45-avg-35.int8.onnx
sherpa-onnx-zipformer-gigaspeech-2023-12-12 (English)
Training code for this model is https://github.com/k2-fsa/icefall/pull/1254. It supports only English since it is trained
on the GigaSpeech dataset.
In the following, we describe how to download it and use it with sherpa-onnx.
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-onnx-
˓→zipformer-gigaspeech-2023-12-12.tar.bz2
tar xf sherpa-onnx-zipformer-gigaspeech-2023-12-12.tar.bz2
rm sherpa-onnx-zipformer-gigaspeech-2023-12-12.tar.bz2
ls -lh sherpa-onnx-zipformer-gigaspeech-2023-12-12
$ ls -lh sherpa-onnx-zipformer-gigaspeech-2023-12-12
total 656184
-rw-r--r-- 1 fangjun staff 28B Dec 12 19:00 README.md
-rw-r--r-- 1 fangjun staff 239K Dec 12 19:00 bpe.model
-rw-r--r-- 1 fangjun staff 528K Dec 12 19:00 decoder-epoch-30-avg-1.int8.onnx
-rw-r--r-- 1 fangjun staff 2.0M Dec 12 19:00 decoder-epoch-30-avg-1.onnx
-rw-r--r-- 1 fangjun staff 68M Dec 12 19:00 encoder-epoch-30-avg-1.int8.onnx
-rw-r--r-- 1 fangjun staff 249M Dec 12 19:00 encoder-epoch-30-avg-1.onnx
-rw-r--r-- 1 fangjun staff 253K Dec 12 19:00 joiner-epoch-30-avg-1.int8.onnx
-rw-r--r-- 1 fangjun staff 1.0M Dec 12 19:00 joiner-epoch-30-avg-1.onnx
drwxr-xr-x 5 fangjun staff 160B Dec 12 19:00 test_wavs
-rw-r--r-- 1 fangjun staff 4.9K Dec 12 19:00 tokens.txt
Hint: It supports decoding only wave files of a single channel with 16-bit encoded samples, while the sampling rate
does not need to be 16 kHz.
fp32
The following code shows how to use fp32 models to decode wave files:
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx-offline \
--tokens=./sherpa-onnx-zipformer-gigaspeech-2023-12-12/tokens.txt \
--encoder=./sherpa-onnx-zipformer-gigaspeech-2023-12-12/encoder-epoch-30-avg-1.onnx \
--decoder=./sherpa-onnx-zipformer-gigaspeech-2023-12-12/decoder-epoch-30-avg-1.onnx \
--joiner=./sherpa-onnx-zipformer-gigaspeech-2023-12-12/joiner-epoch-30-avg-1.onnx \
./sherpa-onnx-zipformer-gigaspeech-2023-12-12/test_wavs/1089-134686-0001.wav \
./sherpa-onnx-zipformer-gigaspeech-2023-12-12/test_wavs/1221-135766-0001.wav \
./sherpa-onnx-zipformer-gigaspeech-2023-12-12/test_wavs/1221-135766-0002.wav
˓→tokens.txt --encoder=./sherpa-onnx-zipformer-gigaspeech-2023-12-12/encoder-epoch-30-
˓→avg-1.onnx --decoder=./sherpa-onnx-zipformer-gigaspeech-2023-12-12/decoder-epoch-30-
˓→avg-1.onnx --joiner=./sherpa-onnx-zipformer-gigaspeech-2023-12-12/joiner-epoch-30-avg-
˓→1.onnx ./sherpa-onnx-zipformer-gigaspeech-2023-12-12/test_wavs/1089-134686-0001.wav ./
˓→sherpa-onnx-zipformer-gigaspeech-2023-12-12/test_wavs/1221-135766-0001.wav ./sherpa-
˓→onnx-zipformer-gigaspeech-2023-12-12/test_wavs/1221-135766-0002.wav
OfflineRecognizerConfig(feat_config=OfflineFeatureExtractorConfig(sampling_rate=16000,␣
˓→feature_dim=80), model_
(continues on next page)
˓→config=OfflineModelConfig(transducer=OfflineTransducerModelConfig(encoder_filename="./
˓→sherpa-onnx-zipformer-gigaspeech-2023-12-12/encoder-epoch-30-avg-1.onnx", decoder_
˓→joiner_filename="./sherpa-onnx-zipformer-gigaspeech-2023-12-12/joiner-epoch-30-avg-1.
˓→ctc=OfflineNemoEncDecCtcModelConfig(model=""),␣
sherpa, Release 1.3
./sherpa-onnx-zipformer-gigaspeech-2023-12-12/test_wavs/1089-134686-0001.wav
{"text": " AFTER EARLY NIGHTFALL THE YELLOW LAMPS WOULD LIGHT UP HERE AND THERE THE␣
˓→SQUALID QUARTER OF THE BROTHELS", "timestamps": [0.00, 0.36, 0.52, 0.68, 0.96, 1.00, 1.
˓→08, 1.28, 1.40, 1.48, 1.60, 1.76, 1.80, 1.88, 1.92, 2.00, 2.20, 2.32, 2.36, 2.48, 2.60,
˓→ 2.80, 2.84, 2.92, 3.12, 3.32, 3.56, 3.76, 4.04, 4.20, 4.32, 4.40, 4.56, 4.80, 4.92, 5.
˓→08, 5.36, 5.48, 5.64, 5.72, 5.88, 6.04, 6.24], "tokens":[" AFTER", " E", "AR", "LY", "
˓→", "N", "IGHT", "F", "AL", "L", " THE", " ", "Y", "E", "LL", "OW", " LA", "M", "P", "S
˓→", " WOULD", " ", "L", "IGHT", " UP", " HERE", " AND", " THERE", " THE", " S", "QU",
˓→"AL", "ID", " QU", "AR", "TER", " OF", " THE", " B", "RO", "TH", "EL", "S"]}
----
./sherpa-onnx-zipformer-gigaspeech-2023-12-12/test_wavs/1221-135766-0001.wav
{"text": " GOD AS A DIRECT CONSEQUENCE OF THE SIN WHICH MAN THUS PUNISHED HAD GIVEN HER␣
˓→A LOVELY CHILD WHOSE PLACE WAS ON THAT SAME DISHONORED BOSOM TO CONNECT HER PARENT FOR␣
˓→EVER WITH THE RACE AND DESCENT OF MORTALS AND TO BE FINALLY A BLESSED SOUL IN HEAVEN",
˓→"timestamps": [0.00, 0.16, 0.40, 0.68, 0.84, 0.96, 1.04, 1.12, 1.32, 1.52, 1.68, 1.76,␣
˓→2.00, 2.12, 2.28, 2.40, 2.64, 2.92, 3.20, 3.32, 3.52, 3.64, 3.76, 3.96, 4.12, 4.36, 4.
˓→52, 4.72, 4.92, 5.16, 5.40, 5.64, 5.76, 5.88, 6.12, 6.28, 6.48, 6.84, 7.08, 7.32, 7.60,
˓→ 7.92, 8.12, 8.24, 8.36, 8.48, 8.64, 8.76, 8.88, 9.12, 9.32, 9.48, 9.56, 9.60, 9.76,␣
˓→10.00, 10.12, 10.20, 10.44, 10.68, 10.80, 11.00, 11.20, 11.36, 11.52, 11.76, 12.00, 12.
˓→12, 12.24, 12.28, 12.52, 12.72, 12.84, 12.96, 13.04, 13.24, 13.40, 13.64, 13.76, 14.00,
˓→ 14.08, 14.24, 14.52, 14.68, 14.80, 15.00, 15.04, 15.28, 15.52, 15.76, 16.00, 16.12,␣
˓→16.20, 16.32], "tokens":[" GO", "D", " AS", " A", " DI", "RE", "C", "T", " CON", "SE",
˓→"QU", "ENCE", " OF", " THE", " S", "IN", " WHICH", " MAN", " TH", "US", " P", "UN",
˓→"ISH", "ED", " HAD", " GIVE", "N", " HER", " A", " LOVE", "LY", " CHI", "L", "D", " WHO
˓→", "SE", " PLACE", " WAS", " ON", " THAT", " SAME", " DIS", "HO", "N", "OR", "ED", " BO
˓→", "S", "OM", " TO", " CON", "NE", "C", "T", " HER", " PA", "R", "ENT", " FOR", " E",
˓→"VER", " WITH", " THE", " RA", "CE", " AND", " DE", "S", "C", "ENT", " OF", " MO", "R",
˓→ "T", "AL", "S", " AND", " TO", " BE", " F", "IN", "ALLY", " A", " B", "LES", "S", "ED
˓→", " SO", "UL", " IN", " HE", "A", "VE", "N"]}
----
./sherpa-onnx-zipformer-gigaspeech-2023-12-12/test_wavs/1221-135766-0002.wav
{"text": " YET THESE THOUGHTS AFFECTED HESTER PRYNE LESS WITH HOPE THAN APPREHENSION",
˓→"timestamps": [0.00, 0.04, 0.12, 0.40, 0.68, 0.88, 0.96, 1.12, 1.20, 1.32, 1.44, 1.48,␣
˓→1.64, 1.76, 1.88, 2.04, 2.16, 2.28, 2.52, 2.68, 2.72, 2.88, 3.12, 3.28, 3.52, 3.80, 4.
˓→00, 4.16, 4.24, 4.40, 4.48], "tokens":[" ", "Y", "ET", " THESE", " THOUGH", "T", "S",
˓→" A", "FF", "E", "C", "TED", " HE", "S", "TER", " P", "RY", "NE", " LE", "S", "S", "␣
˓→WITH", " HO", "PE", " THAN", " APP", "RE", "HE", "N", "S", "ION"]}
----
num threads: 2
decoding method: greedy_search
Elapsed seconds: 1.407 s
Real time factor (RTF): 1.407 / 28.165 = 0.050
int8
The following code shows how to use int8 models to decode wave files:
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx-offline \
--tokens=./sherpa-onnx-zipformer-gigaspeech-2023-12-12/tokens.txt \
--encoder=./sherpa-onnx-zipformer-gigaspeech-2023-12-12/encoder-epoch-30-avg-1.int8.
˓→onnx \
--decoder=./sherpa-onnx-zipformer-gigaspeech-2023-12-12/decoder-epoch-30-avg-1.onnx \
--joiner=./sherpa-onnx-zipformer-gigaspeech-2023-12-12/joiner-epoch-30-avg-1.int8.onnx␣
˓→\
./sherpa-onnx-zipformer-gigaspeech-2023-12-12/test_wavs/1089-134686-0001.wav \
./sherpa-onnx-zipformer-gigaspeech-2023-12-12/test_wavs/1221-135766-0001.wav \
./sherpa-onnx-zipformer-gigaspeech-2023-12-12/test_wavs/1221-135766-0002.wav
/Users/fangjun/open-source/sherpa-onnx/sherpa-onnx/csrc/parse-options.cc:Read:361 ./
˓→build/bin/sherpa-onnx-offline --tokens=./sherpa-onnx-zipformer-gigaspeech-2023-12-12/
˓→tokens.txt --encoder=./sherpa-onnx-zipformer-gigaspeech-2023-12-12/encoder-epoch-30-
˓→avg-1.int8.onnx --decoder=./sherpa-onnx-zipformer-gigaspeech-2023-12-12/decoder-epoch-
˓→30-avg-1.onnx --joiner=./sherpa-onnx-zipformer-gigaspeech-2023-12-12/joiner-epoch-30-
˓→avg-1.int8.onnx ./sherpa-onnx-zipformer-gigaspeech-2023-12-12/test_wavs/1089-134686-
˓→0001.wav ./sherpa-onnx-zipformer-gigaspeech-2023-12-12/test_wavs/1221-135766-0001.wav .
˓→/sherpa-onnx-zipformer-gigaspeech-2023-12-12/test_wavs/1221-135766-0002.wav
OfflineRecognizerConfig(feat_config=OfflineFeatureExtractorConfig(sampling_rate=16000,␣
˓→feature_dim=80), model_
˓→config=OfflineModelConfig(transducer=OfflineTransducerModelConfig(encoder_filename="./
˓→sherpa-onnx-zipformer-gigaspeech-2023-12-12/encoder-epoch-30-avg-1.int8.onnx", decoder_
˓→filename="./sherpa-onnx-zipformer-gigaspeech-2023-12-12/decoder-epoch-30-avg-1.onnx",␣
˓→joiner_filename="./sherpa-onnx-zipformer-gigaspeech-2023-12-12/joiner-epoch-30-avg-1.
˓→ctc=OfflineNemoEncDecCtcModelConfig(model=""),␣
˓→ctc=OfflineZipformerCtcModelConfig(model=""), wenet_
˓→ctc=OfflineWenetCtcModelConfig(model=""), tokens="./sherpa-onnx-zipformer-gigaspeech-
./sherpa-onnx-zipformer-gigaspeech-2023-12-12/test_wavs/1089-134686-0001.wav
{"text": " AFTER EARLY NIGHTFALL THE YELLOW LAMPS WOULD LIGHT UP HERE AND THERE THE␣
˓→SQUALID QUARTER OF THE BROTHELS", "timestamps": [0.00, 0.36, 0.52, 0.68, (continues on next page)
0.96, 1.00, 1.
˓→08, 1.28, 1.40, 1.48, 1.60, 1.76, 1.80, 1.88, 1.92, 2.00, 2.20, 2.32, 2.36, 2.48, 2.60,
˓→ 2.80, 2.84, 2.92, 3.12, 3.32, 3.56, 3.76, 4.04, 4.24, 4.32, 4.40, 4.56, 4.80, 4.92, 5.
422 Chapter 8. sherpa-onnx
˓→08, 5.36, 5.48, 5.64, 5.72, 5.88, 6.04, 6.24], "tokens":[" AFTER", " E", "AR", "LY", "
˓→", "N", "IGHT", "F", "AL", "L", " THE", " ", "Y", "E", "LL", "OW", " LA", "M", "P", "S
˓→", " WOULD", " ", "L", "IGHT", " UP", " HERE", " AND", " THERE", " THE", " S", "QU",
˓→"AL", "ID", " QU", "AR", "TER", " OF", " THE", " B", "RO", "TH", "EL", "S"]}
sherpa, Release 1.3
˓→EVER WITH THE RACE AND DESCENT OF MORTALS AND TO BE FINALLY A BLESSED SOUL IN HEAVEN",
˓→"timestamps": [0.00, 0.16, 0.40, 0.68, 0.84, 0.96, 1.08, 1.12, 1.32, 1.52, 1.68, 1.76,␣
˓→2.00, 2.12, 2.28, 2.40, 2.64, 2.92, 3.20, 3.32, 3.52, 3.64, 3.76, 3.96, 4.12, 4.36, 4.
˓→52, 4.72, 4.92, 5.16, 5.40, 5.64, 5.76, 5.88, 6.12, 6.28, 6.52, 6.84, 7.08, 7.32, 7.60,
˓→ 7.92, 8.12, 8.24, 8.36, 8.48, 8.64, 8.76, 8.88, 9.12, 9.32, 9.48, 9.56, 9.60, 9.76,␣
˓→10.00, 10.12, 10.20, 10.44, 10.68, 10.80, 11.00, 11.20, 11.36, 11.52, 11.76, 12.00, 12.
˓→12, 12.24, 12.28, 12.52, 12.72, 12.84, 12.96, 13.04, 13.24, 13.44, 13.64, 13.76, 14.00,
˓→ 14.08, 14.24, 14.52, 14.68, 14.80, 15.00, 15.04, 15.28, 15.48, 15.76, 16.00, 16.12,␣
˓→16.16, 16.32], "tokens":[" GO", "D", " AS", " A", " DI", "RE", "C", "T", " CON", "SE",
˓→"QU", "ENCE", " OF", " THE", " S", "IN", " WHICH", " MAN", " TH", "US", " P", "UN",
˓→"ISH", "ED", " HAD", " GIVE", "N", " HER", " A", " LOVE", "LY", " CHI", "L", "D", " WHO
˓→", "SE", " PLACE", " WAS", " ON", " THAT", " SAME", " DIS", "HO", "N", "OR", "ED", " BO
˓→", "S", "OM", " TO", " CON", "NE", "C", "T", " HER", " PA", "R", "ENT", " FOR", " E",
˓→"VER", " WITH", " THE", " RA", "CE", " AND", " DE", "S", "C", "ENT", " OF", " MO", "R",
˓→ "T", "AL", "S", " AND", " TO", " BE", " F", "IN", "ALLY", " A", " B", "LES", "S", "ED
˓→", " SO", "UL", " IN", " HE", "A", "VE", "N"]}
----
./sherpa-onnx-zipformer-gigaspeech-2023-12-12/test_wavs/1221-135766-0002.wav
{"text": " YET THESE THOUGHTS AFFECTED HESTER PRYNNE LESS WITH HOPE THAN APPREHENSION",
˓→"timestamps": [0.00, 0.04, 0.12, 0.40, 0.68, 0.88, 0.96, 1.12, 1.24, 1.32, 1.44, 1.48,␣
˓→1.64, 1.76, 1.88, 2.04, 2.16, 2.28, 2.32, 2.52, 2.68, 2.72, 2.88, 3.12, 3.32, 3.52, 3.
˓→80, 4.00, 4.16, 4.24, 4.40, 4.48], "tokens":[" ", "Y", "ET", " THESE", " THOUGH", "T",
˓→"S", " A", "FF", "E", "C", "TED", " HE", "S", "TER", " P", "RY", "N", "NE", " LE", "S",
˓→ "S", " WITH", " HO", "PE", " THAN", " APP", "RE", "HE", "N", "S", "ION"]}
----
num threads: 2
decoding method: greedy_search
Elapsed seconds: 1.101 s
Real time factor (RTF): 1.101 / 28.165 = 0.039
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx-microphone-offline \
--tokens=./sherpa-onnx-zipformer-gigaspeech-2023-12-12/tokens.txt \
--encoder=./sherpa-onnx-zipformer-gigaspeech-2023-12-12/encoder-epoch-30-avg-1.int8.
˓→onnx \
--decoder=./sherpa-onnx-zipformer-gigaspeech-2023-12-12/decoder-epoch-30-avg-1.onnx \
--joiner=./sherpa-onnx-zipformer-gigaspeech-2023-12-12/joiner-epoch-30-avg-1.int8.onnx
cd /path/to/sherpa-onnx
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/silero_vad.onnx
./build/bin/sherpa-onnx-vad-microphone-offline-asr \
--silero-vad-model=./silero_vad.onnx \
--tokens=./sherpa-onnx-zipformer-gigaspeech-2023-12-12/tokens.txt \
--encoder=./sherpa-onnx-zipformer-gigaspeech-2023-12-12/encoder-epoch-30-avg-1.int8.
˓→onnx \
--decoder=./sherpa-onnx-zipformer-gigaspeech-2023-12-12/decoder-epoch-30-avg-1.onnx \
--joiner=./sherpa-onnx-zipformer-gigaspeech-2023-12-12/joiner-epoch-30-avg-1.int8.onnx
zrjin/sherpa-onnx-zipformer-multi-zh-hans-2023-9-2 (Chinese)
cd /path/to/sherpa-onnx
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-onnx-
˓→zipformer-multi-zh-hans-2023-9-2.tar.bz2
Please check that the file sizes of the pre-trained models are correct. See the file sizes of *.onnx files below.
Hint: It supports decoding only wave files of a single channel with 16-bit encoded samples, while the sampling rate
does not need to be 16 kHz.
fp32
The following code shows how to use fp32 models to decode wave files:
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx-offline \
--tokens=./sherpa-onnx-zipformer-multi-zh-hans-2023-9-2/tokens.txt \
--encoder=./sherpa-onnx-zipformer-multi-zh-hans-2023-9-2/encoder-epoch-20-avg-1.onnx \
--decoder=./sherpa-onnx-zipformer-multi-zh-hans-2023-9-2/decoder-epoch-20-avg-1.onnx \
--joiner=./sherpa-onnx-zipformer-multi-zh-hans-2023-9-2/joiner-epoch-20-avg-1.onnx \
./sherpa-onnx-zipformer-multi-zh-hans-2023-9-2/test_wavs/0.wav \
./sherpa-onnx-zipformer-multi-zh-hans-2023-9-2/test_wavs/1.wav \
./sherpa-onnx-zipformer-multi-zh-hans-2023-9-2/test_wavs/8k.wav
˓→--encoder=./sherpa-onnx-zipformer-multi-zh-hans-2023-9-2/encoder-epoch-20-avg-1.onnx --
˓→decoder=./sherpa-onnx-zipformer-multi-zh-hans-2023-9-2/decoder-epoch-20-avg-1.onnx --
˓→joiner=./sherpa-onnx-zipformer-multi-zh-hans-2023-9-2/joiner-epoch-20-avg-1.onnx ./
˓→sherpa-onnx-zipformer-multi-zh-hans-2023-9-2/test_wavs/0.wav ./sherpa-onnx-zipformer-
˓→multi-zh-hans-2023-9-2/test_wavs/1.wav ./sherpa-onnx-zipformer-multi-zh-hans-2023-9-2/
˓→test_wavs/8k.wav
OfflineRecognizerConfig(feat_config=OfflineFeatureExtractorConfig(sampling_rate=16000,␣
˓→feature_dim=80), model_
˓→config=OfflineModelConfig(transducer=OfflineTransducerModelConfig(encoder_filename="./
˓→sherpa-onnx-zipformer-multi-zh-hans-2023-9-2/encoder-epoch-20-avg-1.onnx", decoder_
˓→filename="./sherpa-onnx-zipformer-multi-zh-hans-2023-9-2/decoder-epoch-20-avg-1.onnx",␣
˓→joiner_filename="./sherpa-onnx-zipformer-multi-zh-hans-2023-9-2/joiner-epoch-20-avg-1.
˓→ctc=OfflineNemoEncDecCtcModelConfig(model=""),␣
Done!
./sherpa-onnx-zipformer-multi-zh-hans-2023-9-2/test_wavs/0.wav
{"text":" ","timestamps":"[0.00, 0.16, 0.40, 0.60, 0.84, 1.08, 1.60, 1.72, 1.88, 2.04, 2.
˓→24, 2.44, 2.60, 2.96, 3.12, 3.32, 3.40, 3.60, 3.72, 3.84, 4.00, 4.16, 4.32, 4.52, 4.68]
˓→","tokens":[" ","","","","","","","","","","","","","","","","","","","","","","","","
˓→"]}
----
./sherpa-onnx-zipformer-multi-zh-hans-2023-9-2/test_wavs/1.wav
{"text":" <0xE8><0x8D><0xA1>","timestamps":"[0.00, 0.12, 0.48, 0.68, 0.92, 1.12, 1.28, 1.
˓→48, 1.80, 2.04, 2.40, 2.56, 2.76, 2.96, 3.08, 3.32, 3.48, 3.68, 3.84, 4.00, 4.20, 4.24,
˓→","","","","<0xE8>","<0x8D>","<0xA1>","","",""]}
----
./sherpa-onnx-zipformer-multi-zh-hans-2023-9-2/test_wavs/8k.wav
{"text":" <0xE8><0x8D><0xA1>","timestamps":"[0.00, 0.04, 0.24, 0.52, 0.76, 1.00, 1.40, 1.
˓→64, 1.80, 2.12, 2.32, 2.64, 2.80, 3.00, 3.20, 3.24, 3.28, 3.44, 3.64, 3.76, 3.96, 4.20]
˓→","tokens":[" ","","","","","","","","","","","","","","<0xE8>","<0x8D>","<0xA1>","","
˓→","","",""]}
----
num threads: 2
decoding method: greedy_search
Elapsed seconds: 0.362 s
Real time factor (RTF): 0.362 / 15.289 = 0.024
int8
The following code shows how to use int8 models to decode wave files:
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx-offline \
--tokens=./sherpa-onnx-zipformer-multi-zh-hans-2023-9-2/tokens.txt \
--encoder=./sherpa-onnx-zipformer-multi-zh-hans-2023-9-2/encoder-epoch-20-avg-1.int8.
˓→onnx \
--decoder=./sherpa-onnx-zipformer-multi-zh-hans-2023-9-2/decoder-epoch-20-avg-1.onnx \
--joiner=./sherpa-onnx-zipformer-multi-zh-hans-2023-9-2/joiner-epoch-20-avg-1.int8.
˓→onnx \
./sherpa-onnx-zipformer-multi-zh-hans-2023-9-2/test_wavs/0.wav \
./sherpa-onnx-zipformer-multi-zh-hans-2023-9-2/test_wavs/1.wav \
./sherpa-onnx-zipformer-multi-zh-hans-2023-9-2/test_wavs/8k.wav
/Users/runner/work/sherpa-onnx/sherpa-onnx/sherpa-onnx/csrc/parse-options.cc:Read:361␣
˓→sherpa-onnx-offline --tokens=./sherpa-onnx-zipformer-multi-zh-hans-2023-9-2/tokens.txt␣
˓→--encoder=./sherpa-onnx-zipformer-multi-zh-hans-2023-9-2/encoder-epoch-20-avg-1.int8.
˓→onnx --decoder=./sherpa-onnx-zipformer-multi-zh-hans-2023-9-2/decoder-epoch-20-avg-1.
˓→onnx --joiner=./sherpa-onnx-zipformer-multi-zh-hans-2023-9-2/joiner-epoch-20-avg-1.
˓→zipformer-multi-zh-hans-2023-9-2/test_wavs/1.wav ./sherpa-onnx-zipformer-multi-zh-hans-
˓→2023-9-2/test_wavs/8k.wav
OfflineRecognizerConfig(feat_config=OfflineFeatureExtractorConfig(sampling_rate=16000,␣
˓→feature_dim=80), model_
˓→config=OfflineModelConfig(transducer=OfflineTransducerModelConfig(encoder_filename="./
˓→sherpa-onnx-zipformer-multi-zh-hans-2023-9-2/encoder-epoch-20-avg-1.int8.onnx",␣
˓→decoder_filename="./sherpa-onnx-zipformer-multi-zh-hans-2023-9-2/decoder-epoch-20-avg-
˓→1.onnx", joiner_filename="./sherpa-onnx-zipformer-multi-zh-hans-2023-9-2/joiner-epoch-
˓→ctc=OfflineNemoEncDecCtcModelConfig(model=""),␣
in_sample_rate: 8000
output_sample_rate: 16000
Done!
./sherpa-onnx-zipformer-multi-zh-hans-2023-9-2/test_wavs/0.wav
{"text":" ","timestamps":"[0.00, 0.16, 0.40, 0.60, 0.84, 1.08, 1.60, 1.72, 1.88, 2.04, 2.
˓→28, 2.44, 2.60, 2.96, 3.12, 3.32, 3.40, 3.60, 3.76, 3.84, 4.00, 4.16, 4.32, 4.52, 4.56]
˓→","tokens":[" ","","","","","","","","","","","","","","","","","","","","","","","","
˓→"]}
----
./sherpa-onnx-zipformer-multi-zh-hans-2023-9-2/test_wavs/1.wav
{"text":" <0xE8><0x8D><0xA1>","timestamps":"[0.00, 0.12, 0.48, 0.68, 0.92, 1.12, 1.28, 1.
˓→48, 1.80, 2.04, 2.40, 2.56, 2.76, 2.96, 3.08, 3.32, 3.48, 3.68, 3.84, 4.00, 4.20, 4.24,
˓→","","","","<0xE8>","<0x8D>","<0xA1>","","",""]}
----
./sherpa-onnx-zipformer-multi-zh-hans-2023-9-2/test_wavs/8k.wav
{"text":" <0xE8><0x8D><0xA1>","timestamps":"[0.00, 0.04, 0.24, 0.52, 0.76, 1.00, 1.40, 1.
˓→64, 1.80, 2.12, 2.36, 2.64, 2.80, 3.04, 3.16, 3.20, 3.24, 3.44, 3.64, 3.76, 3.96, 4.20]
˓→","tokens":[" ","","","","","","","","","","","","","","<0xE8>","<0x8D>","<0xA1>","","
˓→","","",""]}
----
num threads: 2
decoding method: greedy_search
Elapsed seconds: 0.305 s
Real time factor (RTF): 0.305 / 15.289 = 0.020
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx-microphone-offline \
--tokens=./sherpa-onnx-zipformer-multi-zh-hans-2023-9-2/tokens.txt \
--encoder=./sherpa-onnx-zipformer-multi-zh-hans-2023-9-2/encoder-epoch-20-avg-1.onnx \
--decoder=./sherpa-onnx-zipformer-multi-zh-hans-2023-9-2/decoder-epoch-0-avg-1.onnx \
--joiner=./sherpa-onnx-zipformer-multi-zh-hans-2023-9-2/joiner-epoch-20-avg-1.onnx
yfyeung/icefall-asr-cv-corpus-13.0-2023-03-09-en-pruned-transducer-stateless7-2023-04-17 (En-
glish)
cd /path/to/sherpa-onnx
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/icefall-asr-cv-
˓→corpus-13.0-2023-03-09-en-pruned-transducer-stateless7-2023-04-17.tar.bz2
rm icefall-asr-cv-corpus-13.0-2023-03-09-en-pruned-transducer-stateless7-2023-04-17.tar.
˓→bz2
Please check that the file sizes of the pre-trained models are correct. See the file sizes of *.onnx files below.
icefall-asr-cv-corpus-13.0-2023-03-09-en-pruned-transducer-stateless7-2023-04-17 fangjun
˓→$ ls -lh exp/*epoch-60-avg-20*.onnx
Hint: It supports decoding only wave files of a single channel with 16-bit encoded samples, while the sampling rate
does not need to be 16 kHz.
fp32
The following code shows how to use fp32 models to decode wave files:
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx-offline \
--tokens=./icefall-asr-cv-corpus-13.0-2023-03-09-en-pruned-transducer-stateless7-2023-
˓→04-17/data/lang_bpe_500/tokens.txt \
--encoder=./icefall-asr-cv-corpus-13.0-2023-03-09-en-pruned-transducer-stateless7-2023-
˓→04-17/exp/encoder-epoch-60-avg-20.onnx \
--decoder=./icefall-asr-cv-corpus-13.0-2023-03-09-en-pruned-transducer-stateless7-2023-
˓→04-17/exp/decoder-epoch-60-avg-20.onnx \
--joiner=./icefall-asr-cv-corpus-13.0-2023-03-09-en-pruned-transducer-stateless7-2023-
˓→04-17/exp/joiner-epoch-60-avg-20.onnx \
./icefall-asr-cv-corpus-13.0-2023-03-09-en-pruned-transducer-stateless7-2023-04-17/
˓→test_wavs/1089-134686-0001.wav \
./icefall-asr-cv-corpus-13.0-2023-03-09-en-pruned-transducer-stateless7-2023-04-17/
˓→test_wavs/1221-135766-0001.wav \
./icefall-asr-cv-corpus-13.0-2023-03-09-en-pruned-transducer-stateless7-2023-04-17/
˓→test_wavs/1221-135766-0002.wav
˓→pruned-transducer-stateless7-2023-04-17/data/lang_bpe_500/tokens.txt --encoder=./
˓→icefall-asr-cv-corpus-13.0-2023-03-09-en-pruned-transducer-stateless7-2023-04-17/exp/
˓→encoder-epoch-60-avg-20.onnx --decoder=./icefall-asr-cv-corpus-13.0-2023-03-09-en-
˓→pruned-transducer-stateless7-2023-04-17/exp/decoder-epoch-60-avg-20.onnx --joiner=./
˓→icefall-asr-cv-corpus-13.0-2023-03-09-en-pruned-transducer-stateless7-2023-04-17/exp/
˓→joiner-epoch-60-avg-20.onnx ./icefall-asr-cv-corpus-13.0-2023-03-09-en-pruned-
˓→transducer-stateless7-2023-04-17/test_wavs/1089-134686-0001.wav ./icefall-asr-cv-
˓→corpus-13.0-2023-03-09-en-pruned-transducer-stateless7-2023-04-17/test_wavs/1221-
˓→135766-0001.wav ./icefall-asr-cv-corpus-13.0-2023-03-09-en-pruned-transducer-
˓→stateless7-2023-04-17/test_wavs/1221-135766-0002.wav
OfflineRecognizerConfig(feat_config=OfflineFeatureExtractorConfig(sampling_rate=16000,␣
˓→feature_dim=80), model_
˓→config=OfflineModelConfig(transducer=OfflineTransducerModelConfig(encoder_filename="./
˓→icefall-asr-cv-corpus-13.0-2023-03-09-en-pruned-transducer-stateless7-2023-04-17/exp/
˓→encoder-epoch-60-avg-20.onnx", decoder_filename="./icefall-asr-cv-corpus-13.0-2023-03-
˓→joiner_filename="./icefall-asr-cv-corpus-13.0-2023-03-09-en-pruned-transducer-
˓→stateless7-2023-04-17/exp/joiner-epoch-60-avg-20.onnx"),␣
8.25. Pre-trained models
˓→paraformer=OfflineParaformerModelConfig(model=""), nemo_
429
˓→ctc=OfflineNemoEncDecCtcModelConfig(model=""), tokens="./icefall-asr-cv-corpus-13.0-
˓→2023-03-09-en-pruned-transducer-stateless7-2023-04-17/data/lang_bpe_500/tokens.txt",␣
./icefall-asr-cv-corpus-13.0-2023-03-09-en-pruned-transducer-stateless7-2023-04-17/test_
˓→wavs/1089-134686-0001.wav
{"text":" AFTER EARLY NIGHTFALL THE YELLOW LAMPS WOULD LIGHT UP HERE AND THERE THE␣
˓→SQUALID QUARTER OF THE BROTHELS","timestamps":"[0.00, 0.64, 0.76, 0.84, 1.04, 1.08, 1.
˓→16, 1.32, 1.44, 1.56, 1.72, 1.84, 1.88, 1.92, 1.96, 2.04, 2.16, 2.32, 2.48, 2.56, 2.76,
˓→ 2.80, 2.84, 3.08, 3.28, 3.40, 3.52, 3.68, 4.00, 4.24, 4.28, 4.52, 4.68, 4.84, 4.88, 4.
˓→96, 5.04, 5.28, 5.40, 5.52, 5.72, 5.88, 6.08]","tokens":[" AFTER"," E","AR","LY"," ","N
----
./icefall-asr-cv-corpus-13.0-2023-03-09-en-pruned-transducer-stateless7-2023-04-17/test_
˓→wavs/1221-135766-0001.wav
{"text":" GOD AS A DIRECT CONSEQUENCE OF THE SIN WHICH MAN THUS PUNISHED HAD GIVEN HER A␣
˓→LOVELY CHILD WHOSE PLACE WAS ON THAT SAME DISHONORED BOSOM TO CONNECT HER PARENT␣
˓→FOREVER WITH THE RACE AND DESCENT OF MORTALS AND TO BE FINALLY A BLESSED SOUL IN HEAVEN
˓→","timestamps":"[0.04, 0.44, 0.64, 0.84, 0.96, 1.32, 1.52, 1.68, 1.84, 1.88, 2.04, 2.
˓→16, 2.32, 2.40, 2.64, 2.88, 3.12, 3.24, 3.44, 3.52, 3.72, 3.88, 4.20, 4.40, 4.48, 4.60,
˓→ 4.76, 4.96, 5.08, 5.24, 5.36, 5.56, 5.80, 6.20, 6.32, 6.52, 6.92, 7.16, 7.36, 7.60, 7.
˓→76, 7.92, 8.16, 8.28, 8.40, 8.48, 8.60, 8.76, 8.84, 9.08, 9.24, 9.44, 9.48, 9.72, 9.88,
˓→ 10.04, 10.12, 10.52, 10.76, 10.84, 11.08, 11.24, 11.36, 11.60, 11.76, 11.96, 12.08,␣
˓→12.24, 12.28, 12.48, 12.72, 12.84, 12.92, 13.00, 13.20, 13.52, 13.76, 13.88, 14.08, 14.
˓→28, 14.52, 14.64, 14.76, 14.96, 15.04, 15.24, 15.48, 15.68, 15.84, 16.00, 16.04]",
˓→" LO","VE","LY"," CHI","LD"," WHO","SE"," PLACE"," WAS"," ON"," THAT"," SA","ME"," DIS
˓→HE","A","VEN"]}
----
./icefall-asr-cv-corpus-13.0-2023-03-09-en-pruned-transducer-stateless7-2023-04-17/test_
˓→wavs/1221-135766-0002.wav
{"text":" YET THESE THOUGHTS AFFECTED HESTER PRIN LESS WITH HOPE THAN APPREHENSION",
˓→"timestamps":"[0.00, 0.04, 0.12, 0.56, 0.80, 0.88, 1.00, 1.04, 1.12, 1.20, 1.28, 1.40,␣
˓→1.52, 1.64, 1.76, 1.84, 2.04, 2.24, 2.40, 2.64, 2.68, 2.84, 3.04, 3.24, 3.44, 3.52, 3.
˓→TH","AN"," APP","RE","HE","N","S","ION"]}
----
num threads: 2
decoding method: greedy_search
Elapsed seconds: 1.611 s
Real time factor (RTF): 1.611 / 28.165 = 0.057
int8
The following code shows how to use int8 models to decode wave files:
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx-offline \
--tokens=./icefall-asr-cv-corpus-13.0-2023-03-09-en-pruned-transducer-stateless7-2023-
˓→04-17/data/lang_bpe_500/tokens.txt \
--encoder=./icefall-asr-cv-corpus-13.0-2023-03-09-en-pruned-transducer-stateless7-2023-
˓→04-17/exp/encoder-epoch-60-avg-20.int8.onnx \
--decoder=./icefall-asr-cv-corpus-13.0-2023-03-09-en-pruned-transducer-stateless7-2023-
˓→04-17/exp/decoder-epoch-60-avg-20.onnx \
--joiner=./icefall-asr-cv-corpus-13.0-2023-03-09-en-pruned-transducer-stateless7-2023-
˓→04-17/exp/joiner-epoch-60-avg-20.int8.onnx \
./icefall-asr-cv-corpus-13.0-2023-03-09-en-pruned-transducer-stateless7-2023-04-17/
˓→test_wavs/1089-134686-0001.wav \
./icefall-asr-cv-corpus-13.0-2023-03-09-en-pruned-transducer-stateless7-2023-04-17/
˓→test_wavs/1221-135766-0001.wav \
./icefall-asr-cv-corpus-13.0-2023-03-09-en-pruned-transducer-stateless7-2023-04-17/
˓→test_wavs/1221-135766-0002.wav
˓→pruned-transducer-stateless7-2023-04-17/data/lang_bpe_500/tokens.txt --encoder=./
˓→icefall-asr-cv-corpus-13.0-2023-03-09-en-pruned-transducer-stateless7-2023-04-17/exp/
˓→encoder-epoch-60-avg-20.int8.onnx --decoder=./icefall-asr-cv-corpus-13.0-2023-03-09-en-
˓→pruned-transducer-stateless7-2023-04-17/exp/decoder-epoch-60-avg-20.onnx --joiner=./
˓→icefall-asr-cv-corpus-13.0-2023-03-09-en-pruned-transducer-stateless7-2023-04-17/exp/
˓→joiner-epoch-60-avg-20.int8.onnx ./icefall-asr-cv-corpus-13.0-2023-03-09-en-pruned-
˓→transducer-stateless7-2023-04-17/test_wavs/1089-134686-0001.wav ./icefall-asr-cv-
˓→corpus-13.0-2023-03-09-en-pruned-transducer-stateless7-2023-04-17/test_wavs/1221-
˓→135766-0001.wav ./icefall-asr-cv-corpus-13.0-2023-03-09-en-pruned-transducer-
˓→stateless7-2023-04-17/test_wavs/1221-135766-0002.wav
OfflineRecognizerConfig(feat_config=OfflineFeatureExtractorConfig(sampling_rate=16000,␣
˓→feature_dim=80), model_
˓→config=OfflineModelConfig(transducer=OfflineTransducerModelConfig(encoder_filename="./
˓→icefall-asr-cv-corpus-13.0-2023-03-09-en-pruned-transducer-stateless7-2023-04-17/exp/
˓→encoder-epoch-60-avg-20.int8.onnx", decoder_filename="./icefall-asr-cv-corpus-13.0-
˓→2023-03-09-en-pruned-transducer-stateless7-2023-04-17/exp/decoder-epoch-60-avg-20.onnx
˓→", joiner_filename="./icefall-asr-cv-corpus-13.0-2023-03-09-en-pruned-transducer-
˓→stateless7-2023-04-17/exp/joiner-epoch-60-avg-20.int8.onnx"),␣
˓→paraformer=OfflineParaformerModelConfig(model=""), nemo_
˓→ctc=OfflineNemoEncDecCtcModelConfig(model=""), tokens="./icefall-asr-cv-corpus-13.0-
˓→2023-03-09-en-pruned-transducer-stateless7-2023-04-17/data/lang_bpe_500/tokens.txt",␣
./icefall-asr-cv-corpus-13.0-2023-03-09-en-pruned-transducer-stateless7-2023-04-17/test_
˓→wavs/1089-134686-0001.wav
{"text":" AFTER EARLY NIGHTFALL THE YELLOW LAMPS WOULD LIGHT UP HERE AND THERE THE␣
˓→SQUALID QUARTER OF THE BROTHELS","timestamps":"[0.00, 0.64, 0.76, 0.84, 1.04, 1.08, 1.
˓→16, 1.36, 1.44, 1.56, 1.72, 1.84, 1.88, 1.92, 1.96, 2.04, 2.20, 2.32, 2.48, 2.56, 2.76,
˓→ 2.80, 2.84, 3.08, 3.28, 3.40, 3.52, 3.68, 4.00, 4.24, 4.28, 4.52, 4.68, 4.84, 4.88, 4.
˓→96, 5.04, 5.28, 5.36, 5.52, 5.72, 5.88, 6.08]","tokens":[" AFTER"," E","AR","LY"," ","N
----
./icefall-asr-cv-corpus-13.0-2023-03-09-en-pruned-transducer-stateless7-2023-04-17/test_
˓→wavs/1221-135766-0001.wav
{"text":" GOD AS A DIRECT CONSEQUENCE OF THE SIN WHICH MAN THUS PUNISHED HAD GIVEN HER A␣
˓→LOVELY CHILD WHOSE PLACE WAS ON THAT SAME DISHONORED BOSOM TO CONNECT HER PARENT␣
˓→FOREVER WITH THE RACE AND DESCENT OF MORTALS AND TO BE FINALLY A BLESSED SOUL IN HEAVEN
˓→","timestamps":"[0.04, 0.44, 0.64, 0.84, 0.96, 1.32, 1.52, 1.68, 1.84, 1.88, 2.04, 2.
˓→16, 2.32, 2.40, 2.64, 2.88, 3.12, 3.24, 3.44, 3.52, 3.72, 3.88, 4.20, 4.40, 4.48, 4.60,
˓→ 4.76, 4.96, 5.08, 5.24, 5.36, 5.56, 5.80, 6.20, 6.32, 6.52, 6.92, 7.16, 7.32, 7.60, 7.
˓→76, 7.92, 8.16, 8.28, 8.40, 8.48, 8.60, 8.76, 8.84, 9.08, 9.24, 9.44, 9.48, 9.72, 9.88,
˓→ 10.04, 10.12, 10.52, 10.76, 10.84, 11.08, 11.24, 11.36, 11.60, 11.76, 11.96, 12.08,␣
˓→12.24, 12.28, 12.48, 12.72, 12.84, 12.92, 13.00, 13.20, 13.52, 13.76, 13.88, 14.08, 14.
˓→28, 14.52, 14.64, 14.76, 14.96, 15.04, 15.24, 15.48, 15.68, 15.84, 16.00, 16.04]",
˓→" LO","VE","LY"," CHI","LD"," WHO","SE"," PLACE"," WAS"," ON"," THAT"," SA","ME"," DIS
˓→HE","A","VEN"]}
----
./icefall-asr-cv-corpus-13.0-2023-03-09-en-pruned-transducer-stateless7-2023-04-17/test_
˓→wavs/1221-135766-0002.wav
{"text":" YET THESE THOUGHTS AFFECTED HESTER PRIN LESS WITH HOPE THAN APPREHENSION",
˓→"timestamps":"[0.00, 0.04, 0.12, 0.56, 0.80, 0.88, 1.00, 1.04, 1.12, 1.20, 1.28, 1.40,␣
˓→1.52, 1.64, 1.76, 1.84, 2.04, 2.24, 2.40, 2.64, 2.68, 2.84, 3.04, 3.24, 3.44, 3.52, 3.
˓→TH","AN"," APP","RE","HE","N","S","ION"]}
----
num threads: 2
decoding method: greedy_search
Elapsed seconds: 1.368 s
Real time factor (RTF): 1.368 / 28.165 = 0.049
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx-microphone-offline \
--tokens=./icefall-asr-cv-corpus-13.0-2023-03-09-en-pruned-transducer-stateless7-2023-
˓→04-17/data/lang_bpe_500/tokens.txt \
--encoder=./icefall-asr-cv-corpus-13.0-2023-03-09-en-pruned-transducer-stateless7-2023-
˓→04-17/exp/encoder-epoch-60-avg-20.onnx \
--decoder=./icefall-asr-cv-corpus-13.0-2023-03-09-en-pruned-transducer-stateless7-2023-
˓→04-17/exp/decoder-epoch-60-avg-20.onnx \
--joiner=./icefall-asr-cv-corpus-13.0-2023-03-09-en-pruned-transducer-stateless7-2023-
˓→04-17/exp/joiner-epoch-60-avg-20.onnx
k2-fsa/icefall-asr-zipformer-wenetspeech-small (Chinese)
k2-fsa/icefall-asr-zipformer-wenetspeech-large (Chinese)
pkufool/icefall-asr-zipformer-wenetspeech-20230615 (Chinese)
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/icefall-asr-
˓→zipformer-wenetspeech-20230615.tar.bz2
Please check that the file sizes of the pre-trained models are correct. See the file sizes of *.onnx files below.
Hint: It supports decoding only wave files of a single channel with 16-bit encoded samples, while the sampling rate
does not need to be 16 kHz.
fp32
The following code shows how to use fp32 models to decode wave files:
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx-offline \
--tokens=./icefall-asr-zipformer-wenetspeech-20230615/data/lang_char/tokens.txt \
--encoder=./icefall-asr-zipformer-wenetspeech-20230615/exp/encoder-epoch-12-avg-4.onnx␣
˓→\
--decoder=./icefall-asr-zipformer-wenetspeech-20230615/exp/decoder-epoch-12-avg-4.onnx␣
˓→\
(continues on next page)
Caution: If you use Windows and get encoding issues, please run:
CHCP 65001
in your commandline.
/Users/fangjun/open-source/sherpa-onnx/sherpa-onnx/csrc/parse-options.cc:Read:361 ./
˓→build/bin/sherpa-onnx-offline --tokens=./icefall-asr-zipformer-wenetspeech-20230615/
˓→data/lang_char/tokens.txt --encoder=./icefall-asr-zipformer-wenetspeech-20230615/exp/
˓→encoder-epoch-12-avg-4.onnx --decoder=./icefall-asr-zipformer-wenetspeech-20230615/exp/
˓→decoder-epoch-12-avg-4.onnx --joiner=./icefall-asr-zipformer-wenetspeech-20230615/exp/
˓→joiner-epoch-12-avg-4.onnx ./icefall-asr-zipformer-wenetspeech-20230615/test_wavs/DEV_
˓→T0000000000.wav ./icefall-asr-zipformer-wenetspeech-20230615/test_wavs/DEV_T0000000001.
˓→wav ./icefall-asr-zipformer-wenetspeech-20230615/test_wavs/DEV_T0000000002.wav
OfflineRecognizerConfig(feat_config=OfflineFeatureExtractorConfig(sampling_rate=16000,␣
˓→feature_dim=80), model_
˓→config=OfflineModelConfig(transducer=OfflineTransducerModelConfig(encoder_filename="./
˓→icefall-asr-zipformer-wenetspeech-20230615/exp/encoder-epoch-12-avg-4.onnx", decoder_
˓→filename="./icefall-asr-zipformer-wenetspeech-20230615/exp/decoder-epoch-12-avg-4.onnx
˓→", joiner_filename="./icefall-asr-zipformer-wenetspeech-20230615/exp/joiner-epoch-12-
˓→ctc=OfflineNemoEncDecCtcModelConfig(model=""), tokens="./icefall-asr-zipformer-
./icefall-asr-zipformer-wenetspeech-20230615/test_wavs/DEV_T0000000000.wav
{"text":"","timestamps":"[0.00, 0.12, 0.48, 0.64, 0.88, 1.16, 1.64, 1.76, 1.92, 2.08, 2.
˓→32, 2.48, 2.64, 3.08, 3.20, 3.40, 3.48, 3.64, 3.76, 3.88, 3.96, 4.12, 4.28, 4.52, 4.72,
˓→ 4.84]","tokens":["","","","","","","","","","","","","","","","","","","","","","","",
˓→"","",""]}
----
./icefall-asr-zipformer-wenetspeech-20230615/test_wavs/DEV_T0000000001.wav
{"text":"","timestamps":"[0.00, 0.16, 0.48, 0.72, 0.92, 1.08, 1.28, 1.52, 1.92, 2.08, 2.
˓→52, 2.64, 2.88, 3.04, 3.20, 3.40, 3.56, 3.76, 3.84, 4.00, 4.16, 4.32, 4.56, 4.84]",
˓→"tokens":["","","","","","","","","","","","","","","","","","","","","","","",""]}
˓→"","","","","","","","","","",""]}
----
num threads: 2
decoding method: greedy_search
Elapsed seconds: 0.458 s
Real time factor (RTF): 0.458 / 15.289 = 0.030
int8
The following code shows how to use int8 models to decode wave files:
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx-offline \
--tokens=./icefall-asr-zipformer-wenetspeech-20230615/data/lang_char/tokens.txt \
--encoder=./icefall-asr-zipformer-wenetspeech-20230615/exp/encoder-epoch-12-avg-4.int8.
˓→onnx \
--decoder=./icefall-asr-zipformer-wenetspeech-20230615/exp/decoder-epoch-12-avg-4.onnx␣
˓→\
--joiner=./icefall-asr-zipformer-wenetspeech-20230615/exp/joiner-epoch-12-avg-4.int8.
˓→onnx \
./icefall-asr-zipformer-wenetspeech-20230615/test_wavs/DEV_T0000000000.wav \
./icefall-asr-zipformer-wenetspeech-20230615/test_wavs/DEV_T0000000001.wav \
./icefall-asr-zipformer-wenetspeech-20230615/test_wavs/DEV_T0000000002.wav
Caution: If you use Windows and get encoding issues, please run:
CHCP 65001
in your commandline.
/Users/fangjun/open-source/sherpa-onnx/sherpa-onnx/csrc/parse-options.cc:Read:361 ./
˓→build/bin/sherpa-onnx-offline --tokens=./icefall-asr-zipformer-wenetspeech-20230615/
˓→data/lang_char/tokens.txt --encoder=./icefall-asr-zipformer-wenetspeech-20230615/exp/
˓→encoder-epoch-12-avg-4.int8.onnx --decoder=./icefall-asr-zipformer-wenetspeech-
˓→20230615/exp/decoder-epoch-12-avg-4.onnx --joiner=./icefall-asr-zipformer-wenetspeech-
˓→20230615/exp/joiner-epoch-12-avg-4.int8.onnx ./icefall-asr-zipformer-wenetspeech-
˓→20230615/test_wavs/DEV_T0000000000.wav ./icefall-asr-zipformer-wenetspeech-20230615/
˓→test_wavs/DEV_T0000000001.wav ./icefall-asr-zipformer-wenetspeech-20230615/test_wavs/
˓→DEV_T0000000002.wav
OfflineRecognizerConfig(feat_config=OfflineFeatureExtractorConfig(sampling_rate=16000,␣
˓→feature_dim=80), model_
˓→config=OfflineModelConfig(transducer=OfflineTransducerModelConfig(encoder_filename="./
˓→icefall-asr-zipformer-wenetspeech-20230615/exp/encoder-epoch-12-avg-4.int8.onnx",␣
˓→decoder_filename="./icefall-asr-zipformer-wenetspeech-20230615/exp/decoder-epoch-12-
˓→avg-4.onnx", joiner_filename="./icefall-asr-zipformer-wenetspeech-20230615/exp/joiner-
˓→ctc=OfflineNemoEncDecCtcModelConfig(model=""), tokens="./icefall-asr-zipformer-
./icefall-asr-zipformer-wenetspeech-20230615/test_wavs/DEV_T0000000000.wav
{"text":"","timestamps":"[0.00, 0.12, 0.48, 0.60, 0.80, 1.08, 1.64, 1.76, 1.92, 2.08, 2.
˓→32, 2.48, 2.64, 3.08, 3.20, 3.28, 3.44, 3.60, 3.72, 3.84, 3.92, 4.12, 4.28, 4.48, 4.72,
˓→ 4.84]","tokens":["","","","","","","","","","","","","","","","","","","","","","","",
˓→"","",""]}
----
./icefall-asr-zipformer-wenetspeech-20230615/test_wavs/DEV_T0000000001.wav
{"text":"","timestamps":"[0.00, 0.16, 0.48, 0.68, 0.84, 1.08, 1.20, 1.48, 1.64, 2.08, 2.
˓→36, 2.52, 2.64, 2.84, 3.00, 3.16, 3.40, 3.52, 3.72, 3.84, 4.00, 4.16, 4.32, 4.56, 4.84]
˓→","tokens":["","","","","","","","","","","","","","","","","","","","","","","","","
˓→"]}
----
./icefall-asr-zipformer-wenetspeech-20230615/test_wavs/DEV_T0000000002.wav
{"text":"","timestamps":"[0.00, 0.12, 0.48, 0.84, 1.08, 1.44, 1.60, 1.84, 2.24, 2.48, 2.
˓→76, 2.88, 3.12, 3.24, 3.28, 3.36, 3.60, 3.72, 3.84, 4.16]","tokens":["","","","","","",
˓→"","","","","","","","","","","","","",""]}
----
num threads: 2
decoding method: greedy_search
Elapsed seconds: 0.338 s
Real time factor (RTF): 0.338 / 15.289 = 0.022
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx-microphone-offline \
--tokens=./icefall-asr-zipformer-wenetspeech-20230615/data/lang_char/tokens.txt \
--encoder=./icefall-asr-zipformer-wenetspeech-20230615/exp/encoder-epoch-12-avg-4.onnx␣
˓→\
--decoder=./icefall-asr-zipformer-wenetspeech-20230615/exp/decoder-epoch-12-avg-4.onnx␣
˓→\
--joiner=./icefall-asr-zipformer-wenetspeech-20230615/exp/joiner-epoch-12-avg-4.onnx
csukuangfj/sherpa-onnx-zipformer-large-en-2023-06-26 (English)
cd /path/to/sherpa-onnx
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-onnx-
˓→zipformer-large-en-2023-06-26.tar.bz2
Please check that the file sizes of the pre-trained models are correct. See the file sizes of *.onnx files below.
Hint: It supports decoding only wave files of a single channel with 16-bit encoded samples, while the sampling rate
does not need to be 16 kHz.
fp32
The following code shows how to use fp32 models to decode wave files:
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx-offline \
--tokens=./sherpa-onnx-zipformer-large-en-2023-06-26/tokens.txt \
--encoder=./sherpa-onnx-zipformer-large-en-2023-06-26/encoder-epoch-99-avg-1.onnx \
(continues on next page)
˓→tokens.txt --encoder=./sherpa-onnx-zipformer-large-en-2023-06-26/encoder-epoch-99-avg-
˓→1.onnx --decoder=./sherpa-onnx-zipformer-large-en-2023-06-26/decoder-epoch-99-avg-1.
˓→onnx --joiner=./sherpa-onnx-zipformer-large-en-2023-06-26/joiner-epoch-99-avg-1.onnx ./
˓→sherpa-onnx-zipformer-large-en-2023-06-26/test_wavs/0.wav ./sherpa-onnx-zipformer-
˓→large-en-2023-06-26/test_wavs/1.wav ./sherpa-onnx-zipformer-large-en-2023-06-26/test_
˓→wavs/8k.wav
OfflineRecognizerConfig(feat_config=OfflineFeatureExtractorConfig(sampling_rate=16000,␣
˓→feature_dim=80), model_
˓→config=OfflineModelConfig(transducer=OfflineTransducerModelConfig(encoder_filename="./
˓→sherpa-onnx-zipformer-large-en-2023-06-26/encoder-epoch-99-avg-1.onnx", decoder_
˓→filename="./sherpa-onnx-zipformer-large-en-2023-06-26/decoder-epoch-99-avg-1.onnx",␣
˓→joiner_filename="./sherpa-onnx-zipformer-large-en-2023-06-26/joiner-epoch-99-avg-1.onnx
˓→ctc=OfflineNemoEncDecCtcModelConfig(model=""), tokens="./sherpa-onnx-zipformer-large-
˓→active_paths=4, context_score=1.5)
in_sample_rate: 8000
output_sample_rate: 16000
Done!
./sherpa-onnx-zipformer-large-en-2023-06-26/test_wavs/0.wav
{"text":" AFTER EARLY NIGHTFALL THE YELLOW LAMPS WOULD LIGHT UP HERE AND THERE THE␣
˓→SQUALID QUARTER OF THE BROTHELS","timestamps":"[0.00, 0.48, 0.60, 0.72, 1.04, 1.28, 1.
˓→36, 1.48, 1.60, 1.84, 1.96, 2.00, 2.16, 2.32, 2.40, 2.48, 2.60, 2.80, 3.04, 3.28, 3.40,
˓→ 3.56, 3.76, 4.04, 4.24, 4.28, 4.48, 4.64, 4.80, 4.84, 5.00, 5.04, 5.28, 5.40, 5.56, 5.
----
./sherpa-onnx-zipformer-large-en-2023-06-26/test_wavs/1.wav
{"text":" GOD AS A DIRECT CONSEQUENCE OF THE SIN WHICH MAN THUS PUNISHED HAD GIVEN HER A␣
˓→LOVELY CHILD WHOSE PLACE WAS ON THAT SAME DISHONORED BOSOM TO CONNECT HER PARENT FOR␣
(continues
˓→EVER WITH THE RACE AND DESCENT OF MORTALS AND TO BE FINALLY A BLESSED SOUL on next page)
IN HEAVEN",
˓→"timestamps":"[0.00, 0.20, 0.48, 0.72, 0.88, 1.04, 1.12, 1.20, 1.36, 1.52, 1.68, 1.84,␣
˓→1.88, 2.00, 2.12, 2.32, 2.36, 2.60, 2.84, 3.12, 3.24, 3.48, 3.56, 3.76, 3.92, 4.12, 4.
8.25. Pre-trained models 439
˓→36, 4.56, 4.72, 4.96, 5.16, 5.44, 5.68, 6.12, 6.28, 6.48, 6.88, 7.12, 7.36, 7.56, 7.92,
˓→ 8.16, 8.28, 8.40, 8.48, 8.60, 8.76, 8.88, 9.08, 9.28, 9.44, 9.52, 9.60, 9.72, 9.92,␣
˓→10.00, 10.12, 10.48, 10.68, 10.76, 11.00, 11.20, 11.36, 11.56, 11.76, 12.00, 12.12, 12.
sherpa, Release 1.3
˓→1.76, 1.88, 2.04, 2.12, 2.24, 2.28, 2.48, 2.56, 2.80, 3.08, 3.28, 3.52, 3.80, 3.92, 4.
˓→" A","PP","RE","HE","N","S","ION"]}
----
num threads: 2
decoding method: greedy_search
Elapsed seconds: 1.843 s
Real time factor (RTF): 1.843 / 28.165 = 0.065
int8
The following code shows how to use int8 models to decode wave files:
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx-offline \
--tokens=./sherpa-onnx-zipformer-large-en-2023-06-26/tokens.txt \
--encoder=./sherpa-onnx-zipformer-large-en-2023-06-26/encoder-epoch-99-avg-1.int8.onnx␣
˓→\
--decoder=./sherpa-onnx-zipformer-large-en-2023-06-26/decoder-epoch-99-avg-1.onnx \
--joiner=./sherpa-onnx-zipformer-large-en-2023-06-26/joiner-epoch-99-avg-1.int8.onnx \
./sherpa-onnx-zipformer-large-en-2023-06-26/test_wavs/0.wav \
./sherpa-onnx-zipformer-large-en-2023-06-26/test_wavs/1.wav \
./sherpa-onnx-zipformer-large-en-2023-06-26/test_wavs/8k.wav
˓→tokens.txt --encoder=./sherpa-onnx-zipformer-large-en-2023-06-26/encoder-epoch-99-avg-
˓→1.int8.onnx --decoder=./sherpa-onnx-zipformer-large-en-2023-06-26/decoder-epoch-99-avg-
˓→1.onnx --joiner=./sherpa-onnx-zipformer-large-en-2023-06-26/joiner-epoch-99-avg-1.int8.
˓→zipformer-large-en-2023-06-26/test_wavs/1.wav ./sherpa-onnx-zipformer-large-en-2023-06-
˓→26/test_wavs/8k.wav
OfflineRecognizerConfig(feat_config=OfflineFeatureExtractorConfig(sampling_rate=16000,␣
˓→feature_dim=80), model_
˓→config=OfflineModelConfig(transducer=OfflineTransducerModelConfig(encoder_filename="./
˓→sherpa-onnx-zipformer-large-en-2023-06-26/encoder-epoch-99-avg-1.int8.onnx", decoder_
˓→filename="./sherpa-onnx-zipformer-large-en-2023-06-26/decoder-epoch-99-avg-1.onnx",␣
˓→joiner_filename="./sherpa-onnx-zipformer-large-en-2023-06-26/joiner-epoch-99-avg-1.
in_sample_rate: 8000
output_sample_rate: 16000
Done!
./sherpa-onnx-zipformer-large-en-2023-06-26/test_wavs/0.wav
{"text":" AFTER EARLY NIGHTFALL THE YELLOW LAMPS WOULD LIGHT UP HERE AND THERE THE␣
˓→SQUALID QUARTER OF THE BROTHELS","timestamps":"[0.00, 0.48, 0.60, 0.72, 1.04, 1.28, 1.
˓→36, 1.48, 1.60, 1.84, 1.96, 2.00, 2.16, 2.32, 2.40, 2.48, 2.60, 2.80, 3.04, 3.28, 3.40,
˓→ 3.56, 3.76, 4.04, 4.24, 4.28, 4.48, 4.64, 4.80, 4.84, 5.00, 5.04, 5.28, 5.40, 5.56, 5.
----
./sherpa-onnx-zipformer-large-en-2023-06-26/test_wavs/1.wav
{"text":" GOD AS A DIRECT CONSEQUENCE OF THE SIN WHICH MAN THUS PUNISHED HAD GIVEN HER A␣
˓→LOVELY CHILD WHOSE PLACE WAS ON THAT SAME DISHONORED BOSOM TO CONNECT HER PARENT FOR␣
˓→EVER WITH THE RACE AND DESCENT OF MORTALS AND TO BE FINALLY A BLESSED SOUL IN HEAVEN",
˓→"timestamps":"[0.00, 0.20, 0.48, 0.72, 0.88, 1.04, 1.12, 1.20, 1.36, 1.52, 1.64, 1.84,␣
˓→1.88, 2.00, 2.12, 2.32, 2.36, 2.60, 2.84, 3.12, 3.24, 3.48, 3.56, 3.76, 3.92, 4.12, 4.
˓→36, 4.52, 4.72, 4.96, 5.16, 5.44, 5.68, 6.12, 6.28, 6.48, 6.88, 7.12, 7.36, 7.56, 7.92,
˓→ 8.16, 8.28, 8.40, 8.48, 8.60, 8.76, 8.88, 9.08, 9.28, 9.44, 9.52, 9.60, 9.72, 9.92,␣
˓→10.00, 10.12, 10.48, 10.68, 10.76, 11.00, 11.20, 11.36, 11.56, 11.76, 12.00, 12.12, 12.
˓→28, 12.32, 12.52, 12.72, 12.84, 12.92, 13.04, 13.20, 13.44, 13.64, 13.76, 14.00, 14.08,
˓→ 14.24, 14.36, 14.52, 14.72, 14.76, 15.04, 15.28, 15.52, 15.76, 16.00, 16.20, 16.24,␣
˓→"," HER"," A"," LOVE","LY"," CHILD"," WHO","SE"," PLACE"," WAS"," ON"," THAT"," SAME",
----
./sherpa-onnx-zipformer-large-en-2023-06-26/test_wavs/8k.wav
{"text":" YET THESE THOUGHTS AFFECTED HESTER PRYNNE LESS WITH HOPE THAN APPREHENSION",
˓→"timestamps":"[0.00, 0.12, 0.36, 0.48, 0.76, 0.96, 1.12, 1.24, 1.32, 1.44, 1.48, 1.68,␣
˓→1.76, 1.88, 2.04, 2.12, 2.28, 2.32, 2.48, 2.52, 2.80, 3.08, 3.28, 3.52, 3.76, 3.92, 4.
˓→" A","PP","RE","HE","N","S","ION"]}
----
num threads: 2
decoding method: greedy_search
Elapsed seconds: 1.490 s
Real time factor (RTF): 1.490 / 28.165 = 0.053
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx-microphone-offline \
--tokens=./sherpa-onnx-zipformer-large-en-2023-06-26/tokens.txt \
--encoder=./sherpa-onnx-zipformer-large-en-2023-06-26/encoder-epoch-99-avg-1.onnx \
--decoder=./sherpa-onnx-zipformer-large-en-2023-06-26/decoder-epoch-99-avg-1.onnx \
--joiner=./sherpa-onnx-zipformer-large-en-2023-06-26/joiner-epoch-99-avg-1.onnx
csukuangfj/sherpa-onnx-zipformer-small-en-2023-06-26 (English)
cd /path/to/sherpa-onnx
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-onnx-
˓→zipformer-small-en-2023-06-26.tar.bz2
Please check that the file sizes of the pre-trained models are correct. See the file sizes of *.onnx files below.
Hint: It supports decoding only wave files of a single channel with 16-bit encoded samples, while the sampling rate
does not need to be 16 kHz.
fp32
The following code shows how to use fp32 models to decode wave files:
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx-offline \
--tokens=./sherpa-onnx-zipformer-small-en-2023-06-26/tokens.txt \
--encoder=./sherpa-onnx-zipformer-small-en-2023-06-26/encoder-epoch-99-avg-1.onnx \
--decoder=./sherpa-onnx-zipformer-small-en-2023-06-26/decoder-epoch-99-avg-1.onnx \
--joiner=./sherpa-onnx-zipformer-small-en-2023-06-26/joiner-epoch-99-avg-1.onnx \
./sherpa-onnx-zipformer-small-en-2023-06-26/test_wavs/0.wav \
./sherpa-onnx-zipformer-small-en-2023-06-26/test_wavs/1.wav \
./sherpa-onnx-zipformer-small-en-2023-06-26/test_wavs/8k.wav
˓→tokens.txt --encoder=./sherpa-onnx-zipformer-small-en-2023-06-26/encoder-epoch-99-avg-
˓→1.onnx --decoder=./sherpa-onnx-zipformer-small-en-2023-06-26/decoder-epoch-99-avg-1.
˓→onnx --joiner=./sherpa-onnx-zipformer-small-en-2023-06-26/joiner-epoch-99-avg-1.onnx ./
˓→sherpa-onnx-zipformer-small-en-2023-06-26/test_wavs/0.wav ./sherpa-onnx-zipformer-
˓→small-en-2023-06-26/test_wavs/1.wav ./sherpa-onnx-zipformer-small-en-2023-06-26/test_
˓→wavs/8k.wav
OfflineRecognizerConfig(feat_config=OfflineFeatureExtractorConfig(sampling_rate=16000,␣
˓→feature_dim=80), model_
˓→config=OfflineModelConfig(transducer=OfflineTransducerModelConfig(encoder_filename="./
˓→sherpa-onnx-zipformer-small-en-2023-06-26/encoder-epoch-99-avg-1.onnx", decoder_
˓→filename="./sherpa-onnx-zipformer-small-en-2023-06-26/decoder-epoch-99-avg-1.onnx",␣
˓→joiner_filename="./sherpa-onnx-zipformer-small-en-2023-06-26/joiner-epoch-99-avg-1.onnx
˓→ctc=OfflineNemoEncDecCtcModelConfig(model=""), tokens="./sherpa-onnx-zipformer-small-
˓→active_paths=4, context_score=1.5)
in_sample_rate: 8000
(continues on next page)
Done!
./sherpa-onnx-zipformer-small-en-2023-06-26/test_wavs/0.wav
{"text":" AFTER EARLY NIGHTFALL THE YELLOW LAMPS WOULD LIGHT UP HERE AND THERE THE␣
˓→SQUALID QUARTER OF THE BROTHELS","timestamps":"[0.00, 0.64, 0.76, 0.84, 1.12, 1.36, 1.
˓→44, 1.56, 1.72, 1.84, 1.96, 2.04, 2.20, 2.32, 2.36, 2.44, 2.60, 2.76, 3.04, 3.24, 3.40,
˓→ 3.52, 3.72, 4.04, 4.20, 4.28, 4.48, 4.64, 4.80, 4.84, 4.96, 5.00, 5.28, 5.40, 5.52, 5.
----
./sherpa-onnx-zipformer-small-en-2023-06-26/test_wavs/1.wav
{"text":" GOD AS A DIRECT CONSEQUENCE OF THE SIN WHICH MAN THUS PUNISHED HAD GIVEN HER A␣
˓→LOVELY CHILD WHOSE PLACE WAS ON THAT SAME DISHONOURED BOSOM TO CONNECT HER PARENT␣
˓→FOREVER WITH THE RACE AND DESCENT OF MORTALS AND TO BE FINALLY A BLESSED SOUL IN HEAVEN
˓→","timestamps":"[0.00, 0.32, 0.64, 0.80, 0.96, 1.08, 1.16, 1.20, 1.32, 1.52, 1.68, 1.
˓→80, 1.88, 2.04, 2.16, 2.32, 2.40, 2.64, 2.88, 3.16, 3.20, 3.44, 3.52, 3.72, 3.88, 4.16,
˓→ 4.44, 4.60, 4.76, 4.96, 5.16, 5.36, 5.60, 6.16, 6.32, 6.52, 6.88, 7.16, 7.32, 7.60, 7.
˓→96, 8.16, 8.28, 8.36, 8.48, 8.64, 8.76, 8.84, 9.04, 9.28, 9.44, 9.52, 9.60, 9.68, 9.88,
˓→ 9.92, 10.12, 10.52, 10.76, 10.80, 11.08, 11.20, 11.36, 11.56, 11.76, 11.96, 12.08, 12.
˓→24, 12.28, 12.48, 12.68, 12.80, 12.92, 13.00, 13.20, 13.48, 13.72, 13.84, 14.04, 14.20,
˓→ 14.28, 14.40, 14.56, 14.68, 14.76, 15.00, 15.24, 15.48, 15.68, 15.92, 16.08, 16.12,␣
˓→"," HER"," A"," LOVE","LY"," CHILD"," WHO","SE"," PLACE"," WAS"," ON"," THAT"," SAME",
----
./sherpa-onnx-zipformer-small-en-2023-06-26/test_wavs/8k.wav
{"text":" YET THESE THOUGHTS AFFECTED HESTER PRYNNE LESS WITH HOPE THAN APPREHENSION",
˓→"timestamps":"[0.00, 0.32, 0.48, 0.64, 0.84, 1.08, 1.20, 1.32, 1.36, 1.44, 1.48, 1.64,␣
˓→1.76, 1.88, 2.08, 2.12, 2.24, 2.28, 2.44, 2.48, 2.80, 3.04, 3.24, 3.48, 3.72, 3.88, 3.
˓→" A","PP","RE","HE","N","S","ION"]}
----
num threads: 2
decoding method: greedy_search
Elapsed seconds: 0.953 s
Real time factor (RTF): 0.953 / 28.165 = 0.034
int8
The following code shows how to use int8 models to decode wave files:
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx-offline \
--tokens=./sherpa-onnx-zipformer-small-en-2023-06-26/tokens.txt \
--encoder=./sherpa-onnx-zipformer-small-en-2023-06-26/encoder-epoch-99-avg-1.int8.onnx␣
˓→\
--decoder=./sherpa-onnx-zipformer-small-en-2023-06-26/decoder-epoch-99-avg-1.onnx \
--joiner=./sherpa-onnx-zipformer-small-en-2023-06-26/joiner-epoch-99-avg-1.int8.onnx \
./sherpa-onnx-zipformer-small-en-2023-06-26/test_wavs/0.wav \
./sherpa-onnx-zipformer-small-en-2023-06-26/test_wavs/1.wav \
./sherpa-onnx-zipformer-small-en-2023-06-26/test_wavs/8k.wav
˓→tokens.txt --encoder=./sherpa-onnx-zipformer-small-en-2023-06-26/encoder-epoch-99-avg-
˓→1.int8.onnx --decoder=./sherpa-onnx-zipformer-small-en-2023-06-26/decoder-epoch-99-avg-
˓→1.onnx --joiner=./sherpa-onnx-zipformer-small-en-2023-06-26/joiner-epoch-99-avg-1.int8.
˓→zipformer-small-en-2023-06-26/test_wavs/1.wav ./sherpa-onnx-zipformer-small-en-2023-06-
˓→26/test_wavs/8k.wav
OfflineRecognizerConfig(feat_config=OfflineFeatureExtractorConfig(sampling_rate=16000,␣
˓→feature_dim=80), model_
˓→config=OfflineModelConfig(transducer=OfflineTransducerModelConfig(encoder_filename="./
˓→sherpa-onnx-zipformer-small-en-2023-06-26/encoder-epoch-99-avg-1.int8.onnx", decoder_
˓→filename="./sherpa-onnx-zipformer-small-en-2023-06-26/decoder-epoch-99-avg-1.onnx",␣
˓→joiner_filename="./sherpa-onnx-zipformer-small-en-2023-06-26/joiner-epoch-99-avg-1.
˓→ctc=OfflineNemoEncDecCtcModelConfig(model=""), tokens="./sherpa-onnx-zipformer-small-
˓→active_paths=4, context_score=1.5)
in_sample_rate: 8000
output_sample_rate: 16000
Done!
./sherpa-onnx-zipformer-small-en-2023-06-26/test_wavs/0.wav
{"text":" AFTER EARLY NIGHTFALL THE YELLOW LAMPS WOULD LIGHT UP HERE AND THERE THE␣
˓→SQUALID QUARTER OF THE BROTHELS","timestamps":"[0.00, 0.64, 0.76, 0.84, 1.08, 1.36, 1.
(continues
˓→44, 1.56, 1.72, 1.84, 1.96, 2.04, 2.20, 2.32, 2.36, 2.44, 2.60, 2.76, 3.04, 3.24,on next page)
3.40,
˓→ 3.52, 3.72, 4.00, 4.20, 4.28, 4.48, 4.64, 4.80, 4.84, 4.96, 5.00, 5.28, 5.40, 5.52, 5.
˓→FOREVER WITH THE RACE AND DESCENT OF MORTALS AND TO BE FINALLY A BLESSED SOUL IN HEAVEN
˓→","timestamps":"[0.00, 0.32, 0.64, 0.80, 0.96, 1.08, 1.16, 1.20, 1.32, 1.52, 1.68, 1.
˓→80, 1.88, 2.04, 2.16, 2.32, 2.40, 2.64, 2.88, 3.16, 3.20, 3.44, 3.52, 3.72, 3.88, 4.16,
˓→ 4.44, 4.60, 4.76, 4.96, 5.16, 5.36, 5.60, 6.16, 6.32, 6.52, 6.88, 7.16, 7.32, 7.60, 7.
˓→96, 8.16, 8.28, 8.36, 8.48, 8.64, 8.76, 8.84, 9.04, 9.28, 9.44, 9.52, 9.60, 9.68, 9.88,
˓→ 9.92, 10.12, 10.52, 10.76, 10.80, 11.08, 11.20, 11.36, 11.56, 11.76, 11.96, 12.08, 12.
˓→24, 12.28, 12.48, 12.68, 12.80, 12.92, 13.04, 13.16, 13.48, 13.72, 13.84, 14.04, 14.20,
˓→ 14.28, 14.40, 14.56, 14.68, 14.76, 15.00, 15.28, 15.48, 15.68, 15.92, 16.08, 16.12,␣
˓→"," HER"," A"," LOVE","LY"," CHILD"," WHO","SE"," PLACE"," WAS"," ON"," THAT"," SAME",
----
./sherpa-onnx-zipformer-small-en-2023-06-26/test_wavs/8k.wav
{"text":" YET THESE THOUGHTS AFFECTED HESTER PRYNNE LESS WITH HOPE THAN APPREHENSION",
˓→"timestamps":"[0.00, 0.32, 0.48, 0.64, 0.84, 1.08, 1.20, 1.32, 1.36, 1.44, 1.48, 1.64,␣
˓→1.76, 1.88, 2.08, 2.12, 2.24, 2.28, 2.44, 2.48, 2.80, 3.04, 3.24, 3.48, 3.72, 3.88, 3.
˓→" A","PP","RE","HE","N","S","ION"]}
----
num threads: 2
decoding method: greedy_search
Elapsed seconds: 0.891 s
Real time factor (RTF): 0.891 / 28.165 = 0.032
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx-microphone-offline \
--tokens=./sherpa-onnx-zipformer-small-en-2023-06-26/tokens.txt \
--encoder=./sherpa-onnx-zipformer-small-en-2023-06-26/encoder-epoch-99-avg-1.onnx \
--decoder=./sherpa-onnx-zipformer-small-en-2023-06-26/decoder-epoch-99-avg-1.onnx \
--joiner=./sherpa-onnx-zipformer-small-en-2023-06-26/joiner-epoch-99-avg-1.onnx
csukuangfj/sherpa-onnx-zipformer-en-2023-06-26 (English)
cd /path/to/sherpa-onnx
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-onnx-
˓→zipformer-en-2023-06-26.tar.bz2
Please check that the file sizes of the pre-trained models are correct. See the file sizes of *.onnx files below.
Hint: It supports decoding only wave files of a single channel with 16-bit encoded samples, while the sampling rate
does not need to be 16 kHz.
fp32
The following code shows how to use fp32 models to decode wave files:
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx-offline \
--tokens=./sherpa-onnx-zipformer-en-2023-06-26/tokens.txt \
--encoder=./sherpa-onnx-zipformer-en-2023-06-26/encoder-epoch-99-avg-1.onnx \
(continues on next page)
˓→txt --encoder=./sherpa-onnx-zipformer-en-2023-06-26/encoder-epoch-99-avg-1.onnx --
˓→decoder=./sherpa-onnx-zipformer-en-2023-06-26/decoder-epoch-99-avg-1.onnx --joiner=./
˓→sherpa-onnx-zipformer-en-2023-06-26/joiner-epoch-99-avg-1.onnx ./sherpa-onnx-zipformer-
˓→en-2023-06-26/test_wavs/0.wav ./sherpa-onnx-zipformer-en-2023-06-26/test_wavs/1.wav ./
˓→sherpa-onnx-zipformer-en-2023-06-26/test_wavs/8k.wav
OfflineRecognizerConfig(feat_config=OfflineFeatureExtractorConfig(sampling_rate=16000,␣
˓→feature_dim=80), model_
˓→config=OfflineModelConfig(transducer=OfflineTransducerModelConfig(encoder_filename="./
˓→sherpa-onnx-zipformer-en-2023-06-26/encoder-epoch-99-avg-1.onnx", decoder_filename="./
˓→sherpa-onnx-zipformer-en-2023-06-26/decoder-epoch-99-avg-1.onnx", joiner_filename="./
˓→sherpa-onnx-zipformer-en-2023-06-26/joiner-epoch-99-avg-1.onnx"),␣
˓→paraformer=OfflineParaformerModelConfig(model=""), nemo_
˓→ctc=OfflineNemoEncDecCtcModelConfig(model=""), tokens="./sherpa-onnx-zipformer-en-2023-
˓→active_paths=4, context_score=1.5)
in_sample_rate: 8000
output_sample_rate: 16000
Done!
./sherpa-onnx-zipformer-en-2023-06-26/test_wavs/0.wav
{"text":" AFTER EARLY NIGHTFALL THE YELLOW LAMPS WOULD LIGHT UP HERE AND THERE THE␣
˓→SQUALID QUARTER OF THE BROTHELS","timestamps":"[0.00, 0.56, 0.64, 0.80, 1.08, 1.36, 1.
˓→40, 1.52, 1.68, 1.84, 1.96, 2.04, 2.20, 2.32, 2.40, 2.48, 2.60, 2.80, 3.04, 3.28, 3.40,
˓→ 3.56, 3.76, 4.08, 4.24, 4.32, 4.48, 4.64, 4.80, 4.84, 5.00, 5.04, 5.28, 5.40, 5.56, 5.
----
./sherpa-onnx-zipformer-en-2023-06-26/test_wavs/1.wav
{"text":" GOD AS A DIRECT CONSEQUENCE OF THE SIN WHICH MAN THUS PUNISHED HAD GIVEN HER A␣
˓→LOVELY CHILD WHOSE PLACE WAS ON THAT SAME DISHONORED BOSOM TO CONNECT HER PARENT␣
˓→FOREVER WITH THE RACE AND DESCENT OF MORTALS AND TO BE FINALLY A BLESSED SOUL IN HEAVEN
(continues
˓→","timestamps":"[0.00, 0.24, 0.56, 0.76, 0.92, 1.04, 1.16, 1.20, 1.36, 1.52, on next1.
1.64, page)
˓→80, 1.88, 2.00, 2.16, 2.32, 2.40, 2.64, 2.88, 3.12, 3.24, 3.48, 3.56, 3.72, 3.92, 4.12,
˓→ 4.40, 4.52, 4.72, 4.96, 5.16, 5.36, 5.64, 6.12, 6.28, 6.52, 6.88, 7.12, 7.32, 7.56, 7.
448 Chapter 8. sherpa-onnx
˓→92, 8.16, 8.28, 8.40, 8.48, 8.64, 8.76, 8.88, 9.04, 9.28, 9.44, 9.52, 9.60, 9.72, 9.92,
˓→ 9.96, 10.16, 10.48, 10.72, 10.80, 11.04, 11.20, 11.36, 11.56, 11.76, 12.00, 12.12, 12.
˓→28, 12.32, 12.52, 12.72, 12.84, 12.92, 13.04, 13.20, 13.44, 13.68, 13.84, 14.00, 14.16,
sherpa, Release 1.3
˓→1.76, 1.88, 2.00, 2.12, 2.24, 2.28, 2.48, 2.52, 2.80, 3.08, 3.28, 3.52, 3.68, 3.84, 3.
˓→" A","PP","RE","HE","N","S","ION"]}
----
num threads: 2
decoding method: greedy_search
Elapsed seconds: 1.301 s
Real time factor (RTF): 1.301 / 28.165 = 0.046
int8
The following code shows how to use int8 models to decode wave files:
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx-offline \
--tokens=./sherpa-onnx-zipformer-en-2023-06-26/tokens.txt \
--encoder=./sherpa-onnx-zipformer-en-2023-06-26/encoder-epoch-99-avg-1.int8.onnx \
--decoder=./sherpa-onnx-zipformer-en-2023-06-26/decoder-epoch-99-avg-1.onnx \
--joiner=./sherpa-onnx-zipformer-en-2023-06-26/joiner-epoch-99-avg-1.int8.onnx \
./sherpa-onnx-zipformer-en-2023-06-26/test_wavs/0.wav \
./sherpa-onnx-zipformer-en-2023-06-26/test_wavs/1.wav \
./sherpa-onnx-zipformer-en-2023-06-26/test_wavs/8k.wav
˓→txt --encoder=./sherpa-onnx-zipformer-en-2023-06-26/encoder-epoch-99-avg-1.int8.onnx --
˓→decoder=./sherpa-onnx-zipformer-en-2023-06-26/decoder-epoch-99-avg-1.onnx --joiner=./
˓→sherpa-onnx-zipformer-en-2023-06-26/joiner-epoch-99-avg-1.int8.onnx ./sherpa-onnx-
˓→zipformer-en-2023-06-26/test_wavs/0.wav ./sherpa-onnx-zipformer-en-2023-06-26/test_
˓→wavs/1.wav ./sherpa-onnx-zipformer-en-2023-06-26/test_wavs/8k.wav
OfflineRecognizerConfig(feat_config=OfflineFeatureExtractorConfig(sampling_rate=16000,␣
˓→feature_dim=80), model_
˓→config=OfflineModelConfig(transducer=OfflineTransducerModelConfig(encoder_filename="./
˓→sherpa-onnx-zipformer-en-2023-06-26/encoder-epoch-99-avg-1.int8.onnx", decoder_
˓→filename="./sherpa-onnx-zipformer-en-2023-06-26/decoder-epoch-99-avg-1.onnx", joiner_
˓→filename="./sherpa-onnx-zipformer-en-2023-06-26/joiner-epoch-99-avg-1.int8.onnx"),␣
˓→paraformer=OfflineParaformerModelConfig(model=""), nemo_
˓→ctc=OfflineNemoEncDecCtcModelConfig(model=""), tokens="./sherpa-onnx-zipformer-en-2023-
˓→active_paths=4, context_score=1.5)
in_sample_rate: 8000
output_sample_rate: 16000
Done!
./sherpa-onnx-zipformer-en-2023-06-26/test_wavs/0.wav
{"text":" AFTER EARLY NIGHTFALL THE YELLOW LAMPS WOULD LIGHT UP HERE AND THERE THE␣
˓→SQUALID QUARTER OF THE BROTHELS","timestamps":"[0.00, 0.56, 0.64, 0.80, 1.08, 1.36, 1.
˓→40, 1.52, 1.68, 1.84, 1.96, 2.04, 2.20, 2.32, 2.40, 2.48, 2.60, 2.76, 3.04, 3.28, 3.40,
˓→ 3.56, 3.76, 4.08, 4.24, 4.32, 4.48, 4.64, 4.80, 4.84, 5.00, 5.04, 5.28, 5.40, 5.56, 5.
----
./sherpa-onnx-zipformer-en-2023-06-26/test_wavs/1.wav
{"text":" GOD AS A DIRECT CONSEQUENCE OF THE SIN WHICH MAN THUS PUNISHED HAD GIVEN HER A␣
˓→LOVELY CHILD WHOSE PLACE WAS ON THAT SAME DISHONORED BOSOM TO CONNECT HER PARENT␣
˓→FOREVER WITH THE RACE AND DESCENT OF MORTALS AND TO BE FINALLY A BLESSED SOUL IN HEAVEN
˓→","timestamps":"[0.00, 0.24, 0.56, 0.76, 0.92, 1.04, 1.16, 1.20, 1.36, 1.52, 1.64, 1.
˓→80, 1.88, 2.00, 2.16, 2.32, 2.40, 2.64, 2.88, 3.12, 3.24, 3.48, 3.56, 3.72, 3.92, 4.12,
˓→ 4.40, 4.52, 4.72, 4.96, 5.12, 5.40, 5.64, 6.12, 6.28, 6.52, 6.88, 7.12, 7.32, 7.60, 7.
˓→92, 8.16, 8.28, 8.40, 8.48, 8.64, 8.76, 8.88, 9.04, 9.28, 9.44, 9.52, 9.60, 9.72, 9.92,
˓→ 9.96, 10.16, 10.48, 10.72, 10.80, 11.04, 11.20, 11.36, 11.56, 11.76, 12.00, 12.12, 12.
˓→28, 12.32, 12.52, 12.72, 12.84, 12.92, 13.04, 13.20, 13.44, 13.68, 13.84, 14.00, 14.16,
˓→ 14.28, 14.40, 14.56, 14.72, 14.76, 15.00, 15.28, 15.48, 15.68, 15.96, 16.16, 16.20,␣
˓→"," HER"," A"," LOVE","LY"," CHILD"," WHO","SE"," PLACE"," WAS"," ON"," THAT"," SAME",
----
./sherpa-onnx-zipformer-en-2023-06-26/test_wavs/8k.wav
{"text":" YET THESE THOUGHTS AFFECTED HESTER PRYNNE LESS WITH HOPE THAN APPREHENSION",
˓→"timestamps":"[0.00, 0.24, 0.40, 0.60, 0.80, 1.04, 1.16, 1.28, 1.36, 1.44, 1.48, 1.68,␣
˓→1.76, 1.88, 2.00, 2.08, 2.24, 2.28, 2.48, 2.52, 2.80, 3.08, 3.28, 3.52, 3.68, 3.84, 3.
˓→" A","PP","RE","HE","N","S","ION"]}
----
num threads: 2
decoding method: greedy_search
Elapsed seconds: 1.106 s
Real time factor (RTF): 1.106 / 28.165 = 0.039
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx-microphone-offline \
--tokens=./sherpa-onnx-zipformer-en-2023-06-26/tokens.txt \
--encoder=./sherpa-onnx-zipformer-en-2023-06-26/encoder-epoch-99-avg-1.onnx \
--decoder=./sherpa-onnx-zipformer-en-2023-06-26/decoder-epoch-99-avg-1.onnx \
--joiner=./sherpa-onnx-zipformer-en-2023-06-26/joiner-epoch-99-avg-1.onnx
icefall-asr-multidataset-pruned_transducer_stateless7-2023-05-04 (English)
This model is trained using GigaSpeech + LibriSpeech + Common Voice 13.0 with zipformer
See https://github.com/k2-fsa/icefall/pull/1010 if you are interested in how it is trained.
In the following, we describe how to download it and use it with sherpa-onnx.
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/icefall-asr-
˓→multidataset-pruned_transducer_stateless7-2023-05-04.tar.bz2
Please check that the file sizes of the pre-trained models are correct. See the file sizes of *.onnx files below.
$ ls -lh *.onnx
-rw-r--r-- 1 fangjun staff 1.2M May 15 11:11 decoder-epoch-30-avg-4.int8.onnx
-rw-r--r-- 1 fangjun staff 2.0M May 15 11:11 decoder-epoch-30-avg-4.onnx
-rw-r--r-- 1 fangjun staff 121M May 15 11:12 encoder-epoch-30-avg-4.int8.onnx
-rw-r--r-- 1 fangjun staff 279M May 15 11:13 encoder-epoch-30-avg-4.onnx
-rw-r--r-- 1 fangjun staff 253K May 15 11:11 joiner-epoch-30-avg-4.int8.onnx
-rw-r--r-- 1 fangjun staff 1.0M May 15 11:11 joiner-epoch-30-avg-4.onnx
Hint: It supports decoding only wave files of a single channel with 16-bit encoded samples, while the sampling rate
does not need to be 16 kHz.
fp32
The following code shows how to use fp32 models to decode wave files:
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx-offline \
--tokens=./icefall-asr-multidataset-pruned_transducer_stateless7-2023-05-04/data/lang_
˓→bpe_500/tokens.txt \
--encoder=./icefall-asr-multidataset-pruned_transducer_stateless7-2023-05-04/exp/
˓→encoder-epoch-30-avg-4.onnx \
--decoder=./icefall-asr-multidataset-pruned_transducer_stateless7-2023-05-04/exp/
˓→decoder-epoch-30-avg-4.onnx \
--joiner=./icefall-asr-multidataset-pruned_transducer_stateless7-2023-05-04/exp/joiner-
˓→epoch-30-avg-4.onnx \
./icefall-asr-multidataset-pruned_transducer_stateless7-2023-05-04/test_wavs/1089-
˓→134686-0001.wav \
./icefall-asr-multidataset-pruned_transducer_stateless7-2023-05-04/test_wavs/1221-
˓→135766-0001.wav \
./icefall-asr-multidataset-pruned_transducer_stateless7-2023-05-04/test_wavs/1221-
˓→135766-0002.wav
˓→stateless7-2023-05-04/data/lang_bpe_500/tokens.txt --encoder=./icefall-asr-
˓→multidataset-pruned_transducer_stateless7-2023-05-04/exp/encoder-epoch-30-avg-4.onnx --
˓→decoder=./icefall-asr-multidataset-pruned_transducer_stateless7-2023-05-04/exp/decoder-
˓→epoch-30-avg-4.onnx --joiner=./icefall-asr-multidataset-pruned_transducer_stateless7-
˓→2023-05-04/exp/joiner-epoch-30-avg-4.onnx ./icefall-asr-multidataset-pruned_transducer_
˓→stateless7-2023-05-04/test_wavs/1089-134686-0001.wav ./icefall-asr-multidataset-pruned_
˓→transducer_stateless7-2023-05-04/test_wavs/1221-135766-0001.wav ./icefall-asr-
˓→multidataset-pruned_transducer_stateless7-2023-05-04/test_wavs/1221-135766-0002.wav
OfflineRecognizerConfig(feat_config=OfflineFeatureExtractorConfig(sampling_rate=16000,␣
˓→feature_dim=80), model_
˓→config=OfflineModelConfig(transducer=OfflineTransducerModelConfig(encoder_filename="./
˓→icefall-asr-multidataset-pruned_transducer_stateless7-2023-05-04/exp/encoder-epoch-30-
˓→avg-4.onnx", decoder_filename="./icefall-asr-multidataset-pruned_transducer_stateless7-
˓→2023-05-04/exp/decoder-epoch-30-avg-4.onnx", joiner_filename="./icefall-asr-
˓→multidataset-pruned_transducer_stateless7-2023-05-04/exp/joiner-epoch-30-avg-4.onnx"),␣
˓→paraformer=OfflineParaformerModelConfig(model=""), nemo_
˓→ctc=OfflineNemoEncDecCtcModelConfig(model=""), tokens="./icefall-asr-multidataset-
˓→pruned_transducer_stateless7-2023-05-04/data/lang_bpe_500/tokens.txt", num_threads=2,␣
˓→method="greedy_search", max_active_paths=4)
./icefall-asr-multidataset-pruned_transducer_stateless7-2023-05-04/test_wavs/1089-134686-
˓→0001.wav
{"text":" AFTER EARLY NIGHTFALL THE YELLOW LAMPS WOULD LIGHT UP HERE AND THERE THE␣
˓→SQUALID QUARTER OF THE BROTHELS","timestamps":"[0.00,0.40,0.56,0.64,0.96,1.24,1.32,1.
˓→44,1.56,1.76,1.88,1.96,2.16,2.32,2.36,2.48,2.60,2.80,3.08,3.28,3.36,3.56,3.80,4.04,4.
˓→24,4.32,4.48,4.64,4.84,4.88,5.00,5.08,5.32,5.44,5.56,5.64,5.80,5.96,6.20]","tokens":["␣
----
./icefall-asr-multidataset-pruned_transducer_stateless7-2023-05-04/test_wavs/1221-135766-
˓→0001.wav
{"text":" GOD AS A DIRECT CONSEQUENCE OF THE SIN WHICH MAN THUS PUNISHED HAD GIVEN HER A␣
˓→LOVELY CHILD WHOSE PLACE WAS ON THAT SAME DISHONORED BOSOM TO CONNECT HER PARENT FOR␣
˓→EVER WITH THE RACE AND DESCENT OF MORTALS AND TO BE FINALLY A BLESSED SOUL IN HEAVEN",
˓→"timestamps":"[0.00,0.16,0.44,0.68,0.84,1.00,1.12,1.16,1.32,1.48,1.64,1.80,1.84,2.00,2.
˓→12,2.28,2.40,2.64,2.88,3.16,3.28,3.56,3.60,3.76,3.92,4.12,4.36,4.52,4.72,4.92,5.16,5.
˓→44,5.72,6.12,6.24,6.48,6.84,7.08,7.28,7.56,7.88,8.12,8.28,8.36,8.48,8.60,8.76,8.88,9.
˓→12,9.28,9.48,9.56,9.64,9.80,10.00,10.04,10.20,10.44,10.68,10.80,11.04,11.20,11.40,11.
˓→56,11.80,12.00,12.12,12.28,12.32,12.52,12.72,12.84,12.96,13.04,13.24,13.40,13.64,13.80,
˓→14.00,14.16,14.24,14.36,14.56,14.72,14.80,15.08,15.32,15.52,15.76,16.04,16.16,16.24,16.
˓→" HER"," A"," LOVE","LY"," CHILD"," WHO","SE"," PLACE"," WAS"," ON"," THAT"," SAME","␣
˓→"," FOR"," E","VER"," WITH"," THE"," RA","CE"," AND"," DE","S","C","ENT"," OF"," MO","R
----
./icefall-asr-multidataset-pruned_transducer_stateless7-2023-05-04/test_wavs/1221-135766-
˓→0002.wav
{"text":" YET THESE THOUGHTS AFFECTED HESTER PRYNNE LESS WITH HOPE THAN APPREHENSION",
˓→"timestamps":"[0.00,0.08,0.32,0.48,0.68,0.92,1.08,1.20,1.28,1.40,1.44,1.64,1.76,1.88,2.
˓→04,2.12,2.24,2.32,2.48,2.56,2.88,3.12,3.32,3.52,3.76,3.92,4.00,4.20,4.28,4.40,4.52]",
˓→"ION"]}
----
num threads: 2
decoding method: greedy_search
Elapsed seconds: 1.662 s
Real time factor (RTF): 1.662 / 28.165 = 0.059
int8
The following code shows how to use int8 models to decode wave files:
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx-offline \
--tokens=./icefall-asr-multidataset-pruned_transducer_stateless7-2023-05-04/data/lang_
˓→bpe_500/tokens.txt \
--encoder=./icefall-asr-multidataset-pruned_transducer_stateless7-2023-05-04/exp/
˓→encoder-epoch-30-avg-4.int8.onnx \
--decoder=./icefall-asr-multidataset-pruned_transducer_stateless7-2023-05-04/exp/
˓→decoder-epoch-30-avg-4.onnx \
--joiner=./icefall-asr-multidataset-pruned_transducer_stateless7-2023-05-04/exp/joiner-
˓→epoch-30-avg-4.int8.onnx \
./icefall-asr-multidataset-pruned_transducer_stateless7-2023-05-04/test_wavs/1089-
˓→134686-0001.wav \
./icefall-asr-multidataset-pruned_transducer_stateless7-2023-05-04/test_wavs/1221-
˓→135766-0001.wav \
./icefall-asr-multidataset-pruned_transducer_stateless7-2023-05-04/test_wavs/1221-
˓→135766-0002.wav
˓→stateless7-2023-05-04/data/lang_bpe_500/tokens.txt --encoder=./icefall-asr-
˓→multidataset-pruned_transducer_stateless7-2023-05-04/exp/encoder-epoch-30-avg-4.int8.
˓→onnx --decoder=./icefall-asr-multidataset-pruned_transducer_stateless7-2023-05-04/exp/
˓→decoder-epoch-30-avg-4.onnx --joiner=./icefall-asr-multidataset-pruned_transducer_
˓→stateless7-2023-05-04/exp/joiner-epoch-30-avg-4.int8.onnx ./icefall-asr-multidataset-
˓→pruned_transducer_stateless7-2023-05-04/test_wavs/1089-134686-0001.wav ./icefall-asr-
˓→multidataset-pruned_transducer_stateless7-2023-05-04/test_wavs/1221-135766-0001.wav ./
˓→icefall-asr-multidataset-pruned_transducer_stateless7-2023-05-04/test_wavs/1221-135766-
˓→0002.wav
OfflineRecognizerConfig(feat_config=OfflineFeatureExtractorConfig(sampling_rate=16000,␣
˓→feature_dim=80), model_
˓→config=OfflineModelConfig(transducer=OfflineTransducerModelConfig(encoder_filename="./
˓→icefall-asr-multidataset-pruned_transducer_stateless7-2023-05-04/exp/encoder-epoch-30-
˓→avg-4.int8.onnx", decoder_filename="./icefall-asr-multidataset-pruned_transducer_
˓→stateless7-2023-05-04/exp/decoder-epoch-30-avg-4.onnx", joiner_filename="./icefall-asr-
˓→multidataset-pruned_transducer_stateless7-2023-05-04/exp/joiner-epoch-30-avg-4.int8.
˓→ctc=OfflineNemoEncDecCtcModelConfig(model=""), tokens="./icefall-asr-multidataset-
˓→pruned_transducer_stateless7-2023-05-04/data/lang_bpe_500/tokens.txt", num_threads=2,␣
˓→method="greedy_search", max_active_paths=4)
./icefall-asr-multidataset-pruned_transducer_stateless7-2023-05-04/test_wavs/1089-134686-
˓→0001.wav
{"text":" AFTER EARLY NIGHTFALL THE YELLOW LAMPS WOULD LIGHT UP HERE AND THERE THE␣
˓→SQUALID QUARTER OF THE BROTHELS","timestamps":"[0.00,0.40,0.56,0.64,0.96,1.24,1.32,1.
˓→44,1.56,1.76,1.88,1.96,2.16,2.32,2.36,2.48,2.60,2.80,3.08,3.28,3.36,3.56,3.80,4.04,4.
˓→24,4.32,4.48,4.64,4.84,4.88,5.00,5.08,5.32,5.44,5.56,5.64,5.80,5.96,6.20]","tokens":["␣
----
./icefall-asr-multidataset-pruned_transducer_stateless7-2023-05-04/test_wavs/1221-135766-
˓→0001.wav
{"text":" GOD AS A DIRECT CONSEQUENCE OF THE SIN WHICH MAN THUS PUNISHED HAD GIVEN HER A␣
˓→LOVELY CHILD WHOSE PLACE WAS ON THAT SAME DISHONORED BOSOM TO CONNECT HER PARENT FOR␣
˓→EVER WITH THE RACE AND DESCENT OF MORTALS AND TO BE FINALLY A BLESSED SOUL IN HEAVEN",
˓→"timestamps":"[0.00,0.12,0.44,0.68,0.80,1.00,1.12,1.16,1.32,1.48,1.64,1.80,1.84,2.00,2.
˓→12,2.28,2.40,2.64,2.88,3.16,3.28,3.56,3.60,3.76,3.92,4.12,4.36,4.52,4.72,4.92,5.16,5.
˓→44,5.72,6.12,6.24,6.48,6.84,7.08,7.28,7.56,7.88,8.12,8.28,8.36,8.48,8.60,8.76,8.88,9.
˓→12,9.28,9.48,9.56,9.64,9.80,10.00,10.04,10.16,10.44,10.68,10.80,11.04,11.20,11.40,11.
˓→56,11.80,12.00,12.16,12.28,12.32,12.52,12.72,12.84,12.96,13.04,13.24,13.40,13.64,13.80,
˓→14.00,14.16,14.24,14.36,14.56,14.72,14.80,15.08,15.32,15.52,15.76,16.04,16.16,16.24,16.
˓→" HER"," A"," LOVE","LY"," CHILD"," WHO","SE"," PLACE"," WAS"," ON"," THAT"," SAME","␣
˓→"," FOR"," E","VER"," WITH"," THE"," RA","CE"," AND"," DE","S","C","ENT"," OF"," MO","R
----
./icefall-asr-multidataset-pruned_transducer_stateless7-2023-05-04/test_wavs/1221-135766-
˓→0002.wav
{"text":" YET THESE THOUGHTS AFFECTED HESTER PRYNNE LESS WITH HOPE THAN APPREHENSION",
˓→"timestamps":"[0.00,0.08,0.32,0.48,0.68,0.92,1.08,1.20,1.28,1.40,1.44,1.64,1.76,1.88,2.
˓→04,2.12,2.28,2.32,2.52,2.56,2.88,3.12,3.32,3.52,3.76,3.92,4.00,4.20,4.28,4.40,4.52]",
˓→"ION"]}
----
num threads: 2
decoding method: greedy_search
Elapsed seconds: 1.424 s
Real time factor (RTF): 1.424 / 28.165 = 0.051
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx-microphone-offline \
--tokens=./icefall-asr-multidataset-pruned_transducer_stateless7-2023-05-04/data/lang_
˓→bpe_500/tokens.txt \
--encoder=./icefall-asr-multidataset-pruned_transducer_stateless7-2023-05-04/exp/
˓→encoder-epoch-30-avg-4.onnx \
--decoder=./icefall-asr-multidataset-pruned_transducer_stateless7-2023-05-04/exp/
˓→decoder-epoch-30-avg-4.onnx \
--joiner=./icefall-asr-multidataset-pruned_transducer_stateless7-2023-05-04/exp/joiner-
˓→epoch-30-avg-4.onnx
csukuangfj/sherpa-onnx-zipformer-en-2023-04-01 (English)
cd /path/to/sherpa-onnx
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-onnx-
˓→zipformer-en-2023-04-01.tar.bz2
Please check that the file sizes of the pre-trained models are correct. See the file sizes of *.onnx files below.
Hint: It supports decoding only wave files of a single channel with 16-bit encoded samples, while the sampling rate
does not need to be 16 kHz.
fp32
The following code shows how to use fp32 models to decode wave files:
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx-offline \
--tokens=./sherpa-onnx-zipformer-en-2023-04-01/tokens.txt \
--encoder=./sherpa-onnx-zipformer-en-2023-04-01/encoder-epoch-99-avg-1.onnx \
--decoder=./sherpa-onnx-zipformer-en-2023-04-01/decoder-epoch-99-avg-1.onnx \
--joiner=./sherpa-onnx-zipformer-en-2023-04-01/joiner-epoch-99-avg-1.onnx \
./sherpa-onnx-zipformer-en-2023-04-01/test_wavs/0.wav \
./sherpa-onnx-zipformer-en-2023-04-01/test_wavs/1.wav \
./sherpa-onnx-zipformer-en-2023-04-01/test_wavs/8k.wav
˓→config=OfflineModelConfig(transducer=OfflineTransducerModelConfig(encoder_filename="./
˓→sherpa-onnx-zipformer-en-2023-04-01/encoder-epoch-99-avg-1.onnx", decoder_filename="./
˓→sherpa-onnx-zipformer-en-2023-04-01/decoder-epoch-99-avg-1.onnx", joiner_filename="./
˓→sherpa-onnx-zipformer-en-2023-04-01/joiner-epoch-99-avg-1.onnx"),␣
˓→paraformer=OfflineParaformerModelConfig(model=""), tokens="./sherpa-onnx-zipformer-en-
˓→error msg: Invalid argument. Specify the number of threads explicitly so the affinity␣
˓→error msg: Invalid argument. Specify the number of threads explicitly so the affinity␣
Started
Creating a resampler:
in_sample_rate: 8000
output_sample_rate: 16000
Done!
./sherpa-onnx-zipformer-en-2023-04-01/test_wavs/0.wav
(continues on next page)
----
./sherpa-onnx-zipformer-en-2023-04-01/test_wavs/1.wav
GOD AS A DIRECT CONSEQUENCE OF THE SIN WHICH MAN THUS PUNISHED HAD GIVEN HER A LOVELY␣
˓→CHILD WHOSE PLACE WAS ON THAT SAME DISHONOURED BOSOM TO CONNECT HER PARENT FOR EVER␣
˓→WITH THE RACE AND DESCENT OF MORTALS AND TO BE FINALLY A BLESSED SOUL IN HEAVEN
----
./sherpa-onnx-zipformer-en-2023-04-01/test_wavs/8k.wav
YET THESE THOUGHTS AFFECTED HESTER PRYNNE LESS WITH HOPE THAN APPREHENSION
----
num threads: 2
decoding method: greedy_search
Elapsed seconds: 2.151 s
Real time factor (RTF): 2.151 / 28.165 = 0.076
int8
The following code shows how to use int8 models to decode wave files:
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx-offline \
--tokens=./sherpa-onnx-zipformer-en-2023-04-01/tokens.txt \
--encoder=./sherpa-onnx-zipformer-en-2023-04-01/encoder-epoch-99-avg-1.int8.onnx \
--decoder=./sherpa-onnx-zipformer-en-2023-04-01/decoder-epoch-99-avg-1.onnx \
--joiner=./sherpa-onnx-zipformer-en-2023-04-01/joiner-epoch-99-avg-1.int8.onnx \
./sherpa-onnx-zipformer-en-2023-04-01/test_wavs/0.wav \
./sherpa-onnx-zipformer-en-2023-04-01/test_wavs/1.wav \
./sherpa-onnx-zipformer-en-2023-04-01/test_wavs/8k.wav
˓→config=OfflineModelConfig(transducer=OfflineTransducerModelConfig(encoder_filename="./
˓→sherpa-onnx-zipformer-en-2023-04-01/encoder-epoch-99-avg-1.int8.onnx", decoder_
˓→filename="./sherpa-onnx-zipformer-en-2023-04-01/decoder-epoch-99-avg-1.onnx", joiner_
˓→filename="./sherpa-onnx-zipformer-en-2023-04-01/joiner-epoch-99-avg-1.int8.onnx"),␣
˓→paraformer=OfflineParaformerModelConfig(model=""), tokens="./sherpa-onnx-zipformer-en-
˓→error msg: Invalid argument. Specify the number of threads explicitly so the affinity␣
Done!
./sherpa-onnx-zipformer-en-2023-04-01/test_wavs/0.wav
AFTER EARLY NIGHTFALL THE YELLOW LAMPS WOULD LIGHT UP HERE AND THERE THE SQUALID␣
˓→QUARTER OF THE BROTHELS
----
./sherpa-onnx-zipformer-en-2023-04-01/test_wavs/1.wav
GOD AS A DIRECT CONSEQUENCE OF THE SIN WHICH MAN THUS PUNISHED HAD GIVEN HER A LOVELY␣
˓→CHILD WHOSE PLACE WAS ON THAT SAME DISHONOURED BOSOM TO CONNECT HER PARENT FOR EVER␣
˓→WITH THE RACE AND DESCENT OF MORTALS AND TO BE FINALLY A BLESSED SOUL IN HEAVEN
----
./sherpa-onnx-zipformer-en-2023-04-01/test_wavs/8k.wav
YET THESE THOUGHTS AFFECTED HESTER PRYNNE LESS WITH HOPE THAN APPREHENSION
----
num threads: 2
decoding method: greedy_search
Elapsed seconds: 1.478 s
Real time factor (RTF): 1.478 / 28.165 = 0.052
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx-microphone-offline \
--tokens=./sherpa-onnx-zipformer-en-2023-04-01/tokens.txt \
--encoder=./sherpa-onnx-zipformer-en-2023-04-01/encoder-epoch-99-avg-1.onnx \
--decoder=./sherpa-onnx-zipformer-en-2023-04-01/decoder-epoch-99-avg-1.onnx \
--joiner=./sherpa-onnx-zipformer-en-2023-04-01/joiner-epoch-99-avg-1.onnx
csukuangfj/sherpa-onnx-zipformer-en-2023-03-30 (English)
cd /path/to/sherpa-onnx
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-onnx-
˓→zipformer-en-2023-03-30.tar.bz2
Please check that the file sizes of the pre-trained models are correct. See the file sizes of *.onnx files below.
Hint: It supports decoding only wave files of a single channel with 16-bit encoded samples, while the sampling rate
does not need to be 16 kHz.
fp32
The following code shows how to use fp32 models to decode wave files:
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx-offline \
--tokens=./sherpa-onnx-zipformer-en-2023-03-30/tokens.txt \
--encoder=./sherpa-onnx-zipformer-en-2023-03-30/encoder-epoch-99-avg-1.onnx \
--decoder=./sherpa-onnx-zipformer-en-2023-03-30/decoder-epoch-99-avg-1.onnx \
--joiner=./sherpa-onnx-zipformer-en-2023-03-30/joiner-epoch-99-avg-1.onnx \
./sherpa-onnx-zipformer-en-2023-03-30/test_wavs/0.wav \
./sherpa-onnx-zipformer-en-2023-03-30/test_wavs/1.wav \
./sherpa-onnx-zipformer-en-2023-03-30/test_wavs/8k.wav
OfflineRecognizerConfig(feat_config=OfflineFeatureExtractorConfig(sampling_rate=16000,␣
˓→feature_dim=80), model_
˓→config=OfflineModelConfig(transducer=OfflineTransducerModelConfig(encoder_filename="./
˓→sherpa-onnx-zipformer-en-2023-03-30/encoder-epoch-99-avg-1.onnx", decoder_filename="./
˓→sherpa-onnx-zipformer-en-2023-03-30/decoder-epoch-99-avg-1.onnx", joiner_filename="./
˓→sherpa-onnx-zipformer-en-2023-03-30/joiner-epoch-99-avg-1.onnx"),␣
˓→paraformer=OfflineParaformerModelConfig(model=""), tokens="./sherpa-onnx-zipformer-en-
˓→error msg: Invalid argument. Specify the number of threads explicitly so the affinity␣
˓→error msg: Invalid argument. Specify the number of threads explicitly so the affinity␣
Started
Creating a resampler:
in_sample_rate: 8000
output_sample_rate: 16000
Done!
./sherpa-onnx-zipformer-en-2023-03-30/test_wavs/0.wav
AFTER EARLY NIGHTFALL THE YELLOW LAMPS WOULD LIGHT UP HERE AND THERE THE SQUALID␣
˓→QUARTER OF THE BROTHELS
----
./sherpa-onnx-zipformer-en-2023-03-30/test_wavs/1.wav
GOD AS A DIRECT CONSEQUENCE OF THE SIN WHICH MAN THUS PUNISHED HAD GIVEN HER A LOVELY␣
˓→CHILD WHOSE PLACE WAS ON THAT SAME DISHONOURED BOSOM TO CONNECT HER PARENT FOR EVER␣
˓→WITH THE RACE AND DESCENT OF MORTALS AND TO BE FINALLY A BLESSED SOUL IN HEAVEN
----
./sherpa-onnx-zipformer-en-2023-03-30/test_wavs/8k.wav
YET THESE THOUGHTS AFFECTED HESTER PRYNNE LESS WITH HOPE THAN APPREHENSION
----
num threads: 2
decoding method: greedy_search
Elapsed seconds: 1.950 s
Real time factor (RTF): 1.950 / 28.165 = 0.069
int8
The following code shows how to use int8 models to decode wave files:
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx-offline \
--tokens=./sherpa-onnx-zipformer-en-2023-03-30/tokens.txt \
--encoder=./sherpa-onnx-zipformer-en-2023-03-30/encoder-epoch-99-avg-1.int8.onnx \
--decoder=./sherpa-onnx-zipformer-en-2023-03-30/decoder-epoch-99-avg-1.onnx \
(continues on next page)
OfflineRecognizerConfig(feat_config=OfflineFeatureExtractorConfig(sampling_rate=16000,␣
˓→feature_dim=80), model_
˓→config=OfflineModelConfig(transducer=OfflineTransducerModelConfig(encoder_filename="./
˓→sherpa-onnx-zipformer-en-2023-03-30/encoder-epoch-99-avg-1.int8.onnx", decoder_
˓→filename="./sherpa-onnx-zipformer-en-2023-03-30/decoder-epoch-99-avg-1.onnx", joiner_
˓→filename="./sherpa-onnx-zipformer-en-2023-03-30/joiner-epoch-99-avg-1.int8.onnx"),␣
˓→paraformer=OfflineParaformerModelConfig(model=""), tokens="./sherpa-onnx-zipformer-en-
˓→error msg: Invalid argument. Specify the number of threads explicitly so the affinity␣
˓→error msg: Invalid argument. Specify the number of threads explicitly so the affinity␣
Started
Creating a resampler:
in_sample_rate: 8000
output_sample_rate: 16000
Done!
./sherpa-onnx-zipformer-en-2023-03-30/test_wavs/0.wav
AFTER EARLY NIGHTFALL THE YELLOW LAMPS WOULD LIGHT UP HERE AND THERE THE SQUALID␣
˓→QUARTER OF THE BROTHELS
----
./sherpa-onnx-zipformer-en-2023-03-30/test_wavs/1.wav
GOD AS A DIRECT CONSEQUENCE OF THE SIN WHICH MAN THUS PUNISHED HAD GIVEN HER A LOVELY␣
˓→CHILD WHOSE PLACE WAS ON THAT SAME DISHONOURED BOSOM TO CONNECT HER PARENT FOR EVER␣
˓→WITH THE RACE AND DESCENT OF MORTALS AND TO BE FINALLY A BLESSED SOUL IN HEAVEN
----
./sherpa-onnx-zipformer-en-2023-03-30/test_wavs/8k.wav
YET THESE THOUGHTS AFFECTED HESTER PRYNNE LESS WITH HOPE THAN APPREHENSION
----
num threads: 2
decoding method: greedy_search
Elapsed seconds: 1.710 s
Real time factor (RTF): 1.710 / 28.165 = 0.061
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx-microphone-offline \
--tokens=./sherpa-onnx-zipformer-en-2023-03-30/tokens.txt \
--encoder=./sherpa-onnx-zipformer-en-2023-03-30/encoder-epoch-99-avg-1.onnx \
--decoder=./sherpa-onnx-zipformer-en-2023-03-30/decoder-epoch-99-avg-1.onnx \
--joiner=./sherpa-onnx-zipformer-en-2023-03-30/joiner-epoch-99-avg-1.onnx
Conformer-transducer-based Models
Hint: Please refer to Installation to install sherpa-onnx before you read this section.
csukuangfj/sherpa-onnx-conformer-zh-stateless2-2023-05-23 (Chinese)
cd /path/to/sherpa-onnx
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-onnx-
˓→conformer-zh-stateless2-2023-05-23.tar.bz2
Please check that the file sizes of the pre-trained models are correct. See the file sizes of *.onnx files below.
Hint: It supports decoding only wave files of a single channel with 16-bit encoded samples, while the sampling rate
does not need to be 16 kHz.
fp32
The following code shows how to use fp32 models to decode wave files:
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx-offline \
--tokens=./sherpa-onnx-conformer-zh-stateless2-2023-05-23/tokens.txt \
--encoder=./sherpa-onnx-conformer-zh-stateless2-2023-05-23/encoder-epoch-99-avg-1.onnx␣
˓→\
--decoder=./sherpa-onnx-conformer-zh-stateless2-2023-05-23/decoder-epoch-99-avg-1.onnx␣
˓→\
--joiner=./sherpa-onnx-conformer-zh-stateless2-2023-05-23/joiner-epoch-99-avg-1.onnx \
./sherpa-onnx-conformer-zh-stateless2-2023-05-23/test_wavs/0.wav \
./sherpa-onnx-conformer-zh-stateless2-2023-05-23/test_wavs/1.wav \
./sherpa-onnx-conformer-zh-stateless2-2023-05-23/test_wavs/2.wav
˓→23/tokens.txt --encoder=./sherpa-onnx-conformer-zh-stateless2-2023-05-23/encoder-epoch-
˓→99-avg-1.onnx --decoder=./sherpa-onnx-conformer-zh-stateless2-2023-05-23/decoder-epoch-
˓→99-avg-1.onnx --joiner=./sherpa-onnx-conformer-zh-stateless2-2023-05-23/joiner-epoch-
˓→99-avg-1.onnx ./sherpa-onnx-conformer-zh-stateless2-2023-05-23/test_wavs/0.wav ./
˓→sherpa-onnx-conformer-zh-stateless2-2023-05-23/test_wavs/1.wav ./sherpa-onnx-conformer-
˓→zh-stateless2-2023-05-23/test_wavs/2.wav
OfflineRecognizerConfig(feat_config=OfflineFeatureExtractorConfig(sampling_rate=16000,␣
˓→feature_dim=80), model_
˓→config=OfflineModelConfig(transducer=OfflineTransducerModelConfig(encoder_filename="./
˓→sherpa-onnx-conformer-zh-stateless2-2023-05-23/encoder-epoch-99-avg-1.onnx", decoder_
˓→filename="./sherpa-onnx-conformer-zh-stateless2-2023-05-23/decoder-epoch-99-avg-1.onnx
˓→", joiner_filename="./sherpa-onnx-conformer-zh-stateless2-2023-05-23/joiner-epoch-99-
˓→ctc=OfflineNemoEncDecCtcModelConfig(model=""), tokens="./sherpa-onnx-conformer-zh-
˓→active_paths=4)
./sherpa-onnx-conformer-zh-stateless2-2023-05-23/test_wavs/0.wav
{"text":"","timestamps":"[0.00, 0.12, 0.44, 0.64, 0.84, 1.04, 1.64, 1.72, 1.88, 2.08, 2.
˓→28, 2.44, 2.56, 2.76, 3.08, 3.20, 3.32, 3.48, 3.64, 3.76, 3.88, 4.00, 4.16, 4.24, 4.44,
˓→ 4.60, 4.84]","tokens":["","","","","","","","","","","","","","","","","","","","","",
˓→"","","","","",""]}
----
./sherpa-onnx-conformer-zh-stateless2-2023-05-23/test_wavs/1.wav
{"text":"","timestamps":"[0.00, 0.12, 0.48, 0.64, 0.88, 1.08, 1.28, 1.48, 1.80, 2.12, 2.
˓→40, 2.56, 2.68, 2.88, 3.04, 3.16, 3.36, 3.56, 3.68, 3.84, 4.00, 4.16, 4.32, 4.56, 4.76]
˓→","tokens":["","","","","","","","","","","","","","","","","","","","","","","","","
˓→"]}
----
./sherpa-onnx-conformer-zh-stateless2-2023-05-23/test_wavs/2.wav
{"text":"","timestamps":"[0.00, 0.16, 0.60, 0.88, 1.08, 1.36, 1.64, 1.84, 2.24, 2.52, 2.
˓→72, 2.92, 3.08, 3.24, 3.40, 3.56, 3.72, 3.88, 4.12]","tokens":["","","","","","","","",
˓→"","","","","","","","","","",""]}
----
num threads: 2
decoding method: greedy_search
Elapsed seconds: 0.596 s
Real time factor (RTF): 0.596 / 15.289 = 0.039
Caution: If you use Windows and get encoding issues, please run:
CHCP 65001
in your commandline.
int8
The following code shows how to use int8 models to decode wave files:
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx-offline \
--tokens=./sherpa-onnx-conformer-zh-stateless2-2023-05-23/tokens.txt \
--encoder=./sherpa-onnx-conformer-zh-stateless2-2023-05-23/encoder-epoch-99-avg-1.int8.
˓→onnx \
--decoder=./sherpa-onnx-conformer-zh-stateless2-2023-05-23/decoder-epoch-99-avg-1.onnx␣
˓→\
--joiner=./sherpa-onnx-conformer-zh-stateless2-2023-05-23/joiner-epoch-99-avg-1.int8.
˓→onnx \
./sherpa-onnx-conformer-zh-stateless2-2023-05-23/test_wavs/0.wav \
./sherpa-onnx-conformer-zh-stateless2-2023-05-23/test_wavs/1.wav \
./sherpa-onnx-conformer-zh-stateless2-2023-05-23/test_wavs/2.wav
Caution: We did not use int8 for the decoder model above.
/Users/fangjun/open-source/sherpa-onnx/sherpa-onnx/csrc/parse-options.cc:Read:361 ./
˓→build/bin/sherpa-onnx-offline --tokens=./sherpa-onnx-conformer-zh-stateless2-2023-05-
˓→23/tokens.txt --encoder=./sherpa-onnx-conformer-zh-stateless2-2023-05-23/encoder-epoch-
˓→99-avg-1.int8.onnx --decoder=./sherpa-onnx-conformer-zh-stateless2-2023-05-23/decoder-
˓→epoch-99-avg-1.onnx --joiner=./sherpa-onnx-conformer-zh-stateless2-2023-05-23/joiner-
˓→epoch-99-avg-1.int8.onnx ./sherpa-onnx-conformer-zh-stateless2-2023-05-23/test_wavs/0.
˓→conformer-zh-stateless2-2023-05-23/test_wavs/2.wav
OfflineRecognizerConfig(feat_config=OfflineFeatureExtractorConfig(sampling_rate=16000,␣
˓→feature_dim=80), model_
˓→config=OfflineModelConfig(transducer=OfflineTransducerModelConfig(encoder_filename="./
˓→sherpa-onnx-conformer-zh-stateless2-2023-05-23/encoder-epoch-99-avg-1.int8.onnx",␣
˓→decoder_filename="./sherpa-onnx-conformer-zh-stateless2-2023-05-23/decoder-epoch-99-
˓→avg-1.onnx", joiner_filename="./sherpa-onnx-conformer-zh-stateless2-2023-05-23/joiner-
˓→ctc=OfflineNemoEncDecCtcModelConfig(model=""), tokens="./sherpa-onnx-conformer-zh-
˓→active_paths=4)
./sherpa-onnx-conformer-zh-stateless2-2023-05-23/test_wavs/0.wav
{"text":"","timestamps":"[0.00, 0.12, 0.44, 0.64, 0.84, 1.08, 1.64, 1.72, 1.88, 2.08, 2.
˓→28, 2.44, 2.56, 2.76, 3.08, 3.20, 3.32, 3.48, 3.64, 3.76, 3.88, 4.00, 4.16, 4.24, 4.48,
˓→ 4.60, 4.84]","tokens":["","","","","","","","","","","","","","","","","","","","","",
˓→"","","","","",""]}
----
./sherpa-onnx-conformer-zh-stateless2-2023-05-23/test_wavs/1.wav
{"text":"","timestamps":"[0.00, 0.08, 0.48, 0.64, 0.88, 1.08, 1.28, 1.48, 1.80, 2.08, 2.
˓→40, 2.56, 2.68, 2.88, 3.04, 3.16, 3.36, 3.56, 3.68, 3.84, 4.00, 4.16, 4.32, 4.56, 4.76]
˓→","tokens":["","","","","","","","","","","","","","","","","","","","","","","","","
˓→"]}
----
./sherpa-onnx-conformer-zh-stateless2-2023-05-23/test_wavs/2.wav
{"text":"","timestamps":"[0.00, 0.12, 0.56, 0.84, 1.08, 1.40, 1.64, 1.84, 2.24, 2.52, 2.
˓→72, 2.92, 3.08, 3.24, 3.40, 3.56, 3.72, 3.88, 4.12]","tokens":["","","","","","","","",
˓→"","","","","","","","","","",""]}
----
num threads: 2
decoding method: greedy_search
Elapsed seconds: 0.439 s
Real time factor (RTF): 0.439 / 15.289 = 0.029
Caution: If you use Windows and get encoding issues, please run:
CHCP 65001
in your commandline.
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx-microphone-offline \
--tokens=./sherpa-onnx-conformer-zh-stateless2-2023-05-23/tokens.txt \
--encoder=./sherpa-onnx-conformer-zh-stateless2-2023-05-23/encoder-epoch-99-avg-1.onnx␣
˓→\
--decoder=./sherpa-onnx-conformer-zh-stateless2-2023-05-23/decoder-epoch-99-avg-1.onnx␣
˓→\
--joiner=./sherpa-onnx-conformer-zh-stateless2-2023-05-23/joiner-epoch-99-avg-1.onnx
Caution: If you use Windows and get encoding issues, please run:
CHCP 65001
in your commandline.
csukuangfj/sherpa-onnx-conformer-zh-2023-05-23 (Chinese)
cd /path/to/sherpa-onnx
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-onnx-
˓→conformer-zh-2023-05-23.tar.bz2
Please check that the file sizes of the pre-trained models are correct. See the file sizes of *.onnx files below.
Hint: It supports decoding only wave files of a single channel with 16-bit encoded samples, while the sampling rate
does not need to be 16 kHz.
fp32
The following code shows how to use fp32 models to decode wave files:
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx-offline \
--tokens=./sherpa-onnx-conformer-zh-2023-05-23/tokens.txt \
--encoder=./sherpa-onnx-conformer-zh-2023-05-23/encoder-epoch-99-avg-1.onnx \
--decoder=./sherpa-onnx-conformer-zh-2023-05-23/decoder-epoch-99-avg-1.onnx \
--joiner=./sherpa-onnx-conformer-zh-2023-05-23/joiner-epoch-99-avg-1.onnx \
./sherpa-onnx-conformer-zh-2023-05-23/test_wavs/0.wav \
./sherpa-onnx-conformer-zh-2023-05-23/test_wavs/1.wav \
./sherpa-onnx-conformer-zh-2023-05-23/test_wavs/2.wav
˓→txt --encoder=./sherpa-onnx-conformer-zh-2023-05-23/encoder-epoch-99-avg-1.onnx --
˓→decoder=./sherpa-onnx-conformer-zh-2023-05-23/decoder-epoch-99-avg-1.onnx --joiner=./
˓→sherpa-onnx-conformer-zh-2023-05-23/joiner-epoch-99-avg-1.onnx ./sherpa-onnx-conformer-
˓→zh-2023-05-23/test_wavs/0.wav ./sherpa-onnx-conformer-zh-2023-05-23/test_wavs/1.wav ./
˓→sherpa-onnx-conformer-zh-2023-05-23/test_wavs/2.wav
OfflineRecognizerConfig(feat_config=OfflineFeatureExtractorConfig(sampling_rate=16000,␣
˓→feature_dim=80), model_
˓→config=OfflineModelConfig(transducer=OfflineTransducerModelConfig(encoder_filename="./
˓→sherpa-onnx-conformer-zh-2023-05-23/encoder-epoch-99-avg-1.onnx", decoder_filename="./
˓→sherpa-onnx-conformer-zh-2023-05-23/decoder-epoch-99-avg-1.onnx", joiner_filename="./
˓→sherpa-onnx-conformer-zh-2023-05-23/joiner-epoch-99-avg-1.onnx"),␣
˓→paraformer=OfflineParaformerModelConfig(model=""), nemo_
˓→ctc=OfflineNemoEncDecCtcModelConfig(model=""), tokens="./sherpa-onnx-conformer-zh-2023-
(continues on next page)
˓→05-23/tokens.txt", num_threads=2, debug=False, provider="cpu"), lm_
468
˓→active_paths=4) Chapter 8. sherpa-onnx
sherpa, Release 1.3
./sherpa-onnx-conformer-zh-2023-05-23/test_wavs/0.wav
{"text":"","timestamps":"[0.00, 0.12, 0.52, 0.64, 0.84, 1.04, 1.68, 1.80, 1.92, 2.12, 2.
˓→32, 2.48, 2.64, 2.76, 3.08, 3.20, 3.44, 3.52, 3.64, 3.76, 3.88, 4.00, 4.16, 4.32, 4.48,
˓→ 4.64, 4.84]","tokens":["","","","","","","","","","","","","","","","","","","","","",
˓→"","","","","",""]}
----
./sherpa-onnx-conformer-zh-2023-05-23/test_wavs/1.wav
{"text":"","timestamps":"[0.04, 0.16, 0.36, 0.48, 0.68, 0.92, 1.08, 1.24, 1.44, 1.84, 2.
˓→08, 2.36, 2.52, 2.68, 2.88, 3.04, 3.16, 3.40, 3.56, 3.72, 3.84, 4.04, 4.16, 4.32, 4.56,
˓→ 4.76]","tokens":["","","","","","","","","","","","","","","","","","","","","","","",
˓→"","",""]}
----
./sherpa-onnx-conformer-zh-2023-05-23/test_wavs/2.wav
{"text":"","timestamps":"[0.00, 0.12, 0.60, 0.84, 1.04, 1.44, 1.68, 1.84, 2.28, 2.52, 2.
˓→80, 2.92, 3.08, 3.24, 3.40, 3.60, 3.72, 3.84, 4.12]","tokens":["","","","","","","","",
˓→"","","","","","","","","","",""]}
----
num threads: 2
decoding method: greedy_search
Elapsed seconds: 0.706 s
Real time factor (RTF): 0.706 / 15.289 = 0.046
Caution: If you use Windows and get encoding issues, please run:
CHCP 65001
in your commandline.
int8
The following code shows how to use int8 models to decode wave files:
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx-offline \
--tokens=./sherpa-onnx-conformer-zh-2023-05-23/tokens.txt \
--encoder=./sherpa-onnx-conformer-zh-2023-05-23/encoder-epoch-99-avg-1.int8.onnx \
--decoder=./sherpa-onnx-conformer-zh-2023-05-23/decoder-epoch-99-avg-1.onnx \
--joiner=./sherpa-onnx-conformer-zh-2023-05-23/joiner-epoch-99-avg-1.int8.onnx \
./sherpa-onnx-conformer-zh-2023-05-23/test_wavs/0.wav \
./sherpa-onnx-conformer-zh-2023-05-23/test_wavs/1.wav \
./sherpa-onnx-conformer-zh-2023-05-23/test_wavs/2.wav
Caution: We did not use int8 for the decoder model above.
/Users/fangjun/open-source/sherpa-onnx/sherpa-onnx/csrc/parse-options.cc:Read:361 ./
˓→build/bin/sherpa-onnx-offline --decoding-method=greedy_search --tokens=./sherpa-onnx-
˓→conformer-zh-2023-05-23/tokens.txt --encoder=./sherpa-onnx-conformer-zh-2023-05-23/
˓→encoder-epoch-99-avg-1.int8.onnx --decoder=./sherpa-onnx-conformer-zh-2023-05-23/
˓→decoder-epoch-99-avg-1.onnx --joiner=./sherpa-onnx-conformer-zh-2023-05-23/joiner-
˓→epoch-99-avg-1.int8.onnx ./sherpa-onnx-conformer-zh-2023-05-23/test_wavs/0.wav ./
˓→sherpa-onnx-conformer-zh-2023-05-23/test_wavs/1.wav ./sherpa-onnx-conformer-zh-2023-05-
˓→23/test_wavs/2.wav
OfflineRecognizerConfig(feat_config=OfflineFeatureExtractorConfig(sampling_rate=16000,␣
˓→feature_dim=80), model_
˓→config=OfflineModelConfig(transducer=OfflineTransducerModelConfig(encoder_filename="./
˓→sherpa-onnx-conformer-zh-2023-05-23/encoder-epoch-99-avg-1.int8.onnx", decoder_
˓→filename="./sherpa-onnx-conformer-zh-2023-05-23/decoder-epoch-99-avg-1.onnx", joiner_
˓→filename="./sherpa-onnx-conformer-zh-2023-05-23/joiner-epoch-99-avg-1.int8.onnx"),␣
˓→paraformer=OfflineParaformerModelConfig(model=""), nemo_
˓→ctc=OfflineNemoEncDecCtcModelConfig(model=""), tokens="./sherpa-onnx-conformer-zh-2023-
˓→active_paths=4)
./sherpa-onnx-conformer-zh-2023-05-23/test_wavs/0.wav
{"text":"","timestamps":"[0.00, 0.12, 0.52, 0.64, 0.84, 1.04, 1.68, 1.80, 1.92, 2.08, 2.
˓→32, 2.48, 2.64, 2.76, 3.08, 3.20, 3.44, 3.52, 3.64, 3.76, 3.88, 4.00, 4.16, 4.32, 4.48,
˓→ 4.60, 4.84]","tokens":["","","","","","","","","","","","","","","","","","","","","",
˓→"","","","","",""]}
----
./sherpa-onnx-conformer-zh-2023-05-23/test_wavs/1.wav
{"text":"","timestamps":"[0.04, 0.16, 0.36, 0.48, 0.68, 0.92, 1.08, 1.24, 1.44, 1.88, 2.
˓→08, 2.36, 2.52, 2.64, 2.88, 3.00, 3.16, 3.40, 3.56, 3.72, 3.84, 4.04, 4.20, 4.32, 4.56,
˓→ 4.76]","tokens":["","","","","","","","","","","","","","","","","","","","","","","",
˓→"","",""]}
----
./sherpa-onnx-conformer-zh-2023-05-23/test_wavs/2.wav
{"text":"","timestamps":"[0.00, 0.12, 0.60, 0.84, 1.04, 1.44, 1.64, 1.84, 2.28, 2.52, 2.
˓→80, 2.92, 3.08, 3.28, 3.36, 3.60, 3.72, 3.84, 4.12]","tokens":["","","","","","","","",
˓→"","","","","","","","","","",""]}
----
num threads: 2
decoding method: greedy_search
Elapsed seconds: 0.502 s
Real time factor (RTF): 0.502 / 15.289 = 0.033
Caution: If you use Windows and get encoding issues, please run:
CHCP 65001
in your commandline.
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx-microphone-offline \
--tokens=./sherpa-onnx-conformer-zh-2023-05-23/tokens.txt \
--encoder=./sherpa-onnx-conformer-zh-2023-05-23/encoder-epoch-99-avg-1.onnx \
--decoder=./sherpa-onnx-conformer-zh-2023-05-23/decoder-epoch-99-avg-1.onnx \
--joiner=./sherpa-onnx-conformer-zh-2023-05-23/joiner-epoch-99-avg-1.onnx
Caution: If you use Windows and get encoding issues, please run:
CHCP 65001
in your commandline.
csukuangfj/sherpa-onnx-conformer-en-2023-03-18 (English)
cd /path/to/sherpa-onnx
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-onnx-
˓→conformer-en-2023-03-18.tar.bz2
Please check that the file sizes of the pre-trained models are correct. See the file sizes of *.onnx files below.
Hint: It supports decoding only wave files of a single channel with 16-bit encoded samples, while the sampling rate
does not need to be 16 kHz.
fp32
The following code shows how to use fp32 models to decode wave files:
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx-offline \
--tokens=./sherpa-onnx-conformer-en-2023-03-18/tokens.txt \
--encoder=./sherpa-onnx-conformer-en-2023-03-18/encoder-epoch-99-avg-1.onnx \
--decoder=./sherpa-onnx-conformer-en-2023-03-18/decoder-epoch-99-avg-1.onnx \
--joiner=./sherpa-onnx-conformer-en-2023-03-18/joiner-epoch-99-avg-1.onnx \
./sherpa-onnx-conformer-en-2023-03-18/test_wavs/0.wav \
./sherpa-onnx-conformer-en-2023-03-18/test_wavs/1.wav \
./sherpa-onnx-conformer-en-2023-03-18/test_wavs/8k.wav
˓→config=OfflineModelConfig(transducer=OfflineTransducerModelConfig(encoder_filename="./
˓→sherpa-onnx-conformer-en-2023-03-18/encoder-epoch-99-avg-1.onnx", decoder_filename="./
˓→sherpa-onnx-conformer-en-2023-03-18/decoder-epoch-99-avg-1.onnx", joiner_filename="./
˓→sherpa-onnx-conformer-en-2023-03-18/joiner-epoch-99-avg-1.onnx"),␣
˓→paraformer=OfflineParaformerModelConfig(model=""), tokens="./sherpa-onnx-conformer-en-
˓→error msg: Invalid argument. Specify the number of threads explicitly so the affinity␣
˓→error msg: Invalid argument. Specify the number of threads explicitly so the affinity␣
(continues on next page)
˓→is not set.
Done!
./sherpa-onnx-conformer-en-2023-03-18/test_wavs/0.wav
AFTER EARLY NIGHTFALL THE YELLOW LAMPS WOULD LIGHT UP HERE AND THERE THE SQUALID␣
˓→QUARTER OF THE BROTHELS
----
./sherpa-onnx-conformer-en-2023-03-18/test_wavs/1.wav
GOD AS A DIRECT CONSEQUENCE OF THE SIN WHICH MAN THUS PUNISHED HAD GIVEN HER A LOVELY␣
˓→CHILD WHOSE PLACE WAS ON THAT SAME DISHONORED BOSOM TO CONNECT HER PARENT FOREVER WITH␣
˓→THE RACE AND DESCENT OF MORTALS AND TO BE FINALLY A BLESSED SOUL IN HEAVEN
----
./sherpa-onnx-conformer-en-2023-03-18/test_wavs/8k.wav
YET THESE THOUGHTS AFFECTED HESTER PRYNNE LESS WITH HOPE THAN APPREHENSION
----
num threads: 2
decoding method: greedy_search
Elapsed seconds: 2.264 s
Real time factor (RTF): 2.264 / 28.165 = 0.080
int8
The following code shows how to use int8 models to decode wave files:
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx-offline \
--tokens=./sherpa-onnx-conformer-en-2023-03-18/tokens.txt \
--encoder=./sherpa-onnx-conformer-en-2023-03-18/encoder-epoch-99-avg-1.int8.onnx \
--decoder=./sherpa-onnx-conformer-en-2023-03-18/decoder-epoch-99-avg-1.onnx \
--joiner=./sherpa-onnx-conformer-en-2023-03-18/joiner-epoch-99-avg-1.int8.onnx \
./sherpa-onnx-conformer-en-2023-03-18/test_wavs/0.wav \
./sherpa-onnx-conformer-en-2023-03-18/test_wavs/1.wav \
./sherpa-onnx-conformer-en-2023-03-18/test_wavs/8k.wav
OfflineRecognizerConfig(feat_config=OfflineFeatureExtractorConfig(sampling_rate=16000,␣
˓→feature_dim=80), model_
˓→config=OfflineModelConfig(transducer=OfflineTransducerModelConfig(encoder_filename="./
˓→sherpa-onnx-conformer-en-2023-03-18/encoder-epoch-99-avg-1.int8.onnx", decoder_
˓→filename="./sherpa-onnx-conformer-en-2023-03-18/decoder-epoch-99-avg-1.onnx", joiner_
˓→filename="./sherpa-onnx-conformer-en-2023-03-18/joiner-epoch-99-avg-1.int8.onnx"),␣
˓→paraformer=OfflineParaformerModelConfig(model=""), tokens="./sherpa-onnx-conformer-en-
(continues on next page)
˓→2023-03-18/tokens.txt", num_threads=2, debug=False), decoding_method="greedy_search")
Started
Creating a resampler:
in_sample_rate: 8000
output_sample_rate: 16000
Done!
./sherpa-onnx-conformer-en-2023-03-18/test_wavs/0.wav
AFTER EARLY NIGHTFALL THE YELLOW LAMPS WOULD LIGHT UP HERE AND THERE THE SQUALID␣
˓→QUARTER OF THE BROTHELS
----
./sherpa-onnx-conformer-en-2023-03-18/test_wavs/1.wav
GOD AS A DIRECT CONSEQUENCE OF THE SIN WHICH MAN THUS PUNISHED HAD GIVEN HER A LOVELY␣
˓→CHILD WHOSE PLACE WAS ON THAT SAME DISHONORED BOSOM TO CONNECT HER PARENT FOREVER WITH␣
˓→THE RACE AND DESCENT OF MORTALS AND TO BE FINALLY A BLESSED SOUL IN HEAVEN
----
./sherpa-onnx-conformer-en-2023-03-18/test_wavs/8k.wav
YET THESE THOUGHTS AFFECTED HESTER PRYNNE LESS WITH HOPE THAN APPREHENSION
----
num threads: 2
decoding method: greedy_search
Elapsed seconds: 1.370 s
Real time factor (RTF): 1.370 / 28.165 = 0.049
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx-microphone-offline \
--tokens=./sherpa-onnx-conformer-en-2023-03-18/tokens.txt \
--encoder=./sherpa-onnx-conformer-en-2023-03-18/encoder-epoch-99-avg-1.onnx \
--decoder=./sherpa-onnx-conformer-en-2023-03-18/decoder-epoch-99-avg-1.onnx \
--joiner=./sherpa-onnx-conformer-en-2023-03-18/joiner-epoch-99-avg-1.onnx
Hint: Please refer to Installation to install sherpa-onnx before you read this section.
sherpa-onnx-nemo-transducer-giga-am-russian-2024-10-24 (Russian, )
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-onnx-nemo-
˓→transducer-giga-am-russian-2024-10-24.tar.bz2
ls -lh sherpa-onnx-nemo-transducer-giga-am-russian-2024-10-24/
total 548472
-rw-r--r-- 1 fangjun staff 89K Oct 25 13:36 GigaAM%20License_NC.pdf
-rw-r--r-- 1 fangjun staff 318B Oct 25 13:37 README.md
-rw-r--r-- 1 fangjun staff 3.8M Oct 25 13:36 decoder.onnx
-rw-r--r-- 1 fangjun staff 262M Oct 25 13:37 encoder.int8.onnx
-rw-r--r-- 1 fangjun staff 3.8K Oct 25 13:32 export-onnx-rnnt.py
-rw-r--r-- 1 fangjun staff 2.0M Oct 25 13:36 joiner.onnx
-rwxr-xr-x 1 fangjun staff 2.0K Oct 25 13:32 run-rnnt.sh
-rwxr-xr-x 1 fangjun staff 8.7K Oct 25 13:32 test-onnx-rnnt.py
drwxr-xr-x 4 fangjun staff 128B Oct 25 13:37 test_wavs
-rw-r--r-- 1 fangjun staff 5.8K Oct 25 13:36 tokens.txt
Hint: It supports decoding only wave files of a single channel with 16-bit encoded samples, while the sampling rate
does not need to be 16 kHz.
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx-offline \
--encoder=./sherpa-onnx-nemo-transducer-giga-am-russian-2024-10-24/encoder.int8.onnx \
--decoder=./sherpa-onnx-nemo-transducer-giga-am-russian-2024-10-24/decoder.onnx \
--joiner=./sherpa-onnx-nemo-transducer-giga-am-russian-2024-10-24/joiner.onnx \
--tokens=./sherpa-onnx-nemo-transducer-giga-am-russian-2024-10-24/tokens.txt \
--model-type=nemo_transducer \
./sherpa-onnx-nemo-transducer-giga-am-russian-2024-10-24/test_wavs/example.wav
Caution: If you use Windows and get encoding issues, please run:
CHCP 65001
in your commandline.
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx-microphone-offline \
--encoder=./sherpa-onnx-nemo-transducer-giga-am-russian-2024-10-24/encoder.int8.onnx \
--decoder=./sherpa-onnx-nemo-transducer-giga-am-russian-2024-10-24/decoder.onnx \
--joiner=./sherpa-onnx-nemo-transducer-giga-am-russian-2024-10-24/joiner.onnx \
--tokens=./sherpa-onnx-nemo-transducer-giga-am-russian-2024-10-24/tokens.txt \
--model-type=nemo_transducer
cd /path/to/sherpa-onnx
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/silero_vad.onnx
./build/bin/sherpa-onnx-vad-microphone-offline-asr \
--silero-vad-model=./silero_vad.onnx \
--encoder=./sherpa-onnx-nemo-transducer-giga-am-russian-2024-10-24/encoder.int8.onnx \
(continues on next page)
Paraformer models
Hint: Please refer to Installation to install sherpa-onnx before you read this section.
Note: This model does not support timestamps. It is a trilingual model, supporting both Chinese and English. ()
cd /path/to/sherpa-onnx
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-onnx-
˓→paraformer-trilingual-zh-cantonese-en.tar.bz2
Please check that the file sizes of the pre-trained models are correct. See the file sizes of *.onnx files below.
Hint: It supports decoding only wave files of a single channel with 16-bit encoded samples, while the sampling rate
does not need to be 16 kHz.
fp32
The following code shows how to use fp32 models to decode wave files:
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx-offline \
--tokens=./sherpa-onnx-paraformer-trilingual-zh-cantonese-en/tokens.txt \
--paraformer=./sherpa-onnx-paraformer-trilingual-zh-cantonese-en/model.onnx \
./sherpa-onnx-paraformer-trilingual-zh-cantonese-en/test_wavs/1.wav \
./sherpa-onnx-paraformer-trilingual-zh-cantonese-en/test_wavs/2.wav \
./sherpa-onnx-paraformer-trilingual-zh-cantonese-en/test_wavs/3-sichuan.wav \
./sherpa-onnx-paraformer-trilingual-zh-cantonese-en/test_wavs/4-tianjin.wav \
./sherpa-onnx-paraformer-trilingual-zh-cantonese-en/test_wavs/5-henan.wav \
./sherpa-onnx-paraformer-trilingual-zh-cantonese-en/test_wavs/6-zh-en.wav
Caution: If you use Windows and get encoding issues, please run:
CHCP 65001
in your commandline.
˓→onnx-paraformer-trilingual-zh-cantonese-en/model.onnx ./sherpa-onnx-paraformer-
˓→trilingual-zh-cantonese-en/test_wavs/1.wav ./sherpa-onnx-paraformer-trilingual-zh-
˓→cantonese-en/test_wavs/2.wav ./sherpa-onnx-paraformer-trilingual-zh-cantonese-en/test_
˓→wavs/3-sichuan.wav ./sherpa-onnx-paraformer-trilingual-zh-cantonese-en/test_wavs/4-
˓→tianjin.wav ./sherpa-onnx-paraformer-trilingual-zh-cantonese-en/test_wavs/5-henan.wav .
˓→/sherpa-onnx-paraformer-trilingual-zh-cantonese-en/test_wavs/6-zh-en.wav
OfflineRecognizerConfig(feat_config=OfflineFeatureExtractorConfig(sampling_rate=16000,␣
˓→feature_dim=80), model_
˓→config=OfflineModelConfig(transducer=OfflineTransducerModelConfig(encoder_filename="",␣
˓→decoder_filename="", joiner_filename=""),␣
˓→paraformer=OfflineParaformerModelConfig(model="./sherpa-onnx-paraformer-trilingual-zh-
˓→cantonese-en/model.onnx"), nemo_ctc=OfflineNemoEncDecCtcModelConfig(model=""),␣
/project/sherpa-onnx/csrc/offline-paraformer-greedy-search-decoder.cc:Decode:65 time␣
˓→stamp for batch: 1, 15 vs -1
/project/sherpa-onnx/csrc/offline-paraformer-greedy-search-decoder.cc:Decode:65 time␣
˓→stamp for batch: 2, 40 vs -1
/project/sherpa-onnx/csrc/offline-paraformer-greedy-search-decoder.cc:Decode:65 time␣
˓→stamp for batch: 3, 41 vs -1
/project/sherpa-onnx/csrc/offline-paraformer-greedy-search-decoder.cc:Decode:65 time␣
˓→stamp for batch: 4, 37 vs -1
/project/sherpa-onnx/csrc/offline-paraformer-greedy-search-decoder.cc:Decode:65 time␣
˓→stamp for batch: 5, 16 vs -1
Done!
./sherpa-onnx-paraformer-trilingual-zh-cantonese-en/test_wavs/1.wav
{"text": "", "timestamps": [], "tokens":["", "", "", "", "", "", "", "", "", "", "", "",
˓→""]}
----
./sherpa-onnx-paraformer-trilingual-zh-cantonese-en/test_wavs/2.wav
{"text": "", "timestamps": [], "tokens":["", "", "", "", "", "", "", "", "", "", "", "",
˓→"", "", ""]}
----
./sherpa-onnx-paraformer-trilingual-zh-cantonese-en/test_wavs/3-sichuan.wav
{"text": "", "timestamps": [], "tokens":["", "", "", "", "", "", "", "", "", "", "", "",
˓→"", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "",
----
./sherpa-onnx-paraformer-trilingual-zh-cantonese-en/test_wavs/4-tianjin.wav
{"text": "", "timestamps": [], "tokens":["", "", "", "", "", "", "", "", "", "", "", "",
˓→"", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "",
----
./sherpa-onnx-paraformer-trilingual-zh-cantonese-en/test_wavs/5-henan.wav
{"text": "", "timestamps": [], "tokens":["", "", "", "", "", "", "", "", "", "", "", "",
˓→"", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "",
----
./sherpa-onnx-paraformer-trilingual-zh-cantonese-en/test_wavs/6-zh-en.wav
{"text": " yesterday was today is tuesday ", "timestamps": [], "tokens":["yesterday",
˓→"was", "", "", "", "today", "is", "tu@@", "es@@", "day", "", "", "", "", "", ""]}
----
num threads: 2
decoding method: greedy_search
Elapsed seconds: 6.871 s
Real time factor (RTF): 6.871 / 42.054 = 0.163
int8
The following code shows how to use int8 models to decode wave files:
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx-offline \
--tokens=./sherpa-onnx-paraformer-trilingual-zh-cantonese-en/tokens.txt \
--paraformer=./sherpa-onnx-paraformer-trilingual-zh-cantonese-en/model.int8.onnx \
./sherpa-onnx-paraformer-trilingual-zh-cantonese-en/test_wavs/1.wav \
./sherpa-onnx-paraformer-trilingual-zh-cantonese-en/test_wavs/2.wav \
./sherpa-onnx-paraformer-trilingual-zh-cantonese-en/test_wavs/3-sichuan.wav \
./sherpa-onnx-paraformer-trilingual-zh-cantonese-en/test_wavs/4-tianjin.wav \
./sherpa-onnx-paraformer-trilingual-zh-cantonese-en/test_wavs/5-henan.wav \
./sherpa-onnx-paraformer-trilingual-zh-cantonese-en/test_wavs/6-zh-en.wav
Caution: If you use Windows and get encoding issues, please run:
CHCP 65001
in your commandline.
˓→onnx-paraformer-trilingual-zh-cantonese-en/model.int8.onnx ./sherpa-onnx-paraformer-
˓→trilingual-zh-cantonese-en/test_wavs/1.wav ./sherpa-onnx-paraformer-trilingual-zh-
˓→cantonese-en/test_wavs/2.wav ./sherpa-onnx-paraformer-trilingual-zh-cantonese-en/test_
˓→wavs/3-sichuan.wav ./sherpa-onnx-paraformer-trilingual-zh-cantonese-en/test_wavs/4-
˓→tianjin.wav ./sherpa-onnx-paraformer-trilingual-zh-cantonese-en/test_wavs/5-henan.wav .
˓→/sherpa-onnx-paraformer-trilingual-zh-cantonese-en/test_wavs/6-zh-en.wav
OfflineRecognizerConfig(feat_config=OfflineFeatureExtractorConfig(sampling_rate=16000,␣
˓→feature_dim=80), model_
˓→config=OfflineModelConfig(transducer=OfflineTransducerModelConfig(encoder_filename="",␣
˓→decoder_filename="", joiner_filename=""),␣
˓→paraformer=OfflineParaformerModelConfig(model="./sherpa-onnx-paraformer-trilingual-zh-
˓→cantonese-en/model.int8.onnx"), nemo_ctc=OfflineNemoEncDecCtcModelConfig(model=""),␣
˓→ctc=OfflineZipformerCtcModelConfig(model=""), wenet_
˓→ctc=OfflineWenetCtcModelConfig(model=""), tokens="./sherpa-onnx-paraformer-trilingual-
/project/sherpa-onnx/csrc/offline-paraformer-greedy-search-decoder.cc:Decode:65 time␣
˓→stamp for batch: 1, 15 vs -1
/project/sherpa-onnx/csrc/offline-paraformer-greedy-search-decoder.cc:Decode:65 time␣
˓→stamp for batch: 2, 40 vs -1
/project/sherpa-onnx/csrc/offline-paraformer-greedy-search-decoder.cc:Decode:65 time␣
˓→stamp for batch: 3, 41 vs -1
/project/sherpa-onnx/csrc/offline-paraformer-greedy-search-decoder.cc:Decode:65 time␣
˓→stamp for batch: 4, 37 vs -1
/project/sherpa-onnx/csrc/offline-paraformer-greedy-search-decoder.cc:Decode:65 time␣
˓→stamp for batch: 5, 16 vs -1
Done!
./sherpa-onnx-paraformer-trilingual-zh-cantonese-en/test_wavs/1.wav
{"text": "", "timestamps": [], "tokens":["", "", "", "", "", "", "", "", "", "", "", "",
˓→""]}
----
./sherpa-onnx-paraformer-trilingual-zh-cantonese-en/test_wavs/2.wav
{"text": "", "timestamps": [], "tokens":["", "", "", "", "", "", "", "", "", "", "", "",
˓→"", "", ""]}
----
./sherpa-onnx-paraformer-trilingual-zh-cantonese-en/test_wavs/3-sichuan.wav
{"text": "", "timestamps": [], "tokens":["", "", "", "", "", "", "", "", "", "", "", "",
˓→"", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "",
----
./sherpa-onnx-paraformer-trilingual-zh-cantonese-en/test_wavs/4-tianjin.wav
{"text": "", "timestamps": [], "tokens":["", "", "", "", "", "", "", "", "", "", "", "",
˓→"", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "",
----
./sherpa-onnx-paraformer-trilingual-zh-cantonese-en/test_wavs/5-henan.wav
{"text": "", "timestamps": [], "tokens":["", "", "", "", "", "", "", "", "", "", "", "",
˓→"", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "",
----
./sherpa-onnx-paraformer-trilingual-zh-cantonese-en/test_wavs/6-zh-en.wav
{"text": " yesterday was today is tuesday ", "timestamps": [], "tokens":["yesterday",
˓→"was", "", "", "", "today", "is", "tu@@", "es@@", "day", "", "", "", "", "", ""]}
----
num threads: 2
decoding method: greedy_search
Elapsed seconds: 6.290 s
Real time factor (RTF): 6.290 / 42.054 = 0.150
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx-microphone-offline \
--tokens=./sherpa-onnx-paraformer-trilingual-zh-cantonese-en/tokens.txt \
--paraformer=./sherpa-onnx-paraformer-trilingual-zh-cantonese-en/model.int8.onnx
csukuangfj/sherpa-onnx-paraformer-en-2024-03-09 (English)
Note: This model does not support timestamps. It supports only English.
cd /path/to/sherpa-onnx
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-onnx-
˓→paraformer-en-2024-03-09.tar.bz2
Please check that the file sizes of the pre-trained models are correct. See the file sizes of *.onnx files below.
Hint: It supports decoding only wave files of a single channel with 16-bit encoded samples, while the sampling rate
does not need to be 16 kHz.
fp32
The following code shows how to use fp32 models to decode wave files:
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx-offline \
--tokens=./sherpa-onnx-paraformer-en-2024-03-09/tokens.txt \
--paraformer=./sherpa-onnx-paraformer-en-2024-03-09/model.onnx \
./sherpa-onnx-paraformer-en-2024-03-09/test_wavs/0.wav \
./sherpa-onnx-paraformer-en-2024-03-09/test_wavs/1.wav \
./sherpa-onnx-paraformer-en-2024-03-09/test_wavs/8k.wav
˓→en-2024-03-09/model.onnx ./sherpa-onnx-paraformer-en-2024-03-09/test_wavs/0.wav ./
˓→sherpa-onnx-paraformer-en-2024-03-09/test_wavs/1.wav ./sherpa-onnx-paraformer-en-2024-
˓→03-09/test_wavs/8k.wav
OfflineRecognizerConfig(feat_config=OfflineFeatureExtractorConfig(sampling_rate=16000,␣
˓→feature_dim=80), model_
˓→config=OfflineModelConfig(transducer=OfflineTransducerModelConfig(encoder_filename="",␣
˓→decoder_filename="", joiner_filename=""),␣
˓→paraformer=OfflineParaformerModelConfig(model="./sherpa-onnx-paraformer-en-2024-03-09/
˓→model.onnx"), nemo_ctc=OfflineNemoEncDecCtcModelConfig(model=""),␣
˓→ctc=OfflineZipformerCtcModelConfig(model=""), wenet_
˓→ctc=OfflineWenetCtcModelConfig(model=""), tokens="./sherpa-onnx-paraformer-en-2024-03-
Done!
./sherpa-onnx-paraformer-en-2024-03-09/test_wavs/0.wav
{"text": " after early nightfall the yellow lamps would light up here and there the␣
˓→squalid quarter of the brothels", "timestamps": [], "tokens":["after", "early", "ni@@",
˓→ "ght@@", "fall", "the", "yel@@", "low", "la@@", "mp@@", "s", "would", "light", "up",
˓→"here", "and", "there", "the", "squ@@", "al@@", "id", "quarter", "of", "the", "bro@@",
˓→"the@@", "ls"]}
----
(continues on next page)
˓→for ever with the race and descent of mortals and to be finally a blessed soul in␣
˓→"of", "the", "sin", "which", "man", "thus", "p@@", "uni@@", "shed", "had", "given",
˓→"her", "a", "lo@@", "vely", "child", "whose", "place", "was", "'on'", "that", "same",
˓→"di@@", "sh@@", "on@@", "ou@@", "red", "bo@@", "so@@", "m", "to", "connect", "her",
˓→"paren@@", "t", "for", "ever", "with", "the", "race", "and", "des@@", "cent", "of",
˓→"mor@@", "tal@@", "s", "and", "to", "be", "finally", "a", "bl@@", "essed", "soul", "in
----
./sherpa-onnx-paraformer-en-2024-03-09/test_wavs/8k.wav
{"text": " yet these thoughts affected hester prynne less with hope than apprehension",
˓→"timestamps": [], "tokens":["yet", "these", "thoughts", "aff@@", "ected", "he@@", "ster
˓→", "pr@@", "y@@", "n@@", "ne", "less", "with", "hope", "than", "ap@@", "pre@@", "hen@@
˓→", "sion"]}
----
num threads: 2
decoding method: greedy_search
Elapsed seconds: 7.173 s
Real time factor (RTF): 7.173 / 28.165 = 0.255
int8
The following code shows how to use int8 models to decode wave files:
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx-offline \
--tokens=./sherpa-onnx-paraformer-en-2024-03-09/tokens.txt \
--paraformer=./sherpa-onnx-paraformer-en-2024-03-09/model.int8.onnx \
./sherpa-onnx-paraformer-en-2024-03-09/test_wavs/0.wav \
./sherpa-onnx-paraformer-en-2024-03-09/test_wavs/1.wav \
./sherpa-onnx-paraformer-en-2024-03-09/test_wavs/8k.wav
˓→en-2024-03-09/model.int8.onnx ./sherpa-onnx-paraformer-en-2024-03-09/test_wavs/0.wav ./
˓→sherpa-onnx-paraformer-en-2024-03-09/test_wavs/1.wav ./sherpa-onnx-paraformer-en-2024-
˓→03-09/test_wavs/8k.wav
OfflineRecognizerConfig(feat_config=OfflineFeatureExtractorConfig(sampling_rate=16000,␣
˓→feature_dim=80), model_
˓→config=OfflineModelConfig(transducer=OfflineTransducerModelConfig(encoder_filename="",␣
˓→decoder_filename="", joiner_filename=""),␣
(continues on next page)
˓→paraformer=OfflineParaformerModelConfig(model="./sherpa-onnx-paraformer-en-2024-03-09/
˓→model.int8.onnx"), nemo_ctc=OfflineNemoEncDecCtcModelConfig(model=""),␣
˓→ctc=OfflineZipformerCtcModelConfig(model=""), wenet_
˓→ctc=OfflineWenetCtcModelConfig(model=""), tokens="./sherpa-onnx-paraformer-en-2024-03-
sherpa, Release 1.3
Done!
./sherpa-onnx-paraformer-en-2024-03-09/test_wavs/0.wav
{"text": " after early nightfall the yellow lamps would light up here and there the␣
˓→squalid quarter of the brothels", "timestamps": [], "tokens":["after", "early", "ni@@",
˓→ "ght@@", "fall", "the", "yel@@", "low", "la@@", "mp@@", "s", "would", "light", "up",
˓→"here", "and", "there", "the", "squ@@", "al@@", "id", "quarter", "of", "the", "bro@@",
˓→"the@@", "ls"]}
----
./sherpa-onnx-paraformer-en-2024-03-09/test_wavs/1.wav
{"text": " god as a direct consequence of the sin which man thus punished had given her␣
˓→a lovely child whose place was 'on' that same dishonoured bosom to connect her parent␣
˓→for ever with the race and descent of mortals and to be finally a blessed soul in␣
˓→"of", "the", "sin", "which", "man", "thus", "p@@", "uni@@", "shed", "had", "given",
˓→"her", "a", "lo@@", "vely", "child", "whose", "place", "was", "'on'", "that", "same",
˓→"di@@", "sh@@", "on@@", "ou@@", "red", "bo@@", "so@@", "m", "to", "connect", "her",
˓→"paren@@", "t", "for", "ever", "with", "the", "race", "and", "des@@", "cent", "of",
˓→"mor@@", "tal@@", "s", "and", "to", "be", "finally", "a", "bl@@", "essed", "soul", "in
----
./sherpa-onnx-paraformer-en-2024-03-09/test_wavs/8k.wav
{"text": " yet these thoughts affected hester prynne less with hope than apprehension",
˓→"timestamps": [], "tokens":["yet", "these", "thoughts", "aff@@", "ected", "he@@", "ster
˓→", "pr@@", "y@@", "n@@", "ne", "less", "with", "hope", "than", "ap@@", "pre@@", "hen@@
˓→", "sion"]}
----
num threads: 2
decoding method: greedy_search
Elapsed seconds: 5.492 s
Real time factor (RTF): 5.492 / 28.165 = 0.195
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx-microphone-offline \
--tokens=./sherpa-onnx-paraformer-en-2024-03-09/tokens.txt \
--paraformer=./sherpa-onnx-paraformer-en-2024-03-09/model.int8.onnx
Note: This model does not support timestamps. It is a bilingual model, supporting both Chinese and English. ()
cd /path/to/sherpa-onnx
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-onnx-
˓→paraformer-zh-small-2024-03-09.tar.bz2
Please check that the file sizes of the pre-trained models are correct. See the file sizes of *.onnx files below.
Hint: It supports decoding only wave files of a single channel with 16-bit encoded samples, while the sampling rate
does not need to be 16 kHz.
int8
The following code shows how to use int8 models to decode wave files:
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx-offline \
--tokens=./sherpa-onnx-paraformer-zh-small-2024-03-09/tokens.txt \
--paraformer=./sherpa-onnx-paraformer-zh-small-2024-03-09/model.int8.onnx \
./sherpa-onnx-paraformer-zh-small-2024-03-09/test_wavs/0.wav \
./sherpa-onnx-paraformer-zh-small-2024-03-09/test_wavs/1.wav \
./sherpa-onnx-paraformer-zh-small-2024-03-09/test_wavs/8k.wav \
./sherpa-onnx-paraformer-zh-small-2024-03-09/test_wavs/2-zh-en.wav \
./sherpa-onnx-paraformer-zh-small-2024-03-09/test_wavs/3-sichuan.wav \
./sherpa-onnx-paraformer-zh-small-2024-03-09/test_wavs/4-tianjin.wav \
./sherpa-onnx-paraformer-zh-small-2024-03-09/test_wavs/5-henan.wav
Caution: If you use Windows and get encoding issues, please run:
CHCP 65001
in your commandline.
˓→paraformer-zh-small-2024-03-09/model.int8.onnx ./sherpa-onnx-paraformer-zh-small-2024-
˓→03-09/test_wavs/0.wav ./sherpa-onnx-paraformer-zh-small-2024-03-09/test_wavs/1.wav ./
˓→sherpa-onnx-paraformer-zh-small-2024-03-09/test_wavs/8k.wav ./sherpa-onnx-paraformer-
˓→zh-small-2024-03-09/test_wavs/2-zh-en.wav ./sherpa-onnx-paraformer-zh-small-2024-03-09/
˓→test_wavs/3-sichuan.wav ./sherpa-onnx-paraformer-zh-small-2024-03-09/test_wavs/4-
˓→tianjin.wav ./sherpa-onnx-paraformer-zh-small-2024-03-09/test_wavs/5-henan.wav
OfflineRecognizerConfig(feat_config=OfflineFeatureExtractorConfig(sampling_rate=16000,␣
˓→feature_dim=80), model_
˓→config=OfflineModelConfig(transducer=OfflineTransducerModelConfig(encoder_filename="",␣
˓→decoder_filename="", joiner_filename=""),␣
˓→paraformer=OfflineParaformerModelConfig(model="./sherpa-onnx-paraformer-zh-small-2024-
˓→03-09/model.int8.onnx"), nemo_ctc=OfflineNemoEncDecCtcModelConfig(model=""),␣
˓→ctc=OfflineZipformerCtcModelConfig(model=""), wenet_
˓→ctc=OfflineWenetCtcModelConfig(model=""), tokens="./sherpa-onnx-paraformer-zh-small-
Done!
./sherpa-onnx-paraformer-zh-small-2024-03-09/test_wavs/0.wav
{"text": "", "timestamps": [], "tokens":["", "", "", "", "", "", "", "", "", "", "", "",
˓→"", "", "", "", "", "", "", "", "", "", "", "", "", "", "", ""]}
----
./sherpa-onnx-paraformer-zh-small-2024-03-09/test_wavs/1.wav
{"text": "", "timestamps": [], "tokens":["", "", "", "", "", "", "", "", "", "", "", "",
˓→"", "", "", "", "", "", "", "", "", "", "", "", "", ""]}
----
./sherpa-onnx-paraformer-zh-small-2024-03-09/test_wavs/8k.wav
{"text": "", "timestamps": [], "tokens":["", "", "", "", "", "", "", "", "", "", "", "",
˓→"", "", "", "", "", "", ""]} (continues on next page)
˓→""]}
----
./sherpa-onnx-paraformer-zh-small-2024-03-09/test_wavs/3-sichuan.wav
{"text": "", "timestamps": [], "tokens":["", "", "", "", "", "", "", "", "", "", "", "",
˓→"", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "",
----
./sherpa-onnx-paraformer-zh-small-2024-03-09/test_wavs/4-tianjin.wav
{"text": "", "timestamps": [], "tokens":["", "", "", "", "", "", "", "", "", "", "", "",
˓→"", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "",
----
./sherpa-onnx-paraformer-zh-small-2024-03-09/test_wavs/5-henan.wav
{"text": "", "timestamps": [], "tokens":["", "", "", "", "", "", "", "", "", "", "", "",
˓→"", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "",
----
num threads: 2
decoding method: greedy_search
Elapsed seconds: 3.562 s
Real time factor (RTF): 3.562 / 47.023 = 0.076
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx-microphone-offline \
--tokens=./sherpa-onnx-paraformer-zh-small-2024-03-09/tokens.txt \
--paraformer=./sherpa-onnx-paraformer-zh-small-2024-03-09/model.int8.onnx
Note: This model does not support timestamps. It is a bilingual model, supporting both Chinese and English. ()
cd /path/to/sherpa-onnx
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-onnx-
˓→paraformer-zh-2024-03-09.tar.bz2
Please check that the file sizes of the pre-trained models are correct. See the file sizes of *.onnx files below.
Hint: It supports decoding only wave files of a single channel with 16-bit encoded samples, while the sampling rate
does not need to be 16 kHz.
fp32
The following code shows how to use fp32 models to decode wave files:
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx-offline \
--tokens=./sherpa-onnx-paraformer-zh-2024-03-09/tokens.txt \
--paraformer=./sherpa-onnx-paraformer-zh-2024-03-09/model.onnx \
./sherpa-onnx-paraformer-zh-2024-03-09/test_wavs/0.wav \
./sherpa-onnx-paraformer-zh-2024-03-09/test_wavs/1.wav \
./sherpa-onnx-paraformer-zh-2024-03-09/test_wavs/8k.wav \
./sherpa-onnx-paraformer-zh-2024-03-09/test_wavs/2-zh-en.wav \
./sherpa-onnx-paraformer-zh-2024-03-09/test_wavs/3-sichuan.wav \
./sherpa-onnx-paraformer-zh-2024-03-09/test_wavs/4-tianjin.wav \
./sherpa-onnx-paraformer-zh-2024-03-09/test_wavs/5-henan.wav
Caution: If you use Windows and get encoding issues, please run:
CHCP 65001
in your commandline.
˓→zh-2024-03-09/model.onnx ./sherpa-onnx-paraformer-zh-2024-03-09/test_wavs/0.wav ./
˓→sherpa-onnx-paraformer-zh-2024-03-09/test_wavs/1.wav ./sherpa-onnx-paraformer-zh-2024-
˓→03-09/test_wavs/8k.wav ./sherpa-onnx-paraformer-zh-2024-03-09/test_wavs/2-zh-en.wav ./
˓→sherpa-onnx-paraformer-zh-2024-03-09/test_wavs/3-sichuan.wav ./sherpa-onnx-paraformer-
˓→zh-2024-03-09/test_wavs/4-tianjin.wav ./sherpa-onnx-paraformer-zh-2024-03-09/test_wavs/
˓→5-henan.wav
OfflineRecognizerConfig(feat_config=OfflineFeatureExtractorConfig(sampling_rate=16000,␣
˓→feature_dim=80), model_
˓→config=OfflineModelConfig(transducer=OfflineTransducerModelConfig(encoder_filename="",␣
˓→decoder_filename="", joiner_filename=""),␣
˓→paraformer=OfflineParaformerModelConfig(model="./sherpa-onnx-paraformer-zh-2024-03-09/
˓→model.onnx"), nemo_ctc=OfflineNemoEncDecCtcModelConfig(model=""),␣
˓→ctc=OfflineZipformerCtcModelConfig(model=""), wenet_
˓→ctc=OfflineWenetCtcModelConfig(model=""), tokens="./sherpa-onnx-paraformer-zh-2024-03-
Done!
./sherpa-onnx-paraformer-zh-2024-03-09/test_wavs/0.wav
{"text": "", "timestamps": [], "tokens":["", "", "", "", "", "", "", "", "", "", "", "",
˓→"", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", ""]}
----
./sherpa-onnx-paraformer-zh-2024-03-09/test_wavs/1.wav
{"text": "", "timestamps": [], "tokens":["", "", "", "", "", "", "", "", "", "", "", "",
˓→"", "", "", "", "", "", "", "", "", "", "", "", "", ""]}
----
./sherpa-onnx-paraformer-zh-2024-03-09/test_wavs/8k.wav
{"text": "", "timestamps": [], "tokens":["", "", "", "", "", "", "", "", "", "", "", "",
˓→"", "", "", "", "", "", ""]}
----
./sherpa-onnx-paraformer-zh-2024-03-09/test_wavs/2-zh-en.wav
{"text": " yesterday was today is tuesday ", "timestamps": [], "tokens":["ye@@", "ster@@
˓→", "day", "was", "", "", "", "today", "is", "tu@@", "es@@", "day", "", "", "", "", "",
˓→""]}
----
./sherpa-onnx-paraformer-zh-2024-03-09/test_wavs/3-sichuan.wav
{"text": "", "timestamps": [], "tokens":["", "", "", "", "", "", "", "", "", "", "", "",
˓→"", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "",
----
./sherpa-onnx-paraformer-zh-2024-03-09/test_wavs/5-henan.wav
{"text": "", "timestamps": [], "tokens":["", "", "", "", "", "", "", "", "", "", "", "",
˓→"", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "",
----
num threads: 2
decoding method: greedy_search
Elapsed seconds: 6.829 s
Real time factor (RTF): 6.829 / 47.023 = 0.145
int8
The following code shows how to use int8 models to decode wave files:
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx-offline \
--tokens=./sherpa-onnx-paraformer-zh-2024-03-09/tokens.txt \
--paraformer=./sherpa-onnx-paraformer-zh-2024-03-09/model.int8.onnx \
./sherpa-onnx-paraformer-zh-2024-03-09/test_wavs/0.wav \
./sherpa-onnx-paraformer-zh-2024-03-09/test_wavs/1.wav \
./sherpa-onnx-paraformer-zh-2024-03-09/test_wavs/8k.wav \
./sherpa-onnx-paraformer-zh-2024-03-09/test_wavs/2-zh-en.wav \
./sherpa-onnx-paraformer-zh-2024-03-09/test_wavs/3-sichuan.wav \
./sherpa-onnx-paraformer-zh-2024-03-09/test_wavs/4-tianjin.wav \
./sherpa-onnx-paraformer-zh-2024-03-09/test_wavs/5-henan.wav
Caution: If you use Windows and get encoding issues, please run:
CHCP 65001
in your commandline.
˓→zh-2024-03-09/model.onnx ./sherpa-onnx-paraformer-zh-2024-03-09/test_wavs/0.wav ./
˓→sherpa-onnx-paraformer-zh-2024-03-09/test_wavs/1.wav ./sherpa-onnx-paraformer-zh-2024-
˓→03-09/test_wavs/8k.wav ./sherpa-onnx-paraformer-zh-2024-03-09/test_wavs/2-zh-en.wav ./
˓→sherpa-onnx-paraformer-zh-2024-03-09/test_wavs/3-sichuan.wav ./sherpa-onnx-paraformer-
(continues on next page)
˓→zh-2024-03-09/test_wavs/4-tianjin.wav ./sherpa-onnx-paraformer-zh-2024-03-09/test_wavs/
˓→5-henan.wav
OfflineRecognizerConfig(feat_config=OfflineFeatureExtractorConfig(sampling_rate=16000,␣
˓→feature_dim=80), model_
˓→config=OfflineModelConfig(transducer=OfflineTransducerModelConfig(encoder_filename="",␣
˓→decoder_filename="", joiner_filename=""),␣
˓→paraformer=OfflineParaformerModelConfig(model="./sherpa-onnx-paraformer-zh-2024-03-09/
˓→model.onnx"), nemo_ctc=OfflineNemoEncDecCtcModelConfig(model=""),␣
˓→ctc=OfflineZipformerCtcModelConfig(model=""), wenet_
˓→ctc=OfflineWenetCtcModelConfig(model=""), tokens="./sherpa-onnx-paraformer-zh-2024-03-
Done!
./sherpa-onnx-paraformer-zh-2024-03-09/test_wavs/0.wav
{"text": "", "timestamps": [], "tokens":["", "", "", "", "", "", "", "", "", "", "", "",
˓→"", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", ""]}
----
./sherpa-onnx-paraformer-zh-2024-03-09/test_wavs/1.wav
{"text": "", "timestamps": [], "tokens":["", "", "", "", "", "", "", "", "", "", "", "",
˓→"", "", "", "", "", "", "", "", "", "", "", "", "", ""]}
----
./sherpa-onnx-paraformer-zh-2024-03-09/test_wavs/8k.wav
{"text": "", "timestamps": [], "tokens":["", "", "", "", "", "", "", "", "", "", "", "",
˓→"", "", "", "", "", "", ""]}
----
./sherpa-onnx-paraformer-zh-2024-03-09/test_wavs/2-zh-en.wav
{"text": " yesterday was today is tuesday ", "timestamps": [], "tokens":["ye@@", "ster@@
˓→", "day", "was", "", "", "", "today", "is", "tu@@", "es@@", "day", "", "", "", "", "",
˓→""]}
----
./sherpa-onnx-paraformer-zh-2024-03-09/test_wavs/3-sichuan.wav
{"text": "", "timestamps": [], "tokens":["", "", "", "", "", "", "", "", "", "", "", "",
˓→"", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "",
----
./sherpa-onnx-paraformer-zh-2024-03-09/test_wavs/4-tianjin.wav
{"text": "", "timestamps": [], "tokens":["", "", "", "", "", "", "", "", "", "", "", "",
˓→"", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "",
----
./sherpa-onnx-paraformer-zh-2024-03-09/test_wavs/5-henan.wav
{"text": "", "timestamps": [], "tokens":["", "", "", "", "", "", "", "", "", "", "", "",
(continues
˓→"", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", on"",
next "",
page)
˓→ "", "", "", ""]}
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx-microphone-offline \
--tokens=./sherpa-onnx-paraformer-zh-2024-03-09/tokens.txt \
--paraformer=./sherpa-onnx-paraformer-zh-2024-03-09/model.int8.onnx
Note: This model does not support timestamps. It is a bilingual model, supporting both Chinese and English. ()
cd /path/to/sherpa-onnx
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-onnx-
˓→paraformer-zh-2023-03-28.tar.bz2
Please check that the file sizes of the pre-trained models are correct. See the file sizes of *.onnx files below.
Hint: It supports decoding only wave files of a single channel with 16-bit encoded samples, while the sampling rate
does not need to be 16 kHz.
fp32
The following code shows how to use fp32 models to decode wave files:
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx-offline \
--tokens=./sherpa-onnx-paraformer-zh-2023-03-28/tokens.txt \
--paraformer=./sherpa-onnx-paraformer-zh-2023-03-28/model.onnx \
./sherpa-onnx-paraformer-zh-2023-03-28/test_wavs/0.wav \
./sherpa-onnx-paraformer-zh-2023-03-28/test_wavs/1.wav \
./sherpa-onnx-paraformer-zh-2023-03-28/test_wavs/2.wav \
./sherpa-onnx-paraformer-zh-2023-03-28/test_wavs/3-sichuan.wav \
./sherpa-onnx-paraformer-zh-2023-03-28/test_wavs/4-tianjin.wav \
./sherpa-onnx-paraformer-zh-2023-03-28/test_wavs/5-henan.wav \
./sherpa-onnx-paraformer-zh-2023-03-28/test_wavs/6-zh-en.wav \
./sherpa-onnx-paraformer-zh-2023-03-28/test_wavs/8k.wav
Caution: If you use Windows and get encoding issues, please run:
CHCP 65001
in your commandline.
˓→zh-2023-03-28/model.onnx ./sherpa-onnx-paraformer-zh-2023-03-28/test_wavs/0.wav ./
˓→sherpa-onnx-paraformer-zh-2023-03-28/test_wavs/1.wav ./sherpa-onnx-paraformer-zh-2023-
˓→03-28/test_wavs/2.wav ./sherpa-onnx-paraformer-zh-2023-03-28/test_wavs/3-sichuan.wav ./
˓→sherpa-onnx-paraformer-zh-2023-03-28/test_wavs/4-tianjin.wav ./sherpa-onnx-paraformer-
˓→zh-2023-03-28/test_wavs/5-henan.wav ./sherpa-onnx-paraformer-zh-2023-03-28/test_wavs/6-
˓→zh-en.wav ./sherpa-onnx-paraformer-zh-2023-03-28/test_wavs/8k.wav
OfflineRecognizerConfig(feat_config=OfflineFeatureExtractorConfig(sampling_rate=16000,␣
˓→feature_dim=80), model_
˓→config=OfflineModelConfig(transducer=OfflineTransducerModelConfig(encoder_filename="",␣
˓→decoder_filename="", joiner_filename=""),␣
˓→paraformer=OfflineParaformerModelConfig(model="./sherpa-onnx-paraformer-zh-2023-03-28/
˓→model.onnx"), nemo_ctc=OfflineNemoEncDecCtcModelConfig(model=""),␣
˓→ctc=OfflineZipformerCtcModelConfig(model=""), wenet_
494 Chapter 8. sherpa-onnx
˓→ctc=OfflineWenetCtcModelConfig(model=""), tokens="./sherpa-onnx-paraformer-zh-2023-03-
Done!
./sherpa-onnx-paraformer-zh-2023-03-28/test_wavs/0.wav
{"text": "", "timestamps": [], "tokens":["", "", "", "", "", "", "", "", "", "", "", "",
˓→"", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", ""]}
----
./sherpa-onnx-paraformer-zh-2023-03-28/test_wavs/1.wav
{"text": "", "timestamps": [], "tokens":["", "", "", "", "", "", "", "", "", "", "", "",
˓→"", "", "", "", "", "", "", "", "", "", "", "", "", ""]}
----
./sherpa-onnx-paraformer-zh-2023-03-28/test_wavs/2.wav
{"text": "", "timestamps": [], "tokens":["", "", "", "", "", "", "", "", "", "", "", "",
˓→"", "", "", "", "", "", ""]}
----
./sherpa-onnx-paraformer-zh-2023-03-28/test_wavs/3-sichuan.wav
{"text": "", "timestamps": [], "tokens":["", "", "", "", "", "", "", "", "", "", "", "",
˓→"", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "",
----
./sherpa-onnx-paraformer-zh-2023-03-28/test_wavs/4-tianjin.wav
{"text": "", "timestamps": [], "tokens":["", "", "", "", "", "", "", "", "", "", "", "",
˓→"", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "",
----
./sherpa-onnx-paraformer-zh-2023-03-28/test_wavs/5-henan.wav
{"text": "", "timestamps": [], "tokens":["", "", "", "", "", "", "", "", "", "", "", "",
˓→"", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "",
----
./sherpa-onnx-paraformer-zh-2023-03-28/test_wavs/6-zh-en.wav
{"text": " yesterday was today is tuesday ", "timestamps": [], "tokens":["ye@@", "ster@@
˓→", "day", "was", "", "", "", "today", "is", "tu@@", "es@@", "day", "", "", "", "", "",
˓→""]}
----
./sherpa-onnx-paraformer-zh-2023-03-28/test_wavs/8k.wav
{"text": "", "timestamps": [], "tokens":["", "", "", "", "", "", "", "", "", "", "", "",
˓→""]}
----
num threads: 2
decoding method: greedy_search
Elapsed seconds: 8.547 s
Real time factor (RTF): 8.547 / 51.236 = 0.167
int8
The following code shows how to use int8 models to decode wave files:
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx-offline \
--tokens=./sherpa-onnx-paraformer-zh-2023-03-28/tokens.txt \
--paraformer=./sherpa-onnx-paraformer-zh-2023-03-28/model.int8.onnx \
./sherpa-onnx-paraformer-zh-2023-03-28/test_wavs/0.wav \
./sherpa-onnx-paraformer-zh-2023-03-28/test_wavs/1.wav \
./sherpa-onnx-paraformer-zh-2023-03-28/test_wavs/2.wav \
./sherpa-onnx-paraformer-zh-2023-03-28/test_wavs/3-sichuan.wav \
./sherpa-onnx-paraformer-zh-2023-03-28/test_wavs/4-tianjin.wav \
./sherpa-onnx-paraformer-zh-2023-03-28/test_wavs/5-henan.wav \
./sherpa-onnx-paraformer-zh-2023-03-28/test_wavs/6-zh-en.wav \
./sherpa-onnx-paraformer-zh-2023-03-28/test_wavs/8k.wav
Caution: If you use Windows and get encoding issues, please run:
CHCP 65001
in your commandline.
˓→zh-2023-03-28/model.int8.onnx ./sherpa-onnx-paraformer-zh-2023-03-28/test_wavs/0.wav ./
˓→sherpa-onnx-paraformer-zh-2023-03-28/test_wavs/1.wav ./sherpa-onnx-paraformer-zh-2023-
˓→03-28/test_wavs/2.wav ./sherpa-onnx-paraformer-zh-2023-03-28/test_wavs/3-sichuan.wav ./
˓→sherpa-onnx-paraformer-zh-2023-03-28/test_wavs/4-tianjin.wav ./sherpa-onnx-paraformer-
˓→zh-2023-03-28/test_wavs/5-henan.wav ./sherpa-onnx-paraformer-zh-2023-03-28/test_wavs/6-
˓→zh-en.wav ./sherpa-onnx-paraformer-zh-2023-03-28/test_wavs/8k.wav
OfflineRecognizerConfig(feat_config=OfflineFeatureExtractorConfig(sampling_rate=16000,␣
˓→feature_dim=80), model_
˓→config=OfflineModelConfig(transducer=OfflineTransducerModelConfig(encoder_filename="",␣
˓→decoder_filename="", joiner_filename=""),␣
˓→paraformer=OfflineParaformerModelConfig(model="./sherpa-onnx-paraformer-zh-2023-03-28/
˓→model.int8.onnx"), nemo_ctc=OfflineNemoEncDecCtcModelConfig(model=""),␣
˓→ctc=OfflineZipformerCtcModelConfig(model=""), wenet_
˓→ctc=OfflineWenetCtcModelConfig(model=""), tokens="./sherpa-onnx-paraformer-zh-2023-03-
Done!
./sherpa-onnx-paraformer-zh-2023-03-28/test_wavs/0.wav
{"text": "", "timestamps": [], "tokens":["", "", "", "", "", "", "", "", "", "", "", "",
˓→"", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", ""]}
----
./sherpa-onnx-paraformer-zh-2023-03-28/test_wavs/1.wav
{"text": "", "timestamps": [], "tokens":["", "", "", "", "", "", "", "", "", "", "", "",
˓→"", "", "", "", "", "", "", "", "", "", "", "", "", ""]}
----
./sherpa-onnx-paraformer-zh-2023-03-28/test_wavs/2.wav
{"text": "", "timestamps": [], "tokens":["", "", "", "", "", "", "", "", "", "", "", "",
˓→"", "", "", "", "", "", ""]}
----
./sherpa-onnx-paraformer-zh-2023-03-28/test_wavs/3-sichuan.wav
{"text": "", "timestamps": [], "tokens":["", "", "", "", "", "", "", "", "", "", "", "",
˓→"", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "",
----
./sherpa-onnx-paraformer-zh-2023-03-28/test_wavs/4-tianjin.wav
{"text": "", "timestamps": [], "tokens":["", "", "", "", "", "", "", "", "", "", "", "",
˓→"", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "",
----
./sherpa-onnx-paraformer-zh-2023-03-28/test_wavs/5-henan.wav
{"text": "", "timestamps": [], "tokens":["", "", "", "", "", "", "", "", "", "", "", "",
˓→"", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "",
----
./sherpa-onnx-paraformer-zh-2023-03-28/test_wavs/6-zh-en.wav
{"text": " yesterday was today is tuesday ", "timestamps": [], "tokens":["ye@@", "ster@@
˓→", "day", "was", "", "", "", "today", "is", "tu@@", "es@@", "day", "", "", "", "", "",
˓→""]}
----
./sherpa-onnx-paraformer-zh-2023-03-28/test_wavs/8k.wav
{"text": "", "timestamps": [], "tokens":["", "", "", "", "", "", "", "", "", "", "", "",
˓→""]}
----
num threads: 2
decoding method: greedy_search
Elapsed seconds: 6.439 s
Real time factor (RTF): 6.439 / 51.236 = 0.126
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx-microphone-offline \
--tokens=./sherpa-onnx-paraformer-zh-2023-03-28/tokens.txt \
--paraformer=./sherpa-onnx-paraformer-zh-2023-03-28/model.int8.onnx
Note: This model supports timestamps. It is a bilingual model, supporting both Chinese and English. ()
cd /path/to/sherpa-onnx
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-onnx-
˓→paraformer-zh-2023-09-14.tar.bz2
Please check that the file sizes of the pre-trained models are correct. See the file sizes of *.onnx files below.
Hint: It supports decoding only wave files of a single channel with 16-bit encoded samples, while the sampling rate
does not need to be 16 kHz.
int8
The following code shows how to use int8 models to decode wave files:
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx-offline \
--tokens=./sherpa-onnx-paraformer-zh-2023-09-14/tokens.txt \
--paraformer=./sherpa-onnx-paraformer-zh-2023-09-14/model.int8.onnx \
--model-type=paraformer \
./sherpa-onnx-paraformer-zh-2023-09-14/test_wavs/0.wav \
./sherpa-onnx-paraformer-zh-2023-09-14/test_wavs/1.wav \
./sherpa-onnx-paraformer-zh-2023-09-14/test_wavs/2.wav \
./sherpa-onnx-paraformer-zh-2023-09-14/test_wavs/3-sichuan.wav \
./sherpa-onnx-paraformer-zh-2023-09-14/test_wavs/4-tianjin.wav \
./sherpa-onnx-paraformer-zh-2023-09-14/test_wavs/5-henan.wav \
./sherpa-onnx-paraformer-zh-2023-09-14/test_wavs/6-zh-en.wav \
./sherpa-onnx-paraformer-zh-2023-09-14/test_wavs/8k.wav
Caution: If you use Windows and get encoding issues, please run:
CHCP 65001
in your commandline.
˓→onnx-paraformer-zh-2023-09-14/test_wavs/2.wav ./sherpa-onnx-paraformer-zh-2023-09-14/
˓→test_wavs/3-sichuan.wav ./sherpa-onnx-paraformer-zh-2023-09-14/test_wavs/4-tianjin.wav␣
˓→./sherpa-onnx-paraformer-zh-2023-09-14/test_wavs/5-henan.wav ./sherpa-onnx-paraformer-
˓→zh-2023-09-14/test_wavs/6-zh-en.wav ./sherpa-onnx-paraformer-zh-2023-09-14/test_wavs/
˓→8k.wav
OfflineRecognizerConfig(feat_config=OfflineFeatureExtractorConfig(sampling_rate=16000,␣
˓→feature_dim=80), model_
˓→config=OfflineModelConfig(transducer=OfflineTransducerModelConfig(encoder_filename="",␣
˓→decoder_filename="", joiner_filename=""),␣
˓→paraformer=OfflineParaformerModelConfig(model="./sherpa-onnx-paraformer-zh-2023-09-14/
˓→model.int8.onnx"), nemo_ctc=OfflineNemoEncDecCtcModelConfig(model=""),␣
˓→ctc=OfflineZipformerCtcModelConfig(model=""), wenet_
˓→ctc=OfflineWenetCtcModelConfig(model=""), tokens="./sherpa-onnx-paraformer-zh-2023-09-
Done!
./sherpa-onnx-paraformer-zh-2023-09-14/test_wavs/0.wav
{"text": "", "timestamps": [0.36, 0.48, 0.62, 0.72, 0.86, 1.02, 1.32, 1.74, 1.90, 2.12,␣
˓→2.20, 2.38, 2.50, 2.62, 2.74, 3.18, 3.32, 3.52, 3.62, 3.74, 3.82, 3.90, 3.98, 4.08, 4.
˓→20, 4.34, 4.56, 4.74, 5.10], "tokens":["", "", "", "", "", "", "", "", "", "", "", "",
˓→"", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", ""]}
----
./sherpa-onnx-paraformer-zh-2023-09-14/test_wavs/1.wav
{"text": "", "timestamps": [0.16, 0.30, 0.42, 0.56, 0.72, 0.96, 1.08, 1.20, 1.30, 2.08,␣
˓→2.26, 2.44, 2.58, 2.72, 2.98, 3.14, 3.26, 3.46, 3.62, 3.80, 3.88, 4.02, 4.12, 4.20, 4.
˓→36, 4.56], "tokens":["", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "
˓→", "", "", "", "", "", "", "", "", ""]}
----
./sherpa-onnx-paraformer-zh-2023-09-14/test_wavs/2.wav
{"text": "", "timestamps": [0.34, 0.54, 0.66, 0.80, 1.08, 1.52, 1.72, 1.90, 2.40, 2.68,␣
˓→2.86, 2.96, 3.16, 3.26, 3.46, 3.54, 3.66, 3.80, 3.90], "tokens":["", "", "", "", "", "
˓→", "", "", "", "", "", "", "", "", "", "", "", "", ""]}
----
./sherpa-onnx-paraformer-zh-2023-09-14/test_wavs/3-sichuan.wav
{"text": "", "timestamps": [0.16, 0.30, 0.56, 0.72, 0.92, 1.18, 1.32, 1.88, 2.24, 2.40,␣
˓→3.16, 3.28, 3.40, 3.54, 3.76, 3.88, 4.06, 4.24, 4.36, 4.56, 4.66, 4.88, 5.14, 5.30, 5.
˓→44, 5.60, 5.72, 5.84, 5.96, 6.14, 6.24, 6.38, 6.56, 6.78, 6.98, 7.08, 7.22, 7.38, 7.50,
˓→ 7.62], "tokens":["", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "
˓→", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "",
˓→""]}
----
./sherpa-onnx-paraformer-zh-2023-09-14/test_wavs/4-tianjin.wav
{"text": "", "timestamps": [0.08, 0.24, 0.36, 0.56, 0.66, 0.78, 1.04, 1.14, 1.26, 1.38,␣
˓→1.50, 1.58, 1.70, 1.84, 2.28, 2.38, 2.64, 2.74, 3.08, 3.28, 3.66, 3.80, 3.94, 4.14, 4.
˓→34, 4.64, 4.84, 4.94, 5.12, 5.24, 5.84, 6.10, 6.24, 6.44, 6.54, 6.66, 6.86, 7.02, 7.14,
˓→ 7.24, 7.44], "tokens":["", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "",
˓→ "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "
----
./sherpa-onnx-paraformer-zh-2023-09-14/test_wavs/5-henan.wav
{"text": "", "timestamps": [0.08, 0.20, 0.30, 0.42, 0.94, 1.14, 1.26, 1.46, 1.66, 2.28,␣
˓→2.50, 2.62, 2.70, 2.82, 2.98, 3.14, 3.28, 3.52, 3.70, 3.86, 4.94, 5.06, 5.18, 5.30, 5.
˓→42, 5.66, 5.76, 5.94, 6.08, 6.24, 6.38, 6.60, 6.78, 6.96, 7.10, 7.30, 7.50, 7.62],
˓→"tokens":["", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "
˓→", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", ""]}
----
./sherpa-onnx-paraformer-zh-2023-09-14/test_wavs/6-zh-en.wav
{"text": " yesterday was today is tuesday ", "timestamps": [0.36, 0.60, 0.84, 1.22, 2.
˓→24, 2.44, 2.74, 3.52, 4.06, 4.68, 5.00, 5.12, 5.76, 5.96, 6.24, 6.82, 7.02, 7.26],
˓→"tokens":["ye@@", "ster@@", "day", "was", "", "", "", "today", "is", "tu@@", "es@@",
˓→"day", "", "", "", "", "", ""]} (continues on next page)
----
num threads: 2
decoding method: greedy_search
Elapsed seconds: 9.206 s
Real time factor (RTF): 9.206 / 51.236 = 0.180
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx-microphone-offline \
--tokens=./sherpa-onnx-paraformer-zh-2023-09-14/tokens.txt \
--paraformer=./sherpa-onnx-paraformer-zh-2023-09-14/model.int8.onnx \
--model-type=paraformer
NeMo
This section describes how to export CTC models from NeMo to sherpa-onnx.
Hint: You can find the exported files in this example by visiting
https://huggingface.co/csukuangfj/sherpa-onnx-nemo-ctc-en-conformer-small
wget https://huggingface.co/csukuangfj/sherpa-onnx-nemo-ctc-en-conformer-small/resolve/
˓→main/add-model-metadata.py
wget https://huggingface.co/csukuangfj/sherpa-onnx-nemo-ctc-en-conformer-small/resolve/
˓→main/quantize-model.py
English
Hint: Please refer to Installation to install sherpa-onnx before you read this section.
Note: We use ./build/bin/sherpa-offline as an example in this section. You can use other scripts such as
• ./build/bin/sherpa-onnx-microphone-offline
• ./build/bin/sherpa-onnx-offline-websocket-server
• python-api-examples/offline-decode-files.py
This page lists offline CTC models from NeMo for English.
stt_en_citrinet_512
cd /path/to/sherpa-onnx
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-onnx-nemo-
˓→ctc-en-citrinet-512.tar.bz2
Please check that the file sizes of the pre-trained models are correct. See the file sizes of *.onnx files below.
Hint: It supports decoding only wave files of a single channel with 16-bit encoded samples, while the sampling rate
does not need to be 16 kHz.
The following code shows how to use fp32 models to decode wave files. Please replace model.onnx with model.
int8.onnx to use int8 quantized model.
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx-offline \
--tokens=./sherpa-onnx-nemo-ctc-en-citrinet-512/tokens.txt \
--nemo-ctc-model=./sherpa-onnx-nemo-ctc-en-citrinet-512/model.onnx \
--num-threads=2 \
--decoding-method=greedy_search \
--debug=false \
./sherpa-onnx-nemo-ctc-en-citrinet-512/test_wavs/0.wav \
./sherpa-onnx-nemo-ctc-en-citrinet-512/test_wavs/1.wav \
./sherpa-onnx-nemo-ctc-en-citrinet-512/test_wavs/8k.wav
˓→nemo-ctc-en-citrinet-512/test_wavs/8k.wav
OfflineRecognizerConfig(feat_config=OfflineFeatureExtractorConfig(sampling_rate=16000,␣
˓→feature_dim=80), model_
˓→config=OfflineModelConfig(transducer=OfflineTransducerModelConfig(encoder_filename="",␣
˓→decoder_filename="", joiner_filename=""),␣
˓→paraformer=OfflineParaformerModelConfig(model=""), nemo_
˓→ctc=OfflineNemoEncDecCtcModelConfig(model="./sherpa-onnx-nemo-ctc-en-citrinet-512/
in_sample_rate: 8000
output_sample_rate: 16000
Done!
./sherpa-onnx-nemo-ctc-en-citrinet-512/test_wavs/0.wav
after early nightfall the yellow lamps would light up here and there the squalid␣
˓→quarter of the brothels
(continues on next page)
˓→with the race and descent of mortals and to be finally a blessed soul in heaven
----
./sherpa-onnx-nemo-ctc-en-citrinet-512/test_wavs/8k.wav
yet these thoughts affected hester prynne less with hope than apprehension
----
num threads: 2
decoding method: greedy_search
Elapsed seconds: 4.963 s
Real time factor (RTF): 4.963 / 28.165 = 0.176
stt_en_conformer_ctc_small
cd /path/to/sherpa-onnx
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-onnx-nemo-
˓→ctc-en-conformer-small.tar.bz2
Please check that the file sizes of the pre-trained models are correct. See the file sizes of *.onnx files below.
Hint: It supports decoding only wave files of a single channel with 16-bit encoded samples, while the sampling rate
does not need to be 16 kHz.
The following code shows how to use fp32 models to decode wave files. Please replace model.onnx with model.
int8.onnx to use int8 quantized model.
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx-offline \
--tokens=./sherpa-onnx-nemo-ctc-en-conformer-small/tokens.txt \
--nemo-ctc-model=./sherpa-onnx-nemo-ctc-en-conformer-small/model.onnx \
--num-threads=2 \
--decoding-method=greedy_search \
--debug=false \
./sherpa-onnx-nemo-ctc-en-conformer-small/test_wavs/0.wav \
./sherpa-onnx-nemo-ctc-en-conformer-small/test_wavs/1.wav \
./sherpa-onnx-nemo-ctc-en-conformer-small/test_wavs/8k.wav
˓→conformer-small/test_wavs/0.wav ./sherpa-onnx-nemo-ctc-en-conformer-small/test_wavs/1.
˓→wav ./sherpa-onnx-nemo-ctc-en-conformer-small/test_wavs/8k.wav
OfflineRecognizerConfig(feat_config=OfflineFeatureExtractorConfig(sampling_rate=16000,␣
˓→feature_dim=80), model_
˓→config=OfflineModelConfig(transducer=OfflineTransducerModelConfig(encoder_filename="",␣
˓→decoder_filename="", joiner_filename=""),␣
˓→paraformer=OfflineParaformerModelConfig(model=""), nemo_
˓→ctc=OfflineNemoEncDecCtcModelConfig(model="./sherpa-onnx-nemo-ctc-en-conformer-small/
in_sample_rate: 8000
output_sample_rate: 16000
Done!
./sherpa-onnx-nemo-ctc-en-conformer-small/test_wavs/0.wav
after early nightfall the yellow lamps would light up here and there the squalid␣
˓→quarter of the brothels
(continues on next page)
˓→with the race and descent of mortals and to be finally a blessed soul in heaven
----
./sherpa-onnx-nemo-ctc-en-conformer-small/test_wavs/8k.wav
yet these thoughts affected hester prin less with hope than apprehension
----
num threads: 2
decoding method: greedy_search
Elapsed seconds: 0.665 s
Real time factor (RTF): 0.665 / 28.165 = 0.024
stt_en_conformer_ctc_medium
cd /path/to/sherpa-onnx
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-onnx-nemo-
˓→ctc-en-conformer-medium.tar.bz2
Please check that the file sizes of the pre-trained models are correct. See the file sizes of *.onnx files below.
Hint: It supports decoding only wave files of a single channel with 16-bit encoded samples, while the sampling rate
does not need to be 16 kHz.
The following code shows how to use fp32 models to decode wave files. Please replace model.onnx with model.
int8.onnx to use int8 quantized model.
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx-offline \
--tokens=./sherpa-onnx-nemo-ctc-en-conformer-medium/tokens.txt \
--nemo-ctc-model=./sherpa-onnx-nemo-ctc-en-conformer-medium/model.onnx \
--num-threads=2 \
--decoding-method=greedy_search \
--debug=false \
./sherpa-onnx-nemo-ctc-en-conformer-medium/test_wavs/0.wav \
./sherpa-onnx-nemo-ctc-en-conformer-medium/test_wavs/1.wav \
./sherpa-onnx-nemo-ctc-en-conformer-medium/test_wavs/8k.wav
˓→tokens.txt --nemo-ctc-model=./sherpa-onnx-nemo-ctc-en-conformer-medium/model.onnx --
˓→conformer-medium/test_wavs/0.wav ./sherpa-onnx-nemo-ctc-en-conformer-medium/test_wavs/
˓→1.wav ./sherpa-onnx-nemo-ctc-en-conformer-medium/test_wavs/8k.wav
OfflineRecognizerConfig(feat_config=OfflineFeatureExtractorConfig(sampling_rate=16000,␣
˓→feature_dim=80), model_
˓→config=OfflineModelConfig(transducer=OfflineTransducerModelConfig(encoder_filename="",␣
˓→decoder_filename="", joiner_filename=""),␣
˓→paraformer=OfflineParaformerModelConfig(model=""), nemo_
˓→ctc=OfflineNemoEncDecCtcModelConfig(model="./sherpa-onnx-nemo-ctc-en-conformer-medium/
in_sample_rate: 8000
output_sample_rate: 16000
Done!
./sherpa-onnx-nemo-ctc-en-conformer-medium/test_wavs/0.wav
after early nightfall the yellow lamps would light up here and there the squalid␣
˓→quarter of the brothels
(continues on next page)
˓→with the race and descent of mortals and to be finally a blessed soul in heaven
----
./sherpa-onnx-nemo-ctc-en-conformer-medium/test_wavs/8k.wav
yet these thoughts affected hester pryne less with hope than apprehension
----
num threads: 2
decoding method: greedy_search
Elapsed seconds: 1.184 s
Real time factor (RTF): 1.184 / 28.165 = 0.042
stt_en_conformer_ctc_large
cd /path/to/sherpa-onnx
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-onnx-nemo-
˓→ctc-en-conformer-large.tar.bz2
Please check that the file sizes of the pre-trained models are correct. See the file sizes of *.onnx files below.
Hint: It supports decoding only wave files of a single channel with 16-bit encoded samples, while the sampling rate
does not need to be 16 kHz.
The following code shows how to use fp32 models to decode wave files. Please replace model.onnx with model.
int8.onnx to use int8 quantized model.
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx-offline \
--tokens=./sherpa-onnx-nemo-ctc-en-conformer-large/tokens.txt \
--nemo-ctc-model=./sherpa-onnx-nemo-ctc-en-conformer-large/model.onnx \
--num-threads=2 \
--decoding-method=greedy_search \
--debug=false \
./sherpa-onnx-nemo-ctc-en-conformer-large/test_wavs/0.wav \
./sherpa-onnx-nemo-ctc-en-conformer-large/test_wavs/1.wav \
./sherpa-onnx-nemo-ctc-en-conformer-large/test_wavs/8k.wav
˓→conformer-large/test_wavs/0.wav ./sherpa-onnx-nemo-ctc-en-conformer-large/test_wavs/1.
˓→wav ./sherpa-onnx-nemo-ctc-en-conformer-large/test_wavs/8k.wav
OfflineRecognizerConfig(feat_config=OfflineFeatureExtractorConfig(sampling_rate=16000,␣
˓→feature_dim=80), model_
˓→config=OfflineModelConfig(transducer=OfflineTransducerModelConfig(encoder_filename="",␣
˓→decoder_filename="", joiner_filename=""),␣
˓→paraformer=OfflineParaformerModelConfig(model=""), nemo_
˓→ctc=OfflineNemoEncDecCtcModelConfig(model="./sherpa-onnx-nemo-ctc-en-conformer-large/
in_sample_rate: 8000
output_sample_rate: 16000
Done!
./sherpa-onnx-nemo-ctc-en-conformer-large/test_wavs/0.wav
after early nightfall the yellow lamps would light up here and there the squalid␣
˓→quarter of the brothels
(continues on next page)
˓→with the race and descent of mortals and to be finally a blesed soul in heaven
----
./sherpa-onnx-nemo-ctc-en-conformer-large/test_wavs/8k.wav
yet these thoughts afected hester pryne les with hope than aprehension
----
num threads: 2
decoding method: greedy_search
Elapsed seconds: 3.553 s
Real time factor (RTF): 3.553 / 28.165 = 0.126
Russian
Hint: Please refer to Installation to install sherpa-onnx before you read this section.
This page lists offline CTC models from NeMo for English.
sherpa-onnx-nemo-ctc-giga-am-russian-2024-10-24
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-onnx-nemo-
˓→ctc-giga-am-russian-2024-10-24.tar.bz2
ls -lh sherpa-onnx-nemo-ctc-giga-am-russian-2024-10-24/
total 558904
-rw-r--r-- 1 fangjun staff 89K Oct 24 21:20 GigaAM%20License_NC.pdf
-rw-r--r-- 1 fangjun staff 318B Oct 24 21:20 README.md
-rwxr-xr-x 1 fangjun staff 3.5K Oct 24 21:20 export-onnx-ctc.py
-rw-r--r-- 1 fangjun staff 262M Oct 24 21:24 model.int8.onnx
-rwxr-xr-x 1 fangjun staff 1.2K Oct 24 21:20 run-ctc.sh
-rwxr-xr-x 1 fangjun staff 4.1K Oct 24 21:20 test-onnx-ctc.py
drwxr-xr-x 4 fangjun staff 128B Oct 24 21:24 test_wavs
-rw-r--r--@ 1 fangjun staff 196B Oct 24 21:31 tokens.txt
Hint: It supports decoding only wave files of a single channel with 16-bit encoded samples, while the sampling rate
does not need to be 16 kHz.
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx-offline \
--nemo-ctc-model=./sherpa-onnx-nemo-ctc-giga-am-russian-2024-10-24/model.int8.onnx \
--tokens=./sherpa-onnx-nemo-ctc-giga-am-russian-2024-10-24/tokens.txt \
./sherpa-onnx-nemo-ctc-giga-am-russian-2024-10-24/test_wavs/example.wav
Caution: If you use Windows and get encoding issues, please run:
CHCP 65001
in your commandline.
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx-microphone-offline \
--nemo-ctc-model=./sherpa-onnx-nemo-ctc-giga-am-russian-2024-10-24/model.int8.onnx \
--tokens=./sherpa-onnx-nemo-ctc-giga-am-russian-2024-10-24/tokens.txt
cd /path/to/sherpa-onnx
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/silero_vad.onnx
./build/bin/sherpa-onnx-vad-microphone-offline-asr \
--silero-vad-model=./silero_vad.onnx \
--nemo-ctc-model=./sherpa-onnx-nemo-ctc-giga-am-russian-2024-10-24/model.int8.onnx \
--tokens=./sherpa-onnx-nemo-ctc-giga-am-russian-2024-10-24/tokens.txt
yesno
This section describes how to use the tdnn model of the yesno dataset from icefall in sherpa-onnx.
Note: It is a non-streaming model and it can only recognize two words in Hebrew: yes and no.
cd /path/to/sherpa-onnx
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-onnx-tdnn-
˓→yesno.tar.bz2
Please check that the file sizes of the pre-trained models are correct. See the file sizes of *.onnx files below.
Hint: It supports decoding only wave files of a single channel with 16-bit encoded samples, while the sampling rate
does not need to be 16 kHz.
The following code shows how to use fp32 models to decode wave files. Please replace model-epoch-14-avg-2.
int8.onnx with model-epoch-14-avg-2.int8.onnx to use the int8 quantized model.
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx-offline \
--sample-rate=8000 \
--feat-dim=23 \
--tokens=./sherpa-onnx-tdnn-yesno/tokens.txt \
--tdnn-model=./sherpa-onnx-tdnn-yesno/model-epoch-14-avg-2.onnx \
(continues on next page)
OfflineRecognizerConfig(feat_config=OfflineFeatureExtractorConfig(sampling_rate=8000,␣
˓→feature_dim=23), model_
˓→config=OfflineModelConfig(transducer=OfflineTransducerModelConfig(encoder_filename="",␣
˓→decoder_filename="", joiner_filename=""),␣
˓→paraformer=OfflineParaformerModelConfig(model=""), nemo_
˓→ctc=OfflineNemoEncDecCtcModelConfig(model=""),␣
˓→whisper=OfflineWhisperModelConfig(encoder="", decoder=""),␣
˓→tdnn=OfflineTdnnModelConfig(model="./sherpa-onnx-tdnn-yesno/model-epoch-14-avg-2.onnx
./sherpa-onnx-tdnn-yesno/test_wavs/0_0_0_1_0_0_0_1.wav
{"text":"NNNYNNNY","timestamps":"[]","tokens":["N","N","N","Y","N","N","N","Y"]}
----
./sherpa-onnx-tdnn-yesno/test_wavs/0_0_1_0_0_0_1_0.wav
{"text":"NNYNNNYN","timestamps":"[]","tokens":["N","N","Y","N","N","N","Y","N"]}
----
./sherpa-onnx-tdnn-yesno/test_wavs/0_0_1_0_0_1_1_1.wav
{"text":"NNYNNYYY","timestamps":"[]","tokens":["N","N","Y","N","N","Y","Y","Y"]}
----
./sherpa-onnx-tdnn-yesno/test_wavs/0_0_1_0_1_0_0_1.wav
{"text":"NNYNYNNY","timestamps":"[]","tokens":["N","N","Y","N","Y","N","N","Y"]}
----
./sherpa-onnx-tdnn-yesno/test_wavs/0_0_1_1_0_0_0_1.wav
{"text":"NNYYNNNY","timestamps":"[]","tokens":["N","N","Y","Y","N","N","N","Y"]}
----
./sherpa-onnx-tdnn-yesno/test_wavs/0_0_1_1_0_1_1_0.wav
{"text":"NNYYNYYN","timestamps":"[]","tokens":["N","N","Y","Y","N","Y","Y","N"]}
----
num threads: 2
decoding method: greedy_search
Elapsed seconds: 0.071 s
Real time factor (RTF): 0.071 / 38.530 = 0.002
Note: In the above output, N represents NO, while Y is YES. So for the last wave, NNYYNYYN means NO NO YES YES
NO YES YES NO.
In the filename of the last wave 0_0_1_1_0_1_1_0.wav, 0 means NO and 1 means YES. So the ground truth of the last
8.25.7 TeleSpeech
This section describes how to export CTC models from Tele-AI/TeleSpeech-ASR to sherpa-onnx.
Models
sherpa-onnx-telespeech-ctc-int8-zh-2024-06-04 ()
Hint:
cd /path/to/sherpa-onnx
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-onnx-
˓→telespeech-ctc-int8-zh-2024-06-04.tar.bz2
Please check that the file sizes of the pre-trained models are correct. See the file sizes of *.onnx files below.
$ ls -lh *.onnx
-rw-r--r-- 1 fangjun staff 325M Jun 4 11:56 model.int8.onnx
Hint: It supports decoding only wave files of a single channel with 16-bit encoded samples, while the sampling rate
does not need to be 16 kHz.
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx-offline \
--telespeech-ctc=./sherpa-onnx-telespeech-ctc-int8-zh-2024-06-04/model.int8.onnx \
--tokens=./sherpa-onnx-telespeech-ctc-int8-zh-2024-06-04/tokens.txt \
--model-type=telespeech_ctc \
--num-threads=1 \
./sherpa-onnx-telespeech-ctc-int8-zh-2024-06-04/test_wavs/3-sichuan.wav \
./sherpa-onnx-telespeech-ctc-int8-zh-2024-06-04/test_wavs/4-tianjin.wav \
./sherpa-onnx-telespeech-ctc-int8-zh-2024-06-04/test_wavs/5-henan.wav
Caution: If you use Windows and get encoding issues, please run:
CHCP 65001
in your commandline.
˓→2024-06-04/model.int8.onnx --tokens=./sherpa-onnx-telespeech-ctc-int8-zh-2024-06-04/
˓→int8-zh-2024-06-04/test_wavs/3-sichuan.wav ./sherpa-onnx-telespeech-ctc-int8-zh-2024-
˓→06-04/test_wavs/4-tianjin.wav ./sherpa-onnx-telespeech-ctc-int8-zh-2024-06-04/test_
˓→wavs/5-henan.wav
OfflineRecognizerConfig(feat_config=FeatureExtractorConfig(sampling_rate=16000, feature_
˓→dim=80, low_freq=20, high_freq=-400, dither=0), model_
˓→config=OfflineModelConfig(transducer=OfflineTransducerModelConfig(encoder_filename="",␣
˓→decoder_filename="", joiner_filename=""),␣
˓→paraformer=OfflineParaformerModelConfig(model=""), nemo_
˓→ctc=OfflineNemoEncDecCtcModelConfig(model=""),␣
(continues on next page)
˓→whisper=OfflineWhisperModelConfig(encoder="", decoder="", language="", task="transcribe
˓→ctc=OfflineWenetCtcModelConfig(model=""), telespeech_ctc="./sherpa-onnx-telespeech-ctc-
˓→int8-zh-2024-06-04/model.int8.onnx", tokens="./sherpa-onnx-telespeech-ctc-int8-zh-2024-
sherpa, Release 1.3
./sherpa-onnx-telespeech-ctc-int8-zh-2024-06-04/test_wavs/3-sichuan.wav
{"text": "", "timestamps": [0.08, 0.36, 0.52, 0.72, 0.92, 1.16, 1.36, 1.88, 2.20, 2.36,␣
˓→3.16, 3.28, 3.40, 3.60, 3.80, 3.92, 4.08, 4.24, 4.40, 4.56, 4.76, 5.16, 5.32, 5.44, 5.
˓→64, 5.76, 5.88, 6.04, 6.16, 6.28, 6.40, 6.60, 6.88, 7.12, 7.40, 7.52, 7.64], "tokens":[
˓→"", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "",
˓→ "", "", "", "", "", "", "", "", "", "", "", "", "", "", ""]}
----
./sherpa-onnx-telespeech-ctc-int8-zh-2024-06-04/test_wavs/4-tianjin.wav
{"text": "", "timestamps": [0.36, 0.56, 1.04, 1.16, 1.24, 1.64, 1.88, 2.24, 2.40, 2.60,␣
˓→2.80, 3.12, 3.32, 3.64, 3.80, 3.96, 4.16, 4.44, 4.68, 4.80, 5.00, 5.16, 5.28, 6.12, 6.
˓→28, 6.44, 6.60, 6.72, 6.88, 7.04, 7.12, 7.32, 7.52], "tokens":["", "", "", "", "", "",
˓→"", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "",
----
./sherpa-onnx-telespeech-ctc-int8-zh-2024-06-04/test_wavs/5-henan.wav
{"text": "", "timestamps": [0.04, 0.12, 0.24, 0.40, 1.00, 1.24, 1.44, 1.68, 2.32, 2.48,␣
˓→2.60, 2.64, 2.80, 3.00, 3.16, 3.32, 3.52, 3.68, 3.92, 5.00, 5.16, 5.28, 5.32, 5.44, 5.
˓→84, 6.00, 6.12, 6.48, 6.68, 6.84, 7.00, 7.16, 7.32, 7.56, 7.68], "tokens":["", "", "",
˓→"", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "",
˓→ "", "", "", "", "", "", "", "", "", ""]}
----
num threads: 1
decoding method: greedy_search
Elapsed seconds: 3.406 s
Real time factor (RTF): 3.406 / 23.634 = 0.144
Note: The feature_dim=80 is incorrect in the above logs. The actual value is 40.
8.25.8 Whisper
This section describes how to use models from Whisper with sherpa-onnx for non-streaming speech recognition.
Available models
Note that we have already exported Whisper models to onnx and they are available from the following huggingface
repositories:
If you want to export the models by yourself or/and want to learn how the models are exported, please read below.
Export to onnx
We use
https://github.com/k2-fsa/sherpa-onnx/blob/master/scripts/whisper/export-onnx.py
to export Whisper models to onnx.
First, let us install dependencies and download the export script
˓→aishell}
optional arguments:
-h, --help show this help message and exit
--model {tiny,tiny.en,base,base.en,small,small.en,medium,medium.en,large,large-v1,
˓→large-v2,large-v3,distil-medium.en,distil-small.en,distil-large-v2,medium-aishell}
python3 ./test.py \
--encoder ./tiny.en-encoder.onnx \
--decoder ./tiny.en-decoder.onnx \
--tokens ./tiny.en-tokens.txt \
./0.wav
python3 ./test.py \
--encoder ./tiny.en-encoder.int8.onnx \
--decoder ./tiny.en-decoder.int8.onnx \
(continues on next page)
python3 ./test.py \
--encoder ./large-v3-encoder.onnx \
--decoder ./large-v3-decoder.onnx \
--tokens ./large-v3-tokens.txt \
./0.wav
Hint: We provide a colab notebook for you to try the exported large-v3 onnx model with sherpa-onnx on CPU as
well as on GPU.
You will find the RTF on GPU (Tesla T4) is less than 1.
tiny.en
You can use the following commands to download the exported onnx models of tiny.en:
Hint: Please replace tiny.en with base.en, small.en, medium.en, distil-small.en, tiny, base, small, and
medium if you want to try a different type of model.
cd /path/to/sherpa-onnx
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-onnx-
˓→whisper-tiny.en.tar.bz2
Please check that the file sizes of the downloaded models are correct. See the file size of *.onnx files below.
Hint: Please first follow Installation to build sherpa-onnx before you continue.
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx-offline \
--whisper-encoder=./sherpa-onnx-whisper-tiny.en/tiny.en-encoder.onnx \
--whisper-decoder=./sherpa-onnx-whisper-tiny.en/tiny.en-decoder.onnx \
--tokens=./sherpa-onnx-whisper-tiny.en/tiny.en-tokens.txt \
./sherpa-onnx-whisper-tiny.en/test_wavs/0.wav \
./sherpa-onnx-whisper-tiny.en/test_wavs/1.wav \
./sherpa-onnx-whisper-tiny.en/test_wavs/8k.wav
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx-offline \
--whisper-encoder=./sherpa-onnx-whisper-tiny.en/tiny.en-encoder.int8.onnx \
--whisper-decoder=./sherpa-onnx-whisper-tiny.en/tiny.en-decoder.int8.onnx \
--tokens=./sherpa-onnx-whisper-tiny.en/tiny.en-tokens.txt \
./sherpa-onnx-whisper-tiny.en/test_wavs/0.wav \
./sherpa-onnx-whisper-tiny.en/test_wavs/1.wav \
./sherpa-onnx-whisper-tiny.en/test_wavs/8k.wav
./sherpa-onnx-offline \
--num-threads=1 \
--whisper-encoder=./sherpa-onnx-whisper-tiny.en/tiny.en-encoder.onnx \
--whisper-decoder=./sherpa-onnx-whisper-tiny.en/tiny.en-decoder.onnx \
--tokens=./sherpa-onnx-whisper-tiny.en/tiny.en-tokens.txt \
./sherpa-onnx-whisper-tiny.en/test_wavs/1.wav
/root/fangjun/open-source/sherpa-onnx/sherpa-onnx/csrc/parse-options.cc:Read:361 ./
˓→sherpa-onnx-offline --num-threads=1 --whisper-encoder=./sherpa-onnx-whisper-tiny.en/
˓→tiny.en-encoder.onnx --whisper-decoder=./sherpa-onnx-whisper-tiny.en/tiny.en-decoder.
˓→tiny.en/test_wavs/1.wav
OfflineRecognizerConfig(feat_config=OfflineFeatureExtractorConfig(sampling_rate=16000,␣
˓→feature_dim=80), model_
˓→config=OfflineModelConfig(transducer=OfflineTransducerModelConfig(encoder_filename="",␣
˓→decoder_filename="", joiner_filename=""),␣
˓→paraformer=OfflineParaformerModelConfig(model=""), nemo_
˓→ctc=OfflineNemoEncDecCtcModelConfig(model=""),␣
˓→whisper=OfflineWhisperModelConfig(encoder="./sherpa-onnx-whisper-tiny.en/tiny.en-
./sherpa-onnx-whisper-tiny.en/test_wavs/1.wav
{"text":" God, as a direct consequence of the sin which man thus punished, had given her␣
˓→a lovely child, whose place was on that same dishonored bosom to connect her parent␣
˓→forever with the race and descent of mortals, and to be finally a blessed soul in␣
˓→of"," the"," sin"," which"," man"," thus"," punished",","," had"," given"," her"," a",
˓→" lovely"," child",","," whose"," place"," was"," on"," that"," same"," dishon","ored",
˓→" bos","om"," to"," connect"," her"," parent"," forever"," with"," the"," race"," and",
˓→" descent"," of"," mortals",","," and"," to"," be"," finally"," a"," blessed"," soul",
----
num threads: 1
decoding method: greedy_search
Elapsed seconds: 11.454 s
Real time factor (RTF): 11.454 / 16.715 = 0.685
The following table compares the RTF between different number of threads and types of onnx models:
large-v3
cd /content
mkdir -p build
cd build
cmake \
-DBUILD_SHARED_LIBS=ON \
-DSHERPA_ONNX_ENABLE_GPU=ON ..
You can use the following commands to download the exported onnx models of large-v3:
Hint: Please replace large-v3 with large, large-v1, large-v2, and distil-large-v2 if you want to try a
different type of model.
cd /content
ls -lh sherpa-onnx-whisper-large-v3
Caution: Please remember to run git lfs install before you run git clone. If you have any issues about
git lfs install, please follow https://git-lfs.com/ to install git-lfs.
Caution: Please check the file sizes are correct before proceeding. Otherwise, you would be SAD later.
Caution: Please check the file sizes are correct before proceeding. Otherwise, you would be SAD later.
Caution: Please check the file sizes are correct before proceeding. Otherwise, you would be SAD later.
cd /content
exe=$PWD/sherpa-onnx/build/bin/sherpa-onnx-offline
cd sherpa-onnx-whisper-large-v3
time $exe \
--whisper-encoder=./large-v3-encoder.onnx \
--whisper-decoder=./large-v3-decoder.onnx \
--tokens=./large-v3-tokens.txt \
--num-threads=2 \
./test_wavs/0.wav
/content/sherpa-onnx/sherpa-onnx/csrc/parse-options.cc:Read:375 /content/sherpa-onnx/
˓→build/bin/sherpa-onnx-offline --whisper-encoder=./large-v3-encoder.onnx --whisper-
˓→wavs/0.wav
OfflineRecognizerConfig(feat_config=FeatureExtractorConfig(sampling_rate=16000, feature_
˓→dim=80, low_freq=20, high_freq=-400, dither=0), model_
˓→config=OfflineModelConfig(transducer=OfflineTransducerModelConfig(encoder_filename="",␣
(continues on next page)
˓→decoder_filename="", joiner_filename=""),␣
˓→paraformer=OfflineParaformerModelConfig(model=""), nemo_
˓→ctc=OfflineNemoEncDecCtcModelConfig(model=""),␣
524 Chapter 8. sherpa-onnx
˓→whisper=OfflineWhisperModelConfig(encoder="./large-v3-encoder.onnx", decoder="./large-
˓→tdnn=OfflineTdnnModelConfig(model=""), zipformer_
sherpa, Release 1.3
./test_wavs/0.wav
{"text": " after early nightfall the yellow lamps would light up here and there the␣
˓→squalid quarter of the brothels", "timestamps": [], "tokens":[" after", " early", "␣
˓→night", "fall", " the", " yellow", " lamps", " would", " light", " up", " here", " and
˓→", " there", " the", " squ", "alid", " quarter", " of", " the", " broth", "els"],
˓→"words": []}
----
num threads: 2
decoding method: greedy_search
Elapsed seconds: 54.070 s
Real time factor (RTF): 54.070 / 6.625 = 8.162
real 1m32.107s
user 1m39.877s
sys 0m10.405s
cd /content
exe=$PWD/sherpa-onnx/build/bin/sherpa-onnx-offline
cd sherpa-onnx-whisper-large-v3
time $exe \
--whisper-encoder=./large-v3-encoder.int8.onnx \
--whisper-decoder=./large-v3-decoder.int8.onnx \
--tokens=./large-v3-tokens.txt \
--num-threads=2 \
./test_wavs/0.wav
/content/sherpa-onnx/sherpa-onnx/csrc/parse-options.cc:Read:375 /content/sherpa-onnx/
˓→build/bin/sherpa-onnx-offline --whisper-encoder=./large-v3-encoder.int8.onnx --whisper-
˓→test_wavs/0.wav
OfflineRecognizerConfig(feat_config=FeatureExtractorConfig(sampling_rate=16000, feature_
˓→dim=80, low_freq=20, high_freq=-400, dither=0), model_
˓→config=OfflineModelConfig(transducer=OfflineTransducerModelConfig(encoder_filename="",␣
˓→decoder_filename="", joiner_filename=""),␣
˓→paraformer=OfflineParaformerModelConfig(model=""), nemo_
˓→ctc=OfflineNemoEncDecCtcModelConfig(model=""),␣
˓→whisper=OfflineWhisperModelConfig(encoder="./large-v3-encoder.int8.onnx", decoder="./
˓→tdnn=OfflineTdnnModelConfig(model=""), zipformer_
(continues on next page)
˓→ctc=OfflineZipformerCtcModelConfig(model=""), wenet_
./test_wavs/0.wav
{"text": " after early nightfall the yellow lamps would light up here and there the␣
˓→squalid quarter of the brothels", "timestamps": [], "tokens":[" after", " early", "␣
˓→night", "fall", " the", " yellow", " lamps", " would", " light", " up", " here", " and
˓→", " there", " the", " squ", "alid", " quarter", " of", " the", " broth", "els"],
˓→"words": []}
----
num threads: 2
decoding method: greedy_search
Elapsed seconds: 49.991 s
Real time factor (RTF): 49.991 / 6.625 = 7.546
real 1m15.555s
user 1m41.488s
sys 0m9.156s
cd /content
exe=$PWD/sherpa-onnx/build/bin/sherpa-onnx-offline
cd sherpa-onnx-whisper-large-v3
time $exe \
--whisper-encoder=./large-v3-encoder.onnx \
--whisper-decoder=./large-v3-decoder.onnx \
--tokens=./large-v3-tokens.txt \
--provider=cuda \
--num-threads=2 \
./test_wavs/0.wav
/content/sherpa-onnx/sherpa-onnx/csrc/parse-options.cc:Read:375 /content/sherpa-onnx/
˓→build/bin/sherpa-onnx-offline --whisper-encoder=./large-v3-encoder.onnx --whisper-
˓→threads=2 ./test_wavs/0.wav
OfflineRecognizerConfig(feat_config=FeatureExtractorConfig(sampling_rate=16000, feature_
˓→dim=80, low_freq=20, high_freq=-400, dither=0), model_
˓→config=OfflineModelConfig(transducer=OfflineTransducerModelConfig(encoder_filename="",␣
˓→decoder_filename="", joiner_filename=""),␣
˓→paraformer=OfflineParaformerModelConfig(model=""), nemo_
˓→ctc=OfflineNemoEncDecCtcModelConfig(model=""),␣
˓→whisper=OfflineWhisperModelConfig(encoder="./large-v3-encoder.onnx", decoder="./large-
˓→tdnn=OfflineTdnnModelConfig(model=""), zipformer_
(continues on next page)
˓→ctc=OfflineZipformerCtcModelConfig(model=""), wenet_
./test_wavs/0.wav
{"text": " after early nightfall the yellow lamps would light up here and there the␣
˓→squalid quarter of the brothels", "timestamps": [], "tokens":[" after", " early", "␣
˓→night", "fall", " the", " yellow", " lamps", " would", " light", " up", " here", " and
˓→", " there", " the", " squ", "alid", " quarter", " of", " the", " broth", "els"],
˓→"words": []}
----
num threads: 2
decoding method: greedy_search
Elapsed seconds: 5.910 s
Real time factor (RTF): 5.910 / 6.625 = 0.892
real 0m26.996s
user 0m12.854s
sys 0m4.486s
Note: The above command is run within a colab notebook using Tesla T4 GPU. You can see the RTF is less than 1.
If you has some more performant GPU, you would get an even lower RTF.
cd /content
exe=$PWD/sherpa-onnx/build/bin/sherpa-onnx-offline
cd sherpa-onnx-whisper-large-v3
time $exe \
--whisper-encoder=./large-v3-encoder.int8.onnx \
--whisper-decoder=./large-v3-decoder.int8.onnx \
--tokens=./large-v3-tokens.txt \
--provider=cuda \
--num-threads=2 \
./test_wavs/0.wav
˓→num-threads=2 ./test_wavs/0.wav
OfflineRecognizerConfig(feat_config=FeatureExtractorConfig(sampling_rate=16000, feature_
˓→dim=80, low_freq=20, high_freq=-400, dither=0), model_
˓→config=OfflineModelConfig(transducer=OfflineTransducerModelConfig(encoder_filename="",␣
˓→decoder_filename="", joiner_filename=""),␣
(continues on next page)
˓→paraformer=OfflineParaformerModelConfig(model=""), nemo_
˓→ctc=OfflineNemoEncDecCtcModelConfig(model=""),␣
˓→whisper=OfflineWhisperModelConfig(encoder="./large-v3-encoder.int8.onnx", decoder="./
8.25. Pre-trained models 527
˓→large-v3-decoder.int8.onnx", language="", task="transcribe", tail_paddings=-1),␣
˓→tdnn=OfflineTdnnModelConfig(model=""), zipformer_
˓→ctc=OfflineZipformerCtcModelConfig(model=""), wenet_
sherpa, Release 1.3
./test_wavs/0.wav
{"text": " after early nightfall the yellow lamps would light up here and there the␣
˓→squalid quarter of the brothels", "timestamps": [], "tokens":[" after", " early", "␣
˓→night", "fall", " the", " yellow", " lamps", " would", " light", " up", " here", " and
˓→", " there", " the", " squ", "alid", " quarter", " of", " the", " broth", "els"],
˓→"words": []}
----
num threads: 2
decoding method: greedy_search
Elapsed seconds: 19.190 s
Real time factor (RTF): 19.190 / 6.625 = 2.897
real 0m46.850s
user 0m50.007s
sys 0m8.013s
what(): /onnxruntime_src/onnxruntime/core/session/provider_bridge_ort.cc:1426
onnxruntime::Provider& onnxruntime::ProviderLibrary::Get()
[ONNXRuntimeError] : 1 : FAIL :
Failed to load library libonnxruntime_providers_cuda.so with error:
libcublasLt.so.11: cannot open shared object file: No such file or directory
colab
colab
Non-large models
We provide a colab notebook for you to try Whisper models with sherpa-onnx step by step.
Large models
For large models of whisper, please see the following colab notebook . It walks you step by step to try the exported
large-v3 onnx model with sherpa-onnx on CPU as well as on GPU.
You will find the RTF on GPU (Tesla T4) is less than 1.
Huggingface space
You can try Whisper models from within your browser without installing anything.
Please visit
https://huggingface.co/spaces/k2-fsa/automatic-speech-recognition
8.25.9 WeNet
mv units.txt tokens.txt
Now you can use the following command for speech recognition with the exported models:
mv units.txt tokens.txt
Now you can use the following command for speech recognition with the exported models:
FAQs
/Users/fangjun/open-source/sherpa-onnx/sherpa-onnx/csrc/online-wenet-ctc-model.
˓→cc:Init:144 head does not exist in the metadata
To fix the above error, please check the following two items:
• Make sure you are using model-streaming.onnx or model-streaing.int8.onnx. The executable you are
running requires a streaming model as input.
• Make sure you use the script from sherpa-onnx to export your model.
• https://huggingface.co/csukuangfj/sherpa-onnx-zh-wenet-aishell
• https://huggingface.co/csukuangfj/sherpa-onnx-zh-wenet-aishell2
• https://huggingface.co/csukuangfj/sherpa-onnx-en-wenet-gigaspeech
• https://huggingface.co/csukuangfj/sherpa-onnx-en-wenet-librispeech
• https://huggingface.co/csukuangfj/sherpa-onnx-zh-wenet-multi-cn
• https://huggingface.co/csukuangfj/sherpa-onnx-zh-wenet-wenetspeech
Colab
We provide a colab notebook for you to try the exported WeNet models with sherpa-onnx.
In this section, we list online/streaming models with fewer parameters that are suitable for resource constrained em-
bedded systems.
Hint: You can use them as a first pass model in a two-pass system, where the second pass uses a non-streaming model.
Hint: If you are using Raspberry Pi 4, this section is not so helpful for you since all models in sherpa-onnx are able
to run in real-time on it.
This page is especially useful for systems with less resource than Raspberry Pi 4.
• csukuangfj/sherpa-onnx-streaming-zipformer-zh-14M-2023-02-23 (Chinese)
• csukuangfj/sherpa-onnx-streaming-zipformer-en-20M-2023-02-17 (English)
• sherpa-onnx-streaming-zipformer-small-bilingual-zh-en-2023-02-16 (Bilingual, Chinese + English)
8.26 Moonshine
You can try Moonshine with sherpa-onnx with the following huggingface spaces
• For short audio: https://huggingface.co/spaces/k2-fsa/automatic-speech-recognition
• For generating subtitles (support very long audio/video files): https://huggingface.co/spaces/k2-fsa/
generate-subtitles-for-videos
Hint: You don’t need to install anything. All you need is a browser.
You can even run it on your phone or tablet.
Fig. 8.33: Try Moonshine in our Huggingface space with sherpa-onnx for short audio
Fig. 8.34: Try Moonshine in our Huggingface space with sherpa-onnx for generating subtitles
8.26.2 Models
sherpa-onnx-moonshine-tiny-en-int8
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-onnx-
˓→moonshine-tiny-en-int8.tar.bz2
ls -lh sherpa-onnx-moonshine-tiny-en-int8/
total 242160
-rw-r--r-- 1 fangjun staff 1.0K Oct 26 09:42 LICENSE
-rw-r--r-- 1 fangjun staff 175B Oct 26 09:42 README.md
-rw-r--r-- 1 fangjun staff 43M Oct 26 09:42 cached_decode.int8.onnx
-rw-r--r-- 1 fangjun staff 17M Oct 26 09:42 encode.int8.onnx
-rw-r--r-- 1 fangjun staff 6.5M Oct 26 09:42 preprocess.onnx
drwxr-xr-x 6 fangjun staff 192B Oct 26 09:42 test_wavs
-rw-r--r-- 1 fangjun staff 426K Oct 26 09:42 tokens.txt
-rw-r--r-- 1 fangjun staff 51M Oct 26 09:42 uncached_decode.int8.onnx
Hint: It supports decoding only wave files of a single channel with 16-bit encoded samples, while the sampling rate
does not need to be 16 kHz.
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx-offline \
--moonshine-preprocessor=./sherpa-onnx-moonshine-tiny-en-int8/preprocess.onnx \
--moonshine-encoder=./sherpa-onnx-moonshine-tiny-en-int8/encode.int8.onnx \
--moonshine-uncached-decoder=./sherpa-onnx-moonshine-tiny-en-int8/uncached_decode.int8.
˓→onnx \
--moonshine-cached-decoder=./sherpa-onnx-moonshine-tiny-en-int8/cached_decode.int8.
˓→onnx \
--tokens=./sherpa-onnx-moonshine-tiny-en-int8/tokens.txt \
--num-threads=1 \
./sherpa-onnx-moonshine-tiny-en-int8/test_wavs/0.wav
/Users/fangjun/open-source/sherpa-onnx/sherpa-onnx/csrc/parse-options.cc:Read:375 ./
˓→build/bin/sherpa-onnx-offline --moonshine-preprocessor=./sherpa-onnx-moonshine-tiny-en-
˓→int8/preprocess.onnx --moonshine-encoder=./sherpa-onnx-moonshine-tiny-en-int8/encode.
˓→int8.onnx --moonshine-uncached-decoder=./sherpa-onnx-moonshine-tiny-en-int8/uncached_
˓→decode.int8.onnx --moonshine-cached-decoder=./sherpa-onnx-moonshine-tiny-en-int8/
˓→threads=1 ./sherpa-onnx-moonshine-tiny-en-int8/test_wavs/0.wav
OfflineRecognizerConfig(feat_config=FeatureExtractorConfig(sampling_rate=16000, feature_
˓→dim=80, low_freq=20, high_freq=-400, dither=0), model_
˓→config=OfflineModelConfig(transducer=OfflineTransducerModelConfig(encoder_filename="",␣
˓→decoder_filename="", joiner_filename=""),␣
˓→paraformer=OfflineParaformerModelConfig(model=""), nemo_
˓→ctc=OfflineNemoEncDecCtcModelConfig(model=""),␣
˓→ctc=OfflineZipformerCtcModelConfig(model=""), wenet_
˓→ctc=OfflineWenetCtcModelConfig(model=""), sense_
˓→moonshine=OfflineMoonshineModelConfig(preprocessor="./sherpa-onnx-moonshine-tiny-en-
˓→int8/preprocess.onnx", encoder="./sherpa-onnx-moonshine-tiny-en-int8/encode.int8.onnx",
˓→ uncached_decoder="./sherpa-onnx-moonshine-tiny-en-int8/uncached_decode.int8.onnx",␣
˓→cached_decoder="./sherpa-onnx-moonshine-tiny-en-int8/cached_decode.int8.onnx"),␣
˓→rule_fsts="", rule_fars="")
./sherpa-onnx-moonshine-tiny-en-int8/test_wavs/0.wav
{"lang": "", "emotion": "", "event": "", "text": " After early nightfall, the yellow␣
˓→lamps would light up here and there the squalid quarter of the brothels.", "timestamps
˓→": [], "tokens":[" After", " early", " night", "fall", ",", " the", " yellow", " l",
˓→"amps", " would", " light", " up", " here", " and", " there", " the", " squ", "al", "id
˓→", " quarter", " of", " the", " bro", "th", "els", "."], "words": []}
----
num threads: 1
decoding method: greedy_search
Elapsed seconds: 0.213 s
Real time factor (RTF): 0.213 / 6.625 = 0.032
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx-microphone-offline \
--moonshine-preprocessor=./sherpa-onnx-moonshine-tiny-en-int8/preprocess.onnx \
--moonshine-encoder=./sherpa-onnx-moonshine-tiny-en-int8/encode.int8.onnx \
--moonshine-uncached-decoder=./sherpa-onnx-moonshine-tiny-en-int8/uncached_decode.int8.
˓→onnx \
--moonshine-cached-decoder=./sherpa-onnx-moonshine-tiny-en-int8/cached_decode.int8.
˓→onnx \
--tokens=./sherpa-onnx-moonshine-tiny-en-int8/tokens.txt
cd /path/to/sherpa-onnx
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/silero_vad.onnx
./build/bin/sherpa-onnx-vad-microphone-offline-asr \
--silero-vad-model=./silero_vad.onnx \
--moonshine-preprocessor=./sherpa-onnx-moonshine-tiny-en-int8/preprocess.onnx \
--moonshine-encoder=./sherpa-onnx-moonshine-tiny-en-int8/encode.int8.onnx \
--moonshine-uncached-decoder=./sherpa-onnx-moonshine-tiny-en-int8/uncached_decode.int8.
˓→onnx \
--moonshine-cached-decoder=./sherpa-onnx-moonshine-tiny-en-int8/cached_decode.int8.
˓→onnx \
--tokens=./sherpa-onnx-moonshine-tiny-en-int8/tokens.txt
sherpa-onnx-moonshine-base-en-int8
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-onnx-
˓→moonshine-base-en-int8.tar.bz2
ls -lh sherpa-onnx-moonshine-base-en-int8/
total 560448
-rw-r--r-- 1 fangjun staff 1.0K Oct 26 09:42 LICENSE
-rw-r--r-- 1 fangjun staff 175B Oct 26 09:42 README.md
-rw-r--r-- 1 fangjun staff 95M Oct 26 09:42 cached_decode.int8.onnx
-rw-r--r-- 1 fangjun staff 48M Oct 26 09:42 encode.int8.onnx
-rw-r--r-- 1 fangjun staff 13M Oct 26 09:42 preprocess.onnx
drwxr-xr-x 6 fangjun staff 192B Oct 26 09:42 test_wavs
-rw-r--r-- 1 fangjun staff 426K Oct 26 09:42 tokens.txt
-rw-r--r-- 1 fangjun staff 116M Oct 26 09:42 uncached_decode.int8.onnx
Hint: It supports decoding only wave files of a single channel with 16-bit encoded samples, while the sampling rate
does not need to be 16 kHz.
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx-offline \
--moonshine-preprocessor=./sherpa-onnx-moonshine-base-en-int8/preprocess.onnx \
--moonshine-encoder=./sherpa-onnx-moonshine-base-en-int8/encode.int8.onnx \
--moonshine-uncached-decoder=./sherpa-onnx-moonshine-base-en-int8/uncached_decode.int8.
˓→onnx \
--moonshine-cached-decoder=./sherpa-onnx-moonshine-base-en-int8/cached_decode.int8.
˓→onnx \
--tokens=./sherpa-onnx-moonshine-base-en-int8/tokens.txt \
--num-threads=1 \
./sherpa-onnx-moonshine-base-en-int8/test_wavs/0.wav
˓→int8/preprocess.onnx --moonshine-encoder=./sherpa-onnx-moonshine-base-en-int8/encode.
˓→int8.onnx --moonshine-uncached-decoder=./sherpa-onnx-moonshine-base-en-int8/uncached_
˓→decode.int8.onnx --moonshine-cached-decoder=./sherpa-onnx-moonshine-base-en-int8/
˓→threads=1 ./sherpa-onnx-moonshine-base-en-int8/test_wavs/0.wav
OfflineRecognizerConfig(feat_config=FeatureExtractorConfig(sampling_rate=16000, feature_
˓→dim=80, low_freq=20, high_freq=-400, dither=0), model_
˓→config=OfflineModelConfig(transducer=OfflineTransducerModelConfig(encoder_filename="",␣
˓→decoder_filename="", joiner_filename=""),␣
˓→paraformer=OfflineParaformerModelConfig(model=""), nemo_
˓→ctc=OfflineNemoEncDecCtcModelConfig(model=""),␣
˓→ctc=OfflineZipformerCtcModelConfig(model=""), wenet_
˓→ctc=OfflineWenetCtcModelConfig(model=""), sense_
˓→moonshine=OfflineMoonshineModelConfig(preprocessor="./sherpa-onnx-moonshine-base-en-
˓→int8/preprocess.onnx", encoder="./sherpa-onnx-moonshine-base-en-int8/encode.int8.onnx",
˓→ uncached_decoder="./sherpa-onnx-moonshine-base-en-int8/uncached_decode.int8.onnx",␣
˓→cached_decoder="./sherpa-onnx-moonshine-base-en-int8/cached_decode.int8.onnx"),␣
./sherpa-onnx-moonshine-base-en-int8/test_wavs/0.wav
{"lang": "", "emotion": "", "event": "", "text": " After early nightfall, the yellow␣
˓→lamps would light up here and there the squalid quarter of the brothels.", "timestamps
˓→": [], "tokens":[" After", " early", " night", "fall", ",", " the", " yellow", " l",
˓→"amps", " would", " light", " up", " here", " and", " there", " the", " squ", "al", "id
˓→", " quarter", " of", " the", " bro", "th", "els", "."], "words": []}
----
num threads: 1
decoding method: greedy_search
Elapsed seconds: 0.438 s
Real time factor (RTF): 0.438 / 6.625 = 0.066
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx-microphone-offline \
--moonshine-preprocessor=./sherpa-onnx-moonshine-base-en-int8/preprocess.onnx \
--moonshine-encoder=./sherpa-onnx-moonshine-base-en-int8/encode.int8.onnx \
--moonshine-uncached-decoder=./sherpa-onnx-moonshine-base-en-int8/uncached_decode.int8.
˓→onnx \
--moonshine-cached-decoder=./sherpa-onnx-moonshine-base-en-int8/cached_decode.int8.
˓→onnx \
--tokens=./sherpa-onnx-moonshine-base-en-int8/tokens.txt
cd /path/to/sherpa-onnx
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/silero_vad.onnx
./build/bin/sherpa-onnx-vad-microphone-offline-asr \
--silero-vad-model=./silero_vad.onnx \
--moonshine-preprocessor=./sherpa-onnx-moonshine-base-en-int8/preprocess.onnx \
--moonshine-encoder=./sherpa-onnx-moonshine-base-en-int8/encode.int8.onnx \
--moonshine-uncached-decoder=./sherpa-onnx-moonshine-base-en-int8/uncached_decode.int8.
˓→onnx \
--moonshine-cached-decoder=./sherpa-onnx-moonshine-base-en-int8/cached_decode.int8.
˓→onnx \
--tokens=./sherpa-onnx-moonshine-base-en-int8/tokens.txt
You can find Android APKs for Moonshine with VAD at the following page
https://k2-fsa.github.io/sherpa/onnx/vad/apk-asr.html
Fig. 8.35: Android APKs about Moonshine + VAD for speech recognition
Please see
https://github.com/k2-fsa/sherpa-onnx/blob/master/c-api-examples/moonshine-c-api.c
and C API.
If you want to use the C++ API, which is just a wrapper around the C API, please see the following example:
https://github.com/k2-fsa/sherpa-onnx/blob/master/cxx-api-examples/moonshine-cxx-api.cc
Please see
https://github.com/k2-fsa/sherpa-onnx/blob/master/dotnet-examples/offline-decode-files/
run-moonshine.sh
and
https://github.com/k2-fsa/sherpa-onnx/blob/master/dotnet-examples/offline-decode-files/Program.cs
Please see
• Decoding a file: https://github.com/k2-fsa/sherpa-onnx/blob/master/dart-api-examples/non-streaming-asr/
run-sense-voice.sh
and
• Decoding a file with VAD: https://github.com/k2-fsa/sherpa-onnx/blob/master/dart-api-examples/
vad-with-non-streaming-asr/run-moonshine.sh
Please see
https://github.com/k2-fsa/sherpa-onnx/blob/master/go-api-examples/non-streaming-decode-files/
run-moonshine.sh
and
https://github.com/k2-fsa/sherpa-onnx/blob/master/go-api-examples/non-streaming-decode-files/main.
go
Please see
• https://github.com/k2-fsa/sherpa-onnx/blob/master/java-api-examples/NonStreamingDecodeFileMoonshine.
java
• https://github.com/k2-fsa/sherpa-onnx/blob/master/java-api-examples/VadFromMicWithNonStreamingMoonshine.
java
and Java API.
Please see
https://github.com/k2-fsa/sherpa-onnx/blob/master/kotlin-api-examples/test_offline_asr.kt
Please see
• https://github.com/k2-fsa/sherpa-onnx/blob/master/pascal-api-examples/non-streaming-asr/run-moonshine.sh
• https://github.com/k2-fsa/sherpa-onnx/blob/master/pascal-api-examples/vad-with-non-streaming-asr/
run-vad-with-moonshine.sh
Please see
• https://github.com/k2-fsa/sherpa-onnx/blob/master/python-api-examples/generate-subtitles.py
• https://github.com/k2-fsa/sherpa-onnx/blob/master/python-api-examples/offline-moonshine-decode-files.py
• https://github.com/k2-fsa/sherpa-onnx/blob/master/python-api-examples/vad-with-non-streaming-asr.py
• https://github.com/k2-fsa/sherpa-onnx/blob/master/python-api-examples/non_streaming_server.py
for usages.
Please see
https://github.com/k2-fsa/sherpa-onnx/blob/master/swift-api-examples/decode-file-non-streaming.swift
8.27 SenseVoice
Note that you can use SenseVoice with sherpa-onnx on the following platforms:
• Linux (x64, aarch64, arm, riscv64)
• macOS (x64, arm64)
• Windows (x64, x86, arm64)
• Android (arm64-v8a, armv7-eabi, x86, x86_64)
• iOS (arm64)
In the following, we describe how to download pre-trained SenseVoice models and use them in sherpa-onnx.
You can try SenseVoice with sherpa-onnx with the following huggingface space
https://huggingface.co/spaces/k2-fsa/automatic-speech-recognition
Hint: You don’t need to install anything. All you need is a browser.
You can even run it on your phone or tablet.
This page describes how to export SenseVoice to onnx so that you can use it with sherpa-onnx.
The code
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-onnx-
˓→sense-voice-zh-en-ja-ko-yue-2024-07-17.tar.bz2
ls -lh sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17
total 1.1G
-rw-r--r-- 1 runner docker 71 Jul 18 13:06 LICENSE
-rw-r--r-- 1 runner docker 104 Jul 18 13:06 README.md
-rwxr-xr-x 1 runner docker 5.8K Jul 18 13:06 export-onnx.py
-rw-r--r-- 1 runner docker 229M Jul 18 13:06 model.int8.onnx
-rw-r--r-- 1 runner docker 895M Jul 18 13:06 model.onnx
(continues on next page)
ls -lh sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17/test_wavs
total 940K
-rw-r--r-- 1 runner docker 224K Jul 18 13:06 en.wav
-rw-r--r-- 1 runner docker 226K Jul 18 13:06 ja.wav
-rw-r--r-- 1 runner docker 145K Jul 18 13:06 ko.wav
-rw-r--r-- 1 runner docker 161K Jul 18 13:06 yue.wav
-rw-r--r-- 1 runner docker 175K Jul 18 13:06 zh.wav
sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17
Download
cd /path/to/sherpa-onnx
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-onnx-
˓→sense-voice-zh-en-ja-ko-yue-2024-07-17.tar.bz2
ls -lh sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17
total 1.1G
-rw-r--r-- 1 runner docker 71 Jul 18 13:06 LICENSE
(continues on next page)
ls -lh sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17/test_wavs
total 940K
-rw-r--r-- 1 runner docker 224K Jul 18 13:06 en.wav
-rw-r--r-- 1 runner docker 226K Jul 18 13:06 ja.wav
-rw-r--r-- 1 runner docker 145K Jul 18 13:06 ko.wav
-rw-r--r-- 1 runner docker 161K Jul 18 13:06 yue.wav
-rw-r--r-- 1 runner docker 175K Jul 18 13:06 zh.wav
/Users/fangjun/open-source/sherpa-onnx/sherpa-onnx/csrc/parse-options.cc:Read:375 ./
˓→build/bin/sherpa-onnx-offline --tokens=./sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-
˓→07-17/tokens.txt --sense-voice-model=./sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-
˓→07-17/test_wavs/zh.wav ./sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17/test_wavs/
˓→en.wav
OfflineRecognizerConfig(feat_config=FeatureExtractorConfig(sampling_rate=16000, feature_
˓→dim=80, low_freq=20, high_freq=-400, dither=0), model_
˓→config=OfflineModelConfig(transducer=OfflineTransducerModelConfig(encoder_filename="",␣
˓→decoder_filename="", joiner_filename=""),␣
˓→paraformer=OfflineParaformerModelConfig(model=""), nemo_
˓→ctc=OfflineNemoEncDecCtcModelConfig(model=""),␣
˓→ctc=OfflineZipformerCtcModelConfig(model=""), wenet_
˓→ctc=OfflineWenetCtcModelConfig(model=""), sense_
˓→voice=OfflineSenseVoiceModelConfig(model="./sherpa-onnx-sense-voice-zh-en-ja-ko-yue-
˓→rule_fsts="", rule_fars="")
sherpa, Release 1.3
./sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17/test_wavs/zh.wav
{"text": "", "timestamps": [0.72, 0.96, 1.26, 1.44, 1.92, 2.10, 2.58, 2.82, 3.30, 3.90,␣
˓→4.20, 4.56, 4.74], "tokens":["", "", "", "", "", "", "", "", "", "", "", "", ""],
˓→"words": []}
----
./sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17/test_wavs/en.wav
{"text": "the tribal chieftain called for the boy and presented him with fifty pieces of␣
˓→gold", "timestamps": [0.90, 1.26, 1.56, 1.80, 2.16, 2.46, 2.76, 2.94, 3.12, 3.60, 3.96,
˓→ 4.50, 4.74, 5.10, 5.52, 5.88, 6.18], "tokens":["the", " tri", "bal", " chief", "tain",
˓→ " called", " for", " the", " boy", " and", " presented", " him", " with", " fifty", "␣
----
num threads: 1
decoding method: greedy_search
Elapsed seconds: 2.320 s
Real time factor (RTF): 2.320 / 12.744 = 0.182
./build/bin/sherpa-onnx-offline \
--tokens=./sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17/tokens.txt \
--sense-voice-model=./sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17/model.onnx \
--num-threads=1 \
--sense-voice-use-itn=1 \
--debug=0 \
./sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17/test_wavs/zh.wav \
./sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17/test_wavs/en.wav
˓→07-17/tokens.txt --sense-voice-model=./sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-
˓→voice-zh-en-ja-ko-yue-2024-07-17/test_wavs/zh.wav ./sherpa-onnx-sense-voice-zh-en-ja-
˓→ko-yue-2024-07-17/test_wavs/en.wav
OfflineRecognizerConfig(feat_config=FeatureExtractorConfig(sampling_rate=16000, feature_
˓→dim=80, low_freq=20, high_freq=-400, dither=0), model_
˓→config=OfflineModelConfig(transducer=OfflineTransducerModelConfig(encoder_filename="",␣
˓→decoder_filename="", joiner_filename=""),␣
˓→paraformer=OfflineParaformerModelConfig(model=""), nemo_
˓→ctc=OfflineNemoEncDecCtcModelConfig(model=""),␣
˓→voice=OfflineSenseVoiceModelConfig(model="./sherpa-onnx-sense-voice-zh-en-ja-ko-yue-
8.27. SenseVoice
˓→2024-07-17/model.onnx", language="auto", use_itn=True), telespeech_ctc="", tokens="./
549
˓→sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17/tokens.txt", num_threads=1,␣
./sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17/test_wavs/zh.wav
{"text": "95", "timestamps": [0.72, 0.96, 1.26, 1.44, 1.92, 2.10, 2.58, 2.82, 3.30, 3.90,
˓→ 4.20, 4.56, 4.74, 5.46], "tokens":["", "", "", "", "", "", "9", "", "", "", "", "5", "
----
./sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17/test_wavs/en.wav
{"text": "The tribal chieftain called for the boy and presented him with 50 pieces of␣
˓→gold.", "timestamps": [0.90, 1.26, 1.56, 1.80, 2.16, 2.46, 2.76, 2.94, 3.12, 3.60, 3.
˓→96, 4.50, 4.74, 4.92, 5.10, 5.28, 5.52, 5.88, 6.18, 7.02], "tokens":["The", " tri",
˓→"bal", " chief", "tain", " called", " for", " the", " boy", " and", " presented", " him
˓→", " with", " ", "5", "0", " pieces", " of", " gold", "."], "words": []}
----
num threads: 1
decoding method: greedy_search
Elapsed seconds: 1.543 s
Real time factor (RTF): 1.543 / 12.744 = 0.121
Hint: When inverse text normalziation is enabled, the results also punctuations.
Specify a language
./build/bin/sherpa-onnx-offline \
--tokens=./sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17/tokens.txt \
--sense-voice-model=./sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17/model.onnx \
--num-threads=1 \
--sense-voice-language=zh \
--debug=0 \
./sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17/test_wavs/zh.wav
˓→07-17/tokens.txt --sense-voice-model=./sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-
˓→voice-zh-en-ja-ko-yue-2024-07-17/test_wavs/zh.wav
OfflineRecognizerConfig(feat_config=FeatureExtractorConfig(sampling_rate=16000, feature_
˓→dim=80, low_freq=20, high_freq=-400, dither=0), model_
˓→config=OfflineModelConfig(transducer=OfflineTransducerModelConfig(encoder_filename="",␣
˓→decoder_filename="", joiner_filename=""),␣
˓→paraformer=OfflineParaformerModelConfig(model=""), nemo_
˓→voice=OfflineSenseVoiceModelConfig(model="./sherpa-onnx-sense-voice-zh-en-ja-ko-yue-
./sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17/test_wavs/zh.wav
{"text": "", "timestamps": [0.72, 0.96, 1.26, 1.44, 1.92, 2.10, 2.58, 2.82, 3.30, 3.90,␣
˓→4.20, 4.56, 4.74], "tokens":["", "", "", "", "", "", "", "", "", "", "", "", ""],
˓→"words": []}
----
num threads: 1
decoding method: greedy_search
Elapsed seconds: 0.625 s
Real time factor (RTF): 0.625 / 5.592 = 0.112
Hint: Valid values for --sense-voice-language are auto, zh, en, ko, ja, and yue. where zh is for Chinese, en
for English, ko for Korean, ja for Japanese, and yue for Cantonese.
./build/bin/sherpa-onnx-microphone-offline \
--tokens=./sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17/tokens.txt \
--sense-voice-model=./sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17/model.int8.
˓→onnx
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/silero_vad.onnx
./build/bin/sherpa-onnx-vad-microphone-offline-asr \
--silero-vad-model=./silero_vad.onnx \
--tokens=./sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17/tokens.txt \
--sense-voice-model=./sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17/model.int8.
˓→onnx
If you copy, paste, and run the following commands in your terminal, you should be able to see the following recognition
result:
Decoded text: The tribal chieftain called for the boy and presented him with 50 pieces␣
˓→of gold.
cd /tmp
ls -lh sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17
echo "---"
ls -lh sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17/test_wavs
mkdir build
cd build
cmake \
-D CMAKE_BUILD_TYPE=Release \
-D BUILD_SHARED_LIBS=ON \
-D CMAKE_INSTALL_PREFIX=./install \
-D SHERPA_ONNX_ENABLE_BINARY=OFF \
..
ls -lh install/lib
ls -lh install/include
cd ..
ls -lh sense-voice-c-api
export LD_LIBRARY_PATH=$PWD/build/install/lib:$LD_LIBRARY_PATH
export DYLD_LIBRARY_PATH=$PWD/build/install/lib:$DYLD_LIBRARY_PATH
./sense-voice-c-api
Hint: Since we are using shared libraries in the above example, you have to set the environemnt variable
LD_LIBRARY_PATH for Linux and DYLD_LIBRARY_PATH for macOS. Otherwise, you would get runtime errors when
running ./sense-voice-c-api.
Explanations
1. Download sherpa-onnx
cd /tmp
In this example, we download sherpa-onnx and place it inside the directory /tmp/. You can replace /tmp/ with any
directory you like.
Please always download the latest master of sherpa-onnx.
3. Build sherpa-onnx
mkdir build
cd build
cmake \
-D CMAKE_BUILD_TYPE=Release \
-D BUILD_SHARED_LIBS=ON \
-D CMAKE_INSTALL_PREFIX=./install \
-D SHERPA_ONNX_ENABLE_BINARY=OFF \
..
We build a Release version of sherpa-onnx. Also, we use shared libraries here. The header file c-api.h and shared
libraries are installed into the directory ./build/install.
If you are using Linux, you should see the following content:
-- Installing: /tmp/sherpa-onnx/build/install/include/sherpa-onnx/c-api/c-api.h
ls -lh install/lib
ls -lh install/include
If you are using Linux, you should see the following content:
total 19M
-rw-r--r-- 1 runner docker 15M Jul 22 08:47 libonnxruntime.so
-rw-r--r-- 1 runner docker 4.1M Jul 22 08:47 libsherpa-onnx-c-api.so
drwxr-xr-x 2 runner docker 4.0K Jul 22 08:47 pkgconfig
total 4.0K
drwxr-xr-x 3 runner docker 4.0K Jul 22 08:47 sherpa-onnx
If you are using macOS, you should see the following content:
total 53976
-rw-r--r-- 1 runner staff 23M Jul 22 08:48 libonnxruntime.1.17.1.dylib
lrwxr-xr-x 1 runner staff 27B Jul 22 08:48 libonnxruntime.dylib -> libonnxruntime.1.
˓→17.1.dylib
cd ..
ls -lh sense-voice-c-api
Note that:
• -I ./build/install/include is to add the directory ./build/install/include to the header search path
so that #include "sherpa-onnx/c-api/c-api.h won’t throw an error.
• -L ./build/install/lib/ is to add the directory ./build/install/lib to the library search path so that
it can find -l sherpa-onnx-c-api
• -l sherpa-onnx-c-api is to link the library libsherpa-onnx-c-api.so for Linux and
libsherpa-onnx-c-api.dylib for macOS.
6. Run it
export LD_LIBRARY_PATH=$PWD/build/install/lib:$LD_LIBRARY_PATH
export DYLD_LIBRARY_PATH=$PWD/build/install/lib:$DYLD_LIBRARY_PATH
./sense-voice-c-api
# For Linux
export LD_LIBRARY_PATH=$PWD/build/install/lib:$LD_LIBRARY_PATH
and:
# for macOS
export DYLD_LIBRARY_PATH=$PWD/build/install/lib:$DYLD_LIBRARY_PATH
This page describes how to use the Dart API to run SenseVoice models in sherpa-onnx
Note that we have published the package sherpa_onnx at https://pub.dev/packages/sherpa_onnx.
Note that the package supports the following platforms:
• Android
• iOS
• Linux
• macOS
• Windows
In the following, we show how to use the pure Dart API to decode files with SenseVoice models.
cd /tmp
cd sherpa-onnx
cd dart-api-examples
cd non-streaming-asr
dart pub get
./run-sense-voice.sh
Explanations
cd /tmp
In this example, we download sherpa-onnx and place it inside the directory /tmp/. You can replace /tmp/ with any
directory you like.
cd sherpa-onnx
cd dart-api-examples
cd non-streaming-asr
dart pub get
The command dart pub get will download the sherpa_onnx package automagically from pub.dev.
You should see something like below after running dart pub get:
3. Run it
./run-sense-voice.sh
The above script downloads models and run the code automatically.
You can find run-sense-voice.sh at the following address:
https://github.com/k2-fsa/sherpa-onnx/blob/master/dart-api-examples/non-streaming-asr/
run-sense-voice.sh
The Dart API example code can be found at:
https://github.com/k2-fsa/sherpa-onnx/blob/master/dart-api-examples/non-streaming-asr/bin/
sense-voice.dart
This page describes how to use the Python API for SenseVoice.
Please refer to Install the Python Package for how to install the Python package of sherpa-onnx.
The following is a quick way to do that:
Decode a file
After installing the Python package, you can download the Python example code and run it with the following com-
mands:
cd /tmp
git clone https://github.com/k2-fsa/sherpa-onnx.git/
cd sherpa-onnx
python3 ./python-api-examples/offline-sense-voice-ctc-decode-files.py
./sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17/test_wavs/zh.wav
{"text": "95", "timestamps": [0.72, 0.96, 1.26, 1.44, 1.92, 2.10, 2.58, 2.82, 3.30, 3.90,
˓→ 4.20, 4.56, 4.74, 5.46], "tokens":["", "", "", "", "", "", "9", "", "", "", "", "5", "
The following example shows how to use a microphone with SenseVoice and silero-vad for speech recognition:
cd /tmp/sherpa-onnx
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/silero_vad.onnx
python3 ./python-api-examples/vad-with-non-streaming-asr.py \
--silero-vad-model=./silero_vad.onnx \
(continues on next page)
If you start speaking, you should see some output after you stop speaking.
Generate subtitles
This section describes how to use SenseVoice and silero-vad to generate subtitles.
Chinese
cd /tmp/sherpa-onnx
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/silero_vad.onnx
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/lei-jun-test.wav
python3 ./python-api-examples/generate-subtitles.py \
--silero-vad-model=./silero_vad.onnx \
--sense-voice=./sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17/model.onnx \
--tokens=./sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17/tokens.txt \
--num-threads=2 \
./lei-jun-test.wav
1
0:00:28,934 --> 0:00:36,006
2
(continues on next page)
3
0:00:46,918 --> 0:00:50,118
4
0:00:50,406 --> 0:00:55,750
5
0:00:56,134 --> 0:00:57,574
6
0:00:58,182 --> 0:01:06,854
7
0:01:07,718 --> 0:01:10,886
8
0:01:11,654 --> 0:01:15,142
9
0:01:15,430 --> 0:01:19,526
3500
10
0:01:19,942 --> 0:01:22,278
11
0:01:22,694 --> 0:01:25,798
12
0:01:26,342 --> 0:01:30,886
13
0:01:38,470 --> 0:01:39,910
14
0:01:40,358 --> 0:01:44,486
15
(continues on next page)
16
0:01:47,910 --> 0:01:50,694
130
17
0:01:51,750 --> 0:01:52,838
18
0:01:53,350 --> 0:01:54,886
19
0:01:55,206 --> 0:01:57,222
20
0:01:58,054 --> 0:01:59,558
21
0:01:59,814 --> 0:02:02,598
22
0:02:03,654 --> 0:02:05,670
23
0:02:06,246 --> 0:02:08,614
24
0:02:09,190 --> 0:02:11,462
25
0:02:11,686 --> 0:02:14,182
26
0:02:14,982 --> 0:02:17,670
27
0:02:18,278 --> 0:02:19,494
28
(continues on next page)
29
0:02:25,926 --> 0:02:27,654
1987
30
0:02:28,678 --> 0:02:31,622
31
0:02:32,678 --> 0:02:35,174
32
0:02:35,398 --> 0:02:36,710
33
0:02:37,574 --> 0:02:38,630
34
0:02:39,334 --> 0:02:41,638
35
0:02:43,302 --> 0:02:44,454
36
0:02:45,286 --> 0:02:46,438
37
0:02:47,590 --> 0:02:49,318
38
0:02:50,406 --> 0:02:51,238
39
0:02:52,006 --> 0:02:52,966
40
0:02:53,318 --> 0:02:54,662
41
(continues on next page)
42
0:02:58,342 --> 0:02:59,782
43
0:03:00,774 --> 0:03:02,470
44
0:03:02,950 --> 0:03:05,222
45
0:03:06,470 --> 0:03:07,750
46
0:03:08,934 --> 0:03:10,310
47
0:03:10,598 --> 0:03:11,814
48
0:03:13,958 --> 0:03:14,822
49
0:03:17,606 --> 0:03:18,822
50
0:03:19,270 --> 0:03:22,502
51
0:03:29,766 --> 0:03:30,534
52
0:03:30,758 --> 0:03:32,550
53
0:03:40,326 --> 0:03:42,726
54
(continues on next page)
55
0:03:48,134 --> 0:03:50,630
56
0:03:51,526 --> 0:03:56,326
57
0:03:57,574 --> 0:04:00,102
58
0:04:00,358 --> 0:04:02,278
59
0:04:03,846 --> 0:04:04,934
60
0:04:05,190 --> 0:04:06,918
61
0:04:07,974 --> 0:04:08,966
62
0:04:10,342 --> 0:04:13,798
63
0:04:14,982 --> 0:04:16,102
64
0:04:16,774 --> 0:04:18,022
65
0:04:18,342 --> 0:04:19,878
66
0:04:20,422 --> 0:04:21,382
67
(continues on next page)
68
0:04:23,366 --> 0:04:25,670
69
0:04:26,662 --> 0:04:27,174
70
0:04:28,486 --> 0:04:31,398
English
cd /tmp/sherpa-onnx
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/silero_vad.onnx
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/Obama.wav
python3 ./python-api-examples/generate-subtitles.py \
--silero-vad-model=./silero_vad.onnx \
--sense-voice=./sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17/model.onnx \
--tokens=./sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17/tokens.txt \
--num-threads=2 \
./Obama.wav
1
0:00:09,286 --> 0:00:12,486
Everybody, all right, everybody, go ahead and have a seat.
2
0:00:13,094 --> 0:00:15,014
How's everybody doing today?
3
0:00:18,694 --> 0:00:20,742
How about Tim Sper.
4
(continues on next page)
5
0:00:32,710 --> 0:00:48,326
And we've got students tuning in from all across America, from kindergarten through 12th␣
˓→grade. And I am just so glad that all could join us today. And I want to thank␣
˓→Wakefield for being such an outstanding host. Give yourselves a big line.
6
0:00:54,406 --> 0:00:59,238
Now I know that for many of you, today is the first day of school.
7
0:00:59,590 --> 0:01:09,798
And for those of you in kindergarten or starting middle or high school, it's your first␣
˓→day in a new school, so it's understandable if you're a little nervous.
8
0:01:10,630 --> 0:01:16,006
I imagine there's some seniors out there who are feeling pretty good right now with just␣
˓→one more year to go.
9
0:01:18,790 --> 0:01:27,142
And no matter what grade you're in, some of you are probably wishing it we're still␣
˓→summer, and you could have stayed in bed just a little bit longer this morning.
10
0:01:27,942 --> 0:01:29,414
I know that field.
11
0:01:31,654 --> 0:01:51,708
When I was young, my family lived overseas, I lived in Indonesia for a few years and my␣
˓→mother, she didn't have the money to send me where all the American kids went to␣
˓→school, but she thought it was important for me to keep up with American education, so␣
12
0:01:52,230 --> 0:01:58,790
Monday through Friday, but because she had to go to work, the only time she could do it␣
˓→was at 430 in the morning.
13
0:02:00,038 --> 0:02:03,750
Now as you might imagine, I wasn't too happy about getting up that early.
14
0:02:04,102 --> 0:02:07,302
A lot of times I'd fall asleep right there at the kitchen table.
16
0:02:17,094 --> 0:02:25,382
So I know that some of you are still adjusting to being back at school, but I'm here␣
˓→today because I have something important to discuss with you.
17
0:02:25,798 --> 0:02:33,798
I'm here because I want to talk with you about your education and what's expected of all␣
˓→of you in this new school year.
18
0:02:34,470 --> 0:02:40,422
I've given a lot of speeches about education, and I've talked about responsibility a lot.
19
0:02:40,806 --> 0:02:47,174
I've talked about teachers responsibility for inspiring students and pushing you to␣
˓→learn.
20
0:02:47,430 --> 0:02:58,726
I talked about your parents' responsibility for making sure you stay on track and you␣
˓→get your homework done and don't spend every waking hour in front of the TV or with␣
˓→the Xbox.
21
0:02:59,078 --> 0:03:00,774
I've talked a lot about.
22
0:03:01,350 --> 0:03:13,286
Your government's responsibility for setting high standards and supporting teachers and␣
˓→principals and turning around schools that aren't working where students aren't␣
23
0:03:13,990 --> 0:03:15,366
But at the end of the day.
24
0:03:16,006 --> 0:03:26,054
We can have the most dedicated teachers, the most supportive parents, the best schools␣
˓→in the world, and none of it will make a difference, none of it will matter.
25
0:03:26,694 --> 0:03:30,694
Unless all of you fulfill your responsibilities.
(continues on next page)
26
0:03:31,238 --> 0:03:43,814
Unless you show up to those schools, unless you pay attention to those teachers, unless␣
˓→you listen to your parents and grandparents and other adults and put in the hard work␣
27
0:03:44,646 --> 0:03:46,598
That's what I want to focus on today.
28
0:03:47,110 --> 0:03:50,918
The responsibility each of you has for your education.
29
0:03:51,718 --> 0:03:54,854
I want to start with the responsibility you have to yourself.
30
0:03:55,654 --> 0:03:59,078
Every single one of you has something that you're good at.
31
0:03:59,782 --> 0:04:02,406
Every single one of you has something to offer.
32
0:04:02,982 --> 0:04:07,590
And you have a responsibility to yourself to discover what that is.
33
0:04:08,326 --> 0:04:11,494
That's the opportunity an education can provide.
34
0:04:12,326 --> 0:04:22,598
Maybe you could be a great writer, maybe even good enough to write a book or articles in␣
˓→a newspaper, but you might not know it until you write that English paper.
35
0:04:23,078 --> 0:04:25,894
That English class paper that's assigned to you.
36
0:04:26,694 --> 0:04:38,726
Maybe you could be an innovator or an inventor, maybe even good enough to come up with␣
˓→the next iPhone or the new medicine or a vaccine, but you might not know it until you␣
37
0:04:39,814 --> 0:04:44,838
(continues on next page)
38
0:04:45,350 --> 0:04:50,182
But you might not know that until you join student government or the debate team.
39
0:04:51,558 --> 0:04:56,774
And no matter what you want to do with your life, I guarantee that you'll need an␣
˓→education to do it.
40
0:04:57,318 --> 0:05:00,710
You want to be a doctor or a teacher or a police officer?
41
0:05:00,998 --> 0:05:09,702
You want to be a nurse or an architect, a lawyer or a member of our military, you're␣
˓→going to need a good education for every single one of those careers.
42
0:05:10,054 --> 0:05:14,278
You cannot drop out of school and just drop into a good job.
43
0:05:15,174 --> 0:05:19,846
You've got to train for it and work for it and learn for it.
44
0:05:20,518 --> 0:05:23,654
And this isn't just important for your own life and your own future.
45
0:05:24,678 --> 0:05:29,670
What you make of your education will decide nothing less than the future of this country.
46
0:05:29,958 --> 0:05:32,998
The future of America depends on you.
This example shows how to use a WebSocket server with SenseVoice for speech recognition.
Please run
cd /tmp/sherpa-onnx
python3 ./python-api-examples/non_streaming_server.py \
--sense-voice=./sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17/model.int8.onnx \
--tokens=./sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17/tokens.txt
You should see the following output after starting the server:
http://localhost:6006
You can either visit the address http://localhost:6006 or write code to interact with the server.
In the following, we describe possible approaches for interacting with the WebSocket server.
Hint: The WebSocket server is able to handle multiple clients/connections at the same time.
The following code sends the files in sequential one by one to the server for decoding.
cd /tmp/sherpa-onnx
python3 ./python-api-examples/offline-websocket-client-decode-files-sequential.py ./
˓→sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17/test_wavs/zh.wav ./sherpa-onnx-
˓→sense-voice-zh-en-ja-ko-yue-2024-07-17/test_wavs/en.wav
95
2024-07-28 20:28:15,934 INFO [offline-websocket-client-decode-files-sequential.py:114]␣
˓→Sending ./sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17/test_wavs/en.wav
The tribal chieftain called for the boy and presented him with 50 pieces of gold.
The following code sends the files in parallel at the same time to the server for decoding.
cd /tmp/sherpa-onnx
˓→voice-zh-en-ja-ko-yue-2024-07-17/test_wavs/zh.wav', './sherpa-onnx-sense-voice-zh-en-
˓→ja-ko-yue-2024-07-17/test_wavs/en.wav']}
95
2024-07-28 20:31:10,542 INFO [offline-websocket-client-decode-files-paralell.py:131] ./
˓→sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17/test_wavs/en.wav
The tribal chieftain called for the boy and presented him with 50 pieces of gold.
You can also start a browser to interact with the WebSocket server.
Please visit http://localhost:6006.
Warning: We are not using a certificate to start the server, so the only correct URL is http://localhost:6006.
All of the following addresses are incorrect:
• Incorrect/Wrong address: https://localhost:6006
• Incorrect/Wrong address: http://127.0.0.1:6006
• Incorrect/Wrong address: https://127.0.0.1:6006
• Incorrect/Wrong address: http://a.b.c.d:6006
• Incorrect/Wrong address: https://a.b.c.d:6006
After starting the browser, you should see the following page:
After clicking Click me to connect and Choose File, you will see the recognition result returned from the server:
Please click the button Click me to connect, and then click the button Offline-Record, then speak, finally, click
the button Offline-Stop;
you should see the results from the server. A screenshot is given below:
Note that you can save the recorded audio into a wave file for debugging.
The recorded audio from the above screenshot is saved to test.wav and is given below:
Colab notebook
We provide a colab notebook for you to try this section step by step.
sherpa-onnx-pyannote-segmentation-3-0
This model is converted from https://huggingface.co/pyannote/segmentation-3.0. You can find the conversion script at
https://github.com/k2-fsa/sherpa-onnx/tree/master/scripts/pyannote/segmentation.
In the following, we describe how to use it together with a speaker embedding extraction model for speaker diarization.
cd /path/to/sherpa-onnx
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/speaker-segmentation-models/
˓→sherpa-onnx-pyannote-segmentation-3-0.tar.bz2
ls -lh sherpa-onnx-pyannote-segmentation-3-0/{*.onnx,LICENSE,README.md}
First, let’s download a test wave file. The model expects wave files of 16kHz, 16-bit and a single channel.
cd /path/to/sherpa-onnx
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/speaker-segmentation-models/
˓→0-four-speakers-zh.wav
Next, let’s download a model for extracting speaker embeddings. You can find lots of models from https://github.com/
k2-fsa/sherpa-onnx/releases/tag/speaker-recongition-models. We download two models in this example:
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/speaker-recongition-models/
˓→3dspeaker_speech_eres2net_base_sv_zh-cn_3dspeaker_16k.onnx
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/speaker-recongition-models/
˓→nemo_en_titanet_small.onnx
3D-Speaker + model.onnx
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx-offline-speaker-diarization \
--clustering.num-clusters=4 \
--segmentation.pyannote-model=./sherpa-onnx-pyannote-segmentation-3-0/model.onnx \
--embedding.model=./3dspeaker_speech_eres2net_base_sv_zh-cn_3dspeaker_16k.onnx \
./0-four-speakers-zh.wav
˓→segmentation.pyannote-model=./sherpa-onnx-pyannote-segmentation-3-0/model.onnx --
˓→embedding.model=./3dspeaker_speech_eres2net_base_sv_zh-cn_3dspeaker_16k.onnx ./0-four-
˓→speakers-zh.wav
OfflineSpeakerDiarizationConfig(segmentation=OfflineSpeakerSegmentationModelConfig(pyannote=OfflineSpeak
˓→"./sherpa-onnx-pyannote-segmentation-3-0/model.onnx"), num_threads=1, debug=False,␣
˓→provider="cpu"), embedding=SpeakerEmbeddingExtractorConfig(model="./3dspeaker_speech_
Started
Duration : 56.861 s
Elapsed seconds: 16.870 s
Real time factor (RTF): 16.870 / 56.861 = 0.297
3D-Speaker + model.int8.onnx
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx-offline-speaker-diarization \
--clustering.num-clusters=4 \
--segmentation.pyannote-model=./sherpa-onnx-pyannote-segmentation-3-0/model.int8.onnx \
--embedding.model=./3dspeaker_speech_eres2net_base_sv_zh-cn_3dspeaker_16k.onnx \
./0-four-speakers-zh.wav
/Users/fangjun/open-source/sherpa-onnx/sherpa-onnx/csrc/parse-options.cc:Read:375 ./
˓→build/bin/sherpa-onnx-offline-speaker-diarization --clustering.num-clusters=4 --
˓→segmentation.pyannote-model=./sherpa-onnx-pyannote-segmentation-3-0/model.int8.onnx --
˓→embedding.model=./3dspeaker_speech_eres2net_base_sv_zh-cn_3dspeaker_16k.onnx ./0-four-
˓→speakers-zh.wav
OfflineSpeakerDiarizationConfig(segmentation=OfflineSpeakerSegmentationModelConfig(pyannote=OfflineSpeak
˓→"./sherpa-onnx-pyannote-segmentation-3-0/model.int8.onnx"), num_threads=1, debug=False,
˓→ provider="cpu"), embedding=SpeakerEmbeddingExtractorConfig(model="./3dspeaker_speech_
˓→3, min_duration_off=0.5)
Started
Duration : 56.861 s
Elapsed seconds: 13.679 s
Real time factor (RTF): 13.679 / 56.861 = 0.241
NeMo + model.onnx
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx-offline-speaker-diarization \
--clustering.num-clusters=4 \
--segmentation.pyannote-model=./sherpa-onnx-pyannote-segmentation-3-0/model.onnx \
--embedding.model=./nemo_en_titanet_small.onnx \
./0-four-speakers-zh.wav
/Users/fangjun/open-source/sherpa-onnx/sherpa-onnx/csrc/parse-options.cc:Read:375 ./
˓→build/bin/sherpa-onnx-offline-speaker-diarization --clustering.num-clusters=4 --
˓→segmentation.pyannote-model=./sherpa-onnx-pyannote-segmentation-3-0/model.onnx --
˓→embedding.model=./nemo_en_titanet_small.onnx ./0-four-speakers-zh.wav
OfflineSpeakerDiarizationConfig(segmentation=OfflineSpeakerSegmentationModelConfig(pyannote=OfflineSpeak
˓→"./sherpa-onnx-pyannote-segmentation-3-0/model.onnx"), num_threads=1, debug=False,␣
˓→provider="cpu"), embedding=SpeakerEmbeddingExtractorConfig(model="./nemo_en_titanet_
˓→min_duration_off=0.5)
Started
Duration : 56.861 s
(continues on next page)
NeMo + model.int8.onnx
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx-offline-speaker-diarization \
--clustering.num-clusters=4 \
--segmentation.pyannote-model=./sherpa-onnx-pyannote-segmentation-3-0/model.int8.onnx \
--embedding.model=./nemo_en_titanet_small.onnx \
./0-four-speakers-zh.wav
/Users/fangjun/open-source/sherpa-onnx/sherpa-onnx/csrc/parse-options.cc:Read:375 ./
˓→build/bin/sherpa-onnx-offline-speaker-diarization --clustering.num-clusters=4 --
˓→segmentation.pyannote-model=./sherpa-onnx-pyannote-segmentation-3-0/model.int8.onnx --
˓→embedding.model=./nemo_en_titanet_small.onnx ./0-four-speakers-zh.wav
OfflineSpeakerDiarizationConfig(segmentation=OfflineSpeakerSegmentationModelConfig(pyannote=OfflineSpeak
˓→"./sherpa-onnx-pyannote-segmentation-3-0/model.int8.onnx"), num_threads=1, debug=False,
˓→ provider="cpu"), embedding=SpeakerEmbeddingExtractorConfig(model="./nemo_en_titanet_
˓→min_duration_off=0.5)
Started
Duration : 56.861 s
Elapsed seconds: 6.231 s
Real time factor (RTF): 6.231 / 56.861 = 0.110
sherpa-onnx-reverb-diarization-v1
This model is converted from https://huggingface.co/Revai/reverb-diarization-v1. You can find the conversion script
at https://github.com/k2-fsa/sherpa-onnx/tree/master/scripts/pyannote/segmentation.
Caution: It is accessible under a non-commercial license. You can find its license at https://huggingface.co/
Revai/reverb-diarization-v1/blob/main/LICENSE.
In the following, we describe how to use it together with a speaker embedding extraction model for speaker diarization.
cd /path/to/sherpa-onnx
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/speaker-segmentation-models/
˓→sherpa-onnx-reverb-diarization-v1.tar.bz2
ls -lh sherpa-onnx-reverb-diarization-v1/{*.onnx,LICENSE,README.md}
First, let’s download a test wave file. The model expects wave files of 16kHz, 16-bit and a single channel.
cd /path/to/sherpa-onnx
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/speaker-segmentation-models/
˓→0-four-speakers-zh.wav
Next, let’s download a model for extracting speaker embeddings. You can find lots of models from https://github.com/
k2-fsa/sherpa-onnx/releases/tag/speaker-recongition-models. We download two models in this example:
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/speaker-recongition-models/
˓→3dspeaker_speech_eres2net_base_sv_zh-cn_3dspeaker_16k.onnx
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/speaker-recongition-models/
˓→nemo_en_titanet_small.onnx
3D-Speaker + model.onnx
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx-offline-speaker-diarization \
--clustering.num-clusters=4 \
--segmentation.pyannote-model=./sherpa-onnx-reverb-diarization-v1/model.onnx \
--embedding.model=./3dspeaker_speech_eres2net_base_sv_zh-cn_3dspeaker_16k.onnx \
./0-four-speakers-zh.wav
/Users/fangjun/open-source/sherpa-onnx/sherpa-onnx/csrc/parse-options.cc:Read:375 ./
˓→build/bin/sherpa-onnx-offline-speaker-diarization --clustering.num-clusters=4 --
˓→segmentation.pyannote-model=./sherpa-onnx-reverb-diarization-v1/model.onnx --embedding.
˓→model=./3dspeaker_speech_eres2net_base_sv_zh-cn_3dspeaker_16k.onnx ./0-four-speakers-
˓→zh.wav
OfflineSpeakerDiarizationConfig(segmentation=OfflineSpeakerSegmentationModelConfig(pyannote=OfflineSpeak
˓→"./sherpa-onnx-reverb-diarization-v1/model.onnx"), num_threads=1, debug=False,␣
˓→provider="cpu"), embedding=SpeakerEmbeddingExtractorConfig(model="./3dspeaker_speech_
˓→3, min_duration_off=0.5)
Started
Duration : 56.861 s
Elapsed seconds: 25.715 s
Real time factor (RTF): 25.715 / 56.861 = 0.452
3D-Speaker + model.int8.onnx
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx-offline-speaker-diarization \
--clustering.num-clusters=4 \
--segmentation.pyannote-model=./sherpa-onnx-reverb-diarization-v1/model.int8.onnx \
--embedding.model=./3dspeaker_speech_eres2net_base_sv_zh-cn_3dspeaker_16k.onnx \
./0-four-speakers-zh.wav
/Users/fangjun/open-source/sherpa-onnx/sherpa-onnx/csrc/parse-options.cc:Read:375 ./
˓→build/bin/sherpa-onnx-offline-speaker-diarization --clustering.num-clusters=4 --
˓→segmentation.pyannote-model=./sherpa-onnx-reverb-diarization-v1/model.int8.onnx --
˓→embedding.model=./3dspeaker_speech_eres2net_base_sv_zh-cn_3dspeaker_16k.onnx ./0-four-
˓→speakers-zh.wav
OfflineSpeakerDiarizationConfig(segmentation=OfflineSpeakerSegmentationModelConfig(pyannote=OfflineSpeak
˓→"./sherpa-onnx-reverb-diarization-v1/model.int8.onnx"), num_threads=1, debug=False,␣
˓→provider="cpu"), embedding=SpeakerEmbeddingExtractorConfig(model="./3dspeaker_speech_
˓→3, min_duration_off=0.5)
Started
Duration : 56.861 s
Elapsed seconds: 22.323 s
Real time factor (RTF): 22.323 / 56.861 = 0.393
NeMo + model.onnx
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx-offline-speaker-diarization \
--clustering.num-clusters=4 \
--segmentation.pyannote-model=./sherpa-onnx-reverb-diarization-v1/model.onnx \
--embedding.model=./nemo_en_titanet_small.onnx \
./0-four-speakers-zh.wav
/Users/fangjun/open-source/sherpa-onnx/sherpa-onnx/csrc/parse-options.cc:Read:375 ./
˓→build/bin/sherpa-onnx-offline-speaker-diarization --clustering.num-clusters=4 --
˓→segmentation.pyannote-model=./sherpa-onnx-reverb-diarization-v1/model.onnx --embedding.
˓→model=./nemo_en_titanet_small.onnx ./0-four-speakers-zh.wav
OfflineSpeakerDiarizationConfig(segmentation=OfflineSpeakerSegmentationModelConfig(pyannote=OfflineSpeak
˓→"./sherpa-onnx-reverb-diarization-v1/model.onnx"), num_threads=1, debug=False,␣
˓→provider="cpu"), embedding=SpeakerEmbeddingExtractorConfig(model="./nemo_en_titanet_
˓→min_duration_off=0.5)
Started
Duration : 56.861 s
Elapsed seconds: 11.465 s
Real time factor (RTF): 11.465 / 56.861 = 0.202
NeMo + model.int8.onnx
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx-offline-speaker-diarization \
--clustering.num-clusters=4 \
--segmentation.pyannote-model=./sherpa-onnx-reverb-diarization-v1/model.int8.onnx \
--embedding.model=./nemo_en_titanet_small.onnx \
./0-four-speakers-zh.wav
˓→segmentation.pyannote-model=./sherpa-onnx-reverb-diarization-v1/model.int8.onnx --
˓→embedding.model=./nemo_en_titanet_small.onnx ./0-four-speakers-zh.wav
OfflineSpeakerDiarizationConfig(segmentation=OfflineSpeakerSegmentationModelConfig(pyannote=OfflineSpeak
˓→"./sherpa-onnx-reverb-diarization-v1/model.int8.onnx"), num_threads=1, debug=False,␣
˓→provider="cpu"), embedding=SpeakerEmbeddingExtractorConfig(model="./nemo_en_titanet_
˓→min_duration_off=0.5)
584 Chapter 8. sherpa-onnx
sherpa, Release 1.3
Started
Duration : 56.861 s
Elapsed seconds: 9.688 s
Real time factor (RTF): 9.688 / 56.861 = 0.170
Please visit
https://huggingface.co/spaces/k2-fsa/speaker-diarization
You don’t need to install anything to try it in your browser.
You can find Android APKs for speaker diarization at the following page
https://k2-fsa.github.io/sherpa/onnx/speaker-diarization/apk.html
For users from China, you can also visit
https://k2-fsa.github.io/sherpa/onnx/speaker-diarization/apk-cn.html
The source code for the APKs can be found at
https://github.com/k2-fsa/sherpa-onnx/tree/master/android/SherpaOnnxSpeakerDiarization
You can find the script for building the APKs at
https://github.com/k2-fsa/sherpa-onnx/blob/master/scripts/apk/build-apk-speaker-diarization.sh
Please see Android for more details.
Please see
https://github.com/k2-fsa/sherpa-onnx/blob/master/c-api-examples/offline-speaker-diarization-c-api.c
and C API.
Please see
https://github.com/k2-fsa/sherpa-onnx/blob/master/sherpa-onnx/csrc/sherpa-onnx-offline-speaker-diarization.
cc
Please see
https://github.com/k2-fsa/sherpa-onnx/tree/master/dotnet-examples/offline-speaker-diarization
Please see
https://github.com/k2-fsa/sherpa-onnx/tree/master/dart-api-examples/speaker-diarization
Please see
https://github.com/k2-fsa/sherpa-onnx/tree/master/go-api-examples/non-streaming-speaker-diarization
Please see
https://github.com/k2-fsa/sherpa-onnx/blob/master/java-api-examples/OfflineSpeakerDiarizationDemo.
java
and Java API.
Please see
https://github.com/k2-fsa/sherpa-onnx/blob/master/kotlin-api-examples/test_offline_speaker_
diarization.kt
Please see
https://github.com/k2-fsa/sherpa-onnx/tree/master/pascal-api-examples/speaker-diarization
Please see
https://github.com/k2-fsa/sherpa-onnx/blob/master/swift-api-examples/speaker-diarization.swift
Hint: You can find Android APKs for each model at the following page
https://k2-fsa.github.io/sherpa/onnx/speaker-identification/apk.html
We provide a huggingface space where you can try text-to-speech with sherpa-onnx from within your browser without
installing anything.
Hint: We also have spaces using WebAssembly for text-to-speech. Please see Huggingface Spaces (WebAssembly).
All you need is a browser, either running on your desk computer, your phone, or your iPad, etc.
Please visit
https://huggingface.co/spaces/k2-fsa/text-to-speech
vits
The following table summarizes the information of all models in this page.
Note: Since there are more than 100 pre-trained models for over 40 languages, we don’t list all of them on this page.
Please find them at https://github.com/k2-fsa/sherpa-onnx/releases/tag/tts-models.
You can try all the models at the following huggingface space. https://huggingface.co/spaces/k2-fsa/text-to-speech.
Hint: You can find Android APKs for each model at the following page
https://k2-fsa.github.io/sherpa/onnx/tts/apk.html
cd /path/to/sherpa-onnx
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/tts-models/vits-melo-tts-zh_
˓→en.tar.bz2
Please check that the file sizes of the pre-trained models are correct. See the file sizes of *.onnx files below.
ls -lh vits-melo-tts-zh_en/
total 346848
-rw-r--r-- 1 fangjun staff 1.0K Jul 16 13:38 LICENSE
-rw-r--r-- 1 fangjun staff 156B Jul 16 13:38 README.md
-rw-r--r-- 1 fangjun staff 58K Jul 16 13:38 date.fst
drwxr-xr-x 9 fangjun staff 288B Apr 19 20:42 dict
-rw-r--r-- 1 fangjun staff 6.5M Jul 16 13:38 lexicon.txt
-rw-r--r-- 1 fangjun staff 163M Jul 16 13:38 model.onnx
-rw-r--r-- 1 fangjun staff 63K Jul 16 13:38 number.fst
-rw-r--r-- 1 fangjun staff 87K Jul 16 13:38 phone.fst
-rw-r--r-- 1 fangjun staff 655B Jul 16 13:38 tokens.txt
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx-offline-tts \
--vits-model=./vits-melo-tts-zh_en/model.onnx \
--vits-lexicon=./vits-melo-tts-zh_en/lexicon.txt \
--vits-tokens=./vits-melo-tts-zh_en/tokens.txt \
--vits-dict-dir=./vits-melo-tts-zh_en/dict \
--output-filename=./zh-en-0.wav \
"This is a text to speech "
(continues on next page)
./build/bin/sherpa-onnx-offline-tts \
--vits-model=./vits-melo-tts-zh_en/model.onnx \
--vits-lexicon=./vits-melo-tts-zh_en/lexicon.txt \
--vits-tokens=./vits-melo-tts-zh_en/tokens.txt \
--vits-dict-dir=./vits-melo-tts-zh_en/dict \
--output-filename=./zh-en-1.wav \
"machine learningartificial intelligence"
./build/bin/sherpa-onnx-offline-tts-play \
--vits-model=./vits-melo-tts-zh_en/model.onnx \
--vits-lexicon=./vits-melo-tts-zh_en/lexicon.txt \
--vits-tokens=./vits-melo-tts-zh_en/tokens.txt \
--tts-rule-fsts="./vits-melo-tts-zh_en/date.fst,./vits-melo-tts-zh_en/number.fst" \
--vits-dict-dir=./vits-melo-tts-zh_en/dict \
--output-filename=./zh-en-2.wav \
"Are you ok 20154 I am very happy to be in China."
After running, it will generate three files zh-en-1.wav, zh-en-2.wav, and zh-en-3.wav in the current directory.
soxi zh-en-*.wav
cd /path/to/sherpa-onnx
python3 ./python-api-examples/offline-tts-play.py \
--vits-model=./vits-melo-tts-zh_en/model.onnx \
--vits-lexicon=./vits-melo-tts-zh_en/lexicon.txt \
--vits-tokens=./vits-melo-tts-zh_en/tokens.txt \
--vits-dict-dir=./vits-melo-tts-zh_en/dict \
--output-filename=./zh-en-3.wav \
". Genius is one percent inspiration and ninety-nine percent perspiration. "
soxi zh-en-3.wav
cd /path/to/sherpa-onnx
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/tts-models/vits-piper-en_US-
˓→glados.tar.bz2
Please check that the file sizes of the pre-trained models are correct. See the file sizes of *.onnx files below.
ls -lh vits-piper-en_US-glados/
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx-offline-tts \
--vits-model=./vits-piper-en_US-glados/en_US-glados.onnx\
--vits-tokens=./vits-piper-en_US-glados/tokens.txt \
--vits-data-dir=./vits-piper-en_US-glados/espeak-ng-data \
--output-filename=./glados-liliana.wav \
"liliana, the most beautiful and lovely assistant of our team!"
./build/bin/sherpa-onnx-offline-tts \
--vits-model=./vits-piper-en_US-glados/en_US-glados.onnx\
--vits-tokens=./vits-piper-en_US-glados/tokens.txt \
--vits-data-dir=./vits-piper-en_US-glados/espeak-ng-data \
--output-filename=./glados-code.wav \
"Talk is cheap. Show me the code."
./build/bin/sherpa-onnx-offline-tts \
--vits-model=./vits-piper-en_US-glados/en_US-glados.onnx\
--vits-tokens=./vits-piper-en_US-glados/tokens.txt \
--vits-data-dir=./vits-piper-en_US-glados/espeak-ng-data \
--output-filename=./glados-men.wav \
"Today as always, men fall into two groups: slaves and free men. Whoever does not␣
˓→have two-thirds of his day for himself, is a slave, whatever he may be: a statesman, a␣
After running, it will generate 3 files glados-liliana.wav, glados-code.wav, and glados-men.wav in the current
directory.
soxi glados*.wav
cd /path/to/sherpa-onnx
python3 ./python-api-examples/offline-tts.py \
--vits-model=./vits-piper-en_US-glados/en_US-glados.onnx\
--vits-tokens=./vits-piper-en_US-glados/tokens.txt \
--vits-data-dir=./vits-piper-en_US-glados/espeak-ng-data \
--output-filename=./glados-ship.wav \
"A ship in port is safe, but that's not what ships are built for."
python3 ./python-api-examples/offline-tts.py \
--vits-model=./vits-piper-en_US-glados/en_US-glados.onnx\
--vits-tokens=./vits-piper-en_US-glados/tokens.txt \
--vits-data-dir=./vits-piper-en_US-glados/espeak-ng-data \
--output-filename=./glados-bug.wav \
"Given enough eyeballs, all bugs are shallow."
After running, it will generate two files glados-ship.wav and glados-bug.wav in the current directory.
soxi ./glados-{ship,bug}.wav
cd /path/to/sherpa-onnx
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/tts-models/vits-piper-en_US-
˓→libritts_r-medium.tar.bz2
Please check that the file sizes of the pre-trained models are correct. See the file sizes of *.onnx files below.
ls -lh vits-piper-en_US-libritts_r-medium/
total 153552
-rw-r--r-- 1 fangjun staff 279B Nov 29 2023 MODEL_CARD
-rw-r--r-- 1 fangjun staff 75M Nov 29 2023 en_US-libritts_r-medium.onnx
-rw-r--r-- 1 fangjun staff 20K Nov 29 2023 en_US-libritts_r-medium.onnx.json
drwxr-xr-x 122 fangjun staff 3.8K Nov 28 2023 espeak-ng-data
-rw-r--r-- 1 fangjun staff 954B Nov 29 2023 tokens.txt
-rwxr-xr-x 1 fangjun staff 1.8K Nov 29 2023 vits-piper-en_US.py
-rwxr-xr-x 1 fangjun staff 730B Nov 29 2023 vits-piper-en_US.sh
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx-offline-tts \
--vits-model=./vits-piper-en_US-libritts_r-medium/en_US-libritts_r-medium.onnx \
--vits-tokens=./vits-piper-en_US-libritts_r-medium/tokens.txt \
--vits-data-dir=./vits-piper-en_US-libritts_r-medium/espeak-ng-data \
--output-filename=./libritts-liliana-109.wav \
--sid=109 \
"liliana, the most beautiful and lovely assistant of our team!"
./build/bin/sherpa-onnx-offline-tts \
--vits-model=./vits-piper-en_US-libritts_r-medium/en_US-libritts_r-medium.onnx \
--vits-tokens=./vits-piper-en_US-libritts_r-medium/tokens.txt \
--vits-data-dir=./vits-piper-en_US-libritts_r-medium/espeak-ng-data \
--output-filename=./libritts-liliana-900.wav \
--sid=900 \
"liliana, the most beautiful and lovely assistant of our team!"
After running, it will generate two files libritts-liliana-109.wav and libritts-liliana-900.wav in the cur-
rent directory.
soxi libritts-liliana-*.wav
cd /path/to/sherpa-onnx
python3 ./python-api-examples/offline-tts.py \
--vits-model=./vits-piper-en_US-libritts_r-medium/en_US-libritts_r-medium.onnx \
--vits-tokens=./vits-piper-en_US-libritts_r-medium/tokens.txt \
--vits-data-dir=./vits-piper-en_US-libritts_r-medium/espeak-ng-data \
--sid=200 \
--output-filename=./libritts-armstrong-200.wav \
"That's one small step for a man, a giant leap for mankind."
python3 ./python-api-examples/offline-tts.py \
--vits-model=./vits-piper-en_US-libritts_r-medium/en_US-libritts_r-medium.onnx \
--vits-tokens=./vits-piper-en_US-libritts_r-medium/tokens.txt \
--vits-data-dir=./vits-piper-en_US-libritts_r-medium/espeak-ng-data \
--sid=500 \
--output-filename=./libritts-armstrong-500.wav \
"That's one small step for a man, a giant leap for mankind."
soxi ./libritts-armstrong*.wav
This model is converted from pretrained_ljspeech.pth, which is trained by the vits author Jaehyeon Kim on the LJ
Speech dataset. It supports only English and is a single-speaker model.
Note: If you are interested in how the model is converted, please see https://github.com/k2-fsa/sherpa-onnx/blob/
master/scripts/vits/export-onnx-ljs.py
cd /path/to/sherpa-onnx
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/tts-models/vits-ljs.tar.bz2
tar xvf vits-ljs.tar.bz2
rm vits-ljs.tar.bz2
Please check that the file sizes of the pre-trained models are correct. See the file sizes of *.onnx files below.
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx-offline-tts \
--vits-model=./vits-ljs/vits-ljs.onnx \
--vits-lexicon=./vits-ljs/lexicon.txt \
--vits-tokens=./vits-ljs/tokens.txt \
--output-filename=./liliana.wav \
"liliana, the most beautiful and lovely assistant of our team!"
soxi ./liliana.wav
cd /path/to/sherpa-onnx
python3 ./python-api-examples/offline-tts.py \
--vits-model=./vits-ljs/vits-ljs.onnx \
--vits-lexicon=./vits-ljs/lexicon.txt \
--vits-tokens=./vits-ljs/tokens.txt \
--output-filename=./armstrong.wav \
"That's one small step for a man, a giant leap for mankind."
soxi ./armstrong.wav
This model is converted from pretrained_vctk.pth, which is trained by the vits author Jaehyeon Kim on the VCTK
dataset. It supports only English and is a multi-speaker model. It contains 109 speakers.
Note: If you are interested in how the model is converted, please see https://github.com/k2-fsa/sherpa-onnx/blob/
master/scripts/vits/export-onnx-vctk.py
cd /path/to/sherpa-onnx
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/tts-models/vits-vctk.tar.bz2
tar xvf vits-vctk.tar.bz2
rm vits-vctk.tar.bz2
Please check that the file sizes of the pre-trained models are correct. See the file sizes of *.onnx files below.
Since there are 109 speakers available, we can choose a speaker from 0 to 198. The default speaker ID is 0.
We use speaker ID 0, 10, and 108 below to generate audio for the same text.
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx-offline-tts \
--vits-model=./vits-vctk/vits-vctk.onnx \
--vits-lexicon=./vits-vctk/lexicon.txt \
--vits-tokens=./vits-vctk/tokens.txt \
--sid=0 \
--output-filename=./kennedy-0.wav \
"Ask not what your country can do for you; ask what you can do for your country."
./build/bin/sherpa-onnx-offline-tts \
--vits-model=./vits-vctk/vits-vctk.onnx \
--vits-lexicon=./vits-vctk/lexicon.txt \
--vits-tokens=./vits-vctk/tokens.txt \
--sid=10 \
--output-filename=./kennedy-10.wav \
"Ask not what your country can do for you; ask what you can do for your country."
./build/bin/sherpa-onnx-offline-tts \
--vits-model=./vits-vctk/vits-vctk.onnx \
--vits-lexicon=./vits-vctk/lexicon.txt \
--vits-tokens=./vits-vctk/tokens.txt \
--sid=108 \
--output-filename=./kennedy-108.wav \
"Ask not what your country can do for you; ask what you can do for your country."
We use speaker ID 30, 66, and 99 below to generate audio for different transcripts.
cd /path/to/sherpa-onnx
python3 ./python-api-examples/offline-tts.py \
--vits-model=./vits-vctk/vits-vctk.onnx \
--vits-lexicon=./vits-vctk/lexicon.txt \
--vits-tokens=./vits-vctk/tokens.txt \
--sid=30 \
--output-filename=./einstein-30.wav \
"Life is like riding a bicycle. To keep your balance, you must keep moving."
python3 ./python-api-examples/offline-tts.py \
--vits-model=./vits-vctk/vits-vctk.onnx \
--vits-lexicon=./vits-vctk/lexicon.txt \
--vits-tokens=./vits-vctk/tokens.txt \
--sid=66 \
(continues on next page)
python3 ./python-api-examples/offline-tts.py \
--vits-model=./vits-vctk/vits-vctk.onnx \
--vits-lexicon=./vits-vctk/lexicon.txt \
--vits-tokens=./vits-vctk/tokens.txt \
--sid=99 \
--output-filename=./martin-99.wav \
"Darkness cannot drive out darkness: only light can do that. Hate cannot drive out␣
˓→hate: only love can do that"
usage:
sherpa-onnx-offline-tts \
--vits-model=./sherpa-onnx-vits-zh-ll/model.onnx \
--vits-dict-dir=./sherpa-onnx-vits-zh-ll/dict \
--vits-lexicon=./sherpa-onnx-vits-zh-ll/lexicon.txt \
--vits-tokens=./sherpa-onnx-vits-zh-ll/tokens.txt \
--vits-length-scale=0.5 \
(continues on next page)
sherpa-onnx-offline-tts \
--vits-model=./sherpa-onnx-vits-zh-ll/model.onnx \
--vits-dict-dir=./sherpa-onnx-vits-zh-ll/dict \
--vits-lexicon=./sherpa-onnx-vits-zh-ll/lexicon.txt \
--vits-tokens=./sherpa-onnx-vits-zh-ll/tokens.txt \
--sid=1 \
--tts-rule-fsts=./sherpa-onnx-vits-zh-ll/number.fst \
--output-filename="./1-numbers.wav" \
"14"
sherpa-onnx-offline-tts \
--vits-model=./sherpa-onnx-vits-zh-ll/model.onnx \
--vits-dict-dir=./sherpa-onnx-vits-zh-ll/dict \
--vits-lexicon=./sherpa-onnx-vits-zh-ll/lexicon.txt \
--vits-tokens=./sherpa-onnx-vits-zh-ll/tokens.txt \
--tts-rule-fsts=./sherpa-onnx-vits-zh-ll/phone.fst,./sherpa-onnx-vits-zh-ll/number.fst␣
˓→\
--sid=2 \
--output-filename="./2-numbers.wav" \
"110 18601200909"
sherpa-onnx-offline-tts \
--vits-model=./sherpa-onnx-vits-zh-ll/model.onnx \
--vits-dict-dir=./sherpa-onnx-vits-zh-ll/dict \
--vits-lexicon=./sherpa-onnx-vits-zh-ll/lexicon.txt \
--vits-tokens=./sherpa-onnx-vits-zh-ll/tokens.txt \
--sid=3 \
--output-filename="./3-wo-mi.wav" \
""
sherpa-onnx-offline-tts \
--vits-model=./sherpa-onnx-vits-zh-ll/model.onnx \
--vits-dict-dir=./sherpa-onnx-vits-zh-ll/dict \
--vits-lexicon=./sherpa-onnx-vits-zh-ll/lexicon.txt \
--vits-tokens=./sherpa-onnx-vits-zh-ll/tokens.txt \
--tts-rule-fsts=./sherpa-onnx-vits-zh-ll/number.fst \
--sid=4 \
--output-filename="./4-heteronym.wav" \
"35, 9"
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/tts-models/vits-zh-hf-
˓→fanchen-C.tar.bz2
total 291M
-rw-r--r-- 1 1001 127 58K Apr 21 05:40 date.fst
drwxr-xr-x 3 1001 127 4.0K Apr 19 12:42 dict
-rwxr-xr-x 1 1001 127 4.0K Apr 21 05:40 export-onnx-zh-hf-fanchen-models.py
-rwxr-xr-x 1 1001 127 2.5K Apr 21 05:40 generate-lexicon-zh-hf-fanchen-models.py
-rw-r--r-- 1 1001 127 2.4M Apr 21 05:40 lexicon.txt
-rw-r--r-- 1 1001 127 22K Apr 21 05:40 new_heteronym.fst
-rw-r--r-- 1 1001 127 63K Apr 21 05:40 number.fst
-rw-r--r-- 1 1001 127 87K Apr 21 05:40 phone.fst
-rw-r--r-- 1 1001 127 173M Apr 21 05:40 rule.far
-rw-r--r-- 1 1001 127 331 Apr 21 05:40 tokens.txt
-rw-r--r-- 1 1001 127 116M Apr 21 05:40 vits-zh-hf-fanchen-C.onnx
-rwxr-xr-x 1 1001 127 2.0K Apr 21 05:40 vits-zh-hf-fanchen-models.sh
usage:
sherpa-onnx-offline-tts \
--vits-model=./vits-zh-hf-fanchen-C/vits-zh-hf-fanchen-C.onnx \
--vits-dict-dir=./vits-zh-hf-fanchen-C/dict \
--vits-lexicon=./vits-zh-hf-fanchen-C/lexicon.txt \
--vits-tokens=./vits-zh-hf-fanchen-C/tokens.txt \
--vits-length-scale=0.5 \
--output-filename="./value-2x.wav" \
""
sherpa-onnx-offline-tts \
--vits-model=./vits-zh-hf-fanchen-C/vits-zh-hf-fanchen-C.onnx \
--vits-dict-dir=./vits-zh-hf-fanchen-C/dict \
--vits-lexicon=./vits-zh-hf-fanchen-C/lexicon.txt \
--vits-tokens=./vits-zh-hf-fanchen-C/tokens.txt \
--vits-length-scale=1.0 \
--tts-rule-fsts=./vits-zh-hf-fanchen-C/number.fst \
--output-filename="./numbers.wav" \
"14"
sherpa-onnx-offline-tts \
--sid=100 \
(continues on next page)
sherpa-onnx-offline-tts \
--sid=14 \
--vits-model=./vits-zh-hf-fanchen-C/vits-zh-hf-fanchen-C.onnx \
--vits-dict-dir=./vits-zh-hf-fanchen-C/dict \
--vits-lexicon=./vits-zh-hf-fanchen-C/lexicon.txt \
--vits-tokens=./vits-zh-hf-fanchen-C/tokens.txt \
--vits-length-scale=1.0 \
--output-filename="./wo-mi-14.wav" \
""
sherpa-onnx-offline-tts \
--sid=102 \
--vits-model=./vits-zh-hf-fanchen-C/vits-zh-hf-fanchen-C.onnx \
--vits-dict-dir=./vits-zh-hf-fanchen-C/dict \
--vits-lexicon=./vits-zh-hf-fanchen-C/lexicon.txt \
--vits-tokens=./vits-zh-hf-fanchen-C/tokens.txt \
--tts-rule-fsts=./vits-zh-hf-fanchen-C/number.fst \
--vits-length-scale=1.0 \
--output-filename="./heteronym-102.wav" \
"35, 91"
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/tts-models/vits-zh-hf-
˓→fanchen-wnj.tar.bz2
usage:
sherpa-onnx-offline-tts \
--vits-model=./vits-zh-hf-fanchen-wnj/vits-zh-hf-fanchen-wnj.onnx \
--vits-dict-dir=./vits-zh-hf-fanchen-wnj/dict \
--vits-lexicon=./vits-zh-hf-fanchen-wnj/lexicon.txt \
--vits-tokens=./vits-zh-hf-fanchen-wnj/tokens.txt \
--output-filename="./kuayue.wav" \
""
sherpa-onnx-offline-tts \
--vits-model=./vits-zh-hf-fanchen-wnj/vits-zh-hf-fanchen-wnj.onnx \
--vits-dict-dir=./vits-zh-hf-fanchen-wnj/dict \
--vits-lexicon=./vits-zh-hf-fanchen-wnj/lexicon.txt \
--vits-tokens=./vits-zh-hf-fanchen-wnj/tokens.txt \
--tts-rule-fsts=./vits-zh-hf-fanchen-wnj/number.fst \
--output-filename="./os.wav" \
"14"
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/tts-models/vits-zh-hf-
˓→theresa.tar.bz2
total 596992
-rw-r--r-- 1 fangjun staff 58K Apr 21 13:39 date.fst
drwxr-xr-x 9 fangjun staff 288B Apr 19 20:42 dict
-rw-r--r-- 1 fangjun staff 2.6M Apr 21 13:39 lexicon.txt
-rw-r--r-- 1 fangjun staff 21K Apr 21 13:39 new_heteronym.fst
-rw-r--r-- 1 fangjun staff 63K Apr 21 13:39 number.fst
-rw-r--r-- 1 fangjun staff 87K Apr 21 13:39 phone.fst
-rw-r--r-- 1 fangjun staff 172M Apr 21 13:39 rule.far
-rw-r--r-- 1 fangjun staff 116M Apr 21 13:39 theresa.onnx
(continues on next page)
usage:
sherpa-onnx-offline-tts \
--vits-model=./vits-zh-hf-theresa/theresa.onnx \
--vits-dict-dir=./vits-zh-hf-theresa/dict \
--vits-lexicon=./vits-zh-hf-theresa/lexicon.txt \
--vits-tokens=./vits-zh-hf-theresa/tokens.txt \
--sid=0 \
--output-filename="./reai-0.wav" \
""
sherpa-onnx-offline-tts \
--vits-model=./vits-zh-hf-theresa/theresa.onnx \
--vits-dict-dir=./vits-zh-hf-theresa/dict \
--vits-lexicon=./vits-zh-hf-theresa/lexicon.txt \
--vits-tokens=./vits-zh-hf-theresa/tokens.txt \
--tts-rule-fsts=./vits-zh-hf-theresa/number.fst \
--debug=1 \
--sid=88 \
--output-filename="./mi14-88.wav" \
"141000000"
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/tts-models/vits-zh-hf-eula.
˓→tar.bz2
total 596992
-rw-r--r-- 1 fangjun staff 58K Apr 21 13:39 date.fst
drwxr-xr-x 9 fangjun staff 288B Apr 19 20:42 dict
-rw-r--r-- 1 fangjun staff 116M Apr 21 13:39 eula.onnx
-rw-r--r-- 1 fangjun staff 2.6M Apr 21 13:39 lexicon.txt
-rw-r--r-- 1 fangjun staff 21K Apr 21 13:39 new_heteronym.fst
-rw-r--r-- 1 fangjun staff 63K Apr 21 13:39 number.fst
-rw-r--r-- 1 fangjun staff 87K Apr 21 13:39 phone.fst
-rw-r--r-- 1 fangjun staff 172M Apr 21 13:39 rule.far
(continues on next page)
usage:
sherpa-onnx-offline-tts \
--vits-model=./vits-zh-hf-eula/eula.onnx \
--vits-dict-dir=./vits-zh-hf-eula/dict \
--vits-lexicon=./vits-zh-hf-eula/lexicon.txt \
--vits-tokens=./vits-zh-hf-eula/tokens.txt \
--debug=1 \
--sid=666 \
--output-filename="./news-666.wav" \
""
sherpa-onnx-offline-tts \
--vits-model=./vits-zh-hf-eula/eula.onnx \
--vits-dict-dir=./vits-zh-hf-eula/dict \
--vits-lexicon=./vits-zh-hf-eula/lexicon.txt \
--vits-tokens=./vits-zh-hf-eula/tokens.txt \
--tts-rule-fsts=./vits-zh-hf-eula/number.fst \
--sid=99 \
--output-filename="./news-99.wav" \
"925"
Hint: You can download the Android APK for this model at
https://k2-fsa.github.io/sherpa/onnx/tts/apk-engine.html
(Please search for vits-icefall-zh-aishell3 in the above Android APK page)
Note: If you are interested in how the model is converted, please see the documentation of icefall.
If you are interested in training your own model, please also refer to icefall.
icefall is also developed by us.
cd /path/to/sherpa-onnx
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/tts-models/vits-icefall-zh-
˓→aishell3.tar.bz2
Please check that the file sizes of the pre-trained models are correct. See the file sizes of *.onnx files below.
Since there are 174 speakers available, we can choose a speaker from 0 to 173. The default speaker ID is 0.
We use speaker ID 10, 33, and 99 below to generate audio for the same text.
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx-offline-tts \
--vits-model=./vits-icefall-zh-aishell3/model.onnx \
--vits-lexicon=./vits-icefall-zh-aishell3/lexicon.txt \
--vits-tokens=./vits-icefall-zh-aishell3/tokens.txt \
--tts-rule-fsts=./vits-icefall-zh-aishell3/phone.fst,./vits-icefall-zh-aishell3/date.
˓→fst,./vits-icefall-zh-aishell3/number.fst \
--sid=10 \
--output-filename=./liliana-10.wav \
""
./build/bin/sherpa-onnx-offline-tts \
--vits-model=./vits-icefall-zh-aishell3/model.onnx \
--vits-lexicon=./vits-icefall-zh-aishell3/lexicon.txt \
--vits-tokens=./vits-icefall-zh-aishell3/tokens.txt \
--tts-rule-fsts=./vits-icefall-zh-aishell3/phone.fst,./vits-icefall-zh-aishell3/date.
˓→fst,./vits-icefall-zh-aishell3/number.fst \
--sid=33 \
--output-filename=./liliana-33.wav \
""
./build/bin/sherpa-onnx-offline-tts \
--vits-model=./vits-icefall-zh-aishell3/model.onnx \
--vits-lexicon=./vits-icefall-zh-aishell3/lexicon.txt \
--vits-tokens=./vits-icefall-zh-aishell3/tokens.txt \
--tts-rule-fsts=./vits-icefall-zh-aishell3/phone.fst,./vits-icefall-zh-aishell3/date.
˓→fst,./vits-icefall-zh-aishell3/number.fst \
--sid=99 \
--output-filename=./liliana-99.wav \
""
./build/bin/sherpa-onnx-offline-tts \
--vits-model=./vits-icefall-zh-aishell3/model.onnx \
--vits-lexicon=./vits-icefall-zh-aishell3/lexicon.txt \
--vits-tokens=./vits-icefall-zh-aishell3/tokens.txt \
--tts-rule-fsts=./vits-icefall-zh-aishell3/phone.fst,./vits-icefall-zh-aishell3/date.
˓→fst,./vits-icefall-zh-aishell3/number.fst \
--sid=66 \
--output-filename=./rule-66.wav \
"35, 91"
We use speaker ID 21, 41, and 45 below to generate audio for different transcripts.
cd /path/to/sherpa-onnx
python3 ./python-api-examples/offline-tts.py \
--vits-model=./vits-icefall-zh-aishell3/model.onnx \
--vits-lexicon=./vits-icefall-zh-aishell3/lexicon.txt \
--vits-tokens=./vits-icefall-zh-aishell3/tokens.txt \
--tts-rule-fsts=./vits-icefall-zh-aishell3/phone.fst,./vits-icefall-zh-aishell3/date.
˓→fst,./vits-icefall-zh-aishell3/number.fst \
--sid=21 \
--output-filename=./liubei-21.wav \
""
python3 ./python-api-examples/offline-tts.py \
--vits-model=./vits-icefall-zh-aishell3/model.onnx \
--vits-lexicon=./vits-icefall-zh-aishell3/lexicon.txt \
--vits-tokens=./vits-icefall-zh-aishell3/tokens.txt \
--tts-rule-fsts=./vits-icefall-zh-aishell3/phone.fst,./vits-icefall-zh-aishell3/date.
˓→fst,./vits-icefall-zh-aishell3/number.fst \
--sid=41 \
--output-filename=./demokelite-41.wav \
""
python3 ./python-api-examples/offline-tts.py \
--vits-model=./vits-icefall-zh-aishell3/model.onnx \
--vits-lexicon=./vits-icefall-zh-aishell3/lexicon.txt \
--vits-tokens=./vits-icefall-zh-aishell3/tokens.txt \
--tts-rule-fsts=./vits-icefall-zh-aishell3/phone.fst,./vits-icefall-zh-aishell3/date.
˓→fst,./vits-icefall-zh-aishell3/number.fst \
--sid=45 \
(continues on next page)
python3 ./python-api-examples/offline-tts.py \
--vits-model=./vits-icefall-zh-aishell3/model.onnx \
--vits-lexicon=./vits-icefall-zh-aishell3/lexicon.txt \
--vits-tokens=./vits-icefall-zh-aishell3/tokens.txt \
--tts-rule-fsts=./vits-icefall-zh-aishell3/phone.fst,./vits-icefall-zh-aishell3/date.
˓→fst,./vits-icefall-zh-aishell3/number.fst \
--sid=103 \
--output-filename=./rule-103.wav \
"7144349737831141177872411013812345678"
cd /path/to/sherpa-onnx
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/tts-models/vits-piper-en_US-
˓→lessac-medium.tar.bz2
tar xf vits-piper-en_US-lessac-medium.tar.bz2
Hint: You can find a lot of pre-trained models for over 40 languages at <https://github.com/k2-fsa/sherpa-
onnx/releases/tag/tts-models>.
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx-offline-tts \
--vits-model=./vits-piper-en_US-lessac-medium/en_US-lessac-medium.onnx \
--vits-data-dir=./vits-piper-en_US-lessac-medium/espeak-ng-data \
--vits-tokens=./vits-piper-en_US-lessac-medium/tokens.txt \
--output-filename=./liliana-piper-en_US-lessac-medium.wav \
"liliana, the most beautiful and lovely assistant of our team!"
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx-offline-tts-play \
--vits-model=./vits-piper-en_US-lessac-medium/en_US-lessac-medium.onnx \
--vits-data-dir=./vits-piper-en_US-lessac-medium/espeak-ng-data \
--vits-tokens=./vits-piper-en_US-lessac-medium/tokens.txt \
--output-filename=./liliana-piper-en_US-lessac-medium.wav \
"liliana, the most beautiful and lovely assistant of our team!"
soxi ./liliana-piper-en_US-lessac-medium.wav
cd /path/to/sherpa-onnx
python3 ./python-api-examples/offline-tts.py \
--vits-model=./vits-piper-en_US-lessac-medium/en_US-lessac-medium.onnx \
--vits-data-dir=./vits-piper-en_US-lessac-medium/espeak-ng-data \
--vits-tokens=./vits-piper-en_US-lessac-medium/tokens.txt \
--output-filename=./armstrong-piper-en_US-lessac-medium.wav \
"That's one small step for a man, a giant leap for mankind."
cd /path/to/sherpa-onnx
python3 ./python-api-examples/offline-tts-play.py \
--vits-model=./vits-piper-en_US-lessac-medium/en_US-lessac-medium.onnx \
--vits-data-dir=./vits-piper-en_US-lessac-medium/espeak-ng-data \
--vits-tokens=./vits-piper-en_US-lessac-medium/tokens.txt \
--output-filename=./armstrong-piper-en_US-lessac-medium.wav \
"That's one small step for a man, a giant leap for mankind."
soxi ./armstrong-piper-en_US-lessac-medium.wav
8.30.3 WebAssembly
In this section, we describe how to build text-to-speech from sherpa-onnx for WebAssembly so that you can run text-
to-speech with WebAssembly.
Please follow the steps below to build and run sherpa-onnx for WebAssembly.
Hint: We provide a colab notebook for you to try this section step by step.
If you are using Windows or you don’t want to setup your local environment to build WebAssembly support, please
use the above colab notebook.
Install Emscripten
We need to compile the C/C++ files in sherpa-onnx with the help of emscripten.
Please refer to https://emscripten.org/docs/getting_started/downloads for detailed installation instructions.
The following is an example to show you how to install it on Linux/macOS.
source ./emsdk_env.sh
emcc -v
Target: wasm32-unknown-emscripten
Thread model: posix
InstalledDir: /Users/fangjun/open-source/emsdk/upstream/bin
Build
After installing emscripten, we can build text-to-speech from sherpa-onnx for WebAssembly now.
Please use the following command to build it:
cd wasm/tts/assets
wget -q https://github.com/k2-fsa/sherpa-onnx/releases/download/tts-models/vits-piper-en_
˓→US-libritts_r-medium.tar.bz2
tar xf vits-piper-en_US-libritts_r-medium.tar.bz2
rm vits-piper-en_US-libritts_r-medium.tar.bz2
mv vits-piper-en_US-libritts_r-medium/en_US-libritts_r-medium.onnx ./model.onnx
mv vits-piper-en_US-libritts_r-medium/tokens.txt ./
mv vits-piper-en_US-libritts_r-medium/espeak-ng-data ./
rm -rf vits-piper-en_US-libritts_r-medium
cd ../../..
./build-wasm-simd-tts.sh
-- Installing: /Users/fangjun/open-source/sherpa-onnx/build-wasm-simd-tts/install/lib/
˓→libkaldi-decoder-core.a
-- Installing: /Users/fangjun/open-source/sherpa-onnx/build-wasm-simd-tts/install/lib/
˓→libsherpa-onnx-kaldifst-core.a
-- Installing: /Users/fangjun/open-source/sherpa-onnx/build-wasm-simd-tts/install/lib/
˓→libsherpa-onnx-fst.a
-- Installing: /Users/fangjun/open-source/sherpa-onnx/build-wasm-simd-tts/install/lib/
˓→libonnxruntime.a
-- Installing: /Users/fangjun/open-source/sherpa-onnx/build-wasm-simd-tts/install/lib/
˓→libespeak-ng.a
-- Installing: /Users/fangjun/open-source/sherpa-onnx/build-wasm-simd-tts/install/lib/
˓→libucd.a
-- Up-to-date: /Users/fangjun/open-source/sherpa-onnx/build-wasm-simd-tts/install/lib/
˓→libucd.a
-- Installing: /Users/fangjun/open-source/sherpa-onnx/build-wasm-simd-tts/install/lib/
˓→libpiper_phonemize.a
-- Installing: /Users/fangjun/open-source/sherpa-onnx/build-wasm-simd-tts/install/./
˓→sherpa-onnx.pc
-- Installing: /Users/fangjun/open-source/sherpa-onnx/build-wasm-simd-tts/install/lib/
˓→pkgconfig/espeak-ng.pc
-- Installing: /Users/fangjun/open-source/sherpa-onnx/build-wasm-simd-tts/install/share/
˓→vim/vimfiles/ftdetect
-- Installing: /Users/fangjun/open-source/sherpa-onnx/build-wasm-simd-tts/install/share/
˓→vim/vimfiles/ftdetect/espeakfiletype.vim
-- Installing: /Users/fangjun/open-source/sherpa-onnx/build-wasm-simd-tts/install/share/
˓→vim/vimfiles/syntax
-- Installing: /Users/fangjun/open-source/sherpa-onnx/build-wasm-simd-tts/install/share/
˓→vim/vimfiles/syntax/espeakrules.vim
-- Installing: /Users/fangjun/open-source/sherpa-onnx/build-wasm-simd-tts/install/share/
˓→vim/vimfiles/syntax/espeaklist.vim
-- Up-to-date: /Users/fangjun/open-source/sherpa-onnx/build-wasm-simd-tts/install/lib/
˓→libucd.a
-- Up-to-date: /Users/fangjun/open-source/sherpa-onnx/build-wasm-simd-tts/install/lib/
˓→libespeak-ng.a
-- Installing: /Users/fangjun/open-source/sherpa-onnx/build-wasm-simd-tts/install/lib/
˓→libsherpa-onnx-core.a
-- Installing: /Users/fangjun/open-source/sherpa-onnx/build-wasm-simd-tts/install/lib/
˓→libsherpa-onnx-c-api.a
-- Installing: /Users/fangjun/open-source/sherpa-onnx/build-wasm-simd-tts/install/
˓→include/sherpa-onnx/c-api/c-api.h
-- Installing: /Users/fangjun/open-source/sherpa-onnx/build-wasm-simd-tts/install/bin/
˓→wasm/tts/sherpa-onnx-wasm-main.js
-- Up-to-date: /Users/fangjun/open-source/sherpa-onnx/build-wasm-simd-tts/install/bin/
˓→wasm/tts/sherpa-onnx-wasm-main.js
-- Installing: /Users/fangjun/open-source/sherpa-onnx/build-wasm-simd-tts/install/bin/
˓→wasm/tts/index.html
-- Installing: /Users/fangjun/open-source/sherpa-onnx/build-wasm-simd-tts/install/bin/
˓→wasm/tts/sherpa-onnx.js
-- Installing: /Users/fangjun/open-source/sherpa-onnx/build-wasm-simd-tts/install/bin/
˓→wasm/tts/app.js
(continues on next page)
-- Installing: /Users/fangjun/open-source/sherpa-onnx/build-wasm-simd-tts/install/bin/
˓→wasm/tts/sherpa-onnx-wasm-main.data
+ ls -lh install/bin/wasm/tts
total 211248
-rw-r--r-- 1 fangjun staff 5.3K Feb 22 09:18 app.js
-rw-r--r-- 1 fangjun staff 1.3K Feb 22 09:18 index.html
-rw-r--r-- 1 fangjun staff 92M Feb 22 10:35 sherpa-onnx-wasm-main.data
-rw-r--r-- 1 fangjun staff 117K Feb 22 10:39 sherpa-onnx-wasm-main.js
-rw-r--r-- 1 fangjun staff 11M Feb 22 10:39 sherpa-onnx-wasm-main.wasm
-rw-r--r-- 1 fangjun staff 4.5K Feb 22 09:18 sherpa-onnx.js
cd build-wasm-simd-tts/install/bin/wasm/tts
python3 -m http.server 6008
Start your browser and visit http://localhost:6008/; you should see the following page:
We provide two Huggingface spaces so that you can try text-to-speech with WebAssembly in your browser.
English TTS
https://huggingface.co/spaces/k2-fsa/web-assembly-tts-sherpa-onnx-en
Hint: If you don’t have access to Huggingface, please visit the following mirror:
https://modelscope.cn/studios/k2-fsa/web-assembly-tts-sherpa-onnx-en/summary
Note: The script for building this space can be found at https://github.com/k2-fsa/sherpa-onnx/blob/master/.github/
workflows/wasm-simd-hf-space-en-tts.yaml
German TTS
https://huggingface.co/spaces/k2-fsa/web-assembly-tts-sherpa-onnx-de
Hint: If you don’t have access to Huggingface, please visit the following mirror:
https://modelscope.cn/studios/k2-fsa/web-assembly-tts-sherpa-onnx-de/summary
Note: The script for building this space can be found at https://github.com/k2-fsa/sherpa-onnx/blob/master/.github/
workflows/wasm-simd-hf-space-de-tts.yaml
8.30.4 Piper
In this section, we describe how to convert piper pre-trained models from https://huggingface.co/rhasspy/piper-voices.
Hint:
You can find all of the converted models from piper in the following address:
https://github.com/k2-fsa/sherpa-onnx/releases/tag/tts-models
If you want to convert your own pre-trained piper models or if you want to learn how the conversion works, please read
on.
Otherwise, you only need to download the converted models from the above link.
Note that there are pre-trained models for over 30 languages from piper. All models share the same converting method,
so we use an American English model in this section as an example.
Install dependencies
Hint: We suggest that you always use the latest version of onnxruntime.
wget https://huggingface.co/rhasspy/piper-voices/resolve/main/en/en_US/amy/low/en_US-amy-
˓→low.onnx
wget https://huggingface.co/rhasspy/piper-voices/resolve/main/en/en_US/amy/low/en_US-amy-
˓→low.onnx.json
Please use the following code to add meta data to the downloaded onnx model.
#!/usr/bin/env python3
import json
import os
from typing import Any, Dict
import onnx
Args:
filename:
Filename of the ONNX model to be changed.
meta_data:
Key-value pairs.
"""
model = onnx.load(filename)
for key, value in meta_data.items():
meta = model.metadata_props.add()
meta.key = key
meta.value = str(value)
(continues on next page)
onnx.save(model, filename)
def load_config(model):
with open(f"{model}.json", "r") as file:
config = json.load(file)
return config
def generate_tokens(config):
id_map = config["phoneme_id_map"]
with open("tokens.txt", "w", encoding="utf-8") as f:
for s, i in id_map.items():
f.write(f"{s} {i[0]}\n")
print("Generated tokens.txt")
def main():
# Caution: Please change the filename
filename = "en_US-amy-low.onnx"
config = load_config(filename)
print("generate tokens")
generate_tokens(config)
main()
After running the above script, your en_US-amy-low.onnx is updated with meta data and it also generates a new file
tokens.txt.
From now on, you don’t need the config json file en_US-amy-low.onnx.json any longer.
Download espeak-ng-data
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/tts-models/espeak-ng-data.
˓→tar.bz2
tar xf espeak-ng-data.tar.bz2
Note that espeak-ng-data.tar.bz2 is shared by all models from piper, no matter which language your are using for
your model.
to install sherpa-onnx and then use the following commands to test your model:
which sherpa-onnx-offline-tts
sherpa-onnx-offline-tts \
--vits-model=./en_US-amy-low.onnx \
--vits-tokens=./tokens.txt \
--vits-data-dir=./espeak-ng-data \
--output-filename=./test.wav \
"How are you doing? This is a text-to-speech application using next generation Kaldi."
8.30.5 MMS
Install dependencies
Suppose that we want to convert the English model, we need to use the following commands to download the model:
name=eng
wget -q https://huggingface.co/facebook/mms-tts/resolve/main/models/$name/G_100000.pth
wget -q https://huggingface.co/facebook/mms-tts/resolve/main/models/$name/config.json
wget -q https://huggingface.co/facebook/mms-tts/resolve/main/models/$name/vocab.txt
pushd MMS/vits/monotonic_align
ls -lh build/
ls -lh build/lib*/
ls -lh build/lib*/*/
cp build/lib*/vits/monotonic_align/core*.so .
Please save the following code into a file with name ./vits-mms.py:
#!/usr/bin/env python3
import collections
import os
from typing import Any, Dict
import onnx
import torch
from vits import commons, utils
from vits.models import SynthesizerTrn
def forward(
self,
x,
x_lengths,
noise_scale=0.667,
length_scale=1.0,
noise_scale_w=0.8,
):
return self.model.infer(
x=x,
x_lengths=x_lengths,
noise_scale=noise_scale,
length_scale=length_scale,
noise_scale_w=noise_scale_w,
)[0]
Args:
filename:
Filename of the ONNX model to be changed.
meta_data:
Key-value pairs.
"""
model = onnx.load(filename)
for key, value in meta_data.items():
meta = model.metadata_props.add()
meta.key = key
meta.value = str(value)
onnx.save(model, filename)
def load_vocab():
return [
x.replace("\n", "") for x in open("vocab.txt", encoding="utf-8").readlines()
]
@torch.no_grad()
def main():
hps = utils.get_hparams_from_file("config.json")
is_uroman = hps.data.training_files.split(".")[-1] == "uroman"
if is_uroman:
raise ValueError("We don't support uroman!")
(continues on next page)
symbols = load_vocab()
print("generate tokens.txt")
net_g = SynthesizerTrn(
len(symbols),
hps.data.filter_length // 2 + 1,
hps.train.segment_size // hps.data.hop_length,
**hps.model,
)
net_g.cpu()
_ = net_g.eval()
model = OnnxModel(net_g)
opset_version = 13
filename = "model.onnx"
main()
export PYTHONPATH=$PWD/MMS:$PYTHONPATH
export PYTHONPATH=$PWD/MMS/vits:$PYTHONPATH
export lang=eng
python3 ./vits-mms.py
We can use the converted model with the following command after installing sherpa-onnx.
./build/bin/sherpa-onnx-offline-tts \
--vits-model=./model.onnx \
--vits-tokens=./tokens.txt \
--debug=1 \
--output-filename=./mms-eng.wav \
"How are you doing today? This is a text-to-speech application using models from␣
˓→facebook with next generation Kaldi"
The above command does NOT require you to install a C++ compiler and it supports a variety of platforms, such as:
• Linux
– x64
– arm, e.g., 32-bit Raspberry Pi
– arm64, e.g., 64-bit Raspberry Pi
• Windows
– x64, e.g., 64-bit Windows
– x86, e.g., 32-bit Windows
• macOS
– x64
– arm64, e.g., M1 and M2 chips
If you want to build the sherpa-onnx Python package from source, please refer to Install the Python Package.
After installation, please refer to https://github.com/k2-fsa/sherpa-onnx/blob/master/python-api-examples/offline-tts.
py for example usage.
Hint: pip install sherpa-onnx also installs an executable sherpa-onnx-offline-tts. The directory where
it is installed should be already on your PATH after you activate your Python virtual environment.
You can run
sherpa-onnx-offline-tts --help
NINE
TRITON
Nvidia Triton Inference Server provides a cloud and edge inferencing solution optimized for both CPUs and GPUs.
The following content describes how to deploy ASR models trained by icefall using Triton.
9.1 Installation
We prepare a dockerfile based on official triton docker containers. The customized dockerfile intergrates Triton-server,
Triton-client and sherpa-related requirements into a single image. You need to install Docker first before starting
installation.
Hint: For your production environment, you could build triton manually to reduce the size of container.
Note: It may take a lot of time since we build k2 from source. If you only need to use greedy search scorer, you could
comment k2-related lines.
docker run --gpus all --name sherpa_server --net host --shm-size=1g -it sherpa_triton_
˓→server:latest
629
sherpa, Release 1.3
9.2 Triton-server
This page gives serveral examples to deploy streaming and offline ASR pretrained models with Triton server.
export SHERPA_SRC=./sherpa
export ICEFALL_SRC=/workspace/icefall
# copy essentials
cp $SHERPA_SRC/triton/scripts/*onnx*.py $ICEFALL_DIR/egs/wenetspeech/ASR/pruned_
˓→stateless_transducer5/
cd $ICEFALL_SRC/egs/wenetspeech/ASR/
# download pretrained models
GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/luomingshuang/icefall_asr_
˓→wenetspeech_pruned_transducer_stateless5_streaming
cd ./icefall_asr_wenetspeech_pruned_transducer_stateless5_streaming
git lfs pull --include "exp/pretrained_epoch_7_avg_1.pt"
cd -
# export to onnx fp16
ln -s ./icefall_asr_wenetspeech_pruned_transducer_stateless5_streaming/exp/pretrained_
˓→epoch_7_avg_1.pt ./icefall_asr_wenetspeech_pruned_transducer_stateless5_streaming/exp/
˓→epoch-999.pt
./pruned_transducer_stateless5/export_onnx.py \
--exp-dir ./icefall_asr_wenetspeech_pruned_transducer_stateless5_streaming/exp \
--tokenizer-file ./icefall_asr_wenetspeech_pruned_transducer_stateless5_streaming/
˓→data/lang_char \
--epoch 999 \
--avg 1 \
--streaming-model 1 \
--causal-convolution 1 \
--onnx 1 \
--left-context 64 \
--right-context 4 \
--fp16
Note: For Chinese models, --tokenizer-file points to <pretrained_dir>/data/lang_char. While for En-
glish models, it points to <pretrained_dir>/data/lang_bpe_500/bpe.model file.
Then, in the docker container, you could start the service with:
cd sherpa/triton/
bash scripts/start_streaming_server.sh
Caution: Currently, we only support FP32 offline ASR inference for torchscript backend. Streaming ASR and
FP16 inference are not supported.
export SHERPA_SRC=./sherpa
export ICEFALL_SRC=/workspace/icefall
# Download pretrained models
GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/csukuangfj/icefall-asr-
˓→librispeech-pruned-transducer-stateless3-2022-04-29 $ICEFALL_DIR/egs/librispeech/ASR/
˓→pruned_stateless_transducer3/
cd icefall-asr-librispeech-pruned-transducer-stateless3-2022-04-29
git lfs pull --include "exp/pretrained-epoch-25-avg-7.pt"
# export them to three jit models: encoder_jit.pt, decoder_jit.pt, joiner_jit.pt
cp $SHERPA_SRC/triton/scripts/conformer_triton.py $ICEFALL_DIR/egs/librispeech/ASR/
˓→pruned_stateless_transducer3/
cp $SHERPA_SRC/triton/scripts/export_jit.py $ICEFALL_DIR/egs/librispeech/ASR/pruned_
˓→stateless_transducer3/
cd $ICEFALL_DIR/egs/librispeech/ASR/pruned_stateless_transducer3
python3 export_jit.py \
--pretrained-model $ICEFALL_DIR/egs/librispeech/ASR/pruned_stateless_transducer3/
˓→icefall-asr-librispeech-pruned-transducer-stateless3-2022-04-29 \
cp <bpe_model_path> <jit_model_dir>
Note: If you export models outside the docker container, you could mount the exported <jit_model_dir> with -v
<host_dir>:<container_dir> when lauching the container.
Then, in the docker container, you could start the service with:
cd sherpa/triton/
bash scripts/start_offline_server_jit.sh
9.3 Triton-client
cd sherpa/triton/client
# Test one audio using offline ASR
python3 client.py --audio_file=./test_wavs/1089-134686-0001.wav --url=localhost:8001
The above command sends a single audio 1089-134686-0001.wav to the server and get the result. --url option
specifies the IP and port of the server, in this example, we set the server and client on the same machine, therefore IP
is localhost, and we use port 8001 since it is the default port for gRPC in Triton.
You can also test a bunch of audios together with the client. Just specify the path of wav.scp with --wavscp option,
set the path of test set directory with --data_dir option, and set the path of ground-truth transcript file with --trans
option, the client will infer all the audios in test set and calculate the WER upon the test set.
You could also decode a whole dataset to benchmark metrics e.g. RTF, WER.
Caution: Decode manifests in simulation streaming mode would be supported in the future.
cd sherpa/triton/client
python3 decode_manifest_triton.py \
--server-addr localhost \
--num-tasks 300 \
--log-interval 20 \
--model-name transducer \
--manifest-filename ./aishell-test-dev-manifests/data/fbank/aishell_cuts_test.jsonl.
˓→gz \
--compute-cer
We can use perf_analyzer provided by Triton to test the performance of the service.
cd sherpa/triton/client
# en
python3 generate_perf_input.py --audio_file=test_wavs/1089-134686-0001.wav
# zh
python3 generate_perf_input.py --audio_file=test_wavs/zh/mid.wav
Con- Infer- Client Net- Server Server Server Server Client p50 p90 p95 p99
cur- ences/Second
Send work+ServerQueue Com- Com- Com- Recv la- la- la- la-
rency Send/Recv pute pute pute tency tency tency tency
Input Infer Output
300 226.24 109 230434 1 9314 1068792 14512 1 1254206
1616224
1958246
3551406
This page shows how to use TensorRT engine to accelerate inference speed for K2 models
9.5.1 Preparation
First of all, you have to install the TensorRT. Here we suggest you to use docker container to run TRT. Just run the
following command:
docker run --gpus '"device=0"' -it --rm --net host -v $PWD/:/k2 nvcr.io/nvidia/
˓→tensorrt:22.12-py3
Note: Please pay attention that, the TRT version must have to >= 8.5.3!!!
If your TRT version is < 8.5.3, you can download the desired TRT version and then run the following command inside
the docker container to use the TRT you just download:
You have to prepare the ONNX model by referring here to export your models into ONNX format. Assume you have
put your ONNX model in the $model_dir directory. Then, just run the command:
The generated TRT model will be saved into $model_dir/encoder.trt. We also give an example of model_repo
of TRT model. You can follow the same procedure as described here to deploy the pipeline using triton.