Speech Recognition With Vosk

Jekyll2026-05-26T00:38:55+02:00https://alphacephei.com/nsh/feed.xmlSpeech Recognition With VoskBlog about speech technologies - recognition, synthesis, identification. Mostly it's about scientific part of it, the core design of the engines, the new methods, machine learning and about about technical part like architecture of the recognizer and design decisions behind it.In-depth evaluation of ASR engines2026-05-24T23:00:00+02:002026-05-24T23:00:00+02:00https://alphacephei.com/nsh/2026/05/24/asr-detailsNickolay Shmyrev

Factorizing E2E on acoustic and language models2026-02-23T22:00:00+01:002026-02-23T22:00:00+01:00https://alphacephei.com/nsh/2026/02/23/am-lm-factorNickolay Shmyrev

Failure of SSL2025-12-13T22:00:00+01:002025-12-13T22:00:00+01:00https://alphacephei.com/nsh/2025/12/13/failure-of-sslNickolay Shmyrev

Открытые модели для распознавания русской речи 20252025-04-18T23:00:00+02:002025-04-18T23:00:00+02:00https://alphacephei.com/nsh/2025/04/18/russian-models * Nvidia RNNT Fastconformer Large * Nemo Parakeet V3 * Nemo Canary V2 * Whisper Large V3 * Whisper V3 Turbo * Whisper Podlodka Turbo * GigaAM * T-one * Vikhr Borealis Пишите, если знаете о хорошей модели, которую можно протестировать.]]>Nickolay Shmyrev

Experiments with correction of speech recognition output with LLMs2025-03-15T22:00:00+01:002025-03-15T22:00:00+01:00https://alphacephei.com/nsh/2025/03/15/generative-error-correction Some notable papers: * Large language model based generative error correction: A challenge and baselines for speech recognition, speaker tagging, and emotion recognition * Evolutionary Prompt Design for LLM-Based Post-ASR Error Correction Overall, GEC results are somewhat controversal because most experiments are on book-sourced texts and LLM knows book texts very well: [Most SpeechLLMs are trained on the test sets of common speech datasets](https://www.linkedin.com/posts/titouan-parcollet-b233a698_oh-and-by-the-way-most-speechllms-are-trained-activity-7298700339111735297-161j/) We recently tried to rescore 5-best transcription of Russian telephony calls with LLMs. There are many LLMs to try. We tried ones that fit 8Gb card and Gemini Flash Lite 2.0 as a big model. We also tried an LLM finetuned specifically for GEC for Russian language [Meno Tiny](https://huggingface.co/bond005/meno-tiny-0.1). Here is what our prompt looks like: ``` You need to edit and improve the output of speech recognition system. Here are 5 variants of transcription of a support call. Calls are in Russian. Speech recognizer makes mistakes, for example it uses "southpan" instead of "southpark". First variant is most precise. The second variant recognizes proper names better than the others. Correct mistakes and print most accurate transcription using the context, grammar and knowledge about phonetics. You need to provide only one answer. Number of lines in the answer must match the number of lines in the first variant. # Example input: 1. what is the price for the house ok good i got it goodbye 2. what is the price for the horse ok good i've got it ok goodbye 3. what is the price for the house ok great i've got it goodbye 4. what is the price for the house ok great i've got it ok goodbye 5. what is the price for the house ok great i've got it ok goodbye # Example output: what is the price for the house ok good i've got it ok goodbye # Input: 1. 2. .... # Output: ``` Here are our approximate results: {:class="table table-bordered"} | Model | WER | |----------------------------------------------|---------| | 1-best ASR | 15.9 | | 5-best ROVER | 14.8 | | Qwen2.5-7B-Instruct-1M-Q4_K_M | 100+ unstable | | vikhr-llama3.1-8b-instruct-r-21-09-24-q4_k_m | 40+ unstable | | vikhr-yandexgpt-5-lite-8b-it-q4_k_m | unstable | | meno-tiny-0.1-fp16 | 40+ unstable | | gemma-2-9b-it-Q4_K_M | 16.0 | | google_gemma-3-4b-it-Q8_0 | 16.7 | | Gemini Flash 2.0 Lite | 14.6 | | Gemini Flash 2.0 Lite English prompt | 14.7 | | Gemini Flash 2.0 Lite 10-line chunks | 14.8 | Some our observations: 1. Most 8B models at 4b quantization are not very stable, hallucinations present in about 25% cases. Qwen is very unstable for this task. 2. Gemma2 and Gemma3 are ok, yet to try 27B version. 3. Simple prompt from the papers certainly doesn't work. One has to provide much more details and specific issues in prompt. We yet to work on the prompt more. 4. Even prompt formatting matters, by modifying the prompt format we were able to reduce WER from 26% to 16% 5. For now GEC doesn't seem like a breakthrough tech, it seems like something like extra sauce is needed, simple ROVER is equally ok and much more stable. 6. We discussed on the channel with iLa that English prompt helps for non-English language. I think it is possible for some models but I can't confirm in experiments. 7. For big model input split doesn't help much. 8. There are still a lot of overcorrection of proper names which are rare and unknown to LLM and overcorrection of grammar. We need to work more on it. 9. The difference between Gemma2-9B and Gemini Flash is not very large except for a number of hallucinations. 10. Most models have very poor knowledge in rare domains and poor knowledge about speech (phonetics). So interesting results and more work is needed. Eventually we can make a real benchmark from this, it is actually interesting which LLM performs the best here. PS. Sreyan Ghosh on Twitter suggested me the following paper stressing the issues with named entity recognition for GEC. Right on the subject: Failing Forward: Improving Generative Error Correction for ASR with Synthetic Data and Retrieval Augmentation ]]>Nickolay Shmyrev

Experiments with solvers and decoding-time guidance in flow matching2025-01-17T22:00:00+01:002025-01-17T22:00:00+01:00https://alphacephei.com/nsh/2025/01/17/guidance Here are our experiments with guided sampling and different solvers. ![Guided sampling](/img/blog/guided-sampling.png){: width="600" } As you see, as any regularization method it helps to reduce artifacts and improve clarity (see that CER is reduced). It also significantly reduces expressiveness (see that FAD significantly increased). However, one can see that simply reducing temperature has similar effect. The question then is why do we spend compute time on guided sampling. I've seen that many times that researchers propose some different regularization method but never consider alternatives. As for solvers, I don't see any effect from 2-nd order Heun solver. Maybe diffusion has to be fixed first (replaced with DiT). Between, default VITS temperature of 0.8 is pretty high and often leads to artifacts, I've heard many times in discussion that production guys use lower values up to 0.2-0.3. Voice is not that expressive, but artifacts are significantly reduced. Between, Matcha/VITS also have problems with modeling speakers. Next post about it. ## Update 03.2025 Btw, the original paper [Classifier-Free Diffusion Guidance](https://arxiv.org/abs/2207.12598) claims that low temperature mode results in bad samples and CFG works better. That's for pictures. Not sure if the same applies for TTS. But we ended with CFG at the end too. With weight like 1.0 the quality of the results significantly improves and the FAD doesn't degrade much. Also, from the paper it is clear that CFG must be applied to both training and inference, not just inference. NVIDIA paper is wrong here.]]>Nickolay Shmyrev

Why discrete units2025-01-12T22:00:00+01:002025-01-12T22:00:00+01:00https://alphacephei.com/nsh/2025/01/12/discrete-units It says "This discretization allows models to operate within a discrete probability space, enabling the use of the crossentropy loss, in analogy to their application in language models. However, quantization methods typically require additional losses (e.g., commitment and codebook losses) during VAE training and may introduce a hyperparameter overhead. Secondly, continuous embeddings can encode information more efficiently than discrete tokens..." While the first sentence is the core advantage of discrete units the second sentence is not aligning with the theory. The point is that many distributions we try to model are significantly non-gaussian. We will cover that later in detail but that's the fact. And when we try to model non-gaussian distributions with gaussian models and L2 loss we fail totally. And flow/flow matching/diffusion models don't help here since they still keep that gaussian nature even if they try to approximate the target distribution. This is exactly the reason we need discrete approximation and crossentropy loss. Given that it is strange that the paper above tries to fight error accumulation but never mentions which distributions it tries to model. And only provides experimental evidence of the advantages. ## Hung-yi Lee talk This whole idea actually was very nicely introduced to me in a talk by Professor Hung-yi Lee at Interspeech 2024, see here Here is the image from the slides: ![Discrete Units](/img/blog/discrete.png){: width="600" } This is probably very obvious thing but I haven't seen the paper that uses or mentions this consistently. # Discrete units for duration Why did this idea come to my mind again? Because we worked a bit more on VITS durations. Here is the histogram of overall durations values and their VITS prediction. ![Discrete Duration Units](/img/blog/discrete-durations.png){: width="600" } As you see, durations are clearly non-gaussians and simple convnet has trouble to model them. Not surprisingly, flow model and flow matching model have trouble too! The solution is simple, let's consider duration as discrete units and use cross-entropy loss. Where did we see it already? In [StyleTTS2](https://arxiv.org/abs/2306.07691) I must admit again, StyleTTS2 is very advanced and well-thought architecture and while it might not be obvious from the start, it definitely needs attention. Many things like advanced discriminators, ASR alignment, durations raise again and again in my studies. StyleTTS2 never calls duration discrete yet, but the idea is the same. I'd predict the next generation duration model will be all discrete. This idea is used not just in StyleTTS. For example in the latest paper [Total-Duration-Aware Duration Modeling for Text-to-Speech Systems](https://arxiv.org/abs/2406.04281) we also see Microsoft researchers consider discrete units and prove their advantage. On top of that, our experiments with StyleTTS2 duration in Matcha TTS show very good results. # Unit selection vs generative models That non-gaussian nature of things reminded me of the old story where everyone was discussing unit selection TTS vs HMM TTS. The first was more natural-sounding but less flexible, the second never sounded well but was really versatile. Since we model something non-gaussian, unit selection and patch mixing can provide really good results. So welcome back to the unit selection world. Papers like this appear quite often. One example is [KNN-VC](https://github.com/bshall/knn-vc). And DiT/MaskGIT is exactly the thing here. They are on the rise, and it is nice to have a theory that confirms they are really reasonable. # Why brain signals are discrete Clearly some distributions are more gaussian, some less. We need to understand nature before we select the method. But the question raises why many of our distributions are discrete. In my opinion there is a simple answer here - the mechanics of the brain. Since neurons have pretty fixed states it is natural to think they are discrete. Somewhere I hear 4-bit estimation from neuroscientists (probably 4-bit LLMs also make most sense). So no wonder that our speech has a discrete nature as well. Something to remember about.]]>Nickolay Shmyrev

Matcha TTS notes2025-01-03T22:00:00+01:002025-01-03T22:00:00+01:00https://alphacephei.com/nsh/2025/01/03/matcha-tts-notesNickolay Shmyrev

TTS Design Thoughts2024-10-18T23:00:00+02:002024-10-18T23:00:00+02:00https://alphacephei.com/nsh/2024/10/18/tts-designNickolay Shmyrev

Evaluation of Russian TTS models2024-07-12T23:00:00+02:002024-07-12T23:00:00+02:00https://alphacephei.com/nsh/2024/07/12/russian-tts Similar repo is Some information about evaluation data: * Audiobooks, about 100 speakers, about 1k utterances. Some observations: * Fastspeech2 methods still show best clarity (Silero/Yandex/EdgeTTS). They are not very good in intonation but clarity is hard to beat. For end user it really makes sense actually, you can deal with plain intonation but artifacts are really annoying. * Training database matters a lot, even a small size gives very good results (CER and UTMOS), if the data is good (Piper Irina compared to other piper voices. And there the data is only 1 hour of data) * Multi-voice systems seriously suffer from fuzziness (CER 0.7 > 2.0+), something needs to be done about it. * Tortoise is pretty good in intonation (as expected). * It is necessary to add another metric responsible for the liveliness of speech (F0 correlation? duration?). FAD is relevant, but only works for multivoice systems. * XTTS2 results are much worse than I expected. Both similarity and clarity of speech. * A good metric to evaluate would be diversity of speech generation. VITS for example specifically optimized for diversity compared to fastspeech. Something to implement in the future. See also .]]>Nickolay Shmyrev