<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.9.0">Jekyll</generator><link href="https://alphacephei.com/nsh/feed.xml" rel="self" type="application/atom+xml" /><link href="https://alphacephei.com/nsh/" rel="alternate" type="text/html" /><updated>2026-05-26T00:38:55+02:00</updated><id>https://alphacephei.com/nsh/feed.xml</id><title type="html">Speech Recognition With Vosk</title><subtitle>Blog about speech technologies - recognition, synthesis, identification. Mostly it's about scientific part of it, the core design of the engines, the new methods, machine learning and about about technical part like architecture of the recognizer and design decisions behind it.</subtitle><entry><title type="html">In-depth evaluation of ASR engines</title><link href="https://alphacephei.com/nsh/2026/05/24/asr-details.html" rel="alternate" type="text/html" title="In-depth evaluation of ASR engines" /><published>2026-05-24T23:00:00+02:00</published><updated>2026-05-24T23:00:00+02:00</updated><id>https://alphacephei.com/nsh/2026/05/24/asr-details</id><content type="html" xml:base="https://alphacephei.com/nsh/2026/05/24/asr-details.html"><![CDATA[While everyone focuses on latency, there are many measurable aspects of
ASR that are easy to evaluate and have a significant impact on user
experience. Here are some of them:

1. **Hallucination rates from noisy inputs.** This is really useful for
estimation how robust your recognizer is. We just feed 1000 noisy samples
into every engine and measure how many words it produces. Some systems
hallucinate significantly,  exceeding 100% of insertions, some just give
a few extra words. Very easy to measure, very useful in practice.

2. **Recognition of the short inputs.** While everyone trains on longer samples,
short inputs sometimes gets less attention as a result systems have very
big trouble to recognize simple "yes". A huge thing for voicebots, much
more important than latency. We specifically feed real-life short data and
compare engines, results might be very different from system to system.

3. **Ability to identify non-speech sounds as music and noises.** While most
modern systems become more and more intelligient the users expect them
to react to a wide range of conditions. Even embedded systems need to
identify the environment around properly. Very few engines able to do that.

4. **Rare words problem.** Because end-to-end systems learn distributions
as is, the hidden distributions mismatches becomes a hidden but serious
problems. For example, end-to-end systems learn to recognize frequent
words well but our semantics obviously depends a lot on rare words -
names, street names, products, medications. End-to-end systems have
trouble with them, even the one trained with millions hours of data.
People use LLM to estimate WER, talk about BERT WER but the reality is
simple. You drop 30k most frequent words and consider substituion rate
for the remaining tail. Take the standard Tedlium test and two modern
English systems - Cohere and Qwen3-ASR. 

|                       |  Qwen3-ASR    | Cohere        |
|-----------------------|---------------|---------------|
| WER                   | 3.02          |  4.22         |
| Subsitution rate      | 1.52          |  1.51         |
| Rare substituion rate | 14.69         |  11.88        |

You see even top systems with different architecture (encoder-decoder and
LLM) make mistakes on rare words much more often than on common words. A
paper like [Improving accuracy of rare words for RNN-Transducer through unigram shallow fusion](https://arxiv.org/abs/2012.00133) might be a
great start as well as subsequent publications.

Recently we worked a lot on Russian ASR and we can introduce new version
Vosk 0.62 focused exactly on the problems above. You might see the accuracy)
didn't improve much but performance in noise, rare words and music
detection imroved significantly.

| Dataset               | Vosk 0.54 | Vosk 0.62        |  Vosk Small Streaming 0.54 | Vosk Small Streaming 0.62 | GigaAM3 RNNT | T-one CTC + LM Streaming |
|-----------------------|-----------|------------------|----------------------------|---------------------------|--------------|--------------------------|
| Audiobooks            | 1.2       | 1.1              | 4.1                        |  4.0                      | 3.6          | 5.8                      |
| Ru Librispeech        | 9.4       | 8.4              | 14.4                       |  14.1                     | 4.4          | 6.2                      |
| CommonVoice 12.0      | 6.1       | 5.8              | 11.2                       |  10.9                     | 2.6          | 5.5                      |
| Golos Crowd           | 3.1       | 3.1              | 5.5                        |  5.6                      | 2.6          | 5.6                      |
| Golos Farfield        | 6.2       | 6.5              | 10.1                       |  10.6                     | 4.3          | 12.5                     |
| Sova Devices          | 11.6      | 11.8             | 14.7                       |  15.4                     | 10.2         | 10.1                     |
| TV Broadcast          | 16.6      | 16.6             | 19.8                       |  20.9                     | 12.0         | 19.5                     |
| Medical               | 15.7      | 14.3             | 17.9                       |  18.7                     | 9.0          | 17.1                     |
| Short commands        | 4.4       | 4.8              | 7.1                        |  6.4                      | 3.2          | 12.2                     |
| Callcenter orders     | 20.0      | 19.6             | 27.9                       |  29.7                     | 14.6         | 18.5                     |
| Callcenter support    | 12.9      | 12.8             | 16.8                       |  17.2                     | 12.6         | 14.8                     |
|-----------------------|-----------|---------- -------|----------------------------|---------------------------|--------------|--------------------------|
| Average               | 9.74      | 9.53             | 13.95                      |  13.94                    | 7.19         | 11.62                    |
|-----------------------|-----------|---------- -------|----------------------------|---------------------------|--------------|--------------------------|


Extra features

| Dataset               | Vosk 0.54 | Vosk 0.62        |  Vosk Small Streaming 0.54 | Vosk Small Streaming 0.62 | GigaAM3 RNNT | T-one CTC + LM Streaming |
|-----------------------|-----------|------------------|----------------------------|---------------------------|--------------|--------------------------|
| Noises                | 99.03     | 29.59            | 48.38                      | 40.04                     | 79.90        | 14.10                    |
|-----------------------|-----------|---------- -------|----------------------------|---------------------------|--------------|--------------------------|
| Music/Noise           | -         |  +               | -                          | +                         | -            | -                        |
|-----------------------|-----------|------------------|----------------------------|---------------------------|--------------|--------------------------|
| Rare words            | 28.39     | 25.7             | 35.44                      | 34.82                     | 24.44        | 38.91                    |


Despite not being top accuracy we still see our models improve user experience
significantly.

Latency and WER are, of course, still important, and we need to
incorporate latency more systematically into our standard tests.
Hopefully, we will also develop additional approaches for handling rare
words, although the problem require much deeper architecture changes and
brings us back to the LM integration issues discussed in the previous
post. More on it later.]]></content><author><name>Nickolay Shmyrev</name></author><summary type="html"><![CDATA[While everyone focuses on latency, there are many measurable aspects of ASR that are easy to evaluate and have a significant impact on user experience. Here are some of them:]]></summary></entry><entry><title type="html">Factorizing E2E on acoustic and language models</title><link href="https://alphacephei.com/nsh/2026/02/23/am-lm-factor.html" rel="alternate" type="text/html" title="Factorizing E2E on acoustic and language models" /><published>2026-02-23T22:00:00+01:00</published><updated>2026-02-23T22:00:00+01:00</updated><id>https://alphacephei.com/nsh/2026/02/23/am-lm-factor</id><content type="html" xml:base="https://alphacephei.com/nsh/2026/02/23/am-lm-factor.html"><![CDATA[While end-to-end speech recognition systems are dominating leaderboards,
it's still valuable to consider the separate acoustic and language
models. This separation present in the network as the lower layers of the
network handle acoustic information, filtering out noise, while the
higher layers encode linguistic patterns, including cross-word
dependencies. For various reasons, it remains beneficial to factorize
large models into these acoustic and language subcomponents and some
other components too like speaker identity.

A clear example is cross-lingual models. While the ideal is to learn
everything from the acoustic features, in practice, mastering the
language component is just as critical. This includes handling proper
names, city names, and other language-specific elements. Achieving this
requires massive datasets for each language, not just a large English
corpus with small amounts of data for others.

For instance, when we evaluated Qwen3-ASR-1.7B for Russian, the results
were disappointing. The model struggled to recognize Russian words, even
simple, common names, despite being trained on millions of hours of data.
This illustrates the importance of adequately addressing the language
component in training.

Another challenge is domain transfer. If we have millions of hours of
data from a real estate call center, can we repurpose it for banking
assistants? Unfortunately, the answer is no. It is not easy to replace
internal LM without real data. This means that even with massive
datasets, there will always be gaps, especially in niche domains like
medical, where specialized systems like Google's MedASR outperform
general models.

While the acoustic and language factors are distinct, perfect accuracy
depends on their deep integration. This is why I'm skeptical about audio
LLMs. Without a robust feedback loop between the acoustic layer and the
LLM, these models won't achieve optimal robustness. More importantly,
testing should focus on noisy, real-world conditions, not just clean
datasets like Librispeech or Fleurs.

Currently, most frameworks don't account for explicit separation of
acoustic and language models. Hopefully, we'll see such tools in the
future.]]></content><author><name>Nickolay Shmyrev</name></author><summary type="html"><![CDATA[While end-to-end speech recognition systems are dominating leaderboards, it’s still valuable to consider the separate acoustic and language models. This separation present in the network as the lower layers of the network handle acoustic information, filtering out noise, while the higher layers encode linguistic patterns, including cross-word dependencies. For various reasons, it remains beneficial to factorize large models into these acoustic and language subcomponents and some other components too like speaker identity.]]></summary></entry><entry><title type="html">Failure of SSL</title><link href="https://alphacephei.com/nsh/2025/12/13/failure-of-ssl.html" rel="alternate" type="text/html" title="Failure of SSL" /><published>2025-12-13T22:00:00+01:00</published><updated>2025-12-13T22:00:00+01:00</updated><id>https://alphacephei.com/nsh/2025/12/13/failure-of-ssl</id><content type="html" xml:base="https://alphacephei.com/nsh/2025/12/13/failure-of-ssl.html"><![CDATA[The recent release of the FAIR omnilingual model, LeCun news and active use of
wav2vec in "semantics" made me think again about SSL in speech.

This is going to be a brick in speech technology science. The thing is
that  the premise of SSL is that one can train a stable baseline
multilingual model with just acoustics and then finetune it for many
languages. However, all our experience with end-to-end models
demonstrated that effective and robust speech recognition and
understanding requires tight integration between acoustic and
understanding (language model) layers. Just like Whisper.

This means that you can not sort out things just from the acoustics. And
most of the multilingual models fail for exactly the same reason. Despite
claims to support 1000 languages, the real support for language specifics
is limited to major 3-5 and most others are not well supported. For
example, omnilingual 300m [performs really bad for
Swedish](https://github.com/alphacep/awesome-speech/blob/main/swedish.md), 
a pretty resourceful language. Whisper is not good for Italian, language
specific conformers are much better. Essentially there is no well defined
universal phonetic alphabet (like IPA), allophones are very context
dependant and context is large (up to semantic layers).

On top of that, many multilingual models (wav2vec, hubert, xls-r, mms,
seamless, omnilingual) are not trained for robustness (noise
augmentation, telephony augmentation, etc). Only WavLM has that sort of
multienvironment training as recent Jinyu Li's
[talk](https://www.youtube.com/watch?v=dJIQoZ3uxsk) explained, others are
usually focused on simple uniform inputs like Fleurs.

As a result, you need to finetune wav2vec A LOT to get good accuracy,
essentially training it from scratch (kudos to our friend Anton Nekrasov
who mentions that frequently).

My estimation is that modern size models can effectively learn 3-4
languages at once, not 1000 and to learn 1000 languages one needs 100B
models and those models have to have good speech understanding of each
language they claim to support.

Given that, one can think about methods for training multilingual models
which are modular enough to support many languages well and at least have
some means to inject the language model quickly. For example, traditional
semi-supervised learning that integrates some sort of language model into
the learning process makes much more sense actually.]]></content><author><name>Nickolay Shmyrev</name></author><summary type="html"><![CDATA[The recent release of the FAIR omnilingual model, LeCun news and active use of wav2vec in “semantics” made me think again about SSL in speech.]]></summary></entry><entry><title type="html">Открытые модели для распознавания русской речи 2025</title><link href="https://alphacephei.com/nsh/2025/04/18/russian-models.html" rel="alternate" type="text/html" title="Открытые модели для распознавания русской речи 2025" /><published>2025-04-18T23:00:00+02:00</published><updated>2025-04-18T23:00:00+02:00</updated><id>https://alphacephei.com/nsh/2025/04/18/russian-models</id><content type="html" xml:base="https://alphacephei.com/nsh/2025/04/18/russian-models.html"><![CDATA[Обновлено 14.09.2025

 * Добавлена модель Vikhr Borealis

Обновлено 17.08.2025:

  * Добавлены Nemo Canary V2 и Whisper Podlodka Turbo

Обновлено 15.08.2025:

  * Добавлена Nemo Parakeet V3

Обновлено 21.07.2025:

  * Добавлены потоковые модели Vosk Small Streaming 0.54 и t-tech/T-One

Предыдущие версии [2023](https://alphacephei.com/nsh/2023/01/22/russian-models.html), [2024](https://alphacephei.com/nsh/2024/04/14/russian-models.html)

Мы протестировали доступные модели для распознавания русской речи на различных наборах данных. Интересных моделей довольно много, каждая со
своими особенностями. Пока лидирует GigaAM2 CTC модель, что довольно необычно, потому что считается, что RNNT точнее.

{:class="table table-bordered"}
| Dataset               | Vosk 0.54 | Vosk 0.54 LODR | Nemo RNNT Fastconformer | Nemo Parakeet TDT V3 |Nemo Canary V2 | Whisper Large V3 Transformers | Whisper v3 Turbo | Whisper Podlodka Turbo | GigaAM2 RNNT | GigaAM2 CTC + LM | Vosk Small Streaming 0.54 | T-one CTC + LM Streaming | Vikhr Borealis |
|-----------------------|-----------|----------------|-------------------------|----------------------|---------------|-------------------------------|------------------|------------------------|--------------|------------------|---------------------------|--------------------------|----------------|
| Аудиокниги АЦ         | **1.2**   | 1.3            | 8.2                     |  6.9                 | 12.0          | 5.8                           | 6.5              | 6.9                    | 4.4          | 3.4              |  4.1                      |  5.8                     | 8.4           |
| Ru Librispeech        | 9.4       | 9.0            | 11.2                    | 10.6                 | 19.8          | 9.5                           | 9.7              | 9.3                    | 5.2          | **4.4**          |  14.4                     |  6.2                     | 5.9           |
| CommonVoice 12.0      | 6.1       | 5.6            | 5.9                     | 5.3                  | 8.7           | 5.5                           | 6.2              | 5.2                    | **2.6**      | 2.9              |  11.2                     |  5.5                     | 2.9           |
| Golos Crowd           | 3.1       | 3.0            | 2.7                     | 3.9                  | 9.2           | 14.7                          | 14.5             | 11.1                   | 2.5          | **2.2**          |  5.5                      |  5.6                     | 8.0           |
| Golos Farfield        | 6.2       | 5.9            | 7.1                     | 7.6                  | 15.6          | 17.6                          | 18.7             | 10.9                   | 4.4          | **4.1**          |  10.1                     |  12.5                    | 11.3          |
| Sova устройства       | 11.6      | 11.4           | 7                       | 16.2                 | 19.8          | 15.9                          | 16               | 14.0                   | **5.6**      | 8.3              |  14.7                     |  10.1                    | 14.5          |
| Телевещание           | 16.6      | 16.2           | 22.6                    | 18.0                 | 21.3          | 17.9                          | 18.2             | 19.7                   | 14.4         | **13.8**         |  19.8                     |  19.5                    | 22.7          |
| Медицина              | 15.6      | 15.4           | 19.2                    | 13.4                 | 16.9          | 13.8                          | 13.7             | 10.8                   | 10.9         | **9.8**          |  17.9                     |  17.1                    | 17.3          |
| Команды Яндекса       | 4.4       | 4.3            | 3.8                     | 19.5                 | 12.2          | 18.6                          | 21.8             | 11.2                   | **1.9**      | 3.4              |  7.1                      |  12.2                    | 8.7           |
| Звонки заказы         | 20.0      | 18.8           | 22.8                    | 32.5                 | 35.7          | 23.7                          | 24.8             | 21.8                   | 15.5         | **13.7**         |  27.9                     |  18.5                    | 29.5          |
| Звонки поддержка      | 12.9      | 12.6           | 23.8                    | 29.4                 | 34.3          | 26.8                          | 27.5             | 23.7                   | 14.2         | **12.4**         |  16.8                     |  14.8                    | 28.9          |
|-----------------------|-----------|----------------|-------------------------|----------------------|---------------|-------------------------------|------------------|---------|--------------|--------------|------------------|---------------------------|--------------------------|---------------|
| Среднее               | 11.02     | 10.69          | 13.95                   | 16.02                | 20.24         | 16.21                         | 16.84            | 13.78                  | 8.64         | **8.42**         |  14.67                    |  12.79                   | 15.99         |

Ссылки на модели:

  * Vosk 0.54 <https://huggingface.co/alphacep/vosk-model-ru>
  * Nvidia RNNT Fastconformer Large <https://huggingface.co/nvidia/stt_ru_fastconformer_hybrid_large_pc>
  * Nemo Parakeet V3 <https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3>
  * Nemo Canary V2 <https://huggingface.co/nvidia/canary-1b-v2>
  * Whisper Large V3 <https://huggingface.co/openai/whisper-large-v3>
  * Whisper V3 Turbo <https://huggingface.co/openai/whisper-large-v3-turbo>
  * Whisper Podlodka Turbo <https://huggingface.co/bond005/whisper-podlodka-turbo>
  * GigaAM <https://github.com/salute-developers/GigaAM>
  * T-one <https://huggingface.co/t-tech/T-one>
  * Vikhr Borealis <https://huggingface.co/Vikhrmodels/Borealis>

Пишите, если знаете о хорошей модели, которую можно протестировать.]]></content><author><name>Nickolay Shmyrev</name></author><summary type="html"><![CDATA[Обновлено 14.09.2025]]></summary></entry><entry><title type="html">Experiments with correction of speech recognition output with LLMs</title><link href="https://alphacephei.com/nsh/2025/03/15/generative-error-correction.html" rel="alternate" type="text/html" title="Experiments with correction of speech recognition output with LLMs" /><published>2025-03-15T22:00:00+01:00</published><updated>2025-03-15T22:00:00+01:00</updated><id>https://alphacephei.com/nsh/2025/03/15/generative-error-correction</id><content type="html" xml:base="https://alphacephei.com/nsh/2025/03/15/generative-error-correction.html"><![CDATA[Generative error correction is a thing recently, there are many papers on that, even 
a challenge:

<https://huggingface.co/GenSEC-LLM>

Some notable papers:

  * Large language model based generative error correction: A challenge and baselines for speech recognition, speaker tagging, and emotion recognition <https://arxiv.org/abs/2409.09785>
  * Evolutionary Prompt Design for LLM-Based Post-ASR Error Correction <https://arxiv.org/abs/2407.16370>

Overall, GEC results are somewhat controversal because most experiments are on book-sourced texts and LLM knows
book texts very well:

[Most SpeechLLMs are trained on the test sets of common speech datasets](https://www.linkedin.com/posts/titouan-parcollet-b233a698_oh-and-by-the-way-most-speechllms-are-trained-activity-7298700339111735297-161j/)

We recently tried to rescore 5-best transcription of Russian telephony
calls with LLMs. There are many LLMs to try. We tried ones that fit 8Gb
card and Gemini Flash Lite 2.0 as a big model. We also tried an LLM
finetuned specifically for GEC for Russian language [Meno Tiny](https://huggingface.co/bond005/meno-tiny-0.1).

Here is what our prompt looks like:

```
You need to edit and improve the output of speech recognition system.
Here are 5 variants of transcription of a support call.
Calls are in Russian.
Speech recognizer makes mistakes, for example it uses "southpan" instead of "southpark".
First variant is most precise.
The second variant recognizes proper names better than the others.
Correct mistakes and print most accurate transcription using the context, grammar and knowledge about phonetics.
You need to provide only one answer. Number of lines in the answer must match the number of lines in the first variant.

# Example input:

1.

what is the price for the house
ok good i got it
goodbye

2.

what is the price for the horse
ok good i've got it
ok goodbye

3.

what is the price for the house
ok great i've got it
goodbye

4.

what is the price for the house
ok great i've got it
ok goodbye

5.

what is the price for the house
ok great i've got it
ok goodbye

# Example output:

what is the price for the house
ok good i've got it
ok goodbye

# Input:

1.

<lines1>

2.

<lines2>

....

# Output:
```

Here are our approximate results:

{:class="table table-bordered"}
| Model       | WER     |
|----------------------------------------------|---------|
| 1-best ASR                                   |  15.9   |
| 5-best ROVER                                 |  14.8   |
| Qwen2.5-7B-Instruct-1M-Q4_K_M           | 100+ unstable |
| vikhr-llama3.1-8b-instruct-r-21-09-24-q4_k_m | 40+ unstable |
| vikhr-yandexgpt-5-lite-8b-it-q4_k_m     | unstable |
| meno-tiny-0.1-fp16                      | 40+ unstable |
| gemma-2-9b-it-Q4_K_M                    | 16.0 |
| google_gemma-3-4b-it-Q8_0               | 16.7 |
| Gemini Flash 2.0 Lite                        | 14.6 |
| Gemini Flash 2.0 Lite English prompt         | 14.7 |
| Gemini Flash 2.0 Lite 10-line chunks         | 14.8 |


Some our observations:

1. Most 8B models at 4b quantization are not very stable, hallucinations present in about 25% cases. Qwen is very unstable for this task.

2. Gemma2 and Gemma3 are ok, yet to try 27B version.

3. Simple prompt from the papers certainly doesn't work. One has to
provide much more details and specific issues in prompt. We yet to work
on the prompt more.

4. Even prompt formatting matters, by modifying the prompt format we were able to reduce WER from 26% to 16%

5. For now GEC doesn't seem like a breakthrough tech, it seems like something like extra sauce is needed, simple
ROVER is equally ok and much more stable.

6. We discussed on the channel with iLa that English prompt helps for non-English language. I think it is possible for some models but I can't confirm in experiments.

7. For big model input split doesn't help much.

8. There are still a lot of overcorrection of proper names which are rare and unknown to LLM and overcorrection of grammar.
We need to work more on it.

9. The difference between Gemma2-9B and Gemini Flash is not very large except for a number of hallucinations.

10. Most models have very poor knowledge in rare domains and poor knowledge about speech (phonetics).

So interesting results and more work is needed. Eventually we can make a real benchmark from this, it is actually interesting which LLM performs
the best here.

PS. Sreyan Ghosh on Twitter suggested me the following paper stressing the issues with named entity recognition for GEC. Right
on the subject:

Failing Forward: Improving Generative Error Correction for ASR with Synthetic Data and Retrieval Augmentation

<https://arxiv.org/abs/2410.13198>]]></content><author><name>Nickolay Shmyrev</name></author><summary type="html"><![CDATA[Generative error correction is a thing recently, there are many papers on that, even a challenge:]]></summary></entry><entry><title type="html">Experiments with solvers and decoding-time guidance in flow matching</title><link href="https://alphacephei.com/nsh/2025/01/17/guidance.html" rel="alternate" type="text/html" title="Experiments with solvers and decoding-time guidance in flow matching" /><published>2025-01-17T22:00:00+01:00</published><updated>2025-01-17T22:00:00+01:00</updated><id>https://alphacephei.com/nsh/2025/01/17/guidance</id><content type="html" xml:base="https://alphacephei.com/nsh/2025/01/17/guidance.html"><![CDATA[Some features are somewhat small and require few lines of code, not
really worth a conference paper or a poster. Still, they are somewhat
widespread. A blog post about them feels just right.

In P-Flow paper [P-Flow: A Fast and Data-Efficient Zero-Shot TTS through
Speech Prompting](https://openreview.net/forum?id=zNA7u7wtIN) there is a
big section on guided sampling. Authors claim that pronunciation clarity
can be further enhanced by applying techniques from a classifier-free
guidance method. 

The code implementation is really simple, you just run estimator on mean and change the gradient:

<https://github.com/p0p4k/pflowtts_pytorch/blob/master/pflow/models/components/flow_matching.py#L168>

Here are our experiments with guided sampling and different solvers.

![Guided sampling](/img/blog/guided-sampling.png){: width="600" }

As you see, as any regularization method it helps to reduce artifacts and
improve clarity (see that CER is reduced). It also significantly reduces
expressiveness (see that FAD significantly increased). However, one can
see that simply reducing temperature has similar effect. The question
then is why do we spend compute time on guided sampling. I've seen that
many times that researchers propose some different regularization method
but never consider alternatives.

As for solvers, I don't see any effect from 2-nd order Heun solver. Maybe
diffusion has to be fixed first (replaced with DiT).

Between, default VITS temperature of 0.8 is pretty high and often leads
to artifacts, I've heard many times in discussion that production guys
use lower values up to 0.2-0.3. Voice is not that expressive, but
artifacts are significantly reduced.

Between, Matcha/VITS also have problems with modeling speakers. Next post about it.

## Update 03.2025

Btw, the original paper 

[Classifier-Free Diffusion Guidance](https://arxiv.org/abs/2207.12598)

claims that low temperature mode results in bad samples and CFG works
better. That's for pictures. Not sure if the same applies for TTS. But we
ended with CFG at the end too. With weight like 1.0 the quality of the
results significantly improves and the FAD doesn't degrade much.

Also, from the paper it is clear that CFG must be applied to both 
training and inference, not just inference. NVIDIA paper is wrong here.]]></content><author><name>Nickolay Shmyrev</name></author><summary type="html"><![CDATA[Some features are somewhat small and require few lines of code, not really worth a conference paper or a poster. Still, they are somewhat widespread. A blog post about them feels just right.]]></summary></entry><entry><title type="html">Why discrete units</title><link href="https://alphacephei.com/nsh/2025/01/12/discrete-units.html" rel="alternate" type="text/html" title="Why discrete units" /><published>2025-01-12T22:00:00+01:00</published><updated>2025-01-12T22:00:00+01:00</updated><id>https://alphacephei.com/nsh/2025/01/12/discrete-units</id><content type="html" xml:base="https://alphacephei.com/nsh/2025/01/12/discrete-units.html"><![CDATA[Discrete units made a splash since Hubert probably (2021, four years
already), then with Tortoise TTS and successors. Before that there were
many attempts too, like the very old system by our respected colleagues
Jan Cernocky, Genevieve Baudoin, Gerard Chollet 
[The use of ALISP for automatic acoustic-phonetic transcription 1998](https://www.isca-archive.org/sposs_1998/cernocky98_sposs.html).

Originally I was sceptical about discrete units for speech. As usual it
is hard for me to understand things quickly. It seemed for me that speech
is really continuous as the generation part is clearly mechanical.
However, these days I see more and more arguments for discrete units. You
say haha, it has been a few years already, I'd reply - only now we have
enough evidence.

Even now, the theory behind discrete units is somewhat lacking. There is a
definite understanding why they are required but it is certainly not
expressed well or widely accepted. Let's take a recent work on replacing
discrete units with continouous representation:

Continuous Autoregressive Models with Noise Augmentation Avoid Error Accumulation

<https://arxiv.org/abs/2411.18447>

It says "This discretization allows models to operate within a discrete
probability space, enabling the use of the crossentropy loss, in analogy
to their application in language models. However, quantization methods
typically require additional losses (e.g., commitment and codebook
losses) during VAE training and may introduce a hyperparameter overhead.
Secondly, continuous embeddings can encode information more efficiently
than discrete tokens..."

While the first sentence is the core advantage of discrete units the second
sentence is not aligning with the theory. The point is that many
distributions we try to model are significantly non-gaussian. We will
cover that later in detail but that's the fact. And when we try to model
non-gaussian distributions with gaussian models and L2 loss we fail
totally. And flow/flow matching/diffusion models don't help here since
they still keep that gaussian nature even if they try to approximate the
target distribution. This is exactly the reason we need discrete
approximation and crossentropy loss. 

Given that it is strange that the paper above tries to fight error
accumulation but never mentions which distributions it tries to model.
And only provides experimental evidence of the advantages.

## Hung-yi Lee talk

This whole idea actually was very nicely introduced to me in a talk by
Professor Hung-yi Lee at Interspeech 2024, see here

<https://x.com/HungyiLee2/status/1830698181757411769>

Here is the image from the slides:

![Discrete Units](/img/blog/discrete.png){: width="600" }

This is probably very obvious thing but I haven't seen the paper that
uses or mentions this consistently.

# Discrete units for duration

Why did this idea come to my mind again? Because we worked a bit more on VITS durations. Here
is the histogram of overall durations values and their VITS prediction.

![Discrete Duration Units](/img/blog/discrete-durations.png){: width="600" }

As you see, durations are clearly non-gaussians and simple convnet has trouble to model
them. Not surprisingly, flow model and flow matching model have trouble too!

The solution is simple, let's consider duration as discrete units and use
cross-entropy loss. Where did we see it already? In
[StyleTTS2](https://arxiv.org/abs/2306.07691)

I must admit again, StyleTTS2 is very advanced and well-thought
architecture and while it might not be obvious from the start, it
definitely needs attention. Many things like advanced discriminators, ASR
alignment, durations raise again and again in my studies.

StyleTTS2 never calls duration discrete yet, but the idea is the same.
I'd predict the next generation duration model will be all discrete.

This idea is used not just in StyleTTS. For example in the latest paper
[Total-Duration-Aware Duration Modeling for Text-to-Speech Systems](https://arxiv.org/abs/2406.04281)
we also see Microsoft researchers consider discrete units and prove their advantage.

On top of that, our experiments with StyleTTS2 duration in Matcha TTS show very good
results.

# Unit selection vs generative models

That non-gaussian nature of things reminded me of the old story where
everyone was discussing unit selection TTS vs HMM TTS. The first was more
natural-sounding but less flexible, the second never sounded well but was
really versatile.

Since we model something non-gaussian, unit selection and patch mixing
can provide really good results. So welcome back to the unit selection world.

Papers like this appear quite often. One example is
[KNN-VC](https://github.com/bshall/knn-vc). And DiT/MaskGIT is exactly
the thing here. They are on the rise, and it is nice to have a theory
that confirms they are really reasonable.

# Why brain signals are discrete

Clearly some distributions are more gaussian, some less. We need to
understand nature before we select the method. But the question
raises why many of our distributions are discrete.

In my opinion there is a simple answer here - the mechanics of the brain.
Since neurons have pretty fixed states it is natural to think they are
discrete. Somewhere I hear 4-bit estimation from neuroscientists (probably
4-bit LLMs also make most sense). So no wonder that our speech has a
discrete nature as well. Something to remember about.]]></content><author><name>Nickolay Shmyrev</name></author><summary type="html"><![CDATA[Discrete units made a splash since Hubert probably (2021, four years already), then with Tortoise TTS and successors. Before that there were many attempts too, like the very old system by our respected colleagues Jan Cernocky, Genevieve Baudoin, Gerard Chollet The use of ALISP for automatic acoustic-phonetic transcription 1998.]]></summary></entry><entry><title type="html">Matcha TTS notes</title><link href="https://alphacephei.com/nsh/2025/01/03/matcha-tts-notes.html" rel="alternate" type="text/html" title="Matcha TTS notes" /><published>2025-01-03T22:00:00+01:00</published><updated>2025-01-03T22:00:00+01:00</updated><id>https://alphacephei.com/nsh/2025/01/03/matcha-tts-notes</id><content type="html" xml:base="https://alphacephei.com/nsh/2025/01/03/matcha-tts-notes.html"><![CDATA[Recently I've spent some time with [Matcha](https://github.com/shivammehta25/Matcha-TTS) by Shivam Mehta. Some related papers

[Matcha-TTS: A fast TTS architecture with conditional flow matching](https://arxiv.org/abs/2309.03199)

[Should you use a probabilistic duration model in TTS? Probably! Especially for spontaneous speech](https://arxiv.org/abs/2406.05401)

[P-Flow: A Fast and Data-Efficient Zero-Shot TTS through Speech Prompting](https://proceedings.neurips.cc/paper_files/paper/2023/file/eb0965da1d2cb3fbbbb8dbbad5fa0bfc-Paper-Conference.pdf)

Overall, Matcha is attractive because it is a very simple system
following VITS design and incorporating recent advances in TTS. It is
fast and light. We test Matcha on a Russian database of about 1000 hours
100 speakers. I wish those things would be mentioned in the paper, but
papers these days are not a good source of information.

# Better quality than VITS2

 Out of box Matcha gives you better synthesis clarity (CER metric) and
intonation (FAD metric) than VITS2 at a price of slightly slower speed
and reduced quality (UTMOS). The quality drop is due to the codec and
mel-based architecture, end-to-end quality is better as expected. Our
results are like this:

{:class="table table-bordered"}
| Metric | VITS2 | Matcha | Matcha + Vocos |
|--------|-------|--------|----------------|
| CER    | 1.9   | 1.2    |     1.2        |
| UTMOS  | 3.4   | 3.0    |     3.2        |
| FAD    | 9.7   | 5.0    |     3.0        |
| SIM    | 0.87  | 0.82   |    0.84        |
| CPU xRT| 0.07  | 0.40   |    0.14        |

Note that MOS is not expected to be better (as paper claims).

# Focus on a single speaker

Matcha default params are optimized for a single speaker database. There
is VCTK setup too, but it doesn't feel optimal. Such focus has the
usecase, but overall it is not sufficient for modern TTS. As a result,
some parts need extra inputs (for example, duration module doesn't use
speaker embedding which creates big issues with duration), some need more
parameters (it is
[recommended](https://github.com/shivammehta25/Matcha-TTS/issues/52) to
increase params of the decoder from 10M to 40M. Overall, it is interesting
that most of the current light models are underparameterized. An example
of this is a good boost in quality of Kokoro-StyleTTS2 vs plain
StyleTTS2.

# Mel parameters

Matcha uses an 8khz cut-off for mel coefficients. Maybe one day it was
relevant when you want to mix 16khz data, but these days there are plenty
of wideband data around. I yet have to retrain with full 11khz mel, but I
already think there will be an improvement. 80 mels doesn't feel enough
as well, it is probably better to use 100.

# BERT semantics

It is clear that proper synthesis is not really possible without text
understanding, since [Bert-VITS2](https://github.com/fishaudio/Bert-VITS2) BERT embeddings are
very helpful to implement that. In our experiments, BERT embeddings are
very helpful for Matcha as well,  we see that FAD improves from 3.0 to
2.3 which is a really good improvement.

Not sure why modern research systems still ignore this, essentially weakening the baselines.

# Duration issues

Matcha copies VITS simple duration predictor. It is sufficient for single
speaker, not really sufficient for life-like synthesis.

Overall, duration is the weakest point of modern VITS successors. It is
very sad the problem is not fully understood and explored. For example it
is surprising that duration paper linked above never mentioned duration
metrics, only WER metric. GlowTTS/VITS duration modeling is somewhat
innovative (for example, MAS) but very basic -- no skip silence phones, no
chance for learning proper alignment. Very inaccurate plain CNN duration
model. Very sad this piece of code is copied from repo to repo. As a
result  researchers claim duration models are not helpful (as in E5/F5).

First of all, it is still not clear for me why do we feed text encoder
outputs to duration. Text encoder outputs are optimized with prior loss
with L2 distance to mel spectrogram, essentially it is just a rough mel
spectrogram where all semantics is lost. No punctuation, nothing. And we
hope to predict the duration from it.

Another example of the issue copied from repo to repo is 
[About ceiling for calculating phoneme duration](https://github.com/jaywalnut310/vits/issues/11).  
Essentially what happens here - VITS duration predictor is not very good
and often predicts very short duration for important phones, which causes
a very perceptible phone skips in audio. This ceil helps to hide this problem
by pushing duration up, but still creates many problems with intonation.
To match original length the sound has to be scaled down by a factor of
approximately 0.9 and still you have irregular duration sometimes.

The proper solution would be to improve duration predictor, then ceil can
be replaced with round which leads to much better intonation (FAD reduces
from 2.3 to 2.0) by this change. But then you have issues with phone
skips (CER raises from 1.2 to 1.9).

Our experiment with flow-matching duration from the paper above didn't
demonstrate advantages of the method. In fact it got worse. Probably
because there is no speaker vector. See the note on the overall flow
matching issues below. We have yet to explore this part.

Some other good duration ideas: duration discriminator (VITS2), interpolation between
sdp and dp predictors (MeloTTS), LSTM duration (StyleTTS2).

# MAS vs ASR alignment

MAS is somewhat innovative (it implements probabilistic alignment instead
of old fixed Viterbi-style alignment). However, it is certainly not sufficient for
all the usecases. It definitely works well for large single-speaker database, but
is expected to fail for diverse data. Here are the cases:

1. Fine-tuning for small and emotional dataset of say 0.5 hours. It is a perfectly
valid usecase, but MAS is not going to work here, there is not enough internal data
to properly align.

2. Training on diverse datasets with variable amount of speech per speaker. Some 
speaker have hours, some minutes. MAS will also fail here.

As a result, it becomes obvious that modern light TTS system should have ASR aligner, 
not MAS aligner. It also aligns with modularity requirements we covered in the previous
post.

An interesting thing that StyleTTS2 has the proper architecture here. Overall, StyleTTS2
has many proper decisions and needs more attention.

# Mel representation, Vocos and BigVGAN codecs

By default Matcha uses HifiGAN, which is ok, even has good quality but not as universal
and fast as modern vocoders. 

We tried BigVGAN with Matcha expecting it allows to create high quality
synthesis, unfortunately, it doesn't really work. UTMOS is 2.5 (10 NFE)
and something like 2.8 (50 NFE) much lower than Vocos (3.2). The thing is
that Matcha doesn't model Mel precisely and BigVGAN is not just speech
codec. So it renders inaccurate mel as non-speech sounds (clicks)
affecting quality a lot. 

Yet to investigate other codecs, for example, some projects like
[Pflow-Encodec](https://github.com/seastar105/pflow-encodec) report good
results with the Encodec latents.

Mel is clearly not sufficient to reproduce speech clearly. VITS2 system
has UTMOS 3.4, higher than Matcha and uses latent of dimension 192.
Hopefully more advanced codec could help to fill the gap between Matcha
and E2E systems while keeping reasonable speed of synthesis.

# Flow matching and outliers

Overall, flow matching and diffusion is not a silver bullet. It can 
emulate complex distributions with some precision but it also has drawbacks -
the outliers. If your distribution doesn't match the model you get variance
of the prediction. Sometimes this variance is nice when you want to have
emotional speech. But sometimes it really hurts - you get outliers which
affect the impression from the result. As an outcome, VITS CER is always
very high compared to FastSpeech2 models. Former is usually about 2%, latter
is 0.7%. VITS often skips or misrenders some phones due to flow model.

Same thing we see with flow matching in Matcha. We improve expressiveness
overall, but we introduce some gross outliers. We can hear them in
acoustics and in duration too. As VITS, Matcha is not very good at
modeling liquid phones (l and r).

It is better to create a good model matching the target distribution (for
example, introduce intonation vector as input) than to hope that flow
matching will solve your problems. It will not.

# No style vectors

Style vectors like in StyleTTS2 or HierSpeech++ seem like an important
way to control synthesis. Beside a simple timbre vector, an emotional
style vector could affect intonation based on reference file and do many
other things.  Matcha doesn't use anything similar probably due to the
focus on more uniform speech but later testing with more diverse
databases and speech styles might demonstrate requirements of
architecture modifications.

# Streaming

Low latency synthesis is getting popular last year, while we don't
believe in it, something like two-stage synthesis definitely should make
sense. Since we need to understand the full semantics to render the line
properly, we still need to look for the whole sentence. For example we
synthesize intermediate representation of the whole line quickly and then
render it with diffusion in streaming fashion.]]></content><author><name>Nickolay Shmyrev</name></author><summary type="html"><![CDATA[Recently I’ve spent some time with Matcha by Shivam Mehta. Some related papers]]></summary></entry><entry><title type="html">TTS Design Thoughts</title><link href="https://alphacephei.com/nsh/2024/10/18/tts-design.html" rel="alternate" type="text/html" title="TTS Design Thoughts" /><published>2024-10-18T23:00:00+02:00</published><updated>2024-10-18T23:00:00+02:00</updated><id>https://alphacephei.com/nsh/2024/10/18/tts-design</id><content type="html" xml:base="https://alphacephei.com/nsh/2024/10/18/tts-design.html"><![CDATA[We spent last year working mostly on TTS just as in the good old Festival times. Here are some more random thoughts I have on the subject. Rants follow, I still have trouble living in a positive thinking world. That one of course has advantages as life demonstrates but I just can’t get there.

These days there are a dozen TTS systems around and their strengths and weaknesses are not fully understood. Some are mimicking. For example, a popular MeloTTS is just a VITS system with very tiny modifications and well-trained single speaker voices. So it has the same weaknesses as VITS - weak global intonation, no emotion, no text understanding. Speed is fast as in any VITS though.

New audio LLMs arrive every week. Everyone markets short response latency. Somehow they all forget to mention WER which is usually two times worse than the offline systems due to streaming nature. It is actually crazy that [LLM reviews talk about everything](https://arxiv.org/abs/2410.03751) but forget to mention WER metrics.

Recently the F5-TTS made a splash. People claim it is good from listening to a few short samples. Everyone [finetunes it with 80 hours of data](https://github.com/SWivid/F5-TTS/discussions/57) on their 4090 with gradient accumulation from 80 steps. Nobody tests anything rigorously, not even WER.

F5-TTS paper reports WER and speaker similarity but never reports UTMOS and F0 correlation. Not surprising since F5 uses Vocos which is not very great at UTMOS and not very good for multilingual generation. Reasons are simple - Vocos is fast but it is trained on English data only, for other languages it needs finetune. Second, it uses mel as input, as a result it is not great for complex sounds like fricatives and clicks. Kind of crazy nobody mentions that.

Nobody mentions that simple transformers can’t learn semantics, even on 200k hours of training data. Very few people honestly report the transformer swallows syllables and has trouble with complextext  inputs like repeating numbers. Nobody talks about pause control, pronunciation control and so on. Nevertheless, F5-TTS is described as the next “Elevenlabs killer”. 

But I have to admit diffusion transformer idea from Stable Diffusion 3 is super nice.

Let us describe a few basic principles and demonstrate how they affect the design of a modern TTS system.


## Design purpose

No free lunch thing is still in place and the optimal system design depends on the purpose. Tests and everything else, all shaped by purpose. There are different purposes for TTS.

Book reading TTS - it has to read the whole book and render dialogs properly. It doesn’t have to be fast as the generation can be offline.
On-device interactive TTS - information reading, simple chatbot. It has to be fast and clear. Pronunciation should be solid.
Emotional chatbots and production TTS - it has to be emotional and deliver human-like speech in various conditions. It has to support arbitrary voices and probably arbitrary languages.
Singing TTS has a separate feature set.

The thing is - design of the TTS for use cases above doesn’t have to be the same. Many simple fast TTS (VITS, Matcha, Pflow) are actually single voice TTS and they serve their purpose well - they are fast and sometimes clear to deliver information. Advanced GPT is unnecessary there, though there are specific things where AI is required (like number pronunciation).

Now, when one sees results for the TTS systems in the paper one has to account for the purpose of the system too. I.e. you can’t compare a single speaker system with an emotional multi speaker system. Single speaker is trained on LJSpeech usually and results reported on LJSpeech only. The examples of such systems are VITS, MatchaTTS and StyleTTS2. Good for single speaker training and simple plain non-emotional speech they usually fail for emotional speech altogether. Typically such a system has a very plain simple duration predictor with no semantics. For example, MatchaTTS has a very simple CNN-based duration predictor. For non-fiction audiobooks and simple reporting it is ok. If you ask them to render emotional conversational input they usually fail in intonation significantly. 

On the other hand, proper multispeaker multipurpose systems are usually extremely heavy and don’t fit small devices.

At the moment it looks like we have to design several different TTS systems for different purposes as it is hard to construct a universal one.

Here are some more points which get very important.

## Lean compute and modularity (the end of end2end)

In recent times when good compute capabilities are not really available for us, a big part of the strategy is to reuse as many components as possible. It really means the end of end2end for small researchers. Things like speaker identification networks, LLMs, vocoders are all better pre-trained than trained from scratch.

From experiments, end-to-end systems like VITS still show very impressive quality compared to modular ones after training for months on simple hardware, so it is a big research task on how to make it work. For example, Vocos is very hard for multilingual cases like I mentioned above. Hopefully, we can find a reasonable replacement. One thing that is clear is that Mel is certainly not good enough, probably some encodec-like multi-scale multi-level features make more sense.

This requirement also means all big monolithic architectures have to pass by. It is very hard to resist the common opinion that a big compute is always better but right now I don’t see the way forward unfortunately. Maybe I will be wrong again.

Recently we trained a good old Kaldi ASR for one not very well supported language. It is not perfect but it ended up being more robust than any MMS-like or Whisper finetune. I kind of feel that a simple modern TDNN ASR system has great potential in the world of LLMs.

## Dirty data training

One requirement that is frequently omitted in modern TTS paper is the ability to properly train from dirty data.Given the amounts and languages, the input data is certainly dirty and some methods work better with dirty data than others. No good research here, but hopefully we can do something about it soon.

Monotonic alignment for example is a nice move forward here. Due to probabilistic nature it actually can deal with some data inconsistencies. But the full nature and consequences are not fully understood here.

Everyone is using transformers, how good they are for the dirty data, that is the question.]]></content><author><name>Nickolay Shmyrev</name></author><summary type="html"><![CDATA[We spent last year working mostly on TTS just as in the good old Festival times. Here are some more random thoughts I have on the subject. Rants follow, I still have trouble living in a positive thinking world. That one of course has advantages as life demonstrates but I just can’t get there.]]></summary></entry><entry><title type="html">Evaluation of Russian TTS models</title><link href="https://alphacephei.com/nsh/2024/07/12/russian-tts.html" rel="alternate" type="text/html" title="Evaluation of Russian TTS models" /><published>2024-07-12T23:00:00+02:00</published><updated>2024-07-12T23:00:00+02:00</updated><id>https://alphacephei.com/nsh/2024/07/12/russian-tts</id><content type="html" xml:base="https://alphacephei.com/nsh/2024/07/12/russian-tts.html"><![CDATA[We recently evaluated Russian open source and proprietary TTS models. Here are the results:

{:class="table table-bordered"}
|Engine | Voice | CER | xRT GPU | xRT CPU | UTMOS | Similarity Avg/Min | Encodec FAD
|Silero v3_1 | Aidar | 0.7 | 0.0177 | 0.1256 | 2.544 | - | 97.36
|Silero v3_1 | Baya | 0.7 | 0.0177 | 0.1256 | 2.978 | - | 170.53
|Silero 4 | Aidar | 1.0 | 0.0149 | 0.0544 | 1.755 | - | 79.33
|Silero 4 | Baya | 0.9 | 0.0149 | 0.0544 | 2.144 | - | 118.63
|Vosk-TTS 0.6 | Multi | 2.3 | - | 0.0605 | 3.283 | 0.869/0.571 | 9.99
|TeraTTS | Natasha | 1.6 | - | 0.1945 | 3.281 | - | 70.10
|UtrobinTTS | Female | 2.1 | 0.0265 | 0.1323 | 2.851 | - | 73.34
|UtrobinTTS | Male | 2.1 | 0.0265 | 0.1323 | 3.186 | - | 46.14
|XTTS2 | Multi | 2.7 | 0.3458 | - | 3.035 | 0.762/0.468 | 97.05
|Vosk-TTS GPT | Multi | 2.1 | 0.2690 | - | 3.381 | 0.814/0.544 | 10.08
|Piper | Denis | 3.7 | - | 0.045 | 3.056 |  | 142.91
|Piper | Dmitry | 3.6 | - | 0.045 | 2.864 |  | 130.9
|Piper | Irina | 1.4 | - | 0.045 | 3.672 |  | 74.98
|Piper | Ruslan | 3 | - | 0.045 | 2.975 |  | 72.22
|BeneGes | Ruslan | 2.4 | - | 0.321 | 2.537 |  | 63.02
|EdgeTTS | Dmitry | 0.7 | - | 0.076 (cloud) | 3.565 |  | 32.69
|EdgeTTS | Svetlana | 0.7 | - | 0.076 (cloud) | 3.513 |  | 30.60
|Yandex | Alexander | 0.6 | - | 0.028 (cloud) | 3.413 |  | 54.10
|Yandex | Marina | 0.6 | - | 0.028 (cloud) | 3.482 |  | 49.40
|Tortoise Ruslan | Multi | 6.2 | 25.0300 | - | 2.893 | 0.660/0.483 | 14.21
|Bark Small | Ru_4 | 10.3 | 1.201 | - | 2.554 |  | 61.71


To make it clear here are the evaluated metrics:

  * CER - obtained using ASR recognizer. Responsible for clarity of speech (phonemes are not swallowed, they are pronounced correctly)
  * xRT - synthesis rate
  * UTMOS - responsible for sound purity (studio sound quality)
  * Similarity - for multi-voice systems, similarity measures the similarity of a voice to a sample
  * Encodec FAD - intonation quality
  * The code for evaluation is here: <https://github.com/alphacep/vosk-tts/tree/master/extra/tts-test/ru>

Similar repo is <https://github.com/Edresson/ZS-TTS-Evaluation>

Some information about evaluation data:

  * Audiobooks, about 100 speakers, about 1k utterances.

Some observations:

  * Fastspeech2 methods still show best clarity (Silero/Yandex/EdgeTTS). They are not very good in intonation but clarity
    is hard to beat. For end user it really makes sense actually, you can deal with plain intonation but artifacts are
    really annoying.

  * Training database matters a lot, even a small size gives very good results (CER and UTMOS), if the data is good (Piper Irina compared to other piper voices. And there the data is only 1 hour of data)

  * Multi-voice systems seriously suffer from fuzziness (CER 0.7 > 2.0+), something needs to be done about it.

  * Tortoise is pretty good in intonation (as expected).

  * It is necessary to add another metric responsible for the liveliness of speech (F0 correlation? duration?). FAD is relevant, but only works for multivoice systems.

  * XTTS2 results are much worse than I expected. Both similarity and clarity of speech.

  * A good metric to evaluate would be diversity of speech generation. VITS for example specifically optimized for diversity compared
    to fastspeech. Something to implement in the future.

See also <https://alphacephei.com/nsh/2023/11/29/new-tts.html>.]]></content><author><name>Nickolay Shmyrev</name></author><summary type="html"><![CDATA[We recently evaluated Russian open source and proprietary TTS models. Here are the results:]]></summary></entry></feed>