0% found this document useful (0 votes)
48 views4 pages

TTS Tech Review for Researchers

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
48 views4 pages

TTS Tech Review for Researchers

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

A review-based study on different Text-to-Speech

technologies
Md. Jalal Uddin Chowdhury, Ashab Hussan
Leading University, jalalchy101, ashabhtanim@gmail.com

Abstract - This research paper presents a comprehensive voice in multiple regional dialects, such as Hindi or Irish
review-based study on various Text-to-Speech (TTS) English. Regional dialects provide greater clarity of
technologies. TTS technology is an important aspect of pronunciation of region-specific words or phrases, making
human-computer interaction, enabling machines to for more understandable and accessible accents.
convert written text into audible speech. The paper
Neural Voice: Neural voice is a new type of synthesized
examines the different TTS technologies available,
speech that’s nearly indistinguishable from human
including concatenative TTS, formant synthesis TTS, and
recordings. Powered by deep neural networks, neural voices
statistical parametric TTS. The study focuses on
sound more natural than standard voices by producing
comparing the advantages and limitations of these
human-like speech patterns, such as stress and loudness of
technologies in terms of their naturalness of voice, the
individual words. Because of this human-like speech, you
level of complexity of the system, and their suitability for
end up with a more precise articulation of words, along with
different applications. In addition, the paper explores the
a significant reduction in listening fatigue when users
latest advancements in TTS technology, including neural
interact with AI systems.
TTS and hybrid TTS. The findings of this research will
provide valuable insights for researchers, developers, and Custom Neural Voice: Custom neural voice uses your own
users who want to understand the different TTS audio data to create a one-of-a-kind customized synthetic
technologies and their suitability for specific applications. voice. Custom neural voice has the deepest level of voice
personalization, with realistic speech that can be used to
Index Terms – Natural Language Processing, Text-to-Speech. represent brands, personify machines, and allow users to
interact with applications conversationally.
INTRODUCTION
B. Some Terminology
Nowadays, we are living in the digital era. Every staff
associated with our life is getting digital. Almost every Phoneme: A phoneme is the smallest unit of sound that
smartphone has a smart assistant that can speak and makes a word’s pronunciation and meaning different from
communicate like a human. Speech recognition is one of the another word.
technologies used in those smart assistants. Text-to-Speech is Prosody: The patterns of rhythm and sound used in poetry.
a part of speech recognition. Text-to-speech (TTS) is a natural
language modeling approach that converts text units into Mel-spectrogram: It is derived by applying a non-linear
speech units for audio presentation. There are numerous transformation to the frequency axis of short-time Fourier
technologies used in Text-to-Speech. Many programming transform (STFT) of audio, to reduce the dimensionality. It
languages are used in these technologies. Python, a emphasizes details in low frequencies which are very
programming language mostly used in Text-to-Speech important to distinguish speech and de-emphasizes details in
technology. There are many Python library e.g gTTS, pyttsx3, high frequencies which usually are noise.
paddlespeech. But everyone's performance is not the same. Text-To-Speech (TTS) Structure
Our thesis is about measuring the efficiency of various Text-
to-Speech technology in various aspects.
I. How to Works TTS
Text-to-speech converts text into human-like speech, along
with the ability to create a unique, custom voice.
A. Type of Voice:
Standard voice : Standard voice is the simplest and most
cost-effective type of voice. In the past few years standard
voice has improved considerably to provide a human-like Fig. 1 : Text-to-speech structure
This is a high-level diagram of different components used in
the TTS system. The input to our model is text, which passes
through several blocks and eventually is converted to audio.

Preprocessor

● Tokenize: Tokenize a sentence into words


● Phonemes/Pronunciation: It breaks input text into The decoder is used to convert information embedded in the
phonemes, based on their pronunciation. For Latent processed feature to the Acoustic feature i.e. Mel-
example, “Hello, Have a good day” converts to HH spectrogram.
AH0 L OW1, HH AE1 V AH0 G UH1 D D EY1.
● Phoneme duration: Represents the total time taken Vocoder
by each phoneme in the audio.
● Pitch: Key feature to convey emotions, it greatly
affects the speech prosody.
● Energy: Indicates frame-level magnitude of mel-
spectrograms and directly affects the volume and
prosody of speech.
The Linguistic feature only contains phonemes. Energy,
pitch, and duration are actually used to train the energy It converts the Acoustic feature (Mel-spectrogram) to
predictor, the pitch predictor, and the duration predictor waveform output (audio). It can be done using a mathematical
respectively which are used by the model to get a more natural model like Griffin Lim or we can also train a neural network
output. to learn the mapping from mel-spectrogram to waveforms. In
reality, learning-based methods usually outperform the
Griffin Lim method.
Encoder
So instead of directly predicting waveform using the decoder,
we split this complex and sophisticated task into two stages,
first predicting mel-spectrogram from Latent processed
features and then generating audio using mel-spectrogram.

LITERATURE REVIEW
Designing an effective text-to-speech synthesis system is
quite difficult. Building a whole TTS system requires
completing several steps, including normalizing text,
converting text to phonemes, identifying prosodic emotional
content, and generating speech.
Speech synthesis for different languages has already been the
subject of many research proposals. Before electronic signal
processing was invented, some early scientists tried to make
machines that could mimic human speech.
The encoder inputs Linguistic features (Phonemes) and
outputs an n-dimensional embedding. This embedding A Unit Selection approach for the text-to-speech synthesis
between the encoder and decoder is known as the latent using Syllabic was presented in [1]. In this paper, They select
feature. Latent features are crucial because, other features like syllables as their unit- hence this was the first syllable based
speaker embedding are concatenated with these and passed to Text to Speech conversion system for Bangla language. In
the decoder. Furthermore, the latent features are also used for this System, It is necessary to conduct a substantial amount of
the prediction of energy, pitch, and duration which in turn testing with an even larger text corpus than the one they
play a crucial role in controlling the naturalness of the audio. utilized as an experimental text corpus.

Decoder The research by F. Alam and colleagues resulted in the


development of a speech synthesizer for the Bangla
language[2,3]. The diphone concatenation method was used
to make this system. It needs a dictionary that tells it how to was that it was done in sequences of tokenization, token
say words so that it can talk. There are 93000 entries in the classification, token sense disambiguation, and standard word
dictionary [3]. The proposed system makes voice data for a generation. This work will be useful in the future because it
festival and adds support for the Bangla language to the combines TTS and Speech Recognition and compares the
festival using its embedded scheme scripting interface. It ways that rule-based systems and other classification systems
turns Bangla Unicode text into ASCII text based on the handle ambiguity.
Bangla phone set. But there is no explanation of how the
process of transliteration works. Also, there is no information In[10],The authors developed an audio programming tool for
about how the letter-to-sound (LTS) rules for words that aren't blind and vision-impaired people to learn programming that
in the lexicon were made. is based on text-to-speech technology. In this paper, they
demonstrate how users who use the tool can edit programs,
In [4], the author showed how a Bangla Text-to-Speech (TTS) compile, debug, and run them. The authors mentioned that
system was designed and built from the raw level without these levels can all be voiced. As a programming language,
using third-party speech synthesis tools. For building the they use C# for evaluation, and VisualStudio.NET is used to
system, they were looked at from two different angles: one create the tool. Evaluations have demonstrated that the
based on phonemes and the other on syllables.This study was programming tool can support the implementation of software
conducted on a very raw level, and the researchers used applications by blind and vision-impaired individuals and the
recordings of their own voices to produce phonemes and achievement of equality of access and opportunity in
syllables. The syllable-based method showed higher quality information technology education. To communicate with a
speech than the phoneme-based method. But, limited syllable computer, vision-impaired people liked to use mouse events
and phoneme data were used for the development process. and blind people liked to use keyboards with shortcut keys
In [5], the author used a concatenative synthesis technique to defined in JAWS. This means there’s not any inbuilt or
make the system's speech sound natural.In this paper, They intuitive systematic approach to handle the interaction with
proposed a system that converted Bangla text to Romanized computers.
text based on Bangla graphemes set and by developing a
bunch of romanization rules.They used the MBROLA A diphone-based concatenative technique was utilized by the
diphone database and did not develop their own authors in the development of a speech synthesizer for the
database.Also, The sound quality is not particularly natural in Bangla language [11].In addition to this unique collection of
its presentation. words, the tokenization of null-modified characters has been
presented in this study. This is an important and, to put it
In [6], they present FastPitch, a fully-parallel text-to-speech mildly, a tough task for a text-to-speech program (TTS).
model based on FastSpeech, conditioned on fundamental
frequency contours. Pitch contours are predicted by the model From the perception of the authors, despite the fact that over
during inference. The generated speech can be made more 1.6 billion Muslims live in the world and that Arabic is spoken
expressive, better match the meaning of the utterance, and by millions of people as an official language in 24 different
ultimately more interesting to the audience by changing these nations, it has received less attention than other languages
predictions. [12]. These considerations highlight the necessity, from the
point of view of the authors, for an Arabic TTS that could be
In [7], the authors of the paper propose LightSpeech, which of the highest quality, be lightweight, and be absolutely free.
leverages neural architecture search (NAS) to automatically A rule-based system with an exception dictionary for words
design more lightweight and efficient models based on that don't follow the letter-to-phoneme rules might be a much
FastSpeech. They meticulously developed a fresh search more sensible approach since the vowelized written text of
space that includes a variety of lightweight and potentially Arabic bears the pronunciation rules with few exceptions
efficient architectures after thoroughly profiling the from their perspective. This study developed a rule-based
components of the current FastSpeech model. Then, within text-to-speech hybrid synthesis system that combined formant
the search space, NAS is used to automatically find well- and concatenation approaches to produce speech that sounds
performing architectures. The model developed by their natural enough. But for the lack of significant stressed
method, according to experiments, achieved a 15x model syllables and intonation, the overall system might not perform
compression ratio and a 6.5x inference speedup on the CPU intuitively as well as handle the differentiation in different
while maintaining a comparable voice quality. arabic accents.

In [8], the authors made a rule-based system for normalizing


Bangla text instead of a decision tree and a decision list for
ambiguous tokens.In this paper, a lexical analyzer was CONCLUSION AND FUTURE DIRECTION
developed to tokenize each NSW(Non Standard Words) by This review-based study has examined different Text-to-
using a regular expression and the tool JFlex[9]. This was Speech (TTS) technologies and highlighted their advantages
done based on semiotic classes. The main thing that the work and limitations. The study has provided an overview of the
basic functionalities of TTS systems and has shown how they [10] Tran, D., Haines, P., Ma, W., & Sharma, D. (2007, September). Text-
to-speech technology-based programming tool. In International Conference
have evolved over time, from rule-based systems to neural-
On Signal, Speech and Image Processing.
based models. The study has also explored the impact of TTS
on different industries, including education, entertainment, [11] Rashid, M. M., Hussain, M. A., & Rahman, M. S. (2010). Text
and healthcare. One of the key findings of this study is that normalization and diphone preparation for bangla speech synthesis. Journal
of Multimedia, 5(6), 551.
the recent advancements in deep learning have significantly
improved the quality of TTS systems. However, there are still [12] Zeki, M., Khalifa, O. O., & Naji, A. W. (2010, May). Development of
several challenges that need to be addressed, such as the lack an Arabic text-to-speech system. In International Conference on Computer
of emotional expressiveness and naturalness in synthesized and Communication Engineering (ICCCE'10) (pp. 1-5). IEEE.
speech, which can affect the user experience. In terms of
future directions, further research is needed to improve the
performance of TTS systems in terms of naturalness,
expressiveness, and intonation. This can be achieved by
developing more advanced algorithms that can capture the
nuances of human speech and emotions. Additionally, more
studies are needed to evaluate the effectiveness of TTS in
various applications, such as language learning and speech
therapy. Overall, TTS technology has the potential to
revolutionize the way we communicate and interact with
machines. As the technology continues to evolve, it will
become increasingly important to address the limitations and
challenges of TTS to ensure that it can be used to its full
potential.
REFERENCES
[1] Sadeque, F. Y., Yasar, S., & Islam, M. M. (2013, May). Bangla text to
speech conversion: A syllabic unit selection approach. In 2013 International
Conference on Informatics, Electronics and Vision (ICIEV) (pp. 1-6). IEEE.

[2] Firoj Alam, Promila Kanti Nath, Mumit Khan (2007 ’Text to speech for
Bangla language using festival’, BRAC University.

[3] Firoj Alam, Promila Kanti Nath, Mumit Khan (2011) ‘Bangla text to
speech using festival’,Conference on human language technology for
development, pp.154-161

[4] Arafat, M. Y., Fahrin, S., Islam, M. J., Siddiquee, M. A., Khan, A.,
Kotwal, M. R. A., & Huda, M. N. (2014, December). Speech synthesis for
bangla text to speech conversion. In The 8th International Conference on
Software, Knowledge, Information Management and Applications (SKIMA
2014) (pp. 1-6). IEEE.

[5] Ahmed, K. M., Mandal, P., & Hossain, B. M. (2019). Text to Speech
Synthesis for Bangla Language. International Journal of Information
Engineering and Electronic Business, 12(2), 1.

[6] A. Łańcucki, "Fastpitch: Parallel Text-to-Speech with Pitch Prediction,"


ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech
and Signal Processing (ICASSP), 2021, pp. 6588-6592, doi:
10.1109/ICASSP39728.2021.9413889.

[7] R. Luo et al., "Lightspeech: Lightweight and Fast Text to Speech with
Neural Architecture Search," ICASSP 2021 - 2021 IEEE International
Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021, pp.
5699-5703, doi: 10.1109/ICASSP39728.2021.9414403.

[8] Firoj Alam, S.M. Murtoza Habib, Mumit Khan, “Text normalization
system for Bangla,” Proc. of Conf. on Language and Technology, Lahore,
pp. 22-24, 2009.

[9] Elliot Berk, JFlex - The Fast Scanner Generator for Java, 2004, version
1.4.1, http://jflex.de

You might also like