Text To Speech
Text To Speech
All
rights reserved. Draft of August 24, 2025.
CHAPTER
16 Text-to-Speech
“Words mean more than what is set down on paper. It takes the human voice
to infuse them with shades of deeper meaning.”
Maya Angelou, I Know Why the Caged Bird Sings
The task of mapping from text to speech is a task with an even longer history than
speech to text. In Vienna in 1769, Wolfgang von Kempelen built for the Empress
Maria Theresa the famous Mechanical Turk, a chess-playing automaton consisting
of a wooden box filled with gears, behind which sat a robot mannequin who played
chess by moving pieces with his mechanical arm. The Turk toured Europe and the
Americas for decades, defeating Napoleon Bonaparte and even playing Charles Bab-
bage. The Mechanical Turk might have been one of the early successes of artificial
intelligence were it not for the fact that it was, alas, a hoax, powered by a human
chess player hidden inside the box.
What is less well known is that von Kempelen, an extraordinarily
prolific inventor, also built between
1769 and 1790 what was definitely
not a hoax: the first full-sentence
speech synthesizer, shown partially to
the right. His device consisted of a
bellows to simulate the lungs, a rub-
ber mouthpiece and a nose aperture, a
reed to simulate the vocal folds, var-
ious whistles for the fricatives, and a
small auxiliary bellows to provide the puff of air for plosives. By moving levers
with both hands to open and close apertures, and adjusting the flexible leather “vo-
cal tract”, an operator could produce different consonants and vowels.
More than two centuries later, we no longer build our synthesizers out of wood
text-to-speech and leather, nor do we need human operators. The modern task of text-to-speech or
TTS TTS, also called speech synthesis, is exactly the reverse of ASR; to map text:
speech
synthesis It’s time for lunch!
to an acoustic waveform:
o Speech Synthesizers
model whose vocabulary includes both speech tokens and text tokens.
We train this language model to take as input two sequences, a text transcript
and a small sample of speech from the desired talker, to tokenize both the text and
the speech into discrete tokens, and then to conditionally generate discrete samples
Yu Wu , Ziqiang Zhang of the , Long
speechZhou, Shujie
corresponding Liu
to the , Member,
text string, voice. Zhuo Chen,
IEEE,
in the desired
Wang, Jinyu Li , Fellow, IEEE, Lei He , Sheng Zhao, and Furu Wei text string and
At inference time we prompt this language model with a tokenized
a sample of the desired voice (tokenized by the codec into discrete audio tokens) and
conditionally generate to produce the desired audio tokens. Then these tokens can
be converted into a waveform.
scribe in the next section. Codecs have three parts: an encoder (that turns
speech into embedding vectors), a quantizer (that turns the embeddings into
discrete tokens) and decoders (that turns the discrete tokens back into speech).
2. The 2-stage conditional language model that can generate audio tokens cor-
responding to the desired text. We’ll sketch this in Section 16.3.
En r
co de
der co
De
Quantization
x zt qt,1 qt,2 … qt,N zqt x^
c
Fig. 16.2 adapted from Mousavi et al. (2025). shows the standard architecture
of an audio tokenizer. Audio tokenizers take as input an audio waveform, and are
4 C HAPTER 16 • T EXT- TO -S PEECH
trained to recreate the same audio waveform out, via an intermediate representation
consisting of discrete tokens created by vector quantization.
Audio tokenizers have three stages:
1. an encoder maps the acoustic waveform, a series of T values x = x1 , x2 , ..., xT ,
to a sequence of τ embeddings z = z1 , z2 , ..., zτ . τ is typically 100-1000 times
smaller than T .
2. a vector quantizer that takes each embedding zt corresponding to part of the
waveform, and represents it by a sequence of discrete tokens each taken from
one of the Nc codebooks, qt = qt,1 , qt,2 , ..., qt,Nc . The vector quantizer also
sums the vector codewords from each codebook to create a quantizer output
vector zq t .
3. a decoder that generates a lossy reconstructed waveform span x̂ from the
quantizer output vector zq t .
Audio tokenizers are generally learned end-to-end, using loss functions that re-
ward a tokenization that allows the system to reconstruct the input waveform.
In the following subsections we’ll go through the components of one particular
tokenizer, the E N C ODEC tokenizer of Défossez et al. (2023).
Decoder
Encoder
(N=16C, S=8)
Conv1D (N=16C, S=8)
(K=2S, N=C, Stride=S)
EncoderBlock Conv1D Conv1DT (K=2S, N=C)
(N=8C, S=5) (K=3, N=C)
DecoderBlock
(N=8C, S=5)
Residual Unit Residual Unit
EncoderBlock Conv1D
(K=3, N=C) DecoderBlock
(N=4C, S=4) (N=4C, S=4)
EncoderBlock DecoderBlock
(N=2C, S=2) (N=2C, S=2)
Figure 16.3 The encoder and decoder stages of the E N C ODEC model. The goal of the encoder is to down-
sample an input waveform by encoded it as a series of embeddings zt at 75Hz, i.e. 75 embeddings a second.
Because the original signal was represented at 24kHz, this is a downsampling of 24000
75 = 320 times. Between
the encoder and decoder is a quantization step producing a lossy embedding zq t . The goal of the decoder is to
take the lossy embedding zq t and upsample it, converting it back to a waveform.
The encoder and decoder of the E N C ODEC model (Défossez et al., 2023) are
sketched in Fig. 16.3. The goal of the encoder is to downsample a span of waveform
at time t, which is at 24kHz—one second of speech has 24,000 real values—to an
embedding representation zt at 75Hz—one second of audio is represented by 75
16.2 • U SING A CODEC TO LEARN DISCRETE AUDIO TOKENS 5
vectors, each of dimensionality D. For the purposes of this explanation, we’ll use
D = 256.
This downsampling is accomplished by having a series of encoder blocks that are
made up of convolutional layers with strides larger than 1 that iteratively downsam-
ple the audio, as we discussed at the end of Section ??. The convolution blocks are
sketched in Fig. 16.3, and include a long series of convolutions as well as residual
units that add a convolution to the prior input.
The output of the encoder is an embedding zt at time t, 75 of which are produced
per second. This embedding is then quantized (as discussed in the next section),
turning each embedding zt into a series of Nc discrete symbols qt = qt,1 , qt,2 , ..., qt,Nc ,
and also turning the series of symbols into a new quantizer output vector zq t . Fi-
nally, the decoder takes the output embedding from the quantizer zq t and generates
a waveform via a symmetric set of convnets that upsample the audio.
In summary, a 24kHz waveform comes through, we encode/downsample it into
a vector zt of dimensionality D = 256, quantize it into discrete symbols qt , turn it
back into a vector zq t of dimensionality D = 256, and then decode/upsample that
vector back into a waveform at 24kHz.
256-d vector
Output:
to be quantized
Vector Quantizer discrete symbol
0 1.2
0.9
Encoder 0.1 Similarity 3
…
-55
0.2
7
1 2 3 4 5 … 1024 Cluster #
255 -9
… 256-d codewords
(vectors)
Codebook
Figure 16.4 The basic VQ algorithm at inference time, after the codebook has been learned.
The input is a span of speech encoded by the encoder into a vector of dimensionality D =
256. This vector is compared with each codeword (cluster centroid) in the codebook. The
codeword for cluster 3 is most similar, so the VQ outputs 3 as the discrete representation of
this vector.
estimation step, the codeword for each cluster is recomputed by recalculating a new
mean vector. The result is that the clusters and their centroids slowly adjust to the
training space. We iterate back and forth between these two steps until the algorithm
converges.
VQ can also be used as part of end-to-end training, as we will discuss below,
in which case instead of iterative k-means, we instead recompute the means during
minibatch training via online algorithms like exponential moving averages.
At the end of clustering, the cluster index can be used as a discrete symbol. Each
codeword cluster is also associated with a codeword, the vector which is the centroid of all
the vectors in the cluster. We call the list of cluster ids (tokens) together with their
codebook codeword the codebook, and we often call the cluster id the code.
code In inference, when a new vector comes in, we compare it to each vector in the
codebook. Whichever codeword is closest, we assign it to that codeword’s associated
cluster. Fig. 16.4 shows an intuition of this inference step in the context of speech
encoding:
1. an input speech waveform is encoded into a vector v,
2. this input vector v is compared to each of the 1024 possible codewords in the
codebook,
3. v is found to be most similar to codeword 3,
4. and so the output of VQ is the discrete symbol 3 as a representation of v.
As we will see below, for training the E N C ODEC model end-to-end we will need
a way to turn this discrete symbol back into a waveform. For simple VQ we do
that by directly using the codeword for that cluster, passing that codeword to the
decoder for it to reconstruct the waveform. Of course the codeword vector won’t
exactly match the original vector encoding of the input speech span, especially with
only 1024 possible codewords, but the hope is that it’s at least close if our codebook
is good, and the decoder will still produce reasonable speech. Nonetheless, more
powerful methods are usually used, as we’ll see in the next section.
16.2 • U SING A CODEC TO LEARN DISCRETE AUDIO TOKENS 7
Figure 16.5 Residual VQ (figure from Chen et al. (2025)). We run VQ on the encoder out-
Figure
put 2: The neural
embedding audio codec
to produce modelsymbol
a discrete revisit. Because
and the RVQ is employed,
corresponding the first quantizer
codeword. We thenplays
look
the most important role in reconstruction, and the impact from others gradually decreases.
at the residual, the difference between the encoder output embedding zt and the codeword
chosen by VQ. We then take a second codebook and run VQ on this residual. We repeat the
we can explicitly
process until we control
have 8 the content in speech synthesis. Another direction is to apply pre-training
tokens.
to the neural TTS. Chung et al. [2018] pre-trains speech decoder in TTS through autoregressive
mel-spectrogram prediction. In Ao et al. [2022], the authors propose a unified-modal encoder-decoder
The idea
framework is verywhich
SpeechT5, simple. We rununlabeled
can leverage standardspeech
VQ with a codebook
and text just as
data to pre-train all in Fig. 16.4
components
in the prior
of TTS model.section.
Tjandra etThen for an
al. [2019] input unlabeled
quantizes embedding zt we
speech into take thetokens
discrete codeword vector
by a VQVAE
model
that is [van den Oord
produced, et al.,
let’s call2017],
it zq1and
fortrain
the az model
as with the token-to-speech
quantified by codebook sequence.
1, and They
take the
demonstrate that the pre-trained modelt only requires t a small amount of real data for fine-tuning. Bai
difference between the two:
et al. [2022] proposes mask and reconstruction on mel spectrogram and showing better performance
on speech editing and synthesis. Previous TTS pre-training work leverages less than 1K hours of
(1)
data, whereas VALL-E is pre-trainedresidual
with 60K = zt −ofzdata.
hours q t . Furthermore, VALL-E is the first (16.1)
to
use audio codec codes as intermediate representations, and emerge in-context learning capability in
residual zero-shot
This TTS. is the error in the VQ; the part of the original vector that the VQ
residual
didn’t capture. The residual is kind of a rounding error; it’s as if in VQ we ‘round’
3 vector
the Background: Speech
to the nearest Quantization
codeword, and that creates some error. So we then take that
residual vector and pass it through another
Since audio is typically stored as a sequence of 16-bit vector
integerquantizer! That gives
values, a generative modelusisarequired
second
codeword 16that represents the residual part of the vector. We then take the residual
to output 2 = 65, 536 probabilities per timestep to synthesize the raw audio. In addition, the audio
samplethe
from ratesecond
exceeding ten thousand
codeword, leads
and dotothis
an extraordinarily
again. The long totalsequence length,
result is making it more
8 codewords (the
intractable for raw audio synthesis. To this end, speech quantization is required to compress integer
original codeword and the 7 residuals).
values and sequence length. µ-law transformation can quantize each timestep to 256 values and
That means
reconstruct for RVQ
high-quality we represent
raw audio. It is widelytheusedoriginal
in speechspeech span
generative by asuch
models, sequence of 8
as WaveNet
[van den Oord
discrete et al.,(instead
symbols 2016], butofthe1 inference
discrete speed
symbolis still
in slow
basicsince
VQ).the Fig.
sequence
16.5length
shows is not
the
reduced. Recently, vector quantization is widely applied in self-supervised speech models for feature
intuition.
extraction, such as vq-wav2vec [Baevski et al., 2020a] and HuBERT [Hsu et al., 2021]. The following
workWhat do we
[Lakhotia do2021,
et al., when Duwe want
et al., 2022]toshows
reconstruct
the codesthe speech?
from The method
self-supervised models canusedalsoin
Ereconstruct
N C ODEC content,
RVQand the inference
is again simple: speedweis faster than 8WaveNet.
take the codewordsHowever,
andthe speaker
add themidentity has
together!
been discarded and the reconstruction quality is low [Borsos et al., 2022]. AudioLM [Borsos et al.,
The resulting vector
2022] trains speech-to-speech z is then passed through the decoder to generate a
q t language models on both k-means tokens from a self-supervised model waveform.
and .acoustic tokens from a neural codec model, leading to high-quality speech-to-speech generation.
In this paper, we follow AudioLM [Borsos et al., 2022] to leverage neural codec models to represent
16.2.4 Training
speech in discrete tokens. the
To compress audio formodel
E N C ODEC network of audio tokens
transmission, codec models are able to
encode waveform into discrete acoustic codes and reconstruct high-quality waveform even if the
speaker
The E NisCunseen
ODEC in training.
model Compared
(like similartoaudio
traditional audio codec
tokenizer approaches,
models) the neural-based
is trained end to end.
codec is significantly better at low bitrates, and we believe the quantized tokens contain sufficient
The input isabout
information a waveform,
the speakeraand
span of speech
recording of perhaps
conditions. 1 or to
Compared 10other
seconds extracted
quantization from
methods,
athelonger original
audio codec waveform.
shows The
the following desired output
advantages: is theabundant
1) It contains same waveform span, since
speaker information and
acoustic information, which could maintain speaker identity in reconstruction compared to HuBERT
codes [Hsu et al., 2021]. 2) There is an off-the-shelf codec decoder to convert discrete tokens into a
waveform, without the additional efforts on vocoder training like VQ-based methods that operated on
spectrum [Du et al., 2022]. 3) It could reduce the length of time steps for efficiency to address the
problem in µ-law transformation [van den Oord et al., 2016].
4
8 C HAPTER 16 • T EXT- TO -S PEECH
the model is a kind of autoencoder that learns to map to itself. The model is trained
to do this reconstruction on large speech datasets like Common Voice (Ardila et al.,
2020) (over 30,000 hours of speech in 133 languages) as well as other audio data
like Audio Set (Gemmeke et al., 2017) (1.7 million 10 sec excerpts from YouTube
videos labeled from a large ontology including natural, animal, and machine sounds,
music, and so on).
En
cod 𝓛 GAN de
r
er co
De
Quantization
x zt qt,1 qt,2 … qt,N zqt x^
𝓛 VQ
c
𝓛 reconstruction
Figure 16.6 Architecture of audio tokenizer training, figure adapted from Mousavi et al.
(2025). The audio tokenizer is trained with a weighted combination of various loss functions,
summarized in the figure and described below.
The E N C ODEC model, like most audio tokenizers, is trained with a number of
reconstruction loss functions, as suggested in Fig. 16.6. The reconstruction loss Lreconstruction mea-
loss
sures how similar the output waveform is to the input waveform, for example by the
sum-squared difference between the original and reconstructed audio:
T
X
Lreconstruction (x, x̂) = ||xt − x̂t ||2 (16.2)
t=1
output vector zt and the reconstructed vector after the quantization zq t , i.e. the code-
word, summed over all the Nc codebooks and residuals.
Nc
T X
X (c)
LVQ (x, x̂) = ||z(c) t − zq t || (16.3)
t=1 c=1
The total loss function can then just be a weighted sum of these losses:
L(x, x̂) = λ1 Lreconstruction (x, x̂) + λ2 LGAN (x, x̂) + λ3 LVQ (x, x̂) (16.4)
Non-Autoregressive
CT’+1,1
CT’+2,1 CT,1
x C
…
CT’+1,2 CT’+2,2 CT,2
Non-Autoregressive
x C CT’+1,1 CT’+2,1 … CT,1
x C
Text Audio Prompt
Figure 16.7 The 2-stage language modelling approach for VALL-E, showing the inference
stage for the autoregressive transformer and the first 3 of the 7 non-autoregressive transform-
ers. The output sequence of discrete audio codes is generated in two stages. First the au-
toregressive LM generates all the codes for the first quantizer from left to right. Then the
non-autoregressive model is called 7 times to generate the remaining codes conditioned on all
the codes from the preceding quantizer, including conditioning on the codes to the right.
the language model generates the acoustic codes in two stages. First, an autoregres-
sive LM generates the first-quantizer codes for the entire output sequence, given the
input text and enrolled audio. Then given those codes, a non-autoregressive LM is
run 7 times, each time taking as input the output of the initial autoregressive codes
and the prior non-autoregressive quantizer and thus generating the codes from the
remaining quantizers one by one. Fig. 16.7 shows the intuition for the inference
step.
Now let’s see the architecture in a bit more detail. For training, we are given
an audio sample y and its tokenized text transcription x = [x0 , x1 , . . . , xL ]. We use a
pretrained E N C ODEC to convert y into a code matrix C. Let T be the number of
downsampled vectors output by E N C ODEC, with 8 codes per vector. Then we can
represent the encoder output as
CT ×8 = E N C ODEC(y) (16.5)
Here C is a two-dimensional acoustic code matrix that has T × 8 entries, where the
columns represent time and the rows represent different quantizers. That is, the row
vector ct,: of the matrix contains the 8 codes for the t-th frame, and the column vector
c:, j contains the code sequence from the j-th vector quantizer where j ∈ [1, ..., 8].
Given the text x and audio C, we train the TTS as a conditional code language
model to maximize the likelihood of C conditioned on x:
L = − log p(C|x)
T
Y
= − log p(c<t,: , x) (16.6)
CHEN et al.: NEURAL CODEC LANGUAGE MODELS ARE ZERO-SHOT TEXT TO SPEECH
t=0 SYNTHESIZERS 709
Figure
Fig. 3. Training overview 16.8 We
of VALL-E. Training
regard TTSprocedure for VALL-E.
as a conditional Given
codec language the text
modeling task. prompt, theVALL-E
We structure autoregressive
as two conditional codec language
models in a hierarchicaltransformer
structure. The is
ARfirst trained
model is usedtotogenerate eachcode
generate each code of the
of the first first-quantizer
code sequence incode sequence, manner,
an autoregressive autore-while the NAR model is
used to generate each remaining code sequence based on the previous code sequences in a non-autoregressive manner.
gressively The the non-autoregressive transformer generates the rest of the codes. Figure from
Chen et al. (2025).
Fig.AR
B. Hierarchical Structure: 16.8 andshows the intuition. On the left,
NAR Model we have an audio
simultaneously, sample the
thus reducing andtime
its tran-
complexity from O(T )
scription, and both are tokenized. Then
As introduced in Section III, the codec codes derived from the wetoappend
O(1). an [ EOS ] and [ BOS ] token to x
and anwith
neural audio codec model [ EOS ] token
RVQ to the
exhibit twoend
keyof C and train the autoregressive transformer to predict
properties:
(1) A single speechthe acoustic
sample tokens,into
is encoded starting c0,1 ,se-
withcode
multiple untilC.[ EOS
Training:
], and Conditional Codec Language Modeling
then the non-autoregressive
quences with multiple transformers
quantizers in to the
fill audio
in thecodec
other model.
tokens.(2) As depicted in Fig. 3, VALL-E is trained using the condi-
is presentinference,
A hierarchical structure During where the code we are given afrom
sequence text sequence to belanguage
tional codec spoken as y0 , an en-
well as method.
modeling It is noteworthy that
rolledmost
the first quantizer covers speech
of thesample
acoustic from some unseen
information, whilespeaker, for which
the training we have requires
of VALL-E the transcription
only simple utterance-wise
subsequent code sequences contain the residual acoustic infor- audio-transcription pair data, and no complex data such as
mation from their predecessors, serving to refine and augment force-alignment information or additional audio clips of the
the acoustic details. same speaker for reference. This greatly simplifies the process of
Inspired by these properties, we design VALL-E as two collecting and processing training data, facilitating scalability.
conditional codec language models in a hierarchical structure: Specifically, for each audio and corresponding transcription
an Autoregressive (AR) codec language model and a Non- in the training dataset, we initially utilize the audio codec
16.4 • TTS E VALUATION 11
transcript(y0 ). We first run the codec to get an acoustic code matrix for y0 , which will
be CP = C:T 0 ,: = [c0,: , c1,: , . . . cT 0 ,: :]. Next we concatenate the transcription of y0 to
the text sequence to be spoken to create the total input text x, which we pass through
a text tokenizer. At this stage we thus have a tokenized text x and a tokenized audio
prompt CP .
Then we generate CT = C>T 0 ,: = [cT 0 +1,: , . . . cT,: ] conditioned on the text se-
quence x and the prompt CP :
CT = argmax p(CT |CP , x)
CT
T
Y
= argmax p(ct,: |c<t,: , x) (16.7)
CT t=T 0 +1
IEEE TRANSACTIONS ON AUDIO,
Then the generated SPEECH
tokens ANDbeLANGUAGE
CT can converted byPROCESSING, VOL.
the E N C ODEC 33, 2025
decoder into
a waveform. Fig. 16.9 shows the intuition.
θAR ) (10)
remaining code sequence See Chen et al. (2025) for more details on the transformer components and other
ence x and the preceding Overall,
details the NAR model is optimized by minimizing the
of training.
toregressive manner, where negative log likelihood of each j-th target code sequence c>T ′ ,j
conditioned on the text sequence x, all the code sequences of
16.4 TTS Evaluation
e sequences of the prompt the acoustic condition C:T ′ ,: and the preceding j − 1 target code
e speaker information of the sequences c>T ′ ,<j .
TTS systems are evaluated by humans, by playing an utterance to listeners and ask-
itly split the code matrix MOSC ingLthem
NARto= give mean
−alog p(Copinion score
>T ′ ,>1 |x,(MOS),
C<T ′a,:rating
, c>Tof′ ,1how (15)
good)the synthesized
; θNAR
d target code matrix C>T ′ ,: utterances are, usually on a scale from 1–5. We can then compare systems by com-
The model is then optimized paring their MOS$ scores
8 on the same sentences (using, e.g., paired t-tests to test for
significant=
differences).
− log p(c>T ′ ,j |x, C<T ′ ,: , C>T ′ ,<j ; θNAR ). (16)
e c>T ′ ,j conditioned on the
j=2
es in the acoustic condition
nces in the target code matrix In practice, to optimize computational efficiency during training,
ner. we do not calculate the training loss by iterating over all values of
t of Fig. 3, we first obtain the j and aggregating the corresponding losses. Instead, during each
12 C HAPTER 16 • T EXT- TO -S PEECH
16.7 Summary
This chapter introduced the fundamental algorithms of text-to-speech (TTS).
• A common modern algorithm for TTS is to use conditional generation with a
language model over audio tokens learned by a codec model.
• A neural audio codec, short for coder/decoder, is a system that encodes ana-
log speech signals into a digitized, discrete compressed representation for
compression.
• The discrete symbols that a codec produces as its compressed representation
can be used as discrete codes for language modeling.
• A codec includes an encoder that uses convnets to downsample speech into
a downsampled embedding, a quantizer that converts the embedding into a
series of discrete tokens, and a decoder that uses convnets to upsample the
tokens/embedding back into a lossy reconstructed waveform.
• Vector Quantization (VQ) is a method for turning a series of vectors into a
series of discrete symbols. This can be done by using k-means clustering, and
then creating a codebook in which each code is represented by a vector at the
centroid of each cluster, called a codeword. Input vector can be assigned the
nearest codeword cluster.
• Residual Vector Quantization (RVQ) is a hierarchical version of vector
quantization that produces multiple codes for an input vector by first quantiz-
ing a vector into a codebook, and then quantizing the residual (the difference
between the codeword and the input vector) and then iterating.
• TTS systems like VALL-E take a text to be synthesized and a sample of the
voice to be used, tokenize with BPE (text) and an audio codec (speech) and
then use an LM to conditionally generate discrete audio tokens corresponding
to the text prompt, in the voice of the speech sample.
• TTS is evaluated by playing a sentence to human listeners and having them
give a mean opinion score (MOS).
Historical Notes
As we noted at the beginning of the chapter, speech synthesis is one of the earliest
fields of speech and language processing. The 18th century saw a number of physical
models of the articulation process, including the von Kempelen model mentioned
above, as well as the 1773 vowel model of Kratzenstein in Copenhagen using organ
pipes.
The early 1950s saw the development of three early paradigms of waveform
synthesis: formant synthesis, articulatory synthesis, and concatenative synthesis.
Formant synthesizers originally were inspired by attempts to mimic human
speech by generating artificial spectrograms. The Haskins Laboratories Pattern
Playback Machine generated a sound wave by painting spectrogram patterns on a
moving transparent belt and using reflectance to filter the harmonics of a wave-
form (Cooper et al., 1951); other very early formant synthesizers include those of
Lawrence (1953) and Fant (1951). Perhaps the most well-known of the formant
synthesizers were the Klatt formant synthesizer and its successor systems, includ-
ing the MITalk system (Allen et al., 1987) and the Klattalk software used in Digital
Equipment Corporation’s DECtalk (Klatt, 1982). See Klatt (1975) for details.
14 C HAPTER 16 • T EXT- TO -S PEECH
A second early paradigm, concatenative synthesis, seems to have been first pro-
posed by Harris (1953) at Bell Laboratories; he literally spliced together pieces of
magnetic tape corresponding to phones. Soon afterwards, Peterson et al. (1958) pro-
posed a theoretical model based on diphones, including a database with multiple
copies of each diphone with differing prosody, each labeled with prosodic features
including F0, stress, and duration, and the use of join costs based on F0 and formant
distance between neighboring units. But such diphone synthesis models were not
actually implemented until decades later (Dixon and Maxey 1968, Olive 1977). The
1980s and 1990s saw the invention of unit selection synthesis, based on larger units
of non-uniform length and the use of a target cost, (Sagisaka 1988, Sagisaka et al.
1992, Hunt and Black 1996, Black and Taylor 1994, Syrdal et al. 2000).
A third paradigm, articulatory synthesizers attempt to synthesize speech by
modeling the physics of the vocal tract as an open tube. Representative models
include Stevens et al. (1953), Flanagan et al. (1975), and Fant (1986). See Klatt
(1975) and Flanagan (1972) for more details.
Most early TTS systems used phonemes as input; development of the text anal-
ysis components of TTS came somewhat later, drawing on NLP. Indeed the first
true text-to-speech system seems to have been the system of Umeda and Teranishi
(Umeda et al. 1968, Teranishi and Umeda 1968, Umeda 1976), which included a
parser that assigned prosodic boundaries, as well as accent and stress.
History of codecs and modern history of neural TTS TBD.
Exercises
Exercises 15
Allen, J., M. S. Hunnicut, and D. H. Klatt. 1987. From Text Olive, J. P. 1977. Rule synthesis of speech from dyadic units.
to Speech: The MITalk system. Cambridge University ICASSP77.
Press. Peterson, G. E., W. S.-Y. Wang, and E. Sivertsen. 1958.
Ardila, R., M. Branson, K. Davis, M. Kohler, J. Meyer, Segmentation techniques in speech synthesis. JASA,
M. Henretty, R. Morais, L. Saunders, F. Tyers, and G. We- 30(8):739–742.
ber. 2020. Common voice: A massively-multilingual Sagisaka, Y. 1988. Speech synthesis by rule using an optimal
speech corpus. LREC. selection of non-uniform synthesis units. ICASSP.
Black, A. W. and P. Taylor. 1994. CHATR: A generic speech Sagisaka, Y., N. Kaiki, N. Iwahashi, and K. Mimura. 1992.
synthesis system. COLING. Atr – ν-talk speech synthesis system. ICSLP.
Chen, S., C. Wang, Y. Wu, Z. Zhang, L. Zhou, S. Liu, Stevens, K. N., S. Kasowski, and G. M. Fant. 1953. An elec-
Z. Chen, Y. Liu, H. Wang, J. Li, L. He, S. Zhao, and trical analog of the vocal tract. JASA, 25(4):734–742.
F. Wei. 2025. Neural codec language models are zero-
shot text to speech synthesizers. IEEE Trans. on TASLP, Syrdal, A. K., C. W. Wightman, A. Conkie, Y. Stylianou,
33:705–718. M. Beutnagel, J. Schroeter, V. Strom, and K.-S. Lee.
2000. Corpus-based techniques in the AT&T NEXTGEN
Cooper, F. S., A. M. Liberman, and J. M. Borst. 1951. The
synthesis system. ICSLP.
interconversion of audible and visible patterns as a basis
for research in the perception of speech. Proceedings of Teranishi, R. and N. Umeda. 1968. Use of pronouncing dic-
the National Academy of Sciences, 37(5):318–325. tionary in speech synthesis experiments. 6th International
Congress on Acoustics.
Défossez, A., J. Copet, G. Synnaeve, and Y. Adi. 2023. High
fidelity neural audio compression. TMLR. Umeda, N. 1976. Linguistic rules for text-to-speech synthe-
sis. Proceedings of the IEEE, 64(4):443–451.
Dixon, N. and H. Maxey. 1968. Terminal analog synthesis of
continuous speech using the diphone method of segment Umeda, N., E. Matui, T. Suzuki, and H. Omura. 1968. Syn-
assembly. IEEE Transactions on Audio and Electroacous- thesis of fairy tale using an analog vocal tract. 6th Inter-
tics, 16(1):40–50. national Congress on Acoustics.
Fant, G. M. 1951. Speech communication research. Ing. Van Den Oord, A., O. Vinyals, and K. Kavukcuoglu. 2017.
Vetenskaps Akad. Stockholm, Sweden, 24:331–337. Neural discrete representation learning. NeurIPS.
Fant, G. M. 1986. Glottal flow: Models and interaction.
Journal of Phonetics, 14:393–399.
Flanagan, J. L. 1972. Speech Analysis, Synthesis, and Per-
ception. Springer.
Flanagan, J. L., K. Ishizaka, and K. L. Shipley. 1975. Syn-
thesis of speech from a dynamic model of the vocal
cords and vocal tract. The Bell System Technical Jour-
nal, 54(3):485–506.
Gemmeke, J. F., D. P. Ellis, D. Freedman, A. Jansen,
W. Lawrence, R. C. Moore, M. Plakal, and M. Ritter.
2017. Audio Set: An ontology and human-labeled dataset
for audio events. ICASSP.
Gray, R. M. 1984. Vector quantization. IEEE Transactions
on ASSP, ASSP-1(2):4–29.
Harris, C. M. 1953. A study of the building blocks in speech.
JASA, 25(5):962–969.
Hunt, A. J. and A. W. Black. 1996. Unit selection in a con-
catenative speech synthesis system using a large speech
database. ICASSP.
Klatt, D. H. 1975. Voice onset time, friction, and aspiration
in word-initial consonant clusters. Journal of Speech and
Hearing Research, 18:686–706.
Klatt, D. H. 1982. The Klattalk text-to-speech conversion
system. ICASSP.
Lawrence, W. 1953. The synthesis of speech from signals
which have a low information rate. In W. Jackson, ed.,
Communication Theory, 460–469. Butterworth.
Mousavi, P., G. Maimon, A. Moumen, D. Petermann,
J. Shi, H. Wu, H. Yang, A. Kuznetsova, A. Ploujnikov,
R. Marxer, B. Ramabhadran, B. Elizalde, L. Lugosch,
J. Li, C. Subakan, P. Woodland, M. Kim, H. yi Lee,
S. Watanabe, Y. Adi, and M. Ravanelli. 2025. Discrete
audio tokens: More than a survey! ArXiv preprint.