0% found this document useful (0 votes)
125 views5 pages

Tacotron 2

This paper presents Tacotron 2, a neural network architecture for text-to-speech synthesis that combines a sequence-to-sequence feature prediction network with a modified WaveNet vocoder to produce high-quality audio from mel spectrograms. The system achieves a mean opinion score (MOS) of 4.53, closely rivaling professionally recorded speech, and allows for end-to-end learning directly from character sequences. The proposed method simplifies the traditional TTS pipeline and demonstrates significant improvements in audio fidelity compared to previous approaches.

Uploaded by

Tugdual
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
125 views5 pages

Tacotron 2

This paper presents Tacotron 2, a neural network architecture for text-to-speech synthesis that combines a sequence-to-sequence feature prediction network with a modified WaveNet vocoder to produce high-quality audio from mel spectrograms. The system achieves a mean opinion score (MOS) of 4.53, closely rivaling professionally recorded speech, and allows for end-to-end learning directly from character sequences. The proposed method simplifies the traditional TTS pipeline and demonstrates significant improvements in audio fidelity compared to previous approaches.

Uploaded by

Tugdual
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

NATURAL TTS SYNTHESIS BY CONDITIONING WAVENET ON MEL SPECTROGRAM

PREDICTIONS

Jonathan Shen1 , Ruoming Pang1 , Ron J. Weiss1 , Mike Schuster1 , Navdeep Jaitly1 , Zongheng Yang ∗2 ,
Zhifeng Chen1 , Yu Zhang1 , Yuxuan Wang1 , RJ Skerry-Ryan1 , Rif A. Saurous1 , Yannis Agiomyrgiannakis1 ,
and Yonghui Wu1
1
Google, Inc.
2
University of California, Berkeley
{jonathanasdf,rpang,yonghui}@google.com
arXiv:1712.05884v1 [cs.CL] 16 Dec 2017

ABSTRACT estimation, followed by an inverse short-time Fourier transform. As


the authors note, this was simply a placeholder for future neural
This paper describes Tacotron 2, a neural network architecture for
vocoder approaches, as Griffin-Lim produces characteristic artifacts
speech synthesis directly from text. The system is composed of
and lower audio fidelity than approaches like WaveNet.
a recurrent sequence-to-sequence feature prediction network that
maps character embeddings to mel-scale spectrograms, followed by In this paper, we describe a unified, entirely neural approach to
a modified WaveNet model acting as a vocoder to synthesize time- speech synthesis that combines the best of the previous approaches:
domain waveforms from those spectrograms. Our model achieves a a sequence-to-sequence Tacotron-style model [12] that generates
mean opinion score (MOS) of 4.53 comparable to a MOS of 4.58 for mel spectrograms, followed by a modified WaveNet vocoder [10, 15].
professionally recorded speech. To validate our design choices, we This system allows end-to-end learning of TTS directly from character
present ablation studies of key components of our system and evaluate sequences and speech waveforms, yielding natural sounding speech
the impact of using mel spectrograms as the input to WaveNet instead that approaches the audio fidelity of real human speech.
of linguistic, duration, and F0 features. We further demonstrate Deep Voice 3 [11] describes a similar approach. However, unlike
that using a compact acoustic intermediate representation enables our system, its audio fidelity has not been shown to rival that of human
significant simplification of the WaveNet architecture. speech. Char2Wav [16] describes yet another similar approach to
end-to-end TTS using a neural vocoder. However, they use different
Index Terms— Tacotron 2, WaveNet, text-to-speech intermediate representations (traditional vocoder features) and their
model architecture differs significantly.
1. INTRODUCTION
2. MODEL ARCHITECTURE
Generating natural speech from text (text-to-speech synthesis, TTS)
remains a challenging task despite decades of investigation [1]. Over
Our proposed system consists of two components, shown in Figure 1:
time, different techniques have dominated the field. Concatenative
(1) a recurrent sequence-to-sequence feature prediction network with
synthesis with unit selection, the process of stitching small units
attention which predicts a sequence of mel spectrogram frames from
of pre-recorded waveforms together [2, 3] was the state-of-the-art
an input character sequence, and (2) a modified version of WaveNet
for many years. Statistical parametric speech synthesis [4, 5, 6, 7],
which generates time-domain waveform samples conditioned on the
which directly generates smooth trajectories of speech features to be
predicted mel spectrogram frames.
synthesized by a vocoder, followed, solving many of the issues that
concatenative synthesis had with boundary artifacts. However, the
audio produced by these systems often sounds muffled and unnatural 2.1. Intermediate Feature Representation
compared to human speech.
WaveNet [8], a generative model of time domain waveforms, pro- In this work, we choose a low-level acoustic representation: mel-
duces audio fidelity that begins to rival that of real human speech and frequency spectrograms, to bridge the two components of our system.
is already used in some complete TTS systems [9, 10, 11]. The inputs Using a representation that is easily computed from time-domain
to WaveNet (linguistic features, predicted log fundamental frequency waveforms allows us to train the two components separately. This
(F0 ), and phoneme durations), however, require significant domain representation is also smoother than waveform samples and is easier
expertise to produce, involving elaborate text-analysis systems as to train using a mean squared error loss because it is invariant to
well as a robust lexicon (pronunciation guide). phase within each frame.
Tacotron [12], a sequence-to-sequence architecture [13] for pro- A mel-frequency spectrogram is related to the linear-frequency
ducing magnitude spectrograms from a sequence of characters, sim- spectrogram, i.e. the short-time Fourier transform (STFT) magnitude.
plifies the traditional speech synthesis pipeline by replacing the pro- It is obtained by applying a nonlinear transform to the frequency
duction of these linguistic and acoustic features with a single neural axis of the STFT, inspired by measured responses from the human
network trained from data alone. To vocode the resulting magnitude auditory system, and summarizes the frequency content with fewer
spectrograms, Tacotron uses the Griffin-Lim algorithm [14] for phase dimensions. Using such an auditory frequency scale has the effect of
emphasizing details in lower frequencies, which are critical to speech
∗ Work done while at Google. intelligibility, while de-emphasizing high frequency details, which

1
Waveform samples
are dominated by fricatives and other noise bursts and generally do
not need to be modeled with high fidelity. Because of these properties, WaveNet MoL

features derived from the mel scale have been used as an underlying mel spectrogram

representation for speech recognition for many decades [17].


While linear spectrograms discard phase information (and are
therefore lossy), algorithms such as Griffin-Lim [14] are capable of
estimating this discarded information, which enables time-domain 5 Conv Layer Post-Net
conversion via the inverse short-time Fourier transform. Mel spectro- Bi-directional LSTM
Linear Projection
grams discard even more information, presenting a challenging in- Location
3 Conv Layers
verse problem. However, in comparison to the linguistic and acoustic Sensitive 2 LSTM Layers
Attention
features used in WaveNet, the mel spectrogram is a simpler, lower- Character Embedding
2 Layer Pre-Net
level acoustic representation of audio signals. It should therefore
M y N a m e i
...
be straightforward for a similar WaveNet model conditioned on mel
spectrograms to generate audio, essentially as a neural vocoder. In-
deed, we will show that it is possible to generate high quality audio Fig. 1. Block diagram of system architecture.
from mel spectrograms using a modified WaveNet architecture.

all reconstruction. Each post-net layer is comprised of 512 filters with


2.2. Spectrogram Prediction Network
shape 5 × 1 with batch normalization, followed by tanh activations
As in Tacotron, mel spectrograms are computed through a short-time on all but the final layer.
Fourier transform (STFT) using a 50 ms frame size, 12.5 ms frame We minimize the summed mean squared error (MSE) from before
hop, and a Hann window function. We transform the STFT magnitude and after the post-net to aid convergence. We also experimented
to the mel-scale using an 80 channel mel filterbank spanning 125 Hz with a log-likelihood loss by modeling the output distribution with
to 7.6 kHz, followed by log dynamic range compression. Prior to log a Mixture Density Network [23, 24] to avoid assuming a constant
compression, the filterbank output magnitudes are stabilized to a floor variance over time, but found that these were more difficult to train
of 0.01 in order to limit dynamic range in the logarithmic domain. and they did not lead to better sounding samples.
The network is composed of an encoder and a decoder with atten- In parallel to spectrogram frame prediction, the concatenation of
tion. The encoder converts a character sequence into a hidden feature decoder LSTM output and the attention context is projected down
representation which the decoder consumes to predict a spectrogram. to a scalar and passed through a sigmoid activation to predict the
Input characters are represented using 512-dimensional character probability that the output sequence has completed. This “stop token”
embeddings, which are passed through a stack of 3 convolutional prediction is used during inference to allow the model to dynamically
layers each containing 512 filters with shape 5 × 1, i.e. where each determine when to terminate generation instead of always generating
filter spans 5 characters, followed by batch normalization [18] and for a fixed duration.
ReLU activations. As in Tacotron, these convolutional layers model The convolutional layers in the network are regularized using
longer-term context (e.g. N -grams) in the input character sequence. dropout [25] with probability 0.5, and LSTM layers are regularized
The output of the final convolutional layer is passed into a single using zoneout [26] with probability 0.1. In order to introduce output
bi-directional [19] LSTM [20] layer containing 512 units (256 in variation at inference time, dropout with probability 0.5 is applied
each direction) to generate the encoded features. only to layers in the pre-net of the autoregressive decoder.
The encoder output is consumed by an attention network which In contrast to Tacotron, our model uses simpler building blocks,
summarizes the full encoded sequence as a fixed-length context vector using vanilla LSTM and convolutional layers in the encoder and
for each decoder output step. We use the location sensitive attention decoder instead of “CBHG” stacks and GRU recurrent layers. We do
from [21], which extends the additive attention mechanism [22] to not use a “reduction factor”, i.e. each decoder step corresponds to a
use cumulative attention weights from previous decoder time steps single spectrogram frame.
as an additional feature. This encourages the model to move forward
consistently through the input, mitigating potential failure modes 2.3. WaveNet Vocoder
where some subsequences are repeated or ignored by the decoder.
Attention probabilities are computed after projecting inputs and lo- We use a modified version of the WaveNet architecture from [8] to
cation features to 128-dimensional hidden representations. Location invert the mel spectrogram feature representation into time-domain
features are computed using 32 1-D convolution filters of length 31. waveform samples. As in the original architecture, there are 30 dilated
The decoder is an autoregressive recurrent neural network which convolution layers, grouped into 3 dilation cycles, i.e. the dilation
predicts the output spectrogram from the encoded input sequence rate of layer k (k = 0 . . 29) is 2k (mod 10) .
one frame at a time. The prediction from the previous time step is However, instead of predicting discretized buckets with a soft-
first passed through a small “pre-net” containing 2 fully connected max layer, we follow PixelCNN++ [27] and recent improvements
layers of 256 hidden ReLU units. We found that the pre-net act- to WaveNet [28] and use a 10-component mixture of logistic distri-
ing as an information bottleneck was essential for learning attention. butions (MoL) to generate 16-bit samples at 24 kHz. To compute
The pre-net output and attention context vector are concatenated and the logistic mixture distribution, the WaveNet stack output is passed
passed through a stack of 2 uni-directional LSTM layers with 1024 through a ReLU activation followed by a linear projection layer to
units. The concatenation of the LSTM output and the attention con- predict parameters (mean, log scale, mixture weight) for each mixture
text vector is then projected through a linear transform to produce component. The loss is computed as the negative log-likelihood of
a prediction of the target spectrogram frame. Finally, the predicted the ground truth sample.
features are passed through a 5-layer convolutional “post-net” which The original WaveNet used linguistic features, phoneme dura-
predicts a residual to add to the initial prediction to improve the over- tions, and log F0 at a frame rate of 5 ms. In our experiments we
noticed significant pronunciation issues when predicting spectrogram introduced above. We also compare to the original Tacotron that
frames spaced this closely, so we modified the WaveNet architecture predicts linear spectrograms and uses Griffin-Lim to synthesize audio,
to work with 12.5 ms feature spacing by using only 2 upsampling as well as concatenative [30] and parametric [31] baseline systems,
layers in the transposed convolutional network. both of which have been used in production at Google. We find that
the proposed system significantly outpeforms all other TTS systems,
and results in an MOS comparable to that of the ground truth audio.
3. EXPERIMENTS & RESULTS

3.1. Training Setup Name MOS

Our training process involves first training the feature prediction Parametric 3.492 ± 0.096
network on its own, followed by training a modified WaveNet inde- Tacotron (Griffin-Lim) 4.001 ± 0.087
pendently on the outputs generated by the first network. Concatenative 4.166 ± 0.091
To train the feature prediction network, we apply the standard WaveNet (Linguistic) 4.341 ± 0.051
maximum-likelihood training procedure (feeding in the correct output Ground Truth 4.582 ± 0.053
instead of the predicted output on the decoder side, also referred to Tacotron 2 (this paper) 4.526 ± 0.066
as teacher-forcing) with a batch size of 64 on a single GPU. We use
the Adam optimizer [29] with β1 = 0.9, β2 = 0.999,  = 10−6 and Table 1. Mean Opinion Score (MOS) evaluations with 95% confi-
a learning rate of 10−3 exponentially decaying to 10−5 starting after dence intervals for various systems.
50,000 iterations. We also apply L2 regularization with weight 10−6 .
We then train our modified WaveNet on the ground truth-aligned
We also conduct a side-by-side evaluation between audio synthe-
predictions of the feature prediction network. That is, these pre-
sized by our system and the ground truth. For each pair of utterances,
dictions are produced in teacher-forcing mode so each spectrogram
raters are asked to give a score ranging from -3 (synthesized much
frame exactly aligns with the target waveform samples. We train with
worse than ground truth) to 3 (synthesized much better than ground
a batch size of 128 distributed across 32 GPUs with synchronous
truth). The overall mean score of −0.270 ± 0.155 shows that raters
updates, using the Adam optimizer with β1 = 0.9, β2 = 0.999,  =
have a small but statistically significant preference towards ground
1e-8 and a fixed learning rate of 1e-4. It helps quality to average model
truth over our results. See Figure 2 for a detailed breakdown. The
weights over recent updates. Therefore we maintain an exponentially-
comments from raters indicate that occasional mispronunciation by
weighted moving average of the network parameters over update
our system is the primary reason for this preference.
steps with a decay of 0.9999 – this version is used for inference (see
also [29]). To speed up convergence, we scale the waveform targets
by a factor of 127.5. This scaling brings the initial outputs of the
mixture of logistics layer closer to the eventual distributions.
We train all models on an internal US English dataset, which
contains 24.6 hours of speech from a single professional female
speaker. All text in our datasets is spelled out. e.g. “16” is written as
“sixteen”, i.e. our models are all trained on pre-normalized text.

3.2. Evaluation
When generating speech in inference mode, the ground truth targets
are not known. Therefore, the predicted outputs from the previous Fig. 2. Synthesized vs. ground truth: 800 ratings on 100 items.
step are fed in during decoding, in contrast to the teacher-forcing
configuration used for training. We manually analyze the error modes of our system on the cus-
We randomly selected 100 fixed examples from the test set as the tom 100-sentence test set from Appendix E of [11]. Within the audio
evaluation set. Audio generated on this set are sent to a human rating generated from those sentences, 0 contained repeated words, 6 con-
service similar to Amazon’s Mechanical Turk where each sample tained mispronunciations, 1 contained skipped words, and 23 were
is rated by at least 8 raters on a scale from 1 to 5 with 0.5 point subjectively decided to contain unnatural prosody, such as emphasis
increments, from which a subjective mean opinion score (MOS) is on the wrong syllables or words, or unnatural pitch. In one case,
calculated. Each evaluation is conducted independently from each the longest sentence, end-point prediction failed. Overall, our model
other, so the outputs of two different models are not directly compared achieves a MOS of 4.354 on these inputs. These results show that
when asking raters to assign a score to them. while our system is able to reliably attend to the entire input, there is
Note that while instances in the evaluation set never appear in still room for improvement in prosody modeling.
the training set, there are some recurring patterns and common words Finally, we evaluate samples generated from 37 news headlines to
between the two sets. While this could potentially result in an inflated test the generalization ability of our system to out-of-domain text. On
MOS compared to an evaluation set consisting of sentences generated this task, our model receives a MOS of 4.148 ± 0.124 while WaveNet
from random words, using this set allows us to compare to the ground conditioned on linguistic features receives a MOS of 4.137 ± 0.128.
truth. Since all the systems we compare are trained on the same data, A side-by-side evaluation comparing speech generated by these sys-
relative comparisons are still meaningful. tems also shows a virtual tie – a statistically insignificant preference
Table 1 shows a comparison of our method against various prior towards our results by 0.142±0.338. Examination of rater comments
systems. In order to better isolate the effect of using mel spectrograms shows that our neural system tends to generate speech that feels more
as features, we compare to a WaveNet conditioned on linguistic natural and human-like to raters, but it sometimes runs into pronunci-
features with similar modifications to the WaveNet architecture as ation difficulties, e.g., when handling names. This result points to a
challenge our end-to-end neural approach faces – it requires training already contains convolutional layers, one may wonder if the post-net
on data that cover intended usage. is still necessary when WaveNet is used as the vocoder. To answer
this question, we compared our model with and without the post-net,
3.3. Ablation Studies and found that without it, our model only obtains a MOS score of
4.429 ± 0.071, compared to 4.526 ± 0.066 with it, meaning that em-
3.3.1. Predicted Features versus Ground Truth pirically the post-net is still an important part of the network design.
While the two components of our model were trained independently,
the WaveNet component depends on having predicted features for
training. An alternative would be to train WaveNet on mel spectro- 3.3.4. Simplifying WaveNet
grams extracted from ground truth audio, which would allow it to be
A defining feature of WaveNet is its use of dilated convolution to
trained in isolation from the feature prediction network. We explore
increase the receptive field exponentially with the number of layers.
this possibility in Table 2.
We evaluate WaveNet models with varying receptive field sizes and
number of layers to test our hypothesis that a shallower network with
Inference a smaller receptive field may solve the problem satisfactorily since
Training Predicted Ground truth mel spectrograms are a much closer representation of the waveform
Predicted 4.526 ± 0.066 4.449 ± 0.060 than linguistic features and already capture long-term dependencies
Ground truth 4.362 ± 0.066 4.522 ± 0.055 across frames.
As shown in Table 4, we find that our model can generate high-
Table 2. Comparison of evaluated MOS for our system when quality audio using as few as 12 layers with a receptive field of
WaveNet trained on predicted/ground truth mel spectrograms are 10.5 ms, compared to 30 layers and 256 ms in the baseline model.
made to synthesize from predicted/ground truth mel spectrograms. These results confirm the observations in [9] that a large receptive
field size is not an essential factor for audio quality. However, we
hypothesize that it is the choice to condition on mel spectrograms
As expected, the best performance is obtained when the type of
that allows this reduction in complexity.
features used for training match those that are used for inference.
However, when trained on mel spectrograms extracted from ground On the other hand, if we eliminate dilated convolutions altogether,
truth audio and made to synthesize from predicted features, the result the receptive field becomes two orders of magnitude smaller than
is much worse than the opposite. This is likely because of inherent the baseline and the quality degrades significantly even though the
noise in the predicted features that a model trained on ground truth is stack is as deep as the baseline model. This indicates that the model
unable to handle. It is possible that this difference can be eliminated requires sufficient context at the time scale of waveform samples in
with data augmentation. order to generate high quality sound.

3.3.2. Linear Spectrograms Total Num Dilation Receptive field


MOS
Instead of predicting mel spectrograms, we experiment with training layers cycles cycle size (samples / ms)
to predict linear-frequency spectrograms instead, making it possible 30 3 10 6,139 / 255.8 4.526 ± 0.066
to invert the spectrogram using Griffin-Lim. 24 4 6 505 / 21.0 4.547 ± 0.056
12 2 6 253 / 10.5 4.481 ± 0.059
Name MOS 30 30 1 61 / 2.5 3.930 ± 0.076

Tacotron 2 (Linear + G-L) 3.944 ± 0.091 Table 4. WaveNet with various layer and receptive field sizes.
Tacotron 2 (Linear + WaveNet) 4.510 ± 0.054
Tacotron 2 (Mel + WaveNet) 4.526 ± 0.066

4. CONCLUSION
Table 3. Comparison of evaluated MOS for Griffin-Lim vs. WaveNet
as a vocoder, and using 1,025-dimensional linear spectrograms vs. 80-
dimensional mel spectrograms as conditioning features for WaveNet. This paper provides a detailed description of Tacotron 2, an end-
to-end neural TTS system that combines a sequence-to-sequence
recurrent network with attention to predicts mel spectrograms with a
As noted in [10], WaveNet produces much higher quality audio
modified WaveNet vocoder. The resulting system synthesizes speech
compared to Griffin-Lim. However, there is not much difference
with Tacotron-level prosody and WaveNet-level audio quality. This
between the use of linear-scale or mel-scale spectrograms. As such,
system can be trained directly from data without relying on complex
the use of mel spectrograms seems to be a strictly better choice since
feature engineering, and achieves state-of-the-art sound quality close
it is a more compact representation. It would be interesting to explore
to that of natural human speech.
the trade-off between the number of mel frequency bins versus audio
quality (MOS) in future work.
5. ACKNOWLEDGMENTS
3.3.3. Post-Processing Network
Since it is not possible to use the information of predicted future The authors thank Jan Chorowski, Samy Bengio, Aäron van den
frames before they have been decoded, we use a convolutional post- Oord, and the WaveNet and Machine Hearing teams for their helpful
processing network to incorporate past and future frames after decod- discussions and advice, as well as Heiga Zen and the Google TTS
ing to improve the feature predictions. However, because WaveNet team for their feedback and assistance with running evaluations.
6. REFERENCES [17] S. Davis and P. Mermelstein, “Comparison of parametric repre-
sentations for monosyllabic word recognition in continuously
[1] P. Taylor, Text-to-Speech Synthesis, Cambridge University spoken sentences,” IEEE Transactions on Acoustics, Speech
Press, New York, NY, USA, 1st edition, 2009. and Signal Processing, vol. 28, no. 4, pp. 357 – 366, 1980.
[2] A. J. Hunt and A. W. Black, “Unit selection in a concatenative [18] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating
speech synthesis system using a large speech database,” in deep network training by reducing internal covariate shift,” in
Proceedings of ICASSP, 1996, pp. 373–376. Proceedings of ICML, 2015, pp. 448–456.
[3] A. W. Black and P. Taylor, “Automatically clustering similar [19] M. Schuster and K. Paliwal, “Bidirectional recurrent neural
units for unit selection in speech synthesis,” in Proceedings of networks,” IEEE Transactions on Signal Processing, vol. 45,
Eurospeech, September 1997, pp. 601–604. no. 11, pp. 2673–2681, Nov. 1997.
[4] K. Tokuda, T. Yoshimura, T. Masuko, T. Kobayashi, and T. Ki- [20] S. Hochreiter and J. Schmidhuber, “Long short-term memory,”
tamura, “Speech parameter generation algorithms for HMM- Neural Computation, vol. 9, no. 8, pp. 1735–1780, Nov. 1997.
based speech synthesis,” in Proceedings of ICASSP, 2000, pp. [21] J. K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Ben-
1315–1318. gio, “Attention-based models for speech recognition,” in Ad-
[5] H. Zen, K. Tokuda, and A. W. Black, “Statistical parametric vances in Neural Information Processing Systems, 2015, pp.
speech synthesis,” Speech Communication, vol. 51, no. 11, pp. 577–585.
1039–1064, 2009. [22] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine transla-
[6] H. Zen, A. Senior, and M. Schuster, “Statistical parametric tion by jointly learning to align and translate,” in Proceedings
speech synthesis using deep neural networks,” in Proceedings of ICLR, 2015.
of ICASSP, 2013, pp. 7962–7966. [23] C. M. Bishop, “Mixture density networks,” Tech. Rep., 1994.
[7] K. Tokuda, Y. Nankaku, T. Toda, H. Zen, J. Yamagishi, and [24] M. Schuster, On supervised learning from sequential data with
K. Oura, “Speech synthesis based on hidden Markov models,” applications for speech recognition, Ph.D. thesis, Nara Institute
Proceedings of the IEEE, vol. 101, no. 5, pp. 1234–1252, 2013. of Science and Technology, 1999.
[8] A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, [25] N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and
A. Graves, N. Kalchbrenner, A. W. Senior, and K. Kavukcuoglu, R. Salakhutdinov, “Dropout: a simple way to prevent neu-
“WaveNet: A generative model for raw audio,” CoRR, vol. ral networks from overfitting.,” Journal of Machine Learning
abs/1609.03499, 2016. Research, vol. 15, no. 1, pp. 1929–1958, 2014.
[9] S. Ö. Arik, M. Chrzanowski, A. Coates, G. Diamos, A. Gib- [26] D. Krueger, T. Maharaj, J. Kramár, M. Pezeshki, N. Ballas, N. R.
iansky, Y. Kang, X. Li, J. Miller, J. Raiman, S. Sengupta, and Ke, A. Goyal, Y. Bengio, H. Larochelle, A. Courville, et al.,
M. Shoeybi, “Deep voice: Real-time neural text-to-speech,” “Zoneout: Regularizing RNNs by randomly preserving hidden
CoRR, vol. abs/1702.07825, 2017. activations,” in Proceedings of ICLR, 2017.
[10] S. Ö. Arik, G. F. Diamos, A. Gibiansky, J. Miller, K. Peng, [27] T. Salimans, A. Karpathy, X. Chen, and D. P. Kingma, “Pix-
W. Ping, J. Raiman, and Y. Zhou, “Deep voice 2: Multi-speaker elCNN++: Improving the PixelCNN with discretized logistic
neural text-to-speech,” CoRR, vol. abs/1705.08947, 2017. mixture likelihood and other modifications,” in Proceedings of
[11] W. Ping, K. Peng, A. Gibiansky, S. Ö. Arik, A. Kannan, ICLR, 2017.
S. Narang, J. Raiman, and J. Miller, “Deep voice 3: 2000- [28] A. van den Oord, Y. Li, I. Babuschkin, K. Simonyan, O. Vinyals,
speaker neural text-to-speech,” CoRR, vol. abs/1710.07654, K. Kavukcuoglu, G. van den Driessche, E. Lockhart, L. C. Cobo,
2017. F. Stimberg, N. Casagrande, D. Grewe, S. Noury, S. Dieleman,
E. Elsen, N. Kalchbrenner, H. Zen, A. Graves, H. King, T. Wal-
[12] Y. Wang, R. Skerry-Ryan, D. Stanton, Y. Wu, R. J. Weiss,
ters, D. Belov, and D. Hassabis, “Parallel WaveNet: Fast High-
N. Jaitly, Z. Yang, Y. Xiao, Z. Chen, S. Bengio, Q. Le,
Fidelity Speech Synthesis,” CoRR, vol. abs/1711.10433, Nov.
Y. Agiomyrgiannakis, R. Clark, and R. A. Saurous, “Tacotron:
2017.
Towards end-to-end speech synthesis,” in Proceedings of Inter-
speech, Aug. 2017. [29] D. P. Kingma and J. Ba, “Adam: A method for stochastic
optimization,” in Proceedings of ICLR, 2015.
[13] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence
learning with neural networks.,” in NIPS, Z. Ghahramani, [30] X. Gonzalvo, S. Tazari, C.-a. Chan, M. Becker, A. Gutkin, and
M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, H. Silen, “Recent advances in Google real-time HMM-driven
Eds., 2014, pp. 3104–3112. unit selection synthesizer.,” in Proceedings of Interspeech,
2016.
[14] D. W. Griffin and J. S. Lim, “Signal estimation from modified
short-time Fourier transform,” IEEE Transactions on Acoustics, [31] H. Zen, Y. Agiomyrgiannakis, N. Egberts, F. Henderson, and
Speech and Signal Processing, pp. 236–243, 1984. P. Szczepaniak, “Fast, compact, and high quality LSTM-RNN
based statistical parametric speech synthesizers for mobile de-
[15] A. Tamamori, T. Hayashi, K. Kobayashi, K. Takeda, and vices,” in Proceedings of Interspeech, 2016.
T. Toda, “Speaker-dependent WaveNet vocoder,” in Proceed-
ings of Interspeech, 2017, pp. 1118–1122.
[16] J. Sotelo, S. Mehri, K. Kumar, J. F. Santos, K. Kastner,
A. Courville, and Y. Bengio, “Char2Wav: End-to-end speech
synthesis,” in Proceedings of ICLR, 2017.

You might also like