Tacotron 2
Tacotron 2
PREDICTIONS
Jonathan Shen1 , Ruoming Pang1 , Ron J. Weiss1 , Mike Schuster1 , Navdeep Jaitly1 , Zongheng Yang ∗2 ,
Zhifeng Chen1 , Yu Zhang1 , Yuxuan Wang1 , RJ Skerry-Ryan1 , Rif A. Saurous1 , Yannis Agiomyrgiannakis1 ,
and Yonghui Wu1
1
Google, Inc.
2
University of California, Berkeley
{jonathanasdf,rpang,yonghui}@google.com
arXiv:1712.05884v1 [cs.CL] 16 Dec 2017
1
Waveform samples
are dominated by fricatives and other noise bursts and generally do
not need to be modeled with high fidelity. Because of these properties, WaveNet MoL
features derived from the mel scale have been used as an underlying mel spectrogram
Our training process involves first training the feature prediction Parametric 3.492 ± 0.096
network on its own, followed by training a modified WaveNet inde- Tacotron (Griffin-Lim) 4.001 ± 0.087
pendently on the outputs generated by the first network. Concatenative 4.166 ± 0.091
To train the feature prediction network, we apply the standard WaveNet (Linguistic) 4.341 ± 0.051
maximum-likelihood training procedure (feeding in the correct output Ground Truth 4.582 ± 0.053
instead of the predicted output on the decoder side, also referred to Tacotron 2 (this paper) 4.526 ± 0.066
as teacher-forcing) with a batch size of 64 on a single GPU. We use
the Adam optimizer [29] with β1 = 0.9, β2 = 0.999, = 10−6 and Table 1. Mean Opinion Score (MOS) evaluations with 95% confi-
a learning rate of 10−3 exponentially decaying to 10−5 starting after dence intervals for various systems.
50,000 iterations. We also apply L2 regularization with weight 10−6 .
We then train our modified WaveNet on the ground truth-aligned
We also conduct a side-by-side evaluation between audio synthe-
predictions of the feature prediction network. That is, these pre-
sized by our system and the ground truth. For each pair of utterances,
dictions are produced in teacher-forcing mode so each spectrogram
raters are asked to give a score ranging from -3 (synthesized much
frame exactly aligns with the target waveform samples. We train with
worse than ground truth) to 3 (synthesized much better than ground
a batch size of 128 distributed across 32 GPUs with synchronous
truth). The overall mean score of −0.270 ± 0.155 shows that raters
updates, using the Adam optimizer with β1 = 0.9, β2 = 0.999, =
have a small but statistically significant preference towards ground
1e-8 and a fixed learning rate of 1e-4. It helps quality to average model
truth over our results. See Figure 2 for a detailed breakdown. The
weights over recent updates. Therefore we maintain an exponentially-
comments from raters indicate that occasional mispronunciation by
weighted moving average of the network parameters over update
our system is the primary reason for this preference.
steps with a decay of 0.9999 – this version is used for inference (see
also [29]). To speed up convergence, we scale the waveform targets
by a factor of 127.5. This scaling brings the initial outputs of the
mixture of logistics layer closer to the eventual distributions.
We train all models on an internal US English dataset, which
contains 24.6 hours of speech from a single professional female
speaker. All text in our datasets is spelled out. e.g. “16” is written as
“sixteen”, i.e. our models are all trained on pre-normalized text.
3.2. Evaluation
When generating speech in inference mode, the ground truth targets
are not known. Therefore, the predicted outputs from the previous Fig. 2. Synthesized vs. ground truth: 800 ratings on 100 items.
step are fed in during decoding, in contrast to the teacher-forcing
configuration used for training. We manually analyze the error modes of our system on the cus-
We randomly selected 100 fixed examples from the test set as the tom 100-sentence test set from Appendix E of [11]. Within the audio
evaluation set. Audio generated on this set are sent to a human rating generated from those sentences, 0 contained repeated words, 6 con-
service similar to Amazon’s Mechanical Turk where each sample tained mispronunciations, 1 contained skipped words, and 23 were
is rated by at least 8 raters on a scale from 1 to 5 with 0.5 point subjectively decided to contain unnatural prosody, such as emphasis
increments, from which a subjective mean opinion score (MOS) is on the wrong syllables or words, or unnatural pitch. In one case,
calculated. Each evaluation is conducted independently from each the longest sentence, end-point prediction failed. Overall, our model
other, so the outputs of two different models are not directly compared achieves a MOS of 4.354 on these inputs. These results show that
when asking raters to assign a score to them. while our system is able to reliably attend to the entire input, there is
Note that while instances in the evaluation set never appear in still room for improvement in prosody modeling.
the training set, there are some recurring patterns and common words Finally, we evaluate samples generated from 37 news headlines to
between the two sets. While this could potentially result in an inflated test the generalization ability of our system to out-of-domain text. On
MOS compared to an evaluation set consisting of sentences generated this task, our model receives a MOS of 4.148 ± 0.124 while WaveNet
from random words, using this set allows us to compare to the ground conditioned on linguistic features receives a MOS of 4.137 ± 0.128.
truth. Since all the systems we compare are trained on the same data, A side-by-side evaluation comparing speech generated by these sys-
relative comparisons are still meaningful. tems also shows a virtual tie – a statistically insignificant preference
Table 1 shows a comparison of our method against various prior towards our results by 0.142±0.338. Examination of rater comments
systems. In order to better isolate the effect of using mel spectrograms shows that our neural system tends to generate speech that feels more
as features, we compare to a WaveNet conditioned on linguistic natural and human-like to raters, but it sometimes runs into pronunci-
features with similar modifications to the WaveNet architecture as ation difficulties, e.g., when handling names. This result points to a
challenge our end-to-end neural approach faces – it requires training already contains convolutional layers, one may wonder if the post-net
on data that cover intended usage. is still necessary when WaveNet is used as the vocoder. To answer
this question, we compared our model with and without the post-net,
3.3. Ablation Studies and found that without it, our model only obtains a MOS score of
4.429 ± 0.071, compared to 4.526 ± 0.066 with it, meaning that em-
3.3.1. Predicted Features versus Ground Truth pirically the post-net is still an important part of the network design.
While the two components of our model were trained independently,
the WaveNet component depends on having predicted features for
training. An alternative would be to train WaveNet on mel spectro- 3.3.4. Simplifying WaveNet
grams extracted from ground truth audio, which would allow it to be
A defining feature of WaveNet is its use of dilated convolution to
trained in isolation from the feature prediction network. We explore
increase the receptive field exponentially with the number of layers.
this possibility in Table 2.
We evaluate WaveNet models with varying receptive field sizes and
number of layers to test our hypothesis that a shallower network with
Inference a smaller receptive field may solve the problem satisfactorily since
Training Predicted Ground truth mel spectrograms are a much closer representation of the waveform
Predicted 4.526 ± 0.066 4.449 ± 0.060 than linguistic features and already capture long-term dependencies
Ground truth 4.362 ± 0.066 4.522 ± 0.055 across frames.
As shown in Table 4, we find that our model can generate high-
Table 2. Comparison of evaluated MOS for our system when quality audio using as few as 12 layers with a receptive field of
WaveNet trained on predicted/ground truth mel spectrograms are 10.5 ms, compared to 30 layers and 256 ms in the baseline model.
made to synthesize from predicted/ground truth mel spectrograms. These results confirm the observations in [9] that a large receptive
field size is not an essential factor for audio quality. However, we
hypothesize that it is the choice to condition on mel spectrograms
As expected, the best performance is obtained when the type of
that allows this reduction in complexity.
features used for training match those that are used for inference.
However, when trained on mel spectrograms extracted from ground On the other hand, if we eliminate dilated convolutions altogether,
truth audio and made to synthesize from predicted features, the result the receptive field becomes two orders of magnitude smaller than
is much worse than the opposite. This is likely because of inherent the baseline and the quality degrades significantly even though the
noise in the predicted features that a model trained on ground truth is stack is as deep as the baseline model. This indicates that the model
unable to handle. It is possible that this difference can be eliminated requires sufficient context at the time scale of waveform samples in
with data augmentation. order to generate high quality sound.
Tacotron 2 (Linear + G-L) 3.944 ± 0.091 Table 4. WaveNet with various layer and receptive field sizes.
Tacotron 2 (Linear + WaveNet) 4.510 ± 0.054
Tacotron 2 (Mel + WaveNet) 4.526 ± 0.066
4. CONCLUSION
Table 3. Comparison of evaluated MOS for Griffin-Lim vs. WaveNet
as a vocoder, and using 1,025-dimensional linear spectrograms vs. 80-
dimensional mel spectrograms as conditioning features for WaveNet. This paper provides a detailed description of Tacotron 2, an end-
to-end neural TTS system that combines a sequence-to-sequence
recurrent network with attention to predicts mel spectrograms with a
As noted in [10], WaveNet produces much higher quality audio
modified WaveNet vocoder. The resulting system synthesizes speech
compared to Griffin-Lim. However, there is not much difference
with Tacotron-level prosody and WaveNet-level audio quality. This
between the use of linear-scale or mel-scale spectrograms. As such,
system can be trained directly from data without relying on complex
the use of mel spectrograms seems to be a strictly better choice since
feature engineering, and achieves state-of-the-art sound quality close
it is a more compact representation. It would be interesting to explore
to that of natural human speech.
the trade-off between the number of mel frequency bins versus audio
quality (MOS) in future work.
5. ACKNOWLEDGMENTS
3.3.3. Post-Processing Network
Since it is not possible to use the information of predicted future The authors thank Jan Chorowski, Samy Bengio, Aäron van den
frames before they have been decoded, we use a convolutional post- Oord, and the WaveNet and Machine Hearing teams for their helpful
processing network to incorporate past and future frames after decod- discussions and advice, as well as Heiga Zen and the Google TTS
ing to improve the feature predictions. However, because WaveNet team for their feedback and assistance with running evaluations.
6. REFERENCES [17] S. Davis and P. Mermelstein, “Comparison of parametric repre-
sentations for monosyllabic word recognition in continuously
[1] P. Taylor, Text-to-Speech Synthesis, Cambridge University spoken sentences,” IEEE Transactions on Acoustics, Speech
Press, New York, NY, USA, 1st edition, 2009. and Signal Processing, vol. 28, no. 4, pp. 357 – 366, 1980.
[2] A. J. Hunt and A. W. Black, “Unit selection in a concatenative [18] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating
speech synthesis system using a large speech database,” in deep network training by reducing internal covariate shift,” in
Proceedings of ICASSP, 1996, pp. 373–376. Proceedings of ICML, 2015, pp. 448–456.
[3] A. W. Black and P. Taylor, “Automatically clustering similar [19] M. Schuster and K. Paliwal, “Bidirectional recurrent neural
units for unit selection in speech synthesis,” in Proceedings of networks,” IEEE Transactions on Signal Processing, vol. 45,
Eurospeech, September 1997, pp. 601–604. no. 11, pp. 2673–2681, Nov. 1997.
[4] K. Tokuda, T. Yoshimura, T. Masuko, T. Kobayashi, and T. Ki- [20] S. Hochreiter and J. Schmidhuber, “Long short-term memory,”
tamura, “Speech parameter generation algorithms for HMM- Neural Computation, vol. 9, no. 8, pp. 1735–1780, Nov. 1997.
based speech synthesis,” in Proceedings of ICASSP, 2000, pp. [21] J. K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Ben-
1315–1318. gio, “Attention-based models for speech recognition,” in Ad-
[5] H. Zen, K. Tokuda, and A. W. Black, “Statistical parametric vances in Neural Information Processing Systems, 2015, pp.
speech synthesis,” Speech Communication, vol. 51, no. 11, pp. 577–585.
1039–1064, 2009. [22] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine transla-
[6] H. Zen, A. Senior, and M. Schuster, “Statistical parametric tion by jointly learning to align and translate,” in Proceedings
speech synthesis using deep neural networks,” in Proceedings of ICLR, 2015.
of ICASSP, 2013, pp. 7962–7966. [23] C. M. Bishop, “Mixture density networks,” Tech. Rep., 1994.
[7] K. Tokuda, Y. Nankaku, T. Toda, H. Zen, J. Yamagishi, and [24] M. Schuster, On supervised learning from sequential data with
K. Oura, “Speech synthesis based on hidden Markov models,” applications for speech recognition, Ph.D. thesis, Nara Institute
Proceedings of the IEEE, vol. 101, no. 5, pp. 1234–1252, 2013. of Science and Technology, 1999.
[8] A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, [25] N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and
A. Graves, N. Kalchbrenner, A. W. Senior, and K. Kavukcuoglu, R. Salakhutdinov, “Dropout: a simple way to prevent neu-
“WaveNet: A generative model for raw audio,” CoRR, vol. ral networks from overfitting.,” Journal of Machine Learning
abs/1609.03499, 2016. Research, vol. 15, no. 1, pp. 1929–1958, 2014.
[9] S. Ö. Arik, M. Chrzanowski, A. Coates, G. Diamos, A. Gib- [26] D. Krueger, T. Maharaj, J. Kramár, M. Pezeshki, N. Ballas, N. R.
iansky, Y. Kang, X. Li, J. Miller, J. Raiman, S. Sengupta, and Ke, A. Goyal, Y. Bengio, H. Larochelle, A. Courville, et al.,
M. Shoeybi, “Deep voice: Real-time neural text-to-speech,” “Zoneout: Regularizing RNNs by randomly preserving hidden
CoRR, vol. abs/1702.07825, 2017. activations,” in Proceedings of ICLR, 2017.
[10] S. Ö. Arik, G. F. Diamos, A. Gibiansky, J. Miller, K. Peng, [27] T. Salimans, A. Karpathy, X. Chen, and D. P. Kingma, “Pix-
W. Ping, J. Raiman, and Y. Zhou, “Deep voice 2: Multi-speaker elCNN++: Improving the PixelCNN with discretized logistic
neural text-to-speech,” CoRR, vol. abs/1705.08947, 2017. mixture likelihood and other modifications,” in Proceedings of
[11] W. Ping, K. Peng, A. Gibiansky, S. Ö. Arik, A. Kannan, ICLR, 2017.
S. Narang, J. Raiman, and J. Miller, “Deep voice 3: 2000- [28] A. van den Oord, Y. Li, I. Babuschkin, K. Simonyan, O. Vinyals,
speaker neural text-to-speech,” CoRR, vol. abs/1710.07654, K. Kavukcuoglu, G. van den Driessche, E. Lockhart, L. C. Cobo,
2017. F. Stimberg, N. Casagrande, D. Grewe, S. Noury, S. Dieleman,
E. Elsen, N. Kalchbrenner, H. Zen, A. Graves, H. King, T. Wal-
[12] Y. Wang, R. Skerry-Ryan, D. Stanton, Y. Wu, R. J. Weiss,
ters, D. Belov, and D. Hassabis, “Parallel WaveNet: Fast High-
N. Jaitly, Z. Yang, Y. Xiao, Z. Chen, S. Bengio, Q. Le,
Fidelity Speech Synthesis,” CoRR, vol. abs/1711.10433, Nov.
Y. Agiomyrgiannakis, R. Clark, and R. A. Saurous, “Tacotron:
2017.
Towards end-to-end speech synthesis,” in Proceedings of Inter-
speech, Aug. 2017. [29] D. P. Kingma and J. Ba, “Adam: A method for stochastic
optimization,” in Proceedings of ICLR, 2015.
[13] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence
learning with neural networks.,” in NIPS, Z. Ghahramani, [30] X. Gonzalvo, S. Tazari, C.-a. Chan, M. Becker, A. Gutkin, and
M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, H. Silen, “Recent advances in Google real-time HMM-driven
Eds., 2014, pp. 3104–3112. unit selection synthesizer.,” in Proceedings of Interspeech,
2016.
[14] D. W. Griffin and J. S. Lim, “Signal estimation from modified
short-time Fourier transform,” IEEE Transactions on Acoustics, [31] H. Zen, Y. Agiomyrgiannakis, N. Egberts, F. Henderson, and
Speech and Signal Processing, pp. 236–243, 1984. P. Szczepaniak, “Fast, compact, and high quality LSTM-RNN
based statistical parametric speech synthesizers for mobile de-
[15] A. Tamamori, T. Hayashi, K. Kobayashi, K. Takeda, and vices,” in Proceedings of Interspeech, 2016.
T. Toda, “Speaker-dependent WaveNet vocoder,” in Proceed-
ings of Interspeech, 2017, pp. 1118–1122.
[16] J. Sotelo, S. Mehri, K. Kumar, J. F. Santos, K. Kastner,
A. Courville, and Y. Bengio, “Char2Wav: End-to-end speech
synthesis,” in Proceedings of ICLR, 2017.