Neural text-to-speech with a modeling-by-generation excitation vocoder

TTS demos

Sample1
Raw	Proposed MbG-ExcitNet
Baseline WaveNet	Baseline ExcitNet
Baseline G-WaveNet	Baseline G-ExcitNet
Sample2
Raw	Proposed MbG-ExcitNet
Baseline WaveNet	Baseline ExcitNet
Baseline G-WaveNet	Baseline G-ExcitNet
Sample3
Raw	Proposed MbG-ExcitNet
Baseline WaveNet	Baseline ExcitNet
Baseline G-WaveNet	Baseline G-ExcitNet
Sample4
Raw	Proposed MbG-ExcitNet
Baseline WaveNet	Baseline ExcitNet
Baseline G-WaveNet	Baseline G-ExcitNet

References

[1] T. Okamoto, T. Toda, Y. Shiga, and H. Kawai, “Tacotron-based acoustic model using phoneme alignment for practical neural text-to-speech systems,” in Proc. ASRU, 2019, pp. 214–221.
[2] A. Tamamori, T. Hayashi, K. Kobayashi, K. Takeda, and T. Toda, “Speaker-dependent WaveNet vocoder,” in Proc. INTERSPEECH, 2017, pp. 1118–1122.
[3] E. Song, K. Byun, and H.-G. Kang, “Excitnet vocoder: A neural excitation model for parametric speech synthesis systems,” in Proc. EUSIPCO, 2019, pp. 1179-1183.
[4] J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang, Z. Chen, Y. Zhang, Y. Wang, R. Skerrv-Ryan et al., “Natural TTS synthesis by conditioning WaveNet on mel spectrogram predictions,” in Proc. ICASSP, 2018, pp. 4779–4783.

Citation

@inproceedings{song2020neural,
    title={Neural text-to-speech with a modeling-by-generation excitation vocoder},
    author={Song, Eunwoo and Hwang, Min-Jae and Yamamoto, Ryuichi and Kim, Jin-Seob and Kwon, Ohsung and Kim, Jae-Min},
    booktitle={Proc. INTERSPEECH},
    pages={3570--3574},
    year={2020}
  }

TTS demos

References

Acknowledgements

Citation