Development of vietnamese speech synthesis system using deep neural networks

In this paper, we present our first Vietnamese speech synthesis system based on deep
neural networks. To improve the training data collected from the Internet, a cleaning method is
proposed. The experimental results indicate that by using deeper architectures we can achieve better
performance for the TTS than using shallow architectures such as hidden Markov model. We also
present the effect of using different amounts of data to train the TTS systems. In the VLSP TTS
challenge 2018, our proposed DNN-based speech synthesis system won the first place in all three
subjects including naturalness, intelligibility, and MOS.
Download
Trang 1
Trang 2
Trang 3
Trang 4
Trang 5
Trang 6
Trang 7
Trang 8
Trang 9
Trang 10
Tải về để xem bản đầy đủ
15 trang duykhanh 7060
Download
Bạn đang xem 10 trang mẫu của tài liệu "Development of vietnamese speech synthesis system using deep neural networks", để tải tài liệu gốc về máy hãy click vào nút Download ở trên
Tóm tắt nội dung tài liệu: Development of vietnamese speech synthesis system using deep neural networks

ction 3.1).
Two DNN-based TTS systems are trained without and with using the data cleaning proce-
dure. Table 1 shows that by carefully cleaning training data, a significant improvement in
synthesized speech quality is achieved both in objective and subjective evaluation. Specifi-
cally, 4 objective metrics are reduced, while 3 subjective metrics are increased.
358 NGUYEN VAN THINH, NGUYEN QUOC BAO, PHAN HUY KINH, DO VAN HAI
Table 1. The objective and subjective evaluations for the two DNN-based TTS systems without and
with using the training data cleaning procedure. (MCD: Mel-Cepstral Distortion; BAP: distortion of
band aperiodicities; F0 RMSE: Root mean squared error in log F0; V/UV: voiced/unvoiced error)
Training
data
cleaning
Objective evaluation Subjective evaluation
MCD
(dB)
BAP
(dB)
F0
RMSE
(Hz)
V/UV
(%)
Naturalness Intelligibility MOS
No (DNN1) 4.758 0.171 23.038 6.084 92.67 94.00 4.50
Yes
(DNN2)
4.721 0.163 22.119 6.052 94.67 96.33 4.61
3.3.3. Effect of DNN architecture
In the previous experiments, 6-layer DNN were used for the duration and acoustic models.
Now we investigate the effect of DNN architecture to the quality of the TTS system. Note
that in all cases, we use training data after cleaning i.e., DNN2 in Table 1.
Table 2 shows the results given by the DNN-based TTS systems with different DNN
architectures The last row is the result given by the HMM-based TTS baseline. We can see
that by increasing the number of hidden layers from 1 to 6, we can improve both objective
and subjective metrics. However, when more than 4 hidden layers are used, not much
improvement is observed for objective evaluation except voice/unvoice error. For subjective
evaluation, no improvement is achieved by using more than 5 hidden layers for the DNN
models.
Table 2. The objective and subjective evaluations for the TTS systems with different DNN architec-
tures, the last row is the result for the HMM-based TTS system. (MCD: Mel-Cepstral Distortion;
BAP: distortion of band aperiodicities; F0 RMSE: Root mean squared error in log F0; V/UV: voi-
ced/unvoiced error)
Model
Objective evaluation Subjective evaluation
MCD
(dB)
BAP
(dB)
F0
RMSE
(Hz)
V/UV
(%)
Naturalness Intelligibility MOS
1 layer-DNN 5.104 0.173 24.158 7.097 88.33 91.67 4.31
2 layer-DNN 4.875 0.169 23.010 6.577 91.67 94.00 4.47
3 layer-DNN 4.769 0.166 22.434 6.310 92.33 94.33 4.49
4 layer-DNN 4.729 0.163 22.051 6.212 92.33 94.67 4.50
5 layer-DNN 4.724 0.163 21.969 6.141 94.67 96.33 4.67
6 layer-DNN 4.721 0.163 22.119 6.052 94.67 96.33 4.67
HMM 4.790 0.185 23.012 8.528 89.67 90.00 4.40
Comparing to the HMM-based system in the last row, the DNN-based system (6 hidden
layers) has a similar performance in Mel-cepstral distortion and root mean squared error in
DEVELOPMENT OF VIETNAMESE SPEECH SYNTHESIS SYSTEM 359
log F0. However, the DNN system is significantly better than the HMM system in distortion
of band aperiodicities and voiced/unvoiced error. In the subjective evaluation, the DNN
system outperforms consistently the HMM system in all three metrics including naturalness,
intelligibility and MOS. This shows that by using deeper architectures we can achieve better
performance for the TTS than using shallow architectures such as HMM or neural network
with 1 hidden layer.
3.3.4. Effect of training data size
Figure 4. Subjective evaluation for both the DNN-based and HMM-based TTS systems with different
amounts of training data
Now, we investigate the effect of training data size to TTS performance. We randomly
sample the full training set (3156 sentences) to smaller subsets i.e., 1600, 800, and 400
sentences. Figure 4 shows subjective evaluation given by the DNN-based system (with 6
hidden layers) and the HMM-based system with different amounts of data to train the
model. It can be seen that performance degradation is observed when using less training
data for both the DNN and HMM systems. The DNN system achieved a significantly better
performance in all aspects: naturalness, intelligibility and MOS metrics.
3.3.5. Effect of applying postfilter
In this section, we discuss the effect of applying postfilter to synthesized quality. Two
DNN-based system with 6 hidden layers are compared: the first system is configured with
postfilter and the second system is a normal system without postfilter. The subjective
evaluation is shown in Table 3. It can be seen that the DNN-based system with postfilter
archive better results in naturalness, MOS and Intelligibility.
3.3.6. Effect of applying parallel processing to postfilter
The result of previous section shows that, by applying postfilter to DNN-based speech
synthesis system, notable improvement in synthesized quality has been recorded. In this
section, we compared time response of three DNN-based text to speech systems with 6
hidden layers: the Original Postfilter system (system with original postfilter from HTS),
360 NGUYEN VAN THINH, NGUYEN QUOC BAO, PHAN HUY KINH, DO VAN HAI
Table 3. Subjective evaluation for both the DNN-based TTS with applying postfilter and DNN-based
TTS without applying postfilter
Apply Postfilter MOS Naturelness Intelligibility
No 4.39 83.73 92.05
Yes 4.67 94.67 96.33
the No Postfilter system (system without postfilter) and the Parallel Postfilter (system with
parallelized postfilter).
We made a performance test to compare time response of three systems above. The test
corpus is a set of the sentences with variable length (like 4 word, 5 word, 6 word, 10 word,).
For each length, three sentences were used for testing. The average response time of each
system for each length group is demonstrated in Figure 5. It is clear that by using parallel
processing, the systems response faster and the difference in time performance is getting
more significant as the length of the sentence increases.
Figure 5. The response time comparison of three systems: No postfilter is the speech synthesis system
without postfiltering, Original Postfilter is the system with the postfilter originated from HTS, and
Parallel Postfilter is the system with the postfilter implemented by applying parallel processing
3.3.7. Performance in the VLSP TTS challenge 2018
Our proposed TTS system was also submitted to the VLSP TTS challenge 2018. The test
set consists of 30 sentences in news domain. Each team needs to submit 30 corresponding
DEVELOPMENT OF VIETNAMESE SPEECH SYNTHESIS SYSTEM 361
Table 4. The scores given by 3 teams in the VLSP TTS challenge 2018
Team Naturalness Intelligibility MOS
VAIS 65.50 72.54 3.48
MICA 72.69 76.94 3.79
Our system (Viettel) 90.54 93.02 4.66
synthesized audio files. 20 people including males/females, different dialects, phoneticians
and non-phoneticians were asked to provide score for naturalness, intelligibility and MOS.
As shown in Table 4, our TTS system (Viettel) won the first place and outperformed other
TTS systems significantly in all subjects including naturalness, intelligibility, and MOS.
4. CONCLUSIONS
In this paper, we presented our effort to build the first DNN-based Vietnamese TTS
system. To reduce the synthesized time, a method of using parallel processing postfilter
was proposed. Experimental results showed that using cleaned data improves the quality of
synthesized speech given by the TTS system. We also showed that by using deeper archi-
tectures, we can achieve better synthesized speech quality than using shallow architectures
such as HMM or neural network with 1 hidden layer. The results also indicated that less
training data also reduces speech quality. Generally talking, in all cases, the DNN system
outperforms the HMM system. Our TTS system also won the first place in the VLSP TTS
challenge 2018 in all three subjects including naturalness, intelligibility, and MOS. Our future
work is to optimize the TTS systems for different dialects in Vietnam.
REFERENCES
[1] T. T. Do and T. Takara, “Precise tone generation for vietnamese text-to-speech system,” in
Acoustics, Speech, and Signal Processing, 2003. Proceedings.(ICASSP’03). 2003 IEEE Interna-
tional Conference on, vol. 1. IEEE, 2003, pp. 504–507.
[2] J. J. Ohala, “Christian gottlieb kratzenstein: pioneer in speech synthesis,” Proc. 17th ICPhS,
2011.
[3] H. Dudley, “The carrier nature of speech,” Bell System Technical Journal, vol. 19, no. 4, pp.
495–515, 1940.
[4] R. Billi, F. Canavesio, A. Ciaramella, and L. Nebbia, “Interactive voice technology at work: The
cselt experience,” Speech communication, vol. 17, no. 3-4, pp. 263–271, 1995.
[5] R. W. Sproat, Multilingual text-to-speech synthesis. KLUWER academic publishers, 1997.
[6] A. W. Black, H. Zen, and K. Tokuda, “Statistical parametric speech synthesis,” in Acoustics,
Speech and Signal Processing, 2007. ICASSP 2007. IEEE International Conference on, vol. 4.
IEEE, 2007, pp. IV–1229.
[7] H. Ze, A. Senior, and M. Schuster, “Statistical parametric speech synthesis using deep neu-
ral networks,” in Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International
Conference on. IEEE, 2013, pp. 7962–7966.
362 NGUYEN VAN THINH, NGUYEN QUOC BAO, PHAN HUY KINH, DO VAN HAI
[8] A. Van Den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbren-
ner, A. Senior, and K. Kavukcuoglu, “Wavenet: A generative model for raw audio,” CoRR
abs/1609.03499, 2016.
[9] S. O. Arik, M. Chrzanowski, A. Coates, G. Diamos, A. Gibiansky, Y. Kang, X. Li, J. Mil-
ler, A. Ng, J. Raiman et al., “Deep voice: Real-time neural text-to-speech,” arXiv preprint
arXiv:1702.07825, 2017.
[10] D. T. Nguyen, M. C. Luong, B. K. Vu, H. Mixdorff, and H. H. Ngo, “Fujisaki model based f0
contours in vietnamese tts,” in Eighth International Conference on Spoken Language Processing,
2004.
[11] A.-T. Dinh, T.-S. Phan, T.-T. Vu, and C. M. Luong, “Vietnamese hmm-based speech synthesis
with prosody information,” in Eighth ISCA Workshop on Speech Synthesis, 2013, pp. 55–59.
[12] L. He, J. Yang, L. Zuo, and L. Kui, “A trainable vietnamese speech synthesis system based on
hmm,” in Electric Information and Control Engineering (ICEICE), 2011 International Confe-
rence on. IEEE, 2011, pp. 3910–3913.
[13] S. Kayte, M. Mundada, and J. Gujrathi, “Hidden markov model based speech synthesis: A
review,” International Journal of Computer Applications (0975–8887) Volume, 2015.
[14] T. T. Vu, M. C. Luong, and S. Nakamura, “An hmm-based vietnamese speech synthesis system,”
in Speech Database and Assessments, 2009 Oriental COCOSDA International Conference on.
IEEE, 2009, pp. 116–121.
[15] Q. S. Trinh, “Hmm-based vietnamese speech synthesis,” in 2015 IEEE/ACIS 14th International
Conference on Computer and Information Science (ICIS). IEEE, 2015, pp. 349–353.
[16] P. G. Shivakumar and P. Georgiou, “Transfer learning from adult to children for speech recog-
nition: Evaluation, analysis and recommendations,” arXiv preprint arXiv:1805.03322, 2018.
[17] K. Yun, J. Osborne, M. Lee, T. Lu, and E. Chow, “Automatic speech recognition for launch con-
trol center communication using recurrent neural networks with data augmentation and custom
language model,” in Disruptive Technologies in Information Sciences, vol. 10652. International
Society for Optics and Photonics, 2018, p. 1065202.
[18] M. Airaksinen, “Analysis/synthesis comparison of vocoders utilized in statistical parametric
speech synthesis,” Master’s thesis, Aalto University, 2012.
[19] M. Morise, H. Kawahara, and H. Katayose, “Fast and reliable f0 estimation method based on
the period extraction of vocal fold vibration of singing voice and speech,” in Audio Engineering
Society Conference: 35th International Conference: Audio for Games. Audio Engineering
Society, 2009.
[20] M. Morise, “Cheaptrick, a spectral envelope estimator for high-quality speech synthesis,” Speech
Communication, vol. 67, pp. 1–7, 2015.
[21] M. Morise., “Platinum: A method to extract excitation signals for voice synthesis system,”
Acoustical Science and Technology, vol. 33, no. 2, pp. 123–125, 2012.
[22] T. Yoshimura, K. Tokuda, T. Masuko, T. Kobayashi, and T. Kitamura, “Incorporating a mixed
excitation model and postfilter into hmm-based text-to-speech synthesis,” Systems and Compu-
ters in Japan, vol. 36, no. 12, pp. 43–50, 2005.
DEVELOPMENT OF VIETNAMESE SPEECH SYNTHESIS SYSTEM 363
[23] Z. Wu, O. Watts, and S. King, “Merlin: An open source neural network speech synthesis system,”
Proc. SSW, Sunnyvale, USA, 2016.
[24] Z. Zhang, M. Li, Y. Zhang, W. Zhang, Y. Liu, S. Yang, and Y. Lu, “The i2r-nwpu-ntu text-to-
speech system at blizzard challenge 2016,” in Proc. Blizzard Challenge workshop, 2016.
[25] K. Pa¨rssinen and M. Moberg, “Multilingual data configurable text-to-speech system for embed-
ded devices,” in Multilingual Speech and Language Processing, 2006.
[26] Z.-Z. Wu, E. S. Chng, and H. Li, “Development of hmm-based malay text-to-speech system,” in
Proceedings of the Second APSIPA Annual Summit and Conference, 2010, pp. 494–497.
[27] D. Jurafsky and J. H. Martin, “Speech and language processing: An introduction to natural
language processing, computational linguistics, and speech recognition,” pp. 1–1024, 2009.
[28] A. Bizzocchi, “How many phonemes does the english language have?” International Journal on
Studies in English Language and Literature (IJSELL), vol. 5, pp. 36–46, 10 2017.
[29] B. Phm and S. McLeod, “Consonants, vowels and tones across vietnamese dialects,” Internati-
onal journal of speech-language pathology, vol. 18, no. 2, pp. 122–134, 2016.
[30] M. Brunelle, “Northern and southern vietnamese tone coarticulation: A comparative case study,”
Journal of Southeast Asian Linguistics, vol. 1, no. 1, pp. 49–62, 2009.
[31] M. Brunelle., “Tone perception in northern and southern vietnamese,” Journal of Phonetics,
vol. 37, no. 1, pp. 79–96, 2009.
[32] J. Edmondson and N. V. Li, “Tones and voice quality in modern northern vietnamese: instru-
mental case studies.,” Mon-Khmer Studies, vol. 28, 1997.
[33] L.-H. Chen, T. Raitio, C. Valentini-Botinhao, Z.-H. Ling, and J. Yamagishi, “A deep generative
architecture for postfiltering in statistical parametric speech synthesis,” IEEE/ACM Transacti-
ons on Audio, Speech and Language Processing (TASLP), vol. 23, no. 11, pp. 2003–2014, 2015.
[34] K. Koishida, K. Tokuda, T. Kobayashi, and S. Imai, “Celp coding based on mel-cepstral analy-
sis,” in Acoustics, Speech, and Signal Processing, 1995. ICASSP-95., 1995 International Confe-
rence on, vol. 1. IEEE, 1995, pp. 33–36.
[35] T. Masuko, “Hmm-based speech synthesis and its applications,” Institute of Technology, p. 185,
2002.
[36] M. Morise, F. Yokomori, and K. Ozawa, “World: a vocoder-based high-quality speech synthe-
sis system for real-time applications,” IEICE TRANSACTIONS on Information and Systems,
vol. 99, no. 7, pp. 1877–1884, 2016.
Received on October 04, 2018
Revised on December 28, 2018
File đính kèm:
development_of_vietnamese_speech_synthesis_system_using_deep.pdf