Розробка застосунку синтезу вокалу засобами Python

Кліщ, Максим Володимирович; Klishch, Maksym Volodymyrovych

Please use this identifier to cite or link to this item: http://elartu.tntu.edu.ua/handle/lib/45748

Titolo:	Розробка застосунку синтезу вокалу засобами Python
Titoli alternativi:	Development of a Vocal Synthesis Application Using Python
Autori:	Кліщ, Максим Володимирович Klishch, Maksym Volodymyrovych
Affiliation:	ТНТУ ім. І. Пулюя, Факультет комп’ютерно-інформаційних систем і програмної інженерії, Кафедра комп’ютерних наук, м. Тернопіль, Україна
Bibliographic description (Ukraine):	Кліщ М. В. Розробка застосунку синтезу вокалу засобами Python : робота на здобуття кваліфікаційного ступеня бакалавра : спец. 122 - комп'ютерні науки / наук. кер. Г. В. Козбур. Тернопіль : Тернопільський національний технічний університет імені Івана Пулюя, 2024. 60 с.
Data:	28-giu-2024
Submitted date:	14-giu-2024
Date of entry:	3-lug-2024
Country (code):	UA
Place of the edition/event:	ТНТУ ім. І.Пулюя, ФІС, м. Тернопіль, Україна
Supervisor:	Козбур, Галина Володимирівна
Committee members:	Семенишин, Галина Мирославівна
UDC:	004.08
Parole chiave:	синтез вокалу vocal synthesis синтез співочого голосу singing voice synthesis нейронні мережі neural networks машинне навчання machine learning глибинне навчання deep learning залишкова мережа residual network python
Abstract:	Кваліфікаційна робота присвячена розробці методу синтезу вокалу та розробці на основі нього застосунку. У першому розділі кваліфікаційної роботи описано існуючі застосунки синтезу вокалу. Розглянуто існуючі методи синтезу вокалу на основі глибинного навчання. Визначено вимоги до застосунку, який розроблено в процесі виконання роботи. У другому розділі кваліфікаційної роботи запропоновано архітектуру моделі синтезу вокалу. Описано архітектуру застосунку синтезу вокалу. Показано етапи опрацювання датасету. Висвітлено процес тренування моделі. У третьому розділі кваліфікаційної роботи описано тестування застосунку синтезу вокалу. Оцінено якість отриманої моделі синтезу вокалу. Описано процес субʼєктивного та обʼєктивного оцінювань. У четвертому розділі кваліфікаційної роботи описано фізіогічний та психологічний впливи синтезованого вокалу на життєдіяльність людини. Висвітлено проблеми, які можуть виникати під час роботи зі застосунок. Подано рекомендації щодо безпечної роботи зі застосунком синтезу вокалу. The qualification work is devoted to the development of a vocal synthesis method and an application based on it. The first chapter of the qualification work describes existing applications of vocal synthesis. Existing methods of vocal synthesis based on deep learning are considered. The requirements for the application developed in the course of the work are defined. In the second chapter of the qualification work, the architecture of the vocal synthesis model is proposed. The architecture of the vocal synthesis application is described. The stages of dataset processing are shown. The process of model training is covered. The third chapter of the qualification work describes the testing of the vocal synthesis application. The quality of the resulting vocal synthesis model is evaluated. The process of subjective and objective evaluation is described. The fourth chapter of the qualification work describes the physiological and psychological effects of synthesized voice on human. The problems that may arise when working with the application are highlighted. Recommendations for safe work with the vocal synthesis application are given.
Content:	Вступ 9 РОЗДІЛ 1. Аналіз Задачі синтезу вокалу та постановка завдання 10 1.1 Предметна область 10 1.2 Огляд існуючих застосунків синтезу вокалу 10 1.3 Огляд існуючих рішень на основі глибинного навчання 13 1.4 Постановка завдання 16 1.5 Висновок до першого розділу 17 РОЗДІЛ 2. Проєктування архітектури моделі та застосунку синтезу вокалу 18 2.1 Пайплайн застосунку 18 2.2 Архітектура моделі 19 2.3 Датасет 21 2.4 Попереднє опрацювання датасету 21 2.5 Функція втрат 24 2.6 Тренувальний процес 25 2.7 Постфільтр 26 2.8 Проєктування системи класів застосунку 27 2.9 Інтерфейс застосунку синтезу вокалу 33 2.10 Висновок до другого розділу 33 РОЗДІЛ 3. Оцінювання якості моделі та Тестування застосунку синтезу вокалу 34 3.1 Оцінка прогнозованих значень 34 3.2 Обʼєктивне оцінювання 41 3.3 Субʼєктивне оцінювання 42 3.4 Тестування функціональності застосунку синтезу вокалу 45 3.5 Висновок до третього розділу 46 РОЗДІЛ 4. Безпека життєдіяльності, основи Охорони праці 47 4.1 Фізіологічний та психологічний вплив синтезованого вокалу на життєдіяльність людини 47 4.2 Заходи щодо зниження ризиків для оператора ПК при роботі із застосунком синтезу вокалу 48 4.3 Висновок до четвертого розділу 50 Висновки 52 Перелік джерел 53
URI:	http://elartu.tntu.edu.ua/handle/lib/45748
Copyright owner:	© Кліщ Максим Володимирович, 2024
References (Ukraine):	1. I. Strutynska, H. Kozbur, L. Dmytrotsa, O. Sorokivska, L. Melnyk and R. Grytseliak, "Regarding to the Concept of Small and Medium-Sized Enterprises Digitalization in Ukraine: Problems and Solutions," 2021 11th International Conference on Advanced Computer Information Technologies (ACIT), Deggendorf, Germany, 2021, pp. 276-279, doi: 10.1109/ACIT52158.2021.9548382 2. I. Strutynska, L. Dmytrotsa, H. Kozbur, L. Melnyk, and R. Sherstiuk, “The unification of approaches to measuring the digital maturity of business structures (international and domestic approaches).,” in ICTERI, pp. 10–23, 2021. 3. Л. Мосій, І. Струтинська та Г. Козбур, “Роль комп'ютерно-інформаційних технологій у цифровій трансформації економіки,” ФОП Паляниця ВА, pp. 432-434, 2023. 4. P. Lu, J. Wu, J. Luan, X. Tan, and L. Zhou, “Xiaoicesing: A high-quality and integrated singing voice synthesis system,” arXiv preprint arXiv:2006.06261, 2020. 5. Y. Ren, Y. Ruan, X. Tan, T. Qin, S. Zhao, Z. Zhao, and T.-Y. Liu, “FastSpeech: Fast, Robust and Controllable Text to Speech,” arXiv preprint arXiv:1905.09263, 2019. 6. Y. Gu, X. Yin, Y. Rao, Y. Wan, B. Tang, Y. Zhang, J. Chen, Y. Wang, and Z. Ma, “ByteSing: A Chinese Singing Voice Synthesis System Using Duration Allocated Encoder-Decoder Acoustic Models and WaveRNN Vocoders,” arXiv preprint arXiv:2004.11012, 2020. 7. Y. Wang, R. Skerry-Ryan, D. Stanton, Y. Wu, R. J. Weiss, N. Jaitly, Z. Yang, Y. Xiao, Z. Chen, S. Bengio, Q. Le, Y. Agiomyrgiannakis, R. Clark, and R. A. Saurous, “Tacotron: Towards end-to-end speech synthesis,” arXiv preprint arXiv:1703.10135, 2017. 8. V. Lanzrein and R. Cross, The Singer’s Guide to German Diction. Oxford University Press, 2018. 9. “VOCALOID – the modern singing synthesizer – vocaloid.com.” https://www.vocaloid.com/en/. Дата звернення: 6 черв. 2024. [Онлайн] 10. H. Kenmochi and H. Ohshita, “Vocaloid-commercial singing synthesizer based on sample concatenation.,” in Interspeech, vol. 2007, pp. 4009–4010, 2007. 11. “Singing voice synthesis tool UTAU download page – utau2008.xrea.jp.” https://utau2008.xrea.jp/. Дата звернення: 3 черв. 2024. [Онлайн] 12. “Home – openutau.com.” https://www.openutau.com/. Дата звернення: 3 черв. 2024. [Онлайн]. 13. “Synthesizer V – Dreamtonics – dreamtonics.com.” https://dreamtonics.com/synthesizerv/. Дата звернення: 4 черв. 2024. [Онлайн]. 14. “NEUTRINO – Neural singing synthesizer – studio-neutrino.com.” https://studio-neutrino.com/. Дата звернення: 4 черв. 2024. [Онлайн4]. 15. “synsinger – synthetic singing for the masses.” https://synsinger.wordpress.com/. Дата звернення: 4 черв. 2024. [Онлайн]. 16. R. Huang, C. Cui, F. Chen, Y. Ren, J. Liu, Z. Zhao, B. Huai, and Z. Wang, “SingGAN: Generative adversarial network for high-fidelity singing voice generation,” arXiv preprint arXiv:2110.07468, 2021. 17. I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial networks,” arXiv preprint arXiv:1406.2661, 2014. 18. J. Liu, C. Li, Y. Ren, F. Chen, and Z. Zhao, “Diffsinger: Singing voice synthesis via shallow diffusion mechanism,” arXiv preprint arXiv:2105.02446, 2021. 19. J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” arXiv preprint arXiv:2006.11239, 2020. 20. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” arXiv preprint arXiv:1706.03762, 2017. 21. M. Morise, F. Yokomori, and K. Ozawa, “World: a vocoder-based high-quality speech synthesis system for real-time applications,” IEICE TRANSACTIONS on Information and Systems, vol. 99, no. 7, pp. 1877–1884, 2016. 22. M. Morise, “D4c, a band-aperiodicity estimator for high-quality speech synthesis,” Speech Communication, vol. 84, pp. 57–65, 2016. 23. J. Chen, X. Tan, J. Luan, T. Qin, and T.-Y. Liu, “Hifisinger: Towards high-fidelity neural singing voice synthesis,” arXiv preprint arXiv:2009.01776, 2020. 24. R. Yamamoto, E. Song, and J.-M. Kim, “Parallel wavegan: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram,” arXiv preprint arXiv:1910.11480, 2019. 25. C. Wang, C. Zeng, and X. He, “Xiaoicesing 2: A high-fidelity singing voice synthesizer based on generative adversarial network,” arXiv preprint arXiv:2210.14666, 2022. 26. J. Kim, H. Choi, J. Park, S. Kim, J. Kim, and M. Hahn, “Korean singing voice synthesis system based on an lstm recurrent neural network,” in Proc. Interspeech, pp. 1551–1555, 2018. 27. S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997. 28. Y. Ren, X. Tan, T. Qin, J. Luan, Z. Zhao, and T.-Y. Liu, “Deepsinger: Singing voice synthesis with data mined from the web,” arXiv preprint arXiv:2007.04590, 2020. 29. H. Tachibana, K. Uenoyama, and S. Aihara, “Efficiently trainable text-to-speech system based on deep convolutional networks with guided attention,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, Apr. 2018. 30. K. Tokuda, Y. Nankaku, T. Toda, H. Zen, J. Yamagishi, and K. Oura, “Speech synthesis based on hidden markov models,” Proceedings of the IEEE, vol. 101, no. 5, pp. 1234–1252, 2013. 31. M. Nishimura, K. Hashimoto, K. Oura, Y. Nankaku, and K. Tokuda, “Singing voice synthesis based on deep neural networks.,” in Interspeech, pp. 2478–2482, 2016. 32. Y. Hono, S. Murata, K. Nakamura, K. Hashimoto, K. Oura, Y. Nankaku, and K. Tokuda, “Recent development of the dnn-based singing voice synthesis system – sinsy,” in 2018 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), pp. 1003–1009, IEEE, 2018. 33. C. M. Bishop, “Mixture density networks,” Aston University, 1994. 34. Y. Hono, K. Hashimoto, K. Oura, Y. Nankaku, and K. Tokuda, “Sinsy: A deep neural network-based singing voice synthesis system,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, p. 2803–2815, 2021. 35. S.-Z. Yu, “Hidden semi-markov models,” Artificial intelligence, vol. 174, no. 2, pp. 215–243, 2010. 36. K. Nakamura, K. Hashimoto, K. Oura, Y. Nankaku, and K. Tokuda, “Singing voice synthesis based on convolutional neural networks,” arXiv preprint arXiv:1904.06868, 2019. 37. M. Nishihara, Y. Hono, K. Hashimoto, Y. Nankaku, and K. Tokuda, “Singing voice synthesis based on frame-level sequence-to-sequence models considering vocal timing deviation,” arXiv preprint arXiv:2301.02262, 2023. 38. J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang, Z. Chen, Y. Zhang, Y. Wang, R. Skerry-Ryan, R. A. Saurous, Y. Agiomyrgiannakis, and Y. Wu, “Natural tts synthesis by conditioning wavenet on mel spectrogram predictions,” arXiv preprint arXiv:1712.05884, 2017. 39. J. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Bengio, “Attention-based models for speech recognition,” arXiv preprint arXiv:1506.07503, 2015. 40. Y. Zhang, J. Cong, H. Xue, L. Xie, P. Zhu, and M. Bi, “Visinger: Variational inference with adversarial learning for end-to-end singing voice synthesis,” arXiv preprint arXiv:2110.08813, 2021. 41. J. Kim, J. Kong, and J. Son, “Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech,” arXiv preprint arXiv:2106.06103, 2021. 42. Y. Zhang, H. Xue, H. Li, L. Xie, T. Guo, R. Zhang, and C. Gong, “Visinger 2: High-fidelity end-to-end singing voice synthesis enhanced by digital signal processing synthesizer,” arXiv preprint arXiv:2211.02903, 2022. 43. J. Engel, L. Hantrakul, C. Gu, and A. Roberts, “DDSP: Differentiable digital signal processing,” arXiv preprint arXiv:2001.04643, 2020. 44. R. Yamamoto, R. Yoneyama, and T. Toda, “Nnsvs: A neural network-based singing voice synthesis toolkit,” arXiv preprint arXiv:2210.15987, 2022. 45. J. Shi, S. Guo, T. Qian, N. Huo, T. Hayashi, Y. Wu, F. Xu, X. Chang, H. Li, P. Wu, S. Watanabe, and Q. Jin, “Muskits: an end-to-end music processing toolkit for singing voice synthesis,” in Proceedings of Interspeech, pp. 4277– 4281, 2022. 46. H. Zen, K. Tokuda, and A. W. Black, “Statistical parametric speech synthesis,” Speech Communication, vol. 51, no. 11, pp. 1039–1064, 2009. 47. K. Saino, H. Zen, Y. Nankaku, A. Lee, and K. Tokuda, “An hmm-based singing voice synthesis system,” in Ninth International Conference on Spoken Language Processing, 2006. 48. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” arXiv preprint arXiv:1512.03385, 2015. 49. S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” arXiv preprint arXiv:1502.03167, 2015. 50. I. Ogawa and M. Morise, “Tohoku kiritan singing database: A singing database for statistical parametric singing synthesis using japanese pop songs,” Acoustical Science and Technology, vol. 42, no. 3, pp. 140–145, 2021. 51. S. Choi, W. Kim, S. Park, S. Yong, and J. Nam, “Children’s song dataset for singing voice research,” in International Society for Music Information Retrieval Conference (ISMIR), vol. 4, 2020. 52. M. McAuliffe, M. Socolof, S. Mihuc, M. Wagner, and M. Sonderegger, “Montreal forced aligner: Trainable text-speech alignment using kaldi.,” in Interspeech, vol. 2017, pp. 498–502, 2017. 53. C. Raffel and D. P. Ellis, “Intuitive analysis, creation and manipulation of midi data with pretty midi,” in 15th International society for music information retrieval conference late breaking and demo papers, pp. 84–93, 2014. 54. C. J. Steinmetz and J. Reiss, “pyloudnorm: A simple yet flexible loudness meter in python,” in Audio Engineering Society Convention 150, Audio Engineering Society, 2021. 55. J. Hsu, “GitHub - JeremyCCHsu/Python-Wrapper-for-World-Vocoder: A Python wrapper for the high-quality vocoder ”World” – github.com.” https://github.com/JeremyCCHsu/Python-Wrapper-for-World-Vocoder. Дата звернення: 5 черв. 2024. [Онлайн] 56. M. Morise, G. Miyashita, and K. Ozawa, “Low-dimensional representation of spectral envelope without deterioration for full-band speech analysis/synthesis system.,” in INTERSPEECH, pp. 409–413, 2017. 57. “Google Colab – colab.research.google.com.” https://colab.research.google.com/. Дата звернення: 5 черв. 2024. [Онлайн] 58. I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” arXiv preprint arXiv:1711.05101, 2017. 59. H. Silén, E. Helander, J. Nurminen, and M. Gabbouj, “Ways to implement global variance in statistical speech synthesis.,” in Interspeech, pp. 1436–1439, 2012. 60. J. Ansel, E. Yang, H. He, N. Gimelshein, A. Jain, M. Voznesensky, B. Bao, P. Bell, D. Berard, E. Burovski, G. Chauhan, A. Chourdia, W. Constable, A. Desmaison, Z. DeVito, E. Ellison, W. Feng, J. Gong, M. Gschwind, B. Hirsh, S. Huang, K. Kalambarkar, L. Kirsch, M. Lazos, M. Lezcano, Y. Liang, J. Liang, Y. Lu, C. Luk, B. Maher, Y. Pan, C. Puhrsch, M. Reso, M. Saroufim, M. Y. Siraichi, H. Suk, M. Suo, P. Tillet, E. Wang, X. Wang, W. Wen, S. Zhang, X. Zhao, K. Zhou, R. Zou, A. Mathews, G. Chanan, P. Wu, and S. Chintala, “PyTorch 2: Faster Machine Learning Through Dynamic Python Bytecode Transformation and Graph Compilation,” in 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2 (ASPLOS ’24), ACM, Apr. 2024. 61. W. Falcon and The PyTorch Lightning team, “PyTorch Lightning.” https://github.com/Lightning-AI/lightning, Mar. 2019. Дата звернення: 5 черв. 2024. [Онлайн] 62. D. R. Mortensen, S. Dalmia, and P. Littell, “Epitran: Precision G2P for many languages,” in Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018) (N. C. C. chair), K. Choukri, C. Cieri, T. Declerck, S. Goggi, K. Hasida, H. Isahara, B. Maegaard, J. Mariani, H. Mazo, A. Moreno, J. Odijk, S. Piperidis, and T. Tokunaga, eds.), (Paris, France), European Language Resources Association (ELRA), May 2018. 63. “Espeak NG text-to-speech.” https://github.com/espeak-ng/espeak-ng. Дата звернення: 6 черв. 2024. [Онлайн] 64. “MusicXML for Exchanging Digital Sheet Music – musicxml.com.” https://www.musicxml.com/. Дата звернення: 6 черв. 2024. [Онлайн] 65. “Gradio.” https://www.gradio.app/. Дата звернення: 6 черв. 2024. [Онлайн] 66. “Home – Milk – xepheris.wixsite.com.” https://xepheris.wixsite.com/milk. Дата звернення: 6 черв. 2024. [Онлайн] 67. “Kasane Teto Official website – kasaneteto.jp.” https://kasaneteto.jp/utau/. Дата звернення: 6 черв. 2024. [Онлайн] 68. “Sinsy – HMM/DNN-based Singing Voice Synthesis System – sinsy.jp.” https://www.sinsy.jp/. Дата звернення: 6 черв. 2024. [Онлайн] 69. M. V. Thoma, R. La Marca, R. Bronnimann, L. Finkel, U. Ehlert, and U. M. Nater, “The effect of music on the human stress response,” PloS one, vol. 8, no. 8, p. e70156, 2013. 70. L. Bernardi, C. Porta, and P. Sleight, “Cardiovascular, cerebrovascular, and respiratory changes induced by different types of music in musicians and non-musicians: the importance of silence,” Heart, vol. 92, no. 4, pp. 445–452, 2006. 71. M. Masahiro, “The uncanny valley,” Energy, vol. 7, p. 33, 1970. 72. M. Avdeeff, “Artificial intelligence & popular music: Skygge, flow machines, and the audio uncanny valley,” in Arts, vol. 8, p. 130, MDPI, 2019. 73. J. Romportl, “Speech synthesis and uncanny valley,” in International conference on text, speech, and dialogue, pp. 595–602, Springer, 2014. 74. A. Diel and M. Lewis, “The vocal uncanny valley: Deviation from typicalorganic voices best explains uncanniness.,” 2023. 75. Грибан Г. В. Охорона праці / Г. В. Грибан, О. В. Негодченко. – Київ: Центр учбової літератури, 2009. – 280 с. 76. World Health Organization, “Make listening safe,” World Health Organization, 2021. Доступно: https://cdn.who.int/media/docs/default-source/documents/health-topics/deafness-and-hearing-loss/mls-brochure-english-2021.pdf 77. World Health Organization, “Safe listening devices and systems: a WHO-ITU standard,” 2019. Доступно: https://www.who.int/publications/i/item/9789241515276 78. Україна, МОЗ України. (1999, 1 груд.). Постанова, Норми № 37, Санітарні норми виробничого шуму, ультразвуку та інфразвуку ДСН 3.3.6.037-99. Дата звернення: 7 черв. 2024. [Онлайн]. Доступно: https://zakon.rada.gov.ua/rada/show/va037282-99 79. Україна, Міністерство охорони здоров'я України. (2019, 22 лют.). Наказ Міністерства охорони здоров'я України № 463, Про затвердження Державних санітарних норм допустимих рівнів шуму в приміщеннях житлових та громадських будинків і на території житлової забудови. Дата звернення: 7 черв. 2024. [Онлайн]. Доступно: https://zakon.rada.gov.ua/laws/show/z0281-19#Text 80. “Loudness normalization - Spotify”. Spotify. Дата звернення: 7 черв. 2024. [Онлайн]. Доступно: https://support.spotify.com/us/artists/article/loudness-normalization/ 81. World Health Organization, WHO global standard for safe listening venues and events. World Health Organization, 2022. Доступно: https://www.who.int/publications/i/item/9789240043114
Content type:	Bachelor Thesis
È visualizzato nelle collezioni:	122 — Компʼютерні науки (бакалаври)

File in questo documento:

File	Descrizione	Dimensioni	Formato
2024_KRB_SNs-42_Klishch_MV.pdf		1,63 MB	Adobe PDF	Visualizza/apri

Visualizza tutti i metadati del documento

Admin Tools