000 | 00000nam c2200205 c 4500 | |
001 | 000046026247 | |
005 | 20230526122055 | |
007 | ta | |
008 | 200106s2020 ulkad bmAC 000c eng | |
040 | ▼a 211009 ▼c 211009 ▼d 211009 | |
041 | 0 | ▼a eng ▼b kor |
085 | 0 | ▼a 0510 ▼2 KDCP |
090 | ▼a 0510 ▼b 6D36 ▼c 1115 | |
100 | 1 | ▼a 송찬호, ▼g 宋燦浩 |
245 | 1 0 | ▼a Individual prosody control using generated duration and pitch via mel spectrogram deformation / ▼d Chan-ho Song |
260 | ▼a Seoul : ▼b Graduate School, Korea University, ▼c 2020 | |
300 | ▼a vi, 25장 : ▼b 천연색삽화, 도표 ; ▼c 26 cm | |
500 | ▼a 지도교수: 이성환 | |
502 | 0 | ▼a 학위논문(석사)-- ▼b 고려대학교 대학원, ▼c 컴퓨터·전파통신공학과, ▼d 2020. 2 |
504 | ▼a 참고문헌: 장 22-25 | |
530 | ▼a PDF 파일로도 이용가능; ▼c Requires PDF file reader(application/pdf) | |
653 | ▼a Prosody control ▼a Speech synthesis | |
776 | 0 | ▼t Individual Prosody Control using Generated Duration and Pitch via Mel Spectrogram Deformation ▼w (DCOLL211009)000000127343 |
900 | 1 0 | ▼a Song, Chan-ho, ▼e 저 |
900 | 1 0 | ▼a 이성환, ▼g 李晟瑍, ▼d 1962-, ▼e 지도교수 ▼0 AUTH(211009)151678 |
945 | ▼a KLPA |
전자정보
소장정보
No. | 소장처 | 청구기호 | 등록번호 | 도서상태 | 반납예정일 | 예약 | 서비스 |
---|---|---|---|---|---|---|---|
No. 1 | 소장처 과학도서관/학위논문서고/ | 청구기호 0510 6D36 1115 | 등록번호 123063751 | 도서상태 대출가능 | 반납예정일 | 예약 | 서비스 |
No. 2 | 소장처 과학도서관/학위논문서고/ | 청구기호 0510 6D36 1115 | 등록번호 123063752 | 도서상태 대출가능 | 반납예정일 | 예약 | 서비스 |
컨텐츠정보
초록
In speech and signal processing, it is very important to control the prosody for various speech synthesis. However, as the prosody is mixed with various features, it is difficult to control the individual prosody. Disentangling various features allows individual control without affecting other features. In this paper, we propose an individual prosody control architecture which consists of domain adversarial neural network(DANN) and conditional variational autoencoder(CVAE). We apply data augmentation which randomly warps the duration and local pitch of the mel spectrogram to deform the speaker's unique prosody. We utilize the difference between original and deformed data to represent the disentangled voice and prosody features. We can control the prosody and synthesize the various duration and pitch speeches using only monotonic speech dataset without high-quality dataset. We evaluate disentangling features and individual prosody control of synthesized speech via t-SNE, F_0\ (pitch) tracks, MOS, and ABX test.
목차
1 Introduction 1 2 Related Work 3 3 Individual Prosody Control 6 3.1 Data Augmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 3.1.1 Time Warping . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 3.1.2 Frequency Warping . . . . . . . . . . . . . . . . . . . . . . . . . . 9 3.2 Proposed Architecture and Objective . . . . . . . . . . . . . . . . . . . . 9 3.2.1 Reference Encoder . . . . . . . . . . . . . . . . . . . . . . . . 10 3.2.2 Adversarial Training Module . . . . . . . . . . . . . . . . . . . . 11 3.2.3 Objective Function . . . . . . . . . . . . . . . . . . . . . . . . 11 4 Experiments and Results 14 4.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . 14 4.2 Speaker Encoding Disentanglement . . . . . . . . . . . . . . . . 14 4.3 Prosody Transfer . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 4.4 Prosody Control . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 4.4.1 Local Pitch Control . . . . . . . . . . . . . . . . . . . . . . . . . . 19 4.4.2 Duration Control . . . . . . . . . . . . . . . . . . . . . . 20 5 Conclusion 21 REFERENCE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22