HOME > Detail View

Detail View

Individual prosody control using generated duration and pitch via mel spectrogram deformation

Individual prosody control using generated duration and pitch via mel spectrogram deformation

Material type
학위논문
Personal Author
송찬호, 宋燦浩
Title Statement
Individual prosody control using generated duration and pitch via mel spectrogram deformation / Chan-ho Song
Publication, Distribution, etc
Seoul :   Graduate School, Korea University,   2020  
Physical Medium
vi, 25장 : 천연색삽화, 도표 ; 26 cm
기타형태 저록
Individual Prosody Control using Generated Duration and Pitch via Mel Spectrogram Deformation   (DCOLL211009)000000127343  
학위논문주기
학위논문(석사)-- 고려대학교 대학원: 컴퓨터·전파통신공학과, 2020. 2
학과코드
0510   6D36   1115  
General Note
지도교수: 이성환  
Bibliography, Etc. Note
참고문헌: 장 22-25
이용가능한 다른형태자료
PDF 파일로도 이용가능;   Requires PDF file reader(application/pdf)  
비통제주제어
Prosody control , Speech synthesis,,
000 00000nam c2200205 c 4500
001 000046026247
005 20200428155150
007 ta
008 200106s2020 ulkad bmAC 000c eng
040 ▼a 211009 ▼c 211009 ▼d 211009
041 0 ▼a eng ▼b kor
085 0 ▼a 0510 ▼2 KDCP
090 ▼a 0510 ▼b 6D36 ▼c 1115
100 1 ▼a 송찬호, ▼g 宋燦浩
245 1 0 ▼a Individual prosody control using generated duration and pitch via mel spectrogram deformation / ▼d Chan-ho Song
260 ▼a Seoul : ▼b Graduate School, Korea University, ▼c 2020
300 ▼a vi, 25장 : ▼b 천연색삽화, 도표 ; ▼c 26 cm
500 ▼a 지도교수: 이성환
502 0 ▼a 학위논문(석사)-- ▼b 고려대학교 대학원: ▼c 컴퓨터·전파통신공학과, ▼d 2020. 2
504 ▼a 참고문헌: 장 22-25
530 ▼a PDF 파일로도 이용가능; ▼c Requires PDF file reader(application/pdf)
653 ▼a Prosody control ▼a Speech synthesis
776 0 ▼t Individual Prosody Control using Generated Duration and Pitch via Mel Spectrogram Deformation ▼w (DCOLL211009)000000127343
900 1 0 ▼a Song, Chan-ho, ▼e
900 1 0 ▼a 이성환, ▼g 李晟瑍, ▼e 지도교수
945 ▼a KLPA

Electronic Information

No. Title Service
1
Individual prosody control using generated duration and pitch via mel spectrogram deformation (19회 열람)
View PDF Abstract Table of Contents

Holdings Information

No. Location Call Number Accession No. Availability Due Date Make a Reservation Service
No. 1 Location Science & Engineering Library/Stacks(Thesis)/ Call Number 0510 6D36 1115 Accession No. 123063751 Availability Available Due Date Make a Reservation Service B M
No. 2 Location Science & Engineering Library/Stacks(Thesis)/ Call Number 0510 6D36 1115 Accession No. 123063752 Availability Available Due Date Make a Reservation Service B M

Contents information

Abstract

In speech and signal processing, it is very important to control the prosody for various speech synthesis. However, as the prosody is mixed with various features, it is difficult to control the individual prosody. Disentangling various features allows individual control without affecting other features. In this paper, we propose an individual prosody control architecture which consists of domain adversarial neural network(DANN) and conditional variational autoencoder(CVAE). We apply data augmentation which randomly warps the duration and local pitch of the mel spectrogram to deform the speaker's unique prosody. We utilize the difference between original and deformed data to represent the disentangled voice and prosody features. We can control the prosody and synthesize the various duration and pitch speeches using only monotonic speech dataset without high-quality dataset. We evaluate disentangling features and individual prosody control of synthesized speech via t-SNE, F_0\ (pitch) tracks, MOS, and ABX test.

Table of Contents

1  Introduction                                                                1
2  Related Work                                                             3
3  Individual Prosody Control            		       	 	       6
    3.1  Data Augmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
         3.1.1 Time Warping . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
         3.1.2 Frequency Warping . . . . . . . . . . . . . . . . . . . . . . . . . . 9 
    3.2  Proposed Architecture and Objective . . . . . . . . . . . . . . . . . . . . 9
 3.2.1 Reference Encoder . . . . . . . . . . . . . . . . . . . . . . . . 10
 3.2.2 Adversarial Training Module . . . . . . . . . . . . . . . . . . . . 11
 3.2.3 Objective Function . . . . . . . . . . . . . . . . . . . . . . . . 11
4  Experiments and Results                                                 14
    4.1  Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . 14
    4.2  Speaker Encoding Disentanglement . . . . . . . . . . . . . . . . 14
    4.3  Prosody Transfer . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
    4.4  Prosody Control . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
 4.4.1 Local Pitch Control . . . . . . . . . . . . . . . . . . . . . . . . . . 19
         4.4.2 Duration Control . . . . . . . . . . . . . . . . . . . . . . 20 
5  Conclusion                                                        21
REFERENCE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22