Flow-based neural vocoders, such as FloWaveNet and WaveGlow, have recently shown significant improvement in the real-time speech generation system. By using these models, the sequence of randomly generated noises can be transformed into an audio waveform in parallel. However, train the model to learn target continuous density function with quantized data can degrade model performance due to the topological difference between the target and source distribution.
To resolve this issue, we propose various audio dequantization methods that can be easily implemented to any flow-based neural vocoder and improve the model performance. Inspired by the well-known method in image generation, data dequantization, the audio dequantization can help the model to learn topologically more fitted distribution. As a result, the degradation during the inference can be reduced. We implemented various audio dequantization methods to flow-based neural vocoders and investigated the effect on the generated audio.
We conducted various objective performance assessments and subjective evaluations to show that audio dequantization can help improving audio generation quality. From our experiments, using audio dequantization produces waveform audio with better harmonic structure and fewer digital artifacts.