While our model significantly improves zero-shot speech synthesis performance, it inadvertently synthesizes background noise alongside the voice due to non-disentangled modeling.
To mitigate this, we employ a denoiser prior to the style encoder, which enhances audio quality but unfortunately reduces reconstruction quality metrics like CER and WER.
We discovered that the denoiser tends to remove critical speech elements, negatively affecting pronunciation in the synthetic output, necessitating further refinements.
To address the issues introduced by noise and denoising, we use an interpolation method between original and denoised style representations, offering improved results.
Collection
[
|
...
]