Skip to the content.

Fine-grained style control in Transformer-based Text-to-speech Synthesis

LJ Speech samples (single speaker)

Style reference speech Generated Speech Text

VCTK (multiple speakers, samples from speakers held-out in the test set)

Same style and speaker reference speech

Style reference speech Generated Speech Text

Different style and speaker reference speech

Speaker reference speech Style reference speech Generated Speech Text

Observations