A Variational Framework for Improving Naturalness in Generative Spoken Language Models

Li-Wei Chen1, Takuya Higuchi2, Zakaria Aldeneh2, Ahmed Hussen Abdelaziz2, Alexander Rudnicky1

1Language Technology Institute, Carnegie Mellon University    2Apple

Abstract

The success of large language models in text processing has inspired their adaptation to speech modeling. However, since speech is continuous and complex, it is often discretized for autoregressive modeling. Speech tokens derived from self-supervised models (known as semantic tokens) typically focus on the linguistic aspects of speech but neglect prosodic information. As a result, models trained on these tokens can generate speech with reduced naturalness. Existing approaches try to fix this by adding pitch features to the semantic tokens. However, pitch alone cannot fully represent the range of paralinguistic attributes, and selecting the right features requires careful hand-engineering. To overcome this, we propose an end-to-end variational approach that automatically learns to encode these continuous speech attributes to enhance the semantic tokens. Our approach eliminates the need for manual extraction and selection of paralinguistic features. Moreover, it produces preferred speech continuations according to human raters.

Links

Method Overview

System Overview Figure

Speech Continuation Comparison

Below are audio samples using different methods for speech continuation. Each example begins with the same 3-second prompt, followed by generated speech from various approaches.

Prompt

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Sample 6

Sample 7

Sample 8

Token-LM

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Sample 6

Sample 7

Sample 8

Token-LM + Pitch

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Sample 6

Sample 7

Sample 8

Proposed (VAE-GSLM)

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Sample 6

Sample 7

Sample 8

Human Speech (Ground Truth)

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Sample 6

Sample 7

Sample 8