Approach for Creating a Real-Time Speech-to-Speech Model with Emotions, Laughter, and Crying—aka "The Perfect Voice Changer"

Approach for Creating a Real-Time Speech-to-Speech Model with Emotions, Laughter, and Crying—aka “The Perfect Voice Changer”

There are already many TTS (Text-to-Speech) and STT (Speech-to-Text) models available. I once tried a paid app that claimed to offer Speech-to-Speech (S2S) conversion. However, it was quite obvious that internally, it was using STT followed by TTS, which resulted in a significant loss of tone and expressiveness. Additionally, it was limited to English vocabulary, meaning it couldn’t accurately process non-verbal sounds like laughter or crying, leading to a loss of emotional nuance.

Has anyone ever trained a model that performs direct speech-to-speech conversion without relying on text as an intermediary? Most existing S2S models focus on translation, which inherently involves a vocabulary-based transformation. My goal is to train a model that completely bypasses any reliance on text-based representation.

Training Data

A straightforward approach would be to use existing TTS models to generate different voices reading the same text. However, TTS struggles with accurately reproducing emotions like crying or laughter.

Instead, with some personal funding, I would hire film students (cough) to record the same script while heavily emphasizing various emotions for an hour or so. I’m not sure if that would provide enough data, though.

Using Stacked Models as a Base

Rather than training a model from scratch, my plan is to start with existing TTS and STT models and layer them together as a foundation. Of course, in this setup, the intermediate layer would still produce text, meaning it wouldn’t yet capture crying, laughter, or other emotional cues.

Splitting the Model

The TTS and STT models I’ve examined typically consist of around six layers. My idea is to remove the last two layers of the STT model and the first two layers of the TTS model, leaving a four-layer base model. I would then train this core using the emotionally varied speech data mentioned earlier.

The weights and biases of the input/output layers (the remaining parts of STT and TTS) wouldn’t need to be modified. In theory, this approach should preserve the “voice recognition” capabilities from the STT model’s initial layers while maintaining the voice generation quality from the TTS model’s final layers. At least, that’s my hypothesis.

Maybe I’m oversimplifying things—has anyone attempted something like this before? Is this a common approach?

Real-Time

The existing open-source models don’t really work in real-time, as far as I know. I would first focus on making everything above work before tackling real-time synthesis.

For real-time conversion, I would split the input stream into 100ms segments and process the last ~4–16 segments (around 0.4 to 1.6 seconds) together. The output would then be synchronized with the previous output and smoothed for a seamless transition.

Are there any good papers on approaches or designs for low-latency or real-time applications in audio? Or perhaps insights from real-time video processing, such as autonomous driving?

1 Like

When you combine them, there is inevitably a large time lag even when streaming. For example, it would be good if there were more like this.