Approach for Creating a Real-Time Speech-to-Speech Model with Emotions, Laughter, and Crying—aka "The Perfect Voice Changer"

Kaisky · February 24, 2025, 9:09am

Approach for Creating a Real-Time Speech-to-Speech Model with Emotions, Laughter, and Crying—aka “The Perfect Voice Changer”

There are already many TTS (Text-to-Speech) and STT (Speech-to-Text) models available. I once tried a paid app that claimed to offer Speech-to-Speech (S2S) conversion. However, it was quite obvious that internally, it was using STT followed by TTS, which resulted in a significant loss of tone and expressiveness. Additionally, it was limited to English vocabulary, meaning it couldn’t accurately process non-verbal sounds like laughter or crying, leading to a loss of emotional nuance.

Has anyone ever trained a model that performs direct speech-to-speech conversion without relying on text as an intermediary? Most existing S2S models focus on translation, which inherently involves a vocabulary-based transformation. My goal is to train a model that completely bypasses any reliance on text-based representation.

Training Data

A straightforward approach would be to use existing TTS models to generate different voices reading the same text. However, TTS struggles with accurately reproducing emotions like crying or laughter.

Instead, with some personal funding, I would hire film students (cough) to record the same script while heavily emphasizing various emotions for an hour or so. I’m not sure if that would provide enough data, though.

Using Stacked Models as a Base

Rather than training a model from scratch, my plan is to start with existing TTS and STT models and layer them together as a foundation. Of course, in this setup, the intermediate layer would still produce text, meaning it wouldn’t yet capture crying, laughter, or other emotional cues.

Splitting the Model

The TTS and STT models I’ve examined typically consist of around six layers. My idea is to remove the last two layers of the STT model and the first two layers of the TTS model, leaving a four-layer base model. I would then train this core using the emotionally varied speech data mentioned earlier.

The weights and biases of the input/output layers (the remaining parts of STT and TTS) wouldn’t need to be modified. In theory, this approach should preserve the “voice recognition” capabilities from the STT model’s initial layers while maintaining the voice generation quality from the TTS model’s final layers. At least, that’s my hypothesis.

Maybe I’m oversimplifying things—has anyone attempted something like this before? Is this a common approach?

Real-Time

The existing open-source models don’t really work in real-time, as far as I know. I would first focus on making everything above work before tackling real-time synthesis.

For real-time conversion, I would split the input stream into 100ms segments and process the last ~4–16 segments (around 0.4 to 1.6 seconds) together. The output would then be synchronized with the previous output and smoothed for a seamless transition.

Are there any good papers on approaches or designs for low-latency or real-time applications in audio? Or perhaps insights from real-time video processing, such as autonomous driving?

John6666 · February 24, 2025, 6:16pm

When you combine them, there is inevitably a large time lag even when streaming. For example, it would be good if there were more like this.

Topic		Replies	Views
AI to improve voice Beginners	10	370	April 10, 2025
Speech synthesis model with Styles Like Emoticons or emphasis Intermediate	3	206	December 25, 2024
Training a TTS Model on a Specific Character from a TV Show or Movie Models	0	560	February 29, 2024
Real-Time Text-to-Speech Model Models	2	1592	January 5, 2025
AI to Convert Any Voice to a Specific Voice Intermediate	10	6955	November 10, 2024