Speech synthesis model with Styles Like Emoticons or emphasis

padmalcom · December 24, 2024, 1:02pm

Hi we are creating a speech dataset to generate speech with several emotions and emphasis. Now WE are looking for an implementation of a text to speech that allows generating speech from text like this:

sad_I am freezing_sad but I have a warm jacket.

I know that sunos bark can achieve similar results but there ist no real documentation on bark and it feels abandoned somehow.

Can anybody give me a hint on a newer implementation with some reference code that shows how I could use special tokens in an tts model?

Thank you all!

mahmutc · December 25, 2024, 9:51am

hi @padmalcom

I just tried to see whether coqui/XTTS-v2 · Hugging Face is promising or not. It seems that the emotion and speed arguments work only with Coqui Studio models, which are discontinued. You can find more information at this GitHub link.

Relevant links: coqui/XTTS-v2 · Hugging Face and infinisoft/tts · Hugging Face

Bark contains the following information, but it will not be sufficient for your project:

Below is a list of some known non-speech sounds, but we are finding more every day. Please let us know if you find patterns that work particularly well on Discord!

[laughter]

[laughs]

[sighs]

[music]

[gasps]

[clears throat]

— or ... for hesitations

♪ for song lyrics

CAPITALIZATION for emphasis of a word

[MAN] and [WOMAN] to bias Bark toward male and female speakers, respectively

By the way, I can confirm that TTS ignores both emotion and speed , and the quality of the generated speech with Bark is poor .

mahmutc · December 25, 2024, 10:02am

But since XTTS-v2 claims the ‘Emotion and style transfer by cloning’ feature, which requires only a few seconds of an example, you might create examples with relevant emotions and generate parts of a sentence using them.

First, generate ‘I am freezing’ using a sad example, and then generate ‘but I have a warm jacket.’ using a neutral example. Merge these two parts to form the complete sentence.

I can confirm that XTTS-v2 generates high quality output for several languages.

mahmutc · December 25, 2024, 12:16pm

You can find the files generated by XTTS-v2 on Hugging Face:

Input files are from https://speechbot.github.io/expresso/
Read Short Sentences - ex03 - default and sad

The code I ran:

# first run pip  install TTS

from TTS.api import TTS
tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2", gpu=True)

tts.tts_to_file(text="I am freezing",
  file_path="sad.wav",
  speaker_wav="sad-input.wav",
  language="en")
  
  
tts.tts_to_file(text="But ... I have a warm jacket.",
  file_path="default.wav",
  speaker_wav="default-input.wav",
  language="en")

Topic	Replies	Views
Create speech to text training dataset using text to speech model Intermediate	403	February 8, 2023
Text To Speech In Real-Time Models	1141	August 21, 2023
Chinese text to speech Models	507	April 18, 2024
Arabic Female TTS model 🤗Transformers	72	June 6, 2024
Using inference api on espnet/kan-bayashi_ljspeech_vits model Beginners	379	November 27, 2021

Speech synthesis model with Styles Like Emoticons or emphasis

Related topics