Hi we are creating a speech dataset to generate speech with several emotions and emphasis. Now WE are looking for an implementation of a text to speech that allows generating speech from text like this:
sad_I am freezing_sad but I have a warm jacket.
I know that sunos bark can achieve similar results but there ist no real documentation on bark and it feels abandoned somehow.
Can anybody give me a hint on a newer implementation with some reference code that shows how I could use special tokens in an tts model?
I just tried to see whether coqui/XTTS-v2 · Hugging Face is promising or not. It seems that the emotion and speed arguments work only with Coqui Studio models, which are discontinued. You can find more information at this GitHub link.
Bark contains the following information, but it will not be sufficient for your project:
Below is a list of some known non-speech sounds, but we are finding more every day. Please let us know if you find patterns that work particularly well on Discord!
[laughter]
[laughs]
[sighs]
[music]
[gasps]
[clears throat]
— or ... for hesitations
♪ for song lyrics
CAPITALIZATION for emphasis of a word
[MAN] and [WOMAN] to bias Bark toward male and female speakers, respectively
By the way, I can confirm that TTS ignores both emotion and speed , and the quality of the generated speech with Bark is poor .
But since XTTS-v2 claims the ‘Emotion and style transfer by cloning’ feature, which requires only a few seconds of an example, you might create examples with relevant emotions and generate parts of a sentence using them.
First, generate ‘I am freezing’ using a sad example, and then generate ‘but I have a warm jacket.’ using a neutral example. Merge these two parts to form the complete sentence.
I can confirm that XTTS-v2 generates high quality output for several languages.
# first run pip install TTS
from TTS.api import TTS
tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2", gpu=True)
tts.tts_to_file(text="I am freezing",
file_path="sad.wav",
speaker_wav="sad-input.wav",
language="en")
tts.tts_to_file(text="But ... I have a warm jacket.",
file_path="default.wav",
speaker_wav="default-input.wav",
language="en")