ML for Audio Study Group - Text to Speech Deep Dive (Jan 4)

Hey everyone,

Thank you for joining the stream yesterday.
I’m putting together the responses to the questions below:

Question: If I want to train a TTS model on my own speech to imitate my voice, — what amount of records do I need to make for training a model with good quality?

It’d differ from use case to you use case, if you want to model general speaking then you can definitely use something like YourTTS: [2112.02418] YourTTS: Towards Zero-Shot Multi-Speaker TTS and Zero-Shot Voice Conversion for everyone
They claim that 1 minute of audio is sufficient however this will differ from what you require from the end result.

Question: Should I fine-tune a pretrained model or train one from scratch?

In almost all cases, fine-tuning a pre-trained model makes much more sense than training from scratch.

Question: What architecture and instruments would be better to use?

FastSpeech2, Tacotron2, TransformerTTS architectures are a good place to start at. You can leverage the fine-tuning guide by Coqui-ai: Fine-tuning a 🐸 TTS model - TTS 0.5.0 documentation

Question: Is English pretrained model good for fine-tuning on other languages? Or it’s necessary to find the model trained on the same language as the target one?

In case you have nothing else available, english model can be a good starting point or atleast would serve as a good initial benchmark. Typically finetuning models in the target language is desirable.

As always you can find the slides on our GitHub repo: GitHub - Vaibhavs10/ml-with-audio: HF's ML for Audio study group

2 Likes