ML for Audio Study Group - Text to Speech Deep Dive (Jan 4)

reach-vb · January 5, 2022, 3:16pm

Hey everyone,

Thank you for joining the stream yesterday.
I’m putting together the responses to the questions below:

Question: If I want to train a TTS model on my own speech to imitate my voice, — what amount of records do I need to make for training a model with good quality?

It’d differ from use case to you use case, if you want to model general speaking then you can definitely use something like YourTTS: [2112.02418] YourTTS: Towards Zero-Shot Multi-Speaker TTS and Zero-Shot Voice Conversion for everyone
They claim that 1 minute of audio is sufficient however this will differ from what you require from the end result.

Question: Should I fine-tune a pretrained model or train one from scratch?

In almost all cases, fine-tuning a pre-trained model makes much more sense than training from scratch.

Question: What architecture and instruments would be better to use?

FastSpeech2, Tacotron2, TransformerTTS architectures are a good place to start at. You can leverage the fine-tuning guide by Coqui-ai: Fine-tuning a 🐸 TTS model - TTS 0.5.0 documentation

Question: Is English pretrained model good for fine-tuning on other languages? Or it’s necessary to find the model trained on the same language as the target one?

In case you have nothing else available, english model can be a good starting point or atleast would serve as a good initial benchmark. Typically finetuning models in the target language is desirable.

As always you can find the slides on our GitHub repo: GitHub - Vaibhavs10/ml-with-audio: HF's ML for Audio study group

Topic		Replies	Views
Training a TTS Model on a Specific Character from a TV Show or Movie Models	0	564	February 29, 2024
AI to Convert Any Voice to a Specific Voice Intermediate	10	7278	November 10, 2024
Which hugging face llm is best for voice recognition 🤗Hub	4	4357	March 11, 2024
AI to improve voice Beginners	12	430	July 20, 2025
What is SOTA model to create Voice cloning for my voice Spaces	1	1222	October 18, 2023

ML for Audio Study Group - Text to Speech Deep Dive (Jan 4)

Related topics