ML for Audio Study Group - Text to Speech Deep Dive (Jan 4)

Welcome to the second week of ML for Audio Study Group! :loud_sound: :loud_sound:

This week we will do a deep dive into Text to Speech (TTS), with VB (Deloitte) and Vatsal giving the presentations.

Topic: Deep dive into TTS

Suggested Resources (To be read before)

The first link are two colabs that will allow you to put into practice what was taught in the previous session.

How to join

Speakers
Vaibhav (VB) is a consultant turned student researcher at University of Stuttgart, Germany. His current research is in the field of Performance Prediction for NLP models and Speech Synthesis. He is also an active volunteer with Europython and Python DE. LinkedIn: https://www.linkedin.com/in/vaibhavs10/

Vatsal left the world of mathematics in 2017 to dive into Speech Synthesis soon after he came across the WaveNet paper. His research has focused on Normalising Flows, a particular kind of Deep Generative Model. At Amazon, he researched the deep-learning based vocoding module that is used in production, and disentanglement in deep generative models for zero-shot speech generation (text-to-speech & voice conversion): publishing 4 papers, 5 patents, and developing multiple product proof-of-concepts. Beyond speech, Vatsal has also spent some time in a team of researchers focused on Bayesian Models/Sparse Gaussian Processes. LinkedIn: https://www.linkedin.com/in/vatsal-aggarwal-993472104/.

You can post all your questions in this topic! They will be answered during the session

1 Like

Question from averkij from Discord:

If I want to train a TTS model on my own speech to imitate my voice, — what amount of records do I need to make for training a model with good quality?

2 Likes

Question from averkij from Discord:

Should I fine-tune a pretrained model or train one from scratch?

2 Likes

Question from averkij from Discord

What architecture and instruments would be better to use?

2 Likes

Question from averkij from Discord

Is English pretrained model good for fine-tuning on other languages? Or it’s necessary to find the model trained on the same language as the target one?

2 Likes

For someone, I used an hour and a half or tagged audio with pytorch-dc-tts, maybe with tacotron2 you need less time.

1 Like

If you have one available for the language use it for sure. Loooot less time for training.

I don’t think so, I think should be better than starting from scratch (but you will most likely need to change the vocabulary). Finding a model for the language is always better.

Hey everyone,

Thank you for joining the stream yesterday.
I’m putting together the responses to the questions below:

Question: If I want to train a TTS model on my own speech to imitate my voice, — what amount of records do I need to make for training a model with good quality?

It’d differ from use case to you use case, if you want to model general speaking then you can definitely use something like YourTTS: [2112.02418] YourTTS: Towards Zero-Shot Multi-Speaker TTS and Zero-Shot Voice Conversion for everyone
They claim that 1 minute of audio is sufficient however this will differ from what you require from the end result.

Question: Should I fine-tune a pretrained model or train one from scratch?

In almost all cases, fine-tuning a pre-trained model makes much more sense than training from scratch.

Question: What architecture and instruments would be better to use?

FastSpeech2, Tacotron2, TransformerTTS architectures are a good place to start at. You can leverage the fine-tuning guide by Coqui-ai: Fine-tuning a 🐸 TTS model - TTS 0.5.0 documentation

Question: Is English pretrained model good for fine-tuning on other languages? Or it’s necessary to find the model trained on the same language as the target one?

In case you have nothing else available, english model can be a good starting point or atleast would serve as a good initial benchmark. Typically finetuning models in the target language is desirable.

As always you can find the slides on our GitHub repo: GitHub - Vaibhavs10/ml-with-audio: HF's ML for Audio study group

2 Likes

Thank you for the answers. I’ve started to tune pretrained models, it’s looking good so far.

1 Like