ML for Audio Study Group - Text to Speech Deep Dive (Jan 4)

osanseviero · January 3, 2022, 10:07am

Welcome to the second week of ML for Audio Study Group!

This week we will do a deep dive into Text to Speech (TTS), with VB (Deloitte) and Vatsal giving the presentations.

Topic: Deep dive into TTS

Suggested Resources (To be read before)

The first link are two colabs that will allow you to put into practice what was taught in the previous session.

How to join

Watch the livestream in YouTube.
Join in our Discord (hf.co/join/discord at the channel #ml-4-audio-study-group).
Check out the GitHub repository of the project.

Speakers
Vaibhav (VB) is a consultant turned student researcher at University of Stuttgart, Germany. His current research is in the field of Performance Prediction for NLP models and Speech Synthesis. He is also an active volunteer with Europython and Python DE. LinkedIn: https://www.linkedin.com/in/vaibhavs10/

Vatsal left the world of mathematics in 2017 to dive into Speech Synthesis soon after he came across the WaveNet paper. His research has focused on Normalising Flows, a particular kind of Deep Generative Model. At Amazon, he researched the deep-learning based vocoding module that is used in production, and disentanglement in deep generative models for zero-shot speech generation (text-to-speech & voice conversion): publishing 4 papers, 5 patents, and developing multiple product proof-of-concepts. Beyond speech, Vatsal has also spent some time in a team of researchers focused on Bayesian Models/Sparse Gaussian Processes. LinkedIn: https://www.linkedin.com/in/vatsal-aggarwal-993472104/.

You can post all your questions in this topic! They will be answered during the session

osanseviero · January 3, 2022, 10:16pm

Question from averkij from Discord:

If I want to train a TTS model on my own speech to imitate my voice, — what amount of records do I need to make for training a model with good quality?

osanseviero · January 3, 2022, 10:16pm

Question from averkij from Discord:

Should I fine-tune a pretrained model or train one from scratch?

osanseviero · January 3, 2022, 10:17pm

Question from averkij from Discord

What architecture and instruments would be better to use?

osanseviero · January 3, 2022, 10:18pm

Question from averkij from Discord

Is English pretrained model good for fine-tuning on other languages? Or it’s necessary to find the model trained on the same language as the target one?

ignacio-ferreira-dev · January 4, 2022, 4:03pm

For someone, I used an hour and a half or tagged audio with pytorch-dc-tts, maybe with tacotron2 you need less time.

ignacio-ferreira-dev · January 4, 2022, 4:05pm

If you have one available for the language use it for sure. Loooot less time for training.

ignacio-ferreira-dev · January 4, 2022, 4:08pm

I don’t think so, I think should be better than starting from scratch (but you will most likely need to change the vocabulary). Finding a model for the language is always better.

reach-vb · January 5, 2022, 3:16pm

Hey everyone,

Thank you for joining the stream yesterday.
I’m putting together the responses to the questions below:

Question: If I want to train a TTS model on my own speech to imitate my voice, — what amount of records do I need to make for training a model with good quality?

It’d differ from use case to you use case, if you want to model general speaking then you can definitely use something like YourTTS: [2112.02418] YourTTS: Towards Zero-Shot Multi-Speaker TTS and Zero-Shot Voice Conversion for everyone
They claim that 1 minute of audio is sufficient however this will differ from what you require from the end result.

Question: Should I fine-tune a pretrained model or train one from scratch?

In almost all cases, fine-tuning a pre-trained model makes much more sense than training from scratch.

Question: What architecture and instruments would be better to use?

FastSpeech2, Tacotron2, TransformerTTS architectures are a good place to start at. You can leverage the fine-tuning guide by Coqui-ai: Fine-tuning a 🐸 TTS model - TTS 0.5.0 documentation

Question: Is English pretrained model good for fine-tuning on other languages? Or it’s necessary to find the model trained on the same language as the target one?

In case you have nothing else available, english model can be a good starting point or atleast would serve as a good initial benchmark. Typically finetuning models in the target language is desirable.

As always you can find the slides on our GitHub repo: GitHub - Vaibhavs10/ml-with-audio: HF's ML for Audio study group

averoo · January 7, 2022, 10:26am

Thank you for the answers. I’ve started to tune pretrained models, it’s looking good so far.

Topic		Replies	Views
ML for Audio Study Group - Intro to Audio and ASR (Dec 21) Community Calls	10	2422	December 22, 2021
ML for Audio Study Group - Kick Off (Dec 14) Community Calls	13	2409	December 16, 2021
Community content of the week (01/13/2022) Community Calls	0	1736	January 13, 2022
Community content of the week (01/20/2022) Community Calls	0	1847	January 20, 2022
ML for Audio Study Group - pyctcdecode (Jan 18) Community Calls	10	1829	January 18, 2022

ML for Audio Study Group - Text to Speech Deep Dive (Jan 4)

Related topics