Spanish ASR: Fine-Tuning Wav2Vec2

pcuenq · March 22, 2021, 11:12am

Summary update of progress (tl;dr: not much) so far.

I took a look at the foreign characters mentioned in the previous message and there aren’t many. Listening to the samples, some of them are omitted in the speech, some are pronounced in a wrong way (for example, the Japanese character “の” is pronounced “sigma” by one speaker. I did a couple of translations that I thought could make sense and assigned the rest to [UNK].
After data exploration, I consolidated all dataset preparation tasks into a single map function, to prevent disk usage explosion (due to caches and temporal files) and reduce computation time. I also disabled caching and explicitly saved the pre-processed dataset at the points I’m interested.
After I set up the model and invoke train, it takes a long time for training to actually start. I’m talking hours when using the complete common voice dataset (training and validation samples together). I tried with and without a dataloader_num_workers training argument. Even with the dataloader_num_workers set to 16, I see a process doing something but the rest of the CPUs are idle. I don’t know what it’s doing, I’ll try to investigate.
Training in a local GPU with 24GB RAM (a 3090) I get CUDA out-of-memory errors after a few steps, when using a batch size of 32. I suppose some of the samples are longer and I got unlucky in one batch. I could also omit longer samples, does anyone have a feeling about a reasonable maximum duration?
Iteration and hill-climbing are going to be very slow because of the reasons above. I’m currently training on a subset with just 10% of the data using a batch size of 24, to see if that works. At just 3% progress I get ~1.30s/it. It feels slow, but I don’t really know how that compares. If that works, I plan to train a few epochs per 10% subset.
The OVH environment looks awesome (thanks a lot!), but the ephemeral disk space is not big enough to process the Spanish dataset. I’m thinking about mounting an additional block storage unit and upload my pre-processed dataset there somehow. Not sure how that works, I’ll take a look later.

In summary, processing a language with a relatively high amount of training data is harder than I expected. I’m most worried about the delay before training starts, I might have something misconfigured in my computer. Any hints about that, or strategies about dealing with huge datasets would be appreciated.

But it’s fun and a great learning experience

Topic		Replies	Views
How to use unk_token (unknown token) during wav2vec model finetuning Models	2	3728	May 19, 2022
XLSR-Wav2Vec2 with punctuation Research	1	1387	October 12, 2022
Hindi ASR: Fine-Tuning Wav2Vec2 Languages at Hugging Face	19	3002	January 4, 2022
German ASR: Fine-Tuning Wav2Vec2 Languages at Hugging Face	17	3681	February 18, 2022
Wav2vec2-large-xlsr-53 🤗Transformers	4	811	July 26, 2022

Spanish ASR: Fine-Tuning Wav2Vec2

Related topics