Spanish ASR: Fine-Tuning Wav2Vec2

Summary update of progress (tl;dr: not much) so far.

  • I took a look at the foreign characters mentioned in the previous message and there aren’t many. Listening to the samples, some of them are omitted in the speech, some are pronounced in a wrong way (for example, the Japanese character “の” is pronounced “sigma” by one speaker. I did a couple of translations that I thought could make sense and assigned the rest to [UNK].

  • After data exploration, I consolidated all dataset preparation tasks into a single map function, to prevent disk usage explosion (due to caches and temporal files) and reduce computation time. I also disabled caching and explicitly saved the pre-processed dataset at the points I’m interested.

  • After I set up the model and invoke train, it takes a long time for training to actually start. I’m talking hours when using the complete common voice dataset (training and validation samples together). I tried with and without a dataloader_num_workers training argument. Even with the dataloader_num_workers set to 16, I see a process doing something but the rest of the CPUs are idle. I don’t know what it’s doing, I’ll try to investigate.

  • Training in a local GPU with 24GB RAM (a 3090) I get CUDA out-of-memory errors after a few steps, when using a batch size of 32. I suppose some of the samples are longer and I got unlucky in one batch. I could also omit longer samples, does anyone have a feeling about a reasonable maximum duration?

  • Iteration and hill-climbing are going to be very slow because of the reasons above. I’m currently training on a subset with just 10% of the data using a batch size of 24, to see if that works. At just 3% progress I get ~1.30s/it. It feels slow, but I don’t really know how that compares. If that works, I plan to train a few epochs per 10% subset.

  • The OVH environment looks awesome (thanks a lot!), but the ephemeral disk space is not big enough to process the Spanish dataset. I’m thinking about mounting an additional block storage unit and upload my pre-processed dataset there somehow. Not sure how that works, I’ll take a look later.

In summary, processing a language with a relatively high amount of training data is harder than I expected. I’m most worried about the delay before training starts, I might have something misconfigured in my computer. Any hints about that, or strategies about dealing with huge datasets would be appreciated.

But it’s fun and a great learning experience :slight_smile:

2 Likes