Summary update of progress (tl;dr:
not much) so far.
-
I took a look at the foreign characters mentioned in the previous message and there aren’t many. Listening to the samples, some of them are omitted in the speech, some are pronounced in a wrong way (for example, the Japanese character “の” is pronounced “sigma” by one speaker. I did a couple of translations that I thought could make sense and assigned the rest to
[UNK]
. -
After data exploration, I consolidated all dataset preparation tasks into a single
map
function, to prevent disk usage explosion (due to caches and temporal files) and reduce computation time. I also disabled caching and explicitly saved the pre-processed dataset at the points I’m interested. -
After I set up the model and invoke
train
, it takes a long time for training to actually start. I’m talking hours when using the complete common voice dataset (training and validation samples together). I tried with and without adataloader_num_workers
training argument. Even with thedataloader_num_workers
set to 16, I see a process doing something but the rest of the CPUs are idle. I don’t know what it’s doing, I’ll try to investigate. -
Training in a local GPU with 24GB RAM (a 3090) I get CUDA out-of-memory errors after a few steps, when using a batch size of 32. I suppose some of the samples are longer and I got unlucky in one batch. I could also omit longer samples, does anyone have a feeling about a reasonable maximum duration?
-
Iteration and hill-climbing are going to be very slow because of the reasons above. I’m currently training on a subset with just 10% of the data using a batch size of 24, to see if that works. At just 3% progress I get
~1.30s/it
. It feels slow, but I don’t really know how that compares. If that works, I plan to train a few epochs per 10% subset. -
The OVH environment looks awesome (thanks a lot!), but the ephemeral disk space is not big enough to process the Spanish dataset. I’m thinking about mounting an additional block storage unit and upload my pre-processed dataset there somehow. Not sure how that works, I’ll take a look later.
In summary, processing a language with a relatively high amount of training data is harder than I expected. I’m most worried about the delay before training starts, I might have something misconfigured in my computer. Any hints about that, or strategies about dealing with huge datasets would be appreciated.
But it’s fun and a great learning experience