I am trying to fine tune the Facebook 60K hour wav2vec2 model as described in Patrick von Platen’s article: Fine-Tune Wav2Vec2 for English ASR in Hugging Face with 🤗 Transformers
I have about 200 hours of speech data. I am using 3 RTX8000 GPUs on a 48 Core Lenovo SR670 with 369 GB of memory (on our NYU compute cluster).
It seems to be using all 3 GPUs (utilization is high) but seems as slow as molasses. If I read the output right it looks like it will take over a week to fine tune the model with 30 epochs.
Does this sound correct?
Thanks
Michael Picheny