Very Slow Fine Tuning Performance for Speech?

I am trying to fine tune the Facebook 60K hour wav2vec2 model as described in Patrick von Platen’s article: Fine-Tune Wav2Vec2 for English ASR in Hugging Face with 🤗 Transformers

I have about 200 hours of speech data. I am using 3 RTX8000 GPUs on a 48 Core Lenovo SR670 with 369 GB of memory (on our NYU compute cluster).

It seems to be using all 3 GPUs (utilization is high) but seems as slow as molasses. If I read the output right it looks like it will take over a week to fine tune the model with 30 epochs.

Does this sound correct?

Thanks
Michael Picheny