Multi GPU Audio Finetuning for Wav2vec2 Failing for 4 GPUs but successful for 1 GPU

Hi All,

Thanks in advance for helping me in advance.

Context: I am having an issue with training using Trainer in Multi GPU setup. I am finetuning Wav2vec2.0 on a dataset that has >100,000 audio segments and hence, I am unable to run it in a single GPU.

What I am doing?
To build the script to try in both a single GPU infrastructure versus multiple GPU infrastructure, I am picking up only 1000 short audio segments (<6 seconds) and transcriptions for finetuning.

What’s happening?
When i train it on a single GPU, it is training successfully and i have no problems for 1000 audio segments. However, when i try to do the same thing with 4 GPUs, it crashes with no memory error (help ! RuntimeError: CUDA out of memory. Tried to allocate XXXXXXXX)

Help Required?
How can i make this work? As per the article here (From PyTorch DDP to Accelerate to Trainer, mastery of distributed training with ease), the Trainer Module takes care of distributed computing by itself. There is clearly something that is wrong that I am unable to debug. Help here would be appreciated.

Script similar to this notebook : Fine-Tune XLSR-Wav2Vec2 for low-resource ASR with 🤗 Transformers
Hardware: Tesla V100-SXM2-32GB