Wav2vec2 CUDA OOM with distributed training

DuyTa · July 20, 2024, 5:29am

I’m trying to reproduce Distributed Training for Wav2vec2 model.
My computation resource have :
A server with 128 core, 2TB RAM, 8x A100 x40GB
About the datasets, I’m trying to train Chinese ASR acoustic model with about 580GB WAV files.
My reproceduce code in this repo :

My question is, when I set same the configuration in 1 single A100, it’s worked nice
But when I run with all 8 GPUs, after few minutes, the CUDA 'OOM appears.
What wrong with my code ?

My reproceduce code in here

Topic		Replies	Views
Training wav2vac2 requires a lot of compute power 🤗Transformers	0	193	March 21, 2023
Multi GPU Audio Finetuning for Wav2vec2 Failing for 4 GPUs but successful for 1 GPU Beginners	0	307	July 9, 2023
Wav2vec2-xls-r-2b out of memory issues on A100 (40 GB) Models	0	681	May 13, 2022
Wav2vec fine-tuning with multiGPU Models	16	6931	May 22, 2021
How much memory to fine tune wav2vec2? Models	2	1142	March 7, 2022

Wav2vec2 CUDA OOM with distributed training

Related topics