Segmentation fault with gradient_checkpointing on multiGPU

mirix · August 21, 2023, 9:13am

I am trying to train a wav2vec2 model on my own dataset by following this template.

I have two issues:

The model does not seem to be learning much. I have tried different learning rates and I see differences, but not good enough.
If I set gradient_checkpointing=True the training segfaults (core dumped) when CUDA_VISIBLE_DEVICES is set to more than one GPU (single node). With just one GPU it is OK, no matter which one. And if gradient_checkpointing is not set, the training can take advantage of multiple GPUs. Is this a known issue/feature? Are there any extra options that need to be set?

I am running Cuda 12.1 with the latest driver and the nightly developing version on Pytorch on two RTX 4090.

rhasan · September 5, 2024, 5:24am

Hi, did you get it solved by any chance?

Unfortunately, I’m also getting a Seg fault with multi-GPU. Turning off gradient checkpointing is not solving the issue either.
I’m using AMD GPU with the following packages in Python 3.12.3:
pytorch-triton-rocm 3.0.0
torch 2.4.0+rocm6.1
transformers 4.44.2
datasets 2.21.0

Topic		Replies	Views
Can we use Gradient Checkpointing and Gradient Accumulation at Once? 🤗Transformers	1	1234	September 14, 2021
Wav2vec fine-tuning with multiGPU Models	16	6988	May 22, 2021
Gradient_checkpointing = True results in error 🤗Transformers	3	8774	February 22, 2023
DDP gradient checkpoint crashes Beginners	4	3597	February 24, 2024
SIGSEGV when training on multiple GPUs 🤗Accelerate	0	884	August 1, 2023

Segmentation fault with gradient_checkpointing on multiGPU

Related topics