Run crash with all GPU's and success with less

IdoAmit198 · December 12, 2022, 7:03am

Hi everyone,

I’m fine-tuning GPT2-small on the OpenWebText dataset in a distributed system (8 GPUs, 1 node), and facing a weird issue, where when I use all 8 of the GPU’s one of the processes sends a SIGTERM.
However, when I use less than all 8, 7 for example (doesn’t matter which seven) it doesn’t happen.

An illustration of how I’m running my code:

torchrun
torchrun
–standalone
–nnodes=1
–nproc_per_node=${NUM} \

run_clm.py
–model_name_or_path ${MODEL} \

–dataset_name ${DS_NAME}
–do_train
–fp16
–do_eval
–ddp_timeout 3240000
–ddp_find_unused_parameters False \

Notes:

I’m using the ddp_timeout parameter to avoid time-out since loading and processing my data takes longer than 30 minutes.
I face the same issue when using fp16 and when using the regular fp32 weights.
First thing I tracked is memory usage along the run, and I don’t seem to face OOM errors.
The process which sends the SIGTERM isn’t necessarily the same each time.

My error:

In the error below you can see it loads my huge dataset (which I already downloaded to a local directory) and start tokenizing it and stops after few steps.
How my Error looks like:

versions:

transformers version = 4.24
system = linux
torch version = 1.10.2+cu113

Hope anyone got clues on this weird issue,
and if it’s not the right place for it I can also open a discussion in the repo issues.

Thanks in advance guys

Topic		Replies	Views
Hugging face accelerate and torch DDP crash with out-of-memory errors for a model runs fine on a single GPU 🤗Accelerate	3	4459	January 1, 2024
Fine tunning GPT-2 by using multiple GPUs ( ddp pytorch ) 🤗Transformers	0	1144	May 9, 2023
RAM memory issues while training with torch.distributed.launch 🤗Transformers	1	1025	October 19, 2022
Crash happened with accelerate + deepspeed 🤗Accelerate	1	1456	July 8, 2022
DDP running out of memory but DP is successful for the same per_device_train_batch_size 🤗Accelerate	0	388	February 5, 2024

Run crash with all GPU's and success with less

An illustration of how I’m running my code:

Notes:

My error:

versions:

Related topics