Run crash with all GPU's and success with less

Hi everyone,

I’m fine-tuning GPT2-small on the OpenWebText dataset in a distributed system (8 GPUs, 1 node), and facing a weird issue, where when I use all 8 of the GPU’s one of the processes sends a SIGTERM.
However, when I use less than all 8, 7 for example (doesn’t matter which seven) it doesn’t happen.

An illustration of how I’m running my code:

torchrun
torchrun
–standalone
–nnodes=1
–nproc_per_node=${NUM} \

run_clm.py
–model_name_or_path ${MODEL} \

–dataset_name ${DS_NAME}
–do_train
–fp16
–do_eval
–ddp_timeout 3240000
–ddp_find_unused_parameters False \

Notes:

  • I’m using the ddp_timeout parameter to avoid time-out since loading and processing my data takes longer than 30 minutes.
  • I face the same issue when using fp16 and when using the regular fp32 weights.
  • First thing I tracked is memory usage along the run, and I don’t seem to face OOM errors.
  • The process which sends the SIGTERM isn’t necessarily the same each time.

My error:

In the error below you can see it loads my huge dataset (which I already downloaded to a local directory) and start tokenizing it and stops after few steps.
How my Error looks like:

versions:

transformers version = 4.24
system = linux
torch version = 1.10.2+cu113

Hope anyone got clues on this weird issue,
and if it’s not the right place for it I can also open a discussion in the repo issues.

Thanks in advance guys :slight_smile: