Hi everyone,
I’m fine-tuning GPT2-small on the OpenWebText dataset in a distributed system (8 GPUs, 1 node), and facing a weird issue, where when I use all 8 of the GPU’s one of the processes sends a SIGTERM.
However, when I use less than all 8, 7 for example (doesn’t matter which seven) it doesn’t happen.
An illustration of how I’m running my code:
torchrun
torchrun
–standalone
–nnodes=1
–nproc_per_node=${NUM} \run_clm.py
–model_name_or_path ${MODEL} \–dataset_name ${DS_NAME}
–do_train
–fp16
–do_eval
–ddp_timeout 3240000
–ddp_find_unused_parameters False \
Notes:
- I’m using the
ddp_timeout
parameter to avoid time-out since loading and processing my data takes longer than 30 minutes. - I face the same issue when using
fp16
and when using the regular fp32 weights. - First thing I tracked is memory usage along the run, and I don’t seem to face OOM errors.
- The process which sends the SIGTERM isn’t necessarily the same each time.
My error:
In the error below you can see it loads my huge dataset (which I already downloaded to a local directory) and start tokenizing it and stops after few steps.
How my Error looks like:
versions:
transformers version = 4.24
system = linux
torch version = 1.10.2+cu113
Hope anyone got clues on this weird issue,
and if it’s not the right place for it I can also open a discussion in the repo issues.
Thanks in advance guys