Hi everyone and @muellerzr
I pre-trained MaskedLM models using the RoBERTa class and now running some finetuning for sequence classification tasks.
Without HPO, my codes run nicely, including in the DDP setting (one node, 8 GPUs).
So I moved on to HPO following your guide Hyperparameter Search using Trainer API
I used the Optuna backend, it runs without issues on a single GPU and results seem coherent.
However when I want to run HPO with Pytorch DDP (i.e. each HPO trial is run with all 8 GPUs), it gets stuck after the first logging step while the GPUs keep running until time out.
It goes as
0%| | 1/10000 [00:02<7:08:24, 2.57s/it]
[E ProcessGroupNCCL.cpp:475] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=27, OpType=BROADCAST, NumelIn=460, NumelOut=460, Timeout(ms)=1800000) ran for 1800144 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:475] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=27, OpType=BROADCAST, NumelIn=460, NumelOut=460, Timeout(ms)=1800000) ran for 1800173 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:475] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=27, OpType=BROADCAST, NumelIn=460, NumelOut=460, Timeout(ms)=1800000) ran for 1800338 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:916] [Rank 2] NCCL watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=27, OpType=BROADCAST, NumelIn=460, NumelOut=460, Timeout(ms)=1800000) ran for 1800144 milliseconds before timing out.
terminate called after throwing an instance of 'std::runtime_error'
...
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
========================================================
run_HPsearch_PEFT.py FAILED
--------------------------------------------------------
Failures:
[1]:
time : 2024-02-09_16:34:56
host : xxx
rank : 2 (local_rank: 2)
exitcode : -6 (pid: 2630978)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 2630978
[2]:
time : 2024-02-09_16:34:56
host : xxx
rank : 3 (local_rank: 3)
exitcode : -6 (pid: 2630979)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 2630979
--------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-02-09_16:34:56
host : byccsb-dgxa100-03.bayer.cnb
rank : 1 (local_rank: 1)
exitcode : -6 (pid: 2630977)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 2630977
========================================================
I am having a hard time to find any hints on internet and the logs themselves are not helpful at all …
Any hints on what could screw the DDP runs with HPO please?
Thanks in advance!
System info:
transformers
version: 4.37.0- Platform: Linux-5.15.0-1040-nvidia-x86_64-with-glibc2.35
- Python version: 3.9.18
- Huggingface_hub version: 0.20.3
- Safetensors version: 0.4.1
- Accelerate version: 0.26.1
- Accelerate config: not found
- PyTorch version (GPU?): 2.1.2 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?:
- Using distributed or parallel set-up in script?: