Hyper Parameter Optimization with Optuna backend timeout when using Pytorch DDP

Hi everyone and @muellerzr

I pre-trained MaskedLM models using the RoBERTa class and now running some finetuning for sequence classification tasks.

Without HPO, my codes run nicely, including in the DDP setting (one node, 8 GPUs).

So I moved on to HPO following your guide Hyperparameter Search using Trainer API

I used the Optuna backend, it runs without issues on a single GPU and results seem coherent.

However when I want to run HPO with Pytorch DDP (i.e. each HPO trial is run with all 8 GPUs), it gets stuck after the first logging step while the GPUs keep running until time out.

It goes as

  0%|                                                                                                                                            | 1/10000 [00:02<7:08:24,  2.57s/it]

[E ProcessGroupNCCL.cpp:475] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=27, OpType=BROADCAST, NumelIn=460, NumelOut=460, Timeout(ms)=1800000) ran for 1800144 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:475] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=27, OpType=BROADCAST, NumelIn=460, NumelOut=460, Timeout(ms)=1800000) ran for 1800173 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:475] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=27, OpType=BROADCAST, NumelIn=460, NumelOut=460, Timeout(ms)=1800000) ran for 1800338 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:916] [Rank 2] NCCL watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=27, OpType=BROADCAST, NumelIn=460, NumelOut=460, Timeout(ms)=1800000) ran for 1800144 milliseconds before timing out.
terminate called after throwing an instance of 'std::runtime_error'

...

torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
========================================================
run_HPsearch_PEFT.py FAILED
--------------------------------------------------------
Failures:
[1]:
  time      : 2024-02-09_16:34:56
  host      : xxx
  rank      : 2 (local_rank: 2)
  exitcode  : -6 (pid: 2630978)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 2630978
[2]:
  time      : 2024-02-09_16:34:56
  host      : xxx
  rank      : 3 (local_rank: 3)
  exitcode  : -6 (pid: 2630979)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 2630979
--------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-02-09_16:34:56
  host      : byccsb-dgxa100-03.bayer.cnb
  rank      : 1 (local_rank: 1)
  exitcode  : -6 (pid: 2630977)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 2630977
========================================================

I am having a hard time to find any hints on internet and the logs themselves are not helpful at all …

Any hints on what could screw the DDP runs with HPO please?

Thanks in advance!

System info:

  • transformers version: 4.37.0
  • Platform: Linux-5.15.0-1040-nvidia-x86_64-with-glibc2.35
  • Python version: 3.9.18
  • Huggingface_hub version: 0.20.3
  • Safetensors version: 0.4.1
  • Accelerate version: 0.26.1
  • Accelerate config: not found
  • PyTorch version (GPU?): 2.1.2 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?:
  • Using distributed or parallel set-up in script?: