Hyper Parameter Optimization with Optuna backend timeout when using Pytorch DDP

adrienchaton · February 9, 2024, 5:04pm

Hi everyone and @muellerzr

I pre-trained MaskedLM models using the RoBERTa class and now running some finetuning for sequence classification tasks.

Without HPO, my codes run nicely, including in the DDP setting (one node, 8 GPUs).

So I moved on to HPO following your guide Hyperparameter Search using Trainer API

I used the Optuna backend, it runs without issues on a single GPU and results seem coherent.

However when I want to run HPO with Pytorch DDP (i.e. each HPO trial is run with all 8 GPUs), it gets stuck after the first logging step while the GPUs keep running until time out.

It goes as

  0%|                                                                                                                                            | 1/10000 [00:02<7:08:24,  2.57s/it]

[E ProcessGroupNCCL.cpp:475] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=27, OpType=BROADCAST, NumelIn=460, NumelOut=460, Timeout(ms)=1800000) ran for 1800144 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:475] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=27, OpType=BROADCAST, NumelIn=460, NumelOut=460, Timeout(ms)=1800000) ran for 1800173 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:475] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=27, OpType=BROADCAST, NumelIn=460, NumelOut=460, Timeout(ms)=1800000) ran for 1800338 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:916] [Rank 2] NCCL watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=27, OpType=BROADCAST, NumelIn=460, NumelOut=460, Timeout(ms)=1800000) ran for 1800144 milliseconds before timing out.
terminate called after throwing an instance of 'std::runtime_error'

...

torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
========================================================
run_HPsearch_PEFT.py FAILED
--------------------------------------------------------
Failures:
[1]:
  time      : 2024-02-09_16:34:56
  host      : xxx
  rank      : 2 (local_rank: 2)
  exitcode  : -6 (pid: 2630978)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 2630978
[2]:
  time      : 2024-02-09_16:34:56
  host      : xxx
  rank      : 3 (local_rank: 3)
  exitcode  : -6 (pid: 2630979)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 2630979
--------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-02-09_16:34:56
  host      : byccsb-dgxa100-03.bayer.cnb
  rank      : 1 (local_rank: 1)
  exitcode  : -6 (pid: 2630977)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 2630977
========================================================

I am having a hard time to find any hints on internet and the logs themselves are not helpful at all …

Any hints on what could screw the DDP runs with HPO please?

Thanks in advance!

System info:

transformers version: 4.37.0
Platform: Linux-5.15.0-1040-nvidia-x86_64-with-glibc2.35
Python version: 3.9.18
Huggingface_hub version: 0.20.3
Safetensors version: 0.4.1
Accelerate version: 0.26.1
Accelerate config: not found
PyTorch version (GPU?): 2.1.2 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?:
Using distributed or parallel set-up in script?:

Topic		Replies	Views
NCCL timeout + corrupts checkpoint/latest DeepSpeed	1	2535	July 31, 2023
DDP Program hang/stuck in trainer.predict() and trainer.evaluate() 🤗Accelerate	2	740	February 15, 2024
[E ProcessGroupNCCL.cpp:828] [Rank X] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3634, OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1800429 milliseconds before timing out 🤗Accelerate	5	6090	July 31, 2023
Parallel HPO when using `trainer.hyperparameter_search()` 🤗Transformers	0	343	December 30, 2021
Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels 🤗Accelerate	3	14340	February 9, 2023

Hyper Parameter Optimization with Optuna backend timeout when using Pytorch DDP

Related topics