[E ProcessGroupNCCL.cpp:828] [Rank X] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3634, OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1800429 milliseconds before timing out

hemanth-kj · July 25, 2023, 6:32pm

Error when attaching evaluation to trainer on accelerate + deepspeed zero3
Config:

trainer_args = transformers.TrainingArguments(
			per_device_train_batch_size=1,
			gradient_checkpointing=True,
			# bf16=True,
			ddp_find_unused_parameters=False,
			gradient_accumulation_steps=2,
			warmup_steps=10,
			num_train_epochs=2,
			# max_steps=20000,
			learning_rate=2e-6,
			evaluation_strategy="steps",
			eval_steps=1,
			per_device_eval_batch_size=1,
			include_inputs_for_metrics=True,
			# fp16=True,
			logging_steps=5,
			output_dir="outputs",
			optim="paged_adamw_8bit",
			save_strategy="steps",   #epochs, steps, no
			save_total_limit=5,
			save_steps=30,
			# save_on_each_node=False,
			report_to="wandb",
			dataloader_drop_last=True,
			dataloader_num_workers=0,
			# resume_from_checkpoint="outputs/checkpoint-9500/",
			hub_strategy="checkpoint",
			save_safetensors=True,
			# fsdp = ["full_shard", "offload", "autowrap"],
			deepspeed="deepspeed_zero3_config.json"
			# Model & Data Sharding
			# Model checkpointing
		)

{
  "bf16": {
    "enabled": "auto"
  },
  "optimizer": {
    "type": "AdamW",
    "params": {
      "lr": "auto",
      "betas": "auto",
      "eps": "auto",
      "weight_decay": "auto"
    }
  },
  "scheduler": {
    "type": "WarmupDecayLR",
    "params": {
      "total_num_steps": "auto",
      "warmup_min_lr": "auto",
      "warmup_max_lr": "auto",
      "warmup_num_steps": "auto"
    }
  },
  "zero_optimization": {
    "stage": 3,
    "offload_optimizer": {
      "device": "cpu",
      "pin_memory": true
    },
    "offload_param": {
      "device": "cpu",
      "pin_memory": true
    },
    "overlap_comm": true,
    "contiguous_gradients": true,
    "sub_group_size": 1e9,
    "reduce_bucket_size": "auto",
    "stage3_prefetch_bucket_size": "auto",
    "stage3_param_persistence_threshold": "auto",
    "stage3_max_live_parameters": 1e9,
    "stage3_max_reuse_distance": 1e9,
    "stage3_gather_16bit_weights_on_model_save": true
  },
  "gradient_accumulation_steps": 2,
  "gradient_clipping": "auto",
  "steps_per_print": 5,
  "train_batch_size": 16,
  "train_micro_batch_size_per_gpu": 1,
  "wall_clock_breakdown": false
}

compute_environment: LOCAL_MACHINE

deepspeed_config:

deepspeed_config_file: deepspeed_zero3_config.json

zero3_init_flag: true

distributed_type: DEEPSPEED

downcast_bf16: 'no'

machine_rank: 0

main_training_function: main

num_machines: 1

num_processes: 8

rdzv_backend: static

same_network: true

tpu_env: []

tpu_use_cluster: false

tpu_use_sudo: false

use_cpu: false

hemanth-kj · July 25, 2023, 7:26pm

Works fine when number of GPU is set to 1.

muellerzr · July 25, 2023, 7:39pm

What is your dataset situation like?

hemanth-kj · July 27, 2023, 1:51pm

Validation set contains same format as the training set, with less number of samples (2800 vs 40). The generation for both is done by a single function and are padding and truncating perfectly.

qiyan98 · July 31, 2023, 7:28pm

I encountered similar issue training T5 model on 4 V100 GPUs in the cluster.

sandeepch · July 31, 2023, 9:17pm

Can you try a couple of things

higher value for eval_steps.
Additionally, are you doing some expensive calculations during eval?
higher value for logging_steps.
Alternatively, you can run evaluation on only 1 machine and see if that helps

I’ve had this issue in the past during eval with deepspeed across multiple gpus. My hunch was that one of the process hangs due to some issue with metric computation or logging during eval and the others wait forever. So I simplified my eval loop

Hope this helps

Topic		Replies	Views
NCCL timeout + corrupts checkpoint/latest DeepSpeed	1	2572	July 31, 2023
Accelerator.save_state errors out due to timeout. Unable to increase timeout through kwargs_handlers 🤗Accelerate	5	1316	March 3, 2025
NCCL Timeout Accelerate Load From Checkpoint 🤗Accelerate	2	2367	June 20, 2025
Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels 🤗Accelerate	3	14502	February 9, 2023
Hyper Parameter Optimization with Optuna backend timeout when using Pytorch DDP 🤗Transformers	0	563	February 9, 2024

[E ProcessGroupNCCL.cpp:828] [Rank X] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3634, OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1800429 milliseconds before timing out

Related topics