[E ProcessGroupNCCL.cpp:828] [Rank X] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3634, OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1800429 milliseconds before timing out

Error when attaching evaluation to trainer on accelerate + deepspeed zero3
Config:

trainer_args = transformers.TrainingArguments(
			per_device_train_batch_size=1,
			gradient_checkpointing=True,
			# bf16=True,
			ddp_find_unused_parameters=False,
			gradient_accumulation_steps=2,
			warmup_steps=10,
			num_train_epochs=2,
			# max_steps=20000,
			learning_rate=2e-6,
			evaluation_strategy="steps",
			eval_steps=1,
			per_device_eval_batch_size=1,
			include_inputs_for_metrics=True,
			# fp16=True,
			logging_steps=5,
			output_dir="outputs",
			optim="paged_adamw_8bit",
			save_strategy="steps",   #epochs, steps, no
			save_total_limit=5,
			save_steps=30,
			# save_on_each_node=False,
			report_to="wandb",
			dataloader_drop_last=True,
			dataloader_num_workers=0,
			# resume_from_checkpoint="outputs/checkpoint-9500/",
			hub_strategy="checkpoint",
			save_safetensors=True,
			# fsdp = ["full_shard", "offload", "autowrap"],
			deepspeed="deepspeed_zero3_config.json"
			# Model & Data Sharding
			# Model checkpointing
		)
{
  "bf16": {
    "enabled": "auto"
  },
  "optimizer": {
    "type": "AdamW",
    "params": {
      "lr": "auto",
      "betas": "auto",
      "eps": "auto",
      "weight_decay": "auto"
    }
  },
  "scheduler": {
    "type": "WarmupDecayLR",
    "params": {
      "total_num_steps": "auto",
      "warmup_min_lr": "auto",
      "warmup_max_lr": "auto",
      "warmup_num_steps": "auto"
    }
  },
  "zero_optimization": {
    "stage": 3,
    "offload_optimizer": {
      "device": "cpu",
      "pin_memory": true
    },
    "offload_param": {
      "device": "cpu",
      "pin_memory": true
    },
    "overlap_comm": true,
    "contiguous_gradients": true,
    "sub_group_size": 1e9,
    "reduce_bucket_size": "auto",
    "stage3_prefetch_bucket_size": "auto",
    "stage3_param_persistence_threshold": "auto",
    "stage3_max_live_parameters": 1e9,
    "stage3_max_reuse_distance": 1e9,
    "stage3_gather_16bit_weights_on_model_save": true
  },
  "gradient_accumulation_steps": 2,
  "gradient_clipping": "auto",
  "steps_per_print": 5,
  "train_batch_size": 16,
  "train_micro_batch_size_per_gpu": 1,
  "wall_clock_breakdown": false
}
compute_environment: LOCAL_MACHINE

deepspeed_config:

deepspeed_config_file: deepspeed_zero3_config.json

zero3_init_flag: true

distributed_type: DEEPSPEED

downcast_bf16: 'no'

machine_rank: 0

main_training_function: main

num_machines: 1

num_processes: 8

rdzv_backend: static

same_network: true

tpu_env: []

tpu_use_cluster: false

tpu_use_sudo: false

use_cpu: false

Works fine when number of GPU is set to 1.

What is your dataset situation like?

Validation set contains same format as the training set, with less number of samples (2800 vs 40). The generation for both is done by a single function and are padding and truncating perfectly.

I encountered similar issue training T5 model on 4 V100 GPUs in the cluster.

Can you try a couple of things

  • higher value for eval_steps.
    Additionally, are you doing some expensive calculations during eval?

  • higher value for logging_steps.

  • Alternatively, you can run evaluation on only 1 machine and see if that helps

I’ve had this issue in the past during eval with deepspeed across multiple gpus. My hunch was that one of the process hangs due to some issue with metric computation or logging during eval and the others wait forever. So I simplified my eval loop

Hope this helps