Hi,
I use the Trainer class for my model on Sagemaker, it works well on single GPU instances. However, switching to multi GPU it encounter NCCL timeout .
Is it because of the Trainer class or because of my configuration?
transformers_version = "4.46"
pytorch_version = '2.3'
python_version = "py311"
git_transformers_version = "v4.46.0"
hyperparameters = {
'model_name_or_path': 'facebook/mask2former-swin-base-IN21k-cityscapes-instance',
'output_dir':'/opt/ml/model',
'dataset_name': 'xxx/xxx',
'token': hf_token,
'image_height':1080,
'image_width':1920,
'do_train':True,
'fp16': True,
'num_train_epochs': 20,
'learning_rate': 1e-5,
'lr_scheduler_type': 'constant',
'per_device_train_batch_size': 1,
'gradient_accumulation_steps': 4,
'dataloader_num_workers': 8,
'dataloader_persistent_workers': True,
#'dataloader_prefetch_factor': 4, # Does not seems to be supported
'do_eval': True,
'evaluation_strategy': 'epoch',
'logging_strategy':'epoch',
'save_strategy':'epoch',
'save_total_limit': 2,
'push_to_hub': True,
'hub_model_id': 'xxx/xxx',
# Multi-GPU training specific parameters
'ddp_find_unused_parameters': False,
'ddp_bucket_cap_mb': 25,
# Early stopping not available for our model?
# 'early_stopping_patience': 3,
# 'metric_for_best_model': 'eval_segm-AP',
# 'load_best_model_at_end': True,
}
git_config = {
'repo': 'https://github.com/huggingface/transformers.git',
'branch': git_transformers_version,
'username': uf_username, # Your HF username
'password': hf_token # Your HF token here
}
huggingface_estimator = HuggingFace(
entry_point='train_v4.46.0.py', # local copy of run_instance_segmentation.py
source_dir='/home/sagemaker-user/code/',
instance_type='ml.p4d.24xlarge',
instance_count=1,
role=role,
git_config=git_config,
transformers_version=transformers_version,
pytorch_version=pytorch_version,
py_version=python_version,
hyperparameters = hyperparameters,
distribution={'pytorchddp': {'enabled': True}},
)
The error is :
[1,mpirank:0,algo-1]<stderr>:[rank0]:[E ProcessGroupNCCL.cpp:1316] [PG 0 Rank 0] Heartbeat monitor timed out! Process will be terminated after dumping debug info. workMetaList_.size()=17
[1,mpirank:0,algo-1]<stderr>:[rank0]:[E ProcessGroupNCCL.cpp:1153] [PG 0 Rank 0] ProcessGroupNCCL preparing to dump debug info.
[1,mpirank:0,algo-1]<stderr>:[rank0]:[F ProcessGroupNCCL.cpp:1169] [PG 0 Rank 0] [PG 0 Rank 0] ProcessGroupNCCL's watchdog got stuck for 600 seconds without making progress in monitoring enqueued collectives. This typically indicates a NCCL/CUDA API hang blocking the watchdog, and could be triggered by another thread holding the GIL inside a CUDA api, or other deadlock-prone behaviors.If you suspect the watchdog is not actually stuck and a longer timeout would help, you can either increase the timeout (TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC) to a larger value or disable the heartbeat monitor (TORCH_NCCL_ENABLE_MONITORING=0).If either of aforementioned helps, feel free to file an issue to PyTorch about the short timeout or false positive abort; otherwise, please attempt to debug the hang. workMetaList_.size() = 17
[1,mpirank:4,algo-1]<stderr>:/opt/conda/bin/runwithenvvars: line 66: 101 Aborted (core dumped) $@
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.```