Correct configuration to train Mask2Former on Amazon Sagemaker multi GPU ml.p4d.24xlarge instance

Hi,

I use the Trainer class for my model on Sagemaker, it works well on single GPU instances. However, switching to multi GPU it encounter NCCL timeout .

Is it because of the Trainer class or because of my configuration?

transformers_version = "4.46"
pytorch_version = '2.3'
python_version = "py311"
git_transformers_version = "v4.46.0"

hyperparameters = {
    'model_name_or_path': 'facebook/mask2former-swin-base-IN21k-cityscapes-instance', 
    'output_dir':'/opt/ml/model',
    'dataset_name': 'xxx/xxx',
    'token': hf_token,
    'image_height':1080,
    'image_width':1920,
    'do_train':True,
    'fp16': True,
    'num_train_epochs': 20, 
    'learning_rate': 1e-5,
    'lr_scheduler_type': 'constant',
    'per_device_train_batch_size': 1, 
    'gradient_accumulation_steps': 4,
    'dataloader_num_workers': 8,
    'dataloader_persistent_workers': True,
    #'dataloader_prefetch_factor': 4, # Does not seems to be supported
    'do_eval': True,
    'evaluation_strategy': 'epoch',
    'logging_strategy':'epoch',
    'save_strategy':'epoch',
    'save_total_limit': 2,
    'push_to_hub': True,
    'hub_model_id': 'xxx/xxx',
    # Multi-GPU training specific parameters
    'ddp_find_unused_parameters': False,
    'ddp_bucket_cap_mb': 25,
    # Early stopping not available for our model?
    # 'early_stopping_patience': 3,
    # 'metric_for_best_model': 'eval_segm-AP',
    # 'load_best_model_at_end': True,
}

git_config = {
    'repo': 'https://github.com/huggingface/transformers.git',
    'branch': git_transformers_version,
    'username': uf_username,      # Your HF username
    'password': hf_token          # Your HF token here
}

huggingface_estimator = HuggingFace(
	entry_point='train_v4.46.0.py',  # local copy of run_instance_segmentation.py
	source_dir='/home/sagemaker-user/code/',
	instance_type='ml.p4d.24xlarge',
	instance_count=1,
	role=role,
	git_config=git_config,
	transformers_version=transformers_version,
	pytorch_version=pytorch_version,
	py_version=python_version,
	hyperparameters = hyperparameters,
       distribution={'pytorchddp': {'enabled': True}},  
)

The error is :

[1,mpirank:0,algo-1]<stderr>:[rank0]:[E ProcessGroupNCCL.cpp:1316] [PG 0 Rank 0] Heartbeat monitor timed out! Process will be terminated after dumping debug info. workMetaList_.size()=17
[1,mpirank:0,algo-1]<stderr>:[rank0]:[E ProcessGroupNCCL.cpp:1153] [PG 0 Rank 0] ProcessGroupNCCL preparing to dump debug info.
[1,mpirank:0,algo-1]<stderr>:[rank0]:[F ProcessGroupNCCL.cpp:1169] [PG 0 Rank 0] [PG 0 Rank 0] ProcessGroupNCCL's watchdog got stuck for 600 seconds without making progress in monitoring enqueued collectives. This typically indicates a NCCL/CUDA API hang blocking the watchdog, and could be triggered by another thread holding the GIL inside a CUDA api, or other deadlock-prone behaviors.If you suspect the watchdog is not actually stuck and a longer timeout would help, you can either increase the timeout (TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC) to a larger value or disable the heartbeat monitor (TORCH_NCCL_ENABLE_MONITORING=0).If either of aforementioned helps, feel free to file an issue to PyTorch about the short timeout or false positive abort; otherwise, please attempt to debug the hang. workMetaList_.size() = 17
[1,mpirank:4,algo-1]<stderr>:/opt/conda/bin/runwithenvvars: line 66:   101 Aborted                 (core dumped) $@
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.```
1 Like

Hmmm… Difficult issue.

Thanks, the 10mn timeout may just be too short. Anyway, I decided to stay on Mono GPU training for the time being. I tried multi-GPU with Accelerate. It works well, but the resulting model cannot be integrated into CVAT as a pre-annotation model with the standard CVAT’s Hugging Face integration.

1 Like