Sagemaker multi gpu training

shreyans92dhankhar · November 17, 2022, 2:36pm

Hi,

I am doing multi-gpu training using trainer. When i am doing it on sagemaker studio it is working fine but when i am launching the same code with same exact versions of libraries as a sagemaker training job, i observed 2 scenarios:

Newer models like microsoft/deberta-v3-small is producing same results in both notebook as well as job.
Models like roberta-base or legalbert producing different results, looks like in training job the outcome from multiple gpu are not getting properly managed, eventhough the job is exiting with success status.

Has anyone observed something like this with newer and older models? if yes, what is the workaround for this?

Topic		Replies	Views
Correct configuration to train Mask2Former on Amazon Sagemaker multi GPU ml.p4d.24xlarge instance Intermediate	2	79	March 24, 2025
Distributed Training on Sagemaker Amazon SageMaker	13	2721	August 5, 2021
Model and data parallelism when training on multiple GPUs? Amazon SageMaker	0	37	January 22, 2025
How can I get advantage using multi-GPUs Beginners	5	3141	February 3, 2021
Trainer.train() hangs with multiple GPUs (but GPUs show activity) Beginners	4	842	October 31, 2024

Sagemaker multi gpu training

Related topics