I am doing multi-gpu training using trainer. When i am doing it on sagemaker studio it is working fine but when i am launching the same code with same exact versions of libraries as a sagemaker training job, i observed 2 scenarios:
Newer models like microsoft/deberta-v3-small is producing same results in both notebook as well as job.
Models like roberta-base or legalbert producing different results, looks like in training job the outcome from multiple gpu are not getting properly managed, eventhough the job is exiting with success status.
Has anyone observed something like this with newer and older models? if yes, what is the workaround for this?