Where is SageMaker Distributed configured in HF Trainer?

OlivierCR · May 6, 2021, 10:34am

I see that the HF Trainer run_qa script is compatible with SageMaker Distributed Data Parallel, but I don’t see where is it configured?

In particular, I can see in the training_args that smdist gets imported and configured, but where is the model wrapped with smdist DDP?

According to the smdist doc the below snippet is a required step ; I’d like to understand where it’s done with HF Trainer

from smdistributed.dataparallel.torch.parallel.distributed import DistributedDataParallel as DDP
model = DDP(Net().to(device))

philschmid · May 6, 2021, 11:34am

Hey @OlivierCR,

both the SageMaker Distributed Data-Parallel and the Model-Parallel library are directly integrated into the Trainer API, which uses and initializes both libraries automatically.
For SMD:

The library is first imported with an alias for the default PyTorch DDP library here
and then wrapps the model here

P.S. The _wrap_model() function also handles SMP

OlivierCR · May 6, 2021, 11:50am

wow so amazing, good job guys

Topic		Replies	Views
Trainer API for Model Parallelism on Multiple GPUs 🤗Transformers	5	4128	September 10, 2024
Distributed Training on Sagemaker Amazon SageMaker	13	2719	August 5, 2021
Using Transformers with DistributedDataParallel — any examples? Intermediate	11	23098	May 8, 2023
How to run an end to end example of distributed data parallel with hugging face's trainer api (ideally on a single node multiple gpus)? Intermediate	17	17763	September 6, 2023
Does the HF Trainer class support multi-node training? Beginners	4	2472	January 9, 2023

Where is SageMaker Distributed configured in HF Trainer?

Related topics