Trainer API for Model Parallelism on Multiple GPUs

According to the main page of the Trainer API,

“The API supports distributed training on multiple GPUs/TPUs, mixed precision through NVIDIA Apex and Native AMP for PyTorch.”

It seems like a user does not have to configure anything when using the Trainer class for doing distributed training. However, I am not able to find which distribution strategy this Trainer API supports. From my understanding, the Trainer class automatically uses DDP when multiple GPUs are detected. And if a user wants to also distribute model parameters onto different GPUs, then we have to pass in configurations for FSDP or deepspeed.

Can someone confirm that my understanding is correct or not please?

It depends on how you launch the script. If you use torch.distributed.launch (or have accelerate config setup for multi-gpu) it’ll use DistributedDataParallism. To use model parallelism just launch with python {} and it should pick up model parallism. (If you find it does not, or need some more assistance, let me know!)

You can verify if so by checking if trainer.args.parallel_mode prints ParallelMode.NOT_DISTRIBUTED.

1 Like

Thanks for your reply! It is super helpful :slight_smile: It is great to know that by just running python {} the class will use model parallelism.

A follow-up question from me is, how is the Trainer’s model parallelism differ from Deepspeed and FSDP? Is there any documentation that I can read into to gain more knowledge of what is happening at the backend?

Thanks a lot!

It just uses raw pytorch model parallism at that point, which is the equivalent of doing .to(device_num). here’s a good doc discussing it: Model Parallelism — transformers 4.7.0 documentation

1 Like

is it enough to install deepspeed (pip3 install deepspeed) and run accelerate config and say yes when asked about deepspeed, to get deepspeed with Trainer()?