Trainer API for Model Parallelism on Multiple GPUs

zcakzhu · August 4, 2023, 2:08pm

According to the main page of the Trainer API,

“The API supports distributed training on multiple GPUs/TPUs, mixed precision through NVIDIA Apex and Native AMP for PyTorch.”

It seems like a user does not have to configure anything when using the Trainer class for doing distributed training. However, I am not able to find which distribution strategy this Trainer API supports. From my understanding, the Trainer class automatically uses DDP when multiple GPUs are detected. And if a user wants to also distribute model parameters onto different GPUs, then we have to pass in configurations for FSDP or deepspeed.

Can someone confirm that my understanding is correct or not please?

muellerzr · August 4, 2023, 2:22pm

It depends on how you launch the script. If you use torch.distributed.launch (or have accelerate config setup for multi-gpu) it’ll use DistributedDataParallism. To use model parallelism just launch with python {myscript.py} and it should pick up model parallism. (If you find it does not, or need some more assistance, let me know!)

You can verify if so by checking if trainer.args.parallel_mode prints ParallelMode.NOT_DISTRIBUTED.

zcakzhu · August 4, 2023, 2:38pm

Thanks for your reply! It is super helpful It is great to know that by just running python {myscript.py} the class will use model parallelism.

A follow-up question from me is, how is the Trainer’s model parallelism differ from Deepspeed and FSDP? Is there any documentation that I can read into to gain more knowledge of what is happening at the backend?

Thanks a lot!

muellerzr · August 4, 2023, 4:09pm

It just uses raw pytorch model parallism at that point, which is the equivalent of doing .to(device_num). here’s a good doc discussing it: Model Parallelism — transformers 4.7.0 documentation

hansekbrand · August 5, 2023, 12:53pm

is it enough to install deepspeed (pip3 install deepspeed) and run accelerate config and say yes when asked about deepspeed, to get deepspeed with Trainer()?

yjoonjang · September 10, 2024, 12:12pm

Hello, @muellerzr I would like to see the inner process of how huggingface Trainer activates model parallelism by just giving is_model_parallel argument to True.

I checked that this argument makes place_model_on_device=False in transformers/src/transformers/trainer.py at main · huggingface/transformers · GitHub.
However, I don’t get what happens in the next step. Are there some codes representing the division of model parameters ? It would be great if I could get your help.

Thank you.

Topic		Replies	Views
Model parallel with deepspeed integration Beginners	0	643	September 14, 2021
Which method is use HF Trainer with multiple GPU? 🤗Transformers	4	1564	June 19, 2023
Model Parallelism, how to parallelize transformer? Beginners	3	12752	June 18, 2021
Using Transformers with DistributedDataParallel — any examples? Intermediate	11	23371	May 8, 2023
Basics for Multi GPU Training with Huggingface Trainer 🤗Transformers	0	2689	June 14, 2023

Trainer API for Model Parallelism on Multiple GPUs

Related topics