According to the main page of the Trainer API,
“The API supports distributed training on multiple GPUs/TPUs, mixed precision through NVIDIA Apex and Native AMP for PyTorch.”
It seems like a user does not have to configure anything when using the Trainer class for doing distributed training. However, I am not able to find which distribution strategy this Trainer API supports. From my understanding, the Trainer class automatically uses DDP when multiple GPUs are detected. And if a user wants to also distribute model parameters onto different GPUs, then we have to pass in configurations for FSDP or deepspeed.
Can someone confirm that my understanding is correct or not please?
It depends on how you launch the script. If you use torch.distributed.launch
(or have accelerate config
setup for multi-gpu) it’ll use DistributedDataParallism. To use model parallelism just launch with python {myscript.py}
and it should pick up model parallism. (If you find it does not, or need some more assistance, let me know!)
You can verify if so by checking if trainer.args.parallel_mode
prints ParallelMode.NOT_DISTRIBUTED
.
1 Like
Thanks for your reply! It is super helpful It is great to know that by just running python {myscript.py}
the class will use model parallelism.
A follow-up question from me is, how is the Trainer’s model parallelism differ from Deepspeed and FSDP? Is there any documentation that I can read into to gain more knowledge of what is happening at the backend?
Thanks a lot!
It just uses raw pytorch model parallism at that point, which is the equivalent of doing .to(device_num)
. here’s a good doc discussing it: Model Parallelism — transformers 4.7.0 documentation
1 Like
is it enough to install deepspeed (pip3 install deepspeed
) and run accelerate config
and say yes
when asked about deepspeed
, to get deepspeed
with Trainer()
?
Hello, @muellerzr I would like to see the inner process of how huggingface Trainer activates model parallelism by just giving is_model_parallel
argument to True.
I checked that this argument makes place_model_on_device
=False in transformers/src/transformers/trainer.py at main · huggingface/transformers · GitHub.
However, I don’t get what happens in the next step. Are there some codes representing the division of model parameters ? It would be great if I could get your help.
Thank you.