Using Transformers with DistributedDataParallel — any examples?

brando · August 17, 2022, 2:29pm

can you share the command you ran? and summarize what you did please? @treeofknowledge

this discussion is slightly incomplete imho. For example, usually we wrap the mdl in DDP to have this type of (distributed) data parallel type of thing to work.

Did you

wrap the model in DDP?
change the args to trainer or trainer args in anyway?
wrap the optimizer in any distributed trainer (like cherry? cherry is a pytorch lib for things like this)
also, what about the init group that is usually needed?

Thanks in advance

made a real question of this here:

Topic		Replies	Views
How to run an end to end example of distributed data parallel with hugging face's trainer api (ideally on a single node multiple gpus)? Intermediate	17	18055	September 6, 2023
Which data parallel does trainer use? DP or DDP? 🤗Transformers	6	6432	August 30, 2025
Trainer default distributed training behaviour 🤗Transformers	2	49	May 15, 2025
Trainer API for Model Parallelism on Multiple GPUs 🤗Transformers	5	4245	September 10, 2024
I cannot find the code that transformers trainer model_wrapped by deepspeed , i can find the theory about model_wrapped was wraped by DDP(Deepspeed(transformer model )) ,but i only find the code transformers model wrapped by ddp, where is the deepspeed wr DeepSpeed	1	138	May 1, 2024

Using Transformers with DistributedDataParallel — any examples?

Related topics