Using Transformers with DistributedDataParallel — any examples?

treeofknowledge · October 14, 2021, 2:33pm

Hi!
I’ve been consulting this page:
https://huggingface.co/transformers/parallelism.html#data-parallel
that says using DDP with transformers is “almost trivial”. But is there an example available?
Am I supposed to follow
https://pytorch.org/tutorials/intermediate/ddp_tutorial.html
just as if I was working with a regular PyTorch model and its optimizer exposed (as opposed to having it abstracted via transformers.Trainer)?

Also, I have some Dataset-related questions. I’ve written a custom dataset class that extends torch.Dataset. My dataset class yields samples from stored binary chunks with pre-shuffled pre-tokenized data (to maximize reading speed within a chunk). Therefore, I had to disable Trainer’s shuffling behavior by replacing RandomSampler with SequentialSampler within Trainer._get_train_sampler.
Will this hack work with DDP? Would it work if I switched to another distributed backend, like deepspeed? Is there a better way to do this?

sgugger · October 14, 2021, 2:46pm

You have examples using Accelerate which is our library for distributed training for all tasks in the Transformers repo.

As for your hack, you will need to use the distributed version of the SequentialSampler. You might be better off replacing the sampler for the training dataloader by _get_eval_sampler instead of _get_train_sampler.

treeofknowledge · October 15, 2021, 12:45pm

Introduction for the Accelerate library says I have to be willing to write a forward loop (forgoing Trainer). Is there a way for me to enable DDP training while continuing using Trainer?

Replacing _get_train_sampler with _get_eval_sampler looks like a much more elegant solution, thank you!

sgugger · October 15, 2021, 12:58pm

I misunderstood your question then, but I thought you wanted an example with the model and optimizer exposed. That’s why I pointed you to Accelerate.

treeofknowledge · October 15, 2021, 1:01pm

Yup! I’d like to keep using Trainer, I’m sorry if I wasn’t clear enough.

sgugger · October 15, 2021, 1:08pm

Then you just need to properly launch your training script, see here.

treeofknowledge · October 15, 2021, 1:50pm

I had to create a --local_rank launch argument and pass its value to Trainer, and then it worked. Thanks!

brando · August 17, 2022, 2:29pm

can you share the command you ran? and summarize what you did please? @treeofknowledge

this discussion is slightly incomplete imho. For example, usually we wrap the mdl in DDP to have this type of (distributed) data parallel type of thing to work.

Did you

wrap the model in DDP?
change the args to trainer or trainer args in anyway?
wrap the optimizer in any distributed trainer (like cherry? cherry is a pytorch lib for things like this)
also, what about the init group that is usually needed?

Thanks in advance

made a real question of this here:

brando · August 17, 2022, 2:42pm

@sgugger this doc Efficient Training on Multiple GPUs suggests that ZeRO might also be used in this case (where only the trainer api is needed/wanted). How would one use ZeRO for in that case and keep the trainer? Do you have any links/demos/notebooks?

brando · August 17, 2022, 7:12pm

In case that wasn’t clear this will do everything automatically:

python -m torch.distributed.launch --nproc_per_node 2 ~/src/main_debug.py

see details: How to run an end to end example of distributed data parallel with hugging face's trainer api (ideally on a single node multiple gpus)? - #4 by brando

Colorful · September 5, 2022, 3:49am

LOL. I find most questions can be solved by carefully reading the documentation. Thanks for your help~

fahadh4ilyas · May 8, 2023, 6:46am

Hi @sgugger , I’m curios about how Trainer works. After I look at the script, I found that when saving model at checkpoint, the script didn’t use local_rank argument to make the script only saving model on first worker. But, the example from Pytorch here showing that saving model at checkpoint using parameter local_rank. Is it okay to do what the Trainer do?

Topic		Replies	Views
How to run an end to end example of distributed data parallel with hugging face's trainer api (ideally on a single node multiple gpus)? Intermediate	17	17863	September 6, 2023
Which data parallel does trainer use? DP or DDP? 🤗Transformers	2	6373	August 17, 2022
Trainer default distributed training behaviour 🤗Transformers	2	30	May 15, 2025
Trainer API for Model Parallelism on Multiple GPUs 🤗Transformers	5	4177	September 10, 2024
I cannot find the code that transformers trainer model_wrapped by deepspeed , i can find the theory about model_wrapped was wraped by DDP(Deepspeed(transformer model )) ,but i only find the code transformers model wrapped by ddp, where is the deepspeed wr DeepSpeed	1	136	May 1, 2024

Using Transformers with DistributedDataParallel — any examples?

Related topics