Using Transformers with DistributedDataParallel — any examples?

I’ve been consulting this page:
that says using DDP with transformers is “almost trivial”. But is there an example available?
Am I supposed to follow
just as if I was working with a regular PyTorch model and its optimizer exposed (as opposed to having it abstracted via transformers.Trainer)?

Also, I have some Dataset-related questions. I’ve written a custom dataset class that extends torch.Dataset. My dataset class yields samples from stored binary chunks with pre-shuffled pre-tokenized data (to maximize reading speed within a chunk). Therefore, I had to disable Trainer’s shuffling behavior by replacing RandomSampler with SequentialSampler within Trainer._get_train_sampler.
Will this hack work with DDP? Would it work if I switched to another distributed backend, like deepspeed? Is there a better way to do this?


You have examples using Accelerate which is our library for distributed training for all tasks in the Transformers repo.

As for your hack, you will need to use the distributed version of the SequentialSampler. You might be better off replacing the sampler for the training dataloader by _get_eval_sampler instead of _get_train_sampler.

1 Like

Introduction for the Accelerate library says I have to be willing to write a forward loop (forgoing Trainer). Is there a way for me to enable DDP training while continuing using Trainer?

Replacing _get_train_sampler with _get_eval_sampler looks like a much more elegant solution, thank you!

1 Like

I misunderstood your question then, but I thought you wanted an example with the model and optimizer exposed. That’s why I pointed you to Accelerate.

Yup! I’d like to keep using Trainer, I’m sorry if I wasn’t clear enough.


Then you just need to properly launch your training script, see here.

I had to create a --local_rank launch argument and pass its value to Trainer, and then it worked. Thanks!

1 Like

can you share the command you ran? and summarize what you did please? :slight_smile: @treeofknowledge

this discussion is slightly incomplete imho. For example, usually we wrap the mdl in DDP to have this type of (distributed) data parallel type of thing to work.

Did you

  1. wrap the model in DDP?
  2. change the args to trainer or trainer args in anyway?
  3. wrap the optimizer in any distributed trainer (like cherry? cherry is a pytorch lib for things like this)
  4. also, what about the init group that is usually needed?

Thanks in advance

made a real question of this here:

1 Like

@sgugger this doc Efficient Training on Multiple GPUs suggests that ZeRO might also be used in this case (where only the trainer api is needed/wanted). How would one use ZeRO for in that case and keep the trainer? Do you have any links/demos/notebooks?

In case that wasn’t clear this will do everything automatically:

python -m torch.distributed.launch --nproc_per_node 2 ~/src/

see details: How to run an end to end example of distributed data parallel with hugging face's trainer api (ideally on a single node multiple gpus)? - #4 by brando


LOL. I find most questions can be solved by carefully reading the documentation. Thanks for your help~

1 Like

Hi @sgugger , I’m curios about how Trainer works. After I look at the script, I found that when saving model at checkpoint, the script didn’t use local_rank argument to make the script only saving model on first worker. But, the example from Pytorch here showing that saving model at checkpoint using parameter local_rank. Is it okay to do what the Trainer do?