Using Transformers with DistributedDataParallel — any examples?

I’ve been consulting this page:

that says using DDP with transformers is “almost trivial”. But is there an example available?
Am I supposed to follow
just as if I was working with a regular PyTorch model and its optimizer exposed (as opposed to having it abstracted via transformers.Trainer)?

Also, I have some Dataset-related questions. I’ve written a custom dataset class that extends torch.Dataset. My dataset class yields samples from stored binary chunks with pre-shuffled pre-tokenized data (to maximize reading speed within a chunk). Therefore, I had to disable Trainer’s shuffling behavior by replacing RandomSampler with SequentialSampler within Trainer._get_train_sampler.
Will this hack work with DDP? Would it work if I switched to another distributed backend, like deepspeed? Is there a better way to do this?

You have examples using Accelerate which is our library for distributed training for all tasks in the Transformers repo.

As for your hack, you will need to use the distributed version of the SequentialSampler. You might be better off replacing the sampler for the training dataloader by _get_eval_sampler instead of _get_train_sampler.

1 Like

Introduction for the Accelerate library says I have to be willing to write a forward loop (forgoing Trainer). Is there a way for me to enable DDP training while continuing using Trainer?

Replacing _get_train_sampler with _get_eval_sampler looks like a much more elegant solution, thank you!

I misunderstood your question then, but I thought you wanted an example with the model and optimizer exposed. That’s why I pointed you to Accelerate.

Yup! I’d like to keep using Trainer, I’m sorry if I wasn’t clear enough.

Then you just need to properly launch your training script, see here.

I had to create a --local_rank launch argument and pass its value to Trainer, and then it worked. Thanks!