I’ve been consulting this page:
that says using DDP with transformers is “almost trivial”. But is there an example available?
Am I supposed to follow
just as if I was working with a regular PyTorch model and its optimizer exposed (as opposed to having it abstracted via transformers.Trainer)?
Also, I have some Dataset-related questions. I’ve written a custom dataset class that extends torch.Dataset. My dataset class yields samples from stored binary chunks with pre-shuffled pre-tokenized data (to maximize reading speed within a chunk). Therefore, I had to disable Trainer’s shuffling behavior by replacing RandomSampler with SequentialSampler within Trainer._get_train_sampler.
Will this hack work with DDP? Would it work if I switched to another distributed backend, like deepspeed? Is there a better way to do this?
You have examples using
Accelerate which is our library for distributed training for all tasks in the Transformers repo.
As for your hack, you will need to use the distributed version of the SequentialSampler. You might be better off replacing the sampler for the training dataloader by
_get_eval_sampler instead of
Introduction for the
Accelerate library says I have to be willing to write a forward loop (forgoing Trainer). Is there a way for me to enable DDP training while continuing using Trainer?
_get_eval_sampler looks like a much more elegant solution, thank you!
I misunderstood your question then, but I thought you wanted an example with the model and optimizer exposed. That’s why I pointed you to Accelerate.
Yup! I’d like to keep using Trainer, I’m sorry if I wasn’t clear enough.
Then you just need to properly launch your training script, see here.
I had to create a
--local_rank launch argument and pass its value to Trainer, and then it worked. Thanks!
can you share the command you ran? and summarize what you did please? @treeofknowledge
this discussion is slightly incomplete imho. For example, usually we wrap the mdl in DDP to have this type of (distributed) data parallel type of thing to work.
- wrap the model in DDP?
- change the args to trainer or trainer args in anyway?
- wrap the optimizer in any distributed trainer (like cherry? cherry is a pytorch lib for things like this)
- also, what about the init group that is usually needed?
Thanks in advance
made a real question of this here:
@sgugger this doc Efficient Training on Multiple GPUs suggests that ZeRO might also be used in this case (where only the trainer api is needed/wanted). How would one use ZeRO for in that case and keep the trainer? Do you have any links/demos/notebooks?
In case that wasn’t clear this will do everything automatically:
python -m torch.distributed.launch --nproc_per_node 2 ~/src/main_debug.py
see details: How to run an end to end example of distributed data parallel with hugging face's trainer api (ideally on a single node multiple gpus)? - #4 by brando
LOL. I find most questions can be solved by carefully reading the documentation. Thanks for your help~
Hi @sgugger , I’m curios about how Trainer works. After I look at the script, I found that when saving model at checkpoint, the script didn’t use
local_rank argument to make the script only saving model on first worker. But, the example from Pytorch here showing that saving model at checkpoint using parameter
local_rank. Is it okay to do what the Trainer do?