Also, I have some Dataset-related questions. I’ve written a custom dataset class that extends torch.Dataset. My dataset class yields samples from stored binary chunks with pre-shuffled pre-tokenized data (to maximize reading speed within a chunk). Therefore, I had to disable Trainer’s shuffling behavior by replacing RandomSampler with SequentialSampler within Trainer._get_train_sampler.
Will this hack work with DDP? Would it work if I switched to another distributed backend, like deepspeed? Is there a better way to do this?
You have examples using Accelerate which is our library for distributed training for all tasks in the Transformers repo.
As for your hack, you will need to use the distributed version of the SequentialSampler. You might be better off replacing the sampler for the training dataloader by _get_eval_sampler instead of _get_train_sampler.
Introduction for the Accelerate library says I have to be willing to write a forward loop (forgoing Trainer). Is there a way for me to enable DDP training while continuing using Trainer?
Replacing _get_train_sampler with _get_eval_sampler looks like a much more elegant solution, thank you!
can you share the command you ran? and summarize what you did please? @treeofknowledge
this discussion is slightly incomplete imho. For example, usually we wrap the mdl in DDP to have this type of (distributed) data parallel type of thing to work.
Did you
wrap the model in DDP?
change the args to trainer or trainer args in anyway?
wrap the optimizer in any distributed trainer (like cherry? cherry is a pytorch lib for things like this)
also, what about the init group that is usually needed?
@sgugger this doc Efficient Training on Multiple GPUs suggests that ZeRO might also be used in this case (where only the trainer api is needed/wanted). How would one use ZeRO for in that case and keep the trainer? Do you have any links/demos/notebooks?
Hi @sgugger , I’m curios about how Trainer works. After I look at the script, I found that when saving model at checkpoint, the script didn’t use local_rank argument to make the script only saving model on first worker. But, the example from Pytorch here showing that saving model at checkpoint using parameter local_rank. Is it okay to do what the Trainer do?