Using Transformers with DistributedDataParallel — any examples?

can you share the command you ran? and summarize what you did please? :slight_smile: @treeofknowledge


this discussion is slightly incomplete imho. For example, usually we wrap the mdl in DDP to have this type of (distributed) data parallel type of thing to work.

Did you

  1. wrap the model in DDP?
  2. change the args to trainer or trainer args in anyway?
  3. wrap the optimizer in any distributed trainer (like cherry? cherry is a pytorch lib for things like this)
  4. also, what about the init group that is usually needed?

Thanks in advance

made a real question of this here:

1 Like