Does the HF Trainer class support multi-node training? Or only single-host multi-GPU training?
It supports both single-node and multi-node distributed training with the PyTorch launcher (torch.distributed.launch
)
2 Likes
thanks! who is doing the cross-instance allreduce? PyTorch DDP? Or Horovod? or some custom HF allreduce? any sample ?
It’s standard PyTorch DDP behind the scenes.
1 Like
@sgugger I tried the distributed training per this HF blog and it failed with multi-node training. It gave an mpirun error when i tried to use multiple instances (similar to when i tried the AWS blog). In both cases i tried with ml.p3.16xlarge and ml.p3dn.24xlarge instances.
Edit: I tried as written and with all variations of current PyTorch/HF URI containers (can’t link due to new account, limited to 2)
Any ideas on how to fix?