Does the HF Trainer class support multi-node training?

Does the HF Trainer class support multi-node training? Or only single-host multi-GPU training?

It supports both single-node and multi-node distributed training with the PyTorch launcher (torch.distributed.launch)

2 Likes

thanks! who is doing the cross-instance allreduce? PyTorch DDP? Or Horovod? or some custom HF allreduce? any sample :slight_smile: ?

It’s standard PyTorch DDP behind the scenes.

1 Like

@sgugger I tried the distributed training per this HF blog and it failed with multi-node training. It gave an mpirun error when i tried to use multiple instances (similar to when i tried the AWS blog). In both cases i tried with ml.p3.16xlarge and ml.p3dn.24xlarge instances.

Edit: I tried as written and with all variations of current PyTorch/HF URI containers (can’t link due to new account, limited to 2)

Any ideas on how to fix?