Does the HF Trainer class support multi-node training?

Does the HF Trainer class support multi-node training? Or only single-host multi-GPU training?

It supports both single-node and multi-node distributed training with the PyTorch launcher (torch.distributed.launch)

1 Like

thanks! who is doing the cross-instance allreduce? PyTorch DDP? Or Horovod? or some custom HF allreduce? any sample :slight_smile: ?

It’s standard PyTorch DDP behind the scenes.

1 Like