Does the HF Trainer class support multi-node training?

OlivierCR · April 15, 2021, 4:00pm

Does the HF Trainer class support multi-node training? Or only single-host multi-GPU training?

sgugger · April 15, 2021, 4:10pm

It supports both single-node and multi-node distributed training with the PyTorch launcher (torch.distributed.launch)

OlivierCR · April 15, 2021, 4:13pm

thanks! who is doing the cross-instance allreduce? PyTorch DDP? Or Horovod? or some custom HF allreduce? any sample ?

sgugger · April 15, 2021, 4:27pm

It’s standard PyTorch DDP behind the scenes.

mnichls · January 9, 2023, 9:03pm

@sgugger I tried the distributed training per this HF blog and it failed with multi-node training. It gave an mpirun error when i tried to use multiple instances (similar to when i tried the AWS blog). In both cases i tried with ml.p3.16xlarge and ml.p3dn.24xlarge instances.

Edit: I tried as written and with all variations of current PyTorch/HF URI containers (can’t link due to new account, limited to 2)

Any ideas on how to fix?

Topic		Replies	Views
Multi gpu training 🤗Transformers	3	6042	April 24, 2022
Running a Trainer in DistributedDataParallel mode 🤗Transformers	1	1463	October 24, 2020
Boilerplate for Trainer using torch.distributed Beginners	4	2063	January 11, 2022
Single Node Multi GPU FlanT5 fine-tuning using HF Dataset and HF Trainer 🤗Transformers	4	2075	July 5, 2023
How to run single-node, multi-GPU training with HF Trainer? 🤗Transformers	5	15383	October 16, 2024

Does the HF Trainer class support multi-node training?

Related topics