Does the HF Trainer class support multi-node training?

mnichls · January 9, 2023, 9:03pm

@sgugger I tried the distributed training per this HF blog and it failed with multi-node training. It gave an mpirun error when i tried to use multiple instances (similar to when i tried the AWS blog). In both cases i tried with ml.p3.16xlarge and ml.p3dn.24xlarge instances.

Edit: I tried as written and with all variations of current PyTorch/HF URI containers (can’t link due to new account, limited to 2)

Any ideas on how to fix?

Topic		Replies	Views
Distributed Training w/ Trainer 🤗Transformers	11	9077	June 3, 2025
Multi gpu training 🤗Transformers	3	6028	April 24, 2022
How to run single-node, multi-GPU training with HF Trainer? 🤗Transformers	5	15286	October 16, 2024
Trainer API for data parallel on multi-node 🤗Transformers	4	152	February 6, 2025
Single Node Multi GPU FlanT5 fine-tuning using HF Dataset and HF Trainer 🤗Transformers	4	2063	July 5, 2023

Does the HF Trainer class support multi-node training?

Related topics