How to run an end to end example of distributed data parallel with hugging face's trainer api (ideally on a single node multiple gpus)?

muellerzr · August 17, 2022, 7:11pm

Needs to be n-proc-per-node

Node = computer in this case. Updated the example above, think i forgot to do that when I did the bug myself!

Topic		Replies	Views
Distributed training large models on cloud resources Beginners	6	859	March 27, 2024
Single Node Multi GPU FlanT5 fine-tuning using HF Dataset and HF Trainer 🤗Transformers	4	2089	July 5, 2023
Using Transformers with DistributedDataParallel — any examples? Intermediate	11	23784	May 8, 2023
Multi gpu training 🤗Transformers	3	6080	April 24, 2022
Training using multiple GPUs Beginners	20	20272	February 25, 2024