How to run an end to end example of distributed data parallel with hugging face's trainer api (ideally on a single node multiple gpus)?

muellerzr · May 8, 2023, 10:24am

Yes, on the backend the trainer does the same thing, (and in accelerate, save_state), only writing/using on the main worker during saving

Topic		Replies	Views
Distributed training large models on cloud resources Beginners	6	859	March 27, 2024
Single Node Multi GPU FlanT5 fine-tuning using HF Dataset and HF Trainer 🤗Transformers	4	2089	July 5, 2023
Using Transformers with DistributedDataParallel — any examples? Intermediate	11	23783	May 8, 2023
Multi gpu training 🤗Transformers	3	6080	April 24, 2022
Training using multiple GPUs Beginners	20	20272	February 25, 2024