How to run an end to end example of distributed data parallel with hugging face's trainer api (ideally on a single node multiple gpus)?

brando · August 17, 2022, 7:08pm

@muellerzr thank you so much I appreciate your help! Needed to update a bunch of my libraries since my torch was old and it wasn’t running it but now that it does I did:

python -m torch.distributed.launch --nproc_per_node 2 ~/src/main_debug.py

and it worked! see the nvidia-smi & script running side by side(tmux is annoying for copying stuff so doing screen shot):

Topic		Replies	Views
Boilerplate for Trainer using torch.distributed Beginners	4	2066	January 11, 2022
Multi gpu training 🤗Transformers	3	6044	April 24, 2022
Using Transformers with DistributedDataParallel — any examples? Intermediate	11	23636	May 8, 2023
Single Node Multi GPU FlanT5 fine-tuning using HF Dataset and HF Trainer 🤗Transformers	4	2075	July 5, 2023
Distributed training large models on cloud resources Beginners	6	817	March 27, 2024

How to run an end to end example of distributed data parallel with hugging face's trainer api (ideally on a single node multiple gpus)?

Related topics