How to run an end to end example of distributed data parallel with hugging face's trainer api (ideally on a single node multiple gpus)?

@muellerzr thank you so much I appreciate your help! Needed to update a bunch of my libraries since my torch was old and it wasn’t running it but now that it does I did:

python -m torch.distributed.launch --nproc_per_node 2 ~/src/main_debug.py

and it worked! see the nvidia-smi & script running side by side(tmux is annoying for copying stuff so doing screen shot):

1 Like