How to run an end to end example of distributed data parallel with hugging face's trainer api (ideally on a single node multiple gpus)?

ok I must admit it didn’t occur to me to just run my normal script by appending torchrun --nnodes 2 ... or python -m torch.distributed.launch --nproc_per_node 2 main_data_parallel_ddp_pg.py (with the more familiar/older way to launch for me).

I will launch it as you suggested and track the nvidia-smi usage.

I assume that somehow the processes are in communication and know by coordination when a epoch has truly ended.

Thanks that was useful.


fyi in case torchrun doesn’t work for you:

#pip install torch==1.9.1+cu111 torchvision==0.10.1+cu111 torchaudio==0.9.1 -f https://download.pytorch.org/whl/torch_stable.html
pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu113

or if you want upgrade:

pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu113
pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu113 --upgrade
2 Likes