ok I must admit it didn’t occur to me to just run my normal script by appending torchrun --nnodes 2 ... or python -m torch.distributed.launch --nproc_per_node 2 main_data_parallel_ddp_pg.py (with the more familiar/older way to launch for me).
I will launch it as you suggested and track the nvidia-smi usage.
I assume that somehow the processes are in communication and know by coordination when a epoch has truly ended.
Thanks that was useful.
fyi in case torchrun doesn’t work for you:
#pip install torch==1.9.1+cu111 torchvision==0.10.1+cu111 torchaudio==0.9.1 -f https://download.pytorch.org/whl/torch_stable.html
pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu113
or if you want upgrade:
pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu113
pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu113 --upgrade