I am facing following error:
IndexError: tuple index out of range
callable()
File "site-packages/torch/cuda/random.py", line 61, in cb
default_generator = torch.cuda.default_generators[idx]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^
I am using DPP using torchrun
Earlier I am training a model on 8 GPUs with DPP, saved a checkpoint, with following files in it,
rng_state_0.pth
rng_state_1.pth
rng_state_2.pth
rng_state_3.pth
rng_state_4.pth
rng_state_5.pth
rng_state_6.pth
rng_state_7.pth
Now I want to run the same script model training from the checkpoint but with 6 GPUs only.
How can I resume my training from the checkpoint with lesser GPU environment with DPP ?