Resume training with lesser GPUs Error rng_state_6.pth

dSiddhesh · June 13, 2024, 2:01pm

I am facing following error:

IndexError: tuple index out of range
    callable()
  File "site-packages/torch/cuda/random.py", line 61, in cb
    default_generator = torch.cuda.default_generators[idx]
                        ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^

I am using DPP using torchrun

Earlier I am training a model on 8 GPUs with DPP, saved a checkpoint, with following files in it,
rng_state_0.pth
rng_state_1.pth
rng_state_2.pth
rng_state_3.pth
rng_state_4.pth
rng_state_5.pth
rng_state_6.pth
rng_state_7.pth

Now I want to run the same script model training from the checkpoint but with 6 GPUs only.

How can I resume my training from the checkpoint with lesser GPU environment with DPP ?

Topic		Replies	Views
Resuming training fails with CUDA out of memory error Beginners	1	1124	October 13, 2023
Getting IndexError: list index out of range when fine-tuning 🤗Transformers	7	10161	February 23, 2025
Multi GPU Training with Trainer and TokenClassification Model 🤗Transformers	0	1520	July 21, 2023
Unable to resume Multi GPU training from checkpoint SFT Trainer Models	2	214	November 6, 2024
Model checkpoints on a worker node in multi-node training 🤗Transformers	0	732	June 7, 2023

Resume training with lesser GPUs Error rng_state_6.pth

Related topics