Getting error while resuming the training with a single GPU

mgoksu · August 31, 2021, 8:57am

Hello all,

I’m new to the Huggingface library though I’m familiar with the deep learning in general and some of the DL libraries.

so I’m trying to train a BERT model from scratch with a custom text. I followed the instructions here and managed to get the training going with 2 GPUs. However, later on, I had to stop the training and resume with a single GPU. Then, I got the error below. I looked up this forum, StackOverflow etc. but couldn’t find any solution so far.

The environment info is as follows (the result of the command transformers-cli env):

transformers version: 4.10.0.dev0
Platform: Linux-5.4.0-65-generic-x86_64-with-glibc2.10
Python version: 3.8.5
PyTorch version (GPU?): 1.8.1+cu111 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: Yes
Using distributed or parallel set-up in script?: First yes, then no (I guess)

I created a big text file where I got a sentence at each line and a newline in between documents (I guess it’s called the NSP format) and I’m using a tokenizer from the Huggingface repo.
So I started the training with this command:

python run_mlm.py --model_type bert --train_file <path_to_my_train_file> --validation_file <path_to_my_val_file> --do_train --do_eval --output_dir /my/local/path/test-mlm --tokenizer_name dbmdz/bert-base-turkish-cased --cache_dir /my/local/cache/dir --line_by_line --save_total_limit 2 --logging_dir /my/log/dir

I have two GPUs and this command was using both of them as intended. Then, when I stopped the training and wanted to resume with one GPU with the following command (Basically I just added CUDA_VISIBLE_DEVICES=1 at the beginning):

CUDA_VISIBLE_DEVICES=1 python run_mlm.py --model_type bert --train_file <path_to_my_train_file> --validation_file <path_to_my_val_file> --do_train --do_eval --output_dir /my/local/path/test-mlm --tokenizer_name dbmdz/bert-base-turkish-cased --cache_dir /my/local/cache/dir --line_by_line --save_total_limit 2 --logging_dir /my/log/dir

First, it skipped some batches to resume from the last checkpoint and threw this error trace while loading it:

Traceback (most recent call last):
File “run_mlm.py”, line 550, in
main()
File “run_mlm.py”, line 501, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File “/my/local/path/venv/lib/python3.8/site-packages/transformers/trai
ner.py”, line 1262, in train
self._load_rng_state(resume_from_checkpoint)
File “/my/local/path/venv/lib/python3.8/site-packages/transformers/trai
ner.py”, line 1477, in _load_rng_state
torch.cuda.random.set_rng_state_all(checkpoint_rng_state[“cuda”])
File “/my/local/path/venv/lib/python3.8/site-packages/torch/cuda/random
.py”, line 73, in set_rng_state_all
set_rng_state(state, i)
File “/my/local/path/venv/lib/python3.8/site-packages/torch/cuda/random
.py”, line 64, in set_rng_state
_lazy_call(cb)
File “/my/local/path/venv/lib/python3.8/site-packages/torch/cuda/__init
__.py”, line 114, in _lazy_call
callable()
File “/my/local/path/venv/lib/python3.8/site-packages/torch/cuda/random
.py”, line 61, in cb
default_generator = torch.cuda.default_generators[idx]
IndexError: tuple index out of range

It works with no problem if I remove CUDA_VISIBLE_DEVICES=1 but then it uses both of the GPUs. I had a desperate attempt of doubling per_device_train_batch_size (default is 8) hoping it would balance out the single GPU index error:

CUDA_VISIBLE_DEVICES=1 python run_mlm.py --model_type bert --train_file <path_to_my_train_file> --validation_file <path_to_my_val_file> --do_train --do_eval --output_dir /my/local/path/test-mlm --tokenizer_name dbmdz/bert-base-turkish-cased --cache_dir /my/local/cache/dir --line_by_line --save_total_limit 2 --logging_dir /my/log/dir --per_device_train_batch_size 16

This didn’t work unfortunately. So I got stuck at this point.
Any help is appreciated. Thanks!

dSiddhesh · June 13, 2024, 1:59pm

Similar issue.

I am using DPP using torchrun

Earlier I am training a model on 8 GPUs with DPP, saved a checkpoint, with following files in it,
rng_state_0.pth
rng_state_1.pth
rng_state_2.pth
rng_state_3.pth
rng_state_4.pth
rng_state_5.pth
rng_state_6.pth
rng_state_7.pth

Now I want to run the same script model training from the checkpoint but with 6 GPUs only.

I am facing following error:

IndexError: tuple index out of range
    callable()
  File "site-packages/torch/cuda/random.py", line 61, in cb
    default_generator = torch.cuda.default_generators[idx]
                        ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^

How can I resume my training from the checkpoint with lesser GPU environment with DPP ?

Topic		Replies	Views
Resume training with lesser GPUs Error rng_state_6.pth 🤗Accelerate	0	185	June 13, 2024
Checkpoint missing Optimizer.pt? How to Resume? 🤗Transformers	7	5563	May 18, 2021
RuntimeError: arguments are located on different GPUs 🤗Transformers	2	1870	October 24, 2020
Resume_from_checkpoints leads to RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! 🤗Transformers	4	714	November 13, 2023
Resuming training fails with CUDA out of memory error Beginners	1	1135	October 13, 2023

Getting error while resuming the training with a single GPU

Related topics